I’ve done a summary of the paper “Beyond Accuracy: Behavioral Testing of NLP models with CheckList” by Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin and Sameer Singh. This paper won the Best-Papwer award in the 2020 ACL conference.

Summary

💡 Goal of the paper: Present Checklist, a methodology and a tool for testing NLP models.

Authors claim that…

  • … only ensuring that models fulfill benchmark accuracy is not enough to evaluate model quality in NLP

  • … by using similar techniques to those applied in SWE testing it is possible to reveal the “bad” quality of models that have passed the existing benchmarks in 3 different tasks

  • … their methodology and tools are easy to follow/use

  • … utility is guaranteed

    • they are able to find errors in battletested public comercial models
    • they show how users (both expert and newcomers) can benefit from the framework almost immediately
  • … open-source is the way to go, so…

Find my summary of the paper here in the form of a Jupyter notebook or just click here to start binder to access the presentation