I’ve done a summary of the paper “Beyond Accuracy: Behavioral Testing of NLP models with CheckList” by Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin and Sameer Singh. This paper won the Best-Papwer award in the 2020 ACL conference.

Summary

💡 Goal of the paper: Present Checklist, a methodology and a tool for testing NLP models.

Authors claim that…

… only ensuring that models fulfill benchmark accuracy is not enough to evaluate model quality in NLP
… by using similar techniques to those applied in SWE testing it is possible to reveal the “bad” quality of models that have passed the existing benchmarks in 3 different tasks
… their methodology and tools are easy to follow/use
… utility is guaranteed
- they are able to find errors in battletested public comercial models
- they show how users (both expert and newcomers) can benefit from the framework almost immediately
… open-source is the way to go, so…
- Tool and all stuff described in the paper is is open sourced
- They plan the community to start growing by sharing their experiences through new test suites and capabilities

Find my summary of the paper here in the form of a Jupyter notebook or just click here to start binder to access the presentation