ai for sciencereasoning
Long Phan et al. (incl. Jean Kaddour), arXiv 2025
• What: A really hard multiple-choice science benchmark for LLMs.
• Why: Previous benchmarks got hillclimbed quickly but this one will remain the last one standing, promised.
tool usecoding
Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, James Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, David Lo, Binyuan Hui, Niklas Muennighoff, Daniel Fried, Xiaoning Du, Harm de Vries, Leandro von Werra, ICLR 2025 (Oral, Top 2%)
• What: 1k+ diverse, multi-tool-use programming tasks in Python.
• Why: Other code benchmarks are too homogeneous and lacked tool calls.
Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, Pasquale Minervini, NAACL 2025
• What: We expose serious flaws in MMLU and release a smaller and cleaner version, MMLU-Redux.
• Why: MMLU is one of the most popular LLM benchmarks; better benchmarks, better models.
Hanchen Wang∗, Jean Kaddour∗, Shengchao Liu∗, Jian Tang, Joan Lasenby, Qi Liu, NeurIPS 2023
• What: A probing suite to profile molecular graph embeddings.
• Why: Downstream-only evaluations can be misleading; better probes yield more faithful assessments.
synthetic datavision modelsspurious correlations
Aengus Lynch∗, Gbètondji J-S Dovonon∗, Jean Kaddour∗, Ricardo Silva, ICLR 2025 SCSL
• What: A vision dataset of cute dogs with spurious correlations between dog breeds and backgrounds.
• Why: Spurious correlations harm the reliability of vision models; previous benchmarks were too easy.