Jean Kaddour

I'm an AI researcher in London.

Publications

Pretraining

compute-optimal traininghardware awarenessoptimizer

Kaddour et al., NeurIPS 2023

• What: A simple budget-aware LR scheduler outperforms most fancy efficient training methods.

• Why: Every day, there's a new efficient training algorithm; the ones we tried weren't that effective.

• Trivia: This started by trying some novel ideas that never outperformed our baseline; then realizing that our baseline was quite competitive.

pretraining datadata mixturesclustering

Kaddour, arXiv 2023

• What: Using embeddings and k-means, I construct a small and clean yet diverse pretraining corpus.

• Why: The Pile is too large for GPU-poor academics.

• Trivia: I reviewed examples of each k-means cluster during my daily tube commute.

optimizer

Kaddour et al., NeurIPS 2022

• What: We can find even flatter minima than SAM by adding weight averaging.

• Why: SAM finds flat basins; WA finds flat points inside those basins.

data collectionactive learningmeta-learning

Kaddour et al., NeurIPS 2020

• What: We make meta-learning more sample-efficient by letting the model guide the task selection.

• Why: Acquiring datasets can be expensive and slow. Let's make sure we make it worth it.

• Trivia: This was my Master's thesis while studying at the wonderful Imperial College London.

Posttraining

memoryparallelism

Key et al., NeurIPS 2023 WANT

• What: A method for fine-tuning an arbitrarily large model chunk by chunk (in isolation).

• Why: Allowing the GPU-poor to fine-tune some LLMs too.

• Trivia: Inspired by distributed training techniques, adopted for single-GPU fine-tuning.

Evals

ai for sciencereasoning

Phan et al., arXiv 2025

• What: A really hard multiple-choice science benchmark for LLMs.

• Why: Previous benchmarks got hillclimbed quickly but this one will remain the last one standing (trust me, bro).

Gema et al., NAACL 2025

• What: We expose serious flaws in MMLU and release a smaller and cleaner version, MMLU-Redux.

• Why: MMLU is one of the most popular LLM benchmarks; better benchmarks, better models.

synthetic datavision modelsspurious correlations

Lynch et al., ICLR 2025 SCSL

• What: A vision dataset of cute dogs with spurious correlations between dog breeds and backgrounds.

• Why: Spurious correlations harm the reliability of vision models; previous benchmarks were too easy.

Misc

agentsscaffoldsoss

Kaddour et al., Github (5.6k stars)

• What: A Python package with UI for building and debugging agents. Used by several enterprises.

• Why: Debugging long-running agents in a terminal gets cumbersome.

• Trivia: Building this taught me a lot about frontend and TypeScript.

unfathomable datasetshallucinationsmisalignment

Kaddour et al., arXiv 2023

• What: An opinionated review of 16 challenges for LLMs.

• Why: The field is moving fast, hard to keep up with what's worth solving.

• Trivia: This doc started as notes I took during an internship to teach myself about LLMs.

causalityspurious correlationscausal rl

Kaddour et al., Foundations and Trends in Optimization, 2022

• What: A survey of how causality can be applied to ML problems.

• Why: Causality allows you to make assumptions about the data-generating process.

• Trivia: 3 years later, I'm surprised how far we've come with LLMs without any causality.

Recent Blog Posts

View all posts →