Jean Kaddour

I'm an AI researcher in London.

Publications

Pretraining

[0]

Early Weight Averaging meets High Learning Rates for LLM Pre-training

lr decayoptimizer

Sanyal et al., COLM 2024, NeurIPS 2023 WANT

• What: We scale up LAWA (see below) to large models.

• Why: Large model training -> large batch sizes -> large LRs -> LAWA makes (even more) sense.

[1]

No train no gain: Revisiting efficient training algorithms for transformer-based language models

compute-optimal traininghardware awarenessoptimizer

Kaddour et al., NeurIPS 2023

• What: A simple budget-aware LR scheduler outperforms most fancy efficient training methods.

• Why: Every day, there's a new efficient training algorithm; the ones we tried weren't that effective.

• Trivia: This started by trying some novel ideas that never outperformed our baseline; then realizing that our baseline was quite competitive.

[2]

Minipile: A Challenge for Data-Efficient Language Models

pretraining datadata mixturesclustering

Kaddour, arXiv 2023

• What: Using embeddings and k-means, I construct a small and clean yet diverse pretraining corpus.

• Why: The Pile is too large for GPU-poor academics.

• Trivia: I reviewed examples of each k-means cluster during my daily tube commute.

[3]

When Do Flat Minima Optimizers Work?

optimizer

Kaddour et al., NeurIPS 2022

• What: We can find even flatter minima than SAM by adding weight averaging.

• Why: SAM finds flat basins; WA finds flat points inside those basins.

[4]

Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging

weight averaginglr decayoptimizer

Kaddour, NeurIPS 2022 HIIT Workshop

• What: Weight averaging = implicit LR decay.

• Why: We can evaluate intermediate checkpoints pre-LR decay, which is much cheaper.

[5]

Probabilistic Active Meta-Learning

data collectionactive learningmeta-learning

Kaddour et al., NeurIPS 2020

• What: We make meta-learning more sample-efficient by letting the model guide the task selection.

• Why: Acquiring datasets can be expensive and slow. Let's make sure we make it worth it.

• Trivia: This was my Master's thesis while studying at the wonderful Imperial College London.

Posttraining

[6]

REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

rlreasoningreward engineering

Stojanovski et al., NeurIPS 2025 (Spotlight, Top 2%)

• What: 100+ RL envs across 8 domains with configurable complexity.

• Why: RL is so back thanks to R1. More envs, more data, more RL.

[7]

Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models

inferencelatency

Tyukin et al., arXiv 2024

• What: We can remove up to 33% of the attention layers in Llama2 with negligible performance loss.

• Why: Removing attention layers makes inference faster and cheaper.

[8]

Local LoRA: Memory-Efficient Fine-Tuning of Large Language Models

memoryparallelism

Key et al., NeurIPS 2023 WANT

• What: A method for fine-tuning an arbitrarily large model chunk by chunk (in isolation).

• Why: Allowing the GPU-poor to fine-tune some LLMs too.

• Trivia: Inspired by distributed training techniques, adopted for single-GPU fine-tuning.

[9]

Synthetic Data Generation in Low-Resource Settings via Fine-Tuning of Large Language Models

distillationsynthetic datainference efficiency

Kaddour and Liu, arXiv 2023

• What: Knowledge distillation via synthetic data generation after fine-tuning of the teacher.

• Why: Teachers are more sample-efficient; by fine-tuning them, we can generate synthetic data for students.

Evals

[10]

Humanity's Last Exam

ai for sciencereasoning

Phan et al., arXiv 2025

• What: A really hard multiple-choice science benchmark for LLMs.

• Why: Previous benchmarks got hillclimbed quickly but this one will remain the last one standing (trust me, bro).

[11]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

tool usecoding

Zhuo et al., ICLR 2025 (Oral, Top 2%)

• What: 1k+ diverse, multi-tool-use programming tasks in Python.

• Why: Other code benchmarks were too monotonous (eg. Django) and lacked tool calls.

[12]

Are We Done with MMLU?

Gema et al., NAACL 2025

• What: We expose serious flaws in MMLU and release a smaller and cleaner version, MMLU-Redux.

• Why: MMLU is one of the most popular LLM benchmarks; better benchmarks, better models.

[13]

Evaluating Self-Supervised Learning for Molecular Graph Embeddings

Wang et al., NeurIPS 2023

• What: A probing suite to profile molecular graph embeddings.

• Why: Downstream-only evaluations can be misleading; better probes yield more faithful assessments.

[14]

Spawrious: A Benchmark for Fine Control of Spurious Correlation Biases

synthetic datavision modelsspurious correlations

Lynch et al., ICLR 2025 SCSL

• What: A vision dataset of cute dogs with spurious correlations between dog breeds and backgrounds.

• Why: Spurious correlations harm the reliability of vision models; previous benchmarks were too easy.

Misc

[15]

PySpur: A visual playground for AI Agents

agentsscaffoldsoss

Kaddour et al., Github (5.6k stars)

• What: A Python package with UI for building and debugging agents. Used by several enterprises.

• Why: Debugging long-running agents in a terminal gets cumbersome.

• Trivia: Building this taught me a lot about frontend and TypeScript.

[16]

Challenges and applications of large language models

unfathomable datasetshallucinationsmisalignment

Kaddour et al., arXiv 2023

• What: An opinionated review of 16 challenges for LLMs.

• Why: The field is moving fast, hard to keep up with what's worth solving.

• Trivia: This doc started as notes I took during an internship to teach myself about LLMs.

[17]

Ttida: Controllable generative data augmentation via text-to-text and text-to-image models

synthetic datavision modelsdiffusion modelsdistillation

Yin et al., arXiv 2023

• What: We generate synthetic training data for vision classification models.

• Why: You can think of it as knowledge distillation from generative to discriminative models.

• Trivia: This is sort of the training-equivalent of Spawrious.

[18]

Causal Machine Learning: A Survey and Open Problems

causalityspurious correlationscausal rl

Kaddour et al., Foundations and Trends in Optimization, 2022

• What: A survey of how causality can be applied to ML problems.

• Why: Causality allows you to make assumptions about the data-generating process.

• Trivia: 3 years later, I'm surprised how far we've come with LLMs without any causality.

[19]

Causal Effect Inference for Structured Treatments

causalitytreatment effectstreatment embeddings

Kaddour et al., NeurIPS 2021

• What: We generalize the Robinson decomposition to treatment embeddings.

• Why: We can now use eg., an embedding of a drug's molecular graph.

Recent Blog Posts

View all posts →