Sanyal et al., COLM 2024, NeurIPS 2023 WANT
• What: We scale up LAWA (see below) to large models.
• Why: Large model training -> large batch sizes -> large LRs -> LAWA makes (even more) sense.
I'm an AI researcher in London.
Sanyal et al., COLM 2024, NeurIPS 2023 WANT
• What: We scale up LAWA (see below) to large models.
• Why: Large model training -> large batch sizes -> large LRs -> LAWA makes (even more) sense.
Kaddour et al., NeurIPS 2023
• What: A simple budget-aware LR scheduler outperforms most fancy efficient training methods.
• Why: Every day, there's a new efficient training algorithm; the ones we tried weren't that effective.
• Trivia: This started by trying some novel ideas that never outperformed our baseline; then realizing that our baseline was quite competitive.
Kaddour, arXiv 2023
• What: Using embeddings and k-means, I construct a small and clean yet diverse pretraining corpus.
• Why: The Pile is too large for GPU-poor academics.
• Trivia: I reviewed examples of each k-means cluster during my daily tube commute.
Kaddour et al., NeurIPS 2022
• What: We can find even flatter minima than SAM by adding weight averaging.
• Why: SAM finds flat basins; WA finds flat points inside those basins.
Kaddour, NeurIPS 2022 HIIT Workshop
• What: Weight averaging = implicit LR decay.
• Why: We can evaluate intermediate checkpoints pre-LR decay, which is much cheaper.
Kaddour et al., NeurIPS 2020
• What: We make meta-learning more sample-efficient by letting the model guide the task selection.
• Why: Acquiring datasets can be expensive and slow. Let's make sure we make it worth it.
• Trivia: This was my Master's thesis while studying at the wonderful Imperial College London.
Stojanovski et al., NeurIPS 2025 (Spotlight, Top 2%)
• What: 100+ RL envs across 8 domains with configurable complexity.
• Why: RL is so back thanks to R1. More envs, more data, more RL.
Tyukin et al., arXiv 2024
• What: We can remove up to 33% of the attention layers in Llama2 with negligible performance loss.
• Why: Removing attention layers makes inference faster and cheaper.
Key et al., NeurIPS 2023 WANT
• What: A method for fine-tuning an arbitrarily large model chunk by chunk (in isolation).
• Why: Allowing the GPU-poor to fine-tune some LLMs too.
• Trivia: Inspired by distributed training techniques, adopted for single-GPU fine-tuning.
Kaddour and Liu, arXiv 2023
• What: Knowledge distillation via synthetic data generation after fine-tuning of the teacher.
• Why: Teachers are more sample-efficient; by fine-tuning them, we can generate synthetic data for students.
Phan et al., arXiv 2025
• What: A really hard multiple-choice science benchmark for LLMs.
• Why: Previous benchmarks got hillclimbed quickly but this one will remain the last one standing (trust me, bro).
Zhuo et al., ICLR 2025 (Oral, Top 2%)
• What: 1k+ diverse, multi-tool-use programming tasks in Python.
• Why: Other code benchmarks were too monotonous (eg. Django) and lacked tool calls.
Gema et al., NAACL 2025
• What: We expose serious flaws in MMLU and release a smaller and cleaner version, MMLU-Redux.
• Why: MMLU is one of the most popular LLM benchmarks; better benchmarks, better models.
Wang et al., NeurIPS 2023
• What: A probing suite to profile molecular graph embeddings.
• Why: Downstream-only evaluations can be misleading; better probes yield more faithful assessments.
Lynch et al., ICLR 2025 SCSL
• What: A vision dataset of cute dogs with spurious correlations between dog breeds and backgrounds.
• Why: Spurious correlations harm the reliability of vision models; previous benchmarks were too easy.
Kaddour et al., Github (5.6k stars)
• What: A Python package with UI for building and debugging agents. Used by several enterprises.
• Why: Debugging long-running agents in a terminal gets cumbersome.
• Trivia: Building this taught me a lot about frontend and TypeScript.
Kaddour et al., arXiv 2023
• What: An opinionated review of 16 challenges for LLMs.
• Why: The field is moving fast, hard to keep up with what's worth solving.
• Trivia: This doc started as notes I took during an internship to teach myself about LLMs.
Yin et al., arXiv 2023
• What: We generate synthetic training data for vision classification models.
• Why: You can think of it as knowledge distillation from generative to discriminative models.
• Trivia: This is sort of the training-equivalent of Spawrious.
Kaddour et al., Foundations and Trends in Optimization, 2022
• What: A survey of how causality can be applied to ML problems.
• Why: Causality allows you to make assumptions about the data-generating process.
• Trivia: 3 years later, I'm surprised how far we've come with LLMs without any causality.
Kaddour et al., NeurIPS 2021
• What: We generalize the Robinson decomposition to treatment embeddings.
• Why: We can now use eg., an embedding of a drug's molecular graph.
Why advice often fails
How do we turn superhuman AI capabilities into human skills?