Advice on advice
Why good advice often fails
Currently, I'm developing PySpur, a Graph-Based Editor for AI agents. At some point, I will graduate with my PhD in LLMs supervised by Ricardo Silva and Matt Kusner at UCL. I am based in London, UK.
[0] REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards
Stojanovski et al., NeurIPS 2025 (Spotlight, Top 2%)
• What: 100+ RL envs across 8 domains with configurable complexity.
• Why: RL is so back thanks to R1. More envs, more data, moreRL.
[1] PySpur: A visual playground for AI Agents
Kaddour et al., Github 2025
• What: A Python package with UI for building and debugging agent scaffolds. Used by several enterprises.
• Why: Debugging long-running agents in a terminal gets cumbersome.
[2] Humanity's Last Exam
Phan et al., arXiv 2025
• What: A really hard multiple-choice science benchmark for LLMs.
• Why: Previous benchmarks got hillclimbed quickly but this one will remain the last one standing (trust me, bro).
[3] BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Zhuo et al., ICLR 2025 (Oral, Top 2%)
• What: 1k+ diverse, multi-tool-use programming tasks in Python.
• Why: Other code benchmarks were too monotonous (eg. Django) and lacked tool calls.
[4] Are We Done with MMLU?
Gema et al., NAACL 2025
• What: We expose serious flaws in MMLU and release a smaller and cleaner version, MMLU-Redux.
• Why: MMLU is one of the most popular LLM benchmarks; better benchmarks, better models.
[5] Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models
Tyukin et al., arXiv 2024
• What: We can remove up to 33% of the attention layers in Llama2 with negligible performance loss.
• Why: Removing attention layers makes inference faster and cheaper.
[6] Challenges and applications of large language models
Kaddour et al., arXiv 2023
• What: An opinionated review of 16 challenges for LLMs.
• Why: The field is moving fast, hard to keep up with what's worth solving.
• Trivia: This doc started as notes I took during an internship to teach myself about LLMs.
[7] Early Weight Averaging meets High Learning Rates for LLM Pre-training
Sanyal et al., COLM 2024, NeurIPS 2023 WANT
• What: We scale up LAWA (see below) to large models.
• Why: Large model training -> large batch sizes -> large LRs -> LAWA makes (even more) sense.
[8] Local LoRA: Memory-Efficient Fine-Tuning of Large Language Models
Key et al., NeurIPS 2023 WANT
• What: A method for fine-tuning an arbitrarily large model chunk by chunk (in isolation).
• Why: Allowing the GPU-poor to fine-tune some LLMs too.
[9] Synthetic Data Generation in Low-Resource Settings via Fine-Tuning of Large Language Models
Kaddour and Liu, arXiv 2023
• What: Knowledge distillation via synthetic data generation after fine-tuning of the teacher.
• Why: Teachers are more sample-efficient; by fine-tuning them, we can generate synthetic data for students.
[10] No train no gain: Revisiting efficient training algorithms for transformer-based language models
Kaddour et al., NeurIPS 2023
• What: A simple budget-aware LR scheduler outperforms most fancy efficient training methods.
• Why: Every day, there's a new efficient training algorithm; the ones we tried weren't that effective.
• Trivia: This started by trying some novel ideas that never outperformed our baseline; then realizing that our baseline was quite competitive.
[11] Evaluating Self-Supervised Learning for Molecular Graph Embeddings
Wang et al., NeurIPS 2023
• What: A probing suite to profile molecular graph embeddings and evaluate GSSL methods.
• Why: Downstream-only evaluations can be misleading; better probes yield more faithful assessments.
[12] Ttida: Controllable generative data augmentation via text-to-text and text-to-image models
Yin et al., arXiv 2023
• What: We generate synthetic training data for vision classification models.
• Why: You can think of it as knowledge distillation from generative to discriminative models.
• Trivia: This is sort of the training-equivalent of Spawrious (see below).
[13] Minipile: A Challenge for Data-Efficient Language Models
Kaddour, arXiv 2023
• What: Using embeddings and k-means, I construct a small and clean yet diverse pretraining corpus.
• Why: The Pile is too large for GPU-poor academics.
• Trivia: I reviewed examples of each k-means cluster during my daily tube commute.
[14] Spawrious: A Benchmark for Fine Control of Spurious Correlation Biases
Lynch et al., ICLR 2025 SCSL
• What: A vision dataset of cute dogs with spurious correlations between dog breeds and backgrounds.
• Why: Spurious correlations harm the reliability of vision models; previous benchmarks were too easy.
[15] When Do Flat Minima Optimizers Work?
Kaddour et al., NeurIPS 2022
• What: We can find even flatter minima than SAM by adding weight averaging.
• Why: SAM finds flat basins; WA finds flat points inside those basins.
[16] Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging
Kaddour, NeurIPS 2022 HIIT Workshop
• What: Surprisingly, LAtest Weight Averaging (LAWA), ie., SWA in a FIFO way, is almost identical to decaying the LR.
• Why: Training runs can run for months, wouldn't it be nice to make better use of intermediate checkpoints?
• Trivia: NeurIPS folks thought I had a bug in my code, until it got confirmed by several other works.
[17] Causal Machine Learning: A Survey and Open Problems
Kaddour et al., Foundations and Trends in Optimization 2022
• What: A survey of how causality can be applied to ML problems.
• Why: Causality allows you to use assumptions about the data-generating process into your model.
[18] Causal Effect Inference for Structured Treatments
Kaddour et al., NeurIPS 2021
• What: We generalize the Robinson decomposition to continuous vector treatments.
• Why: In medicine or economics, we often have continuous, multivariate data of treatments.
• Trivia: Made me realize that causal inference research lacks meaningful benchmarks.
[19] Probabilistic Active Meta-Learning
Kaddour et al., NeurIPS 2020
• What: We make meta-learning more sample-efficient by letting the model guide the task selection.
• Why: Acquiring task-specific datasets can be expensive and slow. Let's make sure we make it worth it.
• Trivia: This was my Master's thesis while studying at the wonderful Imperial College London.
Why good advice often fails
How do we turn superhuman AI capabilities into human skills?
Addicted to Cursor