Skip to content
Code & Context logoCode&Context

Autoresearch: 700 Experiments While You Sleep

An AI agent ran 700 training experiments autonomously and found 20 real improvements. Here's how autoresearch works.

Saurabh Prakash

Author

Mar 10, 20266 min read
Share:

What if you could optimize a model overnight without any ML experience? What if an AI agent runs hundreds of training experiments autonomously, keeping only the improvements?

That is the idea behind autoresearch.

The core loop:

You give an AI agent a training script and a metric. It edits the code, runs a short experiment, checks if the metric improved, keeps or discards, and repeats.

Karpathy used it to squeeze 11% more speed out of his GPT-2 training[2]. Tobi Lütke, Shopify's CEO, trained a 0.8B model overnight that outscored his previous 1.6B.


How Autoresearch Works

An LLM agent edits training code, runs a short experiment, checks if the metric improved, and repeats — without human involvement.

The design constraints that make it work:

  • Fixed 5-min time budget. Results stay comparable regardless of what the agent changes.
  • Single file scope. Agent edits train.py only. Data prep and evaluation are locked down.
  • Git as memory. Each experiment is a commit. The agent reads branch history to plan what to try next.
  • Binary keep/discard. No human judgment needed.

Roughly 12 experiments/hour. Around 100 overnight.

The human iterates on the prompt (.md). The AI agent iterates on the training code (.py). — Karpathy[1]


Two Early Experiments

Both examples below are small-scale and early. The setups are minimal, the models are small, and neither is a controlled study. But they show where this is headed.

Karpathy: 700 experiments on nanochat

Karpathy pointed autoresearch at nanochat, his already well-tuned GPT-2 training codebase. Over two days the agent ran ~700 experiments and found ~20 real improvements. Stacked together, time-to-GPT-2 dropped from 2.02 to 1.80 hours (11% faster)[2].

What the agent found that Karpathy missed:

  • The attention mechanism was too spread out (QKNorm missing a scaler multiplier)
  • A key layer was missing regularization (Value Embeddings)
  • The local attention window was too narrow (banded attention)
  • The optimizer settings were wrong (AdamW betas, weight decay, initialization)

All improvements transferred from depth-12 to depth-24 models.

"This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work, you come up with new ideas based on that, you read some papers for inspiration. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild." — Karpathy[2]


Tobi Lütke: Query expansion overnight

Tobi adapted the pattern for a query-expansion model for the QMD open source project:

  1. Told an AI agent to read the autoresearch repo and build a version for QMD
  2. Went to sleep
  3. Woke up to a 0.8B model scoring 19% higher than the previous 1.6B model after 37 experiments in 8 hours

A smaller model outperformed one twice its size. He then pointed the same loop at a reranker and beat that baseline too[3].


How To Apply This

The loop hinges on your eval. If the metric is gamed or leaky, the model looks better on paper and fails in production.

Critical requirement:

Your eval set must be held out completely — the agent never touches it, never trains on it, never sees it during optimization.

You need:

  • A training script the agent can modify
  • Training data (manually labeled or synthetic)
  • A metric that reflects what the model will actually do in production

When experiments run 100x faster than a human can manage, your eval becomes the bottleneck. Static benchmarks get saturated. Build your eval pipeline so it can evolve — refresh from real production data, harder edge cases.

The pattern fits:

  • Search ranking
  • Product categorization
  • Clinical NER
  • Fraud scoring
  • Contract extraction
  • Intent classification

Small models work well — training runs finish in minutes and improvements transfer when you scale up. Open models like Gemma are a good starting point: small enough for a single GPU, performant for production tasks, commercially licensed.


How This Differs From Prompt Optimizers

Unlike prompt optimizers (which tune prompts on frozen models), autoresearch changes the model itself — it modifies training code, architecture, and hyperparameters. For teams building domain SLMs, both layers compound.


What People Are Saying

"Karpathy just mass-produced the most expensive part of ML research for free. A senior ML engineer costs $400K–$800K/year, runs maybe 3–5 meaningful experiments per day, and spends 80% of their time on the exact loop Karpathy just automated. The 630 lines of code in this repo fit inside a single LLM context window. That's by design." — an ML engineer on X

"Now this research mode is starting to get automated with LLMs, I see no other outcome than an LLM coming up with real innovation one day very soon. Current frontier models are already having nice research tastes. Scary yet amazing things are about to happen." — a researcher at Moonshot AI (Kimi)


The Bigger Picture

Karpathy's own framing says it all:

"All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course — you don't just have a single 'train.py' file to tune. But doing it is 'just engineering' and it's going to work." — Karpathy[2]

And the broader implication:

The generalization:

Any metric you care about that is reasonably efficient to evaluate can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.

The role of the human shifts. You spin up a swarm of agents, have them collaborate to tune smaller models, promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges.

That word — optionally — should make every ML researcher who defines their value as "I tune models" think carefully about what comes next.

Is your team already running autonomous experiments — or is the eval problem holding you back?


References

[1]: Andrej Karpathy, autoresearch announcement — tweet

[2]: Andrej Karpathy, autoresearch results — tweet

[3]: Tobi Lütke, overnight autoresearch run — tweet

[4]: Andrej Karpathy — autoresearch repo: github.com/karpathy/autoresearch

[5]: Andrej Karpathy — agenthub repo: github.com/karpathy/agenthub

[6]: nanochat commit with autoresearch improvements — commit