Autonomous Experimentation

Autoresearch

自主研究 — AI代理自动运行实验并优化自身

What if AI agents could run their own experiments, measure results, and improve themselves — autonomously, overnight, without human intervention? That's autoresearch.

如果AI代理能自主运行实验、衡量结果并自我改进——一夜之间,无需人类干预?这就是自主研究。

The Core Loop 核心循环

How Autoresearch Works

自主研究的运行原理 — 一个永不停歇的实验循环
1

Measure Baseline

Run the current system and score it against objective criteria. This is experiment #0.

运行当前系统并根据客观标准评分。这是第0次实验。
2

Analyze Failures

Examine what went wrong. Identify the most common failure pattern. Form a hypothesis.

分析哪里出了问题。识别最常见的失败模式。形成假设。
3

Mutate ONE Thing

Change exactly one variable. Not five things at once — one. So you know what helped.

只改变一个变量。不要同时改五个——只改一个。这样你才知道什么有效。
4

Run Experiment

Execute the system N times with the same test inputs. Score every output.

用相同的测试输入执行系统N次。对每个输出评分。
5

Keep or Discard

Score improved? Keep. Same or worse? Discard and revert. No sentimentality.

分数提高了?保留。相同或更差?丢弃并回滚。不留情面。

Repeat Until Ceiling

Go back to step 2. Run autonomously overnight. Stop at 95%+ or when the user returns.

回到第2步。整夜自主运行。在95%+时或用户返回时停止。
Applications 应用场景

Two Flavors of Autoresearch

自主研究的两种应用 — ML训练 vs AI技能优化
ML Training (Karpathy)

Optimize Model Training

AI agent modifies train.py — hyperparameters, architecture, optimizer settings — then runs 5-minute GPU experiments and keeps improvements.

AI代理修改train.py——超参数、架构、优化器设置——然后运行5分钟GPU实验并保留改进。
  • Target: validation bits-per-byte
  • Budget: 5 min per experiment, single GPU
  • Mutations: learning rate, layers, attention heads
  • ~100 experiments overnight
  • Stack: Python, PyTorch, NVIDIA GPU
Skill Optimization

Optimize AI Prompts

AI agent mutates a Claude Code skill's prompt — wording, examples, anti-patterns — then runs test inputs and keeps mutations that improve pass rate.

AI代理突变Claude Code技能的提示——措辞、示例、反模式——然后运行测试输入并保留提高通过率的突变。
  • Target: binary eval pass rate
  • Budget: N runs per experiment
  • Mutations: instructions, examples, anti-patterns
  • Output: improved SKILL.md + changelog
  • Stack: Claude Code skill framework
Live Simulation 实时模拟

Watch It Run

观看自主研究的运行过程
autoresearch — experiment loop
Score Progression 分数进展
Philosophy 哲学

The Principles That Make It Work

让它有效的原则
01

Binary Evals Only

Pass or fail. No 1-10 scales. Scales compound variability and give unreliable results.

通过或失败。不用1-10分制。分数制会叠加变异性并产生不可靠的结果。
02

One Variable at a Time

Change one thing per experiment. If you change five things, you don't know which one helped.

每次实验只改一个变量。如果同时改五个,你不知道哪个有效。
03

Baseline First

Always measure the starting point before changing anything. No baseline = no signal.

在改变任何东西之前始终测量起点。没有基线=没有信号。
04

Revert Ruthlessly

If the score doesn't improve, revert. Complexity without improvement is pure cost.

如果分数没有提高,回滚。没有改进的复杂性是纯粹的成本。
05

Log Everything

Every experiment recorded — kept or discarded. The changelog is the most valuable artifact.

记录每次实验——保留或丢弃。变更日志是最有价值的产物。
06

Run Autonomously

Once started, don't stop to ask. Run overnight. The human will check in the morning.

一旦开始,不要停下来询问。整夜运行。人类早上会来查看。
Credits 致谢

Built On

基于以下项目
🤖

karpathy/autoresearch

AI agents running research on single-GPU nanochat training automatically

📚

autoresearch-skill

Claude Code skill adaptation of autoresearch for prompt optimization