Autonomous Experimentation

Autoresearch

自主研究 — AI代理自动运行实验并优化自身

What if AI agents could run their own experiments, measure results, and improve themselves — autonomously, overnight, without human intervention? That's autoresearch.

如果AI代理能自主运行实验、衡量结果并自我改进——一夜之间，无需人类干预？这就是自主研究。

The Core Loop 核心循环

How Autoresearch Works

自主研究的运行原理 — 一个永不停歇的实验循环

Measure Baseline

Run the current system and score it against objective criteria. This is experiment #0.

运行当前系统并根据客观标准评分。这是第0次实验。

↓

Analyze Failures

Examine what went wrong. Identify the most common failure pattern. Form a hypothesis.

分析哪里出了问题。识别最常见的失败模式。形成假设。

↓

Mutate ONE Thing

Change exactly one variable. Not five things at once — one. So you know what helped.

只改变一个变量。不要同时改五个——只改一个。这样你才知道什么有效。

↓

Run Experiment

Execute the system N times with the same test inputs. Score every output.

用相同的测试输入执行系统N次。对每个输出评分。

↓

Keep or Discard

Score improved? Keep. Same or worse? Discard and revert. No sentimentality.

分数提高了？保留。相同或更差？丢弃并回滚。不留情面。

↓

↺

Repeat Until Ceiling

Go back to step 2. Run autonomously overnight. Stop at 95%+ or when the user returns.

回到第2步。整夜自主运行。在95%+时或用户返回时停止。

Applications 应用场景

Two Flavors of Autoresearch

自主研究的两种应用 — ML训练 vs AI技能优化

ML Training (Karpathy)

Optimize Model Training

AI agent modifies train.py — hyperparameters, architecture, optimizer settings — then runs 5-minute GPU experiments and keeps improvements.

AI代理修改train.py——超参数、架构、优化器设置——然后运行5分钟GPU实验并保留改进。

Target: validation bits-per-byte
Budget: 5 min per experiment, single GPU
Mutations: learning rate, layers, attention heads
~100 experiments overnight
Stack: Python, PyTorch, NVIDIA GPU

Skill Optimization

Optimize AI Prompts

AI agent mutates a Claude Code skill's prompt — wording, examples, anti-patterns — then runs test inputs and keeps mutations that improve pass rate.

AI代理突变Claude Code技能的提示——措辞、示例、反模式——然后运行测试输入并保留提高通过率的突变。

Target: binary eval pass rate
Budget: N runs per experiment
Mutations: instructions, examples, anti-patterns
Output: improved SKILL.md + changelog
Stack: Claude Code skill framework

Philosophy 哲学

The Principles That Make It Work

让它有效的原则

Binary Evals Only

Pass or fail. No 1-10 scales. Scales compound variability and give unreliable results.

通过或失败。不用1-10分制。分数制会叠加变异性并产生不可靠的结果。

One Variable at a Time

Change one thing per experiment. If you change five things, you don't know which one helped.

每次实验只改一个变量。如果同时改五个，你不知道哪个有效。

Baseline First

Always measure the starting point before changing anything. No baseline = no signal.

在改变任何东西之前始终测量起点。没有基线=没有信号。

Revert Ruthlessly

If the score doesn't improve, revert. Complexity without improvement is pure cost.

如果分数没有提高，回滚。没有改进的复杂性是纯粹的成本。

Log Everything

Every experiment recorded — kept or discarded. The changelog is the most valuable artifact.

记录每次实验——保留或丢弃。变更日志是最有价值的产物。

Run Autonomously

Once started, don't stop to ask. Run overnight. The human will check in the morning.

一旦开始，不要停下来询问。整夜运行。人类早上会来查看。