What if AI agents could run their own experiments, measure results, and improve themselves — autonomously, overnight, without human intervention? That's autoresearch.
如果AI代理能自主运行实验、衡量结果并自我改进——一夜之间,无需人类干预?这就是自主研究。
Run the current system and score it against objective criteria. This is experiment #0.
Examine what went wrong. Identify the most common failure pattern. Form a hypothesis.
Change exactly one variable. Not five things at once — one. So you know what helped.
Execute the system N times with the same test inputs. Score every output.
Score improved? Keep. Same or worse? Discard and revert. No sentimentality.
Go back to step 2. Run autonomously overnight. Stop at 95%+ or when the user returns.
AI agent modifies train.py — hyperparameters, architecture, optimizer settings — then runs 5-minute GPU experiments and keeps improvements.
AI agent mutates a Claude Code skill's prompt — wording, examples, anti-patterns — then runs test inputs and keeps mutations that improve pass rate.
Pass or fail. No 1-10 scales. Scales compound variability and give unreliable results.
Change one thing per experiment. If you change five things, you don't know which one helped.
Always measure the starting point before changing anything. No baseline = no signal.
If the score doesn't improve, revert. Complexity without improvement is pure cost.
Every experiment recorded — kept or discarded. The changelog is the most valuable artifact.
Once started, don't stop to ask. Run overnight. The human will check in the morning.