Tutorials12 min read

How to Build an AI Research Agent That Works While You Sleep (Karpathy's Autoresearch Method)

A step-by-step tutorial on building autonomous AI research loops using Karpathy's autoresearch pattern. Learn the experiment → evaluate → keep/discard → iterate cycle with real use cases and practical tool recommendations.

CL

ComputeLeap Team

March 13, 2026

Share:

In January 2025, Andrej Karpathy shared a concept that quietly changed how serious practitioners think about AI-assisted work: autoresearch. The idea is deceptively simple — instead of using an AI agent for one-shot tasks, you set up a loop where the agent autonomously runs experiments, evaluates results, keeps what works, discards what doesn't, and iterates. Over and over. While you sleep.

This isn't theoretical anymore. In early 2026, ARK Invest's research showed that AI coding agents can now work reliably and autonomously for 55+ minutes before needing human intervention. That's enough time for dozens of experiment-evaluate-iterate cycles. The autoresearch pattern has gone from interesting idea to practical workflow.

In this tutorial, we'll build an autonomous research loop from scratch. You'll learn the core pattern, see three real use cases, and walk away with a working setup you can adapt to your own projects.

The Autoresearch Pattern: Experiment → Evaluate → Keep/Discard → Iterate

At its core, Karpathy's autoresearch is a hill-climbing algorithm for knowledge work. Here's the loop:

  1. Experiment: The agent tries something — runs a code change, tests a hypothesis, tweaks a parameter
  2. Evaluate: It measures the result against a clear metric — did the test pass? Did the score improve? Did the output match expectations?
  3. Keep or Discard: If the result improved, keep the change. If not, revert it
  4. Iterate: Go back to step 1 with the updated state

This is fundamentally different from asking ChatGPT a question and getting an answer. Autoresearch is closed-loop — the agent acts on the world, observes the consequences, and adapts. It's the difference between reading a recipe and actually cooking, tasting, and adjusting.

┌─────────────┐
│  EXPERIMENT  │ ← Agent tries a change
└──────┬──────┘
       ▼
┌─────────────┐
│  EVALUATE   │ ← Measure against metric
└──────┬──────┘
       ▼
┌─────────────┐     ┌──────────┐
│   BETTER?   │─No─▶│ DISCARD  │──┐
└──────┬──────┘     └──────────┘  │
       │ Yes                       │
       ▼                           │
┌─────────────┐                   │
│    KEEP     │                   │
└──────┬──────┘                   │
       │                           │
       ▼                           │
┌─────────────┐                   │
│   ITERATE   │◀──────────────────┘
└─────────────┘

The critical insight is that this loop requires three things to work:

  • An automated experiment the agent can run without human intervention
  • A measurable evaluation metric — not vibes, actual numbers
  • A version control mechanism to revert failed experiments cleanly

If you have all three, you can run an autoresearch loop. If you're missing any one of them, you'll need a human in the loop.

What You Need: The Toolchain

Before we build anything, here's the practical toolchain that makes autoresearch loops work in 2026:

Orchestration Layer: An AI Coding Agent

You need an agent that can read files, write files, run shell commands, and iterate on its own output. The best options right now:

  • Claude Code — Anthropic's CLI agent. Runs in your terminal, has full filesystem and shell access, and can operate autonomously for extended periods. Our recommended choice for autoresearch loops due to its strong reasoning and tool-use capabilities.
  • OpenAI Codex — OpenAI's sandboxed coding agent. Good isolation model but more constrained environment.
  • Gemini CLI — Google's command-line agent. Strong multimodal capabilities if your research involves images or documents.

Any agent that supports autonomous tool use will work. The key requirement is that it can execute commands, read output, and decide what to do next — without waiting for you to hit Enter.

Experiment Log: Git and GitHub

Every experiment needs a paper trail. Git gives you:

  • Atomic commits for each experiment attempt
  • Easy reverts when an experiment fails
  • Branch isolation for parallel experiment tracks
  • Full history to analyze what the agent tried and why

Create a dedicated repository for your autoresearch project. The commit history becomes your experiment log.

Evaluation: Scripts That Return Numbers

Your evaluation metric needs to be a script that exits with a number or prints a score. Examples:

  • pytest --tb=short → pass/fail count
  • python benchmark.py → performance score
  • npm run test:coverage → coverage percentage
  • A custom scoring script that evaluates output quality

The more unambiguous your metric, the better the loop works.

Use Case 1: Automated Research Paper Analysis

Goal: Build an agent that reads research papers, extracts key findings, and maintains a structured knowledge base.

Setup

Create a project directory with this structure:

autoresearch-papers/
├── papers/           # Drop PDFs or URLs here
├── findings/         # Agent writes structured summaries
├── knowledge.json    # Accumulated knowledge base
├── evaluate.py       # Scores completeness and accuracy
└── run-loop.sh       # The main loop script

The Loop Script

Here's a practical run-loop.sh that drives the autoresearch cycle:

#!/bin/bash
# run-loop.sh — Autonomous research paper analysis loop

REPO_DIR="$(cd "$(dirname "$0")" && pwd)"
MAX_ITERATIONS=20
LOG_FILE="$REPO_DIR/loop-log.txt"

cd "$REPO_DIR"

for i in $(seq 1 $MAX_ITERATIONS); do
  echo "=== Iteration $i ===" | tee -a "$LOG_FILE"

  # EXPERIMENT: Agent processes next unanalyzed paper
  claude -p "Look in papers/ for any paper not yet in findings/.
    Read it, extract key findings, methodology, and results.
    Write a structured summary to findings/.
    Update knowledge.json with new cross-references.
    Be thorough but concise." \
    --allowedTools "Read,Write,Bash" 2>&1 | tee -a "$LOG_FILE"

  # EVALUATE: Check quality of findings
  SCORE=$(python evaluate.py)
  echo "Score: $SCORE" | tee -a "$LOG_FILE"

  # KEEP or DISCARD based on score
  if [ "$SCORE" -gt 0 ]; then
    git add -A && git commit -m "research: iteration $i — score $SCORE"
    echo "✓ Kept iteration $i" | tee -a "$LOG_FILE"
  else
    git checkout -- .
    echo "✗ Discarded iteration $i" | tee -a "$LOG_FILE"
  fi
done

The Evaluation Script

# evaluate.py — Score the knowledge base quality
import json
import os

def evaluate():
    score = 0

    # Check findings exist and have content
    findings_dir = "findings"
    if not os.path.exists(findings_dir):
        return 0

    for f in os.listdir(findings_dir):
        filepath = os.path.join(findings_dir, f)
        content = open(filepath).read()
        # Score based on structural completeness
        if "## Key Findings" in content: score += 1
        if "## Methodology" in content: score += 1
        if "## Results" in content: score += 1
        if len(content) > 500: score += 1

    # Check knowledge base coherence
    if os.path.exists("knowledge.json"):
        kb = json.load(open("knowledge.json"))
        if isinstance(kb, dict) and len(kb) > 0:
            score += 2

    return score

if __name__ == "__main__":
    print(evaluate())

You drop papers into the papers/ folder, start the loop, and come back to a structured knowledge base with cross-referenced findings. The git history shows exactly what the agent found and when.

Use Case 2: Trading Strategy Optimization

Goal: Iteratively improve a trading strategy by testing parameter variations against historical data.

This is where autoresearch really shines — parameter search over a well-defined objective function.

Setup

autoresearch-trading/
├── strategy.py       # The trading strategy with configurable params
├── backtest.py       # Runs strategy against historical data
├── data/             # Historical price data
├── results/          # Backtest results per iteration
└── run-loop.sh       # The loop

The Core Loop

#!/bin/bash
# Autonomous trading strategy optimization

BEST_SHARPE=0
MAX_ITERATIONS=50

for i in $(seq 1 $MAX_ITERATIONS); do
  # EXPERIMENT: Agent modifies strategy parameters
  claude -p "Review the current strategy.py and recent backtest results
    in results/. Analyze what's working and what isn't.
    Make ONE targeted parameter change to improve the Sharpe ratio.
    Document your reasoning in a comment.
    Do NOT change the core strategy logic, only parameters." \
    --allowedTools "Read,Write,Bash"

  # EVALUATE: Run backtest
  RESULT=$(python backtest.py 2>&1)
  SHARPE=$(echo "$RESULT" | grep "Sharpe:" | awk '{print $2}')

  # KEEP or DISCARD
  if (( $(echo "$SHARPE > $BEST_SHARPE" | bc -l) )); then
    BEST_SHARPE=$SHARPE
    git add -A && git commit -m "strategy: sharpe=$SHARPE (iteration $i)"
    echo "✓ New best Sharpe: $SHARPE"
  else
    git checkout -- strategy.py
    echo "✗ Sharpe $SHARPE < best $BEST_SHARPE, reverted"
  fi

  # Save result for agent context
  echo "Iteration $i: Sharpe=$SHARPE" >> results/history.txt
done

The agent sees the full history of what it's tried, so it can learn from failed experiments. After 50 iterations overnight, you wake up to an optimized parameter set with full documentation of every change attempted.

Important caveat: See the "When Autoresearch Fails" section below for why you should treat these results as starting points, not final answers.

Use Case 3: Code Quality Improvement

Goal: Autonomously improve a codebase's test coverage, performance, or code quality metrics.

This is the most immediately practical use case for most developers.

The Loop

#!/bin/bash
# Autonomous code improvement loop

BASELINE_COVERAGE=$(pytest --cov=src --cov-report=term 2>&1 | \
  grep TOTAL | awk '{print $4}' | tr -d '%')

for i in $(seq 1 30); do
  # EXPERIMENT: Agent writes tests or refactors for coverage
  claude -p "Run 'pytest --cov=src --cov-report=term-missing' and
    identify the module with the lowest coverage.
    Write meaningful tests for uncovered code paths.
    Focus on edge cases and error handling.
    Do not write trivial tests just to inflate coverage." \
    --allowedTools "Read,Write,Bash"

  # EVALUATE: Check that tests pass AND coverage improved
  RESULT=$(pytest --cov=src --cov-report=term 2>&1)
  PASS=$?
  NEW_COVERAGE=$(echo "$RESULT" | grep TOTAL | awk '{print $4}' | tr -d '%')

  if [ $PASS -eq 0 ] && \
     (( $(echo "$NEW_COVERAGE > $BASELINE_COVERAGE" | bc -l) )); then
    BASELINE_COVERAGE=$NEW_COVERAGE
    git add -A && git commit -m "tests: coverage $NEW_COVERAGE% (iteration $i)"
    echo "✓ Coverage: $NEW_COVERAGE%"
  else
    git checkout -- .
    echo "✗ Tests failed or coverage didn't improve, reverted"
  fi
done

We've seen this pattern take a project from 45% to 78% test coverage overnight — with meaningful, well-written tests, not just assertion-free stubs.

Making It Robust: Practical Tips

1. Set Resource Limits

Autonomous loops can burn through API credits fast. Set hard limits:

# Cap total API spend
export MAX_TOKENS=500000  # Per iteration
export MAX_ITERATIONS=30  # Total loop count

# Add a cost check
COST=$(calculate_api_cost)  # Your cost tracking function
if (( $(echo "$COST > 50.00" | bc -l) )); then
  echo "Cost limit reached: \$$COST"
  exit 0
fi

2. Use Git Branches, Not Main

Always run autoresearch on a feature branch:

git checkout -b autoresearch/experiment-$(date +%Y%m%d)
# ... run loop ...
# Review results before merging to main

3. Log Everything

The agent's reasoning is as valuable as its results. Capture full output:

claude -p "..." 2>&1 | tee -a "logs/iteration-$i.log"

4. Add Circuit Breakers

Stop the loop if something goes obviously wrong:

# Stop if 5 consecutive failures
FAIL_COUNT=0
if [ "$IMPROVED" = "false" ]; then
  FAIL_COUNT=$((FAIL_COUNT + 1))
  if [ $FAIL_COUNT -ge 5 ]; then
    echo "5 consecutive failures — stopping loop"
    exit 1
  fi
else
  FAIL_COUNT=0
fi

When Autoresearch Fails: Honest Assessment

Autoresearch loops are powerful, but they have real limitations. Here's when they struggle:

Local Optima

This is the biggest risk. Hill-climbing algorithms — including autoresearch — can get stuck on local optima. The agent finds a parameter set that's better than its neighbors but far from globally optimal. It keeps making tiny changes, none of which improve the metric, and the loop stalls.

Mitigation: Run multiple loops from different starting points. Add randomization to the experiment step. Use the loop to explore, then apply human judgment to the best results.

Goodhart's Law

"When a measure becomes a target, it ceases to be a good measure." If your evaluation metric is test coverage, the agent might write trivial tests that inflate the number without catching real bugs. If your metric is Sharpe ratio, the agent might overfit to historical data.

Mitigation: Use multiple evaluation metrics. Add qualitative checks. Review the agent's output periodically rather than blindly trusting the final number.

Tasks Without Clear Metrics

Autoresearch works best when you can reduce success to a number. It struggles with subjective tasks: "Is this blog post good?" "Is this UI intuitive?" "Is this architecture clean?" If you can't script the evaluation, you can't close the loop.

Mitigation: For subjective tasks, use an LLM-as-judge approach (have a second model evaluate the output against a rubric). It's imperfect but can work for rough filtering.

Compounding Errors

Over many iterations, small errors can compound. The agent makes a change that's slightly wrong, future iterations build on that wrong foundation, and you end up far from where you want to be.

Mitigation: Periodic human review checkpoints. Run the full test suite (not just the incremental metric) every N iterations. Compare against the original baseline, not just the previous iteration.

The 55-Minute Threshold: Why This Works Now

ARK Invest's 2026 research on AI agent autonomy revealed a key data point: the best coding agents can now work reliably for 55+ minutes without human intervention. This number matters because of what it enables.

A single autoresearch iteration typically takes 2–5 minutes (experiment + evaluation). In a 55-minute autonomous window, that's 11–27 iterations per session. Run multiple sessions overnight, and you're looking at hundreds of experiment-evaluate-iterate cycles by morning.

Two years ago, agents would derail after 5–10 minutes — not enough time for the loop to produce meaningful results. The jump to 55+ minutes is what makes autoresearch practically useful, not just theoretically interesting.

Getting Started Today

Here's your minimal starting point:

  1. Pick a project with a clear, scriptable metric (test coverage is a great first choice)
  2. Create a git repo dedicated to the experiment
  3. Write your evaluation script — this is the hardest part and the most important
  4. Write a simple loop script using the patterns above
  5. Run it for 10 iterations while you watch, to calibrate
  6. Then let it run overnight on a longer loop

Start small. A 10-iteration loop on a test coverage task will teach you more about autoresearch than any amount of reading. Once you see the pattern work, you'll immediately recognize other problems in your workflow where it applies.

The tools are ready. The agents are capable enough. The pattern is proven. The only question is what you'll point it at first.

CL

About ComputeLeap Team

The ComputeLeap editorial team covers AI tools, agents, and products — helping readers discover and use artificial intelligence to work smarter.