Autoresearch by Andrej Karpathy: A Computer That Optimises Itself

Most people are using AI to help once, then doing the rest themselves. Autoresearch by Karpathy flips that completely.

Over the past few days, Autoresearch by Andrej Karpathy has been everywhere, and it deserves the attention. It is a proof of concept for something powerful: autonomous iteration by AI.

Instead of asking for instructions every step, the system keeps improving on its own. It experiments, evaluates results, keeps what works, removes what doesn't, and repeats continuously.

TL;DR

AI modifies code, runs it, evaluates results, and repeats the loop
It trains a small GPT model and improves the training process automatically
Shopify's CEO used the same idea and achieved a 53% speed improvement
This works for anything that can be measured or scored
The real challenge is defining what "good" actually means

What Is It Doing?

At its core, autoresearch is a simple loop built around two files:

train.py: the code being optimized
program.md: instructions for the AI agent

The process:

The AI reads the current code
It makes a small change
Saves the change
Runs the training script
Checks performance using a metric
Keeps the change if results improve
Reverts it if results get worse
Repeats endlessly

The system is explicitly told to never stop. If it runs out of ideas, it is expected to explore deeper and try more variations.

This is not a chatbot. It behaves more like a researcher that never gets tired.

What Is It Training?

The script trains a small GPT model, but uses modern techniques:

Scaled Dot Product Attention
Hybrid optimizers like Muon and AdamW
Rotary Positional Embeddings

It trains on the TinyStories dataset, designed for efficient experimentation.

Each run takes about five minutes, and the system can perform around 100+ experiments in one session. The goal is to improve a metric called val_bpb — lower values mean better performance.

Quick Explanation of GPT

A GPT (Generative Pre-trained Transformer) is the fundamental architecture behind models like ChatGPT, Claude, and Llama:

The Transformer is a neural network architecture invented at Google in 2017. Its superpower is Attention — rather than reading text word-by-word, it looks at the entire context at once and calculates which words are most relevant to each other.
Pre-trained means the model is first exposed to large amounts of text. It learns statistical patterns.
Generative means the model's only job is to predict the next word, one token at a time.

Real-World Proof

Tobi Lütke (founder and CEO of Shopify) applied this exact approach to Shopify's Liquid engine.

He ran around 120 automated experiments overnight. Results:

53% faster performance
61% fewer memory allocations
All tests still passed

No manual tuning. Just automated iteration.

The Bigger Idea

This approach works for anything that can be scored.

The loop becomes:

Define a task
Define how to measure success
Let AI generate variations
Keep what improves results
Repeat

The system does not need a GPU or complex setup. It just needs clear evaluation criteria.

Where You Can Use It

Marketing and Content

Define: readability, SEO structure, keyword density, tone, predicted CTR, brand voice.

The system writes content, scores it, improves weak areas, and repeats until it meets all conditions.

Use cases:

Ad copy: 50 variants overnight, humans review a shortlist of winners
Email campaigns: score on open rate prediction, spam likelihood, CTA clarity
SEO articles: define target keyword, semantic clusters, headings, readability floor — the agent writes until the article is genuinely optimised
Social media posts: platform-specific scoring for Twitter/X, LinkedIn, and Instagram simultaneously
Landing pages: iterate copy layer independently from design

Web and Product Development

Landing pages optimized for performance and accessibility
UI components checked for design and compliance
Onboarding flows refined for clarity and completion

Engineering and Performance

APIs optimized for speed
Queries improved automatically
Prompts refined using evaluation datasets
CI pipelines optimized for efficiency

Why This Matters

The real unlock isn't the loop — it's being forced to define what "good" actually means. Most teams have never done that rigorously.

This is exactly the architecture we're building into Montr AI — automated content loops that score and iterate without a human in the middle.

Before: AI helps once, then humans iterate manually

Now: AI runs the entire loop and improves continuously

The real bottleneck is no longer generating output. It is defining what success looks like. Once that is clear, the system can keep improving on its own.