Most people are using AI to help once, then doing the rest themselves. Autoresearch by Karpathy flips that completely.
Over the past few days, Autoresearch by Andrej Karpathy has been everywhere, and it deserves the attention. It is a proof of concept for something powerful: autonomous iteration by AI.
Instead of asking for instructions every step, the system keeps improving on its own. It experiments, evaluates results, keeps what works, removes what doesn't, and repeats continuously.
TL;DR
- AI modifies code, runs it, evaluates results, and repeats the loop
- It trains a small GPT model and improves the training process automatically
- Shopify's CEO used the same idea and achieved a 53% speed improvement
- This works for anything that can be measured or scored
- The real challenge is defining what "good" actually means
What Is It Doing?
At its core, autoresearch is a simple loop built around two files:
train.py: the code being optimizedprogram.md: instructions for the AI agent
The process:
- The AI reads the current code
- It makes a small change
- Saves the change
- Runs the training script
- Checks performance using a metric
- Keeps the change if results improve
- Reverts it if results get worse
- Repeats endlessly
The system is explicitly told to never stop. If it runs out of ideas, it is expected to explore deeper and try more variations.
This is not a chatbot. It behaves more like a researcher that never gets tired.
What Is It Training?
The script trains a small GPT model, but uses modern techniques:
- Scaled Dot Product Attention
- Hybrid optimizers like Muon and AdamW
- Rotary Positional Embeddings
It trains on the TinyStories dataset, designed for efficient experimentation.
Each run takes about five minutes, and the system can perform around 100+ experiments in one session. The goal is to improve a metric called val_bpb — lower values mean better performance.
Quick Explanation of GPT
A GPT (Generative Pre-trained Transformer) is the fundamental architecture behind models like ChatGPT, Claude, and Llama:
- The Transformer is a neural network architecture invented at Google in 2017. Its superpower is Attention — rather than reading text word-by-word, it looks at the entire context at once and calculates which words are most relevant to each other.
- Pre-trained means the model is first exposed to large amounts of text. It learns statistical patterns.
- Generative means the model's only job is to predict the next word, one token at a time.
Real-World Proof
Tobi Lütke (founder and CEO of Shopify) applied this exact approach to Shopify's Liquid engine.
He ran around 120 automated experiments overnight. Results:
- 53% faster performance
- 61% fewer memory allocations
- All tests still passed
No manual tuning. Just automated iteration.
The Bigger Idea
This approach works for anything that can be scored.
The loop becomes:
- Define a task
- Define how to measure success
- Let AI generate variations
- Keep what improves results
- Repeat
The system does not need a GPU or complex setup. It just needs clear evaluation criteria.
Where You Can Use It
Marketing and Content
Define: readability, SEO structure, keyword density, tone, predicted CTR, brand voice.
The system writes content, scores it, improves weak areas, and repeats until it meets all conditions.
Use cases:
- Ad copy: 50 variants overnight, humans review a shortlist of winners
- Email campaigns: score on open rate prediction, spam likelihood, CTA clarity
- SEO articles: define target keyword, semantic clusters, headings, readability floor — the agent writes until the article is genuinely optimised
- Social media posts: platform-specific scoring for Twitter/X, LinkedIn, and Instagram simultaneously
- Landing pages: iterate copy layer independently from design
Web and Product Development
- Landing pages optimized for performance and accessibility
- UI components checked for design and compliance
- Onboarding flows refined for clarity and completion
Engineering and Performance
- APIs optimized for speed
- Queries improved automatically
- Prompts refined using evaluation datasets
- CI pipelines optimized for efficiency
Why This Matters
The real unlock isn't the loop — it's being forced to define what "good" actually means. Most teams have never done that rigorously.
This is exactly the architecture we're building into Montr AI — automated content loops that score and iterate without a human in the middle.
Before: AI helps once, then humans iterate manually
Now: AI runs the entire loop and improves continuously
The real bottleneck is no longer generating output. It is defining what success looks like. Once that is clear, the system can keep improving on its own.