What Happened When I Applied Karpathy's Autoresearch Idea to LLM Inference
Most “AI optimization” demos are fun to watch for the same reason benchmark tweets are fun to watch: they show you the win, not the search.
You see the final graph. You see the +12% or the “runs 2x faster now” claim. What you usually do not see is the graveyard of bad ideas behind it. The settings that looked promising but were just noise. The optimizations that made throughput better by quietly making the model worse. The fake wins that only happened because the benchmark got easier.
So I built a small repo called Auto-Inference-Optimiser to study exactly that.
The idea is simple: lock the evaluation, open one file for experimentation, and let an AI coding agent hill-climb on inference speed forever on Apple Silicon.
And the most interesting part was not that it found a speedup.
It was what kind of speedup it found, what it failed to improve, and what that says about inference engineering on real hardware.
Let’s get into it.
Why I Built This
I care a lot about inference right now.
Not in the abstract “LLMs are cool” sense. I mean the actual production questions: where latency comes from, what batching buys you, what prompt processing costs, how KV cache decisions change throughput, and where the hardware wall starts pushing back.
There is a lot of content online about training. There is also a lot of content online about agents. But there is still not enough content that combines both instincts: build a tight experimental harness, let the agent search inside it, and use that process to learn something real about inference.
This repo was my way of doing that.
It is clearly inspired by Karpathy’s Autoresearch, but pointed at a different layer of the stack. Instead of searching over training code on a GPU box, this one searches over an mlx inference pipeline on a Mac.
What The Repo Actually Does
At a high level, the repo turns “make inference faster” into a bounded optimization problem.
The structure is intentionally small:
prepare.py -> fixed evaluation harness, quality gates, benchmark prompts
inference.py -> the only file the agent is allowed to modify
program.md -> operating manual for the agent
results.tsv -> untracked experiment log
That boundary is doing most of the work.
prepare.py is read-only. It fixes the benchmark model, the prompts, the warmup behavior, the averaging logic, and the quality gates. The agent cannot “win” by quietly changing the test.
inference.py is the search surface. That is where the agent is allowed to touch sampling, prefill step size, prompt formatting, and the general generation path.
program.md tells the agent how to behave:
edit inference.py
-> commit the change
-> run python prepare.py
-> extract metrics from run.log
-> keep the change if generation_tps improved and quality still passes
-> otherwise revert
-> repeat forever
That is the core harness.
And I like this design a lot because it bakes in three things that most autonomous coding demos hand-wave away:
- Reversibility - bad ideas are cheap to discard.
- Observability - every run leaves behind metrics and logs.
- Constraints - the agent is not allowed to optimize by moving the goalposts.
That is what makes the repo interesting to me. Not “an agent edited code.” Plenty of agents can do that. The interesting part is that the edits are forced through a stable eval loop.
The Evaluation Is The Real Product
The truth is that the most important file in this repo is not inference.py.
It is prepare.py.
That file fixes the benchmark around a small Apple Silicon friendly model, runs warmups, averages across multiple runs, and evaluates five different prompt types:
- explanation
- long-context summarization
- reasoning
- creative generation
- code generation
That already makes the benchmark better than a lot of speed demos, because decode-heavy and prefill-heavy cases behave differently.
But the more important choice is the quality gate.
This repo does not let the agent optimize only for tokens/sec. It requires two checks to pass:
avg_perplexityhas to stay below a thresholdsanity_checkhas to stay above a threshold
That second gate matters a lot.
Perplexity is useful, but it is still a model-internal metric. It can tell you that outputs are becoming unstable or degenerate, but it does not fully tell you whether the answer is still usable. So the repo also checks for concrete task-level correctness: did the train speed answer contain 48? Did the transformer explanation mention the right ideas? Did the LCS prompt actually return something that looks like Python code?
This is one of my favorite design choices in the whole project.
Because if you do not defend quality explicitly, an optimization harness will absolutely “improve” your system by making it worse.
What Actually Worked
After the optimization runs, the pattern was surprisingly clear.
Here is the short version:
| Model | Baseline avg_generation_tps |
Best avg_generation_tps |
Improvement |
|---|---|---|---|
| Qwen 0.5B 4-bit | 394.97 | 437.17 | +10.7% |
| Llama 3.2 3B 4-bit | 115.55 | 118.10 | +2.2% |
But the more interesting part is where they came from.
1. Argmax sampling was the biggest win
On the Qwen run, setting sampling to greedy decoding gave the largest gain: about +10.8% generation throughput.
On the Llama run, it was also the best keep: about +2.6%.
That tells you something important: sampling overhead is not free. Top-p decoding is doing real work every token, and if your objective is pure throughput, removing that work can matter more than a lot of fancier ideas.
Of course there is a trade-off.
You get deterministic output and lose diversity. So this is not a universal recommendation for every product. But as an inference lesson, it is very clean: sometimes the fastest path is just doing less decoding logic per token.
2. Simplicity was a real optimization
One of the kept changes was not some exotic kernel trick.
It was simplifying inference.py: singleton sampler, inline formatting, less unused configuration, fewer lines of code. The final version kept essentially the same speed while removing about 42 lines.
I love this result.
Because it reinforces something I keep seeing in systems work: complexity often arrives with a performance story attached, but a lot of the time the clean version is just as good.
Less code to review. Less surface area for bugs. Same throughput.
What Did Not Work
This is where the repo got really instructive.
Most of the optimization ideas were either noise or regressions.
KV cache quantization hurt more than it helped
This was probably the clearest “do not cargo-cult this” result.
On Qwen 0.5B, 8-bit KV cache quantization reduced speed and dropped the sanity score hard enough to fail the gate. 4-bit KV quantization was much worse. Perplexity exploded, sanity dropped to 0.20, and the run was clearly unusable.
On Llama 3.2 3B, 4-bit KV quantization did not collapse quality, but it still made throughput worse.
So the lesson is not just “KV quantization is bad.”
It is narrower, and more useful:
- quantization overhead can outweigh the memory savings
- model families tolerate these trade-offs differently
- a trick that sounds good in theory can still lose on your actual hardware
That last part matters. Apple Silicon has its own constraints. You do not get to assume that an optimization which sounds correct on paper will pay off on an M4.
Most tuning knobs were just noise
Changing PREFILL_STEP_SIZE, rotating the KV cache, disabling Python GC, tweaking Metal cache limits. Most of these landed inside measurement noise or slightly hurt performance.
And honestly, I think that is a useful result too.
When people talk about optimization, they often imply there is always hidden free performance waiting for you if you are clever enough. The reality is much more annoying. Once the obvious waste is gone, many parameter changes are just tiny movements around a local ceiling.
That is exactly what these runs look like.
Reducing MAX_TOKENS was a fake win
This one is my favorite example of why the harness matters.
One experiment reduced MAX_TOKENS to 128 and got better throughput. On paper, that looks like progress, but it is not.
The model was simply doing less work.
This repo’s change monitor calls that out explicitly, which is great. It is a reminder that benchmark hygiene matters as much as cleverness. If the agent can improve the score by shrinking the task, you are not optimizing inference. You are optimizing your ability to lie to yourself.
The Memory Bandwidth Wall Is Real
One thing I found especially interesting is how consistent the larger pattern was across two very different model sizes.
Qwen 0.5B got a much larger percentage jump, but both Qwen and Llama told roughly the same story:
- argmax helps
- KV quantization does not pay off here
- most config tuning is noise
- simplification is basically free
That kind of consistency usually means you are not looking at random luck. You are looking at the shape of the hardware constraint.
The README calls out the memory bandwidth wall explicitly, and I think that is the right read. Once you are already fairly close to the limits of the machine, there may not be much low-hanging fruit left beyond reducing obvious per-token overhead.
That is a useful mental model for anyone working on local inference systems.
Why I Think This Repo Matters
I do not think this project matters because it found a +10.7% win on one setup.
I think it matters because it demonstrates a practical pattern for learning about inference without turning the whole thing into vibes.
A lot of agent demos still optimize for spectacle. They show autonomous behavior without showing the measurement discipline around it. This repo goes in the opposite direction. The agent is not asked to be generally intelligent. It is asked to operate inside a small, unforgiving harness and earn every keep.
That is much closer to how I think useful agentic systems will actually look in production.
And on a more personal level, this repo was also just a good excuse to build intuition.
I want to understand inference systems the same way I wanted to understand database internals when I built toy databases. Not by memorizing a list of optimizations, but by building something tight enough that the trade-offs stop being abstract.
Conclusion
Auto-Inference-Optimiser taught me something simple but important: the hardest part of optimization is not generating ideas. It is building a harness that can tell the difference between a real win, a quality regression, and a benchmark illusion.
If you are building inference systems, I think this is the right instinct to cultivate: do not just chase faster numbers. Build a harness that makes those numbers honest.
References
If you found this interesting, I’d love to hear your thoughts. Share it on Twitter, LinkedIn, or reach out at guptaamanthan01[at]gmail[dot]com.