What Happened When I Applied Karpathy's Autoresearch Idea to LLM Inference

Mar 22, 2026 · Manthan Gupta

Most “AI optimization” demos are fun to watch for the same reason benchmark tweets are fun to watch: they show you the win, not the search.

You see the final graph. You see the +12% or the “runs 2x faster now” claim. What you usually do not see is the graveyard of bad ideas behind it. The settings that looked promising but were just noise. The optimizations that made throughput better by quietly making the model worse. The fake wins that only happened because the benchmark got easier.

So I built a small repo called Auto-Inference-Optimiser to study exactly that.

The idea is simple: lock the evaluation, open one file for experimentation, and let an AI coding agent hill-climb on inference speed forever on Apple Silicon.

And the most interesting part was not that it found a speedup.

It was what kind of speedup it found, what it failed to improve, and what that says about inference engineering on real hardware.

Let’s get into it.

Why I Built This

I care a lot about inference right now.

Not in the abstract “LLMs are cool” sense. I mean the actual production questions: where latency comes from, what batching buys you, what prompt processing costs, how KV cache decisions change throughput, and where the hardware wall starts pushing back.

There is a lot of content online about training. There is also a lot of content online about agents. But there is still not enough content that combines both instincts: build a tight experimental harness, let the agent search inside it, and use that process to learn something real about inference.

This repo was my way of doing that.

It is clearly inspired by Karpathy’s Autoresearch, but pointed at a different layer of the stack. Instead of searching over training code on a GPU box, this one searches over an mlx inference pipeline on a Mac.

What The Repo Actually Does

At a high level, the repo turns “make inference faster” into a bounded optimization problem.

The structure is intentionally small:

prepare.py   -> fixed evaluation harness, quality gates, benchmark prompts
inference.py -> the only file the agent is allowed to modify
program.md   -> operating manual for the agent
results.tsv  -> untracked experiment log

That boundary is doing most of the work.

prepare.py is read-only. It fixes the benchmark model, the prompts, the warmup behavior, the averaging logic, and the quality gates. The agent cannot “win” by quietly changing the test.

inference.py is the search surface. That is where the agent is allowed to touch sampling, prefill step size, prompt formatting, and the general generation path.

program.md tells the agent how to behave:

edit inference.py
-> commit the change
-> run python prepare.py
-> extract metrics from run.log
-> keep the change if generation_tps improved and quality still passes
-> otherwise revert
-> repeat forever

That is the core harness.

And I like this design a lot because it bakes in three things that most autonomous coding demos hand-wave away:

Reversibility - bad ideas are cheap to discard.
Observability - every run leaves behind metrics and logs.
Constraints - the agent is not allowed to optimize by moving the goalposts.

That is what makes the repo interesting to me. Not “an agent edited code.” Plenty of agents can do that. The interesting part is that the edits are forced through a stable eval loop.

The Evaluation Is The Real Product

The truth is that the most important file in this repo is not inference.py.

It is prepare.py.

That file fixes the benchmark around a small Apple Silicon friendly model, runs warmups, averages across multiple runs, and evaluates five different prompt types:

explanation
long-context summarization
reasoning
creative generation
code generation

That already makes the benchmark better than a lot of speed demos, because decode-heavy and prefill-heavy cases behave differently.

But the more important choice is the quality gate.

This repo does not let the agent optimize only for tokens/sec. It requires two checks to pass:

avg_perplexity has to stay below a threshold
sanity_check has to stay above a threshold

That second gate matters a lot.

Perplexity is useful, but it is still a model-internal metric. It can tell you that outputs are becoming unstable or degenerate, but it does not fully tell you whether the answer is still usable. So the repo also checks for concrete task-level correctness: did the train speed answer contain 48? Did the transformer explanation mention the right ideas? Did the LCS prompt actually return something that looks like Python code?

This is one of my favorite design choices in the whole project.

Because if you do not defend quality explicitly, an optimization harness will absolutely “improve” your system by making it worse.

What Actually Worked

After the optimization runs, the pattern was surprisingly clear.

Here is the short version:

Model	Baseline `avg_generation_tps`	Best `avg_generation_tps`	Improvement
Qwen 0.5B 4-bit	394.97	437.17	+10.7%
Llama 3.2 3B 4-bit	115.55	118.10	+2.2%

But the more interesting part is where they came from.

1. Argmax sampling was the biggest win

On the Qwen run, setting sampling to greedy decoding gave the largest gain: about +10.8% generation throughput.

On the Llama run, it was also the best keep: about +2.6%.

That tells you something important: sampling overhead is not free. Top-p decoding is doing real work every token, and if your objective is pure throughput, removing that work can matter more than a lot of fancier ideas.

Of course there is a trade-off.

You get deterministic output and lose diversity. So this is not a universal recommendation for every product. But as an inference lesson, it is very clean: sometimes the fastest path is just doing less decoding logic per token.

2. Simplicity was a real optimization

One of the kept changes was not some exotic kernel trick.

It was simplifying inference.py: singleton sampler, inline formatting, less unused configuration, fewer lines of code. The final version kept essentially the same speed while removing about 42 lines.

I love this result.

Because it reinforces something I keep seeing in systems work: complexity often arrives with a performance story attached, but a lot of the time the clean version is just as good.

Less code to review. Less surface area for bugs. Same throughput.

What Did Not Work

This is where the repo got really instructive.

Most of the optimization ideas were either noise or regressions.

KV cache quantization hurt more than it helped

This was probably the clearest “do not cargo-cult this” result.

On Qwen 0.5B, 8-bit KV cache quantization reduced speed and dropped the sanity score hard enough to fail the gate. 4-bit KV quantization was much worse. Perplexity exploded, sanity dropped to 0.20, and the run was clearly unusable.

On Llama 3.2 3B, 4-bit KV quantization did not collapse quality, but it still made throughput worse.

So the lesson is not just “KV quantization is bad.”

It is narrower, and more useful:

quantization overhead can outweigh the memory savings
model families tolerate these trade-offs differently
a trick that sounds good in theory can still lose on your actual hardware

That last part matters. Apple Silicon has its own constraints. You do not get to assume that an optimization which sounds correct on paper will pay off on an M4.

Most tuning knobs were just noise

Changing PREFILL_STEP_SIZE, rotating the KV cache, disabling Python GC, tweaking Metal cache limits. Most of these landed inside measurement noise or slightly hurt performance.

And honestly, I think that is a useful result too.

When people talk about optimization, they often imply there is always hidden free performance waiting for you if you are clever enough. The reality is much more annoying. Once the obvious waste is gone, many parameter changes are just tiny movements around a local ceiling.

That is exactly what these runs look like.

Reducing `MAX_TOKENS` was a fake win

This one is my favorite example of why the harness matters.

One experiment reduced MAX_TOKENS to 128 and got better throughput. On paper, that looks like progress, but it is not.

The model was simply doing less work.

This repo’s change monitor calls that out explicitly, which is great. It is a reminder that benchmark hygiene matters as much as cleverness. If the agent can improve the score by shrinking the task, you are not optimizing inference. You are optimizing your ability to lie to yourself.

The Memory Bandwidth Wall Is Real

One thing I found especially interesting is how consistent the larger pattern was across two very different model sizes.

Qwen 0.5B got a much larger percentage jump, but both Qwen and Llama told roughly the same story:

argmax helps
KV quantization does not pay off here
most config tuning is noise
simplification is basically free

That kind of consistency usually means you are not looking at random luck. You are looking at the shape of the hardware constraint.

The README calls out the memory bandwidth wall explicitly, and I think that is the right read. Once you are already fairly close to the limits of the machine, there may not be much low-hanging fruit left beyond reducing obvious per-token overhead.

That is a useful mental model for anyone working on local inference systems.

Why I Think This Repo Matters

I do not think this project matters because it found a +10.7% win on one setup.

I think it matters because it demonstrates a practical pattern for learning about inference without turning the whole thing into vibes.

A lot of agent demos still optimize for spectacle. They show autonomous behavior without showing the measurement discipline around it. This repo goes in the opposite direction. The agent is not asked to be generally intelligent. It is asked to operate inside a small, unforgiving harness and earn every keep.

That is much closer to how I think useful agentic systems will actually look in production.

And on a more personal level, this repo was also just a good excuse to build intuition.

I want to understand inference systems the same way I wanted to understand database internals when I built toy databases. Not by memorizing a list of optimizations, but by building something tight enough that the trade-offs stop being abstract.

Conclusion

Auto-Inference-Optimiser taught me something simple but important: the hardest part of optimization is not generating ideas. It is building a harness that can tell the difference between a real win, a quality regression, and a benchmark illusion.

If you are building inference systems, I think this is the right instinct to cultivate: do not just chase faster numbers. Build a harness that makes those numbers honest.

References

If you found this interesting, I’d love to hear your thoughts. Share it on Twitter, LinkedIn, or reach out at guptaamanthan01[at]gmail[dot]com.