How Does Cursor Index Your Codebase?
Every time you ask Cursor “where do we handle authentication?” and it points you to the right file in a 50,000 file monorepo in under a second, something interesting has happened under the hood. It’s not magic but an elegant combination of Merkle trees, trigram indexes, AST-based chunking, a custom trained embedding model, and a vector database (turbopuffer) storing over a trillion vectors across 80 million namespaces. This blog post is special because I am a big fan of Cursor and Turbopuffer. So I thought it would be interesting to understand how it works under the hood. Let’s get into it.
The Two Index Strategy
Most people assume Cursor has a single index. A vector store with embeddings of your code but Cursor actually maintains two fundamentally different indexes that serve different purposes:
- A semantic (vector) index for natural language queries like “where do we handle payment retries?”, “show me the database connection logic”
- A trigram-style (inverted) index for regular expression search like the kind of structured pattern matching that
grepdoes, but without the O(n*files) cost (the production version uses sparse n-grams rather than plain trigrams, more on that later)
These aren’t interchangeable. The semantic index is great at conceptual lookup but can’t tell you “find all places where we call db.execute with a raw string literal.” The regex index is precise but completely blind to meaning. Cursor’s agent harness uses both, and the combination is what makes the context quality good enough to write code that actually gets retained in your codebase.
Cursor’s own research shows that semantic search on top of grep improves agent accuracy by 12.5% on average (ranging from 6.5% to 23.5% depending on the model), increases code retention by 2.6% on large codebases, and reduces dissatisfied follow-up requests by 2.2%.
Building the Semantic Index
When you open a project, Cursor starts building its semantic index. For a fresh codebase, this means reading every file, splitting it into chunks, embedding each chunk, and uploading the embeddings to a vector database. Here’s what each of those steps actually involves.
Chunking With Tree-Sitter
Cursor describes splitting each file into “syntactic chunks” rather than fixed windows. Parsing code into an Abstract Syntax Tree (AST) first, then chunking along AST boundaries like functions, classes, methods, and blocks. (The standard tool for this across dozens of language grammars is tree-sitter, which is what the examples here use.) A function is a natural unit of meaning in code, and it becomes a natural unit in the index. The way this typically works is that small sibling AST nodes get merged together up to the token limit while a coherent unit like a function or class stays in one piece, so the embedding for a chunk captures something meaningful rather than an arbitrary window of characters.
Here’s a simplified version of what AST based chunking looks like:
import tree_sitter_python as tspython
from tree_sitter import Language, Parser
PY_LANGUAGE = Language(tspython.language())
parser = Parser(PY_LANGUAGE)
MAX_CHUNK_BYTES = 1500
def chunk_file(source_code: bytes) -> list[str]:
tree = parser.parse(source_code)
chunks = []
for node in tree.root_node.children:
# Top-level functions and classes become individual chunks
if node.type in ("function_definition", "class_definition"):
chunk_text = source_code[node.start_byte:node.end_byte].decode()
chunks.append(chunk_text)
# Small sibling nodes (imports, constants) get merged
else:
if chunks and len(chunks[-1]) + len(source_code[node.start_byte:node.end_byte]) < MAX_CHUNK_BYTES:
chunks[-1] += "\n" + source_code[node.start_byte:node.end_byte].decode()
else:
chunks.append(source_code[node.start_byte:node.end_byte].decode())
return chunks
This is a simplification of the real implementation. The real implementation would be more complex and handle nested structures, multiple languages, and edge cases across dozens of language grammars, but the idea is the same.
A Custom Embedding Model Trained on Agent Traces
Interestingly, Cursor doesn’t reach for an off-the-shelf embedding model like text-embedding-ada-002 or text-embedding-3-small. They say plainly that they train their own embedding model, and the training signal comes from agent sessions themselves.
The clever part is where the training labels come from. Cursor describes the approach plainly that agent sessions are the training data. When an agent works through a task, it performs multiple searches and opens files before finding the right code, and by analyzing those traces you can see, in retrospect, what should have been retrieved earlier in the conversation. They feed each trace to an LLM that ranks what content would have been most helpful at each step, then train the embedding model to align its similarity scores with those rankings. That’s their description, and it’s the part I’d take as ground truth.
Here’s where I’m reading between the lines is that a session is effectively a self-labeling dataset. The agent’s eventual successful change, and the files it kept coming back to, are exactly the kind of hindsight signal an LLM grader would lean on to decide what “should have surfaced” at each step. And “align similarity scores with the rankings” almost always cashes out as some contrastive style objective that pulls the query vector toward the chunks that proved useful and pushes it away from the ones that merely looked similar. The net effect is the thing they do state outright: the model stops optimizing for “do these two pieces of code resemble each other?” and starts optimizing for “given this task and context, which chunk actually helps the agent.”
This creates a feedback loop grounded in how agents actually use code and not generic code similarity. It’s the same idea as using preference data from human feedback, applied to retrieval.
Turbopuffer: A Namespace Per Codebase
The embeddings land in Turbopuffer, a vector database whose object storage architecture with unlimited namespaces is a natural fit for Cursor’s namespace-per-codebase use case. Every codebase gets its own namespace. Active namespaces are kept in memory/NVMe; inactive ones fade to object storage. When a query comes in for an inactive namespace, it warms up on demand.
Cursor is running over 1 trillion vectors across 80 million namespaces. Their previous vector database required manually bin-packing namespaces to servers, a constant operational headache. Turbopuffer’s serverless architecture eliminated that entirely and cut costs by 20x. The API is straightforward:
import turbopuffer
tpuf = turbopuffer.Turbopuffer(region="gcp-us-central1")
# One namespace per codebase, keyed by a hash of the repo path
ns = tpuf.namespace(f"codebase-{repo_hash}")
# Upsert embeddings for new or changed chunks (row-based format)
ns.write(
upsert_rows=[
{"id": chunk.id, "vector": chunk.embedding, "file_path": chunk.path}
for chunk in chunks
],
distance_metric="cosine_distance",
schema={"file_path": {"type": "string", "glob": True}},
)
# Query at search time
results = ns.query(
rank_by=("vector", "ANN", query_embedding),
top_k=20,
filters=("file_path", "Glob", "src/**/*.py"),
include_attributes=["file_path"],
)
Efficient Sync with Merkle Trees
Embedding computation is expensive. On a large codebase, embedding every chunk from scratch every time you make a change would be prohibitively slow. Cursor uses Merkle trees to know exactly which chunks need re-embedding.
A Merkle tree assigns a cryptographic hash (SHA-256) to every file, and then to every directory based on the hashes of its children. It’s the same idea that underpins Git’s object model where Git keys its objects by content hash and chains them the same way (using SHA-1 rather than SHA-256) so a change to any file propagates its hash up to the root through all parent directories.
When a file changes, Cursor walks the Merkle tree and finds exactly which branches diverge between the client and server. Only the changed files need to be re-chunked and re-embedded. The walkthrough looks like this:
import hashlib
from pathlib import Path
from dataclasses import dataclass
@dataclass
class MerkleNode:
path: str
hash: str
children: list["MerkleNode"]
is_file: bool
def hash_file(path: Path) -> str:
content = path.read_bytes()
return hashlib.sha256(content).hexdigest()
def build_merkle_tree(root: Path) -> MerkleNode:
if root.is_file():
return MerkleNode(str(root), hash_file(root), [], is_file=True)
children = [build_merkle_tree(child) for child in sorted(root.iterdir())]
# Directory hash is SHA-256 of all children's hashes concatenated
combined = "".join(child.hash for child in children)
dir_hash = hashlib.sha256(combined.encode()).hexdigest()
return MerkleNode(str(root), dir_hash, children, is_file=False)
def find_changed_files(client_tree: MerkleNode, server_tree: MerkleNode) -> list[str]:
"""Walk both trees simultaneously, only descending into differing branches."""
if client_tree.hash == server_tree.hash:
return [] # Entire subtree is unchanged, skip it
if client_tree.is_file:
return [client_tree.path]
# Build lookup for server children
server_children = {child.path: child for child in server_tree.children}
changed = []
for client_child in client_tree.children:
if client_child.path not in server_children:
# New File
changed.append(client_child.path)
else:
changed.extend(
find_changed_files(client_child, server_children[client_child.path])
)
return changed
The efficiency gain is substantial. In a 50,000 file workspace, just the filenames and SHA-256 hashes add up to roughly 3.2 MB. Without the Merkle tree, you would transfer that data on every update to check what changed. With the tree, you only walk branches where hashes differ, typically a handful of files after a normal coding session.
Reusing Teammate Indexes
Here’s the part of Cursor’s indexing pipeline that I think is genuinely clever for performance. On large monorepos, indexing from scratch takes hours. But most engineers on a team work from near identical copies of the same codebase, clones averaging 92% similarity within an organization. Re-embedding the same code over and over is pure waste of compute and storage.
When a new user opens a codebase, Cursor computes the Merkle tree and derives a simhash, a single value that summarizes the distribution of file content hashes, similar in spirit to a locality sensitive hash. The client sends this simhash to the server and the server searches a vector database of all existing simhashes across the organization and finds the most similar existing index.
If similarity is above a threshold, Cursor seeds the new user’s namespace from the existing one using Turbopuffer’s copy_from_namespace operation, at a 50% write discount. The user is immediately allowed to query the copied index while the background sync reconciles differences and their local changes relative to the copied codebase. The results are:
- Median repo: time to first query drops from 7.87 seconds to 525 milliseconds
- 90th percentile: 2.82 minutes -> 1.87 seconds
- 99th percentile: 4.03 hours -> 21 seconds
But this creates a security problem: user A’s index might contain code that user B shouldn’t see. How do you reuse an index without leaking code across trust boundaries?
Proving Access via Merkle Proofs
The solution to the problem is elegant. Since each node in the Merkle tree is a cryptographic hash of the content beneath it, you can only compute that hash if you actually have the file. When user B starts from a copied index, their client uploads the full Merkle tree. The server stores this tree as a set of content proofs that for each file path in the index, the hash that proves the client has it.
When user B runs a semantic search, results are filtered by checking the returned chunk’s file path against user B’s content proofs. If user B can’t prove they have a file (because the hash isn’t in their Merkle tree), the result is dropped. They can only see search results for code their local machine actually contains.
This gives you shared indexes with hard security guarantees, without any server-side file content inspection. The server still stores the embedding vectors and obfuscated file paths it needs to serve queries, but it never sees raw source. The access decision rides entirely on hashes the client can only produce if it actually has the file.
The Trigram Index
The semantic index handles conceptual queries, but agents also need precise pattern matching. The problem is ripgrep is fast for small projects, but on large monorepos Cursor was seeing rg invocations take more than 15 seconds. That stalls the entire agent workflow while it waits for a search result.
The foundation Cursor built on is the trigram index, the same approach described by Zobel, Moffat, and Sacks-Davis in 1993 and popularized by Russ Cox’s 2012 blog post on Google Code Search. The idea is to pre-build an inverted index from every overlapping 3-character sequence (trigram) in your codebase.
def extract_trigrams(text: str) -> set[str]:
"""Extract all overlapping 3-character sequences."""
return {text[i:i+3] for i in range(len(text) - 2)}
def build_trigram_index(files: dict[str, str]) -> dict[str, set[str]]:
"""Build inverted index: trigram → set of file IDs containing it."""
index = {}
for file_id, content in files.items():
for trigram in extract_trigrams(content):
index.setdefault(trigram, set()).add(file_id)
return index
At query time, a regex like db\.execute\( is decomposed into its literal trigrams: db., b.e, .ex, exe, xec, ecu, cut, ute, te(. The search engine intersects the posting lists for all these trigrams to find candidate files, a tiny fraction of the codebase. Then the regex is matched against only those candidates “the old-fashioned way” to confirm actual hits.
def regex_candidate_files(
pattern: str,
index: dict[str, set[str]]
) -> set[str] | None:
"""Find candidate files using trigram index before running regex.
Returns None when the pattern has no extractable literals, signaling
the caller to fall back to a full scan.
"""
# Extract literal strings from the regex pattern
# (simplified — real implementation parses the full regex AST)
literal_parts = extract_literals_from_regex(pattern)
if not literal_parts:
return None # No trigrams extractable, must scan everything
# Get trigrams from all literal parts
trigrams = set()
for literal in literal_parts:
trigrams.update(extract_trigrams(literal))
# Intersect posting lists: files must contain ALL trigrams
candidate_files = None
for trigram in trigrams:
matching_files = index.get(trigram, set())
if candidate_files is None:
candidate_files = matching_files
else:
candidate_files &= matching_files # Intersection
return candidate_files or set()
The trigram index acts as a filter. If your regex has enough literal characters (most real patterns do), you narrow the candidate set from 50,000 files to a handful before the actual regex match runs. The false positive rate is low enough that the filtering step dominates performance.
The classic trigram index is where Cursor started with their first implementation but they quickly realized that pure trigrams run into capacity trouble at monorepo scale where common trigrams produce posting lists so large that loading them is nearly as slow as scanning everything. So Cursor’s production index extends the idea in two ways. It uses sparse n-grams, deterministically chosen, variable length n-grams rather than every fixed 3-character window which keeps the index smaller and lets queries decompose into far fewer, more selective lookups. And it augments each posting with small bloom-filter masks encoding the character that follows a trigram and the positions where it appears, so a trigram keyed index can be queried with quadgram level precision and verify that matched trigrams are actually adjacent. The mental model below is the classic trigram filter, the shipped system is that filter sharpened until the candidate set is small enough to confirm locally with a memory mapped lookup.
There’s one more architectural decision that’s easy to miss but matters a lot. The semantic index lives on Cursor’s servers, but the regex index is built and queried entirely on your machine. The reason is freshness. A stale embedding still points roughly the right way in vector space, so the semantic index can lag behind your edits without much harm. A regex index can’t afford that: the moment the agent searches for code it just wrote and the index doesn’t have it yet, it spirals into a wasteful hunt. Cursor keeps the local index fresh by anchoring it to the current Git commit and layering user and agent edits on top, which makes it cheap to update on every keystroke and fast to load on startup.
Dynamic Context Discovery
One more layer worth understanding. The agent doesn’t just dump all retrieved chunks into the context window. Cursor has moved toward what they call dynamic context discovery, files as the unit of context management.
Instead of eagerly injecting retrieved content into the prompt, tool responses that are too large get written to temporary files. The agent is given the file path and can decide whether to read it, tail it, or grep it based on its current reasoning state. This prevents context bloat from large tool responses while ensuring no information is lost to truncation.
The same pattern applies to terminal output, MCP tool responses, and chat history. Everything the agent might need is accessible as a file. The agent retrieves what it actually needs rather than receiving everything upfront. In an A/B test on MCP tool usage, this reduced total agent tokens by 46.9%. This is a really interesting approach and I think it’s a good way to keep the context window small and the agent focused on the task at hand.
Conclusion
To look at the full indexing pipeline, end to end. Here is how it works:
- On project open: build a Merkle tree over all files in the codebase
- Semantic index (server-side), first time: chunk with tree-sitter, embed with Cursor’s custom model, upload to a per-codebase namespace in Turbopuffer. This is the semantic index, and it lives on Cursor’s servers.
- Semantic index, on subsequent opens: compute simhash, check for similar existing indexes in the org, copy the closest match via
copy_from_namespace - Semantic index, background sync: walk the Merkle tree diff to find changed files, re-embed only those, update the namespace. Embeddings tolerate some staleness, so this happens asynchronously.
- Regex index (client-side): build the trigram / sparse n-gram index locally on your machine, seeded from the current Git commit with user and agent edits layered on top. Unlike the semantic index, it’s kept fresh on every edit, because a regex that misses the model’s own just-written code sends the agent on a wild goose chase.
- At search time: semantic queries go to Turbopuffer with content proofs for access control; regex queries hit the local index to filter candidates before the actual regex match runs
- In context: retrieved content surfaces as files that the agent reads on demand rather than as pre-injected context
What I find interesting about this architecture isn’t any single piece, Merkle trees, trigram indexes, and content-based addressing are all decades old. It’s the combination: using Merkle tree cryptographic properties for both change detection and access control, training the embedding model on agent session traces rather than code similarity, and the file-as-context abstraction that runs consistently through the whole system.
The next time Cursor finds the exact function you were thinking of across 50,000 files in half a second, you now know what happened under the hood.
Enjoyed this? You can find me on Twitter and LinkedIn. If you’re building anything in this space, I’d love to hear about it — guptaamanthan01[at]gmail[dot]com.