Reverse Engineering ChatGPT, Claude, OpenClaw, and Hermes Convinced Me Most AI Products Shouldn't Ship Memory
The first time I asked ChatGPT what it remembered about me, it listed 33 facts. Name, career goals, fitness routine, names of side projects I had mentioned weeks earlier, a throwaway line about my sabbatical from a completely unrelated chat. I was impressed. I spent the next few weeks reverse-engineering how it actually works, then did the same for Claude, OpenClaw, and Hermes. Somewhere around the third system, I stopped noticing how clever the designs were and started noticing how often memory was degrading my own outputs.
Explicit state beats implicit memory, surprisingly often. Memory is not a default feature, it is a product and systems tax, and most AI products have not earned the right to pay it.
What I Saw Inside Four Memory Systems
I want to start with the synthesis, because it is the part I could not have written without doing the reverse-engineering work.
Each of the four systems I mapped takes a fundamentally different approach:
- ChatGPT uses an injected profile. A long-term fact store (33 facts, in my case), plus pre-computed summaries of recent chats, plus session metadata. All glued into every prompt. Just a curated block that rides along on every turn.
- Claude uses on-demand retrieval. A small
<userMemories>block is always present, but past conversations are not injected by default. The model can invokeconversation_searchorrecent_chatsas tools when it decides context is relevant. - OpenClaw uses a Markdown workspace. Everything is plain files on disk:
MEMORY.mdfor durable knowledge,memory/YYYY-MM-DD.mdfor daily logs, indexed for hybrid semantic + keyword search. The agent searches its own notes on demand. - Hermes uses a hot/cold split. A tiny frozen prompt memory —
MEMORY.mdcapped at 2,200 characters andUSER.mdat 1,375 characters, about 1,300 tokens combined, plus a SQLite-backedsession_searchfor episodic recall, plus a skills system for procedural memory, plus an optional user-modeling layer.
Four systems, four different answers. Put them next to each other and a pattern becomes obvious. The simplest approach, injected profile, is also the one with the worst failure mode, every old fact competes for attention on every future prompt, and you have no principled story for when to stop. It is also the one most product teams copy, because it is the easiest to demo and the easiest to ship. That is how most AI products ended up with memory that degrades their own outputs.
The most deliberate of the four is Hermes, and the design choice holding it together is a single sentence from the source: keep the prompt stable for caching, and push everything else to tools. Memory stops being ambient and becomes a choice the model has to make. That is the direction I think most teams should copy, and it is the opposite of what the easy path gives you.
Storage is Easy, Retrieval Policy is Hard.
Once I had looked inside four of these systems, one observation would not go away. The storage side is actually not the hard part. Storing facts is easy. You can do it with a JSON blob, a Markdown file, a SQLite table, or a vector index. The interesting part is the part that decides whether memory helps or hurts the output, the retrieval policy, the heuristic that decides which remembered thing gets pulled into which future prompt.
Walk through the four systems through that lens and the picture gets clearer:
- ChatGPT’s retrieval policy is “always inject.” Every stored fact rides along on every prompt. Cheap, fast, and the reason old context keeps shaping new answers whether it is relevant or not.
- Claude’s retrieval policy is “model decides.” The model has to recognize when past conversations matter and call a tool. Cleaner prompts when it works, but dependent on the model getting the “do I need to search?” call right.
- OpenClaw’s retrieval policy is “agent issues semantic + keyword search.” Better than always-inject, but the more notes you accumulate, the harder the search has to work, and the more likely you are to pull stale or redundant material.
- Hermes’s retrieval policy is tiered and explicit. A tiny hot set for durable facts, a separate cold store for episodic history, a separate skills index for procedural knowledge, and clear rules about what belongs where. (“Save user preferences, environment facts, recurring corrections, stable conventions. Do not save task progress, session outcomes, temporary TODO state.”)
Every failure I saw, and most of the failures the rest of this post describes, comes from a weak retrieval policy and not a weak storage layer. ChatGPT’s failure is architectural: the policy is “always inject,” which is why stale context keeps bleeding through. A better storage scheme would not fix that. Only a better retrieval policy would.
This matters because most teams shipping memory are spending their complexity budget on the wrong half. They compare vector stores, they design embedding pipelines, they debate chunk sizes. The retrieval policy gets one sentence in the design doc: “we’ll retrieve the top-k relevant items and inject them.” That one sentence is where the product quality lives. Most teams are flying blind on it.
The Best Case for Memory
I want to take the opposite view seriously before arguing against it.
The strongest case for default memory is friction reduction. Not having to re-enter preferences every session is genuinely nice, especially for casual users. “Remember I’m vegetarian” should not need to be said twice.
The next strongest case is continuity for inherently longitudinal products. Meeting tools like Granola, personal knowledge products like Reflect and Mem.ai, relationship products like Replika, therapy companions. For these, memory is not a feature, it is the product.
The third is the retention wedge. Mike Taylor’s piece notes that the “it knows me so well” feeling is exactly what locks ChatGPT users in. That is real. Users do not switch to Gemini or Claude partly because they do not want to rebuild the profile. Memory makes your product stickier whether or not it makes the outputs better.
The fourth is low-stakes drift. For casual tasks like recipe ideas, travel suggestions, chit-chat being slightly wrong because of stale memory does not really hurt the user.
Each of these has a counter. Friction reduction does not require implicit memory; a settings panel does it without the tax. The longitudinal case is exactly the one this post concedes for those products, ship memory. The retention wedge is a business case, not a quality case; you are trading output quality for stickiness, which is a legitimate choice but should be made consciously. And low-stakes drift assumes your product only serves casual tasks, which is rarely true the same ChatGPT user doing recipe lookups is also doing code reviews and performance reviews and therapy-adjacent venting, and stale memory does not know which of those it is in.
The strongest version of the pro-memory argument is real. It is just much narrower than the scope most products ship memory at.
Where Memory Goes Wrong in Practice
Once you ship memory with a weak retrieval policy, it fails in predictable, documented ways.
Output quality degrades. Mike Taylor’s Why I Turned Off ChatGPT’s Memory is the best user-side documentation of this. He put a Kanye quote about “dopeness” into his custom instructions, and ChatGPT started claiming it had built a collapsible website section “as dope as possible”, applying the same quote to interior decor, marketing plans, and Python debugging. When he turned memory back on to write the piece, a request for barbecue rib advice came back as “Hoboken Dinner Upgrade Ideas” because the assistant knew he had just moved. OP-Bench shows the pattern at benchmark scale: memory-augmented agents retrieve user details even when unnecessary, then over-attend to them until the details overshadow the actual query.
Debugging gets dramatically harder. I have spent enough time inside agent codebases to know the signature of a memory bug: the live trace is clean. Prompt, retrieval, tool logs are all fine. The weird behavior still happens. The reason is architectural.
REQUEST PIPELINE (what your logs see)
─────────────────────────────────────────────────
user msg → system prompt → retrieval → tool calls → response
▲
│ logs end here
MEMORY PIPELINE (what your logs do NOT see)
─────────────────────────────────────────────────
session end → summarizer → memory store
│
└──→ next session's system prompt
The thing shaping tomorrow's answer lives in a
pipeline you are not logging.
The Unit 42 writeup on Amazon Bedrock Agents describes exactly this shape: memory is produced by a separate session summarization process that runs at session end and merges into the next session’s system prompt. Every memory system I reverse-engineered has some version of this split. With memory, you are not debugging a request, you are debugging a relationship, and the relationship is logged somewhere you are not looking.
Context rot. The context window is not free intelligence, it is scarce working memory. Chroma’s Context Rot research evaluates 18 leading models on deliberately controlled tasks, holding task difficulty constant and varying only input length. Performance degrades with input length across every model they test, and distractors hurt more as context grows. A Databricks study shows accuracy dropping well before the window is full, sometimes as early as 32k tokens. A Microsoft/Salesforce paper shows splitting a prompt into a multi-turn conversation instead of one shot drops performance by 39% on average. Most memory systems are context inflation mechanisms in disguise.
Privacy gets weird fast. I built BYOM because, once you look at a memory system clearly, it stops being a convenience feature and becomes a persistent user-profiling system. The CIMemories paper calls out the specific failure mode: memory-augmented LLMs often pick the right domain to talk about but cannot tell which details inside that domain are relevant. Right domain (life logistics), wrong granularity (a therapy schedule bleeding into a work email draft). Personalization and contextual integrity are not the same thing.
A brand new attack surface. The Unit 42 proof-of-concept against Bedrock Agents shows what “poisoned memory” actually looks like. An attacker hides a prompt injection in a webpage. The victim asks their travel agent to read the URL. Nothing goes wrong in the live session, the payload is crafted to target the session summarization prompt, not the orchestration prompt. When the session closes, the summarizer writes the attacker’s instructions into memory as a normal-looking topic. Days later, the user returns, books a trip, and the agent exfiltrates the booking to an attacker-controlled domain by calling its own scrape_url tool. Prompt injection against a stateless chat is transient. Prompt injection into memory is persistent. The MINJA paper shows the attacker does not even always need access to the memory store, user-style interaction alone can land the payload.
Personality drift. PersistBench reports median failure rates of 53% on cross-domain leakage and 97% on sycophancy samples in long-term-memory systems. The sycophancy number is the scary one. If what the assistant remembers about you nudges it toward agreement and accommodation instead of honest judgment, you do not have a memory problem, you have a judgment problem. And it will still feel personal to the user.
The Hermes Pattern: Memory as a Tool, Not Ambient Context
If the previous section is the symptom list, Hermes is the design that treats the disease.
Three principles hold it together. First, it separates hot memory from cold recall. A tiny always-injected block for durable facts, a searchable cold store for episodic history, a skills index for procedural memory. Nothing in the “always-injected” tier is allowed to grow unbounded, because prompt memory is cache-sensitive working set, not a diary.
Second, it treats prompt stability as a first-class constraint. Memory is frozen into a snapshot at session start and not mutated mid-session. Writes go to disk immediately, but the prompt stays stable until a natural rebuild point (new session, post-compression). Every agent system ships memory without thinking about caching. Hermes does.
Third, it acknowledges that memory is plural. Facts, episodes, skills, and deeper user modeling are distinct retrieval problems with distinct policies. One store does not solve them all.
This is what memory done right looks like in practice. Not a bigger vector DB. Not smarter auto-promotion. Fewer things in the system prompt, more things in tools, explicit rules about what belongs where. Most AI products that ship memory could cut their memory surface by 80% and end up with better outputs.
Before You Ship Memory, Answer These
The synthesis above turns into a checklist. Before shipping memory in your product, answer these honestly:
- Is your product inherently longitudinal? Do users get less value from session one than from session ten? If no, you do not need memory.
- Can you draw a clear line between your storage system and your retrieval policy? If no, you are about to ship the ChatGPT failure mode and call it personalization.
- Will the stored state be visible to users and directly editable by them? If no, you are building implicit profiling, not memory.
- Can you scope recall to a specific task, project, or explicit tool invocation? If no, ambient memory will bleed across contexts.
- Is your team willing to own the privacy, security, and debugging tax for the next three years? If no, you are not shipping memory, you are shipping a liability.
If the honest answer to most of these is no, do not ship memory. Ship visible settings, scoped project state, and explicit task briefs instead. Cursor’s .cursorrules and AGENTS.md, Claude Projects, Zed’s .rules, ChatGPT Custom Instructions, Linear task context, all of these work because they are legible, editable, and scoped. None of them need a memory layer to do their job.
Conclusion
Memory sounds like intelligence because humans associate memory with understanding. Product memory is not human memory. It is stored context with retrieval rules, summarization errors, privacy trade-offs, security exposure, and a constant tendency to turn old signals into future bias. That does not make it useless. It makes it expensive.
If your AI product still struggles with basic workflow design, explicit settings, clean state management, and reliable task execution, adding memory will not make it smarter. It will make it harder to understand when it fails, harder to debug when it drifts, and harder to trust when it confidently carries the wrong things forward.
The truth is that most AI products do not need better memory. They need better product design.
References
- OpenAI Memory FAQ
- Why I Turned Off ChatGPT’s Memory - Mike Taylor, Every
- Context Rot: How Increasing Input Tokens Impacts LLM Performance - Chroma
- Long Context RAG Performance of LLMs - Databricks
- When AI Remembers Too Much: Persistent Behaviors in Agents’ Memory - Unit 42
- CIMemories: A Compositional Benchmark for Contextual Integrity of Persistent Memory in LLMs
- PersistBench: When Should Long-Term Memories Be Forgotten by LLMs?
- OP-Bench: Benchmarking Over-Personalization in Memory-Augmented Conversational Agents
- MINJA: Memory Injection Attacks on LLM Agents via Query-Only Interaction
If you found this interesting, I’d love to hear your thoughts. Share it on Twitter, LinkedIn, or reach out at guptaamanthan01[at]gmail[dot]com.