In February 2024, Klarna announced what appeared to be a triumph of artificial intelligence at enterprise scale. Their AI assistant, built in partnership with OpenAI, had handled 2.3 million customer conversations in its first month — two-thirds of all service inquiries. Resolution times dropped from eleven minutes to two. The company projected forty million dollars in profit improvement. By mid-2025, that number had grown to sixty million, with the AI performing work equivalent to 853 full-time agents. The financial markets took notice; the efficiency narrative was compelling.
Fifteen months later, Klarna's CEO Sebastian Siemiatkowski offered a different assessment. "We went too far," he admitted publicly. "Cost was a predominant evaluation factor, resulting in lower quality." Customers had reported what internal reviews confirmed: generic responses, repetitive answers, and an inability to handle nuanced problem-solving. The company began rehiring human agents.
The Klarna reversal is instructive not because artificial intelligence failed, but because of how it failed. The system performed adequately on transactional queries — straightforward requests with predictable resolutions. It degraded on precisely the interactions that matter most: complex issues requiring accumulated context, customer history, and the kind of institutional memory that distinguishes service from processing.
Klarna is not an outlier. It is an early signal of a structural limitation appearing across enterprise AI systems.
Klarna's AI could retrieve similar responses. It could not precisely recall what it needed to know.
This pattern has a name. I call it context decay: the systematic erosion of AI-generated institutional knowledge through architectural failure to precisely recall it.
The Retrieval Tax
To understand the scale of what enterprises are paying, consider the market trajectory. Enterprise spending on generative AI grew from $11.5 billion in 2024 to $37 billion in 2025 — a threefold increase in twelve months. Inference — the cost of actually running AI models — now accounts for 85 percent of enterprise AI budgets. Five hundred companies spend more than one million dollars annually on AI APIs alone, up from a dozen two years ago.
Here is the paradox: per-token costs have dropped by a factor of one thousand. Yet total enterprise spending surged 320 percent. The efficiency gains are being consumed — by volume, by architectural overhead, and by waste that current systems make invisible.
The source of that waste requires a brief explanation of how AI memory currently works. Large language models are stateless — when a session ends, they retain nothing. To give AI systems access to prior knowledge, the industry adopted retrieval-augmented generation, or RAG. When context is needed, the system queries a database and retrieves content that appears semantically similar to the current request. This retrieved content is then injected into the conversation as tokens — and billed accordingly.
The limitation is structural. Similarity retrieval returns probabilistic approximations rather than deterministic recall. The architecture creates four distinct taxes.
The re-retrieval tax
When a session ends or a context window fills, the model discards everything it learned. To continue working, previously known context must be retrieved and re-injected. The enterprise pays to teach the AI something. Then pays again to retrieve it. Then pays again when the context clears and the cycle repeats. The same institutional knowledge, tokenized and billed multiple times — not because new value is being created, but because the architecture cannot retain what it already learned.
The over-retrieval tax
RAG architectures retrieve by similarity, not precision. To increase the likelihood of capturing relevant content, the system retrieves broadly — returning chunks of text that scored above a similarity threshold, regardless of whether they are specifically needed. Industry benchmarks illustrate the inefficiency: a system retrieves ten documents "to be safe," injecting 8,055 tokens into the context window to produce a 50-token answer. Hidden system prompts add another 500 to 3,000 tokens per request. Studies suggest that context pruning and retrieval optimization can reduce token consumption by 40 to 70 percent — which implies that 40 to 70 percent of current spending retrieves context the AI does not need.
$10–20 billion
Estimated annual waste from architectural inefficiency, based on $37B enterprise AI spending with 85% on inference and 40-70% over-retrieval.
The labor tax
Beyond the invoice, there is the cost measured in human attention. Every developer who re-explains project context to a coding agent that should already know it. Every analyst who re-states prior conclusions because the AI cannot recall them. Every employee who spends the first minutes of an AI interaction rebuilding context that existed in a previous session.
The industry has developed workarounds. Markdown files containing persistent context that developers manually maintain. Summarization routines that compress prior conversation into condensed form. Context "compacting" that attempts to preserve essential information when the window fills. These approaches acknowledge the problem exists. They do not solve it. They shift the burden of memory management from the architecture to the user — asking humans to compensate for what the system cannot do. And the labor cost scales with adoption.
The risk tax
There is a fourth cost that does not appear until it materializes as loss: the cost of acting on similar but incorrect context.
Similarity retrieval does not only fail by omission. It can fail by substitution. When the system retrieves content that is semantically adjacent but substantively wrong for the query at hand — a policy from a different jurisdiction, a procedure that was superseded, guidance that applies to a different product line — the AI processes it as valid input. The output is confident, well-structured, and built on the wrong foundation.
This risk is asymmetric. The retrieval error is minor — a similarity score that cleared threshold when it should not have. The consequence can be disproportionate — a compliance determination based on outdated guidance, a customer commitment that contradicts current policy, a decision rationale that references the wrong precedent.
In my work on cybersecurity due diligence for mergers and acquisitions, I observed this pattern repeatedly: minor oversights creating outsized consequences. A single unreviewed vendor relationship. A compliance gap in an acquired subsidiary. The initial failure appeared small; the downstream impact was not. The same asymmetry applies to AI memory. The retrieval failure is invisible at the moment it occurs. The cost surfaces later, often far removed from its cause, and frequently difficult to trace back to the retrieval layer that produced it.
The Architecture of Context Decay
Understanding why this tax exists requires examining how AI memory systems are built.
The models themselves are remarkable. They reason, synthesize, generate, and in many cases perform cognitive tasks that previously required human judgment. But when a session ends and a new one begins, the model retains nothing. No memory of prior conversations. No record of decisions made. No accumulated context from previous interactions.
A genius professor with amnesia.
This limitation created a genuine engineering problem. If an enterprise deploys AI to assist with complex, ongoing work — compliance analysis, customer relationships, strategic planning — the system needs access to knowledge beyond the current conversation. It needs memory.
The dominant solution is retrieval-augmented generation. The mechanism is straightforward: when an AI system needs context beyond its immediate session, it queries a vector database containing prior knowledge. The database returns content that is semantically similar to the query — text whose mathematical representation is closest to the mathematical representation of the question being asked.
This works well for certain use cases. If I ask "What is our refund policy?" and the system returns a document titled "Refund and Return Guidelines," the semantic match has done its job.
The limitation emerges when precision matters more than similarity.
Consider a regulated financial institution managing compliance documentation across multiple jurisdictions. An analyst queries the AI system: "What is our interpretation of NYDFS cybersecurity requirements for third-party vendor assessments?" The vector database returns content that is semantically similar — other cybersecurity requirements, other vendor management policies, other NYDFS regulations, perhaps guidance from different jurisdictions with overlapping terminology.
What it may fail to return is the specific internal memorandum from eighteen months ago where the compliance team documented how your organization interprets those requirements in practice — because that document's semantic signature is not close enough to the query. The words are different. The context is institutional. The similarity score falls below threshold.
The analyst receives a confident, well-structured response built on adjacent content. The response is not wrong. It is incomplete in ways that are difficult to detect.
Similarity retrieval asks:
"What content resembles what I'm asking about?"
Precision recall asks:
"What specific knowledge do I need to answer this question accurately?"
These are not variations of the same operation. They are fundamentally different retrieval objectives. Similarity retrieval returns probabilistic approximations rather than deterministic recall.
The gap between them is manageable when the knowledge corpus is small, when queries are transactional, and when approximate answers carry low risk. The gap becomes a liability when institutional knowledge accumulates, when context is nuanced, and when the cost of incomplete retrieval compounds over time.
Klarna's AI did not fail because the underlying language model was inadequate. It failed because the memory architecture could not precisely recall customer context when nuanced problem-solving was required. It retrieved similar responses to similar queries. It could not recall the specific history, the specific prior interactions, the specific context that distinguished one customer's situation from another's.
The architecture guaranteed this outcome.
Deterministic Architecture
The retrieval tax is not inevitable. It is a consequence of architectural choices that can be made differently.
The alternative is not incremental improvement to similarity retrieval. Adding more sophisticated ranking algorithms, expanding context windows, or fine-tuning embedding models addresses symptoms without resolving the underlying limitation. The vector database still returns probabilistic approximations. The context window still clears. The enterprise still pays the tax.
What enterprise AI requires is architectural redesign around precision recall as the foundational capability.
This means AI memory systems that maintain exact retrieval of specific knowledge — not ranked lists of semantically adjacent content. Systems where the question "What did we decide about third-party vendor assessments eighteen months ago?" returns the precise decision, the rationale that informed it, and the context in which it was made. Not documents that discuss similar topics. The specific institutional knowledge that answers the specific question.
The technical requirements follow from this design principle:
Deterministic retrieval
Queries return specific, identifiable knowledge units. The system either retrieves the exact context needed or acknowledges that it does not exist. There is no similarity threshold to tune, no ranking algorithm to optimize, no probabilistic scoring that might or might not surface what matters. The retrieval is exact or it fails explicitly.
Temporal precision
The system maintains when knowledge was created and can retrieve context as it existed at any point in time. An AI agent queried about a policy interpretation can retrieve either the current interpretation or the interpretation that was in effect during a prior audit period. The system does not conflate versions or return the most recent match when historical accuracy is required.
Relational integrity
Knowledge maintains its connections to related context. Retrieval of a decision includes the rationale that informed it. Retrieval of a policy includes the exceptions that qualify it. Retrieval of guidance includes the regulatory framework it implements. The architecture preserves these relationships rather than fragmenting knowledge into isolated chunks that lose context when retrieved.
Compliance-grade auditability
Every retrieval is logged, traceable, and demonstrable for regulatory purposes. The enterprise can answer not only "What did the AI tell the analyst?" but "What knowledge did the AI retrieve to generate that response, and why?" In regulated industries, this is not optional. SEC Rule 17a-4, NYDFS 23 NYCRR 500, GDPR Article 32 — these frameworks require that organizations demonstrate how decisions were informed and how records were maintained.
These are not features layered onto existing RAG systems. They require building the memory layer around different foundational assumptions — assumptions that prioritize precision over similarity, determinism over probability, and institutional accountability over retrieval convenience.
The cost of this redesign is real. It requires rethinking how knowledge is stored, indexed, and retrieved. It requires infrastructure that most AI deployments have not built.
The cost of not redesigning is also real. It appears in the retrieval tax — paid continuously, compounding silently, and creating risk exposure that surfaces only when the wrong context produces the wrong outcome.
Background
The pattern I describe in this paper — asymmetric risk from architectural failure — is one I have observed across two decades of work at the intersection of technology, security, and enterprise decision-making.
My career began at IBM, where I first encountered the challenge of translating technical architecture into business risk. I have since advised major financial institutions — including Citigroup, JPMorgan Chase, and Bank of America — on cybersecurity strategy and risk management. That work, combined with direct M&A cybersecurity experience, became the foundation for my book, Mergers & Acquisitions Cybersecurity: The Framework for Maximizing Value, which formalized the asymmetric risk framework I now see replicated in AI memory systems.
I also served as Head of Information Security, Compliance, and Privacy at an AI company, where I led the security program through enterprise approvals including Blackstone, Goldman Sachs, and Vanguard. That experience — building AI infrastructure that satisfied the most demanding institutional scrutiny — clarified what enterprise-grade AI requires and what current architectures fail to provide.
I founded SolonAI to address the gap this paper describes. GrantAi, our memory and recall infrastructure, is built around precision recall rather than similarity retrieval. The argument presented here is not neutral — I have a position. But the underlying problem exists independently of my approach to solving it. Klarna's experience, the growing cost of inference in enterprise AI deployments, and the operational risks created by probabilistic memory systems are observable patterns across the industry.
The architecture has amnesia. The enterprise is paying. The question is whether that continues.
Enterprise AI is at an inflection point. The capabilities are extraordinary — reasoning, synthesis, generation at a scale and speed that was unimaginable a decade ago. But those capabilities are constrained by memory infrastructure that was designed for a different problem.
Similarity retrieval was a reasonable solution when AI needed to search documents. It is not sufficient when AI needs to remember.
The distinction matters because the cost compounds. Every cleared context window. Every redundant retrieval. Every hour spent re-stating what the system should already know. Every confident response built on incomplete context. The tax is paid daily, distributed across workflows, absorbed into operational friction, and surfacing as risk only when the wrong context produces the wrong outcome.
Klarna spent fifteen months discovering what the architecture guaranteed from the start. The question for every enterprise deploying AI at scale is whether to learn the same lesson the same way — or to recognize that memory infrastructure is the next layer that enterprise AI requires.
Your AI has amnesia. You are paying for it. The architecture is to blame.
The architecture can also be changed.
Lawrence Grant
Lawrence Grant is the founder of SolonAI — AI memory GrantAi — and author of "Mergers & Acquisitions Cybersecurity: The Framework for Maximizing Value". His background spans IBM, advisory roles at major financial institutions including Citigroup, JPMorgan Chase, and Bank of America. He also served as Head of Information Security, Compliance, and Privacy at an AI company where he led the security program through enterprise approvals including Blackstone, Goldman Sachs, and Vanguard. He holds a Master's degree from Harvard University, where he received the Dean's Award for Academic Excellence, a Private Equity certification from The Wharton School, and Venture Capital credentials from Columbia University.