LLM Context Windows and Memory: How Models Handle Extended Dialogues
Context window
Imagine you're telling a story to a friend, but your friend can only remember the last 10 sentences at any given moment. If you talk too much and provide too much information, the initial part of your story disappears from their memory. That's what happens with LLMs — they have a "context window," which is like a box that can only hold a certain number of words or tokens.
Example:
If the model's box holds 4K tokens, it's like saying:
"I can remember everything you said… as long as it fits in this box. If you bring a bigger box then sorry, I'll forget what you said earlier!"
The limited context window problem
LLMs process text in chunks called tokens (pieces of words). The context window is the maximum number of tokens the model can process at once.
Example: If an LLM has a 2K token limit, it can only consider the last 2K tokens of your latest conversation. Anything before is gone out of context.
Why it's a problem
- Cost and speed: increasing the window size makes computation expensive because attention mechanisms scale roughly quadratically with the number of tokens.
- Loss of coherence: in long conversations or documents, the model forgets earlier details (e.g., character names, instructions).
- Truncation: if you feed a 100-page document, only the last few pages fit.
Context windows of different LLMs
| Model | Context window |
|---|---|
| Claude 3.5 Sonnet / Opus | ~200K tokens |
| GPT-4.1 (API) | up to ~1M tokens |
| Gemini 1.5 Pro | up to 1M tokens |
(Numbers move fast — check each provider's current docs before you commit architecture decisions to them.)
How the latest LLM models handle context better
Since LLMs cannot have an unlimited context window, it's much more important to make the context window large enough to hold the context without losing too much information. There are many ways to improve and handle this issue — here are several strategies to overcome these limitations.
Bigger context windows
Models like GPT-5.2 (~400K tokens) and Claude 4.5 (200K tokens) massively increase the memory box.
Trade-off: larger windows = more compute and cost, but better for long documents. Also worth noting: long-context retrieval accuracy tends to degrade in the middle of the window ("lost in the middle"), so a 1M-token window isn't a substitute for good chunking.
Retrieval-Augmented Generation (RAG)
Rather than stuffing information into the LLM while training, RAG fetches relevant chunks from an external database.
Think of it as the model saying: "I don't need to memorise the whole Wikipedia — I'll just look up the right page when needed."
Persistent memory
Experimental approaches store long-term context outside the model, like writing in a notebook. The model can refer back to this memory when needed. Projects like MemGPT formalise this as an OS-style memory hierarchy, with the model paging context in and out.
Efficient attention mechanisms
Techniques like FlashAttention and sparse attention reduce the cost of handling large windows. Instead of comparing every token with every other token, they use smart shortcuts — either by reordering the attention computation to keep more of it in GPU SRAM, or by attending to a carefully chosen subset of tokens rather than all of them.
Conclusion
Large Language Models are powerful, but their ability to "remember" is limited by the size of their context window — like a friend who can only recall the last few pages of a book. This constraint leads to challenges in maintaining coherence during long conversations or processing large documents. Newer models tackle this by:
- Expanding context windows (up to hundreds of thousands of tokens).
- Using smarter attention mechanisms to reduce computational cost.
- Employing techniques like summarisation, retrieval-augmented generation (RAG), and persistent memory to keep important information accessible without overwhelming the model.
As research advances and newer architectures emerge, LLMs can handle massive contexts efficiently, blending short-term and long-term memory — making them more like a well-organised librarian than a forgetful friend.
Further reading
- Fine-tuning vs Retrieval-Augmented Generation (RAG) — when to reach for each.
- Build LLM Vocab, Tokens, Embeddings, and Context — the fundamentals behind tokenisation and context.
- Building RAG Systems in Production — a hands-on walkthrough on CipherMind.
- Fine-Tune an LLM on Your MacBook with LoRA — a local alternative when RAG alone isn't enough.
Stay in the loop
New articles on AI, Cybersecurity, and PKI — delivered to your inbox.