LLM Context Windows and Memory: How Models Handle Extended Dialogues

Context window

Imagine you're telling a story to a friend, but your friend can only remember the last 10 sentences at any given moment. If you talk too much and provide too much information, the initial part of your story disappears from their memory. That's what happens with LLMs — they have a "context window," which is like a box that can only hold a certain number of words or tokens.

Example:

If the model's box holds 4K tokens, it's like saying:

"I can remember everything you said… as long as it fits in this box. If you bring a bigger box then sorry, I'll forget what you said earlier!"

The limited context window problem

LLMs process text in chunks called tokens (pieces of words). The context window is the maximum number of tokens the model can process at once.

Example: If an LLM has a 2K token limit, it can only consider the last 2K tokens of your latest conversation. Anything before is gone out of context.

Why it's a problem

Cost and speed: increasing the window size makes computation expensive because attention mechanisms scale roughly quadratically with the number of tokens.
Loss of coherence: in long conversations or documents, the model forgets earlier details (e.g., character names, instructions).
Truncation: if you feed a 100-page document, only the last few pages fit.

Context windows of different LLMs

Model	Context window
Claude 3.5 Sonnet / Opus	~200K tokens
GPT-4.1 (API)	up to ~1M tokens
Gemini 1.5 Pro	up to 1M tokens

(Numbers move fast — check each provider's current docs before you commit architecture decisions to them.)

How the latest LLM models handle context better

Since LLMs cannot have an unlimited context window, it's much more important to make the context window large enough to hold the context without losing too much information. There are many ways to improve and handle this issue — here are several strategies to overcome these limitations.

Bigger context windows

Models like GPT-5.2 (~400K tokens) and Claude 4.5 (200K tokens) massively increase the memory box.

Trade-off: larger windows = more compute and cost, but better for long documents. Also worth noting: long-context retrieval accuracy tends to degrade in the middle of the window ("lost in the middle"), so a 1M-token window isn't a substitute for good chunking.

Retrieval-Augmented Generation (RAG)

Rather than stuffing information into the LLM while training, RAG fetches relevant chunks from an external database.

Think of it as the model saying: "I don't need to memorise the whole Wikipedia — I'll just look up the right page when needed."

Persistent memory

Experimental approaches store long-term context outside the model, like writing in a notebook. The model can refer back to this memory when needed. Projects like MemGPT formalise this as an OS-style memory hierarchy, with the model paging context in and out.

Efficient attention mechanisms

Techniques like FlashAttention and sparse attention reduce the cost of handling large windows. Instead of comparing every token with every other token, they use smart shortcuts — either by reordering the attention computation to keep more of it in GPU SRAM, or by attending to a carefully chosen subset of tokens rather than all of them.

Conclusion

Large Language Models are powerful, but their ability to "remember" is limited by the size of their context window — like a friend who can only recall the last few pages of a book. This constraint leads to challenges in maintaining coherence during long conversations or processing large documents. Newer models tackle this by:

Expanding context windows (up to hundreds of thousands of tokens).
Using smarter attention mechanisms to reduce computational cost.
Employing techniques like summarisation, retrieval-augmented generation (RAG), and persistent memory to keep important information accessible without overwhelming the model.

As research advances and newer architectures emerge, LLMs can handle massive contexts efficiently, blending short-term and long-term memory — making them more like a well-organised librarian than a forgetful friend.