AI

Build LLM Vocab: Tokens, Embeddings, Vocabulary Size, and Context Windows Explained

Anil K··7 min read
#llm#tokens#tokenization#embeddings#vocabulary#context-window#bpe#subword#fundamentals#nlp
Close-up of colourful Lego bricks, used as a metaphor for LLM tokens

In the world of Large Language Models (LLMs), data is barely a raw text — it needs to be carefully transformed into a form that machines can understand and manipulate. This transformation process is at the heart of how LLMs operate, and it relies on a few key concepts: tokens, tokenisation, embeddings, vocabulary size, and the context window. Together, these elements form the bridge between human language and machine computation.

Tokens: the Lego bricks of machine language

Think of tokens as the tiny Lego bricks that LLMs use to build up meaning. Without them, the machine is basically staring into the void.

Tokens are the atoms of machine language. They're the base unit — the "stuff" that everything else relies on. You can't explain "oh, this article looks interesting" to an LLM unless you break it down into the right token bits first. It's the bridge between our lovely chaotic human language and the machine's world of numbers. Without tokens, an LLM is basically like you reading German or French without knowing the words.

Example: "This article looks interesting for LLM"

Depending on the tokenizer (each model has its own style), it might split like this:

["This", "article", "looks", "interesting", "for", "LL", "M"]

Funny, right? Notice the twist: many tokenizers can't keep acronyms as a single word. "LLM" could come out as "LL" + "M", because of the tokenizer type. You can see exactly how GPT-style models split your text using the OpenAI tokenizer tool.

Tokenisation: turning text into tokens

Before an LLM can understand language, it needs to chop text into small, manageable chunks called tokens, and the process of doing that is called tokenisation. It's the very first step in turning human words into something a machine can work with.

There are two main cutting styles:

  • Word-level tokenisation — splits text by spaces or punctuation. Like cutting a pizza into slices: each word = one big slice. Example: "Dogs are lovely" → ["Dogs", "are", "lovely"]

  • Subword tokenisation — breaks words into smaller, reusable chunks. This is like slicing your pizza into bite-sized squares, so you can reuse leftovers anywhere. Example: "unbelievable" → ["un", "believ", "able"]

Modern LLMs almost all use subword schemes — Byte Pair Encoding (BPE), WordPiece, or SentencePiece — because they strike the right balance between handling rare words and keeping the token count reasonable.

Embeddings: converting tokens into numbers

After we slice text into tokens, the model still needs to understand what those tokens actually mean. This is where embeddings come in.

An embedding is simply a list of numbers (a vector) that represents a token's meaning in a multi-dimensional space. Think of it as giving each token an "address" on a vast map of language.

Here's why embeddings matter:

  • Similar meanings live close together. Tokens like "Stockholm" and "Delhi" end up with vectors that sit near each other on the map, because in context both are capitals of a country.
  • Context shapes meaning. The word "bank" doesn't always mean the same thing. In "river bank", its embedding will be different than in "money bank". Contextual models know how to shift the "address" depending on the neighbourhood the word appears in.
  • The model thinks in vectors, not words. Once tokens are converted to embeddings, the LLM can do math with them — measuring distances or directions — to capture relationships between words.

If you want to see this visually, Google's Embedding Projector lets you explore the geometry of word embeddings interactively.

Vocabulary size: the model's lexicon

When we say a model has a "vocabulary," it means the total number of unique tokens (pieces of text) that the model has been trained to recognise — not that it has memorised the Oxford English Dictionary.

Think of it as the menu of tokens the model is allowed to work with. If a token is on the menu, the model already knows it. If not, it has to creatively chop the word into smaller sub-tokens it does know.

Model with a small vocabulary

The model knows fewer tokens, so it has to break words down more often (using subwords or even single characters).

  • Pros: super flexible — it can handle new words, slang, or languages it wasn't explicitly trained on by remixing known chunks.
  • Cons: sentence processing takes more steps because simple words get split into many tokens.

Model with a large vocabulary

The model stores more whole tokens, so it can recognise complex or rare words all at once.

  • Pros: faster processing for known words, since less splitting is needed.
  • Cons: bigger vocabulary = more memory and compute, because each token needs its own embedding vector in the model's brain.

Real-world balance

Modern LLMs (like GPT-style models) usually aim for a middle ground — vocabularies in the ~32K to ~200K range.

  • Big enough to capture lots of words directly.
  • Small enough to stay efficient without consuming too much memory.

A blend of both small and big is the core reason why newer models can handle everyday tasks so well.

Context window: the model's cache memory

Think of the context window as the maximum number of words you can remember while actively listening. In LLM terms, it's the number of tokens the model can remember at once.

  • A model with a 2K-token window can remember around 1,500 words during a particular conversation.
  • Longer context windows allow the model to handle longer conversations and bigger documents without losing track of earlier details.
  • Once the limit is reached, older tokens are either replaced or dropped completely, which means the model tends to "forget" earlier conversation.

For a deeper dive on how modern LLMs push this limit — RAG, persistent memory, FlashAttention, and how "lost in the middle" applies even when the window is huge — see LLM Context Windows and Memory.

Why all of this matters

Understanding these concepts is essential because they directly affect:

  • Performance — tokenisation impacts speed and accuracy.
  • Cost — most LLM APIs charge per token processed.
  • Capabilities — vocabulary size and context window determine how well a model can handle complex or lengthy inputs.
  • Interpretability — embeddings reveal how the model organises and relates concepts internally.

Summary

LLMs are like language robots that understand and generate human language. They do this by breaking language into smaller pieces called tokens. Each token has a special meaning, and the vocabulary size determines how many different tokens the model can understand. The context window is like the robot's memory, which helps it understand the context of a sentence. Embeddings turn each token into a vector so the model can reason about meaning mathematically. Together, these pieces make LLMs capable of answering questions, writing poetry, and more.


Related reading

Find this article interesting? Subscribe below and drop a comment.

Found this useful? Give it a like.

Stay in the loop

New articles on AI, Cybersecurity, and PKI — delivered to your inbox.