← All Inside Large Language Models modules

Module 2 — Tokens & Embeddings

The core idea · hands-on · about 25 minutes.

In Module 1 the model treated every word as a bare label. It had no idea that "cat" and "dog" are more alike than "cat" and "sky" — to it they were just different symbols. That is a serious limitation. This module fixes it in two steps that every real LLM performs before it does anything else: tokenize the text, then turn each token into an embedding — a list of numbers that captures meaning.

Step 1 — Tokenize

A model does not read characters or words directly; it reads tokens. A token is a chunk of text — often a whole word, sometimes a piece of one ("token", "##izing"). The text is chopped into a sequence of tokens, and each token has an ID number in the model's vocabulary. Type a sentence and watch it get split.

This activity needs JavaScript. The lesson below still covers everything.

Step 2 — Embed: turn each token into a vector

Models do math, and you cannot do math on the word "cat." So each token ID is mapped to a vector — a list of numbers, its embedding. Real models use hundreds or thousands of numbers per token; here we use just two so we can draw them. The crucial property the model learns: tokens used in similar ways end up with similar vectors, so they sit close together.

Click any word on the map below to see its nearest neighbours — the tokens the model considers most similar.

This activity needs JavaScript.

Why "distance" means "meaning"

Because similar tokens have nearby vectors, the model can generalize. If it learns something about "dog," that knowledge partly transfers to "cat" simply because their vectors are close. This is the difference between memorizing words and understanding relationships between them — and it is why an LLM can handle a sentence it has never seen before.

The same idea in code — read only, nothing to install
# every token id maps to a learned vector (here 2 numbers; real models use 100s)
embedding = nn.Embedding(vocab_size, 2)

ids = tokenizer("the happy cat")        # → [4, 19, 7]
vectors = embedding(ids)                  # → three 2-number vectors

# similar tokens have nearby vectors — distance encodes meaning
similarity = cosine(vectors["cat"], vectors["dog"])  # high

The embedding table is learned during training, exactly like the weights in Course 4 — gradient descent pushes related tokens together.

AI anchor — the first thing every model does to your prompt The moment you hit enter, your prompt is tokenized and every token is replaced by its embedding vector before any "thinking" happens. Those vectors are the model's actual input. The famous fact that king − man + woman ≈ queen works because these vectors arrange meaning as geometry — directions in the space correspond to concepts like gender or plurality. Everything the model does downstream operates on these vectors, never on the letters you typed.

Check your understanding

A few questions about tokens and embeddings. You will get a score.

This activity needs JavaScript.

Why this matters next Now every word is a vector. But meaning also depends on the other words around it — "bank" by a river vs. a bank with money. Module 3 introduces attention: how a model lets each token look at the others to figure out what it means in context.
One-sentence summary: a model first chops text into tokens, then replaces each token with a learned embedding vector — and because similar tokens get nearby vectors, distance in that space encodes meaning, letting the model generalize instead of just memorize.

Next: Attention, Intuitively →