← All Inside Large Language Models modules

Module 2 — Tokens & Embeddings

The core idea · hands-on · about 25 minutes.

In Module 1, the model represented each word as an opaque categorical label. It had no internal representation that "cat" and "dog" are more semantically similar than "cat" and "sky" — each was a distinct symbol with no encoded relationship to any other. This is a substantial limitation. This module resolves it through two operations that every production LLM performs before any further computation: tokenization of the text, followed by mapping each token to an embedding — a learned vector of real numbers that encodes its semantic properties.

Step 1 — Tokenization

The model does not operate directly on characters or words; its input units are tokens. A token is a contiguous span of text — typically a full word, or in some cases a sub-word fragment ("token", "##izing"). The input text is segmented into a sequence of tokens, and each token is mapped to an integer identifier in the model's vocabulary. Enter a sentence below to observe its tokenization.

This activity needs JavaScript. The lesson below still covers everything.

Step 2 — Embedding: mapping tokens to vectors

The model performs numerical computation, and the discrete symbol "cat" does not admit numerical operations directly. Each token ID is therefore mapped to a vector of real numbers — its embedding. Production models employ embedding dimensions of several hundred to several thousand; this activity uses two dimensions so that the embeddings can be visualized in the plane. The critical property that the model learns during training is that tokens used in similar contexts are assigned similar embeddings — they are nearby in the embedding space.

Click any word on the map below to display its nearest neighbors — the tokens the model considers most semantically similar.

This activity needs JavaScript.

Distance in embedding space encodes semantic similarity

Because semantically similar tokens are mapped to nearby vectors, the model can generalize: information learned about one token transfers, to a degree determined by their embedding distance, to other tokens. This distinguishes memorization of individual tokens from the representation of relationships between them, and accounts for an LLM's ability to process sentences not present in its training data.

The equivalent procedure in code

# every token id maps to a learned vector (here 2 numbers; real models use 100s)
embedding = nn.Embedding(vocab_size, 2)

ids = tokenizer("the happy cat")        # → [4, 19, 7]
vectors = embedding(ids)                  # → three 2-number vectors

# similar tokens have nearby vectors — distance encodes meaning
similarity = cosine(vectors["cat"], vectors["dog"])  # high

The embedding table is learned during training in the same manner as the weights introduced in Course 4: gradient descent adjusts the embeddings so that semantically related tokens are positioned proximally in the embedding space.

AI anchor — the first operation every model performs on a prompt Immediately upon submission of a prompt, the text is tokenized and each token is replaced by its embedding vector before any downstream computation occurs. These vectors constitute the model's actual input. The classical observation that king − man + woman ≈ queen in the embedding space holds because semantic relationships are encoded as geometric structure: specific directions in the space correspond to attributes such as grammatical gender or number. All subsequent computation in the model operates on these vectors, not on the input characters.

Check your understanding

Answer a short set of questions on tokens and embeddings.

This activity needs JavaScript.

Why this matters next Each word is now represented as a vector. However, the meaning of a word depends on its surrounding context — "bank" denoting a river bank versus a financial institution. Module 3 introduces attention: the mechanism by which the model permits each token to incorporate information from the other tokens in its context to determine its contextual meaning.

Summary: a model first tokenizes the input text and then replaces each token with a learned embedding vector. Because semantically similar tokens are assigned nearby vectors, distance in the embedding space encodes semantic similarity, enabling the model to generalize rather than merely memorize individual tokens.

Next: Attention, Intuitively →