Module 2 — Tokens & Embeddings
In Module 1 the model treated every word as a bare label. It had no idea that "cat" and "dog" are more alike than "cat" and "sky" — to it they were just different symbols. That is a serious limitation. This module fixes it in two steps that every real LLM performs before it does anything else: tokenize the text, then turn each token into an embedding — a list of numbers that captures meaning.
Step 1 — Tokenize
A model does not read characters or words directly; it reads tokens. A token is a chunk of text — often a whole word, sometimes a piece of one ("token", "##izing"). The text is chopped into a sequence of tokens, and each token has an ID number in the model's vocabulary. Type a sentence and watch it get split.
This activity needs JavaScript. The lesson below still covers everything.
Step 2 — Embed: turn each token into a vector
Models do math, and you cannot do math on the word "cat." So each token ID is mapped to a vector — a list of numbers, its embedding. Real models use hundreds or thousands of numbers per token; here we use just two so we can draw them. The crucial property the model learns: tokens used in similar ways end up with similar vectors, so they sit close together.
Click any word on the map below to see its nearest neighbours — the tokens the model considers most similar.
This activity needs JavaScript.
Why "distance" means "meaning"
Because similar tokens have nearby vectors, the model can generalize. If it learns something about "dog," that knowledge partly transfers to "cat" simply because their vectors are close. This is the difference between memorizing words and understanding relationships between them — and it is why an LLM can handle a sentence it has never seen before.
# every token id maps to a learned vector (here 2 numbers; real models use 100s) embedding = nn.Embedding(vocab_size, 2) ids = tokenizer("the happy cat") # → [4, 19, 7] vectors = embedding(ids) # → three 2-number vectors # similar tokens have nearby vectors — distance encodes meaning similarity = cosine(vectors["cat"], vectors["dog"]) # high
The embedding table is learned during training, exactly like the weights in Course 4 — gradient descent pushes related tokens together.
Check your understanding
A few questions about tokens and embeddings. You will get a score.
This activity needs JavaScript.