Module 1 — The Language and Notation of ML

Warm-up · hands-on · about 25 minutes.

Machine-learning mathematics is often perceived as difficult primarily because of its notation. Notation, however, is a compact formal language for expressing concepts that are themselves straightforward. This module introduces the small set of symbols that appear in nearly every ML formula, so that subsequent modules can be read fluently.

The concepts here are elementary — summation and functional dependence. The objective is to establish familiarity with the notation, not to introduce new mathematics.

Notation reference

Each card displays a piece of notation on one side and its plain-language meaning on the other. Select a card to reveal its meaning: read the symbol, formulate your interpretation, then verify it.

This activity needs JavaScript. The lesson below still covers everything.

Functions: mapping inputs to outputs

A function is a rule that maps an input to an output. The notation \( f(x) \) — read "f of x" — denotes the output of the rule \( f \) applied to the input \( x \). The choice of letter is arbitrary; \( f(x) \), \( g(t) \), and \( \text{loss}(w) \) all denote the same concept.

This is the central abstraction: a machine-learning model is a function. It receives an input \( x \) (an email, an image, a row of data) and returns an output \( \hat{y} \) (read "y-hat"), its prediction. Training is the process of selecting the version of that function that produces the most accurate predictions.

In the activity below, select a rule \( f \), provide an input \( x \), and observe the resulting output \( \hat{y} \).

This activity needs JavaScript. The lesson below still covers everything.

Subscripts and indices: referencing elements of a sequence

Data is frequently organized as an ordered sequence. A subscript identifies a specific element: \( x_1 \) is the first element, \( x_2 \) the second, and \( x_i \) denotes "the \( i \)-th element," where \( i \) is an index variable ranging over the positions. A dataset of \( n \) points is written \( x_1, x_2, \ldots, x_n \).

A subscript must be distinguished from an exponent: \( x_2 \) (subscript) denotes the second element, whereas \( x^2 \) (superscript) denotes \( x \) squared — position versus exponentiation.

Adjust the index \( i \) below to reference any element of the sequence; the label beneath each element is its subscript.

This activity needs JavaScript. The lesson below still covers everything.

Summation notation

The Greek capital sigma, \( \sum \), is among the most frequently used symbols in ML. It denotes the sum of a sequence of terms. The annotations specify the range of summation:

\[ \sum_{i=1}^{n} x_i \;=\; x_1 + x_2 + \cdots + x_n \]

This is read as: "for \( i \) ranging from 1 to \( n \), sum every \( x_i \)." Adjust the slider below to observe the sum expand term by term.

This activity needs JavaScript.

Logarithms and exponentials: their role in ML

Two further symbols appear frequently: the exponential \( e^x \) and the logarithm \( \log(x) \). They are inverse functions — the logarithm inverts the exponential. The objective here is not to compute them manually but to understand why they are used in ML.

Products of probabilities underflow. The probability of multiple independent events occurring jointly is a product: \( p_1 \times p_2 \times \cdots \). The product of many values less than 1 becomes extremely small — below the precision a computer can represent reliably. The logarithm converts this product into a sum: \( \log(p_1 p_2) = \log p_1 + \log p_2 \). Sums are numerically stable. This is why loss functions are typically expressed using logarithms.
Exponentials model growth and saturation. The function \( e^x \) grows rapidly; the related sigmoid function uses \( e \) to map any real number to a probability in \( (0, 1) \) — precisely what a classifier requires. You will construct a sigmoid in Module 8.

This identity — \( \log(a \times b) = \log a + \log b \) — is the primary reason logarithms are used. Multiply two numbers below and observe that their logarithms add:

This activity needs JavaScript. The lesson below still covers everything.

The activity below multiplies several probabilities and demonstrates how the product approaches zero while the sum of their logarithms remains numerically tractable.

This activity needs JavaScript.

AI anchor — interpreting a real loss function Here is a loss function used to train real models, the mean squared error:

\[ L \;=\; \frac{1}{n} \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2 \]

Every component can now be interpreted. \( y_i \) is the true value for example \( i \); \( \hat{y}_i \) is the model's prediction; \( (y_i - \hat{y}_i) \) is the error on that example; the error is squared so that positive and negative errors both contribute positively; \( \sum \) sums these squared errors over all \( n \) examples; and \( \frac{1}{n} \) computes their mean. The loss is a scalar measure of the model's total error, and training seeks to minimize it. This principle underlies the entire course.

Put it together

Translate each expression into plain language. Attempting an interpretation before checking reinforces familiarity with the notation.

This activity needs JavaScript.

Why this matters next Every subsequent module depends on this notation. Conditional probability is written \( P(A\mid B) \); a vector is \( \mathbf{x} = [x_1, x_2, \ldots] \); the dot product is expressed as a \( \sum \); a gradient is a vector of partial derivatives; and the loss function above is precisely the quantity that gradient descent minimizes in Module 6. Once the notation is familiar, the remaining mathematics is largely a matter of reading it.

Summary: ML notation is a compact formal language for elementary concepts — \( f(x) \) denotes a rule applied to an input, a subscript references an element of a sequence, \( \sum \) denotes summation, and logarithms are used because they convert numerically unstable products of probabilities into stable sums.

Next: Conditional Probability →