← All Neural Networks & Deep Learning modules

Module 2 — Activation Functions

The building block · hands-on · about 25 minutes.

In Module 1, every neuron applied a final non-linear transformation, \( \sigma \). This transformation — the activation function — may appear to be a minor architectural detail. It is, in fact, the single most important reason deep networks are capable of representing non-linear functions. This module introduces the principal activation functions and demonstrates interactively why a network without non-linear activations is mathematically equivalent to a single linear model.

The three principal activation functions

Sigmoid — \( \sigma(z) = \frac{1}{1+e^{-z}} \). Maps any real input to the interval \( (0,1) \). Interpretable as a probability and standard in the output layer of binary classifiers.
Tanh — analogous to sigmoid but with range \( (-1, 1) \) and zero-centered output, which typically yields somewhat improved training dynamics in hidden layers.
ReLU — \( \text{ReLU}(z) = \max(0, z) \). The identity for positive inputs and zero otherwise. Computationally inexpensive, non-saturating on the positive domain, and the standard default for hidden layers in modern deep networks.

Plot each function and vary the input to observe the corresponding output.

This activity needs JavaScript. The lesson below still covers everything.

The fundamental role of non-linearity

The formal reason activation functions are essential is as follows. The composition of two linear transformations is itself a linear transformation: a weighted sum of weighted sums remains a weighted sum. Consequently, no matter how many linear layers are composed, the resulting network can only represent linear functions. The activation is the only source of non-linearity in the architecture, and it is precisely this non-linearity that permits each layer to apply a non-linear transformation to the output of the previous one. Successive composition of non-linear transformations is what enables the network to represent functions of arbitrary complexity.

In the activity below, three ReLU neurons feed a single output unit, and you specify the contribution of each. Attempt to approximate the sinusoidal target. Then switch the activation to linear — the output collapses to a straight line regardless of the parameter settings.

This activity needs JavaScript.

Activation functions in Keras

from tensorflow.keras.layers import Dense

Dense(16, activation='relu')      # hidden layer — ReLU is the usual default
Dense(16, activation='tanh')      # tanh: zero-centered alternative
Dense(1,  activation='sigmoid')   # output: a 0–1 probability for yes/no

If the activation argument is omitted from every layer, the network collapses to a single equivalent linear model — precisely the behavior demonstrated by the activity above.

AI anchor — ReLU and the resurgence of deep networks For several decades, deep networks were difficult to train because the sigmoid and tanh activations "saturate" — their derivative approaches zero for inputs of large magnitude, causing the gradient signal to vanish as it propagates backward through many layers. The introduction of ReLU resolved this: its derivative is exactly 1 for all positive inputs, allowing gradients to propagate through deep networks without attenuation. This single architectural change, combined with the increased availability of large datasets and faster GPUs, is widely credited with the resurgence of deep learning circa 2012.

Check your understanding

Answer a short set of questions on activation functions.

This activity needs JavaScript.

Why this matters next Non-linear neurons have now been introduced. Module 3 composes them into a complete neural network and propagates data through the architecture — the forward pass — illustrating how a hidden layer transforms the raw input space into a representation in which the final output neuron can produce a non-linear decision boundary.

Summary: an activation function introduces the non-linearity that permits each layer to apply a non-linear transformation to the output of the previous one. Without non-linear activations, any composition of layers collapses to a single linear model, which is why every effective network includes activation functions (ReLU, tanh, sigmoid) between its layers.

Next: The Forward Pass →