DAY 75

Activations ⚡🧠🎯

Learn what activation functions do between layers — how ReLU kills negatives to keep learning alive, how Sigmoid squashes outputs into probabilities, and the golden rule: ReLU for hidden layers, match output activation to your task.

⏱ 15 mins
⚡ +50 XP
Activations ⚡🧠🎯

Day 75: Activation Functions

Why Should I Care?

Without activation functions, a neural network with 100 layers behaves exactly like a network with 1 layer. No matter how many layers you stack, the whole thing is just one straight line of math. Activation functions are what break that line. They give the network the ability to learn curves, patterns, and complexity. They are the filter between every layer — and the source of all intelligence in the network.

Core Concept

An activation function sits between layers and decides what signal passes through, how strongly, and what gets blocked. Think of it as an executive assistant who filters every message before it reaches the manager. Strong positive signal — PASS. Weak signal — REDUCED. Negative signal — BLOCKED. The manager never sees irrelevant noise. Only useful, scaled signals get through. The two most important activation functions are ReLU and Sigmoid. Each has a different job.

How It Works

ReLU is simple: if the value is negative, output zero. If positive, pass it unchanged. Dead below zero. Alive above zero. That is the whole rule. ReLU is fast, simple, and the default choice for all hidden layers. Sigmoid is different: it squashes any value — no matter how large or small — into a range between 0 and 1. This makes it perfect for output layers in binary classification — yes or no, spam or not spam, fraud or not fraud. The activation role is to filter, scale, decide, and protect learning. Both ReLU and Sigmoid do this in their own way.

ReLU:    relu(x)    = max(0, x)
         relu(-3)   = 0      (blocked)
         relu(3)    = 3      (passed unchanged)

Sigmoid: sigmoid(x) = 1 / (1 + e^(-x))
         sigmoid(-2) = 0.12  (near zero)
         sigmoid(3)  = 0.95  (near one)

Real World Connection

When Zomato predicts whether you will order again tonight — the output layer uses Sigmoid. It squashes the network output to a probability between 0 and 1. Above 0.5 means yes, below means no. All the hidden layers inside that model use ReLU — fast, simple, kills noise, keeps learning moving. When Google Photos classifies a photo into one of 100 categories — the output layer uses Softmax, not Sigmoid. Softmax is like Sigmoid but for more than two classes — it spreads probability across all options so they add up to 1. Rule: match the output activation to the task. Hidden layers always get ReLU.

Examples

import numpy as np

def relu(x):
    return np.maximum(0, x)

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

values = np.array([-3, -1, 0, 1, 3])

print(f"Input   : {values}")
print(f"ReLU    : {relu(values)}")
print(f"Sigmoid : {sigmoid(values).round(2)}")
print()
print("--- When to use ---")
print("ReLU    -> hidden layers (fast, simple)")
print("Sigmoid -> output layer for binary classification")

# OUTPUT:
# Input   : [-3  -1   0   1   3]
# ReLU    : [ 0   0   0   1   3]
# Sigmoid : [0.05 0.27 0.5 0.73 0.95]
# ReLU killed first three negatives -- last two passed unchanged
# Sigmoid squashed all values between 0 and 1

Common Mistakes

Mistake 1 — Using Sigmoid on all hidden layers:

-- WRONG:
Dense(16, activation="sigmoid")   # hidden layer
Dense(8,  activation="sigmoid")   # hidden layer
Dense(1,  activation="sigmoid")   # output layer
# sigmoid on hidden layers causes vanishing gradients
# learning grinds to a complete halt in deep networks

-- CORRECT:
Dense(16, activation="relu")      # hidden layer
Dense(8,  activation="relu")      # hidden layer
Dense(1,  activation="sigmoid")   # output layer -- binary only
# ReLU for hidden. Sigmoid squashes gradients -- deep networks
# stop learning entirely when sigmoid is used throughout.

Mistake 2 — No activation on the output layer:

-- WRONG:
Dense(1)   # no activation on output layer
# predictions can go above 1 or below 0
# outside valid probability range entirely

-- CORRECT:
Dense(1, activation="sigmoid")    # binary yes/no
Dense(5, activation="softmax")    # multi-class (5 categories)
Dense(1)                          # regression -- no activation fine here
# wrong output activation gives probabilities above 1 or below 0
# always match output activation to your task

Mini Challenge

Mini Challenge

Open Google Colab. Copy the vectorized ReLU and Sigmoid code from the Examples section. Change the values array to [-10, -2, 0, 2, 10] and run. Look at the ReLU output — which values survived and which got killed? Look at the Sigmoid output — what is the range of all five values? Now answer: if this network is predicting whether a Swiggy user will reorder, which activation goes on the output layer? That one question is the whole lesson locked in your brain.

Quick Quiz

Q: What does ReLU do to a negative value?
A: ReLU replaces any negative value with zero and passes positive values through unchanged — dead below zero, alive above zero.

Q: When should you use Sigmoid instead of ReLU?
A: Use Sigmoid on the output layer when your task is binary classification — yes or no, spam or not spam, fraud or not fraud. Never use it on hidden layers.

Q: What happens to a deep network if you use Sigmoid on all hidden layers?
A: Vanishing gradients — Sigmoid squashes values so aggressively that gradients shrink to near zero during backpropagation, and the network stops learning entirely.

Bonus Knowledge

There is a third activation function you will use often — Softmax. Softmax is the output layer choice when your task has more than two classes. It takes all output values and converts them into probabilities that add up to exactly 1. For example: cat 60%, dog 30%, bird 10% — that is Softmax at work. The golden rule for activation functions: ReLU for every hidden layer. Match the output activation to your task. Binary classification uses Sigmoid. Multi-class classification uses Softmax. Regression uses no activation at all. Non-linearity is the ability to learn curves. Activation is what gives the network that power.

Key Takeaways

Key Takeaways

  • Activation functions sit between layers and filter signals — deciding what passes, how strongly, and what gets blocked.
  • Without activation functions, any network no matter how deep is just one straight line of math.
  • ReLU kills negatives and passes positives unchanged — fast, simple, default for all hidden layers.
  • Sigmoid squashes any value into a range between 0 and 1 — perfect for binary output layers.
  • Never use Sigmoid on hidden layers — it causes vanishing gradients and stops learning entirely.
  • Golden rule: ReLU for hidden layers. Sigmoid for binary output. Softmax for multi-class output. None for regression.
  • Activation is what gives the network the ability to learn curves. Without it, depth means nothing.

← Previous Lesson