Gradient Descent

January 10, 2026

Gradient Descent

Gradient Descent is an optimization algorithm used to find the minimum of a function.

In machine learning, we use it to minimize the loss (error) of a model by adjusting its parameters.

Intuition (Mountain analogy)

Imagine you are standing on a foggy mountain and want to reach the lowest point (valley).

You can’t see the whole mountain
You only know the slope at your current position
So you:
1. Check the slope
2. Take a small step downhill
3. Repeat until you reach the bottom

That’s Gradient Descent.

Mountain height → Loss function
Your position → Model parameters
Slope → Gradient
Step size → Learning rate

Why do we need Gradient Descent?

Most ML models learn by minimizing a loss function:

Loss = f(θ)

Where:

$\theta$ = model parameters (weights, bias)
Goal: find θ that minimizes loss

For complex models:

No closed-form solution
Too expensive to try all possibilities

Gradient Descent efficiently finds the minimum.

What is a Gradient?

The gradient is a vector of partial derivatives:

$\nabla f (θ) = [\frac{\partial f}{\partial θ_{1}}, \frac{\partial f}{\partial θ_{2}}, . . .]$

It tells:

Direction of steepest increase

To minimize, we move in the opposite direction.

Gradient Descent Update Rule

$θ = θ - α \cdot \nabla f (θ)$

Where:

$\alpha$ = learning rate (step size)
$\nabla f(\theta)$ = gradient

Key idea:

Big gradient → big step
Small gradient → small step

Simple Example (1D)

Function:

f(x)=x^2

Derivative:

\frac{df}{dx} = 2x

Update:

x = x - α \cdot 2 x

If:

Start at $x = 10$
Learning rate $\alpha = 0.1$

Steps:

x = 10

x = 8

x = 6.4

x = 5.12

...

→ 0

Learning Rate (α) – VERY IMPORTANT

Learning Rate	Effect
Too small	Very slow convergence
Too large	Overshoots, may diverge
Just right	Fast and stable

Common values: 0.1, 0.01, 0.001

Types of Gradient Descent

Batch Gradient Descent

Uses entire dataset to compute gradient
Stable but slow for large data

$\nabla f(\theta) = \frac{1}{N} \sum_{i=1}^{N} \nabla L_i$

Stochastic Gradient Descent (SGD)

Uses one data point at a time
Faster, noisy updates
May not converge exactly

Mini-Batch Gradient Descent (Most common)

Uses small batches (e.g., 32, 64, 128)
Balance between speed & stability

Gradient Descent in Linear Regression

Model:

y = w x + b

Loss (MSE):

L = \frac{1}{n} \sum (y - \hat{y})^{2}

Gradients:

\frac{\partial L}{\partial w}, \frac{\partial L}{\partial b​}

Update:

w = w - α * dw

b = b - α * db

Repeat until loss stops decreasing.

Problems with Gradient Descent

Local Minima

Can get stuck (less of an issue in deep learning)

Saddle Points

Flat regions slow training

Vanishing / Exploding Gradients

Common in deep neural networks

Improvements & Variants

Algorithm	Idea
Momentum	Accelerates in consistent direction
RMSProp	Adapts learning rate per parameter
Adam	Combines Momentum + RMSProp

Adam is the default choice in most DL frameworks.

Comments