Gradient Descent

Gradient Descent 

Gradient Descent is an optimization algorithm used to find the minimum of a function.

In machine learning, we use it to minimize the loss (error) of a model by adjusting its parameters.

Intuition (Mountain analogy)

Imagine you are standing on a foggy mountain and want to reach the lowest point (valley).

  • You can’t see the whole mountain

  • You only know the slope at your current position

  • So you:

    1. Check the slope

    2. Take a small step downhill

    3. Repeat until you reach the bottom

That’s Gradient Descent.

  • Mountain height → Loss function

  • Your position → Model parameters

  • Slope → Gradient

  • Step size → Learning rate


Why do we need Gradient Descent?

Most ML models learn by minimizing a loss function:

            Loss    = f(θ)

Where:

  • θ\theta = model parameters (weights, bias)

  • Goal: find θ that minimizes loss

For complex models:

  • No closed-form solution

  • Too expensive to try all possibilities

Gradient Descent efficiently finds the minimum.

What is a Gradient?

The gradient is a vector of partial derivatives:

f(θ)=[fθ1,fθ2,...]

It tells:

  • Direction of steepest increase

To minimize, we move in the opposite direction.

Gradient Descent Update Rule

θ=θαf(θ)

Where:

  • α\alpha = learning rate (step size)

  • f(θ)\nabla f(\theta) = gradient

Key idea:

  • Big gradient → big step

  • Small gradient → small step

Simple Example (1D)

Function:

        f(x)=x^2


Derivative:

dfdx=2x\frac{df}{dx} = 2x

Update:

x=xα2x

If:

  • Start at x=10x = 10

  • Learning rate α=0.1\alpha = 0.1

Steps:
x = 10
x = 8
x = 6.4
x = 5.12
...
→ 0

Learning Rate (α) – VERY IMPORTANT

Learning RateEffect
Too smallVery slow convergence
Too largeOvershoots, may diverge
Just rightFast and stable

Common values: 0.1, 0.01, 0.001

Types of Gradient Descent

Batch Gradient Descent

  • Uses entire dataset to compute gradient

  • Stable but slow for large data

f(θ)=1Ni=1NLi\nabla f(\theta) = \frac{1}{N} \sum_{i=1}^{N} \nabla L_i


Stochastic Gradient Descent (SGD)

  • Uses one data point at a time

  • Faster, noisy updates

  • May not converge exactly


Mini-Batch Gradient Descent  (Most common)

  • Uses small batches (e.g., 32, 64, 128)

  • Balance between speed & stability


Gradient Descent in Linear Regression

Model:

y=wx+b

Loss (MSE):

L=1n(yy^)2

Gradients:

Lw,Lb​

Update:

w = w - α * dw

b = b - α * db

Repeat until loss stops decreasing.


Problems with Gradient Descent

Local Minima

  • Can get stuck (less of an issue in deep learning)

Saddle Points

  • Flat regions slow training

Vanishing / Exploding Gradients

  • Common in deep neural networks

Improvements & Variants

AlgorithmIdea
Momentum    Accelerates in consistent direction
RMSProp    Adapts learning rate per parameter
Adam     Combines Momentum + RMSProp

Adam is the default choice in most DL frameworks.

Comments