Gradient Descent
Gradient Descent
Gradient Descent is an optimization algorithm used to find the minimum of a function.
In machine learning, we use it to minimize the loss (error) of a model by adjusting its parameters.Intuition (Mountain analogy)
Imagine you are standing on a foggy mountain and want to reach the lowest point (valley).
-
You can’t see the whole mountain
-
You only know the slope at your current position
-
So you:
-
Check the slope
-
Take a small step downhill
-
Repeat until you reach the bottom
-
That’s Gradient Descent.
-
Mountain height → Loss function
-
Your position → Model parameters
-
Slope → Gradient
-
Step size → Learning rate
Why do we need Gradient Descent?
Most ML models learn by minimizing a loss function:
Loss = f(θ)
Where:
-
= model parameters (weights, bias)
-
Goal: find θ that minimizes loss
For complex models:
-
No closed-form solution
-
Too expensive to try all possibilities
Gradient Descent efficiently finds the minimum.
What is a Gradient?
The gradient is a vector of partial derivatives:
It tells:
-
Direction of steepest increase
To minimize, we move in the opposite direction.
Gradient Descent Update Rule
Where:
-
= learning rate (step size)
-
= gradient
Key idea:
-
Big gradient → big step
-
Small gradient → small step
Simple Example (1D)
Function:
f(x)=x^2
Derivative:
Update:
If:
-
Start at
-
Learning rate
Learning Rate (α) – VERY IMPORTANT
| Learning Rate | Effect |
|---|---|
| Too small | Very slow convergence |
| Too large | Overshoots, may diverge |
| Just right | Fast and stable |
Common values: 0.1, 0.01, 0.001
Types of Gradient Descent
Batch Gradient Descent
-
Uses entire dataset to compute gradient
-
Stable but slow for large data
Stochastic Gradient Descent (SGD)
-
Uses one data point at a time
-
Faster, noisy updates
-
May not converge exactly
Mini-Batch Gradient Descent (Most common)
-
Uses small batches (e.g., 32, 64, 128)
-
Balance between speed & stability
Gradient Descent in Linear Regression
Model:
Loss (MSE):
Gradients:
Update:
w = w - α * dw
b = b - α * db
Repeat until loss stops decreasing.
Problems with Gradient Descent
Local Minima
-
Can get stuck (less of an issue in deep learning)
Saddle Points
-
Flat regions slow training
Vanishing / Exploding Gradients
-
Common in deep neural networks
Improvements & Variants
| Algorithm | Idea |
|---|---|
| Momentum | Accelerates in consistent direction |
| RMSProp | Adapts learning rate per parameter |
| Adam | Combines Momentum + RMSProp |
Adam is the default choice in most DL frameworks.
Comments
Post a Comment