
Introduction to Optimization: Momentum, RMSProp, Adam
If you’ve been working with machine learning models, you’ve probably wondered why your gradient descent takes forever to converge or gets stuck in local minima that make you question your life choices. Optimization algorithms like Momentum, RMSProp, and Adam are the heroes that can save you from these headaches by making your model training faster, more stable, and less likely to get trapped in suboptimal solutions. In this post, we’ll dive deep into how these algorithms work under the hood, when to use each one, and how to implement them properly in your projects.
Understanding the Problem with Vanilla Gradient Descent
Before we jump into the fancy optimizers, let’s be real about why vanilla gradient descent can be frustrating. Standard gradient descent updates parameters using a fixed learning rate, which leads to several issues:
- Oscillations around steep valleys in the loss landscape
- Slow convergence on flat surfaces
- Getting stuck in local minima or saddle points
- Same learning rate applied to all parameters regardless of their gradient history
Here’s what vanilla gradient descent looks like:
# Vanilla Gradient Descent
theta = theta - learning_rate * gradient
Simple, but not smart. It’s like using the same hammer for every nail, screw, and delicate electronic component.
Momentum: Adding Memory to Gradient Descent
Momentum solves the oscillation problem by adding a “memory” component that smooths out the parameter updates. Think of it like a ball rolling down a hill – it builds up speed in consistent directions and dampens oscillations.
How Momentum Works
Momentum maintains an exponentially decaying moving average of past gradients and uses that to determine the update direction:
# Momentum implementation
v = beta * v + gradient # velocity update
theta = theta - learning_rate * v # parameter update
The beta parameter (typically 0.9) controls how much “memory” the optimizer has. Higher values mean more momentum, lower values make it more responsive to current gradients.
Implementation Example
Here’s a complete momentum optimizer implementation in Python:
import numpy as np
class MomentumOptimizer:
def __init__(self, learning_rate=0.01, momentum=0.9):
self.learning_rate = learning_rate
self.momentum = momentum
self.velocity = {}
def update(self, params, gradients):
for param_name in params:
if param_name not in self.velocity:
self.velocity[param_name] = np.zeros_like(params[param_name])
# Update velocity
self.velocity[param_name] = (self.momentum * self.velocity[param_name] +
gradients[param_name])
# Update parameters
params[param_name] -= self.learning_rate * self.velocity[param_name]
return params
# Usage example
optimizer = MomentumOptimizer(learning_rate=0.01, momentum=0.9)
updated_params = optimizer.update(model_params, computed_gradients)
When to Use Momentum
- Training deep neural networks where gradient directions are noisy
- Optimization landscapes with many local minima
- When you want faster convergence without too much complexity
- Computer vision tasks where momentum typically works well out of the box
RMSProp: Adaptive Learning Rates
RMSProp (Root Mean Square Propagation) takes a different approach by adapting the learning rate for each parameter based on the magnitude of recent gradients. It’s particularly good at handling different scales of gradients across parameters.
How RMSProp Works
Instead of using the same learning rate for all parameters, RMSProp maintains a moving average of squared gradients and scales the learning rate accordingly:
# RMSProp algorithm
v = decay_rate * v + (1 - decay_rate) * gradient^2
theta = theta - learning_rate * gradient / (sqrt(v) + epsilon)
The key insight is that parameters with large gradients get smaller effective learning rates, while parameters with small gradients get relatively larger learning rates.
Implementation Example
class RMSPropOptimizer:
def __init__(self, learning_rate=0.001, decay_rate=0.9, epsilon=1e-8):
self.learning_rate = learning_rate
self.decay_rate = decay_rate
self.epsilon = epsilon
self.cache = {}
def update(self, params, gradients):
for param_name in params:
if param_name not in self.cache:
self.cache[param_name] = np.zeros_like(params[param_name])
# Update cache (moving average of squared gradients)
self.cache[param_name] = (self.decay_rate * self.cache[param_name] +
(1 - self.decay_rate) * gradients[param_name]**2)
# Update parameters
params[param_name] -= (self.learning_rate * gradients[param_name] /
(np.sqrt(self.cache[param_name]) + self.epsilon))
return params
# Usage with different hyperparameters for RNNs
rnn_optimizer = RMSPropOptimizer(learning_rate=0.001, decay_rate=0.95)
RMSProp Best Practices
- Start with learning_rate=0.001 and decay_rate=0.9
- Increase decay_rate to 0.95-0.99 for RNNs to handle longer sequences
- Monitor for exploding gradients – RMSProp can be sensitive to this
- Works particularly well for online learning and non-stationary objectives
Adam: The Best of Both Worlds
Adam (Adaptive Moment Estimation) combines the momentum concept with RMSProp’s adaptive learning rates. It maintains both first-moment (mean) and second-moment (variance) estimates of gradients, making it incredibly robust across different types of problems.
How Adam Works
Adam tracks both the exponentially decaying average of past gradients (like momentum) and the exponentially decaying average of past squared gradients (like RMSProp):
# Adam algorithm
m = beta1 * m + (1 - beta1) * gradient # first moment
v = beta2 * v + (1 - beta2) * gradient^2 # second moment
# Bias correction
m_hat = m / (1 - beta1^t)
v_hat = v / (1 - beta2^t)
# Parameter update
theta = theta - learning_rate * m_hat / (sqrt(v_hat) + epsilon)
The bias correction terms are crucial for the first few iterations when the moment estimates are biased toward zero.
Complete Adam Implementation
class AdamOptimizer:
def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
self.learning_rate = learning_rate
self.beta1 = beta1
self.beta2 = beta2
self.epsilon = epsilon
self.m = {} # first moment
self.v = {} # second moment
self.t = 0 # time step
def update(self, params, gradients):
self.t += 1
for param_name in params:
if param_name not in self.m:
self.m[param_name] = np.zeros_like(params[param_name])
self.v[param_name] = np.zeros_like(params[param_name])
# Update first and second moments
self.m[param_name] = (self.beta1 * self.m[param_name] +
(1 - self.beta1) * gradients[param_name])
self.v[param_name] = (self.beta2 * self.v[param_name] +
(1 - self.beta2) * gradients[param_name]**2)
# Bias correction
m_hat = self.m[param_name] / (1 - self.beta1**self.t)
v_hat = self.v[param_name] / (1 - self.beta2**self.t)
# Update parameters
params[param_name] -= (self.learning_rate * m_hat /
(np.sqrt(v_hat) + self.epsilon))
return params
# Production-ready usage
adam_optimizer = AdamOptimizer(learning_rate=0.001, beta1=0.9, beta2=0.999)
Performance Comparison and Benchmarks
Here’s how these optimizers typically perform across different scenarios:
Optimizer | Convergence Speed | Memory Usage | Hyperparameter Sensitivity | Best Use Cases |
---|---|---|---|---|
Momentum | Fast | 1x gradient memory | Low | Computer vision, stable objectives |
RMSProp | Medium-Fast | 1x gradient memory | Medium | RNNs, online learning |
Adam | Very Fast | 2x gradient memory | Low | General purpose, transformers, GANs |
In practice, here are some performance numbers from training a ResNet-50 on ImageNet:
Optimizer | Time to 70% Accuracy | Final Accuracy | Training Stability |
---|---|---|---|
SGD + Momentum | 45 epochs | 76.2% | High |
RMSProp | 42 epochs | 75.8% | Medium |
Adam | 38 epochs | 75.9% | High |
Real-World Implementation with PyTorch
Here’s how you’d actually use these optimizers in a real project with PyTorch:
import torch
import torch.nn as nn
import torch.optim as optim
# Model setup
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Linear(256, 10)
)
# Different optimizer configurations
optimizers = {
'sgd_momentum': optim.SGD(model.parameters(), lr=0.01, momentum=0.9),
'rmsprop': optim.RMSprop(model.parameters(), lr=0.001, alpha=0.9),
'adam': optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))
}
# Training loop
def train_model(optimizer_name, num_epochs=10):
optimizer = optimizers[optimizer_name]
criterion = nn.CrossEntropyLoss()
for epoch in range(num_epochs):
for batch_idx, (data, target) in enumerate(train_loader):
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
if batch_idx % 100 == 0:
print(f'Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}')
# Compare different optimizers
for opt_name in optimizers.keys():
print(f"Training with {opt_name}")
train_model(opt_name)
Common Pitfalls and Troubleshooting
Here are the most common issues you’ll run into and how to fix them:
Adam-Specific Issues
- Generalization gap: Adam sometimes converges to solutions that don’t generalize well. Try reducing learning rate or switching to SGD+Momentum for final epochs
- Learning rate scheduling: Adam’s adaptive nature can conflict with learning rate schedules. Use smaller decay factors
- Weight decay problems: Standard L2 regularization doesn’t work well with Adam. Use AdamW instead
# Better Adam configuration for production
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
RMSProp Troubleshooting
- Exploding gradients: RMSProp can amplify gradient noise. Add gradient clipping
- Learning rate too high: Start with 0.001 and decrease if loss explodes
- Epsilon sensitivity: If training stalls, try epsilon=1e-4 instead of 1e-8
# RMSProp with gradient clipping
optimizer = optim.RMSprop(model.parameters(), lr=0.001, eps=1e-4)
# In training loop
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
Advanced Techniques and Variants
Once you’re comfortable with the basics, here are some advanced techniques that can squeeze out extra performance:
Learning Rate Scheduling
# Cyclical learning rates work great with momentum
scheduler = optim.lr_scheduler.CyclicLR(
optimizer,
base_lr=0.001,
max_lr=0.01,
cycle_momentum=True,
base_momentum=0.85,
max_momentum=0.95
)
# Warm restarts for Adam
scheduler = optim.lr_scheduler.CosineAnnealingWarmRestarts(
optimizer,
T_0=10,
T_mult=2
)
Gradient Accumulation for Large Batches
# Simulate large batch sizes with gradient accumulation
accumulation_steps = 4
optimizer.zero_grad()
for i, (data, target) in enumerate(train_loader):
output = model(data)
loss = criterion(output, target) / accumulation_steps
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
Choosing the Right Optimizer
Here’s a practical decision tree for optimizer selection:
- Use Adam when: You’re prototyping, working with transformers/attention models, or need fast convergence with minimal tuning
- Use SGD+Momentum when: You’re training computer vision models, need the best possible generalization, or have time to tune hyperparameters
- Use RMSProp when: You’re working with RNNs, doing online learning, or dealing with very noisy gradients
- Use AdamW when: You want Adam’s benefits but with proper weight decay (most modern applications)
For server deployments where you’re training models regularly, Adam is usually the safest choice because it requires less hyperparameter tuning and works well across different architectures. If you’re setting up automated training pipelines on your VPS or dedicated servers, Adam’s robustness will save you from having to babysit the training process.
The key takeaway is that optimization isn’t just about the algorithm – it’s about understanding your problem, monitoring training dynamics, and being willing to experiment. These optimizers are tools, and like any tool, they work best when you understand their strengths and limitations. Start with Adam for most cases, but don’t be afraid to switch if you’re hitting performance walls or generalization issues.
For more technical details and mathematical derivations, check out the original papers: Adam paper and Sebastian Ruder’s excellent gradient descent optimization overview.

This article incorporates information and material from various online sources. We acknowledge and appreciate the work of all original authors, publishers, and websites. While every effort has been made to appropriately credit the source material, any unintentional oversight or omission does not constitute a copyright infringement. All trademarks, logos, and images mentioned are the property of their respective owners. If you believe that any content used in this article infringes upon your copyright, please contact us immediately for review and prompt action.
This article is intended for informational and educational purposes only and does not infringe on the rights of the copyright owners. If any copyrighted material has been used without proper credit or in violation of copyright laws, it is unintentional and we will rectify it promptly upon notification. Please note that the republishing, redistribution, or reproduction of part or all of the contents in any form is prohibited without express written permission from the author and website owner. For permissions or further inquiries, please contact us.