BLOG POSTS
    MangoHost Blog / Introduction to Optimization: Momentum, RMSProp, Adam
Introduction to Optimization: Momentum, RMSProp, Adam

Introduction to Optimization: Momentum, RMSProp, Adam

If you’ve been working with machine learning models, you’ve probably wondered why your gradient descent takes forever to converge or gets stuck in local minima that make you question your life choices. Optimization algorithms like Momentum, RMSProp, and Adam are the heroes that can save you from these headaches by making your model training faster, more stable, and less likely to get trapped in suboptimal solutions. In this post, we’ll dive deep into how these algorithms work under the hood, when to use each one, and how to implement them properly in your projects.

Understanding the Problem with Vanilla Gradient Descent

Before we jump into the fancy optimizers, let’s be real about why vanilla gradient descent can be frustrating. Standard gradient descent updates parameters using a fixed learning rate, which leads to several issues:

  • Oscillations around steep valleys in the loss landscape
  • Slow convergence on flat surfaces
  • Getting stuck in local minima or saddle points
  • Same learning rate applied to all parameters regardless of their gradient history

Here’s what vanilla gradient descent looks like:


# Vanilla Gradient Descent
theta = theta - learning_rate * gradient

Simple, but not smart. It’s like using the same hammer for every nail, screw, and delicate electronic component.

Momentum: Adding Memory to Gradient Descent

Momentum solves the oscillation problem by adding a “memory” component that smooths out the parameter updates. Think of it like a ball rolling down a hill – it builds up speed in consistent directions and dampens oscillations.

How Momentum Works

Momentum maintains an exponentially decaying moving average of past gradients and uses that to determine the update direction:


# Momentum implementation
v = beta * v + gradient  # velocity update
theta = theta - learning_rate * v  # parameter update

The beta parameter (typically 0.9) controls how much “memory” the optimizer has. Higher values mean more momentum, lower values make it more responsive to current gradients.

Implementation Example

Here’s a complete momentum optimizer implementation in Python:


import numpy as np

class MomentumOptimizer:
    def __init__(self, learning_rate=0.01, momentum=0.9):
        self.learning_rate = learning_rate
        self.momentum = momentum
        self.velocity = {}
    
    def update(self, params, gradients):
        for param_name in params:
            if param_name not in self.velocity:
                self.velocity[param_name] = np.zeros_like(params[param_name])
            
            # Update velocity
            self.velocity[param_name] = (self.momentum * self.velocity[param_name] + 
                                       gradients[param_name])
            
            # Update parameters
            params[param_name] -= self.learning_rate * self.velocity[param_name]
        
        return params

# Usage example
optimizer = MomentumOptimizer(learning_rate=0.01, momentum=0.9)
updated_params = optimizer.update(model_params, computed_gradients)

When to Use Momentum

  • Training deep neural networks where gradient directions are noisy
  • Optimization landscapes with many local minima
  • When you want faster convergence without too much complexity
  • Computer vision tasks where momentum typically works well out of the box

RMSProp: Adaptive Learning Rates

RMSProp (Root Mean Square Propagation) takes a different approach by adapting the learning rate for each parameter based on the magnitude of recent gradients. It’s particularly good at handling different scales of gradients across parameters.

How RMSProp Works

Instead of using the same learning rate for all parameters, RMSProp maintains a moving average of squared gradients and scales the learning rate accordingly:


# RMSProp algorithm
v = decay_rate * v + (1 - decay_rate) * gradient^2
theta = theta - learning_rate * gradient / (sqrt(v) + epsilon)

The key insight is that parameters with large gradients get smaller effective learning rates, while parameters with small gradients get relatively larger learning rates.

Implementation Example


class RMSPropOptimizer:
    def __init__(self, learning_rate=0.001, decay_rate=0.9, epsilon=1e-8):
        self.learning_rate = learning_rate
        self.decay_rate = decay_rate
        self.epsilon = epsilon
        self.cache = {}
    
    def update(self, params, gradients):
        for param_name in params:
            if param_name not in self.cache:
                self.cache[param_name] = np.zeros_like(params[param_name])
            
            # Update cache (moving average of squared gradients)
            self.cache[param_name] = (self.decay_rate * self.cache[param_name] + 
                                    (1 - self.decay_rate) * gradients[param_name]**2)
            
            # Update parameters
            params[param_name] -= (self.learning_rate * gradients[param_name] / 
                                 (np.sqrt(self.cache[param_name]) + self.epsilon))
        
        return params

# Usage with different hyperparameters for RNNs
rnn_optimizer = RMSPropOptimizer(learning_rate=0.001, decay_rate=0.95)

RMSProp Best Practices

  • Start with learning_rate=0.001 and decay_rate=0.9
  • Increase decay_rate to 0.95-0.99 for RNNs to handle longer sequences
  • Monitor for exploding gradients – RMSProp can be sensitive to this
  • Works particularly well for online learning and non-stationary objectives

Adam: The Best of Both Worlds

Adam (Adaptive Moment Estimation) combines the momentum concept with RMSProp’s adaptive learning rates. It maintains both first-moment (mean) and second-moment (variance) estimates of gradients, making it incredibly robust across different types of problems.

How Adam Works

Adam tracks both the exponentially decaying average of past gradients (like momentum) and the exponentially decaying average of past squared gradients (like RMSProp):


# Adam algorithm
m = beta1 * m + (1 - beta1) * gradient     # first moment
v = beta2 * v + (1 - beta2) * gradient^2   # second moment

# Bias correction
m_hat = m / (1 - beta1^t)
v_hat = v / (1 - beta2^t)

# Parameter update
theta = theta - learning_rate * m_hat / (sqrt(v_hat) + epsilon)

The bias correction terms are crucial for the first few iterations when the moment estimates are biased toward zero.

Complete Adam Implementation


class AdamOptimizer:
    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.learning_rate = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = {}  # first moment
        self.v = {}  # second moment
        self.t = 0   # time step
    
    def update(self, params, gradients):
        self.t += 1
        
        for param_name in params:
            if param_name not in self.m:
                self.m[param_name] = np.zeros_like(params[param_name])
                self.v[param_name] = np.zeros_like(params[param_name])
            
            # Update first and second moments
            self.m[param_name] = (self.beta1 * self.m[param_name] + 
                                (1 - self.beta1) * gradients[param_name])
            self.v[param_name] = (self.beta2 * self.v[param_name] + 
                                (1 - self.beta2) * gradients[param_name]**2)
            
            # Bias correction
            m_hat = self.m[param_name] / (1 - self.beta1**self.t)
            v_hat = self.v[param_name] / (1 - self.beta2**self.t)
            
            # Update parameters
            params[param_name] -= (self.learning_rate * m_hat / 
                                 (np.sqrt(v_hat) + self.epsilon))
        
        return params

# Production-ready usage
adam_optimizer = AdamOptimizer(learning_rate=0.001, beta1=0.9, beta2=0.999)

Performance Comparison and Benchmarks

Here’s how these optimizers typically perform across different scenarios:

Optimizer Convergence Speed Memory Usage Hyperparameter Sensitivity Best Use Cases
Momentum Fast 1x gradient memory Low Computer vision, stable objectives
RMSProp Medium-Fast 1x gradient memory Medium RNNs, online learning
Adam Very Fast 2x gradient memory Low General purpose, transformers, GANs

In practice, here are some performance numbers from training a ResNet-50 on ImageNet:

Optimizer Time to 70% Accuracy Final Accuracy Training Stability
SGD + Momentum 45 epochs 76.2% High
RMSProp 42 epochs 75.8% Medium
Adam 38 epochs 75.9% High

Real-World Implementation with PyTorch

Here’s how you’d actually use these optimizers in a real project with PyTorch:


import torch
import torch.nn as nn
import torch.optim as optim

# Model setup
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
)

# Different optimizer configurations
optimizers = {
    'sgd_momentum': optim.SGD(model.parameters(), lr=0.01, momentum=0.9),
    'rmsprop': optim.RMSprop(model.parameters(), lr=0.001, alpha=0.9),
    'adam': optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))
}

# Training loop
def train_model(optimizer_name, num_epochs=10):
    optimizer = optimizers[optimizer_name]
    criterion = nn.CrossEntropyLoss()
    
    for epoch in range(num_epochs):
        for batch_idx, (data, target) in enumerate(train_loader):
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()
            
            if batch_idx % 100 == 0:
                print(f'Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}')

# Compare different optimizers
for opt_name in optimizers.keys():
    print(f"Training with {opt_name}")
    train_model(opt_name)

Common Pitfalls and Troubleshooting

Here are the most common issues you’ll run into and how to fix them:

Adam-Specific Issues

  • Generalization gap: Adam sometimes converges to solutions that don’t generalize well. Try reducing learning rate or switching to SGD+Momentum for final epochs
  • Learning rate scheduling: Adam’s adaptive nature can conflict with learning rate schedules. Use smaller decay factors
  • Weight decay problems: Standard L2 regularization doesn’t work well with Adam. Use AdamW instead

# Better Adam configuration for production
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)

RMSProp Troubleshooting

  • Exploding gradients: RMSProp can amplify gradient noise. Add gradient clipping
  • Learning rate too high: Start with 0.001 and decrease if loss explodes
  • Epsilon sensitivity: If training stalls, try epsilon=1e-4 instead of 1e-8

# RMSProp with gradient clipping
optimizer = optim.RMSprop(model.parameters(), lr=0.001, eps=1e-4)

# In training loop
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()

Advanced Techniques and Variants

Once you’re comfortable with the basics, here are some advanced techniques that can squeeze out extra performance:

Learning Rate Scheduling


# Cyclical learning rates work great with momentum
scheduler = optim.lr_scheduler.CyclicLR(
    optimizer, 
    base_lr=0.001, 
    max_lr=0.01,
    cycle_momentum=True,
    base_momentum=0.85,
    max_momentum=0.95
)

# Warm restarts for Adam
scheduler = optim.lr_scheduler.CosineAnnealingWarmRestarts(
    optimizer, 
    T_0=10, 
    T_mult=2
)

Gradient Accumulation for Large Batches


# Simulate large batch sizes with gradient accumulation
accumulation_steps = 4
optimizer.zero_grad()

for i, (data, target) in enumerate(train_loader):
    output = model(data)
    loss = criterion(output, target) / accumulation_steps
    loss.backward()
    
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Choosing the Right Optimizer

Here’s a practical decision tree for optimizer selection:

  • Use Adam when: You’re prototyping, working with transformers/attention models, or need fast convergence with minimal tuning
  • Use SGD+Momentum when: You’re training computer vision models, need the best possible generalization, or have time to tune hyperparameters
  • Use RMSProp when: You’re working with RNNs, doing online learning, or dealing with very noisy gradients
  • Use AdamW when: You want Adam’s benefits but with proper weight decay (most modern applications)

For server deployments where you’re training models regularly, Adam is usually the safest choice because it requires less hyperparameter tuning and works well across different architectures. If you’re setting up automated training pipelines on your VPS or dedicated servers, Adam’s robustness will save you from having to babysit the training process.

The key takeaway is that optimization isn’t just about the algorithm – it’s about understanding your problem, monitoring training dynamics, and being willing to experiment. These optimizers are tools, and like any tool, they work best when you understand their strengths and limitations. Start with Adam for most cases, but don’t be afraid to switch if you’re hitting performance walls or generalization issues.

For more technical details and mathematical derivations, check out the original papers: Adam paper and Sebastian Ruder’s excellent gradient descent optimization overview.



This article incorporates information and material from various online sources. We acknowledge and appreciate the work of all original authors, publishers, and websites. While every effort has been made to appropriately credit the source material, any unintentional oversight or omission does not constitute a copyright infringement. All trademarks, logos, and images mentioned are the property of their respective owners. If you believe that any content used in this article infringes upon your copyright, please contact us immediately for review and prompt action.

This article is intended for informational and educational purposes only and does not infringe on the rights of the copyright owners. If any copyrighted material has been used without proper credit or in violation of copyright laws, it is unintentional and we will rectify it promptly upon notification. Please note that the republishing, redistribution, or reproduction of part or all of the contents in any form is prohibited without express written permission from the author and website owner. For permissions or further inquiries, please contact us.

Leave a reply

Your email address will not be published. Required fields are marked