BLOG POSTS

MangoHost Blog / VGG from Scratch with PyTorch – Step-by-Step Guide

VGG from Scratch with PyTorch – Step-by-Step Guide

VGG (Visual Geometry Group) is a classic convolutional neural network architecture that dominated image recognition tasks back in 2014, demonstrating that depth matters in neural networks. While you’ve probably heard of ResNet and EfficientNet being the hot stuff nowadays, understanding VGG from scratch is crucial for grasping the fundamentals of CNN architectures – plus it’s surprisingly straightforward to implement in PyTorch. In this guide, we’ll build VGG-16 from the ground up, dive into the architecture details, handle common implementation gotchas, and benchmark its performance against modern alternatives.

How VGG Architecture Works

VGG’s beauty lies in its simplicity – it’s basically a stack of 3×3 convolutional layers followed by max pooling, repeated until you get to fully connected layers. The key insight from the VGG paper was that using multiple small filters (3×3) is more effective than fewer large filters, while using less parameters.

The architecture follows this pattern:

Convolutional blocks with 3×3 kernels, stride 1, padding 1
ReLU activation after each conv layer
Max pooling (2×2, stride 2) after each block
Feature maps double after each pooling operation
Three fully connected layers at the end

Here’s how the feature map sizes change through VGG-16:

Layer Block	Input Size	Filters	Output Size
Conv Block 1	224x224x3	64	224x224x64
After Pool 1	224x224x64	–	112x112x64
Conv Block 2	112x112x64	128	112x112x128
After Pool 2	112x112x128	–	56x56x128
Conv Block 3	56x56x128	256	56x56x256
After Pool 3	56x56x256	–	28x28x256
Conv Block 4	28x28x256	512	28x28x512
After Pool 4	28x28x512	–	14x14x512
Conv Block 5	14x14x512	512	14x14x512
After Pool 5	14x14x512	–	7x7x512

Step-by-Step Implementation Guide

Let’s start by setting up our imports and defining the VGG configuration. Different VGG variants (VGG-11, VGG-13, VGG-16, VGG-19) differ only in the number of convolutional layers.

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
import time

# VGG configurations for different variants
cfg = {
    'VGG11': [64, 'M', 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],
    'VGG13': [64, 64, 'M', 128, 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],
    'VGG16': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512, 'M'],
    'VGG19': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 256, 'M', 512, 512, 512, 512, 'M', 512, 512, 512, 512, 'M'],
}

Now let’s implement the VGG class. The key is to build the feature extraction layers dynamically based on the configuration:

class VGG(nn.Module):
    def __init__(self, vgg_name, num_classes=1000):
        super(VGG, self).__init__()
        self.features = self._make_layers(cfg[vgg_name])
        self.avgpool = nn.AdaptiveAvgPool2d((7, 7))
        self.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096),
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(4096, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x

    def _make_layers(self, cfg):
        layers = []
        in_channels = 3
        for x in cfg:
            if x == 'M':
                layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
            else:
                layers += [nn.Conv2d(in_channels, x, kernel_size=3, padding=1),
                          nn.BatchNorm2d(x),
                          nn.ReLU(inplace=True)]
                in_channels = x
        return nn.Sequential(*layers)

# Create VGG-16 model
def vgg16(num_classes=1000):
    return VGG('VGG16', num_classes)

Note that I added BatchNorm2d which wasn’t in the original VGG paper but significantly helps with training stability. If you want the pure original architecture, just remove those lines.

Let’s set up data loading and preprocessing. CIFAR-10 is perfect for testing since it trains quickly:

# Data preprocessing
transform_train = transforms.Compose([
    transforms.Resize(224),  # VGG expects 224x224 input
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))
])

transform_test = transforms.Compose([
    transforms.Resize(224),
    transforms.ToTensor(),
    transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))
])

# Load CIFAR-10 dataset
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform_train)
trainloader = DataLoader(trainset, batch_size=32, shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform_test)
testloader = DataLoader(testset, batch_size=32, shuffle=False, num_workers=2)

Now for the training loop. VGG can be memory-hungry, so watch your batch sizes:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Initialize model, loss, and optimizer
model = vgg16(num_classes=10).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=5e-4)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)

def train_epoch(model, trainloader, criterion, optimizer, epoch):
    model.train()
    running_loss = 0.0
    correct = 0
    total = 0
    
    for batch_idx, (inputs, targets) in enumerate(trainloader):
        inputs, targets = inputs.to(device), targets.to(device)
        
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
        _, predicted = outputs.max(1)
        total += targets.size(0)
        correct += predicted.eq(targets).sum().item()
        
        if batch_idx % 100 == 0:
            print(f'Epoch: {epoch}, Batch: {batch_idx}, Loss: {loss.item():.4f}, Acc: {100.*correct/total:.2f}%')

def test(model, testloader, criterion):
    model.eval()
    test_loss = 0
    correct = 0
    total = 0
    
    with torch.no_grad():
        for batch_idx, (inputs, targets) in enumerate(testloader):
            inputs, targets = inputs.to(device), targets.to(device)
            outputs = model(inputs)
            loss = criterion(outputs, targets)
            
            test_loss += loss.item()
            _, predicted = outputs.max(1)
            total += targets.size(0)
            correct += predicted.eq(targets).sum().item()
    
    accuracy = 100. * correct / total
    print(f'Test Accuracy: {accuracy:.2f}%')
    return accuracy

# Training loop
epochs = 50
best_acc = 0

for epoch in range(epochs):
    start_time = time.time()
    train_epoch(model, trainloader, criterion, optimizer, epoch)
    acc = test(model, testloader, criterion)
    scheduler.step()
    
    # Save best model
    if acc > best_acc:
        print(f'Saving model with accuracy: {acc:.2f}%')
        torch.save(model.state_dict(), 'vgg16_best.pth')
        best_acc = acc
    
    print(f'Epoch {epoch} completed in {time.time() - start_time:.2f}s\n')

Real-World Examples and Use Cases

While VGG isn’t the go-to choice for production anymore, it still has its place. Here are scenarios where you might actually want to use VGG:

Transfer Learning Base: VGG’s simple architecture makes it excellent for understanding feature extraction before jumping to ResNet
Resource-Constrained Environments: VGG-11 can outperform smaller models when you need decent accuracy but can’t afford modern architectures
Educational Purposes: Perfect for teaching CNN concepts without the complexity of skip connections
Feature Extraction: VGG features are still used in style transfer and some computer vision pipelines

Here’s how to use VGG for transfer learning on a custom dataset:

# Load pre-trained VGG and modify for your dataset
import torchvision.models as models

def create_transfer_vgg(num_classes):
    # Load pre-trained VGG16
    model = models.vgg16(pretrained=True)
    
    # Freeze feature extraction layers
    for param in model.features.parameters():
        param.requires_grad = False
    
    # Replace classifier for your number of classes
    model.classifier[6] = nn.Linear(4096, num_classes)
    
    return model

# For fine-tuning instead of feature extraction
def create_finetuned_vgg(num_classes, freeze_layers=True):
    model = models.vgg16(pretrained=True)
    
    if freeze_layers:
        # Freeze early layers, train later ones
        for i, param in enumerate(model.features.parameters()):
            if i < 20:  # Freeze first 20 layers
                param.requires_grad = False
    
    model.classifier[6] = nn.Linear(4096, num_classes)
    return model

For deployment on servers with limited GPU memory, you can implement a memory-efficient version:

class EfficientVGG(nn.Module):
    def __init__(self, vgg_name, num_classes=1000):
        super(EfficientVGG, self).__init__()
        self.features = self._make_layers(cfg[vgg_name])
        # Use global average pooling instead of large FC layers
        self.global_pool = nn.AdaptiveAvgPool2d((1, 1))
        self.classifier = nn.Linear(512, num_classes)
        
    def forward(self, x):
        x = self.features(x)
        x = self.global_pool(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x
    
    def _make_layers(self, cfg):
        layers = []
        in_channels = 3
        for x in cfg:
            if x == 'M':
                layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
            else:
                layers += [nn.Conv2d(in_channels, x, kernel_size=3, padding=1),
                          nn.BatchNorm2d(x),
                          nn.ReLU(inplace=True)]
                in_channels = x
        return nn.Sequential(*layers)

Performance Comparisons and Benchmarks

Let's be honest about VGG's performance compared to modern architectures. Here's what you can expect on CIFAR-10:

Model	Parameters	CIFAR-10 Accuracy	Training Time (50 epochs)	Memory Usage
VGG-11	9.2M	88-90%	~2 hours (GTX 1080)	~3GB
VGG-16	15M	91-93%	~3 hours (GTX 1080)	~4GB
VGG-19	20M	92-94%	~4 hours (GTX 1080)	~5GB
ResNet-18	11M	94-95%	~1 hour (GTX 1080)	~2GB
EfficientNet-B0	5.3M	96-97%	~1.5 hours (GTX 1080)	~2GB

The numbers don't lie - VGG is slower and hungrier than modern alternatives. But here's the performance on ImageNet for reference:

Model	Top-1 Accuracy	Top-5 Accuracy	Parameters
VGG-16	71.59%	90.38%	138M
VGG-19	72.38%	90.88%	144M
ResNet-50	76.15%	92.87%	25.6M
EfficientNet-B0	77.69%	93.53%	5.3M

Common Pitfalls and Troubleshooting

After implementing VGG dozens of times, here are the gotchas that'll save you hours of debugging:

Memory Issues: VGG's fully connected layers are massive (25,088 x 4096 = 103M parameters just for the first FC layer). If you're getting CUDA out of memory errors:

# Reduce batch size first
trainloader = DataLoader(trainset, batch_size=16, shuffle=True)  # Instead of 32

# Or use gradient accumulation
accumulation_steps = 4
optimizer.zero_grad()

for i, (inputs, targets) in enumerate(trainloader):
    outputs = model(inputs.to(device))
    loss = criterion(outputs, targets.to(device)) / accumulation_steps
    loss.backward()
    
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Training Instability: VGG without batch normalization can be tricky to train. If loss explodes or doesn't converge:

# Use gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# Lower learning rate
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)  # Instead of 0.01

# Add weight initialization
def init_weights(m):
    if isinstance(m, nn.Conv2d):
        torch.nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
        if m.bias is not None:
            torch.nn.init.constant_(m.bias, 0)
    elif isinstance(m, nn.Linear):
        torch.nn.init.normal_(m.weight, 0, 0.01)
        torch.nn.init.constant_(m.bias, 0)

model.apply(init_weights)

Input Size Mismatch: VGG expects 224x224 inputs. For smaller datasets like CIFAR-10 (32x32), you need to resize or modify the architecture:

# Option 1: Resize inputs (what we did above)
transform = transforms.Compose([
    transforms.Resize(224),
    # ... other transforms
])

# Option 2: Modify architecture for smaller inputs
class VGG_CIFAR(nn.Module):
    def __init__(self, vgg_name, num_classes=10):
        super(VGG_CIFAR, self).__init__()
        self.features = self._make_layers(cfg[vgg_name])
        # Calculate the size after conv layers for CIFAR-10 (32x32 input)
        # After 5 pooling layers: 32 -> 16 -> 8 -> 4 -> 2 -> 1
        self.classifier = nn.Sequential(
            nn.Linear(512 * 1 * 1, 512),
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(512, num_classes),
        )

Best Practices and Optimization Tips

If you're committed to using VGG, here are ways to make it less painful:

Mixed Precision Training: Cuts memory usage and training time significantly:

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for epoch in range(epochs):
    for inputs, targets in trainloader:
        optimizer.zero_grad()
        
        with autocast():
            outputs = model(inputs.to(device))
            loss = criterion(outputs, targets.to(device))
        
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

Model Pruning: Remove unnecessary weights to reduce model size:

import torch.nn.utils.prune as prune

def prune_vgg(model, amount=0.3):
    for module in model.modules():
        if isinstance(module, nn.Conv2d):
            prune.l1_unstructured(module, name='weight', amount=amount)
        elif isinstance(module, nn.Linear):
            prune.l1_unstructured(module, name='weight', amount=amount)
    
    return model

# Apply pruning after training
model = prune_vgg(model, amount=0.3)

Knowledge Distillation: Use a larger VGG to train a smaller one:

def distillation_loss(student_logits, teacher_logits, targets, temperature=4, alpha=0.7):
    distillation_loss = nn.KLDivLoss(reduction='batchmean')(
        F.log_softmax(student_logits / temperature, dim=1),
        F.softmax(teacher_logits / temperature, dim=1)
    ) * (temperature ** 2)
    
    student_loss = F.cross_entropy(student_logits, targets)
    return alpha * distillation_loss + (1 - alpha) * student_loss

For production deployments on your dedicated servers, consider converting to ONNX for faster inference:

import torch.onnx

# Convert trained model to ONNX
dummy_input = torch.randn(1, 3, 224, 224).to(device)
torch.onnx.export(model, dummy_input, "vgg16.onnx", 
                  export_params=True, opset_version=11,
                  input_names=['input'], output_names=['output'])

# Load with ONNX Runtime for deployment
import onnxruntime as ort
ort_session = ort.InferenceSession("vgg16.onnx")

The PyTorch documentation has extensive details on pre-trained models and transfer learning techniques. For deployment considerations on your VPS or dedicated infrastructure, remember that VGG's computational requirements scale linearly with input resolution, making it predictable for capacity planning but potentially expensive for high-resolution inference workloads.

This article incorporates information and material from various online sources. We acknowledge and appreciate the work of all original authors, publishers, and websites. While every effort has been made to appropriately credit the source material, any unintentional oversight or omission does not constitute a copyright infringement. All trademarks, logos, and images mentioned are the property of their respective owners. If you believe that any content used in this article infringes upon your copyright, please contact us immediately for review and prompt action.

This article is intended for informational and educational purposes only and does not infringe on the rights of the copyright owners. If any copyrighted material has been used without proper credit or in violation of copyright laws, it is unintentional and we will rectify it promptly upon notification. Please note that the republishing, redistribution, or reproduction of part or all of the contents in any form is prohibited without express written permission from the author and website owner. For permissions or further inquiries, please contact us.