
VGG from Scratch with PyTorch – Step-by-Step Guide
VGG (Visual Geometry Group) is a classic convolutional neural network architecture that dominated image recognition tasks back in 2014, demonstrating that depth matters in neural networks. While you’ve probably heard of ResNet and EfficientNet being the hot stuff nowadays, understanding VGG from scratch is crucial for grasping the fundamentals of CNN architectures – plus it’s surprisingly straightforward to implement in PyTorch. In this guide, we’ll build VGG-16 from the ground up, dive into the architecture details, handle common implementation gotchas, and benchmark its performance against modern alternatives.
How VGG Architecture Works
VGG’s beauty lies in its simplicity – it’s basically a stack of 3×3 convolutional layers followed by max pooling, repeated until you get to fully connected layers. The key insight from the VGG paper was that using multiple small filters (3×3) is more effective than fewer large filters, while using less parameters.
The architecture follows this pattern:
- Convolutional blocks with 3×3 kernels, stride 1, padding 1
- ReLU activation after each conv layer
- Max pooling (2×2, stride 2) after each block
- Feature maps double after each pooling operation
- Three fully connected layers at the end
Here’s how the feature map sizes change through VGG-16:
Layer Block | Input Size | Filters | Output Size |
---|---|---|---|
Conv Block 1 | 224x224x3 | 64 | 224x224x64 |
After Pool 1 | 224x224x64 | – | 112x112x64 |
Conv Block 2 | 112x112x64 | 128 | 112x112x128 |
After Pool 2 | 112x112x128 | – | 56x56x128 |
Conv Block 3 | 56x56x128 | 256 | 56x56x256 |
After Pool 3 | 56x56x256 | – | 28x28x256 |
Conv Block 4 | 28x28x256 | 512 | 28x28x512 |
After Pool 4 | 28x28x512 | – | 14x14x512 |
Conv Block 5 | 14x14x512 | 512 | 14x14x512 |
After Pool 5 | 14x14x512 | – | 7x7x512 |
Step-by-Step Implementation Guide
Let’s start by setting up our imports and defining the VGG configuration. Different VGG variants (VGG-11, VGG-13, VGG-16, VGG-19) differ only in the number of convolutional layers.
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
import time
# VGG configurations for different variants
cfg = {
'VGG11': [64, 'M', 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],
'VGG13': [64, 64, 'M', 128, 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],
'VGG16': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512, 'M'],
'VGG19': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 256, 'M', 512, 512, 512, 512, 'M', 512, 512, 512, 512, 'M'],
}
Now let’s implement the VGG class. The key is to build the feature extraction layers dynamically based on the configuration:
class VGG(nn.Module):
def __init__(self, vgg_name, num_classes=1000):
super(VGG, self).__init__()
self.features = self._make_layers(cfg[vgg_name])
self.avgpool = nn.AdaptiveAvgPool2d((7, 7))
self.classifier = nn.Sequential(
nn.Linear(512 * 7 * 7, 4096),
nn.ReLU(True),
nn.Dropout(),
nn.Linear(4096, 4096),
nn.ReLU(True),
nn.Dropout(),
nn.Linear(4096, num_classes),
)
def forward(self, x):
x = self.features(x)
x = self.avgpool(x)
x = torch.flatten(x, 1)
x = self.classifier(x)
return x
def _make_layers(self, cfg):
layers = []
in_channels = 3
for x in cfg:
if x == 'M':
layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
else:
layers += [nn.Conv2d(in_channels, x, kernel_size=3, padding=1),
nn.BatchNorm2d(x),
nn.ReLU(inplace=True)]
in_channels = x
return nn.Sequential(*layers)
# Create VGG-16 model
def vgg16(num_classes=1000):
return VGG('VGG16', num_classes)
Note that I added BatchNorm2d which wasn’t in the original VGG paper but significantly helps with training stability. If you want the pure original architecture, just remove those lines.
Let’s set up data loading and preprocessing. CIFAR-10 is perfect for testing since it trains quickly:
# Data preprocessing
transform_train = transforms.Compose([
transforms.Resize(224), # VGG expects 224x224 input
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))
])
transform_test = transforms.Compose([
transforms.Resize(224),
transforms.ToTensor(),
transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))
])
# Load CIFAR-10 dataset
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
download=True, transform=transform_train)
trainloader = DataLoader(trainset, batch_size=32, shuffle=True, num_workers=2)
testset = torchvision.datasets.CIFAR10(root='./data', train=False,
download=True, transform=transform_test)
testloader = DataLoader(testset, batch_size=32, shuffle=False, num_workers=2)
Now for the training loop. VGG can be memory-hungry, so watch your batch sizes:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
# Initialize model, loss, and optimizer
model = vgg16(num_classes=10).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=5e-4)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
def train_epoch(model, trainloader, criterion, optimizer, epoch):
model.train()
running_loss = 0.0
correct = 0
total = 0
for batch_idx, (inputs, targets) in enumerate(trainloader):
inputs, targets = inputs.to(device), targets.to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
running_loss += loss.item()
_, predicted = outputs.max(1)
total += targets.size(0)
correct += predicted.eq(targets).sum().item()
if batch_idx % 100 == 0:
print(f'Epoch: {epoch}, Batch: {batch_idx}, Loss: {loss.item():.4f}, Acc: {100.*correct/total:.2f}%')
def test(model, testloader, criterion):
model.eval()
test_loss = 0
correct = 0
total = 0
with torch.no_grad():
for batch_idx, (inputs, targets) in enumerate(testloader):
inputs, targets = inputs.to(device), targets.to(device)
outputs = model(inputs)
loss = criterion(outputs, targets)
test_loss += loss.item()
_, predicted = outputs.max(1)
total += targets.size(0)
correct += predicted.eq(targets).sum().item()
accuracy = 100. * correct / total
print(f'Test Accuracy: {accuracy:.2f}%')
return accuracy
# Training loop
epochs = 50
best_acc = 0
for epoch in range(epochs):
start_time = time.time()
train_epoch(model, trainloader, criterion, optimizer, epoch)
acc = test(model, testloader, criterion)
scheduler.step()
# Save best model
if acc > best_acc:
print(f'Saving model with accuracy: {acc:.2f}%')
torch.save(model.state_dict(), 'vgg16_best.pth')
best_acc = acc
print(f'Epoch {epoch} completed in {time.time() - start_time:.2f}s\n')
Real-World Examples and Use Cases
While VGG isn’t the go-to choice for production anymore, it still has its place. Here are scenarios where you might actually want to use VGG:
- Transfer Learning Base: VGG’s simple architecture makes it excellent for understanding feature extraction before jumping to ResNet
- Resource-Constrained Environments: VGG-11 can outperform smaller models when you need decent accuracy but can’t afford modern architectures
- Educational Purposes: Perfect for teaching CNN concepts without the complexity of skip connections
- Feature Extraction: VGG features are still used in style transfer and some computer vision pipelines
Here’s how to use VGG for transfer learning on a custom dataset:
# Load pre-trained VGG and modify for your dataset
import torchvision.models as models
def create_transfer_vgg(num_classes):
# Load pre-trained VGG16
model = models.vgg16(pretrained=True)
# Freeze feature extraction layers
for param in model.features.parameters():
param.requires_grad = False
# Replace classifier for your number of classes
model.classifier[6] = nn.Linear(4096, num_classes)
return model
# For fine-tuning instead of feature extraction
def create_finetuned_vgg(num_classes, freeze_layers=True):
model = models.vgg16(pretrained=True)
if freeze_layers:
# Freeze early layers, train later ones
for i, param in enumerate(model.features.parameters()):
if i < 20: # Freeze first 20 layers
param.requires_grad = False
model.classifier[6] = nn.Linear(4096, num_classes)
return model
For deployment on servers with limited GPU memory, you can implement a memory-efficient version:
class EfficientVGG(nn.Module):
def __init__(self, vgg_name, num_classes=1000):
super(EfficientVGG, self).__init__()
self.features = self._make_layers(cfg[vgg_name])
# Use global average pooling instead of large FC layers
self.global_pool = nn.AdaptiveAvgPool2d((1, 1))
self.classifier = nn.Linear(512, num_classes)
def forward(self, x):
x = self.features(x)
x = self.global_pool(x)
x = torch.flatten(x, 1)
x = self.classifier(x)
return x
def _make_layers(self, cfg):
layers = []
in_channels = 3
for x in cfg:
if x == 'M':
layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
else:
layers += [nn.Conv2d(in_channels, x, kernel_size=3, padding=1),
nn.BatchNorm2d(x),
nn.ReLU(inplace=True)]
in_channels = x
return nn.Sequential(*layers)
Performance Comparisons and Benchmarks
Let's be honest about VGG's performance compared to modern architectures. Here's what you can expect on CIFAR-10:
Model | Parameters | CIFAR-10 Accuracy | Training Time (50 epochs) | Memory Usage |
---|---|---|---|---|
VGG-11 | 9.2M | 88-90% | ~2 hours (GTX 1080) | ~3GB |
VGG-16 | 15M | 91-93% | ~3 hours (GTX 1080) | ~4GB |
VGG-19 | 20M | 92-94% | ~4 hours (GTX 1080) | ~5GB |
ResNet-18 | 11M | 94-95% | ~1 hour (GTX 1080) | ~2GB |
EfficientNet-B0 | 5.3M | 96-97% | ~1.5 hours (GTX 1080) | ~2GB |
The numbers don't lie - VGG is slower and hungrier than modern alternatives. But here's the performance on ImageNet for reference:
Model | Top-1 Accuracy | Top-5 Accuracy | Parameters |
---|---|---|---|
VGG-16 | 71.59% | 90.38% | 138M |
VGG-19 | 72.38% | 90.88% | 144M |
ResNet-50 | 76.15% | 92.87% | 25.6M |
EfficientNet-B0 | 77.69% | 93.53% | 5.3M |
Common Pitfalls and Troubleshooting
After implementing VGG dozens of times, here are the gotchas that'll save you hours of debugging:
Memory Issues: VGG's fully connected layers are massive (25,088 x 4096 = 103M parameters just for the first FC layer). If you're getting CUDA out of memory errors:
# Reduce batch size first
trainloader = DataLoader(trainset, batch_size=16, shuffle=True) # Instead of 32
# Or use gradient accumulation
accumulation_steps = 4
optimizer.zero_grad()
for i, (inputs, targets) in enumerate(trainloader):
outputs = model(inputs.to(device))
loss = criterion(outputs, targets.to(device)) / accumulation_steps
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
Training Instability: VGG without batch normalization can be tricky to train. If loss explodes or doesn't converge:
# Use gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# Lower learning rate
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9) # Instead of 0.01
# Add weight initialization
def init_weights(m):
if isinstance(m, nn.Conv2d):
torch.nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
if m.bias is not None:
torch.nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.Linear):
torch.nn.init.normal_(m.weight, 0, 0.01)
torch.nn.init.constant_(m.bias, 0)
model.apply(init_weights)
Input Size Mismatch: VGG expects 224x224 inputs. For smaller datasets like CIFAR-10 (32x32), you need to resize or modify the architecture:
# Option 1: Resize inputs (what we did above)
transform = transforms.Compose([
transforms.Resize(224),
# ... other transforms
])
# Option 2: Modify architecture for smaller inputs
class VGG_CIFAR(nn.Module):
def __init__(self, vgg_name, num_classes=10):
super(VGG_CIFAR, self).__init__()
self.features = self._make_layers(cfg[vgg_name])
# Calculate the size after conv layers for CIFAR-10 (32x32 input)
# After 5 pooling layers: 32 -> 16 -> 8 -> 4 -> 2 -> 1
self.classifier = nn.Sequential(
nn.Linear(512 * 1 * 1, 512),
nn.ReLU(True),
nn.Dropout(),
nn.Linear(512, num_classes),
)
Best Practices and Optimization Tips
If you're committed to using VGG, here are ways to make it less painful:
Mixed Precision Training: Cuts memory usage and training time significantly:
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for epoch in range(epochs):
for inputs, targets in trainloader:
optimizer.zero_grad()
with autocast():
outputs = model(inputs.to(device))
loss = criterion(outputs, targets.to(device))
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Model Pruning: Remove unnecessary weights to reduce model size:
import torch.nn.utils.prune as prune
def prune_vgg(model, amount=0.3):
for module in model.modules():
if isinstance(module, nn.Conv2d):
prune.l1_unstructured(module, name='weight', amount=amount)
elif isinstance(module, nn.Linear):
prune.l1_unstructured(module, name='weight', amount=amount)
return model
# Apply pruning after training
model = prune_vgg(model, amount=0.3)
Knowledge Distillation: Use a larger VGG to train a smaller one:
def distillation_loss(student_logits, teacher_logits, targets, temperature=4, alpha=0.7):
distillation_loss = nn.KLDivLoss(reduction='batchmean')(
F.log_softmax(student_logits / temperature, dim=1),
F.softmax(teacher_logits / temperature, dim=1)
) * (temperature ** 2)
student_loss = F.cross_entropy(student_logits, targets)
return alpha * distillation_loss + (1 - alpha) * student_loss
For production deployments on your dedicated servers, consider converting to ONNX for faster inference:
import torch.onnx
# Convert trained model to ONNX
dummy_input = torch.randn(1, 3, 224, 224).to(device)
torch.onnx.export(model, dummy_input, "vgg16.onnx",
export_params=True, opset_version=11,
input_names=['input'], output_names=['output'])
# Load with ONNX Runtime for deployment
import onnxruntime as ort
ort_session = ort.InferenceSession("vgg16.onnx")
The PyTorch documentation has extensive details on pre-trained models and transfer learning techniques. For deployment considerations on your VPS or dedicated infrastructure, remember that VGG's computational requirements scale linearly with input resolution, making it predictable for capacity planning but potentially expensive for high-resolution inference workloads.

This article incorporates information and material from various online sources. We acknowledge and appreciate the work of all original authors, publishers, and websites. While every effort has been made to appropriately credit the source material, any unintentional oversight or omission does not constitute a copyright infringement. All trademarks, logos, and images mentioned are the property of their respective owners. If you believe that any content used in this article infringes upon your copyright, please contact us immediately for review and prompt action.
This article is intended for informational and educational purposes only and does not infringe on the rights of the copyright owners. If any copyrighted material has been used without proper credit or in violation of copyright laws, it is unintentional and we will rectify it promptly upon notification. Please note that the republishing, redistribution, or reproduction of part or all of the contents in any form is prohibited without express written permission from the author and website owner. For permissions or further inquiries, please contact us.