
Writing CNNs from Scratch in PyTorch – Beginner’s Guide
Convolutional Neural Networks (CNNs) are the backbone of modern computer vision applications, powering everything from image classification to object detection systems. While frameworks like TensorFlow and Keras make it easy to build CNNs with high-level abstractions, understanding how to implement them from scratch in PyTorch provides crucial insights into their inner workings and gives you greater control over your models. This guide will walk you through building CNNs from the ground up, covering the mathematical foundations, practical implementation details, and common troubleshooting scenarios you’ll encounter when deploying these models on production servers.
How CNNs Work Under the Hood
Before diving into code, it’s essential to understand what happens inside a CNN. At its core, a CNN applies learnable filters (kernels) across input images through convolution operations, followed by pooling layers that reduce spatial dimensions and fully connected layers for final classification.
The key components include:
- Convolutional layers that detect features like edges, textures, and patterns
- Activation functions (typically ReLU) that introduce non-linearity
- Pooling layers that downsample feature maps and reduce computational load
- Fully connected layers that map features to class probabilities
- Dropout layers for regularization to prevent overfitting
PyTorch’s dynamic computation graph makes it particularly well-suited for understanding these operations since you can inspect tensors at each step and modify the forward pass dynamically.
Setting Up Your Development Environment
First, ensure you have the necessary dependencies installed. If you’re running this on a VPS or dedicated server, make sure you have sufficient RAM (at least 8GB recommended) and GPU support if available.
pip install torch torchvision matplotlib numpy
# For GPU support (optional but recommended)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Verify your installation:
import torch
import torchvision
import torch.nn as nn
import torch.optim as optim
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"GPU device: {torch.cuda.get_device_name(0)}")
Building Your First CNN from Scratch
Let’s start with a simple CNN for CIFAR-10 classification. This implementation shows the fundamental structure without any shortcuts:
import torch
import torch.nn as nn
import torch.nn.functional as F
class SimpleCNN(nn.Module):
def __init__(self, num_classes=10):
super(SimpleCNN, self).__init__()
# First convolutional block
self.conv1 = nn.Conv2d(in_channels=3, out_channels=32,
kernel_size=3, padding=1)
self.bn1 = nn.BatchNorm2d(32)
self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
# Second convolutional block
self.conv2 = nn.Conv2d(in_channels=32, out_channels=64,
kernel_size=3, padding=1)
self.bn2 = nn.BatchNorm2d(64)
self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
# Third convolutional block
self.conv3 = nn.Conv2d(in_channels=64, out_channels=128,
kernel_size=3, padding=1)
self.bn3 = nn.BatchNorm2d(128)
self.pool3 = nn.MaxPool2d(kernel_size=2, stride=2)
# Fully connected layers
self.fc1 = nn.Linear(128 * 4 * 4, 512)
self.dropout = nn.Dropout(0.5)
self.fc2 = nn.Linear(512, num_classes)
def forward(self, x):
# First block
x = self.pool1(F.relu(self.bn1(self.conv1(x))))
# Second block
x = self.pool2(F.relu(self.bn2(self.conv2(x))))
# Third block
x = self.pool3(F.relu(self.bn3(self.conv3(x))))
# Flatten for fully connected layers
x = x.view(x.size(0), -1)
# Fully connected layers
x = F.relu(self.fc1(x))
x = self.dropout(x)
x = self.fc2(x)
return x
# Initialize the model
model = SimpleCNN(num_classes=10)
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")
This basic architecture follows the classic CNN pattern. The batch normalization layers help with training stability, while dropout prevents overfitting on smaller datasets.
Data Loading and Preprocessing
Proper data handling is crucial for CNN performance. Here’s how to set up efficient data loading with torchvision:
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
# Define transforms for training and validation
train_transform = transforms.Compose([
transforms.RandomHorizontalFlip(p=0.5),
transforms.RandomRotation(10),
transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
])
val_transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
])
# Load CIFAR-10 dataset
train_dataset = torchvision.datasets.CIFAR10(root='./data', train=True,
download=True, transform=train_transform)
val_dataset = torchvision.datasets.CIFAR10(root='./data', train=False,
download=True, transform=val_transform)
# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True, num_workers=4)
val_loader = DataLoader(val_dataset, batch_size=128, shuffle=False, num_workers=4)
print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")
Training Loop Implementation
Here’s a complete training loop with proper logging, validation, and checkpointing:
def train_model(model, train_loader, val_loader, num_epochs=50):
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=20, gamma=0.1)
best_val_acc = 0.0
train_losses = []
val_accuracies = []
for epoch in range(num_epochs):
# Training phase
model.train()
running_loss = 0.0
for batch_idx, (data, targets) in enumerate(train_loader):
data, targets = data.to(device), targets.to(device)
optimizer.zero_grad()
outputs = model(data)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
running_loss += loss.item()
if batch_idx % 100 == 0:
print(f'Epoch {epoch+1}/{num_epochs}, Batch {batch_idx}, '
f'Loss: {loss.item():.4f}')
# Validation phase
model.eval()
val_correct = 0
val_total = 0
with torch.no_grad():
for data, targets in val_loader:
data, targets = data.to(device), targets.to(device)
outputs = model(data)
_, predicted = torch.max(outputs.data, 1)
val_total += targets.size(0)
val_correct += (predicted == targets).sum().item()
val_acc = 100 * val_correct / val_total
avg_train_loss = running_loss / len(train_loader)
train_losses.append(avg_train_loss)
val_accuracies.append(val_acc)
print(f'Epoch {epoch+1}: Train Loss: {avg_train_loss:.4f}, '
f'Val Accuracy: {val_acc:.2f}%')
# Save best model
if val_acc > best_val_acc:
best_val_acc = val_acc
torch.save(model.state_dict(), 'best_model.pth')
scheduler.step()
return train_losses, val_accuracies
# Train the model
train_losses, val_accuracies = train_model(model, train_loader, val_loader)
Advanced CNN Architectures
Once you understand the basics, you can implement more sophisticated architectures. Here’s a ResNet-style block with skip connections:
class ResidualBlock(nn.Module):
def __init__(self, in_channels, out_channels, stride=1):
super(ResidualBlock, self).__init__()
self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3,
stride=stride, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(out_channels)
self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3,
stride=1, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(out_channels)
self.shortcut = nn.Sequential()
if stride != 1 or in_channels != out_channels:
self.shortcut = nn.Sequential(
nn.Conv2d(in_channels, out_channels, kernel_size=1,
stride=stride, bias=False),
nn.BatchNorm2d(out_channels)
)
def forward(self, x):
out = F.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
out += self.shortcut(x)
out = F.relu(out)
return out
class CustomResNet(nn.Module):
def __init__(self, num_classes=10):
super(CustomResNet, self).__init__()
self.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(64)
self.layer1 = self._make_layer(64, 64, 2, stride=1)
self.layer2 = self._make_layer(64, 128, 2, stride=2)
self.layer3 = self._make_layer(128, 256, 2, stride=2)
self.avg_pool = nn.AdaptiveAvgPool2d((1, 1))
self.fc = nn.Linear(256, num_classes)
def _make_layer(self, in_channels, out_channels, num_blocks, stride):
layers = []
layers.append(ResidualBlock(in_channels, out_channels, stride))
for _ in range(1, num_blocks):
layers.append(ResidualBlock(out_channels, out_channels))
return nn.Sequential(*layers)
def forward(self, x):
x = F.relu(self.bn1(self.conv1(x)))
x = self.layer1(x)
x = self.layer2(x)
x = self.layer3(x)
x = self.avg_pool(x)
x = x.view(x.size(0), -1)
x = self.fc(x)
return x
Performance Optimization and Best Practices
Here are critical optimizations for production deployment, especially important when running on server infrastructure:
Optimization | Impact | Implementation | Memory Usage |
---|---|---|---|
Mixed Precision Training | 1.5-2x speed increase | torch.cuda.amp.autocast() | ~50% reduction |
Gradient Checkpointing | 10-20% slower training | torch.utils.checkpoint | ~80% reduction |
DataLoader num_workers | 2-4x data loading speed | num_workers=4-8 | Minimal increase |
Batch Size Optimization | Linear GPU utilization | Powers of 2 (32, 64, 128) | Linear increase |
Here’s how to implement mixed precision training:
from torch.cuda.amp import GradScaler, autocast
def train_with_mixed_precision(model, train_loader, val_loader, num_epochs=50):
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
scaler = GradScaler()
for epoch in range(num_epochs):
model.train()
running_loss = 0.0
for batch_idx, (data, targets) in enumerate(train_loader):
data, targets = data.to(device), targets.to(device)
optimizer.zero_grad()
# Mixed precision forward pass
with autocast():
outputs = model(data)
loss = criterion(outputs, targets)
# Mixed precision backward pass
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
running_loss += loss.item()
print(f'Epoch {epoch+1}: Loss: {running_loss/len(train_loader):.4f}')
Common Issues and Troubleshooting
Based on real-world deployment experience, here are the most frequent problems and their solutions:
- Out of Memory Errors: Reduce batch size, use gradient checkpointing, or implement gradient accumulation
- Vanishing Gradients: Add batch normalization, use ResNet-style skip connections, or try different activation functions
- Slow Training: Increase batch size if memory allows, use multiple GPUs with DataParallel, optimize data loading with proper num_workers
- Poor Convergence: Adjust learning rate, add learning rate scheduling, check data normalization
- Overfitting: Increase dropout rate, add weight decay, use data augmentation, reduce model complexity
Here’s a debugging utility to monitor training:
def debug_model(model, sample_input):
model.eval()
# Hook to capture activations
activations = {}
def hook_fn(name):
def hook(module, input, output):
activations[name] = output.detach()
return hook
# Register hooks
for name, layer in model.named_modules():
if isinstance(layer, (nn.Conv2d, nn.Linear)):
layer.register_forward_hook(hook_fn(name))
# Forward pass
with torch.no_grad():
output = model(sample_input)
# Print activation statistics
for name, activation in activations.items():
print(f"{name}: Shape={activation.shape}, "
f"Mean={activation.mean():.4f}, "
f"Std={activation.std():.4f}")
return activations
# Usage
sample_batch = next(iter(train_loader))[0][:1] # Single sample
debug_info = debug_model(model, sample_batch)
Real-World Use Cases and Deployment
CNNs built from scratch in PyTorch are particularly valuable in these scenarios:
- Custom Computer Vision Tasks: Medical imaging, satellite imagery analysis, industrial quality control
- Edge Deployment: Converting to ONNX or TensorRT for inference on embedded devices
- Research Applications: Experimenting with novel architectures or loss functions
- Educational Purposes: Understanding the mathematical foundations for team training
For production deployment, consider model quantization and pruning:
# Model quantization for faster inference
import torch.quantization as quantization
def quantize_model(model, train_loader):
model.eval()
model.qconfig = quantization.get_default_qconfig('fbgemm')
model_prepared = quantization.prepare(model, inplace=False)
# Calibration with sample data
for data, _ in train_loader:
model_prepared(data)
break
model_quantized = quantization.convert(model_prepared, inplace=False)
return model_quantized
quantized_model = quantize_model(model, train_loader)
print(f"Original model size: {sum(p.numel() for p in model.parameters()):,}")
print(f"Quantized model size: {sum(p.numel() for p in quantized_model.parameters()):,}")
Integration with MLOps and Monitoring
When deploying CNNs on production servers, implement proper monitoring and logging:
import logging
import time
from datetime import datetime
class ModelTracker:
def __init__(self, log_file='model_training.log'):
logging.basicConfig(
filename=log_file,
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
self.start_time = None
self.metrics = {}
def start_epoch(self, epoch):
self.start_time = time.time()
logging.info(f"Starting epoch {epoch}")
def end_epoch(self, epoch, train_loss, val_acc):
epoch_time = time.time() - self.start_time
self.metrics[epoch] = {
'train_loss': train_loss,
'val_accuracy': val_acc,
'epoch_time': epoch_time
}
logging.info(f"Epoch {epoch} completed in {epoch_time:.2f}s - "
f"Loss: {train_loss:.4f}, Accuracy: {val_acc:.2f}%")
def save_metrics(self, filename='training_metrics.json'):
import json
with open(filename, 'w') as f:
json.dump(self.metrics, f, indent=2)
# Usage in training loop
tracker = ModelTracker()
# Integrate with your training loop
For comprehensive CNN tutorials and advanced techniques, refer to the official PyTorch documentation and the torchvision model zoo for reference implementations.
Building CNNs from scratch provides invaluable insights into deep learning fundamentals while giving you the flexibility to create custom architectures for specific use cases. The combination of PyTorch’s dynamic nature and proper server infrastructure makes it an excellent choice for both research and production deployment of computer vision systems.

This article incorporates information and material from various online sources. We acknowledge and appreciate the work of all original authors, publishers, and websites. While every effort has been made to appropriately credit the source material, any unintentional oversight or omission does not constitute a copyright infringement. All trademarks, logos, and images mentioned are the property of their respective owners. If you believe that any content used in this article infringes upon your copyright, please contact us immediately for review and prompt action.
This article is intended for informational and educational purposes only and does not infringe on the rights of the copyright owners. If any copyrighted material has been used without proper credit or in violation of copyright laws, it is unintentional and we will rectify it promptly upon notification. Please note that the republishing, redistribution, or reproduction of part or all of the contents in any form is prohibited without express written permission from the author and website owner. For permissions or further inquiries, please contact us.