BLOG POSTS

MangoHost Blog / Writing CNNs from Scratch in PyTorch – Beginner’s Guide

Writing CNNs from Scratch in PyTorch – Beginner’s Guide

Convolutional Neural Networks (CNNs) are the backbone of modern computer vision applications, powering everything from image classification to object detection systems. While frameworks like TensorFlow and Keras make it easy to build CNNs with high-level abstractions, understanding how to implement them from scratch in PyTorch provides crucial insights into their inner workings and gives you greater control over your models. This guide will walk you through building CNNs from the ground up, covering the mathematical foundations, practical implementation details, and common troubleshooting scenarios you’ll encounter when deploying these models on production servers.

How CNNs Work Under the Hood

Before diving into code, it’s essential to understand what happens inside a CNN. At its core, a CNN applies learnable filters (kernels) across input images through convolution operations, followed by pooling layers that reduce spatial dimensions and fully connected layers for final classification.

The key components include:

Convolutional layers that detect features like edges, textures, and patterns
Activation functions (typically ReLU) that introduce non-linearity
Pooling layers that downsample feature maps and reduce computational load
Fully connected layers that map features to class probabilities
Dropout layers for regularization to prevent overfitting

PyTorch’s dynamic computation graph makes it particularly well-suited for understanding these operations since you can inspect tensors at each step and modify the forward pass dynamically.

Setting Up Your Development Environment

First, ensure you have the necessary dependencies installed. If you’re running this on a VPS or dedicated server, make sure you have sufficient RAM (at least 8GB recommended) and GPU support if available.

pip install torch torchvision matplotlib numpy
# For GPU support (optional but recommended)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Verify your installation:

import torch
import torchvision
import torch.nn as nn
import torch.optim as optim

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU device: {torch.cuda.get_device_name(0)}")

Building Your First CNN from Scratch

Let’s start with a simple CNN for CIFAR-10 classification. This implementation shows the fundamental structure without any shortcuts:

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super(SimpleCNN, self).__init__()
        
        # First convolutional block
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=32, 
                              kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm2d(32)
        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
        
        # Second convolutional block
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, 
                              kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm2d(64)
        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
        
        # Third convolutional block
        self.conv3 = nn.Conv2d(in_channels=64, out_channels=128, 
                              kernel_size=3, padding=1)
        self.bn3 = nn.BatchNorm2d(128)
        self.pool3 = nn.MaxPool2d(kernel_size=2, stride=2)
        
        # Fully connected layers
        self.fc1 = nn.Linear(128 * 4 * 4, 512)
        self.dropout = nn.Dropout(0.5)
        self.fc2 = nn.Linear(512, num_classes)
        
    def forward(self, x):
        # First block
        x = self.pool1(F.relu(self.bn1(self.conv1(x))))
        
        # Second block
        x = self.pool2(F.relu(self.bn2(self.conv2(x))))
        
        # Third block
        x = self.pool3(F.relu(self.bn3(self.conv3(x))))
        
        # Flatten for fully connected layers
        x = x.view(x.size(0), -1)
        
        # Fully connected layers
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        
        return x

# Initialize the model
model = SimpleCNN(num_classes=10)
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")

This basic architecture follows the classic CNN pattern. The batch normalization layers help with training stability, while dropout prevents overfitting on smaller datasets.

Data Loading and Preprocessing

Proper data handling is crucial for CNN performance. Here’s how to set up efficient data loading with torchvision:

import torchvision.transforms as transforms
from torch.utils.data import DataLoader

# Define transforms for training and validation
train_transform = transforms.Compose([
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomRotation(10),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
])

val_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
])

# Load CIFAR-10 dataset
train_dataset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                           download=True, transform=train_transform)
val_dataset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                         download=True, transform=val_transform)

# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True, num_workers=4)
val_loader = DataLoader(val_dataset, batch_size=128, shuffle=False, num_workers=4)

print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")

Training Loop Implementation

Here’s a complete training loop with proper logging, validation, and checkpointing:

def train_model(model, train_loader, val_loader, num_epochs=50):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
    scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=20, gamma=0.1)
    
    best_val_acc = 0.0
    train_losses = []
    val_accuracies = []
    
    for epoch in range(num_epochs):
        # Training phase
        model.train()
        running_loss = 0.0
        
        for batch_idx, (data, targets) in enumerate(train_loader):
            data, targets = data.to(device), targets.to(device)
            
            optimizer.zero_grad()
            outputs = model(data)
            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()
            
            running_loss += loss.item()
            
            if batch_idx % 100 == 0:
                print(f'Epoch {epoch+1}/{num_epochs}, Batch {batch_idx}, '
                      f'Loss: {loss.item():.4f}')
        
        # Validation phase
        model.eval()
        val_correct = 0
        val_total = 0
        
        with torch.no_grad():
            for data, targets in val_loader:
                data, targets = data.to(device), targets.to(device)
                outputs = model(data)
                _, predicted = torch.max(outputs.data, 1)
                val_total += targets.size(0)
                val_correct += (predicted == targets).sum().item()
        
        val_acc = 100 * val_correct / val_total
        avg_train_loss = running_loss / len(train_loader)
        
        train_losses.append(avg_train_loss)
        val_accuracies.append(val_acc)
        
        print(f'Epoch {epoch+1}: Train Loss: {avg_train_loss:.4f}, '
              f'Val Accuracy: {val_acc:.2f}%')
        
        # Save best model
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            torch.save(model.state_dict(), 'best_model.pth')
        
        scheduler.step()
    
    return train_losses, val_accuracies

# Train the model
train_losses, val_accuracies = train_model(model, train_loader, val_loader)

Advanced CNN Architectures

Once you understand the basics, you can implement more sophisticated architectures. Here’s a ResNet-style block with skip connections:

class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super(ResidualBlock, self).__init__()
        
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, 
                              stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, 
                              stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)
        
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1, 
                         stride=stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )
    
    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += self.shortcut(x)
        out = F.relu(out)
        return out

class CustomResNet(nn.Module):
    def __init__(self, num_classes=10):
        super(CustomResNet, self).__init__()
        
        self.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        
        self.layer1 = self._make_layer(64, 64, 2, stride=1)
        self.layer2 = self._make_layer(64, 128, 2, stride=2)
        self.layer3 = self._make_layer(128, 256, 2, stride=2)
        
        self.avg_pool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(256, num_classes)
    
    def _make_layer(self, in_channels, out_channels, num_blocks, stride):
        layers = []
        layers.append(ResidualBlock(in_channels, out_channels, stride))
        
        for _ in range(1, num_blocks):
            layers.append(ResidualBlock(out_channels, out_channels))
        
        return nn.Sequential(*layers)
    
    def forward(self, x):
        x = F.relu(self.bn1(self.conv1(x)))
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.avg_pool(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x

Performance Optimization and Best Practices

Here are critical optimizations for production deployment, especially important when running on server infrastructure:

Optimization	Impact	Implementation	Memory Usage
Mixed Precision Training	1.5-2x speed increase	torch.cuda.amp.autocast()	~50% reduction
Gradient Checkpointing	10-20% slower training	torch.utils.checkpoint	~80% reduction
DataLoader num_workers	2-4x data loading speed	num_workers=4-8	Minimal increase
Batch Size Optimization	Linear GPU utilization	Powers of 2 (32, 64, 128)	Linear increase

Here’s how to implement mixed precision training:

from torch.cuda.amp import GradScaler, autocast

def train_with_mixed_precision(model, train_loader, val_loader, num_epochs=50):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
    scaler = GradScaler()
    
    for epoch in range(num_epochs):
        model.train()
        running_loss = 0.0
        
        for batch_idx, (data, targets) in enumerate(train_loader):
            data, targets = data.to(device), targets.to(device)
            
            optimizer.zero_grad()
            
            # Mixed precision forward pass
            with autocast():
                outputs = model(data)
                loss = criterion(outputs, targets)
            
            # Mixed precision backward pass
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
            
            running_loss += loss.item()
        
        print(f'Epoch {epoch+1}: Loss: {running_loss/len(train_loader):.4f}')

Common Issues and Troubleshooting

Based on real-world deployment experience, here are the most frequent problems and their solutions:

Out of Memory Errors: Reduce batch size, use gradient checkpointing, or implement gradient accumulation
Vanishing Gradients: Add batch normalization, use ResNet-style skip connections, or try different activation functions
Slow Training: Increase batch size if memory allows, use multiple GPUs with DataParallel, optimize data loading with proper num_workers
Poor Convergence: Adjust learning rate, add learning rate scheduling, check data normalization
Overfitting: Increase dropout rate, add weight decay, use data augmentation, reduce model complexity

Here’s a debugging utility to monitor training:

def debug_model(model, sample_input):
    model.eval()
    
    # Hook to capture activations
    activations = {}
    
    def hook_fn(name):
        def hook(module, input, output):
            activations[name] = output.detach()
        return hook
    
    # Register hooks
    for name, layer in model.named_modules():
        if isinstance(layer, (nn.Conv2d, nn.Linear)):
            layer.register_forward_hook(hook_fn(name))
    
    # Forward pass
    with torch.no_grad():
        output = model(sample_input)
    
    # Print activation statistics
    for name, activation in activations.items():
        print(f"{name}: Shape={activation.shape}, "
              f"Mean={activation.mean():.4f}, "
              f"Std={activation.std():.4f}")
    
    return activations

# Usage
sample_batch = next(iter(train_loader))[0][:1]  # Single sample
debug_info = debug_model(model, sample_batch)

Real-World Use Cases and Deployment

CNNs built from scratch in PyTorch are particularly valuable in these scenarios:

Custom Computer Vision Tasks: Medical imaging, satellite imagery analysis, industrial quality control
Edge Deployment: Converting to ONNX or TensorRT for inference on embedded devices
Research Applications: Experimenting with novel architectures or loss functions
Educational Purposes: Understanding the mathematical foundations for team training

For production deployment, consider model quantization and pruning:

# Model quantization for faster inference
import torch.quantization as quantization

def quantize_model(model, train_loader):
    model.eval()
    model.qconfig = quantization.get_default_qconfig('fbgemm')
    model_prepared = quantization.prepare(model, inplace=False)
    
    # Calibration with sample data
    for data, _ in train_loader:
        model_prepared(data)
        break
    
    model_quantized = quantization.convert(model_prepared, inplace=False)
    return model_quantized

quantized_model = quantize_model(model, train_loader)
print(f"Original model size: {sum(p.numel() for p in model.parameters()):,}")
print(f"Quantized model size: {sum(p.numel() for p in quantized_model.parameters()):,}")

Integration with MLOps and Monitoring

When deploying CNNs on production servers, implement proper monitoring and logging:

import logging
import time
from datetime import datetime

class ModelTracker:
    def __init__(self, log_file='model_training.log'):
        logging.basicConfig(
            filename=log_file,
            level=logging.INFO,
            format='%(asctime)s - %(levelname)s - %(message)s'
        )
        self.start_time = None
        self.metrics = {}
    
    def start_epoch(self, epoch):
        self.start_time = time.time()
        logging.info(f"Starting epoch {epoch}")
    
    def end_epoch(self, epoch, train_loss, val_acc):
        epoch_time = time.time() - self.start_time
        self.metrics[epoch] = {
            'train_loss': train_loss,
            'val_accuracy': val_acc,
            'epoch_time': epoch_time
        }
        
        logging.info(f"Epoch {epoch} completed in {epoch_time:.2f}s - "
                    f"Loss: {train_loss:.4f}, Accuracy: {val_acc:.2f}%")
    
    def save_metrics(self, filename='training_metrics.json'):
        import json
        with open(filename, 'w') as f:
            json.dump(self.metrics, f, indent=2)

# Usage in training loop
tracker = ModelTracker()
# Integrate with your training loop

For comprehensive CNN tutorials and advanced techniques, refer to the official PyTorch documentation and the torchvision model zoo for reference implementations.

Building CNNs from scratch provides invaluable insights into deep learning fundamentals while giving you the flexibility to create custom architectures for specific use cases. The combination of PyTorch’s dynamic nature and proper server infrastructure makes it an excellent choice for both research and production deployment of computer vision systems.

This article incorporates information and material from various online sources. We acknowledge and appreciate the work of all original authors, publishers, and websites. While every effort has been made to appropriately credit the source material, any unintentional oversight or omission does not constitute a copyright infringement. All trademarks, logos, and images mentioned are the property of their respective owners. If you believe that any content used in this article infringes upon your copyright, please contact us immediately for review and prompt action.

This article is intended for informational and educational purposes only and does not infringe on the rights of the copyright owners. If any copyrighted material has been used without proper credit or in violation of copyright laws, it is unintentional and we will rectify it promptly upon notification. Please note that the republishing, redistribution, or reproduction of part or all of the contents in any form is prohibited without express written permission from the author and website owner. For permissions or further inquiries, please contact us.