BLOG POSTS

MangoHost Blog / YOLO NAS: What It Is and How It Works

YOLO NAS: What It Is and How It Works

YOLO NAS (Neural Architecture Search) represents a major evolution in the YOLO (You Only Look Once) family of object detection models, incorporating automated neural architecture search to optimize model design. Unlike previous YOLO versions that relied on manual architecture engineering, YOLO NAS leverages machine learning to discover optimal network structures, resulting in superior accuracy-to-latency ratios. This guide will walk you through the technical foundations of YOLO NAS, practical implementation steps, performance comparisons, and real-world deployment scenarios for your computer vision projects.

How YOLO NAS Works

YOLO NAS operates on a fundamentally different approach compared to traditional YOLO architectures. The core innovation lies in its Neural Architecture Search methodology, which automatically discovers optimal network configurations through a search process that evaluates thousands of potential architectures.

The architecture consists of three main components:

Backbone Network: Extracts feature representations from input images using a searched convolutional neural network structure
Neck Module: Aggregates features from different scales through Feature Pyramid Network (FPN) and Path Aggregation Network (PAN)
Detection Head: Performs final object classification and bounding box regression with anchor-free detection

The NAS process optimizes for multiple objectives simultaneously, including accuracy, inference speed, and memory consumption. This multi-objective optimization produces Pareto-optimal solutions that balance performance trade-offs effectively.

Key technical improvements include:

Quantization-friendly architecture design for efficient INT8 deployment
Attention mechanisms integrated at optimal network locations
Advanced data augmentation techniques like Mosaic and MixUp
Knowledge distillation during training for improved small model performance

Step-by-Step Implementation Guide

Setting up YOLO NAS requires Python 3.8+ and can be deployed on both CPU and GPU environments. Here’s a complete implementation walkthrough:

Environment Setup

# Create virtual environment
python -m venv yolo_nas_env
source yolo_nas_env/bin/activate  # Linux/Mac
# yolo_nas_env\Scripts\activate  # Windows

# Install dependencies
pip install super-gradients
pip install torch torchvision
pip install opencv-python
pip install matplotlib

Basic Inference Implementation

import torch
from super_gradients.training import models
from super_gradients.common.object_names import Models
import cv2
import numpy as np

# Load pre-trained model (options: yolo_nas_s, yolo_nas_m, yolo_nas_l)
model = models.get(Models.YOLO_NAS_S, pretrained_weights="coco")

# Set model to evaluation mode
model.eval()

# Load and preprocess image
image_path = "your_image.jpg"
image = cv2.imread(image_path)
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

# Perform inference
predictions = model.predict(image_rgb)

# Process results
for prediction in predictions:
    bboxes = prediction.prediction.bboxes_xyxy
    confidence = prediction.prediction.confidence
    labels = prediction.prediction.labels
    
    # Draw bounding boxes
    for bbox, conf, label in zip(bboxes, confidence, labels):
        if conf > 0.5:  # Confidence threshold
            x1, y1, x2, y2 = map(int, bbox)
            cv2.rectangle(image, (x1, y1), (x2, y2), (0, 255, 0), 2)
            cv2.putText(image, f'{label}: {conf:.2f}', 
                       (x1, y1-10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)

# Display results
cv2.imshow('YOLO NAS Detection', image)
cv2.waitKey(0)
cv2.destroyAllWindows()

Custom Training Setup

from super_gradients.training import Trainer
from super_gradients.training.dataloaders.dataloaders import (
    coco_detection_yolo_format_train, 
    coco_detection_yolo_format_val
)
from super_gradients.training.losses import PPYoloELoss
from super_gradients.training.models.detection_models.pp_yolo_e import PPYoloEPostPredictionCallback

# Initialize trainer
trainer = Trainer(experiment_name="yolo_nas_custom", ckpt_root_dir="./checkpoints")

# Define dataset paths
train_data_dir = "/path/to/train/images"
train_labels_dir = "/path/to/train/labels"
val_data_dir = "/path/to/val/images"
val_labels_dir = "/path/to/val/labels"

# Create data loaders
train_data = coco_detection_yolo_format_train(
    dataset_params={
        'data_dir': train_data_dir,
        'images_dir': train_data_dir,
        'labels_dir': train_labels_dir,
        'classes': ['class1', 'class2', 'class3']  # Your custom classes
    },
    dataloader_params={'batch_size': 16, 'num_workers': 4}
)

val_data = coco_detection_yolo_format_val(
    dataset_params={
        'data_dir': val_data_dir,
        'images_dir': val_data_dir,
        'labels_dir': val_labels_dir,
        'classes': ['class1', 'class2', 'class3']
    },
    dataloader_params={'batch_size': 16, 'num_workers': 4}
)

# Load model and start training
model = models.get(Models.YOLO_NAS_S, num_classes=3, pretrained_weights="coco")

train_params = {
    "silent_mode": False,
    "average_best_models": True,
    "warmup_mode": "linear_epoch_step",
    "warmup_initial_lr": 1e-6,
    "lr_warmup_epochs": 3,
    "initial_lr": 5e-4,
    "lr_mode": "cosine",
    "cosine_final_lr_ratio": 0.1,
    "optimizer": "AdamW",
    "optimizer_params": {"weight_decay": 0.0001},
    "zero_weight_decay_on_bias_and_bn": True,
    "ema": True,
    "ema_params": {"decay": 0.9, "decay_type": "threshold"},
    "max_epochs": 100,
    "mixed_precision": True,
    "loss": PPYoloELoss(use_static_assigner=False, num_classes=3),
    "valid_metrics_list": [
        DetectionMetrics_050(
            score_thres=0.1,
            top_k_predictions=300,
            num_cls=3,
            normalize_targets=True,
            post_prediction_callback=PPYoloEPostPredictionCallback(
                score_threshold=0.01,
                nms_top_k=1000,
                max_predictions=300,
                nms_threshold=0.7
            )
        )
    ],
    "metric_to_watch": 'mAP@0.50'
}

# Start training
trainer.train(model=model, training_params=train_params, 
              train_loader=train_data, valid_loader=val_data)

Performance Comparisons and Benchmarks

YOLO NAS demonstrates significant improvements over previous YOLO versions and competing architectures. Here’s a comprehensive comparison:

Model	mAP@0.5	mAP@0.5:0.95	FPS (V100)	Parameters (M)	Model Size (MB)
YOLOv5s	56.8	37.4	740	7.2	14
YOLOv8s	61.8	44.9	520	11.2	22
YOLO NAS S	65.1	47.5	780	12.9	26
YOLO NAS M	68.4	51.1	480	33.8	68
YOLO NAS L	70.7	52.2	320	44.1	88

The quantized INT8 versions show even more impressive performance:

Model (INT8)	mAP Drop	Speed Improvement	Memory Reduction	Deployment Advantage
YOLO NAS S	0.3%	2.1x	4x	Edge devices
YOLO NAS M	0.5%	1.9x	4x	Mobile/Embedded
YOLO NAS L	0.7%	1.8x	4x	Server optimization

Real-World Use Cases and Examples

YOLO NAS excels in various production scenarios where both accuracy and speed are critical:

Surveillance and Security Systems

import threading
import queue
from super_gradients.training import models

class RealTimeDetection:
    def __init__(self, model_type="yolo_nas_s"):
        self.model = models.get(f"yolo_nas_{model_type[9:]}", pretrained_weights="coco")
        self.model.eval()
        self.frame_queue = queue.Queue(maxsize=30)
        self.result_queue = queue.Queue(maxsize=30)
        
    def process_video_stream(self, rtsp_url):
        cap = cv2.VideoCapture(rtsp_url)
        
        while True:
            ret, frame = cap.read()
            if not ret:
                break
                
            if not self.frame_queue.full():
                self.frame_queue.put(frame)
            
            # Process detection results
            if not self.result_queue.empty():
                processed_frame = self.result_queue.get()
                cv2.imshow('Security Feed', processed_frame)
                
            if cv2.waitKey(1) & 0xFF == ord('q'):
                break
                
        cap.release()
        cv2.destroyAllWindows()
    
    def detection_worker(self):
        while True:
            if not self.frame_queue.empty():
                frame = self.frame_queue.get()
                predictions = self.model.predict(frame)
                
                # Process detections for security alerts
                for pred in predictions:
                    if 'person' in pred.class_names and pred.confidence > 0.7:
                        # Trigger security alert
                        self.send_alert(frame, pred)
                
                self.result_queue.put(frame)

# Usage
detector = RealTimeDetection("yolo_nas_m")
threading.Thread(target=detector.detection_worker, daemon=True).start()
detector.process_video_stream("rtsp://camera_ip:554/stream")

Manufacturing Quality Control

class QualityInspection:
    def __init__(self, custom_model_path):
        # Load custom-trained model for defect detection
        self.model = models.get("yolo_nas_s", checkpoint_path=custom_model_path)
        self.model.eval()
        
    def inspect_product(self, image_path):
        image = cv2.imread(image_path)
        predictions = self.model.predict(image)
        
        defects = []
        for pred in predictions:
            for bbox, conf, label in zip(pred.bboxes_xyxy, pred.confidence, pred.labels):
                if conf > 0.8:  # High confidence threshold for quality control
                    defects.append({
                        'type': label,
                        'confidence': float(conf),
                        'location': bbox.tolist(),
                        'severity': 'high' if conf > 0.95 else 'medium'
                    })
        
        return {
            'pass': len(defects) == 0,
            'defects_found': len(defects),
            'details': defects
        }

# Batch processing for production line
inspector = QualityInspection("./models/defect_detection.pth")
results = []

for product_image in product_batch:
    result = inspector.inspect_product(product_image)
    results.append(result)
    
    if not result['pass']:
        # Flag for manual inspection
        print(f"Product {product_image} failed QC: {result['defects_found']} defects")

Deployment and Production Considerations

Docker Containerization

# Dockerfile
FROM nvidia/cuda:11.8-runtime-ubuntu20.04

RUN apt-get update && apt-get install -y \
    python3 \
    python3-pip \
    libglib2.0-0 \
    libsm6 \
    libxext6 \
    libxrender-dev \
    libgomp1 \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY requirements.txt .
RUN pip3 install -r requirements.txt

COPY . .

EXPOSE 8000

CMD ["python3", "app.py"]

REST API Implementation

from flask import Flask, request, jsonify
import base64
import io
from PIL import Image
import numpy as np

app = Flask(__name__)

# Load model once at startup
model = models.get("yolo_nas_s", pretrained_weights="coco")
model.eval()

@app.route('/detect', methods=['POST'])
def detect_objects():
    try:
        # Get image from request
        image_data = request.json['image']
        image_bytes = base64.b64decode(image_data)
        image = Image.open(io.BytesIO(image_bytes))
        image_array = np.array(image)
        
        # Perform detection
        predictions = model.predict(image_array)
        
        # Format results
        results = []
        for pred in predictions:
            for bbox, conf, label in zip(pred.bboxes_xyxy, pred.confidence, pred.labels):
                if conf > 0.5:
                    results.append({
                        'class': label,
                        'confidence': float(conf),
                        'bbox': bbox.tolist()
                    })
        
        return jsonify({
            'success': True,
            'detections': results,
            'count': len(results)
        })
        
    except Exception as e:
        return jsonify({'success': False, 'error': str(e)}), 500

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8000)

Best Practices and Common Pitfalls

Memory Management

YOLO NAS can be memory-intensive, especially the larger variants. Implement proper memory management:

import torch
import gc

class OptimizedDetector:
    def __init__(self, model_size="s", use_half_precision=True):
        self.model = models.get(f"yolo_nas_{model_size}", pretrained_weights="coco")
        
        if use_half_precision and torch.cuda.is_available():
            self.model = self.model.half()
            
        self.model.eval()
        
    def detect_batch(self, images, clear_cache=True):
        with torch.no_grad():
            results = []
            for image in images:
                pred = self.model.predict(image)
                results.append(pred)
                
                # Clear GPU cache periodically
                if clear_cache and torch.cuda.is_available():
                    torch.cuda.empty_cache()
                    
            return results
    
    def __del__(self):
        # Cleanup on object destruction
        if hasattr(self, 'model'):
            del self.model
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()

Common Issues and Solutions

CUDA Out of Memory: Reduce batch size, use gradient checkpointing, or implement model sharding for large deployments
Slow Inference on CPU: Consider using ONNX Runtime or Intel OpenVINO for CPU optimization
Poor Performance on Custom Data: Ensure proper data augmentation and sufficient training epochs
Model Loading Failures: Verify CUDA compatibility and torch version alignment

Performance Optimization

# ONNX Export for optimized inference
def export_to_onnx(model, output_path):
    dummy_input = torch.randn(1, 3, 640, 640)
    torch.onnx.export(
        model,
        dummy_input,
        output_path,
        export_params=True,
        opset_version=11,
        do_constant_folding=True,
        input_names=['input'],
        output_names=['output'],
        dynamic_axes={
            'input': {0: 'batch_size'},
            'output': {0: 'batch_size'}
        }
    )

# TensorRT optimization for NVIDIA GPUs
import tensorrt as trt

def optimize_with_tensorrt(onnx_path, engine_path):
    logger = trt.Logger(trt.Logger.WARNING)
    builder = trt.Builder(logger)
    network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
    parser = trt.OnnxParser(network, logger)
    
    with open(onnx_path, 'rb') as model:
        parser.parse(model.read())
    
    config = builder.create_builder_config()
    config.max_workspace_size = 1 << 30  # 1GB
    config.set_flag(trt.BuilderFlag.FP16)  # Enable FP16 precision
    
    engine = builder.build_engine(network, config)
    
    with open(engine_path, 'wb') as f:
        f.write(engine.serialize())

For production deployments on high-performance servers, consider leveraging dedicated servers with NVIDIA GPUs for optimal YOLO NAS performance. When developing and testing, VPS instances provide cost-effective environments for model experimentation and smaller-scale deployments.

YOLO NAS represents a significant advancement in real-time object detection, offering superior accuracy with maintained inference speed. Its quantization-friendly architecture makes it particularly suitable for edge deployment scenarios, while the Neural Architecture Search foundation ensures optimal performance across diverse use cases. The key to successful implementation lies in proper environment setup, careful memory management, and choosing the right model variant for your specific accuracy-speed requirements.

This article incorporates information and material from various online sources. We acknowledge and appreciate the work of all original authors, publishers, and websites. While every effort has been made to appropriately credit the source material, any unintentional oversight or omission does not constitute a copyright infringement. All trademarks, logos, and images mentioned are the property of their respective owners. If you believe that any content used in this article infringes upon your copyright, please contact us immediately for review and prompt action.

This article is intended for informational and educational purposes only and does not infringe on the rights of the copyright owners. If any copyrighted material has been used without proper credit or in violation of copyright laws, it is unintentional and we will rectify it promptly upon notification. Please note that the republishing, redistribution, or reproduction of part or all of the contents in any form is prohibited without express written permission from the author and website owner. For permissions or further inquiries, please contact us.