BLOG POSTS

MangoHost Blog / Monitoring GPU Utilization in Real Time

Monitoring GPU Utilization in Real Time

Monitoring GPU utilization in real time is essential for anyone running high-performance workloads, whether you’re training machine learning models, rendering graphics, mining cryptocurrency, or running intensive computational tasks. Real-time monitoring helps you understand resource usage patterns, identify bottlenecks, optimize performance, and prevent hardware damage from overheating or overclocking. This guide covers various tools and techniques for monitoring GPU utilization across different platforms, from simple command-line utilities to comprehensive monitoring solutions, plus troubleshooting common issues you’ll encounter along the way.

How GPU Monitoring Works

GPU monitoring operates through hardware-level sensors and driver APIs that expose real-time metrics about the graphics card’s state. Modern GPUs contain multiple sensors that track temperature, power consumption, memory usage, clock speeds, and utilization percentages across different processing units like CUDA cores, tensor cores, and video encoders.

The monitoring process typically works through these layers:

Hardware sensors embedded in the GPU chip measure physical parameters
GPU drivers translate sensor data into standardized metrics
System APIs (like NVIDIA’s NVML or AMD’s ADL) provide programmatic access
Monitoring tools query these APIs to display real-time information

Different GPU vendors use different monitoring interfaces. NVIDIA provides the NVIDIA Management Library (NVML), while AMD uses the AMD Display Library (ADL) and ROCm System Management Interface (ROCm-SMI). Intel’s Arc GPUs use Intel GPU Tools and similar interfaces.

Command-Line Monitoring Tools

The quickest way to start monitoring GPU utilization is through command-line tools that come with your GPU drivers or can be installed separately.

NVIDIA GPUs

For NVIDIA cards, nvidia-smi is your primary tool. It’s included with NVIDIA drivers and provides comprehensive real-time monitoring:

# Basic usage - single snapshot
nvidia-smi

# Continuous monitoring every 2 seconds
nvidia-smi -l 2

# Monitor specific metrics
nvidia-smi --query-gpu=timestamp,name,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 1

# Monitor processes using GPU
nvidia-smi pmon -i 0

The output includes GPU utilization percentage, memory usage, temperature, power draw, and running processes. You can also use nvtop, which provides a more user-friendly interface similar to htop:

# Install nvtop (Ubuntu/Debian)
sudo apt install nvtop

# Run nvtop
nvtop

AMD GPUs

AMD users can utilize radeontop for basic monitoring or rocm-smi for more detailed metrics:

# Install radeontop
sudo apt install radeontop

# Run radeontop
sudo radeontop

# Using rocm-smi (if ROCm is installed)
rocm-smi
rocm-smi -a  # Show all available information

Intel GPUs

Intel Arc and integrated GPUs can be monitored using intel_gpu_top:

# Install intel-gpu-tools
sudo apt install intel-gpu-tools

# Monitor Intel GPU
sudo intel_gpu_top

Cross-Platform Monitoring Solutions

Several tools work across different GPU vendors and operating systems, making them ideal for mixed environments.

GPU-Z and MSI Afterburner (Windows)

On Windows systems, GPU-Z provides detailed real-time monitoring with logging capabilities, while MSI Afterburner offers both monitoring and overclocking features. Both tools support multiple GPU vendors and provide comprehensive telemetry data.

System Monitoring Tools

Tools like htop, glances, and netdata can display GPU information alongside CPU and memory metrics:

# Install glances with GPU support
pip install glances[gpu]

# Run glances
glances

# Install netdata for web-based monitoring
bash <(curl -Ss https://my-netdata.io/kickstart.sh)

Programming-Based Monitoring

For custom monitoring solutions or integration into existing applications, you can access GPU metrics programmatically using various libraries and APIs.

Python with pynvml

The pynvml library provides Python bindings for NVIDIA's NVML library:

#!/usr/bin/env python3
import pynvml
import time

def monitor_gpu():
    pynvml.nvmlInit()
    device_count = pynvml.nvmlDeviceGetCount()
    
    while True:
        for i in range(device_count):
            handle = pynvml.nvmlDeviceGetHandleByIndex(i)
            
            # Get GPU name
            name = pynvml.nvmlDeviceGetName(handle).decode('utf-8')
            
            # Get utilization
            util = pynvml.nvmlDeviceGetUtilizationRates(handle)
            
            # Get memory info
            mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
            
            # Get temperature
            temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
            
            print(f"GPU {i} ({name}):")
            print(f"  GPU Utilization: {util.gpu}%")
            print(f"  Memory Utilization: {util.memory}%")
            print(f"  Memory Used: {mem_info.used // 1024**2} MB / {mem_info.total // 1024**2} MB")
            print(f"  Temperature: {temp}°C")
            print()
        
        time.sleep(2)

if __name__ == "__main__":
    monitor_gpu()

Node.js with systeminformation

For JavaScript/Node.js applications, the systeminformation library provides GPU monitoring capabilities:

const si = require('systeminformation');

async function monitorGPU() {
    try {
        const graphics = await si.graphics();
        const currentLoad = await si.currentLoad();
        
        graphics.controllers.forEach((gpu, index) => {
            console.log(`GPU ${index}: ${gpu.model}`);
            console.log(`  Memory: ${gpu.memoryTotal}MB total, ${gpu.memoryUsed}MB used`);
            console.log(`  Temperature: ${gpu.temperatureGpu}°C`);
        });
    } catch (error) {
        console.error('Error monitoring GPU:', error);
    }
}

setInterval(monitorGPU, 2000);

Web-Based Monitoring Dashboards

For comprehensive monitoring solutions, especially in server environments, web-based dashboards provide excellent visualization and historical data tracking.

Prometheus and Grafana

Setting up Prometheus with NVIDIA's DCGM exporter creates a robust monitoring solution:

# Install DCGM
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/datacenter-gpu-manager_3.1.7_amd64.deb
sudo dpkg -i datacenter-gpu-manager_3.1.7_amd64.deb

# Start DCGM
sudo systemctl --now enable nvidia-dcgm

# Run DCGM exporter
docker run -d --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.1.7-3.1.4-ubuntu20.04

Then configure Prometheus to scrape the metrics and visualize them in Grafana. This setup provides historical data, alerting capabilities, and detailed performance analytics.

Netdata Cloud

Netdata provides out-of-the-box GPU monitoring with minimal configuration. After installation, it automatically detects and monitors available GPUs, providing real-time charts and alerts through a web interface.

Monitoring Tool Comparison

Tool	Platform	GPU Support	Real-time	Historical Data	API Access	Cost
nvidia-smi	Linux, Windows	NVIDIA only	Yes	No	Yes (NVML)	Free
nvtop	Linux	NVIDIA, AMD	Yes	No	No	Free
GPU-Z	Windows	All major vendors	Yes	Yes (logging)	No	Free
MSI Afterburner	Windows	All major vendors	Yes	Yes	Limited	Free
Netdata	Cross-platform	NVIDIA, AMD	Yes	Yes	Yes (REST)	Free/Paid
Prometheus + Grafana	Cross-platform	NVIDIA (DCGM)	Yes	Yes	Yes	Free

Real-World Use Cases and Examples

Machine Learning Model Training

When training deep learning models, monitoring GPU utilization helps optimize batch sizes and identify data loading bottlenecks. A typical monitoring setup for ML workflows includes:

# Monitor GPU during training with custom metrics
#!/bin/bash
while true; do
    timestamp=$(date '+%Y-%m-%d %H:%M:%S')
    gpu_util=$(nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader,nounits)
    mem_util=$(nvidia-smi --query-gpu=utilization.memory --format=csv,noheader,nounits) 
    temp=$(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader,nounits)
    
    echo "$timestamp,GPU_Util:${gpu_util}%,Mem_Util:${mem_util}%,Temp:${temp}C"
    
    # Alert if GPU utilization drops below 80% (potential bottleneck)
    if [ "$gpu_util" -lt 80 ]; then
        echo "WARNING: GPU utilization below 80% - check data pipeline"
    fi
    
    sleep 5
done

Cryptocurrency Mining

Mining operations require constant monitoring to ensure optimal performance and prevent hardware damage. A comprehensive monitoring script might track power efficiency and temperature limits:

#!/usr/bin/env python3
import pynvml
import time
import json
import requests

def mining_monitor():
    pynvml.nvmlInit()
    device_count = pynvml.nvmlDeviceGetCount()
    
    while True:
        for i in range(device_count):
            handle = pynvml.nvmlDeviceGetHandleByIndex(i)
            
            util = pynvml.nvmlDeviceGetUtilizationRates(handle)
            temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
            power = pynvml.nvmlDeviceGetPowerUsage(handle) / 1000.0  # Convert to watts
            
            # Calculate efficiency (utilization per watt)
            efficiency = util.gpu / power if power > 0 else 0
            
            print(f"GPU {i}: {util.gpu}% util, {temp}°C, {power:.1f}W, {efficiency:.2f} util/W")
            
            # Temperature protection
            if temp > 83:
                print(f"WARNING: GPU {i} temperature critical ({temp}°C)")
                # Could trigger mining software restart or fan curve adjustment
            
        time.sleep(10)

if __name__ == "__main__":
    mining_monitor()

Server Infrastructure Monitoring

For servers running multiple GPU workloads, integration with existing monitoring infrastructure is crucial. Here's an example of exporting GPU metrics to a centralized logging system:

#!/usr/bin/env python3
import pynvml
import json
import time
import socket
from datetime import datetime

def export_gpu_metrics():
    pynvml.nvmlInit()
    hostname = socket.gethostname()
    
    while True:
        metrics = []
        device_count = pynvml.nvmlDeviceGetCount()
        
        for i in range(device_count):
            handle = pynvml.nvmlDeviceGetHandleByIndex(i)
            
            metric = {
                "timestamp": datetime.utcnow().isoformat(),
                "hostname": hostname,
                "gpu_id": i,
                "gpu_name": pynvml.nvmlDeviceGetName(handle).decode('utf-8'),
                "utilization_gpu": pynvml.nvmlDeviceGetUtilizationRates(handle).gpu,
                "utilization_memory": pynvml.nvmlDeviceGetUtilizationRates(handle).memory,
                "temperature": pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU),
                "power_usage": pynvml.nvmlDeviceGetPowerUsage(handle) / 1000.0,
                "memory_total": pynvml.nvmlDeviceGetMemoryInfo(handle).total,
                "memory_used": pynvml.nvmlDeviceGetMemoryInfo(handle).used,
                "memory_free": pynvml.nvmlDeviceGetMemoryInfo(handle).free
            }
            
            metrics.append(metric)
        
        # Export to your logging/monitoring system
        print(json.dumps(metrics, indent=2))
        
        time.sleep(30)

if __name__ == "__main__":
    export_gpu_metrics()

Common Issues and Troubleshooting

Permission Issues

Many monitoring tools require elevated privileges or specific group membership. Common solutions include:

# Add user to video group (Linux)
sudo usermod -a -G video $USER

# For nvidia-smi without sudo
sudo chmod 755 /usr/bin/nvidia-smi

# Set proper permissions for device files
sudo chmod 666 /dev/nvidia*

Driver and Library Conflicts

Monitoring tools often depend on specific driver versions or libraries. When troubleshooting:

Verify driver installation: nvidia-smi should work without errors
Check library versions: pip list | grep nvidia or equivalent
Reinstall monitoring libraries if needed
Use virtual environments to isolate dependencies

Missing Metrics or Inaccurate Readings

Some monitoring tools may not display all available metrics or show incorrect values:

# Verify available metrics with nvidia-smi
nvidia-smi --help-query-gpu

# Check supported queries
nvidia-smi --query-gpu=name,driver_version,memory.total,memory.used,memory.free,temperature.gpu,utilization.gpu,utilization.memory,power.draw --format=csv

# For scripting, always handle missing values
nvidia-smi --query-gpu=power.draw --format=csv,noheader,nounits 2>/dev/null || echo "N/A"

Performance Impact

Intensive monitoring can impact system performance. Best practices include:

Adjust polling intervals based on requirements (1-5 seconds for real-time, 30+ seconds for logging)
Use efficient monitoring tools (avoid resource-heavy GUI applications on servers)
Implement monitoring data aggregation to reduce storage requirements
Monitor the monitoring tools themselves for resource consumption

Best Practices and Security Considerations

Monitoring Strategy

Effective GPU monitoring requires a strategic approach:

Define clear metrics that align with your use case (utilization, temperature, memory usage)
Set appropriate alert thresholds (typically 85°C for temperature, 90% for memory usage)
Implement redundant monitoring for critical systems
Maintain historical data for trend analysis and capacity planning
Document normal operating ranges for your specific workloads

Security Considerations

GPU monitoring can expose sensitive system information:

Restrict access to monitoring interfaces using authentication
Use HTTPS for web-based monitoring dashboards
Limit network exposure of monitoring endpoints
Regularly update monitoring tools and their dependencies
Consider the implications of exposing GPU utilization patterns

Integration with Infrastructure

For production environments, integrate GPU monitoring with existing infrastructure:

Use configuration management tools (Ansible, Puppet) to deploy monitoring consistently
Integrate with existing alerting systems (PagerDuty, Slack, email)
Export metrics to centralized logging systems (ELK stack, Splunk)
Implement automated responses to common issues (fan curve adjustments, workload redistribution)

Whether you're running VPS instances with GPU acceleration or managing dedicated servers with multiple graphics cards, implementing comprehensive GPU monitoring ensures optimal performance, prevents hardware damage, and provides valuable insights for capacity planning and optimization.

Real-time GPU monitoring is an essential skill for modern system administrators and developers working with high-performance computing workloads. The tools and techniques covered in this guide provide a solid foundation for monitoring GPU utilization across different platforms and use cases, from simple command-line monitoring to enterprise-grade dashboard solutions.

This article incorporates information and material from various online sources. We acknowledge and appreciate the work of all original authors, publishers, and websites. While every effort has been made to appropriately credit the source material, any unintentional oversight or omission does not constitute a copyright infringement. All trademarks, logos, and images mentioned are the property of their respective owners. If you believe that any content used in this article infringes upon your copyright, please contact us immediately for review and prompt action.

This article is intended for informational and educational purposes only and does not infringe on the rights of the copyright owners. If any copyrighted material has been used without proper credit or in violation of copyright laws, it is unintentional and we will rectify it promptly upon notification. Please note that the republishing, redistribution, or reproduction of part or all of the contents in any form is prohibited without express written permission from the author and website owner. For permissions or further inquiries, please contact us.