
Monitoring GPU Utilization in Real Time
Monitoring GPU utilization in real time is essential for anyone running high-performance workloads, whether you’re training machine learning models, rendering graphics, mining cryptocurrency, or running intensive computational tasks. Real-time monitoring helps you understand resource usage patterns, identify bottlenecks, optimize performance, and prevent hardware damage from overheating or overclocking. This guide covers various tools and techniques for monitoring GPU utilization across different platforms, from simple command-line utilities to comprehensive monitoring solutions, plus troubleshooting common issues you’ll encounter along the way.
How GPU Monitoring Works
GPU monitoring operates through hardware-level sensors and driver APIs that expose real-time metrics about the graphics card’s state. Modern GPUs contain multiple sensors that track temperature, power consumption, memory usage, clock speeds, and utilization percentages across different processing units like CUDA cores, tensor cores, and video encoders.
The monitoring process typically works through these layers:
- Hardware sensors embedded in the GPU chip measure physical parameters
- GPU drivers translate sensor data into standardized metrics
- System APIs (like NVIDIA’s NVML or AMD’s ADL) provide programmatic access
- Monitoring tools query these APIs to display real-time information
Different GPU vendors use different monitoring interfaces. NVIDIA provides the NVIDIA Management Library (NVML), while AMD uses the AMD Display Library (ADL) and ROCm System Management Interface (ROCm-SMI). Intel’s Arc GPUs use Intel GPU Tools and similar interfaces.
Command-Line Monitoring Tools
The quickest way to start monitoring GPU utilization is through command-line tools that come with your GPU drivers or can be installed separately.
NVIDIA GPUs
For NVIDIA cards, nvidia-smi
is your primary tool. It’s included with NVIDIA drivers and provides comprehensive real-time monitoring:
# Basic usage - single snapshot
nvidia-smi
# Continuous monitoring every 2 seconds
nvidia-smi -l 2
# Monitor specific metrics
nvidia-smi --query-gpu=timestamp,name,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 1
# Monitor processes using GPU
nvidia-smi pmon -i 0
The output includes GPU utilization percentage, memory usage, temperature, power draw, and running processes. You can also use nvtop
, which provides a more user-friendly interface similar to htop
:
# Install nvtop (Ubuntu/Debian)
sudo apt install nvtop
# Run nvtop
nvtop
AMD GPUs
AMD users can utilize radeontop
for basic monitoring or rocm-smi
for more detailed metrics:
# Install radeontop
sudo apt install radeontop
# Run radeontop
sudo radeontop
# Using rocm-smi (if ROCm is installed)
rocm-smi
rocm-smi -a # Show all available information
Intel GPUs
Intel Arc and integrated GPUs can be monitored using intel_gpu_top
:
# Install intel-gpu-tools
sudo apt install intel-gpu-tools
# Monitor Intel GPU
sudo intel_gpu_top
Cross-Platform Monitoring Solutions
Several tools work across different GPU vendors and operating systems, making them ideal for mixed environments.
GPU-Z and MSI Afterburner (Windows)
On Windows systems, GPU-Z provides detailed real-time monitoring with logging capabilities, while MSI Afterburner offers both monitoring and overclocking features. Both tools support multiple GPU vendors and provide comprehensive telemetry data.
System Monitoring Tools
Tools like htop
, glances
, and netdata
can display GPU information alongside CPU and memory metrics:
# Install glances with GPU support
pip install glances[gpu]
# Run glances
glances
# Install netdata for web-based monitoring
bash <(curl -Ss https://my-netdata.io/kickstart.sh)
Programming-Based Monitoring
For custom monitoring solutions or integration into existing applications, you can access GPU metrics programmatically using various libraries and APIs.
Python with pynvml
The pynvml
library provides Python bindings for NVIDIA's NVML library:
#!/usr/bin/env python3
import pynvml
import time
def monitor_gpu():
pynvml.nvmlInit()
device_count = pynvml.nvmlDeviceGetCount()
while True:
for i in range(device_count):
handle = pynvml.nvmlDeviceGetHandleByIndex(i)
# Get GPU name
name = pynvml.nvmlDeviceGetName(handle).decode('utf-8')
# Get utilization
util = pynvml.nvmlDeviceGetUtilizationRates(handle)
# Get memory info
mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
# Get temperature
temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
print(f"GPU {i} ({name}):")
print(f" GPU Utilization: {util.gpu}%")
print(f" Memory Utilization: {util.memory}%")
print(f" Memory Used: {mem_info.used // 1024**2} MB / {mem_info.total // 1024**2} MB")
print(f" Temperature: {temp}°C")
print()
time.sleep(2)
if __name__ == "__main__":
monitor_gpu()
Node.js with systeminformation
For JavaScript/Node.js applications, the systeminformation
library provides GPU monitoring capabilities:
const si = require('systeminformation');
async function monitorGPU() {
try {
const graphics = await si.graphics();
const currentLoad = await si.currentLoad();
graphics.controllers.forEach((gpu, index) => {
console.log(`GPU ${index}: ${gpu.model}`);
console.log(` Memory: ${gpu.memoryTotal}MB total, ${gpu.memoryUsed}MB used`);
console.log(` Temperature: ${gpu.temperatureGpu}°C`);
});
} catch (error) {
console.error('Error monitoring GPU:', error);
}
}
setInterval(monitorGPU, 2000);
Web-Based Monitoring Dashboards
For comprehensive monitoring solutions, especially in server environments, web-based dashboards provide excellent visualization and historical data tracking.
Prometheus and Grafana
Setting up Prometheus with NVIDIA's DCGM exporter creates a robust monitoring solution:
# Install DCGM
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/datacenter-gpu-manager_3.1.7_amd64.deb
sudo dpkg -i datacenter-gpu-manager_3.1.7_amd64.deb
# Start DCGM
sudo systemctl --now enable nvidia-dcgm
# Run DCGM exporter
docker run -d --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.1.7-3.1.4-ubuntu20.04
Then configure Prometheus to scrape the metrics and visualize them in Grafana. This setup provides historical data, alerting capabilities, and detailed performance analytics.
Netdata Cloud
Netdata provides out-of-the-box GPU monitoring with minimal configuration. After installation, it automatically detects and monitors available GPUs, providing real-time charts and alerts through a web interface.
Monitoring Tool Comparison
Tool | Platform | GPU Support | Real-time | Historical Data | API Access | Cost |
---|---|---|---|---|---|---|
nvidia-smi | Linux, Windows | NVIDIA only | Yes | No | Yes (NVML) | Free |
nvtop | Linux | NVIDIA, AMD | Yes | No | No | Free |
GPU-Z | Windows | All major vendors | Yes | Yes (logging) | No | Free |
MSI Afterburner | Windows | All major vendors | Yes | Yes | Limited | Free |
Netdata | Cross-platform | NVIDIA, AMD | Yes | Yes | Yes (REST) | Free/Paid |
Prometheus + Grafana | Cross-platform | NVIDIA (DCGM) | Yes | Yes | Yes | Free |
Real-World Use Cases and Examples
Machine Learning Model Training
When training deep learning models, monitoring GPU utilization helps optimize batch sizes and identify data loading bottlenecks. A typical monitoring setup for ML workflows includes:
# Monitor GPU during training with custom metrics
#!/bin/bash
while true; do
timestamp=$(date '+%Y-%m-%d %H:%M:%S')
gpu_util=$(nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader,nounits)
mem_util=$(nvidia-smi --query-gpu=utilization.memory --format=csv,noheader,nounits)
temp=$(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader,nounits)
echo "$timestamp,GPU_Util:${gpu_util}%,Mem_Util:${mem_util}%,Temp:${temp}C"
# Alert if GPU utilization drops below 80% (potential bottleneck)
if [ "$gpu_util" -lt 80 ]; then
echo "WARNING: GPU utilization below 80% - check data pipeline"
fi
sleep 5
done
Cryptocurrency Mining
Mining operations require constant monitoring to ensure optimal performance and prevent hardware damage. A comprehensive monitoring script might track power efficiency and temperature limits:
#!/usr/bin/env python3
import pynvml
import time
import json
import requests
def mining_monitor():
pynvml.nvmlInit()
device_count = pynvml.nvmlDeviceGetCount()
while True:
for i in range(device_count):
handle = pynvml.nvmlDeviceGetHandleByIndex(i)
util = pynvml.nvmlDeviceGetUtilizationRates(handle)
temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
power = pynvml.nvmlDeviceGetPowerUsage(handle) / 1000.0 # Convert to watts
# Calculate efficiency (utilization per watt)
efficiency = util.gpu / power if power > 0 else 0
print(f"GPU {i}: {util.gpu}% util, {temp}°C, {power:.1f}W, {efficiency:.2f} util/W")
# Temperature protection
if temp > 83:
print(f"WARNING: GPU {i} temperature critical ({temp}°C)")
# Could trigger mining software restart or fan curve adjustment
time.sleep(10)
if __name__ == "__main__":
mining_monitor()
Server Infrastructure Monitoring
For servers running multiple GPU workloads, integration with existing monitoring infrastructure is crucial. Here's an example of exporting GPU metrics to a centralized logging system:
#!/usr/bin/env python3
import pynvml
import json
import time
import socket
from datetime import datetime
def export_gpu_metrics():
pynvml.nvmlInit()
hostname = socket.gethostname()
while True:
metrics = []
device_count = pynvml.nvmlDeviceGetCount()
for i in range(device_count):
handle = pynvml.nvmlDeviceGetHandleByIndex(i)
metric = {
"timestamp": datetime.utcnow().isoformat(),
"hostname": hostname,
"gpu_id": i,
"gpu_name": pynvml.nvmlDeviceGetName(handle).decode('utf-8'),
"utilization_gpu": pynvml.nvmlDeviceGetUtilizationRates(handle).gpu,
"utilization_memory": pynvml.nvmlDeviceGetUtilizationRates(handle).memory,
"temperature": pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU),
"power_usage": pynvml.nvmlDeviceGetPowerUsage(handle) / 1000.0,
"memory_total": pynvml.nvmlDeviceGetMemoryInfo(handle).total,
"memory_used": pynvml.nvmlDeviceGetMemoryInfo(handle).used,
"memory_free": pynvml.nvmlDeviceGetMemoryInfo(handle).free
}
metrics.append(metric)
# Export to your logging/monitoring system
print(json.dumps(metrics, indent=2))
time.sleep(30)
if __name__ == "__main__":
export_gpu_metrics()
Common Issues and Troubleshooting
Permission Issues
Many monitoring tools require elevated privileges or specific group membership. Common solutions include:
# Add user to video group (Linux)
sudo usermod -a -G video $USER
# For nvidia-smi without sudo
sudo chmod 755 /usr/bin/nvidia-smi
# Set proper permissions for device files
sudo chmod 666 /dev/nvidia*
Driver and Library Conflicts
Monitoring tools often depend on specific driver versions or libraries. When troubleshooting:
- Verify driver installation:
nvidia-smi
should work without errors - Check library versions:
pip list | grep nvidia
or equivalent - Reinstall monitoring libraries if needed
- Use virtual environments to isolate dependencies
Missing Metrics or Inaccurate Readings
Some monitoring tools may not display all available metrics or show incorrect values:
# Verify available metrics with nvidia-smi
nvidia-smi --help-query-gpu
# Check supported queries
nvidia-smi --query-gpu=name,driver_version,memory.total,memory.used,memory.free,temperature.gpu,utilization.gpu,utilization.memory,power.draw --format=csv
# For scripting, always handle missing values
nvidia-smi --query-gpu=power.draw --format=csv,noheader,nounits 2>/dev/null || echo "N/A"
Performance Impact
Intensive monitoring can impact system performance. Best practices include:
- Adjust polling intervals based on requirements (1-5 seconds for real-time, 30+ seconds for logging)
- Use efficient monitoring tools (avoid resource-heavy GUI applications on servers)
- Implement monitoring data aggregation to reduce storage requirements
- Monitor the monitoring tools themselves for resource consumption
Best Practices and Security Considerations
Monitoring Strategy
Effective GPU monitoring requires a strategic approach:
- Define clear metrics that align with your use case (utilization, temperature, memory usage)
- Set appropriate alert thresholds (typically 85°C for temperature, 90% for memory usage)
- Implement redundant monitoring for critical systems
- Maintain historical data for trend analysis and capacity planning
- Document normal operating ranges for your specific workloads
Security Considerations
GPU monitoring can expose sensitive system information:
- Restrict access to monitoring interfaces using authentication
- Use HTTPS for web-based monitoring dashboards
- Limit network exposure of monitoring endpoints
- Regularly update monitoring tools and their dependencies
- Consider the implications of exposing GPU utilization patterns
Integration with Infrastructure
For production environments, integrate GPU monitoring with existing infrastructure:
- Use configuration management tools (Ansible, Puppet) to deploy monitoring consistently
- Integrate with existing alerting systems (PagerDuty, Slack, email)
- Export metrics to centralized logging systems (ELK stack, Splunk)
- Implement automated responses to common issues (fan curve adjustments, workload redistribution)
Whether you're running VPS instances with GPU acceleration or managing dedicated servers with multiple graphics cards, implementing comprehensive GPU monitoring ensures optimal performance, prevents hardware damage, and provides valuable insights for capacity planning and optimization.
Real-time GPU monitoring is an essential skill for modern system administrators and developers working with high-performance computing workloads. The tools and techniques covered in this guide provide a solid foundation for monitoring GPU utilization across different platforms and use cases, from simple command-line monitoring to enterprise-grade dashboard solutions.

This article incorporates information and material from various online sources. We acknowledge and appreciate the work of all original authors, publishers, and websites. While every effort has been made to appropriately credit the source material, any unintentional oversight or omission does not constitute a copyright infringement. All trademarks, logos, and images mentioned are the property of their respective owners. If you believe that any content used in this article infringes upon your copyright, please contact us immediately for review and prompt action.
This article is intended for informational and educational purposes only and does not infringe on the rights of the copyright owners. If any copyrighted material has been used without proper credit or in violation of copyright laws, it is unintentional and we will rectify it promptly upon notification. Please note that the republishing, redistribution, or reproduction of part or all of the contents in any form is prohibited without express written permission from the author and website owner. For permissions or further inquiries, please contact us.