BLOG POSTS

MangoHost Blog / GPU Monitoring in 2025: nvtop and radeontop for AI Workloads

GPU Monitoring in 2025: nvtop and radeontop for AI Workloads

🚀 performance 🧐 monitoring

Table of Contents

What’s This Article About?
The “Uh-Oh, My GPU Is Melting” Drama
Why GPU Monitoring Matters for AI in 2025
How Does nvtop & radeontop Work? (And What’s Under the Hood?)
When, Where, and How: Use Cases & Benefits
Quick Setup Guide: Getting nvtop & radeontop Running (Fast!)
Real-World Examples & Comic Metaphor Table
Beginner Mistakes, Myths, and Similar Tools
“Use This If…” Decision Flowchart
Fun Facts & Unconventional Hacks
Scripting & Automation: Level Up Your Monitoring
A Short Admin Story: The Day The GPUs Rebelled
Wrap-up & Recommendations

What’s This Article About?

Let’s be real: GPUs are the new datacenter rockstars, powering everything from Stable Diffusion and ChatGPT clones to deep learning pipelines and crypto mining. But with great power comes great overheating (and even greater headaches when things go wrong). This article is your hands-on, no-BS guide to monitoring your GPUs like a pro in 2025 using nvtop (for NVIDIA) and radeontop (for AMD). If you’re spinning up machines—whether locally, in Docker containers, on a VPS, or on some beefy dedicated box—this post is for you. Get ready for practical tips, quick setup, common pitfalls, and even a few nerdy admin tales.

The “Uh-Oh, My GPU Is Melting” Drama

Picture this: It’s 2AM, your AI model’s been training for 12 hours straight, the office is empty, and you’re watching your phone notifications like a hawk. Suddenly, the temperature on your RTX 4090 spikes. The fans go wild. Your SSH session lags. You check your cloud dashboard—nothing. Your monitoring system? MIA. That’s the moment you realize: “If only I had a simple, real-time GPU dashboard right in my terminal…”

If you’ve ever fried a card, lost a week’s worth of training, or just hate guessing what your hardware is up to, you know why this matters.

Why GPU Monitoring Matters for AI in 2025

AI Workloads ≠ Normal Workloads: Training LLMs, running stable diffusion, or even video encoding can peg your GPU at 100% for hours—or days. Stuff gets hot.
Cloud, VPS, Docker—You Name It: In virtualized or containerized environments, GPU access can get weird. You need to see what’s going on, fast.
Spotting Bottlenecks: Is your job CPU-bound, I/O-bound, or just waiting for VRAM? Without monitoring, it’s all guesswork.
Preventing Meltdowns: Overheating, throttling, and memory leaks are real. Detect them before your fans sound like a jet engine.

TL;DR: Good GPU monitoring is the difference between “training complete” and “RIP, GPU.”

How Does nvtop & radeontop Work? (And What’s Under the Hood?)

nvtop: The NVIDIA Terminal Dashboard

What is it? Think htop or top, but for NVIDIA GPUs. Real-time stats, per-GPU and per-process.
How? Uses libnvidia-ml (NVIDIA Management Library) to tap into hardware sensors and stats.
Visuals: Colorful bars for utilization, memory, temperature, power, fan speed, and per-process info. All in your terminal—SSH-friendly!
Supported Hardware: RTX, Quadro, Tesla, and basically anything with a modern NVIDIA chip.

radeontop: The AMD Terminal Sidekick

What is it? Same concept as nvtop, but for AMD cards.
How? Reads performance counters from the Linux kernel’s AMD GPU drivers (amdgpu), exposing engine, memory, and shader usage.
Visuals: Text-based bars for each major engine (GFX, MEM, DMA, etc.).
Supported Hardware: Most GCN and RDNA cards, including the latest RDNA3 monsters.

Both tools run in a simple terminal window, with zero bloat and minimal system overhead. Perfect for servers, SSH, tmux/screen, or even inside Docker containers (with the right device mounts).

When, Where, and How: Use Cases & Benefits

Server Admins: Watch for overheating, throttling, or failed fans on headless servers—no GUI needed.
ML Engineers & AI Researchers: Tune batch sizes and parallel jobs by watching VRAM usage in real time.
Cloud & VPS Users: Confirm you’re actually getting the GPU you’re paying for (and not just a “virtual” card with no power).
DevOps & CI/CD: Automate sanity checks before running expensive training jobs. Alert on overload or hardware errors.
Home Lab Geeks: Benchmark your setup, overclock safely, and spot zombie processes hogging your card.
Dockerized Workloads: Monitor inside containers—yes, it works, with a little setup.

Bonus: Both tools play nice with scripts, so you can automate metrics collection, alerts, or even scale workloads based on real utilization.

Quick Setup Guide: Getting nvtop & radeontop Running (Fast!)

Mini Glossary (Real-Talk Definitions)

VRAM: Video RAM, the “workspace” for your GPU. If it fills up, jobs crash or slow down.
Utilization: How “busy” your GPU is. High = working hard. Low = probably waiting for data or idle.
Power/Temp: How hot and hungry your GPU is. Watch these for meltdown prevention.
Device Mounts: In Docker, passing through the physical GPU so containers can see it.

Step-By-Step: nvtop (NVIDIA GPUs)

Prerequisites: Install the NVIDIA driver (and CUDA toolkit if needed).
sudo apt install nvidia-driver-535 nvidia-cuda-toolkit
Install nvtop:
sudo apt install nvtop

or (Fedora):

sudo dnf install nvtop
Fire it up:
nvtop
Inside Docker?
Make sure to run your container with the --gpus all flag, and install nvtop inside the container.

docker run --gpus all -it ubuntu bash

apt-get update && apt-get install nvtop

Step-By-Step: radeontop (AMD GPUs)

Prerequisites: Kernel with amdgpu driver (most modern distros have this).
Install radeontop:
sudo apt install radeontop

or (Fedora):

sudo dnf install radeontop
Run it:
radeontop
Docker? It’s trickier for AMD; you’ll need to pass through the GPU device (and possibly /dev/kfd).

Links:

Official nvtop: https://github.com/Syllo/nvtop
Official radeontop: https://github.com/clbr/radeontop

Need a beefy box to try all this? Check out VPS or dedicated server options at MangoHost!

Real-World Examples & Comic Metaphor Table

Meet the Team: nvtop vs. radeontop (as Comic Superheroes)

Captain GreenBar (nvtop): Fights for NVIDIA justice. Sees through CUDA, laser-vision on VRAM leaks, smashes overheating with a single glance.
- Strengths: Process-level info, per-GPU stats, works in Docker, shiny colors.
- Weakness: Useless against AMD. Needs NVIDIA drivers (no drivers = no powers).
Red Lightning (radeontop): AMD’s champion. Races across PCIe lanes, tracks memory, engines, and shaders.
- Strengths: No-CUDA dependency, tracks all the new RDNA stuff, super-lightweight.
- Weakness: No NVIDIA support, fewer stats than nvtop, sometimes cryptic output.
Sidekick: nvidia-smi (the silent type): Always there, but doesn’t update in real-time and hates multi-process views.

Example: The Good, The Bad, and The Ugly

Good: You catch a memory leak in your PyTorch training by watching VRAM usage climb and halt the process before OOM crash.
Bad: You forgot to check power limits, and the GPU throttles down—training slows to a crawl.
Ugly: You trusted cloud “GPU available” status, but nvtop says 0%—turns out your instance is sharing with ten other tenants. Ouch.

Beginner Mistakes, Myths, and Similar Tools

Myth #1: “nvidia-smi is all you need.” Nope! It’s static, not real-time, and doesn’t show per-process info by default.
Myth #2: “You can’t monitor GPUs in Docker.” You can—with --gpus all and device passthrough.
Beginner Mistake: Not running as root (some metrics require it, especially on AMD).
Similar Tools: watch nvidia-smi (boring, not interactive), gpustat (great summary, less detail), glances (system overview, GPU plugin optional).

Pro tip: Combine nvtop/radeontop with Prometheus exporters for web dashboards.

“Use This If…” Decision Flowchart

    👾 Are you using NVIDIA?
      ↓ YES
        ├─> Need per-process stats, real-time bars? → nvtop
        └─> Just a summary? → nvidia-smi or gpustat
      ↓ NO (AMD)
        └─> Want real-time terminal stats? → radeontop
      ↓ Neither / Intel / Embedded?
        └─> Try intel_gpu_top or custom scripts

Still shopping for the right GPU server? Order a VPS or dedicated server to get started.

Fun Facts & Unconventional Hacks

nvtop supports multiple GPUs at once—watch your entire AI cluster in a single window!
radeontop can output JSON for easy scripting or log collection.
Both tools can be run inside tmux or screen—perfect for background monitoring.
Fancy dashboards? Feed nvtop stats into grafana via a custom script for pretty web views.
Some admins use nvtop output to auto-throttle workloads when temps spike—saving hardware and power bills.

Scripting & Automation: Level Up Your Monitoring

Example: Simple GPU Overheat Alert (nvtop + bash)

#!/bin/bash
# Alert if any NVIDIA GPU exceeds 85C
while true; do
  temp=$(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader | sort -nr | head -1)
  if [ "$temp" -gt 85 ]; then
    echo "🚨 Warning: GPU temperature is ${temp}C! Check nvtop!" | mail -s "GPU Overheat Alert" you@example.com
  fi
  sleep 60
done

Automate VRAM Usage Logging

#!/bin/bash
nvidia-smi --query-gpu=timestamp,memory.used --format=csv >> /var/log/gpu_mem.log

Or, for AMD:

radeontop -d - -l 1 | grep 'gpu ' >> /var/log/amd_gpu.log

The outputs can be parsed, visualized, or fed into alerting systems like Zabbix, Prometheus, or even Slack bots.

A Short Admin Story: The Day The GPUs Rebelled

Once upon a midnight patch cycle, an admin (let’s call them “Alex”) was running a multi-GPU training job on a rented server. Suddenly, one card’s fan failed. Training slowed, then stopped. But with nvtop running in a tmux pane, Alex spotted the issue—one GPU’s temp bar was pegged at 98C! A quick shutdown, a ticket to the hoster, and disaster averted. Moral: GPU monitoring isn’t just a luxury; it’s your early-warning system.

Wrap-up & Recommendations

For AI workloads in 2025, nvtop (NVIDIA) and radeontop (AMD) are the go-to terminal tools for real-time GPU monitoring—no GUI, no nonsense.
They’re fast to install, work over SSH, and are perfect for VPS, cloud, Docker, and dedicated server setups.
Great for admins, ML engineers, DevOps, and anyone who wants to keep their expensive hardware alive and well.
Want to jump in? Get a VPS or dedicated server with GPU and test out nvtop/radeontop today!
Remember: Your GPUs are the engine. Don’t drive blind—nvtop and radeontop are your dashboard. Happy monitoring!

This article incorporates information and material from various online sources. We acknowledge and appreciate the work of all original authors, publishers, and websites. While every effort has been made to appropriately credit the source material, any unintentional oversight or omission does not constitute a copyright infringement. All trademarks, logos, and images mentioned are the property of their respective owners. If you believe that any content used in this article infringes upon your copyright, please contact us immediately for review and prompt action.

This article is intended for informational and educational purposes only and does not infringe on the rights of the copyright owners. If any copyrighted material has been used without proper credit or in violation of copyright laws, it is unintentional and we will rectify it promptly upon notification. Please note that the republishing, redistribution, or reproduction of part or all of the contents in any form is prohibited without express written permission from the author and website owner. For permissions or further inquiries, please contact us.