FP8 Training Analyzer - User Guide
A model-agnostic tool for analyzing FP8 quantization logs and visualizing gradient underflows during training.
🎯 What Does This Tool Do?
The FP8 Analyzer helps you diagnose training issues caused by FP8 quantization by:
- Parsing training logs - Extracts FP8 metrics from your training runs
- Auto-detecting model architecture - Identifies encoder layers, head layers, and embeddings
- Generating publication-quality heatmaps - Visualizes gradient underflows across all model components over time
- Exporting structured data - Saves metrics to CSV for further analysis
Key Features
✅ Model-agnostic - Works with any transformer architecture (ESM, BERT, GPT, T5, etc.) ✅ Automatic component detection - No configuration needed ✅ Beautiful visualizations - 600 DPI publication-ready heatmaps ✅ Easy comparison - Run on multiple experiments with suffixes
How to use it
Before using this tool, you must first gather FP8 statistics during training. We currently support the following models and training scripts.
| Model | DDP | FSDP2 | MFSDP |
|---|---|---|---|
| ESM2 | ✓ | ✓ | ✗ |
| LLAMA3 | ✓ | ✓ | ✗ |
To gather FP8 statistics for analysis, refer to the model-specific documentation (e.g., ESM2 FP8 Debugging) or add these arguments to your training command:
python train_fsdp2.py \
fp8_stats_config.enabled=True # whether to log stats or not
fp8_stats_config.fp8_log_dir=./logs/fp8_stats_logs_dummy # where to store the logs
fp8_stats_config.fp8_stats_file=./fp8_stats.yaml # specifies what stats you want to run. Currently this is saved in this yaml file.
fp8_config.enabled=True # set this to use FP8 otherwise stats logging won't work
Once the run is completed. The fp8_stats_config.fp8_log_dir should have several directories under it. It should look like this
└── rank_0
├── nvdlfw_inspect_logs
│ └── nvdlfw_inspect_globalrank-0.log
└── nvdlfw_inspect_statistics_logs
└── nvdlfw_inspect_globalrank-0.log
Here we can see that there are directories for each rank. This is intended in case one wants to do a rank-by-rank analysis.
As we can see, there are inspect_logs and inspect_statistics_logs. The inspect_logs will tell you what layer names are being tracked as well as what tensor values are being logged.
The inspect_statistics_logs holds the actual stats for the runs, which should have a value for every tracked tensor at iterations specified by the freq parameter in the log config file (specified by fp8_stats_config.fp8_stats_file).
📊 Sample Output
Example: Full FP8 Run (Encoder + Head in FP8)
Command:
python3 analyze_and_create_heatmap.py fp8logswithhead
Result:

What you see:
- 35 components: 33 encoder layers + 2 head layers (Dense, Decoder)
- Red/orange bands: Critical underflows in early layers (2-5) and late layer (33)
- Head layers affected: Both Dense (2.3%) and Decoder (2.5%) show underflows
- White separator line: Divides encoder layers from head layers
- U-shape pattern: Middle layers (7-28) are fine, but edges suffer
- Max underflow: 5.89% at Layer 33
- Yellow boxes: Highlight the 5 worst components
Interpretation: This is a problematic run where FP8 quantization causes significant gradient underflows throughout the model. The head being in FP8 amplifies problems in the encoder. Early layers (2-5) suffer from vanishing gradients, while late layers (32-33) and head layers receive noisy gradients from FP8 quantization.
🚀 Quick Start
1. Run the Analyzer
python3 analyze_and_create_heatmap.py <log_directory>
2. View Output
The script generates two files:
analysis_output/csv_data/rank_0_metrics.csv
heatmap_visualization/heatmap_highres.png
Open the PNG to see your heatmap!
📖 Understanding the Heatmap
Color Scale
| Color | Underflow % | Meaning |
|---|---|---|
| 🟢 Green | < 0.5% | ✅ Acceptable - Normal quantization noise |
| 🟡 Yellow | 0.5-2% | ⚠️ Warning - Monitor but not critical |
| 🟠 Orange | 2-4% | 🔶 Critical - Significant learning signal loss |
| 🔴 Red | > 4% | ❌ Severe - Major training instability risk |
Visual Elements
- Yellow boxes - Highlight the 5 worst components (>2% underflows)
- White separator lines - Divide component groups (Encoder | Head | Embedding)
- Cyan vertical line - Marks iteration 3000 (common divergence point)
- Side labels - Show component groups (ENCODER, HEAD)
- Summary box (top-right) - Key statistics
Interpreting Patterns
✅ Good Pattern
Most layers: Green
Few yellow spots: Acceptable
Max < 2%: Safe to continue training
⚠️ Warning Pattern
Some layers: Orange (2-4%)
Isolated to 1-3 layers: Monitor closely
Max < 4%: Consider adjusting FP8 settings
❌ Bad Pattern
Multiple layers: Red (>4%)
U-shape (early + late): Gradient flow issues
Max > 5%: High risk of divergence
🔍 Common Scenarios
Scenario 1: U-Shape Pattern (Early + Late Layers)
What it looks like:
- Layers 1-5: Red/Orange
- Layers 6-28: Green
- Layers 29-33: Red/Orange
- Head: Red/Orange
Why it happens:
- Early layers: Far from loss, gradients shrink through backprop (vanishing gradient)
- Late layers: Close to loss but receive noisy gradients from head
- Middle layers: Goldilocks zone - far enough to have stable gradients, close enough to receive clean signal
Solution:
# Keep problematic layers in higher precision
fp8_skip_layers = [
"layers.1",
"layers.2",
"layers.3", # Early layers
"layers.31",
"layers.32",
"layers.33", # Late layers
"lm_head.dense",
"lm_head.decoder", # Head
]
Scenario 2: Head-Only Problem
What it looks like:
- Encoder layers: Mostly green
- Head (Dense/Decoder): Red/Orange
Why it happens:
- Head has small vocabulary weight matrix with large dynamic range
- Gradients from cross-entropy loss can be very small or very large
- FP8 struggles with this high dynamic range
Solution:
# Keep head in BF16
fp8_enabled = True
fp8_skip_layers = ["lm_head.dense", "lm_head.decoder"]
Expected improvement: 34-57% reduction in encoder underflows (validated!)
📐 Reading the Log Output
When you run the script, you'll see:
INFO - ================================================================================
INFO - MODEL-AGNOSTIC FP8 LOG ANALYZER & HEATMAP GENERATOR
INFO - ================================================================================
INFO - Log directory: fp8logswithhead
INFO - Output suffix: '_fp8head' (if provided)
INFO - ================================================================================
INFO - ================================================================================
INFO - PARSING MODEL ARCHITECTURE
INFO - ================================================================================
INFO - Metadata: fp8logswithhead/rank_0/nvdlfw_inspect_logs/nvdlfw_inspect_globalrank-0.log
INFO - Found 373 layer names
INFO - Model Structure:
INFO - Encoder layers: 33
INFO - Range: Layer 1 to 33
INFO - Head layers: 3
INFO - - model.lm_head
INFO - - model.lm_head.dense
INFO - - model.lm_head.decoder
INFO - ================================================================================
INFO - PARSING LOG FILE
INFO - ================================================================================
INFO - File: fp8logswithhead/.../nvdlfw_inspect_globalrank-0.log
INFO - Processed 500,000 lines...
INFO - Total lines: 6,013,920
INFO - Metrics extracted: 6,013,920
INFO - Iteration range: 0 to 7369
INFO - ================================================================================
INFO - AUTO-DETECTING COMPONENTS
INFO - ================================================================================
INFO - Found 68 gradient underflow metrics
INFO - Component Summary:
INFO - Encoder: 33 components
INFO - Head: 2 components
INFO - - Decoder
INFO - - Dense
INFO - ================================================================================
INFO - CREATING HEATMAP
INFO - ================================================================================
INFO - Components: 35
INFO - Data points: 257,950
INFO - Heatmap dimensions: 35 components × 121 time points
INFO - ✨ Saved heatmap: heatmap_visualization/heatmap_highres_fp8head.png
INFO - Max underflow: 5.89%
INFO - Critical components (>2%): 5
INFO - ================================================================================
INFO - ✅ COMPLETE
INFO - ================================================================================
Key metrics to watch:
- Max underflow: Should be < 2% ideally, < 4% acceptable
- Critical components: Fewer is better
- Iteration range: Ensure you have enough data
🔬 Advanced Analysis
Comparing Multiple Runs
To compare different experiments, run the analyzer in separate directories or rename the output files after each run:
# Run 1: Analyze and save results
python3 analyze_and_create_heatmap.py logs_fp8_full
mv heatmap_visualization/heatmap_highres.png heatmap_visualization/run1_fp8.png
mv analysis_output/csv_data/rank_0_metrics.csv analysis_output/csv_data/run1_fp8.csv
# Run 2: Analyze next experiment
python3 analyze_and_create_heatmap.py logs_bf16_head
mv heatmap_visualization/heatmap_highres.png heatmap_visualization/run2_bf16.png
mv analysis_output/csv_data/rank_0_metrics.csv analysis_output/csv_data/run2_bf16.csv
Then compare the heatmaps side-by-side!
Extracting Specific Metrics
The CSV output contains all metrics:
import pandas as pd
# Load data
df = pd.read_csv("analysis_output/csv_data/rank_0_metrics.csv")
# Get Layer 33 underflows over time
layer33 = df[
df["metric_name"]
== "model.esm.encoder.layers.33.self_attention.layernorm_qkv_gradient_underflows%"
]
# Plot
import matplotlib.pyplot as plt
plt.plot(layer33["iteration"], layer33["value"])
plt.xlabel("Iteration")
plt.ylabel("Gradient Underflow %")
plt.title("Layer 33 Gradient Underflows")
plt.show()
❓ FAQ
Q: What if my model has a different architecture?
A: The script auto-detects layers! It works with:
- ESM:
model.esm.encoder.layers.N - BERT:
model.encoder.layer.N - GPT:
model.transformer.layers.N - Custom: Any pattern with
.layers.N.or.layer.N.
Q: Why are some layers missing in my heatmap?
A: If you used fp8_skip_layers, those layers won't be in FP8 and won't have underflow metrics logged. This is expected!
Q: What's a "good" underflow percentage?
A:
- < 0.5%: Excellent
- 0.5-1%: Good
- 1-2%: Acceptable
- 2-4%: Concerning
- > 4%: Critical - action needed
Q: Can I change the color scale?
A: Yes! Edit the max_val variable in the create_heatmap function in analyze_and_create_heatmap.py:
max_val = min(6, pivot_sample.values.max()) # Change 6 to your max
Q: How do I export to PDF?
A: Use ImageMagick or Preview.app:
# macOS
open heatmap_visualization/heatmap_highres.png
# File → Export as PDF
# Linux
convert heatmap_visualization/heatmap_highres.png output.pdf
🎓 Key Takeaways
- Gradient underflows % shows how much learning signal is lost - Keep it under 2%
- U-shape pattern indicates vanishing/noisy gradients - Fix by keeping problematic layers in BF16
- Head precision matters - BF16 head reduces encoder underflows by 34-57%
- Compare runs visually - Side-by-side heatmaps quickly show improvements
- Early detection is key - Run this analyzer frequently during training
Need help? Check the example heatmaps above or refer to the FAQ section!
Generated: January 13, 2026