NVRx Logger Guide
The NVRx Logger is a sophisticated logging system designed specifically for multi-node, multi-rank training workloads. It provides intelligent log aggregation, rank-aware formatting, and automatic adaptation between two logging modes:
Regular Mode: Logs go directly to stderr/stdout (standard Python logging) Node Local Temporary Mode: Each rank writes logs to temporary files on local node storage, then local rank 0 aggregates and writes them to a per-node log file
What “Distributed Logging” Actually Means: The term “distributed logging” in this context refers to the fact that logs are collected from multiple ranks/processes across multiple nodes, but the actual storage and aggregation happens locally on each node. This is different from traditional distributed logging systems that send all logs to a central location over the network. The NVRx approach keeps logging local to each node for better performance and reliability.
Key Features
Node Local Temporary Logging: When enabled, each rank writes logs to temporary files on local node storage, avoiding network filesystem bottlenecks
Automatic Log Aggregation: Local rank 0 acts as the node aggregator, collecting logs from all ranks on the same node and writing them to a single per-node log file
Environment-driven Behavior: Automatically adapts between regular logging (stderr/stdout) and node local temporary logging based on configuration
Fork-safe Design: All ranks use file-based message passing to ensure child processes can log even when they don’t inherit the aggregator thread
Dynamic Rank Detection: Automatically reads rank information from environment variables (RANK, LOCAL_RANK, SLURM_PROCID, SLURM_LOCALID)
Architecture
The logger operates in two modes:
- Regular Mode (default)
Logs go directly to stderr/stdout. This is the standard Python logging behavior.
- Node Local Temporary Mode (when
NVRX_NODE_LOCAL_TMPDIR
is set) Each rank writes logs to temporary files on local node storage (e.g., /tmp, /scratch, local SSDs). Local rank 0 aggregates these logs and writes them to a single per-node log file. This approach avoids network filesystem bottlenecks and provides better performance for high-throughput logging scenarios.
Configuration
The logger is configured through environment variables. See Configuration Reference for complete configuration details.
Key configuration variable:
- NVRX_NODE_LOCAL_TMPDIR
: Set to enable node local temporary logging with aggregation
For advanced configuration options, environment variables, and troubleshooting, refer to the Configuration Reference.
Basic Usage
Setup the logger at the start of your program:
from nvidia_resiliency_ext.shared_utils.log_manager import setup_logger
import logging
# Setup logging
logger = setup_logger()
# Get the configured logger
log = logging.getLogger("nvrx")
# Use throughout your code
log.info("Training started")
log.warning("GPU memory usage high")
log.error("Rank 0 failed")
Node Local Temporary Logging Setup
For workloads that need node-local temporary logging, set the environment variable:
export NVRX_NODE_LOCAL_TMPDIR=/tmp/nvrx_logs
Or in your SLURM script:
#!/bin/bash
export NVRX_NODE_LOCAL_TMPDIR=/tmp/nvrx_logs
srun python your_training_script.py
⚠️ Critical Filesystem Warning: The temporary directory experiences high write throughput from all ranks on each node. Use local node storage (e.g., /tmp, /scratch, local SSDs) and avoid network filesystems like NFS, Lustre (LFS), or any storage accessed over network.
The logger automatically handles: - Temporary log file creation for each rank - Log aggregation from all ranks on each node - Per-node log file writing - Log rotation and cleanup
Advanced Configuration
Force logger reconfiguration for subprocesses:
logger = setup_logger(force_reset=True)
Log formatting automatically includes: - Timestamp, log level, node ID - Workload and infrastructure rank information - Source file and line number
Example Output Format
2024-01-15 10:30:45,123 [INFO] [node001] [workload:0(0) infra:0(0)] training.py:45 Training started
2024-01-15 10:30:46,456 [WARNING] [node001] [workload:0(0) infra:0(0)] training.py:67 GPU memory usage high
2024-01-15 10:30:47,789 [ERROR] [node001] [workload:0(0) infra:0(0)] training.py:89 Rank 0 failed
Integration with Other NVRx Components
The logger automatically integrates with these NVRx components: - Fault Tolerance: Automatic logging of restart events and health checks - In-Process Restart: Logging of restart boundaries and process state - Health Check: Logging of system health monitoring events
Note: Checkpointing and Straggler Detection components use their own logging mechanisms and do not integrate with the NVRx logger.
Best Practices
Setup Once: Call
setup_logger()
once at the start of your main programUse Standard Logger: Access via
logging.getLogger("nvrx")
in other modulesEnvironment Configuration: Use environment variables rather than hardcoding
Subprocess Handling: Use
force_reset=True
for subprocessesFilesystem Selection: Use local node storage, avoid network filesystems (NFS, Lustre)
Troubleshooting
Common Issues:
- Logs not appearing: Check NVRX_NODE_LOCAL_TMPDIR
is set and directory is writable
- Missing rank info: Ensure RANK/LOCAL_RANK environment variables are set
- Performance issues: Monitor directory size, adjust file limits, verify filesystem choice (avoid NFS/Lustre)