NVRx Logger Guide

The NVRx Logger is a sophisticated logging system designed specifically for multi-node, multi-rank training workloads. It provides intelligent log aggregation, rank-aware formatting, and automatic adaptation between two logging modes:

Regular Mode: Logs go directly to stderr/stdout (standard Python logging) Node Local Temporary Mode: Each rank writes logs to temporary files on local node storage, then local rank 0 aggregates and writes them to a per-node log file

What “Distributed Logging” Actually Means: The term “distributed logging” in this context refers to the fact that logs are collected from multiple ranks/processes across multiple nodes, but the actual storage and aggregation happens locally on each node. This is different from traditional distributed logging systems that send all logs to a central location over the network. The NVRx approach keeps logging local to each node for better performance and reliability.

Key Features

Node Local Temporary Logging: When enabled, each rank writes logs to temporary files on local node storage, avoiding network filesystem bottlenecks
Automatic Log Aggregation: Local rank 0 acts as the node aggregator, collecting logs from all ranks on the same node and writing them to a single per-node log file
Environment-driven Behavior: Automatically adapts between regular logging (stderr/stdout) and node local temporary logging based on configuration
Fork-safe Design: All ranks use file-based message passing to ensure child processes can log even when they don’t inherit the aggregator thread
Dynamic Rank Detection: Automatically reads rank information from environment variables (RANK, LOCAL_RANK, SLURM_PROCID, SLURM_LOCALID)

Architecture

The logger operates in two modes:

Regular Mode (default): Logs go directly to stderr/stdout. This is the standard Python logging behavior.
Node Local Temporary Mode (when NVRX_NODE_LOCAL_TMPDIR is set): Each rank writes logs to temporary files on local node storage (e.g., /tmp, /scratch, local SSDs). Local rank 0 aggregates these logs and writes them to a single per-node log file. This approach avoids network filesystem bottlenecks and provides better performance for high-throughput logging scenarios.

Configuration

The logger is configured through environment variables. See Configuration Reference for complete configuration details.

Key configuration variable: - NVRX_NODE_LOCAL_TMPDIR: Set to enable node local temporary logging with aggregation

For advanced configuration options, environment variables, and troubleshooting, refer to the Configuration Reference.

Basic Usage

Setup the logger at the start of your program:

from nvidia_resiliency_ext.shared_utils.log_manager import setup_logger
import logging

# Setup logging
logger = setup_logger()

# Get the configured logger
log = logging.getLogger("nvrx")

# Use throughout your code
log.info("Training started")
log.warning("GPU memory usage high")
log.error("Rank 0 failed")

Node Local Temporary Logging Setup

For workloads that need node-local temporary logging, set the environment variable:

export NVRX_NODE_LOCAL_TMPDIR=/tmp/nvrx_logs

Or in your SLURM script:

#!/bin/bash
export NVRX_NODE_LOCAL_TMPDIR=/tmp/nvrx_logs

srun python your_training_script.py

⚠️ Critical Filesystem Warning: The temporary directory experiences high write throughput from all ranks on each node. Use local node storage (e.g., /tmp, /scratch, local SSDs) and avoid network filesystems like NFS, Lustre (LFS), or any storage accessed over network.

The logger automatically handles: - Temporary log file creation for each rank - Log aggregation from all ranks on each node - Per-node log file writing - Log rotation and cleanup

Advanced Configuration

Force logger reconfiguration for subprocesses:

logger = setup_logger(force_reset=True)

Log formatting automatically includes: - Timestamp, log level, node ID - Workload and infrastructure rank information - Source file and line number

Example Output Format

2024-01-15 10:30:45,123 [INFO] [node001] [workload:0(0) infra:0(0)] training.py:45 Training started
2024-01-15 10:30:46,456 [WARNING] [node001] [workload:0(0) infra:0(0)] training.py:67 GPU memory usage high
2024-01-15 10:30:47,789 [ERROR] [node001] [workload:0(0) infra:0(0)] training.py:89 Rank 0 failed

Integration with Other NVRx Components

The logger automatically integrates with these NVRx components: - Fault Tolerance: Automatic logging of restart events and health checks - In-Process Restart: Logging of restart boundaries and process state - Health Check: Logging of system health monitoring events

Note: Checkpointing and Straggler Detection components use their own logging mechanisms and do not integrate with the NVRx logger.

Best Practices

Setup Once: Call setup_logger() once at the start of your main program
Use Standard Logger: Access via logging.getLogger("nvrx") in other modules
Environment Configuration: Use environment variables rather than hardcoding
Subprocess Handling: Use force_reset=True for subprocesses
Filesystem Selection: Use local node storage, avoid network filesystems (NFS, Lustre)

Troubleshooting

Common Issues: - Logs not appearing: Check NVRX_NODE_LOCAL_TMPDIR is set and directory is writable - Missing rank info: Ensure RANK/LOCAL_RANK environment variables are set - Performance issues: Monitor directory size, adjust file limits, verify filesystem choice (avoid NFS/Lustre)