NVRx Logger Guide

The NVRx Logger is a sophisticated logging system designed specifically for multi-node, multi-rank training workloads. It provides intelligent log aggregation, rank-aware formatting, and automatic adaptation between two logging modes:

Regular Mode: Logs go directly to stderr/stdout (standard Python logging) Node Local Temporary Mode: Each rank writes logs to temporary files on local node storage, then local rank 0 aggregates and writes them to a per-node log file

What “Distributed Logging” Actually Means: The term “distributed logging” in this context refers to the fact that logs are collected from multiple ranks/processes across multiple nodes, but the actual storage and aggregation happens locally on each node. This is different from traditional distributed logging systems that send all logs to a central location over the network. The NVRx approach keeps logging local to each node for better performance and reliability.

Key Features

  • Node Local Temporary Logging: When enabled, each rank writes logs to temporary files on local node storage, avoiding network filesystem bottlenecks

  • Automatic Log Aggregation: Local rank 0 acts as the node aggregator, collecting logs from all ranks on the same node and writing them to a single per-node log file

  • Environment-driven Behavior: Automatically adapts between regular logging (stderr/stdout) and node local temporary logging based on configuration

  • Fork-safe Design: All ranks use file-based message passing to ensure child processes can log even when they don’t inherit the aggregator thread

  • Dynamic Rank Detection: Automatically reads rank information from environment variables (RANK, LOCAL_RANK, SLURM_PROCID, SLURM_LOCALID)

Architecture

The logger operates in two modes:

Regular Mode (default)

Logs go directly to stderr/stdout. This is the standard Python logging behavior.

Node Local Temporary Mode (when NVRX_NODE_LOCAL_TMPDIR is set)

Each rank writes logs to temporary files on local node storage (e.g., /tmp, /scratch, local SSDs). Local rank 0 aggregates these logs and writes them to a single per-node log file. This approach avoids network filesystem bottlenecks and provides better performance for high-throughput logging scenarios.

Configuration

The logger is configured through environment variables. See Configuration Reference for complete configuration details.

Key configuration variable: - NVRX_NODE_LOCAL_TMPDIR: Set to enable node local temporary logging with aggregation

For advanced configuration options, environment variables, and troubleshooting, refer to the Configuration Reference.

Basic Usage

Setup the logger at the start of your program:

from nvidia_resiliency_ext.shared_utils.log_manager import setup_logger
import logging

# Setup logging
logger = setup_logger()

# Get the configured logger
log = logging.getLogger("nvrx")

# Use throughout your code
log.info("Training started")
log.warning("GPU memory usage high")
log.error("Rank 0 failed")

Node Local Temporary Logging Setup

For workloads that need node-local temporary logging, set the environment variable:

export NVRX_NODE_LOCAL_TMPDIR=/tmp/nvrx_logs

Or in your SLURM script:

#!/bin/bash
export NVRX_NODE_LOCAL_TMPDIR=/tmp/nvrx_logs

srun python your_training_script.py

⚠️ Critical Filesystem Warning: The temporary directory experiences high write throughput from all ranks on each node. Use local node storage (e.g., /tmp, /scratch, local SSDs) and avoid network filesystems like NFS, Lustre (LFS), or any storage accessed over network.

The logger automatically handles: - Temporary log file creation for each rank - Log aggregation from all ranks on each node - Per-node log file writing - Log rotation and cleanup

Advanced Configuration

Force logger reconfiguration for subprocesses:

logger = setup_logger(force_reset=True)

Log formatting automatically includes: - Timestamp, log level, node ID - Workload and infrastructure rank information - Source file and line number

Example Output Format

2024-01-15 10:30:45,123 [INFO] [node001] [workload:0(0) infra:0(0)] training.py:45 Training started
2024-01-15 10:30:46,456 [WARNING] [node001] [workload:0(0) infra:0(0)] training.py:67 GPU memory usage high
2024-01-15 10:30:47,789 [ERROR] [node001] [workload:0(0) infra:0(0)] training.py:89 Rank 0 failed

Integration with Other NVRx Components

The logger automatically integrates with these NVRx components: - Fault Tolerance: Automatic logging of restart events and health checks - In-Process Restart: Logging of restart boundaries and process state - Health Check: Logging of system health monitoring events

Note: Checkpointing and Straggler Detection components use their own logging mechanisms and do not integrate with the NVRx logger.

Best Practices

  1. Setup Once: Call setup_logger() once at the start of your main program

  2. Use Standard Logger: Access via logging.getLogger("nvrx") in other modules

  3. Environment Configuration: Use environment variables rather than hardcoding

  4. Subprocess Handling: Use force_reset=True for subprocesses

  5. Filesystem Selection: Use local node storage, avoid network filesystems (NFS, Lustre)

Troubleshooting

Common Issues: - Logs not appearing: Check NVRX_NODE_LOCAL_TMPDIR is set and directory is writable - Missing rank info: Ensure RANK/LOCAL_RANK environment variables are set - Performance issues: Monitor directory size, adjust file limits, verify filesystem choice (avoid NFS/Lustre)