NVRx Logger Guide ================= The NVRx Logger is a sophisticated logging system designed specifically for multi-node, multi-rank training workloads. It provides intelligent log aggregation, rank-aware formatting, and automatic adaptation between two logging modes: **Regular Mode**: Logs go directly to stderr/stdout (standard Python logging) **Node Local Temporary Mode**: Each rank writes logs to temporary files on local node storage, then local rank 0 aggregates and writes them to a per-node log file **What "Distributed Logging" Actually Means:** The term "distributed logging" in this context refers to the fact that logs are collected from multiple ranks/processes across multiple nodes, but the actual storage and aggregation happens locally on each node. This is different from traditional distributed logging systems that send all logs to a central location over the network. The NVRx approach keeps logging local to each node for better performance and reliability. Key Features ----------- * **Node Local Temporary Logging**: When enabled, each rank writes logs to temporary files on local node storage, avoiding network filesystem bottlenecks * **Automatic Log Aggregation**: Local rank 0 acts as the node aggregator, collecting logs from all ranks on the same node and writing them to a single per-node log file * **Environment-driven Behavior**: Automatically adapts between regular logging (stderr/stdout) and node local temporary logging based on configuration * **Fork-safe Design**: All ranks use file-based message passing to ensure child processes can log even when they don't inherit the aggregator thread * **Dynamic Rank Detection**: Automatically reads rank information from environment variables (RANK, LOCAL_RANK, SLURM_PROCID, SLURM_LOCALID) Architecture ----------- The logger operates in two modes: **Regular Mode** (default) Logs go directly to stderr/stdout. This is the standard Python logging behavior. **Node Local Temporary Mode** (when ``NVRX_NODE_LOCAL_TMPDIR`` is set) Each rank writes logs to temporary files on local node storage (e.g., `/tmp`, `/scratch`, local SSDs). Local rank 0 aggregates these logs and writes them to a single per-node log file. This approach avoids network filesystem bottlenecks and provides better performance for high-throughput logging scenarios. Configuration ------------ The logger is configured through environment variables. See :doc:`config_reference` for complete configuration details. Key configuration variable: - ``NVRX_NODE_LOCAL_TMPDIR``: Set to enable node local temporary logging with aggregation For advanced configuration options, environment variables, and troubleshooting, refer to the :doc:`config_reference`. Basic Usage ---------- Setup the logger at the start of your program: .. code-block:: python from nvidia_resiliency_ext.shared_utils.log_manager import setup_logger import logging # Setup logging logger = setup_logger() # Get the configured logger log = logging.getLogger("nvrx") # Use throughout your code log.info("Training started") log.warning("GPU memory usage high") log.error("Rank 0 failed") Node Local Temporary Logging Setup -------------------------------- For workloads that need node-local temporary logging, set the environment variable: .. code-block:: bash export NVRX_NODE_LOCAL_TMPDIR=/tmp/nvrx_logs Or in your SLURM script: .. code-block:: bash #!/bin/bash export NVRX_NODE_LOCAL_TMPDIR=/tmp/nvrx_logs srun python your_training_script.py **⚠️ Critical Filesystem Warning**: The temporary directory experiences high write throughput from all ranks on each node. Use local node storage (e.g., `/tmp`, `/scratch`, local SSDs) and avoid network filesystems like NFS, Lustre (LFS), or any storage accessed over network. The logger automatically handles: - Temporary log file creation for each rank - Log aggregation from all ranks on each node - Per-node log file writing - Log rotation and cleanup Advanced Configuration --------------------- Force logger reconfiguration for subprocesses: .. code-block:: python logger = setup_logger(force_reset=True) Log formatting automatically includes: - Timestamp, log level, node ID - Workload and infrastructure rank information - Source file and line number Example Output Format -------------------- .. code-block:: text 2024-01-15 10:30:45,123 [INFO] [node001] [workload:0(0) infra:0(0)] training.py:45 Training started 2024-01-15 10:30:46,456 [WARNING] [node001] [workload:0(0) infra:0(0)] training.py:67 GPU memory usage high 2024-01-15 10:30:47,789 [ERROR] [node001] [workload:0(0) infra:0(0)] training.py:89 Rank 0 failed Integration with Other NVRx Components ------------------------------------ The logger automatically integrates with these NVRx components: - **Fault Tolerance**: Automatic logging of restart events and health checks - **In-Process Restart**: Logging of restart boundaries and process state - **Health Check**: Logging of system health monitoring events **Note**: Checkpointing and Straggler Detection components use their own logging mechanisms and do not integrate with the NVRx logger. Best Practices ------------- 1. **Setup Once**: Call ``setup_logger()`` once at the start of your main program 2. **Use Standard Logger**: Access via ``logging.getLogger("nvrx")`` in other modules 3. **Environment Configuration**: Use environment variables rather than hardcoding 4. **Subprocess Handling**: Use ``force_reset=True`` for subprocesses 5. **Filesystem Selection**: Use local node storage, avoid network filesystems (NFS, Lustre) Troubleshooting -------------- **Common Issues:** - **Logs not appearing**: Check ``NVRX_NODE_LOCAL_TMPDIR`` is set and directory is writable - **Missing rank info**: Ensure RANK/LOCAL_RANK environment variables are set - **Performance issues**: Monitor directory size, adjust file limits, verify filesystem choice (avoid NFS/Lustre)