API Reference

This section provides detailed API documentation for the NVRx Shared Utilities, focusing on the logging system.

Note

For configuration options, environment variables, examples, and usage guides, see Configuration Reference.

Log Manager

NVRx Logger for Large-Scale LLM Training

This module provides a simple and efficient log manager that supports both regular logging and distributed logging for large-scale training with thousands of GPUs. The design automatically adapts based on environment configuration.

Key Design Principles: - Environment-driven behavior: NVRX_NODE_LOCAL_TMPDIR controls distributed vs regular logging - Per-node aggregation: When distributed logging is enabled, a separate aggregator service does log aggregation - Dynamic rank detection: Automatically reads rank info from environment variables - Scalable: Works with 3K+ GPUs without overwhelming logging infrastructure - Fork-safe: All ranks use file-based messaging to ensure child processes can log - Subprocess-safe: Supports force_reset=True for fresh logger setup in subprocesses - Service-based aggregation: Aggregator can run as a separate service for reliable log collection

Features: - Dual mode operation: Regular logging (stderr/stdout) or distributed logging (file aggregation) - Per-node log files: When distributed logging is enabled (e.g., node_hostname.log) - Automatic rank and node identification in log messages - Thread-safe logging with proper synchronization - Environment variable configuration for easy deployment - Fork-safe design with file-based message passing for all ranks - Separate aggregator service: Can run independently of training processes - Configurable temp directory: Customizable location for pending message files

Environment Variables:: NVRX_LOG_DEBUG: Set to “1”, “true”, “yes”, or “on” to enable DEBUG level logging (default: INFO) NVRX_LOG_TO_STDOUT: Set to “1” to log to stdout instead of stderr NVRX_NODE_LOCAL_TMPDIR: Directory for temporary log files NVRX_LOG_MAX_FILE_SIZE_KB: Maximum size of temporary message files in KB before rotation (default: 10240 KB = 10 MB) NVRX_LOG_MAX_LOG_FILES: Maximum number of log files to keep per rank (default: 4)

Note: File rotation is designed to be safe for the aggregator service. When files are rotated, the aggregator will automatically read from both current and backup files to ensure no messages are lost.

Usage:

# In main script (launcher.py) from nvidia_resiliency_ext.shared_utils.log_manager import setup_logger logger = setup_logger() # Call once at startup

# In other modules import logging logger = logging.getLogger(LogConfig.name) logger.info(“Training started”) logger.debug(“Debug information”) logger.error(“Error occurred”) logger.warning(“Warning message”) logger.critical(“Critical error”)

Forking Support:

The logger is designed to work safely with process forking. When using fork():

# In parent process from nvidia_resiliency_ext.shared_utils.log_manager import setup_logger logger = setup_logger() # Setup before forking logger.info(“Parent process logging”)

# Fork child process pid = os.fork() if pid == 0:

# In child process - logger will work normally import logging logger = logging.getLogger(LogConfig.name) logger.info(“Child process logging”)

else:: # Parent continues normally logger.info(“Parent continues”)

All ranks use file-based message passing, ensuring child processes can log even when they don’t inherit the aggregator thread from the parent.

Separate Aggregator Service, see log_aggregator.py for details.

class nvidia_resiliency_ext.shared_utils.log_manager.LogConfig[source]

Bases: object

Utility class for log configuration.

classmethod get_infra_local_rank()[source]

classmethod get_infra_rank()[source]

classmethod get_log_file()[source]

classmethod get_log_level()[source]

classmethod get_log_to_stdout_cfg()[source]

Return type:: bool

classmethod get_max_file_size(file_size_kb=None)[source]

Return type:: int

classmethod get_max_log_files()[source]

Return type:: int

classmethod get_node_id()[source]

classmethod get_node_local_tmp_dir(node_local_tmp_dir=None)[source]

classmethod get_process_name(proc_name=None)[source]

Parameters:: proc_name (str | None)
Return type:: str

classmethod get_workload_local_rank()[source]

classmethod get_workload_rank()[source]

name = 'nvrx'

class nvidia_resiliency_ext.shared_utils.log_manager.LogManager(node_local_tmp_dir=None, node_local_tmp_prefix=None)[source]

Bases: object

Log manager for large-scale LLM training.

Supports both regular logging and node local temporary logging. When node local temporary logging is enabled (NVRX_NODE_LOCAL_TMPDIR is set), each node logs independently to avoid overwhelming centralized logging systems. Local rank 0 acts as the node aggregator, collecting logs from all ranks on the same node and writing them to a per-node log file.

Fork-safe: Child processes automatically disable aggregation to avoid conflicts. Service-based: Aggregator can run as a separate service for reliable log collection.

Parameters:

node_local_tmp_dir (str | None)
node_local_tmp_prefix (str)

property infra_local_rank: int | None: Get the infrastructure local rank (from SLURM_LOCALID env var).

property infra_rank: int | None: Get the infrastructure rank (from SLURM_PROCID env var).

property logger: Logger

Get the distributed logger instance.

This property provides direct access to the underlying logger, allowing users to use all standard logging methods: - logger.debug(message) - logger.info(message) - logger.warning(message) - logger.error(message) - logger.critical(message)

property node_local_tmp_logging_enabled: bool: Check if node local temporary logging is enabled.

property workload_local_rank: int | None: Get the workload local rank (from LOCAL_RANK env var).

property workload_rank: int | None: Get the workload rank (from RANK env var).

nvidia_resiliency_ext.shared_utils.log_manager.setup_logger(node_local_tmp_dir=None, force_reset=False, node_local_tmp_prefix=None)[source]

Setup the distributed logger.

This function configures the standard Python logger “nvrx” with appropriate handlers for distributed logging. It’s safe to call multiple times - if the logger is already configured, it won’t be reconfigured unless force_reset=True.

The expectation is that this function is called once at the start of the program, and then the logger is used throughout the program i.e. its a singleton.

The logger automatically adapts to distributed or regular mode based on whether NVRX_NODE_LOCAL_TMPDIR is set. If set, enables distributed logging with aggregation. If not set, logs go directly to stderr/stdout.

The logger is fork-safe: all ranks use file-based message passing to ensure child processes can log even when they don’t inherit the aggregator thread.

Parameters:

node_local_tmp_dir – Optional directory path for temporary files. If None, uses NVRX_NODE_LOCAL_TMPDIR env var.
force_reset – If True, force reconfiguration even if logger is already configured. Useful for subprocesses that need fresh logger setup.
node_local_tmp_prefix (str | None)

Returns:

Configured logger instance

Return type:

logging.Logger

Example

# In main script (launcher.py) or training subprocess from nvidia_resiliency_ext.shared_utils.log_manager import setup_logger logger = setup_logger()

# In subprocesses that need fresh logger setup logger = setup_logger(force_reset=True)

# In other modules import logging logger = logging.getLogger(LogConfig.name) logger.info(“Some message”)

Log Aggregator

NVRx Log Aggregator Service

NVRx Log Aggregator Service This module provides a standalone log aggregator service that can run independently of training processes. The service monitors a node-local temporary directory, accessible to all training processes on the same node, and aggregates their log messages into per-node log files stored on a shared filesystem (e.g., Lustre or NFS).

Example sbatch Usage:

# For PyPI installation:

export NVRX_NODE_LOCAL_TMPDIR=/tmp/nvrx # Call python module directly srun bash -c ‘

if [[ ${SLURM_LOCALID:-0} -eq 0 ]]; then
python -m nvidia_resiliency_ext.shared_utils.log_aggregator –wait-file “${WAIT_FILE}” –log-dir “${AGG_DIR}” &

fi $LAUNCHER_CMD $LAUNCHER_ARGS $WORKLOAD_CMD $WORKLOAD_ARGS touch “${WAIT_FILE}”

‘

# For source installation:

export NVRX_NODE_LOCAL_TMPDIR=/tmp/nvrx NVRX_REPO=/../nvidia-resiliency-ext:/nvrx_repo

# all node setup, if installing from source srun bash -c ‘

echo “export NVRX_NODE_LOCAL_TMPDIR=$NVRX_NODE_LOCAL_TMPDIR” >> /tmp/.myenv_${SLURM_JOB_ID}.sh cd /nvrx_repo && pip install -e .

‘

# main workload with aggregator srun bash -c ‘

source /tmp/.myenv_${SLURM_JOB_ID}.sh if [[ $SLURM_LOCALID -eq 0 ]]; then

cd /nvrx_repo && PYTHONPATH=./src:$PYTHONPATH python src/nvidia_resiliency_ext/shared_utils/log_aggregator.py –wait-file ./stop –log-dir /logs/slurm/${SLURM_JOB_ID} &

fi $LAUNCHER_CMD $LAUNCHER_ARGS $WORKLOAD_CMD $WORKLOAD_ARGS touch /nvrx_repo/stop

‘

nvidia_resiliency_ext.shared_utils.log_aggregator.main()[source]: Main function for running the log aggregator as a separate service.

Log Configuration

Core Classes

LogManager

class nvidia_resiliency_ext.shared_utils.log_manager.LogManager(node_local_tmp_dir=None, node_local_tmp_prefix=None)[source]

Bases: object

Log manager for large-scale LLM training.

Supports both regular logging and node local temporary logging. When node local temporary logging is enabled (NVRX_NODE_LOCAL_TMPDIR is set), each node logs independently to avoid overwhelming centralized logging systems. Local rank 0 acts as the node aggregator, collecting logs from all ranks on the same node and writing them to a per-node log file.

Fork-safe: Child processes automatically disable aggregation to avoid conflicts. Service-based: Aggregator can run as a separate service for reliable log collection.

Parameters:

node_local_tmp_dir (str | None)
node_local_tmp_prefix (str)

__init__(node_local_tmp_dir=None, node_local_tmp_prefix=None)[source]

Parameters:

node_local_tmp_dir (str | None)
node_local_tmp_prefix (str | None)

property infra_local_rank: int | None: Get the infrastructure local rank (from SLURM_LOCALID env var).

property infra_rank: int | None: Get the infrastructure rank (from SLURM_PROCID env var).

property logger: Logger

Get the distributed logger instance.

This property provides direct access to the underlying logger, allowing users to use all standard logging methods: - logger.debug(message) - logger.info(message) - logger.warning(message) - logger.error(message) - logger.critical(message)

property node_local_tmp_logging_enabled: bool: Check if node local temporary logging is enabled.

property workload_local_rank: int | None: Get the workload local rank (from LOCAL_RANK env var).

property workload_rank: int | None: Get the workload rank (from RANK env var).

LogConfig

class nvidia_resiliency_ext.shared_utils.log_manager.LogConfig[source]

Bases: object

Utility class for log configuration.

classmethod get_infra_local_rank()[source]

classmethod get_infra_rank()[source]

classmethod get_log_file()[source]

classmethod get_log_level()[source]

classmethod get_log_to_stdout_cfg()[source]

Return type:: bool

classmethod get_max_file_size(file_size_kb=None)[source]

Return type:: int

classmethod get_max_log_files()[source]

Return type:: int

classmethod get_node_id()[source]

classmethod get_node_local_tmp_dir(node_local_tmp_dir=None)[source]

classmethod get_process_name(proc_name=None)[source]

Parameters:: proc_name (str | None)
Return type:: str

classmethod get_workload_local_rank()[source]

classmethod get_workload_rank()[source]

name = 'nvrx'

NodeLogAggregator

class nvidia_resiliency_ext.shared_utils.log_aggregator.NodeLogAggregator(log_dir, temp_dir, log_file, max_file_size, en_chrono_ord)[source]

Bases: object

Parameters:

log_dir (str)
temp_dir (str)
log_file (str)
max_file_size (int)
en_chrono_ord (bool)

__init__(log_dir, temp_dir, log_file, max_file_size, en_chrono_ord)[source]

Parameters:

log_dir (str)
temp_dir (str)
log_file (str)
max_file_size (int)
en_chrono_ord (bool)

shutdown()[source]

start_aggregator()[source]: Start the log aggregator thread.

Core Functions

setup_logger

nvidia_resiliency_ext.shared_utils.log_manager.setup_logger(node_local_tmp_dir=None, force_reset=False, node_local_tmp_prefix=None)[source]

Setup the distributed logger.

This function configures the standard Python logger “nvrx” with appropriate handlers for distributed logging. It’s safe to call multiple times - if the logger is already configured, it won’t be reconfigured unless force_reset=True.

The expectation is that this function is called once at the start of the program, and then the logger is used throughout the program i.e. its a singleton.

The logger automatically adapts to distributed or regular mode based on whether NVRX_NODE_LOCAL_TMPDIR is set. If set, enables distributed logging with aggregation. If not set, logs go directly to stderr/stdout.

The logger is fork-safe: all ranks use file-based message passing to ensure child processes can log even when they don’t inherit the aggregator thread.

Parameters:

node_local_tmp_dir – Optional directory path for temporary files. If None, uses NVRX_NODE_LOCAL_TMPDIR env var.
force_reset – If True, force reconfiguration even if logger is already configured. Useful for subprocesses that need fresh logger setup.
node_local_tmp_prefix (str | None)

Returns:

Configured logger instance

Return type:

logging.Logger

Example

# In main script (launcher.py) or training subprocess from nvidia_resiliency_ext.shared_utils.log_manager import setup_logger logger = setup_logger()

# In subprocesses that need fresh logger setup logger = setup_logger(force_reset=True)

# In other modules import logging logger = logging.getLogger(LogConfig.name) logger.info(“Some message”)

Quick Reference

For quick access to configuration options and environment variables, see the Configuration Reference page which contains:

Complete environment variables reference
Configuration examples and integration guides
Best practices and troubleshooting
Performance considerations and filesystem selection