Configuration Reference

This is a comprehensive reference for all NVRx Logger configuration options, examples, and usage guides.

Note

For detailed API documentation, class methods, and function signatures, see API Reference.

Environment Variables

Complete Environment Variables Reference
Variable	Type	Default	Description
`NVRX_NODE_LOCAL_TMPDIR`	string	None	Directory for temporary log files. When set, enables node local temporary logging mode.
`NVRX_LOG_DEBUG`	string	INFO	Set to “1”, “true”, “yes”, or “on” to enable DEBUG level logging.
`NVRX_LOG_TO_STDOUT`	string	stderr	Set to “1” to log to stdout instead of stderr.
`NVRX_LOG_MAX_FILE_SIZE_KB`	integer	10240 KB (10 MB)	Maximum size of temporary message files in KB before rotation.
`NVRX_LOG_MAX_LOG_FILES`	integer	4	Maximum number of log files to keep per rank.

Python API Parameters

setup_logger Function Parameters
Parameter	Type	Default	Description
`node_local_tmp_dir`	string	None	Custom temporary directory path. Overrides NVRX_NODE_LOCAL_TMPDIR.
`force_reset`	boolean	False	Force reconfiguration even if logger is already configured.
`node_local_tmp_prefix`	string	None	Custom prefix for log files in distributed mode.

LogManager Constructor Parameters

LogManager Class Parameters
Parameter	Type	Default	Description
`node_local_tmp_dir`	string	None	Directory path for temporary log files.
`node_local_tmp_prefix`	string	None	Prefix for log files in distributed mode.

Configuration Examples

Basic Configuration

# Enable node local temporary logging
export NVRX_NODE_LOCAL_TMPDIR=/tmp/nvrx_logs

# Enable debug logging
export NVRX_LOG_DEBUG=1

Advanced Configuration

# Complete configuration
export NVRX_NODE_LOCAL_TMPDIR=/tmp/nvrx_logs
export NVRX_LOG_DEBUG=1
export NVRX_LOG_TO_STDOUT=1
export NVRX_LOG_MAX_FILE_SIZE_KB=10240
export NVRX_LOG_MAX_LOG_FILES=10

Python Configuration

from nvidia_resiliency_ext.shared_utils.log_manager import setup_logger

# Custom configuration
logger = setup_logger(
    node_local_tmp_dir="/custom/logs",
    node_local_tmp_prefix="mytraining",
    force_reset=True
)

SLURM Integration

#!/bin/bash
#SBATCH --job-name=nvrx_training
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8

# NVRx Logger Configuration
export NVRX_NODE_LOCAL_TMPDIR=/tmp/nvrx_logs_${SLURM_JOB_ID}
export NVRX_LOG_DEBUG=1
export NVRX_LOG_MAX_FILE_SIZE_KB=10240

# Launch training
srun python training_script.py

Docker Integration

# Dockerfile
FROM nvcr.io/nvidia/pytorch:24.01-py3

# Install NVRx
RUN pip install nvidia-resiliency-ext

# Set default logging configuration
ENV NVRX_NODE_LOCAL_TMPDIR=/tmp/nvrx_logs
ENV NVRX_LOG_DEBUG=1
ENV NVRX_LOG_MAX_FILE_SIZE_KB=10240

Kubernetes Integration

# kubernetes-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nvrx-training
spec:
  template:
    spec:
      containers:
      - name: training
        image: nvrx-training:latest
        env:
        - name: NVRX_NODE_LOCAL_TMPDIR
          value: "/tmp/nvrx_logs"
        - name: NVRX_LOG_DEBUG
          value: "1"
        - name: NVRX_LOG_MAX_FILE_SIZE_KB
          value: "10240"

Configuration Precedence

Python API parameters (highest priority)
Environment variables
Default values (lowest priority)

Example: - If you set NVRX_NODE_LOCAL_TMPDIR=/tmp/env_logs in environment - And call setup_logger(node_local_tmp_dir=”/tmp/api_logs”) - The API parameter /tmp/api_logs will be used

Best Practices

✅ Do: - Set NVRX_NODE_LOCAL_TMPDIR for node local temporary logging - Use job-specific directories (e.g., /tmp/nvrx_logs_${SLURM_JOB_ID}) - Enable debug logging during development - Use appropriate file size limits for your workload

❌ Don’t: - Use system-critical directories (e.g., /var/log) - Use network filesystems (e.g., NFS) that cannot handle high write throughput from multiple nodes - Set extremely large file size limits - Keep too many log files (can fill disk) - Mix different logging configurations in the same job

Filesystem Selection

Critical Consideration: The temporary directory for distributed logging experiences high write throughput from all ranks on each node. Choose your filesystem carefully:

Recommended Filesystems: - Local node storage: /tmp, /scratch, local SSDs - Local NVMe storage: Fastest option for high-throughput logging

Avoid These Filesystems: - NFS: Cannot handle concurrent writes from multiple processes efficiently - Lustre (LFS): Network filesystem that may have performance limitations for high-frequency small writes

Performance Impact: - Poor filesystem choice can significantly slow down your training - Logging overhead should be minimal (< 1% of training time) - Test filesystem performance before production deployment

Troubleshooting

Common Issues:

Troubleshooting Guide
Issue	Solution
Logs not appearing	Check NVRX_NODE_LOCAL_TMPDIR is set and writable
Permission denied	Ensure directory has proper write permissions
Disk space issues	Reduce NVRX_LOG_MAX_FILE_SIZE_KB or NVRX_LOG_MAX_LOG_FILES
Missing rank info	Verify RANK and LOCAL_RANK environment variables are set
Performance issues	Monitor temporary directory size and adjust limits
Slow logging performance	Check filesystem type (avoid NFS, Lustre, or network storage, use local storage)

Debug Mode: Enable debug logging to see detailed configuration information:

export NVRX_LOG_DEBUG=1
python your_script.py

This will show: - Current configuration values - Directory creation status - Rank detection results - Log handler setup details

Quick API Reference

For developers who need quick access to the most commonly used API methods:

Common LogConfig Methods
Method	Description
`get_node_local_tmp_dir()`	Returns the configured temporary directory path or None if not set
`get_log_level()`	Returns the configured log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
`get_max_file_size()`	Returns the maximum file size in bytes for log rotation
`get_max_log_files()`	Returns the maximum number of log files to keep
`get_workload_rank()`	Returns the workload rank from RANK environment variable
`get_workload_local_rank()`	Returns the workload local rank from LOCAL_RANK environment variable

Common LogManager Properties
Property	Description
`node_local_tmp_logging_enabled`	Boolean indicating whether node local temporary logging is enabled
`workload_rank`	Integer representing the workload rank
`workload_local_rank`	Integer representing the workload local rank
`logger`	The configured Python logging.Logger instance

Note

For complete API documentation including all methods, properties, and detailed signatures, see API Reference.