Configuration Reference

This is a comprehensive reference for all NVRx Logger configuration options, examples, and usage guides.

Note

For detailed API documentation, class methods, and function signatures, see API Reference.

Environment Variables

Complete Environment Variables Reference

Variable

Type

Default

Description

NVRX_NODE_LOCAL_TMPDIR

string

None

Directory for temporary log files. When set, enables node local temporary logging mode.

NVRX_LOG_DEBUG

string

INFO

Set to “1”, “true”, “yes”, or “on” to enable DEBUG level logging.

NVRX_LOG_TO_STDOUT

string

stderr

Set to “1” to log to stdout instead of stderr.

NVRX_LOG_MAX_FILE_SIZE_KB

integer

10240 KB (10 MB)

Maximum size of temporary message files in KB before rotation.

NVRX_LOG_MAX_LOG_FILES

integer

4

Maximum number of log files to keep per rank.

Python API Parameters

setup_logger Function Parameters

Parameter

Type

Default

Description

node_local_tmp_dir

string

None

Custom temporary directory path. Overrides NVRX_NODE_LOCAL_TMPDIR.

force_reset

boolean

False

Force reconfiguration even if logger is already configured.

node_local_tmp_prefix

string

None

Custom prefix for log files in distributed mode.

LogManager Constructor Parameters

LogManager Class Parameters

Parameter

Type

Default

Description

node_local_tmp_dir

string

None

Directory path for temporary log files.

node_local_tmp_prefix

string

None

Prefix for log files in distributed mode.

Configuration Examples

Basic Configuration

# Enable node local temporary logging
export NVRX_NODE_LOCAL_TMPDIR=/tmp/nvrx_logs

# Enable debug logging
export NVRX_LOG_DEBUG=1

Advanced Configuration

# Complete configuration
export NVRX_NODE_LOCAL_TMPDIR=/tmp/nvrx_logs
export NVRX_LOG_DEBUG=1
export NVRX_LOG_TO_STDOUT=1
export NVRX_LOG_MAX_FILE_SIZE_KB=10240
export NVRX_LOG_MAX_LOG_FILES=10

Python Configuration

from nvidia_resiliency_ext.shared_utils.log_manager import setup_logger

# Custom configuration
logger = setup_logger(
    node_local_tmp_dir="/custom/logs",
    node_local_tmp_prefix="mytraining",
    force_reset=True
)

SLURM Integration

#!/bin/bash
#SBATCH --job-name=nvrx_training
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8

# NVRx Logger Configuration
export NVRX_NODE_LOCAL_TMPDIR=/tmp/nvrx_logs_${SLURM_JOB_ID}
export NVRX_LOG_DEBUG=1
export NVRX_LOG_MAX_FILE_SIZE_KB=10240

# Launch training
srun python training_script.py

Docker Integration

# Dockerfile
FROM nvcr.io/nvidia/pytorch:24.01-py3

# Install NVRx
RUN pip install nvidia-resiliency-ext

# Set default logging configuration
ENV NVRX_NODE_LOCAL_TMPDIR=/tmp/nvrx_logs
ENV NVRX_LOG_DEBUG=1
ENV NVRX_LOG_MAX_FILE_SIZE_KB=10240

Kubernetes Integration

# kubernetes-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nvrx-training
spec:
  template:
    spec:
      containers:
      - name: training
        image: nvrx-training:latest
        env:
        - name: NVRX_NODE_LOCAL_TMPDIR
          value: "/tmp/nvrx_logs"
        - name: NVRX_LOG_DEBUG
          value: "1"
        - name: NVRX_LOG_MAX_FILE_SIZE_KB
          value: "10240"

Configuration Precedence

  1. Python API parameters (highest priority)

  2. Environment variables

  3. Default values (lowest priority)

Example: - If you set NVRX_NODE_LOCAL_TMPDIR=/tmp/env_logs in environment - And call setup_logger(node_local_tmp_dir=”/tmp/api_logs”) - The API parameter /tmp/api_logs will be used

Best Practices

Do: - Set NVRX_NODE_LOCAL_TMPDIR for node local temporary logging - Use job-specific directories (e.g., /tmp/nvrx_logs_${SLURM_JOB_ID}) - Enable debug logging during development - Use appropriate file size limits for your workload

Don’t: - Use system-critical directories (e.g., /var/log) - Use network filesystems (e.g., NFS) that cannot handle high write throughput from multiple nodes - Set extremely large file size limits - Keep too many log files (can fill disk) - Mix different logging configurations in the same job

Filesystem Selection

Critical Consideration: The temporary directory for distributed logging experiences high write throughput from all ranks on each node. Choose your filesystem carefully:

Recommended Filesystems: - Local node storage: /tmp, /scratch, local SSDs - Local NVMe storage: Fastest option for high-throughput logging

Avoid These Filesystems: - NFS: Cannot handle concurrent writes from multiple processes efficiently - Lustre (LFS): Network filesystem that may have performance limitations for high-frequency small writes

Performance Impact: - Poor filesystem choice can significantly slow down your training - Logging overhead should be minimal (< 1% of training time) - Test filesystem performance before production deployment

Troubleshooting

Common Issues:

Troubleshooting Guide

Issue

Solution

Logs not appearing

Check NVRX_NODE_LOCAL_TMPDIR is set and writable

Permission denied

Ensure directory has proper write permissions

Disk space issues

Reduce NVRX_LOG_MAX_FILE_SIZE_KB or NVRX_LOG_MAX_LOG_FILES

Missing rank info

Verify RANK and LOCAL_RANK environment variables are set

Performance issues

Monitor temporary directory size and adjust limits

Slow logging performance

Check filesystem type (avoid NFS, Lustre, or network storage, use local storage)

Debug Mode: Enable debug logging to see detailed configuration information:

export NVRX_LOG_DEBUG=1
python your_script.py

This will show: - Current configuration values - Directory creation status - Rank detection results - Log handler setup details

Quick API Reference

For developers who need quick access to the most commonly used API methods:

Common LogConfig Methods

Method

Description

get_node_local_tmp_dir()

Returns the configured temporary directory path or None if not set

get_log_level()

Returns the configured log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)

get_max_file_size()

Returns the maximum file size in bytes for log rotation

get_max_log_files()

Returns the maximum number of log files to keep

get_workload_rank()

Returns the workload rank from RANK environment variable

get_workload_local_rank()

Returns the workload local rank from LOCAL_RANK environment variable

Common LogManager Properties

Property

Description

node_local_tmp_logging_enabled

Boolean indicating whether node local temporary logging is enabled

workload_rank

Integer representing the workload rank

workload_local_rank

Integer representing the workload local rank

logger

The configured Python logging.Logger instance

Note

For complete API documentation including all methods, properties, and detailed signatures, see API Reference.