Configuration Reference
This is a comprehensive reference for all NVRx Logger configuration options, examples, and usage guides.
Note
For detailed API documentation, class methods, and function signatures, see API Reference.
Environment Variables
Variable |
Type |
Default |
Description |
---|---|---|---|
|
string |
None |
Directory for temporary log files. When set, enables node local temporary logging mode. |
|
string |
INFO |
Set to “1”, “true”, “yes”, or “on” to enable DEBUG level logging. |
|
string |
stderr |
Set to “1” to log to stdout instead of stderr. |
|
integer |
10240 KB (10 MB) |
Maximum size of temporary message files in KB before rotation. |
|
integer |
4 |
Maximum number of log files to keep per rank. |
Python API Parameters
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
string |
None |
Custom temporary directory path. Overrides NVRX_NODE_LOCAL_TMPDIR. |
|
boolean |
False |
Force reconfiguration even if logger is already configured. |
|
string |
None |
Custom prefix for log files in distributed mode. |
LogManager Constructor Parameters
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
string |
None |
Directory path for temporary log files. |
|
string |
None |
Prefix for log files in distributed mode. |
Configuration Examples
Basic Configuration
# Enable node local temporary logging
export NVRX_NODE_LOCAL_TMPDIR=/tmp/nvrx_logs
# Enable debug logging
export NVRX_LOG_DEBUG=1
Advanced Configuration
# Complete configuration
export NVRX_NODE_LOCAL_TMPDIR=/tmp/nvrx_logs
export NVRX_LOG_DEBUG=1
export NVRX_LOG_TO_STDOUT=1
export NVRX_LOG_MAX_FILE_SIZE_KB=10240
export NVRX_LOG_MAX_LOG_FILES=10
Python Configuration
from nvidia_resiliency_ext.shared_utils.log_manager import setup_logger
# Custom configuration
logger = setup_logger(
node_local_tmp_dir="/custom/logs",
node_local_tmp_prefix="mytraining",
force_reset=True
)
SLURM Integration
#!/bin/bash
#SBATCH --job-name=nvrx_training
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
# NVRx Logger Configuration
export NVRX_NODE_LOCAL_TMPDIR=/tmp/nvrx_logs_${SLURM_JOB_ID}
export NVRX_LOG_DEBUG=1
export NVRX_LOG_MAX_FILE_SIZE_KB=10240
# Launch training
srun python training_script.py
Docker Integration
# Dockerfile
FROM nvcr.io/nvidia/pytorch:24.01-py3
# Install NVRx
RUN pip install nvidia-resiliency-ext
# Set default logging configuration
ENV NVRX_NODE_LOCAL_TMPDIR=/tmp/nvrx_logs
ENV NVRX_LOG_DEBUG=1
ENV NVRX_LOG_MAX_FILE_SIZE_KB=10240
Kubernetes Integration
# kubernetes-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nvrx-training
spec:
template:
spec:
containers:
- name: training
image: nvrx-training:latest
env:
- name: NVRX_NODE_LOCAL_TMPDIR
value: "/tmp/nvrx_logs"
- name: NVRX_LOG_DEBUG
value: "1"
- name: NVRX_LOG_MAX_FILE_SIZE_KB
value: "10240"
Configuration Precedence
Python API parameters (highest priority)
Environment variables
Default values (lowest priority)
Example: - If you set NVRX_NODE_LOCAL_TMPDIR=/tmp/env_logs in environment - And call setup_logger(node_local_tmp_dir=”/tmp/api_logs”) - The API parameter /tmp/api_logs will be used
Best Practices
✅ Do: - Set NVRX_NODE_LOCAL_TMPDIR for node local temporary logging - Use job-specific directories (e.g., /tmp/nvrx_logs_${SLURM_JOB_ID}) - Enable debug logging during development - Use appropriate file size limits for your workload
❌ Don’t: - Use system-critical directories (e.g., /var/log) - Use network filesystems (e.g., NFS) that cannot handle high write throughput from multiple nodes - Set extremely large file size limits - Keep too many log files (can fill disk) - Mix different logging configurations in the same job
Filesystem Selection
Critical Consideration: The temporary directory for distributed logging experiences high write throughput from all ranks on each node. Choose your filesystem carefully:
Recommended Filesystems: - Local node storage: /tmp, /scratch, local SSDs - Local NVMe storage: Fastest option for high-throughput logging
Avoid These Filesystems: - NFS: Cannot handle concurrent writes from multiple processes efficiently - Lustre (LFS): Network filesystem that may have performance limitations for high-frequency small writes
Performance Impact: - Poor filesystem choice can significantly slow down your training - Logging overhead should be minimal (< 1% of training time) - Test filesystem performance before production deployment
Troubleshooting
Common Issues:
Issue |
Solution |
---|---|
Logs not appearing |
Check NVRX_NODE_LOCAL_TMPDIR is set and writable |
Permission denied |
Ensure directory has proper write permissions |
Disk space issues |
Reduce NVRX_LOG_MAX_FILE_SIZE_KB or NVRX_LOG_MAX_LOG_FILES |
Missing rank info |
Verify RANK and LOCAL_RANK environment variables are set |
Performance issues |
Monitor temporary directory size and adjust limits |
Slow logging performance |
Check filesystem type (avoid NFS, Lustre, or network storage, use local storage) |
Debug Mode: Enable debug logging to see detailed configuration information:
export NVRX_LOG_DEBUG=1
python your_script.py
This will show: - Current configuration values - Directory creation status - Rank detection results - Log handler setup details
Quick API Reference
For developers who need quick access to the most commonly used API methods:
Method |
Description |
---|---|
|
Returns the configured temporary directory path or None if not set |
|
Returns the configured log level (DEBUG, INFO, WARNING, ERROR, CRITICAL) |
|
Returns the maximum file size in bytes for log rotation |
|
Returns the maximum number of log files to keep |
|
Returns the workload rank from RANK environment variable |
|
Returns the workload local rank from LOCAL_RANK environment variable |
Property |
Description |
---|---|
|
Boolean indicating whether node local temporary logging is enabled |
|
Integer representing the workload rank |
|
Integer representing the workload local rank |
|
The configured Python logging.Logger instance |
Note
For complete API documentation including all methods, properties, and detailed signatures, see API Reference.