Configuration Reference
======================

This is a comprehensive reference for all NVRx Logger configuration options, examples, and usage guides.

.. note::
   For detailed API documentation, class methods, and function signatures, see :doc:`api`.

Environment Variables
--------------------

.. list-table:: Complete Environment Variables Reference
   :widths: 25 15 20 40
   :header-rows: 1

   * - Variable
     - Type
     - Default
     - Description
   * - ``NVRX_NODE_LOCAL_TMPDIR``
     - string
     - None
     - Directory for temporary log files. When set, enables node local temporary logging mode.
   * - ``NVRX_LOG_DEBUG``
     - string
     - INFO
     - Set to "1", "true", "yes", or "on" to enable DEBUG level logging.
   * - ``NVRX_LOG_TO_STDOUT``
     - string
     - stderr
     - Set to "1" to log to stdout instead of stderr.
   * - ``NVRX_LOG_MAX_FILE_SIZE_KB``
     - integer
     - 10240 KB (10 MB)
     - Maximum size of temporary message files in KB before rotation.
   * - ``NVRX_LOG_MAX_LOG_FILES``
     - integer
     - 4
     - Maximum number of log files to keep per rank.

Python API Parameters
--------------------

.. list-table:: setup_logger Function Parameters
   :widths: 30 15 20 35
   :header-rows: 1

   * - Parameter
     - Type
     - Default
     - Description
   * - ``node_local_tmp_dir``
     - string
     - None
     - Custom temporary directory path. Overrides NVRX_NODE_LOCAL_TMPDIR.
   * - ``force_reset``
     - boolean
     - False
     - Force reconfiguration even if logger is already configured.
   * - ``node_local_tmp_prefix``
     - string
     - None
     - Custom prefix for log files in distributed mode.

LogManager Constructor Parameters
-------------------------------

.. list-table:: LogManager Class Parameters
   :widths: 30 15 20 35
   :header-rows: 1

   * - Parameter
     - Type
     - Default
     - Description
   * - ``node_local_tmp_dir``
     - string
     - None
     - Directory path for temporary log files.
   * - ``node_local_tmp_prefix``
     - string
     - None
     - Prefix for log files in distributed mode.

Configuration Examples
---------------------

Basic Configuration
~~~~~~~~~~~~~~~~~~

.. code-block:: bash

    # Enable node local temporary logging
    export NVRX_NODE_LOCAL_TMPDIR=/tmp/nvrx_logs
    
    # Enable debug logging
    export NVRX_LOG_DEBUG=1

Advanced Configuration
~~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

    # Complete configuration
    export NVRX_NODE_LOCAL_TMPDIR=/tmp/nvrx_logs
    export NVRX_LOG_DEBUG=1
    export NVRX_LOG_TO_STDOUT=1
    export NVRX_LOG_MAX_FILE_SIZE_KB=10240
    export NVRX_LOG_MAX_LOG_FILES=10

Python Configuration
~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    from nvidia_resiliency_ext.shared_utils.log_manager import setup_logger
    
    # Custom configuration
    logger = setup_logger(
        node_local_tmp_dir="/custom/logs",
        node_local_tmp_prefix="mytraining",
        force_reset=True
    )

SLURM Integration
----------------

.. code-block:: bash

    #!/bin/bash
    #SBATCH --job-name=nvrx_training
    #SBATCH --nodes=4
    #SBATCH --ntasks-per-node=8
    
    # NVRx Logger Configuration
    export NVRX_NODE_LOCAL_TMPDIR=/tmp/nvrx_logs_${SLURM_JOB_ID}
    export NVRX_LOG_DEBUG=1
    export NVRX_LOG_MAX_FILE_SIZE_KB=10240
    
    # Launch training
    srun python training_script.py

Docker Integration
-----------------

.. code-block:: dockerfile

    # Dockerfile
    FROM nvcr.io/nvidia/pytorch:24.01-py3
    
    # Install NVRx
    RUN pip install nvidia-resiliency-ext
    
    # Set default logging configuration
    ENV NVRX_NODE_LOCAL_TMPDIR=/tmp/nvrx_logs
    ENV NVRX_LOG_DEBUG=1
    ENV NVRX_LOG_MAX_FILE_SIZE_KB=10240

Kubernetes Integration
---------------------

.. code-block:: yaml

    # kubernetes-deployment.yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: nvrx-training
    spec:
      template:
        spec:
          containers:
          - name: training
            image: nvrx-training:latest
            env:
            - name: NVRX_NODE_LOCAL_TMPDIR
              value: "/tmp/nvrx_logs"
            - name: NVRX_LOG_DEBUG
              value: "1"
            - name: NVRX_LOG_MAX_FILE_SIZE_KB
              value: "10240"

Configuration Precedence
-----------------------

1. **Python API parameters** (highest priority)
2. **Environment variables**
3. **Default values** (lowest priority)

Example:
- If you set `NVRX_NODE_LOCAL_TMPDIR=/tmp/env_logs` in environment
- And call `setup_logger(node_local_tmp_dir="/tmp/api_logs")`
- The API parameter `/tmp/api_logs` will be used

Best Practices
--------------

✅ **Do:**
- Set `NVRX_NODE_LOCAL_TMPDIR` for node local temporary logging
- Use job-specific directories (e.g., `/tmp/nvrx_logs_${SLURM_JOB_ID}`)
- Enable debug logging during development
- Use appropriate file size limits for your workload

❌ **Don't:**
- Use system-critical directories (e.g., `/var/log`)
- Use network filesystems (e.g., NFS) that cannot handle high write throughput from multiple nodes
- Set extremely large file size limits
- Keep too many log files (can fill disk)
- Mix different logging configurations in the same job

Filesystem Selection
-------------------

**Critical Consideration**: The temporary directory for distributed logging experiences high write throughput from all ranks on each node. Choose your filesystem carefully:

**Recommended Filesystems:**
- **Local node storage**: `/tmp`, `/scratch`, local SSDs
- **Local NVMe storage**: Fastest option for high-throughput logging

**Avoid These Filesystems:**
- **NFS**: Cannot handle concurrent writes from multiple processes efficiently
- **Lustre (LFS)**: Network filesystem that may have performance limitations for high-frequency small writes

**Performance Impact:**
- Poor filesystem choice can significantly slow down your training
- Logging overhead should be minimal (< 1% of training time)
- Test filesystem performance before production deployment

Troubleshooting
---------------

**Common Issues:**

.. list-table:: Troubleshooting Guide
   :widths: 30 70
   :header-rows: 1

   * - Issue
     - Solution
   * - Logs not appearing
     - Check `NVRX_NODE_LOCAL_TMPDIR` is set and writable
   * - Permission denied
     - Ensure directory has proper write permissions
   * - Disk space issues
     - Reduce `NVRX_LOG_MAX_FILE_SIZE_KB` or `NVRX_LOG_MAX_LOG_FILES`
   * - Missing rank info
     - Verify RANK and LOCAL_RANK environment variables are set
   * - Performance issues
     - Monitor temporary directory size and adjust limits
   * - Slow logging performance
     - Check filesystem type (avoid NFS, Lustre, or network storage, use local storage)

**Debug Mode:**
Enable debug logging to see detailed configuration information:

.. code-block:: bash

    export NVRX_LOG_DEBUG=1
    python your_script.py

This will show:
- Current configuration values
- Directory creation status
- Rank detection results
- Log handler setup details

Quick API Reference
-------------------

For developers who need quick access to the most commonly used API methods:

.. list-table:: Common LogConfig Methods
   :widths: 30 70
   :header-rows: 1

   * - Method
     - Description
   * - ``get_node_local_tmp_dir()``
     - Returns the configured temporary directory path or None if not set
   * - ``get_log_level()``
     - Returns the configured log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
   * - ``get_max_file_size()``
     - Returns the maximum file size in bytes for log rotation
   * - ``get_max_log_files()``
     - Returns the maximum number of log files to keep
   * - ``get_workload_rank()``
     - Returns the workload rank from RANK environment variable
   * - ``get_workload_local_rank()``
     - Returns the workload local rank from LOCAL_RANK environment variable

.. list-table:: Common LogManager Properties
   :widths: 30 70
   :header-rows: 1

   * - Property
     - Description
   * - ``node_local_tmp_logging_enabled``
     - Boolean indicating whether node local temporary logging is enabled
   * - ``workload_rank``
     - Integer representing the workload rank
   * - ``workload_local_rank``
     - Integer representing the workload local rank
   * - ``logger``
     - The configured Python logging.Logger instance

.. note::
   For complete API documentation including all methods, properties, and detailed signatures, see :doc:`api`.