nvidia-resiliency-ext

Documentation contents:

  • Fault Tolerance
    • Usage guide
    • Integration Guides
      • Heartbeats API Integration
      • Sections API Integration
      • PyTorch Lightning Integration
      • FT Launcher & Inprocess integration
    • API documentation
    • Examples
  • Inprocess Restart
  • Async Checkpointing
  • Local Checkpointing
  • Straggler Detection
nvidia-resiliency-ext
  • Fault Tolerance
  • Integration Guides
  • View page source

Integration Guides

Integration Guides

  • Heartbeats API Integration
    • 1. Prerequisites
    • 2. FT configuration
    • 3. Integration with PyTorch workload code
  • Sections API Integration
    • 1. Prerequisites
    • 2. FT configuration
    • 3. Integration with PyTorch workload code
  • PyTorch Lightning Integration
    • 1. Use ft_launcher to start the workload
    • 2. Add the FT callback to the PTL trainer
    • 3. Implementing auto-resume
  • FT Launcher & Inprocess integration
    • 1. Heartbeat Mechanism
    • 2. Worker Monitoring & Restart Policy
    • Supported & Unsupported Configurations
Previous Next

© Copyright 2024, NVIDIA Corporation.

Built with Sphinx using a theme provided by Read the Docs.