nvidia-resiliency-ext
Documentation contents:
Fault Tolerance
Usage guide
Integration Guides
Heartbeats API Integration
Sections API Integration
PyTorch Lightning Integration
FT Launcher & Inprocess integration
API documentation
Examples
Inprocess Restart
Async Checkpointing
Local Checkpointing
Straggler Detection
nvidia-resiliency-ext
Fault Tolerance
Integration Guides
View page source
Integration Guides
Integration Guides
Heartbeats API Integration
1. Prerequisites
2. FT configuration
3. Integration with PyTorch workload code
Sections API Integration
1. Prerequisites
2. FT configuration
3. Integration with PyTorch workload code
PyTorch Lightning Integration
1. Use
ft_launcher
to start the workload
2. Add the FT callback to the PTL trainer
3. Implementing auto-resume
FT Launcher & Inprocess integration
1. Heartbeat Mechanism
2. Worker Monitoring & Restart Policy
Supported & Unsupported Configurations