nvidia-resiliency-ext

Documentation contents:

  • Fault Tolerance
  • Inprocess Restart
    • Usage Guide
    • API documentation
      • Wrapper
      • Compose
      • State
      • Rank Assignment
      • Rank Filter
      • Initialize
      • Abort
      • Finalize
      • Health Check
      • Exceptions
    • Examples
  • Async Checkpointing
  • Local Checkpointing
  • Straggler Detection
nvidia-resiliency-ext
  • Inprocess Restart
  • API documentation
  • View page source

API documentation

API documentation

  • Wrapper
    • CallWrapper
      • CallWrapper.atomic()
      • CallWrapper.iteration
      • CallWrapper.ping()
    • Wrapper
  • Compose
  • State
  • Rank Assignment
    • Rank Assignment
      • Base class
      • Tree
      • Composable Rank Assignments
    • Rank Filtering
      • Base class
      • Rank Filters
  • Rank Filter
    • ActiveWorldSizeDivisibleBy
    • MaxActiveWorldSize
    • RankFilter
    • WorldSizeDivisibleBy
  • Initialize
    • Initialize
      • Initialize.__call__()
    • RetryController
  • Abort
    • Abort
      • Abort.__call__()
    • AbortTorchDistributed
    • AbortTransformerEngine
  • Finalize
    • Finalize
      • Finalize.__call__()
    • ThreadedFinalize
  • Health Check
    • HealthCheck
      • HealthCheck.__call__()
    • CudaHealthCheck
    • FaultCounter
    • FaultCounterExceeded
  • Exceptions
    • HealthCheckError
    • InternalError
    • RestartAbort
    • RestartError
    • TimeoutError
Previous Next

© Copyright 2024, NVIDIA Corporation.

Built with Sphinx using a theme provided by Read the Docs.