Troubleshooting#
This section guides you through identifying the root cause of issues, determining whether they stem from system infrastructure or a bug in CloudAI.
Identifying the Root Cause#
If you encounter issues running a command, start by reading the error message to understand the root cause. We strive to make our error messages and exception messages as readable and interpretable as possible.
System Infrastructure vs. CloudAI Bugs#
To determine whether an issue is due to system infrastructure or a CloudAI bug, follow these steps:
Check stdout Messages: If CloudAI fails to run a test successfully, it will be indicated in the stdout messages that a test has failed.
Review Log Files:
Navigate to the output directory and review
debug.log, stdout, and stderr filesdebug.logcontains detailed steps executed by CloudAI, including generated commands, executed commands, and error messages
Analyze Error Messages: By examining the error messages in the log files, you can understand the type of errors CloudAI encountered.
Examine Output Directory: If a test fails without explicit error messages, review the output directory of the failed test. Look for
stdout.txt,stderr.txt, or any generated files to understand the failure reason.Manual Rerun of Tests:
To manually rerun the test, consult the
debug.logfor the command CloudAI executedLook for an
sbatchcommand with a generatedsbatchscriptExecute the command manually to debug further
If the problem persists, please report the issue at NVIDIA/cloudai#choose. When you report an issue, please make sure it is reproducible. Follow the issue template and provide any necessary details, such as the hash commit used, system settings, any changes in the schema files, and the command.