Troubleshooting#

Credentials#

Below are some common errors you might run into when using the credential CLI. Please follow the suggested steps to troubleshoot. Please also refer to Setup Credentials for more information

Could not connect to the endpoint URL#

client [ERROR] common: Server responded with status code 400
Data validation failed with error: Could not connect to the endpoint URL: "{your_data_endpoint_url}"

Please confirm if the data endpoint URL is valid

Extra fields not permitted#

client [ERROR] common: Server responded with status code 422
{'detail': [{'loc': ['body', 'xxxx_credential', 'xxx'], 'msg': 'extra fields not permitted', 'type': 'value_error.extra'}]}

Please make sure you don’t provide extra field when setting credentials with payload. The tabulated information illustrates the keys that are compulsory and those that are optional for the payload corresponding to each type of credential.

SignatureDoesNotMatch#

client [ERROR] common: Server responded with status code 400
Data validation failed with error: An error occurred (SignatureDoesNotMatch) when calling the ListBuckets operation: The request signature we calculated does not match the signature you provided. Check your key and signing method.

Please check if you access_key_id and access_key are valid.

AuthorizationHeaderMalformed#

client [ERROR] common: Server responded with status code 400
Data validation failed with error: An error occurred (AuthorizationHeaderMalformed) when calling the ListBuckets operation: The authorization header is malformed; the region 'us-east-3' is wrong; expecting 'us-east-1'

Please correct the region based on the suggestion.

Max retries exceeded with url#

client [ERROR] common: Server responded with status code 400
Registry connection error for https://your_registry/v2/:
HTTPSConnectionPool(host='your_registry', port=443): Max retries exceeded with url: /v2/ (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fe07c119e40>: Failed to establish a new connection: [Errno -2] Name or service not known'))

Please check if your registry is valid.

Registry authentication failed#

client [ERROR] common: Server responded with status code 400
Registry authentication failed.

Please check if you registry username and auth is valid.

Duplicate key value#

client [ERROR] common: Server responded with status code 400
{'message': ' duplicate key value violates unique constraint "credential_pkey"\nDETAIL:  Key (user_name, cred_name)=(your_user_name, your_cred_name) already exists.\n', 'error_code': 'USER'}

Please rename your credential or delete it with $ osmo credential delete <your_cred_name> and then reset it.

Dataset#

Below are some common errors you might run into when using the dataset CLI. Please follow the suggested steps to troubleshoot. Please also refer to Data or Working with Data for more information.

Validation error#

Data upload failed with error:

Data key validation error: access_key_id <> not valid for <>
Data key validation error: access_key_id has no read access for <>
Data key validation error: access_key_id has no write access for <>

Please confirm if the access_key_id set for your data credentials is the same as the Shared Storage S3 ACL Access User found at Data If the access_key_id does not have the correct permissions, ask an admin for permission.

No default bucket#

No default bucket set. Specify default bucket using the "osmo profile set" CLI.

Please set a default bucket as specified at Data

Resources#

Please make change to the workflow resource specs based on the detailed error message. Please also refer Overview to make sure your resource spec is correct. Set the labels, cpu/gpu, memory, storage based on the current pool/platform availability osmo resource list

Some common errors are listed below:

Too high for label memory#

Resource memory error
E.g. Value "1000000" too high for label memory

Please check the available memory and set it correctly.

Too high for label cpu#

Resource cpu/gpu error
E.g. Value "1000000" too high for label cpu

Please check the available cpu and set it correctly.

Too high for label storage#

Resource storage error
E.g. Value "1000000" too high for label storage

Please check the available storage and set it correctly.

Does not allow mount#

Mount error:
E.g. Mount /bad_mount not allowed for selected platform dgx-h100

If you need specific host mounts, reach out to admin to update the platform configs.

Workflow#

When a workflow fails, refer to query to gain an overview of the workflow tasks statuses on which pods failed as well as their failure messages. Refer to Status Reference that contains more information regarding different workflow statuses.

Use logs for a better insight of what happened during the workflow runtime.

137 Error Code#

When a task exits with exit code 137, it usually signifies that your task was killed due to using too much memory.

A user can confirm this if the admins have setup a Grafana Dashboard for detailed workflow usage information. To see the dashboard, users can click on the Resource Usage button in the UI on the detailed workflow information page.

To resolve the memory issue, users can try increasing the amount of memory requested or lower the memory usage within the task. To learn more about workflow resources, refer to Overview.