Release Instance API Enhancements
What's New
The Release Instance API for NCX Infra Controller (NICo) now supports issue reporting and automated repair workflows. When releasing an instance, you can report problems to help improve system reliability.
Key Features
- Report Issues: Hardware, Network, Performance, or Other problems
- Auto-Repair: Makes machines available for repair plugins/systems to fix issues
- Repair Integration: Special handling for repair systems
- Enhanced Labels: Machine metadata labels for repair status tracking
Quick Start
REST API:
Basic Release (No Issues)
curl -X POST /api/v1/instances/release \
-d '{"id": "instance-12345"}'
Release with Issue Report
curl -X POST /api/v1/instances/release \
-d '{
"id": "instance-12345",
"issue": {
"category": "HARDWARE",
"summary": "Memory errors during training",
"details": "Job crashed with ECC errors on DIMM slot 2"
}
}'
Issue Categories
| Category | When to Use | Examples |
|---|---|---|
| HARDWARE | Physical component failures | Memory errors, GPU failures, disk problems |
| NETWORK | Connectivity issues | Slow InfiniBand, packet loss, timeouts |
| PERFORMANCE | Slower than expected | Thermal throttling, reduced GPU performance |
| OTHER | Software/config issues | Driver problems, CUDA version mismatches |
What Happens When You Report Issues
When you release an instance with issue reporting, the system automatically takes several actions to fix the machine and prevent the issue-reported machine from being allocated to tenants until resolved:
Immediate Actions
- Health Override Application - Marks machine with health status and prevents new allocations
- Issue Logging - Records problem details for tracking and analysis
- Auto-Repair Signal - Makes machine available for repair plugins to act on (if enabled)
Health Override Types
The system uses two complementary health overrides to manage the repair workflow:
| Override | Purpose | Behavior | When Applied |
|---|---|---|---|
tenant-reported-issue | Documents tenant-reported problems | Prevents machine allocation until resolved | Always when issue is reported |
repair-request | Signals automated repair needed | Triggers breakfix system to claim machine | When auto-repair is enabled or manually applied |
Auto-Repair Behavior
- Enabled: Machine gets both overrides (
tenant-reported-issue+repair-request) - repair plugins can act on the machine - Disabled: Machine gets only
tenant-reported-issueoverride (manual intervention needed)
NICo - Breakfix Integration Workflow
Workflow Overview
The breakfix integration follows this automated repair cycle:
- Issue Reporting: Tenant releases instance and reports hardware/software problems via API
- Health Override Application: System applies appropriate health overrides based on configuration
- Repair System Activation: Breakfix system detects machines marked for repair and claims them
- Automated Repair: Repair tenant diagnoses and fixes the reported issues
- Validation & Release: Successfully repaired machines return to the available pool
Stage Details
- Normal Operation: Machine serves tenant workloads without issues
- Issue Reported: Tenant releases instance with problem details via API
- Quarantined: Machine marked with health overrides, preventing new allocations
- Repair Process:
- If auto-repair enabled: Repair plugins automatically attempt fixes
- If auto-repair disabled: Manual intervention required by operations team
- Resolution: Machine either gets repaired successfully or escalated for further action
- Return to Pool: Successfully repaired machines with
repair_status="Completed"return to the available pool
Repair Status Labels
Repair systems use machine metadata labels to communicate repair outcomes back to Forge:
Critical Label: repair_status
| Value | Meaning | Result |
|---|---|---|
"Completed" | Repair successful | Machine returns to available pool |
"Failed" | Repair couldn't fix issue | Escalated to operations team |
"InProgress" | Repair still running | Treated as failed if instance released |
⚠️ Important: Repair systems must set
repair_statusbefore releasing instances. Missing or invalid labels result in failed repair handling.
Optional Labels
repair_details: Explanation of what was done (e.g.,"thermal_paste_replaced")repair_eta: Expected completion time for planning purposes
Configuration
Auto-Repair Settings
>>carbide-api-site-config.toml
...
[auto_machine_repair_plugin]
enabled = true
...
Frequently Asked Questions (FAQ)
Q1: Tenant releases machine reporting issue but auto_machine_repair_plugin.enabled is false
Scenario: A tenant calls the release API with issue details, but automatic repair is disabled in the site configuration.
What happens:
- Machine is released and marked with issue details
- Health override
tenant-reported-issueIS applied (issue is documented) - Health override
repair-requestis NOT applied (no automatic repair triggered) - Machine becomes unavailable for normal allocation due to tenant-reported-issue override
Resolution:
# Check current configuration (requires server access to config file)
# Auto-repair setting is in carbide-api-site-config.toml
# Manually trigger repair using health override
carbide-admin-cli machine health-override add <machine-id> --template RequestRepair \
--message "Manual repair trigger for tenant-reported issue"
# To enable auto-repair site-wide, update carbide-api-site-config.toml:
# [auto_machine_repair_plugin]
# enabled = true
Best Practice: Enable auto-repair in production environments to ensure tenant-reported issues are automatically handled.
Q2: Tenant releases machine reporting issue but repair tenant hasn't picked up the machine
Scenario: Auto-repair is enabled, tenant reports issue, health override is applied, but repair tenant hasn't started working on the machine.
What happens:
- Machine gets
tenant-reported-issuehealth override (documents the issue) - Machine gets
repair-requesthealth override (signals repair system) - Machine becomes unavailable for normal tenant allocation
- Repair plugins should detect and claim the machine
- If repair tenant doesn't pick up machine, it remains in limbo
Troubleshooting:
# Check machine status and health overrides
carbide-admin-cli machine show <machine-id>
carbide-admin-cli machine health-override show <machine-id>
# Check repair system status (requires monitoring tools)
# - Check repair tenant instances
# - Verify repair system connectivity
# Manually assign repair override if needed
carbide-admin-cli machine health-override add <machine-id> --template RequestRepair \
--message "Manual assignment for repair system"
Common Causes:
- Repair tenant is at capacity
- Repair plugins are not running
- Machine doesn't match repair tenant's allocation criteria
- Network connectivity issues between repair systems
Q3: Repair tenant releases machine as "fixed" but machine still needs repair
Scenario: Repair tenant completes work and releases machine claiming it's fixed, but the underlying issue persists.
What happens:
- Health override
repair-requestis removed (repair claimed complete) - If repair tenant reports new issues:
tenant-reported-issueoverride is applied - If repair tenant reports new issues: Machine does NOT return to available pool
- If no new issues reported: Both overrides removed, machine returns to available pool
- Auto-repair is NOT triggered again (prevents infinite repair loops)
Detection and Response:
# Check machine status and current health overrides
carbide-admin-cli machine show <machine-id>
carbide-admin-cli machine health-override show <machine-id>
# Check repair work status (requires access to repair system logs)
# - Review repair tenant instance logs
# - Check repair system monitoring
# If issue persists, escalate to manual intervention
carbide-admin-cli machine health-override add <machine-id> --template OutForRepair \
--message "Repair unsuccessful, requires manual investigation"
Prevention:
- Implement repair validation tests
- Require repair tenants to provide detailed fix reports
- Set up monitoring to detect recurring issues on same machines
- Establish escalation procedures for failed repairs
Q4: Repair tenant successfully fixes machine and reports completion
Scenario: The ideal case where repair tenant successfully resolves the issue and properly reports completion.
What happens:
- Repair tenant releases machine with success status (repair_status = "Completed")
- Health override
repair-requestis automatically removed - Health override
tenant-reported-issueis automatically removed - Machine returns to healthy, available state
- Machine becomes available for normal tenant allocation
Verification Steps:
# Confirm machine is healthy and available
carbide-admin-cli machine show <machine-id>
# Check that health overrides are cleared
carbide-admin-cli machine health-override show <machine-id>
# Verify machine status (should show as available)
# Machine should appear in normal allocation pool
# Review repair work (requires access to repair system)
# - Check repair tenant instance completion status
# - Review repair system logs and reports
Success Indicators:
- ✅ Machine status:
Available - ✅ Health overrides: None or only non-blocking ones
- ✅ Recent allocation tests pass
- ✅ Repair logs show successful completion
- ✅ No recurring issues reported
Q5: Repair tenant releases machine without setting repair_status
Scenario: Repair tenant completes work and releases machine but forgets to set the repair_status metadata or sets it to something other than "Completed".
What happens:
- Machine has existing
repair-requesthealth override - Repair tenant releases machine without
repair_status = "Completed" - System treats this as failed/incomplete repair
- Health override
repair-requestis automatically removed - Health override
tenant-reported-issueis applied (or updated if already exists) - Machine does NOT return to available pool
- Auto-repair is NOT triggered again (prevents infinite loops)
Detection:
# Check machine status after repair tenant release
carbide-admin-cli machine show <machine-id>
carbide-admin-cli machine health-override show <machine-id>
# Look for:
# - repair-request override: REMOVED
# - tenant-reported-issue override: PRESENT
# - Machine status: NOT available for allocation
Resolution:
# If repair was actually successful, manually clear the issue
carbide-admin-cli machine health-override remove <machine-id> tenant-reported-issue
# If repair was incomplete, escalate properly
carbide-admin-cli machine health-override add <machine-id> --template OutForRepair \
--message "Repair incomplete - requires manual investigation"
Prevention:
- Train repair tenants to always set repair_status metadata
- Implement validation in repair workflows to ensure status is set
- Monitor for machines released by repair tenant without "Completed" status
- Set up alerts for machines with tenant-reported-issue after repair tenant release
Best Practice:
# Repair tenants should always set metadata before release:
# repair_status = "Completed" # for successful repairs
# repair_status = "Failed" # for unsuccessful repairs
# repair_status = "InProgress" # repair in progress
General Troubleshooting Commands
Check Auto-Repair Configuration:
# Auto-repair settings are in carbide-api-site-config.toml
# [auto_machine_repair_plugin]
# enabled = true|false
# Check current runtime configuration
carbide-admin-cli version --show-runtime-config
Monitor Issue Reporting:
# Check machine status and health overrides
carbide-admin-cli machine show <machine-id>
carbide-admin-cli machine health-override show <machine-id>
# Monitor machine through repair cycle (requires external monitoring)
Manual Intervention:
# Remove specific health overrides
carbide-admin-cli machine health-override remove <machine-id> repair-request
carbide-admin-cli machine health-override remove <machine-id> tenant-reported-issue
# Apply manual repair override
carbide-admin-cli machine health-override add <machine-id> --template RequestRepair \
--message "Manual repair assignment"
# Escalate to operations team
carbide-admin-cli machine health-override add <machine-id> --template OutForRepair \
--message "Automated repair failed, requires manual investigation"
This enhanced API improves system reliability by enabling structured issue reporting, automated repairs, and better coordination between tenants, repair systems, and operations teams.