NCX Infra Controller (NICo) core metrics
This file contains a list of metrics exported by NCX Infra Controller (NICo). The list is auto-generated from an integration test (test_integration). Metrics for workflows which are not exercised by the test are missing.
| Name | Type | Description |
| carbide_active_host_firmware_update_count | gauge | The number of host machines in the system currently working on updating their firmware. |
| carbide_api_db_queries_total | counter | The amount of database queries that occured inside a span |
| carbide_api_db_span_query_time_milliseconds | histogram | Total time the request spent inside a span on database transactions |
| carbide_api_grpc_server_duration_milliseconds | histogram | Processing time for a request on the carbide API server |
| carbide_api_ready | gauge | Whether the Forge Site Controller API is running |
| carbide_api_tls_connection_attempted_total | counter | The amount of tls connections that were attempted |
| carbide_api_tls_connection_success_total | counter | The amount of tls connections that were successful |
| carbide_api_tracing_spans_open | gauge | Whether the Forge Site Controller API is running |
| carbide_api_vault_request_duration_milliseconds | histogram | the duration of outbound vault requests, in milliseconds |
| carbide_api_vault_requests_attempted_total | counter | The amount of tls connections that were attempted |
| carbide_api_vault_requests_failed_total | counter | The amount of tcp connections that were failures |
| carbide_api_vault_requests_succeeded_total | counter | The amount of tls connections that were successful |
| carbide_api_vault_token_time_until_refresh_seconds | gauge | The amount of time, in seconds, until the vault token is required to be refreshed |
| carbide_api_version | gauge | Version (git sha, build date, etc) of this service |
| carbide_available_ips_count | gauge | The total number of available ips in the site |
| carbide_concurrent_machine_updates_available | gauge | The number of machines in the system that we will update concurrently. |
| carbide_db_pool_idle_conns | gauge | The amount of idle connections in the carbide database pool |
| carbide_db_pool_total_conns | gauge | The amount of total (active + idle) connections in the carbide database pool |
| carbide_dpu_agent_version_count | gauge | The amount of Forge DPU agents which have reported a certain version. |
| carbide_dpu_firmware_version_count | gauge | The amount of DPUs which have reported a certain firmware version. |
| carbide_dpus_healthy_count | gauge | The total number of DPUs in the system that have reported healthy in the last report. Healthy does not imply up - the report from the DPU might be outdated. |
| carbide_dpus_up_count | gauge | The total number of DPUs in the system that are up. Up means we have received a health report less than 5 minutes ago. |
| carbide_endpoint_exploration_duration_milliseconds | histogram | The time it took to explore an endpoint |
| carbide_endpoint_exploration_expected_machines_missing_overall_count | gauge | The total number of machines that were expected but not identified |
| carbide_endpoint_exploration_expected_power_shelves_missing_overall_count | gauge | The total number of power shelves that were expected but not identified |
| carbide_endpoint_exploration_identified_managed_hosts_overall_count | gauge | The total number of managed hosts identified by expectation |
| carbide_endpoint_exploration_machines_explored_overall_count | gauge | The total number of machines explored by machine type |
| carbide_endpoint_exploration_success_count | gauge | The amount of endpoint explorations that have been successful |
| carbide_endpoint_explorations_count | gauge | The amount of endpoint explorations that have been attempted |
| carbide_gpus_in_use_count | gauge | The total number of GPUs that are actively used by tenants in instances in the Forge site |
| carbide_gpus_total_count | gauge | The total number of GPUs available in the Forge site |
| carbide_gpus_usable_count | gauge | The remaining number of GPUs in the Forge site which are available for immediate instance creation |
| carbide_hosts_by_sku_count | gauge | The amount of hosts by SKU and device type ('unknown' for hosts without SKU) |
| carbide_hosts_health_overrides_count | gauge | The amount of health overrides that are configured in the site |
| carbide_hosts_health_status_count | gauge | The total number of Managed Hosts in the system that have reported any a healthy nor not healthy status - based on the presence of health probe alerts |
| carbide_hosts_in_use_count | gauge | The total number of hosts that are actively used by tenants as instances in the Forge site |
| carbide_hosts_usable_count | gauge | The remaining number of hosts in the Forge site which are available for immediate instance creation |
| carbide_hosts_with_bios_password_set | gauge | The total number of Hosts in the system that have their BIOS password set. |
| carbide_ib_partitions_enqueuer_iteration_latency_milliseconds | histogram | The overall time it took to enqueue state handling tasks for all carbide_ib_partitions in the system |
| carbide_ib_partitions_iteration_latency_milliseconds | histogram | The elapsed time in the last state processor iteration to handle objects of type carbide_ib_partitions |
| carbide_ib_partitions_object_tasks_enqueued_total | counter | The amount of types that object handling tasks that have been freshly enqueued for objects of type carbide_ib_partitions |
| carbide_ib_partitions_total | gauge | The total number of carbide_ib_partitions in the system |
| carbide_machine_reboot_duration_seconds | histogram | Time taken for machine/host to reboot in seconds |
| carbide_machine_updates_started_count | gauge | The number of machines in the system that in the process of updating. |
| carbide_machine_validation_completed | gauge | Count of machine validation that have completed successfully |
| carbide_machine_validation_failed | gauge | Count of machine validation that have failed |
| carbide_machine_validation_in_progress | gauge | Count of machine validation that are in progress |
| carbide_machine_validation_tests | gauge | The details of machine validation tests |
| carbide_machines_enqueuer_iteration_latency_milliseconds | histogram | The overall time it took to enqueue state handling tasks for all carbide_machines in the system |
| carbide_machines_handler_latency_in_state_milliseconds | histogram | The amount of time it took to invoke the state handler for objects of type carbide_machines in a certain state |
| carbide_machines_in_maintenance_count | gauge | The total number of machines in the system that are in maintenance. |
| carbide_machines_iteration_latency_milliseconds | histogram | The elapsed time in the last state processor iteration to handle objects of type carbide_machines |
| carbide_machines_object_tasks_completed_total | counter | The amount of object handling tasks that have been completed for objects of type carbide_machines |
| carbide_machines_object_tasks_dispatched_total | counter | The amount of types that object handling tasks that have been dequeued and dispatched for processing for objects of type carbide_machines |
| carbide_machines_object_tasks_enqueued_total | counter | The amount of types that object handling tasks that have been freshly enqueued for objects of type carbide_machines |
| carbide_machines_object_tasks_requeued_total | counter | The amount of object handling tasks that have been requeued for objects of type carbide_machines |
| carbide_machines_per_state | gauge | The number of carbide_machines in the system with a given state |
| carbide_machines_per_state_above_sla | gauge | The number of carbide_machines in the system which had been longer in a state than allowed per SLA |
| carbide_machines_state_entered_total | counter | The amount of types that objects of type carbide_machines have entered a certain state |
| carbide_machines_state_exited_total | counter | The amount of types that objects of type carbide_machines have exited a certain state |
| carbide_machines_time_in_state_seconds | histogram | The amount of time objects of type carbide_machines have spent in a certain state |
| carbide_machines_total | gauge | The total number of carbide_machines in the system |
| carbide_machines_with_state_handling_errors_per_state | gauge | The number of carbide_machines in the system with a given state that failed state handling |
| carbide_measured_boot_bundles_total | gauge | The total number of measured boot bundles. |
| carbide_measured_boot_machines_per_bundle_state_total | gauge | The total number of machines per a given measured boot bundle state. |
| carbide_measured_boot_machines_per_machine_state_total | gauge | The total number of machines per a given measured boot machine state. |
| carbide_measured_boot_machines_total | gauge | The total number of machines reporting measurements. |
| carbide_measured_boot_profiles_total | gauge | The total number of measured boot profiles. |
| carbide_network_segments_enqueuer_iteration_latency_milliseconds | histogram | The overall time it took to enqueue state handling tasks for all carbide_network_segments in the system |
| carbide_network_segments_handler_latency_in_state_milliseconds | histogram | The amount of time it took to invoke the state handler for objects of type carbide_network_segments in a certain state |
| carbide_network_segments_iteration_latency_milliseconds | histogram | The elapsed time in the last state processor iteration to handle objects of type carbide_network_segments |
| carbide_network_segments_object_tasks_completed_total | counter | The amount of object handling tasks that have been completed for objects of type carbide_network_segments |
| carbide_network_segments_object_tasks_dispatched_total | counter | The amount of types that object handling tasks that have been dequeued and dispatched for processing for objects of type carbide_network_segments |
| carbide_network_segments_object_tasks_enqueued_total | counter | The amount of types that object handling tasks that have been freshly enqueued for objects of type carbide_network_segments |
| carbide_network_segments_object_tasks_requeued_total | counter | The amount of object handling tasks that have been requeued for objects of type carbide_network_segments |
| carbide_network_segments_per_state | gauge | The number of carbide_network_segments in the system with a given state |
| carbide_network_segments_per_state_above_sla | gauge | The number of carbide_network_segments in the system which had been longer in a state than allowed per SLA |
| carbide_network_segments_state_entered_total | counter | The amount of types that objects of type carbide_network_segments have entered a certain state |
| carbide_network_segments_state_exited_total | counter | The amount of types that objects of type carbide_network_segments have exited a certain state |
| carbide_network_segments_time_in_state_seconds | histogram | The amount of time objects of type carbide_network_segments have spent in a certain state |
| carbide_network_segments_total | gauge | The total number of carbide_network_segments in the system |
| carbide_network_segments_with_state_handling_errors_per_state | gauge | The number of carbide_network_segments in the system with a given state that failed state handling |
| carbide_nvlink_partition_monitor_nmxm_changes_applied_total | counter | Number of changes requested to Nmx-M |
| carbide_pending_dpu_nic_firmware_update_count | gauge | The number of machines in the system that need a firmware update. |
| carbide_pending_host_firmware_update_count | gauge | The number of host machines in the system that need a firmware update. |
| carbide_power_shelves_enqueuer_iteration_latency_milliseconds | histogram | The overall time it took to enqueue state handling tasks for all carbide_power_shelves in the system |
| carbide_power_shelves_iteration_latency_milliseconds | histogram | The elapsed time in the last state processor iteration to handle objects of type carbide_power_shelves |
| carbide_power_shelves_object_tasks_enqueued_total | counter | The amount of types that object handling tasks that have been freshly enqueued for objects of type carbide_power_shelves |
| carbide_power_shelves_total | gauge | The total number of carbide_power_shelves in the system |
| carbide_preingestion_total | gauge | The amount of known machines currently being evaluated prior to ingestion |
| carbide_preingestion_waiting_download | gauge | The amount of machines that are waiting for firmware downloads on other machines to complete before doing thier own |
| carbide_preingestion_waiting_installation | gauge | The amount of machines which have had firmware uploaded to them and are currently in the process of installing that firmware |
| carbide_racks_enqueuer_iteration_latency_milliseconds | histogram | The overall time it took to enqueue state handling tasks for all carbide_racks in the system |
| carbide_racks_iteration_latency_milliseconds | histogram | The elapsed time in the last state processor iteration to handle objects of type carbide_racks |
| carbide_racks_object_tasks_enqueued_total | counter | The amount of types that object handling tasks that have been freshly enqueued for objects of type carbide_racks |
| carbide_racks_total | gauge | The total number of carbide_racks in the system |
| carbide_reboot_attempts_in_booting_with_discovery_image | histogram | The amount of machines rebooted again in BootingWithDiscoveryImage since there is no response after a certain time from host. |
| carbide_reserved_ips_count | gauge | The total number of reserved ips in the site |
| carbide_resourcepool_free_count | gauge | Count of values in the pool currently available for allocation |
| carbide_resourcepool_used_count | gauge | Count of values in the pool currently allocated |
| carbide_running_dpu_updates_count | gauge | The number of machines in the system that running a firmware update. |
| carbide_site_exploration_expected_machines_sku_count | gauge | The total count of expected machines by SKU ID and device type |
| carbide_site_exploration_identified_managed_hosts_count | gauge | The amount of Host+DPU pairs that has been identified in the last SiteExplorer run |
| carbide_site_explorer_bmc_reset_count | gauge | The amount of BMC resets initiated in the last SiteExplorer run |
| carbide_site_explorer_create_machines_latency_milliseconds | histogram | The time it to perform create_machines inside site-explorer |
| carbide_site_explorer_created_machines_count | gauge | The amount of Machine pairs that had been created by Site Explorer after being identified |
| carbide_site_explorer_created_power_shelves_count | gauge | The amount of Power Shelves that had been created by Site Explorer after being identified |
| carbide_site_explorer_iteration_latency_milliseconds | histogram | The time it took to perform one site explorer iteration |
| carbide_switches_enqueuer_iteration_latency_milliseconds | histogram | The overall time it took to enqueue state handling tasks for all carbide_switches in the system |
| carbide_switches_iteration_latency_milliseconds | histogram | The elapsed time in the last state processor iteration to handle objects of type carbide_switches |
| carbide_switches_object_tasks_enqueued_total | counter | The amount of types that object handling tasks that have been freshly enqueued for objects of type carbide_switches |
| carbide_switches_total | gauge | The total number of carbide_switches in the system |
| carbide_total_ips_count | gauge | The total number of ips in the site |
| carbide_unavailable_dpu_nic_firmware_update_count | gauge | The number of machines in the system that need a firmware update but are unavailble for update. |