Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

NCX Infra Controller (NICo) core metrics

This file contains a list of metrics exported by NCX Infra Controller (NICo). The list is auto-generated from an integration test (test_integration). Metrics for workflows which are not exercised by the test are missing.

NameTypeDescription
carbide_active_host_firmware_update_countgaugeThe number of host machines in the system currently working on updating their firmware.
carbide_api_db_queries_totalcounterThe amount of database queries that occured inside a span
carbide_api_db_span_query_time_millisecondshistogramTotal time the request spent inside a span on database transactions
carbide_api_grpc_server_duration_millisecondshistogramProcessing time for a request on the carbide API server
carbide_api_readygaugeWhether the Forge Site Controller API is running
carbide_api_tls_connection_attempted_totalcounterThe amount of tls connections that were attempted
carbide_api_tls_connection_success_totalcounterThe amount of tls connections that were successful
carbide_api_tracing_spans_opengaugeWhether the Forge Site Controller API is running
carbide_api_vault_request_duration_millisecondshistogramthe duration of outbound vault requests, in milliseconds
carbide_api_vault_requests_attempted_totalcounterThe amount of tls connections that were attempted
carbide_api_vault_requests_failed_totalcounterThe amount of tcp connections that were failures
carbide_api_vault_requests_succeeded_totalcounterThe amount of tls connections that were successful
carbide_api_vault_token_time_until_refresh_secondsgaugeThe amount of time, in seconds, until the vault token is required to be refreshed
carbide_api_versiongaugeVersion (git sha, build date, etc) of this service
carbide_available_ips_countgaugeThe total number of available ips in the site
carbide_concurrent_machine_updates_availablegaugeThe number of machines in the system that we will update concurrently.
carbide_db_pool_idle_connsgaugeThe amount of idle connections in the carbide database pool
carbide_db_pool_total_connsgaugeThe amount of total (active + idle) connections in the carbide database pool
carbide_dpu_agent_version_countgaugeThe amount of Forge DPU agents which have reported a certain version.
carbide_dpu_firmware_version_countgaugeThe amount of DPUs which have reported a certain firmware version.
carbide_dpus_healthy_countgaugeThe total number of DPUs in the system that have reported healthy in the last report. Healthy does not imply up - the report from the DPU might be outdated.
carbide_dpus_up_countgaugeThe total number of DPUs in the system that are up. Up means we have received a health report less than 5 minutes ago.
carbide_endpoint_exploration_duration_millisecondshistogramThe time it took to explore an endpoint
carbide_endpoint_exploration_expected_machines_missing_overall_countgaugeThe total number of machines that were expected but not identified
carbide_endpoint_exploration_expected_power_shelves_missing_overall_countgaugeThe total number of power shelves that were expected but not identified
carbide_endpoint_exploration_identified_managed_hosts_overall_countgaugeThe total number of managed hosts identified by expectation
carbide_endpoint_exploration_machines_explored_overall_countgaugeThe total number of machines explored by machine type
carbide_endpoint_exploration_success_countgaugeThe amount of endpoint explorations that have been successful
carbide_endpoint_explorations_countgaugeThe amount of endpoint explorations that have been attempted
carbide_gpus_in_use_countgaugeThe total number of GPUs that are actively used by tenants in instances in the Forge site
carbide_gpus_total_countgaugeThe total number of GPUs available in the Forge site
carbide_gpus_usable_countgaugeThe remaining number of GPUs in the Forge site which are available for immediate instance creation
carbide_hosts_by_sku_countgaugeThe amount of hosts by SKU and device type ('unknown' for hosts without SKU)
carbide_hosts_health_overrides_countgaugeThe amount of health overrides that are configured in the site
carbide_hosts_health_status_countgaugeThe total number of Managed Hosts in the system that have reported any a healthy nor not healthy status - based on the presence of health probe alerts
carbide_hosts_in_use_countgaugeThe total number of hosts that are actively used by tenants as instances in the Forge site
carbide_hosts_usable_countgaugeThe remaining number of hosts in the Forge site which are available for immediate instance creation
carbide_hosts_with_bios_password_setgaugeThe total number of Hosts in the system that have their BIOS password set.
carbide_ib_partitions_enqueuer_iteration_latency_millisecondshistogramThe overall time it took to enqueue state handling tasks for all carbide_ib_partitions in the system
carbide_ib_partitions_iteration_latency_millisecondshistogramThe elapsed time in the last state processor iteration to handle objects of type carbide_ib_partitions
carbide_ib_partitions_object_tasks_enqueued_totalcounterThe amount of types that object handling tasks that have been freshly enqueued for objects of type carbide_ib_partitions
carbide_ib_partitions_totalgaugeThe total number of carbide_ib_partitions in the system
carbide_machine_reboot_duration_secondshistogramTime taken for machine/host to reboot in seconds
carbide_machine_updates_started_countgaugeThe number of machines in the system that in the process of updating.
carbide_machine_validation_completedgaugeCount of machine validation that have completed successfully
carbide_machine_validation_failedgaugeCount of machine validation that have failed
carbide_machine_validation_in_progressgaugeCount of machine validation that are in progress
carbide_machine_validation_testsgaugeThe details of machine validation tests
carbide_machines_enqueuer_iteration_latency_millisecondshistogramThe overall time it took to enqueue state handling tasks for all carbide_machines in the system
carbide_machines_handler_latency_in_state_millisecondshistogramThe amount of time it took to invoke the state handler for objects of type carbide_machines in a certain state
carbide_machines_in_maintenance_countgaugeThe total number of machines in the system that are in maintenance.
carbide_machines_iteration_latency_millisecondshistogramThe elapsed time in the last state processor iteration to handle objects of type carbide_machines
carbide_machines_object_tasks_completed_totalcounterThe amount of object handling tasks that have been completed for objects of type carbide_machines
carbide_machines_object_tasks_dispatched_totalcounterThe amount of types that object handling tasks that have been dequeued and dispatched for processing for objects of type carbide_machines
carbide_machines_object_tasks_enqueued_totalcounterThe amount of types that object handling tasks that have been freshly enqueued for objects of type carbide_machines
carbide_machines_object_tasks_requeued_totalcounterThe amount of object handling tasks that have been requeued for objects of type carbide_machines
carbide_machines_per_stategaugeThe number of carbide_machines in the system with a given state
carbide_machines_per_state_above_slagaugeThe number of carbide_machines in the system which had been longer in a state than allowed per SLA
carbide_machines_state_entered_totalcounterThe amount of types that objects of type carbide_machines have entered a certain state
carbide_machines_state_exited_totalcounterThe amount of types that objects of type carbide_machines have exited a certain state
carbide_machines_time_in_state_secondshistogramThe amount of time objects of type carbide_machines have spent in a certain state
carbide_machines_totalgaugeThe total number of carbide_machines in the system
carbide_machines_with_state_handling_errors_per_stategaugeThe number of carbide_machines in the system with a given state that failed state handling
carbide_measured_boot_bundles_totalgaugeThe total number of measured boot bundles.
carbide_measured_boot_machines_per_bundle_state_totalgaugeThe total number of machines per a given measured boot bundle state.
carbide_measured_boot_machines_per_machine_state_totalgaugeThe total number of machines per a given measured boot machine state.
carbide_measured_boot_machines_totalgaugeThe total number of machines reporting measurements.
carbide_measured_boot_profiles_totalgaugeThe total number of measured boot profiles.
carbide_network_segments_enqueuer_iteration_latency_millisecondshistogramThe overall time it took to enqueue state handling tasks for all carbide_network_segments in the system
carbide_network_segments_handler_latency_in_state_millisecondshistogramThe amount of time it took to invoke the state handler for objects of type carbide_network_segments in a certain state
carbide_network_segments_iteration_latency_millisecondshistogramThe elapsed time in the last state processor iteration to handle objects of type carbide_network_segments
carbide_network_segments_object_tasks_completed_totalcounterThe amount of object handling tasks that have been completed for objects of type carbide_network_segments
carbide_network_segments_object_tasks_dispatched_totalcounterThe amount of types that object handling tasks that have been dequeued and dispatched for processing for objects of type carbide_network_segments
carbide_network_segments_object_tasks_enqueued_totalcounterThe amount of types that object handling tasks that have been freshly enqueued for objects of type carbide_network_segments
carbide_network_segments_object_tasks_requeued_totalcounterThe amount of object handling tasks that have been requeued for objects of type carbide_network_segments
carbide_network_segments_per_stategaugeThe number of carbide_network_segments in the system with a given state
carbide_network_segments_per_state_above_slagaugeThe number of carbide_network_segments in the system which had been longer in a state than allowed per SLA
carbide_network_segments_state_entered_totalcounterThe amount of types that objects of type carbide_network_segments have entered a certain state
carbide_network_segments_state_exited_totalcounterThe amount of types that objects of type carbide_network_segments have exited a certain state
carbide_network_segments_time_in_state_secondshistogramThe amount of time objects of type carbide_network_segments have spent in a certain state
carbide_network_segments_totalgaugeThe total number of carbide_network_segments in the system
carbide_network_segments_with_state_handling_errors_per_stategaugeThe number of carbide_network_segments in the system with a given state that failed state handling
carbide_nvlink_partition_monitor_nmxm_changes_applied_totalcounterNumber of changes requested to Nmx-M
carbide_pending_dpu_nic_firmware_update_countgaugeThe number of machines in the system that need a firmware update.
carbide_pending_host_firmware_update_countgaugeThe number of host machines in the system that need a firmware update.
carbide_power_shelves_enqueuer_iteration_latency_millisecondshistogramThe overall time it took to enqueue state handling tasks for all carbide_power_shelves in the system
carbide_power_shelves_iteration_latency_millisecondshistogramThe elapsed time in the last state processor iteration to handle objects of type carbide_power_shelves
carbide_power_shelves_object_tasks_enqueued_totalcounterThe amount of types that object handling tasks that have been freshly enqueued for objects of type carbide_power_shelves
carbide_power_shelves_totalgaugeThe total number of carbide_power_shelves in the system
carbide_preingestion_totalgaugeThe amount of known machines currently being evaluated prior to ingestion
carbide_preingestion_waiting_downloadgaugeThe amount of machines that are waiting for firmware downloads on other machines to complete before doing thier own
carbide_preingestion_waiting_installationgaugeThe amount of machines which have had firmware uploaded to them and are currently in the process of installing that firmware
carbide_racks_enqueuer_iteration_latency_millisecondshistogramThe overall time it took to enqueue state handling tasks for all carbide_racks in the system
carbide_racks_iteration_latency_millisecondshistogramThe elapsed time in the last state processor iteration to handle objects of type carbide_racks
carbide_racks_object_tasks_enqueued_totalcounterThe amount of types that object handling tasks that have been freshly enqueued for objects of type carbide_racks
carbide_racks_totalgaugeThe total number of carbide_racks in the system
carbide_reboot_attempts_in_booting_with_discovery_imagehistogramThe amount of machines rebooted again in BootingWithDiscoveryImage since there is no response after a certain time from host.
carbide_reserved_ips_countgaugeThe total number of reserved ips in the site
carbide_resourcepool_free_countgaugeCount of values in the pool currently available for allocation
carbide_resourcepool_used_countgaugeCount of values in the pool currently allocated
carbide_running_dpu_updates_countgaugeThe number of machines in the system that running a firmware update.
carbide_site_exploration_expected_machines_sku_countgaugeThe total count of expected machines by SKU ID and device type
carbide_site_exploration_identified_managed_hosts_countgaugeThe amount of Host+DPU pairs that has been identified in the last SiteExplorer run
carbide_site_explorer_bmc_reset_countgaugeThe amount of BMC resets initiated in the last SiteExplorer run
carbide_site_explorer_create_machines_latency_millisecondshistogramThe time it to perform create_machines inside site-explorer
carbide_site_explorer_created_machines_countgaugeThe amount of Machine pairs that had been created by Site Explorer after being identified
carbide_site_explorer_created_power_shelves_countgaugeThe amount of Power Shelves that had been created by Site Explorer after being identified
carbide_site_explorer_iteration_latency_millisecondshistogramThe time it took to perform one site explorer iteration
carbide_switches_enqueuer_iteration_latency_millisecondshistogramThe overall time it took to enqueue state handling tasks for all carbide_switches in the system
carbide_switches_iteration_latency_millisecondshistogramThe elapsed time in the last state processor iteration to handle objects of type carbide_switches
carbide_switches_object_tasks_enqueued_totalcounterThe amount of types that object handling tasks that have been freshly enqueued for objects of type carbide_switches
carbide_switches_totalgaugeThe total number of carbide_switches in the system
carbide_total_ips_countgaugeThe total number of ips in the site
carbide_unavailable_dpu_nic_firmware_update_countgaugeThe number of machines in the system that need a firmware update but are unavailble for update.