Telemetry¶
The MSC provides telemetry through the OpenTelemetry Python API and SDK.
Only the OpenTelemetry Python API is included by default. The OpenTelemetry Python SDK is included with the observability-otel
extra.
Telemetry can be configured with the opentelemetry
dictionary in the MSC configuration and creating a telemetry provider to use with storage client creation flows. See Configuration Reference for all configuration options.
profiles:
data:
# ...
opentelemetry:
metrics:
attributes:
- type: static
options:
attributes:
organization: NVIDIA
cluster: DGX SuperPOD 1
- type: host
options:
attributes:
node: name
- type: process
options:
attributes:
process: pid
reader:
options:
# ≤ 100 Hz collect frequency.
collect_interval_millis: 10
collect_interval_timeout: 100
# ≤ 1 Hz export frequency.
export_interval_millis: 1000
export_timeout_millis: 500
exporter:
type: otlp
options:
# OpenTelemetry Collector default local HTTP endpoint.
endpoint: http://localhost:4318/v1/traces
traces:
exporter:
type: otlp
options:
# OpenTelemetry Collector default local HTTP endpoint.
endpoint: http://localhost:4318/v1/traces
import multistorageclient
import multistorageclient.telemetry
# Create a telemetry provider instance.
#
# Returns a proxy object by default to make the OpenTelemetry Python SDK work
# correctly with Python multiprocessing.
#
# When on the main process, this creates a Python multiprocessing manager server
# listening on 127.0.0.1:{dynamic port based on the current process PID}
# and connects to it.
#
# When in a child process, this connects to the Python multiprocessing manager server
# listening on 127.0.0.1:{dynamic port based on the parent process PID}.
#
# The telemetry mode and address can be provided as function parameters.
# See the API reference for more details.
telemetry = multistorageclient.telemetry.init()
# Create a storage client with the telemetry provider instance.
client = multistorageclient.StorageClient(
config=multistorageclient.StorageClientConfig.from_file(
profile="data",
telemetry=telemetry
)
)
# Set the telemetry provider instance to use when MSC shortcuts create storage clients.
multistorageclient.set_telemetry(telemetry=telemetry)
# Create a storage client for a profile and open an object/file.
multistorageclient.open("msc://data/file.txt")
Metrics¶
MSC prefers publishing raw samples when possible to support arbitrary post-hoc aggregations.
This is done through high frequency gauges, with sums being used for accurate global aggregates.
Concepts¶
Theory¶
- sample¶
Individual metric data point.
- distribution¶
Collection of samples.
- true distribution¶
A distribution with all samples (e.g. true distribution of fair 6-sided dice rolls).
This may have infinite samples.
- empirical distribution¶
A distribution with a subset of samples (e.g. empirical distribution of 1000 fair 6-sided dice rolls).
- aggregate¶
Compress a distribution into a summary statistic (e.g. minimum, maximum, sum, average, percentile).
- decomposable aggregate¶
An aggregate which can be recursively applied.
For example, the maximum is a decomposable aggregate because the global maximum can be found by taking the maximum of the local maxima of sample subsets.
On the other hand, the average is not a decomposable aggregate because the global average cannot be found by taking the average of the local averages of sample subsets.
- sampling rate¶
For a signal over time (e.g. metric data points over time), this is how often a sample is collected.
OpenTelemetry¶
OpenTelemetry provides several metric points. Of note are:
- gauge¶
Captures a distribution.
If the sampling rate is high enough, this captures the true distribution.
If the sampling rate is not high enough, this captures the empirical distribution. This preserves local (i.e. per-sample) information at the expense of global (i.e. aggregate) information.
- sum¶
Captures sums, a decomposable aggregate.
This preserves global (i.e. aggregate) information at the expense of local (i.e. per-sample) information.
- histogram¶
Captures a distribution by bucketing samples by value.
Not used by the MSC since buckets must be pre-defined, requiring the distribution to be known ahead of time.
Emitted Metrics¶
Storage Provider¶
multistorageclient.latency
¶The time it took for an operation to complete.
Operations:
All
Metric data point:
Gauge
Unit:
Seconds
Attributes:
multistorageclient.provider
(e.g.s3
)multistorageclient.operation
(e.g.read
)multistorageclient.status
(e.g.success
,error.{Python error class name}
)
Timestamp:
Operation End
multistorageclient.data_size
¶The data (object/file) size for an operation.
Operations:
Successful Read, Write, Copy
Metric data point:
Gauge
Unit:
Bytes
Attributes:
multistorageclient.provider
(e.g.s3
)multistorageclient.operation
(e.g.read
)multistorageclient.status
(e.g.success
,error.{Python error class name}
)
Timestamp:
Operation End
multistorageclient.data_rate
¶The data size divided by the latency for an operation. Equivalent to an operation’s average data rate.
Operations:
Successful Read, Write, Copy
Metric data point:
Gauge
Unit:
Bytes/Second
Attributes:
multistorageclient.provider
(e.g.s3
)multistorageclient.operation
(e.g.read
)multistorageclient.status
(e.g.success
,error.{Python error class name}
)
Timestamp:
Operation End
multistorageclient.request.sum
¶The sum of operation starts.
Operations:
All
Metric data point:
Sum
Unit:
Requests
Attributes:
multistorageclient.provider
(e.g.s3
)multistorageclient.operation
(e.g.read
)
Timestamp:
Operation Start
multistorageclient.response.sum
¶The sum of operation ends.
Operations:
All
Metric data point:
Sum
Unit:
Responses
Attributes:
multistorageclient.provider
(e.g.s3
)multistorageclient.operation
(e.g.read
)multistorageclient.status
(e.g.success
,error.{Python error class name}
)
Timestamp:
Operation End
multistorageclient.data_size.sum
¶The data (object/file) size for all operations.
Operations:
Successful Read, Write, Copy
Metric data point:
Sum
Unit:
Bytes
Attributes:
multistorageclient.provider
(e.g.s3
)multistorageclient.operation
(e.g.read
)multistorageclient.status
(e.g.success
,error.{Python error class name}
)
Timestamp:
Operation End
Traces¶
MSC publishes spans using a tail sampler which publishes errors and high-latency traces. The span pipeline currently isn’t configurable except the exporter.