Telemetry

The MSC provides telemetry through the OpenTelemetry Python API and SDK.

Only the OpenTelemetry Python API is included by default. The OpenTelemetry Python SDK is included with the observability-otel extra.

Telemetry can be configured with the opentelemetry dictionary in the MSC configuration and creating a telemetry provider to use with storage client creation flows. See Configuration Reference for all configuration options.

Example MSC configuration.
profiles:
  data:
    # ...
opentelemetry:
  metrics:
    attributes:
      - type: static
        options:
          attributes:
            organization: NVIDIA
            cluster: DGX SuperPOD 1
      - type: host
        options:
          attributes:
            node: name
      - type: process
        options:
          attributes:
            process: pid
    reader:
      options:
        # ≤ 100 Hz collect frequency.
        collect_interval_millis: 10
        collect_interval_timeout: 100
        # ≤ 1 Hz export frequency.
        export_interval_millis: 1000
        export_timeout_millis: 500
    exporter:
      type: otlp
      options:
        # OpenTelemetry Collector default local HTTP endpoint.
        endpoint: http://localhost:4318/v1/traces
  traces:
    exporter:
      type: otlp
      options:
        # OpenTelemetry Collector default local HTTP endpoint.
        endpoint: http://localhost:4318/v1/traces
Example usage.
import multistorageclient
import multistorageclient.telemetry

# Create a telemetry provider instance.
#
# Returns a proxy object by default to make the OpenTelemetry Python SDK work
# correctly with Python multiprocessing.
#
# When on the main process, this creates a Python multiprocessing manager server
# listening on 127.0.0.1:{dynamic port based on the current process PID}
# and connects to it.
#
# When in a child process, this connects to the Python multiprocessing manager server
# listening on 127.0.0.1:{dynamic port based on the parent process PID}.
#
# The telemetry mode and address can be provided as function parameters.
# See the API reference for more details.
telemetry = multistorageclient.telemetry.init()

# Create a storage client with the telemetry provider instance.
client = multistorageclient.StorageClient(
    config=multistorageclient.StorageClientConfig.from_file(
        profile="data",
        telemetry=telemetry
    )
)

# Set the telemetry provider instance to use when MSC shortcuts create storage clients.
multistorageclient.set_telemetry(telemetry=telemetry)

# Create a storage client for a profile and open an object/file.
multistorageclient.open("msc://data/file.txt")

Metrics

MSC prefers publishing raw samples when possible to support arbitrary post-hoc aggregations.

This is done through high frequency gauges, with sums being used for accurate global aggregates.

Concepts

Theory

sample

Individual metric data point.

distribution

Collection of samples.

true distribution

A distribution with all samples (e.g. true distribution of fair 6-sided dice rolls).

This may have infinite samples.

empirical distribution

A distribution with a subset of samples (e.g. empirical distribution of 1000 fair 6-sided dice rolls).

aggregate

Compress a distribution into a summary statistic (e.g. minimum, maximum, sum, average, percentile).

decomposable aggregate

An aggregate which can be recursively applied.

For example, the maximum is a decomposable aggregate because the global maximum can be found by taking the maximum of the local maxima of sample subsets.

On the other hand, the average is not a decomposable aggregate because the global average cannot be found by taking the average of the local averages of sample subsets.

sampling rate

For a signal over time (e.g. metric data points over time), this is how often a sample is collected.

OpenTelemetry

OpenTelemetry provides several metric points. Of note are:

gauge

Captures a distribution.

If the sampling rate is high enough, this captures the true distribution.

If the sampling rate is not high enough, this captures the empirical distribution. This preserves local (i.e. per-sample) information at the expense of global (i.e. aggregate) information.

sum

Captures sums, a decomposable aggregate.

This preserves global (i.e. aggregate) information at the expense of local (i.e. per-sample) information.

histogram

Captures a distribution by bucketing samples by value.

Not used by the MSC since buckets must be pre-defined, requiring the distribution to be known ahead of time.

Emitted Metrics

Storage Provider

multistorageclient.latency

The time it took for an operation to complete.

  • Operations:

    • All

  • Metric data point:

    • Gauge

  • Unit:

    • Seconds

  • Attributes:

    • multistorageclient.provider (e.g. s3)

    • multistorageclient.operation (e.g. read)

    • multistorageclient.status (e.g. success, error.{Python error class name})

  • Timestamp:

    • Operation End

multistorageclient.data_size

The data (object/file) size for an operation.

  • Operations:

    • Successful Read, Write, Copy

  • Metric data point:

    • Gauge

  • Unit:

    • Bytes

  • Attributes:

    • multistorageclient.provider (e.g. s3)

    • multistorageclient.operation (e.g. read)

    • multistorageclient.status (e.g. success, error.{Python error class name})

  • Timestamp:

    • Operation End

multistorageclient.data_rate

The data size divided by the latency for an operation. Equivalent to an operation’s average data rate.

  • Operations:

    • Successful Read, Write, Copy

  • Metric data point:

    • Gauge

  • Unit:

    • Bytes/Second

  • Attributes:

    • multistorageclient.provider (e.g. s3)

    • multistorageclient.operation (e.g. read)

    • multistorageclient.status (e.g. success, error.{Python error class name})

  • Timestamp:

    • Operation End

multistorageclient.request.sum

The sum of operation starts.

  • Operations:

    • All

  • Metric data point:

    • Sum

  • Unit:

    • Requests

  • Attributes:

    • multistorageclient.provider (e.g. s3)

    • multistorageclient.operation (e.g. read)

  • Timestamp:

    • Operation Start

multistorageclient.response.sum

The sum of operation ends.

  • Operations:

    • All

  • Metric data point:

    • Sum

  • Unit:

    • Responses

  • Attributes:

    • multistorageclient.provider (e.g. s3)

    • multistorageclient.operation (e.g. read)

    • multistorageclient.status (e.g. success, error.{Python error class name})

  • Timestamp:

    • Operation End

multistorageclient.data_size.sum

The data (object/file) size for all operations.

  • Operations:

    • Successful Read, Write, Copy

  • Metric data point:

    • Sum

  • Unit:

    • Bytes

  • Attributes:

    • multistorageclient.provider (e.g. s3)

    • multistorageclient.operation (e.g. read)

    • multistorageclient.status (e.g. success, error.{Python error class name})

  • Timestamp:

    • Operation End

Traces

MSC publishes spans using a tail sampler which publishes errors and high-latency traces. The span pipeline currently isn’t configurable except the exporter.