Telemetry

The MSC provides telemetry through the OpenTelemetry Python API and SDK.

Only the OpenTelemetry Python API is included by default. The OpenTelemetry Python SDK is included with the observability-otel extra.

Telemetry can be configured with the opentelemetry dictionary in the MSC configuration. See Configuration Reference for all configuration options.

Example MSC configuration.
profiles:
  data:
    # ...
opentelemetry:
  metrics:
    attributes:
      - type: static
        options:
          attributes:
            organization: NVIDIA
            cluster: DGX SuperPOD 1
      - type: host
        options:
          attributes:
            node: name
      - type: process
        options:
          attributes:
            process: pid
    reader:
      options:
        # ≤ 100 Hz collect frequency.
        collect_interval_millis: 10
        collect_timeout_millis: 100
        # ≤ 1 Hz export frequency.
        export_interval_millis: 1000
        export_timeout_millis: 500
    exporter:
      type: otlp
      options:
        # OpenTelemetry Collector default local HTTP endpoint.
        endpoint: http://localhost:4318/v1/traces
  traces:
    exporter:
      type: otlp
      options:
        # OpenTelemetry Collector default local HTTP endpoint.
        endpoint: http://localhost:4318/v1/traces
Example usage with automatic telemetry initialization.
import multistorageclient

# Directly create a storage client for a profile and open an object/file.
client = multistorageclient.StorageClient(
    config=multistorageclient.StorageClientConfig.from_file(profile="data")
)
client.open("file.txt")

# Use an MSC shortcut to create a storage client for a profile and open an object/file.
multistorageclient.open("msc://data/file.txt")

This will automatically create telemetry provider instances which will spin up dedicated telemetry processes and local network ports (for IPC) as needed.

If the default telemetry provider creation doesn’t behave as desired, you can manually create a telemetry provider to use with storage client creation flows.

Example usage with manual telemetry initialization.
import multistorageclient
import multistorageclient.telemetry

# Create a telemetry provider instance.
#
# Returns a proxy object by default to make the OpenTelemetry Python SDK work
# correctly with Python multiprocessing.
#
# When on the main process, this creates a Python multiprocessing manager server
# listening on 127.0.0.1:{dynamic port based on the current process PID}
# and connects to it.
#
# When in a child process, this connects to the Python multiprocessing manager server
# listening on 127.0.0.1:{dynamic port based on the parent process PID}.
#
# The telemetry mode and address can be provided as function parameters.
# See the API reference for more details.
telemetry = multistorageclient.telemetry.init()

# Directly create a storage client with the telemetry provider instance and open an object/file.
client = multistorageclient.StorageClient(
    config=multistorageclient.StorageClientConfig.from_file(
        profile="data",
        telemetry=telemetry
    )
)
client.open("file.txt")

# Set the telemetry provider instance to use when MSC shortcuts create storage clients.
multistorageclient.set_telemetry(telemetry=telemetry)

# Use an MSC shortcut to create a storage client for a profile and open an object/file.
multistorageclient.open("msc://data/file.txt")

Metrics

MSC prefers publishing raw samples when possible to support arbitrary post-hoc aggregations.

This is done through high frequency gauges, with sums being used for accurate global aggregates.

Concepts

Theory

sample

Individual metric data point.

distribution

Collection of samples.

true distribution

A distribution with all samples (e.g. true distribution of fair 6-sided dice rolls).

This may have infinite samples.

empirical distribution

A distribution with a subset of samples (e.g. empirical distribution of 1000 fair 6-sided dice rolls).

aggregate

Compress a distribution into a summary statistic (e.g. minimum, maximum, sum, average, percentile).

decomposable aggregate

An aggregate which can be recursively applied.

For example, the maximum is a decomposable aggregate because the global maximum can be found by taking the maximum of the local maxima of sample subsets.

On the other hand, the average is not a decomposable aggregate because the global average cannot be found by taking the average of the local averages of sample subsets.

sampling rate

For a signal over time (e.g. metric data points over time), this is how often a sample is collected.

OpenTelemetry

OpenTelemetry provides several metric points. Of note are:

gauge

Captures a distribution.

If the sampling rate is high enough, this captures the true distribution.

If the sampling rate is not high enough, this captures the empirical distribution. This preserves local (i.e. per-sample) information at the expense of global (i.e. aggregate) information.

sum

Captures sums, a decomposable aggregate.

This preserves global (i.e. aggregate) information at the expense of local (i.e. per-sample) information.

histogram

Captures a distribution by bucketing samples by value.

Not used by the MSC since buckets must be pre-defined, requiring the distribution to be known ahead of time.

Emitted Metrics

Storage Provider

multistorageclient.latency

The time it took for an operation to complete.

  • Operations:

    • All

  • Metric data point:

    • Gauge

  • Unit:

    • Seconds

  • Attributes:

    • multistorageclient.provider (e.g. s3)

    • multistorageclient.operation (e.g. read)

    • multistorageclient.status (e.g. success, error.{Python error class name})

  • Timestamp:

    • Operation End

multistorageclient.data_size

The data (object/file) size for an operation.

  • Operations:

    • Successful Read, Write, Copy

  • Metric data point:

    • Gauge

  • Unit:

    • Bytes

  • Attributes:

    • multistorageclient.provider (e.g. s3)

    • multistorageclient.operation (e.g. read)

    • multistorageclient.status (e.g. success, error.{Python error class name})

  • Timestamp:

    • Operation End

multistorageclient.data_rate

The data size divided by the latency for an operation. Equivalent to an operation’s average data rate.

  • Operations:

    • Successful Read, Write, Copy

  • Metric data point:

    • Gauge

  • Unit:

    • Bytes/Second

  • Attributes:

    • multistorageclient.provider (e.g. s3)

    • multistorageclient.operation (e.g. read)

    • multistorageclient.status (e.g. success, error.{Python error class name})

  • Timestamp:

    • Operation End

multistorageclient.request.sum

The sum of operation starts.

  • Operations:

    • All

  • Metric data point:

    • Sum

  • Unit:

    • Requests

  • Attributes:

    • multistorageclient.provider (e.g. s3)

    • multistorageclient.operation (e.g. read)

  • Timestamp:

    • Operation Start

multistorageclient.response.sum

The sum of operation ends.

  • Operations:

    • All

  • Metric data point:

    • Sum

  • Unit:

    • Responses

  • Attributes:

    • multistorageclient.provider (e.g. s3)

    • multistorageclient.operation (e.g. read)

    • multistorageclient.status (e.g. success, error.{Python error class name})

  • Timestamp:

    • Operation End

multistorageclient.data_size.sum

The data (object/file) size for all operations.

  • Operations:

    • Successful Read, Write, Copy

  • Metric data point:

    • Sum

  • Unit:

    • Bytes

  • Attributes:

    • multistorageclient.provider (e.g. s3)

    • multistorageclient.operation (e.g. read)

    • multistorageclient.status (e.g. success, error.{Python error class name})

  • Timestamp:

    • Operation End

Traces

MSC publishes spans using a tail sampler which publishes errors and high-latency traces. The span pipeline currently isn’t configurable except the exporter.