Dataset Buckets#

Register external cloud storage buckets (S3, GCS, Azure) with OSMO to organize datasets across multiple storage locations (This configuration is optional)

Why Use Dataset Buckets?#

Multiple dataset buckets provide flexible data management for OSMO datasets:

Automatic Deduplication

Content-addressed storage means identical files are stored only once across versions, saving storage costs and transfer time.

Version Control

Full version history for datasets—track changes, rollback to previous versions, and maintain reproducible workflows.

Organize by Team or Project

Separate datasets across different buckets for access control, billing, or organizational boundaries.

Use Existing Infrastructure

Register pre-existing S3/GCS/Azure buckets without migrating data—integrate seamlessly with existing storage.

Multi-Cloud Support

Mix storage providers (AWS S3, Google Cloud Storage, Azure Blob) in the same OSMO deployment.

Simplified References

Use short names (e.g., production/model-v2) instead of full URIs (s3://long-bucket-name/model-v2).

Persistent & Shareable

Datasets persist beyond workflow execution and can be shared across workflows, teams, and accessed via CLI, workflows, or Web UI.

How It Works#

Bucket Registration#

1. Register Bucket 🪣

Add cloud storage

2. Set Default

Choose primary bucket

3. Use in Workflows 🔗

Reference datasets

Bucket Naming#

Once registered with a bucket name (say production), datasets in that bucket are referenced as:

  • production/imagenet

  • production/resnet50

If the bucket is set as the default bucket, datasets can be referenced without the bucket name prefix:

  • imagenet

  • resnet50

Practical Guide#

📄 Edit in your Helm values file

Everything on this section is to be added in your Helm values file under services.configs. Apply changes with helm upgrade.

Registering Buckets#

Step 1: Register a Single Bucket

Add your first cloud storage bucket under services.configs.dataset.buckets:

services:
  configs:
    enabled: true
    dataset:
      buckets:
        production:
          dataset_path: s3://my-production-bucket
          region: us-east-1
          mode: read-write

Step 2: Register Multiple Buckets

Add buckets from different cloud providers:

services:
  configs:
    dataset:
      buckets:
        production:
          dataset_path: s3://prod-datasets
          region: us-east-1
          mode: read-write
        staging:
          dataset_path: s3://staging-datasets
          region: us-east-1
          mode: read-write
        research:
          dataset_path: gs://research-bucket
          region: us-central1
          mode: read-write
        archive:
          dataset_path: azure://archive-storage
          region: eastus
          mode: read-only
      default_bucket: production

Step 3: Attach Credentials (Optional)

If a bucket requires credentials, create a Kubernetes Secret with one credential field per key and reference it via default_credential.secretName:

kubectl create secret generic prod-bucket-cred \
    --from-literal=access_key_id=<your-access-key-id> \
    --from-literal=access_key=<your-secret-access-key>
services:
  configs:
    secretRefs:
      - secretName: prod-bucket-cred
    dataset:
      buckets:
        production:
          dataset_path: s3://prod-datasets
          region: us-east-1
          mode: read-write
          default_credential:
            secretName: prod-bucket-cred

Buckets that rely on workload identity (IRSA, Pod Identity) or public read-only access can leave default_credential as null.

Step 4: Apply

helm upgrade osmo deployments/charts/service -f my-values.yaml

Step 5: Verify Configuration

List all registered buckets:

$ osmo bucket list

Bucket               Location
============================================
production (default) s3://prod-datasets
staging              s3://staging-datasets
research             gs://research-bucket
archive              azure://archive-storage

Usage Examples#

Team-Based Buckets

Separate datasets by team or department:

services:
  configs:
    dataset:
      buckets:
        robotics:
          dataset_path: s3://robotics-team-data
          region: us-east-1
          mode: read-write
        ml-research:
          dataset_path: s3://ml-research-data
          region: us-east-1
          mode: read-write
        engineering:
          dataset_path: s3://engineering-shared
          region: us-east-1
          mode: read-write
      default_bucket: robotics

Workflow Usage:

inputs:
  - robotics/sim-data-2024      # Robotics team bucket
  - ml-research/models          # ML research bucket
  - synthetic-data              # Default bucket (robotics)
Environment-Based Buckets

Organize by development stage:

services:
  configs:
    dataset:
      buckets:
        dev:
          dataset_path: s3://dev-datasets
          region: us-east-1
          mode: read-write
        staging:
          dataset_path: s3://staging-datasets
          region: us-east-1
          mode: read-write
        production:
          dataset_path: s3://prod-datasets
          region: us-east-1
          mode: read-write
      default_bucket: dev
Multi-Cloud Buckets

Mix storage providers:

services:
  configs:
    dataset:
      buckets:
        aws-main:
          dataset_path: s3://primary-storage
          region: us-east-1
          mode: read-write
        gcp-backup:
          dataset_path: gs://backup-datasets
          region: us-central1
          mode: read-write
        azure-archive:
          dataset_path: azure://cold-storage
          region: eastus
          mode: read-only
      default_bucket: aws-main

Troubleshooting#

Bucket Not Found
  • Verify bucket name matches exactly (case-sensitive)

  • Check bucket was added before workflow submission

  • Run osmo bucket list to see all registered buckets

Access Denied Errors
  • Ensure the referenced Secret exists and secretName is listed in secretRefs

  • Verify bucket permissions allow read/write operations

  • Check bucket region matches OSMO cluster region

Default Bucket Not Working
  • Confirm default_bucket name matches a registered bucket

  • Verify configuration was applied: osmo config get DATASET

  • Check workflows use correct dataset reference format

Tip

Best Practices

  • Use descriptive bucket names (team, project, or environment)

  • Set a default bucket for the most common use case

  • Document bucket purposes and access policies for teams

  • Use separate buckets for production vs. development data

  • Consider data locality (bucket region near compute)

  • Review and clean up unused buckets quarterly

Note

Supported storage protocols:
  • s3:// (AWS S3)

  • gs:// (Google Cloud Storage)

  • azure:// (Azure Blob Storage)

See also

  • Learn more about datasets in OSMO at Datasets