Dataset Buckets#
Register external cloud storage buckets (S3, GCS, Azure) with OSMO to organize datasets across multiple storage locations (This configuration is optional)
Why Use Dataset Buckets?#
Multiple dataset buckets provide flexible data management for OSMO datasets:
- ✓ Automatic Deduplication
Content-addressed storage means identical files are stored only once across versions, saving storage costs and transfer time.
- ✓ Version Control
Full version history for datasets—track changes, rollback to previous versions, and maintain reproducible workflows.
- ✓ Organize by Team or Project
Separate datasets across different buckets for access control, billing, or organizational boundaries.
- ✓ Use Existing Infrastructure
Register pre-existing S3/GCS/Azure buckets without migrating data—integrate seamlessly with existing storage.
- ✓ Multi-Cloud Support
Mix storage providers (AWS S3, Google Cloud Storage, Azure Blob) in the same OSMO deployment.
- ✓ Simplified References
Use short names (e.g.,
production/model-v2) instead of full URIs (s3://long-bucket-name/model-v2).- ✓ Persistent & Shareable
Datasets persist beyond workflow execution and can be shared across workflows, teams, and accessed via CLI, workflows, or Web UI.
How It Works#
Bucket Registration#
1. Register Bucket 🪣
Add cloud storage
2. Set Default ⭐
Choose primary bucket
3. Use in Workflows 🔗
Reference datasets
Bucket Naming#
Once registered with a bucket name (say production), datasets in that bucket are referenced as:
production/imagenetproduction/resnet50
If the bucket is set as the default bucket, datasets can be referenced without the bucket name prefix:
imagenetresnet50
Practical Guide#
📄 Edit in your Helm values file
Everything on this section is to be added in your Helm values file under services.configs.
Apply changes with helm upgrade.
Registering Buckets#
Step 1: Register a Single Bucket
Add your first cloud storage bucket under services.configs.dataset.buckets:
services:
configs:
enabled: true
dataset:
buckets:
production:
dataset_path: s3://my-production-bucket
region: us-east-1
mode: read-write
Step 2: Register Multiple Buckets
Add buckets from different cloud providers:
services:
configs:
dataset:
buckets:
production:
dataset_path: s3://prod-datasets
region: us-east-1
mode: read-write
staging:
dataset_path: s3://staging-datasets
region: us-east-1
mode: read-write
research:
dataset_path: gs://research-bucket
region: us-central1
mode: read-write
archive:
dataset_path: azure://archive-storage
region: eastus
mode: read-only
default_bucket: production
Step 3: Attach Credentials (Optional)
If a bucket requires credentials, create a Kubernetes Secret with one credential field per key and reference it via default_credential.secretName:
kubectl create secret generic prod-bucket-cred \
--from-literal=access_key_id=<your-access-key-id> \
--from-literal=access_key=<your-secret-access-key>
services:
configs:
secretRefs:
- secretName: prod-bucket-cred
dataset:
buckets:
production:
dataset_path: s3://prod-datasets
region: us-east-1
mode: read-write
default_credential:
secretName: prod-bucket-cred
Buckets that rely on workload identity (IRSA, Pod Identity) or public read-only access can leave default_credential as null.
Step 4: Apply
helm upgrade osmo deployments/charts/service -f my-values.yaml
Step 5: Verify Configuration
List all registered buckets:
$ osmo bucket list
Bucket Location
============================================
production (default) s3://prod-datasets
staging s3://staging-datasets
research gs://research-bucket
archive azure://archive-storage
Usage Examples#
Team-Based Buckets
Separate datasets by team or department:
services:
configs:
dataset:
buckets:
robotics:
dataset_path: s3://robotics-team-data
region: us-east-1
mode: read-write
ml-research:
dataset_path: s3://ml-research-data
region: us-east-1
mode: read-write
engineering:
dataset_path: s3://engineering-shared
region: us-east-1
mode: read-write
default_bucket: robotics
Workflow Usage:
inputs:
- robotics/sim-data-2024 # Robotics team bucket
- ml-research/models # ML research bucket
- synthetic-data # Default bucket (robotics)
Environment-Based Buckets
Organize by development stage:
services:
configs:
dataset:
buckets:
dev:
dataset_path: s3://dev-datasets
region: us-east-1
mode: read-write
staging:
dataset_path: s3://staging-datasets
region: us-east-1
mode: read-write
production:
dataset_path: s3://prod-datasets
region: us-east-1
mode: read-write
default_bucket: dev
Multi-Cloud Buckets
Mix storage providers:
services:
configs:
dataset:
buckets:
aws-main:
dataset_path: s3://primary-storage
region: us-east-1
mode: read-write
gcp-backup:
dataset_path: gs://backup-datasets
region: us-central1
mode: read-write
azure-archive:
dataset_path: azure://cold-storage
region: eastus
mode: read-only
default_bucket: aws-main
Troubleshooting#
- Bucket Not Found
Verify bucket name matches exactly (case-sensitive)
Check bucket was added before workflow submission
Run
osmo bucket listto see all registered buckets
- Access Denied Errors
Ensure the referenced Secret exists and
secretNameis listed insecretRefsVerify bucket permissions allow read/write operations
Check bucket region matches OSMO cluster region
- Default Bucket Not Working
Confirm default_bucket name matches a registered bucket
Verify configuration was applied:
osmo config get DATASETCheck workflows use correct dataset reference format
Tip
Best Practices
Use descriptive bucket names (team, project, or environment)
Set a default bucket for the most common use case
Document bucket purposes and access policies for teams
Use separate buckets for production vs. development data
Consider data locality (bucket region near compute)
Review and clean up unused buckets quarterly
Note
- Supported storage protocols:
s3://(AWS S3)gs://(Google Cloud Storage)azure://(Azure Blob Storage)
See also
Learn more about datasets in OSMO at Datasets