Command-Line Interface¶
After installing the multi-storage-client
package (see Installation), you can use the msc
command to interact with your storage services.
Below are the available sub-commands under msc
.
msc help¶
The msc help
command displays general help information and available commands. It can also be used to display help for a specific command.
$ msc help
usage: msc <command> [options] [parameters]
To see help text, you can run:
msc help
msc help <command>
commands:
glob Find files using Unix-style wildcard patterns with optional attribute filtering
help Display help for commands
ls List files and directories with optional attribute filtering
rm Delete files with a given prefix
sync Synchronize files from the source storage to the target storage
msc ls¶
The msc ls
command lists files and directories in a storage service. It supports various options for filtering and displaying the results.
$ msc help ls
usage: msc ls [--attribute-filter-expression ATTRIBUTE_FILTER_EXPRESSION] [--recursive] [--human-readable] [--summarize] [--debug] [--limit LIMIT] [--show-attributes] path
List files and directories at the specified path. Supports:
1. Simple directory listings
2. Attribute filtering
3. Human readable sizes
4. Summary information
5. Metadata attributes display
positional arguments:
path The path to list (POSIX path or msc:// URL)
options:
--attribute-filter-expression ATTRIBUTE_FILTER_EXPRESSION, -e ATTRIBUTE_FILTER_EXPRESSION
Filter by attributes using a filter expression (e.g., 'model_name = "gpt" AND version > 1.0')
--recursive List contents recursively (default: list only first level)
--human-readable Displays file sizes in human readable format
--summarize Displays summary information (number of objects, total size)
--debug Enable debug output
--limit LIMIT Limit the number of results to display
--show-attributes Display metadata attributes dictionary as an additional column
$ msc ls msc://profile/data/
Last Modified Size Name
2025-04-15 00:22:40 5242880 msc://profile/data/data-5MB.bin
2025-04-15 00:23:36 1496 msc://profile/data/model.pt
$ msc ls --recursive msc://profile/data/
Last Modified Size Name
2025-04-15 00:22:40 5242880 msc://profile/data/data-5MB.bin
2025-04-15 00:23:36 1496 msc://profile/data/model.pt
2025-04-15 00:24:15 2048 msc://profile/data/subdir/config.json
2025-04-15 00:25:30 1024 msc://profile/data/subdir/logs/error.log
Note
The --attribute-filter-expression
option allows you to filter files based on their metadata attributes.
- Supported Operators:
Equality:
=
,!=
Comparison:
>
,>=
,<
,<=
Logical:
AND
,OR
Grouping:
()
- Examples:
model_name = "gpt"
- Find files with model_name attribute equal to “gpt”version >= 1.0
- Find files with version 1.0 or higherenvironment != "test"
- Find files not in test environment(model_name = "gpt" OR model_name = "bert") AND version > 1.0
- Complex filter with logical operators
Numeric vs String Comparison: For comparison operators (>
, >=
, <
, <=
), the system first attempts numeric comparison. If that fails, it falls back to lexicographic string comparison.
Performance Considerations: When using attribute filtering, the system will make additional HEAD requests to retrieve metadata for each file if metadata provider is not provided. This can increase latency, especially when working with many files.
msc glob¶
The msc glob
command finds files in a storage service using Unix-style wildcard patterns.
$ msc help glob
usage: msc glob [--attribute-filter-expression ATTRIBUTE_FILTER_EXPRESSION] pattern
Find files using Unix-style wildcard patterns with optional attribute filtering.
positional arguments:
pattern The glob pattern to match files (POSIX path or msc:// URL)
options:
--attribute-filter-expression ATTRIBUTE_FILTER_EXPRESSION, -e ATTRIBUTE_FILTER_EXPRESSION
Filter by attributes using a filter expression (e.g., 'model_name = "gpt" AND version > 1.0')
$ msc glob "msc://profile/data/*.pt"
msc://profile/data/model.pt
Note
The msc glob
command works by first listing all files in the specified directory using the equivalent of msc ls
, then applying the glob pattern as a post-filter to the results. This means that glob patterns are evaluated locally after retrieving the file listing from the storage service. To reduce number of HEAD requests, either opt in to use metadata provider or use msc glob
for only a specific pattern of files.
msc rm¶
The msc rm
command deletes files or directories in a storage service. It supports both single file deletion and recursive directory deletion.
$ msc help rm
usage: msc rm [-r] [-y] [--debug] [--dryrun] [--quiet] [--only-show-errors] path
Delete files or directories.
positional arguments:
path The file or directory path to delete (either POSIX path or MSC URL)
options:
-r, --recursive Delete directories and their contents recursively (This option is needed to delete directories)
-y, --yes Skip confirmation prompt and proceed with deletion
--debug Enable debug output with deletion details
--dryrun Show what would be deleted without actually deleting
--quiet Suppress output of operations performed
--only-show-errors Only errors and warnings are displayed. All other output is suppressed
$ msc rm msc://profile/foo/file.txt
This will delete the file: msc://profile/foo/file.txt
Are you sure you want to continue? (y/N): y
Deleting: msc://profile/foo/file.txt
Successfully deleted: msc://profile/foo/file.txt
$ msc rm --dryrun --recursive msc://profile/foo
Files that would be deleted:
msc://profile/foo/data-5MB.bin
msc://profile/foo/model.pt
Total: 2 file(s)
$ msc rm --recursive msc://profile/foo
This will delete everything under the path: msc://profile/foo (recursively)
Are you sure you want to continue? (y/N): y
Deleting: msc://profile/foo
Successfully deleted: msc://profile/foo
msc sync¶
The msc sync
command synchronizes files between storage locations. It can be used to upload files from the filesystem to object storage, download files from object storage to the filesystem, or transfer files between different object storage locations.
The sync operation compares files between source and target locations using metadata (etag, size, modification time) to determine if files need to be copied. Files are processed in parallel using multiple worker processes and threads for optimal performance.
$ msc sync msc://profile/data/ --target-url /path/to/local/dataset/
Upload files from the filesystem to object storage:
$ msc sync /path/to/dataset --target-url msc://profile/prefix
Download files from object storage to the filesystem:
$ msc sync msc://profile/prefix --target-url /path/to/dataset
Transfer files between different object storage locations:
$ msc sync msc://profile1/prefix --target-url msc://profile2/prefix
Sync with cleanup (removes files in target not in source):
$ msc sync msc://source-profile/data --target-url msc://target-profile/data --delete-unmatched-files
The sync operation uses a parallel processing architecture with producer/consumer threads and multiple worker processes to maximize throughput. It efficiently compares files using metadata and only transfers files that have changed or are missing.
For large files, the sync operation uses temporary files to avoid loading entire files into memory. Smaller files are transferred directly in memory for better performance.
Note
The sync operation automatically handles metadata updates for the target storage client.
Sync Replicas¶
If the source profile has replicas configured, the sync operation will copy the data to all the replicas by default without requiring the --target-url
option. Please refer to Replicas for the configuration details.
$ msc sync msc://source-profile/data
You can also sync to specific replicas instead of all replicas by using the --replica-indices
option. Replica indices start from 0.
$ msc sync msc://source-profile/data --replica-indices "0,1"
Fine-tuning Parallelism¶
MSC automatically determines optimal parallelism based on your system’s CPU count, but you can fine-tune it using environment variables.
# Set number of worker processes (default: min(8, CPU_count))
$ export MSC_NUM_PROCESSES=4
# Set threads per process (default: max(16, CPU_count/processes))
$ export MSC_NUM_THREADS_PER_PROCESS=8
# Run sync with custom parallelism
$ msc sync msc://source-profile/data --target-url msc://target-profile/data
Note
MSC uses a producer-consumer pattern with multiprocessing and multithreading to maximize throughput:
Producer Thread: Compares source and target files, queues sync operations
Worker Processes: Multiple processes handle file transfers (multiprocessing bypasses Python’s GIL)
Worker Threads: Each process spawns multiple threads for concurrent I/O operations
Consumer Thread: Collects results and updates progress
Ray Integration¶
MSC provides integration with Ray for distributed computing capabilities, enabling you to scale sync operations across multiple machines in a cluster. This is particularly useful for large-scale data transfers that require significant computational resources.
- Prerequisites:
Ray must be installed:
pip install "multi-storage-client[ray]"
A Ray cluster must be running and accessible
- Benefits of Ray Integration:
Distributed Processing: Scale sync operations across multiple machines
Fault Tolerance: Ray provides automatic task retry and failure recovery
Resource Management: Efficient utilization of cluster resources
Scalability: Handle larger datasets by distributing work across nodes
Usage:
To use Ray for distributed sync operations, specify the Ray cluster address using the --ray-cluster
option:
# Start a local Ray cluster
$ ray start --head --port=6379
# Connect to a local Ray cluster
$ msc sync msc://source-profile/data --ray-cluster 127.0.0.1:6379 --target-url msc://target-profile/data