Command-Line Interface¶
After installing the multi-storage-client
package (see Installation), you can use the msc
command to interact with your storage services.
Below are the available sub-commands under msc
.
msc help¶
The msc help
command displays general help information and available commands. It can also be used to display help for a specific command.
$ msc help
usage: msc <command> [options] [parameters]
To see help text, you can run:
msc help
msc help <command>
commands:
glob Find files using Unix-style wildcard patterns with optional attribute filtering
help Display help for commands
ls List files and directories with optional attribute filtering
rm Delete files with a given prefix
sync Synchronize files from the source storage to the target storage
$ msc help ls
usage: msc ls [--attribute-filter-expression ATTRIBUTE_FILTER_EXPRESSION] [--recursive] [--human-readable] [--summarize] [--debug] [--limit LIMIT] [--show-attributes] path
List files and directories at the specified path. Supports:
1. Simple directory listings
2. Attribute filtering
3. Human readable sizes
4. Summary information
5. Metadata attributes display
positional arguments:
path The path to list (POSIX path or msc:// URL)
options:
--attribute-filter-expression ATTRIBUTE_FILTER_EXPRESSION, -e ATTRIBUTE_FILTER_EXPRESSION
Filter by attributes using a filter expression (e.g., 'model_name = "gpt" AND version > 1.0')
--recursive List contents recursively (default: list only first level)
--human-readable Displays file sizes in human readable format
--summarize Displays summary information (number of objects, total size)
--debug Enable debug output
--limit LIMIT Limit the number of results to display
--show-attributes Display metadata attributes dictionary as an additional column
examples:
# Basic directory listing
msc ls "msc://profile/data/"
msc ls "/path/to/files/"
# Human readable sizes
msc ls "msc://profile/models/" --human-readable
# Show summary information
msc ls "msc://profile/data/" --summarize
# List with attribute filtering
msc ls "msc://profile/models/" --attribute-filter-expression 'model_name = "gpt"'
msc ls "msc://profile/data/" --attribute-filter-expression 'version >= 1.0 AND environment != "test"'
# Limited results
msc ls "msc://profile/data/" --limit 10
# List contents recursively
msc ls "msc://profile/data/" --recursive
# Show metadata attributes
msc ls "msc://profile/models/" --show-attributes
msc ls "msc://profile/data/" --show-attributes --human-readable
msc ls¶
The msc ls
command lists files and directories in a storage service. It supports various options for filtering and displaying the results.
$ msc ls msc://profile/data/
Last Modified Size Name
2025-04-15 00:22:40 5242880 msc://profile/data/data-5MB.bin
2025-04-15 00:23:36 1496 msc://profile/data/model.pt
Note
The --attribute-filter-expression
option allows you to filter files based on their metadata attributes.
- Supported Operators:
Equality:
=
,!=
Comparison:
>
,>=
,<
,<=
Logical:
AND
,OR
Grouping:
()
- Examples:
model_name = "gpt"
- Find files with model_name attribute equal to “gpt”version >= 1.0
- Find files with version 1.0 or higherenvironment != "test"
- Find files not in test environment(model_name = "gpt" OR model_name = "bert") AND version > 1.0
- Complex filter with logical operators
Numeric vs String Comparison: For comparison operators (>
, >=
, <
, <=
), the system first attempts numeric comparison. If that fails, it falls back to lexicographic string comparison.
Performance Considerations: When using attribute filtering, the system makes additional HEAD requests to retrieve metadata for each file. This can increase latency, especially when working with many files.
msc glob¶
The msc glob
command finds files in a storage service using Unix-style wildcard patterns.
$ msc glob "msc://profile/data/*.pt"
msc://profile/data/model.pt
Note
The msc glob
command works by first listing all files in the specified directory using the equivalent of msc ls
, then applying the glob pattern as a post-filter to the results. This means that glob patterns are evaluated locally after retrieving the file listing from the storage service.
msc rm¶
The msc rm
command deletes files in a storage service. It supports recursively deleting directories.
$ msc rm --dryrun msc://profile/data
Files that would be deleted:
msc://profile/data/data-5MB.bin
msc://profile/data/model.pt
Total: 2 file(s)
msc sync¶
The msc sync
command synchronizes files between storage locations. It can be used to upload files from the filesystem to object storage, download files from object storage to the filesystem, or transfer files between different object storage locations.
The sync operation compares files between source and target locations using metadata (etag, size, modification time) to determine if files need to be copied. Files are processed in parallel using multiple worker processes and threads for optimal performance.
$ msc sync msc://profile/data/ --target-url /path/to/local/dataset/
Upload files from the filesystem to object storage:
$ msc sync /path/to/dataset --target-url msc://profile/prefix
Download files from object storage to the filesystem:
$ msc sync msc://profile/prefix --target-url /path/to/dataset
Transfer files between different object storage locations:
$ msc sync msc://profile1/prefix --target-url msc://profile2/prefix
Sync with cleanup (removes files in target not in source):
$ msc sync msc://source-profile/data --target-url msc://target-profile/data --delete-unmatched-files
The sync operation uses a parallel processing architecture with producer/consumer threads and multiple worker processes to maximize throughput. It efficiently compares files using metadata and only transfers files that have changed or are missing.
For large files, the sync operation uses temporary files to avoid loading entire files into memory. Smaller files are transferred directly in memory for better performance.
Note
The sync operation automatically handles metadata updates for the target storage client.
Fine-tuning Parallelism¶
MSC automatically determines optimal parallelism based on your system’s CPU count, but you can fine-tune it using environment variables.
# Set number of worker processes (default: min(8, CPU_count))
$ export MSC_NUM_PROCESSES=4
# Set threads per process (default: max(16, CPU_count/processes))
$ export MSC_NUM_THREADS_PER_PROCESS=8
# Run sync with custom parallelism
$ msc sync msc://source-profile/data --target-url msc://target-profile/data
Note
MSC uses a producer-consumer pattern with multiprocessing and multithreading to maximize throughput:
Producer Thread: Compares source and target files, queues sync operations
Worker Processes: Multiple processes handle file transfers (multiprocessing bypasses Python’s GIL)
Worker Threads: Each process spawns multiple threads for concurrent I/O operations
Consumer Thread: Collects results and updates progress
Ray Integration¶
MSC provides integration with Ray for distributed computing capabilities, enabling you to scale sync operations across multiple machines in a cluster. This is particularly useful for large-scale data transfers that require significant computational resources.
- Prerequisites:
Ray must be installed:
pip install "multi-storage-client[ray]"
A Ray cluster must be running and accessible
- Benefits of Ray Integration:
Distributed Processing: Scale sync operations across multiple machines
Fault Tolerance: Ray provides automatic task retry and failure recovery
Resource Management: Efficient utilization of cluster resources
Scalability: Handle larger datasets by distributing work across nodes
Usage:
To use Ray for distributed sync operations, specify the Ray cluster address using the --ray-cluster
option:
# Start a local Ray cluster
$ ray start --head --port=6379
# Connect to a local Ray cluster
$ msc sync msc://source-profile/data --ray-cluster 127.0.0.1:6379 --target-url msc://target-profile/data