Cache¶
The MSC provides a caching system designed to improve performance when accessing remote storage objects.
How It Works¶
When a developer calls msc.open("msc://profile/path/to/data.bin")
, MSC automatically handles caching behind the scenes
if cache is configured. On the first access, the remote object is downloaded and written to the configured cache location,
making all subsequent accesses significantly faster since the data is served directly from local storage.
The cache location should be configured as an absolute filesystem path, and for optimal performance, it’s recommended to point it to high-performance storage such as NVMe drives or parallel file systems like Lustre.
To ensure data integrity in multi-process or multi-threaded environments, MSC employs filelock mechanisms that prevent concurrent writes to the same cached object. When multiple processes attempt to cache the same file simultaneously, an exclusive lock ensures only one process proceeds with the write operation while others wait, eliminating redundant downloads and potential corruption.
Note
If the profile uses a POSIX filesystem (e.g. type: file
) as the storage provider, the cache won’t be used for that profile.
Basic Cache Configuration¶
To enable cache, you need to configure the cache location and size in the configuration file. The cache location should be
an absolute filesystem path, and the size should be a positive integer with a unit suffix (e.g. "100M"
, "100G"
).
cache:
size: 500G
location: /path/to/msc_cache
check_source_version: true # validate remote object version before serving cached copy
cache_line_size: 64M # size of chunks for partial file caching
check_source_version
replaces the older use_etag
flag (still accepted
for backward-compatibility). The default is true
(MSC validates the
remote object’s current version — for example via ETag — before serving a
cached copy). Set it to false
only if you want to skip that validation
for performance reasons or when the storage backend lacks a versioning
mechanism.
For detailed configuration options, see Configuration Reference
Partial File Caching¶
MSC supports partial file caching, which allows efficient caching of large files by downloading and storing them in smaller chunks. This is particularly useful for large files where you only need to access specific portions.
Key Features:
Chunk-based Storage: Large files are automatically split into configurable chunks (default 64MB) and stored separately in the cache.
Range Request Optimization: When reading specific byte ranges, MSC only downloads the necessary chunks, not the entire file.
Configuration:
The cache_line_size
parameter controls the size of each chunk. This should be set based on your typical access patterns:
cache:
size: 500G
location: /path/to/msc_cache
cache_line_size: 64M # 64MB chunks
check_source_version: true
Usage:
Partial file caching is automatically enabled when using client.read()
with range:
from multistorageclient import StorageClient, StorageClientConfig
from multistorageclient.types import Range
# Create a storage client with configuration
config = StorageClientConfig.from_file() # or from_dict()
client = StorageClient(config=config, profile="your_profile")
# This will only download the necessary chunks for the specified range
data = client.read("path/to/large_file.bin",
byte_range=Range(offset=1024*1024, size=512*1024)) # 512KB read from 1MB offset
# For full file access, use client.open() with prefetch_file parameter
with client.open("path/to/large_file.bin", prefetch_file=False) as f:
# This will use partial file caching instead of downloading the entire file
data = f.read(1024*1024) # Read 1MB
Source Version Validation:
The client.read()
method supports source version validation through the check_source_version
parameter. For most use cases, you can control source version checking entirely through the cache configuration:
cache:
check_source_version: false # Disable validation globally for performance
# or
check_source_version: true # Enable validation globally for data consistency
When using SourceVersionCheckMode.INHERIT
(the default), the cache configuration setting is used. You only need to explicitly specify ENABLE
or DISABLE
when you want to override the cache config for specific calls.
from multistorageclient.types import SourceVersionCheckMode
# Use cache configuration setting (default - recommended for most cases)
data = client.read("path/to/file.bin",
byte_range=Range(offset=0, size=1024))
# Equivalent to: check_source_version=SourceVersionCheckMode.INHERIT
# Override cache config for specific calls when needed
# Always validate source version (may require HEAD request)
data = client.read("path/to/file.bin",
byte_range=Range(offset=0, size=1024),
check_source_version=SourceVersionCheckMode.ENABLE)
# Skip source version validation for performance
data = client.read("path/to/file.bin",
byte_range=Range(offset=0, size=1024),
check_source_version=SourceVersionCheckMode.DISABLE)
Prefetch File Control:
The prefetch_file
parameter in client.open()
controls whether the entire file is downloaded upfront or uses partial file caching:
# Download entire file upfront (default behavior)
with client.open("path/to/large_file.bin", prefetch_file=True) as f:
data = f.read(1024*1024) # Fast local read after download
# Use partial file caching - only download chunks as needed
with client.open("path/to/large_file.bin", prefetch_file=False) as f:
data = f.read(1024*1024) # Downloads only necessary chunks
f.seek(1024*1024*100) # Seek to different position
more_data = f.read(1024*1024) # Downloads additional chunks as needed
When to use each approach:
prefetch_file=True (default): Use when you expect to read most or all of the file. The entire file is downloaded once, then all subsequent reads are served from local cache at maximum speed.
prefetch_file=False: Use for large files where you only need specific portions, or when you want to start processing data immediately without waiting for the full download. This enables partial file caching with chunk-based downloads.
Cache Directory Structure¶
The cache directory structure is mirrored to the remote storage object hierarchy. Because the cache location is shared by all profiles, the MSC automatically prefixes the profile name to the cache directory to avoid conflicts.
s3://bucketA/
└── datasets/
├── parts_0001/
│ ├── data-0001.tar
│ └── data-0002.tar
└── parts_0002/
├── data-0001.tar
└── data-0002.tar
/tmp/msc_cache/{profile_name}/
└── datasets/
├── parts_0001/
│ ├── data-0001.tar
│ └── data-0002.tar
└── parts_0002/
├── data-0001.tar
└── data-0002.tar
Cache Eviction¶
The cache eviction procedure is a background thread that runs periodically to remove the cached files from the cache directory to ensure the cache size is within the configured limit. Since all profiles share the same cache directory, the cache eviction procedure is synchronized across all profiles.
Note
When running in a multi-process environment, only one process will run the cache eviction procedure at a time (protected by an exclusive filelock).
The cache eviction policy is configured using the eviction_policy
key in the cache configuration. By default, the cache eviction policy
is set to fifo
with a refresh interval of 300 seconds.
cache:
size: 500G
location: /path/to/msc_cache
eviction_policy:
policy: random
refresh_interval: 300
Best Practices¶
Full File Caching:
Configure MSC cache when your workload performs frequent small range-reads on objects. This is particularly common in:
ML Training Workloads: Machine learning training often involves reading large amount of data files and selecting random samples from different parts of the file, resulting in many small range-read operations. By caching the entire object upfront, these expensive network round-trips are eliminated, and subsequent sample reads are served directly from local storage at much higher speeds.
Checkpoint Loading: When loading large model checkpoints (often several GB in size), frameworks like PyTorch may perform multiple small reads to load different parts of the model. Rather than allowing these small reads to hit remote storage repeatedly, it’s much more performant to download the entire checkpoint file using multi-threaded downloads to the cache first, then let PyTorch load from the local cached file.
Random Access Patterns: Any workload that requires random access to different parts of large files frequently will benefit significantly from caching, as the alternative would be numerous individual range requests to remote storage.
Partial File Caching:
For large files where you only need to access specific portions, use partial file caching with range reads:
Large Dataset Access: When working with very large datasets (hundreds of GB or TB), use
client.read()
withbyte_range
to only download the chunks you need.Sparse File Access: For files where you only read small portions scattered throughout the file, partial file caching is more efficient than downloading the entire file.
Streaming Workloads: When processing large files sequentially in chunks, partial file caching allows you to process data as it’s downloaded without waiting for the entire file.
Performance Considerations:
Full File Caching (
prefetch_file=True
withclient.open()
): The cache transforms what would be hundreds or thousands of small, high-latency network requests into a single bulk download followed by fast local file system access.Partial File Caching (
prefetch_file=False
withclient.open()
): Use for large files with sparse access patterns. Setcache_line_size
based on your typical read sizes - smaller chunks (e.g., 16MB) for fine-grained access, larger chunks (e.g., 128MB) for coarser access patterns.Source Version Validation: This feature can be turned on and off via the cache config using
check_source_version: true/false
. For fine-grained tuning, useSourceVersionCheckMode.DISABLE
for maximum performance when you can tolerate potentially stale data, or when the data doesn’t change (gives best performance). UseSourceVersionCheckMode.ENABLE
when data consistency is critical.Prefetch File Strategy: - Use
prefetch_file=True
for sequential reads or when you need the entire file - Useprefetch_file=False
for random access patterns or when you want to start processing immediately - Combine with appropriatecache_line_size
for optimal performance
Limitations¶
Full file caching inefficiency: Traditional full file caching downloads entire files, which can be inefficient for large files when your workload only reads small portions at a time. For such cases, use partial file caching with
client.read()
andbyte_range
, or useclient.open(..., prefetch_file=False)
to enable chunk-based caching.Chunk size tuning: The effectiveness of partial file caching depends on choosing an appropriate
cache_line_size
. Too small chunks increase overhead, while too large chunks reduce the benefits of partial caching.Network dependency: Initial chunk downloads still require network access. For completely offline scenarios, ensure all required chunks are pre-cached.