########
Replicas
########

MSC replicas provide automatic data mirroring across multiple storage backends for improved performance and availability. 
When you configure replicas, MSC reads from replicas based on configured priority and provides mechanisms to populate 
replicas from the source without requiring code changes.

*************
Configuration
*************

To enable replicas, add a ``replicas`` list to your profile configuration. Each replica entry specifies a profile name 
and a required ``read_priority`` (lower numbers = higher priority).

.. code-block:: yaml
   :caption: Configuration example with source and replicas.
   :linenos:

   profiles:
     my-dataset:
       storage_provider:
         type: s3
         options:
           base_path: my-dataset-bucket
       # Configure two replicas
       replicas:
         - replica_profile: my-dataset-s3-express
           read_priority: 1   # First choice
         - replica_profile: my-dataset-lustre
           read_priority: 2   # Second choice and so on...

     my-dataset-s3-express:
       storage_provider:
         type: s3
         options:
           base_path: my-dataset-bucket--x-s3

     my-dataset-lustre:
       storage_provider:
         type: file
         options:
           base_path: /lustre/datasets/my-dataset

In the above example, the ``my-dataset`` profile is the source, and the ``my-dataset-s3-express`` and ``my-dataset-lustre`` 
profiles are replicas. When you use the ``my-dataset`` profile, MSC will read from the ``my-dataset-s3-express`` profile first, 
and if that is not available, it will read from the ``my-dataset-lustre`` profile. If both replicas are unavailable, MSC will 
fall back and read from the source profile ``my-dataset``.

.. note::

  The ``read_priority`` is **required** and must be a positive integer (``1`` = highest priority), with replicas of the 
  same priority being tried in the order they are listed in the configuration.

.. _mirror-data-to-replicas:

***********************
Mirror Data to Replicas
***********************

There are two ways to mirror data to replicas: using the command-line interface (CLI) or Python code.

**Command Line Interface**

The easiest way to mirror data is using the MSC CLI. See :ref:`msc-sync-replicas-cli` for detailed information about the command.

**Python Code**

To populate replicas from the source using Python, you can use the :py:meth:`multistorageclient.StorageClient.sync_replicas` method:

.. code-block:: python
   :caption: Mirror data to replicas using Python.
   :linenos:

   from multistorageclient import StorageClient, StorageClientConfig

   # Initialize the client
   client = StorageClient(StorageClientConfig.from_file(profile="my-dataset"))

   # Mirror data from source to replica
   client.sync_replicas(source_path="", num_worker_processes=8)

The ``sync_replicas`` will spawn a number of worker processes to copy data from the source to the replicas. By default, it uses the local mode 
that runs the worker processes on the same machine as the client. You can also use the Ray mode to run the worker processes on a Ray cluster to 
take advantage of the distributed computing capabilities of Ray.

.. code-block:: python
   :caption: Mirror data to replicas using Ray.
   :linenos:

   from multistorageclient import StorageClient, StorageClientConfig
   from multistorageclient.types import ExecutionMode

   import ray

   # Connect to the Ray cluster
   ray.init(address="auto")

   # Initialize the client
   client = StorageClient(StorageClientConfig.from_file(profile="my-dataset"))

   # Mirror data from source to replica
   client.sync_replicas(source_path="", execution_mode=ExecutionMode.RAY)

   # Shutdown the Ray cluster
   ray.shutdown()

******************
Read from Replicas
******************

When you read from the source, MSC will automatically read from the replicas based on the configured priority.

.. code-block:: python
   :caption: Read from replicas using Python.
   :linenos:

   from multistorageclient import StorageClient, StorageClientConfig

   # Initialize the client
   client = StorageClient(StorageClientConfig.from_file(profile="my-dataset"))

   # Read object from the replicas
   # It will read from the replicas based on the read priority, 
   # so first from the my-dataset-s3-express profile, then from 
   # the my-dataset-lustre profile.
   client.read("files/my-file.txt")

   # Supported methods:
   # client.download_file("files/my-file.txt", "/local/path/to/my-file.txt")
   # client.copy("files/my-file.txt", "files/my-file-copy.txt")
   # client.open("files/my-file.txt", mode="rb")

If replicas are not populated (i.e., the data doesn't exist in the replica storage), MSC will automatically 
fall back to the source profile. This ensures that your application continues to work even if the replica 
synchronization hasn't been completed or if some replicas are unavailable.

The fallback mechanism works seamlessly in the background. When you attempt to read from a replica that 
doesn't contain the requested data, MSC will automatically try the next replica in priority order, and if 
all replicas fail, it will ultimately read from the source profile.

Additionally, MSC implements an **async upload-on-miss** strategy. When a read operation misses on any 
replica, MSC automatically uploads the object to replicas that are missing the data. This happens in 
background threads, so the caller doesn't block.

This provides a robust, fault-tolerant system where your applications can continue operating normally 
regardless of the replica population status, while keeping replicas up-to-date opportunistically.

**************
Best Practices
**************

**Latency-Sensitive Workloads**

If your workload is latency-sensitive and the data source is geographically distant, choose a replica 
that is close to your compute resources. For optimal performance, prefer a parallel filesystem such as 
Lustre, which provides high-throughput, low-latency access for compute-intensive applications.

**Network Throughput Optimization**

To maximize network throughput when copying data from object storage to local shared filesystems or 
to other object storage systems, consider using the :ref:`rust-client-reference` Rust client in MSC. 
The Rust client offers significant performance improvements over Python clients such as boto3, making it 
ideal for large-scale data transfer operations.

**Replica Synchronization Strategy**

Populate replicas first using :ref:`mirror-data-to-replicas` before relying on the **async upload-on-miss** mechanism. 
While async upload-on-miss provides automatic fallback, it is not optimal compared to :ref:`mirror-data-to-replicas`, 
which leverages multiprocessing for significantly better performance during **bulk data transfers**.