(data_userguide)= # Data Movement :::{admonition} Keeping Data Transparent :class: tip "Show me your code and conceal your data structures, and I shall continue to be mystified. Show me your data structures, and I won't usually need your code; it'll be obvious." - *Fred Brooks* ::: Earth2Studio aims to keep data simple and interpretable between components. Given that this package interacts with geo-physical data, the common data-structure inside workflows is the pairing of: 1. A PyTorch tensor (`torch.Tensor`) on the inference device holding the array data of interest. 2. An OrderedDict of numpy arrays (`CoordSystem`) that represents the geo-phyiscal coordinate system the tensor represents. For example, perturbation methods operate by using a data tensor and coordinate system to generate a noise tensor: ```{literalinclude} ../../../earth2studio/perturbation/base.py :lines: 25- :language: python ``` In later sections, users will find that most components have APIs that either generate or interact with these two data structures. The combonation of both the data tensor and respective coordinate system provides complete information one needs to interpret any stage of a workflow. :::{note} Data is **always** moved between components using unnormalized physical units. ::: (coordinates_userguide)= ## Coordinate Systems As previously discussed, coordinate dictionaries are a critical part of Earth-2 Inference Studio's data movement. We wanted the coordinate object to be a fairly primitive data object that allows users to interact with the data outside the project and keep things transparent in workflows. Inside the package these are typed as `CoordSystem` which is defined as the following: ```python CoordSystem = NewType("CoordSystem", OrderedDict[str, np.ndarray]) ``` The dictionary is ordered since the keys correspond the the dimensions of the associated data tensor. Let's consider a simple example of a 2D lat-lon grid: ```python x = torch.randn(181, 360) coords = OrderedDict({ "lat": np.linspace(-90, 90, 181), "lon": np.linspace(0, 360, 360, endpoint=False) }) ``` Much of Earth2Studio typically operates on a lat-lon grid but it's not required to. ### Standard Coordinate Names Earth2Studio has a dimension naming standard for its built in feature set. We encourage users to follow similar naming schemes for compatability between Earth-2 Inference Studio when possible and the packages we interface with. ```{list-table} :widths: 15 40 25 :header-rows: 1 * - Key - Description - Type * - `batch` - Dimension representing the batch dimension of the data. Used to denote a "free" dimension, consult batching docs for more details. - `np.empty(0)` * - `time` - Time dimension, represented via numpy arrays of datetime objects. - `np.ndarray[np.datetime64[ns]]` (`TimeArray`) * - `lead_time` - Lead time is used to denote a dimension that indexes over forecast steps. - `np.ndarray[np.timedelta64[ns]]` (`LeadTimeArray`) * - `variable` - Dimension representing physical variable (atmospheric, surface, etc). Earth-2 Inference Studio has its own naming convention. See {ref}`lexicon_userguide` docs for more more details. - `np.ndarray[str]` (`VariableArray`) * - `lat` - Lattitude coordinate array - `np.ndarray[float]` * - `lon` - Longitude coordinate array - `np.ndarray[float]` ``` :::{note} `np.empty(0)` is used to denote variable axis in the coordinate system. Namely a dimension in the tensor that can be of any size, greater than 0. Typically this is the `batch` dimension. ::: ### Coordinate Utilities The downside of using a dictionary to store coordinates is that manipulating the data tensor and then updating the coordinate array is a manual process. To help make this process less tedious, Earth2Studio has several utility functions that make interacting with coordinates easier. The bulk of these can be found in the [Earth2Studio Utilities](earth2studio.utils_api). :::{warning} 🚧 Under construction, todo: add some example here! 🚧 ::: ### Model Coordinates Models are components where coordinate systems are essential both, for defining what data is needed and also what data is produced. Both {ref}`prognostic_model_userguide` and {ref}`diagnostic_model_userguide` require two functions that serve coordinate information: - `input_coords()` : A function that returns the expected input coordinate system of the model. A new dictionary should be returned every time. - `output_coords()` : A function that returns the expected output coordinate system of the model *given* an input coordinate system. This function should also validate the input coordinate dictionary. For an example, consider the FourCastNet model's implementations: ```{literalinclude} ../../../earth2studio/models/px/fcn.py :start-after: "# sphinx - coords start" :end-before: "# sphinx - coords end" :language: python ``` The `input_coords()` function provides the coordinate spec of the input tensor expected by the model. In the case of FourCastNet, it's expecting a input tensor of shape `[...,1,26,720,1440]` which corresponds to `[...,lead,variables,lat,lon]`. The `output_coords()` function provides validation and transformation of coordinates by the model. I.e. one should be able to deduce the models output data without executing the forward pass of the model. This function requires an input coordinate system that represent the input data it received. The first step is validating the `input_coords` using the coordinate handhshake utils functions. Next the output coordinate systems are built. This function is a place to store the complexity of a model's forward process for other Earth2Studio components. We encourage users to explore the coordinate transforms of models to learn more about how they operate: ```python from earth2studio.models.px import FCN model = FCN(None, None, None) input_coords = model.input_coords() output_coords = model.output_coords(input_coords) print("Input lead:", input_coords['lead_time']) print("Output lead:", output_coords['lead_time']) output_coords = model.output_coords(output_coords) print("Output lead 2:", output_coords['lead_time']) ``` The output of the following script will be: ```console Input lead: [0] Output lead: [6] Output lead 2: [12] ``` :::{note} The batch dimension was not discussed intentially. Think about it like a free dimension. More information can be found in the {ref}`batch_function_userguide` section. ::: ## Inference on the GPU It is beneficial to leverage the GPU for as many processes as possible. Earth2Studio aims to get data from the data source and immediately convert it into the tensor, coord data struction on the device. From there, the data is kept on the GPU until the very last moment when writes are needed to in-memory or to file. ```{figure} https://huggingface.co/datasets/NickGeneva/Earth2StudioAssets/raw/main/0.2.0/e2studio-data.png :alt: earth2studio-data :width: 600px :align: center ``` In the figure above, that the data is first pulled from the data source as an Xarray data array which is then then converted to a tensor. The data remain on the device, denoted by the GPU boundary, until it needs to be written by the IO component. :::{admonition} Data Sources and Xarray :class: tip This may raise the question: Why do datasources not output directly to tensor and coordinate dictionaries? This is an opinionated decision due to the fact that these data sources need to store data on the CPU regardless and can be extremely useful outside of the context of this package. Thus they return Xarray data arrays which are is nothing more than a fancy data array with a coordinate system attached to it! :::