Lexicon#

As discussed in detail in the Data Movement section, Earth2Studio tracks the geo-physical representation of tensor data inside workflows. This includes the name of the variable / parameter / property the data represents, which is tacked explicitly via Earth2Studios lexicon. Similar to ECMWF’s parameter database, Earth2Studio’s lexicon aims to provide an opinioned and explicit list of short variables names that is used across the package found in earth2studio.lexicon.base.E2STUDIO_VOCAB. Many of these names are based on ECMWF’s parameter database but not all.

Below are a few examples:

  • t2m: Temperature in Kelvin at 2 meters

  • u10m: u-component of Zonal winds at 10 meters

  • v10m: v-component of Zonal winds at 10 meters

  • u200: u-component of Zonal winds at 200 hPa

  • z250: Geo-potential at 250 hPa

  • z500: Geo-potential at 500 hPa

  • tcwv: Total column water vapor

Altitude / Pressure Levels

Note that 3D atmospheric variables are sliced to their individual pressure levels. This is better suited when working with various AI models that may use different pressure levels. Levels based on altitude contain an “m” at the end to distinguish height in meters.

Datasource Lexicon#

A common challenge when working with different sources of weather/climate data is that variables used may be named / denoted in different ways. The Lexicon is also used to track the translation between Earth2Studios naming scheme and the scheme needed to parse the remote data source. Each remote data store has its own lexicon, which is a dictionary that has the Earth2Studio variable name as the keys and a string used to parse the remote data store. Typically, this value is a string that corresponds to the variable name inside the remote data store.

The following snippet is part of the lexicon for the GFS dataset. Note that the class has a metaclass=LexiconType which is present in earth2studio.lexicon.base.py used for type checking.

class GFSLexicon(metaclass=LexiconType):
    """Global Forecast System Lexicon
    GFS specified <Parameter ID>::<Level/ Layer>

    Warning
    -------
    Some variables are only present for lead time greater than 0

    Note
    ----
    Additional resources:
    https://www.nco.ncep.noaa.gov/pmb/products/gfs/gfs.t00z.pgrb2.0p25.f000.shtml
    https://www.nco.ncep.noaa.gov/pmb/products/gfs/gfs.t00z.pgrb2.0p25.f003.shtml
    """

    VOCAB = {
        "u10m": "UGRD::10 m above ground",
        "v10m": "VGRD::10 m above ground",
        "u100m": "UGRD::100 m above ground",
        "v100m": "VGRD::100 m above ground",
        "t2m": "TMP::2 m above ground",

How the value of each variable is left up the the data source. The present pattern is to split by the string based on the separator ::, and then used to access the required data. For example, the variable u100, zonal winds at 100 hPa, the value UGRD::100 mb is split into UGRD and 100 mb which are then used with the remote Grib index file to fetch the correct data.

            try:
                gfs_name, modifier = GFSLexicon[variable]
            except KeyError:
                logger.warning(
                    f"variable id {variable} not found in GFS lexicon, good luck"
                )
                gfs_name = variable

                def modifier(x: np.array) -> np.array:
                    """Modify data (if necessary)."""
                    return x

            byte_offset = None
            byte_length = None
            for key, value in index_file.items():
                if gfs_name in key:
                    byte_offset = value[0]
                    byte_length = value[1]
                    break

            if byte_offset is None:
                raise KeyError(f"Could not find variable {gfs_name} in index file")
            # Download the grib file to cache
            logger.debug(
                f"Fetching GFS grib file for variable: {variable} at {time}_{lead_hour}"
            )
            grib_file = self._download_s3_grib_cached(
                grib_file_name, byte_offset=byte_offset, byte_length=byte_length
            )
            # Open into xarray data-array
            da = xr.open_dataarray(
                grib_file, engine="cfgrib", backend_kwargs={"indexpath": ""}
            )
            gfsda[0, 0, i] = modifier(da.values)

It is a common pattern for data source lexicons to contain a modifier function that is used to apply adjustments to align data more uniformly with the package. A good example of this is the GFS dataset which uses the modifier function to transform the GFS supplied the geo-potential height to geo-potential to better align with other sources inside Earth2Studio.

    @classmethod
    def get_item(cls, val: str) -> tuple[str, Callable]:
        """Get item from GFS vocabulary."""
        gfs_key = cls.VOCAB[val]
        if gfs_key.split("::")[0] == "HGT":

            def mod(x: np.array) -> np.array:
                """Modify data value (if necessary)."""
                return x * 9.81

        else:

            def mod(x: np.array) -> np.array:
                """Modify data value (if necessary)."""
                return x

        return gfs_key, mod

Warning

The lexicon does not necessarily contain every variable inside the remote data store. Rather it explicitly lists what is available inside Earth2Studio. See some variable missing you would like to add? Open an issue!