Lexicon#
As discussed in detail in the Data Movement section, Earth2Studio tracks the
geo-physical representation of tensor data inside workflows.
This includes the name of the variable / parameter / property the data represents, which
is tracked explicitly via Earth2Studios lexicon.
Similar to ECMWF’s parameter database,
Earth2Studio’s lexicon aims to provide an opinioned and explicit list of short variables
names that is used across the package found in earth2studio.lexicon.base.E2STUDIO_VOCAB.
Many of these names are based on ECMWF’s parameter database but not all.
Below are a few examples:
t2m: Temperature in Kelvin at 2 metersu10m: u-component (eastward/zonal) of winds at 10 metersv10m: v-component (northward/meridional) of winds at 10 metersu200: u-component of winds at 200 hPaz250: Geo-potential at 250 hPaz500: Geo-potential at 500 hPatcwv: Total column water vapor
Altitude / Pressure Levels#
Note that there are a variety of ways to represent the vertical coordinates for 3D
atmospheric variables. The most common method is to slice variables along pressure
levels (surfaces of constant pressure), and this is considered the “default” in terms
of variable names within the lexicon (e.g., z500 is the geo-potential) at the 500 hPa
pressure level. Variables which are represented based on altitude above the surface
contain an “m” at the end, to denote height in meters, such as u10m.
Some models or workflows, however, require using their own custom vertical coordinate
which is neither pressure-level nor terrain-following. These are typically referred to
as “native” or “hybrid” vertical levels, and are defined differently for different
use-cases. The lexicon supports these custom levels by indexing the vertical level and
appending a suffix to the variable name to denote it is a custom vertical level, as in
u100k to indicate the u-component of winds at the custom vertical level with index
100 (indexed by k). We leave the choice of suffix up to each use-case, and reserve
the following special-case suffixes:
No suffix: assumed to be pressure-level, as in
z500for geo-potential at 500 hPa levelm: altitude in meters above the surface
Warning
Only use custom vertical level data with caution. The definition of these vertical levels changes with each data source, model, or use-case, and thus they are not necessarily interoperable. Transforming between different custom vertical levels will likely require custom interpolation schemes, possibly using pressure-levels as an intermediate step.
Datasource Lexicon#
A common challenge when working with different sources of weather/climate data is that variables used may be named / denoted in different ways. The Lexicon is also used to track the translation between Earth2Studios naming scheme and the scheme needed to parse the remote data source. Each remote data store has its own lexicon, which is a dictionary that has the Earth2Studio variable name as the keys and a string used to parse the remote data store. Typically, this value is a string that corresponds to the variable name inside the remote data store.
The following snippet is part of the lexicon for the GFS dataset.
Note that the class has a metaclass=LexiconType which is present in
earth2studio.lexicon.base.py used for type checking.
class GFSLexicon(metaclass=LexiconType):
"""Global Forecast System Lexicon
GFS specified <Parameter ID>::<Level/ Layer>
Warning
-------
Some variables are only present for lead time greater than 0
Note
----
Additional resources:
https://www.nco.ncep.noaa.gov/pmb/products/gfs/gfs.t00z.pgrb2.0p25.f000.shtml
https://www.nco.ncep.noaa.gov/pmb/products/gfs/gfs.t00z.pgrb2.0p25.f003.shtml
"""
VOCAB = {
"u10m": "UGRD::10 m above ground",
"v10m": "VGRD::10 m above ground",
"u100m": "UGRD::100 m above ground",
"v100m": "VGRD::100 m above ground",
"t2m": "TMP::2 m above ground",
"d2m": "DPT::2 m above ground",
"r2m": "RH::2 m above ground",
"q2m": "SPFH::2 m above ground",
"sp": "PRES::surface",
"msl": "PRMSL::mean sea level",
"tcwv": "PWAT::entire atmosphere (considered as a single layer)",
"tp": "596::APCP::surface", # 3 hour acc
"2d": "DPT::2 m above ground",
"fg10m": "GUST::surface", # Surface
"u1": "UGRD::1 mb",
"u2": "UGRD::2 mb",
"u3": "UGRD::3 mb",
"u5": "UGRD::5 mb",
"u7": "UGRD::7 mb",
"u10": "UGRD::10 mb",
"u15": "UGRD::15 mb",
Values of each variable is left up the the data source.
The present pattern is to split by the string based on the separator ::, and then used
to access the required data.
For example, the variable u100, zonal winds at 100 hPa, the value UGRD::100 mb is
split into UGRD and 100 mb which are then used with the remote Grib index file to
fetch the correct data.
try:
gfs_name, modifier = GFSLexicon[v]
except KeyError:
logger.warning(
f"variable id {v} not found in GFS lexicon, good luck"
)
gfs_name = v
def modifier(x: np.array) -> np.array:
"""Modify data (if necessary)."""
return x
byte_offset = None
byte_length = None
for key, value in index_file.items():
if gfs_name in key:
byte_offset = value[0]
byte_length = value[1]
break
if byte_length is None or byte_offset is None:
logger.warning(
f"Variable {v} not found in index file for time {t} at {lt}, values will be unset"
)
continue
It is a common pattern for data source lexicons to contain a modifier function that is used to apply adjustments to align data more uniformly with the package. A good example of this is the GFS dataset which uses the modifier function to transform the GFS supplied the geo-potential height to geo-potential to better align with other sources inside Earth2Studio.
@classmethod
def get_item(cls, val: str) -> tuple[str, Callable]:
"""Get item from GFS vocabulary."""
gfs_key = cls.VOCAB[val]
if gfs_key.split("::")[0] == "HGT":
def mod(x: np.array) -> np.array:
"""Modify data value (if necessary)."""
return x * 9.81
else:
def mod(x: np.array) -> np.array:
"""Modify data value (if necessary)."""
return x
return gfs_key, mod
Warning
The lexicon does not necessarily contain every variable inside the remote data store. Rather it explicitly lists what is available inside Earth2Studio. See some variable missing you would like to add? Open an issue!