How to work with EERIE data on DKRZ#

This notebook guides EERIE data users and explains how to find and load data available at DKRZ.

The notebook works well within the /work/bm1344/conda-envs/py_312/ environment.

import panel as pn

pn.extension("tabulator")
import intake

catalog = (
    "https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/eerie.yaml"
)
eerie_cat = intake.open_catalog(catalog)
eerie_cat
eerie:
  args:
    path: https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/eerie.yaml
  description: ''
  driver: intake.catalog.local.YAMLFileCatalog
  metadata: {}

Interactive browsing with the data base table#

The following table contains all EERIE model-output available at DKRZ on the file system.

  • Use the filters to subset the data base

  • Copy one value of the catalog_entry column and use it to open the data set from the intake catalog

  • Each line contains a single variable long name parsed from datasets.

  • DRS is the data reference syntax. A template for the path hierarchy of the catalogs.

Hide code cell source

csv_uri="https://eerie.cloud.dkrz.de/stats-eerie.csv"
import pandas as pd
df=pd.read_csv(csv_uri)
drs = "source_id.experiment_id.version.realm.grid_lable.aggregation"

def normalize_identifier(s: str) -> list[str]:
    parts = s.split(".")
    if len(parts) == 6 - 1:
        parts.insert(2, "not_set")
    return parts

df["catalog_entry"] = (
    "dkrz.disk.model-output."+df["dataset_id"]
)
df["parts"] = df["dataset_id"].apply(normalize_identifier)
df[drs.split('.')]=pd.DataFrame(df["parts"].to_list(), index=df.index)
df = df.drop(columns="parts")
    
df["var_names"]=df["var_names"].apply(lambda s: eval(s))
df.loc[
    df["var_names"].isna(), "var_names"
] = "'Not set'"
df = df.explode("var_names", ignore_index=True)
df=df.rename(columns=dict(var_names="variabe_id"))

patterns = ["height","lat","lon","pfull","depth", "time"]
df = df[~df["variabe_id"].str.contains("|".join(patterns), regex=True)]

tabu = pn.widgets.Tabulator(
    df[drs.split(".") + ["variabe_id", "catalog_entry", "start_year", "end_year"]],
    show_index=False,
    header_filters=True,
    widths={"variabe_id": 200, "catalog_entry": 150},
    selectable=1,
    pagination="local",
    page_size=20,
)
tabu  # there is a hidden cell before
tabu.save("eerie-intake-database.html")

The EERIE Catalog is hierarchically structured using a path template (similar to the Data Reference Syntax used in CMIP):

hpc.hardware.product.source_id.experiment_id.version.realm.grid_type

Opened with python, the catalog is a nested dictionary of catalog sources. The lowest level will finally contain data sources which can be opened as xarray datasets with to_dask().

You can browse through the catalog by listing the catalog and selecting keys:

print(list(eerie_cat))
print(list(eerie_cat["dkrz"]))
['jasmin', 'jasmin_badc', 'dkrz']
['disk', 'archive', 'cloud', 'main', 'dkrz_ngc3']

Entries can be joined with a ‘.’ so that you can access deeper level entries from the highest catalog level:

eerie_cat["dkrz.disk"]
disk:
  args:
    path: https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/dkrz/disk/main.yaml
  description: Use this catalog if you are working on Levante. This catalog contains
    datasets for all raw data in /work/bm1344 and accesses the data via kerchunks
    in /work/bm1344/DKRZ/kerchunks.
  driver: intake.catalog.local.YAMLFileCatalog
  metadata:
    catalog_dir: https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/dkrz

… tip:: Note that there is the autocompletion feature catalogs when pushing tab.

For model-output stored on DKRZ’s disk, you can get a table-like overview from a “data base” csv file from eerie cloud. The code hidden above the interactive table shows you how to read it with pandas.

In the following, we display all unique values of each element of the catalog hierarchy:

from IPython.display import display, Markdown

drs = "source_id.experiment_id.version.realm.grid_lable"
for c in drs.split("."):
    display(Markdown(f"## Unique entries for *{c}*:"))
    display(Markdown("- " + "\n- ".join(df[c].unique())))

Unique entries for source_id:

  • hadgem3-gc5-n216-orca025

  • hadgem3-gc5-n640-orca12

  • icon-esm-er

  • ifs-amip-tco1279

  • ifs-amip-tco2559

  • ifs-amip-tco399

  • ifs-amip-tco3999

  • ifs-fesom2-sr

  • ifs-nemo-er

Unique entries for experiment_id:

  • eerie-picontrol

  • eerie-control-1950

  • highres-future-ssp245

  • hist-1950

  • hist-c-0-a-lr20

  • hist

  • hist-c-lr20-a-0

Unique entries for version:

  • not_set

  • v20240618

  • v20240901

  • v20250101

  • v20240304

  • v20250516

Unique entries for realm:

  • atmos

  • ocean

  • land

  • wave

Unique entries for grid_lable:

  • gr025

  • native

Note that not all combinations of the DRS exist. For example, the ICON spinup run only has data on the native grid. You can use list to explore the contents.

list(eerie_cat["dkrz.disk.model-output.icon-esm-er.eerie-spinup-1950.v20240618.atmos"])
['gr025', 'native']

See also the processing example for printing the full tree of a catalog

You can print a short description of each collection with its describe function:

for col in eerie_cat["dkrz"]:
    print(f"Description of {col}:")
    print(eerie_cat["dkrz"][col].describe()["description"])
Description of disk:
Use this catalog if you are working on Levante. This catalog contains datasets for all raw data in /work/bm1344 and accesses the data via kerchunks in /work/bm1344/DKRZ/kerchunks.
Description of archive:
Only use this catalog if your desired data is not available on diks and needs to be retrieved from DKRZ's tape archive. This catalog contains datasets for archived data in /arch/bm1344
Description of cloud:
Use this catalog if you are NOT working on Levante. This catalog contains the same datasets as *dkrz_eerie_kerchunk* but data access is via the xpublish server *eerie.cloud.dkrz.de*
Description of main:
DKRZ master catalog for all /pool/data catalogs available
Description of dkrz_ngc3:
NextGEMs Cycle 3 data

The DKRZ catalogs distinguish between the storage location on DKRZ.

… note:: If you are accessing the data from remote, you can use the cloud catalog. For accessing other data, you have to be logged in on DKRZ´s HPC.

eerie_dkrz_disk = eerie_cat["dkrz"]["disk"]
for col in eerie_dkrz_disk:
    print(f"Description of {col}:")
    print(eerie_dkrz_disk[col].describe()["description"])
Description of model-output:
EERIE Earth System Model output available on DKRZ's Levante File System. This catalog contains datasets for all raw data in /work/bm1344 and accesses the data via kerchunks in /work/bm1344/DKRZ/kerchunks
Description of observations:
This catalog contains observational data that is used for EERIE evaluation.
Description of CMOR:
This catalog contains CMORized testing data that is going to use for ESGF publication.
Description of derived-variables:
Products of EERIE Earth System Model output available on DKRZ's Levante File System.

We continue to work with the eerie-dkrz-disk-model-output catalog to show how to

  • browse

  • open and load

  • subset

data.

cat = eerie_dkrz_disk["model-output"]
list(cat)
['icon-esm-er',
 'ifs-fesom2-sr',
 'ifs-amip-tco3999',
 'ifs-amip-tco2559',
 'ifs-amip-tco1279',
 'ifs-amip-tco399',
 'ifs-nemo-er',
 'hadgem3-gc5-n640-orca12',
 'hadgem3-gc5-n216-orca025',
 'csv',
 'esm-json']

We have two options:

  • continue working with yaml files and the intake-xarray plugins (easier to load)

  • switching to intake-esm (easier to browse)

Browse#

Pure Intake#

Intake offers users a free text search field. We can search for example for the control run. Intake returns another catalog.

searchdict = dict(
    model="ICON",
    realm="ocean",
    exp="eerie-control-1950",
    var="temperature",
    frequency="monthly",
)
subcat = cat["icon-esm-er"]
for v in searchdict.values():
    subcat = subcat.search(v)
list(subcat)

# note that `search` has a keyword argument *depth* (default: 2) which indicates how many subcatalogs should be searched.
# if you use a high level catalog, adapt that argument to your needs
['eerie-control-1950.v20231106.ocean.gr025.2d_monthly_mean',
 'eerie-control-1950.v20231106.ocean.gr025.2d_monthly_square',
 'eerie-control-1950.v20231106.ocean.gr025.eddy_monthly_mean',
 'eerie-control-1950.v20240618.ocean.gr025.2d_monthly_mean',
 'eerie-control-1950.v20240618.ocean.gr025.2d_monthly_square',
 'eerie-control-1950.v20240618.ocean.gr025.eddy_monthly_mean']

Flatten a catalog by finding all data sources in all levels:

from copy import deepcopy


def find_data_sources(catalog, name=None):
    newname = ".".join([a for a in [name, catalog.name] if a])
    data_sources = []

    for key, entry in catalog.items():
        if isinstance(entry, intake.catalog.Catalog):
            if newname == "main":
                newname = None
            # If the entry is a subcatalog, recursively search it
            data_sources.extend(find_data_sources(entry, newname))
        elif isinstance(entry, intake.source.base.DataSource):
            data_sources.append(newname + "." + key)

    return data_sources

Get file names#

If you need file names for work in shell scripts, you can get via the query_yaml program and for a specification of

  • catalog,

  • dataset,

  • variable name,

e.g.:

%%bash
module use /work/k20200/k202134/hsm-tools/outtake/module
module load hsm-tools/unstable

FILES=$(query_yaml.py ifs-fesom2-sr highres-future-ssp245 v20240304 atmos \
    gr025 2D_monthly_avg \
    -c https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/dkrz/disk/model-output/main.yaml \
    --var mean2t --uri --cdo )
echo found
echo ${FILES} | wc -w
echo files
found
672
files

Open and load#

You can open a dataset from the catalog similar to a dictionary entry. This gives you a lot of metadata information:

dsid = "icon-esm-er.hist-1950.v20240618.ocean.gr025.2d_monthly_mean"
cat[dsid]
2d_monthly_mean:
  args:
    chunks: auto
    consolidated: false
    storage_options:
      lazy: true
      remote_protocol: file
    urlpath: reference:://work/bm1344/k202193/Kerchunk/erc2020/v20240618/ocean/gr025/oce_2d_1mth_mean_remap025.parq
  description: ''
  driver: intake_xarray.xzarr.ZarrSource
  metadata:
    Conventions: CF-1.7 CMIP-6.2
    activity_id: HighResMIP
    catalog_dir: https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/dkrz/disk/model-output/icon-esm-er/hist-1950/v20240618/ocean/gr025
    contact: juergen.kroeger@mpimet.mpg.de
    creation_date: '2025-05-20T09:04:20'
    data_specs_version: 01.00.32
    experiment: coupled historical 1950-2014
    experiment_id: hist-1950
    forcing_index: 1
    grid: gr025
    grid_label: gr
    initialization_index: 1
    institution: Max Planck Institute for Meteorology, Hamburg 20146, Germany
    institution_id: MPI-M
    license: 'EERIE model data produced by MPI-M is licensed under a Creative Commons
      Attribution 4.0 International License (https://creativecommons.org/licenses).
      The data producers and data providers make no warranty, either express or implied,
      including, but not limited to, warranties of merchantability and fitness for
      a particular purpose. All liabilities arising from the supply of the information
      (including any liability arising in negligence) are excluded to the fullest
      extent permitted by law. '
    mip_era: EERIE
    nominal_resolution: 5 km
    parent_activity_id: HighResMIP
    parent_experiment_id: spinup-1950
    physics_index: 1
    product: model-output
    project_id: EERIE
    realization_index: 1
    realm: ocean
    references: "Hohenegger et al., ICON-Sapphire: simulating the components of the\
      \ Earth system and their interactions at kilometer and subkilometer scales.\
      \ Geosci. Model Dev., 16, 779\u2013811, 2023, https://doi.org/10.5194/gmd-16-779-2023"
    source: "ICON-ESM-ER (2023): \naerosol: none, prescribed MACv2-SP\natmos: ICON-A\
      \ (icosahedral/triangles; 10 km; 90 levels; top level 80 km)\natmosChem: none\n\
      land: JSBACH4.20\nlandIce: none/prescribed\nocean: ICON-O (icosahedral/triangles;\
      \ 5 km; 72 levels; top grid cell 0-2 m)\nocnBgchem: none\nseaIce: unnamed (thermodynamic\
      \ (Semtner zero-layer) dynamic (Hibler 79) sea ice model)"
    source_id: ICON-ESM-ER
    source_type: AOGCM
    sub_experiment: none
    sub_experiment_id: none
    variable-long_names:
    - Salt volume flux due to sea ice change
    - Freshwater Flux due to Sea Ice Change
    - Conductive heat flux at ice-ocean interface
    - Energy flux available for surface melting
    - Wind Speed at 10m height
    - atmos_fluxes_FrshFlux_Evaporation
    - atmos_fluxes_FrshFlux_Precipitation
    - atmos_fluxes_FrshFlux_Runoff
    - atmos_fluxes_FrshFlux_SnowFall
    - atmos_fluxes_HeatFlux_Latent
    - atmos_fluxes_HeatFlux_LongWave
    - atmos_fluxes_HeatFlux_Sensible
    - atmos_fluxes_HeatFlux_ShortWave
    - atmos_fluxes_HeatFlux_Total
    - atmos_fluxes_stress_x
    - atmos_fluxes_stress_xw
    - atmos_fluxes_stress_y
    - atmos_fluxes_stress_yw
    - ice concentration in each ice class
    - Change in ice mean thickness due to thermodynamic effects
    - Change in mean snow thickness due to thermodynamic melting
    - Heat flux to ocean from the ice growth
    - Heat flux to ocean from the atmosphere
    - heat_content_seaice
    - heat_conten_snow
    - heat_content_total
    - ice thickness
    - snow thickness
    - zonal velocity
    - meridional velocity
    - ocean_mixed_layer_thickness_defined_by_sigma_t
    - ocean_mixed_layer_thickness_defined_by_sigma_t_10m
    - new ice growth in open water
    - Sea Level Pressure
    - amount of snow that is transformed to ice
    - sea water salinity
    - surface elevation at cell center
    - zstar surface stretch at cell center
    - sea water potential temperature
    - vertically integrated mass flux at edges
    variant_label: r1i1p1f1
    version_id: v20240618

You can open the data as a xarray dataset with the to_dask() function.

ds = cat[dsid].to_dask()
ds
<xarray.Dataset> Size: 126GB
Dimensions:                              (time: 780, lat: 721, lon: 1440,
                                          lev: 1, depth: 1)
Coordinates:
  * depth                                (depth) float64 8B 1.0
  * lat                                  (lat) float64 6kB -90.0 -89.75 ... 90.0
  * lev                                  (lev) float64 8B 0.0
  * lon                                  (lon) float64 12kB 0.0 0.25 ... 359.8
  * time                                 (time) datetime64[ns] 6kB 1950-01-31...
Data variables: (12/39)
    FrshFlux_IceSalt                     (time, lat, lon) float32 3GB dask.array<chunksize=(32, 721, 1440), meta=np.ndarray>
    FrshFlux_TotalIce                    (time, lat, lon) float32 3GB dask.array<chunksize=(32, 721, 1440), meta=np.ndarray>
    Qbot                                 (time, lev, lat, lon) float32 3GB dask.array<chunksize=(32, 1, 721, 1440), meta=np.ndarray>
    Qtop                                 (time, lev, lat, lon) float32 3GB dask.array<chunksize=(32, 1, 721, 1440), meta=np.ndarray>
    Wind_Speed_10m                       (time, lat, lon) float32 3GB dask.array<chunksize=(32, 721, 1440), meta=np.ndarray>
    atmos_fluxes_FrshFlux_Evaporation    (time, lat, lon) float32 3GB dask.array<chunksize=(32, 721, 1440), meta=np.ndarray>
    ...                                   ...
    sea_level_pressure                   (time, lat, lon) float32 3GB dask.array<chunksize=(32, 721, 1440), meta=np.ndarray>
    snow_to_ice                          (time, lev, lat, lon) float32 3GB dask.array<chunksize=(32, 1, 721, 1440), meta=np.ndarray>
    so                                   (time, depth, lat, lon) float32 3GB dask.array<chunksize=(32, 1, 721, 1440), meta=np.ndarray>
    ssh                                  (time, lat, lon) float32 3GB dask.array<chunksize=(32, 721, 1440), meta=np.ndarray>
    stretch_c                            (time, lat, lon) float32 3GB dask.array<chunksize=(32, 721, 1440), meta=np.ndarray>
    to                                   (time, depth, lat, lon) float32 3GB dask.array<chunksize=(32, 1, 721, 1440), meta=np.ndarray>
Attributes: (12/31)
    Conventions:           CF-1.7 CMIP-6.2
    activity_id:           HighResMIP
    data_specs_version:    01.00.32
    forcing_index:         1
    initialization_index:  1
    license:               EERIE model data produced by MPI-M is licensed und...
    ...                    ...
    parent_activity_id:    HighResMIP
    sub_experiment_id:     none
    experiment:            coupled historical 1950-2014
    source:                ICON-ESM-ER (2023): \naerosol: none, prescribed MA...
    institution:           Max Planck Institute for Meteorology, Hamburg 2014...
    sub_experiment:        none

Subset#

Use

  • isel for index selection

  • sel for value selection. Note that you can use method=nearest to do a nearest neighbour interpolation

  • groupby for statistics

# get the latest time step
to_last_timestep = ds["to"].isel(time=-1)
to_last_timestep
<xarray.DataArray 'to' (depth: 1, lat: 721, lon: 1440)> Size: 4MB
dask.array<getitem, shape=(1, 721, 1440), dtype=float32, chunksize=(1, 721, 1440), chunktype=numpy.ndarray>
Coordinates:
  * depth    (depth) float64 8B 1.0
  * lat      (lat) float64 6kB -90.0 -89.75 -89.5 -89.25 ... 89.5 89.75 90.0
  * lon      (lon) float64 12kB 0.0 0.25 0.5 0.75 ... 359.0 359.2 359.5 359.8
    time     datetime64[ns] 8B 2014-12-31T23:59:59
Attributes:
    code:           255
    long_name:      sea water potential temperature
    standard_name:  sea_water_potential_temperature
    units:          C
# select a coordinate by values and with nearest neighbor look up:
import hvplot.xarray

to_northsea = (
    ds["to"].sel(**dict(lat="54", lon="8.2"), method="nearest").drop(["lat", "lon"])
)
to_northsea.hvplot.line()
to_northsea_yearmean = to_northsea.groupby("time.year").mean()
to_northsea_yearmean.hvplot.line()

Handling of ICON native grid for georeferenced plots:

dsnative = cat[
    "icon-esm-er.highres-future-ssp245.v20240618.ocean.native.2d_monthly_mean"
].to_dask()
import hvplot.xarray

dsnative["to"].isel(time=0).squeeze().load().hvplot.scatter(
    x="lon", y="lat", c="to", rasterize=True, datashade=True
)