Overview and access at DKRZ#

This notebook guides EERIE data users and explains how to find and load data available at DKRZ.

The notebook works well within the python3/unstable kernel.

All data relevant for the project is referenced in the main DKRZ-EERIE Catalog:

[1]:
import intake

catalog = (
    "https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/eerie.yaml"
)
eerie_cat = intake.open_catalog(catalog)
eerie_cat
eerie:
  args:
    path: https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/eerie.yaml
  description: ''
  driver: intake.catalog.local.YAMLFileCatalog
  metadata: {}

Online interactive browsing with the data base table#

The following table contains all EERIE model-output available at DKRZ on the file system.

  • Use the filters to subset the data base

  • Copy one value of the catalog_entry column and use it to open the data set from the intake catalog

  • Each line contains a single variable long name parsed from datasets.

  • DRS is the data reference syntax. A template for the path hierarchy of the catalogs.

[3]:
tabu  # there is a hidden cell before
[3]:

The EERIE Catalog is hierarchically structured using a path template (similar to the Data Reference Syntax used in CMIP):

hpc.hardware.product.source_id.experiment_id.realm.grid_type

Opened with python, the catalog is a nested dictionary of catalog sources. The lowest level will finally contain data sources which can be opened as xarray datasets with to_dask().

You can browse through the catalog by listing the catalog and selecting keys:

[4]:
print(list(eerie_cat))
print(list(eerie_cat["dkrz"]))
['jasmin', 'dkrz']
['disk', 'archive', 'cloud', 'main', 'dkrz_ngc3']

Entries can be joined with a ‘.’ so that you can access deeper level entries from the highest catalog level:

[5]:
eerie_cat["dkrz.disk"]
disk:
  args:
    path: https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/dkrz/disk/main.yaml
  description: Use this catalog if you are working on Levante. This catalog contains
    datasets for all raw data in /work/bm1344 and accesses the data via kerchunks
    in /work/bm1344/DKRZ/kerchunks.
  driver: intake.catalog.local.YAMLFileCatalog
  metadata:
    catalog_dir: https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/dkrz

Tip

Note that there is the autocompletion feature catalogs when pushing tab.

For model-output stored on DKRZ’s disk, you can get a table-like overview from a “data base” csv file opened with pandas:

[6]:
data_base = eerie_cat["dkrz"]["disk"]["model-output"]["csv"].read()
# equivalent to
data_base = eerie_cat["dkrz.disk.model-output.csv"].read()

In the following, we display all unique values of each element of the catalog hierarchy:

[7]:
from IPython.display import display, Markdown

drs = "source_id.experiment_id.version.realm.grid_lable"
for c in drs.split("."):
    display(Markdown(f"## Unique entries for *{c}*:"))
    display(Markdown("- " + "\n- ".join(data_base[c].unique())))

Unique entries for source_id:#

  • icon-esm-er

  • ifs-fesom2-sr

  • ifs-amip

  • ifs-nemo

  • hadgem3-gc5-n640-orca12

  • hadgem3-gc5-n216-orca025

Unique entries for experiment_id:#

  • eerie-control-1950

  • eerie-spinup-1950

  • amip-hist-obs

  • amip-hist-obs-lr30

  • amip-hist-obs-c-lr30-a-0

  • amip-hist-obs-c-lr30-a-lr30

  • amip-ng-obs

  • amip-ng-obs-lr30

  • amip-hist-esav3

  • amip-hist-esav3-c-0-a-lr30

  • eerie-picontrol

Unique entries for version:#

  • v20231106

  • v20240618

  • v20240304

  • Not set

Unique entries for realm:#

  • atmos

  • ocean

  • land

Unique entries for grid_lable:#

  • gr025

  • native

Go to Browse with intake-esm to understand more about how to work with the dataframe data_base.

Note that not all combinations of the DRS exist. For example, the ICON spinup run only has data on the native grid. You can use list to explore the contents.

[8]:
list(eerie_cat["dkrz.disk.model-output.icon-esm-er.eerie-spinup-1950.v20240618.atmos"])
[8]:
['native']

See also the processing example for printing the full tree of a catalog

You can print a short description of each collection with its describe function:

[9]:
for col in eerie_cat:
    print(f"Description of {col}:")
    print(eerie_cat[col].describe()["description"])
Description of jasmin:
This catalog contains datasets for EERIE stored on JASMIN
Description of dkrz:
This catalog contains datasets for EERIE stored on DKRZ

Datasets saved on DKRZ are in the DKRZ catalog:

[10]:
eerie_dkrz = eerie_cat["dkrz"]
for col in eerie_dkrz:
    print(f"Description of {col}:")
    print(eerie_dkrz[col].describe()["description"])
Description of disk:
Use this catalog if you are working on Levante. This catalog contains datasets for all raw data in /work/bm1344 and accesses the data via kerchunks in /work/bm1344/DKRZ/kerchunks.
Description of archive:
Only use this catalog if your desired data is not available on diks and needs to be retrieved from DKRZ's tape archive. This catalog contains datasets for archived data in /arch/bm1344
Description of cloud:
Use this catalog if you are NOT working on Levante. This catalog contains the same datasets as *dkrz_eerie_kerchunk* but data access is via the xpublish server *eerie.cloud.dkrz.de*
Description of main:
DKRZ master catalog for all /pool/data catalogs available
Description of dkrz_ngc3:
NextGEMs Cycle 3 data

The DKRZ catalogs distinguish between the storage location on DKRZ.

Note

If you are accessing the data from remote, you can use the cloud catalog. For accessing other data, you have to be logged in on DKRZ´s HPC.

[11]:
eerie_dkrz_disk = eerie_dkrz["disk"]
for col in eerie_dkrz_disk:
    print(f"Description of {col}:")
    print(eerie_dkrz_disk[col].describe()["description"])
Description of model-output:
EERIE Earth System Model output available on DKRZ's Levante File System. This catalog contains datasets for all raw data in /work/bm1344 and accesses the data via kerchunks in /work/bm1344/DKRZ/kerchunks
Description of observations:
This catalog contains observational data that is used for EERIE evaluation.

We continue to work with the eerie-dkrz-disk-model-output catalog to show how to

  • browse

  • open and load

  • subset

data.

[12]:
cat = eerie_dkrz_disk["model-output"]
list(cat)
[12]:
['icon-esm-er',
 'ifs-fesom2-sr',
 'ifs-amip',
 'ifs-nemo',
 'hadgem3-gc5-n640-orca12',
 'hadgem3-gc5-n216-orca025',
 'csv',
 'esm-json']

We have two options:

  • continue working with yaml files and the intake-xarray plugins (easier to load)

  • switching to intake-esm (easier to browse)

Browse#

With intake-esm#

We can use a json+csv from the intake catalog to generate an intake-esm catalog:

[13]:
import json

esmjson = json.loads("".join(cat["esm-json"].read()))
dkrz_disk_model_esm = intake.open_esm_datastore(
    obj=dict(esmcat=esmjson, df=data_base),
    columns_with_iterables=["variables", "variable-long_names", "urlpath"],
)
dkrz_disk_model_esm

dkrz-catalogue catalog with 164 dataset(s) from 170 asset(s):

unique
format 3
grid_id 3
member_id 1
institution_id 2
institution 3
references 1
simulation_id 4
variable-long_names 80
variables 91
source_id 6
experiment_id 11
version 4
realm 3
grid_lable 2
aggregation 57
urlpath 170
derived_variables 0

Intake-esm uses the data_base dataframe under the hood accessible via .df which make things easier to browse through the catalog.

  • what query keywords do exist?

[14]:
dkrz_disk_model_esm.df.columns
[14]:
Index(['format', 'grid_id', 'member_id', 'institution_id', 'institution',
       'references', 'simulation_id', 'variable-long_names', 'variables',
       'source_id', 'experiment_id', 'version', 'realm', 'grid_lable',
       'aggregation', 'urlpath'],
      dtype='object')

Intake-esm uses a pandas dataframe under the hood which make things easier to browse through the catalog.

  • which models are available in the catalog?

[15]:
dkrz_disk_model_esm.unique()["source_id"]
[15]:
['icon-esm-er',
 'ifs-fesom2-sr',
 'ifs-amip',
 'ifs-nemo',
 'hadgem3-gc5-n640-orca12',
 'hadgem3-gc5-n216-orca025']

Search with wild cards:

[16]:
subcat_esm = dkrz_disk_model_esm.search(
    **{
        "source_id": "icon-esm-er",
        "experiment_id": "eerie-control-1950",
        "grid_lable": "gr025",
        "realm": "atmos",
        "variable-long_names": "temperature*",
        "aggregation": "monthly*",
    }
)
subcat_esm

dkrz-catalogue catalog with 1 dataset(s) from 1 asset(s):

unique
format 1
grid_id 1
member_id 1
institution_id 1
institution 1
references 1
simulation_id 1
variable-long_names 1
variables 1
source_id 1
experiment_id 1
version 1
realm 1
grid_lable 1
aggregation 1
urlpath 1
derived_variables 0

Pure Intake#

Intake offers users a free text search field. We can search for example for the control run. Intake returns another catalog.

[17]:
searchdict = dict(
    model="ICON",
    realm="atmos",
    exp="eerie-control-1950",
    var="temperature",
    frequency="monthly",
)
subcat = cat["icon-esm-er"]
for v in searchdict.values():
    subcat = subcat.search(v)
list(subcat)

# note that `search` has a keyword argument *depth* (default: 2) which indicates how many subcatalogs should be searched.
# if you use a high level catalog, adapt that argument to your needs
[17]:
['eerie-control-1950.v20231106.atmos.gr025.2d_monthly_mean',
 'eerie-control-1950.v20231106.atmos.gr025.model-level_monthly_mean',
 'eerie-control-1950.v20231106.atmos.gr025.plev19_monthly_mean',
 'eerie-control-1950.v20231106.atmos.native.2d_monthly_mean',
 'eerie-control-1950.v20231106.atmos.native.model-level_monthly_mean',
 'eerie-control-1950.v20231106.atmos.native.plev19_monthly_mean']

Use the GUI:

[18]:
cat.gui
[18]:

Flatten a catalog by finding all data sources in all levels:

[19]:
from copy import deepcopy


def find_data_sources(catalog, name=None):
    newname = ".".join([a for a in [name, catalog.name] if a])
    data_sources = []

    for key, entry in catalog.items():
        if isinstance(entry, intake.catalog.Catalog):
            if newname == "main":
                newname = None
            # If the entry is a subcatalog, recursively search it
            data_sources.extend(find_data_sources(entry, newname))
        elif isinstance(entry, intake.source.base.DataSource):
            data_sources.append(newname + "." + key)

    return data_sources

Get file names#

If you need file names for work in shell scripts, you can get via the query_yaml program and for a specification of

  • catalog,

  • dataset,

  • variable name,

e.g.:

[20]:
%%bash
module use /work/k20200/k202134/hsm-tools/outtake/module
module load hsm-tools/unstable

FILES=$(query_yaml.py ifs-fesom2-sr eerie-spinup-1950 v20240304 atmos \
    native daily \
    -c https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/dkrz/disk/model-output/main.yaml \
    --var 2t --uri --cdo )
echo found
echo ${FILES} | wc -w
echo files
Choices for this dataset:
        name  ... default
0  variables  ...    100u

[1 rows x 5 columns]
found
372
files

Open and load#

You can open a dataset from the catalog similar to a dictionary entry. This gives you a lot of metadata information:

[21]:
dsid = "icon-esm-er.eerie-control-1950.v20231106.atmos.gr025.2d_monthly_mean"
cat[dsid]
2d_monthly_mean:
  args:
    chunks: auto
    consolidated: false
    storage_options:
      lazy: true
      remote_protocol: file
    urlpath: reference:://work/bm1344/DKRZ/kerchunks_batched/erc1011/atm_2d_1mth_mean_remap025/combined.parq
  description: ''
  driver: intake_xarray.xzarr.ZarrSource
  metadata:
    CDI: Climate Data Interface version 2.2.4 (https://mpimet.mpg.de/cdi)
    CDO: Climate Data Operators version 2.2.2 (https://mpimet.mpg.de/cdo)
    Conventions: CF-1.6
    DOKU_License: CC BY 4.0
    DOKU_Name: EERIE ICON-ESM-ER eerie-control-1950 run
    DOKU_authors: "Putrasahan, D.; Kr\xF6ger, J.; Wachsmann, F."
    DOKU_responsible_person: Fabian Wachsmann
    DOKU_summary: EERIE ICON-ESM-ER eerie-control-1950 run
    activity_id: EERIE
    catalog_dir: https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/dkrz/disk/model-output/icon-esm-er/eerie-control-1950/v20231106/atmos/gr025
    cdo_openmp_thread_number: 16
    comment: Sapphire Dyamond (k203123) on l40335 (Linux 4.18.0-372.32.1.el8_6.x86_64
      x86_64)
    experiment: eerie-control-1950
    experiment_id: eerie-1950control
    format: netcdf
    frequency: 1month
    grid_id: 5aff0578-9bd9-11e8-8e4a-af3d880818e6
    grid_label: gn
    history: deleted for convenience
    institution: Max Planck Institute for Meteorology/Deutscher Wetterdienst
    institution_id: MPI-M
    level_type: 2d
    member_id: r1i1p1f1
    plots:
      quicklook:
        aspect: 1
        cmap: jet
        coastline: 50m
        geo: true
        groupby: time
        kind: image
        use_dask: true
        width: 800
        x: lon
        y: lat
        z: clivi
    project: EERIE
    project_id: EERIE
    realm: atm
    references: see MPIM/DWD publications
    simulation_id: erc1011
    source: git@gitlab.dkrz.de:icon/icon-mpim.git@450227788f06e837f1238ebed27af6e2365fa673
    source_id: ICON-ESM
    source_type: AOGCM
    time_max: 2188799
    time_min: 44639
    time_reduction: mean
    title: ICON simulation
    variable-long_names:
    - vertically integrated cloud ice
    - vertically integrated cloud water
    - total cloud cover
    - dew point temperature in 2m
    - evaporation
    - latent heat flux
    - sensible heat flux
    - precipitation flux
    - vertically integrated water vapour
    - surface pressure
    - mean sea level pressure
    - surface downwelling longwave radiation
    - surface downwelling clear-sky longwave radiation
    - surface upwelling longwave radiation
    - toa outgoing longwave radiation
    - toa outgoing clear-sky longwave radiation
    - surface downwelling shortwave radiation
    - surface downwelling clear-sky shortwave radiation
    - toa incident shortwave radiation
    - surface upwelling shortwave radiation
    - surface upwelling clear-sky shortwave radiation
    - toa outgoing shortwave radiation
    - toa outgoing clear-sky shortwave radiation
    - 10m windspeed
    - temperature in 2m
    - maximum 2m temperature
    - minimum 2m temperature
    - u-momentum flux at the surface
    - v-momentum flux at the surface
    - surface temperature
    - zonal wind in 10m
    - meridional wind in 10m

You can load the data with the to_dask() function.

[22]:
ds = cat[dsid].to_dask()
ds
[22]:
<xarray.Dataset>
Dimensions:   (time: 372, lat: 721, lon: 1440, height: 1, height_2: 1)
Coordinates:
  * height    (height) float64 2.0
  * height_2  (height_2) float64 10.0
  * lat       (lat) float64 -90.0 -89.75 -89.5 -89.25 ... 89.25 89.5 89.75 90.0
  * lon       (lon) float64 0.0 0.25 0.5 0.75 1.0 ... 359.0 359.2 359.5 359.8
  * time      (time) datetime64[ns] 2002-01-31T23:59:00 ... 2032-12-31T23:59:00
Data variables: (12/32)
    clivi     (time, lat, lon) float32 dask.array<chunksize=(32, 721, 1440), meta=np.ndarray>
    cllvi     (time, lat, lon) float32 dask.array<chunksize=(32, 721, 1440), meta=np.ndarray>
    clt       (time, lat, lon) float32 dask.array<chunksize=(32, 721, 1440), meta=np.ndarray>
    dew2      (time, height, lat, lon) float32 dask.array<chunksize=(32, 1, 721, 1440), meta=np.ndarray>
    evspsbl   (time, lat, lon) float32 dask.array<chunksize=(32, 721, 1440), meta=np.ndarray>
    hfls      (time, lat, lon) float32 dask.array<chunksize=(32, 721, 1440), meta=np.ndarray>
    ...        ...
    tasmin    (time, height, lat, lon) float32 dask.array<chunksize=(32, 1, 721, 1440), meta=np.ndarray>
    tauu      (time, lat, lon) float32 dask.array<chunksize=(32, 721, 1440), meta=np.ndarray>
    tauv      (time, lat, lon) float32 dask.array<chunksize=(32, 721, 1440), meta=np.ndarray>
    ts        (time, lat, lon) float32 dask.array<chunksize=(32, 721, 1440), meta=np.ndarray>
    uas       (time, height_2, lat, lon) float32 dask.array<chunksize=(32, 1, 721, 1440), meta=np.ndarray>
    vas       (time, height_2, lat, lon) float32 dask.array<chunksize=(32, 1, 721, 1440), meta=np.ndarray>
Attributes: (12/33)
    CDI:                       Climate Data Interface version 2.2.4 (https://...
    CDO:                       Climate Data Operators version 2.2.2 (https://...
    Conventions:               CF-1.6
    DOKU_License:              CC BY 4.0
    DOKU_Name:                 EERIE ICON-ESM-ER eerie-control-1950 run
    DOKU_authors:              Putrasahan, D.; Kröger, J.; Wachsmann, F.
    ...                        ...
    source_id:                 ICON-ESM
    source_type:               AOGCM
    time_max:                  2188799
    time_min:                  44639
    time_reduction:            mean
    title:                     ICON simulation

Load with intake-esm#

Loading with intake-esm is more complicated as we need different keyword arguments for different catalog entries:

[23]:
default_kwargs = dict(
    xarray_open_kwargs=dict(backend_kwargs=dict(consolidated=False)),
    storage_options=dict(remote_protocol="file", lazy=True),
)

if subcat_esm.df["format"][0] != "zarr":
    del default_kwargs["xarray_open_kwargs"]["backend_kwargs"]

if "," in subcat_esm.df["urlpath"][0]:
    selection.df["urlpath"] = selection.df["urlpath"].apply(eval)
    if not "icon-esm" in subcat_esm.df["source_id"][0]:
        default_kwargs["xarray_open_kwargs"]["compat"] = "override"

subcat_esm.to_dataset_dict(**default_kwargs).popitem()[1]

--> The keys in the returned dictionary of datasets are constructed as follows:
        'source_id.experiment_id.realm.grid_lable.aggregation'
100.00% [1/1 00:00<00:00]
[23]:
<xarray.Dataset>
Dimensions:   (time: 372, lat: 721, lon: 1440, height: 1, height_2: 1)
Coordinates:
  * height    (height) float64 2.0
  * height_2  (height_2) float64 10.0
  * lat       (lat) float64 -90.0 -89.75 -89.5 -89.25 ... 89.25 89.5 89.75 90.0
  * lon       (lon) float64 0.0 0.25 0.5 0.75 1.0 ... 359.0 359.2 359.5 359.8
  * time      (time) datetime64[ns] 2002-01-31T23:59:00 ... 2032-12-31T23:59:00
Data variables: (12/32)
    clivi     (time, lat, lon) float32 dask.array<chunksize=(1, 182, 1440), meta=np.ndarray>
    cllvi     (time, lat, lon) float32 dask.array<chunksize=(1, 182, 1440), meta=np.ndarray>
    clt       (time, lat, lon) float32 dask.array<chunksize=(1, 182, 1440), meta=np.ndarray>
    dew2      (time, height, lat, lon) float32 dask.array<chunksize=(1, 1, 182, 1440), meta=np.ndarray>
    evspsbl   (time, lat, lon) float32 dask.array<chunksize=(1, 182, 1440), meta=np.ndarray>
    hfls      (time, lat, lon) float32 dask.array<chunksize=(1, 182, 1440), meta=np.ndarray>
    ...        ...
    tasmin    (time, height, lat, lon) float32 dask.array<chunksize=(1, 1, 182, 1440), meta=np.ndarray>
    tauu      (time, lat, lon) float32 dask.array<chunksize=(1, 182, 1440), meta=np.ndarray>
    tauv      (time, lat, lon) float32 dask.array<chunksize=(1, 182, 1440), meta=np.ndarray>
    ts        (time, lat, lon) float32 dask.array<chunksize=(1, 182, 1440), meta=np.ndarray>
    uas       (time, height_2, lat, lon) float32 dask.array<chunksize=(1, 1, 182, 1440), meta=np.ndarray>
    vas       (time, height_2, lat, lon) float32 dask.array<chunksize=(1, 1, 182, 1440), meta=np.ndarray>
Attributes: (12/51)
    CDI:                                   Climate Data Interface version 2.2...
    CDO:                                   Climate Data Operators version 2.2...
    Conventions:                           CF-1.6
    DOKU_License:                          CC BY 4.0
    DOKU_Name:                             EERIE ICON-ESM-ER eerie-control-19...
    DOKU_authors:                          Putrasahan, D.; Kröger, J.; Wachsm...
    ...                                    ...
    intake_esm_attrs:version:              v20231106
    intake_esm_attrs:realm:                atmos
    intake_esm_attrs:grid_lable:           gr025
    intake_esm_attrs:aggregation:          2d_monthly_mean
    intake_esm_attrs:urlpath:              reference:://work/bm1344/DKRZ/kerc...
    intake_esm_dataset_key:                icon-esm-er.eerie-control-1950.atm...

Subset#

Use - isel for index selection - sel for value selection. Note that you can use method=nearest to do a nearest neighbour interpolation - groupby for statistics

[24]:
# get the latest time step
tas_last_timestep = ds["tas"].isel(time=-1)
tas_last_timestep
[24]:
<xarray.DataArray 'tas' (height: 1, lat: 721, lon: 1440)>
dask.array<getitem, shape=(1, 721, 1440), dtype=float32, chunksize=(1, 721, 1440), chunktype=numpy.ndarray>
Coordinates:
  * height   (height) float64 2.0
  * lat      (lat) float64 -90.0 -89.75 -89.5 -89.25 ... 89.25 89.5 89.75 90.0
  * lon      (lon) float64 0.0 0.25 0.5 0.75 1.0 ... 359.0 359.2 359.5 359.8
    time     datetime64[ns] 2032-12-31T23:59:00
Attributes:
    long_name:      temperature in 2m
    param:          0.0.0
    standard_name:  tas
    units:          K
[25]:
# select a coordinate by values and with nearest neighbor look up:
import hvplot.xarray

tas_last_timestep_northsea = ds["tas"].sel(
    **dict(lat="54", lon="8.2"), method="nearest"
)
tas_last_timestep_northsea.hvplot.line()
[25]:
[26]:
tas_last_timestep_northsea_yearmean = tas_last_timestep_northsea.groupby(
    "time.year"
).mean()
[27]:
tas_last_timestep_northsea_yearmean.hvplot.line()
[27]:

Handling of ICON native grid for georeferenced plots:

[28]:
dsnative = cat[
    "icon-esm-er.eerie-spinup-1950.v20240618.ocean.native.2d_monthly_mean"
].to_dask()
[29]:
import hvplot.xarray

dsnative["to"].isel(time=0).squeeze().load().hvplot.scatter(
    x="lon", y="lat", c="to", rasterize=True, datashade=True
)
[29]: