EERIE Data at DKRZ#

This notebook guides EERIE data users and explains how to find and load data available at DKRZ.

The notebook works well within the python3/unstable kernel.

All data relevant for the project is referenced in the main DKRZ-EERIE Catalog:

[1]:
import intake

eerie_cat = intake.open_catalog(
    "https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/eerie.yaml"
)
eerie_cat
eerie:
  args:
    path: https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/eerie.yaml
  description: ''
  driver: intake.catalog.local.YAMLFileCatalog
  metadata: {}

We use a catalog reference syntax, i.e. a path name template:

hpc.hardware.product.source_id.experiment_id.realm.grid_type

Opened with python, the catalog is a nested dictionary of catalog sources. The lowest level will finally contain data sources which can be opened as xarray datasets with to_dask().

You can browse through the catalog by listing the catalog and selecting keys:

[2]:
print(list(eerie_cat))
print(list(eerie_cat["dkrz"]))
['jasmin', 'dkrz']
['disk', 'archive', 'cloud', 'main', 'dkrz_ngc3']

Entries can be joined with a ‘.’ so that you can access deeper level entries from the highest catalog level:

[3]:
eerie_cat["dkrz.disk"]
disk:
  args:
    path: https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/dkrz/disk/main.yaml
  description: Use this catalog if you are working on Levante. This catalog contains
    datasets for all raw data in /work/bm1344 and accesses the data via kerchunks
    in /work/bm1344/DKRZ/kerchunks.
  driver: intake.catalog.local.YAMLFileCatalog
  metadata:
    catalog_dir: https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/dkrz

Note that there is the autocompletion feature catalogs when pushing tab.

For model-output stored on DKRZ’s disk, you can get a table-like overview from a “data base” csv file opened with pandas:

[4]:
data_base = eerie_cat["dkrz"]["disk"]["model-output"]["csv"].read()
# equivalent to
data_base = eerie_cat["dkrz.disk.model-output.csv"].read()
data_base.head()
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/dask/dataframe/io/csv.py:542: UserWarning: Warning gzip compression does not support breaking apart files
Please ensure that each individual file can fit in memory and
use the keyword ``blocksize=None to remove this message``
Setting ``blocksize=None``
  warn(
[4]:
format grid_id member_id institution_id institution references simulation_id variable-long_names variables source_id experiment_id realm grid_lable aggregation urlpath
0 zarr 5aff0578-9bd9-11e8-8e4a-af3d880818e6 r1i1p1f1 MPI-M Max Planck Institute for Meteorology/Deutscher... see MPIM/DWD publications erc1011 ['10m windspeed', 'temperature in 2m'] ['sfcwind', 'tas'] icon-esm-er eerie-control-1950 atmos gr025 2d_daily_max reference:://work/bm1344/DKRZ/kerchunks_batche...
1 zarr 5aff0578-9bd9-11e8-8e4a-af3d880818e6 r1i1p1f1 MPI-M Max Planck Institute for Meteorology/Deutscher... see MPIM/DWD publications erc1011 ['total cloud cover', 'dew point temperature i... ['clt', 'dew2', 'evspsbl', 'hfls', 'hfss', 'pr... icon-esm-er eerie-control-1950 atmos gr025 2d_daily_mean reference:://work/bm1344/DKRZ/kerchunks_batche...
2 zarr 5aff0578-9bd9-11e8-8e4a-af3d880818e6 r1i1p1f1 MPI-M Max Planck Institute for Meteorology/Deutscher... see MPIM/DWD publications erc1011 ['temperature in 2m'] ['tas'] icon-esm-er eerie-control-1950 atmos gr025 2d_daily_min reference:://work/bm1344/DKRZ/kerchunks_batche...
3 zarr 5aff0578-9bd9-11e8-8e4a-af3d880818e6 r1i1p1f1 MPI-M Max Planck Institute for Meteorology/Deutscher... see MPIM/DWD publications erc1011 ['vertically integrated cloud ice', 'verticall... ['clivi', 'cllvi', 'clt', 'dew2', 'evspsbl', '... icon-esm-er eerie-control-1950 atmos gr025 2d_monthly_mean reference:://work/bm1344/DKRZ/kerchunks_batche...
4 zarr 5aff0578-9bd9-11e8-8e4a-af3d880818e6 r1i1p1f1 MPI-M Max Planck Institute for Meteorology/Deutscher... see MPIM/DWD publications erc1011 ['specific cloud ice content', 'specific cloud... ['cli', 'clw', 'gpsm', 'height_bnds', 'hus', '... icon-esm-er eerie-control-1950 atmos gr025 model-level_monthly_mean reference:://work/bm1344/DKRZ/kerchunks_batche...
[5]:
from IPython.display import display, Markdown

drs = "source_id.experiment_id.realm.grid_lable"
for c in drs.split("."):
    display(Markdown(f"## Unique entries for *{c}*:"))
    display(Markdown("- " + "\n- ".join(data_base[c].unique())))

Unique entries for source_id:#

  • icon-esm-er

  • ifs-fesom2-sr

  • ifs-amip

  • ifs-nemo

  • hadgem3-gc5-n640-orca12

  • hadgem3-gc5-n216-orca025

Unique entries for experiment_id:#

  • eerie-control-1950

  • eerie-spinup-1950

  • amip-hist-obs

  • amip-hist-obs-lr30

  • amip-hist-obs-c-lr30-a-0

  • amip-hist-obs-c-lr30-a-lr30

  • amip-ng-obs

  • amip-ng-obs-lr30

  • eerie-picontrol

Unique entries for realm:#

  • atmos

  • ocean

  • land

Unique entries for grid_lable:#

  • gr025

  • native

Go to Browse with intake-esm to understand more about how to work with the dataframe data_base.

Note that not all combinations of the DRS exist. One example is:

[6]:
eerie_cat["dkrz.disk.model-output.icon-esm-er.eerie-control-1950.atmos.gr025"]
gr025:
  args:
    path: https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/dkrz/disk/model-output/icon-esm-er/eerie-control-1950/atmos/gr025/main.yaml
  description: This catalog contains atmospheric EERIE ICON-ESM-ER eerie-control-1950
    output on 0.25deg grid available at DKRZ disk.
  driver: intake.catalog.local.YAMLFileCatalog
  metadata:
    catalog_dir: https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/dkrz/disk/model-output/icon-esm-er/eerie-control-1950/atmos

Use list to explore the content of a catalog

[7]:
list(eerie_cat)
[7]:
['jasmin', 'dkrz']

You can print a short description of each collection with its describe function:

[8]:
for col in eerie_cat:
    print(f"Description of {col}:")
    print(eerie_cat[col].describe()["description"])
Description of jasmin:
This catalog contains datasets for EERIE stored on JASMIN
Description of dkrz:
This catalog contains datasets for EERIE stored on DKRZ

Datasets saved on DKRZ are in the DKRZ catalog:

[9]:
eerie_dkrz = eerie_cat["dkrz"]
for col in eerie_dkrz:
    print(f"Description of {col}:")
    print(eerie_dkrz[col].describe()["description"])
Description of disk:
Use this catalog if you are working on Levante. This catalog contains datasets for all raw data in /work/bm1344 and accesses the data via kerchunks in /work/bm1344/DKRZ/kerchunks.
Description of archive:
Only use this catalog if your desired data is not available on diks and needs to be retrieved from DKRZ's tape archive. This catalog contains datasets for archived data in /arch/bm1344
Description of cloud:
Use this catalog if you are NOT working on Levante. This catalog contains the same datasets as *dkrz_eerie_kerchunk* but data access is via the xpublish server *eerie.cloud.dkrz.de*
Description of main:
DKRZ master catalog for all /pool/data catalogs available
Description of dkrz_ngc3:
NextGEMs Cycle 3 data

The DKRZ catalogs distinguish between the storage location on DKRZ.

```{note}

If you are accessing the data from remote, you can use the cloud catalog. For accessing other data, you have to be logged in on DKRZ´s HPC.

```

[10]:
eerie_dkrz_disk = eerie_dkrz["disk"]
for col in eerie_dkrz_disk:
    print(f"Description of {col}:")
    print(eerie_dkrz_disk[col].describe()["description"])
Description of model-output:
EERIE Earth System Model output available on DKRZ's Levante File System. This catalog contains datasets for all raw data in /work/bm1344 and accesses the data via kerchunks in /work/bm1344/DKRZ/kerchunks
Description of observations:
This catalog contains observational data that is used for EERIE evaluation.

We continue to work with the eerie-dkrz-disk-model-output catalog to show how to

  • browse

  • open and load

  • subset

data.

[11]:
cat = eerie_dkrz_disk["model-output"]
list(cat)
[11]:
['icon-esm-er',
 'ifs-fesom2-sr',
 'ifs-amip',
 'ifs-nemo',
 'hadgem3-gc5-n640-orca12',
 'hadgem3-gc5-n216-orca025',
 'csv',
 'esm-json']

We have two options:

  • continue working with yaml files and the intake-xarray plugins (easier to load)

  • switching to intake-esm (easier to browse)

Browse#

With intake-esm#

We can use a json+csv from the intake catalog to generate an intake-esm catalog:

[12]:
import json

esmjson = json.loads("".join(cat["esm-json"].read()))
dkrz_disk_model_esm = intake.open_esm_datastore(
    obj=dict(esmcat=esmjson, df=data_base),
    columns_with_iterables=["variables", "variable-long_names", "urlpath"],
)
dkrz_disk_model_esm

dkrz-catalogue catalog with 110 dataset(s) from 112 asset(s):

unique
format 3
grid_id 3
member_id 1
institution_id 1
institution 2
references 1
simulation_id 3
variable-long_names 54
variables 57
source_id 6
experiment_id 9
realm 3
grid_lable 2
aggregation 51
urlpath 112
derived_variables 0

Intake-esm uses the data_base dataframe under the hood accessible via .df which make things easier to browse through the catalog.

  • what query keywords do exist?

[13]:
dkrz_disk_model_esm.df.columns
[13]:
Index(['format', 'grid_id', 'member_id', 'institution_id', 'institution',
       'references', 'simulation_id', 'variable-long_names', 'variables',
       'source_id', 'experiment_id', 'realm', 'grid_lable', 'aggregation',
       'urlpath'],
      dtype='object')

Intake-esm uses a pandas dataframe under the hood which make things easier to browse through the catalog.

  • which models are available in the catalog?

[14]:
dkrz_disk_model_esm.unique()["source_id"]
[14]:
['icon-esm-er',
 'ifs-fesom2-sr',
 'ifs-amip',
 'ifs-nemo',
 'hadgem3-gc5-n640-orca12',
 'hadgem3-gc5-n216-orca025']

Search with wild cards:

[15]:
subcat_esm = dkrz_disk_model_esm.search(
    **{
        "source_id": "icon-esm-er",
        "experiment_id": "eerie-control-1950",
        "grid_lable": "gr025",
        "realm": "atmos",
        "variable-long_names": "temperature*",
        "aggregation": "monthly*",
    }
)
subcat_esm

dkrz-catalogue catalog with 1 dataset(s) from 1 asset(s):

unique
format 1
grid_id 1
member_id 1
institution_id 1
institution 1
references 1
simulation_id 1
variable-long_names 1
variables 1
source_id 1
experiment_id 1
realm 1
grid_lable 1
aggregation 1
urlpath 1
derived_variables 0

Pure Intake#

Intake offers users a free text search field. We can search for example for the control run. Intake returns another catalog.

[16]:
searchdict = dict(
    model="ICON",
    realm="atmos",
    exp="eerie-control-1950",
    var="temperature",
    frequency="monthly",
)
subcat = cat["icon-esm-er"]
for v in searchdict.values():
    subcat = subcat.search(v)
list(subcat)

# note that `search` has a keyword argument *depth* (default: 2) which indicates how many subcatalogs should be searched.
# if you use a high level catalog, adapt that argument to your needs
[16]:
['eerie-control-1950.atmos.gr025.2d_monthly_mean',
 'eerie-control-1950.atmos.gr025.model-level_monthly_mean',
 'eerie-control-1950.atmos.gr025.plev19_monthly_mean',
 'eerie-control-1950.atmos.native.2d_monthly_mean',
 'eerie-control-1950.atmos.native.model-level_monthly_mean',
 'eerie-control-1950.atmos.native.plev19_monthly_mean']

Use the GUI:

[17]:
cat.gui