Overview and access at DKRZ#
This notebook guides EERIE data users and explains how to find and load data available at DKRZ.
The notebook works well within the /work/bm1344/conda-envs/py_312/
environment.
All data relevant for the project is referenced in the main DKRZ-EERIE Catalog:
[1]:
import panel as pn
pn.extension("tabulator")
[2]:
import intake
catalog = (
"https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/eerie.yaml"
)
eerie_cat = intake.open_catalog(catalog)
eerie_cat
eerie:
args:
path: https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/eerie.yaml
description: ''
driver: intake.catalog.local.YAMLFileCatalog
metadata: {}
Online interactive browsing with the data base table#
The following table contains all EERIE model-output available at DKRZ on the file system.
Use the filters to subset the data base
Copy one value of the catalog_entry column and use it to open the data set from the intake catalog
Each line contains a single variable long name parsed from datasets.
DRS is the data reference syntax. A template for the path hierarchy of the catalogs.
[4]:
tabu # there is a hidden cell before
[4]:
[5]:
tabu.save("eerie-intake-database.html")
The EERIE Catalog is hierarchically structured using a path template (similar to the Data Reference Syntax used in CMIP):
hpc.hardware.product.source_id.experiment_id.version.realm.grid_type
Opened with python, the catalog is a nested dictionary of catalog sources. The lowest level will finally contain data sources which can be opened as xarray datasets with to_dask()
.
You can browse through the catalog by list
ing the catalog and selecting keys:
[6]:
print(list(eerie_cat))
print(list(eerie_cat["dkrz"]))
['jasmin', 'dkrz']
['disk', 'archive', 'cloud', 'main', 'dkrz_ngc3']
Entries can be joined with a ‘.’ so that you can access deeper level entries from the highest catalog level:
[7]:
eerie_cat["dkrz.disk"]
disk:
args:
path: https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/dkrz/disk/main.yaml
description: Use this catalog if you are working on Levante. This catalog contains
datasets for all raw data in /work/bm1344 and accesses the data via kerchunks
in /work/bm1344/DKRZ/kerchunks.
driver: intake.catalog.local.YAMLFileCatalog
metadata:
catalog_dir: https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/dkrz
Tip
Note that there is the autocompletion feature catalogs when pushing tab.
For model-output stored on DKRZ’s disk, you can get a table-like overview from a “data base” csv file opened with pandas:
[8]:
data_base = eerie_cat["dkrz"]["disk"]["model-output"]["csv"].read()
# equivalent to
data_base = eerie_cat["dkrz.disk.model-output.csv"].read()
In the following, we display all unique values of each element of the catalog hierarchy:
[9]:
from IPython.display import display, Markdown
drs = "source_id.experiment_id.version.realm.grid_lable"
for c in drs.split("."):
display(Markdown(f"## Unique entries for *{c}*:"))
display(Markdown("- " + "\n- ".join(data_base[c].unique())))
Unique entries for source_id:#
icon-esm-er
ifs-fesom2-sr
ifs-amip-tco1279
ifs-amip-tco399
ifs-nemo
hadgem3-gc5-n640-orca12
hadgem3-gc5-n216-orca025
Unique entries for experiment_id:#
eerie-control-1950
eerie-spinup-1950
hist-1950
hist
hist-c-0-a-lr20
hist-c-0-a-lr30
hist-c-lr20-a-0
hist-c-lr30-a-0
hist-c-lr30-a-lr30
eerie-picontrol
Unique entries for version:#
v20231106
v20240618
v20240304
v20240901
v20231006
Not set
v20241010
Unique entries for realm:#
atmos
ocean
land
wave
Unique entries for grid_lable:#
gr025
native
gr1x1
Go to Browse with intake-esm to understand more about how to work with the dataframe data_base
.
Note that not all combinations of the DRS exist. For example, the ICON spinup run only has data on the native grid. You can use list
to explore the contents.
[10]:
list(eerie_cat["dkrz.disk.model-output.icon-esm-er.eerie-spinup-1950.v20240618.atmos"])
[10]:
['gr025', 'native']
See also the processing example for printing the full tree of a catalog
You can print a short description of each collection with its describe
function:
[11]:
for col in eerie_cat:
print(f"Description of {col}:")
print(eerie_cat[col].describe()["description"])
Description of jasmin:
This catalog contains datasets for EERIE stored on JASMIN
Description of dkrz:
This catalog contains datasets for EERIE stored on DKRZ
Datasets saved on DKRZ are in the DKRZ catalog:
[12]:
eerie_dkrz = eerie_cat["dkrz"]
for col in eerie_dkrz:
print(f"Description of {col}:")
print(eerie_dkrz[col].describe()["description"])
Description of disk:
Use this catalog if you are working on Levante. This catalog contains datasets for all raw data in /work/bm1344 and accesses the data via kerchunks in /work/bm1344/DKRZ/kerchunks.
Description of archive:
Only use this catalog if your desired data is not available on diks and needs to be retrieved from DKRZ's tape archive. This catalog contains datasets for archived data in /arch/bm1344
Description of cloud:
Use this catalog if you are NOT working on Levante. This catalog contains the same datasets as *dkrz_eerie_kerchunk* but data access is via the xpublish server *eerie.cloud.dkrz.de*
Description of main:
DKRZ master catalog for all /pool/data catalogs available
Description of dkrz_ngc3:
NextGEMs Cycle 3 data
The DKRZ catalogs distinguish between the storage location on DKRZ.
Note
If you are accessing the data from remote, you can use the cloud catalog. For accessing other data, you have to be logged in on DKRZ´s HPC.
[13]:
eerie_dkrz_disk = eerie_dkrz["disk"]
for col in eerie_dkrz_disk:
print(f"Description of {col}:")
print(eerie_dkrz_disk[col].describe()["description"])
Description of model-output:
EERIE Earth System Model output available on DKRZ's Levante File System. This catalog contains datasets for all raw data in /work/bm1344 and accesses the data via kerchunks in /work/bm1344/DKRZ/kerchunks
Description of observations:
This catalog contains observational data that is used for EERIE evaluation.
We continue to work with the eerie-dkrz-disk-model-output catalog to show how to
browse
open and load
subset
data.
[14]:
cat = eerie_dkrz_disk["model-output"]
list(cat)
[14]:
['icon-esm-er',
'ifs-fesom2-sr',
'ifs-amip-tco1279',
'ifs-amip-tco399',
'ifs-amip',
'ifs-nemo',
'hadgem3-gc5-n640-orca12',
'hadgem3-gc5-n216-orca025',
'csv',
'esm-json']
We have two options:
continue working with yaml files and the intake-xarray plugins (easier to load)
switching to intake-esm (easier to browse)
Browse#
With intake-esm#
We can use a json+csv from the intake catalog to generate an intake-esm catalog:
[15]:
import json
esmjson = json.loads("".join(cat["esm-json"].read()))
dkrz_disk_model_esm = intake.open_esm_datastore(
obj=dict(esmcat=esmjson, df=data_base),
columns_with_iterables=["variables", "variable-long_names", "urlpath"],
)
dkrz_disk_model_esm
dkrz-catalogue catalog with 284 dataset(s) from 346 asset(s):
unique | |
---|---|
format | 3 |
grid_id | 3 |
member_id | 1 |
institution_id | 2 |
institution | 3 |
references | 1 |
simulation_id | 5 |
variable-long_names | 103 |
variables | 113 |
source_id | 7 |
experiment_id | 10 |
version | 7 |
realm | 4 |
grid_lable | 3 |
aggregation | 73 |
urlpath | 338 |
derived_variables | 0 |
Intake-esm uses the data_base dataframe under the hood, accessible via .df
, which make things easier to browse through the catalog.
what query keywords do exist?
[16]:
dkrz_disk_model_esm.df.columns
[16]:
Index(['format', 'grid_id', 'member_id', 'institution_id', 'institution',
'references', 'simulation_id', 'variable-long_names', 'variables',
'source_id', 'experiment_id', 'version', 'realm', 'grid_lable',
'aggregation', 'urlpath'],
dtype='object')
Intake-esm uses a pandas dataframe under the hood which make things easier to browse through the catalog.
which models are available in the catalog?
[17]:
dkrz_disk_model_esm.unique()["source_id"]
[17]:
['icon-esm-er',
'ifs-fesom2-sr',
'ifs-amip-tco1279',
'ifs-amip-tco399',
'ifs-nemo',
'hadgem3-gc5-n640-orca12',
'hadgem3-gc5-n216-orca025']
Search with wild cards:
[18]:
subcat_esm = dkrz_disk_model_esm.search(
**{
"source_id": "icon-esm-er",
"experiment_id": "eerie-spinup-1950",
"grid_lable": "gr025",
"realm": "ocean",
"variable-long_names": "temperature*",
"aggregation": "monthly*",
}
)
subcat_esm
dkrz-catalogue catalog with 2 dataset(s) from 2 asset(s):
unique | |
---|---|
format | 1 |
grid_id | 1 |
member_id | 1 |
institution_id | 1 |
institution | 1 |
references | 1 |
simulation_id | 1 |
variable-long_names | 2 |
variables | 2 |
source_id | 1 |
experiment_id | 1 |
version | 1 |
realm | 1 |
grid_lable | 1 |
aggregation | 2 |
urlpath | 2 |
derived_variables | 0 |
Pure Intake#
Intake offers users a free text search field. We can search for example for the control run. Intake returns another catalog.
[19]:
searchdict = dict(
model="ICON",
realm="ocean",
exp="eerie-spinup-1950",
var="temperature",
frequency="monthly",
)
subcat = cat["icon-esm-er"]
for v in searchdict.values():
subcat = subcat.search(v)
list(subcat)
# note that `search` has a keyword argument *depth* (default: 2) which indicates how many subcatalogs should be searched.
# if you use a high level catalog, adapt that argument to your needs
[19]:
['eerie-spinup-1950.v20231106.ocean.native.model-level_monthly_mean',
'eerie-spinup-1950.v20240618.ocean.gr025.2d_monthly_mean',
'eerie-spinup-1950.v20240618.ocean.gr025.2d_monthly_square',
'eerie-spinup-1950.v20240618.ocean.native.2d_monthly_mean',
'eerie-spinup-1950.v20240618.ocean.native.2d_monthly_square',
'eerie-spinup-1950.v20240618.ocean.native.eddy_monthly_mean',
'eerie-spinup-1950.v20240618.ocean.native.model-level_monthly_mean']
Use the GUI:
[20]:
cat.gui
[20]:
Flatten a catalog by finding all data sources in all levels:
[21]:
from copy import deepcopy
def find_data_sources(catalog, name=None):
newname = ".".join([a for a in [name, catalog.name] if a])
data_sources = []
for key, entry in catalog.items():
if isinstance(entry, intake.catalog.Catalog):
if newname == "main":
newname = None
# If the entry is a subcatalog, recursively search it
data_sources.extend(find_data_sources(entry, newname))
elif isinstance(entry, intake.source.base.DataSource):
data_sources.append(newname + "." + key)
return data_sources
Get file names#
If you need file names for work in shell scripts, you can get via the query_yaml program and for a specification of
catalog,
dataset,
variable name,
e.g.:
[22]:
%%bash
module use /work/k20200/k202134/hsm-tools/outtake/module
module load hsm-tools/unstable
FILES=$(query_yaml.py ifs-fesom2-sr eerie-spinup-1950 v20240304 atmos \
native daily \
-c https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/dkrz/disk/model-output/main.yaml \
--var 2t --uri --cdo )
echo found
echo ${FILES} | wc -w
echo files
Autoloading slk
Autoloading openjdk/17.0.0_35-gcc-11.2.0
Autoloading python3
Loading hsm-tools/unstable
Loading requirement: openjdk/17.0.0_35-gcc-11.2.0 slk/3.3.91_h1.12.10_w1.2.2
cdo/2.3.0-gcc-11.2.0 python3/python_3.12-flo
Choices for this dataset:
name ... default
0 variables ... 100u
[1 rows x 5 columns]
found
372
files
Open and load#
You can open a dataset from the catalog similar to a dictionary entry. This gives you a lot of metadata information:
[23]:
dsid = "icon-esm-er.eerie-spinup-1950.v20240618.ocean.gr025.2d_monthly_mean"
cat[dsid]
2d_monthly_mean:
args:
chunks: auto
consolidated: false
storage_options:
lazy: true
remote_protocol: file
urlpath: reference:://work/bm1344/k202193/Kerchunk/erc2002/oce_2d_1mth_mean_remap025.parq
description: ''
driver: intake_xarray.xzarr.ZarrSource
metadata:
CDI: Climate Data Interface version 2.2.4 (https://mpimet.mpg.de/cdi)
CDO: Climate Data Operators version 2.2.2 (https://mpimet.mpg.de/cdo)
Conventions: CF-1.6
DOKU_License: CC BY 4.0
DOKU_Name: EERIE ICON-ESM-ER eerie-spinup-1950 run
DOKU_authors: "Putrasahan, D.; Kr\xF6ger, J.; Wachsmann, F."
DOKU_responsible_person: Fabian Wachsmann
DOKU_summary: EERIE ICON-ESM-ER eerie-spinup-1950 run
activity_id: EERIE
catalog_dir: https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/dkrz/disk/model-output/icon-esm-er/eerie-spinup-1950/v20240618/ocean/gr025
cdo_openmp_thread_number: 16
comment: Sapphire Dyamond (k203123) on l40344 (Linux 4.18.0-372.32.1.el8_6.x86_64
x86_64)
experiment: eerie-spinup-1950
experiment_id: eerie-1950spinup
format: netcdf
frequency: 1month
grid_id: 375cb0cc-637e-11e8-9d6f-8f41a9b9ff4b
grid_label: gn
history: deleted for convenience
institution: Max Planck Institute for Meteorology/Deutscher Wetterdienst
institution_id: MPI-M
level_type: 2d
member_id: r1i1p1f1
plots:
quicklook:
aspect: 1
cmap: jet
coastline: 50m
geo: true
groupby: time
kind: image
use_dask: true
width: 800
x: lon
y: lat
z: Qbot
project: EERIE
project_id: EERIE
realm: oce
references: see MPIM/DWD publications
simulation_id: erc2002
source: git@gitlab.dkrz.de:icon/icon-mpim.git@450227788f06e837f1238ebed27af6e2365fa673
source_id: ICON-ESM
source_type: AOGCM
time_max: 21564000
time_min: 15778080
time_reduction: mean
title: ICON simulation
variable-long_names:
- Salt volume flux due to sea ice change
- Freshwater Flux due to Sea Ice Change
- Conductive heat flux at ice-ocean interface
- Energy flux available for surface melting
- Wind Speed at 10m height
- atmos_fluxes_FrshFlux_Evaporation
- atmos_fluxes_FrshFlux_Precipitation
- atmos_fluxes_FrshFlux_Runoff
- atmos_fluxes_FrshFlux_SnowFall
- atmos_fluxes_HeatFlux_Latent
- atmos_fluxes_HeatFlux_LongWave
- atmos_fluxes_HeatFlux_Sensible
- atmos_fluxes_HeatFlux_ShortWave
- atmos_fluxes_HeatFlux_Total
- atmos_fluxes_stress_x
- atmos_fluxes_stress_xw
- atmos_fluxes_stress_y
- atmos_fluxes_stress_yw
- Heat flux to ocean from the ice growth
- Heat flux to ocean from the atmosphere
- meridional velocity
- ocean_mixed_layer_thickness_defined_by_sigma_t
- ocean_mixed_layer_thickness_defined_by_sigma_t_10m
- new ice growth in open water
- Sea Level Pressure
- amount of snow that is transformed to ice
- sea water salinity
- surface elevation at cell center
- zstar surface stretch at cell center
- sea water potential temperature
You can open the data as a xarray dataset with the to_dask()
function.
[24]:
ds = cat[dsid].to_dask()
ds
/work/bm1344/conda-envs/py_312/lib/python3.12/site-packages/intake_xarray/base.py:21: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
'dims': dict(self._ds.dims),
[24]:
<xarray.Dataset> Size: 17GB Dimensions: (time: 133, lat: 721, lon: 1440, lev: 1, depth: 1) Coordinates: * depth (depth) float64 8B 1.0 * lat (lat) float64 6kB -90.0 -89.75 ... 90.0 * lev (lev) float64 8B 0.0 * lon (lon) float64 12kB 0.0 0.25 ... 359.8 * time (time) datetime64[ns] 1kB 1980-01-01... Data variables: (12/30) FrshFlux_IceSalt (time, lat, lon) float32 552MB dask.array<chunksize=(32, 721, 1440), meta=np.ndarray> FrshFlux_TotalIce (time, lat, lon) float32 552MB dask.array<chunksize=(32, 721, 1440), meta=np.ndarray> Qbot (time, lev, lat, lon) float32 552MB dask.array<chunksize=(32, 1, 721, 1440), meta=np.ndarray> Qtop (time, lev, lat, lon) float32 552MB dask.array<chunksize=(32, 1, 721, 1440), meta=np.ndarray> Wind_Speed_10m (time, lat, lon) float32 552MB dask.array<chunksize=(32, 721, 1440), meta=np.ndarray> atmos_fluxes_FrshFlux_Evaporation (time, lat, lon) float32 552MB dask.array<chunksize=(32, 721, 1440), meta=np.ndarray> ... ... sea_level_pressure (time, lat, lon) float32 552MB dask.array<chunksize=(32, 721, 1440), meta=np.ndarray> snow_to_ice (time, lev, lat, lon) float32 552MB dask.array<chunksize=(32, 1, 721, 1440), meta=np.ndarray> so (time, depth, lat, lon) float32 552MB dask.array<chunksize=(32, 1, 721, 1440), meta=np.ndarray> ssh (time, lat, lon) float32 552MB dask.array<chunksize=(32, 721, 1440), meta=np.ndarray> stretch_c (time, lat, lon) float32 552MB dask.array<chunksize=(32, 721, 1440), meta=np.ndarray> to (time, depth, lat, lon) float32 552MB dask.array<chunksize=(32, 1, 721, 1440), meta=np.ndarray>
Load with intake-esm#
Loading with intake-esm is more complicated as we need different keyword arguments for different catalog entries:
[25]:
default_kwargs = dict(
xarray_open_kwargs=dict(backend_kwargs=dict(consolidated=False)),
storage_options=dict(remote_protocol="file", lazy=True),
)
if subcat_esm.df["format"][0] != "zarr":
del default_kwargs["xarray_open_kwargs"]["backend_kwargs"]
if "," in subcat_esm.df["urlpath"][0]:
selection.df["urlpath"] = selection.df["urlpath"].apply(eval)
if not "icon-esm" in subcat_esm.df["source_id"][0]:
default_kwargs["xarray_open_kwargs"]["compat"] = "override"
subcat_esm.to_dataset_dict(**default_kwargs).popitem()[1]
--> The keys in the returned dictionary of datasets are constructed as follows:
'source_id.experiment_id.realm.grid_lable.aggregation'
[25]:
<xarray.Dataset> Size: 2GB Dimensions: (depth: 1, lat: 721, lon: 1440, time: 133) Coordinates: * depth (depth) float64 8B 1.0 * lat (lat) float64 6kB -90.0 -89.75 -89.5 -89.25 ... 89.5 89.75 90.0 * lon (lon) float64 12kB 0.0 0.25 0.5 0.75 ... 359.0 359.2 359.5 359.8 * time (time) datetime64[ns] 1kB 1980-01-01 1980-02-01 ... 1991-01-01 Data variables: mlotst (time, lat, lon) float32 552MB dask.array<chunksize=(1, 721, 1440), meta=np.ndarray> mlotst10 (time, lat, lon) float32 552MB dask.array<chunksize=(1, 721, 1440), meta=np.ndarray> ssh (time, lat, lon) float32 552MB dask.array<chunksize=(1, 721, 1440), meta=np.ndarray> to (time, depth, lat, lon) float32 552MB dask.array<chunksize=(1, 1, 721, 1440), meta=np.ndarray> Attributes: (12/18) intake_esm_vars: ["['mlotst', 'mlotst10', 'ssh', 't... intake_esm_attrs:_data_format_: zarr intake_esm_attrs:grid_id: 375cb0cc-637e-11e8-9d6f-8f41a9b9ff4b intake_esm_attrs:member_id: r1i1p1f1 intake_esm_attrs:institution_id: MPI-M intake_esm_attrs:institution: Max Planck Institute for Meteorolo... ... ... intake_esm_attrs:version: v20240618 intake_esm_attrs:realm: ocean intake_esm_attrs:grid_lable: gr025 intake_esm_attrs:aggregation: 2d_monthly_square intake_esm_attrs:urlpath: reference:://work/bm1344/k202193/K... intake_esm_dataset_key: icon-esm-er.eerie-spinup-1950.ocea...
Subset#
Use - isel for index selection - sel for value selection. Note that you can use method=nearest
to do a nearest neighbour interpolation - groupby for statistics
[26]:
# get the latest time step
to_last_timestep = ds["to"].isel(time=-1)
to_last_timestep
[26]:
<xarray.DataArray 'to' (depth: 1, lat: 721, lon: 1440)> Size: 4MB dask.array<getitem, shape=(1, 721, 1440), dtype=float32, chunksize=(1, 721, 1440), chunktype=numpy.ndarray> Coordinates: * depth (depth) float64 8B 1.0 * lat (lat) float64 6kB -90.0 -89.75 -89.5 -89.25 ... 89.5 89.75 90.0 * lon (lon) float64 12kB 0.0 0.25 0.5 0.75 ... 359.0 359.2 359.5 359.8 time datetime64[ns] 8B 1991-01-01 Attributes: long_name: sea water potential temperature param: 18.4.10 standard_name: sea_water_potential_temperature units: C
[27]:
# select a coordinate by values and with nearest neighbor look up:
import hvplot.xarray
to_northsea = (
ds["to"].sel(**dict(lat="54", lon="8.2"), method="nearest").drop(["lat", "lon"])
)
to_northsea.hvplot.line()
/tmp/ipykernel_1076277/2778885621.py:5: DeprecationWarning: dropping variables using `drop` is deprecated; use drop_vars.
ds["to"].sel(**dict(lat="54", lon="8.2"), method="nearest").drop(["lat", "lon"])
[27]:
[28]:
to_northsea_yearmean = to_northsea.groupby("time.year").mean()
[29]:
to_northsea_yearmean.hvplot.line()
[29]:
Handling of ICON native grid for georeferenced plots:
[30]:
dsnative = cat[
"icon-esm-er.eerie-spinup-1950.v20240618.ocean.native.2d_monthly_mean"
].to_dask()
/work/bm1344/conda-envs/py_312/lib/python3.12/site-packages/intake_xarray/base.py:21: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
'dims': dict(self._ds.dims),
[31]:
import hvplot.xarray
dsnative["to"].isel(time=0).squeeze().load().hvplot.scatter(
x="lon", y="lat", c="to", rasterize=True, datashade=True
)
[31]:
[ ]: