The EERIE cloud#
Access EERIE and NextGEMs data from a data server everywhere#
The EERIE cloud addresses data users who cannot work on DKRZ´s High Performance Computer Levante next to the data.
From 2024 onwards, the EERIE cloud main URL redirects to a stac-browser’s view of its content to be discoverable by both human and machine. This notebook describes how to access the data from the server programmatically.
Data, mainly from ICON and IFS-FESOM2, is made accessible via a fast lane web service under eerie.cloud.dkrz.de. This is enabled by the python package xpublish. A detailled technical description of the eerie.cloud is provided here.
Use-cases#
Small volume data retrieval
Monitoring
Interactive analysis
E.g. Monitoring:
[1]:
import intake
import hvplot.xarray
caturl = "https://eerie.cloud.dkrz.de/intake.yaml"
dsname = "nextgems.ICON.ngc4008.P1D_0"
var = "tas"
cat = intake.open_catalog(caturl)
ds = cat[dsname].to_dask()
vmean = ds[var].mean(dim=["cell"]).compute()
vmean_yr = vmean.resample(time="Y").mean()
(vmean.hvplot.line(grid=True) * vmean_yr.hvplot.line(grid=True)).opts(
title=" ".join(dsname.split(".")) + f"\ntime series of daily and yearly mean {var}"
)
/work/bm1344/conda-envs/py_312/lib/python3.12/site-packages/intake_xarray/base.py:21: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
'dims': dict(self._ds.dims),
/work/bm1344/conda-envs/py_312/lib/python3.12/site-packages/xarray/groupers.py:326: FutureWarning: 'Y' is deprecated and will be removed in a future version, please use 'YE' instead.
self.index_grouper = pd.Grouper(
[1]:
Intake#
Full EERIE Cloud content#
The EERIE cloud has an endpoint for an intake catalog which is synced with the available dataset.
[2]:
cat = intake.open_catalog("https://eerie.cloud.dkrz.de/intake.yaml")
We can list all datasets available in the catalog with list
:
[3]:
all_dkrz_datasets = list(cat)
len(all_dkrz_datasets)
[3]:
398
A particular dataset has two zarr access points:
[4]:
dsname = all_dkrz_datasets[-1]
cat[dsname].describe()
[4]:
{'name': 'nextgems.IFS_9-FESOM_5-production.3D_hourly_healpix512',
'container': 'xarray',
'plugin': ['zarr'],
'driver': ['zarr'],
'description': '',
'direct_access': 'forbid',
'user_parameters': [{'name': 'method',
'description': 'server-side loading method',
'type': 'str',
'allowed': ['kerchunk', 'zarr'],
'default': 'kerchunk'}],
'metadata': {},
'args': {'consolidated': True,
'urlpath': 'https://eerie.cloud.dkrz.de/datasets/nextgems.IFS_9-FESOM_5-production.3D_hourly_healpix512/{{ method }}'}}
Zarr access endpoints#
You can control which of the two eerie.cloud endpoints you want to use to retrieve data via the method
keyword.
method="zarr"
: delivers data which is processed on server-side with dask. The resulting data isrechunked in large, dask-optimal chunksizes (O(100MB) in memory)
uniformly compressed with blosc. Native data is lossy compressed with 12-bit bitrouding.
method="kerchunk"
: Data is hosted as is. Binary values of reference chunks are streamed to the clients so that the format seen by clients is pure zarr. Original zarr data, e.g. NextGEMs ICON, is just passed through to the user.
Kerchunk endpoints are default as it reduces the load on the server. You better use zarr if you cannot use the original encoding or need to reduce data volume before transfer.
Open data#
We can just open all datasets and put them in a dictionary. For some special cases, we have to use the zarr
method.
[5]:
%%capture
from tqdm import tqdm
nbytes = 0
dsdict = {}
for dsid in tqdm(all_dkrz_datasets):
if "hadgem" in dsid.lower():
continue
try:
dsdict[dsid] = cat[dsid].to_dask()
except:
try:
dsdict[dsid] = cat[dsid](method="zarr").to_dask()
except:
print(dsid)
Stac#
Single datasets from stac-items#
Once you found your desired dataset item with the stac-browser, you can use the snippet from the description.
[6]:
baseurl = "https://eerie.cloud.dkrz.de/datasets"
dataset = "ifs-amip-tco1279.hist.v20240901.atmos.native.2D_24h"
stac_item = intake.open_stac_item("/".join([baseurl, dataset, "stac"]))
The stac-item does not only have a data asset, but also links to other applications:
[7]:
assets = list(stac_item)
assets
[7]:
['data', 'xarray_view', 'jupyterlite', 'gridlook']
[8]:
for asset in assets:
metadata = stac_item[asset].metadata
url = metadata.get("urlpath", metadata.get("href", None))
print(asset + " url: " + url)
data url: https://eerie.cloud.dkrz.de/datasets/ifs-amip-tco1279.hist.v20240901.atmos.native.2D_24h/kerchunk
xarray_view url: https://eerie.cloud.dkrz.de/datasets/ifs-amip-tco1279.hist.v20240901.atmos.native.2D_24h/
jupyterlite url: https://swift.dkrz.de/v1/dkrz_7fa6baba-db43-4d12-a295-8e3ebb1a01ed/apps/jupyterlite/index.html
gridlook url: https://swift.dkrz.de/v1/dkrz_7fa6baba-db43-4d12-a295-8e3ebb1a01ed/apps/gridlook/index.html#https://eerie.cloud.dkrz.de/datasets/ifs-amip-tco1279.hist.v20240901.atmos.native.2D_24h/stac
The zarr API endpoint is hidden in the data asset under its metadata.
Calculate and plot an example#
In the following, we
select hamburg
plot the data
for a daily mean time series of an EERIE ICON control run:
[9]:
ds = dsdict["icon-esm-er.eerie-control-1950.v20240618.atmos.native.2d_daily_mean"]
ds
[9]:
<xarray.Dataset> Size: 5TB Dimensions: (ncells: 5242880, time: 11688, height: 1, height_2: 1, height_3: 1, bnds: 2) Coordinates: cell_sea_land_mask (ncells) int32 21MB dask.array<chunksize=(5242880,), meta=np.ndarray> * height (height) float64 8B 2.0 * height_2 (height_2) float64 8B 10.0 * height_3 (height_3) float64 8B 90.0 lat (ncells) float64 42MB dask.array<chunksize=(5242880,), meta=np.ndarray> lon (ncells) float64 42MB dask.array<chunksize=(5242880,), meta=np.ndarray> * time (time) datetime64[ns] 94kB 1991-01-01T23:59:00 ... 20... Dimensions without coordinates: ncells, bnds Data variables: (12/22) clt (time, ncells) float32 245GB dask.array<chunksize=(1, 5242880), meta=np.ndarray> evspsbl (time, ncells) float32 245GB dask.array<chunksize=(1, 5242880), meta=np.ndarray> height_3_bnds (time, height_3, bnds) float64 187kB dask.array<chunksize=(1, 1, 2), meta=np.ndarray> hfls (time, ncells) float32 245GB dask.array<chunksize=(1, 5242880), meta=np.ndarray> hfss (time, ncells) float32 245GB dask.array<chunksize=(1, 5242880), meta=np.ndarray> hur (time, height_3, ncells) float32 245GB dask.array<chunksize=(1, 1, 5242880), meta=np.ndarray> ... ... rsus (time, ncells) float32 245GB dask.array<chunksize=(1, 5242880), meta=np.ndarray> sfcwind (time, height_2, ncells) float32 245GB dask.array<chunksize=(1, 1, 5242880), meta=np.ndarray> tas (time, height, ncells) float32 245GB dask.array<chunksize=(1, 1, 5242880), meta=np.ndarray> ts (time, ncells) float32 245GB dask.array<chunksize=(1, 5242880), meta=np.ndarray> uas (time, height_2, ncells) float32 245GB dask.array<chunksize=(1, 1, 5242880), meta=np.ndarray> vas (time, height_2, ncells) float32 245GB dask.array<chunksize=(1, 1, 5242880), meta=np.ndarray> Attributes: (12/30) Conventions: CF-1.7 CMIP-6.2 activity_id: EERIE data_specs_version: 01.00.32 forcing_index: 1 initialization_index: 1 license: EERIE model data produced by MPI-M is licensed und... ... ... parent_experiment_id: eerie-spinup-1950 parent_activity_id: EERIE sub_experiment_id: none experiment: coupled control with fixed 1950's forcing (HighRes... source: ICON-ESM-ER (2023): \naerosol: none, prescribed MA... institution: Max Planck Institute for Meteorology, Hamburg 2014...
[10]:
import numpy as np
import xarray as xr
For Hamburg, we use the first cell we find that is about 53°N and 9.9°E:
[12]:
dsgrid = ds[var].isel(time=0).drop("time").drop("height")
dsgrid["gridsel"] = xr.where(
((dsgrid.lat > 53) & (dsgrid.lat < 54) & (dsgrid.lon > 9.5) & (dsgrid.lon < 11)),
1,
np.nan,
).compute()
/tmp/ipykernel_2441572/2614370482.py:1: DeprecationWarning: dropping variables using `drop` is deprecated; use drop_vars.
dsgrid = ds[var].isel(time=0).drop("time").drop("height")
[13]:
cell = dsgrid["gridsel"].argmax().values[()]
cell
[13]:
1064520
[19]:
tas_hamburg = ds[var].isel(ncells=cell, time=range(-365, 0))
[20]:
tas_hamburg
[20]:
<xarray.DataArray 'tas' (time: 365, height: 1)> Size: 1kB dask.array<getitem, shape=(365, 1), dtype=float32, chunksize=(1, 1), chunktype=numpy.ndarray> Coordinates: cell_sea_land_mask int32 4B dask.array<chunksize=(), meta=np.ndarray> * height (height) float64 8B 2.0 lat float64 8B dask.array<chunksize=(), meta=np.ndarray> lon float64 8B dask.array<chunksize=(), meta=np.ndarray> * time (time) datetime64[ns] 3kB 2022-01-01T23:59:00 ... 202... Attributes: CDI_grid_type: unstructured long_name: temperature in 2m number_of_grid_in_reference: 1 param: 0.0.0 standard_name: tas units: K
The hvplot library can be used to generate interactive plots.
[21]:
tas_hamburg.squeeze().hvplot.line()
[21]:
Download the data#
Note that eerie.cloud is not a high-performance data server but rather for interactive work. DO NOT use it for large retrievals of high volume data. Thanks!
In the following, we store a slice of a time series of tas locally in zarr format.
[22]:
towrite = ds[var].isel(time=slice(200, 250))
allbytes = towrite.nbytes / 1024 / 1024
print(allbytes, " MB to write")
1000.0 MB to write
[23]:
import time
filename = "/work/bm0021/k204210/test.zarr"
start = time.time()
towrite.to_zarr(filename, mode="w", compute=True, consolidated=True)
end = time.time()
print(allbytes / (end - start))
149.00784099592835
Environment#
To succesfully run this notebook, you need a kernel / environment with the following packages installed:
- intake-xarray
- intake-stac
- intake
- zarr
- dask
- hvplot # for plotting
- aiohttp
- requests
- eccodes # for grib data
- python-eccodes # for grib data
[ ]: