The EERIE cloud#
Access EERIE and NextGEMs data from a data server everywhere#
This notebook addresses data users who cannot work next to the data on DKRZ´s High Performance Computer Levante.
Data from ICON and IFS-FESOM2 is made accessible via a fast lane web service under This is enabled by the python package xpublish. All API endpoints can be explored on the automatic documentation.
The EERIE Cloud hosts different endpoints for many cloudified datasets. All available datasets are listed here. The service is described in detail here. This notebook focuses on the python-relevant endpoints.
For all datasets, there are 2 main type of endpoints:
A dataset view
Data access endpoints
You can construct both using these templates:
url = ""
dataset_id = "ifs-fesom2-sr.eerie-spinup-1950.v20240304.ocean.gr025.monthly"
zarr_endpoint = '/'.join([url,dataset_id,"zarr"])
kerchunk_endpoint = '/'.join([url,dataset_id,"kerchunk"])
url_list_of_datasets = ""
import json
import fsspec as fs
import pandas as pd
from IPython.display import HTML
import panel as pn
from bokeh.models import HTMLTemplateFormatter
bokeh_formatters = {
"dataset_view": HTMLTemplateFormatter(template="<code><%= value %></code>")
list_of_datasets = json.load(
dict_of_datasets = {}
for dsid in list_of_datasets:
dict_of_datasets[dsid] = {
"dataset_view": '<a href="'
+ "/".join([url_list_of_datasets, dsid])
+ '/"'
+ 'target="_blank">'
+ dsid
+ "</a>",
"zarr_endpoint": "/".join([url_list_of_datasets, dsid, "zarr"]),
df = pd.DataFrame(dict_of_datasets).transpose()
tabu = pn.widgets.Tabulator(
To succesfully run this notebook, you need a kernel / environment with the following packages installed:
- intake-xarray
- intake
- zarr
- dask
- hvplot # for plotting
- aiohttp
- requests
Before we do anything with the data, we should set up a dask client to reduce the number of concurrent threads. The EERIE cloud service is very memory intensive. Here, we open a client with 4 threads:
# client.close()
from distributed import Client
client = Client(n_workers=2, threads_per_worker=2)
Connection method: Cluster object | Cluster type: distributed.LocalCluster |
Dashboard: /proxy/8787/status |
Cluster Info
Dashboard: /proxy/8787/status | Workers: 2 |
Total threads: 4 | Total memory: 250.00 GiB |
Status: running | Using processes: True |
Scheduler Info
Comm: tcp:// | Workers: 2 |
Dashboard: /proxy/8787/status | Total threads: 4 |
Started: Just now | Total memory: 250.00 GiB |
Worker: 0
Comm: tcp:// | Total threads: 2 |
Dashboard: /proxy/37089/status | Memory: 125.00 GiB |
Nanny: tcp:// | |
Local directory: /tmp/dask-worker-space/worker-qc9gen34 |
Worker: 1
Comm: tcp:// | Total threads: 2 |
Dashboard: /proxy/41737/status | Memory: 125.00 GiB |
Nanny: tcp:// | |
Local directory: /tmp/dask-worker-space/worker-h07bok_f |
Data access endpoints#
Zarr: Zarr endpoints deliver data which is processed on server-side with dask. The resulting data is
rechunked in dask-optimal chunksizes, about >10MB
per default compressed via blosc
Kerchunk: Data is hosted as is. Chunks that are references are streamed to the clients so that the format seen by clients is pure zarr. Original zarr data is just passed through to the user.
You should use Kerchunk endpoints per default as it reduces the load on the server. You better use zarr if
the original data is uncompressed to reduce data volume to be transferred
the original data is compressed with complicated compression methods which require software on client side
Data access with intake#
We recommend to use intake to open and load available data. The EERIE cloud has an endpoint for an intake catalog which is synced with the available dataset. See this notebook how to use intake to browse and load data sources.
import intake
cat = intake.open_catalog("")
We can list all datasets available in the catalog with list
all_dkrz_datasets = list(cat)
We can just open all datasets and put them in a dictionary, using the kerchunk policy introduced before:
from tqdm import tqdm
nbytes = 0
dsdict = {}
for dsid in tqdm(all_dkrz_datasets):
if any("kerchunk" in access_method for access_method in list(cat[dsid])):
dsdict[dsid] = cat[dsid][dsid + "-kerchunk"].to_dask()
dsdict[dsid] = cat[dsid][dsid + "-zarr"].to_dask()
nbytes += dsdict[dsid].nbytes
24%|██▍ | 42/176 [03:03<22:11, 9.94s/it]
24%|██▍ | 43/176 [03:04<15:44, 7.10s/it]
25%|██▌ | 44/176 [03:04<11:16, 5.12s/it]
26%|██▌ | 45/176 [03:05<08:09, 3.74s/it]
26%|██▌ | 46/176 [03:05<05:59, 2.77s/it]
27%|██▋ | 47/176 [03:06<04:29, 2.09s/it]
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/ SerializationWarning: variable 'siconc' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
new_vars[k] = decode_cf_variable(
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/ SerializationWarning: variable 'sithick' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
new_vars[k] = decode_cf_variable(
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/ SerializationWarning: variable 'siu' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
new_vars[k] = decode_cf_variable(
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/ SerializationWarning: variable 'siv' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
new_vars[k] = decode_cf_variable(
27%|██▋ | 48/176 [03:07<03:36, 1.69s/it]/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/ SerializationWarning: variable 'mlotst10' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
new_vars[k] = decode_cf_variable(
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/ SerializationWarning: variable 'sosabs' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
new_vars[k] = decode_cf_variable(
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/ SerializationWarning: variable 'toscon' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
new_vars[k] = decode_cf_variable(
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/ SerializationWarning: variable 'zos' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
new_vars[k] = decode_cf_variable(
28%|██▊ | 50/176 [03:07<02:12, 1.05s/it]
29%|██▉ | 51/176 [03:08<01:50, 1.13it/s]
30%|██▉ | 52/176 [03:08<01:35, 1.30it/s]
30%|███ | 53/176 [03:09<01:24, 1.45it/s]
31%|███ | 54/176 [03:09<01:17, 1.58it/s]
31%|███▏ | 55/176 [03:10<01:11, 1.69it/s]
32%|███▏ | 56/176 [03:10<01:07, 1.77it/s]
32%|███▏ | 57/176 [03:11<01:04, 1.83it/s]
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/ SerializationWarning: variable 'mlotst10' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
new_vars[k] = decode_cf_variable(
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/ SerializationWarning: variable 'sosabs' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
new_vars[k] = decode_cf_variable(
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/ SerializationWarning: variable 'toscon' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
new_vars[k] = decode_cf_variable(
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/ SerializationWarning: variable 'zos' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
new_vars[k] = decode_cf_variable(
51%|█████ | 89/176 [03:38<00:57, 1.50it/s]
62%|██████▏ | 109/176 [03:59<00:58, 1.15it/s]
98%|█████████▊| 173/176 [04:42<00:02, 1.22it/s]
99%|█████████▉| 174/176 [04:42<00:01, 1.39it/s]
99%|█████████▉| 175/176 [04:43<00:00, 1.53it/s]
100%|██████████| 176/176 [04:43<00:00, 1.61s/it]
ds = dsdict["icon-esm-er.eerie-control-1950.v20231106.atmos.native.2d_daily_mean"]
<xarray.Dataset> Dimensions: (ncells: 5242880, time: 13879, height: 1, height_2: 1) Coordinates: cell_sea_land_mask (ncells) int32 dask.array<chunksize=(5242880,), meta=np.ndarray> * height (height) float64 2.0 * height_2 (height_2) float64 10.0 lat (ncells) float64 dask.array<chunksize=(5242880,), meta=np.ndarray> lon (ncells) float64 dask.array<chunksize=(5242880,), meta=np.ndarray> * time (time) datetime64[ns] 2002-01-01T23:59:00 ... 2039-12... Dimensions without coordinates: ncells Data variables: (12/17) clt (time, ncells) float32 dask.array<chunksize=(4, 3506176), meta=np.ndarray> dew2 (time, height, ncells) float32 dask.array<chunksize=(4, 1, 5242880), meta=np.ndarray> evspsbl (time, ncells) float32 dask.array<chunksize=(4, 3506176), meta=np.ndarray> hfls (time, ncells) float32 dask.array<chunksize=(4, 3506176), meta=np.ndarray> hfss (time, ncells) float32 dask.array<chunksize=(4, 3506176), meta=np.ndarray> pr (time, ncells) float32 dask.array<chunksize=(4, 3506176), meta=np.ndarray> ... ... rsds (time, ncells) float32 dask.array<chunksize=(4, 3506176), meta=np.ndarray> rsus (time, ncells) float32 dask.array<chunksize=(4, 3506176), meta=np.ndarray> sfcwind (time, height_2, ncells) float32 dask.array<chunksize=(4, 1, 5242880), meta=np.ndarray> tas (time, height, ncells) float32 dask.array<chunksize=(4, 1, 5242880), meta=np.ndarray> uas (time, height_2, ncells) float32 dask.array<chunksize=(4, 1, 5242880), meta=np.ndarray> vas (time, height_2, ncells) float32 dask.array<chunksize=(4, 1, 5242880), meta=np.ndarray> Attributes: (12/36) CDI: Climate Data Interface version 2.2.0 (https://m... Conventions: CF-1.6 DOKU_License: CC BY 4.0 DOKU_Name: EERIE ICON-ESM-ER eerie-control-1950 run DOKU_authors: Putrasahan, D.; Kröger, J.; Wachsmann, F. DOKU_responsible_person: Fabian Wachsmann ... ... time_min: 2002-01-01T23:59:00.000000000 time_reduction: mean title: ICON simulation uuidOfHGrid: 5aff0578-9bd9-11e8-8e4a-af3d880818e6 _catalog_id: icon-esm-er.eerie-control-1950.v20231106.atmos.... creation_date: 2024-07-16T09:14:16Z
Download the data#
Note that is not a high-performance data server but rather for interactive work. DO NOT use it for large retrievals of high volume data. Thanks!
With zarr:
import hvplot.xarray
towrite = ds["tas"].isel(time=slice(200, 250))
allbytes = towrite.nbytes / 1024 / 1024
print(allbytes, " MB to write")