The EERIE cloud#

Access EERIE and NextGEMs data from a data server everywhere#

This notebook addresses data users who cannot work next to the data on DKRZ´s High Performance Computer Levante.

Data from ICON and IFS-FESOM2 is made accessible via a fast lane web service under eerie.cloud.dkrz.de. This is enabled by the python package xpublish. All API endpoints can be explored on the automatic documentation.

The EERIE Cloud hosts different endpoints for many cloudified datasets. All available datasets are listed here. The service is described in detail here. This notebook focuses on the python-relevant endpoints.

For all datasets, there are 2 main type of endpoints:

  1. A dataset view

  2. Data access endpoints

You can construct both using these templates:

url = "https://eerie.cloud.dkrz.de/datasets"
dataset_id = "ifs-fesom2-sr.eerie-spinup-1950.v20240304.ocean.gr025.monthly"
dataset_view='/'.join(url,dataset_id)+"/"

zarr_endpoint = '/'.join([url,dataset_id,"zarr"])
kerchunk_endpoint = '/'.join([url,dataset_id,"kerchunk"])
[1]:
url_list_of_datasets = "https://eerie.cloud.dkrz.de/datasets"
import json
import fsspec as fs
import pandas as pd
from IPython.display import HTML
import panel as pn
from bokeh.models import HTMLTemplateFormatter

bokeh_formatters = {
    "dataset_view": HTMLTemplateFormatter(template="<code><%= value %></code>")
}

pn.extension("tabulator")
list_of_datasets = json.load(fs.open(url_list_of_datasets).open())
dict_of_datasets = {}
for dsid in list_of_datasets:
    dict_of_datasets[dsid] = {
        "dataset_view": '<a href="'
        + "/".join([url_list_of_datasets, dsid])
        + '/"'
        + 'target="_blank">'
        + dsid
        + "</a>",
        "zarr_endpoint": "/".join([url_list_of_datasets, dsid, "zarr"]),
    }
df = pd.DataFrame(dict_of_datasets).transpose()
tabu = pn.widgets.Tabulator(
    df,
    show_index=False,
    header_filters=True,
    selectable=1,
    pagination="local",
    formatters=bokeh_formatters,
)
tabu
[1]:

To succesfully run this notebook, you need a kernel / environment with the following packages installed:

- intake-xarray
- intake
- zarr
- dask
- hvplot # for plotting
- aiohttp
- requests

Before we do anything with the data, we should set up a dask client to reduce the number of concurrent threads. The EERIE cloud service is very memory intensive. Here, we open a client with 4 threads:

[2]:
# client.close()
from distributed import Client

client = Client(n_workers=2, threads_per_worker=2)
client
[2]:

Client

Client-b0ff4286-4361-11ef-8839-080038c03f2f

Connection method: Cluster object Cluster type: distributed.LocalCluster
Dashboard: /proxy/8787/status

Cluster Info

Data access endpoints#

  • Zarr: Zarr endpoints deliver data which is processed on server-side with dask. The resulting data is

    • rechunked in dask-optimal chunksizes, about >10MB

    • per default compressed via blosc

  • Kerchunk: Data is hosted as is. Chunks that are references are streamed to the clients so that the format seen by clients is pure zarr. Original zarr data is just passed through to the user.

You should use Kerchunk endpoints per default as it reduces the load on the server. You better use zarr if

  • the original data is uncompressed to reduce data volume to be transferred

  • the original data is compressed with complicated compression methods which require software on client side

Data access with intake#

We recommend to use intake to open and load available data. The EERIE cloud has an endpoint for an intake catalog which is synced with the available dataset. See this notebook how to use intake to browse and load data sources.

[3]:
import intake

cat = intake.open_catalog("https://eerie.cloud.dkrz.de/intake.yaml")

We can list all datasets available in the catalog with list:

[4]:
all_dkrz_datasets = list(cat)
print(all_dkrz_datasets[0])
ICON.ngc4008.P1D_0

We can just open all datasets and put them in a dictionary, using the kerchunk policy introduced before:

[5]:
from tqdm import tqdm

nbytes = 0
dsdict = {}
for dsid in tqdm(all_dkrz_datasets):
    try:
        if any("kerchunk" in access_method for access_method in list(cat[dsid])):
            dsdict[dsid] = cat[dsid][dsid + "-kerchunk"].to_dask()
        else:
            dsdict[dsid] = cat[dsid][dsid + "-zarr"].to_dask()
        nbytes += dsdict[dsid].nbytes
    except:
        print(dsid)
 24%|██▍       | 42/176 [03:03<22:11,  9.94s/it]
hadgem3-gc5-n216-orca025.eerie-picontrol.atmos.native.atmos_daily_center.0
 24%|██▍       | 43/176 [03:04<15:44,  7.10s/it]
hadgem3-gc5-n216-orca025.eerie-picontrol.atmos.native.atmos_daily_edge
 25%|██▌       | 44/176 [03:04<11:16,  5.12s/it]
hadgem3-gc5-n216-orca025.eerie-picontrol.atmos.native.atmos_monthly_aermon
 26%|██▌       | 45/176 [03:05<08:09,  3.74s/it]
hadgem3-gc5-n216-orca025.eerie-picontrol.atmos.native.atmos_monthly_amon_center
 26%|██▌       | 46/176 [03:05<05:59,  2.77s/it]
hadgem3-gc5-n216-orca025.eerie-picontrol.atmos.native.atmos_monthly_amon_edge
 27%|██▋       | 47/176 [03:06<04:29,  2.09s/it]
hadgem3-gc5-n216-orca025.eerie-picontrol.atmos.native.atmos_monthly_emon
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/conventions.py:427: SerializationWarning: variable 'siconc' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
  new_vars[k] = decode_cf_variable(
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/conventions.py:427: SerializationWarning: variable 'sithick' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
  new_vars[k] = decode_cf_variable(
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/conventions.py:427: SerializationWarning: variable 'siu' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
  new_vars[k] = decode_cf_variable(
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/conventions.py:427: SerializationWarning: variable 'siv' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
  new_vars[k] = decode_cf_variable(
 27%|██▋       | 48/176 [03:07<03:36,  1.69s/it]/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/conventions.py:427: SerializationWarning: variable 'mlotst10' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
  new_vars[k] = decode_cf_variable(
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/conventions.py:427: SerializationWarning: variable 'sosabs' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
  new_vars[k] = decode_cf_variable(
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/conventions.py:427: SerializationWarning: variable 'toscon' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
  new_vars[k] = decode_cf_variable(
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/conventions.py:427: SerializationWarning: variable 'zos' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
  new_vars[k] = decode_cf_variable(
 28%|██▊       | 50/176 [03:07<02:12,  1.05s/it]
hadgem3-gc5-n640-orca12.eerie-picontrol.atmos.gr025.daily.0
 29%|██▉       | 51/176 [03:08<01:50,  1.13it/s]
hadgem3-gc5-n640-orca12.eerie-picontrol.atmos.native.atmos_daily_center.0
 30%|██▉       | 52/176 [03:08<01:35,  1.30it/s]
hadgem3-gc5-n640-orca12.eerie-picontrol.atmos.native.atmos_daily_edge
 30%|███       | 53/176 [03:09<01:24,  1.45it/s]
hadgem3-gc5-n640-orca12.eerie-picontrol.atmos.native.atmos_monthly_aermon
 31%|███       | 54/176 [03:09<01:17,  1.58it/s]
hadgem3-gc5-n640-orca12.eerie-picontrol.atmos.native.atmos_monthly_amon_center
 31%|███▏      | 55/176 [03:10<01:11,  1.69it/s]
hadgem3-gc5-n640-orca12.eerie-picontrol.atmos.native.atmos_monthly_amon_edge
 32%|███▏      | 56/176 [03:10<01:07,  1.77it/s]
hadgem3-gc5-n640-orca12.eerie-picontrol.atmos.native.atmos_monthly_emon
 32%|███▏      | 57/176 [03:11<01:04,  1.83it/s]
hadgem3-gc5-n640-orca12.eerie-picontrol.ocean.gr025.daily.0
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/conventions.py:427: SerializationWarning: variable 'mlotst10' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
  new_vars[k] = decode_cf_variable(
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/conventions.py:427: SerializationWarning: variable 'sosabs' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
  new_vars[k] = decode_cf_variable(
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/conventions.py:427: SerializationWarning: variable 'toscon' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
  new_vars[k] = decode_cf_variable(
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/conventions.py:427: SerializationWarning: variable 'zos' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
  new_vars[k] = decode_cf_variable(
 51%|█████     | 89/176 [03:38<00:57,  1.50it/s]
icon-esm-er.eerie-control-1950.v20231106.ocean.native.2d_grid
 62%|██████▏   | 109/176 [03:59<00:58,  1.15it/s]
icon-esm-er.eerie-spinup-1950.ocean.native.2d_monthly_mean.0
 98%|█████████▊| 173/176 [04:42<00:02,  1.22it/s]
ifs-nemo.eerie-control-1950.atmos.native.daily
 99%|█████████▉| 174/176 [04:42<00:01,  1.39it/s]
ifs-nemo.eerie-control-1950.ocean.native.daily
 99%|█████████▉| 175/176 [04:43<00:00,  1.53it/s]
ifs-nemo.eerie-control-1950.ocean.native.daily_ice
100%|██████████| 176/176 [04:43<00:00,  1.61s/it]
ifs-nemo.eerie-control-1950.ocean.native.monthly

[6]:
ds = dsdict["icon-esm-er.eerie-control-1950.v20231106.atmos.native.2d_daily_mean"]
ds
[6]:
<xarray.Dataset>
Dimensions:             (ncells: 5242880, time: 13879, height: 1, height_2: 1)
Coordinates:
    cell_sea_land_mask  (ncells) int32 dask.array<chunksize=(5242880,), meta=np.ndarray>
  * height              (height) float64 2.0
  * height_2            (height_2) float64 10.0
    lat                 (ncells) float64 dask.array<chunksize=(5242880,), meta=np.ndarray>
    lon                 (ncells) float64 dask.array<chunksize=(5242880,), meta=np.ndarray>
  * time                (time) datetime64[ns] 2002-01-01T23:59:00 ... 2039-12...
Dimensions without coordinates: ncells
Data variables: (12/17)
    clt                 (time, ncells) float32 dask.array<chunksize=(4, 3506176), meta=np.ndarray>
    dew2                (time, height, ncells) float32 dask.array<chunksize=(4, 1, 5242880), meta=np.ndarray>
    evspsbl             (time, ncells) float32 dask.array<chunksize=(4, 3506176), meta=np.ndarray>
    hfls                (time, ncells) float32 dask.array<chunksize=(4, 3506176), meta=np.ndarray>
    hfss                (time, ncells) float32 dask.array<chunksize=(4, 3506176), meta=np.ndarray>
    pr                  (time, ncells) float32 dask.array<chunksize=(4, 3506176), meta=np.ndarray>
    ...                  ...
    rsds                (time, ncells) float32 dask.array<chunksize=(4, 3506176), meta=np.ndarray>
    rsus                (time, ncells) float32 dask.array<chunksize=(4, 3506176), meta=np.ndarray>
    sfcwind             (time, height_2, ncells) float32 dask.array<chunksize=(4, 1, 5242880), meta=np.ndarray>
    tas                 (time, height, ncells) float32 dask.array<chunksize=(4, 1, 5242880), meta=np.ndarray>
    uas                 (time, height_2, ncells) float32 dask.array<chunksize=(4, 1, 5242880), meta=np.ndarray>
    vas                 (time, height_2, ncells) float32 dask.array<chunksize=(4, 1, 5242880), meta=np.ndarray>
Attributes: (12/36)
    CDI:                      Climate Data Interface version 2.2.0 (https://m...
    Conventions:              CF-1.6
    DOKU_License:             CC BY 4.0
    DOKU_Name:                EERIE ICON-ESM-ER eerie-control-1950 run
    DOKU_authors:             Putrasahan, D.; Kröger, J.; Wachsmann, F.
    DOKU_responsible_person:  Fabian Wachsmann
    ...                       ...
    time_min:                 2002-01-01T23:59:00.000000000
    time_reduction:           mean
    title:                    ICON simulation
    uuidOfHGrid:              5aff0578-9bd9-11e8-8e4a-af3d880818e6
    _catalog_id:              icon-esm-er.eerie-control-1950.v20231106.atmos....
    creation_date:            2024-07-16T09:14:16Z

Download the data#

Note that eerie.cloud is not a high-performance data server but rather for interactive work. DO NOT use it for large retrievals of high volume data. Thanks!

  1. With zarr:

[7]:
import hvplot.xarray

towrite = ds["tas"].isel(time=slice(200, 250))
allbytes = towrite.nbytes / 1024 / 1024
print(allbytes, " MB to write")
1000.0  MB to write
[8]:
filename = "/work/bm0021/k204210/tets2.zarr"
import time

start = time.time()
towrite.to_zarr(filename, mode="w", compute=True, consolidated=True)
end = time.time()
print(allbytes / (end - start))
# temp.isel(time=0).hvplot.image(x="lon",y="lat",)
110.78050460428145

Calculate and plot an example#

In the following, we

  1. select hamburg

  2. plot the data

[9]:
import xarray as xr
import numpy as np

tas = xr.open_zarr("/work/bm0021/k204210/tets2.zarr").drop("cell_sea_land_mask")
[10]:
tas
[10]:
<xarray.Dataset>
Dimensions:  (height: 1, ncells: 5242880, time: 50)
Coordinates:
  * height   (height) float64 2.0
    lat      (ncells) float64 dask.array<chunksize=(5242880,), meta=np.ndarray>
    lon      (ncells) float64 dask.array<chunksize=(5242880,), meta=np.ndarray>
  * time     (time) datetime64[ns] 2002-07-20T23:59:00 ... 2002-09-07T23:59:00
Dimensions without coordinates: ncells
Data variables:
    tas      (time, height, ncells) float32 dask.array<chunksize=(4, 1, 5242880), meta=np.ndarray>
[11]:
tas["lat"] = np.rad2deg(tas["lat"])
tas["lat"].attrs["units"] = "degrees"
tas["lon"] = np.rad2deg(tas["lon"])
tas["lon"].attrs["units"] = "degrees"

For Hamburg, we use the first cell we find that is about 53°N and 9.9°E:

[12]:
tas2 = tas.isel(time=0).drop("time").drop("height")
tas2["gridsel"] = xr.where(
    ((tas2.lat > 53) & (tas2.lat < 54) & (tas2.lon > 9.5) & (tas2.lon < 11)),
    1,
    np.nan,
).compute()
[13]:
cell = tas2["gridsel"].argmax().values[()]
cell
[13]:
1064520
[14]:
tas_hamburg = tas.isel(ncells=cell)

The hvplot library can be used to generate interactive plots.

[15]:
import hvplot.xarray

tas_hamburg.squeeze().hvplot.line()
[15]:
  1. With cdo:

  • better use cdo >=2.4.0 to get expected behaviour of opendap access

  • use cdo sinfo and cdo select for all retrievals to get the subset you need

  • opendap access is serial. You are allowed to submit up to 4 cdo processes on the server the same time similar to the dask client.

[16]:
url = "https://eerie.cloud.dkrz.de/datasets"
dataset_id = "icon-esm-er.eerie-control-1950.v20231106.atmos.native.2d_daily_min"
opendap_endpoint = "/".join([url, dataset_id, "opendap"])
filename = "/work/bm0021/k204210/tets2.nc"
!rm {filename}

start = time.time()
!cdo select,timestep=1/50 {opendap_endpoint} /work/bm0021/k204210/test2.nc
end = time.time()
print(allbytes / (end - start))
rm: cannot remove '/work/bm0021/k204210/tets2.nc': No such file or directory
cdo    select: Processed 262144000 values from 1 variable over 51 timesteps [43.85s 6418MB]
22.009078238926783
[ ]: