The EERIE cloud#
Access EERIE and NextGEMs data from a data server everywhere#
This notebook addresses data users who cannot work next to the data on DKRZ´s High Performance Computer Levante.
Data from ICON and IFS-FESOM2 is made accessible via a fast lane web service under eerie.cloud.dkrz.de. This is enabled by the python package xpublish. All API endpoints can be explored on the automatic documentation.
The EERIE Cloud hosts different endpoints for many cloudified datasets. All available datasets are listed here. The service is described in detail here. This notebook focuses on the python-relevant endpoints.
For all datasets, there are 2 main type of endpoints:
A dataset view
Data access endpoints
You can construct both using these templates:
url = "https://eerie.cloud.dkrz.de/datasets"
dataset_id = "ifs-fesom2-sr.eerie-spinup-1950.v20240304.ocean.gr025.monthly"
dataset_view='/'.join(url,dataset_id)+"/"
zarr_endpoint = '/'.join([url,dataset_id,"zarr"])
kerchunk_endpoint = '/'.join([url,dataset_id,"kerchunk"])
[1]:
url_list_of_datasets = "https://eerie.cloud.dkrz.de/datasets"
import json
import fsspec as fs
import pandas as pd
from IPython.display import HTML
import panel as pn
from bokeh.models import HTMLTemplateFormatter
bokeh_formatters = {
"dataset_view": HTMLTemplateFormatter(template="<code><%= value %></code>")
}
pn.extension("tabulator")
list_of_datasets = json.load(fs.open(url_list_of_datasets).open())
dict_of_datasets = {}
for dsid in list_of_datasets:
dict_of_datasets[dsid] = {
"dataset_view": '<a href="'
+ "/".join([url_list_of_datasets, dsid])
+ '/"'
+ 'target="_blank">'
+ dsid
+ "</a>",
"zarr_endpoint": "/".join([url_list_of_datasets, dsid, "zarr"]),
}
df = pd.DataFrame(dict_of_datasets).transpose()
tabu = pn.widgets.Tabulator(
df,
show_index=False,
header_filters=True,
selectable=1,
pagination="local",
formatters=bokeh_formatters,
)
tabu
[1]:
To succesfully run this notebook, you need a kernel / environment with the following packages installed:
- intake-xarray
- intake
- zarr
- dask
- hvplot # for plotting
- aiohttp
- requests
Before we do anything with the data, we should set up a dask client to reduce the number of concurrent threads. The EERIE cloud service is very memory intensive. Here, we open a client with 4 threads:
[2]:
# client.close()
from distributed import Client
client = Client(n_workers=2, threads_per_worker=2)
client
[2]:
Client
Client-b0ff4286-4361-11ef-8839-080038c03f2f
Connection method: Cluster object | Cluster type: distributed.LocalCluster |
Dashboard: /proxy/8787/status |
Cluster Info
LocalCluster
a4387be7
Dashboard: /proxy/8787/status | Workers: 2 |
Total threads: 4 | Total memory: 250.00 GiB |
Status: running | Using processes: True |
Scheduler Info
Scheduler
Scheduler-c41425e4-1ada-409d-b963-61518f914bd8
Comm: tcp://127.0.0.1:39921 | Workers: 2 |
Dashboard: /proxy/8787/status | Total threads: 4 |
Started: Just now | Total memory: 250.00 GiB |
Workers
Worker: 0
Comm: tcp://127.0.0.1:38871 | Total threads: 2 |
Dashboard: /proxy/37089/status | Memory: 125.00 GiB |
Nanny: tcp://127.0.0.1:45783 | |
Local directory: /tmp/dask-worker-space/worker-qc9gen34 |
Worker: 1
Comm: tcp://127.0.0.1:43519 | Total threads: 2 |
Dashboard: /proxy/41737/status | Memory: 125.00 GiB |
Nanny: tcp://127.0.0.1:41091 | |
Local directory: /tmp/dask-worker-space/worker-h07bok_f |
Data access endpoints#
Zarr: Zarr endpoints deliver data which is processed on server-side with dask. The resulting data is
rechunked in dask-optimal chunksizes, about >10MB
per default compressed via blosc
Kerchunk: Data is hosted as is. Chunks that are references are streamed to the clients so that the format seen by clients is pure zarr. Original zarr data is just passed through to the user.
You should use Kerchunk endpoints per default as it reduces the load on the server. You better use zarr if
the original data is uncompressed to reduce data volume to be transferred
the original data is compressed with complicated compression methods which require software on client side
Data access with intake#
We recommend to use intake to open and load available data. The EERIE cloud has an endpoint for an intake catalog which is synced with the available dataset. See this notebook how to use intake to browse and load data sources.
[3]:
import intake
cat = intake.open_catalog("https://eerie.cloud.dkrz.de/intake.yaml")
We can list all datasets available in the catalog with list
:
[4]:
all_dkrz_datasets = list(cat)
print(all_dkrz_datasets[0])
ICON.ngc4008.P1D_0
We can just open all datasets and put them in a dictionary, using the kerchunk policy introduced before:
[5]:
from tqdm import tqdm
nbytes = 0
dsdict = {}
for dsid in tqdm(all_dkrz_datasets):
try:
if any("kerchunk" in access_method for access_method in list(cat[dsid])):
dsdict[dsid] = cat[dsid][dsid + "-kerchunk"].to_dask()
else:
dsdict[dsid] = cat[dsid][dsid + "-zarr"].to_dask()
nbytes += dsdict[dsid].nbytes
except:
print(dsid)
24%|██▍ | 42/176 [03:03<22:11, 9.94s/it]
hadgem3-gc5-n216-orca025.eerie-picontrol.atmos.native.atmos_daily_center.0
24%|██▍ | 43/176 [03:04<15:44, 7.10s/it]
hadgem3-gc5-n216-orca025.eerie-picontrol.atmos.native.atmos_daily_edge
25%|██▌ | 44/176 [03:04<11:16, 5.12s/it]
hadgem3-gc5-n216-orca025.eerie-picontrol.atmos.native.atmos_monthly_aermon
26%|██▌ | 45/176 [03:05<08:09, 3.74s/it]
hadgem3-gc5-n216-orca025.eerie-picontrol.atmos.native.atmos_monthly_amon_center
26%|██▌ | 46/176 [03:05<05:59, 2.77s/it]
hadgem3-gc5-n216-orca025.eerie-picontrol.atmos.native.atmos_monthly_amon_edge
27%|██▋ | 47/176 [03:06<04:29, 2.09s/it]
hadgem3-gc5-n216-orca025.eerie-picontrol.atmos.native.atmos_monthly_emon
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/conventions.py:427: SerializationWarning: variable 'siconc' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
new_vars[k] = decode_cf_variable(
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/conventions.py:427: SerializationWarning: variable 'sithick' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
new_vars[k] = decode_cf_variable(
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/conventions.py:427: SerializationWarning: variable 'siu' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
new_vars[k] = decode_cf_variable(
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/conventions.py:427: SerializationWarning: variable 'siv' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
new_vars[k] = decode_cf_variable(
27%|██▋ | 48/176 [03:07<03:36, 1.69s/it]/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/conventions.py:427: SerializationWarning: variable 'mlotst10' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
new_vars[k] = decode_cf_variable(
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/conventions.py:427: SerializationWarning: variable 'sosabs' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
new_vars[k] = decode_cf_variable(
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/conventions.py:427: SerializationWarning: variable 'toscon' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
new_vars[k] = decode_cf_variable(
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/conventions.py:427: SerializationWarning: variable 'zos' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
new_vars[k] = decode_cf_variable(
28%|██▊ | 50/176 [03:07<02:12, 1.05s/it]
hadgem3-gc5-n640-orca12.eerie-picontrol.atmos.gr025.daily.0
29%|██▉ | 51/176 [03:08<01:50, 1.13it/s]
hadgem3-gc5-n640-orca12.eerie-picontrol.atmos.native.atmos_daily_center.0
30%|██▉ | 52/176 [03:08<01:35, 1.30it/s]
hadgem3-gc5-n640-orca12.eerie-picontrol.atmos.native.atmos_daily_edge
30%|███ | 53/176 [03:09<01:24, 1.45it/s]
hadgem3-gc5-n640-orca12.eerie-picontrol.atmos.native.atmos_monthly_aermon
31%|███ | 54/176 [03:09<01:17, 1.58it/s]
hadgem3-gc5-n640-orca12.eerie-picontrol.atmos.native.atmos_monthly_amon_center
31%|███▏ | 55/176 [03:10<01:11, 1.69it/s]
hadgem3-gc5-n640-orca12.eerie-picontrol.atmos.native.atmos_monthly_amon_edge
32%|███▏ | 56/176 [03:10<01:07, 1.77it/s]
hadgem3-gc5-n640-orca12.eerie-picontrol.atmos.native.atmos_monthly_emon
32%|███▏ | 57/176 [03:11<01:04, 1.83it/s]
hadgem3-gc5-n640-orca12.eerie-picontrol.ocean.gr025.daily.0
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/conventions.py:427: SerializationWarning: variable 'mlotst10' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
new_vars[k] = decode_cf_variable(
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/conventions.py:427: SerializationWarning: variable 'sosabs' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
new_vars[k] = decode_cf_variable(
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/conventions.py:427: SerializationWarning: variable 'toscon' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
new_vars[k] = decode_cf_variable(
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/conventions.py:427: SerializationWarning: variable 'zos' has multiple fill values {1e+20, 1e+20}, decoding all values to NaN.
new_vars[k] = decode_cf_variable(
51%|█████ | 89/176 [03:38<00:57, 1.50it/s]
icon-esm-er.eerie-control-1950.v20231106.ocean.native.2d_grid
62%|██████▏ | 109/176 [03:59<00:58, 1.15it/s]
icon-esm-er.eerie-spinup-1950.ocean.native.2d_monthly_mean.0
98%|█████████▊| 173/176 [04:42<00:02, 1.22it/s]
ifs-nemo.eerie-control-1950.atmos.native.daily
99%|█████████▉| 174/176 [04:42<00:01, 1.39it/s]
ifs-nemo.eerie-control-1950.ocean.native.daily
99%|█████████▉| 175/176 [04:43<00:00, 1.53it/s]
ifs-nemo.eerie-control-1950.ocean.native.daily_ice
100%|██████████| 176/176 [04:43<00:00, 1.61s/it]
ifs-nemo.eerie-control-1950.ocean.native.monthly
[6]:
ds = dsdict["icon-esm-er.eerie-control-1950.v20231106.atmos.native.2d_daily_mean"]
ds
[6]:
<xarray.Dataset> Dimensions: (ncells: 5242880, time: 13879, height: 1, height_2: 1) Coordinates: cell_sea_land_mask (ncells) int32 dask.array<chunksize=(5242880,), meta=np.ndarray> * height (height) float64 2.0 * height_2 (height_2) float64 10.0 lat (ncells) float64 dask.array<chunksize=(5242880,), meta=np.ndarray> lon (ncells) float64 dask.array<chunksize=(5242880,), meta=np.ndarray> * time (time) datetime64[ns] 2002-01-01T23:59:00 ... 2039-12... Dimensions without coordinates: ncells Data variables: (12/17) clt (time, ncells) float32 dask.array<chunksize=(4, 3506176), meta=np.ndarray> dew2 (time, height, ncells) float32 dask.array<chunksize=(4, 1, 5242880), meta=np.ndarray> evspsbl (time, ncells) float32 dask.array<chunksize=(4, 3506176), meta=np.ndarray> hfls (time, ncells) float32 dask.array<chunksize=(4, 3506176), meta=np.ndarray> hfss (time, ncells) float32 dask.array<chunksize=(4, 3506176), meta=np.ndarray> pr (time, ncells) float32 dask.array<chunksize=(4, 3506176), meta=np.ndarray> ... ... rsds (time, ncells) float32 dask.array<chunksize=(4, 3506176), meta=np.ndarray> rsus (time, ncells) float32 dask.array<chunksize=(4, 3506176), meta=np.ndarray> sfcwind (time, height_2, ncells) float32 dask.array<chunksize=(4, 1, 5242880), meta=np.ndarray> tas (time, height, ncells) float32 dask.array<chunksize=(4, 1, 5242880), meta=np.ndarray> uas (time, height_2, ncells) float32 dask.array<chunksize=(4, 1, 5242880), meta=np.ndarray> vas (time, height_2, ncells) float32 dask.array<chunksize=(4, 1, 5242880), meta=np.ndarray> Attributes: (12/36) CDI: Climate Data Interface version 2.2.0 (https://m... Conventions: CF-1.6 DOKU_License: CC BY 4.0 DOKU_Name: EERIE ICON-ESM-ER eerie-control-1950 run DOKU_authors: Putrasahan, D.; Kröger, J.; Wachsmann, F. DOKU_responsible_person: Fabian Wachsmann ... ... time_min: 2002-01-01T23:59:00.000000000 time_reduction: mean title: ICON simulation uuidOfHGrid: 5aff0578-9bd9-11e8-8e4a-af3d880818e6 _catalog_id: icon-esm-er.eerie-control-1950.v20231106.atmos.... creation_date: 2024-07-16T09:14:16Z
Download the data#
Note that eerie.cloud is not a high-performance data server but rather for interactive work. DO NOT use it for large retrievals of high volume data. Thanks!
With zarr:
[7]:
import hvplot.xarray
towrite = ds["tas"].isel(time=slice(200, 250))
allbytes = towrite.nbytes / 1024 / 1024
print(allbytes, " MB to write")