EERIE data from everywhere#

Access EERIE data via eerie.cloud.dkrz.de#

EERIE data from ICON and IFS-FESOM2 is made accessible via a fast lane web service under eerie.cloud.dkrz.de. This is enabled by the python package xpublish. All API endpoints can be explored on the automatic documentation. The web service reflects a live view on all EERIE experiments run and saved at DKRZ.

All available datasets are listed here. For each dataset, both

  • a zarr endpoint is provided which can be directly opened with xarray.

  • a opendap endpoint is provided which can be directly opened with cdo.

  • an entry in the synced intake catalog of the server is generated

The endpoints can be constructed with the template

url = "https://eerie.cloud.dkrz.de/datasets"
datset_id = "icon-esm-er.eerie-control-1950.atmos.native.2d_daily_min"

zarr_endpoint = '/'.join([url,dataset_id,"zarr"])
opendap_endpoint = '/'.join([url,dataset_id,"opendap"])

print(zarr_endpoint)
print(opendap_endpoint)

The server’s intake catalog is available here: https://eerie.cloud.dkrz.de/intake.yaml) and can also be opend from a script.

for all datasets on the server. Additionally, for all datasets we offer a precision-trimmed version which reduces the required storage for saving the dataset at another location while not affecting results of analysis significantly.

Xpublish allows to add plugins to the server as well as server-side processing. For uncompressed datasets on native grid, we offer a precision-trimmed version which reduces the required storage for saving the dataset at another location while not affecting results of analysis significantly.

To access the data and run this notebook, you need a kernel / environment with the following packages installed:

- intake-xarray
- intake
- zarr
- dask
- hvplot # for plotting
- aiohttp
- requests

Before we do anything with the data, we have to set up a dask client. The eerie.cloud server allows 4 concurrent threads per user because the application on the server is very memory intensive. We therefore open a client with 4 threads:

[1]:
# client.close()
from distributed import Client

client = Client(n_workers=2, threads_per_worker=2)
client
[1]:

Client

Client-8bfabe11-d54f-11ee-a9a9-080038c049a7

Connection method: Cluster object Cluster type: distributed.LocalCluster
Dashboard: /proxy/8787/status

Cluster Info

Using intake for loading#

See this notebook how to use intake to browse and load data sources.

[2]:
import intake

eerie_cat = intake.open_catalog(
    "https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/eerie.yaml"
)
cat = eerie_cat["dkrz.cloud"]

We can list all datasets available in the catalog with list:

[3]:
all_dkrz_datasets = list(cat)
print(all_dkrz_datasets[0])
hadgem3-gc5-n216-orca025.eerie-picontrol.atmos.gr025.daily
  • datasets with substring ‘native’ are lossy compressed on the fly with bitrounding to reduce traffic

  • datasets with suffix ‘gr025’ can also be retrieved via kerchunk directly so that eerie.cloud is only used as a data server

We can get the asset of a dataset (i.e. the actual file that can be opened and described as a catalog) by using it as a key in the cat dictionary:

[4]:
dset = "icon-esm-er.eerie-control-1950.atmos.native.2d_daily_min"
icon_control_atm_2d_1mth_mean = cat[dset]
[5]:
list(icon_control_atm_2d_1mth_mean)
[5]:
['icon-esm-er.eerie-control-1950.atmos.native.2d_daily_min-zarr']

Opening the dataset#

Each intake-xarray asset can be opened with the to_dask function:

[6]:
ds = icon_control_atm_2d_1mth_mean[dset + "-zarr"].to_dask()
[7]:
ds
[7]:
<xarray.Dataset>
Dimensions:             (ncells: 5242880, height: 1, time: 11323)
Coordinates:
    cell_sea_land_mask  (ncells) int32 dask.array<chunksize=(5242880,), meta=np.ndarray>
  * height              (height) float64 2.0
    lat                 (ncells) float64 dask.array<chunksize=(5242880,), meta=np.ndarray>
    lon                 (ncells) float64 dask.array<chunksize=(5242880,), meta=np.ndarray>
  * time                (time) datetime64[ns] 2008-12-31T23:59:00 ... 2039-12...
Dimensions without coordinates: ncells
Data variables:
    tas                 (time, height, ncells) float32 dask.array<chunksize=(4, 1, 5242880), meta=np.ndarray>
Attributes: (12/34)
    CDI:                      Climate Data Interface version 2.2.0 (https://m...
    Conventions:              CF-1.6
    DOKU_License:             CC BY 4.0
    DOKU_Name:                EERIE ICON-ESM-ER eerie-control-1950 run
    DOKU_authors:             Putrasahan, D.; Kröger, J.; Wachsmann, F.
    DOKU_responsible_person:  Fabian Wachsmann
    ...                       ...
    source_type:              AOGCM
    time_max:                 4776479
    time_min:                 3682079
    time_reduction:           min
    title:                    ICON simulation
    uuidOfHGrid:              5aff0578-9bd9-11e8-8e4a-af3d880818e6

Download the data#

Note that eerie.cloud is not a high-performance data server but rather for interactive work. DO NOT use it for large retrievals of high volume data. Thanks!

  1. With zarr:

[8]:
import hvplot.xarray

towrite = ds["tas"].isel(time=slice(200, 250))
allbytes = towrite.nbytes / 1024 / 1024
print(allbytes, " MB to write")
1000.0  MB to write
[9]:
filename = "/work/bm0021/k204210/tets2.zarr"
!rm -r {filename}
import time

start = time.time()
towrite.to_zarr(filename, mode="w", compute=True, consolidated=True)
end = time.time()
print(allbytes / (end - start))
# temp.isel(time=0).hvplot.image(x="lon",y="lat",)
320.02920188335054
  1. With cdo:

  • better use cdo >=2.4.0 to get expected behaviour of opendap access

  • use cdo sinfo and cdo select for all retrievals to get the subset you need

  • opendap access is serial. You are allowed to submit up to 4 cdo processes on the server the same time similar to the dask client.

[10]:
url = "https://eerie.cloud.dkrz.de/datasets"
dataset_id = "icon-esm-er.eerie-control-1950.atmos.native.2d_daily_min"
opendap_endpoint = "/".join([url, dataset_id, "opendap"])
filename = "/work/bm0021/k204210/tets2.nc"
!rm {filename}

start = time.time()
!cdo select,timestep=1/50 {opendap_endpoint} /work/bm0021/k204210/test2.nc
end = time.time()
print(allbytes / (end - start))
rm: cannot remove '/work/bm0021/k204210/tets2.nc': No such file or directory
cdo    select: Processed 262144000 values from 1 variable over 51 timesteps [36.49s 281MB]
27.135707294094843

Calculate and plot an example#

In the following, we

  1. select hamburg

  2. plot the data

[11]:
import xarray as xr

tasmin = xr.open_zarr("/work/bm0021/k204210/tets2.zarr").drop("cell_sea_land_mask")

For Hamburg, we use the first cell we find that is about 53°N and 9.9°E:

[12]:
import numpy as np

tasmin1 = tasmin.copy()
tasmin1["lat"] = np.rad2deg(tasmin["lat"])
tasmin1["lat"].attrs["units"] = "degrees"
tasmin1["lon"] = np.rad2deg(tasmin["lon"])
tasmin1["lon"].attrs["units"] = "degrees"
[13]:
tasmin2 = tasmin1.isel(time=0).drop("time")
tasmin2["gridsel"] = xr.where(
    (
        (tasmin2.lat > 53)
        & (tasmin2.lat < 54)
        & (tasmin2.lon > 9.8)
        & (tasmin2.lon < 10)
    ),
    1,
    np.nan,
).compute()
[14]:
cell = tasmin2["gridsel"].argmax().values[()]
cell
[14]:
1064546
[15]:
tasmin_hamburg = tasmin1.isel(ncells=cell)

The hvplot library can be used to generate interactive plots.

[16]:
import hvplot.xarray

tasmin_hamburg.squeeze().hvplot.line()
[16]:
[ ]: