EERIE data from everywhere#
Access EERIE data via eerie.cloud.dkrz.de#
EERIE data from ICON and IFS-FESOM2 is made accessible via a fast lane web service under eerie.cloud.dkrz.de. This is enabled by the python package xpublish. All API endpoints can be explored on the automatic documentation. The web service reflects a live view on all EERIE experiments run and saved at DKRZ.
All available datasets are listed here. For each dataset, both
a zarr endpoint is provided which can be directly opened with xarray.
a opendap endpoint is provided which can be directly opened with cdo.
an entry in the synced intake catalog of the server is generated
The endpoints can be constructed with the template
url = "https://eerie.cloud.dkrz.de/datasets"
datset_id = "icon-esm-er.eerie-control-1950.atmos.native.2d_daily_min"
zarr_endpoint = '/'.join([url,dataset_id,"zarr"])
opendap_endpoint = '/'.join([url,dataset_id,"opendap"])
print(zarr_endpoint)
print(opendap_endpoint)
The server’s intake catalog is available here: https://eerie.cloud.dkrz.de/intake.yaml) and can also be opend from a script.
for all datasets on the server. Additionally, for all datasets we offer a precision-trimmed version which reduces the required storage for saving the dataset at another location while not affecting results of analysis significantly.
Xpublish allows to add plugins to the server as well as server-side processing. For uncompressed datasets on native grid, we offer a precision-trimmed version which reduces the required storage for saving the dataset at another location while not affecting results of analysis significantly.
To access the data and run this notebook, you need a kernel / environment with the following packages installed:
- intake-xarray
- intake
- zarr
- dask
- hvplot # for plotting
- aiohttp
- requests
Before we do anything with the data, we have to set up a dask client. The eerie.cloud server allows 4 concurrent threads per user because the application on the server is very memory intensive. We therefore open a client with 4 threads:
[1]:
# client.close()
from distributed import Client
client = Client(n_workers=2, threads_per_worker=2)
client
[1]:
Client
Client-8bfabe11-d54f-11ee-a9a9-080038c049a7
Connection method: Cluster object | Cluster type: distributed.LocalCluster |
Dashboard: /proxy/8787/status |
Cluster Info
LocalCluster
1f037ca5
Dashboard: /proxy/8787/status | Workers: 2 |
Total threads: 4 | Total memory: 250.00 GiB |
Status: running | Using processes: True |
Scheduler Info
Scheduler
Scheduler-751ef3f8-ec05-4d2f-b985-c893dc1430f9
Comm: tcp://127.0.0.1:44079 | Workers: 2 |
Dashboard: /proxy/8787/status | Total threads: 4 |
Started: Just now | Total memory: 250.00 GiB |
Workers
Worker: 0
Comm: tcp://127.0.0.1:44263 | Total threads: 2 |
Dashboard: /proxy/44681/status | Memory: 125.00 GiB |
Nanny: tcp://127.0.0.1:33939 | |
Local directory: /tmp/dask-worker-space/worker-jwzcxq7r |
Worker: 1
Comm: tcp://127.0.0.1:41579 | Total threads: 2 |
Dashboard: /proxy/36033/status | Memory: 125.00 GiB |
Nanny: tcp://127.0.0.1:44839 | |
Local directory: /tmp/dask-worker-space/worker-_ubq6llr |
Using intake for loading#
See this notebook how to use intake to browse and load data sources.
[2]:
import intake
eerie_cat = intake.open_catalog(
"https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/eerie.yaml"
)
cat = eerie_cat["dkrz.cloud"]
We can list all datasets available in the catalog with list
:
[3]:
all_dkrz_datasets = list(cat)
print(all_dkrz_datasets[0])
hadgem3-gc5-n216-orca025.eerie-picontrol.atmos.gr025.daily
datasets with substring ‘native’ are lossy compressed on the fly with bitrounding to reduce traffic
datasets with suffix ‘gr025’ can also be retrieved via kerchunk directly so that eerie.cloud is only used as a data server
We can get the asset of a dataset (i.e. the actual file that can be opened and described as a catalog) by using it as a key in the cat
dictionary:
[4]:
dset = "icon-esm-er.eerie-control-1950.atmos.native.2d_daily_min"
icon_control_atm_2d_1mth_mean = cat[dset]
[5]:
list(icon_control_atm_2d_1mth_mean)
[5]:
['icon-esm-er.eerie-control-1950.atmos.native.2d_daily_min-zarr']
Opening the dataset#
Each intake-xarray asset can be opened with the to_dask
function:
[6]:
ds = icon_control_atm_2d_1mth_mean[dset + "-zarr"].to_dask()
[7]:
ds
[7]:
<xarray.Dataset> Dimensions: (ncells: 5242880, height: 1, time: 11323) Coordinates: cell_sea_land_mask (ncells) int32 dask.array<chunksize=(5242880,), meta=np.ndarray> * height (height) float64 2.0 lat (ncells) float64 dask.array<chunksize=(5242880,), meta=np.ndarray> lon (ncells) float64 dask.array<chunksize=(5242880,), meta=np.ndarray> * time (time) datetime64[ns] 2008-12-31T23:59:00 ... 2039-12... Dimensions without coordinates: ncells Data variables: tas (time, height, ncells) float32 dask.array<chunksize=(4, 1, 5242880), meta=np.ndarray> Attributes: (12/34) CDI: Climate Data Interface version 2.2.0 (https://m... Conventions: CF-1.6 DOKU_License: CC BY 4.0 DOKU_Name: EERIE ICON-ESM-ER eerie-control-1950 run DOKU_authors: Putrasahan, D.; Kröger, J.; Wachsmann, F. DOKU_responsible_person: Fabian Wachsmann ... ... source_type: AOGCM time_max: 4776479 time_min: 3682079 time_reduction: min title: ICON simulation uuidOfHGrid: 5aff0578-9bd9-11e8-8e4a-af3d880818e6
Download the data#
Note that eerie.cloud is not a high-performance data server but rather for interactive work. DO NOT use it for large retrievals of high volume data. Thanks!
With zarr:
[8]:
import hvplot.xarray
towrite = ds["tas"].isel(time=slice(200, 250))
allbytes = towrite.nbytes / 1024 / 1024
print(allbytes, " MB to write")
1000.0 MB to write
[9]:
filename = "/work/bm0021/k204210/tets2.zarr"
!rm -r {filename}
import time
start = time.time()
towrite.to_zarr(filename, mode="w", compute=True, consolidated=True)
end = time.time()
print(allbytes / (end - start))
# temp.isel(time=0).hvplot.image(x="lon",y="lat",)
320.02920188335054
With cdo:
better use cdo >=2.4.0 to get expected behaviour of opendap access
use
cdo sinfo
andcdo select
for all retrievals to get the subset you needopendap access is serial. You are allowed to submit up to 4
cdo
processes on the server the same time similar to the dask client.
[10]:
url = "https://eerie.cloud.dkrz.de/datasets"
dataset_id = "icon-esm-er.eerie-control-1950.atmos.native.2d_daily_min"
opendap_endpoint = "/".join([url, dataset_id, "opendap"])
filename = "/work/bm0021/k204210/tets2.nc"
!rm {filename}
start = time.time()
!cdo select,timestep=1/50 {opendap_endpoint} /work/bm0021/k204210/test2.nc
end = time.time()
print(allbytes / (end - start))
rm: cannot remove '/work/bm0021/k204210/tets2.nc': No such file or directory
cdo select: Processed 262144000 values from 1 variable over 51 timesteps [36.49s 281MB]
27.135707294094843
Calculate and plot an example#
In the following, we
select hamburg
plot the data
[11]:
import xarray as xr
tasmin = xr.open_zarr("/work/bm0021/k204210/tets2.zarr").drop("cell_sea_land_mask")
For Hamburg, we use the first cell we find that is about 53°N and 9.9°E:
[12]:
import numpy as np
tasmin1 = tasmin.copy()
tasmin1["lat"] = np.rad2deg(tasmin["lat"])
tasmin1["lat"].attrs["units"] = "degrees"
tasmin1["lon"] = np.rad2deg(tasmin["lon"])
tasmin1["lon"].attrs["units"] = "degrees"
[13]:
tasmin2 = tasmin1.isel(time=0).drop("time")
tasmin2["gridsel"] = xr.where(
(
(tasmin2.lat > 53)
& (tasmin2.lat < 54)
& (tasmin2.lon > 9.8)
& (tasmin2.lon < 10)
),
1,
np.nan,
).compute()
[14]:
cell = tasmin2["gridsel"].argmax().values[()]
cell
[14]:
1064546
[15]:
tasmin_hamburg = tasmin1.isel(ncells=cell)
The hvplot library can be used to generate interactive plots.
[16]:
import hvplot.xarray
tasmin_hamburg.squeeze().hvplot.line()
[16]:
[ ]: