The km-scale cloud by DKRZ#

The km-scale cloud formerly known as EERIE cloud addresses data users who cannot work on DKRZ´s High Performance Computer Levante next to the data. It provides open, efficient, Zarr-based access to High-resolution ESM datasets produced by

  • projects like EERIE and NextGEMs

  • models like ICON and IFS-FESOM.

Whether originally stored as Zarr, NetCDF, or GRIB, the km-scale cloud delivers - without rewriting, relocating, or converting - byte ranges of the original files as Zarr chunks. In addition, it incorporates Dask for on-demand, server-side data reduction tasks — such as compression and the computation of summary statistics. This combination makes it practical and scalable to work with petabyte-scale climate data even to users with limited compute resources. Integrated support for catalog tools like STAC (SpatioTemporal Asset Catalog) and intake, organized in a semantic hierarchy, allows users to discover, query, and subset data efficiently through standardized, interoperable interfaces.

This is enabled by the python package cloudify described in detail in Wachsmann, 2025.

Show case: Time series quickplot for a ICON simulation#

import xarray as xr
dataset_id="icon-esm-er.highres-future-ssp245.v20240618.atmos.native.mon"
ds=xr.open_dataset(
    f"simplecache::https://km-scale-cloud.dkrz.de/datasets/{dataset_id}/kerchunk",
    engine="zarr",
    chunks="auto",
    zarr_format=2,
# if you use zarr v3:
#    storage_options=dict(
#        simplecache=dict(asynchronous=True),
#        https=dict(asynchronous=True)
#    )    
)
import matplotlib.pyplot as plt

var="tas_gmean"

vmean = ds[var].squeeze()
vmean_yr = vmean.resample(time="YE").mean()

vmean.plot()
vmean_yr.plot()
plt.title(" ".join(dataset_id.split(".")) + f"\ntime series of monthly and yearly mean {var}")
plt.xlabel("Year")
Text(0.5, 0, 'Year')
../../_images/034d8ad947d1e004b1cfd07c218e7c11866c8daa7a6cb4457a6ae8802887d320.png

Km-scale datasets and how to approach them#

The km-scale-cloud provides many endpoints (URIs) with different features. Under this link, all available endpoints are documented.

Helpful dataset endpoints#

A key endpoint is the /datasets endpoint which provides a simple list of datasets accessible through the km-scale-cloud. We can use it like:

import requests
kmscale_uri = "https://km-scale-cloud.dkrz.de"
kmscale_datasets_uri=kmscale_uri+"/datasets"

kmscale_datasets=requests.get(kmscale_datasets_uri).json()
print(f"The km-scale-cloud provides {len(kmscale_datasets)} datasests such as:")
print(kmscale_datasets[0])
The km-scale-cloud provides 682 datasests such as:
cosmo-rea-1hr_atmos

For each dataset, a Xarray-dataset view endpoint exists which can be helpful to get an overview about the content of the dataset. The endpoint is constructed like datasets/dataset_id/, e.g.:

dataset_pattern="atmos.native.mon"
global_atm_mon_mean_ds_name=next(
    a for a in kmscale_datasets if dataset_pattern in a
)
print(
    "An available atmospheric global mean monthly mean dataset is: "+
    global_atm_mon_mean_ds_name
)
global_atm_mon_mean_ds_uri=kmscale_datasets_uri+"/"+global_atm_mon_mean_ds_name
An available atmospheric global mean monthly mean dataset is: icon-esm-er.highres-future-ssp245.v20240618.atmos.native.mon

The key endpoint is the default Zarr-endpoint constructed like datasets/DATASET_ID/kerchunk. Such an address can be used by any tool that can read Zarr over http. E.g. With Xarray:

xr.open_dataset(
    global_atm_mon_mean_ds_uri+"/kerchunk",
    engine="zarr"    
)
<xarray.Dataset> Size: 31kB
Dimensions:        (time: 432, lat: 1, lon: 1)
Coordinates:
  * time           (time) datetime64[ns] 3kB 2015-02-01 ... 2051-01-01
  * lat            (lat) float64 8B 0.0
  * lon            (lon) float64 8B 0.0
Data variables: (12/16)
    duphyvi_gmean  (time, lat, lon) float32 2kB ...
    evap_gmean     (time, lat, lon) float32 2kB ...
    fwfoce_gmean   (time, lat, lon) float32 2kB ...
    kedisp_gmean   (time, lat, lon) float32 2kB ...
    prec_gmean     (time, lat, lon) float32 2kB ...
    radbal_gmean   (time, lat, lon) float32 2kB ...
    ...             ...
    tas_gmean      (time, lat, lon) float32 2kB ...
    udynvi_gmean   (time, lat, lon) float32 2kB ...
    ufcs_gmean     (time, lat, lon) float32 2kB ...
    ufts_gmean     (time, lat, lon) float32 2kB ...
    ufvs_gmean     (time, lat, lon) float32 2kB ...
    uphybal_gmean  (time, lat, lon) float32 2kB ...
Attributes:
    CDI:          Climate Data Interface version 2.4.0 (https://mpimet.mpg.de...
    Conventions:  CF-1.6
    comment:      Sapphire Dyamond (k203123) on l30674 (Linux 4.18.0-513.24.1...
    history:      /work/bm1344/k203123/experiments/erc2020/run_20150101T00000...
    institution:  Max Planck Institute for Meteorology/Deutscher Wetterdienst
    references:   see MPIM/DWD publications
    source:       git@gitlab.dkrz.de:icon/icon-mpim.git@be4454b870506af6f8f4b...
    title:        ICON simulation

The Datatree: All in one Zarr-group#

Note: xarray>=2025.7.1 required and evolving fast.

Opening Zarr and representing it in Xarray does not coast much resources. It is fast and small in memory. A strategy to work with multiple datasets is therefore to just open everything.

We can access to the full >10PB of the km-scale-cloud with only using xarray by applying the following code:

dt = xr.open_datatree(
    "simplecache::https://km-scale-cloud.dkrz.de/datasets",
    engine="zarr",
    zarr_format=2,
    chunks=None,
    create_default_indexes=False,
    decode_cf=False,
)

The dt object is a Xarray Datatree. This datatree object allows you to browse and discover the full content of the km-scale-cloud similar to the functionality of intake but without requiring an additional tool.

The workflow using the datatree involves the following steps:

  1. filter: subset the tree so that it only includes the datasets you want. You can apply any customized function using the datasets as input. In the following example, we look for a specific dataset name.

  2. chunk, decode_cf: in case you need coordinate values for subsetting or in case you are interested in more than one chunk, better use these functions to allow lazy access on coordinates.

  3. Instead of writing loops over a list or a dictionary of datasets, you can now use the .map_over_datasets function to apply a function to all datasets at once.

path_filter="s2024-08-10"
filtered_tree=dt.filter(lambda ds: ds if path_filter in ds.path else None)
display(filtered_tree)
<xarray.DataTree>
Group: /
├── Group: /orcestra.ICON-LAM.s2024-08-10_2d_PT10M_12
│   └── Group: /orcestra.ICON-LAM.s2024-08-10_2d_PT10M_12/kerchunk
│           Dimensions:  (time: 145, cell: 9371648)
│           Coordinates:
│               time     (time) float64 1kB ...
│               cell     (cell) int64 75MB ...
│           Data variables: (12/33)
│               cllvi    (time, cell) float32 5GB ...
│               clt      (time, cell) float32 5GB ...
│               crs      float32 4B ...
│               evspsbl  (time, cell) float32 5GB ...
│               hfls     (time, cell) float32 5GB ...
│               hfss     (time, cell) float32 5GB ...
│               ...       ...
│               tas      (time, cell) float32 5GB ...
│               tauu     (time, cell) float32 5GB ...
│               tauv     (time, cell) float32 5GB ...
│               ts       (time, cell) float32 5GB ...
│               uas      (time, cell) float32 5GB ...
│               vas      (time, cell) float32 5GB ...
└── Group: /orcestra.ICON-LAM.s2024-08-10_3d_PT4H_12
    └── Group: /orcestra.ICON-LAM.s2024-08-10_3d_PT4H_12/kerchunk
            Dimensions:      (time: 25, height_full: 56, cell: 9371648, height_half: 57)
            Coordinates:
                time         (time) float64 200B ...
                height_full  (height_full) float64 448B ...
                cell         (cell) int64 75MB ...
                height_half  (height_half) float64 456B ...
            Data variables: (12/13)
                crs          float32 4B ...
                pfull        (time, height_full, cell) float32 52GB ...
                qc           (time, height_full, cell) float32 52GB ...
                qg           (time, height_full, cell) float32 52GB ...
                qi           (time, height_full, cell) float32 52GB ...
                qr           (time, height_full, cell) float32 52GB ...
                ...           ...
                qv           (time, height_full, cell) float32 52GB ...
                rho          (time, height_full, cell) float32 52GB ...
                ta           (time, height_full, cell) float32 52GB ...
                ua           (time, height_full, cell) float32 52GB ...
                va           (time, height_full, cell) float32 52GB ...
                wa           (time, height_half, cell) float32 53GB ...
filtered_tree_chunked=filtered_tree.chunk()
filtered_tree_chunked_decoded=filtered_tree_chunked.map_over_datasets(
    lambda ds: xr.decode_cf(ds) if "time" in ds else None
)

Any .compute() or .load() will trigger data retrieval. Thus, make sure you first subset before you download.

The .nbytes object shows you how much uncompressed data will be loaded to your memory.

filtered_tree_chunked_decoded.nbytes/1024**2
767516.752166748
time_sel="2024-08-10"
var="tas"
filtered_tree_chunked_decoded_subsetted=filtered_tree_chunked_decoded.map_over_datasets(
    lambda ds: ds[[var]].sel(time=time_sel) if all(a in ds for a in [var,"time"]) else None
).prune()
print(filtered_tree_chunked_decoded_subsetted.nbytes/1024**2)
5219.50110244751

Before running .load(), add all lazy functions to the dask´s taskgraph:

dt_workflow=filtered_tree_chunked_decoded_subsetted.map_over_datasets(lambda ds: ds.mean(dim="time") if "time" in ds else None)
%%time
ds_mean=dt_workflow.compute().leaves[0].to_dataset()
ds_mean
CPU times: user 16.6 s, sys: 34.4 s, total: 50.9 s
Wall time: 1min 36s
<xarray.Dataset> Size: 112MB
Dimensions:  (cell: 9371648)
Coordinates:
  * cell     (cell) int64 75MB 50331648 50331649 ... 201326590 201326591
    crs      float32 4B nan
Data variables:
    tas      (cell) float32 37MB 299.8 299.8 299.8 299.8 ... 299.9 299.9 299.9

STAC interface#

The most interoperable catalog interface for the km-scale-cloud is with STAC (SpatioTemporal Asset Catalogs). You can use any stac-browser implementation to browse and discover the dynamically created km-scale cloud collection with all datasets linked as dynamic items.

For programmatic access, follow this notebook.

import pystac
kmscale_collection_uri="https://km-scale-cloud.dkrz.de/stac-collection-all.json"
kmscale_collection=pystac.Collection.from_file(kmscale_collection_uri)
kmscale_collection
<Collection id=eerie-cloud-all>

Each child link in the collection corresponds to a dataset and is formatted as a STAC item. Datasets of the km-scale-cloud are highly aggregated.

We can do:

dataset_items=[
    pystac.Item.from_file(link.href)
    for link in kmscale_collection.get_links("child")
]
item=dataset_items[0]
item
<Item id=cosmo-rea-1hr_atmos>

We can use properties or ids to browse and search the items:

selection=[
    a
    for a in dataset_items
    if all(b in a.id.lower() for b in ["icon","gr025","monthly","2d","hist","atmos"])
]
item=selection[0]
len(selection)
1

Each dataset item has multiple assets corresponding to different ways how to open the dataset, e.g. with “dkrz-disk” should be used when working on levante.

We now use the “eerie-cloud” asset. It provides information on volume and number of variables for the uncompressed, total dataset.

asset="eerie-cloud"
ec_asset=item.get_assets()[asset]
ec_asset_dict=ec_asset.to_dict()
ec_asset
<Asset href=https://eerie.cloud.dkrz.de/datasets/icon-esm-er.hist-1950.v20240618.atmos.gr025.2d_monthly_mean/kerchunk>

We can open datasets with the asset:

import xarray as xr
zarr_uri=ec_asset_dict["href"]
xr.open_dataset(
    zarr_uri,
    **ec_asset_dict["xarray:open_kwargs"],
    storage_options=ec_asset_dict.get("xarray:storage_options")
)
<xarray.Dataset> Size: 107GB
Dimensions:        (time: 780, lat: 721, lon: 1440, height_3: 1, bnds: 2,
                    height: 1, height_2: 1)
Coordinates:
  * time           (time) datetime64[ns] 6kB 1950-01-31T23:59:59 ... 2014-12-...
  * lat            (lat) float64 6kB -90.0 -89.75 -89.5 ... 89.5 89.75 90.0
  * lon            (lon) float64 12kB 0.0 0.25 0.5 0.75 ... 359.2 359.5 359.8
  * height_3       (height_3) float64 8B 90.0
  * height         (height) float64 8B 2.0
  * height_2       (height_2) float64 8B 10.0
Dimensions without coordinates: bnds
Data variables: (12/34)
    clivi          (time, lat, lon) float32 3GB dask.array<chunksize=(32, 721, 1440), meta=np.ndarray>
    cllvi          (time, lat, lon) float32 3GB dask.array<chunksize=(32, 721, 1440), meta=np.ndarray>
    clt            (time, lat, lon) float32 3GB dask.array<chunksize=(32, 721, 1440), meta=np.ndarray>
    evspsbl        (time, lat, lon) float32 3GB dask.array<chunksize=(32, 721, 1440), meta=np.ndarray>
    height_3_bnds  (time, height_3, bnds) float64 12kB dask.array<chunksize=(780, 1, 2), meta=np.ndarray>
    hfls           (time, lat, lon) float32 3GB dask.array<chunksize=(32, 721, 1440), meta=np.ndarray>
    ...             ...
    tasmin         (time, height, lat, lon) float32 3GB dask.array<chunksize=(32, 1, 721, 1440), meta=np.ndarray>
    tauu           (time, lat, lon) float32 3GB dask.array<chunksize=(32, 721, 1440), meta=np.ndarray>
    tauv           (time, lat, lon) float32 3GB dask.array<chunksize=(32, 721, 1440), meta=np.ndarray>
    ts             (time, lat, lon) float32 3GB dask.array<chunksize=(32, 721, 1440), meta=np.ndarray>
    uas            (time, height_2, lat, lon) float32 3GB dask.array<chunksize=(32, 1, 721, 1440), meta=np.ndarray>
    vas            (time, height_2, lat, lon) float32 3GB dask.array<chunksize=(32, 1, 721, 1440), meta=np.ndarray>
Attributes: (12/31)
    Conventions:           CF-1.7 CMIP-6.2
    activity_id:           HighResMIP
    data_specs_version:    01.00.32
    forcing_index:         1
    initialization_index:  1
    license:               EERIE model data produced by MPI-M is licensed und...
    ...                    ...
    parent_activity_id:    HighResMIP
    sub_experiment_id:     none
    experiment:            coupled historical 1950-2014
    source:                ICON-ESM-ER (2023): \naerosol: none, prescribed MA...
    institution:           Max Planck Institute for Meteorology, Hamburg 2014...
    sub_experiment:        none

Advanced: Retrievel setting#

Before we start excessively transfering data chunks, we need to understand how the km-scale cloud manages to prevent us from running into rejected requests. The km-scale-cloud aims to serve data to multiple users at once. Therefore, it allows 200 requests per client IP before it burst new requests and returns a 429 error. It also only allows 60 requests per datasets.

On the other side, the python Zarr that we use for access is embarassingly parallel. Under the hood, it uses the aiohttp client which limits the number of simultanoues request to 100 per session.

Thus, with default settings, you can only retrieve data from 2 datasets (2 * 100) at once! Moreover, it make sense to configure the client so that it uses less requests at once. A reasonable number of concurrent requests depends on

  • your network bandwidth

  • the chunk size of the target data For some high resolution data (GRIB formatted), one chunk can be large (>>10MB). In that case, you already reach a performance that fill your network capacity with only a few parallel chunks.

This is how parallel requests are limited to 2:

import aiohttp
so_limited={
    "client_kwargs": {
        "connector": aiohttp.TCPConnector(limit=2),
    }
}

xr.open_dataset(
    zarr_uri,
    **ec_asset_dict["xarray:open_kwargs"],
    storage_options=so_limited
)

Advanced: Two Zarr versions#

The km-scale-cloud provides two Zar access endpoints for the same datasets which differ:

  • /kerchunk suffix: The preferred href

    delivers data as stored in the original without any processing on the server. Original zarr data, e.g. NextGEMs ICON, is just passed through to the user.

  • /zarr suffix: The alternate href

    delivers data processed on server-side with dask. The resulting data is

    • rechunked in large, dask-optimal chunksizes (O(100MB) in memory)

    • uniformly compressed with blosc. Native data is lossy compressed with 12-bit bitrouding.

Kerchunk have a smaller footprint in both memory and compute on the server. zarr reduces data volume before transfer to support users with limited network bandwidth.

Intake#

Warning

The Intake catalog is a legacy interface and requires intake<2

The server dynamically creates an intake catalog which is synced with the available dataset.

import intake
cat = intake.open_catalog("https://eerie.cloud.dkrz.de/intake.yaml")

We can list all datasets available in the catalog with list:

all_dkrz_datasets = list(cat)
len(all_dkrz_datasets)
682

The two zarr access points for each dataset are given in the description:

dsname = all_dkrz_datasets[-1]
cat[dsname].describe()
{'name': 'orcestra.ICON-LAM.s2024-09-29_3d_PT4H_12',
 'container': 'xarray',
 'plugin': ['zarr'],
 'driver': ['zarr'],
 'description': '',
 'direct_access': 'forbid',
 'user_parameters': [{'name': 'method',
   'description': 'server-side loading method',
   'type': 'str',
   'allowed': ['kerchunk', 'zarr'],
   'default': 'kerchunk'}],
 'metadata': {},
 'args': {'consolidated': True,
  'urlpath': 'https://eerie.cloud.dkrz.de/datasets/orcestra.ICON-LAM.s2024-09-29_3d_PT4H_12/{{ method }}'}}

to_dask() opens the dataset:

cat[dsname].to_dask()
/builds/easy/gems/.venv/lib/python3.13/site-packages/intake_xarray/base.py:21: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
  'dims': dict(self._ds.dims),
<xarray.Dataset> Size: 631GB
Dimensions:      (time: 25, height_full: 56, cell: 9371648, height_half: 57)
Coordinates:
  * time         (time) datetime64[ns] 200B 2024-09-29 ... 2024-09-30
  * height_full  (height_full) float64 448B 35.0 36.0 37.0 ... 88.0 89.0 90.0
  * cell         (cell) int64 75MB 50331648 50331649 ... 201326590 201326591
  * height_half  (height_half) float64 456B 35.0 36.0 37.0 ... 89.0 90.0 91.0
    crs          float32 4B ...
Data variables:
    pfull        (time, height_full, cell) float32 52GB dask.array<chunksize=(6, 8, 16384), meta=np.ndarray>
    qc           (time, height_full, cell) float32 52GB dask.array<chunksize=(6, 8, 16384), meta=np.ndarray>
    qg           (time, height_full, cell) float32 52GB dask.array<chunksize=(6, 8, 16384), meta=np.ndarray>
    qi           (time, height_full, cell) float32 52GB dask.array<chunksize=(6, 8, 16384), meta=np.ndarray>
    qr           (time, height_full, cell) float32 52GB dask.array<chunksize=(6, 8, 16384), meta=np.ndarray>
    qs           (time, height_full, cell) float32 52GB dask.array<chunksize=(6, 8, 16384), meta=np.ndarray>
    qv           (time, height_full, cell) float32 52GB dask.array<chunksize=(6, 8, 16384), meta=np.ndarray>
    rho          (time, height_full, cell) float32 52GB dask.array<chunksize=(6, 8, 16384), meta=np.ndarray>
    ta           (time, height_full, cell) float32 52GB dask.array<chunksize=(6, 8, 16384), meta=np.ndarray>
    ua           (time, height_full, cell) float32 52GB dask.array<chunksize=(6, 8, 16384), meta=np.ndarray>
    va           (time, height_full, cell) float32 52GB dask.array<chunksize=(6, 8, 16384), meta=np.ndarray>
    wa           (time, height_half, cell) float32 53GB dask.array<chunksize=(6, 8, 16384), meta=np.ndarray>

Environment#

To succesfully run this notebook, you need a kernel / environment with the following packages installed:

- xarray
- pystac
- zarr
- dask
- hvplot # for plotting
- aiohttp
- requests
- intake-xarray
- intake<2
- eccodes # for grib data
- python-eccodes # for grib data