Dask chunks and best practices#

This notebook is meant to be used in DKRZ’s Jupyterhub with the python3 unstable kernel.

We aim to understand

the different types of chunks, especially
- storage chunks
  - The smallest unit of binary data that is stored within a climate data file (zarr, netcdf, grib)
- dask chunks
  - The chunks that dask uses inside a dask array. The smallest unit for which dask executes a workflow at once.
the best practices recommended by dask.

We use the following packages which needs to be in your environment/kernel:

import intake
import xarray as xr
import hvplot.xarray
import numpy as np
from xhistogram.xarray import histogram
import dask
from dask.distributed import performance_report
from distributed import Client
import time
import glob

We mainly look at high frequency (30minutes) high resolution (zoom level 10 ICON data from NextGEMS Cycle 3 and select precipitation as an example variable.

We get the data from an intake catalog:

CATALOG_LINK = "https://data.nextgems-h2020.eu/catalog.yaml"
VARID = "pr"

selection = dict(zoom=10, time="PT30M")

cat = intake.open_catalog(CATALOG_LINK)
cat

data_nextgems-h2020_eu:
  args:
    path: https://data.nextgems-h2020.eu/catalog.yaml
  description: ''
  driver: intake.catalog.local.YAMLFileCatalog
  metadata: {}

Chunks#

TLDR,

Dask chunks need to

be smaller than your memory

Dask chunks should

be about 100MB
not exceed about O(1 mio) in total when sent to the scheduler i.e. when computed
reflect a multiple of the storage chunks
not cut storage chunks
fit to your analysis

with open_zarr and intake configurations#

In the following, we have a look on the NGC3 zarr data and open it in 3 different ways:

The catalog is configured such that xarray opens data lazily and encoded as numpy arrays (no dask)

ds_delayed = cat["ICON"]["ngc3028"](**selection).to_dask()[VARID]
ds_delayed

<xarray.DataArray 'pr' (time: 96480, cell: 12582912)>
[1213999349760 values with dtype=float32]
Coordinates:
  * time     (time) datetime64[ns] 2020-01-20T00:30:00 ... 2025-07-22
Dimensions without coordinates: cell
Attributes:
    cell_methods:  time: mean
    component:     atmo
    grid_mapping:  crs
    long_name:     precipitation flux
    units:         kg m-2 s-1
    vgrid:         surface

The same as:

ds_delayed = xr.open_dataset(
    f"/fastdata/ka1081/nextgems_cycle3/ngc3028/ngc3028_{selection['time']}_{str(selection['zoom'])}.zarr",
    engine="zarr",
    chunks=None,
)[VARID]

Let dask create dask chunks when opening with chunks="auto":

ds_auto = cat["ICON"]["ngc3028"](**selection, chunks="auto").to_dask()[VARID]
ds_auto

<xarray.DataArray 'pr' (time: 96480, cell: 12582912)>
dask.array<open_dataset-dd9675028d7e43a50230dc83a922860apr, shape=(96480, 12582912), dtype=float32, chunksize=(32, 524288), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 2020-01-20T00:30:00 ... 2025-07-22
Dimensions without coordinates: cell
Attributes:
    cell_methods:  time: mean
    component:     atmo
    grid_mapping:  crs
    long_name:     precipitation flux
    units:         kg m-2 s-1
    vgrid:         surface

xarray.DataArray

'pr'

time: 96480
cell: 12582912

dask.array<chunksize=(32, 524288), meta=np.ndarray>

	Array	Chunk
Bytes	4.42 TiB	64.00 MiB
Shape	(96480, 12582912)	(32, 524288)
Dask graph	72360 chunks in 2 graph layers
Data type	float32 numpy.ndarray

Coordinates: (1)

time

(time)

datetime64[ns]

2020-01-20T00:30:00 ... 2025-07-22

axis :: T

array(['2020-01-20T00:30:00.000000000', '2020-01-20T01:00:00.000000000',
       '2020-01-20T01:30:00.000000000', ..., '2025-07-21T23:00:00.000000000',
       '2025-07-21T23:30:00.000000000', '2025-07-22T00:00:00.000000000'],
      dtype='datetime64[ns]')

Indexes: (1)

time

PandasIndex

PandasIndex(DatetimeIndex(['2020-01-20 00:30:00', '2020-01-20 01:00:00',
               '2020-01-20 01:30:00', '2020-01-20 02:00:00',
               '2020-01-20 02:30:00', '2020-01-20 03:00:00',
               '2020-01-20 03:30:00', '2020-01-20 04:00:00',
               '2020-01-20 04:30:00', '2020-01-20 05:00:00',
               ...
               '2025-07-21 19:30:00', '2025-07-21 20:00:00',
               '2025-07-21 20:30:00', '2025-07-21 21:00:00',
               '2025-07-21 21:30:00', '2025-07-21 22:00:00',
               '2025-07-21 22:30:00', '2025-07-21 23:00:00',
               '2025-07-21 23:30:00', '2025-07-22 00:00:00'],
              dtype='datetime64[ns]', name='time', length=96480, freq=None))

Attributes: (6)
cell_methods :
time: mean
component :
atmo
grid_mapping :
crs
long_name :
precipitation flux
units :
kg m-2 s-1
vgrid :
surface

The same as:

ds_auto = xr.open_dataset(
    f"/fastdata/ka1081/nextgems_cycle3/ngc3028/ngc3028_{selection['time']}_{str(selection['zoom'])}.zarr",
    engine="zarr",
    chunks="auto",
)[VARID]

But why is the .encoding["chunks"] different to the dask chunks?

ds_auto.encoding["chunks"]

(16, 262144)

Because the latter are storage chunks. It is the shape of a chunk written to the storage.

Storage chunks have a different optimal size than dask chunks. Each storage chunk can be compressed individually. Therefore, it depends a lot on the chosen compression tool what the optimal size of a storage chunk is. Also, optimal access to the storage backend hardware affects the optimal storage chunk size.

In general, one can say that 1-10MB is a good storage chunk size for disk/lustre and <100MB is still OK for most backends.

Force dask to use storage chunks when opening with chunks={}:

ds_storage = cat["ICON"]["ngc3028"](**selection, chunks={}).to_dask()[VARID]
ds_storage

<xarray.DataArray 'pr' (time: 96480, cell: 12582912)>
dask.array<open_dataset-11ac8388882b0270658ade8f13719e04pr, shape=(96480, 12582912), dtype=float32, chunksize=(16, 262144), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 2020-01-20T00:30:00 ... 2025-07-22
Dimensions without coordinates: cell
Attributes:
    cell_methods:  time: mean
    component:     atmo
    grid_mapping:  crs
    long_name:     precipitation flux
    units:         kg m-2 s-1
    vgrid:         surface

xarray.DataArray

'pr'

time: 96480
cell: 12582912

dask.array<chunksize=(16, 262144), meta=np.ndarray>

	Array	Chunk
Bytes	4.42 TiB	16.00 MiB
Shape	(96480, 12582912)	(16, 262144)
Dask graph	289440 chunks in 2 graph layers
Data type	float32 numpy.ndarray

Coordinates: (1)

time

(time)

datetime64[ns]

2020-01-20T00:30:00 ... 2025-07-22

axis :: T

array(['2020-01-20T00:30:00.000000000', '2020-01-20T01:00:00.000000000',
       '2020-01-20T01:30:00.000000000', ..., '2025-07-21T23:00:00.000000000',
       '2025-07-21T23:30:00.000000000', '2025-07-22T00:00:00.000000000'],
      dtype='datetime64[ns]')

Indexes: (1)

time

PandasIndex

PandasIndex(DatetimeIndex(['2020-01-20 00:30:00', '2020-01-20 01:00:00',
               '2020-01-20 01:30:00', '2020-01-20 02:00:00',
               '2020-01-20 02:30:00', '2020-01-20 03:00:00',
               '2020-01-20 03:30:00', '2020-01-20 04:00:00',
               '2020-01-20 04:30:00', '2020-01-20 05:00:00',
               ...
               '2025-07-21 19:30:00', '2025-07-21 20:00:00',
               '2025-07-21 20:30:00', '2025-07-21 21:00:00',
               '2025-07-21 21:30:00', '2025-07-21 22:00:00',
               '2025-07-21 22:30:00', '2025-07-21 23:00:00',
               '2025-07-21 23:30:00', '2025-07-22 00:00:00'],
              dtype='datetime64[ns]', name='time', length=96480, freq=None))

Attributes: (6)
cell_methods :
time: mean
component :
atmo
grid_mapping :
crs
long_name :
precipitation flux
units :
kg m-2 s-1
vgrid :
surface

The same as:

ds_storage = xr.open_dataset(
    f"/fastdata/ka1081/nextgems_cycle3/ngc3028/ngc3028_{selection['time']}_{str(selection['zoom'])}.zarr",
    engine="zarr",
    chunks={},
)[VARID]

We look at a

5TB array
which is chunked in all dimensions both in storage chunks and dask chunks
and results in a dask graph with O(100,000) chunks

… note:: To load one individual dask chunk is one task in the dask’s task graph.

When we compare chunks of ds_auto and ds_storage, we find that

Storage chunks of precipitation are small.
Dask aims to create chunks with about 100MB in size
Dask is clever to use a multiple of the storage chunk size:

print(ds_auto.chunksizes["time"][0] / ds_storage.chunksizes["time"][0])
print(ds_auto.chunksizes["cell"][0] / ds_storage.chunksizes["cell"][0])

2.0
2.0

That means, we have 2²=4 storage chunks within one dask chunk.

Dask is able to load these storage chunks at once.

… note:: NOTE: chunks("auto") returns well-configured dask chunks for general analysis purpose so that a chunk has about 100MB.

Config and default setting#

Sometimes 100MB memory is not the best option for a general approach. You can change this dask setting with:

# dask.config.set({"array.slicing.split_large_chunks": False})
# dask.config.set({"array.chunk-size": "10 MB"})

# find out the target dask chunk size:
dask.config.config["array"]["chunk-size"]

'128MiB'

# set the target dask chunk size:
dask.config.set({"array.chunk-size": "10 MB"})

<dask.config.set at 0x7fff7ccdb2e0>

If we now open the data with chunks="auto", we get:

ds_auto_config = xr.open_dataset(
    f"/fastdata/ka1081/nextgems_cycle3/ngc3028/ngc3028_{selection['time']}_{str(selection['zoom'])}.zarr",
    engine="zarr",
    chunks="auto",
)[VARID]
ds_auto_config

/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/core/dataset.py:255: UserWarning: The specified Dask chunks separate the stored chunks along dimension "time" starting at index 12. This could degrade performance. Instead, consider rechunking after loading.
  warnings.warn(
/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/xarray/core/dataset.py:255: UserWarning: The specified Dask chunks separate the stored chunks along dimension "cell" starting at index 208332. This could degrade performance. Instead, consider rechunking after loading.
  warnings.warn(

<xarray.DataArray 'pr' (time: 96480, cell: 12582912)>
dask.array<open_dataset-1a14ada58f9017b077d18ef7caddcac1pr, shape=(96480, 12582912), dtype=float32, chunksize=(12, 208332), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 2020-01-20T00:30:00 ... 2025-07-22
Dimensions without coordinates: cell
Attributes:
    cell_methods:  time: mean
    component:     atmo
    grid_mapping:  crs
    long_name:     precipitation flux
    units:         kg m-2 s-1
    vgrid:         surface

xarray.DataArray

'pr'

time: 96480
cell: 12582912

dask.array<chunksize=(12, 208332), meta=np.ndarray>

	Array	Chunk
Bytes	4.42 TiB	9.54 MiB
Shape	(96480, 12582912)	(12, 208332)
Dask graph	490440 chunks in 2 graph layers
Data type	float32 numpy.ndarray

Coordinates: (1)

time

(time)

datetime64[ns]

2020-01-20T00:30:00 ... 2025-07-22

axis :: T

array(['2020-01-20T00:30:00.000000000', '2020-01-20T01:00:00.000000000',
       '2020-01-20T01:30:00.000000000', ..., '2025-07-21T23:00:00.000000000',
       '2025-07-21T23:30:00.000000000', '2025-07-22T00:00:00.000000000'],
      dtype='datetime64[ns]')

Indexes: (1)

time

PandasIndex

PandasIndex(DatetimeIndex(['2020-01-20 00:30:00', '2020-01-20 01:00:00',
               '2020-01-20 01:30:00', '2020-01-20 02:00:00',
               '2020-01-20 02:30:00', '2020-01-20 03:00:00',
               '2020-01-20 03:30:00', '2020-01-20 04:00:00',
               '2020-01-20 04:30:00', '2020-01-20 05:00:00',
               ...
               '2025-07-21 19:30:00', '2025-07-21 20:00:00',
               '2025-07-21 20:30:00', '2025-07-21 21:00:00',
               '2025-07-21 21:30:00', '2025-07-21 22:00:00',
               '2025-07-21 22:30:00', '2025-07-21 23:00:00',
               '2025-07-21 23:30:00', '2025-07-22 00:00:00'],
              dtype='datetime64[ns]', name='time', length=96480, freq=None))

Attributes: (6)
cell_methods :
time: mean
component :
atmo
grid_mapping :
crs
long_name :
precipitation flux
units :
kg m-2 s-1
vgrid :
surface

# set the target dask chunk size:
dask.config.set({"array.chunk-size": "128 MB"})

<dask.config.set at 0x7fff7c69cd00>

open_mfdataset on netcdf#

It is getting even more complicated. open_mfdataset behaves differently!

… note:: Using chunks=auto does not always lead to dask chunks being a multiple of storage chunks

Consider this example:

varname = "ta"
filelist = glob.glob(
    f"/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp126/r1i1p1f1/CFday/{varname}/gn/v20190710/*.nc"
)

openmf_auto = xr.open_mfdataset(filelist, chunks="auto")
openmf_auto

<xarray.Dataset>
Dimensions:    (time: 31411, bnds: 2, lev: 95, lat: 192, lon: 384)
Coordinates:
  * time       (time) datetime64[ns] 2015-01-01T12:00:00 ... 2100-12-31T12:00:00
  * lev        (lev) float64 0.9961 0.9826 0.959 ... 2.31e-05 9.816e-06
  * lat        (lat) float64 -89.28 -88.36 -87.42 -86.49 ... 87.42 88.36 89.28
  * lon        (lon) float64 0.0 0.9375 1.875 2.812 ... 356.2 357.2 358.1 359.1
Dimensions without coordinates: bnds
Data variables:
    time_bnds  (time, bnds) datetime64[ns] dask.array<chunksize=(1826, 2), meta=np.ndarray>
    lev_bnds   (time, lev, bnds) float64 dask.array<chunksize=(1826, 95, 2), meta=np.ndarray>
    ap         (time, lev) float64 dask.array<chunksize=(1826, 95), meta=np.ndarray>
    b          (time, lev) float64 dask.array<chunksize=(1826, 95), meta=np.ndarray>
    ps         (time, lat, lon) float32 dask.array<chunksize=(1142, 118, 237), meta=np.ndarray>
    ap_bnds    (time, lev, bnds) float64 dask.array<chunksize=(1826, 95, 2), meta=np.ndarray>
    b_bnds     (time, lev, bnds) float64 dask.array<chunksize=(1826, 95, 2), meta=np.ndarray>
    lat_bnds   (time, lat, bnds) float64 dask.array<chunksize=(1826, 192, 2), meta=np.ndarray>
    lon_bnds   (time, lon, bnds) float64 dask.array<chunksize=(1826, 384, 2), meta=np.ndarray>
    ta         (time, lev, lat, lon) float32 dask.array<chunksize=(423, 21, 42, 85), meta=np.ndarray>
Attributes: (12/47)
    Conventions:            CF-1.7 CMIP-6.2
    activity_id:            ScenarioMIP
    branch_method:          standard
    branch_time_in_child:   60265.0
    branch_time_in_parent:  60265.0
    contact:                cmip6-mpi-esm@dkrz.de
    ...                     ...
    title:                  MPI-ESM1-2-HR output prepared for CMIP6
    variable_id:            ta
    variant_label:          r1i1p1f1
    license:                CMIP6 model data produced by DKRZ is licensed und...
    cmor_version:           3.4.0
    tracking_id:            hdl:21.14100/c30b8347-4287-4f20-ab6f-d8ddb1198635

This configuration gives us chunks which split the storage chunks.

storage_chunks = openmf_auto[varname].encoding["chunksizes"]
storage_chunks

(1, 48, 96, 192)

This is the second worst configuration you can have in dask. On top of that, chunks={} does not gives us storage chunks either but one file per chunk:

one_file_per_chunk = xr.open_mfdataset(
    "/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp126/r1i1p1f1/CFday/ta/gn/v20190710/*.nc",
    chunks={},
)
one_file_per_chunk

<xarray.Dataset>
Dimensions:    (time: 31411, bnds: 2, lev: 95, lat: 192, lon: 384)
Coordinates:
  * time       (time) datetime64[ns] 2015-01-01T12:00:00 ... 2100-12-31T12:00:00
  * lev        (lev) float64 0.9961 0.9826 0.959 ... 2.31e-05 9.816e-06
  * lat        (lat) float64 -89.28 -88.36 -87.42 -86.49 ... 87.42 88.36 89.28
  * lon        (lon) float64 0.0 0.9375 1.875 2.812 ... 356.2 357.2 358.1 359.1
Dimensions without coordinates: bnds
Data variables:
    time_bnds  (time, bnds) datetime64[ns] dask.array<chunksize=(1826, 2), meta=np.ndarray>
    lev_bnds   (time, lev, bnds) float64 dask.array<chunksize=(1826, 95, 2), meta=np.ndarray>
    ap         (time, lev) float64 dask.array<chunksize=(1826, 95), meta=np.ndarray>
    b          (time, lev) float64 dask.array<chunksize=(1826, 95), meta=np.ndarray>
    ps         (time, lat, lon) float32 dask.array<chunksize=(1826, 192, 384), meta=np.ndarray>
    ap_bnds    (time, lev, bnds) float64 dask.array<chunksize=(1826, 95, 2), meta=np.ndarray>
    b_bnds     (time, lev, bnds) float64 dask.array<chunksize=(1826, 95, 2), meta=np.ndarray>
    lat_bnds   (time, lat, bnds) float64 dask.array<chunksize=(1826, 192, 2), meta=np.ndarray>
    lon_bnds   (time, lon, bnds) float64 dask.array<chunksize=(1826, 384, 2), meta=np.ndarray>
    ta         (time, lev, lat, lon) float32 dask.array<chunksize=(1826, 95, 192, 384), meta=np.ndarray>
Attributes: (12/47)
    Conventions:            CF-1.7 CMIP-6.2
    activity_id:            ScenarioMIP
    branch_method:          standard
    branch_time_in_child:   60265.0
    branch_time_in_parent:  60265.0
    contact:                cmip6-mpi-esm@dkrz.de
    ...                     ...
    title:                  MPI-ESM1-2-HR output prepared for CMIP6
    variable_id:            ta
    variant_label:          r1i1p1f1
    license:                CMIP6 model data produced by DKRZ is licensed und...
    cmor_version:           3.4.0
    tracking_id:            hdl:21.14100/c30b8347-4287-4f20-ab6f-d8ddb1198635

For this case, this is the worst configuration you can have in dask because the chunk size most likely exceeds the available memory so that you cannot really work with this dataset at all.

… tip:: When using open_mfdataset, ensure that the chunks are well defined. Consider reopening with a multiple of the storage chunk sizes.

open_chunks = dict(
    time=2,
    lev=storage_chunks[1],  # storage chunk levels,
    lon=len(one_file_per_chunk["lon"]),
    lat=len(one_file_per_chunk["lat"]),
)
multiple_storage_chunks = xr.open_mfdataset(
    "/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp126/r1i1p1f1/CFday/ta/gn/v20190710/*.nc",
    chunks=open_chunks,  # or a multiple storage_chunks
)
multiple_storage_chunks

<xarray.Dataset>
Dimensions:    (time: 31411, bnds: 2, lev: 95, lat: 192, lon: 384)
Coordinates:
  * time       (time) datetime64[ns] 2015-01-01T12:00:00 ... 2100-12-31T12:00:00
  * lev        (lev) float64 0.9961 0.9826 0.959 ... 2.31e-05 9.816e-06
  * lat        (lat) float64 -89.28 -88.36 -87.42 -86.49 ... 87.42 88.36 89.28
  * lon        (lon) float64 0.0 0.9375 1.875 2.812 ... 356.2 357.2 358.1 359.1
Dimensions without coordinates: bnds
Data variables:
    time_bnds  (time, bnds) datetime64[ns] dask.array<chunksize=(2, 2), meta=np.ndarray>
    lev_bnds   (time, lev, bnds) float64 dask.array<chunksize=(1826, 48, 2), meta=np.ndarray>
    ap         (time, lev) float64 dask.array<chunksize=(1826, 48), meta=np.ndarray>
    b          (time, lev) float64 dask.array<chunksize=(1826, 48), meta=np.ndarray>
    ps         (time, lat, lon) float32 dask.array<chunksize=(2, 192, 384), meta=np.ndarray>
    ap_bnds    (time, lev, bnds) float64 dask.array<chunksize=(1826, 48, 2), meta=np.ndarray>
    b_bnds     (time, lev, bnds) float64 dask.array<chunksize=(1826, 48, 2), meta=np.ndarray>
    lat_bnds   (time, lat, bnds) float64 dask.array<chunksize=(1826, 192, 2), meta=np.ndarray>
    lon_bnds   (time, lon, bnds) float64 dask.array<chunksize=(1826, 384, 2), meta=np.ndarray>
    ta         (time, lev, lat, lon) float32 dask.array<chunksize=(2, 48, 192, 384), meta=np.ndarray>
Attributes: (12/47)
    Conventions:            CF-1.7 CMIP-6.2
    activity_id:            ScenarioMIP
    branch_method:          standard
    branch_time_in_child:   60265.0
    branch_time_in_parent:  60265.0
    contact:                cmip6-mpi-esm@dkrz.de
    ...                     ...
    title:                  MPI-ESM1-2-HR output prepared for CMIP6
    variable_id:            ta
    variant_label:          r1i1p1f1
    license:                CMIP6 model data produced by DKRZ is licensed und...
    cmor_version:           3.4.0
    tracking_id:            hdl:21.14100/c30b8347-4287-4f20-ab6f-d8ddb1198635

Best-practices#

In the following, we are running some example analysis to explore how dask works. For ds_auto, we want to know:

What is the quick and unweighted field mean of the first year?

def calc_fieldmean_of_first_year(ds):
    dsy1 = list(dict(ds.groupby("time.year")).values())[0]
    dsy1mean = dsy1.mean(dim="cell")
    display(dsy1mean)
    gbs = dsy1.nbytes / 1024 / 1024 / 1024
    print(f"Going to process: {gbs} GBs")
    return dsy1mean.compute(), gbs

Do we need dask?#

print(
    "The first year of the dataset is "
    f"{ds_delayed.isel(time=slice(0,2*24*365)).nbytes/1024/1024/1024}"
    " GBs large and does not fit into memory"
)

The first year of the dataset is 821.25 GBs large and does not fit into memory

# ds_delayed.isel(time=slice(0,2*24*365)).mean(dim="time").compute()

Let’s quickly define a client to get a dashboard view. More on this in the distributed notebook.

# client.close()
client = Client()

Diagnostic tools#

When we let dask actually work, we can use diagnostic-tools to inspect how our task-graph was evaluated from the client i.e. how performant the execution was. E.g., there is the performance_report. We use that in a with context as follows:

s = time.time()
with performance_report(filename="dask-report-auto.html"):
    result, gbs = calc_fieldmean_of_first_year(ds_auto)
e = time.time()
print(f"Process speed: {gbs/(e-s)} GB/s")

<xarray.DataArray 'pr' (time: 16655)>
dask.array<mean_agg-aggregate, shape=(16655,), dtype=float32, chunksize=(32,), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 2020-01-20T00:30:00 ... 2020-12-31T23:30:00

xarray.DataArray

'pr'

time: 16655

dask.array<chunksize=(32,), meta=np.ndarray>

	Array	Chunk
Bytes	65.06 kiB	128 B
Shape	(16655,)	(32,)
Dask graph	521 chunks in 7 graph layers
Data type	float32 numpy.ndarray

Coordinates: (1)

time

(time)

datetime64[ns]

2020-01-20T00:30:00 ... 2020-12-...

axis :: T

array(['2020-01-20T00:30:00.000000000', '2020-01-20T01:00:00.000000000',
       '2020-01-20T01:30:00.000000000', ..., '2020-12-31T22:30:00.000000000',
       '2020-12-31T23:00:00.000000000', '2020-12-31T23:30:00.000000000'],
      dtype='datetime64[ns]')

Indexes: (1)

time

PandasIndex

PandasIndex(DatetimeIndex(['2020-01-20 00:30:00', '2020-01-20 01:00:00',
               '2020-01-20 01:30:00', '2020-01-20 02:00:00',
               '2020-01-20 02:30:00', '2020-01-20 03:00:00',
               '2020-01-20 03:30:00', '2020-01-20 04:00:00',
               '2020-01-20 04:30:00', '2020-01-20 05:00:00',
               ...
               '2020-12-31 19:00:00', '2020-12-31 19:30:00',
               '2020-12-31 20:00:00', '2020-12-31 20:30:00',
               '2020-12-31 21:00:00', '2020-12-31 21:30:00',
               '2020-12-31 22:00:00', '2020-12-31 22:30:00',
               '2020-12-31 23:00:00', '2020-12-31 23:30:00'],
              dtype='datetime64[ns]', name='time', length=16655, freq=None))

Attributes: (0)

Going to process: 780.703125 GBs
Process speed: 6.697790692183614 GB/s

Lets compare this with the dataset with storage chunk dask chunks:

s = time.time()
with performance_report(filename="dask-report-storage.html"):
    gbs = calc_fieldmean_of_first_year(ds_storage)[1]
e = time.time()
print(f"Process speed: {gbs/(e-s)} GB/s")

<xarray.DataArray 'pr' (time: 16655)>
dask.array<mean_agg-aggregate, shape=(16655,), dtype=float32, chunksize=(16,), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 2020-01-20T00:30:00 ... 2020-12-31T23:30:00

xarray.DataArray

'pr'

time: 16655

dask.array<chunksize=(16,), meta=np.ndarray>

	Array	Chunk
Bytes	65.06 kiB	64 B
Shape	(16655,)	(16,)
Dask graph	1041 chunks in 7 graph layers
Data type	float32 numpy.ndarray

Coordinates: (1)

time

(time)

datetime64[ns]

2020-01-20T00:30:00 ... 2020-12-...

axis :: T

array(['2020-01-20T00:30:00.000000000', '2020-01-20T01:00:00.000000000',
       '2020-01-20T01:30:00.000000000', ..., '2020-12-31T22:30:00.000000000',
       '2020-12-31T23:00:00.000000000', '2020-12-31T23:30:00.000000000'],
      dtype='datetime64[ns]')

Indexes: (1)

time

PandasIndex

PandasIndex(DatetimeIndex(['2020-01-20 00:30:00', '2020-01-20 01:00:00',
               '2020-01-20 01:30:00', '2020-01-20 02:00:00',
               '2020-01-20 02:30:00', '2020-01-20 03:00:00',
               '2020-01-20 03:30:00', '2020-01-20 04:00:00',
               '2020-01-20 04:30:00', '2020-01-20 05:00:00',
               ...
               '2020-12-31 19:00:00', '2020-12-31 19:30:00',
               '2020-12-31 20:00:00', '2020-12-31 20:30:00',
               '2020-12-31 21:00:00', '2020-12-31 21:30:00',
               '2020-12-31 22:00:00', '2020-12-31 22:30:00',
               '2020-12-31 23:00:00', '2020-12-31 23:30:00'],
              dtype='datetime64[ns]', name='time', length=16655, freq=None))

Attributes: (0)

Going to process: 780.703125 GBs

/sw/spack-levante/mambaforge-23.1.0-1-Linux-x86_64-3boc6i/lib/python3.10/site-packages/distributed/client.py:3106: UserWarning: Sending large graph of size 9.65 MiB.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
  warnings.warn(

Process speed: 6.211303828719483 GB/s

We found that the field mean of the 780GBs of the first year was processed for

dask_auto:
- in *521 chunks in 7 graph layers
- with a speed of about 6.5GB/s (10GB/s can be considered max for a levante node)
dask_storage:
- in *1041 chunks in 7 graph layers
- with a speed of about 5.5GB/s

result.hvplot.line()

Adapt chunks to your analysis#

Now we ask

What is the histogram for the first 8000 cells?

The analysis goal has implications for our chunk setting. One dask chunk should contain

8000 cells
as many timesteps as possible

Hoever, the general optimization rules for dask chunks setting say:

do not split up storage chunks

and

target 100MB dask chunks

So we can come up with sth like

chunked_for_histogram = xr.open_dataset(
    f"/fastdata/ka1081/nextgems_cycle3/ngc3028/ngc3028_{selection['time']}_{str(selection['zoom'])}.zarr",
    engine="zarr",
    chunks=dict(time=16 * 8, cell=262144),
)[VARID]
chunked_for_histogram

<xarray.DataArray 'pr' (time: 96480, cell: 12582912)>
dask.array<open_dataset-76e6ce1b4c49b5e250a66fdc64b44210pr, shape=(96480, 12582912), dtype=float32, chunksize=(128, 262144), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 2020-01-20T00:30:00 ... 2025-07-22
Dimensions without coordinates: cell
Attributes:
    cell_methods:  time: mean
    component:     atmo
    grid_mapping:  crs
    long_name:     precipitation flux
    units:         kg m-2 s-1
    vgrid:         surface

xarray.DataArray

'pr'

time: 96480
cell: 12582912

dask.array<chunksize=(128, 262144), meta=np.ndarray>

	Array	Chunk
Bytes	4.42 TiB	128.00 MiB
Shape	(96480, 12582912)	(128, 262144)
Dask graph	36192 chunks in 2 graph layers
Data type	float32 numpy.ndarray

Coordinates: (1)

time

(time)

datetime64[ns]

2020-01-20T00:30:00 ... 2025-07-22

axis :: T

array(['2020-01-20T00:30:00.000000000', '2020-01-20T01:00:00.000000000',
       '2020-01-20T01:30:00.000000000', ..., '2025-07-21T23:00:00.000000000',
       '2025-07-21T23:30:00.000000000', '2025-07-22T00:00:00.000000000'],
      dtype='datetime64[ns]')

Indexes: (1)

time

PandasIndex

PandasIndex(DatetimeIndex(['2020-01-20 00:30:00', '2020-01-20 01:00:00',
               '2020-01-20 01:30:00', '2020-01-20 02:00:00',
               '2020-01-20 02:30:00', '2020-01-20 03:00:00',
               '2020-01-20 03:30:00', '2020-01-20 04:00:00',
               '2020-01-20 04:30:00', '2020-01-20 05:00:00',
               ...
               '2025-07-21 19:30:00', '2025-07-21 20:00:00',
               '2025-07-21 20:30:00', '2025-07-21 21:00:00',
               '2025-07-21 21:30:00', '2025-07-21 22:00:00',
               '2025-07-21 22:30:00', '2025-07-21 23:00:00',
               '2025-07-21 23:30:00', '2025-07-22 00:00:00'],
              dtype='datetime64[ns]', name='time', length=96480, freq=None))

Attributes: (6)
cell_methods :
time: mean
component :
atmo
grid_mapping :
crs
long_name :
precipitation flux
units :
kg m-2 s-1
vgrid :
surface

Lets subselect

dss = ds_auto.isel(cell=slice(0, 8000))
dss

<xarray.DataArray 'pr' (time: 96480, cell: 8000)>
dask.array<getitem, shape=(96480, 8000), dtype=float32, chunksize=(32, 8000), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 2020-01-20T00:30:00 ... 2025-07-22
Dimensions without coordinates: cell
Attributes:
    cell_methods:  time: mean
    component:     atmo
    grid_mapping:  crs
    long_name:     precipitation flux
    units:         kg m-2 s-1
    vgrid:         surface

xarray.DataArray

'pr'

time: 96480
cell: 8000

dask.array<chunksize=(32, 8000), meta=np.ndarray>

	Array	Chunk
Bytes	2.88 GiB	0.98 MiB
Shape	(96480, 8000)	(32, 8000)
Dask graph	3015 chunks in 3 graph layers
Data type	float32 numpy.ndarray

Coordinates: (1)

time

(time)

datetime64[ns]

2020-01-20T00:30:00 ... 2025-07-22

axis :: T

array(['2020-01-20T00:30:00.000000000', '2020-01-20T01:00:00.000000000',
       '2020-01-20T01:30:00.000000000', ..., '2025-07-21T23:00:00.000000000',
       '2025-07-21T23:30:00.000000000', '2025-07-22T00:00:00.000000000'],
      dtype='datetime64[ns]')

Indexes: (1)

time

PandasIndex

PandasIndex(DatetimeIndex(['2020-01-20 00:30:00', '2020-01-20 01:00:00',
               '2020-01-20 01:30:00', '2020-01-20 02:00:00',
               '2020-01-20 02:30:00', '2020-01-20 03:00:00',
               '2020-01-20 03:30:00', '2020-01-20 04:00:00',
               '2020-01-20 04:30:00', '2020-01-20 05:00:00',
               ...
               '2025-07-21 19:30:00', '2025-07-21 20:00:00',
               '2025-07-21 20:30:00', '2025-07-21 21:00:00',
               '2025-07-21 21:30:00', '2025-07-21 22:00:00',
               '2025-07-21 22:30:00', '2025-07-21 23:00:00',
               '2025-07-21 23:30:00', '2025-07-22 00:00:00'],
              dtype='datetime64[ns]', name='time', length=96480, freq=None))

Attributes: (6)
cell_methods :
time: mean
component :
atmo
grid_mapping :
crs
long_name :
precipitation flux
units :
kg m-2 s-1
vgrid :
surface

Persist for more analysis, compute for results#

We can see that dss would fit into our memory. If we plan to run multiple different functions on it after loading, we can use .persist instead of compute. This keeps the data distributed across the workers so that a new dask job on the data is faster.

You can see in the dashboard that .persist will trigger some getitem() tasks.

dss_persist = dss.persist()
dss_persist

<xarray.DataArray 'pr' (time: 96480, cell: 8000)>
dask.array<getitem, shape=(96480, 8000), dtype=float32, chunksize=(32, 8000), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 2020-01-20T00:30:00 ... 2025-07-22
Dimensions without coordinates: cell
Attributes:
    cell_methods:  time: mean
    component:     atmo
    grid_mapping:  crs
    long_name:     precipitation flux
    units:         kg m-2 s-1
    vgrid:         surface

xarray.DataArray

'pr'

time: 96480
cell: 8000

dask.array<chunksize=(32, 8000), meta=np.ndarray>

	Array	Chunk
Bytes	2.88 GiB	0.98 MiB
Shape	(96480, 8000)	(32, 8000)
Dask graph	3015 chunks in 1 graph layer
Data type	float32 numpy.ndarray

Coordinates: (1)

time

(time)

datetime64[ns]

2020-01-20T00:30:00 ... 2025-07-22

axis :: T

array(['2020-01-20T00:30:00.000000000', '2020-01-20T01:00:00.000000000',
       '2020-01-20T01:30:00.000000000', ..., '2025-07-21T23:00:00.000000000',
       '2025-07-21T23:30:00.000000000', '2025-07-22T00:00:00.000000000'],
      dtype='datetime64[ns]')

Indexes: (1)

time

PandasIndex

PandasIndex(DatetimeIndex(['2020-01-20 00:30:00', '2020-01-20 01:00:00',
               '2020-01-20 01:30:00', '2020-01-20 02:00:00',
               '2020-01-20 02:30:00', '2020-01-20 03:00:00',
               '2020-01-20 03:30:00', '2020-01-20 04:00:00',
               '2020-01-20 04:30:00', '2020-01-20 05:00:00',
               ...
               '2025-07-21 19:30:00', '2025-07-21 20:00:00',
               '2025-07-21 20:30:00', '2025-07-21 21:00:00',
               '2025-07-21 21:30:00', '2025-07-21 22:00:00',
               '2025-07-21 22:30:00', '2025-07-21 23:00:00',
               '2025-07-21 23:30:00', '2025-07-22 00:00:00'],
              dtype='datetime64[ns]', name='time', length=96480, freq=None))

Attributes: (6)
cell_methods :
time: mean
component :
atmo
grid_mapping :
crs
long_name :
precipitation flux
units :
kg m-2 s-1
vgrid :
surface

dss_persist is still a dask array but a .load would return a numpy array very quickly by gathering the data from workers.

dss_persist

<xarray.DataArray 'pr' (time: 96480, cell: 8000)>
dask.array<getitem, shape=(96480, 8000), dtype=float32, chunksize=(32, 8000), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 2020-01-20T00:30:00 ... 2025-07-22
Dimensions without coordinates: cell
Attributes:
    cell_methods:  time: mean
    component:     atmo
    grid_mapping:  crs
    long_name:     precipitation flux
    units:         kg m-2 s-1
    vgrid:         surface

xarray.DataArray

'pr'

time: 96480
cell: 8000

dask.array<chunksize=(32, 8000), meta=np.ndarray>

	Array	Chunk
Bytes	2.88 GiB	0.98 MiB
Shape	(96480, 8000)	(32, 8000)
Dask graph	3015 chunks in 1 graph layer
Data type	float32 numpy.ndarray

Coordinates: (1)

time

(time)

datetime64[ns]

2020-01-20T00:30:00 ... 2025-07-22

axis :: T

array(['2020-01-20T00:30:00.000000000', '2020-01-20T01:00:00.000000000',
       '2020-01-20T01:30:00.000000000', ..., '2025-07-21T23:00:00.000000000',
       '2025-07-21T23:30:00.000000000', '2025-07-22T00:00:00.000000000'],
      dtype='datetime64[ns]')

Indexes: (1)

time

PandasIndex

PandasIndex(DatetimeIndex(['2020-01-20 00:30:00', '2020-01-20 01:00:00',
               '2020-01-20 01:30:00', '2020-01-20 02:00:00',
               '2020-01-20 02:30:00', '2020-01-20 03:00:00',
               '2020-01-20 03:30:00', '2020-01-20 04:00:00',
               '2020-01-20 04:30:00', '2020-01-20 05:00:00',
               ...
               '2025-07-21 19:30:00', '2025-07-21 20:00:00',
               '2025-07-21 20:30:00', '2025-07-21 21:00:00',
               '2025-07-21 21:30:00', '2025-07-21 22:00:00',
               '2025-07-21 22:30:00', '2025-07-21 23:00:00',
               '2025-07-21 23:30:00', '2025-07-22 00:00:00'],
              dtype='datetime64[ns]', name='time', length=96480, freq=None))

Attributes: (6)
cell_methods :
time: mean
component :
atmo
grid_mapping :
crs
long_name :
precipitation flux
units :
kg m-2 s-1
vgrid :
surface

Multiple computes at once#

Now, we create the bins for our histogram. If we need to run .compute multiple times, we can run that together on the scheduler with dask.compute:

# NOT:
# dsmin=dss.min().compute()
# dsmax=dss.max().compute()
# BETTER:
dsmin, dsmax = dask.compute(dss_persist.min(), dss_persist.max())
bins = np.linspace(dsmin, dsmax - (dsmax - dsmin) * 0.95, 20)

For the histogram we use xhistogram:

def calc_histogram_for_fieldslice(dshist):
    h_x = histogram(dshist, bins=bins, dim=["time"])
    return h_x.compute(), gbs

with performance_report(filename="dask-report-auto.html"):
    result_histogram, gbs = calc_histogram_for_fieldslice(dss_persist)
result_histogram

<xarray.DataArray 'histogram_pr' (cell: 8000, pr_bin: 19)>
array([[95998,   159,    73, ...,     2,     5,     4],
       [96011,   161,    60, ...,     4,     3,     4],
       [96061,   121,    67, ...,     2,     4,     2],
       ...,
       [96346,    25,    12, ...,     1,     3,     1],
       [96341,    32,    17, ...,     1,     1,     1],
       [96353,    18,    18, ...,     2,     1,     1]])
Coordinates:
  * pr_bin   (pr_bin) float64 0.0001068 0.0003205 ... 0.003739 0.003953
Dimensions without coordinates: cell

result_histogram.isel(cell=1, pr_bin=slice(1, 99)).hvplot.bar(x="pr_bin")