Working with Dask#

Earth system datasets are often too large to fit into memory, and analyses frequently need to scale from a laptop to a compute cluster. Dask addresses this challenge by extending Python tools such as xarray to work lazily and in parallel. Instead of executing operations immediately, Dask builds a task graph that describes what needs to be done and only performs the actual computation when explicitly requested. This design allows users to prototype analyses interactively while retaining the option to scale up without rewriting code.

At the user level, Dask is primarily encountered through its collections such as dask.array, dask.dataframe, and xarray objects backed by Dask. These collections look and feel like their in-memory counterparts but are internally split into smaller chunks. Chunking is a key concept: it determines how data is partitioned and therefore strongly influences performance and memory usage. In practice, users should think less about Dask internals and more about choosing chunk sizes that align with their access patterns (e.g. spatial vs. temporal operations in climate data) and the available compute resources.

Finally, Dask separates defining a computation from executing it. Operations on Dask-backed objects are cheap and immediate, while execution happens only when calling methods such as .compute() or when results are plotted or written to disk. This separation enables parallel execution on a single machine or a cluster via a scheduler, without changing the analysis logic. For Earth system science workflows, this means Dask can serve as a scalable execution engine underneath xarray-based analyses, provided users keep the focus on high-level data structures and avoid premature optimization or unnecessary exposure to low-level Dask functionality.

Storage Chunks#

Earth system data is often stored in chunked file formats such as NetCDF4 or Zarr. In this context, storage chunks describe how variables are physically laid out on disk. Each chunk is a contiguous block of data that is read or written in a single I/O operation. Storage chunking is fixed at data creation time and reflects assumptions about how the data will be accessed—for example, reading full time series at individual grid points or loading spatial slices at specific time steps.

For users, storage chunks primarily matter because they define the minimum cost of I/O. Access patterns that align with storage chunks are efficient, while misaligned access can require reading many chunks to assemble a small logical subset of data. Although storage chunking cannot usually be changed without rewriting the dataset, it is essential to be aware of it, as it sets the baseline for performance in any downstream analysis, including those using Dask.

We can inspect the storage chunking of a dataset by providing the chunks={} keyword to xr.open_dataset(). This will create a Dask array that uses the underlying storage chunking directly as Dask chunks:

import intake
import xarray as xr


cat = intake.open_catalog("https://data.nextgems-h2020.eu/online.yaml")
urlpath = cat.ERA5(zoom=6).urlpath

ds = xr.open_dataset(urlpath, chunks={})
ds["2t"].chunks

((24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24,
  24),
 (16384, 16384, 16384))

In this example, the storage chunking has been configured so that 24 time steps and 16,384 spatial cells must always be loaded at once.

Working with Dask#

Storage Chunks#

Dask Chunks#

This Page