Dask Collections: Arrays#


We recommend the official dask documentation for the interested user via https://docs.dask.org/en/stable/

Dask Collections consist of implementations for

  • Array (xarray, numpy)

  • DataFrame (tables, pandas)

  • Bag (low level dask)

  • Delayed (low level dask)

  • Futures (need a client)

This notebook

  1. only covers the dask array. A dask array can be interpreted as a collection of numpy arrays.

  2. works well on the python3/unstable kernel on DKRZ’s Jupyterhub.

  3. tries to calculate and plot quantiles of a large ensemble dataset

Dask arrays feature without distributed:

  • Parallel and larger-than-memory processing:

    • Only process one piece of the array at once - the rest is not loaded into memory yet

    • Automatic parallelization over parts of the dask array - multithreading per default

  • Lazy: Arrays are computed only when requested. High level software like xarray anticipates when it is rqeuired to evaluate an array. Until then, a Task Graph is filled.

We will work with the following packages:

import xarray as xr
from IPython.display import Image
import dask
from dask.diagnostics import ProgressBar
import hvplot.xarray
import numcodecs
from distributed import Client
import intake