How to write a Zarr store - 2.0#

In general, writing a Zarr store is as easy as calling the to_zarr method on an Xarray dataset. However, in practice, there are some tricks that can make the process more robust in certain circumstances.

First, we will again load a dummy dataset to work with:

import intake


cat = intake.open_catalog("https://tcodata.mpimet.mpg.de/internal.yaml")
ds = cat.HIFS(datetime="2024-09-01").to_dask()
ds = ds.sel(time=slice("2024-09-01", "2024-09-01 18:00"))

/builds/easy/gems/.venv/lib/python3.13/site-packages/intake_xarray/base.py:21: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
  'dims': dict(self._ds.dims),

Region writes#

When writing large datasets (as we encourage) it might be necessary to split the processing into several chunks. Fortunately, the Zarr specification allows this to be hadnled quite easily.

Initially, we will pass the compute=False keyword to the to_zarr method. This will create the Zarr store including all metadata, but not write any actual data chunks:

ds.to_zarr("test_dataset2.zarr", compute=False)

Delayed('_finalize_store-6467d70c-37b6-4afb-aa5b-d1068ef5d760')

In a next step, we can use a region write to fill a specific data region with the actual data. All variables in the dataset must share the dimension of the region write, so we’ll need to remove crs and level first. Using this approach, it is easy to split a long processing job into several batches, which can even be run in parallel (!).

ds.drop_vars(["crs", "level"]).isel(time=slice(24, 48)).to_zarr(
    "test_dataset2.zarr", region={"time": slice(24, 48)}
)

<xarray.backends.zarr.ZarrStore at 0x7f0f62e65120>

Dimension separator#

Internally, a zarr store is made up of a hierarchy of directories containing the data chunks for each variable. By default, all data chunks are stored in one directory per variable, with dimensions being separated by a dot in the filename (e.g. 0.1.2). This can result in millions of files in a single directory, which doesn’t play well with certain file systems (e.g. LUSTRE).

One way around this is to use / as the dimension separator, creating additional levels of directories, each containing significantly fewer files.

import zarr


store = zarr.storage.DirectoryStore("test_dataset3.zarr", dimension_separator="/")
ds.isel(time=slice(0, 2)).to_zarr(store)  # store a couple of timesteps as example

<xarray.backends.zarr.ZarrStore at 0x7f0f4c696980>

Enable retries via HTTP#

When writing to an object store via HTTP, it is possible that packages may get lost. One way to mitigate this issue is to define a client that will retry the delivery if an error occurs. This can be done in the following way using the aiohttp pacakge:

async def get_client(**kwargs):
    import aiohttp
    import aiohttp_retry

    retry_options = aiohttp_retry.ExponentialRetry(
        attempts=3, exceptions={OSError, aiohttp.ServerDisconnectedError}
    )
    retry_client = aiohttp_retry.RetryClient(
        raise_for_status=False, retry_options=retry_options
    )
    return retry_client


ds.to_zarr(
    "https://path/to/test_dataset.zarr", storage_options={"get_client": get_client}
)

How to write a Zarr store - 2.0#

Region writes#

Dimension separator#

Enable retries via HTTP#

This Page