How to write a Zarr store - 2.0#

Note

This example shows how to configure the encoding for Zarr stores created according to the v2 specification. The more recent v3 specification requires subtle adjustments, particularly with regard to compression definitions.

In general, writing a Zarr store is as easy as calling the to_zarr method on an Xarray dataset. However, in practice, there are some tricks that can make the process more robust in certain circumstances.

First, we will again load a dummy dataset to work with:

import intake


cat = intake.open_catalog("https://tcodata.mpimet.mpg.de/internal.yaml")
ds = cat.HIFS(datetime="2024-09-01").to_dask()
ds = ds.sel(time=slice("2024-09-01", "2024-09-01 18:00"))

Region writes#

When writing large datasets (as we encourage) it might be necessary to split the processing into several chunks. Fortunately, the Zarr specification allows this to be hadnled quite easily.

Initially, we will pass the compute=False keyword to the to_zarr method. This will create the Zarr store including all metadata, but not write any actual data chunks:

ds.to_zarr("test_dataset2.zarr", compute=False, zarr_format=2)
Delayed('_finalize_store-b72a3b2c-568f-4457-a151-9b7659c80c80')

In a next step, we can use a region write to fill a specific data region with the actual data. All variables in the dataset must share the dimension of the region write, so we’ll need to remove crs and level first. Using this approach, it is easy to split a long processing job into several batches, which can even be run in parallel (!).

ds.drop_vars(["crs", "level"]).isel(time=slice(24, 48)).to_zarr(
    "test_dataset2.zarr", region={"time": slice(24, 48)}, zarr_format=2
)
<xarray.backends.zarr.ZarrStore at 0x7f62a5388360>

Number of inodes (files)#

Internally, a Zarr store consists of directories containing data chunks for each variable. By default, when writing data in Zarr v2 format, all data chunks are stored in one directory per variable, with dimensions separated by a dot in the file name (e.g. 0.1.2). For larger datasets, this can result in millions of files in a single directory, which certain file systems (e.g. LUSTRE) do not handle well.
One solution is to use / as the so called dimension separator, creating additional levels of directories, each containing significantly fewer files. When writing Zarr according to the v3 specification (zarr_format=3), this nested directory structure is used by default.

Enable retries via HTTP#

When writing to an object store via HTTP, it is possible that packages may get lost. One way to mitigate this issue is to define a client that will retry the delivery if an error occurs. This can be done in the following way using the aiohttp pacakge:

async def get_client(**kwargs):
    import aiohttp
    import aiohttp_retry

    retry_options = aiohttp_retry.ExponentialRetry(
        attempts=3, exceptions={OSError, aiohttp.ServerDisconnectedError}
    )
    retry_client = aiohttp_retry.RetryClient(
        raise_for_status=False, retry_options=retry_options
    )
    return retry_client


ds.to_zarr(
    "https://path/to/test_dataset.zarr",
    storage_options={"get_client": get_client},
    zarr_format=2,
)