How to write a Zarr store - 2.0#

In general, writing a Zarr store is as easy as calling the to_zarr method on an Xarray dataset. However, in practice, there are some tricks that can make the process more robust in certain circumstances.

First, we will again load a dummy dataset to work with:

[1]:
import intake


cat = intake.open_catalog("https://tcodata.mpimet.mpg.de/internal.yaml")
ds = cat.HIFS(datetime="2024-09-01").to_dask()

Region writes#

When writing large datasets (as we encourage) it might be necessary to split the processing into several chunks. Fortunately, the Zarr specification allows this to be hadnled quite easily.

Initially, we will pass the compute=False keyword to the to_zarr method. This will create the Zarr store including all metadata, but not write any actual data chunks:

[2]:
ds.to_zarr("test_dataset.zarr", compute=False)
[2]:
Delayed('_finalize_store-e1735278-0418-4916-a0ec-1255d8a66b3d')

In a next step, we can use a region write to fill a specific data region with the actual data. All variables in the dataset must share the dimension of the region write, so we’ll need to remove crs and level first. Using this approach, it is easy to split a long processing job into several batches, which can even be run in parallel (!).

[3]:
ds.drop_vars(["crs", "level"]).isel(time=slice(24, 48)).to_zarr(
    "test_dataset.zarr", region={"time": slice(24, 48)}
)
[3]:
<xarray.backends.zarr.ZarrStore at 0x157e75c40>

Dimension separator#

Internally, a zarr store is made up of a hierarchy of directories containing the data chunks for each variable. By default, all data chunks are stored in one directory per variable, with dimensions being separated by a dot in the filename (e.g. 0.1.2). This can result in millions of files in a single directory, which doesn’t play well with certain file systems (e.g. LUSTRE).

One way around this is to use / as the dimension separator, creating additional levels of directories, each containing significantly fewer files.

[4]:
import zarr


store = zarr.storage.DirectoryStore("test_dataset2.zarr", dimension_separator="/")
ds.isel(time=slice(0, 2)).to_zarr(store)  # store a couple of timesteps as example
[4]:
<xarray.backends.zarr.ZarrStore at 0x105fd02c0>

Enable retries via HTTP#

When writing to an object store via HTTP, it is possible that packages may get lost. One way to mitigate this issue is to define a client that will retry the delivery if an error occurs. This can be done in the following way using the aiohttp pacakge:

[ ]:
async def get_client(**kwargs):
    import aiohttp
    import aiohttp_retry

    retry_options = aiohttp_retry.ExponentialRetry(
        attempts=3, exceptions={OSError, aiohttp.ServerDisconnectedError}
    )
    retry_client = aiohttp_retry.RetryClient(
        raise_for_status=False, retry_options=retry_options
    )
    return retry_client


ds.to_zarr(
    "https://path/to/test_dataset.zarr", storage_options={"get_client": get_client}
)