---
file_format: mystnb
kernelspec:
  name: python3
execution:
  timeout: 60
---

# Define the encoding of a Zarr store

```{code-cell} ipython3
import intake
import numcodecs
import numpy as np
import xarray as xr


cat = intake.open_catalog("https://tcodata.mpimet.mpg.de/internal.yaml")
ds = cat.HIFS(datetime="2024-09-01").to_dask()
ds = ds.sel(time=slice("2024-09-01", "2024-09-01 18:00"))
ds
```

## Data types

Explicitly set the output datatype to single precision float for all float subtypes.

```{code-cell} ipython3
def get_dtype(da):
    if np.issubdtype(da.dtype, np.floating):
        return "float32"
    else:
        return da.dtype


get_dtype(ds["tcwv"])
```

## Chunking

We define [multi-dimensional chunks](https://www.unidata.ucar.edu/software/netcdf/workshops/most-recent/nc4chunking/WhatIsChunking.html) for me efficient data access.
We aim at a chunk size of about 1 MB which is a reasonable choice when accessing data via HTTP. Depending on the total size of your dataset, this chunksize may results in millions (!) of individual files, which might cause problems on some file systems.

```{code-cell} ipython3
def get_chunks(dimensions):
    if "level" in dimensions:
        chunks = {
            "time": 6,
            "cell": 4**6,
            "level": 4,
        }
    else:
        chunks = {
            "time": 6,
            "cell": 4**7,
        }

    return tuple((chunks[d] for d in dimensions))


get_chunks(ds["tcwv"].dims)
```

## Compression

We compress all variables using Zstd into a blosc container.
Increasing the compression level from its default value of 5 will usually result in a slightly better compression ratio without adding significant overhead.

```{code-cell} ipython3
def get_compressor():
    return numcodecs.Blosc("zstd", clevel=6)


get_compressor()
```

## Plug and play

Finally, we can put the pieces together to define an encoding for the whole dataset.
The following function loops over all variables (that are not a dimension) and creates an encoding dictionary.

```{code-cell} ipython3
def get_encoding(dataset):
    return {
        var: {
            "compressor": get_compressor(),
            "dtype": get_dtype(dataset[var]),
            "chunks": get_chunks(dataset[var].dims),
        }
        for var in dataset.variables
        if var not in dataset.dims
    }


get_encoding(ds[["t", "2t"]])
```

The encoding dictionary can be passed to the `to_zarr()` function.
When using dask, make sure that the dask chunks match the selected Zarr chunks.
Otherwise the Zarr library will throw an error to prevent multiple dask chunks from writing to the same chunk on disk.

```{code-cell} ipython3
ds.chunk({"time": 24, "level": 4, "cell": -1}).to_zarr(
    "test_dataset.zarr", encoding=get_encoding(ds)
)
```