--- file_format: mystnb kernelspec: name: python3 execution: timeout: 60 --- # Define the encoding of a Zarr store ```{code-cell} ipython3 import intake import numcodecs import numpy as np import xarray as xr cat = intake.open_catalog("https://tcodata.mpimet.mpg.de/internal.yaml") ds = cat.HIFS(datetime="2024-09-01").to_dask() ds = ds.sel(time=slice("2024-09-01", "2024-09-01 18:00")) ds ``` ## Data types Explicitly set the output datatype to single precision float for all float subtypes. ```{code-cell} ipython3 def get_dtype(da): if np.issubdtype(da.dtype, np.floating): return "float32" else: return da.dtype get_dtype(ds["tcwv"]) ``` ## Chunking We define [multi-dimensional chunks](https://www.unidata.ucar.edu/software/netcdf/workshops/most-recent/nc4chunking/WhatIsChunking.html) for me efficient data access. We aim at a chunk size of about 1 MB which is a reasonable choice when accessing data via HTTP. Depending on the total size of your dataset, this chunksize may results in millions (!) of individual files, which might cause problems on some file systems. ```{code-cell} ipython3 def get_chunks(dimensions): if "level" in dimensions: chunks = { "time": 6, "cell": 4**6, "level": 4, } else: chunks = { "time": 6, "cell": 4**7, } return tuple((chunks[d] for d in dimensions)) get_chunks(ds["tcwv"].dims) ``` ## Compression We compress all variables using Zstd into a blosc container. Increasing the compression level from its default value of 5 will usually result in a slightly better compression ratio without adding significant overhead. ```{code-cell} ipython3 def get_compressor(): return numcodecs.Blosc("zstd", clevel=6) get_compressor() ``` ## Plug and play Finally, we can put the pieces together to define an encoding for the whole dataset. The following function loops over all variables (that are not a dimension) and creates an encoding dictionary. ```{code-cell} ipython3 def get_encoding(dataset): return { var: { "compressor": get_compressor(), "dtype": get_dtype(dataset[var]), "chunks": get_chunks(dataset[var].dims), } for var in dataset.variables if var not in dataset.dims } get_encoding(ds[["t", "2t"]]) ``` The encoding dictionary can be passed to the `to_zarr()` function. When using dask, make sure that the dask chunks match the selected Zarr chunks. Otherwise the Zarr library will throw an error to prevent multiple dask chunks from writing to the same chunk on disk. ```{code-cell} ipython3 ds.chunk({"time": 24, "level": 4, "cell": -1}).to_zarr( "test_dataset.zarr", encoding=get_encoding(ds) ) ```