Define the encoding of a Zarr store#

[57]:
import intake
import numcodecs
import numpy as np
import xarray as xr


cat = intake.open_catalog("https://tcodata.mpimet.mpg.de/internal.yaml")
ds = cat.HIFS(datetime="2024-09-01").to_dask()
ds
[57]:
<xarray.Dataset> Size: 7GB
Dimensions:  (time: 64, cell: 196608, crs: 1, level: 13)
Coordinates:
  * crs      (crs) float64 8B nan
  * level    (level) int64 104B 50 100 150 200 250 300 ... 600 700 850 925 1000
  * time     (time) datetime64[ns] 512B 2024-09-01T03:00:00 ... 2024-09-11
Dimensions without coordinates: cell
Data variables: (12/39)
    100u     (time, cell) float32 50MB dask.array<chunksize=(6, 16384), meta=np.ndarray>
    100v     (time, cell) float32 50MB dask.array<chunksize=(6, 16384), meta=np.ndarray>
    10u      (time, cell) float32 50MB dask.array<chunksize=(6, 16384), meta=np.ndarray>
    10v      (time, cell) float32 50MB dask.array<chunksize=(6, 16384), meta=np.ndarray>
    2d       (time, cell) float32 50MB dask.array<chunksize=(6, 16384), meta=np.ndarray>
    2t       (time, cell) float32 50MB dask.array<chunksize=(6, 16384), meta=np.ndarray>
    ...       ...
    tp       (time, cell) float32 50MB dask.array<chunksize=(6, 16384), meta=np.ndarray>
    ttr      (time, cell) float32 50MB dask.array<chunksize=(6, 16384), meta=np.ndarray>
    u        (time, level, cell) float32 654MB dask.array<chunksize=(6, 1, 16384), meta=np.ndarray>
    v        (time, level, cell) float32 654MB dask.array<chunksize=(6, 1, 16384), meta=np.ndarray>
    vo       (time, level, cell) float32 654MB dask.array<chunksize=(6, 1, 16384), meta=np.ndarray>
    w        (time, level, cell) float32 654MB dask.array<chunksize=(6, 1, 16384), meta=np.ndarray>

Data types#

Explicitly set the output datatype to single precision float for all float subtypes.

[50]:
def get_dtype(da):
    if np.issubdtype(da.dtype, np.floating):
        return "float32"
    else:
        return da.dtype


get_dtype(ds["tcwv"])
[50]:
'float32'

Chunking#

We define multi-dimensional chunks for me efficient data access. We aim at a chunk size of about 1 MB which is a reasonable choice when accessing data via HTTP. Depending on the total size of your dataset, this chunksize may results in millions (!) of individual files, which might cause problems on some file systems.

[51]:
def get_chunks(dimensions):
    if "level" in dimensions:
        chunks = {
            "time": 24,
            "cell": 4**5,
            "level": 4,
        }
    else:
        chunks = {
            "time": 24,
            "cell": 4**6,
        }

    return tuple((chunks[d] for d in dimensions))


get_chunks(ds["tcwv"].dims)
[51]:
(24, 4096)

Compression#

We compress all variables using Zstd into a blosc container. We also enable bit shuffling.

[52]:
def get_compressor():
    return numcodecs.Blosc("zstd", shuffle=2)


get_compressor()
[52]:
Blosc(cname='zstd', clevel=5, shuffle=BITSHUFFLE, blocksize=0)

Plug and play#

Finally, we can put the pieces together to define an encoding for the whole dataset. The following function loops over all variables (that are not a dimension) and creates an encoding dictionary.

[53]:
def get_encoding(dataset):
    return {
        var: {
            "compressor": get_compressor(),
            "dtype": get_dtype(dataset[var]),
            "chunks": get_chunks(dataset[var].dims),
        }
        for var in dataset.variables
        if var not in dataset.dims
    }


get_encoding(ds[["t", "2t"]])
[53]:
{'t': {'compressor': Blosc(cname='zstd', clevel=5, shuffle=BITSHUFFLE, blocksize=0),
  'dtype': 'float32',
  'chunks': (24, 4, 1024)},
 '2t': {'compressor': Blosc(cname='zstd', clevel=5, shuffle=BITSHUFFLE, blocksize=0),
  'dtype': 'float32',
  'chunks': (24, 4096)}}

The encoding dictionary can be passed to the to_zarr() function. When using dask, make sure that the dask chunks match the selected Zarr chunks. Otherwise the Zarr library will throw an error to prevent multiple dask chunks from writing to the same chunk on disk.

[ ]:
ds.chunk({"time": 24, "level": 4, "cell": -1}).to_zarr(
    "test_dataset.zarr", encoding=get_encoding(ds)
)