{ "cells": [ { "cell_type": "markdown", "id": "b42815b5-20ee-458e-aa5e-c3faa4aeccc6", "metadata": {}, "source": [ "# Define the encoding of a Zarr store" ] }, { "cell_type": "code", "execution_count": 57, "id": "682d58aa-242a-44ff-8a0c-f549081cc325", "metadata": { "execution": { "iopub.execute_input": "2024-11-05T13:55:09.244127Z", "iopub.status.busy": "2024-11-05T13:55:09.243482Z", "iopub.status.idle": "2024-11-05T13:55:09.394480Z", "shell.execute_reply": "2024-11-05T13:55:09.394205Z", "shell.execute_reply.started": "2024-11-05T13:55:09.244078Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset> Size: 7GB\n",
       "Dimensions:  (time: 64, cell: 196608, crs: 1, level: 13)\n",
       "Coordinates:\n",
       "  * crs      (crs) float64 8B nan\n",
       "  * level    (level) int64 104B 50 100 150 200 250 300 ... 600 700 850 925 1000\n",
       "  * time     (time) datetime64[ns] 512B 2024-09-01T03:00:00 ... 2024-09-11\n",
       "Dimensions without coordinates: cell\n",
       "Data variables: (12/39)\n",
       "    100u     (time, cell) float32 50MB dask.array<chunksize=(6, 16384), meta=np.ndarray>\n",
       "    100v     (time, cell) float32 50MB dask.array<chunksize=(6, 16384), meta=np.ndarray>\n",
       "    10u      (time, cell) float32 50MB dask.array<chunksize=(6, 16384), meta=np.ndarray>\n",
       "    10v      (time, cell) float32 50MB dask.array<chunksize=(6, 16384), meta=np.ndarray>\n",
       "    2d       (time, cell) float32 50MB dask.array<chunksize=(6, 16384), meta=np.ndarray>\n",
       "    2t       (time, cell) float32 50MB dask.array<chunksize=(6, 16384), meta=np.ndarray>\n",
       "    ...       ...\n",
       "    tp       (time, cell) float32 50MB dask.array<chunksize=(6, 16384), meta=np.ndarray>\n",
       "    ttr      (time, cell) float32 50MB dask.array<chunksize=(6, 16384), meta=np.ndarray>\n",
       "    u        (time, level, cell) float32 654MB dask.array<chunksize=(6, 1, 16384), meta=np.ndarray>\n",
       "    v        (time, level, cell) float32 654MB dask.array<chunksize=(6, 1, 16384), meta=np.ndarray>\n",
       "    vo       (time, level, cell) float32 654MB dask.array<chunksize=(6, 1, 16384), meta=np.ndarray>\n",
       "    w        (time, level, cell) float32 654MB dask.array<chunksize=(6, 1, 16384), meta=np.ndarray>
" ], "text/plain": [ " Size: 7GB\n", "Dimensions: (time: 64, cell: 196608, crs: 1, level: 13)\n", "Coordinates:\n", " * crs (crs) float64 8B nan\n", " * level (level) int64 104B 50 100 150 200 250 300 ... 600 700 850 925 1000\n", " * time (time) datetime64[ns] 512B 2024-09-01T03:00:00 ... 2024-09-11\n", "Dimensions without coordinates: cell\n", "Data variables: (12/39)\n", " 100u (time, cell) float32 50MB dask.array\n", " 100v (time, cell) float32 50MB dask.array\n", " 10u (time, cell) float32 50MB dask.array\n", " 10v (time, cell) float32 50MB dask.array\n", " 2d (time, cell) float32 50MB dask.array\n", " 2t (time, cell) float32 50MB dask.array\n", " ... ...\n", " tp (time, cell) float32 50MB dask.array\n", " ttr (time, cell) float32 50MB dask.array\n", " u (time, level, cell) float32 654MB dask.array\n", " v (time, level, cell) float32 654MB dask.array\n", " vo (time, level, cell) float32 654MB dask.array\n", " w (time, level, cell) float32 654MB dask.array" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import intake\n", "import numcodecs\n", "import numpy as np\n", "import xarray as xr\n", "\n", "\n", "cat = intake.open_catalog(\"https://tcodata.mpimet.mpg.de/internal.yaml\")\n", "ds = cat.HIFS(datetime=\"2024-09-01\").to_dask()\n", "ds" ] }, { "cell_type": "markdown", "id": "582a242c-58a1-485b-9bbc-4f83ee6f9633", "metadata": {}, "source": [ "## Data types\n", "\n", "Explicitly set the output datatype to single precision float for all float subtypes." ] }, { "cell_type": "code", "execution_count": 50, "id": "f2204aff-f95f-4599-ba14-5d1bdb845a4f", "metadata": { "execution": { "iopub.execute_input": "2024-11-05T13:54:20.190056Z", "iopub.status.busy": "2024-11-05T13:54:20.189935Z", "iopub.status.idle": "2024-11-05T13:54:20.205950Z", "shell.execute_reply": "2024-11-05T13:54:20.205678Z", "shell.execute_reply.started": "2024-11-05T13:54:20.190047Z" } }, "outputs": [ { "data": { "text/plain": [ "'float32'" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def get_dtype(da):\n", " if np.issubdtype(da.dtype, np.floating):\n", " return \"float32\"\n", " else:\n", " return da.dtype\n", "\n", "\n", "get_dtype(ds[\"tcwv\"])" ] }, { "cell_type": "markdown", "id": "8ac446b7-2581-4432-bd85-691472efa094", "metadata": {}, "source": [ "## Chunking\n", "\n", "We define [multi-dimensional chunks](https://www.unidata.ucar.edu/software/netcdf/workshops/most-recent/nc4chunking/WhatIsChunking.html) for me efficient data access.\n", "We aim at a chunk size of about 1 MB which is a reasonable choice when accessing data via HTTP. Depending on the total size of your dataset, this chunksize may results in millions (!) of individual files, which might cause problems on some file systems." ] }, { "cell_type": "code", "execution_count": 51, "id": "be65db51-61d3-4d3c-87d7-a582adceb68e", "metadata": { "execution": { "iopub.execute_input": "2024-11-05T13:54:20.206754Z", "iopub.status.busy": "2024-11-05T13:54:20.206615Z", "iopub.status.idle": "2024-11-05T13:54:20.225166Z", "shell.execute_reply": "2024-11-05T13:54:20.224908Z", "shell.execute_reply.started": "2024-11-05T13:54:20.206741Z" } }, "outputs": [ { "data": { "text/plain": [ "(24, 4096)" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def get_chunks(dimensions):\n", " if \"level\" in dimensions:\n", " chunks = {\n", " \"time\": 24,\n", " \"cell\": 4**5,\n", " \"level\": 4,\n", " }\n", " else:\n", " chunks = {\n", " \"time\": 24,\n", " \"cell\": 4**6,\n", " }\n", "\n", " return tuple((chunks[d] for d in dimensions))\n", "\n", "\n", "get_chunks(ds[\"tcwv\"].dims)" ] }, { "cell_type": "markdown", "id": "2f5a8a60-a86e-4254-9a32-d11660cf3d06", "metadata": {}, "source": [ "## Compression\n", "\n", "We compress all variables using Zstd into a blosc container. We also enable bit shuffling." ] }, { "cell_type": "code", "execution_count": 52, "id": "9054052e-ee08-4922-9dee-bd8767fa4c59", "metadata": { "execution": { "iopub.execute_input": "2024-11-05T13:54:20.226219Z", "iopub.status.busy": "2024-11-05T13:54:20.226092Z", "iopub.status.idle": "2024-11-05T13:54:20.243254Z", "shell.execute_reply": "2024-11-05T13:54:20.242986Z", "shell.execute_reply.started": "2024-11-05T13:54:20.226208Z" } }, "outputs": [ { "data": { "text/plain": [ "Blosc(cname='zstd', clevel=5, shuffle=BITSHUFFLE, blocksize=0)" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def get_compressor():\n", " return numcodecs.Blosc(\"zstd\", shuffle=2)\n", "\n", "\n", "get_compressor()" ] }, { "cell_type": "markdown", "id": "b46eb072-60f6-4fc1-af1c-d2a7a6245e87", "metadata": { "execution": { "iopub.execute_input": "2024-11-05T13:07:53.171967Z", "iopub.status.busy": "2024-11-05T13:07:53.171874Z", "iopub.status.idle": "2024-11-05T13:07:53.188782Z", "shell.execute_reply": "2024-11-05T13:07:53.188501Z", "shell.execute_reply.started": "2024-11-05T13:07:53.171958Z" } }, "source": [ "## Plug and play\n", "\n", "Finally, we can put the pieces together to define an encoding for the whole dataset.\n", "The following function loops over all variables (that are not a dimension) and creates an encoding dictionary." ] }, { "cell_type": "code", "execution_count": 53, "id": "44c93bbb-26bd-46eb-bd23-f75ad24fe53d", "metadata": { "execution": { "iopub.execute_input": "2024-11-05T13:54:20.243725Z", "iopub.status.busy": "2024-11-05T13:54:20.243631Z", "iopub.status.idle": "2024-11-05T13:54:20.261276Z", "shell.execute_reply": "2024-11-05T13:54:20.261003Z", "shell.execute_reply.started": "2024-11-05T13:54:20.243716Z" } }, "outputs": [ { "data": { "text/plain": [ "{'t': {'compressor': Blosc(cname='zstd', clevel=5, shuffle=BITSHUFFLE, blocksize=0),\n", " 'dtype': 'float32',\n", " 'chunks': (24, 4, 1024)},\n", " '2t': {'compressor': Blosc(cname='zstd', clevel=5, shuffle=BITSHUFFLE, blocksize=0),\n", " 'dtype': 'float32',\n", " 'chunks': (24, 4096)}}" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def get_encoding(dataset):\n", " return {\n", " var: {\n", " \"compressor\": get_compressor(),\n", " \"dtype\": get_dtype(dataset[var]),\n", " \"chunks\": get_chunks(dataset[var].dims),\n", " }\n", " for var in dataset.variables\n", " if var not in dataset.dims\n", " }\n", "\n", "\n", "get_encoding(ds[[\"t\", \"2t\"]])" ] }, { "cell_type": "markdown", "id": "76a4fc97-d04f-4d60-8f21-66198dbbabf9", "metadata": { "execution": { "iopub.status.busy": "2024-11-05T13:56:34.389258Z", "iopub.status.idle": "2024-11-05T13:56:34.389374Z", "shell.execute_reply": "2024-11-05T13:56:34.389316Z", "shell.execute_reply.started": "2024-11-05T13:56:34.389311Z" } }, "source": [ "The encoding dictionary can be passed to the `to_zarr()` function.\n", "When using dask, make sure that the dask chunks match the selected Zarr chunks.\n", "Otherwise the Zarr library will throw an error to prevent multiple dask chunks from writing to the same chunk on disk." ] }, { "cell_type": "code", "execution_count": null, "id": "76727796-e7e9-469f-ac70-918d7117440e", "metadata": {}, "outputs": [], "source": [ "ds.chunk({\"time\": 24, \"level\": 4, \"cell\": -1}).to_zarr(\n", " \"test_dataset.zarr\", encoding=get_encoding(ds)\n", ")" ] } ], "metadata": { "jupytext": { "notebook_metadata_filter": "-jupytext.text_representation.jupytext_version" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 5 }