{ "cells": [ { "cell_type": "markdown", "id": "0690e46e-38f3-46be-8c9e-9b863b0aa611", "metadata": {}, "source": [ "# How to write a Zarr store - 2.0\n", "\n", "In general, writing a Zarr store is as easy as calling the `to_zarr` method on an Xarray dataset.\n", "However, in practice, there are some tricks that can make the process more robust in certain circumstances.\n", "\n", "First, we will again load a dummy dataset to work with:" ] }, { "cell_type": "code", "execution_count": 1, "id": "f3d6ecd4-2430-4386-964d-e8f66fc95f90", "metadata": { "execution": { "iopub.execute_input": "2024-11-07T08:49:23.178585Z", "iopub.status.busy": "2024-11-07T08:49:23.178225Z", "iopub.status.idle": "2024-11-07T08:49:24.979216Z", "shell.execute_reply": "2024-11-07T08:49:24.978899Z", "shell.execute_reply.started": "2024-11-07T08:49:23.178560Z" } }, "outputs": [], "source": [ "import intake\n", "\n", "\n", "cat = intake.open_catalog(\"https://tcodata.mpimet.mpg.de/internal.yaml\")\n", "ds = cat.HIFS(datetime=\"2024-09-01\").to_dask()" ] }, { "cell_type": "markdown", "id": "64e415d7-97d1-464a-b655-b0154bae2654", "metadata": { "execution": { "iopub.execute_input": "2024-11-05T14:08:28.737758Z", "iopub.status.busy": "2024-11-05T14:08:28.736472Z", "iopub.status.idle": "2024-11-05T14:08:28.777690Z", "shell.execute_reply": "2024-11-05T14:08:28.777166Z", "shell.execute_reply.started": "2024-11-05T14:08:28.737723Z" } }, "source": [ "## Region writes\n", "\n", "When writing large datasets (as we encourage) it might be necessary to split the processing into several chunks.\n", "Fortunately, the Zarr specification allows this to be hadnled quite easily.\n", "\n", "Initially, we will pass the `compute=False` keyword to the `to_zarr` method.\n", "This will create the Zarr store including all metadata, but not write any actual data chunks:" ] }, { "cell_type": "code", "execution_count": 2, "id": "e2c8bffd-7c35-4979-8314-7de4d9eaaf0b", "metadata": { "execution": { "iopub.execute_input": "2024-11-07T08:49:25.787031Z", "iopub.status.busy": "2024-11-07T08:49:25.785996Z", "iopub.status.idle": "2024-11-07T08:49:26.336123Z", "shell.execute_reply": "2024-11-07T08:49:26.335882Z", "shell.execute_reply.started": "2024-11-07T08:49:25.786988Z" } }, "outputs": [ { "data": { "text/plain": [ "Delayed('_finalize_store-e1735278-0418-4916-a0ec-1255d8a66b3d')" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds.to_zarr(\"test_dataset.zarr\", compute=False)" ] }, { "cell_type": "markdown", "id": "c88105fe-2a46-49c6-8e86-99fce1032db3", "metadata": {}, "source": [ "In a next step, we can use a region write to fill a specific data region with the actual data.\n", "All variables in the dataset must share the dimension of the region write, so we'll need to remove `crs` and `level` first.\n", "Using this approach, it is easy to split a long processing job into several batches, which can even be run in parallel (!)." ] }, { "cell_type": "code", "execution_count": 3, "id": "dcd3ccc4-6269-4061-9241-60b2ae458e61", "metadata": { "execution": { "iopub.execute_input": "2024-11-07T08:49:28.106205Z", "iopub.status.busy": "2024-11-07T08:49:28.105751Z", "iopub.status.idle": "2024-11-07T08:49:32.767677Z", "shell.execute_reply": "2024-11-07T08:49:32.767361Z", "shell.execute_reply.started": "2024-11-07T08:49:28.106177Z" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds.drop_vars([\"crs\", \"level\"]).isel(time=slice(24, 48)).to_zarr(\n", " \"test_dataset.zarr\", region={\"time\": slice(24, 48)}\n", ")" ] }, { "cell_type": "markdown", "id": "535254c0-781f-44a4-80f2-6825df4cc79c", "metadata": {}, "source": [ "## Dimension separator\n", "\n", "Internally, a zarr store is made up of a hierarchy of directories containing the data chunks for each variable. By default, all data chunks are stored in one directory per variable, with dimensions being separated by a dot in the filename (e.g. `0.1.2`). This can result in millions of files in a single directory, which doesn't play well with certain file systems (e.g. LUSTRE).\n", "\n", "One way around this is to use `/` as the dimension separator, creating additional levels of directories, each containing significantly fewer files." ] }, { "cell_type": "code", "execution_count": 4, "id": "4aac4513-4f01-45f6-bdf5-2b8a3dd02887", "metadata": { "execution": { "iopub.execute_input": "2024-11-07T08:49:32.768495Z", "iopub.status.busy": "2024-11-07T08:49:32.768332Z", "iopub.status.idle": "2024-11-07T08:49:33.926420Z", "shell.execute_reply": "2024-11-07T08:49:33.926147Z", "shell.execute_reply.started": "2024-11-07T08:49:32.768482Z" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import zarr\n", "\n", "\n", "store = zarr.storage.DirectoryStore(\"test_dataset2.zarr\", dimension_separator=\"/\")\n", "ds.isel(time=slice(0, 2)).to_zarr(store) # store a couple of timesteps as example" ] }, { "cell_type": "markdown", "id": "48aef478-ec9f-42b9-83c3-0a03bf45ec71", "metadata": {}, "source": [ "## Enable retries via HTTP\n", "\n", "When writing to an object store via HTTP, it is possible that packages may get lost.\n", "One way to mitigate this issue is to define a client that will retry the delivery if an error occurs.\n", "This can be done in the following way using the `aiohttp` pacakge:" ] }, { "cell_type": "code", "execution_count": null, "id": "7283195f-4915-44c6-a0e0-a280012e7aa7", "metadata": {}, "outputs": [], "source": [ "async def get_client(**kwargs):\n", " import aiohttp\n", " import aiohttp_retry\n", "\n", " retry_options = aiohttp_retry.ExponentialRetry(\n", " attempts=3, exceptions={OSError, aiohttp.ServerDisconnectedError}\n", " )\n", " retry_client = aiohttp_retry.RetryClient(\n", " raise_for_status=False, retry_options=retry_options\n", " )\n", " return retry_client\n", "\n", "\n", "ds.to_zarr(\n", " \"https://path/to/test_dataset.zarr\", storage_options={\"get_client\": get_client}\n", ")" ] } ], "metadata": { "jupytext": { "notebook_metadata_filter": "-jupytext.text_representation.jupytext_version" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 5 }