{ "cells": [ { "cell_type": "markdown", "id": "dddb64f8-7c77-4367-81e9-10467c1e7f9a", "metadata": { "tags": [] }, "source": [ "# Retrieval from DKRZ tape archive\n", "\n", "This notebook explains how to efficiently retrieve data from archive with [intake](https://intake.readthedocs.io/en/latest/) version 1 and [slkspec](https://github.com/observingClouds/slkspec) on the example of EERIE data. The notebook works well within the `/work/bm1344/conda-envs/py_312/` environment.\n", "\n", "Advantages of intake retrievals compared to other options like command line slk retrievals.\n", "\n", "- Opening the data from the catalog, do coordinate look-ups and prepare workflows is *always* possible without retrieval and for free.\n", "- Same convenient and familiar `ds.load()` command to get data - retrievals are included.\n", "- Optimized background retrieval management, e.g. by tape grouping\n", "- \"Recalls\" can be asyncronously submitted for later resuming of work. Just redo the script, it can only become faster.\n", "- Catalog configurations include a specification of a shared and mostly quota free Levante scratch cache for tape retrievals to prevent duplicates on disk and therefore quota issues" ] }, { "cell_type": "markdown", "id": "63f5d893-05e3-4b05-ba8b-46287a0ab563", "metadata": { "tags": [] }, "source": [ "## Workflow and speed\n", "\n", "\n", "### Subsetting\n", "\n", "When you open a dataset from the intake archive catalog that we introduce here, you can still browse and subset by coordinates as well as prepare dask workflows just as you know from other catlaogs. As soon as dask starts to actually run data tasks, e.g. if you call `.compute()` or `.load()`, the retrieval workflow will be triggered. Therefore, it is of particular importance for these archive catalogs that you **first subset** the data before you do a compute. Otherwise, you will submit retrievals of too much data which will not only take forever but also may break the underlying tape system." ] }, { "cell_type": "markdown", "id": "ce858053-f67c-4d25-9bd4-6314170e1e25", "metadata": { "tags": [] }, "source": [ "### Data flow and speed\n", "\n", "The data flow from tape to your computer's memory enters different stages:\n", "\n", "1. **tape->tapecache**: This transfer is called *recall*. It takes about O(30min) per tape with large fluctuations depending on the archive load. 3 tape recalls in parallel are allowed per user. Each EERIE dataset is usually distributed across O(5) tapes.\n", "1. **tapecache->levante**: This transfer is called *retrieve*. It takes about O(1min).\n", "1. **levante->memory**: The final load.\n", "\n", "For experts, we collect more info in [this pad](https://pad.gwdg.de/rfmL1ntDQAqhVzokCuAcwA#)." ] }, { "cell_type": "markdown", "id": "a648a7f5-57f5-42db-9631-8d1aadbec10a", "metadata": { "tags": [] }, "source": [ "## Example: EERIE data in archive\n", "\n", "A key benefit of catalogs is that you do not need to know where and how the data is stored. For completion and because we can use the information to optimize our workflow, we explain how EERIE Datasets in the archive are organized depending on their sizes:\n", "\n", "- **Small**: five years per file if five years of the full dataset are < 1GB\n", "- **Medium**: one year per one file if one year is <100GB and if \"time\" dimension exists\n", "- **Large**: one year of one variable per file if one year of one variable is <100GB\n", "- **XL**: on month of one year of one variable per file in other cases\n", "\n", "The root directories for the output of the German ESM contributions are:\n", "\n", "- ICON-ESM-ER: `/arch/bm1344/ICON/outdata/`\n", "- IFS-FESOM2-SR: `/arch/bm1344/IFS-FESOM2/outdata/`" ] }, { "cell_type": "markdown", "id": "d00fda90-9105-4b97-aa83-419b1bde0c63", "metadata": {}, "source": [ "## Small retrievals\n", "\n", "If you aim to retrieve data volumes O(<=10GB), you can **forget how EERIE data is archived** and just work with the datasets as if they are on disk and do a `.load()` on your subset. You will have to wait for about an hour untill the command finishes and until you can work with the data." ] }, { "cell_type": "code", "execution_count": 1, "id": "9d655b1a-9756-4925-8382-b16e7eda00c8", "metadata": { "tags": [] }, "outputs": [ { "data": { "application/yaml": "main:\n args:\n path: https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/dkrz/archive/main.yaml\n description: ''\n driver: intake.catalog.local.YAMLFileCatalog\n metadata: {}\n", "text/plain": [ "main:\n", " args:\n", " path: https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/dkrz/archive/main.yaml\n", " description: ''\n", " driver: intake.catalog.local.YAMLFileCatalog\n", " metadata: {}\n" ] }, "metadata": { "application/json": { "root": "main" } }, "output_type": "display_data" } ], "source": [ "import intake\n", "\n", "catalog = (\n", " \"https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/dkrz/archive/main.yaml\"\n", ")\n", "#catalog=\"/work/bm1344/DKRZ/intake_catalogues/dkrz/archive/main.yaml\"\n", "eerie_cat = intake.open_catalog(catalog)\n", "eerie_cat" ] }, { "cell_type": "code", "execution_count": 2, "id": "71034372-df36-47bf-b14d-ffaeb45c8204", "metadata": { "nbsphinx": "hidden", "tags": [ "hide-input" ] }, "outputs": [], "source": [ "def find_data_sources(catalog,name=None):\n", " newname='.'.join(\n", " [ a \n", " for a in [name, catalog.name]\n", " if a\n", " ]\n", " )\n", " data_sources = []\n", "\n", " for key, entry in catalog.items():\n", " if isinstance(entry, intake.catalog.Catalog):\n", " if newname == \"main\":\n", " newname = None\n", " # If the entry is a subcatalog, recursively search it\n", " data_sources.extend(find_data_sources(entry, newname))\n", " elif isinstance(entry, intake.source.base.DataSource):\n", " if key.endswith('.nc'):\n", " continue\n", " if newname:\n", " data_sources.append(newname+\".\"+key)\n", " else:\n", " data_sources.append(key)\n", "\n", " return data_sources" ] }, { "cell_type": "code", "execution_count": 3, "id": "2929a126-2811-4fd0-afea-c1d466553702", "metadata": { "tags": [] }, "outputs": [], "source": [ "all_sources=find_data_sources(eerie_cat)" ] }, { "cell_type": "code", "execution_count": 4, "id": "560fa974-3be1-434c-82ba-c12dbf73a9ca", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "['model-output.icon-esm-er.eerie-spinup-1950.v20240618.atmos.native.2d_1h_inst',\n", " 'model-output.icon-esm-er.eerie-spinup-1950.v20240618.atmos.native.2d_1h_mean',\n", " 'model-output.icon-esm-er.eerie-spinup-1950.v20240618.atmos.native.2d_3h_inst',\n", " 'model-output.icon-esm-er.eerie-spinup-1950.v20240618.atmos.native.2d_6h_inst',\n", " 'model-output.icon-esm-er.eerie-spinup-1950.v20240618.atmos.native.2d_6h_mean',\n", " 'model-output.icon-esm-er.eerie-spinup-1950.v20240618.atmos.native.2d_daily_max',\n", " 'model-output.icon-esm-er.eerie-spinup-1950.v20240618.atmos.native.2d_daily_mean',\n", " 'model-output.icon-esm-er.eerie-spinup-1950.v20240618.atmos.native.2d_daily_min',\n", " 'model-output.icon-esm-er.eerie-spinup-1950.v20240618.atmos.native.model-level_daily_mean_1',\n", " 'model-output.icon-esm-er.eerie-spinup-1950.v20240618.atmos.native.model-level_daily_mean_2',\n", " 'model-output.icon-esm-er.eerie-spinup-1950.v20240618.atmos.native.pl_6h_inst',\n", " 'model-output.icon-esm-er.eerie-spinup-1950.v20240618.ocean.native.model-level_daily_mean',\n", " 'model-output.icon-esm-er.eerie-control-1950.v20240618.atmos.native.2d_1h_inst',\n", " 'model-output.icon-esm-er.eerie-control-1950.v20240618.atmos.native.2d_1h_mean',\n", " 'model-output.icon-esm-er.eerie-control-1950.v20240618.atmos.native.2d_3h_inst',\n", " 'model-output.icon-esm-er.eerie-control-1950.v20240618.atmos.native.2d_6h_inst',\n", " 'model-output.icon-esm-er.eerie-control-1950.v20240618.atmos.native.2d_6h_mean',\n", " 'model-output.icon-esm-er.eerie-control-1950.v20240618.atmos.native.2d_daily_max',\n", " 'model-output.icon-esm-er.eerie-control-1950.v20240618.atmos.native.2d_daily_mean',\n", " 'model-output.icon-esm-er.eerie-control-1950.v20240618.atmos.native.2d_daily_min',\n", " 'model-output.icon-esm-er.eerie-control-1950.v20240618.atmos.native.2d_monthly_mean',\n", " 'model-output.icon-esm-er.eerie-control-1950.v20240618.atmos.native.atmos_native_mon',\n", " 'model-output.icon-esm-er.eerie-control-1950.v20240618.atmos.native.model-level_daily_mean_1',\n", " 'model-output.icon-esm-er.eerie-control-1950.v20240618.atmos.native.model-level_daily_mean_2',\n", " 'model-output.icon-esm-er.eerie-control-1950.v20240618.atmos.native.model-level_monthly_mean',\n", " 'model-output.icon-esm-er.eerie-control-1950.v20240618.atmos.native.pl_6h_inst',\n", " 'model-output.icon-esm-er.eerie-control-1950.v20240618.ocean.native.2d_daily_mean',\n", " 'model-output.icon-esm-er.eerie-control-1950.v20240618.ocean.native.2d_daily_square',\n", " 'model-output.icon-esm-er.eerie-control-1950.v20240618.ocean.native.2d_monthly_mean',\n", " 'model-output.icon-esm-er.eerie-control-1950.v20240618.ocean.native.2d_monthly_square',\n", " 'model-output.icon-esm-er.eerie-control-1950.v20240618.ocean.native.5lev_daily_mean',\n", " 'model-output.icon-esm-er.eerie-control-1950.v20240618.ocean.native.eddy_monthly_mean',\n", " 'model-output.icon-esm-er.eerie-control-1950.v20240618.ocean.native.model-level_daily_mean',\n", " 'model-output.icon-esm-er.eerie-control-1950.v20240618.ocean.native.model-level_monthly_mean',\n", " 'model-output.icon-esm-er.eerie-control-1950.v20240618.ocean.native.ocean_native_mon',\n", " 'model-output.icon-esm-er.eerie-control-1950.v20240618.land.native.2d_daily_mean',\n", " 'model-output.icon-esm-er.eerie-control-1950.v20240618.land.native.2d_monthly_mean']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_sources" ] }, { "cell_type": "code", "execution_count": 5, "id": "8162f3db-bb0e-4921-bb5a-f9e9f37b4339", "metadata": { "tags": [] }, "outputs": [ { "data": { "application/yaml": "2d_daily_mean:\n args:\n chunks: {}\n consolidated: false\n storage_options:\n lazy: true\n remote_options:\n slk_cache: /scratch/k/k202134/INTAKE_CACHE\n remote_protocol: slk\n urlpath: reference:://work/bm1344/DKRZ/kerchunks_pp_batched/ICON/eerie-control-1950/v20240618/atmos_native_2d_daily_mean_slk.parq\n description: ''\n driver: intake_xarray.xzarr.ZarrSource\n metadata:\n catalog_dir: https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/dkrz/archive/model-output/icon-esm-er/eerie-control-1950/v20240618/atmos/native\n", "text/plain": [ "2d_daily_mean:\n", " args:\n", " chunks: {}\n", " consolidated: false\n", " storage_options:\n", " lazy: true\n", " remote_options:\n", " slk_cache: /scratch/k/k202134/INTAKE_CACHE\n", " remote_protocol: slk\n", " urlpath: reference:://work/bm1344/DKRZ/kerchunks_pp_batched/ICON/eerie-control-1950/v20240618/atmos_native_2d_daily_mean_slk.parq\n", " description: ''\n", " driver: intake_xarray.xzarr.ZarrSource\n", " metadata:\n", " catalog_dir: https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/dkrz/archive/model-output/icon-esm-er/eerie-control-1950/v20240618/atmos/native\n" ] }, "metadata": { "application/json": { "root": "2d_daily_mean" } }, "output_type": "display_data" } ], "source": [ "dscat = eerie_cat[\"model-output.icon-esm-er.eerie-control-1950.v20240618.atmos.native.2d_daily_mean\"](chunks={})\n", "dscat" ] }, { "cell_type": "markdown", "id": "207354db-693c-4b52-887f-291db7a6268d", "metadata": {}, "source": [ "The `slk_cache` notates the location on levante where the data is retrieved to. Per intake config, data is stored in a shared, additional scratch cache. Data untouched for two weeks will automatically be deleted from that location.\n", "\n", "Make sure you have write-permissions to the shared cache directory. If not, you can provide another location via `remote_options` similar as we did for `chunks` in the above cell." ] }, { "cell_type": "code", "execution_count": 6, "id": "1d5e4e3d-d42f-4b24-94b9-159840ed6663", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "'/scratch/k/k202134/INTAKE_CACHE'" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "shared_cache=dscat.describe()[\"args\"][\"storage_options\"][\"remote_options\"][\"slk_cache\"]\n", "shared_cache" ] }, { "cell_type": "code", "execution_count": 7, "id": "539354de-66e3-49eb-ae16-09856dc6f26c", "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/work/bm1344/conda-envs/py_312/lib/python3.12/site-packages/intake_xarray/base.py:21: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.\n", " 'dims': dict(self._ds.dims),\n" ] }, { "data": { "text/html": [ "
<xarray.Dataset> Size: 8TB\n", "Dimensions: (ncells: 5242880, time: 18262, height: 1, height_2: 1,\n", " height_3: 1)\n", "Coordinates:\n", " * height (height) float64 8B 2.0\n", " * height_2 (height_2) float64 8B 10.0\n", " * height_3 (height_3) float64 8B 90.0\n", " lat (ncells) float64 42MB dask.array<chunksize=(5242880,), meta=np.ndarray>\n", " lon (ncells) float64 42MB dask.array<chunksize=(5242880,), meta=np.ndarray>\n", " * time (time) datetime64[ns] 146kB 1991-01-01T23:59:59 ... 2...\n", "Dimensions without coordinates: ncells\n", "Data variables: (12/22)\n", " cell_sea_land_mask (ncells) int32 21MB dask.array<chunksize=(5242880,), meta=np.ndarray>\n", " clt (time, ncells) float32 383GB dask.array<chunksize=(1, 5242880), meta=np.ndarray>\n", " evspsbl (time, ncells) float32 383GB dask.array<chunksize=(1, 5242880), meta=np.ndarray>\n", " hfls (time, ncells) float32 383GB dask.array<chunksize=(1, 5242880), meta=np.ndarray>\n", " hfss (time, ncells) float32 383GB dask.array<chunksize=(1, 5242880), meta=np.ndarray>\n", " hur (time, height_3, ncells) float32 383GB dask.array<chunksize=(1, 1, 5242880), meta=np.ndarray>\n", " ... ...\n", " rsus (time, ncells) float32 383GB dask.array<chunksize=(1, 5242880), meta=np.ndarray>\n", " sfcwind (time, height_2, ncells) float32 383GB dask.array<chunksize=(1, 1, 5242880), meta=np.ndarray>\n", " tas (time, height, ncells) float32 383GB dask.array<chunksize=(1, 1, 5242880), meta=np.ndarray>\n", " ts (time, ncells) float32 383GB dask.array<chunksize=(1, 5242880), meta=np.ndarray>\n", " uas (time, height_2, ncells) float32 383GB dask.array<chunksize=(1, 1, 5242880), meta=np.ndarray>\n", " vas (time, height_2, ncells) float32 383GB dask.array<chunksize=(1, 1, 5242880), meta=np.ndarray>
<xarray.Dataset> Size: 31GB\n", "Dimensions: (time: 731, ncells: 5242880, height_3: 1)\n", "Coordinates:\n", " * height_3 (height_3) float64 8B 90.0\n", " lat (ncells) float64 42MB dask.array<chunksize=(5242880,), meta=np.ndarray>\n", " lon (ncells) float64 42MB dask.array<chunksize=(5242880,), meta=np.ndarray>\n", " * time (time) datetime64[ns] 6kB 1991-01-01T23:59:59 ... 1992-12-31T23...\n", "Dimensions without coordinates: ncells\n", "Data variables:\n", " clt (time, ncells) float32 15GB dask.array<chunksize=(1, 5242880), meta=np.ndarray>\n", " hur (time, height_3, ncells) float32 15GB dask.array<chunksize=(1, 1, 5242880), meta=np.ndarray>
<xarray.Dataset> Size: 1GB\n", "Dimensions: (time: 24, ncells: 5242880, height_3: 1)\n", "Coordinates:\n", " * height_3 (height_3) float64 8B 90.0\n", " lat (ncells) float64 42MB dask.array<chunksize=(5242880,), meta=np.ndarray>\n", " lon (ncells) float64 42MB dask.array<chunksize=(5242880,), meta=np.ndarray>\n", " * time (time) datetime64[ns] 192B 1991-01-31T23:59:59 ... 1992-12-31T2...\n", "Dimensions without coordinates: ncells\n", "Data variables:\n", " clt (time, ncells) float32 503MB dask.array<chunksize=(1, 5242880), meta=np.ndarray>\n", " hur (time, height_3, ncells) float32 503MB dask.array<chunksize=(1, 1, 5242880), meta=np.ndarray>
<xarray.Dataset> Size: 1GB\n", "Dimensions: (time: 24, ncells: 5242880, height_3: 1)\n", "Coordinates:\n", " * height_3 (height_3) float64 8B 90.0\n", " lat (ncells) float64 42MB dask.array<chunksize=(5242880,), meta=np.ndarray>\n", " lon (ncells) float64 42MB dask.array<chunksize=(5242880,), meta=np.ndarray>\n", " * time (time) datetime64[ns] 192B 1991-01-31T23:59:59 ... 1992-12-31T2...\n", "Dimensions without coordinates: ncells\n", "Data variables:\n", " clt (time, ncells) float32 503MB dask.array<chunksize=(1, 5242880), meta=np.ndarray>\n", " hur (time, height_3, ncells) float32 503MB dask.array<chunksize=(1, 1, 5242880), meta=np.ndarray>