--- file_format: mystnb kernelspec: name: python3 display_name: Python 3 execution: timeout: 60 --- # Getting started with hierarchical HEALPix data ## The catalog and dataset We found that concatenating datasets tends to be much more complicated than subselecting. Thus, for the hierarchical HEALPix output, we aim at creating one single `Dataset` including all variables and timesteps per model run. This `Dataset` should still be accessed through an `intake` catalog, as the catalog provides a way for dataset maintainers to optimize access to the data without affecting scripts using the data. ### The dataset Long story short, let's have a look at the output of the nextGEMS Cycle 3 `ICON` run `ngc4008` through the catalog `cat`: ```{code-cell} ipython3 import intake cat = intake.open_catalog("https://data.nextgems-h2020.eu/online.yaml") ds = cat.ICON.ngc4008.to_dask() ds ``` You can get an overview over all output variables by inspecting the retrieved dataset `ds`. The dimensions of each variable show if they are 2D or 3D variables, the attributes inform e.g. about the model `component` which generated that variable or the `cell_methods` which have been applied if a variable has been aggregated. ### The catalog The catalog entry is parameterized by `zoom` and `time` in order to implement hierarchical data access. Here's how this looks like: ![dataset hierarchy](ds_hierarchy.png) So while technically, there's an independent `Dataset` for each combination of the `zoom` and `time` parameters, it's better to picture all the datasets as copies of the same thing, only their coordinates are of course different due to the different resolutions. The only exception being, that some variables are missing in finer temporal resolutions (e.g. ocean and 3D variables), but generally all variables available at fine resolutions are **also** available at coarse resolutions. Let's see which parameter values are available (you can ommit the `pandas` part, but it looks nicer): ```{code-cell} ipython3 import pandas as pd pd.DataFrame(cat.ICON.ngc3028.describe()["user_parameters"]) ``` So just as in the picture above, we have `time` and `zoom` parameters. `time` resolution is given as [ISO duration strings](https://en.wikipedia.org/wiki/ISO_8601#Durations), `zoom` are the number of bisections in the healpix grid (e.g. the data will have $12 \cdot 4^{zoom}$ cells). The coarsest settings are the default. The default is well suited to e.g. obtain the trend of global mean surface temperatures: ```{code-cell} ipython3 ds.tas.mean("cell").plot() ``` If you are interested in daily variablility, it might however be important to change temporal resolution to a finer scale: ```{code-cell} ipython3 ds_fine = cat.ICON.ngc4008(time="PT3H").to_dask() ds_fine.tas.mean("cell").plot() ``` For sure, the `12` horizontal cells won't provide a detailed map, but let's check quickly. We'll use `eaysgems`'s [plotting tools](../map_show) for their simplicity. Also we use `.isel(time=0)` to select the first output timestep by index: ```{code-cell} ipython3 import easygems.healpix as egh egh.healpix_show(ds.tas.isel(time=0).values) ``` For a decent global map, a `zoom` level of 7 (roughly corresponding to 1x1° resolution) should be good. So let's try this again: ```{code-cell} ipython3 ds_map = cat.ICON.ngc4008(zoom=7).to_dask() egh.healpix_show(ds_map.tas.isel(time=0).values) ``` This is a map as we'd expect it to be. You might want to try different `zoom` settings to get a bit more familiar with the concept. Generally using finer resolutions will load **a lot** more data, thus it's recommended to use the coarsest settings suiting your needs. If your analysis really requires fine resolution data, it might be convenient to use coarse settings while debugging and then switch to finer settings once you are happy with your code. ```{admonition} .to_dask() :class: info While the method for converting a catalog entry into a `Dataset` is called `.to_dask()`, it will **not** return a `dask`-backed `Dataset` by default. In many cases (in particular for doing some quick plots), `dask` is not really necessary. As the [dask best practices](https://docs.dask.org/en/stable/best-practices.html#start-small) suggest we should *Use better algorithms or data structures* instead of `dask` if you don't need it. That said, if you **need** `dask` (e.g. because of the processing complexity), you can enable it by providing a `chunks` specification to the catalog. This can be as simple as `cat.ICON.ngc3028(chunks="auto").to_dask()`. If you do so, just be sure that your chunk settings are compatible with the underlying [storage chunks](/Processing/dask.md#storage-chunks). ``` ## Working with multiple resolutions We can also mix resolutions to compare Hamburg's temperature to the global mean. To work with multiple resolutions of a single run, it might be useful to store the selected model run's catalog entry into a variable. We also need further helper functions to inspect the grid properties provided by the `easygems` package: ```{code-cell} ipython3 import healpix as hp import matplotlib.pylab as plt model_run = cat.ICON.ngc4008 ds_fine = model_run(zoom=7).to_dask() hamburg = hp.ang2pix( egh.get_nside(ds_fine), 9.993333, 53.550556, lonlat=True, nest=egh.get_nest(ds_fine) ) ds_fine.tas.sel(cell=hamburg, time="2020").plot(label="Hamburg") model_run(zoom=0).to_dask().tas.sel(time="2020").mean("cell").plot(label="global mean") plt.legend() ``` ## Vertical profile Let's try something a bit more involved: northward motion of the water across the equator in the atlantic at some point in time: ```{code-cell} ipython3 import numpy as np import xarray as xr lons = np.linspace(-60.0, 20.0, 200) lats = np.full_like(lons, 0.0) ds = model_run(zoom=7).to_dask() pnts = xr.DataArray( hp.ang2pix(egh.get_nside(ds), lons, lats, lonlat=True, nest=egh.get_nest(ds)), dims=("cell",), coords={"lon": (("cell",), lons), "lat": (("cell",), lats)}, ) ds.v.isel(time=0, cell=pnts).swap_dims({"cell": "lon"}).plot(x="lon", yincrease=False) ```