---
file_format: mystnb
kernelspec:
  name: python3
  display_name: Python 3
execution:
  timeout: 60
---

# Getting started with hierarchical HEALPix data

## The catalog and dataset

We found that concatenating datasets tends to be much more complicated than subselecting.
Thus, for the hierarchical HEALPix output, we aim at creating one single `Dataset` including all variables and timesteps per model run.
This `Dataset` should still be accessed through an `intake` catalog, as the catalog provides a way for dataset maintainers to optimize access to the data without affecting scripts using the data.

### The dataset
Long story short, let's have a look at the output of the nextGEMS Cycle 3 `ICON` run `ngc4008` through the catalog `cat`:

```{code-cell} ipython3
import intake

cat = intake.open_catalog("https://data.nextgems-h2020.eu/online.yaml")
ds = cat.ICON.ngc4008.to_dask()
ds
```

You can get an overview over all output variables by inspecting the retrieved dataset `ds`.
The dimensions of each variable show if they are 2D or 3D variables, the attributes inform e.g. about the model `component` which generated that variable or the `cell_methods` which have been applied if a variable has been aggregated.

### The catalog

The catalog entry is parameterized by `zoom` and `time` in order to implement hierarchical data access. Here's how this looks like:

![dataset hierarchy](ds_hierarchy.png)

So while technically, there's an independent `Dataset` for each combination of the `zoom` and `time` parameters, it's better to picture all the datasets as copies of the same thing, only their coordinates are of course different due to the different resolutions.
The only exception being, that some variables are missing in finer temporal resolutions (e.g. ocean and 3D variables), but generally all variables available at fine resolutions are **also** available at coarse resolutions.

Let's see which parameter values are available (you can ommit the `pandas` part, but it looks nicer):

```{code-cell} ipython3
import pandas as pd

pd.DataFrame(cat.ICON.ngc3028.describe()["user_parameters"])
```

So just as in the picture above, we have `time` and `zoom` parameters.
`time` resolution is given as [ISO duration strings](https://en.wikipedia.org/wiki/ISO_8601#Durations), `zoom` are the number of bisections in the healpix grid (e.g. the data will have $12 \cdot 4^{zoom}$ cells).
The coarsest settings are the default.
The default is well suited to e.g. obtain the trend of global mean surface temperatures:

```{code-cell} ipython3
ds.tas.mean("cell").plot()
```

If you are interested in daily variablility, it might however be important to change temporal resolution to a finer scale:

```{code-cell} ipython3
ds_fine = cat.ICON.ngc4008(time="PT3H").to_dask()
ds_fine.tas.mean("cell").plot()
```

For sure, the `12` horizontal cells won't provide a detailed map, but let's check quickly.
We'll use `eaysgems`'s [plotting tools](../map_show) for their simplicity.
Also we use `.isel(time=0)` to select the first output timestep by index:

```{code-cell} ipython3
import easygems.healpix as egh


egh.healpix_show(ds.tas.isel(time=0).values)
```

For a decent global map, a `zoom` level of 7 (roughly corresponding to 1x1° resolution) should be good. So let's try this again:

```{code-cell} ipython3
ds_map = cat.ICON.ngc4008(zoom=7).to_dask()
egh.healpix_show(ds_map.tas.isel(time=0).values)
```

This is a map as we'd expect it to be. You might want to try different `zoom` settings to get a bit more familiar with the concept. Generally using finer resolutions will load **a lot** more data, thus it's recommended to use the coarsest settings suiting your needs.
If your analysis really requires fine resolution data, it might be convenient to use coarse settings while debugging and then switch to finer settings once you are happy with your code.


```{admonition} .to_dask()
:class: info

While the method for converting a catalog entry into a `Dataset` is called `.to_dask()`, it will **not** return a `dask`-backed `Dataset` by default.
In many cases (in particular for doing some quick plots), `dask` is not really necessary.
As the [dask best practices](https://docs.dask.org/en/stable/best-practices.html#start-small) suggest we should *Use better algorithms or data structures* instead of `dask` if you don't need it.

That said, if you **need** `dask` (e.g. because of the processing complexity), you can enable it by providing a `chunks` specification to the catalog.
This can be as simple as `cat.ICON.ngc3028(chunks="auto").to_dask()`.
If you do so, just be sure that your chunk settings are compatible with the underlying [storage chunks](/Processing/dask.md#storage-chunks).
```

## Working with multiple resolutions

We can also mix resolutions to compare Hamburg's temperature to the global mean.
To work with multiple resolutions of a single run, it might be useful to store the selected model run's catalog entry into a variable.
We also need further helper functions to inspect the grid properties provided by the `easygems` package:

```{code-cell} ipython3
import healpix as hp
import matplotlib.pylab as plt


model_run = cat.ICON.ngc4008

ds_fine = model_run(zoom=7).to_dask()
hamburg = hp.ang2pix(
    egh.get_nside(ds_fine), 9.993333, 53.550556, lonlat=True, nest=egh.get_nest(ds_fine)
)

ds_fine.tas.sel(cell=hamburg, time="2020").plot(label="Hamburg")
model_run(zoom=0).to_dask().tas.sel(time="2020").mean("cell").plot(label="global mean")
plt.legend()
```

## Vertical profile
Let's try something a bit more involved: northward motion of the water across the equator in the atlantic at some point in time:

```{code-cell} ipython3
import numpy as np
import xarray as xr

lons = np.linspace(-60.0, 20.0, 200)
lats = np.full_like(lons, 0.0)

ds = model_run(zoom=7).to_dask()

pnts = xr.DataArray(
    hp.ang2pix(egh.get_nside(ds), lons, lats, lonlat=True, nest=egh.get_nest(ds)),
    dims=("cell",),
    coords={"lon": (("cell",), lons), "lat": (("cell",), lats)},
)

ds.v.isel(time=0, cell=pnts).swap_dims({"cell": "lon"}).plot(x="lon", yincrease=False)
```