Getting started with hierarchical HEALPix data#

The catalog and dataset#

We found that concatenating datasets tends to be much more complicated than subselecting. Thus, for the hierarchical HEALPix output, we aim at creating one single Dataset for a model run (including all variables and timesteps). This Dataset should still be accessed through an intake catalog, as the catalog provides a way for dataset maintainers to optimize access to the data without affecting scripts using the data.

The dataset#

Long story short, let’s have a look at the output of the nextGEMS Cycle 3 ICON run ngc3028 though the catalog cat:

import intake

cat = intake.open_catalog("https://data.nextgems-h2020.eu/catalog.yaml")
ds = cat.ICON.ngc3028.to_dask()  # this does NOT use dask, see note below
ds

<xarray.Dataset>
Dimensions:                              (time: 2010, depth_half: 129,
                                          cell: 12, level_full: 90, crs: 1,
                                          depth_full: 128,
                                          soil_depth_water_level: 5,
                                          level_half: 91,
                                          soil_depth_energy_level: 5)
Coordinates:
  * crs                                  (crs) float32 nan
  * depth_full                           (depth_full) float32 1.0 ... 5.904e+03
  * depth_half                           (depth_half) float32 0.0 ... 6.003e+03
  * level_full                           (level_full) int32 1 2 3 4 ... 88 89 90
  * level_half                           (level_half) int32 1 2 3 4 ... 89 90 91
  * soil_depth_energy_level              (soil_depth_energy_level) float32 0....
  * soil_depth_water_level               (soil_depth_water_level) float32 0.0...
  * time                                 (time) datetime64[ns] 2020-01-21 ......
Dimensions without coordinates: cell
Data variables: (12/88)
    a_tracer_v_to                        (time, depth_half, cell) float32 ...
    atmos_fluxes_frshflux_evaporation    (time, cell) float32 ...
    atmos_fluxes_frshflux_precipitation  (time, cell) float32 ...
    atmos_fluxes_frshflux_runoff         (time, cell) float32 ...
    atmos_fluxes_frshflux_snowfall       (time, cell) float32 ...
    atmos_fluxes_heatflux_latent         (time, cell) float32 ...
    ...                                   ...
    va                                   (time, level_full, cell) float32 ...
    vas                                  (time, cell) float32 ...
    w                                    (time, depth_half, cell) float32 ...
    wa_phy                               (time, level_half, cell) float32 ...
    wind_speed_10m                       (time, cell) float32 ...
    zos                                  (time, cell) float32 ...

You can get an overview over all output variables by inspecting the retrieved dataset ds. The dimensions of each variable show if they are 2D or 3D variables, the attributes inform e.g. about the model component which generated that variable or the cell_methods which have been applied if a variable has been aggregated.

The catalog#

The catalog entry is parameterized by zoom and time in order to implement hierarchical data access. Here’s how this looks like:

dataset hierarchy

So while technically, there’s an independent Dataset for each combination of the zoom and time parameters, it’s better to picture all the datasets as copies of the same thing, only their coordinates are of course different due to the different resolutions. The only exception being, that some variables are missing in finer temporal resolutions (e.g. ocean and 3D variables), but generally all variables available at fine resolutions are also available at coarse resolutions.

Let’s see which parameter values are available (you can ommit the pandas part, but it looks nicer):

import pandas as pd

pd.DataFrame(cat.ICON.ngc3028.describe()["user_parameters"])

	name	description	type	allowed	default
0	time	time resolution of the dataset	str	[PT30M, PT3H, P1D]	P1D
1	zoom	zoom resolution of the dataset	int	[10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]	0

So just as in the picture above, we have time and zoom parameters. time resolution is given as ISO duration strings, zoom are the number of bisections in the healpix grid (e.g. the data will have \(12 \cdot 4^{zoom}\) cells). The coarsest settings are the default. The default is well suited to e.g. obtain the trend of global mean surface temperatures:

ds.tas.mean("cell").plot()

[<matplotlib.lines.Line2D at 0x7fff86f3bd60>]

../../_images/1ee1532b6196596f58e90d02a073d1d6f382b2b6defb045eb9b8bdfc8bff6bcc.png

If you are interested in daily variablility, it might however be important to change temporal resolution to a finer scale:

ds_fine = cat.ICON.ngc3028(
    time="PT30M"
).to_dask()  # this does NOT use dask, see note below
ds_fine.tas.mean("cell").plot()

[<matplotlib.lines.Line2D at 0x7fff86e942e0>]

../../_images/20c0427b70cecb5d56016c9dfa94a5f8c0a71a9cc5d8813a6f97c417740d828f.png

For sure, the 12 horizontal cells won’t provide a detailed map, but let’s check quickly. We’ll use healpy’s plotting tools briefly for their simplicity, but there are much nicer ways to plot data on maps.

NOTE:

You might notice that healpy isn’t installed in your Python environment. Although we’ll check out other ways to plot, you’ll most certainly need healpy when working with HEALPix data. A quick way to install healpy from a running notebook is to run

%pip install healpy

in an new code cell. You might also have to restart your notebook kernel. If you want a cleaner way, please follow the DKRZ docs.

As the healpy functions require some information about the grid, we’ll define a helper function get_nest to extract the nest parameter from the dataset (this function is also available in the easygems package): Also we use .isel(time=0) to select the first output timestep by index:

import healpy

def get_nest(ds):
    return ds.crs.healpix_order == "nest"

healpy.mollview(ds.tas.isel(time=0), flip="geo", nest=get_nest(ds))

../../_images/1eab99dbb82a46907141a041d1528f783b7dfdaef170346ca162d47951607c73.png

For a decent global map, a zoom level of 7 (roughly corresponding to 1x1° resolution) should be good. So let’s try this again:

ds_map = cat.ICON.ngc3028(zoom=7).to_dask()
healpy.mollview(ds_map.tas.isel(time=0), flip="geo", nest=get_nest(ds_map))

../../_images/8f2eef2edeaa2da44448634d9d717b471b2dc410ec08f677fac42113fda3eb8a.png

This is a map as we’d expect it to be. You might want to try different zoom settings to get a bit more familiar with the concept. Generally using finer resolutions will load a lot more data, thus it’s recommended to use the coarsest settings suiting your needs. If your analysis really requires fine resolution data, it might be convenient to use coarse settings while debugging and then switch to finer settings once you are happy with your code.

NOTE:

While the method for converting a catalog entry into a Dataset is called .to_dask(), it will not return a dask-backed Dataset by default. In many cases (in particular for doing some quick plots), dask is not really necessary. As the dask best practices suggest we should Use better algorithms or data structures instead of dask if you don’t need it.

That said, if you need dask (e.g. because of laziness or the processing complexity), you can enable it by providing a chunks specification to the catalog. This can be as simple as cat.ICON.ngc3028(chunks="auto").to_dask(), but you can also specify a dictionary with per-dimension chunk settings. If you do so, just be sure that your chunk settings are compatible with the underlying data chunks (you’ll probably know how to check that if you need it).

Working with multiple resolutions#

We can also mix resolutions to compare Hamburg’s temperature to the global mean. To work with multiple resolutions of a single run, it might be useful to store the selected model run’s catalog entry into a variable. We also need one further helper function (get_nside, also available in the easygems package), as well as pylab:

import matplotlib.pylab as plt


def get_nside(ds):
    return ds.crs.healpix_nside


model_run = cat.ICON.ngc3028

ds_fine = model_run(zoom=10).to_dask()
hamburg = healpy.ang2pix(
    get_nside(ds_fine), 9.993333, 53.550556, lonlat=True, nest=get_nest(ds_fine)
)

ds_fine.tas.isel(cell=hamburg).plot(label="Hamburg")
model_run(zoom=0).to_dask().tas.mean("cell").plot(label="global mean")
plt.legend()

<matplotlib.legend.Legend at 0x7fff81139750>

../../_images/06408bc65f601a9bf88fa6df196c207c85a87a459558442cd15f396f55603266.png

Vertical profile#

Let’s try something a bit more involved: northward motion of the water across the equator in the atlantic at some point in time:

import numpy as np
import xarray as xr

lons = np.linspace(-60.0, 20.0, 200)
lats = np.full_like(lons, 0.0)

ds = model_run(zoom=7).to_dask()

pnts = xr.DataArray(
    healpy.ang2pix(get_nside(ds), lons, lats, lonlat=True, nest=get_nest(ds)),
    dims=("cell",),
    coords={"lon": (("cell",), lons), "lat": (("cell",), lats)},
)

ds.v.isel(time=0, cell=pnts).swap_dims({"cell": "lon"}).plot(x="lon", yincrease=False)

<matplotlib.collections.QuadMesh at 0x7fff86c4aaa0>