Getting started with hierarchical HEALPix data#
The catalog and dataset#
We found that concatenating datasets tends to be much more complicated than subselecting. Thus, for the hierarchical HEALPix output, we aim at creating one single Dataset
for a model run (including all variables and timesteps). This Dataset
should still be accessed through an intake
catalog, as the catalog provides a way for dataset maintainers to optimize access to the data without affecting scripts using the data.
The dataset#
Long story short, let’s have a look at the output of the nextGEMS Cycle 3 ICON
run ngc3028
though the catalog cat
:
[1]:
import intake
cat = intake.open_catalog("https://data.nextgems-h2020.eu/catalog.yaml")
ds = cat.ICON.ngc3028.to_dask() # this does NOT use dask, see note below
ds
[1]:
<xarray.Dataset> Dimensions: (time: 2010, depth_half: 129, cell: 12, level_full: 90, crs: 1, depth_full: 128, soil_depth_water_level: 5, level_half: 91, soil_depth_energy_level: 5) Coordinates: * crs (crs) float32 nan * depth_full (depth_full) float32 1.0 ... 5.904e+03 * depth_half (depth_half) float32 0.0 ... 6.003e+03 * level_full (level_full) int32 1 2 3 4 ... 88 89 90 * level_half (level_half) int32 1 2 3 4 ... 89 90 91 * soil_depth_energy_level (soil_depth_energy_level) float32 0.... * soil_depth_water_level (soil_depth_water_level) float32 0.0... * time (time) datetime64[ns] 2020-01-21 ...... Dimensions without coordinates: cell Data variables: (12/88) a_tracer_v_to (time, depth_half, cell) float32 ... atmos_fluxes_frshflux_evaporation (time, cell) float32 ... atmos_fluxes_frshflux_precipitation (time, cell) float32 ... atmos_fluxes_frshflux_runoff (time, cell) float32 ... atmos_fluxes_frshflux_snowfall (time, cell) float32 ... atmos_fluxes_heatflux_latent (time, cell) float32 ... ... ... va (time, level_full, cell) float32 ... vas (time, cell) float32 ... w (time, depth_half, cell) float32 ... wa_phy (time, level_half, cell) float32 ... wind_speed_10m (time, cell) float32 ... zos (time, cell) float32 ...
You can get an overview over all output variables by inspecting the retrieved dataset ds
. The dimensions of each variable show if they are 2D or 3D variables, the attributes inform e.g. about the model component
which generated that variable or the cell_methods
which have been applied if a variable has been aggregated.
The catalog#
The catalog entry is parameterized by zoom
and time
in order to implement hierarchical data access. Here’s how this looks like:
So while technically, there’s an independent Dataset
for each combination of the zoom
and time
parameters, it’s better to picture all the datasets as copies of the same thing, only their coordinates are of course different due to the different resolutions. The only exception being, that some variables are missing in finer temporal resolutions (e.g. ocean and 3D variables), but generally all variables available at fine resolutions are also available at coarse resolutions.
Let’s see which parameter values are available (you can ommit the pandas
part, but it looks nicer):
[2]:
import pandas as pd
pd.DataFrame(cat.ICON.ngc3028.describe()["user_parameters"])
[2]:
name | description | type | allowed | default | |
---|---|---|---|---|---|
0 | time | time resolution of the dataset | str | [PT30M, PT3H, P1D] | P1D |
1 | zoom | zoom resolution of the dataset | int | [10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0] | 0 |
So just as in the picture above, we have time
and zoom
parameters. time
resolution is given as ISO duration strings, zoom
are the number of bisections in the healpix grid (e.g. the data will have \(12 \cdot 4^{zoom}\) cells). The coarsest settings are the default. The default is well suited to e.g. obtain the trend of global mean surface temperatures:
[3]:
ds.tas.mean("cell").plot()
[3]:
[<matplotlib.lines.Line2D at 0x7fff86f3bd60>]
If you are interested in daily variablility, it might however be important to change temporal resolution to a finer scale:
[4]:
ds_fine = cat.ICON.ngc3028(
time="PT30M"
).to_dask() # this does NOT use dask, see note below
ds_fine.tas.mean("cell").plot()
[4]:
[<matplotlib.lines.Line2D at 0x7fff86e942e0>]
For sure, the 12
horizontal cells won’t provide a detailed map, but let’s check quickly. We’ll use healpy’s plotting tools briefly for their simplicity, but there are much nicer ways to plot data on maps.
NOTE:
You might notice that healpy
isn’t installed in your Python environment. Although we’ll check out other ways to plot, you’ll most certainly need healpy
when working with HEALPix data. A quick way to install healpy
from a running notebook is to run
%pip install healpy
in an new code cell. You might also have to restart your notebook kernel. If you want a cleaner way, please follow the DKRZ docs.
As the healpy
functions require some information about the grid, we’ll define a helper function get_nest
to extract the nest
parameter from the dataset (this function is also available in the easygems
package): Also we use .isel(time=0)
to select the first output timestep by index:
[5]:
import healpy
def get_nest(ds):
return ds.crs.healpix_order == "nest"
healpy.mollview(ds.tas.isel(time=0), flip="geo", nest=get_nest(ds))
For a decent global map, a zoom
level of 7 (roughly corresponding to 1x1° resolution) should be good. So let’s try this again:
[6]:
ds_map = cat.ICON.ngc3028(zoom=7).to_dask()
healpy.mollview(ds_map.tas.isel(time=0), flip="geo", nest=get_nest(ds_map))
This is a map as we’d expect it to be. You might want to try different zoom
settings to get a bit more familiar with the concept. Generally using finer resolutions will load a lot more data, thus it’s recommended to use the coarsest settings suiting your needs. If your analysis really requires fine resolution data, it might be convenient to use coarse settings while debugging and then switch to finer settings once you are happy with your code.
NOTE:
While the method for converting a catalog entry into a Dataset
is called .to_dask()
, it will not return a dask
-backed Dataset
by default. In many cases (in particular for doing some quick plots), dask
is not really necessary. As the dask best practices suggest we should Use better algorithms or data structures instead of dask
if you don’t need it.
That said, if you need dask
(e.g. because of laziness or the processing complexity), you can enable it by providing a chunks
specification to the catalog. This can be as simple as cat.ICON.ngc3028(chunks="auto").to_dask()
, but you can also specify a dictionary with per-dimension chunk settings. If you do so, just be sure that your chunk settings are compatible with the underlying data chunks (you’ll probably know how to check that if you need it).
Working with multiple resolutions#
We can also mix resolutions to compare Hamburg’s temperature to the global mean. To work with multiple resolutions of a single run, it might be useful to store the selected model run’s catalog entry into a variable. We also need one further helper function (get_nside
, also available in the easygems
package), as well as pylab
:
[7]:
import matplotlib.pylab as plt
def get_nside(ds):
return ds.crs.healpix_nside
model_run = cat.ICON.ngc3028
ds_fine = model_run(zoom=10).to_dask()
hamburg = healpy.ang2pix(
get_nside(ds_fine), 9.993333, 53.550556, lonlat=True, nest=get_nest(ds_fine)
)
ds_fine.tas.isel(cell=hamburg).plot(label="Hamburg")
model_run(zoom=0).to_dask().tas.mean("cell").plot(label="global mean")
plt.legend()
[7]:
<matplotlib.legend.Legend at 0x7fff81139750>
Vertical profile#
Let’s try something a bit more involved: northward motion of the water across the equator in the atlantic at some point in time:
[8]:
import numpy as np
import xarray as xr
lons = np.linspace(-60.0, 20.0, 200)
lats = np.full_like(lons, 0.0)
ds = model_run(zoom=7).to_dask()
pnts = xr.DataArray(
healpy.ang2pix(get_nside(ds), lons, lats, lonlat=True, nest=get_nest(ds)),
dims=("cell",),
coords={"lon": (("cell",), lons), "lat": (("cell",), lats)},
)
ds.v.isel(time=0, cell=pnts).swap_dims({"cell": "lon"}).plot(x="lon", yincrease=False)
[8]:
<matplotlib.collections.QuadMesh at 0x7fff86c4aaa0>