Loading data from the catalog

Long story short:

import intake
catalog_file = "/work/ka1081/Catalogs/dyamond-nextgems.json"  # nextGEMS and DYAMOND Winter
cat = intake.open_esm_datastore(catalog_file)
hits = cat.search(simulation_id="ngc2009", variable_id="tas", frequency="30minute")
dataset_dict = hits.to_dataset_dict(cdf_kwargs={"chunks": {"time": 1}})
keys = list(dataset_dict.keys())
dataset = dataset_dict[keys[0]]
dataset.tas.isel(time=1).max().values

# use get_from_cat from below to search a catalog

Loading the catalog

The intake-esm package provides a tool to access big amounts of data, without having to worry about where it comes from. We will give you a short overview of how to do use the catalog to your advantage. The root of the intake catalog, is a ‘.json’ file.

[1]:
import pandas as pd

pd.set_option("max_colwidth", None)  # makes the tables render better

import intake


def get_from_cat(catalog, columns):
    """A helper function for inspecting an intake catalog.

    Call with the catalog to be inspected and a list of columns of interest."""
    import pandas as pd

    pd.set_option("max_colwidth", None)  # makes the tables render better

    if type(columns) == type(""):
        columns = [columns]
    return (
        catalog.df[columns]
        .drop_duplicates()
        .sort_values(columns)
        .reset_index(drop=True)
    )
[2]:
catalog_file = "/work/ka1081/Catalogs/dyamond-nextgems.json"

cat = intake.open_esm_datastore(catalog_file)
cat

ICON-ESM catalog with 215 dataset(s) from 114472 asset(s):

unique
variable_id 642
project 2
institution_id 12
source_id 20
experiment_id 5
simulation_id 19
realm 6
frequency 17
time_reduction 5
grid_label 10
level_type 10
time_min 2345
time_max 6050
grid_id 14
format 2
uri 114462

The meanings of the categories are:

Info

Description

variable_id

Shortname of variables.

project

Larger project the simulation belongs to.

source_id

Model name.

experiment_id

Class of experiment

simulation_id

Id of the run.

realm

oceanic or atmospheric data

frequency

Frequency in time of datapoints.

time_reduction

Average/Instantaneous/…

grid_label

Identifier for horizontal gridtype.

level_type

Identifier for vertical gridtype.

time_min

Starting time for a specific file.

time_max

End of time covered by a specific file.

grid_id

Identifier of horizontal grid.

uri

Uniform resource identifier, location of data files.

Searching the catalog

You can access the underlying pandas dataframe with “cat.df”. Here we show the first 2 entries with head():

[3]:
cat.df.head(n=2)
[3]:
variable_id project institution_id source_id experiment_id simulation_id realm frequency time_reduction grid_label level_type time_min time_max grid_id format uri
0 (ps, psl, rsdt, rsut, rsutcs, rlut, rlutcs, rsds, rsdscs, rlds, rldscs, rsus, rsuscs, rlus, ts, sic, sit, clt, prlr, prls, pr, prw, cllvi, clivi, qgvi, qrvi, qsvi, cptgzvi, hfls, hfss, evspsbl, tauu, tauv, sfcwind, uas, vas, tas) nextGEMS MPI-M ICON-ESM Cycle2-alpha dpp0067 atm 30minute mean gn ml 2020-01-20T00:00:00 2020-01-20T23:59:20 not implemented netcdf /work/mh0287/m218027/experiments/dpp0067/dpp0067_atm_2d_ml_20200120T000000Z.nc
1 (ps, psl, rsdt, rsut, rsutcs, rlut, rlutcs, rsds, rsdscs, rlds, rldscs, rsus, rsuscs, rlus, ts, sic, sit, clt, prlr, prls, pr, prw, cllvi, clivi, qgvi, qrvi, qsvi, cptgzvi, hfls, hfss, evspsbl, tauu, tauv, sfcwind, uas, vas, tas) nextGEMS MPI-M ICON-ESM Cycle2-alpha dpp0067 atm 30minute mean gn ml 2020-01-21T00:00:00 2020-01-21T23:59:20 not implemented netcdf /work/mh0287/m218027/experiments/dpp0067/dpp0067_atm_2d_ml_20200121T000000Z.nc

To reduce the output, we have defined a helper function in the header of this document. We can use it to get an overview of projects, experiments, and models in the catalog.

[4]:
get_from_cat(cat, ["project", "experiment_id", "source_id", "simulation_id"])
[4]:
project experiment_id source_id simulation_id
0 DYAMOND_WINTER DW-ATM ARPEGE-NH-2km r1i1p1f1
1 DYAMOND_WINTER DW-ATM GEM r1i1p1f1
2 DYAMOND_WINTER DW-ATM GEOS-1km r1i1p1f1
3 DYAMOND_WINTER DW-ATM GEOS-3km r1i1p1f1
4 DYAMOND_WINTER DW-ATM ICON-NWP-2km r1i1p1f1
5 DYAMOND_WINTER DW-ATM ICON-SAP-5km dpp0014
6 DYAMOND_WINTER DW-ATM NICAM-3km r1i1p1f1
7 DYAMOND_WINTER DW-ATM SAM2-4km r1i1p1f1
8 DYAMOND_WINTER DW-ATM SCREAM-3km r1i1p1f1
9 DYAMOND_WINTER DW-ATM SHiELD-3km r1i1p1f1
10 DYAMOND_WINTER DW-ATM UM-5km r1i1p1f1
11 DYAMOND_WINTER DW-CPL GEOS-6km r1i1p1f1
12 DYAMOND_WINTER DW-CPL ICON-SAP-5km dpp0029
13 DYAMOND_WINTER DW-CPL IFS-4km r1i1p1f1
14 DYAMOND_WINTER DW-CPL IFS-9km r1i1p1f1
15 DYAMOND_WINTER DW-CPL NICAM-3km r1i1p1f1
16 nextGEMS Cycle1 ICON-SAP-5km dpp0052
17 nextGEMS Cycle1 ICON-SAP-5km dpp0054
18 nextGEMS Cycle1 ICON-SAP-5km dpp0065
19 nextGEMS Cycle1 IFS-FESOM2-4km hlq0
20 nextGEMS Cycle1 IFS-NEMO-4km hmrt
21 nextGEMS Cycle1 IFS-NEMO-9km hmt0
22 nextGEMS Cycle1 IFS-NEMO-DEEPon-4km hmwz
23 nextGEMS Cycle2-alpha ICON-ESM dpp0066
24 nextGEMS Cycle2-alpha ICON-ESM dpp0067
25 nextGEMS nextgems_cycle2 ICON-ESM ngc2009
26 nextGEMS nextgems_cycle2 ICON-ESM ngc2012
27 nextGEMS nextgems_cycle2 ICON-ESM ngc2013
28 nextGEMS nextgems_cycle2 IFS-FESOM HQYS
29 nextGEMS nextgems_cycle2 IFS-FESOM HR0N
30 nextGEMS nextgems_cycle2 IFS-FESOM HR2N
31 nextGEMS nextgems_cycle2 IFS-FESOM HR2N_nodeep

Let’s look into the variables of ICON in NGC2009. Detailed information about how to search the catalog can be found here.

[5]:
get_from_cat(cat.search(simulation_id="ngc2009"), ["realm", "frequency", "variable_id"])
[5]:
realm frequency variable_id
0 atm 1day (clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs)
1 atm 1day (psl, clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs)
2 atm 1month (sfcwind, clivi, cllvi, cptgzvi, hfls, hfss, prlr, pr, prw, qgvi, qrvi, qsvi, rlds, rlus, rlut, rsds, rsdt, rsus, rsut, tauu, tauv, rpds_dir, rpds_dif, rvds_dif, rnds_dif)
3 atm 1month (sfcwind, clivi, cllvi, cptgzvi, hfls, hfss, prlr, pr, prw, qgvi, qrvi, qsvi, rlds, rlus, rlut, rsds, rsdt, rsus, rsut, tauu, tauv, rpds_dir, rpds_dif, rvds_dif, rnds_dif, clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs)
4 atm 1month (sfcwind, clivi, cllvi, cptgzvi, hfls, hfss, prlr, pr, prw, qgvi, qrvi, qsvi, rlds, rlus, rlut, rsds, rsdt, rsus, rsut, tauu, tauv, rpds_dir, rpds_dif, rvds_dif, rnds_dif, psl, clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs)
5 atm 1month (ua, va, wa, ta, hus, rho, clw, cli, pfull, zghalf, zg, dzghalf)
6 atm 2hour (phalf,)
7 atm 30minute (hydro_canopy_cond_limited_box, hydro_w_snow_box, hydro_snow_soil_dens_box)
8 atm 30minute (hydro_discharge_ocean_box, hydro_drainage_box, hydro_runoff_box, hydro_transpiration_box, sse_grnd_hflx_old_box)
9 atm 30minute (psl, ps, sit, sic, tas, ts, uas, vas, cfh_lnd)
10 atm 30minute (sfcwind, clivi, cllvi, cptgzvi, hfls, hfss, prlr, pr, prw, qgvi, qrvi, qsvi, rlds, rlus, rlut, rsds, rsdt, rsus, rsut, tauu, tauv, rpds_dir, rpds_dif, rvds_dif, rnds_dif)
11 atm 6hour (clw, cli, pfull)
12 atm 6hour (hydro_w_soil_sl_box, hydro_w_ice_sl_box, sse_t_soil_sl_box)
13 atm 6hour (ta, hus, rho)
14 atm 6hour (ta, ua, va, clw, hus, zfull, cli, pv)
15 atm 6hour (tas_gmean, rsdt_gmean, rsut_gmean, rlut_gmean, radtop_gmean, prec_gmean, evap_gmean, fwfoce_gmean)
16 atm 6hour (ua, va, wa)
17 atm fx (zghalf, zg, dzghalf)
18 lnd 1month (hydro_discharge_ocean_box, hydro_drainage_box, hydro_runoff_box, hydro_transpiration_box, sse_grnd_hflx_old_box, hydro_canopy_cond_limited_box, hydro_w_snow_box, hydro_snow_soil_dens_box, hydro_w_soil_sl_box, hydro_w_ice_sl_box, sse_t_soil_sl_box)
19 oce 1day (atlantic_hfbasin, atlantic_hfl, atlantic_moc, atlantic_sltbasin, atlantic_wfl, global_hfbasin, global_hfl, global_moc, global_sltbasin, global_wfl, pacific_hfbasin, pacific_hfl, pacific_moc, pacific_sltbasin, pacific_wfl)
20 oce 1day (atmos_fluxes_FrshFlux_Evaporation, atmos_fluxes_FrshFlux_Precipitation, atmos_fluxes_FrshFlux_Runoff, atmos_fluxes_FrshFlux_SnowFall, atmos_fluxes_HeatFlux_Latent, atmos_fluxes_HeatFlux_LongWave, atmos_fluxes_HeatFlux_Sensible, atmos_fluxes_HeatFlux_ShortWave, atmos_fluxes_HeatFlux_Total, atmos_fluxes_stress_x, atmos_fluxes_stress_xw, atmos_fluxes_stress_y, atmos_fluxes_stress_yw, conc, heat_content_seaice, heat_content_snow, heat_content_total, hi, hs, ice_u, ice_v, mlotst, Qbot, Qtop, sea_level_pressure, stretch_c, zos, verticallyTotal_mass_flux_e, Wind_Speed_10m)
21 oce 1day (so, tke, to, u, v, w, A_tracer_v_to, A_veloc_v, heat_content_liquid_water)
22 oce 1hour (atmos_fluxes_FrshFlux_Evaporation, atmos_fluxes_FrshFlux_Precipitation, atmos_fluxes_FrshFlux_Runoff, atmos_fluxes_FrshFlux_SnowFall, atmos_fluxes_HeatFlux_Latent, atmos_fluxes_HeatFlux_LongWave, atmos_fluxes_HeatFlux_Sensible, atmos_fluxes_HeatFlux_ShortWave, atmos_fluxes_HeatFlux_Total, atmos_fluxes_stress_x, atmos_fluxes_stress_xw, atmos_fluxes_stress_y, atmos_fluxes_stress_yw, Qbot, Qtop)
23 oce 1hour (so, to, u, v, conc, hi, hs, ice_u, ice_v, mlotst, sea_level_pressure, stretch_c, Wind_Speed_10m, zos)
24 oce 1month (A_tracer_v_to, tke)
25 oce 1month (atmos_fluxes_FrshFlux_Evaporation, atmos_fluxes_FrshFlux_Precipitation, atmos_fluxes_FrshFlux_Runoff, atmos_fluxes_FrshFlux_SnowFall, atmos_fluxes_HeatFlux_Latent, atmos_fluxes_HeatFlux_LongWave, atmos_fluxes_HeatFlux_Sensible, atmos_fluxes_HeatFlux_ShortWave, atmos_fluxes_HeatFlux_Total, atmos_fluxes_stress_x, atmos_fluxes_stress_xw, atmos_fluxes_stress_y, atmos_fluxes_stress_yw, conc, heat_content_seaice, heat_content_snow, heat_content_total, hi, hs, ice_u, ice_v, mlotst, Qbot, Qtop, sea_level_pressure, stretch_c, zos, Wind_Speed_10m)
26 oce 1month (so, tke, to, u, v, w, A_tracer_v_to, heat_content_liquid_water)
27 oce 1month (so, to, u, v, w)
28 oce 3hour (A_tracer_v_to, A_veloc_v, tke)
29 oce 3hour (so, to, u, v, w)
30 oce 6hour (total_salt, total_saltinseaice, total_saltinliquidwater, amoc26n, kin_energy_global, pot_energy_global, total_energy_global, ssh_global, sst_global, sss_global, potential_enstrophy_global, HeatFlux_Total_global, FrshFlux_Precipitation_global, FrshFlux_SnowFall_global, FrshFlux_Evaporation_global, FrshFlux_Runoff_global, FrshFlux_VolumeIce_global, FrshFlux_TotalOcean_global, FrshFlux_TotalIce_global, FrshFlux_VolumeTotal_global, totalsnowfall_global, ice_volume_nh, ice_volume_sh, ice_extent_nh, ice_extent_sh, global_heat_content, global_heat_content_solid)

Let’s look into surface air temperature (tas)

[6]:
get_from_cat(
    cat.search(simulation_id="ngc2009", variable_id="tas"),
    ["realm", "frequency", "level_type", "variable_id"],
)
[6]:
realm frequency level_type variable_id
0 atm 1day ml (clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs)
1 atm 1day ml (psl, clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs)
2 atm 1month ml (sfcwind, clivi, cllvi, cptgzvi, hfls, hfss, prlr, pr, prw, qgvi, qrvi, qsvi, rlds, rlus, rlut, rsds, rsdt, rsus, rsut, tauu, tauv, rpds_dir, rpds_dif, rvds_dif, rnds_dif, clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs)
3 atm 1month ml (sfcwind, clivi, cllvi, cptgzvi, hfls, hfss, prlr, pr, prw, qgvi, qrvi, qsvi, rlds, rlus, rlut, rsds, rsdt, rsus, rsut, tauu, tauv, rpds_dir, rpds_dif, rvds_dif, rnds_dif, psl, clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs)
4 atm 30minute ml (psl, ps, sit, sic, tas, ts, uas, vas, cfh_lnd)
[7]:
hits = cat.search(simulation_id="ngc2009", variable_id="tas", frequency="30minute")
# The 1day files would have crashed the jupyter because the files are inconsistent across the run.
hits

ICON-ESM catalog with 1 dataset(s) from 823 asset(s):

unique
variable_id 9
project 1
institution_id 1
source_id 1
experiment_id 1
simulation_id 1
realm 1
frequency 1
time_reduction 1
grid_label 1
level_type 1
time_min 823
time_max 823
grid_id 1
format 1
uri 823

Note: The variable_id field still is on 9, as there are 9 variables in total in the file(s) containing tas.

Loading the Data

When you searched the catalog and now want to access the actual data, it is time to load it.

The Option cdf_kwargs={"chunks": {"time":1}} is used, so that only reasonably sized chunks of data are loaded at a time. Your kernel WILL break if you want to load the whole set at once!

[8]:
dataset_dict = hits.to_dataset_dict(cdf_kwargs={"chunks": {"time": 1}})

--> The keys in the returned dictionary of datasets are constructed as follows:
        'project.institution_id.source_id.experiment_id.simulation_id.realm.frequency.time_reduction.grid_label.level_type'
100.00% [1/1 00:00<00:00]

We have only one dataset, to access it, we need the keys:

[9]:
keys = list(dataset_dict.keys())
keys
[9]:
['nextGEMS.MPI-M.ICON-ESM.nextgems_cycle2.ngc2009.atm.30minute.inst.gn.ml']

Now we can finally access the data:

[10]:
dataset = dataset_dict[keys[0]]
dataset
[10]:
<xarray.Dataset>
Dimensions:  (time: 37009, height: 1, ncells: 20971520)
Coordinates:
  * height   (height) float64 2.0
  * time     (time) datetime64[ns] 2020-01-20 2020-01-20T00:30:00 ... 2022-03-01
Dimensions without coordinates: ncells
Data variables:
    tas      (time, height, ncells) float32 dask.array<chunksize=(1, 1, 20971520), meta=np.ndarray>
Attributes: (12/13)
    history:                 ./icon at 20220512 152214\n./icon at 20220512 19...
    intake_esm_varname:      ['tas']
    CDI:                     Climate Data Interface version 1.8.3rc (http://m...
    uuidOfHGrid:             0f1e7d66-637e-11e8-913b-51232bb4d8f9
    title:                   ICON simulation
    comment:                 Sapphire Dyamond (k203123) on l10739 (Linux 4.18...
    ...                      ...
    source:                  git@gitlab.dkrz.de:icon/icon-aes.git@87a1eaded69...
    Conventions:             CF-1.6
    number_of_grid_used:     15
    grid_file_uri:           http://icon-downloads.mpimet.mpg.de/grids/public...
    references:              see MPIM/DWD publications
    intake_esm_dataset_key:  nextGEMS.MPI-M.ICON-ESM.nextgems_cycle2.ngc2009....
[11]:
dataset.tas.isel(time=1).min().values
# the first time step just contains zeros, so we take the second by saying isel(time=1)
[11]:
array(225.27545, dtype=float32)
[12]:
dataset.tas.isel(time=1).max().values
[12]:
array(312.81677, dtype=float32)
[13]:
dataset.tas.max(dim="ncells")  # lazy evaluation - no real work is done yet.
[13]:
<xarray.DataArray 'tas' (time: 37009, height: 1)>
dask.array<_nanmax_skip-aggregate, shape=(37009, 1), dtype=float32, chunksize=(1, 1), chunktype=numpy.ndarray>
Coordinates:
  * height   (height) float64 2.0
  * time     (time) datetime64[ns] 2020-01-20 2020-01-20T00:30:00 ... 2022-03-01
[14]:
# evaluate if you have time to spare
# dataset.tas.max(dim="ncells").values