Loading data from the catalog¶
Long story short:¶
import intake
catalog_file = "/work/ka1081/Catalogs/dyamond-nextgems.json" # nextGEMS and DYAMOND Winter
cat = intake.open_esm_datastore(catalog_file)
hits = cat.search(simulation_id="ngc2009", variable_id="tas", frequency="30minute")
dataset_dict = hits.to_dataset_dict(cdf_kwargs={"chunks": {"time": 1}})
keys = list(dataset_dict.keys())
dataset = dataset_dict[keys[0]]
dataset.tas.isel(time=1).max().values
# use get_from_cat from below to search a catalog
Loading the catalog¶
The intake-esm package provides a tool to access big amounts of data, without having to worry about where it comes from. We will give you a short overview of how to do use the catalog to your advantage. The root of the intake catalog, is a ‘.json’ file.
[1]:
import pandas as pd
pd.set_option("max_colwidth", None) # makes the tables render better
import intake
def get_from_cat(catalog, columns):
"""A helper function for inspecting an intake catalog.
Call with the catalog to be inspected and a list of columns of interest."""
import pandas as pd
pd.set_option("max_colwidth", None) # makes the tables render better
if type(columns) == type(""):
columns = [columns]
return (
catalog.df[columns]
.drop_duplicates()
.sort_values(columns)
.reset_index(drop=True)
)
[2]:
catalog_file = "/work/ka1081/Catalogs/dyamond-nextgems.json"
cat = intake.open_esm_datastore(catalog_file)
cat
ICON-ESM catalog with 215 dataset(s) from 114472 asset(s):
unique | |
---|---|
variable_id | 642 |
project | 2 |
institution_id | 12 |
source_id | 20 |
experiment_id | 5 |
simulation_id | 19 |
realm | 6 |
frequency | 17 |
time_reduction | 5 |
grid_label | 10 |
level_type | 10 |
time_min | 2345 |
time_max | 6050 |
grid_id | 14 |
format | 2 |
uri | 114462 |
The meanings of the categories are:
Info |
Description |
---|---|
variable_id |
Shortname of variables. |
project |
Larger project the simulation belongs to. |
source_id |
Model name. |
experiment_id |
Class of experiment |
simulation_id |
Id of the run. |
realm |
oceanic or atmospheric data |
frequency |
Frequency in time of datapoints. |
time_reduction |
Average/Instantaneous/… |
grid_label |
Identifier for horizontal gridtype. |
level_type |
Identifier for vertical gridtype. |
time_min |
Starting time for a specific file. |
time_max |
End of time covered by a specific file. |
grid_id |
Identifier of horizontal grid. |
uri |
Uniform resource identifier, location of data files. |
Searching the catalog¶
You can access the underlying pandas dataframe with “cat.df”. Here we show the first 2 entries with head():
[3]:
cat.df.head(n=2)
[3]:
variable_id | project | institution_id | source_id | experiment_id | simulation_id | realm | frequency | time_reduction | grid_label | level_type | time_min | time_max | grid_id | format | uri | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | (ps, psl, rsdt, rsut, rsutcs, rlut, rlutcs, rsds, rsdscs, rlds, rldscs, rsus, rsuscs, rlus, ts, sic, sit, clt, prlr, prls, pr, prw, cllvi, clivi, qgvi, qrvi, qsvi, cptgzvi, hfls, hfss, evspsbl, tauu, tauv, sfcwind, uas, vas, tas) | nextGEMS | MPI-M | ICON-ESM | Cycle2-alpha | dpp0067 | atm | 30minute | mean | gn | ml | 2020-01-20T00:00:00 | 2020-01-20T23:59:20 | not implemented | netcdf | /work/mh0287/m218027/experiments/dpp0067/dpp0067_atm_2d_ml_20200120T000000Z.nc |
1 | (ps, psl, rsdt, rsut, rsutcs, rlut, rlutcs, rsds, rsdscs, rlds, rldscs, rsus, rsuscs, rlus, ts, sic, sit, clt, prlr, prls, pr, prw, cllvi, clivi, qgvi, qrvi, qsvi, cptgzvi, hfls, hfss, evspsbl, tauu, tauv, sfcwind, uas, vas, tas) | nextGEMS | MPI-M | ICON-ESM | Cycle2-alpha | dpp0067 | atm | 30minute | mean | gn | ml | 2020-01-21T00:00:00 | 2020-01-21T23:59:20 | not implemented | netcdf | /work/mh0287/m218027/experiments/dpp0067/dpp0067_atm_2d_ml_20200121T000000Z.nc |
To reduce the output, we have defined a helper function in the header of this document. We can use it to get an overview of projects, experiments, and models in the catalog.
[4]:
get_from_cat(cat, ["project", "experiment_id", "source_id", "simulation_id"])
[4]:
project | experiment_id | source_id | simulation_id | |
---|---|---|---|---|
0 | DYAMOND_WINTER | DW-ATM | ARPEGE-NH-2km | r1i1p1f1 |
1 | DYAMOND_WINTER | DW-ATM | GEM | r1i1p1f1 |
2 | DYAMOND_WINTER | DW-ATM | GEOS-1km | r1i1p1f1 |
3 | DYAMOND_WINTER | DW-ATM | GEOS-3km | r1i1p1f1 |
4 | DYAMOND_WINTER | DW-ATM | ICON-NWP-2km | r1i1p1f1 |
5 | DYAMOND_WINTER | DW-ATM | ICON-SAP-5km | dpp0014 |
6 | DYAMOND_WINTER | DW-ATM | NICAM-3km | r1i1p1f1 |
7 | DYAMOND_WINTER | DW-ATM | SAM2-4km | r1i1p1f1 |
8 | DYAMOND_WINTER | DW-ATM | SCREAM-3km | r1i1p1f1 |
9 | DYAMOND_WINTER | DW-ATM | SHiELD-3km | r1i1p1f1 |
10 | DYAMOND_WINTER | DW-ATM | UM-5km | r1i1p1f1 |
11 | DYAMOND_WINTER | DW-CPL | GEOS-6km | r1i1p1f1 |
12 | DYAMOND_WINTER | DW-CPL | ICON-SAP-5km | dpp0029 |
13 | DYAMOND_WINTER | DW-CPL | IFS-4km | r1i1p1f1 |
14 | DYAMOND_WINTER | DW-CPL | IFS-9km | r1i1p1f1 |
15 | DYAMOND_WINTER | DW-CPL | NICAM-3km | r1i1p1f1 |
16 | nextGEMS | Cycle1 | ICON-SAP-5km | dpp0052 |
17 | nextGEMS | Cycle1 | ICON-SAP-5km | dpp0054 |
18 | nextGEMS | Cycle1 | ICON-SAP-5km | dpp0065 |
19 | nextGEMS | Cycle1 | IFS-FESOM2-4km | hlq0 |
20 | nextGEMS | Cycle1 | IFS-NEMO-4km | hmrt |
21 | nextGEMS | Cycle1 | IFS-NEMO-9km | hmt0 |
22 | nextGEMS | Cycle1 | IFS-NEMO-DEEPon-4km | hmwz |
23 | nextGEMS | Cycle2-alpha | ICON-ESM | dpp0066 |
24 | nextGEMS | Cycle2-alpha | ICON-ESM | dpp0067 |
25 | nextGEMS | nextgems_cycle2 | ICON-ESM | ngc2009 |
26 | nextGEMS | nextgems_cycle2 | ICON-ESM | ngc2012 |
27 | nextGEMS | nextgems_cycle2 | ICON-ESM | ngc2013 |
28 | nextGEMS | nextgems_cycle2 | IFS-FESOM | HQYS |
29 | nextGEMS | nextgems_cycle2 | IFS-FESOM | HR0N |
30 | nextGEMS | nextgems_cycle2 | IFS-FESOM | HR2N |
31 | nextGEMS | nextgems_cycle2 | IFS-FESOM | HR2N_nodeep |
Let’s look into the variables of ICON in NGC2009. Detailed information about how to search the catalog can be foundhere.
[5]:
get_from_cat(cat.search(simulation_id="ngc2009"), ["realm", "frequency", "variable_id"])
[5]:
realm | frequency | variable_id | |
---|---|---|---|
0 | atm | 1day | (clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs) |
1 | atm | 1day | (psl, clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs) |
2 | atm | 1month | (sfcwind, clivi, cllvi, cptgzvi, hfls, hfss, prlr, pr, prw, qgvi, qrvi, qsvi, rlds, rlus, rlut, rsds, rsdt, rsus, rsut, tauu, tauv, rpds_dir, rpds_dif, rvds_dif, rnds_dif) |
3 | atm | 1month | (sfcwind, clivi, cllvi, cptgzvi, hfls, hfss, prlr, pr, prw, qgvi, qrvi, qsvi, rlds, rlus, rlut, rsds, rsdt, rsus, rsut, tauu, tauv, rpds_dir, rpds_dif, rvds_dif, rnds_dif, clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs) |
4 | atm | 1month | (sfcwind, clivi, cllvi, cptgzvi, hfls, hfss, prlr, pr, prw, qgvi, qrvi, qsvi, rlds, rlus, rlut, rsds, rsdt, rsus, rsut, tauu, tauv, rpds_dir, rpds_dif, rvds_dif, rnds_dif, psl, clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs) |
5 | atm | 1month | (ua, va, wa, ta, hus, rho, clw, cli, pfull, zghalf, zg, dzghalf) |
6 | atm | 2hour | (phalf,) |
7 | atm | 30minute | (hydro_canopy_cond_limited_box, hydro_w_snow_box, hydro_snow_soil_dens_box) |
8 | atm | 30minute | (hydro_discharge_ocean_box, hydro_drainage_box, hydro_runoff_box, hydro_transpiration_box, sse_grnd_hflx_old_box) |
9 | atm | 30minute | (psl, ps, sit, sic, tas, ts, uas, vas, cfh_lnd) |
10 | atm | 30minute | (sfcwind, clivi, cllvi, cptgzvi, hfls, hfss, prlr, pr, prw, qgvi, qrvi, qsvi, rlds, rlus, rlut, rsds, rsdt, rsus, rsut, tauu, tauv, rpds_dir, rpds_dif, rvds_dif, rnds_dif) |
11 | atm | 6hour | (clw, cli, pfull) |
12 | atm | 6hour | (hydro_w_soil_sl_box, hydro_w_ice_sl_box, sse_t_soil_sl_box) |
13 | atm | 6hour | (ta, hus, rho) |
14 | atm | 6hour | (ta, ua, va, clw, hus, zfull, cli, pv) |
15 | atm | 6hour | (tas_gmean, rsdt_gmean, rsut_gmean, rlut_gmean, radtop_gmean, prec_gmean, evap_gmean, fwfoce_gmean) |
16 | atm | 6hour | (ua, va, wa) |
17 | atm | fx | (zghalf, zg, dzghalf) |
18 | lnd | 1month | (hydro_discharge_ocean_box, hydro_drainage_box, hydro_runoff_box, hydro_transpiration_box, sse_grnd_hflx_old_box, hydro_canopy_cond_limited_box, hydro_w_snow_box, hydro_snow_soil_dens_box, hydro_w_soil_sl_box, hydro_w_ice_sl_box, sse_t_soil_sl_box) |
19 | oce | 1day | (atlantic_hfbasin, atlantic_hfl, atlantic_moc, atlantic_sltbasin, atlantic_wfl, global_hfbasin, global_hfl, global_moc, global_sltbasin, global_wfl, pacific_hfbasin, pacific_hfl, pacific_moc, pacific_sltbasin, pacific_wfl) |
20 | oce | 1day | (atmos_fluxes_FrshFlux_Evaporation, atmos_fluxes_FrshFlux_Precipitation, atmos_fluxes_FrshFlux_Runoff, atmos_fluxes_FrshFlux_SnowFall, atmos_fluxes_HeatFlux_Latent, atmos_fluxes_HeatFlux_LongWave, atmos_fluxes_HeatFlux_Sensible, atmos_fluxes_HeatFlux_ShortWave, atmos_fluxes_HeatFlux_Total, atmos_fluxes_stress_x, atmos_fluxes_stress_xw, atmos_fluxes_stress_y, atmos_fluxes_stress_yw, conc, heat_content_seaice, heat_content_snow, heat_content_total, hi, hs, ice_u, ice_v, mlotst, Qbot, Qtop, sea_level_pressure, stretch_c, zos, verticallyTotal_mass_flux_e, Wind_Speed_10m) |
21 | oce | 1day | (so, tke, to, u, v, w, A_tracer_v_to, A_veloc_v, heat_content_liquid_water) |
22 | oce | 1hour | (atmos_fluxes_FrshFlux_Evaporation, atmos_fluxes_FrshFlux_Precipitation, atmos_fluxes_FrshFlux_Runoff, atmos_fluxes_FrshFlux_SnowFall, atmos_fluxes_HeatFlux_Latent, atmos_fluxes_HeatFlux_LongWave, atmos_fluxes_HeatFlux_Sensible, atmos_fluxes_HeatFlux_ShortWave, atmos_fluxes_HeatFlux_Total, atmos_fluxes_stress_x, atmos_fluxes_stress_xw, atmos_fluxes_stress_y, atmos_fluxes_stress_yw, Qbot, Qtop) |
23 | oce | 1hour | (so, to, u, v, conc, hi, hs, ice_u, ice_v, mlotst, sea_level_pressure, stretch_c, Wind_Speed_10m, zos) |
24 | oce | 1month | (A_tracer_v_to, tke) |
25 | oce | 1month | (atmos_fluxes_FrshFlux_Evaporation, atmos_fluxes_FrshFlux_Precipitation, atmos_fluxes_FrshFlux_Runoff, atmos_fluxes_FrshFlux_SnowFall, atmos_fluxes_HeatFlux_Latent, atmos_fluxes_HeatFlux_LongWave, atmos_fluxes_HeatFlux_Sensible, atmos_fluxes_HeatFlux_ShortWave, atmos_fluxes_HeatFlux_Total, atmos_fluxes_stress_x, atmos_fluxes_stress_xw, atmos_fluxes_stress_y, atmos_fluxes_stress_yw, conc, heat_content_seaice, heat_content_snow, heat_content_total, hi, hs, ice_u, ice_v, mlotst, Qbot, Qtop, sea_level_pressure, stretch_c, zos, Wind_Speed_10m) |
26 | oce | 1month | (so, tke, to, u, v, w, A_tracer_v_to, heat_content_liquid_water) |
27 | oce | 1month | (so, to, u, v, w) |
28 | oce | 3hour | (A_tracer_v_to, A_veloc_v, tke) |
29 | oce | 3hour | (so, to, u, v, w) |
30 | oce | 6hour | (total_salt, total_saltinseaice, total_saltinliquidwater, amoc26n, kin_energy_global, pot_energy_global, total_energy_global, ssh_global, sst_global, sss_global, potential_enstrophy_global, HeatFlux_Total_global, FrshFlux_Precipitation_global, FrshFlux_SnowFall_global, FrshFlux_Evaporation_global, FrshFlux_Runoff_global, FrshFlux_VolumeIce_global, FrshFlux_TotalOcean_global, FrshFlux_TotalIce_global, FrshFlux_VolumeTotal_global, totalsnowfall_global, ice_volume_nh, ice_volume_sh, ice_extent_nh, ice_extent_sh, global_heat_content, global_heat_content_solid) |
Let’s look into surface air temperature (tas)
[6]:
get_from_cat(
cat.search(simulation_id="ngc2009", variable_id="tas"),
["realm", "frequency", "level_type", "variable_id"],
)
[6]:
realm | frequency | level_type | variable_id | |
---|---|---|---|---|
0 | atm | 1day | ml | (clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs) |
1 | atm | 1day | ml | (psl, clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs) |
2 | atm | 1month | ml | (sfcwind, clivi, cllvi, cptgzvi, hfls, hfss, prlr, pr, prw, qgvi, qrvi, qsvi, rlds, rlus, rlut, rsds, rsdt, rsus, rsut, tauu, tauv, rpds_dir, rpds_dif, rvds_dif, rnds_dif, clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs) |
3 | atm | 1month | ml | (sfcwind, clivi, cllvi, cptgzvi, hfls, hfss, prlr, pr, prw, qgvi, qrvi, qsvi, rlds, rlus, rlut, rsds, rsdt, rsus, rsut, tauu, tauv, rpds_dir, rpds_dif, rvds_dif, rnds_dif, psl, clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs) |
4 | atm | 30minute | ml | (psl, ps, sit, sic, tas, ts, uas, vas, cfh_lnd) |
[7]:
hits = cat.search(simulation_id="ngc2009", variable_id="tas", frequency="30minute")
# The 1day files would have crashed the jupyter because the files are inconsistent across the run.
hits
ICON-ESM catalog with 1 dataset(s) from 823 asset(s):
unique | |
---|---|
variable_id | 9 |
project | 1 |
institution_id | 1 |
source_id | 1 |
experiment_id | 1 |
simulation_id | 1 |
realm | 1 |
frequency | 1 |
time_reduction | 1 |
grid_label | 1 |
level_type | 1 |
time_min | 823 |
time_max | 823 |
grid_id | 1 |
format | 1 |
uri | 823 |
Note: The variable_id field still is on 9, as there are 9 variables in total in the file(s) containing tas.
Loading the Data¶
When you searched the catalog and now want to access the actual data, it is time to load it.
The Option cdf_kwargs={"chunks": {"time":1}}
is used, so that only reasonably sized chunks of data are loaded at a time. Your kernel WILL break if you want to load the whole set at once!
[8]:
dataset_dict = hits.to_dataset_dict(cdf_kwargs={"chunks": {"time": 1}})
--> The keys in the returned dictionary of datasets are constructed as follows:
'project.institution_id.source_id.experiment_id.simulation_id.realm.frequency.time_reduction.grid_label.level_type'
We have only one dataset, to access it, we need the keys:
[9]:
keys = list(dataset_dict.keys())
keys
[9]:
['nextGEMS.MPI-M.ICON-ESM.nextgems_cycle2.ngc2009.atm.30minute.inst.gn.ml']
Now we can finally access the data:
[10]:
dataset = dataset_dict[keys[0]]
dataset
[10]:
<xarray.Dataset> Dimensions: (time: 37009, height: 1, ncells: 20971520) Coordinates: * height (height) float64 2.0 * time (time) datetime64[ns] 2020-01-20 2020-01-20T00:30:00 ... 2022-03-01 Dimensions without coordinates: ncells Data variables: tas (time, height, ncells) float32 dask.array<chunksize=(1, 1, 20971520), meta=np.ndarray> Attributes: (12/13) history: ./icon at 20220512 152214\n./icon at 20220512 19... intake_esm_varname: ['tas'] CDI: Climate Data Interface version 1.8.3rc (http://m... uuidOfHGrid: 0f1e7d66-637e-11e8-913b-51232bb4d8f9 title: ICON simulation comment: Sapphire Dyamond (k203123) on l10739 (Linux 4.18... ... ... source: git@gitlab.dkrz.de:icon/icon-aes.git@87a1eaded69... Conventions: CF-1.6 number_of_grid_used: 15 grid_file_uri: http://icon-downloads.mpimet.mpg.de/grids/public... references: see MPIM/DWD publications intake_esm_dataset_key: nextGEMS.MPI-M.ICON-ESM.nextgems_cycle2.ngc2009....
[11]:
dataset.tas.isel(time=1).min().values
# the first time step just contains zeros, so we take the second by saying isel(time=1)
[11]:
array(225.27545, dtype=float32)
[12]:
dataset.tas.isel(time=1).max().values
[12]:
array(312.81677, dtype=float32)
[13]:
dataset.tas.max(dim="ncells") # lazy evaluation - no real work is done yet.
[13]:
<xarray.DataArray 'tas' (time: 37009, height: 1)> dask.array<_nanmax_skip-aggregate, shape=(37009, 1), dtype=float32, chunksize=(1, 1), chunktype=numpy.ndarray> Coordinates: * height (height) float64 2.0 * time (time) datetime64[ns] 2020-01-20 2020-01-20T00:30:00 ... 2022-03-01
[14]:
# evaluate if you have time to spare
# dataset.tas.max(dim="ncells").values