Loading data from the catalog#
Long story short:#
import intake
try:
import outtake
except:
import sys
print ("""Could not load outtake - tape downloads might not work. Try adding
module use /work/k20200/k202134/hsm-tools/outtake/module
module load hsm-tools/unstable
to your ~./kernel_env file""", file=sys.stderr)
catalog_file = "/work/ka1081/Catalogs/dyamond-nextgems.json" # nextGEMS and DYAMOND Winter
cat = intake.open_esm_datastore(catalog_file)
hits = cat.search(simulation_id="ngc2009", variable_id="tas", frequency="30minute")
dataset_dict = hits.to_dataset_dict(cdf_kwargs={"chunks": {"time": 1}})
keys = list(dataset_dict.keys())
dataset = dataset_dict[keys[0]]
dataset.tas.isel(time=1).max().values
# use get_from_cat from below to search a catalog
Loading the catalog#
The intake-esm package provides a tool to access big amounts of data, without having to worry about where it comes from. We will give you a short overview of how to do use the catalog to your advantage. The root of the intake catalog, is a ‘.json’ file.
[1]:
import pandas as pd
pd.set_option("max_colwidth", None) # makes the tables render better
import intake
try:
import outtake
except:
import sys
print(
"""Could not load outtake - tape downloads might not work. Try adding
module use /work/k20200/k202134/hsm-tools/outtake/module
module load hsm-tools/unstable
to your ~./kernel_env file""",
file=sys.stderr,
)
def get_from_cat(catalog, columns):
"""A helper function for inspecting an intake catalog.
Call with the catalog to be inspected and a list of columns of interest."""
import pandas as pd
pd.set_option("max_colwidth", None) # makes the tables render better
if type(columns) == type(""):
columns = [columns]
return (
catalog.df[columns]
.drop_duplicates()
.sort_values(columns)
.reset_index(drop=True)
)
[2]:
catalog_file = "/work/ka1081/Catalogs/dyamond-nextgems.json"
cat = intake.open_esm_datastore(catalog_file)
cat
/work/k20200/k202134/Catalogs/dng-merged catalog with 167 dataset(s) from 120310 asset(s):
unique | |
---|---|
variable_id | 643 |
project | 2 |
institution_id | 13 |
source_id | 21 |
experiment_id | 5 |
simulation_id | 16 |
realm | 6 |
frequency | 16 |
time_reduction | 5 |
grid_label | 11 |
level_type | 6 |
time_min | 3153 |
time_max | 7000 |
grid_id | 16 |
format | 2 |
uri | 120044 |
The meanings of the categories are:
Info |
Description |
---|---|
variable_id |
Shortname of variables. |
project |
Larger project the simulation belongs to. |
source_id |
Model name. |
experiment_id |
Class of experiment |
simulation_id |
Id of the run. |
realm |
oceanic or atmospheric data |
frequency |
Frequency in time of datapoints. |
time_reduction |
Average/Instantaneous/… |
grid_label |
Identifier for horizontal gridtype. |
level_type |
Identifier for vertical gridtype. |
time_min |
Starting time for a specific file. |
time_max |
End of time covered by a specific file. |
grid_id |
Identifier of horizontal grid. |
uri |
Uniform resource identifier, location of data files. |
Searching the catalog#
You can access the underlying pandas dataframe with “cat.df”. Here we show the first 2 entries with head():
[3]:
cat.df.head(n=2)
[3]:
variable_id | project | institution_id | source_id | experiment_id | simulation_id | realm | frequency | time_reduction | grid_label | level_type | time_min | time_max | grid_id | format | uri | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | (c, l, i, v, i) | DYAMOND_WINTER | CAMS | GRIST-5km | DW-ATM | r1i1p1f1 | atmos | 15min | unkonwn | gn | 2d | 2020-01-20T00:00:00.000 | 2020-01-20T23:45:00.000 | not_implemented | netcdf | /work/ka1081/DYAMOND_WINTER/CAMS/GRIST-5km/DW-ATM/atmos/15min/clivi/r1i1p1f1/2d/gn/clivi_15min_GRIST-5km_DW-ATM_r1i1p1f1_2d_gn_20200120000000-20200120234500.nc |
1 | (c, l, t) | DYAMOND_WINTER | CAMS | GRIST-5km | DW-ATM | r1i1p1f1 | atmos | 15min | unkonwn | gn | 2d | 2020-01-20T00:00:00.000 | 2020-01-20T23:45:00.000 | not_implemented | netcdf | /work/ka1081/DYAMOND_WINTER/CAMS/GRIST-5km/DW-ATM/atmos/15min/clt/r1i1p1f1/2d/gn/clt_15min_GRIST-5km_DW-ATM_r1i1p1f1_2d_gn_20200120000000-20200120234500.nc |
To reduce the output, we have defined a helper function in the header of this document. We can use it to get an overview of projects, experiments, and models in the catalog.
[4]:
get_from_cat(cat, ["project", "experiment_id", "source_id", "simulation_id"])
[4]:
project | experiment_id | source_id | simulation_id | |
---|---|---|---|---|
0 | DYAMOND_WINTER | DW-ATM | ARPEGE-NH-2km | r1i1p1f1 |
1 | DYAMOND_WINTER | DW-ATM | GEM | r1i1p1f1 |
2 | DYAMOND_WINTER | DW-ATM | GEOS-1km | r1i1p1f1 |
3 | DYAMOND_WINTER | DW-ATM | GEOS-3km | r1i1p1f1 |
4 | DYAMOND_WINTER | DW-ATM | GRIST-5km | r1i1p1f1 |
5 | DYAMOND_WINTER | DW-ATM | ICON-NWP-2km | r1i1p1f1 |
6 | DYAMOND_WINTER | DW-ATM | ICON-SAP-5km | dpp0014 |
7 | DYAMOND_WINTER | DW-ATM | MPAS-3km | r1i1p1f1 |
8 | DYAMOND_WINTER | DW-ATM | SCREAM-3km | r1i1p1f1 |
9 | DYAMOND_WINTER | DW-ATM | SHiELD-3km | r1i1p1f1 |
10 | DYAMOND_WINTER | DW-ATM | UM-5km | r1i1p1f1 |
11 | DYAMOND_WINTER | DW-ATM | gSAM-4km | r1i1p1f1 |
12 | DYAMOND_WINTER | DW-CPL | GEOS-6km | r1i1p1f1 |
13 | DYAMOND_WINTER | DW-CPL | ICON-SAP-5km | dpp0029 |
14 | DYAMOND_WINTER | DW-CPL | ICON-SAP-5km | r1i1p1f1 |
15 | DYAMOND_WINTER | DW-CPL | IFS-4km | r1i1p1f1 |
16 | DYAMOND_WINTER | DW-CPL | IFS-9km | r1i1p1f1 |
17 | nextGEMS | Cycle1 | IFS-FESOM2-4km | hlq0 |
18 | nextGEMS | Cycle1 | IFS-NEMO-4km | hmrt |
19 | nextGEMS | Cycle1 | IFS-NEMO-9km | hmt0 |
20 | nextGEMS | Cycle1 | IFS-NEMO-DEEPon-4km | hmwz |
21 | nextGEMS | Cycle2-alpha | ICON-ESM | dpp0066 |
22 | nextGEMS | Cycle2-alpha | ICON-ESM | dpp0067 |
23 | nextGEMS | nextgems_cycle2 | ICON-ESM | ngc2009 |
24 | nextGEMS | nextgems_cycle2 | ICON-ESM | ngc2012 |
25 | nextGEMS | nextgems_cycle2 | ICON-ESM | ngc2013 |
26 | nextGEMS | nextgems_cycle2 | IFS-FESOM | HQYS |
27 | nextGEMS | nextgems_cycle2 | IFS-FESOM | HR0N |
28 | nextGEMS | nextgems_cycle2 | IFS-FESOM | HR2N |
29 | nextGEMS | nextgems_cycle2 | IFS-FESOM | HR2N_nodeep |
Let’s look into the variables of ICON in NGC2009. Detailed information about how to search the catalog can be found here.
[5]:
get_from_cat(cat.search(simulation_id="ngc2009"), ["realm", "frequency", "variable_id"])
[5]:
realm | frequency | variable_id | |
---|---|---|---|
0 | atm | 1day | (clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs) |
1 | atm | 1day | (psl, clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs) |
2 | atm | 1month | (sfcwind, clivi, cllvi, cptgzvi, hfls, hfss, prlr, pr, prw, qgvi, qrvi, qsvi, rlds, rlus, rlut, rsds, rsdt, rsus, rsut, tauu, tauv, rpds_dir, rpds_dif, rvds_dif, rnds_dif) |
3 | atm | 1month | (sfcwind, clivi, cllvi, cptgzvi, hfls, hfss, prlr, pr, prw, qgvi, qrvi, qsvi, rlds, rlus, rlut, rsds, rsdt, rsus, rsut, tauu, tauv, rpds_dir, rpds_dif, rvds_dif, rnds_dif, clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs) |
4 | atm | 1month | (sfcwind, clivi, cllvi, cptgzvi, hfls, hfss, prlr, pr, prw, qgvi, qrvi, qsvi, rlds, rlus, rlut, rsds, rsdt, rsus, rsut, tauu, tauv, rpds_dir, rpds_dif, rvds_dif, rnds_dif, psl, clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs) |
5 | atm | 1month | (ua, va, wa, ta, hus, rho, clw, cli, pfull, zghalf, zg, dzghalf) |
6 | atm | 2hour | (phalf,) |
7 | atm | 2minute | (fc, frland, hsurf, p, rnds_dif, rnds_dir, rsds, rvds_dif, rvds_dir, soiltype, t, u, v, w) |
8 | atm | 30minute | (hydro_canopy_cond_limited_box, hydro_w_snow_box, hydro_snow_soil_dens_box) |
9 | atm | 30minute | (hydro_discharge_ocean_box, hydro_drainage_box, hydro_runoff_box, hydro_transpiration_box, sse_grnd_hflx_old_box) |
10 | atm | 30minute | (psl, ps, sit, sic, tas, ts, uas, vas, cfh_lnd) |
11 | atm | 30minute | (sfcwind, clivi, cllvi, cptgzvi, hfls, hfss, prlr, pr, prw, qgvi, qrvi, qsvi, rlds, rlus, rlut, rsds, rsdt, rsus, rsut, tauu, tauv, rpds_dir, rpds_dif, rvds_dif, rnds_dif) |
12 | atm | 6hour | (clw, cli, pfull) |
13 | atm | 6hour | (hydro_w_soil_sl_box, hydro_w_ice_sl_box, sse_t_soil_sl_box) |
14 | atm | 6hour | (ta, hus, rho) |
15 | atm | 6hour | (ta, ua, va, clw, hus, zfull, cli, pv) |
16 | atm | 6hour | (tas_gmean, rsdt_gmean, rsut_gmean, rlut_gmean, radtop_gmean, prec_gmean, evap_gmean, fwfoce_gmean) |
17 | atm | 6hour | (ua, va, wa) |
18 | atm | fx | (zghalf, zg, dzghalf) |
19 | lnd | 1month | (hydro_discharge_ocean_box, hydro_drainage_box, hydro_runoff_box, hydro_transpiration_box, sse_grnd_hflx_old_box, hydro_canopy_cond_limited_box, hydro_w_snow_box, hydro_snow_soil_dens_box, hydro_w_soil_sl_box, hydro_w_ice_sl_box, sse_t_soil_sl_box) |
20 | oce | 1day | (atlantic_hfbasin, atlantic_hfl, atlantic_moc, atlantic_sltbasin, atlantic_wfl, global_hfbasin, global_hfl, global_moc, global_sltbasin, global_wfl, pacific_hfbasin, pacific_hfl, pacific_moc, pacific_sltbasin, pacific_wfl) |
21 | oce | 1day | (atmos_fluxes_FrshFlux_Evaporation, atmos_fluxes_FrshFlux_Precipitation, atmos_fluxes_FrshFlux_Runoff, atmos_fluxes_FrshFlux_SnowFall, atmos_fluxes_HeatFlux_Latent, atmos_fluxes_HeatFlux_LongWave, atmos_fluxes_HeatFlux_Sensible, atmos_fluxes_HeatFlux_ShortWave, atmos_fluxes_HeatFlux_Total, atmos_fluxes_stress_x, atmos_fluxes_stress_xw, atmos_fluxes_stress_y, atmos_fluxes_stress_yw, conc, heat_content_seaice, heat_content_snow, heat_content_total, hi, hs, ice_u, ice_v, mlotst, Qbot, Qtop, sea_level_pressure, stretch_c, zos, verticallyTotal_mass_flux_e, Wind_Speed_10m) |
22 | oce | 1day | (so, tke, to, u, v, w, A_tracer_v_to, A_veloc_v, heat_content_liquid_water) |
23 | oce | 1hour | (atmos_fluxes_FrshFlux_Evaporation, atmos_fluxes_FrshFlux_Precipitation, atmos_fluxes_FrshFlux_Runoff, atmos_fluxes_FrshFlux_SnowFall, atmos_fluxes_HeatFlux_Latent, atmos_fluxes_HeatFlux_LongWave, atmos_fluxes_HeatFlux_Sensible, atmos_fluxes_HeatFlux_ShortWave, atmos_fluxes_HeatFlux_Total, atmos_fluxes_stress_x, atmos_fluxes_stress_xw, atmos_fluxes_stress_y, atmos_fluxes_stress_yw, Qbot, Qtop) |
24 | oce | 1hour | (so, to, u, v, conc, hi, hs, ice_u, ice_v, mlotst, sea_level_pressure, stretch_c, Wind_Speed_10m, zos) |
25 | oce | 1month | (A_tracer_v_to, tke) |
26 | oce | 1month | (atmos_fluxes_FrshFlux_Evaporation, atmos_fluxes_FrshFlux_Precipitation, atmos_fluxes_FrshFlux_Runoff, atmos_fluxes_FrshFlux_SnowFall, atmos_fluxes_HeatFlux_Latent, atmos_fluxes_HeatFlux_LongWave, atmos_fluxes_HeatFlux_Sensible, atmos_fluxes_HeatFlux_ShortWave, atmos_fluxes_HeatFlux_Total, atmos_fluxes_stress_x, atmos_fluxes_stress_xw, atmos_fluxes_stress_y, atmos_fluxes_stress_yw, conc, heat_content_seaice, heat_content_snow, heat_content_total, hi, hs, ice_u, ice_v, mlotst, Qbot, Qtop, sea_level_pressure, stretch_c, zos, Wind_Speed_10m) |
27 | oce | 1month | (so, tke, to, u, v, w, A_tracer_v_to, heat_content_liquid_water) |
28 | oce | 1month | (so, to, u, v, w) |
29 | oce | 3hour | (A_tracer_v_to, A_veloc_v, tke) |
30 | oce | 3hour | (so, to, u, v, w) |
31 | oce | 6hour | (total_salt, total_saltinseaice, total_saltinliquidwater, amoc26n, kin_energy_global, pot_energy_global, total_energy_global, ssh_global, sst_global, sss_global, potential_enstrophy_global, HeatFlux_Total_global, FrshFlux_Precipitation_global, FrshFlux_SnowFall_global, FrshFlux_Evaporation_global, FrshFlux_Runoff_global, FrshFlux_VolumeIce_global, FrshFlux_TotalOcean_global, FrshFlux_TotalIce_global, FrshFlux_VolumeTotal_global, totalsnowfall_global, ice_volume_nh, ice_volume_sh, ice_extent_nh, ice_extent_sh, global_heat_content, global_heat_content_solid) |
Let’s look into surface air temperature (tas)
[6]:
get_from_cat(
cat.search(simulation_id="ngc2009", variable_id="tas"),
["realm", "frequency", "level_type", "variable_id"],
)
[6]:
realm | frequency | level_type | variable_id | |
---|---|---|---|---|
0 | atm | 1day | ml | (clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs) |
1 | atm | 1day | ml | (psl, clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs) |
2 | atm | 1month | ml | (sfcwind, clivi, cllvi, cptgzvi, hfls, hfss, prlr, pr, prw, qgvi, qrvi, qsvi, rlds, rlus, rlut, rsds, rsdt, rsus, rsut, tauu, tauv, rpds_dir, rpds_dif, rvds_dif, rnds_dif, clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs) |
3 | atm | 1month | ml | (sfcwind, clivi, cllvi, cptgzvi, hfls, hfss, prlr, pr, prw, qgvi, qrvi, qsvi, rlds, rlus, rlut, rsds, rsdt, rsus, rsut, tauu, tauv, rpds_dir, rpds_dif, rvds_dif, rnds_dif, psl, clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs) |
4 | atm | 30minute | ml | (psl, ps, sit, sic, tas, ts, uas, vas, cfh_lnd) |
[7]:
hits = cat.search(simulation_id="ngc2009", variable_id="tas", frequency="30minute")
# The 1day files would have crashed the jupyter because the files are inconsistent across the run.
hits
/work/k20200/k202134/Catalogs/dng-merged catalog with 1 dataset(s) from 817 asset(s):
unique | |
---|---|
variable_id | 9 |
project | 1 |
institution_id | 1 |
source_id | 1 |
experiment_id | 1 |
simulation_id | 1 |
realm | 1 |
frequency | 1 |
time_reduction | 1 |
grid_label | 1 |
level_type | 1 |
time_min | 817 |
time_max | 817 |
grid_id | 1 |
format | 1 |
uri | 817 |
Note: The variable_id field still is on 9, as there are 9 variables in total in the file(s) containing tas.
Loading the Data#
When you searched the catalog and now want to access the actual data, it is time to load it.
The Option cdf_kwargs={"chunks": {"time":1}}
is used, so that only reasonably sized chunks of data are loaded at a time. Your kernel WILL break if you want to load the whole set at once!
[8]:
dataset_dict = hits.to_dataset_dict(cdf_kwargs={"chunks": {"time": 1}})
--> The keys in the returned dictionary of datasets are constructed as follows:
'project.institution_id.source_id.experiment_id.simulation_id.realm.frequency.time_reduction.grid_label.level_type'
We have only one dataset, to access it, we need the keys:
[9]:
keys = list(dataset_dict.keys())
keys
[9]:
['nextGEMS.MPI-M.ICON-ESM.nextgems_cycle2.ngc2009.atm.30minute.inst.gn.ml']
Now we can finally access the data:
[10]:
dataset = dataset_dict[keys[0]]
dataset
[10]:
<xarray.Dataset> Dimensions: (time: 36722, height: 1, ncells: 20971520) Coordinates: * height (height) float64 2.0 * time (time) datetime64[ns] 2020-01-20 2020-01-20T00:30:00 ... 2022-03-01 Dimensions without coordinates: ncells Data variables: tas (time, height, ncells) float32 dask.array<chunksize=(1, 1, 20971520), meta=np.ndarray> Attributes: (12/13) Conventions: CF-1.6 institution: Max Planck Institute for Meteorology/Deutscher W... number_of_grid_used: 15 CDI: Climate Data Interface version 1.8.3rc (http://m... uuidOfHGrid: 0f1e7d66-637e-11e8-913b-51232bb4d8f9 history: ./icon at 20220512 152214\n./icon at 20220512 19... ... ... title: ICON simulation grid_file_uri: http://icon-downloads.mpimet.mpg.de/grids/public... comment: Sapphire Dyamond (k203123) on l10739 (Linux 4.18... source: git@gitlab.dkrz.de:icon/icon-aes.git@87a1eaded69... intake_esm_varname: ['tas'] intake_esm_dataset_key: nextGEMS.MPI-M.ICON-ESM.nextgems_cycle2.ngc2009....
[11]:
dataset.tas.isel(time=1).min().values
# the first time step just contains zeros, so we take the second by saying isel(time=1)
[11]:
array(225.27545, dtype=float32)
[12]:
dataset.tas.isel(time=1).max().values
[12]:
array(312.81677, dtype=float32)
[13]:
dataset.tas.max(dim="ncells") # lazy evaluation - no real work is done yet.
[13]:
<xarray.DataArray 'tas' (time: 36722, height: 1)> dask.array<_nanmax_skip-aggregate, shape=(36722, 1), dtype=float32, chunksize=(1, 1), chunktype=numpy.ndarray> Coordinates: * height (height) float64 2.0 * time (time) datetime64[ns] 2020-01-20 2020-01-20T00:30:00 ... 2022-03-01
[14]:
# evaluate if you have time to spare
# dataset.tas.max(dim="ncells").values