Command-line data discovery with query_yaml#
query_yaml is a command-line tool for discovering and locating datasets in the nextGEMS and other catalogs. It provides an efficient way to browse the hierarchical catalog structure and find specific files for analysis tools like CDO.
Getting started#
Loading the module#
Load query_yaml on DKRZ systems:
module use /work/k20200/k202134/hsm-tools/outtake/module
module load hsm-tools/unstable
Basic usage#
Display the complete nextGEMS catalog tree:
query_yaml
Navigate to specific subcatalogs by adding dataset names:
# Browse ICON datasets
query_yaml ICON
# View contents of a specific dataset
query_yaml ICON ngc4008
Other catalogs#
To query the EERIE catalog use the --eerie flag, for the 2025 digital earths global hackathon catalog use --hk25:
query_yaml --eerie
Other catalogs can be specified with the --catalog option:
query_yaml --catalog https://url.to/your/catalog.yaml
CDO integration#
For CDO workflows, combine --cdo with --var to get properly formatted file paths:
query_yaml ICON ngc4008 --cdo --var tas
Getting help#
View all available options:
usage: query_yaml [-h] [-c CATALOG_FILE | --eerie | --hk25 | --ruby] [-s [SEARCH_ARGS ...]] [--uri] [--var VAR] [--cdo] [-v] [branches ...]
Query contents of a YAML catalog.
positional arguments:
branches specify branches of the tree to follow, e.g.
IFS tco2559-ng5-cycle3 2D_1h_native
options:
-h, --help show this help message and exit
-c CATALOG_FILE, --catalog_file CATALOG_FILE
catalog to search, default = https://data.nextgems-h2020.eu/catalog.yaml
--eerie use the EERIE catalog
--hk25 use the 2025 Global Hackathon catalog
--ruby use the RUBY catalog
-s [SEARCH_ARGS ...], --search_args [SEARCH_ARGS ...]
specify search arguments for the YAML dataset at the end of the tree, e.g.
zoom=5 time=P1D
--uri print uris of files in this dataset
--var VAR only print uris for files containing VAR
--cdo Format output for CDO
-v, --verbose print debugging output
Try things like
query_yaml
query_yaml ICON
query_yaml ICON ngc3028
query_yaml ICON ngc3028 --search_args time=PT3H zoom=5
query_yaml FESOM IFS_4.4-FESOM_5-cycle3 2D_daily_native --uri --var vice
cdo -s --eccodes -infov [ -select,name=2t,timestep=1/15 $(query_yaml IFS IFS_9-NEMO_25-cycle3 2D_monthly_0.25deg --cdo --var=2t) ]
Working with different dataset types#
The nextGEMS catalog contains datasets stored in various formats. The optimal query_yaml approach depends on how the data is organized:
Datasets with variants#
Characteristics:
Multiple dataset variants (e.g., different temporal/spatial resolutions)
Variants shown in parentheses:
ngc4008 (time, zoom)
Usage:
# Browse available variants
query_yaml ICON ngc4008
# Select specific variant
query_yaml ICON ngc4008 --search_args time=PT3H zoom=5
Zarr datasets#
Characteristics:
Data stored in Zarr format
Many variables and time steps in one single zarr store
Variable filtering on CDO level
Zarr datasets contain many variables. Use the cdo -select operator to chose specific variables and/or time slices efficiently.
Multi-file netCDF datasets (no kerchunk)#
Characteristics:
Data spread across multiple netCDF files
Slow catalog browsing (must open files to read metadata)
Usage:
# Fast file listing (no metadata inspection)
query_yaml FESOM IFS_4.4-FESOM_5-cycle3 2D_1h_native --uri
# Get files for specific variable
query_yaml FESOM IFS_4.4-FESOM_5-cycle3 2D_1h_native --uri --var sst
Performance consideration
Without --uri, query_yaml must open every file to collect the metadata, making browsing significantly slower for large datasets.
Kerchunk-aggregated datasets (netCDF, FDB/GRIB)#
Characteristics:
Fast catalog operations via kerchunk indices
Unified access to heterogeneous file formats
Usage:
# Find kerchunk index
query_yaml DATASET_NAME --uri
# Extract individual file paths for CDO
query_yaml DATASET_NAME --cdo --var temperature
File ordering
Files in kerchunk datasets are sorted alphabetically by query_yaml, which may not correspond to temporal order. Verify file sequences before time-series operations.
Best practices#
Start broad, then narrow: Begin with dataset hierarchy, then drill down to specific variables
Specify variants with
--search_argsfor multi-resolution datasets to get the exact resolution neededUse
--cdo --varfor CDO workflows to get properly formatted outputLeverage
--urifor performance when browsing large multi-file datasetsVerify file order for time-series analysis, especially with kerchunk datasets