Command-line data discovery with query_yaml#

query_yaml is a command-line tool for discovering and locating datasets in the nextGEMS and other catalogs. It provides an efficient way to browse the hierarchical catalog structure and find specific files for analysis tools like CDO.

Getting started#

Loading the module#

Load query_yaml on DKRZ systems:

module use /work/k20200/k202134/hsm-tools/outtake/module
module load hsm-tools/unstable

Basic usage#

Display the complete nextGEMS catalog tree:

query_yaml

Navigate to specific subcatalogs by adding dataset names:

# Browse ICON datasets
query_yaml ICON

# View contents of a specific dataset
query_yaml ICON ngc4008

Other catalogs#

To query the EERIE catalog use the --eerie flag, for the 2025 digital earths global hackathon catalog use --hk25:

query_yaml --eerie

Other catalogs can be specified with the --catalog option:

query_yaml --catalog https://url.to/your/catalog.yaml

CDO integration#

For CDO workflows, combine --cdo with --var to get properly formatted file paths:

query_yaml ICON ngc4008 --cdo --var tas

Getting help#

View all available options:

usage: query_yaml [-h] [-c CATALOG_FILE | --eerie | --hk25 | --ruby] [-s [SEARCH_ARGS ...]] [--uri] [--var VAR] [--cdo] [-v] [branches ...]

Query contents of a YAML catalog.

positional arguments:
  branches              specify branches of the tree to follow, e.g.
                        IFS tco2559-ng5-cycle3 2D_1h_native

options:
  -h, --help            show this help message and exit
  -c CATALOG_FILE, --catalog_file CATALOG_FILE
                        catalog to search, default = https://data.nextgems-h2020.eu/catalog.yaml
  --eerie               use the EERIE catalog
  --hk25                use the 2025 Global Hackathon catalog
  --ruby                use the RUBY catalog
  -s [SEARCH_ARGS ...], --search_args [SEARCH_ARGS ...]
                        specify search arguments for the YAML dataset at the end of the tree, e.g.
                        zoom=5 time=P1D
  --uri                 print uris of files in this dataset
  --var VAR             only print uris for files containing VAR
  --cdo                 Format output for CDO
  -v, --verbose         print debugging output

Try things like
    query_yaml
    query_yaml ICON
    query_yaml ICON ngc3028
    query_yaml ICON ngc3028 --search_args time=PT3H zoom=5
    query_yaml FESOM IFS_4.4-FESOM_5-cycle3 2D_daily_native --uri --var vice
    cdo -s --eccodes -infov [ -select,name=2t,timestep=1/15  $(query_yaml IFS IFS_9-NEMO_25-cycle3 2D_monthly_0.25deg  --cdo --var=2t) ]

Working with different dataset types#

The nextGEMS catalog contains datasets stored in various formats. The optimal query_yaml approach depends on how the data is organized:

Datasets with variants#

Characteristics:

  • Multiple dataset variants (e.g., different temporal/spatial resolutions)

  • Variants shown in parentheses: ngc4008 (time, zoom)

Usage:

# Browse available variants
query_yaml ICON ngc4008

# Select specific variant
query_yaml ICON ngc4008 --search_args time=PT3H zoom=5

Zarr datasets#

Characteristics:

  • Data stored in Zarr format

  • Many variables and time steps in one single zarr store

Variable filtering on CDO level

Zarr datasets contain many variables. Use the cdo -select operator to chose specific variables and/or time slices efficiently.

Multi-file netCDF datasets (no kerchunk)#

Characteristics:

  • Data spread across multiple netCDF files

  • Slow catalog browsing (must open files to read metadata)

Usage:

# Fast file listing (no metadata inspection)
query_yaml FESOM IFS_4.4-FESOM_5-cycle3 2D_1h_native --uri

# Get files for specific variable
query_yaml FESOM IFS_4.4-FESOM_5-cycle3 2D_1h_native --uri --var sst

Performance consideration

Without --uri, query_yaml must open every file to collect the metadata, making browsing significantly slower for large datasets.

Kerchunk-aggregated datasets (netCDF, FDB/GRIB)#

Characteristics:

  • Fast catalog operations via kerchunk indices

  • Unified access to heterogeneous file formats

Usage:

# Find kerchunk index
query_yaml DATASET_NAME --uri

# Extract individual file paths for CDO
query_yaml DATASET_NAME --cdo --var temperature

File ordering

Files in kerchunk datasets are sorted alphabetically by query_yaml, which may not correspond to temporal order. Verify file sequences before time-series operations.

Best practices#

  1. Start broad, then narrow: Begin with dataset hierarchy, then drill down to specific variables

  2. Specify variants with --search_args for multi-resolution datasets to get the exact resolution needed

  3. Use --cdo --var for CDO workflows to get properly formatted output

  4. Leverage --uri for performance when browsing large multi-file datasets

  5. Verify file order for time-series analysis, especially with kerchunk datasets