--- file_format: myst --- # Command-line data discovery with query_yaml `query_yaml` is a command-line tool for discovering and locating datasets in the [nextGEMS](https://nextgems-h2020.eu) and other catalogs. It provides an efficient way to browse the hierarchical catalog structure and find specific files for analysis tools like CDO. ## Getting started ### Loading the module Load `query_yaml` on DKRZ systems: ```bash module use /work/k20200/k202134/hsm-tools/outtake/module module load hsm-tools/unstable ``` ### Basic usage Display the complete nextGEMS catalog tree: ```bash query_yaml ``` Navigate to specific subcatalogs by adding dataset names: ```bash # Browse ICON datasets query_yaml ICON # View contents of a specific dataset query_yaml ICON ngc4008 ``` ### Other catalogs To query the [EERIE](https://eerie-project.eu) catalog use the `--eerie` flag, for the 2025 [digital earths global hackathon](https://digital-earths-global-hackathon.github.io/hk25/) catalog use `--hk25`: ```bash query_yaml --eerie ``` Other catalogs can be specified with the `--catalog` option: ```bash query_yaml --catalog https://url.to/your/catalog.yaml ``` ### CDO integration For CDO workflows, combine `--cdo` with `--var` to get properly formatted file paths: ```bash query_yaml ICON ngc4008 --cdo --var tas ``` ### Getting help View all available options: ```{literalinclude} query_yaml_help ``` ## Working with different dataset types The nextGEMS catalog contains datasets stored in various formats. The optimal `query_yaml` approach depends on how the data is organized: ### Datasets with variants **Characteristics:** - Multiple dataset variants (e.g., different temporal/spatial resolutions) - Variants shown in parentheses: `ngc4008 (time, zoom)` **Usage:** ```bash # Browse available variants query_yaml ICON ngc4008 # Select specific variant query_yaml ICON ngc4008 --search_args time=PT3H zoom=5 ``` ### Zarr datasets **Characteristics:** - Data stored in Zarr format - Many variables and time steps in one single zarr store ```{admonition} Variable filtering on CDO level :class: warning Zarr datasets contain many variables. Use the cdo [`-select`](https://code.mpimet.mpg.de/projects/cdo/embedded/index.html#x1-1910002.3.1) operator to chose specific variables and/or time slices efficiently. ``` ### Multi-file netCDF datasets (no kerchunk) **Characteristics:** - Data spread across multiple netCDF files - Slow catalog browsing (must open files to read metadata) **Usage:** ```bash # Fast file listing (no metadata inspection) query_yaml FESOM IFS_4.4-FESOM_5-cycle3 2D_1h_native --uri # Get files for specific variable query_yaml FESOM IFS_4.4-FESOM_5-cycle3 2D_1h_native --uri --var sst ``` ```{admonition} Performance consideration :class: note Without `--uri`, `query_yaml` must open every file to collect the metadata, making browsing significantly slower for large datasets. ``` ### Kerchunk-aggregated datasets (netCDF, FDB/GRIB) **Characteristics:** - Fast catalog operations via kerchunk indices - Unified access to heterogeneous file formats **Usage:** ```bash # Find kerchunk index query_yaml DATASET_NAME --uri # Extract individual file paths for CDO query_yaml DATASET_NAME --cdo --var temperature ``` ```{admonition} File ordering :class: warning Files in kerchunk datasets are sorted alphabetically by `query_yaml`, which may not correspond to temporal order. Verify file sequences before time-series operations. ``` ## Best practices 1. **Start broad, then narrow**: Begin with dataset hierarchy, then drill down to specific variables 2. **Specify variants** with `--search_args` for multi-resolution datasets to get the exact resolution needed 3. **Use `--cdo --var`** for CDO workflows to get properly formatted output 4. **Leverage `--uri`** for performance when browsing large multi-file datasets 5. **Verify file order** for time-series analysis, especially with kerchunk datasets