---
file_format: myst
---

# Command-line data discovery with query_yaml

`query_yaml` is a command-line tool for discovering and locating datasets in the [nextGEMS](https://nextgems-h2020.eu) and other catalogs. It provides an efficient way to browse the hierarchical catalog structure and find specific files for analysis tools like CDO.

## Getting started

### Loading the module

Load `query_yaml` on DKRZ systems:

```bash
module use /work/k20200/k202134/hsm-tools/outtake/module
module load hsm-tools/unstable
```

### Basic usage

Display the complete nextGEMS catalog tree:

```bash
query_yaml
```

Navigate to specific subcatalogs by adding dataset names:

```bash
# Browse ICON datasets
query_yaml ICON

# View contents of a specific dataset
query_yaml ICON ngc4008
```

### Other catalogs

To query the [EERIE](https://eerie-project.eu) catalog use the `--eerie` flag, for the 2025 [digital earths global hackathon](https://digital-earths-global-hackathon.github.io/hk25/) catalog use `--hk25`:

```bash
query_yaml --eerie
```

Other catalogs can be specified with the `--catalog` option:

```bash
query_yaml --catalog https://url.to/your/catalog.yaml
```

### CDO integration

For CDO workflows, combine `--cdo` with `--var` to get properly formatted file paths:

```bash
query_yaml ICON ngc4008 --cdo --var tas
```

### Getting help

View all available options:

```{literalinclude} query_yaml_help
```

## Working with different dataset types

The nextGEMS catalog contains datasets stored in various formats. The optimal `query_yaml` approach depends on how the data is organized:

### Datasets with variants

**Characteristics:**
- Multiple dataset variants (e.g., different temporal/spatial resolutions)
- Variants shown in parentheses: `ngc4008 (time, zoom)`

**Usage:**
```bash
# Browse available variants
query_yaml ICON ngc4008

# Select specific variant
query_yaml ICON ngc4008 --search_args time=PT3H zoom=5
```

### Zarr datasets

**Characteristics:**
- Data stored in Zarr format
- Many variables and time steps in one single zarr store

```{admonition} Variable filtering on CDO level
:class: warning

Zarr datasets contain many variables. Use the cdo [`-select`](https://code.mpimet.mpg.de/projects/cdo/embedded/index.html#x1-1910002.3.1) operator to chose specific variables and/or time slices efficiently.
```

### Multi-file netCDF datasets (no kerchunk)

**Characteristics:**
- Data spread across multiple netCDF files
- Slow catalog browsing (must open files to read metadata)

**Usage:**
```bash
# Fast file listing (no metadata inspection)
query_yaml FESOM IFS_4.4-FESOM_5-cycle3 2D_1h_native --uri

# Get files for specific variable
query_yaml FESOM IFS_4.4-FESOM_5-cycle3 2D_1h_native --uri --var sst
```

```{admonition} Performance consideration
:class: note

Without `--uri`, `query_yaml` must open every file to collect the metadata, making browsing significantly slower for large datasets.
```

### Kerchunk-aggregated datasets (netCDF, FDB/GRIB)

**Characteristics:**
- Fast catalog operations via kerchunk indices
- Unified access to heterogeneous file formats

**Usage:**
```bash
# Find kerchunk index
query_yaml DATASET_NAME --uri

# Extract individual file paths for CDO
query_yaml DATASET_NAME --cdo --var temperature
```

```{admonition} File ordering
:class: warning

Files in kerchunk datasets are sorted alphabetically by `query_yaml`, which may not correspond to temporal order. Verify file sequences before time-series operations.
```

## Best practices

1. **Start broad, then narrow**: Begin with dataset hierarchy, then drill down to specific variables
2. **Specify variants** with `--search_args` for multi-resolution datasets to get the exact resolution needed
3. **Use `--cdo --var`** for CDO workflows to get properly formatted output
4. **Leverage `--uri`** for performance when browsing large multi-file datasets
5. **Verify file order** for time-series analysis, especially with kerchunk datasets