Finding files on the command line with query_yaml#

To load query_yaml, use

module use /work/k20200/k202134/hsm-tools/outtake/module
module load hsm-tools/unstable

Then you can search for files with query_yaml. Just calling it without any other arguments will display a tree view of the nextGEMS catalog. Adding names of sub-trees will limit the search (e.g. query_yaml ICON). Once you have limited it to one dataset, the contents of this dataset will be listed (query_yaml ICON ngc4008). In general, using --cdo with --var NAME on one specific dataset is a good choice if you want to use the output of query_yaml with cdo.

The full list of options can be obtained from the help function

usage: query_yaml.py [-h] [-c CATALOG_FILE] [-s [SEARCH_ARGS ...]] [--uri] [--var VAR] [--cdo] [-v] [branches ...]

Query contents of a YAML catalog.

positional arguments:
  branches              specify branches of the tree to follow, e.g.
                        IFS tco2559-ng5-cycle3 2D_1h_native

options:
  -h, --help            show this help message and exit
  -c CATALOG_FILE, --catalog_file CATALOG_FILE
                        catalog to search, default = https://data.nextgems-h2020.eu/catalog.yaml
  -s [SEARCH_ARGS ...], --search_args [SEARCH_ARGS ...]
                        specify search arguments for the YAML dataset at the end of the tree, e.g.
                        zoom=5 time=P1D
  --uri                 print uris of files in this dataset
  --var VAR             only print uris for files containing VAR
  --cdo                 Format output for CDO
  -v, --verbose         print debugging output

Try things like
    query_yaml
    query_yaml ICON
    query_yaml ICON ngc3028
    query_yaml ICON ngc3028 --search_args time=PT3H zoom=5
    query_yaml FESOM tco2559-ng5-cycle3 2d_vertices_daily --uri --var vice
    cdo -s --eccodes -infov [ -select,name=2t,timestep=1/15  $(query_yaml IFS IFS_9-NEMO_25-cycle3 2D_monthly_0.25deg  --cdo --var=2t) ]

Dealing with dataset variants#

zarr datasets with various variants#

  • variants will be indicated in parentenses behind the dataset name, e.g. ngc4008 (time, zoom).

  • query_yaml will be fast.

  • use queries with --search_args, e.g. --search_args time=PT3H zoom=5 to get the desired file set.

  • combine with --cdo to get the decorations needed for opening with cdo (or other libnetcdf-based utilities).

  • Note that the resulting dataset will still contain a lot of variables (i.e. don’t just feed it into cdo -timmean)

Datasets spread over various netCDF/files (no kerchunk)#

  • query_yaml will be slow to show the contents of the dataset (without --uri`), as it has to open all files to check for their contents.

  • just using query_yaml with --uri, but without --var NAME will dump all files on you, regardless of your interest in the variable (may or may not be useful).

  • combine --uri with --var to get files for a specific variable: query_yaml.py FESOM IFS_4.4-FESOM_5-cycle3 2D_1h_native --uri --var sst

Datasets represented via kerchunk (some netCDF, FDB/GRIB)#

  • query_yaml will be fast.

  • Plain --uri will lead you to the index

  • Use --cdo with --var NAME to get actual file names

  • File names will be sorted alphabetically as a best guess. If this is the right order in time depends on the person creating the files.

see also