Finding files on the command line with query_yaml#
To load query_yaml, use
module use /work/k20200/k202134/hsm-tools/outtake/module
module load hsm-tools/unstable
Then you can search for files with query_yaml. Just calling it without any other arguments will display a tree view of the nextGEMS catalog. Adding names of sub-trees will limit the search (e.g. query_yaml ICON). Once you have limited it to one dataset, the contents of this dataset will be listed (query_yaml ICON ngc4008).
In general, using --cdo with --var NAME on one specific dataset is a good choice if you want to use the output of query_yaml with cdo.
The full list of options can be obtained from the help function
usage: query_yaml.py [-h] [-c CATALOG_FILE] [-s [SEARCH_ARGS ...]] [--uri] [--var VAR] [--cdo] [-v] [branches ...]
Query contents of a YAML catalog.
positional arguments:
  branches              specify branches of the tree to follow, e.g.
                        IFS tco2559-ng5-cycle3 2D_1h_native
options:
  -h, --help            show this help message and exit
  -c CATALOG_FILE, --catalog_file CATALOG_FILE
                        catalog to search, default = https://data.nextgems-h2020.eu/catalog.yaml
  -s [SEARCH_ARGS ...], --search_args [SEARCH_ARGS ...]
                        specify search arguments for the YAML dataset at the end of the tree, e.g.
                        zoom=5 time=P1D
  --uri                 print uris of files in this dataset
  --var VAR             only print uris for files containing VAR
  --cdo                 Format output for CDO
  -v, --verbose         print debugging output
Try things like
    query_yaml
    query_yaml ICON
    query_yaml ICON ngc3028
    query_yaml ICON ngc3028 --search_args time=PT3H zoom=5
    query_yaml FESOM tco2559-ng5-cycle3 2d_vertices_daily --uri --var vice
    cdo -s --eccodes -infov [ -select,name=2t,timestep=1/15  $(query_yaml IFS IFS_9-NEMO_25-cycle3 2D_monthly_0.25deg  --cdo --var=2t) ]
Dealing with dataset variants#
zarr datasets with various variants#
- variants will be indicated in parentenses behind the dataset name, e.g. - ngc4008 (time, zoom).
- query_yamlwill be fast.
- use queries with - --search_args, e.g.- --search_args time=PT3H zoom=5to get the desired file set.
- combine with - --cdoto get the decorations needed for opening with cdo (or other libnetcdf-based utilities).
- Note that the resulting dataset will still contain a lot of variables (i.e. don’t just feed it into - cdo -timmean)
Datasets spread over various netCDF/files (no kerchunk)#
- query_yamlwill be slow to show the contents of the dataset (without- --uri`), as it has to open all files to check for their contents.
- just using - query_yamlwith- --uri, but without- --var NAMEwill dump all files on you, regardless of your interest in the variable (may or may not be useful).
- combine - --uriwith- --varto get files for a specific variable:- query_yaml.py FESOM IFS_4.4-FESOM_5-cycle3 2D_1h_native --uri --var sst
Datasets represented via kerchunk (some netCDF, FDB/GRIB)#
- query_yamlwill be fast.
- Plain - --uriwill lead you to the index
- Use - --cdowith- --var NAMEto get actual file names
- File names will be sorted alphabetically as a best guess. If this is the right order in time depends on the person creating the files. 
see also
