Getting file names from intake with a command-line python script#

This is a python tool for the command line that will yield file names containing desired variables, and other information

Requirements

On levante, you can use

module use /work/k20200/k202134/hsm-tools/outtake/module
module load hsm-tools/unstable

to load a recent version of find_files with its dependencies.

You need to run

slk login

on a monthly basis to authenticate with the tape system.

You need to be in project 1153, to download files to the central cache directory. Join via luv.

Tip

Call it as

find_files -h

to get the full help message.

Getting the full details of a dataset#

Use --full to get detailed information:

find_files ua dpp0066 --time_min='2020-02-02.*' --full
           variable_id   project institution_id source_id experiment_id  \
0  (ua, va, vor, gpsm)  NextGEMS          MPI-M  ICON-ESM  Cycle2-alpha
1             (ua, va)  NextGEMS          MPI-M  ICON-ESM  Cycle2-alpha

  simulation_id realm frequency time_reduction grid_label level_type  \
0       dpp0066   atm     3hour           inst         gn         pl
1       dpp0066   atm     3hour           mean         gn         ml

                  time_min                 time_max          grid_id  format  \
0  2020-02-02T00:00:00.000  2020-02-02T23:59:20.000  not implemented  netcdf
1  2020-02-02T00:00:00.000  2020-02-02T23:59:20.000  not implemented  netcdf

                                                                                  uri
0  /work/mh0287/m300083/experiments/dpp0066/dpp0066_atm_2d_850_pl_20200202T000000Z.nc
1    /work/mh0287/m300083/experiments/dpp0066/dpp0066_atm_3d_2_ml_20200202T000000Z.nc

Controlling the output format#

Call with with --print_format=... or -f to get specific colums:

find_files to dpp0066  -f 'experiment_id,simulation_id,frequency'
experiment_id simulation_id frequency
 Cycle2-alpha       dpp0066      1day
 Cycle2-alpha       dpp0066     1hour
 Cycle2-alpha       dpp0066     3hour

find_files to dpp0066  -f 'experiment_id,simulation_id,frequency,uri' --time_min='2020-02-22.*'
experiment_id simulation_id frequency                                                                                   uri
 Cycle2-alpha       dpp0066      1day       /work/mh0287/m300083/experiments/dpp0066/dpp0066_oce_3d_P1D_20200222T000000Z.nc
 Cycle2-alpha       dpp0066      1day    /work/mh0287/m300083/experiments/dpp0066/dpp0066_oce_3dlev_P1D_20200222T000000Z.nc
 Cycle2-alpha       dpp0066     1hour   /work/mh0287/m300083/experiments/dpp0066/dpp0066_oce_2dopt_PT1H_20200222T000000Z.nc
 Cycle2-alpha       dpp0066     3hour /work/mh0287/m300083/experiments/dpp0066/dpp0066_oce_3du200m_PT3H_20200222T000000Z.nc

# now we assume that going for the daily files is going to cause trouble in further processing because two filesets have the data.

find_files to dpp0066  -f 'experiment_id,simulation_id,frequency,uri' --time_min='2020-02-22.*' --uri='.*3dlev.*'
experiment_id simulation_id frequency                                                                                uri
 Cycle2-alpha       dpp0066      1day /work/mh0287/m300083/experiments/dpp0066/dpp0066_oce_3dlev_P1D_20200222T000000Z.nc

Using time ranges#

(taken from --help)

--time_range TIME_RANGE TIME_RANGE can be used to find all files that contain data from a given range START END.

--time_range 2020-02-01 2020-02-03 will give you all data for 2020-02-01 and 2020-02-02, as “2020-02-03” is smaller than any timestamp on 2020-02-03 in the string comparison logic.

Note that regular expressions can be used, but need to follow the “full” regex-syntax.

  • "20-02" will search for exactly “2020-02” (no regex used).

  • "20-02*" will search for anything containing “20-0” and an arbitrary number of “2”s following that, so “2020-03” also matches.

  • "20-02.*" will search for anything containing “20-02” .

  • "^20-02.*" will search for anything starting with “20-02” .

  • "2020-.*-03T" will search for anything containing “2020”, followed by an arbitrary number of characters followed by “03T”.

Using find_files to download data from the tape archive#

First use the normal queries to ensure you are only finding the files you really need (e.g. no model level data when you are interested in pressure level data, etc.)

Once you have narrowed down your search, you can use find_files to retrieve data from the tape archive.

sbatch find_files ua dpp0066 --level_type=ml --time_min="2020-02-0.*" --get

This will start a slurm job that will download the data for you (might take a few hours to days).

Using find_files with cdo#

cdo  -timmean -select,name=tas [ $(find_files tas dpp0066 --time_min="2020-02-0.*" --get ) ] /my/output/file
cdo(1) select: Process started
cdo(1) select: 100%
cdo(1) select: Processed 9059696640 values from 333 variables over 432 timesteps.
cdo    timmean: Processed 9059696640 values from 1 variable over 432 timesteps [46.86s 524MB].

When using find_files with cdo case, you will often also need the --get option, to ensure that find_files rewrites the paths of archived files from slk://... to the paths in the file system.

The code#

#!/sw/spack-levante/mambaforge-4.11.0-0-Linux-x86_64-sobz6z/bin/python
#SBATCH --job-name=find_files      # Specify job name
#SBATCH --account=bb1153
#SBATCH --partition=shared         # Specify partition name
#SBATCH --mem=14G
#SBATCH --time=24:00:00            # Set a limit on the total run time
#SBATCH --error=find_files.log.%j
#SBATCH --output=find_files.log.%j

import argparse


def get_from_cat(catalog, field, searchdict=None):
    """Call this to get all values of a field in the catalog as a sorted list"""
    if searchdict is not None and len(searchdict) > 0:
        cat = catalog.search(**searchdict)
    else:
        cat = catalog
    return sorted(cat.unique(field)[field]["values"])


def parse_args():
    parser = argparse.ArgumentParser(formatter_class=argparse.RawTextHelpFormatter)
    parser.description = """List files for a given variable and simulation."""
    parser.epilog = """Note that regular expressions can be used, but need to follow the "full" regex-syntax.
    "20-02" will search for exactly "2020-02" (no regex used).
    "20-02*" will search for anything *containing* "20-0" and an arbitrary number of "2"s following that, so "2020-03" also matches.
    "20-02.*" will search for anything *containing* "20-02" .
    "^20-02.*" will search for anything *starting with* "20-02" .
    "2020-.*-03T" will search for anything *containing*  "2020", followed by an arbitrary number of characters followed by "03T".
    
Use "" to leave variable_id or simulation_id empty.

Use 
find_files "" "" -f "experiment_id,source_id" 
to get a list of all experiments and participating models available
    """
    required_search_args = ("variable_id", "simulation_id")
    [parser.add_argument(x) for x in required_search_args]
    # optional search arguments (those with default are added below)
    optional_search_args = (
        "project",
        "institution_id",
        "source_id",
        "experiment_id",
        "realm",
        "frequency",
        "time_reduction",
        "grid_label",
        "level_type",
        "time_min",
        "time_max",
        "grid_id",
        "format",
        "uri",
    )
    for x in optional_search_args:
        parser.add_argument(f"--{x}", action="append")

    parser.add_argument(
        "-c", "--catalog_file", default="/work/ka1081/Catalogs/dyamond-nextgems.json"
    )
    parser.add_argument(
        "-f",
        "--print_format",
        default="uri",
        help="Comma separated list of columns to be plotted. e.g. 'variable_id,source_id'",
    )
    parser.add_argument(
        "--full", action="store_true", help="Print full dataset information"
    )
    parser.add_argument("--get", action="store_true", help="Get datasets from tapes")
    parser.add_argument("--datasets", action="store_true", help="List separate datasets")

    parser.add_argument(
        "--time_range",
        nargs=2,
        help="Find all files that contain data from a given range START END. \n--time_range 2020-02-01 2020-02-03 will give you all data for 2020-02-01 and 2020-02-02. \nNote that 2020-02-03 is smaller than any timestamp on 2020-02-03 in the string comparison logic.", 
    )

    pruned_dict = {k: v for k, v in vars(parser.parse_args()).items() if v is not None}
    search_args = {
        k: v
        for k, v in pruned_dict.items()
        if k in optional_search_args + required_search_args
    }
    for k in list(search_args.keys()):
        v = search_args[k]
        if len(v) == 1:
            search_args[k] = v[0]
        if len(v) == 0:
            del search_args[k]

    misc_args = {k: v for k, v in pruned_dict.items() if k not in search_args.keys()}
    return search_args, misc_args


def search_cat(cat, search_args, misc_args):
    hitlist = cat
    if len(search_args):
        hitlist = hitlist.search(**search_args)
    if "time_range" in misc_args.keys():
        tr = misc_args["time_range"]
        hitlist.df = hitlist.df[
            (hitlist.df["time_min"] <= tr[1]) * (hitlist.df["time_max"] >= tr[0])
        ]
    return hitlist


def print_results(hitlist, search_args, misc_args):
    if misc_args.get("full", False):
        import pandas as pd

        pd.set_option("display.max_columns", None)
        pd.set_option("max_colwidth", None)
        pd.set_option("display.width", 10000)
        print(hitlist.df)
    elif misc_args.get("datasets", False):
        import pandas as pd
        cols=[ x for x in hitlist.df if not x in ('uri','time_min','time_max') ]
        hitlist = (
            hitlist.df[cols]
            .drop_duplicates()
            .sort_values(cols)
            .to_string(index=False)
        )
        pd.set_option("display.max_columns", None)
        pd.set_option("max_colwidth", None)
        pd.set_option("display.width", 10000)
        print(hitlist)
    else:
        fmt = misc_args.get("print_format")
        cols = fmt.split(",")
        if len(cols) == 1:
            matches = get_from_cat(hitlist, fmt)
            [print(x) for x in matches]
        else:
            import pandas as pd

            pd.set_option("display.max_columns", None)
            pd.set_option("max_colwidth", None)
            pd.set_option("display.width", 10000)
            pd.set_option("display.max_rows", None)
            hitlist = (
                hitlist.df[cols]
                .drop_duplicates()
                .sort_values(cols)
                .to_string(index=False)
            )
            print(hitlist)


if __name__ == "__main__":
    search_args, misc_args = parse_args()

    import intake

    catalog_file = misc_args["catalog_file"]
    cat = intake.open_esm_datastore(catalog_file)

    hitlist = search_cat(cat, search_args, misc_args)
    try:
        import outtake
    except Exception as e:
        import sys
        print ("Warning: Failed to import outtake. Reason:", file=sys.stderr)
        print (e, file=sys.stderr)
        outtake = False
    if misc_args.get("get", False):
        if not outtake:
            import sys
            print(
                "Could not import outtake. No download support without it.",
                file=sys.stderr,
            )
            exit(1)
        cat = outtake.get(hitlist)

    cat._df["uri"] = cat._df["uri"].str.replace("file:///", "/")

    try:
        print_results(hitlist, search_args, misc_args)
    except ValueError:
        import sys

        print(
            "\nERROR: Could not find any matches for your query ",
            search_args,
            misc_args,
            "in catalog ",
            catalog_file,
            file=sys.stderr,
        )
        sys.exit(1)

A usage example would be (saving the code as find_files, and making it executable with chmod a+x find_files)