.. _intake-find-files-command-line: Getting file names from intake with a command-line python script ================================================================ This is a python tool for the command line that will yield file names containing desired variables, and other information .. admonition:: Requirements On levante, you can use .. code:: bash module use /work/k20200/k202134/hsm-tools/outtake/module module load hsm-tools/unstable to load a recent version of find_files with its dependencies. You need to run :: slk login on a monthly basis to authenticate with the tape system. You need to be in project 1153, to download files to the central cache directory. Join via `luv `__. .. tip:: Call it as :: find_files -h to get the full help message. Getting the full details of a dataset ------------------------------------- Use :code:`--full` to get detailed information: .. code:: find_files ua dpp0066 --time_min='2020-02-02.*' --full variable_id project institution_id source_id experiment_id \ 0 (ua, va, vor, gpsm) NextGEMS MPI-M ICON-ESM Cycle2-alpha 1 (ua, va) NextGEMS MPI-M ICON-ESM Cycle2-alpha simulation_id realm frequency time_reduction grid_label level_type \ 0 dpp0066 atm 3hour inst gn pl 1 dpp0066 atm 3hour mean gn ml time_min time_max grid_id format \ 0 2020-02-02T00:00:00.000 2020-02-02T23:59:20.000 not implemented netcdf 1 2020-02-02T00:00:00.000 2020-02-02T23:59:20.000 not implemented netcdf uri 0 /work/mh0287/m300083/experiments/dpp0066/dpp0066_atm_2d_850_pl_20200202T000000Z.nc 1 /work/mh0287/m300083/experiments/dpp0066/dpp0066_atm_3d_2_ml_20200202T000000Z.nc Controlling the output format ------------------------------ Call with with :code:`--print_format=...` or :code:`-f` to get specific colums: .. code:: find_files to dpp0066 -f 'experiment_id,simulation_id,frequency' experiment_id simulation_id frequency Cycle2-alpha dpp0066 1day Cycle2-alpha dpp0066 1hour Cycle2-alpha dpp0066 3hour find_files to dpp0066 -f 'experiment_id,simulation_id,frequency,uri' --time_min='2020-02-22.*' experiment_id simulation_id frequency uri Cycle2-alpha dpp0066 1day /work/mh0287/m300083/experiments/dpp0066/dpp0066_oce_3d_P1D_20200222T000000Z.nc Cycle2-alpha dpp0066 1day /work/mh0287/m300083/experiments/dpp0066/dpp0066_oce_3dlev_P1D_20200222T000000Z.nc Cycle2-alpha dpp0066 1hour /work/mh0287/m300083/experiments/dpp0066/dpp0066_oce_2dopt_PT1H_20200222T000000Z.nc Cycle2-alpha dpp0066 3hour /work/mh0287/m300083/experiments/dpp0066/dpp0066_oce_3du200m_PT3H_20200222T000000Z.nc # now we assume that going for the daily files is going to cause trouble in further processing because two filesets have the data. find_files to dpp0066 -f 'experiment_id,simulation_id,frequency,uri' --time_min='2020-02-22.*' --uri='.*3dlev.*' experiment_id simulation_id frequency uri Cycle2-alpha dpp0066 1day /work/mh0287/m300083/experiments/dpp0066/dpp0066_oce_3dlev_P1D_20200222T000000Z.nc Using time ranges ----------------- (taken from :code:`--help`) :code:`--time_range TIME_RANGE TIME_RANGE` can be used to find all files that contain data from a given range START END. :code:`--time_range 2020-02-01 2020-02-03` will give you all data for 2020-02-01 and 2020-02-02, as "2020-02-03" is smaller than any timestamp on 2020-02-03 in the string comparison logic. Note that regular expressions can be used, but need to follow the "full" regex-syntax. * :code:`"20-02"` will search for exactly "2020-02" (no regex used). * :code:`"20-02*"` will search for anything *containing* "20-0" and an arbitrary number of "2"s following that, so "2020-03" also matches. * :code:`"20-02.*"` will search for anything *containing* "20-02" . * :code:`"^20-02.*"` will search for anything *starting with* "20-02" . * :code:`"2020-.*-03T"` will search for anything *containing* "2020", followed by an arbitrary number of characters followed by "03T". Using find_files to download data from the tape archive ------------------------------------------------------- **First use the normal queries to ensure you are only finding the files you really need (e.g. no model level data when you are interested in pressure level data, etc.)** Once you have narrowed down your search, you can use find_files to retrieve data from the tape archive. .. code:: bash sbatch find_files ua dpp0066 --level_type=ml --time_min="2020-02-0.*" --get This will start a slurm job that will download the data for you (might take a few hours to days). Using find_files with cdo ------------------------- .. code:: cdo -timmean -select,name=tas [ $(find_files tas dpp0066 --time_min="2020-02-0.*" --get ) ] /my/output/file cdo(1) select: Process started cdo(1) select: 100% cdo(1) select: Processed 9059696640 values from 333 variables over 432 timesteps. cdo timmean: Processed 9059696640 values from 1 variable over 432 timesteps [46.86s 524MB]. When using find_files with cdo case, you will often also need the :code:`--get` option, to ensure that find_files rewrites the paths of archived files from :code:`slk://...` to the paths in the file system. The code -------- .. literalinclude:: find_files.py A usage example would be (saving the code as :file:`find_files`, and making it executable with :code:`chmod a+x find_files`)