DYAMOND Winter dpp0029/dpp0033 - a beginner’s guide#
I. The output#
1. What output is available?#
The available variables and their abbreviations are stored in the Table of variables . The output is structured in files, see DYAMOND Winter dpp0029 output for the overview over the variables and the files in which they are stored (for the experiment dpp0029; the experiment dpp0033 has the same structure but shifted by a year). Each file corresponds to a day’s worth of observations, but the time periods over which the output is given can vary (this can also be found in the output description).
Remarks
Horizontally the output is stored in a one-dimensional array (‘ncells’ in the files). In order to be able to assess it depending on latitude and longitude one needs to attach a grid file in which latitudes and longitudes of the cells are stored. A land-sea mask is also stored in this grid file.
Vertically the output is structured in levels (1 to 90), the lowest being 90. Geographic heights corresponding to these levels are also stored in a separate grid file. Note that they are absolute and depend on the topography. Thus, for accessing the heights over sea it is sufficient to attach the heights from one single grid point over sea to all the output.
The time variable is stored as real numbers, e.g. the 6AM on 1 February 2020 is given as \(20200201.25\). For practical reasons it might be useful to change this into the datetime format. (See below for code to attach the grid and change the time.)
Example
I need the surface temperature output. The corresponding variable is ts
, and it is available in the files
dpp0029_atm_2d_ml_<datetime>.nc
for January and February with 6h intervals,
dpp0029_atm4_2d_ml_<datetime>.nc
for January and February with 15min intervals and
dpp0029_atm_2d_ml_<datetime>.nc
starting from March with 1 day intervals (i.e. one observation per file corresponding to the daily mean).
2. How to access it and obtain more information?#
One possibility is to log into Levante over the terminal by typing
ssh -X x123456@levante.dkrz.de
and replacing x123456
by your DKRZ user ID. By typing
ls /work/mh0287/k203123/Dyamond++/icon-aes-dyw2/experiments/dpp0029/*20200201*
you obtain the list of files corresponding to a given day (here 01.02.2020) in the dpp0029 experiment. For details on a particular file, type ncdump -h
and the name of the file.
Example
To see what kind of output is stored in the files of the type dpp0029_atm_2d_ml_<datetime>.nc
write
ncdump -h /work/mh0287/k203123/Dyamond++/icon-aes-dyw2/experiments/dpp0029/dpp0029_atm_2d_ml_20200201T000000Z.nc
II. Processing the output#
CDO#
The Climate Data Operators (CDO) is a collection of tools to group and perform basic operations on data. It is used to reduce the amount of data to whatever is necessary BEFORE plotting it or performing a statistical analysis. Frequent uses include selecting parameters, spatial or temporal subsets of data, concatenating files, building averages/minima/maxima (and performing other basic math operations) as well as interpolating the data on a horizontal grid (this is called regridding). This is often an efficient way to preprocess the data.
You can type CDO commands directly into the console after logging into Levante. However, for manipulating data you need to request some memory resources. This is done by typing a corresponding command into the terminal, for instance
salloc --x11 -p interactive -A xz0123 -n 1 --mem 8000 -t 240
where xz0123
is replaced by your project name.
Examples
- code:
cdo -info /work/mh0287/k203123/Dyamond++/icon-aes-dyw2/experiments/dpp0029/dpp0029_atm_2d_ml_20200201T000000Z.nc
provides basic information about the contents of the file, including the maxima, minima and mean values of the variables.
The command (read and executed from right to left)
- code:
cdo -timmean -selname,ts,pr -cat /work/mh0287/k203123/Dyamond++/icon-aes-dyw2/experiments/dpp0029/dpp0029_atm_2d_ml_20200121*.nc /work/mh0287/k203123/Dyamond++/icon-aes-dyw2/experiments/dpp0029/dpp0029_atm_2d_ml_20200122*.nc Outfile.nc
concatenates files from January, 21 and 22, selects the variables of surface temperature and precipitation, calculates the respective time means and writes the result in a new file called Outfile.nc
- code:
cdo -sellonlatbox,-56.5,-59,12,14.5 -remap,griddes01x01.txt,weight_dpp0029_01x01.nc -seltimestep,1/40 -selname,va -cat ‘/work/mh0287/k203123/Dyamond++/icon-aes-dyw2/experiments/dpp0029/dpp0029_atm_3d_v_ml_2020*.nc’ Outfile.nc
is a CDO command that concatenates files of the form /work/mh0287/k203123/Dyamond++/icon-aes-dyw2/experiments/dpp0029/dpp0029_atm_3d_v_ml_2020*.nc
, selects the variable va
, selects the first 40 time points, interpolates the obtained data on a fine grid (the grid files are created beforehand following the instructions in MPI Wiki Regridding) and finally cuts out a longitude-latitude box. The resulting output is then saved as Outfile.nc
in the user’s home directory. It can be used, for instance, for 2D plotting of the selected region.
Links provided in the CDO section give a lot of information on possible CDO commands.
Remark
For every user there are three directories in which files can be stored. On DKRZ documentation pages you can familiarise yourself with the particularities of each and also with the type of data they are meant for. Pay particular attention to the different data lifetimes!
Jupyter notebook#
For creating and running Python, Julia or R scripts for files within Levante JupiterHub is used (see the MPI Wiki for basic access and setup information). Per default upon logging in and choosing a computing time option, a user’s home directory is shown. There, one can create a new script by choosing the notebook type in the upper-right corner. We will be working with the Python 3 unstable option from now on.
Starting out#
At the beginning of the script we load some basic packages
from getpass import getuser # Libaray to copy things
from pathlib import Path # Object oriented libary to deal with paths
import os
from tempfile import NamedTemporaryFile, TemporaryDirectory # Creating temporary Files/Dirs
from subprocess import run, PIPE
import sys
import dask # Distributed data libary
from dask_jobqueue import SLURMCluster # Setting up distributed memories via slurm
from distributed import Client, progress, wait # Libaray to orchestrate distributed resources
import xarray as xr # Libary to work with labeled n-dimensional data and dask
from dask.utils import format_bytes
import numpy as np # Pythons standard array library
import multiprocessing
from subprocess import run, PIPE
import warnings
warnings.filterwarnings(action='ignore')
import pandas as pd
dask.config.config.get('distributed').get('dashboard').update({'link':'{JUPYTERHUB_SERVICE_PREFIX}/proxy/{port}/status'})
then create a shortcut for the SCRATCH directory and set up a dask cluster for partitioning the computing
# Set some user specific variables
scratch_dir = Path('/scratch') / getuser()[0] / getuser() # Define the users scratch dir
# Create a temp directory where the output of distributed cluster will be written to, after this notebook
# is closed the temp directory will be closed
dask_tmp_dir = TemporaryDirectory(dir=scratch_dir, prefix='PostProc')
cluster = SLURMCluster(memory='500GiB',
cores=72,
project='xz0123',
walltime='1:00:00',
queue='compute',
name='PostProc',
scheduler_options={'dashboard_address': ':12435'},
local_directory=dask_tmp_dir.name,
job_extra=[f'-J PostProc',
f'-D {dask_tmp_dir.name}',
f'--begin=now',
f'--output={dask_tmp_dir.name}/LOG_cluster.%j.o',
f'--output={dask_tmp_dir.name}/LOG_cluster.%j.o'
],
interface='ib0')
where xz0123
is replaced by your project’s name, and finally initialise a dask client
cluster.scale(jobs=2)
dask_client = Client(cluster)
#dask_client.wait_for_workers(18)
cluster
Remark
Clicking on the dashboard link in the output of this cell opens a visualisation of the tasks that are being done when the script is ran. This is helpful to track the progress (for longer tasks), obtain a feeling for the duration of certain operations and locate problems.
Reading#
Example
Code for opening and concatenating dpp0029 output files:
data_path = Path('/work/mh0287/k203123/Dyamond++/icon-aes-dyw2/experiments/dpp0029/')
glob_pattern_2d = 'atm3_2d_ml'
# Collect all file names with pathlib's rglob
file_names = sorted([str(f) for f in data_path.rglob(f'*{glob_pattern_2d}*.nc')])[1:]
dset = xr.open_mfdataset(file_names, combine='by_coords', parallel=True)
dset
Example
Code for opening files from the SCRATCH directory:
file_names_s = Path('/scratch/x/x123456') / Path('File.nc')
where x123456
stands for your Levante ID.
Bulk of the script#
In the bulk of the script one can also process the data, similarly to CDO. The output read from the files is of type xarray
. Some basic tools for manipulating an object da
of this type include da.sel
, da.where
, da.mean
, da.resample
, see examples of use below.
Example
Code for selecting variables
var_names = [ 'tas', 'uas']
dset_subset = dset[var_names]
#view the new data array
dset_subset
Changing the time from numerical values to a datetime format facilitates further analysis. It can be done in the following way.
Example
Attaching the time to output with certain (fixed) time intervals, here 15 minutes
dset_subset = dset_subset.sel(time=slice(20210121., 20210229.))
timesf1 = pd.date_range(start='2020-01-21', periods=dset_subset.time.sel(time=slice(20210121., 20210229.)).size ,freq='15min')
dset_subset['time']=timesf1
dset_subset
Attaching the time to output with varying time intervals, here 6 hours and 1 day
timesf1 = pd.date_range(start='2020-01-21', periods=dset_subset.time.sel(time=slice(20200121., 20200301.)).size ,freq='6H')
#integers can be converted directly to days
times_old = dset_subset.time.sel(time=slice(20200302., 20210120.)).values
timesf2_old = pd.to_datetime(times_old,format='%Y%m%d')
timesf_total = timesf1.union(timesf2_old)
dset_subset['time'] = timesf_total
#build daily means wherever the intervals are smaller to homogenise the data
#the option skipna deletes possible missing values
dset_subset = dset_subset.resample(time="1D", skipna=True).mean()
Attaching a grid to the cells can be of similar importance for further calculations.
Example
Code for attaching heights for a subset over sea
file3 = '/work/mh0287/k203123/Dyamond++/icon-aes-dyw2/experiments/dpp0016_atm_vgrid_ml.nc'
dsetvgrid = xr.open_dataset(file3)
dsetvgrid
#selecting a subgrid if necessary
dsvg_sub = dsetvgrid['zg'].sel(height_2 = slice(14.0, 90.0))
#overwriting the height levels with geographic heights at a random point over sea
dset_subset['height']=dsvg_sub[:,1343333].values
dset_subset
Note that we have chosen heights in the middle of the corresponding cells. For other possible choices call dsetvgrid
to access all height variables.
Code for attaching horizontal coordinates and a land-sea mask
file2 = '/pool/data/ICON/grids/public/mpim/0015/icon_grid_0015_R02B09_G.nc'
dsetlm = xr.open_dataset(file2)
dsetlm = dsetlm.rename({'cell' : 'ncells'})
land_sea_mask = dsetlm.cell_sea_land_mask.persist()
progress(land_sea_mask ,notebook=False)
mask = land_sea_mask.values
dset_subset['land_sea_mask'] = (('ncells'),mask)
clat = land_sea_mask.clat.values
clon = land_sea_mask.clon.values
dset_subset = ds_subset.assign_coords(clon=("ncells",clon),clat=("ncells",clat))
dset_subset
Now we can cut out boxes according to latitude and longitude via (for instance)
dset_subset = dset_subset.where((land_sea_mask.clat>=(12.0/180*np.pi)) & (land_sea_mask.clat<=(14.5/180*np.pi)) & (land_sea_mask.clon>=(-59/180*np.pi))& (land_sea_mask.clon<=(-56.5/180*np.pi)), drop = True)
or select the land part
dset_subset = dset_subset.where(dset_subset.land_sea_mask>0, drop=True)
Here, as above, drop=True
is used to delete the missing values.
Here are some more examples of basic operators for xarray
objects.
Examples
Code for calculating the spatial mean:
dset_subset_mean = dset_subset.mean(dim='ncells')
Omitting a particular date:
dset_subset_clean = dset_subset.drop_sel(time=np.datetime64('2020-03-01'))
Performing a vectorwise calculation and adding the resulting vector to an existing data array:
#library for numerical computations
import numpy as np
#calculate the horizontal wind speed
dset_subset['ws'] = np.sqrt(dset_subset['va']*dset_subset['va'] + dset_subset['ua']*dset_subset['ua'])
One of the most important tasks performed in the notebook is plotting the preprocessed output. The simplest way is to use the intrinsic plotting method of xarray
objects, that is, for an object da
of this type running
da.plot()
This method automatically chooses a type of plot suitable for the data structure and produces a simple plot. More sophisticated plots can be created manually. In particular, one can use a cartography library by running
from cartopy import crs as ccrs
and produce maps. Some examples can be found in the MPI Wiki. For 2D plots one can import the standard plotting library
from matplotlib import pyplot as plt
and use its numerous options.
Saving#
You should store preprocessed data that you are working with, especially reduced data sets that need a long time to be calculated.
Example
Code for saving data in the SCRATH directory
scratch_dir = Path('/scratch') / getuser()[0] / getuser() #if it has not been defined before
out_file = Path(scratch_dir) / 'OutfileName.nc'
dset_subset.to_netcdf(out_file, mode='w')