DYAMOND Winter dpp0029/dpp0033 - a beginner’s guide

I. The output

1. What output is available?

The available variables and their abbreviations are stored in the Table of variables . The output is structured in files, see DYAMOND Winter dpp0029 output for the overview over the variables and the files in which they are stored (for the experiment dpp0029; the experiment dpp0033 has the same structure but shifted by a year). Each file corresponds to a day’s worth of observations, but the time periods over which the output is given can vary (this can also be found in the output description).

Remarks

Horizontally the output is stored in a one-dimensional array (‘ncells’ in the files). In order to be able to assess it depending on latitude and longitude one needs to attach a grid file in which latitudes and longitudes of the cells are stored. A land-sea mask is also stored in this grid file.

Vertically the output is structured in levels (1 to 90), the lowest being 90. Geographic heights corresponding to these levels are also stored in a separate grid file. Note that they are absolute and depend on the topography. Thus, for accessing the heights over sea it is sufficient to attach the heights from one single grid point over sea to all the output.

The time variable is stored as real numbers, e.g. the 6AM on 1 February 2020 is given as \(20200201.25\). For practical reasons it might be useful to change this into the datetime format. (See below for code to attach the grid and change the time.)

Example

I need the surface temperature output. The corresponding variable is ts, and it is available in the files

dpp0029_atm_2d_ml_<datetime>.nc

for January and February with 6h intervals,

dpp0029_atm4_2d_ml_<datetime>.nc

for January and February with 15min intervals and

dpp0029_atm_2d_ml_<datetime>.nc

starting from March with 1 day intervals (i.e. one observation per file corresponding to the daily mean).

2. How to access it and obtain more information?

One possibility is to log into mistral over the terminal by typing

ssh -X NNN@mistral.dkrz.de

and replacing NNN by your mistral ID. By typing

ls /work/mh0287/k203123/GIT/icon-aes-dyw2/experiments/dpp0029/*20200201*

you obtain the list of files corresponding to a given day (here 01.02.2020) in the dpp0029 experiment. For details on a particular file, type ncdump -h and the name of the file.

Example

To see what kind of output is stored in the files of the type dpp0029_atm_2d_ml_<datetime>.nc write

ncdump -h /work/mh0287/k203123/GIT/icon-aes-dyw2/experiments/dpp0029/dpp0029_atm_2d_ml_20200201T000000Z.nc

II. Processing the output

CDO

The Climate Data Operators (CDO) is a collection of tools to group and perform basic operations on data. It is used to reduce the amount of data to whatever is necessary BEFORE plotting it or performing a statistical analysis. Frequent uses include selecting parameters, spatial or temporal subsets of data, concatenating files, building averages/minima/maxima (and performing other basic math operations) as well as interpolating the data on a horizontal grid (this is called regridding). This is often an efficient way to preprocess the data.

You can type CDO commands directly into the console after logging into mistral. However, for manipulating data you need to request some memory resources. This is done by typing a corresponding command into the terminal, for instance

salloc -N 1 -p compute,compute2 -A NNN -t4:00:00 -- /bin/bash -c 'ssh -X $SLURM_JOB_NODELIST'

where NNN is replaced by your project name. Other commands and more extensive information are stored in the Nodes section.

Examples

cdo -info /work/mh0287/k203123/GIT/icon-aes-dyw2/experiments/dpp0029/dpp0029_atm_2d_ml_20200201T000000Z.nc

provides basic information about the contents of the file, including the maxima, minima and mean values of the variables.

The command (read and executed from right to left)

code

cdo -timmean -selname,ts,pr -cat /work/mh0287/k203123/GIT/icon-aes-dyw2/experiments/dpp0029/dpp0029_atm_2d_ml_20200121*.nc /work/mh0287/k203123/GIT/icon-aes-dyw2/experiments/dpp0029/dpp0029_atm_2d_ml_20200122*.nc Outfile.nc

concatenates files from January, 21 and 22, selects the variables of surface temperature and precipitation, calculates the respective time means and writes the result in a new file called Outfile.nc

cdo -sellonlatbox,-56.5,-59,12,14.5 -remap,griddes01x01.txt,weight_dpp0029_01x01.nc -seltimestep,1/40 -selname,va -cat '/work/mh0287/k203123/GIT/icon-aes-dyw2/experiments/dpp0029/dpp0029_atm_3d_v_ml_2020*.nc' Outfile.nc

is a CDO command that concatenates files of the form /work/mh0287/k203123/GIT/icon-aes-dyw2/experiments/dpp0029/dpp0029_atm_3d_v_ml_2020*.nc , selects the variable va, selects the first 40 time points, interpolates the obtained data on a fine grid (the grid files are created beforehand following the instructions in MPI Wiki Regridding) and finally cuts out a longitude-latitude box. The resulting output is then saved as Outfile.nc in the user’s home directory. It can be used, for instance, for 2D plotting of the selected region.

Links provided in the CDO section give a lot of information on possible CDO commands.

Remark

For every user there are three directories in which files can be stored. On DKRZ User Portal you can familiarise yourself with the particularities of each and also with the type of data they are meant for. Pay particular attention to the different data lifetimes!

Jupyter notebook

For creating and running Python, Julia or R scripts for files within mistral JupiterHub is used (see the MPI Wiki for basic access and setup information). Per default upon logging in and choosing a computing time option, a user’s home directory is shown. There, one can create a new script by choosing the notebook type in the upper-right corner. We will be working with the Python 3 unstable option from now on.

Starting out

At the beginning of the script we load some basic packages

from getpass import getuser # Libaray to copy things
from pathlib import Path # Object oriented libary to deal with paths
import os
from tempfile import NamedTemporaryFile, TemporaryDirectory # Creating temporary Files/Dirs
from subprocess import run, PIPE
import sys

import dask # Distributed data libary
from dask_jobqueue import SLURMCluster # Setting up distributed memories via slurm
from distributed import Client, progress, wait # Libaray to orchestrate distributed resources
import xarray as xr # Libary to work with labeled n-dimensional data and dask

from dask.utils import format_bytes
import numpy as np # Pythons standard array library

import multiprocessing
from subprocess import run, PIPE
import warnings
warnings.filterwarnings(action='ignore')
import pandas as pd

dask.config.config.get('distributed').get('dashboard').update({'link':'{JUPYTERHUB_SERVICE_PREFIX}/proxy/{port}/status'})

then create a shortcut for the SCRATCH directory and set up a dask cluster for partitioning the computing

# Set some user specific variables
scratch_dir = Path('/scratch') / getuser()[0] / getuser() # Define the users scratch dir
# Create a temp directory where the output of distributed cluster will be written to, after this notebook
# is closed the temp directory will be closed
dask_tmp_dir = TemporaryDirectory(dir=scratch_dir, prefix='PostProc')
cluster = SLURMCluster(memory='500GiB',
                       cores=72,
                       project=‘NNN’,
                       walltime='1:00:00',
                       queue='compute',
                       name='PostProc',
                       scheduler_options={'dashboard_address': ':12435'},
                       local_directory=dask_tmp_dir.name,
                       job_extra=[f'-J PostProc',
                                  f'-D {dask_tmp_dir.name}',
                                  f'--begin=now',
                                  f'--output={dask_tmp_dir.name}/LOG_cluster.%j.o',
                                  f'--output={dask_tmp_dir.name}/LOG_cluster.%j.o'
                                 ],
                       interface='ib0')

where NNN is replaced by your project’s name, and finally initialise a dask client

cluster.scale(jobs=2)
dask_client = Client(cluster)
#dask_client.wait_for_workers(18)
cluster

Remark

Clicking on the dashboard link in the output of this cell opens a visualisation of the tasks that are being done when the script is ran. This is helpful to track the progress (for longer tasks), obtain a feeling for the duration of certain operations and locate problems.

Reading

Example

Code for opening and concatenating dpp0029 output files:

data_path = Path('/work/mh0287/k203123/GIT/icon-aes-dyw2/experiments/dpp0029/')
glob_pattern_2d = 'atm3_2d_ml'

# Collect all file names with pathlib's rglob
file_names = sorted([str(f) for f in data_path.rglob(f'*{glob_pattern_2d}*.nc')])[1:]
dset = xr.open_mfdataset(file_names, combine='by_coords', parallel=True)
dset

Example

Code for opening files from the SCRATCH directory:

file_names_s = Path('/scratch/m/NNN') / Path('File.nc')

where NNN stands for your mistral ID.

Bulk of the script

In the bulk of the script one can also process the data, similarly to CDO. The output read from the files is of type xarray. Some basic tools for manipulating an object da of this type include da.sel, da.where, da.mean, da.resample, see examples of use below.

Example

Code for selecting variables

var_names = [ 'tas', 'uas']
dset_subset = dset[var_names]

#view the new data array
dset_subset

Changing the time from numerical values to a datetime format facilitates further analysis. It can be done in the following way.

Example

Attaching the time to output with certain (fixed) time intervals, here 15 minutes

dset_subset = dset_subset.sel(time=slice(20210121., 20210229.))
timesf1 = pd.date_range(start='2020-01-21', periods=dset_subset.time.sel(time=slice(20210121., 20210229.)).size ,freq='15min')
dset_subset['time']=timesf1
dset_subset

Attaching the time to output with varying time intervals, here 6 hours and 1 day

timesf1 = pd.date_range(start='2020-01-21', periods=dset_subset.time.sel(time=slice(20200121., 20200301.)).size ,freq='6H')

#integers can be converted directly to days
times_old = dset_subset.time.sel(time=slice(20200302., 20210120.)).values
timesf2_old = pd.to_datetime(times_old,format='%Y%m%d')
timesf_total = timesf1.union(timesf2_old)
dset_subset['time'] = timesf_total

#build daily means wherever the intervals are smaller to homogenise the data
#the option skipna deletes possible missing values
dset_subset = dset_subset.resample(time="1D", skipna=True).mean()

Attaching a grid to the cells can be of similar importance for further calculations.

Example

Code for attaching heights for a subset over sea

file3 = '/work/mh0287/k203123/GIT/icon-aes-dyw2/experiments/dpp0016_atm_vgrid_ml.nc'
dsetvgrid = xr.open_dataset(file3)
dsetvgrid

#selecting a subgrid if necessary
dsvg_sub = dsetvgrid['zg'].sel(height_2 = slice(14.0, 90.0))

#overwriting the height levels with geographic heights at a random point over sea
dset_subset['height']=dsvg_sub[:,1343333].values
dset_subset

Note that we have chosen heights in the middle of the corresponding cells. For other possible choices call dsetvgrid to access all height variables.

Code for attaching horizontal coordinates and a land-sea mask

file2 = '/pool/data/ICON/grids/public/mpim/0015/icon_grid_0015_R02B09_G.nc'
dsetlm = xr.open_dataset(file2)
dsetlm = dsetlm.rename({'cell' : 'ncells'})
land_sea_mask = dsetlm.cell_sea_land_mask.persist()
progress(land_sea_mask ,notebook=False)
mask = land_sea_mask.values
dset_subset['land_sea_mask'] = (('ncells'),mask)

clat = land_sea_mask.clat.values
clon = land_sea_mask.clon.values
dset_subset = ds_subset.assign_coords(clon=("ncells",clon),clat=("ncells",clat))
dset_subset

Now we can cut out boxes according to latitude and longitude via (for instance)

dset_subset = dset_subset.where((land_sea_mask.clat>=(12.0/180*np.pi)) & (land_sea_mask.clat<=(14.5/180*np.pi)) & (land_sea_mask.clon>=(-59/180*np.pi))& (land_sea_mask.clon<=(-56.5/180*np.pi)), drop = True)

or select the land part

dset_subset = dset_subset.where(dset_subset.land_sea_mask>0, drop=True)

Here, as above, drop=True is used to delete the missing values.

Here are some more examples of basic operators for xarray objects.

Examples

Code for calculating the spatial mean:

dset_subset_mean = dset_subset.mean(dim='ncells')

Omitting a particular date:

dset_subset_clean = dset_subset.drop_sel(time=np.datetime64('2020-03-01'))

Performing a vectorwise calculation and adding the resulting vector to an existing data array:

#library for numerical computations
import numpy as np

#calculate the horizontal wind speed
dset_subset['ws'] = np.sqrt(dset_subset['va']*dset_subset['va'] + dset_subset['ua']*dset_subset['ua'])

One of the most important tasks performed in the notebook is plotting the preprocessed output. The simplest way is to use the intrinsic plotting method of xarray objects, that is, for an object da of this type running

da.plot()

This method automatically chooses a type of plot suitable for the data structure and produces a simple plot. More sophisticated plots can be created manually. In particular, one can use a cartography library by running

from cartopy import crs as ccrs

and produce maps. Some examples can be found in the MPI Wiki. For 2D plots one can import the standard plotting library

from matplotlib import pyplot as plt

and use its numerous options.

Saving

You should store preprocessed data that you are working with, especially reduced data sets that need a long time to be calculated.

Example

Code for saving data in the SCRATH directory

scratch_dir = Path('/scratch') / getuser()[0] / getuser() #if it has not been defined before
out_file = Path(scratch_dir) / 'OutfileName.nc'
dset_subset.to_netcdf(out_file, mode='w')

Example notebook