Finding and loading data with Intake#

Intake is a powerful, flexible data cataloging system that allows you to discover, describe, and load datasets from diverse sources using a unified interface.

This tutorial walks you through the core workflow:

  1. Opening a catalog from a URL

  2. Exploring its contents

  3. Loading a dataset with lazy evaluation and chunking

  4. Using the data for analysis or visualization

We’ll use a public catalog from the nextGEMS project to demonstrate real-world usage.

Intake v2

Version 2 of Intake introduced significant, non-backward-compatible changes to the catalog specification. The catalogs that we maintain are still using the old format. Therefore, it is important to constrain the Intake package in your Python environment, e.g., intake<2 and intake-xarray<2.

Command-line use

See this tutorial for a little tool for querying the catalogs from the command line.

Opening the catalog#

The first step is to open a catalog using intake.open_catalog(). This function loads a catalog file and returns a catalog object that you can explore and query.

In this example, we’re loading a public YAML catalog hosted online. The catalog contains metadata about datasets, including dataset locations, temporal resolution, and additional metadata.

import intake

cat = intake.open_catalog("https://data.nextgems-h2020.eu/online.yaml")

Online datasets

The online.yaml catalog contains datasets that are publicly accessible via web servers. This means that you can run the tutorials on any machine with an internet connection.

Exploring the catalog structure#

Once the catalog is loaded, you can inspect its entries. Each entry corresponds to a dataset or sub-catalog.

list(cat)
['ICON', 'ERA5', 'JRA3Q', 'MERRA2', 'IFS', 'IMERG', 'GSMaP', 'tutorial']

You’ll see entries like ICON, IFS, or tutorial and others, each representing a dataset or group of datasets. You can access a specific entry directly via attribute access, e.g., cat.tutorial.

list(cat.tutorial)
['ICON.native.2d_PT6H_inst',
 'ICON.native.2d_P1D_mean',
 'ICON.native.3d_P1D_mean']

This shows all datasets in the tutorial sub-catalog. You can access individual datasets using their identifier to show the dataset description, parameters, and its driver (how it’s loaded).

cat.tutorial["ICON.native.2d_P1D_mean"]
ICON.native.2d_P1D_mean:
  args:
    chunks: null
    consolidated: true
    urlpath: https://swift.dkrz.de/v1/dkrz_948e7d4bbfbb445fbff5315fc433e36a/easygems_tutorial/ICON.native.2d_P1D_mean.zarr
  description: ''
  driver: intake_xarray.xzarr.ZarrSource
  metadata:
    catalog_dir: https://data.nextgems-h2020.eu

Opening a dataset#

Now that we’ve located a dataset, we can load it. Intake supports lazy loading, meaning the data isn’t read until you explicitly request it. The cat.tutorial["ICON.native.2d_P1D_mean"] syntax returns a data source object.

Using the to_dask() method, we can open the dataset:

ds = cat.tutorial["ICON.native.2d_P1D_mean"](chunks={"ncells": -1}).to_dask()
ds
<xarray.Dataset> Size: 844MB
Dimensions:             (time: 31, ncells: 1310720)
Coordinates:
  * time                (time) datetime64[ns] 248B 2020-01-01T23:59:59 ... 20...
    cell_sea_land_mask  (ncells) float64 10MB dask.array<chunksize=(1310720,), meta=np.ndarray>
    lat                 (ncells) float64 10MB dask.array<chunksize=(1310720,), meta=np.ndarray>
    lon                 (ncells) float64 10MB dask.array<chunksize=(1310720,), meta=np.ndarray>
Dimensions without coordinates: ncells
Data variables:
    hus2m               (time, ncells) float32 163MB dask.array<chunksize=(1, 1310720), meta=np.ndarray>
    psl                 (time, ncells) float32 163MB dask.array<chunksize=(1, 1310720), meta=np.ndarray>
    ts                  (time, ncells) float32 163MB dask.array<chunksize=(1, 1310720), meta=np.ndarray>
    uas                 (time, ncells) float32 163MB dask.array<chunksize=(1, 1310720), meta=np.ndarray>
    vas                 (time, ncells) float32 163MB dask.array<chunksize=(1, 1310720), meta=np.ndarray>

What’s happening here?#

  • chunks={"ncells": -1}: This tells Intake to use Dask for chunking along the ncells dimension.

  • .to_dask(): Converts the dataset into a Dask-backed xarray.Dataset, enabling parallel and out-of-core computation.

Dask for memory-intensive applications

Using chunks is crucial for handling large datasets efficiently. Only parts of the data are loaded into memory at a time.

Fetching data for analysis or visualization#

Now that the dataset is loaded as a Dask-aware xarray.Dataset, you can perform operations just like with any xarray object. This computes the mean temperature across all cells for a single time step:

ds.ts.isel(time=0).mean("ncells").values
array(286.84082, dtype=float32)

Lazy dataset

Because the dataset is Dask-backed, operations are lazy and won’t load data until the actual computation is triggered.