Finding and loading data with Intake#
Intake is a powerful, flexible data cataloging system that allows you to discover, describe, and load datasets from diverse sources using a unified interface.
This tutorial walks you through the core workflow:
Opening a catalog from a URL
Exploring its contents
Loading a dataset with lazy evaluation and chunking
Using the data for analysis or visualization
We’ll use a public catalog from the nextGEMS project to demonstrate real-world usage.
Intake v2
Version 2 of Intake introduced significant, non-backward-compatible changes to the catalog specification.
The catalogs that we maintain are still using the old format.
Therefore, it is important to constrain the Intake package in your Python environment, e.g., intake<2 and intake-xarray<2.
Command-line use
See this tutorial for a little tool for querying the catalogs from the command line.
Opening the catalog#
The first step is to open a catalog using intake.open_catalog().
This function loads a catalog file and returns a catalog object that you can explore and query.
In this example, we’re loading a public YAML catalog hosted online. The catalog contains metadata about datasets, including dataset locations, temporal resolution, and additional metadata.
import intake
cat = intake.open_catalog("https://data.nextgems-h2020.eu/online.yaml")
Online datasets
The online.yaml catalog contains datasets that are publicly accessible via web servers. This means that you can run the tutorials on any machine with an internet connection.
Exploring the catalog structure#
Once the catalog is loaded, you can inspect its entries. Each entry corresponds to a dataset or sub-catalog.
list(cat)
['ICON', 'ERA5', 'JRA3Q', 'MERRA2', 'IFS', 'IMERG', 'GSMaP', 'tutorial']
You’ll see entries like ICON, IFS, or tutorial and others, each representing a dataset or group of datasets.
You can access a specific entry directly via attribute access, e.g., cat.tutorial.
list(cat.tutorial)
['ICON.native.2d_PT6H_inst',
'ICON.native.2d_P1D_mean',
'ICON.native.3d_P1D_mean']
This shows all datasets in the tutorial sub-catalog.
You can access individual datasets using their identifier to show the dataset description, parameters, and its driver (how it’s loaded).
cat.tutorial["ICON.native.2d_P1D_mean"]
ICON.native.2d_P1D_mean:
args:
chunks: null
consolidated: true
urlpath: https://swift.dkrz.de/v1/dkrz_948e7d4bbfbb445fbff5315fc433e36a/easygems_tutorial/ICON.native.2d_P1D_mean.zarr
description: ''
driver: intake_xarray.xzarr.ZarrSource
metadata:
catalog_dir: https://data.nextgems-h2020.eu
Opening a dataset#
Now that we’ve located a dataset, we can load it.
Intake supports lazy loading, meaning the data isn’t read until you explicitly request it.
The cat.tutorial["ICON.native.2d_P1D_mean"] syntax returns a data source object.
Using the to_dask() method, we can open the dataset:
ds = cat.tutorial["ICON.native.2d_P1D_mean"](chunks={"ncells": -1}).to_dask()
ds
<xarray.Dataset> Size: 844MB
Dimensions: (time: 31, ncells: 1310720)
Coordinates:
* time (time) datetime64[ns] 248B 2020-01-01T23:59:59 ... 20...
cell_sea_land_mask (ncells) float64 10MB dask.array<chunksize=(1310720,), meta=np.ndarray>
lat (ncells) float64 10MB dask.array<chunksize=(1310720,), meta=np.ndarray>
lon (ncells) float64 10MB dask.array<chunksize=(1310720,), meta=np.ndarray>
Dimensions without coordinates: ncells
Data variables:
hus2m (time, ncells) float32 163MB dask.array<chunksize=(1, 1310720), meta=np.ndarray>
psl (time, ncells) float32 163MB dask.array<chunksize=(1, 1310720), meta=np.ndarray>
ts (time, ncells) float32 163MB dask.array<chunksize=(1, 1310720), meta=np.ndarray>
uas (time, ncells) float32 163MB dask.array<chunksize=(1, 1310720), meta=np.ndarray>
vas (time, ncells) float32 163MB dask.array<chunksize=(1, 1310720), meta=np.ndarray>What’s happening here?#
chunks={"ncells": -1}: This tells Intake to use Dask for chunking along thencellsdimension..to_dask(): Converts the dataset into a Dask-backedxarray.Dataset, enabling parallel and out-of-core computation.
Dask for memory-intensive applications
Using chunks is crucial for handling large datasets efficiently.
Only parts of the data are loaded into memory at a time.
Fetching data for analysis or visualization#
Now that the dataset is loaded as a Dask-aware xarray.Dataset, you can perform operations just like with any xarray object.
This computes the mean temperature across all cells for a single time step:
ds.ts.isel(time=0).mean("ncells").values
array(286.84082, dtype=float32)
Lazy dataset
Because the dataset is Dask-backed, operations are lazy and won’t load data until the actual computation is triggered.