---
file_format: mystnb
kernelspec:
  name: python3
  display_name: Python 3
---

# Finding and loading data with Intake

[Intake](https://intake.readthedocs.io) is a powerful, flexible data cataloging system that allows you to discover, describe, and load datasets from diverse sources using a unified interface.

This tutorial walks you through the core workflow:
1. **Opening a catalog** from a URL
2. **Exploring its contents**
3. **Loading a dataset** with lazy evaluation and chunking
4. **Using the data** for analysis or visualization

We’ll use a public catalog from the [nextGEMS project](https://nextgems-h2020.eu/) to demonstrate real-world usage.

```{admonition} Intake v2
:class: warning

Version 2 of Intake introduced significant, non-backward-compatible changes to the catalog specification.
The catalogs that we maintain are still using the old format.
Therefore, it is important to constrain the Intake package in your Python environment, e.g., `intake<2` and `intake-xarray<2`.
```

```{admonition} Command-line use
:class: info

See [this tutorial](Intake/query_yaml.md) for a little tool for querying the catalogs from the command line.

```

## Opening the catalog

The first step is to open a catalog using `intake.open_catalog()`.
This function loads a catalog file and returns a catalog object that you can explore and query.

In this example, we’re loading a public YAML catalog hosted online.
The catalog contains metadata about datasets, including dataset locations, temporal resolution, and additional metadata.

```{code-cell} python3
import intake

cat = intake.open_catalog("https://data.nextgems-h2020.eu/online.yaml")
```

```{admonition} Online datasets
:class: info

The online.yaml catalog contains datasets that are publicly accessible via web servers.
This means that you can run the tutorials on any machine with an internet connection.
```

### Exploring the catalog structure

Once the catalog is loaded, you can inspect its entries.
Each entry corresponds to a dataset or sub-catalog.

```{code-cell} python3
list(cat)
```

You’ll see entries like `ICON`, `IFS`, or `tutorial` and others, each representing a dataset or group of datasets.
You can access a specific entry directly via attribute access, e.g., `cat.tutorial`.

```{code-cell} python3
list(cat.tutorial)
```

This shows all datasets in the `tutorial` sub-catalog.
You can access individual datasets using their identifier to show the dataset description, parameters, and its driver (how it’s loaded).

```{code-cell} python3
cat.tutorial["ICON.native.2d_P1D_mean"]
```

## Opening a dataset


Now that we’ve located a dataset, we can load it.
Intake supports **lazy loading**, meaning the data isn’t read until you explicitly request it.
The `cat.tutorial["ICON.native.2d_P1D_mean"]` syntax returns a **data source** object.


Using the `to_dask()` method, we can open the dataset:

```{code-cell} python3
ds = cat.tutorial["ICON.native.2d_P1D_mean"](chunks={"ncells": -1}).to_dask()
ds
```


### What’s happening here?

- `chunks={"ncells": -1}`: This tells Intake to use **Dask** for chunking along the `ncells` dimension.
- `.to_dask()`: Converts the dataset into a Dask-backed `xarray.Dataset`, enabling parallel and out-of-core computation.

```{admonition} Dask for memory-intensive applications
:class: info

Using `chunks` is crucial for handling large datasets efficiently.
Only parts of the data are loaded into memory at a time.
```

## Fetching data for analysis or visualization


Now that the dataset is loaded as a Dask-aware `xarray.Dataset`, you can perform operations just like with any `xarray` object.
This computes the **mean temperature across all cells** for a single time step:

```{code-cell} python3
ds.ts.isel(time=0).mean("ncells").values
```

```{admonition} Lazy dataset
:class: info

Because the dataset is [Dask-backed](dask.md), operations are **lazy** and won’t load data until the actual computation is triggered.
```

```{toctree}
---
hidden: true
---
Intake/query_yaml.md
```