--- file_format: mystnb kernelspec: name: python3 display_name: Python 3 --- # Finding and loading data with Intake [Intake](https://intake.readthedocs.io) is a powerful, flexible data cataloging system that allows you to discover, describe, and load datasets from diverse sources using a unified interface. This tutorial walks you through the core workflow: 1. **Opening a catalog** from a URL 2. **Exploring its contents** 3. **Loading a dataset** with lazy evaluation and chunking 4. **Using the data** for analysis or visualization We’ll use a public catalog from the [nextGEMS project](https://nextgems-h2020.eu/) to demonstrate real-world usage. ```{admonition} Intake v2 :class: warning Version 2 of Intake introduced significant, non-backward-compatible changes to the catalog specification. The catalogs that we maintain are still using the old format. Therefore, it is important to constrain the Intake package in your Python environment, e.g., `intake<2` and `intake-xarray<2`. ``` ```{admonition} Command-line use :class: info See [this tutorial](Intake/query_yaml.md) for a little tool for querying the catalogs from the command line. ``` ## Opening the catalog The first step is to open a catalog using `intake.open_catalog()`. This function loads a catalog file and returns a catalog object that you can explore and query. In this example, we’re loading a public YAML catalog hosted online. The catalog contains metadata about datasets, including dataset locations, temporal resolution, and additional metadata. ```{code-cell} python3 import intake cat = intake.open_catalog("https://data.nextgems-h2020.eu/online.yaml") ``` ```{admonition} Online datasets :class: info The online.yaml catalog contains datasets that are publicly accessible via web servers. This means that you can run the tutorials on any machine with an internet connection. ``` ### Exploring the catalog structure Once the catalog is loaded, you can inspect its entries. Each entry corresponds to a dataset or sub-catalog. ```{code-cell} python3 list(cat) ``` You’ll see entries like `ICON`, `IFS`, or `tutorial` and others, each representing a dataset or group of datasets. You can access a specific entry directly via attribute access, e.g., `cat.tutorial`. ```{code-cell} python3 list(cat.tutorial) ``` This shows all datasets in the `tutorial` sub-catalog. You can access individual datasets using their identifier to show the dataset description, parameters, and its driver (how it’s loaded). ```{code-cell} python3 cat.tutorial["ICON.native.2d_P1D_mean"] ``` ## Opening a dataset Now that we’ve located a dataset, we can load it. Intake supports **lazy loading**, meaning the data isn’t read until you explicitly request it. The `cat.tutorial["ICON.native.2d_P1D_mean"]` syntax returns a **data source** object. Using the `to_dask()` method, we can open the dataset: ```{code-cell} python3 ds = cat.tutorial["ICON.native.2d_P1D_mean"](chunks={"ncells": -1}).to_dask() ds ``` ### What’s happening here? - `chunks={"ncells": -1}`: This tells Intake to use **Dask** for chunking along the `ncells` dimension. - `.to_dask()`: Converts the dataset into a Dask-backed `xarray.Dataset`, enabling parallel and out-of-core computation. ```{admonition} Dask for memory-intensive applications :class: info Using `chunks` is crucial for handling large datasets efficiently. Only parts of the data are loaded into memory at a time. ``` ## Fetching data for analysis or visualization Now that the dataset is loaded as a Dask-aware `xarray.Dataset`, you can perform operations just like with any `xarray` object. This computes the **mean temperature across all cells** for a single time step: ```{code-cell} python3 ds.ts.isel(time=0).mean("ncells").values ``` ```{admonition} Lazy dataset :class: info Because the dataset is [Dask-backed](dask.md), operations are **lazy** and won’t load data until the actual computation is triggered. ``` ```{toctree} --- hidden: true --- Intake/query_yaml.md ```