29 Aug 2022

Easy IPCC part 1: Multi-Model Datatree

12 minute read

Handling heterogenous climate model data with xarray-datatree

Easy IPCC Part 1: Multi-Model Datatree

By Tom Nicholas (@TomNicholas) and Julius Busecke (@jbusecke)

(Originally posted on the Pangeo Medium blog)

In this series of blog posts we aim to quickly reproduce a panel from one of the key figures (fig. 9.3a) of the IPCC AR6 Report from the raw climate model simulation data, using Pangeo cloud-native tools. This figure is critical: it literally shows our best estimate of the future state of the climate given certain choices by humanity.

This figure shows both the historical and projected global average temperature of the ocean’s surface, as used in different Global Climate Models. The projections are for possible trajectories of greenhouse gas emissions based on various socio-economic pathways (Neill et al. 2017).

However, the process of creating it is potentially very cumbersome: it involves downloading large files, idiosyncratic code to produce intermediate datafiles, and lots more code to finally plot the results. The open sharing of code used for the latest IPCC report inspired us to try to make this process easier for everyone.

The numerical model output produced under the Coupled Model Intercomparison Project 6 (CMIP6) output is publicly available (distributed by the Earth System Grid Federation), so the scientific community should be able to reproduce this plot from scratch, and then build upon it as the underlying science advances.

Due to recent advances in software and the migration of the CMIP6 data archive into the cloud, we are now able to produce this panel rapidly from raw data using easily available public cloud computing.

Working with many CMIP6 datasets at once can be cumbersome because each model has different size and coordinates. In this blogpost we highlight the new xarray-datatree package, which helps organize the many datasets in CMIP6 into a single tree-like data structure.

There have been other attempts to analyze, organize, and visualize multi-model experiments, but here we prioritize modularity, flexibility, extensibility, and using domain-agnostic tools.

We were able to reproduce parts of this graph in a live demo at Scipy 2022 using cloud-hosted data, but felt we could do it much more succinctly with datatree!

Getting the data

A key component of this workflow is having the large CMIP6 datasets available to open near-instantly from a public cloud store. The Pangeo/ESGF Cloud Data Working Group maintains Analysis-Ready Cloud-Optimized versions of a large part of the CMIP6 archive as Zarr stores as public datasets both on Google Cloud Storage and Amazon S3. (This alone is a huge advancement for open science, but just a part of what we’re showing today.)

For the sake of demonstration we will load a small subset of sea surface temperature data for three models . We’ll load both the historical baseline, as well as the SSP1.26 (“best case”) and SSP5.85 (“worst case”) future socio-economic pathways: a total of 9 datasets. (We’ll tackle scaling up to all the available data in a future blog post.)

import xarray as xr
xr.set_options(keep_attrs=True)
import gcsfs
from xmip.preprocessing import rename_cmip6

cmip_stores = {
'IPSL-CM6A-LR/historical':  'gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r4i1p1f1/Omon/tos/gn/v20180803/',
'MPI-ESM1-2-LR/historical': 'gs://cmip6/CMIP6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r4i1p1f1/Omon/tos/gn/v20190710/',
'CESM2/historical':         'gs://cmip6/CMIP6/CMIP/NCAR/CESM2/historical/r4i1p1f1/Omon/tos/gn/v20190308/',
'IPSL-CM6A-LR/ssp126':      'gs://cmip6/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp126/r4i1p1f1/Omon/tos/gn/v20191121/',
'IPSL-CM6A-LR/ssp585':      'gs://cmip6/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r4i1p1f1/Omon/tos/gn/v20191122/',
'MPI-ESM1-2-LR/ssp126':     'gs://cmip6/CMIP6/ScenarioMIP/MPI-M/MPI-ESM1-2-LR/ssp126/r4i1p1f1/Omon/tos/gn/v20190710/',
'MPI-ESM1-2-LR/ssp585':     'gs://cmip6/CMIP6/ScenarioMIP/MPI-M/MPI-ESM1-2-LR/ssp585/r4i1p1f1/Omon/tos/gn/v20190710/',
'CESM2/ssp126':             'gs://cmip6/CMIP6/ScenarioMIP/NCAR/CESM2/ssp126/r4i1p1f1/Omon/tos/gn/v20200528/',
'CESM2/ssp585':             'gs://cmip6/CMIP6/ScenarioMIP/NCAR/CESM2/ssp585/r4i1p1f1/Omon/tos/gn/v20200528/'
}

datasets = {
    name: rename_cmip6(xr.open_dataset(path, engine="zarr")).load()
    for name, path in cmip_stores.items()
}

CMIP vocabulary explainer: The words “model” and “experiment” have a specific meaning in CMIP lingo. Within CMIP there are many different datasets produced by various modeling centers around the world. Each of these centers has one or more model setups (source_id in official CMIP language), which are used to produce multiple simulations. For example the simulations for IPSL-CM6A-LR are all produced by the Institut Pierre-Simon Laplace in France. Different simulations are run under certain protocols which prescribe conditions (e.g. greenhouse gas forcings), and these are called experiments. The experiments we’ve loaded here are 'historical', 'ssp126' & 'ssp585'.

These nine datasets are all similar, e.g. they have the same dimension names (freshly cleaned thanks to another package of ours - xMIP) and variable names, but they have different sizes along dimensions: horizontal dimensions differ due to varying resolution and time dimensions differ because the historical baseline run is longer than the scenarios by a varying amount.

print(datasets["CESM2/ssp585"].sizes)
print(datasets["IPSL-CM6A-LR/ssp585"].sizes)

Frozen({'y': 384, 'x': 320, 'vertex': 4, 'time': 1032, 'bnds': 2})
Frozen({'y': 332, 'x': 362, 'vertex': 4, 'time': 1032, 'bnds': 2})

For instance the "x" and "y" dimensions here (roughly corresponding to latitude and longitude) do not match between the CESM2 model and the IPSL model. Hence they cannot be combined into a single N-dimensional array or a single xarray.Dataset!

Introducing datatree

The CMIP6 datasets are clearly related, however. They have a hierarchical order - grouped by model and then by experiment. Each dataset also contains very similar variables. Wouldn’t it be nice to treat them all as one related set of data, and analyze them together?

The xarray-datatree package enables this. It arranges these related datasets into a Datatree object, allowing them to be manipulated together. A DataTree object can be thought of as a recursive dictionary of xarray.Dataset objects, with additional methods for performing computations on the entire tree at once. You can also think of it as an in-memory representation of an entire netCDF file, including all netCDF groups. Because it is in-memory your analysis does not require saving intermediate files to disk.

Note: The pip-installable package xarray-datatree provides the class DataTree, importable from the module datatree. From here on we will refer interchangeably to these as just “datatree”.

from datatree import DataTree

dt = DataTree.from_dict(datasets)

print(f"Size of data in tree = {dt.nbytes / 1e9 :.2f} GB")

Size of data in tree = 4.89 GB

print(dt)

DataTree('None', parent=None)
├── DataTree('IPSL-CM6A-LR')
│   ├── DataTree('historical')
│   │       Dimensions:      (y: 332, x: 362, vertex: 4, time: 1980, bnds: 2)
│   │       Coordinates:
│   │         * time         (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
│   │       Dimensions without coordinates: y, x, vertex, bnds
│   │       Data variables:
│   │           area         (y, x) float32 16.0 16.0 16.0 ... 1.55e+08 3.18e+07 3.18e+07
│   │           lat_bounds   (y, x, vertex) float32 -84.21 -84.21 -84.21 ... 50.11 49.98
│   │           lon_bounds   (y, x, vertex) float32 72.5 72.5 72.5 72.5 ... 73.0 72.95 73.0
│   │           lat          (y, x) float32 -84.21 -84.21 -84.21 ... 50.23 50.01 50.01
│   │           lon          (y, x) float32 72.5 73.5 74.5 75.5 ... 73.05 73.04 73.0 72.99
│   │           time_bounds  (time, bnds) object 1850-01-01 00:00:00 ... 2015-01-01 00:00:00
│   │           tos          (time, y, x) float32 nan nan nan nan nan ... nan nan nan nan
│   │       Attributes: (12/54)
│   │           CMIP6_CV_version:       cv=6.2.3.5-2-g63b123e
│   │           Conventions:            CF-1.7 CMIP-6.2
│   │           EXPID:                  historical
│   │           NCO:                    "4.6.0"
│   │           activity_id:            CMIP
│   │           branch_method:          standard
│   │           ...                     ...
│   │           tracking_id:            hdl:21.14100/dc824b97-b288-4ec0-adde-8dbcfa3b0095
│   │           variable_id:            tos
│   │           variant_label:          r4i1p1f1
│   │           status:                 2019-11-09;created;by nhn2@columbia.edu
│   │           netcdf_tracking_ids:    hdl:21.14100/dc824b97-b288-4ec0-adde-8dbcfa3b0095
│   │           version_id:             v20180803
│   ├── DataTree('ssp126')
│   │       Dimensions:      (y: 332, x: 362, vertex: 4, time: 1032, bnds: 2)
│   │       Coordinates:
│   │         * time         (time) object 2015-01-16 12:00:00 ... 2100-12-16 12:00:00
│   │       Dimensions without coordinates: y, x, vertex, bnds
│   │       Data variables:
│   │           area         (y, x) float32 16.0 16.0 16.0 ... 1.55e+08 3.18e+07 3.18e+07
│   │           lat_bounds   (y, x, vertex) float32 -84.21 -84.21 -84.21 ... 50.11 49.98
│   │           lon_bounds   (y, x, vertex) float32 72.5 72.5 72.5 72.5 ... 73.0 72.95 73.0
│   │           lat          (y, x) float32 -84.21 -84.21 -84.21 ... 50.23 50.01 50.01
│   │           lon          (y, x) float32 72.5 73.5 74.5 75.5 ... 73.05 73.04 73.0 72.99
│   │           time_bounds  (time, bnds) object 2015-01-01 00:00:00 ... 2101-01-01 00:00:00
│   │           tos          (time, y, x) float32 nan nan nan nan nan ... nan nan nan nan
...

We created the whole tree in one go from our dictionary of nine datasets, using the keys of the dict to organize the resulting tree. DataTree objects are structured similarly to a UNIX filesystem, so a key in the dict such as 'IPSL-CM6A-LR/historical' will create a node called IPSL-CM6A-LR and another child node below it called 'historical'. This approach means the DataTree.from_dict method can create trees of arbitrary complexity from flat dictionaries.

If instead we wanted to create the tree manually, you can do that by setting the .children attributes of each node explicitly, assigning other tree objects to build up a composite result.

You can see that the resulting tree structure is grouped by model first, and then by the experiment. (This choice is somewhat arbitrary and we could have chosen to group first by experiment and then model.)

While we printed the string representation of the tree for this blog post, if you’re doing this in a jupyter notebook you will actually automatically get an interactive HTML representation of the tree, where each node is collapsible.

We say that we have created one “tree”, where each “node” within that tree (optionally) contains the contents of exactly one xarray.Dataset, and also has a node name, a “parent” node, and can have any number of “child” nodes. For more information on the data model datatree uses, see the documentation.

We can actually access nodes in the tree using path-like syntax too, for example

print(dt["/CESM2/ssp585"])

DataTree('ssp585', parent="CESM2")
    Dimensions:      (y: 384, x: 320, vertex: 4, time: 1032, bnds: 2)
    Coordinates:
      * y            (y) int32 1 2 3 4 5 6 7 8 9 ... 377 378 379 380 381 382 383 384
      * x            (x) int32 1 2 3 4 5 6 7 8 9 ... 313 314 315 316 317 318 319 320
      * time         (time) object 2015-01-15 13:00:00 ... 2100-12-15 12:00:00
    Dimensions without coordinates: vertex, bnds
    Data variables:
        lat          (y, x) float64 -79.22 -79.22 -79.22 -79.22 ... 72.2 72.19 72.19
        lat_bounds   (y, x, vertex) float32 -79.49 -79.49 -78.95 ... 72.41 72.41
        lon          (y, x) float64 320.6 321.7 322.8 323.9 ... 318.9 319.4 319.8
        lon_bounds   (y, x, vertex) float32 320.0 321.1 321.1 ... 320.0 320.0 319.6
        time_bounds  (time, bnds) object 2015-01-01 02:00:00.000003 ... 2101-01-0...
        tos          (time, y, x) float32 nan nan nan nan nan ... nan nan nan nan
    Attributes: (12/48)
        Conventions:            CF-1.7 CMIP-6.2
        activity_id:            ScenarioMIP
        branch_method:          standard
        branch_time_in_child:   735110.0
        branch_time_in_parent:  735110.0
        case_id:                1735
        ...                     ...
        tracking_id:            hdl:21.14100/68b741c9-b8f8-479d-b48c-6853b1c71e56...
        variable_id:            tos
        variant_info:           CMIP6 SSP5-8.5 experiments (2015-2100) with CAM6,...
        variant_label:          r4i1p1f1
        netcdf_tracking_ids:    hdl:21.14100/68b741c9-b8f8-479d-b48c-6853b1c71e56...
        version_id:             v20200528

You can see that every node is itself another DataTree object, i.e. the tree is a recursive data structure.

This structure not only allows the user full flexibility in how to set up and manipulate the tree, but it also behaves like a filesystem that people are already very familiar with.

But datatrees are not just a neat way to organize xarray datasets; they also allows us to operate on datasets in an intuitive way.

Timeseries

In this case the basic quantity we want to compute is a global mean of the ocean surface temperature. For this we can use the built-in .mean() method.

timeseries = dt.mean(dim=["x", "y"])

This works exactly like the familiar xarray method, and averages the temperature values over the horizontal dimensions for each dataset on any level of the tree.

All arguments are passed down, so this .mean will work so long as the data in each node of the tree has a "x" and a "y" dimension to average over.

Lets confirm our results by plotting a single dataset from the new tree of timeseries:

timeseries['/IPSL-CM6A-LR/ssp585'].ds['tos'].plot()

And indeed, we get a nice timeseries. Once you select a node, you get back an xarray.Dataset using .ds and can do all the good stuff you already know (like plotting).

For comparison, here is the relevant part of Julius’ code from the Scipy 2022 live demo that did not use datatree:

timeseries_hist_datasets = []
timeseries_ssp126_datasets = []
timeseries_ssp245_datasets = []
timeseries_ssp370_datasets = []
timeseries_ssp585_datasets = []

for k,ds in data_timeseries.items():
    # Separate experiments
    out = ds.convert_calendar('standard')
    out = out.sel(time=slice('1850', '2100'))# cut extended runs
    out = out.assign_coords(source_id=ds.source_id)
    if ds.experiment_id == 'historical':
        # CMIP
        if len(out.time)==1980:
            timeseries_hist_datasets.append(out)
        else:
            print(f"found {len(ds.time)} for {k}")
    else:
        #scenarioMIP
        if len(out.time)!=1032:
            print(f"found {len(out.time)} for {k}")
            # print(ds.time)
        else:
            if ds.experiment_id == 'ssp126':
                timeseries_ssp126_datasets.append(out)
            elif ds.experiment_id == 'ssp245':
                timeseries_ssp245_datasets.append(out)
            elif ds.experiment_id == 'ssp370':
                timeseries_ssp370_datasets.append(out)
            elif ds.experiment_id == 'ssp585':
                timeseries_ssp585_datasets.append(out)            

concat_kwargs = dict(
    dim='source_id',
    join='override',
    compat='override',
    coords='minimal'
) 
timeseries_hist = xr.concat(timeseries_hist_datasets, **concat_kwargs)
timeseries_ssp126 = xr.concat(timeseries_ssp126_datasets, **concat_kwargs)
timeseries_ssp245 = xr.concat(timeseries_ssp245_datasets, **concat_kwargs)
timeseries_ssp370 = xr.concat(timeseries_ssp370_datasets, **concat_kwargs)
timeseries_ssp585 = xr.concat(timeseries_ssp585_datasets, **concat_kwargs)

😬 Definitely not as nice as a one-liner…

It’s worth understanding what happened under the hood when calling .mean() on the tree. We can reproduce the behavior of .mean() by passing a simple function to the built-in .map_over_subtree() method.

def mean_over_space(ds):
    return ds.mean(dim=["x", "y"])

dt.map_over_subtree(mean_over_space)

The function mean_over_space that we supplied gets applied to the data in each and every node in the tree automatically. This will return the exact same result as above, but with this method users can map arbitrary functions over each dataset in the tree, as long as they consume and produce xarray.Datasets. You can even map functions with multiple inputs and outputs, but that is an advanced use case which we will use at the end of this post.

You can already see that this enables very intuitive mapping of custom functions without lengthy writing of loops.

Saving output

Before we go further, we might want to save our aggregated timeseries data to use later. DataTree supports saving to and loading from both netCDF files and Zarr stores.

from datatree import open_datatree

timeseries.to_zarr('cmip_timeseries')  # or netcdf, with any group structure

roundtrip = open_datatree('cmip_timeseries', engine="zarr")
print(roundtrip)

DataTree('None', parent=None)
├── DataTree('CESM2')
│   ├── DataTree('historical')
│   │       Dimensions:      (vertex: 4, time: 1980, bnds: 2)
│   │       Coordinates:
│   │         * time         (time) object 1850-01-15 12:59:59.999997 ... 2014-12-15 12:0...
│   │       Dimensions without coordinates: vertex, bnds
│   │       Data variables:
│   │           lat          float64 ...
│   │           lat_bounds   (vertex) float32 ...
│   │           lon          float64 ...
│   │           lon_bounds   (vertex) float32 ...
│   │           time_bounds  (time, bnds) object ...
│   │           tos          (time) float32 ...
...

Nodes are saved as netCDF or Zarr groups, meaning that you can now work smoothly with xarray and multi-group files, a long-term sticking point for many xarray users. Being able to now work with such files without leaving xarray or python helps retains backwards compatibility and portability.

Calculating anomalies

Note on Climate Anomalies: Coupled climate models often exhibit ‘biases’, which means that the ocean temperatures at the starting point for e.g. the historical and future experiments will differ from model to model. For the purpose of this scientific analysis however we are most interested in the relative change from that starting point when a forcing (e.g. increased greenhouse gases) is applied. Conventionally we compute the anomaly relative to a given reference value. In this case the reference value is the global ocean temperature averaged over the years 1950-1980 in the corresponding historical experiment. Therefore we want to compute and subtract this reference value for each model separately so that each model-specific bias is removed.

In practice we need to iterate over each model, pick the historical experiment, select the base period, and average it. We then subtract that resulting value from all experiments of the corresponding model and rearrange the resulting anomaly timeseries into a new tree.

anomaly = DataTree()
for model_name, model in timeseries.children.items():
    
    # model-specific base period as an xarray.Dataset
    base_period = model["historical"].ds.sel(time=slice('1950','1980')).mean('time')
    
    anomaly[model_name] = model - base_period   # subtree - Dataset

We have used a few features of datatree here. We loop over the .children attribute of the DataTree, which iterates through each sub-node in turn in a dictionary-like manner.

We find the reference value (base_period) by extracting the data in a node as an xarray.Dataset via .ds, giving us access to all of xarray’s normal API for computations.

Our tree of results (anomaly) is constructed by creating an empty tree and then adding branches to this new tree whilst we loop through models.

In the final line of this for loop we have subtracted an xarray.Dataset from a DataTree - this operation works node-wise, i.e. the dataset is subtracted from every data-containing node in the tree.

Note: In general, we could imagine having rules for operations involving any number of trees of any node structure, analogous to numpy array broadcasting. This would make the above anomaly calculation much more succinct. However at the moment, datatree only supports tree @ dataset-like operations and tree @ tree-like operations, where the trees must have the same node structure.

Plotting

Alright, so the final step here is to actually plot the data. We can use the map_over_subtree approach to plot each individual simulation onto the same matplotlib axes:

import matplotlib.pyplot as plt

from datatree import map_over_subtree

fig, ax = plt.subplots()

@map_over_subtree
def plot_temp(ds, original_ds):
    
    if ds:
        label = f'{original_ds.attrs["source_id"]} - {original_ds.attrs["experiment_id"]}'
        ds['tos'].rolling(time=2*12).mean().plot(ax=ax, label=label)

    return ds

plot_temp(anomaly, timeseries)
ax.legend()

For this we created a matplotlib figure and axis and then used the built-in xarray plotting capability to add each timeseries to the existing matplotlib.Axes object.

Instead of using the .map_over_subtree method on the DataTree class, we instead used map_over_subtree as a function decorator, which promoted plot_temp to a function that will act over all nodes of any given tree. This decorator is happy to map over multiple trees simultaneously, which we used to extract the metadata for labelling each timeseries.

Note: The way we extracted this metadata for plotting was a little awkward - we can imagine improving this in future versions of datatree.

The fact that we operate on an xarray dataset also allows us to convieniently smooth the timeseries (by a 2 year rolling mean) to remove the seasonal cycle.

Looking at the figure produced above, we’ve successfully replicated the basic features of the IPCC plot that we wanted! And it took us very few lines of code overall, especially compared to how it’s often done. In future posts we will improve this figure, adding more data and maybe making it interactive.

(Eagle-eyed climate scientist readers will have noticed that our figure is not actually quite correct - we forgot to correctly weight our temperature averages by the areas of the grid cells in each model! We will deal with this subtlety and other more complex custom aggregations in a future blog post.)

Takeaways

Datatree is a general tool. Whilst here we used it on CMIP6 data, we learnt about capabilities that might be useful in other projects. We saw that datatree can:

Organize heterogeneous data
Seamlessly extend xarray’s functionality
Handle I/O with nested groups
Apply operations over whole trees
Perform more complex and custom operations
Retain fine-grained control
Reduce overall lines of code

Future

This blogpost is the first in a series. We plan to write more to discuss:

Using intake with datatree, where the trees could act like in-memory catalogs,
Suporting more complex aggregation operations, such as using .weighted to weight an operation on one tree with weights from another tree,
Scaling out to a much larger set of CMIP6 data with dask,
Quick plotting of data in a whole tree, maybe even making it interactive.