19 Dec 2024

Xarray x NASA: xarray.DataTree for hierarchical data structures

5 minute read

The new xarray.DataTree class allows working with netCDF/Zarr groups, brought to you in collaboration with NASA!

(Originally posted on the Xarray blog.)

tl;dr

xarray.DataTree has been released in v2024.10.0, and the prototype xarray-contrib/datatree repository archived, after collaboration between the xarray team and the NASA ESDIS project. 🤝

Why trees?

The DataTree concept allows for organizing heterogeneous collections of scientific data in the same way that a nested directory structure facilitates organizing large numbers of files on disk. It does so in a way that preserves common structure between data in the collections, such as aligned arrays and common coordinates.

For those familiar with netCDF4/Zarr groups, a DataTree can also be thought of as an in-memory representation of a file’s group structure. Xarray users have been asking for a way to handle multiple netCDF4 groups since at least 2016!

DataTree enables xarray to be used for various new use cases, including:

  • Climate model intercomparisons,
  • Multi-scale image pyramids, e.g. in genomics,
  • Organising heterogenous data, such as satellite observations and model simulations.
  • Simple and convenient access to entire hierarchical files.

What is a DataTree exactly?

The new high-level container class xarray.DataTree acts like a tree of linked xarray.Dataset objects, with alignment enforced between arrays in parent and child nodes, but not between those in sibling nodes. It can be written to and opened from formats containing multiple groups, such as netCDF4 files and Zarr stores.

For more details please see the high-level description, the dedicated page on hierarchical data, and the section on IO with groups in the xarray documentation.

Deprecation

If you previously had used the datatree.DataTree prototype in the xarray-contrib/datatree repository, that has now been archived and will no longer be supported. Instead we encourage you to migrate to the implementation of DataTree that you can import from xarray, following the migration guide.

Big moves

This was a big feature addition! For a decade there have been 3 core public xarray data structures, now there are 4: Variable, DataArray, Dataset, and now DataTree.

Datatree represents arguably one of the largest new features added to xarray in 10 years - the migration of the existing prototype alone added >10k lines of code across 80 pull requests, and the resulting datatree implementation now contains contributions from at least 25 people.

We also had to resolve some really gnarly design questions to make it work in a way we were happy with.

How did this happen?

DataTree didn’t get implemented overnight - it was a multi-year effort that took place in a number of steps, and there are some lessons to be learned from the story.

In March 2021, the xarray team submitted a funding proposal to the Chan-Zuckerberg Initiative to develop “TreeDataset”, citing bioscience use cases such as microscopy image pyramids. Unfortunately whilst we’ve been lucky to receive CZI funding before, on this occasion we didn’t win money to work on the datatree idea.

In the absence of dedicated funding for datatree, Tom then used some time whilst at the Climate Data Science Lab at Columbia University to take a initial stab at the design in August 2021 - writing the first implementation on an overnight Amtrak! This simple prototype was released as a separate package in the xarray-contrib/datatree repository, and steadily gained a small community of intrepid users. It was driven partly by the use case of climate model intercomparison datasets.

A separate repository was chosen for speed of iteration, and to be able to more easily make changes without worrying as much about backwards compatibility as code in xarray’s main repo does. However the separate repo meant that the prototype datatree library was not fully integrated with xarray’s main codebase, limiting possible features and requiring fragile dependencies on private xarray internals.

The prototype then sat there for 2 years, until the NASA ESDIS team approached the xarray core team in August 2023. ESDIS devs wanted the ability to work with entire hierarchical files, and had experimented with the prototype version of datatree, but they wanted datatree functionality to be migrated upstream into xarray’s main repository so there would be more guarantees of long-term API stability and support.

Amazingly NASA were able to offer the time of 3 engineers: Owen (NASA EOSDIS Evolution and Development 3 (EED-3) contract), Matt (NASA National Snow and Ice Data Center Distributed Active Archive Center (NSIDC)), and Eni (Goddard Earth Sciences Data and Information Services Center (GES DISC)). So starting in late 2023 the NASA trio worked on migrating the prototype datatree into xarray upstream, with regular supervision from Tom, Justus, and Stephan (existing xarray core team).

This second stage of development allowed us to reduce the bus factor on the datatree code, sanity check the original approach, and it gave us a chance to make some significant improvements to the design without backwards-compatibility concerns (for example enabling the new “coordinate inheritance” feature).

Lessons for future collaborations

This development story is different from the more typical scientific grant funding model - how did that work out for us?

The scientific grant model for funding software expects you to present a full idea in a proposal, wait 6-12 months to hopefully get funding for it, then implement the whole thing during the grant period. In contrast datatree evolved over a gradual process of moving from ideas to hacky prototype to robust implementation, with big time gaps for user feedback and experimentation. The migration was completed by developer-users who actually wanted the feature, rather than grant awardees working in service of a separate and maybe-only-theoretical userbase.

Overall while the migration effort took longer than anticipated we think it worked out quite well!

Pros

  • Zero overhead - the existing xarray team did not to have to write a proposal to get developer time, and there was literally zero paperwork inflicted (on them at least).
  • Certainty of funding - writing grant proposals is a lottery, so the time invested up front doesn’t even come with any certainty of funding. Collaborating with another org has a much higher chance of actually leading to more money being available for developer time.
  • Time efficient - an xarray core dev spending 10% of their time advising someone who is less familiar with the codebase but has more time is an efficient use of relative expertise.
  • Bus factor - the new contributors reduced the bus factor on the datatree code dramatically.
  • User-driven Development - it makes sense to have actual interested user communities involved in development.
  • Stakeholder representation - after officially adding Owen, Matt and Eni to the xarray core team, the NASA ESDIS project has some direct representation in, insider understanding of, and stake in continuing to support the xarray project.

Cons

  • Not everyone got direct funding - It’s less ideal that Tom, Justus, and Stephan didn’t get direct funding for their supervisory work. In future it might be better to have one of the paid people at the contributing org already be a core xarray team member, or perhaps find some way to pay them as a consultant.
  • Tricky to accurately scope - The duration of required work was tricky to estimate in advance, and we didn’t want to “just ship it”. We hold the xarray project to high standards and backwards compatibility promises so we want to ensure that any publicly released features don’t compromise on quality.

This contributing model is more similar to how open-source software has historically been supported by industry, but perhaps because xarray is primarily developed and used by the scientific community we tend to default to more grant-based funding models.

Overall we think this type of collaboration could work again in future! So if there is an xarray or xarray-adjacent feature your organisation would like to see, please reach out to us.

Go try out DataTree

Please try datatree out! The hierarchical structure is potentially useful to any xarray users who work with more than one dataset at a time. Simply do from xarray import DataTree or call open_datatree(...) on a netCDF4 file / Zarr store containing multiple groups.

Be aware that as xarray.DataTree is still new there will likely be some bugs lurking or places that performance could be improved, as well as as-yet unimplemented features (as there always are)!

Thanks

A number of other people also contributed to datatree in various ways - particular shoutout to Alfonso Ladino and Etienne Schalk for their dedicated attendance at many of the weekly migration meetings!

Funding Acknowledgements

  • Owen, Eni, and Matt were able to contribute development time as part of the NASA ESDIS project.
  • Tom was supported first by the Gordon and Betty Moore foundation as part of Ryan Abernathey’s Climate Data Science Lab at Columbia University, then later by various funders for a fraction of his time through [C]Worthy.