Get your geek on: handling data for ensemble forecasting

There’s something about discussions of data handling that’s particularly soporific – but don’t nod off yet!

Most hydrologists are trained to work on individual catchments and we often opt for simple conceptual models. In the pre-ensemble era, we were often quite happy to use unsophisticated ways of crunching numbers: many of us can remember (perhaps quite recently!) using desktop computers to tune and run models, storing data in text files, and so on.

Maybe it’s because it’s obvious, but it’s little remarked that switching from deterministic forecasts to ensembles means handling much more data. Here at CSIRO we tend to use 1000-member ensembles, and our partners at the Bureau of Meteorology use a method that generates 6000 (!) ensemble members for each forecast. If you’re running cross-validation experiments across multiple catchments this can lead to migraine-like data headaches.

In a recent experiment for 22 catchments we generated over 2TB of rainfall and streamflow hindcasts. Of course generating the hindcasts is only one step in the process –verifying them with a bunch of different tests and generating a load of plots can be even more time consuming. It’s simply not feasible to run experiments like this without getting your geek on [1], putting on your “big data” cap (backwards of course) and taking advantage of the awesome power of computer science.

Computer power and data storage

We first began developing a national seasonal forecasting service about 8 years ago. While seasonal forecasting is far less computationally intensive than, say, daily forecasting, we still hit the limits of what could be achieved with desktop computers. So we farmed out jobs to HTCondor, a system that scavenges unused processing power from the many desktop computers at CSIRO. More recently, we have been writing software for applications in short-medium term forecasting that takes advantage of parallelisation in CSIRO’s high-performance supercomputers.

Data storage is another crucial issue. A few years ago we got sick of storing and exchanging thousands of voluminous text files of differing formats and unknown provenance and repute, and followed the climate community on the path to data righteousness: netCDF.

netCDF is a self-describing binary file format that is purpose designed for multi-dimensional data. It’s traditionally been used by ocean and climate modellers, and has a very well described, yet adaptable, set of conventions. The cherry-on-top is that netCDF allows serious data compression, so that corpulent ensemble hindcast data can be squished into slender binary files for storage. We developed our own netCDF specification, including a lead time dimension (see fig below), that allows us to store many ensemble hindcasts/forecasts at different locations in each file, leading to massive speed-ups in forecast verification.


Schematic of how we use the lead-time dimension to store multiple forecasts in a single netCDF file. The lead-time dimension is defined in relation to the time dimension, and means we can store many hindcasts in a single file. The files also handle large ensembles, multiple locations and multiple variables.


A huge benefit of using a standardised, well supported and self-describing binary file format is that it makes sharing data incredibly easy. At first, we found this beneficial within our small research group: even though there are only about 10 of us, we each manage to (strongly!) prefer different scripting languages – R, Matlab, Python. All these scripting languages have strong support for netCDF files so it’s very easy to load and manipulate data. The meta-data stored inside the netCDF files makes reviewing old experiments much easier – we find we don’t have to retrace our steps as much or (gulp!) regenerate hindcasts.

We have since had very good experiences with sharing our netCDF files with collaborators outside our group, and the Bureau of Meteorology has adopted our netCDF specification for operational forecasting. On the downside, binary data formats can be restrictive: netCDF is not readable with simple text editors, and this can occlude data from forecast users or collaborators who don’t have the time or inclination to learn scripting languages or other new software.

Of course, there are many other aspects to the issue of data in ensemble forecasting that we can’t cover in a short blog – we haven’t even touched on algorithm efficiency – and so we’d like to hear your data stories:

  • Have you faced similar problems with verifying ensemble forecasts?
  • Do you prefer other ways of storing data, like HDF5 or databases?
  • Do you have different or better ways of crunching and storing your data?

Tell us in the comments!

[1] Who are we kidding – we were geeks already. Shout out to those who saw the tribute to Missy ‘Misdemeanor’ Elliott in the phrase ‘get your geek on’: there’s a fair chance that you may be even geekier than us. (If not, then go on and treat yourself.)


Original article posted on HEPEX website, 10th May 2016 (Link)


  • James and team,
    Great article! Ensembles are a terrific way to get quantitative information about forecast uncertainty, and the major national weather centres are moving away from deterministic modeling and relying more on ensemble modeling. There’s a huge scope for applications of ensemble outputs, not just in water but also fire, wind hazards, renewable energy, storm surge, and many more. Users are starting to appreciate the benefits of ensemble predictions, and it will be interesting learning how best to share our data using standard formats that we have become comfortable with users might not be (yet?).

  • Thanks Beth! I agree – I think there’s a bit of work for us to do to help users take best advantage of the new ensemble forecasting products that are becoming available, and this includes the handling of data formats that may be unfamiliar to users.

Leave a Reply

Your email address will not be published. Required fields are marked *