Petabytes of climate and water data now available

by Jingbo Wang (ANU)

New opportunities have arisen for data-intensive interdisciplinary science at scales and resolutions not hitherto possible.

The National Computational Infrastructure (NCI) at the Australian National University has organised a priority set of large volume national environmental data assets on a High Performance Data node within a High Performance Computing facility. It was organised as a special node under the Australian Government’s National Collaborative Research Infrastructure Strategy Research Data Storage Infrastructure (RDSI) program.

By co-locating the vast data collections with high performance computing environments and harmonising these large valuable data assets, new opportunities have arisen for data-intensive interdisciplinary science at scales and resolutions not hitherto possible.

We manage more than 30 data collections, including many that will be relevant to the water and climate research community. For more information about those collections, please visit our website. NCI will host over 10 Petabytes (PBytes) on Raijin with six categories related to the environmental sciences (see table below).

 

Data Collections hosted at the NCI RDSI node. The fields are 1) earth system sciences, climate and weather model data assets and products, 2) earth and marine observations and products, 3) geosciences, 4) terrestrial ecosystem, 5) water management and hydrology, and 6) astronomy, social science and biosciences

Field

Collection Name

TBytes

1

Ocean General Circulation Model for the Earth Simulator

27

1

Year Of Tropical Convection (YOTC) Re-analysis

      81

1

Community Atmosphere Biosphere Land Exchange (CABLE) Datasets

24

1

Coordinated Regional Climate Downscaling Experiment (CORDEX)

57

1

Coupled Model Inter-Comparison Project (CMIP5)

2500

1

Atmospheric Reanalysis Products

168

1

Australian Community Climate and Earth-System Simulator (ACCESS)

3000

1

Seasonal Climate Prediction

595

2

Australian Bathymetry and Elevation reference data

113

2

Australian Marine Video and Imagery Collection

7

2

Global Navigation Satellite System (GNSS) (Geodesy)

5

2

Digitised Australian Aerial Survey Photography

74

2

Earth Observation (Satellite: Landsat, etc)

1486

2

Satellite Imagery (NOAA/AVHRR, MODIS, VIIRS, AusCover)

435

2

Synthetic Aperture Radar

29

2

Remote and In-Situ Observations Products for Earth System Modelling

366

2

Ocean-Marine Collections

220

3

Australian 3D Geological Models and supporting data

3

3

Australian Geophysical Data Collection

175

3

Australian Natural Hazards Archive

27

3

National CT-Lab Tomographic Collection

205

4

ecosystem Modelling And Scaling faciliTy (eMAST)

90

4

Phenology Monitoring (Near Surface Remote Sensing)

12

4

eMAST Data Assimilation

110

5

Satellite Soil Moisture Products

5

5

Key Water Assets

44

5

Models of Land and Water Dynamics from Space

22

6

Skymapper (Astronomy)

227

6

Australian Data Archive (Social Sciences)

4

6

BioPlatforms Australia (BPA) Melanoma Dataset (Biosciences)

175

6

Plant Phenomics (Biosciences)

10

 

 

For example, the Satellite Soil Moisture Products consolidate the wide range of satellite-derived soil moisture (SM) products, which span approximately 30 years from a range of data providers, into a single repository for Australian research community to access. Key Water Data Assets includes Murray Darling Basin Plan Draft Report Rainfall and Runoff data, the Australian Water Resources Assessment system Landscape model (AWRA-L), and Bioregional Assessment data. Models of Land and Water Dynamics from Space Data Collection (so called Landcover25-Water, Water Observations from Space – WOfS) is a 25-m resolution gridded dataset indicating areas where surface water has been observed using the Geoscience Australia (GA) Earth observation satellite data holdings. The LC25-Water product Version 1.5 includes observations taken between 1987 and 2014 (inclusive) from the Landsat 5 and 7 satellites. LC25-Water covers all of mainland Australia and Tasmania but excludes off-shore Territories.

How is the NCI organised?

NCI operates as a formal partnership between the ANU and three of the major Australian national scientific agencies: CSIRO, the Bureau of Meteorology (BoM) and Geoscience Australia. They are also the custodians of many of the large volume national scientific data records. The data from these national agencies and collaborating overseas organisations are either replicated to or produced at NCI, and in many cases processed to higher-level data products. Model data from computational workflows at NCI are also captured and released as modelling products. NCI then manages both data services and computational environments, known as Virtual Laboratories, to use that data effectively and efficiently.

Our Data Management Plan (DMP) is compatible with the ISO 19115 metadata standards. We utilize ISO standards to make sure our metadata is transferable and interoperable for sharing and harvesting. The DMP is used along with metadata from the data itself, to create a hierarchy of data collection, dataset and time series catalogues that is then exposed through GeoNetwork for standard discoverability. These hierarchy catalogues are linked using a parent-child relationship. The hierarchical infrastructure of our GeoNetwork catalogues system aims to address both discoverability and in-house administrative use-cases. By standardizing our metadata structure across the entire data corpus, we are laying the foundation to enable the application of appropriate semantic mechanisms to enhance discovery and analysis of NCI’s national environmental research data information. At NCI, we are currently improving the metadata interoperability in our catalogue by linking with standardized community vocabulary services. These emerging vocabulary services are being established to help harmonise data from different national and international scientific communities. One such vocabulary service is currently being established by the Australian National Data Services (ANDS). We expect that this will further encourage the data sharing and reuse within the community, increasing the value of the data further than its current use.

Data citation is another important aspect of the NCI data infrastructure, which allows acknowledgment and credit of the data producer/contributor/publisher. NCI is capable of providing Digital Object Identifiers (DOIs) minting services with support from ANDS’s DataCite partnership agreement. We endeavour to track the data usage and infrastructure investment, encourage data sharing, and increase trust in research that is reliant on these data collections. We also incorporate the standard vocabularies into the data citation metadata so that the data citation is machine readable and also semantically friendly for web-search purposes. To facilitate effective and efficient use of large volume data collections and to support their use in high performance computing environments, attention must be placed on the whole data workflow from creation to publication, including data management plans, provenance capture, and unique identification of the data through DOIs and other forms of data discovery and access.

 

Please contact us at datacollections@nci.org.au if you have any questions.

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *