Provenance Tracking in Climate Science Data Processing Systems

Tuesday, March 4, 2008, 10:00am

325 ITE

Tags: data, provenance, scientific data, semantic web

NASA, NOAA, ESA and other organizations involved with climate research have captured huge archives of earth observations. Over time, the sensors, spacecraft, science algorithms for transforming and analyzing the data and the processing frameworks have all evolved. Tracking the complete provenance information in concert with the science data used in research and ultimately, policy decisions is a tremendously complicated problem. Data are stored in multiple archives across multiple agencies. Since the data volume is so large, previous generations of the data are often discarded in favor of newer versions. Systems often aren't capable of reproducing data that were once provided to the public. Tracing the provenance of a product is generally a very manual process, since it is stored in so many different ways (or not stored at all). It often involves reading science papers, or calling up the researchers. In next generation processing system, data can be transformed by on-demand processing in new ways resulting in transient data sets that are returned to a user or layered application but not archived at all. Our goal is complete scientific reproducibility of all data.

I will briefly present the general area and challenges of provenance tracking for science data processing systems and the requirements for scientific reproducibility. I will discuss some existing techniques and proposals including metadata standards and representation of provenance through standard ontologies on the semantic web.

OWL Tweet