Provenance Tracking in Climate Science Data Processing Systems
by Curt Tilmes
Tuesday, March 4, 2008, 10:00am
325 ITE
NASA, NOAA, ESA and other organizations involved with climate research
have captured huge archives of earth observations. Over time, the
sensors, spacecraft, science algorithms for transforming and analyzing
the data and the processing frameworks have all evolved. Tracking the
complete provenance information in concert with the science data used
in research and ultimately, policy decisions is a tremendously
complicated problem. Data are stored in multiple archives across
multiple agencies. Since the data volume is so large, previous
generations of the data are often discarded in favor of newer
versions. Systems often aren't capable of reproducing data that were
once provided to the public. Tracing the provenance of a product is
generally a very manual process, since it is stored in so many
different ways (or not stored at all). It often involves reading
science papers, or calling up the researchers. In next generation
processing system, data can be transformed by on-demand processing in
new ways resulting in transient data sets that are returned to a user
or layered application but not archived at all. Our goal is complete
scientific reproducibility of all data.
I will briefly present the general area and challenges of provenance tracking for science data processing systems and the requirements for scientific reproducibility. I will discuss some existing techniques and proposals including metadata standards and representation of provenance through standard ontologies on the semantic web.