Maybe we should think of data provenance as being like a recipe. Recipes for preparing food are more than just a list of ingredients and specify, often in great detail, how the ingredients are combined, cooked and served and also specify the cooking implements and their settings.
Curt Tilmes presented his PhD dissertation proposal yesterday on “Provenance Tracking in Science Data Processing Systems”. Curt works at at the NASA Goddard Spaceflight Center and is responsible for managing the data processing of earth science climate research data. Curt has some very good ideas about how to capture all of the relevant provenance data for sophisticated scientific data. He’s using, of course, the Semantic Web languages (RDF and OWL) to express and share the provenance data.
Part of the problem is that you have to capture not just the inputs to a dataset, but how the inputs were processed to produce the dataset, including (ideally) the algorithms, software and hardware. As an easily grasped example to illustrate this, he referred to a recent post by Ray Pierre on the RealClimate blog, How to cook a graph in three easy lessons. This post demonstrates how Roy Spencer processes inputs from two common climate datasets (the Southern Oscillation and Pacific Decadal Oscillation indexes) to get the results that support the conclusion that global warming is due to natural causes and not human activity.