Big (linked?) data

February 8th, 2009

The Data Evolution blog has an interesting post that asks Is Big Data at a tipping point?. It’s suggests that we may be approaching a tipping point in which large amounts of online data will be interlinked and connected to suddenly produce a whole much larger than the parts.

“For the past several decades, an increasing number of business processes– from sales, customer service, shipping – have come online, along with the data they throw off. As these individual databases are linked, via common formats or labels, a tipping point is reached: suddenly, every part of the company organism is connected to the data center. And every action — sales lead, mouse click, and shipping update — is stored. The result: organizations are overwhelmed by what feels like a tsunami of data. The same trend is occurring in the larger universe of data that these organizations inhabit. Big Data unleashed by the “Industrial Revolution of Data”, whether from public agencies, non-profit institutes, or forward-thinking private firms.”

I expected that the post would soon segue into a discussion of the Semantic Web and maybe even the increasingly popular linked data movement, but it did not. Even so, it sets up plenty of nails for which we have a an excellent hammer in hand. I really like this iceberg analogy, by the way.

“At present, much of the world’s Big Data is iceberg-like: frozen and mostly underwater. It’s frozen because format and meta-data standards make it hard to flow from one place to another: comparing the SEC’s financial data with that of Europe’s requires common formats and labels (ahem, XBRL) that don’t yet exist. Data is “underwater” when, whether reasons of competitiveness, privacy, or sheer incompetence it’s not shared: US medical records may contain a wealth of data, but much of it is on paper and offline (not so in Europe, enabling studies with huge cohorts).”

The post also points out some sources of online data and analysis tools, some familiar and some new to me (or maybe just forgotten.)

“Yet there’s a slow thaw underway as evidenced by a number of initiatives: Aaron Swartz’s, Flip Kromer’s infochimps, Carl Malamud’s, as well as Numbrary, Swivel, Freebase, and Amazon’s public data sets. These are all ambitious projects, but the challenge of weaving these data sets together is still greater.”