infochimps Amazon Machine Image for data analysis and viz

February 14th, 2009

Infochimps has registered a community image for Amazon’s Elastic Compute Cloud (EC2) designed for data processing, analysis, and visualization. Great idea!

Doing experimental computer science research requires the right infrastructure — hardware, bandwidth, software environments and data — and tacking some interesting problems requires a lot. Cloud computing services, such as EC2, are a great boon to researchers who aren’t part of a well equipped lab already set up to support just the kind of research you want to do.

EC2 allows users to instantiate a virtual computer from a saved image, called an Amazon Machine Image, or AMI. Users can configure a system with the with the operating system, software packages, and pre-loaded data they want and then save it as a shared community AMI, making it available to others.

The initial announcement, Hacking through the Amazon with a shiny new MachetEC2, says

“MachetEC2 is an effort by a group of Infochimps to create an AMI for data processing, analysis, and visualization. If you create an instance of MachetEC2, you’ll be have an environment with tools designed for working with data ready to go. You can load in your own data, grab one of our datasets, or try grabbing the data from one of Amazon’s Public Data Sets. No matter what, you’ll be hacking in minutes.

We’re taking suggestions for what software the community would be most interested in having installed on the image … When we feel that the AMI is getting too bloated, we’ll split it up: MachetEC2-ML (machine learning), MachetEC2-viz, MachetEC2-lang, MachetEC2-bio, etc.”

And a second post gave some more details:

“When you SSH into an instance of machetEC2 (brief instructions after the jump), check the README files: they describe what’s installed, how to deal with volumes and Amazon Public Datasets, and how to use X11-based applications. You can also visit the the machetEC2 GitHub page to see the full list of packages installed, the list of gems, and the list of programs installed from source.

To launch an instance of machetEC2, log into the AWS Console, click “AMIs”, search for “machetEC2″ or ami-29ef0840, and click “Launch”. If you’re on the command-line, simply run

    $ ec2-run-instances ami-29ef0840 -k [your-keypair-name]

By the time you’ve grabbed some coffee, you’ll be able to access an EC2 instance with all the tools you need for working with data already installed, configured, and ready to hack.”

This is a valuable contribution to the data wrangling community and to the larger research community as an example of what can be done. I can imagine similar community AMIs to support research on the Semantic Web, social network analyss, game development or multi-agent systems.

Pew: 11% of online adults twitter and/or update status

February 13th, 2009

Amanda Lenhart and Susannah Fox of the Pew Internet Project have a six-page note on Twitter and status updating based on a survey of nearly 2300 adults in November and December of 2008. The findings are not very surprising and include:

  • 11% of online adults use Twitter or update their status online
  • Twitter users are mobile, less tethered by technology
  • Younger internet users lead the way in using Twitter and similar services.

To me, the most notable item is one that suggests a very recent and sharp inclease in short, personal status updates.

“As of December 2008, 11% of online American adults said they used a service like Twitter or another service that allowed them to share updates about themselves or to see the updates of others.1 Just a few weeks earlier, in November 2008, 9% of internet users used Twitter or updated their status online and in May of 2008, 6% of internet users responded yes to a slightly different question, where users were asked if they used “Twitter or another ‘microblogging’ service to share updates about themselves or to see updates about others.”

I’d guess that this partly represents more people joining Facebook, creating an tipping point in the use of its status update feature.

Yahoo! adds RDF support to SearchMonkey and BOSS

February 12th, 2009

This could be a big step toward the “web of data” vision of the Semantic Web.

Yahoo announced (Accessing Structured Data using BOSS that their BOSS (Build your Own Search System) will now support structured data, including RDF.

“Yahoo! Search BOSS provides access to structured data acquired through SearchMonkey. Currently, we are only exposing data that has been semantically marked up and subsequently acquired by the Yahoo! Web Crawler. In the near future, we will also expose structured data shared with us in SearchMonkey data feeds. In both cases, we will respect site owner requests to opt-out of structured data sharing through BOSS.”

Yahoo\'s BOSS to support RDF data

Here’s how it works:

  • Sites use microformats or RDF (encoded using RDFa or eRDF) to add structured data to their pages
  • Yahoo’s web crawler encounters embedded markup and indexes the structured data along with the unstructured text
  • A BOSS developer specifies “view=searchmonkey_rdf” or “view=searchmonkey_feed” in API requests
  • BOSS’s response returns the structured data via either XML or JSON

Yahoo’s SearchMonkey only acquires structured data using certain microformats or RDF vocabularies. The microformats supported are hAtom, hCalendar, hCard, hReview, XFN, Geo, rel-tag and adr. RDF vocabularies handled include Dublin Core, FOAF, SIOC, and “other supported vocabularies”. See the appendix on vocabularies in Yahoo’s SearchMonkey Guide for a full list and more information.

A post on the Yahoo search blog also talks about this and other changes to the BOSS service and includes a nice example of the use of structured data encoded using microformats from President Obama’s LinkedIn page.

microformatted data on President Obama\'s linked in page

NSF and science increments survive stimulus conference

February 12th, 2009

Stimulus funding for research and science has done well in the version of the American Economic Recovery and Reinvestment Act coming out of conference. The conference report overview identifies a category that will:

“Transform our Economy with Science and Technology: To secure America’s role as a world leader in a competitive global economy, we are renewing America’s investments in basic research and development, in training students for an innovation economy, and in deploying new technologies into the marketplace. This will help businesses in every community succeed in a global economy.”

The CRA policy blog has the details in House Numbers for Science Prevail in Stimulus Conference. Highlights of the $15B+ to be invested in scientific research include:

  • Provides $3 billion for the National Science Foundation, for basic research in fundamental science and engineering – which spurs discovery and innovation.
  • Provides $1.6 billion for the Department of Energy’s Office of Science, which funds research in such areas as climate science, biofuels, high-energy physics, nuclear physics and fusion energy sciences – areas crucial to our energy future.
  • Provides $400 million for the Advanced Research Project Agency-Energy (ARPA-E) to support high-risk, high-payoff research into energy sources and energy efficiency in collaboration with industry.
  • Provides $580 million for the National Institute of Standards and Technology, including the Technology Innovation Program and the Manufacturing Extension Partnership.
  • Provides $8.5 billion for NIH, including expanding good jobs in biomedical research to study diseases such as Alzheimer’s, Parkinson’s, cancer, and heart disease.
  • Provides $1 billion for NASA, including $400 million to put more scientists to work doing climate change research.
  • Provides $1.5 billion for NIH to renovate university research facilities and help them compete for biomedical research grants.

Google starts Social Web Blog

February 10th, 2009

Ebiquity alumnus Harry Chen alerted us to Google’s new Social Web Blog that described itself as “news and updates about Google products that are helping to make the web more social”. in yesterday’s first post, Mendel Chuang, the product marketing manager for Google Friend Connect says:

“We are launching this blog for anyone interested or involved in helping to make the web more social. Whether you own a site and want to add social features to increase community engagement, or you’re developing a great social application, this blog is for you.

We will write about social initiatives within Google, such as Google Friend Connect, as well as community efforts like OpenSocial. We plan to share some success stories, present tips and tricks, provide updates when there are new developments, and much more.”

Wikinvest offers the wisdom of the investing crowds

February 9th, 2009

Wikinvest is a free, community driven site that “wants to make investing easier by creating the world’s best source of investment information and investment tools”.

A story in today’s NTY, Offering Free Investment Advice by Anonymous Volunteers, says

“Following the model of Wikipedia, the online encyclopedia that anyone can edit, Wikinvest is building a database of user-generated investment information on popular stocks. A senior at Yale writes about the energy industry, for example, while a former stockbroker covers technology and a mother in Arizona tracks children’s retail chains.

Wikinvest, which recently licensed some content to the Web sites of USA Today and Forbes, seeks to be an alternative to Web portals that are little more than “a data dump” of income statements and government filings, said Parker Conrad, a co-founder.

Users annotate stock charts with notes explaining peaks and valleys, edit company profiles and opine about whether to buy or sell. The site is creating a wire service with articles from finance blogs and building a cheat sheet to guide readers through financial filings by defining terms and comparing a company’s performance to competitors’.”

After a quick look at the site it does look interesting. I may well be ready to trust the wisdom of the crowds over the platitudes of the pundits. The Microsoft article has a lot of useful data and lays out reasons to buy and also to sell and lets registered members vote on whether they agree or not. Of course, I thought the reasons offered on both sides were valid — rather than simple propositions their validity needs to be quantified.

For what it is worth, I note that the site is using MediaWiki. I wonder if there are unique opportunities to incorporate RDF and or RDFa into such a site, perhaps encoding or annotating their WikiData.

A comment on modern life

February 9th, 2009

I thought this cartoon from a recent issue of the New Yorker offers an accurate comment on modern life.

We\'ve got to try to coax him back into his enclosure

Senate plan: less stimulus for NSF, NIST, other science agencies

February 9th, 2009

The US Senate’s stimulus plan released at the end of last week has less money for US science agencies than the House plan from January, but the cuts were not as drastic as were feared. CRA reports in a post Senate Deal Protects Much of NSF Increase in Stimulus that

“The agreement does reduce the increase in the Department of Energy’s Office of Science by $100 million (so, +$330 million instead of +$430 million), and NIST’s increase would be reduced by $100 million (so +$495 million instead of +$595 million). But given the reports we were receiving as recently as yesterday evening about the possibility of no increase for the science agencies in the bill, this is a remarkable turn of events. The increase for NSF in the Senate bill will still be far less than the $3 billion called for in the House version of the bill, but NSF will be in far better shape in the conference between the two chambers coming in with $1.2 billion from the Senate instead of zero.”

Scientists and Engineers for America (a 501(c)(3) organization) has a detailed breakdown of the the stimulus package that passed the Senate Friday in Senate-passed stimulus package by the numbers. They also have a downloadable excel spreadsheet in case you want to crunch the data yourself. Here are some science highlights from their post:

NSF Research: $1.2 billion total for NSF including: $1 billion to help America compete globally; $150 million for scientific infrastructure; and $50 million for competitive grants to improve the quality of science, technology, engineering, and mathematics (STEM) education.

NASA: $1.3 billion total for NASA including: $450 million for Earth science missions to provide critical data about the Earth’s resources and climate; $200 million to enable research and testing of environmentally responsible aircraft and for verification and validation methods for complex aerospace systems and software; $450 million to reduce the gap in time that the U.S. does not have a vehicle to access the International Space Station; and $200 million for repair, upgrade and construction at NASA facilities.

NOAA: $1 billion total for NOAA, including $645 million to construct and repair NOAA facilities, equipment and vessels to reduce the Nation’s coastal charting backlog, upgrade supercomputer infrastructure for climate research, and restore critical habitat around the Nation.

NIST: $475 million total for NIST including: $307 million for renovation of NIST facilities and new laboratories using green technologies; $168 million for scientific and technical research at NIST to strengthen the agency’s IT infrastructure; provide additional NIST research fellowships; provide substantial funding for advanced research and measurement equipment and supplies; increase external grants for NIST-related research.

DOE: The Department of Energy’s Science program sees $330 million for laboratory infrastructure and construction.

Big (linked?) data

February 8th, 2009

The Data Evolution blog has an interesting post that asks Is Big Data at a tipping point?. It’s suggests that we may be approaching a tipping point in which large amounts of online data will be interlinked and connected to suddenly produce a whole much larger than the parts.

“For the past several decades, an increasing number of business processes– from sales, customer service, shipping – have come online, along with the data they throw off. As these individual databases are linked, via common formats or labels, a tipping point is reached: suddenly, every part of the company organism is connected to the data center. And every action — sales lead, mouse click, and shipping update — is stored. The result: organizations are overwhelmed by what feels like a tsunami of data. The same trend is occurring in the larger universe of data that these organizations inhabit. Big Data unleashed by the “Industrial Revolution of Data”, whether from public agencies, non-profit institutes, or forward-thinking private firms.”

I expected that the post would soon segue into a discussion of the Semantic Web and maybe even the increasingly popular linked data movement, but it did not. Even so, it sets up plenty of nails for which we have a an excellent hammer in hand. I really like this iceberg analogy, by the way.

“At present, much of the world’s Big Data is iceberg-like: frozen and mostly underwater. It’s frozen because format and meta-data standards make it hard to flow from one place to another: comparing the SEC’s financial data with that of Europe’s requires common formats and labels (ahem, XBRL) that don’t yet exist. Data is “underwater” when, whether reasons of competitiveness, privacy, or sheer incompetence it’s not shared: US medical records may contain a wealth of data, but much of it is on paper and offline (not so in Europe, enabling studies with huge cohorts).”

The post also points out some sources of online data and analysis tools, some familiar and some new to me (or maybe just forgotten.)

“Yet there’s a slow thaw underway as evidenced by a number of initiatives: Aaron Swartz’s, Flip Kromer’s infochimps, Carl Malamud’s, as well as Numbrary, Swivel, Freebase, and Amazon’s public data sets. These are all ambitious projects, but the challenge of weaving these data sets together is still greater.”

Hadoop user group for the Baltimore-DC region

February 8th, 2009

A Hadoop User Group (HUG) has formed for the Washington DC area via

“We’re a group of Hadoop & Cloud Computing technologists / enthusiasts / curious people who discuss emerging technologies, Hadoop & related software development (HBase, Hypertable, PIG, etc). Come learn from each other, meet nice people, have some food/drink.”

The group defines it’s geographic location as Columbia MD and their first HUG meetup was held last Wednesday at the BWI Hampton Inn. In addition to informal social interactions, it featured two presentations:

  • Amir Youssefi from Yahoo! presented an overview of Hadoop. Amir is a member of the Cloud Computing and Data Infrastructure group at Yahoo!, and will be discussing Multi-Dataset Processing (Joins) using Hadoop and Hadoop Table.
  • Introduction to complex, fault tolerant data processing workflows using Cascading and Hadoop by Scott Godwin & Bill Oley

If you’re in Maryland and interested you can join the group at and get announcements for future meetings. It might provide a good way to learn more about new software to exploit computing clusters and cloud computing.

(Thanks to Chris Diehl for alerting me to this)

Tim Berners-Lee talks at TED 2009 on linked data

February 6th, 2009

Tim Berners-Lee gave a talk at the TED2009 conference on linked data — one of the newest and most interesting ideas to emerge from efforts to realize the Semantic Web vision.

Here’s a summary of Sir Berners-Lee’s from a post by Gigaom, Highlights from TED: Tim Berners-Lee, Pattie Maes, Jacek Utko. I’m looking forward to being able to see his talk online soon.

“Founder of the web Tim Berners-Lee spoke of the next grassroots communication movement he wants to start: linked data. Much in the way his development of the web stemmed out of the frustrations of brilliant people working in silos, he is frustrated that the data of the world is shut apart in offline databases.

Berners-Lee wants raw data to come online so that it can be related to each other and applied together for multidisciplinary purposes, like combining genomics data and protein data to try to cure Alzheimer’s. He urged “raw data now,” and an end to “hugging your data” — i.e. keeping it private — until you can make a beautiful web site for it.

Berners-Lee said his dream is already on its way to becoming a reality, but that it will require a format for tagging data and understanding relationships between different pieces of it in order for a search to turn up something meaningful. Some current efforts are dbpedia, a project aimed at extracting structured information from Wikipedia, and OpenStreetMap, an editable map of the world. He really wants President Obama, who has promised to conduct government transparently online, to post linked data online.”

You can see the slides that TBL used on the W3C site.

At the tone, the time will be 1234567890

February 6th, 2009

Next week we will see another epochal event, like Y2K and 2012. Actually, we hope it will be more like Y2K and less like the predicted events on 21 December 2012, which some think will bring about the end of the world as we know it.

At 23:31:30 GMT on Friday 13 February 2009 the Unix time will be 1234567890.

This is, of course, the number of seconds since midnight GMT on 1 January 1970 (not counting leap seconds.) To keep track of the event, you can use the epoch clock or just ask your local Unix system date ‘+%s’.