Tutorial: Hadoop on Windows with Eclipse

April 9th, 2009

Hadoop has become one of the most popular frameworks to exploit parallelism on a computing cluster. You don’t actually need access to a cluster to try Hadoop, learn how to use it, and develop code to solve your own problems.

UMBC Ph.D student Vlad Korolev has written an excellent tutorial, Hadoop on Windows with Eclipse, showing how to install and use Hadoop on a single computer running Microsoft Windows. It also covers the Eclipse Hadoop plugin, which enables you to create and run Hadoop projects from Eclipse. In addition to step by step instructions, the tutorial has short videos documenting the process.

If you want to explore Hadoop and are comfortable developing Java programs in Eclipse on a Windows box, this tutorial will get you going. Once you have mastered Hadoop and had developed your first project using it, you can go about finding a cluster to run it on.

Map reduce on heterogeneous multicore clusters

April 7th, 2009

In tomorrow’s ebiquity meeting (10 am EDT Wed, April 8), PhD student David Chapman will talk about his work on Map Reduce on Heterogeneous Multi-Core Clusters. From the abstract:

“We have extended the Map Reduce programming paradigm to clusters with multicore accelerators. Map Reduce is a simple programming programming model designed for parallel computations with large distributed datasets. Google has reinforced the practical effectiveness of this approach with over 1000 commercial Map Reduce applications. Typical Map Reduce implementations, such as Apache Hadoop exploit parallel file systems for use in homogeneous clusters. Unfortunately, the multicore accelerators such as Cell B.E. used in modern supercomputers such as Roadrunner require additional layers of parallelism, which cannot be addressed from parallel file systems alone. Related work has explored Map Reduce on a single Cell B.E. accelerator machine using hash and sort based techniques. We are incorporating techniques from Apache Hadoop as well as early multicore Map Reduce research to produce an implementation optimized for a hybrid multicore cluster. We are evaluating our implementation on a cluster of 24 of Cell Q series nodes, and and 48 multicore PowerPC J series nodes at the UMBC Multicore Computational Center.”

We will stream the talk live and share the raw recording.

Cloudera offers a simpler Hadoop distribution

March 18th, 2009

We are early in the era of big data (including social and/or semantic) and more and more of us need the tools to handle it. Monday’s NYT had a story, Hadoop, a Free Software Program, Finds Uses Beyond Search, on Hadoop and Cloudera, a new startup that offering its own Hadoop distribution that is designed to beasier to install and configure.

“In the span of just a couple of years, Hadoop, a free software program named after a toy elephant, has taken over some of the world’s biggest Web sites. It controls the top search engines and determines the ads displayed next to the results. It decides what people see on Yahoo’s homepage and finds long-lost friends on Facebook.”

Three top engineers from Google, Yahoo and Facebook, along with a former executive from Oracle, are betting it will. They announced a start-up Monday called Cloudera, based in Burlingame, Calif., that will try to bring Hadoop’s capabilities to industries as far afield as genomics, retailing and finance. The company has just released its own version of Hadoop. The software remains free, but Cloudera hopes to make money selling support and consulting services for the software. It has only a few customers, but it wants to attract biotech, oil and gas, retail and insurance customers to the idea of making more out of their information for less.

Cloudera’s distribution, curently based on Hadoop v0.18.3, uses RPM and comes with a Web-based configuration aide. The company also offers some free basic training in mapReduce concepts, using Hadoop, developing appropriate algorithms and using Hive.

infochimps Amazon Machine Image for data analysis and viz

February 14th, 2009

Infochimps has registered a community image for Amazon’s Elastic Compute Cloud (EC2) designed for data processing, analysis, and visualization. Great idea!

Doing experimental computer science research requires the right infrastructure — hardware, bandwidth, software environments and data — and tacking some interesting problems requires a lot. Cloud computing services, such as EC2, are a great boon to researchers who aren’t part of a well equipped lab already set up to support just the kind of research you want to do.

EC2 allows users to instantiate a virtual computer from a saved image, called an Amazon Machine Image, or AMI. Users can configure a system with the with the operating system, software packages, and pre-loaded data they want and then save it as a shared community AMI, making it available to others.

The initial announcement, Hacking through the Amazon with a shiny new MachetEC2, says

“MachetEC2 is an effort by a group of Infochimps to create an AMI for data processing, analysis, and visualization. If you create an instance of MachetEC2, you’ll be have an environment with tools designed for working with data ready to go. You can load in your own data, grab one of our datasets, or try grabbing the data from one of Amazon’s Public Data Sets. No matter what, you’ll be hacking in minutes.

We’re taking suggestions for what software the community would be most interested in having installed on the image … When we feel that the AMI is getting too bloated, we’ll split it up: MachetEC2-ML (machine learning), MachetEC2-viz, MachetEC2-lang, MachetEC2-bio, etc.”

And a second post gave some more details:

“When you SSH into an instance of machetEC2 (brief instructions after the jump), check the README files: they describe what’s installed, how to deal with volumes and Amazon Public Datasets, and how to use X11-based applications. You can also visit the the machetEC2 GitHub page to see the full list of packages installed, the list of gems, and the list of programs installed from source.

To launch an instance of machetEC2, log into the AWS Console, click “AMIs”, search for “machetEC2″ or ami-29ef0840, and click “Launch”. If you’re on the command-line, simply run

    $ ec2-run-instances ami-29ef0840 -k [your-keypair-name]

By the time you’ve grabbed some coffee, you’ll be able to access an EC2 instance with all the tools you need for working with data already installed, configured, and ready to hack.”

This is a valuable contribution to the data wrangling community and to the larger research community as an example of what can be done. I can imagine similar community AMIs to support research on the Semantic Web, social network analyss, game development or multi-agent systems.

Hadoop user group for the Baltimore-DC region

February 8th, 2009

A Hadoop User Group (HUG) has formed for the Washington DC area via meetup.com.

“We’re a group of Hadoop & Cloud Computing technologists / enthusiasts / curious people who discuss emerging technologies, Hadoop & related software development (HBase, Hypertable, PIG, etc). Come learn from each other, meet nice people, have some food/drink.”

The group defines it’s geographic location as Columbia MD and their first HUG meetup was held last Wednesday at the BWI Hampton Inn. In addition to informal social interactions, it featured two presentations:

  • Amir Youssefi from Yahoo! presented an overview of Hadoop. Amir is a member of the Cloud Computing and Data Infrastructure group at Yahoo!, and will be discussing Multi-Dataset Processing (Joins) using Hadoop and Hadoop Table.
  • Introduction to complex, fault tolerant data processing workflows using Cascading and Hadoop by Scott Godwin & Bill Oley

If you’re in Maryland and interested you can join the group at meetup.com and get announcements for future meetings. It might provide a good way to learn more about new software to exploit computing clusters and cloud computing.

(Thanks to Chris Diehl for alerting me to this)

octo.py: quick and easy MapReduce for Python

January 2nd, 2009

The amount of free, interesting, and useful data is growing explosively. Luckily, computer are getting cheaper as we speak, they are all connected with a robust communication infrastructure, and software for analyzing data is better than ever. That’s why everyone is interested in easy to use frameworks like MapReduce for every-day programmers to run their data crunching in parallel.

octo.py is a very simple MapReduce like system inspired by Ruby’s Starfish.

Octo.py doesn’t aim to meet all your distributed computing needs, but its simple approach is amendable to a large proportion of parallelizable tasks. If your code has a for-loop, there’s a good chance that you can make it distributed with just a few small changes. If you’re already using Python’s map() and reduce() functions, the changes needed are trivial!”

triangular.py is the simple example given in the documentation that is used with octo.py to compute the first 100 triangular numbers.

# triangular.py compute first 100 triangular numbers. Do
# 'octo.py server triangular.py' on server with address IP
# and 'octo.py client IP' on each client. Server uses source
# & final, sends tasks to clients, integrates results. Clients
# get tasks from server, use mapfn & reducefn, return results.

source = dict(zip(range(100), range(100)))

def final(key, value):
    print key, value

def mapfn(key, value):
    for i in range(value + 1):
        yield key, i

def reducefn(key, value):
    return sum(value)

Put octo.py on all of the machines you want to use. On the machine you will use as a server (with ip address <ip>), also install triangular.py, and then execute:

     python octo.py server triangular.py &

On each of your clients, run

     python octo.py client <ip> &

You can try this out using the same machine to run the server process and one or more client processes, of course.

When the clients register with the server, they will get a copy of triangular.py and wait for tasks from the server. The server access the data from source and distributed tasks to the clients. These in turn use mapfn and reducefn to complete the tasks, returning the results. The server integrates these and, when all have completed, invokes final, which in this case just prints the answers, and halts. The clients continue to run, waiting for more tasks to do.

Octo.py is not a replacement for more sophisticated frameworks like Hadoop or Disco but if you are working in Python, its KISS approach is a good way to get started with the MapReduce paradigm and might be all you need for a small projects.

(Note: The package has not been updated since April 2008, so it’s status is not clear. But further development would run the risk of making it more complex and would be self-defeating.)

WWGD: Understanding Google’s Technology Stack

December 24th, 2008

It’s popular to ask “What Would Google Do” these days — The Google reports over 7,000 results for the phrase. Of course, it’s not just about Google, which we all use as the archetype for a new Web way of building and thinking about information systems. Asking WWGD can be productive, but only if we know how to implement and exploit the insights the answer gives us. This in turn requires us (well, some of us, anyway) to understand the algorithms, techniques, and software technology that Google and other large scale Web-oriented companies use. We need to ask “How Would Google Do It”.

Michael Nielsen has a nice post on using your laptop to compute PageRank for millions of webpages. His posts reviews PageRank and how to compute it and shows a short, but reasonably efficient, Python program that can easily do a graph with a few million nodes. While not sufficient for many applications, like the Web, there are lots of interesting and significant graphs this small Python program can handle — Wikipedia pages, DBLP publications, RDF namespaces, BGP routers, Twitter followers, etc.

The post is part of a series Nielsen is making on the Google Technology Stack including PageRank, MapReduce, BigTable, and GFS. The posts are a byproduct of a series of weekly lectures he’s giving starting earlier this month in Waterloo. Here’s the way that Nielsen describes the series.

“Part of what makes Google such an amazing engine of innovation is their internal technology stack: a set of powerful proprietary technologies that makes it easy for Google developers to generate and process enormous quantities of data. According to a senior Microsoft developer who moved to Google, Googlers work and think at a higher level of abstraction than do developers at many other companies, including Microsoft: “Google uses Bayesian filtering the way Microsoft uses the if statement” (Credit: Joel Spolsky). This series of posts describes some of the technologies that make this high level of abstraction possible.”

Videos of the first two lectures, Introducion to PageRank and Building our PageRank Intuition) are available online. Nielsen illustrates the concepts and algorithms with well-written Python code and provides exercises to help readers master the material as well as “more challenging and often open-ended problems” which he has worked on but not completely solved.

Nielsen was trained as a as a theoretical Physicist but has shifted his attention to “the development of new tools for scientific collaboration and publication”. As far as I can see, he is offering these as free public lectures out of a desire to share his knowledge and also to help (or maybe force) him to deepen his own understanding of the topics and develop better ways of explaining them. In both cases, it an admirable and inspiring example for us all and appropriate for the holiday season. Merry Christmas!

UMBC to offer special course in parallel programming

December 9th, 2008

There’s a very interesting late addition to UMBC’s spring schedule — CMSC 491/691A, a special topics class on parallel programming. Programming multi-core and cell-based processors is likely to be an important skill in the coming years, especially for systems that require high performance such as those involving scientific computing, graphics and interactive games.

The class will meet Tu/Thr from 7:00pm to 8:15pm in the “Game Lab” in ECS 005A and will be taught by research professors John Dorband and Shujia Zhou. Both are very experienced in high-performance and parallel programming. Professor Dorband helped to design and build the first Beowulf cluster computer in the mid 1990s when he worked at the NASA’s Goddard Space Flight Center. Shujia Zhou has worked at Northrop Grumman and NASA/Goddard on a wide range of projects using high-performance and parallel computing for climate modeling and simulation.

CMSC 491/691a Special Topics in Computer Science:
Introduction to parallel computing emphasizing the
use of the IBM Cell B.E.

3 credits. Grade Method: REG/P-F/AUD Course meets in
ENG 005A. Prerequisites: CMSC 345 and CMSC 313 or
permission of instructor.

[7735/7736] 0101 TuTh 7:00pm- 8:15pm

USPTO rejects Dell cloud computing trademark application

August 27th, 2008

Dude, you're getting a Cloud Computer TM. It turns out that we may not have to fear hearing “Dude, you’re getting a Cloud Computer®!” in the future after all.

Bill Poser noted on Language Log that the US Patent and Trademark Office has refused Dell’s application to register the term cloud computing as a trademark. In a office action report, the USPTO said:

“Registration is refused because the applied-for mark merely describes a feature and characteristic of applicant’s services. … As shown in the attached Internet and LEXISNEXIS® evidence, CLOUD COMPUTING is a descriptive term of art in the relevant industry.”

Frontiers of Multicore Computing at UMBC, 26-28 Aug 2008

August 18th, 2008

The UMBC Multicore Computation Center is hosting a free workshop on Frontiers of Multicore Computing 26-28 August 2008 at UMBC. The workshop will feature leading computational researchers who will share their current experiences with multicore applications. A number of computer architects and major vendors have also been invited to describe their road maps to near and long-term future system developments. The FMC workshop will focus on applications in the fields of geosciences, aerospace, defense, interactive digital media and bioinformatics. The workshop has no registration fees but you must register to attend. More information regarding hotel accommodations, tutorials, exhibits and access to the campus can also be found at the website.

Members of the UMBC ebiquity lab will make presentations on our current and planned use of multicore and cloud computing for research in exploiting Wikipedia as as knowledge base and also in extracting communities from very large social network graphs.

Dell trying to trademark cloud computing

August 3rd, 2008

Cloud computing is a hot topic this year, with IBM, Microsoft, Google, Yahoo, Intel, HP and Amazon all offering, using or developing high-end computing services typically described as “cloud computing”. We’ve started using it in our lab, like many research groups, via the Hadoop software framework and Amazon’s Elastic Compute Cloud services.

Bill Poser notes in a post (Trademark Insanity) on Language Log that Dell as applied for a trademark on the term “cloud computing”.

It’s bad enough that we have to deal with struggles over the use of trademarks that have become generic terms, like “Xerox” and “Coke”, and trademarks that were already generic terms among specialists, such as “Windows”, but a new low in trademarking has been reached by the joint efforts of Dell and the US Patent and Trademark Office. Cyndy Aleo-Carreira reports that Dell has applied for a trademark on the term “cloud computing”. The opposition period has already passed and a notice of allowance has been issued. That means that it is very likely that the application will soon receive final approval.

It’s clear, at least to me, that ‘cloud computing’ has become a generic term in general use for “data centers and mega-scale computing environments” that make it easy to dynamically focus a large number of computers on a computing task. It would be a shame to have one company claim it as a trademark. On Wikipedia a redirect for the Cloud Computing page was created several weeks before Dell’s USPTO application. A Google search produces many uses of cloud computing in news articles before 2007, although it’s clear that it’s use didn’t take off until mid 2007.

An examination of a Google Trends map shows that searches for ‘cloud computing’ (blue) began in September 2007 and have increased steadily, eclipsing searches for related terms like Hadoop, ‘map reduce’ and EC2 over the past ten months.

Here’s a document giving the current status of Dell’s trademark application, (USPTO #77139082) which was submitted on March 23, 2007. According to the Wikipedia article on cloud computing, Dell

“… must file a ‘Statement of Use’ or ‘Extension Request’ within 6 months (by January 8, 2009) in order to proceed to registration, and thereafter must enforce the trademark to prevent removal for ‘non-use’. This may be used to prevent other vendors (eg Google, HP, IBM, Intel, Yahoo) from offering certain products and services relating to data centers and mega-scale computing environments under the cloud computing moniker.”

Five Cloud Computers and Information Sharing

July 28th, 2008

There is an interesting panel to open the Microsoft faculty research summit featuring Rick Rashid, Daniel Reed, Ed Felten, Howard Schmidt, and Elizabeth Lawley. Lots of interesting ideas, but one that got thrown out was the recent idea that maybe the world does only need five (cloud) computers. If something like this really does happen, then perhaps we’ll need to think even more aggressively about the information sharing issues — is there some way for me to make sure that I only share with (say) Google’s cloud the things that are absolutely needed. Once I have given some information to Google, can I still retain some control over it. Who owns this information now? If I do, how do I know that Google will honor whatever commitments it makes about how it will use or further share that information ? We’ll be exploring some of these questions in our “Assured Information Sharing” Research. Some of the auditing work that MIT’s DIG group has done also ties in .