UMBC ebiquity
UMBC eBiquity Blog

Programming with Hadoop: a hands on introduction

Tim Finin, 12:43am 20 September 2011

In this week’s ebiquity meeting (10:30am Tue 9/20 in ITE 325b) we will dive right into writing MapReduce programs, and we skip all the gory details about Hadoop setup and MapReduce theory. In one hour, we will write a MapReduce Java program using Eclipse to create an inverted-index, test it on a local box, and run it on an already set up Hadoop cluster. If we have time, we will also see how to do the same using Python instead of Java.

You are encouraged to do the following before the meeting if you want to code along.

  • Review the Yahoo Introduction to MapReduce tutorial
  • Download a free virtual machine image with Hadoop pre-installed, so you can get started quickly. Options are available for Linux, Windows and Mac OS X.
  • Make sure you have JDK 1.6x and Eclipse (or your favourite IDE) installed on your laptop.

Addenda (9/19):

  • If you are planning to code along during the demo, download the latest stable release of Hadoop (0.20.2)
  • Some people have been having problems with Cloudera’s 64 bit VM image. If you do, try this virtual machine from Yahoo Developer Network that contains a pre-installed hadoop 0.20.
  • Even if you are not able to get the VM running for now, you can still run the program(s) locally on your laptop using Eclipse.

 

Ten years of words from ebiquity papers

Tim Finin, 8:47am 16 September 2011

Here’s a word cloud that visualizes the 200 most significant words extracted from over 400 papers from our research group over the past ten years. Significance was estimated by tf-idf where the idf data is from a collection of newswire articles (thanks Paul!). The word cloud was created with Wordle.


 

Got a problem? There’s a code for that

Tim Finin, 8:43am 15 September 2011

The Wall Street Journal article Walked Into a Lamppost? Hurt While Crocheting? Help Is on the Way describes the International Classification of Diseases, 10th Revision that is used to describe medical problems.

“Today, hospitals and doctors use a system of about 18,000 codes to describe medical services in bills they send to insurers. Apparently, that doesn’t allow for quite enough nuance. A new federally mandated version will expand the number to around 140,000—adding codes that describe precisely what bone was broken, or which artery is receiving a stent. It will also have a code for recording that a patient’s injury occurred in a chicken coop.”

We want to see the search engine companies develop and support a Microdata vocabulary for ICD-10. An ICDM-10 OWL DL ontology has already been done, but a Microdata version might add a lot of value. We could use it on our blogs and Facebook posts to catalog those annoying problems we encounter each day, like W59.22XD (Struck by turtle, initial encounter), or Y07.53 (Teacher or instructor, perpetrator of maltreat and neglect).

Humor aside, a description logic representation (e.g., in OWL) makes the coding system seem less ridiculous. Instead of appearing as a catalog of 140K ground tags, it would emphasize that it is a collection of a much smaller number of classes that can be combined in productive ways to produce them or used to create general descriptions (e.g., bitten by an animal).


 

Detecting fake Google+ profiles with image search

Tim Finin, 4:00pm 11 September 2011

Many Google+ users have been reporting frequent notices about new followers that they don’t know and appear to be attractive young women. The suspicious followers have minimal profiles and no posts. These are obviously false accounts being created for some yet unknown purpose, but how can one prove it?

I just got a notice, for example, that Janet Smith of Philadelphia is following me. Now Janet Smith is a common name and Philadelphia is a big place — there are probably hundreds of people who live in the Philadelphia area with that name. The 990 other people she’s following seem like a pretty random bunch, though I do know many and have more than a few in my own circles. Most seem to have a fair number of followers.

So there is not much to go on other than her profile image. This is a great use for Google’s new image search. I dragged the picture into the image search query field and Google identified its best guess for the image as Indian actress Koyel Mullick. Sure enough, if you search for images with her name, the precise Janet Smith image is result number 15.

Of course, there are still some subtle issues. This is just one kind of false profile — one created for one identity but using an image from a different one. It’s common on most social media systems, including G+, for some people to use a picture of someone or something other than themselves. But it’s obvious to a human viewer that using a picture of a rabbit, Marilyn Monroe or the mighty Thor on your profile is not meant to deceive. It will be challenging to automate the process of discriminating the intent to deceive from modesty, homage or an ironic gesture.


 

Mid-Atlantic student colloquium on speech, language and learning

Tim Finin, 9:31pm 2 September 2011

The First Mid-Atlantic Student Colloquium on Speech, Language and Learning is a one-day event to be held at the Johns Hopkins University in Baltimore on Friday, 23 September 2011. Its goal is to bring together students taking computational approaches to speech, language, and learning, so that they can introduce their research to the local student community, give and receive feedback, and engage each other in collaborative discussion. Attendance is open to all and free but space is limited, so online registration is requested by September 16. The program runs from 10:00am to 5:00pm and will include oral presentations, poster sessions, and breakout sessions.


 

Journal of web semantics issue on evaluation

Tim Finin, 1:39pm 2 September 2011

Call For Papers

Special Issue on Evaluation of Semantic Technologies
Journal of Web Semantics

Semantic technologies have become a well-established field of computer science. However, the field is continuously evolving: the number of semantic technologies is constantly increasing, standards evolve and new ones are defined; and, in this scenario, the problem of how to compare and evaluate the various approaches becomes crucial. The consistent evaluation of semantic technologies is critical not only for future scientific progress, by identifying research goals and allowing a rigorous examination of research results, but also for their industrial adoption, by allowing objective measurement and comparison of these technologies and enabling their certification.

Semantic technology evaluation must, on the one hand, be supported by strong methodological approaches and relevant test data and, on the other hand, satisfy the differing needs of developers, researchers and adopters by addressing those quality characteristics that are relevant to each target group. Nevertheless, numerous issues must be faced when evaluating semantic technologies.

On the one hand, because of the fast evolution of the semantic field, previous evaluation methods and techniques need to be adapted and extended and new ones have to be developed. On the other hand, the cost of defining new evaluations methods or reusing existing ones can be prohibitive, so facilitating the understanding of such methods or their automated processing becomes highly significant.

The goal of this special issue is to present current advances and trends in semantic technology evaluation (theories and models, methods and techniques, evaluation campaigns, technology comparison, etc.). Therefore we solicit papers that improve evaluation paradigms of semantic technologies. At the same time papers that evaluate a particular method, technology or system without investigating the evaluation regime itself will be considered out of scope and will be returned to the authors with no review.

Topics of interest

Relevant topics for the special issue include, but are not limited to, the following.

  • Semantic technology evaluation methods
  • Test data for semantic technology evaluation
  • Automation of semantic technology evaluation
  • Evaluation of semantic technologies in real world scenarios
  • Evaluation of linked data technologies
  • Quality requirements for semantic technologies
  • Semantic technology certification
  • Maturity models for semantic technologies
  • Semantic technology selection
  • Semantic technology quality estimation
  • Interoperability and conformance of semantic technologies
  • Semantic technology efficiency and scalability
  • Usability of semantic technologies

Important dates

We will aim at an efficient publication cycle in order to guarantee prompt availability of the published results. To this end, we encourage submissions well before the submission deadline.

  • Submission deadline. 29 February 2012
  • Author notification. 31 May 2012
  • Final version. 31 July 2012
  • Publication. Fall 2012

Instructions for submission

Please see the author guidelines for detailed instructions before you submit. Submissions should be conducted through Elsevier’s Electronic Submission System. More details on the Journal of Web Semantics can be found on its homepage.

Editors


 

Gingrich Twitter followers not fake, just inactive

Tim Finin, 10:42am 25 August 2011

Three weeks ago, it was widely reported that an analysis by PeekYou concluded that more than 90% of Newt Gingrich’s 1.3M Twitter followers were fake accounts, probably purchased to make him appear more popular. Further analysis by Topsy supports Newt Gingrich’s assertion that his Twitter followers were real people and that his campaign did not purchase any.

“Former House Speaker and GOP presidential candidate Newt Gingrich was correct in his explanation for why he has relatively few active accounts among his 1.3 million Twitter followers, an analysis requested by Mashable has revealed.

The initial analysis of his followers was apparently based on a a few trivial features, mostly the fact that the vast majority of them were inactive. But most of his followers came from the early days of Twitter when Gingrich’s account was on Twitter’s short list of suggestions for interesting people to follow. Mashable says:

“So there is no smoking gun to suggest that Gingrich, or any of these politicians, bought any of their followers. But what this kind of analysis also reveals, says Topsy, is how hard it is to say which Twitter accounts are for real and which aren’t. Spam bots are getting more sophisticated; many now have fake profile pictures, fake bios and generate fake tweets. “The fact is, a large proportion of all Twitter accounts are inactive anyway,” says Ghosh.

Sorting the humans from the fakes is a problem that companies like Topsy — and Twitter itself, which now has more than 200 million accounts — will be wrestling with for years to come.


 

2011 Hype Cycle for Emerging Technologies

Tim Finin, 6:23am 24 August 2011

The hype cycle concept has been used by IT consulting company Gartner since 1995 to highlight the common pattern of “overenthusiasm, disillusionment and eventual realism that accompanies each new technology and innovation.” While Gartner’s hype cycles represent one company’s opinions, the underlying concept seems right and it is always interesting to see where they place the current crop of computing related technologies.

Here is their 2011 hype cycle for emerging technologies


and some comments from the accompanying press release.

“Themes from this year’s Emerging Technologies Hype Cycle include ongoing interest and activity in social media, cloud computing and mobile,” Ms. Fenn said. “On the social media side, social analytics, activity streams and a new entry for group buying are close to the peak, showing that the era of sky-high valuations for Web 2.0 startups is not yet over. Private cloud computing has taken over from more-general cloud computing at the top of the peak, while cloud/Web platforms have fallen toward the Trough of Disillusionment since 2010. Mobile technologies continue to be part of most of our clients’ short- and long-range plans and are present on this Hype Cycle in the form of media tablets, NFC payments, quick response (QR)/color codes, mobile application stores and location-aware applications.

Transformational technologies that will hit the mainstream in less than five years include highly visible areas, such as media tablets and cloud computing, as well as some that are more IT-specific, such as in-memory database management systems, big data, and extreme information processing and management. In the long term, beyond the five-year horizon, 3D printing, context-enriched services, the “Internet of Things” (called the “real-world Web” in earlier Gartner research), Internet TV and natural language question answering will be major technology forces. Looking more than 10 years out, 3D bioprinting, human augmentation, mobile robots and quantum computing will also drive transformational change in the potential of IT.”

You can get a copy of the Hype Cycle for Emerging Technologies Summary Report by giving your contact information, but the full report on this or any of the other 26 topical hype cycle reports will cost you money.


 

Free online courses on AI, databases and machine learning

Tim Finin, 12:32am 16 August 2011

Stanford is experimenting with an interesting idea — offering some of their most popular undergraduate computer science courses online for free and simultaneously with their regular offerings. An AI course was announced several weeks ago and now there are similar offerings for databases and machine learning. These are taught by first rate instructors (who are also top researchers!) and are the same courses that Stanford students take.

  • “A bold experiment in distributed education, “Introduction to Artificial Intelligence” will be offered free and online to students worldwide during the fall of 2011. The course will include feedback on progress and a statement of accomplishment. Taught by Sebastian Thrun and Peter Norvig, the curriculum draws from that used in Stanford’s introductory Artificial Intelligence course. The instructors will offer similar materials, assignments, and exams.”
  • “A bold experiment in distributed education, “Introduction to Databases” will be offered free and online to students worldwide during the fall of 2011. Students will have access to lecture videos, receive regular feedback on progress, and receive answers to questions. When you successfully complete this class, you will also receive a statement of accomplishment. Taught by Professor Jennifer Widom, the curriculum draws from Stanford’s popular Introduction to Databases course.”
  • “A bold experiment in distributed education, “Machine Learning” will be offered free and online to students worldwide during the fall of 2011. Students will have access to lecture videos, lecture notes, receive regular feedback on progress, and receive answers to questions. When you successfully complete the class, you will also receive a statement of accomplishment. Taught by Professor Andrew Ng, the curriculum draws from Stanford’s popular Machine Learning course.”

If successful, this might be a game changer. Two weeks after the online AI course was announced, 56,000 students had signed up! The approach might work for many disciplines, not just CS. The Kahn Academy is a related effort.

Universities should keep an eye on them and think about how to adapt if they are successful. Most of our students will probably benefit from taking our traditional courses. If so, we should be able to explain the benefits from taking them (and make sure we deliver those benefits). At the same time, we may want to leverage the online material from these courses in a synergistic way.


 

JWS special issues: Semantic Sensing and Social Semantic Web

Tim Finin, 8:55am 27 July 2011

The Journal of Web Semantics announced two new special issues, one on semantic sensing and another on the semantic and social web. Both will be publshed in 2012 with preprints made freely available online as papers are accepted.

The special issue on semantic sensing will be edited by Harith Alani, Oscar Corcho and Manfred Hauswirth. Papers will be reviewed on a rolling basis and authors are encouraged to submit before the final deadline of 20 December 2011.

The issue on the semantic and social web will be edited by John Breslin and Meena Nagarajan. Papers will be reviewed on a rolling basis and authors are encouraged to submit before the final deadline of 21 January 2012.

See the JWS Guide for Authors for details on the submission process.


 

Mid-Atlantic Student Colloquium on Speech, Language and Learning, 23 Sept 2011

Tim Finin, 10:24pm 13 July 2011

The Mid-Atlantic Student Colloquium on Speech, Language and Learning is a one day, free event bringing together faculty, researchers and students from universities in the Mid-Atlantic area working in Speech/Language/ML. The colloquium is an opportunity for students to present preliminary or completed work and to network with other students, faculty and researchers working in related fields. The event will be held in Baltimore MD at the Johns Hopkins University on Friday 23 September 2011.

Students are encouraged to submit one-page abstracts by Monday, August 15 describing ongoing, planned, or completed research projects, including previously published results and negative results. Student research in any field applying computational methods to any aspect of human language, including speech and learning, from all areas of computer science, linguistics, engineering, neuroscience, information science, and related fields, is welcome. Submissions and presentations must be made by students or postdocs. See the call for papers for more information.

Accepted submissions will be presented as posters and each will also be given a one-minute presentation during a poster spotlight session. A small number of submissions will be selected to be presented as talks, on the basis of diversity and general interest.

Student-led breakout sessions of one hour will also be held to discuss papers on topics of interest and stimulate interaction and discussion. Topics and suggested papers for breakout sessions should be submitted by students alongside abstracts.

The event is sponsored by the Human Language Technology Center of Excellence and the Center for Language and Speech Processing at the Johns Hopkins University.


 

Robot discovers ancient graffiti

Tim Finin, 11:47am 12 June 2011

Back in May, it was reported that a robot explorer sent through the Great Pyramid of Giza discovered mysterious hieroglyphs in the 4,500-year-old mausoleum behind one of its mysterious doors. The images transmitted by the robot showed hieroglyphs written in red paint that had not been seen by human eyes since the construction of the pyramid.

This week, the reports are that the three red ochre figures painted on the floor of a hidden chamber at the end of a tunnel deep inside the pyramid are just numbers. The builders of the pyramid simply recorded the total length of the southern shaft from the Queen’s Chamber: 121 cubits.

While not exactly graffiti, it reminds me that when I’ve worked on an older house, I’ve often found notes left by the original workers who built it, like sketches with dimensions on the plaster covered up by wallpaper.