 | Web 
Archive for the 'Web' Category
March 2nd, 2011, by Tim Finin, posted in AI, Semantic Web, Web
Congratulations to Tom Heath and Christian Bizer on the publication of their new book, Linked Data: Evolving the Web into a Global Data Space. It’s published by Morgan & Claypool in the series Synthesis Lectures on the Semantic Web: Theory and Technology edited by Jim Hendler and Frank van Harmelen.

“This book provides a conceptual and technical introduction to the field of Linked Data. It is intended for anyone who cares about data – using it, managing it, sharing it, interacting with it – and is passionate about the Web. We think this will include data geeks, managers and owners of data sets, system implementors and Web developers. We hope that students and teachers of information management and computer science will find the book a suitable reference point for courses that explore topics in Web development and data management. Established practitioners of Linked Data will find in this book a distillation of much of their knowledge and experience, and a reference work that can bring this to all those who follow in their footsteps.”
More importantly, we should all thank them and Morgan & Claypool for making a free HTML version available on the Web.
Edit | Bookmark@del.icio.us | Trackback | 1 Comment »
February 26th, 2011, by Tim Finin, posted in AI, Google, Semantic Web
Many people now use the Web to find recipes rather than their own collection of cookbooks and it is estimated that about one percent of all Google searches are for recipes. This past Thursday, Google released Recipe View in the US, letting you limit results to pages that are recipes and further narrow your search by ingredients, cooking time and calories. This feature is powered by semantic metadata encoded in RDFa and other formats
Google describes the new recipe search in a post on the Official Google Blog:
“Recipe View lets you narrow your search results to show only recipes, and helps you choose the right recipe amongst the search results by showing clearly marked ratings, ingredients and pictures. To get to Recipe View, click on the “Recipes” link in the left-hand panel when searching for a recipe. You can search for specific recipes like [chocolate chip cookies], or more open-ended topics—like [strawberry] to find recipes that feature strawberries, or even a holiday or event, like [cinco de mayo]. In fact, you can try searching for all kinds of things and still find interesting results: a favorite chef like [ina garten], something very specific like [spicy vegetarian curry with coconut and tofu] or even something obscure like [strange salad].”
Recipe View extracts data embedded in Web pages that is encoded in Google’s rich snippets format. This includes both the W3C Semantic Web standard RDFa as well as microformats. Google recognizes a simple recipe vocabulary with fourteen properties.
This is a great example of the potential of semantic web technology that can be understood and appreciated by anyone with an interest in cooking. Or eating.
Edit | Bookmark@del.icio.us | Trackback | 2 Comments »
February 23rd, 2011, by Tim Finin, posted in Datamining, NLP, Semantic Web, Social media
The Fifth International AAAI Conference on Weblogs and Social Media is holding a new data challenge using a new dataset from that includes about three TB of social media data collected by Spinn3r between January 13 and February 14th, 2011.
The dataset consists of over 386M blog posts, news articles, classifieds, forum posts and social media content in a month including events such as the Tunisian revolution and the Egyptian protests. The content includes the syndicated text, its original HTML as found on the web, annotations and metadata (e.g., author information, time of publication and source URL), and boilerplate/chrome extracted content. The data is formatted as Spinn3r’s protostreams – an extension to Google protobuffers. It is also broken down by date, content type and language making it easy to work with selected data.
See the ICWSM Data Challenge pages for more information on the challenge task, its associated ICWSM workshop and procedures for data access.
Edit | Bookmark@del.icio.us | Trackback | 1 Comment »
February 22nd, 2011, by Tim Finin, posted in AI, Machine Learning, Semantic Web

IBM’s Watson’s performance in last week’s Jeopardy Challenge was an amazing accomplishment and a demonstration of how our computer systems are becoming more intelligent and capable of solving difficult tasks.
But I wonder if the way that questions were given to the human players and Watson doesn’t give Watson a short, but significant head start. According to the New York Times
“During the sparring matches, Watson received the questions as electronic texts at the same moment they were made visible to the human players;”
Once Watson received a query, it could process it immediately. While the human contestants got to see the query as written text at the same time, Alex Trebek also starts reading the question aloud. When I was watching Jeopardy, I found it almost impossible to read and understand the question more quickly than it was being spoken and suspect that Ken Jennings and Brad Rutter might also. It’s often observed that people find it very difficult to simultaneously process two language streams. While it took Trebek only a second or two to read the short Jeopardy queries, that could have given Watson a significant head start, enabling it to determine that it had a good answer and press its buzzer before the competition.
If this is the case, I am not sure if it is an unfair advantage. People and computers each have native advantages and disadvantages. If Jennings and Rutter got the questions as text without them being simultaneous read aloud, Watson might still have had the advantage of a quicker start.
Edit | Bookmark@del.icio.us | Trackback | 3 Comments »
February 14th, 2011, by Tim Finin, posted in Computing Research, CS, Semantic Web
There has been an ongoing discussion on the publication culture with the computer science research community in CACM, carried out through a series of editorials, opinion pieces, articles and letters. It covers the usual topics — the best role of workshops, conferences and journals, reviewer responsibility, the effect of deadlines on publications, etc. All important issues.
Jonathan Grudin has an opinion piece in the current (Feb) CACM
Technology, conferences, and community. J. Grudin, 2011. Comm. of the ACM, 54, 2, 41-43.
He has also made available a list of the 16 recent CACM articles (with links) on the topic. It’s a list of papers worth reading.
Edit | Bookmark@del.icio.us | Trackback | 2 Comments »
February 13th, 2011, by Tim Finin, posted in AI, Datamining, Machine Learning, NLP, Semantic Web
On the eve of the big Jeopardy! match, Peter Norvig’s opinion piece in the New York Post (!) today, The Machine Age looks at AI’s progress over the past sixty years and lays out six surprising lessons we’ve learned.
- The things we thought were hard turned out to be easier.
- Dealing with uncertainty turned out to be more important than thinking with logical precision.
- Learning turned out to be more important than knowing.
- Current systems are more likely to be built from examples than from logical rules.
- The focus shifted from replacing humans to augmenting them.
- The partnership between human and machine is stronger than either one alone.
When took Pat Winston’s undergraduate AI class in 1970, only the first of those ideas was current. It’s a good essay.
Of course, after we we’ve exploited the new data-driven, statistical paradigm for the next decade or so, we’ll probably have to go back to figuring out how to get logic back into the framework.
Edit | Bookmark@del.icio.us | Trackback | 2 Comments »
February 12th, 2011, by Tim Finin, posted in Machine Learning, Semantic Web, Social media
The current (11 February 2011) issue of Science is a special issue on Dealing with Data. It includes a collection of free, online articles that “highlights both the challenges posed by the data deluge and the opportunities that can be realized if we can better organize and access the data.” Some of the articles are drawn from three sister publications: Science Signaling, Science Translational Medicine and Science Careers.
From the issue’s introduction:

“Scientific innovation has been called on to spur economic recovery; science and technology are essential to improving public health and welfare and to inform sustainability; and the scientific community has been criticized for not being sufficiently accountable and transparent. Data collection, curation, and access are central to all of these issues.
…
As you will discover, two themes appear repeatedly: Most scientific disciplines are finding the data deluge to be extremely challenging, and tremendous opportunities can be realized if we can better organize and access the data.”
One of the great things about the “data deluge” is that there is something in it for almost all computer science researchers including areas like machine learning, data mining, NLP, visualization, semantic web, security and privacy, social media, high performance computing, HCI, etc. Here are some of the articles that caught our eye:
and still more that look very interesting:
- Climate Data Challenges in the 21st Century, J. T. Overpeck et al.
- Challenges and Opportunities of Open Data in Ecology, O. J. Reichman et al.
- Challenges and Opportunities in Mining Neuroscience Data, H. Akil et al.
- The Disappearing Third Dimension, T. Rowe and L. R. Frank
- Advancing Global Health Research Through Digital Technology and Sharing Data, T. Lang
- More Is Less: Signal Processing and the Data Deluge, R. G. Baraniuk
- Access to Stem Cells and Data: Persons, Property Rights, and Scientific Progress, D. J. H. Mathews et al.
- On the Future of Genomic Data, S. D. Kahn
- Conquering the Data Mountain, N. R. Gough and M. B. Yaffe
- Power to the People: Participant Ownership of Clinical Trial Data, S. F. Terry and P. F. Terry
- Surfing the Tsunami, E. Pain
- Sharing Data in Biomedical and Clinical Research, K. Travis
Edit | Bookmark@del.icio.us | Trackback | 1 Comment »
February 8th, 2011, by Tim Finin, posted in Semantic Web
In today’s ebiquity meeting, Curt Tilmes showed an interesting figure showing the how often a particular dataset (MODIS snow cover data) was mentioned in a paper vs. how often it was formally cited. It’s a good example of how far we still need to go w.r.t. formally capturing the provenance of data and information derived from it.
The figure is from:
Parsons, Mark A.; Duerr, Ruth; Minster, Jean-Bernard. Data Citation and Peer Review. Eos, Transactions American Geophysical Union, Volume 91, Issue 34, p. 297-298. 2010.
Edit | Bookmark@del.icio.us | Trackback | 2 Comments »
December 25th, 2010, by Varish Mulwad, posted in RDF, Semantic Web
The goal and vision of the Semantic Web is to create a Web of connected and interlinked data (items) which can be shared and reused by all. Sharing and opening up “raw data” is great; but the Semantic Web isn’t just about sharing data. To create a Web of data, one needs interlinking between data. In 2006, Sir Tim Berners-Lee introduced the notion of linked data in which he outlined the best practices for creating and sharing data on the Web. To encourage people and government to share data, he recently developed the following rating system -

The highest rating is for the data that can link to other people’s data to provide context. While the Semantic Web has been growing steadily, there is lot of data that is still in raw format. A study by Google researchers shows that there are 154 million tables with high quality relational data on the world wide web. The US government along with 7 other nations have started sharing data publicly. Not all the data is RDF or confers with the best practices of publishing and sharing linked data.
Here in the Ebiquity Research Lab, we have been focusing on converting data in tables and spreadsheets into RDF; but our focus is not on generating just RDF, but rather generate high quality linked data (as now Berners-Lee calls it “5 star data”). Our goal is to build a completely automated framework for interpreting tables and generating linked data from it.

As part of our preliminary research, we have already developed a baseline framework which can link the table column headers to classes from ontologies in the linked data cloud datasets, link the table cells to entities in the linked data cloud and identify relations between table columns and map them to properties in the linked data cloud. You can read papers related to our preliminary research at [1]. We will use this blog as a medium to publish updates in our pursuit of creating “5-star” data for the Semantic Web.
If you are data publisher, go grab some Linked Data star badges at [2]. You can show your support to the open data movement by gettings t-shirts, mugs and bumper stickers from [3] ! (all profits go to W3C)
Happy Holidays ! Let 2011 be yet another step forward in the open data movement !
[1] – http://ebiquity.umbc.edu/person/html/Varish/Mulwad/?pub=on#pub
[2] – http://lab.linkeddata.deri.ie/2010/lod-badges/
[3] – http://www.cafepress.co.uk/w3c_shop
Edit | Bookmark@del.icio.us | Trackback | 3 Comments »
December 9th, 2010, by Tim Finin, posted in Facebook, Social media
If you are good at solving hard problems and like to program here is something you might do over your winter break: compete in Facebook’s first annual Hackers Cup.
The Hacker Cup will start in Janaury and aims to “bring engineers from around the world together to compete in a multi-round programming competition.” Contestants will work to solve algorithmic-based problem statements to advance and be ranked based on their accuracy and speed in solving them. Winners will get cash prizes and those who do well will probably get invitations to interview for jobs or internships.
Registration begins on Monday December 20 and the first three online rounds will be held in January (7-10, 15-16, and 22). The top 25 contestants after the third round will be flown out to the Facebook campus in Palo Alto for the final competition, which will take place on March 11.
For practice, Facebook suggests you work on some of the problems from their Puzzle Master Page. See http://www.facebook.com/hackercup for more information.
Edit | Bookmark@del.icio.us | Trackback | 1 Comment »
November 19th, 2010, by Tim Finin, posted in GENERAL, Privacy, Semantic Web, Web
Sir Tim Berners-Lee discusses the principles underlying the Web and the need to protect them in an article from the December issue of Scientific American, Long Live the Web.
“The Web evolved into a powerful, ubiquitous tool because it was built on egalitarian principles and because thousands of individuals, universities and companies have worked, both independently and together as part of the World Wide Web Consortium, to expand its capabilities based on those principles.
The Web as we know it, however, is being threatened in different ways. Some of its most successful inhabitants have begun to chip away at its principles. Large social-networking sites are walling off information posted by their users from the rest of the Web. Wireless Internet providers are being tempted to slow traffic to sites with which they have not made deals. Governments—totalitarian and democratic alike—are monitoring people’s online habits, endangering important human rights.
If we, the Web’s users, allow these and other trends to proceed unchecked, the Web could be broken into fragmented islands. We could lose the freedom to connect with whichever Web sites we want. The ill effects could extend to smartphones and pads, which are also portals to the extensive information that the Web provides.
Why should you care? Because the Web is yours. It is a public resource on which you, your business, your community and your government depend. The Web is also vital to democracy, a communications channel that makes possible a continuous worldwide conversation. The Web is now more critical to free speech than any other medium. It brings principles established in the U.S. Constitution, the British Magna Carta and other important documents into the network age: freedom from being snooped on, filtered, censored and disconnected.”
Near the end of the long feature article, he mentions the Semantic Web’s linked data as one of the major new technologies the Web will give birth to, provided the principles are upheld.
“A great example of future promise, which leverages the strengths of all the principles, is linked data. Today’s Web is quite effective at helping people publish and discover documents, but our computer programs cannot read or manipulate the actual data within those documents. As this problem is solved, the Web will become much more useful, because data about nearly every aspect of our lives are being created at an astonishing rate. Locked within all these data is knowledge about how to cure diseases, foster business value and govern our world more effectively.”
One of the benefits of linked data is that it makes data integration and fusion much easier. The benefit comes with a potential risk, which Berners-Lee acknowledges.
“Linked data raise certain issues that we will have to confront. For example, new data-integration capabilities could pose privacy challenges that are hardly addressed by today’s privacy laws. We should examine legal, cultural and technical options that will preserve privacy without stifling beneficial data-sharing capabilities.”
The risk is not unique to linked data, and new research is underway, in our lab and elsewhere, on how to also use Semantic Web technology to protect privacy.
Edit | Bookmark@del.icio.us | Trackback | 1 Comment »
October 30th, 2010, by Tim Finin, posted in AI, Datamining, Google, Machine Learning, NLP, sEARCH, Semantic Web, Social media
Recorded Future is a Boston-based startup with backing from Google and In-Q-Tel uses sophisticated linguistic and statistical algorithms to extract time-related information from streams of Web data about entities and events. Their goal is to help their clients to understand how the relationships between entities and events of interest are changing over time and make predictions about the future.
A recent Technology Review article, See the Future with a Search, describes it this way.
“Conventional search engines like Google use links to rank and connect different Web pages. Recorded Future’s software goes a level deeper by analyzing the content of pages to track the “invisible” connections between people, places, and events described online.
”That makes it possible for me to look for specific patterns, like product releases expected from Apple in the near future, or to identify when a company plans to invest or expand into India,” says Christopher Ahlberg, founder of the Boston-based firm.
A search for information about drug company Merck, for example, generates a timeline showing not only recent news on earnings but also when various drug trials registered with the website clinicaltrials.gov will end in coming years. Another search revealed when various news outlets predict that Facebook will make its initial public offering.
That is done using a constantly updated index of what Ahlberg calls “streaming data,” including news articles, filings with government regulators, Twitter updates, and transcripts from earnings calls or political and economic speeches. Recorded Future uses linguistic algorithms to identify specific types of events, such as product releases, mergers, or natural disasters, the date when those events will happen, and related entities such as people, companies, and countries. The tool can also track the sentiment of news coverage about companies, classifying it as either good or bad.”
Pricing for access to their online services and API starts at $149 a month, but there is a free Futures email alert service through which you can get the results of some standing queries on a daily or weekly basis. You can also explore the capabilities they offer through their page on the 2010 US Senate Races.
“Rather than attempt to predict how the the races will turn out, we have drawn from our database the momentum, best characterized as online buzz, and sentiment, both positive and negative, associated with the coverage of the 29 candidates in 14 interesting races. This dashboard is meant to give the view of a campaign strategist, as it measures how well a campaign has done in getting the media to speak about the candidate, and whether that coverage has been positive, in comparison to the opponent.”
Their blog reveals some insights on the technology they are using and much more about the business opportunities they see. Clearly the company is leveraging named entity recognition, event recognition and sentiment analysis. A short A White Paper on Temporal Analytics has some details on their overall approach.
Edit | Bookmark@del.icio.us | Trackback | 2 Comments »
|  |
|  |