“This collaboratively edited knowledgebase provides a common source of data for Wikipedia, and everyone else.
Unnoticed by most of its readers, Wikipedia continues to undergo dramatic changes, as its sister project Wikidata introduces a new multilingual “Wikipedia for data” (http://www.wikidata.org) to manage the factual information of the popular online encyclopedia. With Wikipedia’s data becoming cleaned and integrated in a single location, opportunities arise for many new applications.”
In this week’s ebiquity meeting (10am Tue. Oct 1 in ITE346), Varish Mulwad will present Infoboxer, a prototype tool he developed with Roberto Yus that overcomes these challenges using statistical and semantic knowledge from linked data sources to ease the process of creating Wikipedia infoboxes.
Wikipedia infoboxes serve as input in the creation of knowledge bases
such as DBpedia, Yago, and Freebase. Current creation of Wikipedia
infoboxes is manual and based on templates that are created and
maintained collaboratively. However, these templates pose several
Different communities use different infobox templates for the same category articles
Attribute names differ (e.g., date of birth vs. birthdate)
Templates are restricted to a single category, making it harder to find a template for an article that belongs to multiple categories (e.g., actor and politician)
Templates are free form in nature and no integrity check is performed on whether the value filled by the user is of appropriate type for the given attribute
Infoboxer creates dynamic and semantic templates by suggesting attributes common for similar articles and controlling the expected values semantically. We will give an overview of our approach and demonstrate how Infoboxer can be used to create infoboxes for new Wikipedia articles as well as update erroneous values in existing infoboxes. We will also discuss our proposed extensions to the project.
Yesterday Google announced a very interesting resource with 175M short, unique text strings that were used to refer to one of 7.6M Wikipedia articles. This should be very useful for research on information extraction from text.
“We consider each individual Wikipedia article as representing a concept (an entity or an idea), identified by its URL. Text strings that refer to concepts were collected using the publicly available hypertext of anchors (the text you click on in a web link) that point to each Wikipedia page, thus drawing on the vast link structure of the web. For every English article we harvested the strings associated with its incoming hyperlinks from the rest of Wikipedia, the greater web, and also anchors of parallel, non-English Wikipedia pages. Our dictionaries are cross-lingual, and any concept deemed too fine can be broadened to a desired level of generality using Wikipedia’s groupings of articles into hierarchical categories.
The data set contains triples, each consisting of (i) text, a short, raw natural language string; (ii) url, a related concept, represented by an English Wikipedia article’s canonical location; and (iii) count, an integer indicating the number of times text has been observed connected with the concept’s url. Our database thus includes weights that measure degrees of association.”
Wikidata is a new project that “aims to create a free knowledge base about the world that can be read and edited by humans and machines alike.” The project was started by the German chapter of Wikimedia, the organization that oversees Wikipedia and related projects, and is he first new Wikimedia project since 2006.
Wikidata has its roots in the successful Semantic MediaWiki project, and the Wikidata development team is lead by Dr. Denny Vrandecic, a well known member of the Semantic Web research community and one of the Semantic MediaWiki creators in 2005. The project is funded by Paul Allen’s AI2 foundation (which funded Semantic MediaWiki), Google, and the Gordon and Betty Moore Foundation.
Wikidata will expose the data that underlies Wikipedia and other sources as RDF and JSON and also allow people and programs to query the data as well as adding or editing data.
Wikipedia was offline for nearly twelve hours today, starting about 11:00am EDT. According to Wikipedia’s Twitter feed:
“Thanks for being patient, everyone. We’ve figured out the problem: power outage in our Florida data center. Slowly coming back online!”
This is not the first time that Wikimedia has experienced problems cause by power outages. In March 2010, Wikipedia was also knocked offline globally:
“Due to an overheating problem in our European data center many of our servers turned off to protect themselves. As this impacted all Wikipedia and other projects access from European users, we were forced to move all user traffic to our Florida cluster, for which we have a standard quick failover procedure in place, that changes our DNS entries. However, shortly after we did this failover switch, it turned out that this failover mechanism was now broken, causing the DNS resolution of Wikimedia sites to stop working globally. This problem was quickly resolved, but unfortunately it may take up to an hour before access is restored for everyone, due to caching effects.”
“The cluster is hosted in a co-location facility in Tampa, Florida, which has approximately 300 servers, a 350 Mbps connection, and supports up to 3,000 hits per second, or 150 million hits per day. Two other server clusters – knams in Amsterdam, Netherlands and yaseo, provided by Yahoo! in Seoul, South Korea – also provide hosting and bandwidth to serve users in various regions.
Currently the site supports four mobile platforms: iPhone, Kindle, Android, and Palm Pre. Only the English and German versions are up, but support for more languages is said to be coming.
If you visit a Wikipedia page from a supported mobile device, you will be automatically redirected to the mobile version. You can click through to the regular page for editing or accessing other features not included in the mobile transcoding (e.g., history). You can also permanently disable the mobile redirects for your device, if you like.
You can get some idea how the page rendering is simplified in a non-mobile browser by looking at a page like http://en.m.wikipedia.org/wiki/Alan_Turing. But the device specific encoding makes this work much better for each device.
I like the way it looks on my Palm Pre, which differs from the iPhone encoding, and think it will make Wikipedia much more usable from it.
The special issue, invites contributions that show how synergies between Semantic Web and Web 2.0 techniques can be successfully used. Since both communities work on network-like data structures, analysis methods from different fields of research could form a link between those communities. Techniques can be – but are not limited to – social network analysis, graph analysis, machine learning and data mining methods.
Relevant topics include
ontology learning from Web 2.0 data
instance extraction from Web 2.0 systems
analysis of Blogs
discovering social structures and communities
predicting trends and user behaviour
analysis of dynamic networks
using content of the Web for modelling
discovering misuse and fraud
network analysis of social resource sharing systems
analysis of folksonomies and other Web 2.0 data structures
“Following the model of Wikipedia, the online encyclopedia that anyone can edit, Wikinvest is building a database of user-generated investment information on popular stocks. A senior at Yale writes about the energy industry, for example, while a former stockbroker covers technology and a mother in Arizona tracks children’s retail chains.
Wikinvest, which recently licensed some content to the Web sites of USA Today and Forbes, seeks to be an alternative to Web portals that are little more than “a data dump” of income statements and government filings, said Parker Conrad, a co-founder.
Users annotate stock charts with notes explaining peaks and valleys, edit company profiles and opine about whether to buy or sell. The site is creating a wire service with articles from finance blogs and building a cheat sheet to guide readers through financial filings by defining terms and comparing a company’s performance to competitors’.”
After a quick look at the site it does look interesting. I may well be ready to trust the wisdom of the crowds over the platitudes of the pundits. The Microsoft article has a lot of useful data and lays out reasons to buy and also to sell and lets registered members vote on whether they agree or not. Of course, I thought the reasons offered on both sides were valid — rather than simple propositions their validity needs to be quantified.
For what it is worth, I note that the site is using MediaWiki. I wonder if there are unique opportunities to incorporate RDF and or RDFa into such a site, perhaps encoding or annotating their WikiData.
TAC has been run by the US National Institute of Standards and Technology (NIST) to to encourage research in natural language processing and related applications. As in the NIST sponsored MUC, TREC and ACE workshops, this is done by by providing a large test collection, common evaluation procedures, and a forum for organizations to share their results. The first TAC was held this year and included 65 teams from 20 countries who participated in three tracks: question answering, summarization and recognizing textual entailments.
“The goal of the new Knowledge Base Population track is to augment an existing knowledge representation with information about entities that is discovered from a collection of documents. A snapshot of Wikipedia infoboxes will be used as the original knowledge source, and participants will be expected to fill in empty slots for entities that do exist, add missing entities and their learnable attributes, and provide links between entities and references to text supporting extracted information. The KBP task lies at the intersection of Question Answering and Information Extraction and is expected to be of particular interest to groups that have participated in ACE or TREC QA.”
This is an exciting task and doing well in it will require a a mixture of language processing, knowledge-based processing and (probably) machine learning.
The TAC 2009 workshop will be co-located with TREC and held 16-17 November in Gaithersburg, MD. If you are interested in participating, you should register by March 3.
Wikirage is yet another way to track what’s happening in the world via changes in social media, in this case, Wikipedia. As the site suggests, “popular people in the news, the latest fads, and the hottest video games can be quickly identified by monitoring this social phenomenon.”
Wikirage lists the 100 Wikipedia pages that are being heavily edited over any of six time periods from the last hour to the last month. You can see the top 100 by your choice of six metrics: number of quality edits, unique editors, total edits, vandalism, reversions, or undos. Clicking on a result shows a monthly summary for the article, for example, December 2008 Gaza Strip airstrikes, which is at the top of today’s list for number of edits as I write. I understand the Gaza article, but what’s up with the Tasmanian tiger?
The interface has some other nice features, such as marking pages in red that have high revision, vandalism or undo rates and showing associated Wikipedia flags that indicating articles that need attention or don’t live up to standards. Wikirage is also available for the English, Japanese, Spanish, German and French language Wikipedias.
Wikirage was developed by Craig Wood and is a nicely done system.
Scientific journals are undergoing rapid evolution as they adapt to the Web and various forms of social media. As reported by Nature (Publish in Wikipedia or perish) and in ReadWriteWeb, the journal RNA Biology is experimenting with a connection to Wikipedia. Articles submitted for publication about new RNA molecules must also include a draft Wikipedia page that summarizes the work. The journal will then peer review the page before publishing it in Wikipedia.
Here are the guidelines from the RNA Biology site:
“To be eligible for publication the Supplementary Material must contain: (1) a link to a Wikipedia article preferably in a User’s space. Upon acceptance this can easily be moved into Wikipedia itself together with a reference to the published article.
At least one stub article (essentially an extended abstract) for the paper should be added to either an author’s userspace at Wikipedia (preferred route) or added directly to the main Wikipedia space (be sure to add literature references to avoid speedy deletion). This article will be reviewed alongside the manuscript and may require revision before acceptance. Upon acceptance the former articles can easily be exported to the main Wikipedia space. See below for guidelines on how to do this. Existing articles can be updated in accordance with the latest published results.”
This is definitely an interesting and forward looking idea. Yet, I can not help having the cynical thought that it’s also a great way for the journal to boost it’s page rank.
David Huynh completed his PhD at MIT CSAIL last year and joined MetaWeb a few months ago, where he has been working on new and better interfaces to explore the data encoded in their Freebase system. He recently released Parallax as a prototype browsing interface for Freebase. Here is a video that shows the interface in action.
Freebase is “an open database of the world’s information” that is constructed by a Wiki-like collaborative community. In many ways it is like the Semantic Web model, with two big differences: (1) the data is stored centrally rather than distributed across the Web and (2) the representation system is not based on RDF but rather uses a custom built object-oriented data representation language.
Freebase is a great resource. Much of the data is extracted from Wikipedia, so its content has a large overlap with DBpedia. But it is also relatively easy to upload additional information in various structured forms and many have done so, resulting in an extended coverage.
This is clearly a system in the Web of Data space along with the Linking Open Data effort and having it should offer a way for us all to explore the consequences of some of the underlying design decisions.