Wikidata article in CACM

October 12th, 2014


I just noticed that Denny Vrandecic and Markus Krötzsch have an article on Wikidata in the latest CACM. Good work! Even better, it’s available without subscription.

Wikidata: a free collaborative knowledgebase, Denny Vrandecic and Markus Krötzsch, Communications of the ACM, v57, n10 (2014), pp 78-85.

“This collaboratively edited knowledgebase provides a common source of data for Wikipedia, and everyone else.

Unnoticed by most of its readers, Wikipedia continues to undergo dramatic changes, as its sister project Wikidata introduces a new multilingual “Wikipedia for data” ( to manage the factual information of the popular online encyclopedia. With Wikipedia’s data becoming cleaned and integrated in a single location, opportunities arise for many new applications.”

Infoboxer: using statistical semantic knowledge to help create Wikipedia infoboxes

September 29th, 2014

In this week’s ebiquity meeting (10am Tue. Oct 1 in ITE346), Varish Mulwad will present Infoboxer, a prototype tool he developed with Roberto Yus that overcomes these challenges using statistical and semantic knowledge from linked data sources to ease the process of creating Wikipedia infoboxes.

Wikipedia infoboxes serve as input in the creation of knowledge bases
such as DBpedia, Yago, and Freebase. Current creation of Wikipedia
infoboxes is manual and based on templates that are created and
maintained collaboratively. However, these templates pose several

  • Different communities use different infobox templates for the same category articles
  • Attribute names differ (e.g., date of birth vs. birthdate)
  • Templates are restricted to a single category, making it harder to find a template for an article that belongs to multiple categories (e.g., actor and politician)
  • Templates are free form in nature and no integrity check is performed on whether the value filled by the user is of appropriate type for the given attribute

Infoboxer creates dynamic and semantic templates by suggesting attributes common for similar articles and controlling the expected values semantically. We will give an overview of our approach and demonstrate how Infoboxer can be used to create infoboxes for new Wikipedia articles as well as update erroneous values in existing infoboxes. We will also discuss our proposed extensions to the project.

Visit for more information about Infoboxer. A demo can be found here.

Google releases dataset linking strings and concepts

May 19th, 2012

Yesterday Google announced a very interesting resource with 175M short, unique text strings that were used to refer to one of 7.6M Wikipedia articles. This should be very useful for research on information extraction from text.

“We consider each individual Wikipedia article as representing a concept (an entity or an idea), identified by its URL. Text strings that refer to concepts were collected using the publicly available hypertext of anchors (the text you click on in a web link) that point to each Wikipedia page, thus drawing on the vast link structure of the web. For every English article we harvested the strings associated with its incoming hyperlinks from the rest of Wikipedia, the greater web, and also anchors of parallel, non-English Wikipedia pages. Our dictionaries are cross-lingual, and any concept deemed too fine can be broadened to a desired level of generality using Wikipedia’s groupings of articles into hierarchical categories.

The data set contains triples, each consisting of (i) text, a short, raw natural language string; (ii) url, a related concept, represented by an English Wikipedia article’s canonical location; and (iii) count, an integer indicating the number of times text has been observed connected with the concept’s url. Our database thus includes weights that measure degrees of association.”

The details of the data and how it was constructed are in an LREC 2012 paper by Valentin Spitkovsky and Angel Chang, A Cross-Lingual Dictionary for English Wikipedia Concepts. Get the data here.

Wikidata will create an editable, Semantic Web compatible version of Wikipedia

March 30th, 2012

Wikidata is a new project that “aims to create a free knowledge base about the world that can be read and edited by humans and machines alike.” The project was started by the German chapter of Wikimedia, the organization that oversees Wikipedia and related projects, and is he first new Wikimedia project since 2006.

Wikidata has its roots in the successful Semantic MediaWiki project, and the Wikidata development team is lead by Dr. Denny Vrandecic, a well known member of the Semantic Web research community and one of the Semantic MediaWiki creators in 2005. The project is funded by Paul Allen’s AI2 foundation (which funded Semantic MediaWiki), Google, and the Gordon and Betty Moore Foundation.

Wikidata will expose the data that underlies Wikipedia and other sources as RDF and JSON and also allow people and programs to query the data as well as adding or editing data.

For more information, see an Wikipedia’s Next Big Thing on Techcrunch or the Wikimedia press release on Wikidata. You an also see a recent Wikidata presentation by Denny and view his talk on the nascent Project at the 2011 Wikimania conference.

Wikimedia fans in our area will find it easy to attend Wikimania 2012, which will be held July 12-15 at George Washington University in the Washington DC area.

Wikipedia offline due to power outage

July 4th, 2010

Wikipedia was offline for nearly twelve hours today, starting about 11:00am EDT. According to Wikipedia’s Twitter feed:

“Thanks for being patient, everyone. We’ve figured out the problem: power outage in our Florida data center. Slowly coming back online!”

This is not the first time that Wikimedia has experienced problems cause by power outages. In March 2010, Wikipedia was also knocked offline globally:

“Due to an overheating problem in our European data center many of our servers turned off to protect themselves. As this impacted all Wikipedia and other projects access from European users, we were forced to move all user traffic to our Florida cluster, for which we have a standard quick failover procedure in place, that changes our DNS entries. However, shortly after we did this failover switch, it turned out that this failover mechanism was now broken, causing the DNS resolution of Wikimedia sites to stop working globally. This problem was quickly resolved, but unfortunately it may take up to an hour before access is restored for everyone, due to caching effects.”

According to a story in itnews

“The cluster is hosted in a co-location facility in Tampa, Florida, which has approximately 300 servers, a 350 Mbps connection, and supports up to 3,000 hits per second, or 150 million hits per day. Two other server clusters – knams in Amsterdam, Netherlands and yaseo, provided by Yahoo! in Seoul, South Korea – also provide hosting and bandwidth to serve users in various regions.

It looks like there are still failover problems. 🙁 We can watch the WIkimedia Technical blog for more information.

Wikipedia mobile launches for iPhone, Palm Pre, Android and Kindle

July 5th, 2009

wikipedia mobileWikipedia’s mobile site has been officially launched and running on a new server (in Ruby!).

Currently the site supports four mobile platforms: iPhone, Kindle, Android, and Palm Pre. Only the English and German versions are up, but support for more languages is said to be coming.

If you visit a Wikipedia page from a supported mobile device, you will be automatically redirected to the mobile version. You can click through to the regular page for editing or accessing other features not included in the mobile transcoding (e.g., history). You can also permanently disable the mobile redirects for your device, if you like.

You can get some idea how the page rendering is simplified in a non-mobile browser by looking at a page like But the device specific encoding makes this work much better for each device.

I like the way it looks on my Palm Pre, which differs from the iPhone encoding, and think it will make Wikipedia much more usable from it.

(via ReadWrteWeb)

CFP: JWS special issue on Semantic Web and Social Media

June 27th, 2009
important dates
abstracts 21 Sept 09
submissions 01 Oct 09
notification 15 Dec 09
final copy 15 Jan 10
publication April 10

The Journal of Web Semantics will publish a special issue on Data Mining and Social Network Analysis for integrating Semantic Web and Web 2.0 in the spring of 2010. The special issue will be edited by Bettina Berendt, Andreas Hotho and Gerd Stumme and initial abstracts for papers must be submitted via the Elsevier EES system by September 21, 2009.

The special issue, invites contributions that show how synergies between Semantic Web and Web 2.0 techniques can be successfully used. Since both communities work on network-like data structures, analysis methods from different fields of research could form a link between those communities. Techniques can be – but are not limited to – social network analysis, graph analysis, machine learning and data mining methods.

Relevant topics include

  • ontology learning from Web 2.0 data
  • instance extraction from Web 2.0 systems
  • analysis of Blogs
  • discovering social structures and communities
  • predicting trends and user behaviour
  • analysis of dynamic networks
  • using content of the Web for modelling
  • discovering misuse and fraud
  • network analysis of social resource sharing systems
  • analysis of folksonomies and other Web 2.0 data structures
  • analysis of Web 2.0 applications and their data
  • deriving profiles from usage
  • personalized delivery of news and journals
  • Semantic Web personalization
  • Semantic Web technologies for recommender systems
  • ubiquitous data mining in Web (2.0) environment
  • applications

Wikinvest offers the wisdom of the investing crowds

February 9th, 2009

Wikinvest is a free, community driven site that “wants to make investing easier by creating the world’s best source of investment information and investment tools”.

A story in today’s NTY, Offering Free Investment Advice by Anonymous Volunteers, says

“Following the model of Wikipedia, the online encyclopedia that anyone can edit, Wikinvest is building a database of user-generated investment information on popular stocks. A senior at Yale writes about the energy industry, for example, while a former stockbroker covers technology and a mother in Arizona tracks children’s retail chains.

Wikinvest, which recently licensed some content to the Web sites of USA Today and Forbes, seeks to be an alternative to Web portals that are little more than “a data dump” of income statements and government filings, said Parker Conrad, a co-founder.

Users annotate stock charts with notes explaining peaks and valleys, edit company profiles and opine about whether to buy or sell. The site is creating a wire service with articles from finance blogs and building a cheat sheet to guide readers through financial filings by defining terms and comparing a company’s performance to competitors’.”

After a quick look at the site it does look interesting. I may well be ready to trust the wisdom of the crowds over the platitudes of the pundits. The Microsoft article has a lot of useful data and lays out reasons to buy and also to sell and lets registered members vote on whether they agree or not. Of course, I thought the reasons offered on both sides were valid — rather than simple propositions their validity needs to be quantified.

For what it is worth, I note that the site is using MediaWiki. I wonder if there are unique opportunities to incorporate RDF and or RDFa into such a site, perhaps encoding or annotating their WikiData.

Extracting Wikipedia infobox values from text

January 27th, 2009

Text Analysis Conference This year’s Text Analysis Conference (TAC) has an interesting track focused on processing text to populate Wikipedia infoboxes, both for existing entities with missing values as well as newly discovered entities.

TAC has been run by the US National Institute of Standards and Technology (NIST) to to encourage research in natural language processing and related applications. As in the NIST sponsored MUC, TREC and ACE workshops, this is done by by providing a large test collection, common evaluation procedures, and a forum for organizations to share their results. The first TAC was held this year and included 65 teams from 20 countries who participated in three tracks: question answering, summarization and recognizing textual entailments.

TAC 2009 will include a new track on Knowledge Base Population coordinated by Paul McNamee of the Johns Hopkins University Human Language Technology Center of Excellence.

“The goal of the new Knowledge Base Population track is to augment an existing knowledge representation with information about entities that is discovered from a collection of documents. A snapshot of Wikipedia infoboxes will be used as the original knowledge source, and participants will be expected to fill in empty slots for entities that do exist, add missing entities and their learnable attributes, and provide links between entities and references to text supporting extracted information. The KBP task lies at the intersection of Question Answering and Information Extraction and is expected to be of particular interest to groups that have participated in ACE or TREC QA.”

This is an exciting task and doing well in it will require a a mixture of language processing, knowledge-based processing and (probably) machine learning.

The TAC 2009 workshop will be co-located with TREC and held 16-17 November in Gaithersburg, MD. If you are interested in participating, you should register by March 3.

Wikirage tracks whats hot on Wikipedia

December 30th, 2008

Wikirage is yet another way to track what’s happening in the world via changes in social media, in this case, Wikipedia. As the site suggests, “popular people in the news, the latest fads, and the hottest video games can be quickly identified by monitoring this social phenomenon.”

Wikirage lists the 100 Wikipedia pages that are being heavily edited over any of six time periods from the last hour to the last month. You can see the top 100 by your choice of six metrics: number of quality edits, unique editors, total edits, vandalism, reversions, or undos. Clicking on a result shows a monthly summary for the article, for example, December 2008 Gaza Strip airstrikes, which is at the top of today’s list for number of edits as I write. I understand the Gaza article, but what’s up with the Tasmanian tiger?

The interface has some other nice features, such as marking pages in red that have high revision, vandalism or undo rates and showing associated Wikipedia flags that indicating articles that need attention or don’t live up to standards. Wikirage is also available for the English, Japanese, Spanish, German and French language Wikipedias.

Wikirage was developed by Craig Wood and is a nicely done system.

(via the Porn Sex Viagra Casino Spam site)

Journal requires authors to include Wikipedia article with submissions

December 18th, 2008

Scientific journals are undergoing rapid evolution as they adapt to the Web and various forms of social media. As reported by Nature (Publish in Wikipedia or perish) and in ReadWriteWeb, the journal RNA Biology is experimenting with a connection to Wikipedia. Articles submitted for publication about new RNA molecules must also include a draft Wikipedia page that summarizes the work. The journal will then peer review the page before publishing it in Wikipedia.

Here are the guidelines from the RNA Biology site:

“To be eligible for publication the Supplementary Material must contain: (1) a link to a Wikipedia article preferably in a User’s space. Upon acceptance this can easily be moved into Wikipedia itself together with a reference to the published article.

At least one stub article (essentially an extended abstract) for the paper should be added to either an author’s userspace at Wikipedia (preferred route) or added directly to the main Wikipedia space (be sure to add literature references to avoid speedy deletion). This article will be reviewed alongside the manuscript and may require revision before acceptance. Upon acceptance the former articles can easily be exported to the main Wikipedia space. See below for guidelines on how to do this. Existing articles can be updated in accordance with the latest published results.”

This is definitely an interesting and forward looking idea. Yet, I can not help having the cynical thought that it’s also a great way for the journal to boost it’s page rank.

Parallax: a better interface for Freebase

August 14th, 2008

David Huynh completed his PhD at MIT CSAIL last year and joined MetaWeb a few months ago, where he has been working on new and better interfaces to explore the data encoded in their Freebase system. He recently released Parallax as a prototype browsing interface for Freebase. Here is a video that shows the interface in action.

Freebase Parallax: A new way to browse and explore data from David Huynh on Vimeo.

Freebase is “an open database of the world’s information” that is constructed by a Wiki-like collaborative community. In many ways it is like the Semantic Web model, with two big differences: (1) the data is stored centrally rather than distributed across the Web and (2) the representation system is not based on RDF but rather uses a custom built object-oriented data representation language.

Freebase is a great resource. Much of the data is extracted from Wikipedia, so its content has a large overlap with DBpedia. But it is also relatively easy to upload additional information in various structured forms and many have done so, resulting in an extended coverage.

This is clearly a system in the Web of Data space along with the Linking Open Data effort and having it should offer a way for us all to explore the consequences of some of the underlying design decisions.