UMBC ebiquity
2010 December

Archive for December, 2010

UMBC 2nd in Pan American Intercollegiate Team Chess Championship

December 30th, 2010, by Tim Finin, posted in UMBC

Alan Sherman reports that “UMBC advances to Final Four of College Chess, placing 2nd at 2010 Pan-Am Intercollegiate, after losing to arch rival UTD in close, hard-fought match last evening in Milwaukee. 28 teams competed in this six-round team Swiss event, including Stanford, Yale, Univ of Chicago, Univ. of Toronto.” The top four teams are UT Dallas, UMBC, UT Brownsville and Texas Tech. See here for the final standings of all 28 teams.

From tables to 5 star linked data

December 25th, 2010, by Varish Mulwad, posted in RDF, Semantic Web

The goal and vision of the Semantic Web is to create a Web of connected and interlinked data (items) which can be shared and reused by all. Sharing and opening up “raw data” is great; but the Semantic Web isn’t just about sharing data. To create a Web of data, one needs interlinking between data. In 2006, Sir Tim Berners-Lee introduced the notion of linked data in which he outlined the best practices for creating and sharing data on the Web. To encourage people and government to share data, he recently developed the following rating system –

The highest rating is for the data that can link to other people’s data to provide context. While the Semantic Web has been growing steadily, there is lot of data that is still in raw format. A study by Google researchers shows that there are 154 million tables with high quality relational data on the world wide web. The US government along with 7 other nations have started sharing data publicly. Not all the data is RDF or confers with the best practices of publishing and sharing linked data.

Here in the Ebiquity Research Lab, we have been focusing on converting data in tables and spreadsheets into RDF; but our focus is not on generating just RDF, but rather generate high quality linked data (as now Berners-Lee calls it “5 star data”). Our goal is to build a completely automated framework for interpreting tables and generating linked data from it.

As part of our preliminary research, we have already developed a baseline framework which can link the table column headers to classes from ontologies in the linked data cloud datasets, link the table cells to entities in the linked data cloud and identify relations between table columns and map them to properties in the linked data cloud. You can read papers related to our preliminary research at [1]. We will use this blog as a medium to publish updates in our pursuit of creating “5-star” data for the Semantic Web.

If you are data publisher, go grab some Linked Data star badges at [2]. You can show your support to the open data movement by gettings t-shirts, mugs and bumper stickers from [3]  ! (all profits go to W3C)

Happy Holidays ! Let 2011 be yet another step forward in the open data movement !

[1] – http://ebiquity.umbc.edu/person/html/Varish/Mulwad/?pub=on#pub

[2] – http://lab.linkeddata.deri.ie/2010/lod-badges/

[3] – http://www.cafepress.co.uk/w3c_shop

JASON report on the Science of Cyber-Security

December 20th, 2010, by Tim Finin, posted in cybersecurity, Privacy, Security

The DoD-sponsored JASON study group was asked to consider the question of whether there is a ‘science’ to cyber-security or if it is fundamentally empirical. They released an 88-page report last month, Science of Cyber-Security with the following abstract:

“JASON was requested by the DoD to examine the theory and practice of cyber-security, and evaluate whether there are underlying fundamental principles that would make it possible to adopt a more scientific approach, identify what is needed in creating a science of cyber-security, and recommend specific ways in which scientific methods can be applied. Our study identified several sub-?elds of computer science that are specifically relevant and also provides some recommendations on further developing the science of cyber-security.”

The report discusses to general technical approaches to putting cyber-security on a scientific foundation. The first is based on the standard collection of frameworks and tools grounded in logic and mathematics such as cryptography, game theory, model checking and software verification. The second is grounding cyber-security on a model based on an analog to immunology in biological systems.

It concludes with some observations, recommendations and responses to nine questions that were included in their charge. One interesting observation is that cyber-security, unlike the physical sciences, involves adversaries, so its foundation will use many different tools and methods. A recommendation is that the government establish cyber-security research centers in universities and other research organizations with a “long time horizon and periodic reviews of accomplishments”.

First annual Facebook Hackers Cup

December 9th, 2010, by Tim Finin, posted in Facebook, Social media

First Annual Facebook Hackers CupIf you are good at solving hard problems and like to program here is something you might do over your winter break: compete in Facebook’s first annual Hackers Cup.

The Hacker Cup will start in Janaury and aims to “bring engineers from around the world together to compete in a multi-round programming competition.” Contestants will work to solve algorithmic-based problem statements to advance and be ranked based on their accuracy and speed in solving them. Winners will get cash prizes and those who do well will probably get invitations to interview for jobs or internships.

Registration begins on Monday December 20 and the first three online rounds will be held in January (7-10, 15-16, and 22). The top 25 contestants after the third round will be flown out to the Facebook campus in Palo Alto for the final competition, which will take place on March 11.

For practice, Facebook suggests you work on some of the problems from their Puzzle Master Page. See http://www.facebook.com/hackercup for more information.

Test drive a (free) Google Chrome-48 notebook

December 7th, 2010, by Tim Finin, posted in Google

Google is looking for some people who “Live on the Web” to take part in a pilot program for their new Chrome-48 notebook. If you are accepted, you get a free Chrome notebook in return for providing feedback.

“We have a limited number of Chrome notebooks to distribute, and we need to ensure that they find good homes. That’s where you come in. Everything is still very much a work in progress, and it’s users, like you, that often give us our best ideas about what feels clunky or what’s missing. So if you live in the United States, are at least 18 years old, and would like to be considered for our small Pilot program, please fill this out. It should take about 15 minutes. We’ll review the requests that come in and contact you if you’ve been selected. This application will be open until 11:59:59 PM PST on December 21, 2010.”

The Cr-48’s features are said to include:

  • 12-inch display
  • Built-in Wi-Fi and 3G service via Verizon
  • Full-sized keyboard w/o function keys
  • Oversized clickable touchpad
  • ~3.8 pounds
  • Solid state hard drive
  • Eight hours of active usage with a week of standby power

See Google Chrome OS site for details and to apply.

Dot Diva

December 7th, 2010, by Tim Finin, posted in CS, GENERAL

Dot Diva is a Web site designed to “create an exciting and positive image of computing for high school girls”. It’s a effort of the ‘New Image of Computing’ project sponsored by NSF, ACM and others. One of the project’s advisors is UMBC’s own Marie desJardins.

The project is trying to raise interest in computing majors and careers among high school students and undergraduates. One of the project themes is that studying computing doesn’t mean you have to abandon your other deep interests. You can use a strong computing background to further music, psychology, biology, medicine, linguistics, journalism or almost any pursuit.

It’s a good message that can help attract people to the computing fields.

Naive Bayes classifier in 50 lines

December 7th, 2010, by Krishnamurthy Viswanathan, posted in Machine Learning

The Naive Bayes classifier is one of the most versatile machine learning algorithms that I have seen around during my meager experience as a graduate student, and I wanted to do a toy implementation for fun. At its core, the implementation is reduced to a form of counting, and the entire Python module, including a test harness took only 50 lines of code. I haven’t really evaluated the performance, so I welcome any comments. I am a Python amateur, and am sure that experienced Python hackers can trim a few rough edges off this code.

Intuition and Design

Here is definition a of the classifier functionality (from wikipedia):

Now this means, that for each possible class label, multiply together the conditional probability of each feature, given the class label. This means, for us to implement the classifier, all we need to do, is compute these individual conditional probabilities for each label, for each feature, p(Fi | Cj), and multiply them together with the prior probability for that label p(Cj). The label for which we get the largest product, is the label returned by the classifier.

In order to compute these individual conditional probabilities, we use the Maximum Likelihood Estimation method. In a very short sentence, we approximate these probabilities using the counts from the input/training vectors.

Hence we have: p(Fi | Cj) = count( Fi ^ Cj) / count(Cj)

That is, we count from the training corpus, the ratio of the number of occurrences of the feature Fi and the label Cj together to the total number of occurrences of the label Cj.

Zero Probability Problem

What if we have never seen a particular feature Fa and a particular label Cb together in the training dataset? Whenever they occur in the test data, p(Fa | Cb) will be zero. Hence the overall product will also be zero. This is a problem with maximum likelihood estimates. Just because a particular observation was not made during training does not mean that it will never occur in the test data. In order to remedy this issue, we use what is known as smoothing. The simplest kind of smoothing that we use in this code, is called “add one smoothing”. Essentially, the probability for an unseen event should be greater than one. We achieve this by adding one to each zero count. The net effect should be that we redistribute some of the probability mass from the non-zero count observations to the zero-count observations. Hence, we also need to increase the total count for each label by the number of possible observations, in order to maintain the total probability mass at 1.

For example, if we have two classes C = 0 and C = 1, then after smoothing, the smoothed MLE probabilities can be written as:

p-smoothed(Fi | Cj) = [count(Fi ^ Cj) + 1]/[count(Cj) + N] where N is the total number of observations across all features in the training corpus.

Code

For simplicity, we will use Weka’s ARFF file format as input. We have a single class called Model which has a few dictionaries and lists to store the counts and feature vector details. In this implementation, we only deal with discrete valued features.

The dictionary ‘features’ saves all possible values for a feature. ‘featureNameList‘ is simply a list that contains the names of the features in the same order that it appears in the ARFF file. This is because our features dictionary does not have any intrinsic order, and we need to maintain feature order explicitly. ‘featureCounts‘ contains the actual counts for co-occurrence of each feature value with each label value. The keys for this dictionary are tuples of the form (class_label, feature_name, feature_value). Hence, if we have observed the feature F1 with the value ‘x’ for the label ‘yes’, fifteen times, then we will have the entry {(‘yes’, ‘F1’, 15)} in the dictionary. Note how the default values for counts in this dictionary is ‘1’ instead of ‘0’. This is because we are smoothing the counts. The ‘featureVectors‘ list actually contains all the input feature vectors from the ARFF file. The last feature in this vector is the class label itself, as is the convention with weka ARFF files. Finally, ‘labelCounts‘ stores the counts of the class labels themselves, i.e. now many times did we see the label Ci during training.

We also have the following member functions in the Model class:

The above method simply reads the feature names (including class labels), their possible values, and the feature vectors themselves; and populate the appropriate data structures defined above.

The TrainClassifier method simply counts the number of co-occurrences of each feature value with each class label, and stores them in the form of 3-tuples. These counts are automatically smoothed by using add-one smoothing as the default value of count for this dictionary is ‘1’. The counts of the labels is also adjusted by incrementing these counts by the total number of observations.

Finally, we have the Classify method, that accepts as argument, a single feature vector (as a list), and computes the product of individual conditional probabilities (smoothed MLE) for each label. The final computed probabilities for each label are stored in the ‘probabilityPerLabel‘ dictionary. In the last line, we return the entry from probabilityPerLabel which has the highest probability. Note that the multiplication is actually done as addition in the log domain as the numbers involved are extremely small. Also, one of the factors used in this multiplication, is the prior probability of having this class label.

Here is the complete code, including a test method:

Download the sample ARFF file to try it out.

Update: I found a bug in the last but one(th) line of the GetValues() function. This line gets the possible attribute values from the arff file and stores them in self.featureNameList. This method did not deal with whitespaces correctly. Update this line to:

self.features[self.featureNameList[len(self.featureNameList) - 1]] = [featureName.strip() for featureName in line[line.find('{')+1: line.find('}')].strip().split(',')]

Tech Council of MD CyberMaryland Forum, Wed AM 12/08/2010

December 3rd, 2010, by Tim Finin, posted in cybersecurity, Security, UMBC

The Tech Council of Maryland is the state’s largest technology trade association and has more than 500 members. It is sponsoring a series of meetings on cyber security:

“Understanding that the conversation about cyber security needs to continue among all stakeholders, the Tech Council of Maryland is moving its CyberMaryland Forum throughout the state. The Forum is open to anyone with an interest in the cyber security industry.”

The next CyberMaryland Form meeting will be held this coming Wednesday morning at UMBC:

“The next meeting of the CyberMaryland Forum will be held on Wednesday December 8, 2010 from 8:30 to 11:30 am at the University of Maryland, Baltimore County. Our content will cover the latest developments in the state’s initiative to be the “Epicenter for Information Security and Innovation”, the development of the UMBC/Northrop Grumman Cyber Incubator program to help grow fledgling cyber security companies and other hot topics in the cyber security industry. To learn more about the CyberMaryland Forum, contact Mark Glazer at 240-243-4045 or mglazer@techcouncilmd.com.

The Tech council encourages UMBC faculty, staff and students to participate and is waiving the registration fee for the UMBC community. The meeting will be held in the main conference room at UMBC’s South Campus Technology Center at 1450 South Rolling Road.

Lisp bots win Planet Wars Google AI Challenge

December 2nd, 2010, by Tim Finin, posted in Agents, AI, Games, Google, Social

top programming languages in Planet Wars
The Google-supported Planet Wars Google AI Challenge had over 4000 entries that used AI and game theory to compete against one another. C at the R-Chart blog analyzed the programming languages used by the contestants with some interesting results.

The usual suspects were the most popular languages used: Java, C++, Python, C# and PHP. The winner, Hungarian Gábor Melis, was just one of 33 contestants who used Lisp. Even less common were entries in C, but the 18 “C hippies” did remarkably well.

Blogger C wonders if Lisp was the special sauce:

Paul Graham has stated that Java was designed for “average” programmers while other languages (like Lisp) are for good programmers. The fact that the winner of the competition wrote in Lisp seems to support this assertion. Or should we see Mr. Melis as an anomaly who happened to use Lisp for this task?

FTC proposes a do not track privacy mechanism

December 1st, 2010, by Tim Finin, posted in Privacy

Today the FTC released a preliminary staff report that proposes a “do not track” mechanism allowing consumers to opt out of data collection on online searching and browsing activities. The FTC report says that industry self-regulation efforts on privacy have been “too slow, and up to now have failed to provide adequate and meaningful protection.”

“To reduce the burden on consumers and ensure basic privacy protections, the report first recommends that “companies should adopt a ‘privacy by design’ approach by building privacy protections into their everyday business practices.” Such protections include reasonable security for consumer data, limited collection and retention of such data, and reasonable procedures to promote data accuracy. … Second, the report states, consumers should be presented with choice about collection and sharing of their data at the time and in the context in which they are making decisions – not after having to read long, complicated disclosures that they often cannot find. … One method of simplified choice the FTC staff recommends is a “Do Not Track” mechanism governing the collection of information about consumer’s Internet activity to deliver targeted advertisements and for other purposes. Consumers and industry both support increased transparency and choice for this largely invisible practice. The Commission recommends a simple, easy to use choice mechanism for consumers to opt out of the collection of information about their Internet behavior for targeted ads. The most practical method would probably involve the placement of a persistent setting, similar to a cookie, on the consumer’s browser signaling the consumer’s choices about being tracked and receiving targeted ads.”

The full text of the 120-page report, Protecting Consumer Privacy in an Era of Rapid Change — a proposed framework ofr businesses and policymakers is available online.

You are currently browsing the UMBC ebiquity weblog archives for December, 2010.

  Home | Archive | Login | Feed