UMBC ebiquity research group Building intelligent systems in open, heterogeneous, dynamic, distributed environments
Machine Learning

Archive for the 'Machine Learning' Category

Welcome to the Splogosphere: 75% of new pings are spings (splogs)

December 15th, 2005, by Pranam Kolari, posted in Blogging, GENERAL, Machine Learning, Semantic Web, Technology, Web, memeta, splog

In the blogosphere, pings are notifications sent by updated blogs to PingServers. A major issue recently has been unjustified pings, also known as Spings, sent by Splogs. Splogs have been discussed a lot recently, including an interesting thread on post piracy that Steve Rubel initiated on Micropersuasion.

The problem of splogs prompted us to analyze pings from weblogs.com, which publishes hourly pings as changes.xml. We have been collecting these pings over the last 4 weeks for a total of 40 million pings from around 14 million (so claimed) blogs. To begin with, we applied a language identification technique implemented by James Mayfield to identify language by fetching these blogs. As expected most of the pings were from blogs authored in English. But we were able to identify blogs from many other languages as well. For instance, charts below show a distribution of pings from blogs authored in Italian — over a day and over a week. Each bar denotes the number of pings per hour.


Pings over a day
Pings over 8 days

All times are in GMT; clearly Italian authored blogs display a specific blogging pattern.

In the next step we used our work on splog detection to detect splogs (and hence spings) among the english blogs. Our detection mechanism is close to 90% accurate. As shown in the charts below pings from blogs average around 8K per hour and those from splogs average around 25K.


Blog Pings
Splog Pings

Clearly almost 3 out of 4 pings are spings! Going back further to the source of these spings, we observed that more than 50% of claimed blogs pinging weblogs.com are splogs.

Based on the interestingness of this preliminary statistics, scope for further analysis and interest in the resulting dataset we decided to continuosly monitor the pingosphere. So, we now do it “live” on updated blogs published by weblogs.com(delayed by an hour), and have made it publicly available at http://memeta.umbc.edu. The site lists blogging patterns for many other languages, and compares splogs with blogs. All of our work is part of a larger project memeta, towards analyzing the content and structure of the blogosphere.

We hope our effort is a good complement to existing services (e.g., FightSplog, SplogReporter and SplogSpot) towards combating splogs. We currently publish only simple ping statistics on this site, but do stay tuned for fresh splog and classified blog dumps and much more!

UPDATE: Matthew Hurst from BlogPulse points us to an interesting analysis he has done on a day of weblogs.com pings.

Senate Cuts DARPA Cognitive Computing program

October 21st, 2005, by Tim Finin, posted in AI, Funding, KR, Machine Learning

Peter Harsha reports that the Senate Appropriations Committee included language in the Senate version of the FY 06 Defense Appropriations bill that strips $55M from DARPA’s Cognitive Computing program, specifically “Learning, Reasoning, and Integrated Cognitive Systems”. That’s a 50% cut in the program. Peter points out that this runs counter to recent congressional sentiment that the role of computer science, especially university-led fundamental computer science, should be strengthened at DARPA.

SVMs for the Blogosphere: Blog Identification and Splog Detection

October 19th, 2005, by Tim Finin, posted in Blogging, Machine Learning, Semantic Web, Web

There’s been a lot of talk about splogs lately (e.g., here, there and everywhere). There was even a note in the Washington Post’s Computer Security blog today. We recently finished a paper on using SVMs to recognize splogs

Pranam Kolari, Tim Finin, and Anupam Joshi, SVMs for the Blogosphere: Blog Identification and Splog Detection, TR-CS-05-13, Computer Science and Electrical Engineering, University of Maryland, Baltimore County, 8 October 2005.

The paper compares results using different feature sets for the task of splog recognition as well as some other simple tasks. We’ve submitted this to the AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs.

Cfengine as an adaptive autonomous agent

May 28th, 2005, by Tim Finin, posted in AI, Agents, Machine Learning

Cfengine is a configuration management tool that is widely used to manage networks of Unix systems. It was originally developed at the University of Oslo in 1993. I’ve only been dimly aware of it and assumed it was yet another common system administration tool for Unix. I was surprised to see how it’s described on the Cfengine site:

“About Cfengine: Cfengine, or the configuration engine is an autonomous agent and a middle to high level policy language and agent for building expert systems to administrate and configure large computer networks. Cfengine is designed to be a part of a computer immune system. It is ideal for cluster management and has been adopted for use all over the world in small and huge organizations alike.”

The developers have evolved their approach to use a biologically inspired immunity model and have a recent paper in the Machine Learning Journal.

How TiVo does its collaborative filtering

February 18th, 2005, by Harry Chen, posted in Machine Learning, Technology Impact

There is an interesting paper that describes how TiVo computes its recording recommendations.

The abstract:

We describe the TiVo television show collaborative recommendation system which has been fielded in over one million TiVo clients for four years. Over this install base, TiVo currently has approximately 100 million ratings by users over approximately 30,000 distinct TV shows and movies. TiVo uses an item-item (show to show) form of collaborative filtering which obviates the need to keep any persistent memory of each user�s viewing preferences at the TiVo server. Taking advantage of TiVo�s client-server architecture has produced a novel collaborative filtering system in which the server does a minimum of work and most work is delegated to the numerous clients. Nevertheless, the server-side processing is also highly scalable and parallelizable. Although we have not performed formal empirical evaluations of its accuracy, internal studies have shown its recommendations to be useful even for multiple user households. TiVo�s architecture also allows for throttling of the server so if more server-side resources become available, more correlations can be computed on the server allowing TiVo to make recommendations for niche audiences.

See PVRBLog

Using Google to learning the meanings of words

February 15th, 2005, by Harry Chen, posted in Machine Learning, Ontologies, Web

The Web is the largest database on the Earth, and Google has the largest index of this database. Two researchers at University of Amsterdam proposed a new system that uses Google search to learn and distinguish the meanings of words.

Their work is based on the theory that the meaning of a word can usually be gleaned from the words used around it. Take the word “rider”. Its meaning can be deduced from the fact that it is often found close to words like “horse” and “saddle”.

Instead relying on a common sense knowledge base such as Cyc, the reseachers use Google search to measure how closely two words relate to each other.

To do this, it needs to build a word tree – a database of how words relate to each other. It might start off with any two words to see how they relate to each other. For example, if it googles “hat” and “head” together it gets nearly 9 million hits, compared to, say, fewer than half a million hits for “hat” and “banana”. Clearly “hat” and “head” are more closely related than “hat” and “banana”.

To gauge just how closely, Vitanyi and Cilibrasi have developed a statistical indicator based on these hit counts that gives a measure of a logical distance separating a pair of words. They call this the normalised Google distance, or NGD. The lower the NGD, the more closely the words are related.

See also: “Google’s search for meaning“, New Scientist.

Adaptive middle agents for service matching

November 10th, 2004, by Tim Finin, posted in Agents, Machine Learning

Xiaocheng Luan’s Ph.D. disseration on a quantitative approach to matching service requests against capability descriptions is now available on line.

Xiaocheng Luan, Adaptive Middle Agent for Service Matching in the Semantic Web: A Quantitative Approach, Ph.D. dissertation, Computer Science and Electrical Engineering, University of Maryland, Baltimore County, November 01, 2004.

In Dr. Luan’s approach, middle agents establish and refine an agent’s capability model based on the domain ontology and through the interactions with the agents. An agent’s performance history is considered as an integral part of the agent’s capability model and the agent’s strong and weak areas can also be revealed. The dynamically captured and updated service distribution in the service domain is considered as an important factor in service matching. Service matching here is carried out in two steps. In the first step, candidates are selected through the semantic service description matching. In the second step, the performance rating of each candidate with respect to the specific request is estimated based on the agent’s capability model, and the candidates with the highest estimated performance ratings will be selected. Statistics collected from evaluation experiments show a significant improvement over typical service matching methods in terms of the accuracy in selecting the best service provider(s) for each request.

You are currently browsing the archives for the Machine Learning category.

  Home | Archive | Login | Feed






UMBC