UMBC ebiquity research group Building intelligent systems in open, heterogeneous, dynamic, distributed environments
06 July 2008, 22:17:53 EDT  
2007 March

Archive for March, 2007

One billion spam comments

March 23rd, 2007, by Tim Finin, posted in Uncategorized

Akismet is closing in on identifying a billion spam comments. I hope they capture the one that puts them over the top for posterity.

links for 2007-03-22

March 22nd, 2007, by Tim Finin, posted in Uncategorized

Oracle 11g to support some OWL inferencing

March 22nd, 2007, by Tim Finin, posted in Uncategorized

I noticed in Seth Ladd’s Semergence blog that the next version of Oracle’s RDF Database (11g) is expected to have native inferencing for a subset of OWL. This is in addition to faster querying and bulk-loading and “new SQL operators for enhancing a relational query using an ontology”. See this thread in Oracle’s Semantic Technologies Forum.

According to a recent presentation on 10g, the native OWL inferencing will include:

  • Basics: class, subclass, property, subproperty, domain,
    range, type
  • Property Characteristics: transitive, symmetric, functional, inverse functional, inverse
  • Class comparisons: equivalence, disjointness
  • Property comparisons: equivalence
  • Individual comparisons: same, different
  • Class expressions: complement

We’ve not yet tried 10g, but it’s on our short list of things to do. I guess this task just moved up in the list.

links for 2007-03-21

March 21st, 2007, by Tim Finin, posted in Uncategorized

Why the Semantic Web will fail, NOT.

March 21st, 2007, by Tim Finin, posted in Uncategorized

Slashdot has a post today titled Why the Semantic Web Will Fail that points to a post by Stephen Downes with the same name. His argument is based on the belief that “The Semantic Web will never work because it depends on businesses working together, on them cooperating.” He says:

“But the big problem is they believed everyone would work together:

  • would agree on web standards (hah!)
  • would adopt a common vocabulary (you don’t say)
  • would reliably expose their APIs so anyone could use them (as if)”

While the argument Stephen makes is grounded in his distrust of corporations, his second point above is off the mark, at least for RDF.

One of the features of the W3C’s model (based on RDF) is that it doesn’t push the idea that everyone should adopt the same vocabulary (or ontology) for a topic or domain. Instead it offers a way to publish vocabularies with some semantics, including how terms in one vocabulary relate to terms in another. In addition, the framework makes it trivial to publish data in which you mix vocabularies, making statements about a person, for example, using terms drawn from FOAF, Dublin Core and others.

The RDF approach was designed with interoperability and extensibility in mind, unlike many other approaches. RDF is showing increasing adoption, showing up in products by Oracle, Adobe and Microsoft, for example.

If this approach doesn’t continue to flourish and help realize the envisioned “web of data”, and it might not after all, it will have left some key concepts, tested and explored, on the table for the next push. IMHO, the ’semantic web’ vision — a web of data for machines and their users
– is inevitable.

John Backus passes away

March 20th, 2007, by Tim Finin, posted in Uncategorized

John Backus, inventor of FORTRAN and BNF, functional programming advocate and winner of the 1977 Turing Award, has passed away. He was 82. The New York Times published his obituary today.

links for 2007-03-20

March 20th, 2007, by Tim Finin, posted in Uncategorized

Who created the first blog?

March 20th, 2007, by Tim Finin, posted in Uncategorized

Declan McCullagh and Anne Broache have an article on cnet that explores the question who created the first blog?.

“It may not be one of the Internet’s grandest accomplishments, but with the number of active bloggers hovering somewhere around 100 million, according to one estimate, there are some serious bragging rights to be claimed by the first person who provably laid fingers to keyboard in the traditional bloggy way.”

The article mentions some who come immediately to mind:

Was the first blogger the irascible Dave Winer? The iconoclastic Jorn Barger? Or was the first blogger really Justin Hall, a Web diarist and online gaming expert whom The New York Times Magazine once called the “founding father of personal blogging”? Or did all three merely make incremental improvements on earlier proto-blogs? The answer is most likely “yes” to all of the above. In truth, awarding the title “first blogger”

and also explores some earlier roots, like finger and .plan files. Those last two are interesting connections and makes sense. Kind of. In my experience, the vast majority of people who used .plan files used them to document their generic schedules and availability, rather than to contemporaneously document their activities.

It’s a good article, overall.

Warsaw University wins 2007 ACM programming contest

March 19th, 2007, by Tim Finin, posted in Uncategorized

trophyThe results of the finals for the 2007 ACM International Collegiate Programming Contest are in with Warsaw University, Tsinghua University St. Petersburg University of IT, Mechanics and Optics and MIT placing first through fourth.

The contest has been running since the 1970s has is generally recognized as the oldest, largest and most prestigious programming contest in the world. This year over 6000 teams began the multi-tiered competition with88 teams in the finals at Maihama Japan. See the final problems that the teams had to solve and the final team standings.

How Google separates the blogs from the splogs

March 19th, 2007, by Tim Finin, posted in Uncategorized

Google’s patent application (filed 13 September 2005) for Ranking blog documents is being discussed around the web.

“A blog search engine may receive a search query. The blog search engine may determine scores for a group of blog documents in response to the search query, where the scores are based on a relevance of the group of blog documents to the search query and a quality of the group of blog documents. The blog search engine may also provide information regarding the group of blog documents based on the determined scores.”

The Google Operating System blog has a nice summary of the features Google mentions as useful in separating the blogs from the splogs. No surprises here.

Positive features Negative features
  • links from blogrolls (especially from high-quality blogrolls or blogrolls of “trusted bloggers”)
  • links from other sources (mail, chats)
  • using tags to categorize a post
  • PageRank
  • the number of feed subscriptions (from feed readers)
  • clicks in search results
  • posts added at a predictable time
  • different content between the site and the feed
  • the amount of duplicate content
  • using words/n-grams that appear frequently in spam blogs
  • posts that have identical size
  • linking to a single web page
  • a large number of ads
  • the location of ads (”the presence of ads in the recent posts part of a blog”)

Spotted on Micro Persuasion.

MSM Citations in Republican, Democrat Blogs

March 19th, 2007, by Akshay Java, posted in Uncategorized

A number of qualitative and quantitative analysis of Main Stream Media (MSM) sources have caused heated debates about bias and trustworthiness (or in Stephen Colbert’s lingo shall we say “truthiness“? ;-) ) in MSM. Commentary on news and current affairs once used to be the exclusive prerogative of a handful of political analysts on new channels and sites. Today, blogs and citizen journalism are the new form of punditry. It’s importance is also being recognized by some of the 2008 presidential aspirants.

So, the question is — which MSM sources are going to play an important role on the Blogosphere during the election year? To analyze this we first look at the most cited MSM sources from the ICWSM dataset shown on the right (the complete list is here). Next, we use a list of 113 Republican and 144 Democrat blogs. This list was compiled using a data set provided by Dr. Lada Adamic and by querying Technorati. We count the number of citations for MSM in each of the sets. The MSM sources most frequently cited by democrats and Republican blogs is as follows:

These counts also include multiple citations from each blog. We would like to rank the list in a more meaningful way that would indicate how “influential” an MSM is for a particular group. To do this, we first use KL Divergence based scoring to find the difference in the distribution of citations of MSM in the two groups. For example, a MSM would have a high score in the democrat MSM listing (see below) if it has a high probability of being linked by each of the democratic blogs in the set while having a low chance of being linked to by republicans (and vice versa for republican set). We also modify the scoring function to give importance to citations from multiple distinct blogs (vs. many links from a single blog). The final scoring function produces a ranked list of MSM based on preference of being linked to by either Republican or Democrat blogs. This shows some interesting results (complete list here and here):


Of course, this does not explain any bias of the MSM source itself, but provides a good indication of sources that might influence Republicans and Democrats. Here is a questions I ask our readers: “Bias seems to be quite subjective, according to the side of the political spectrum one may associate with. Do you think it would even be possible to agree on the neutrality of a MSM source?”

Limitations

  1. The popularity of the blog that links to the MSM is not considered here, and it would be useful to incorporate it.
  2. The results are limited to a small sample of Republican/Democrat blogs.
  3. No content analysis was performed and results are solely based on citations.
  4. The presence of a link does not always indicate influence and we need to use Link Polarity to improve the scoring function.
  5. There is scope for improvement in the ranking function itself. But I think its a first order approximation (to rank distinctively democrat vs. republican MSM preferences).

Conclusions

MSM is influential and there are selective preferences of each community towards different sources. Some of the sources that are categorized under MSM in the dataset almost have a blog like quality. As people rely on blogs for information and opinions, the indirect influence that MSM sources (and perhaps, its biases) can not be ignored. While blogs and MSM seem to almost have a symbiotic relation, (IMHO) this election season might see a fierce competition between the two.

[Acknowledgment: Buzzmetrics for the dataset, Dr. Lada Adamic for the Republican/Democrat labels]

Honeyblogs lure suckers to known spam domains

March 19th, 2007, by Tim Finin, posted in Uncategorized

One interesting aspect of web spam is that the suckers include both web searchers and advertisers. The goal of spammers is to get the two together and watch the clicking.

A new article in the NYT, Researchers Track Down a Plague of Fake Web Pages, discusses results by a team of researchers from Microsoft and UC Davis.

“Tens of thousands of junk Web pages, created only to lure search-engine users to advertisements, are proliferating like billboards strung along freeways. Now Microsoft researchers say they have traced the companies and techniques behind them.
     A technical paper published by the researchers says the links promoting such pages are generated by a small group of shadowy operators apparently with the acquiescence of some major advertisers, Web page hosts and advertising syndicators. … The finding is striking because it hints at the possibility of curbing the practice.
     The researchers uncovered a complex scheme in which a small group, creating false doorway pages, works with operators of Web-based computers who profit by redirecting traffic passed from search engines in one direction and then sending advertisements acquired from syndicators in the opposite direction.”

The researchers will present their paper on redirection spamming WWW-2007:

Spam Double-Funnel: Connecting Web Spammers with Advertisers. Yi-min Wang, Ming Ma, Yuan Niu, and Hao Chen. To appear in Proceedings of the 16th International World Wide Web Conference (WWW2007).

Abstract: Spammers use questionable search engine optimization (SEO) techniques to promote their spam links into top search results. In this paper, we focus on one prevalent type of spam – redirection spam – where one can identify spam pages by the third-party domains that these pages redirect traffic to. We propose a fivelayer, double-funnel model for describing end-to-end redirection spam, present a methodology for analyzing the layers, and identify prominent domains on each layer using two sets of commercial keywords – one targeting spammers and the other targeting advertisers. The methodology and findings are useful for search engines to strengthen their ranking algorithms against spam, for legitimate website owners to locate and remove spam doorway pages, and for legitimate advertisers to identify unscrupulous syndicators who serve ads on spam pages.

A related paper on spamming in forums (ppt presentation) also uses a “context-based” approach that focuses on link structure and redirection.

You are currently browsing the UMBC ebiquity weblog archives for March, 2007.

  Home | Archive | Login | Feed





UMBC