One billion spam comments

March 23rd, 2007

Akismet is closing in on identifying a billion spam comments. I hope they capture the one that puts them over the top for posterity.

Oracle 11g to support some OWL inferencing

March 22nd, 2007

I noticed in Seth Ladd’s Semergence blog that the next version of Oracle’s RDF Database (11g) is expected to have native inferencing for a subset of OWL. This is in addition to faster querying and bulk-loading and “new SQL operators for enhancing a relational query using an ontology”. See this thread in Oracle’s Semantic Technologies Forum.

According to a recent presentation on 10g, the native OWL inferencing will include:

  • Basics: class, subclass, property, subproperty, domain,
    range, type
  • Property Characteristics: transitive, symmetric, functional, inverse functional, inverse
  • Class comparisons: equivalence, disjointness
  • Property comparisons: equivalence
  • Individual comparisons: same, different
  • Class expressions: complement

We’ve not yet tried 10g, but it’s on our short list of things to do. I guess this task just moved up in the list.

links for 2007-03-22

March 22nd, 2007

Why the Semantic Web will fail, NOT.

March 21st, 2007

Slashdot has a post today titled Why the Semantic Web Will Fail that points to a post by Stephen Downes with the same name. His argument is based on the belief that “The Semantic Web will never work because it depends on businesses working together, on them cooperating.” He says:

“But the big problem is they believed everyone would work together:

  • would agree on web standards (hah!)
  • would adopt a common vocabulary (you don’t say)
  • would reliably expose their APIs so anyone could use them (as if)”

While the argument Stephen makes is grounded in his distrust of corporations, his second point above is off the mark, at least for RDF.

One of the features of the W3C’s model (based on RDF) is that it doesn’t push the idea that everyone should adopt the same vocabulary (or ontology) for a topic or domain. Instead it offers a way to publish vocabularies with some semantics, including how terms in one vocabulary relate to terms in another. In addition, the framework makes it trivial to publish data in which you mix vocabularies, making statements about a person, for example, using terms drawn from FOAF, Dublin Core and others.

The RDF approach was designed with interoperability and extensibility in mind, unlike many other approaches. RDF is showing increasing adoption, showing up in products by Oracle, Adobe and Microsoft, for example.

If this approach doesn’t continue to flourish and help realize the envisioned “web of data”, and it might not after all, it will have left some key concepts, tested and explored, on the table for the next push. IMHO, the ’semantic web’ vision — a web of data for machines and their users
– is inevitable.

links for 2007-03-21

March 21st, 2007

John Backus passes away

March 20th, 2007

John Backus, inventor of FORTRAN and BNF, functional programming advocate and winner of the 1977 Turing Award, has passed away. He was 82. The New York Times published his obituary today.

Who created the first blog?

March 20th, 2007

Declan McCullagh and Anne Broache have an article on cnet that explores the question who created the first blog?.

“It may not be one of the Internet’s grandest accomplishments, but with the number of active bloggers hovering somewhere around 100 million, according to one estimate, there are some serious bragging rights to be claimed by the first person who provably laid fingers to keyboard in the traditional bloggy way.”

The article mentions some who come immediately to mind:

Was the first blogger the irascible Dave Winer? The iconoclastic Jorn Barger? Or was the first blogger really Justin Hall, a Web diarist and online gaming expert whom The New York Times Magazine once called the “founding father of personal blogging”? Or did all three merely make incremental improvements on earlier proto-blogs? The answer is most likely “yes” to all of the above. In truth, awarding the title “first blogger”

and also explores some earlier roots, like finger and .plan files. Those last two are interesting connections and makes sense. Kind of. In my experience, the vast majority of people who used .plan files used them to document their generic schedules and availability, rather than to contemporaneously document their activities.

It’s a good article, overall.

links for 2007-03-20

March 20th, 2007

How Google separates the blogs from the splogs

March 19th, 2007

Google’s patent application (filed 13 September 2005) for Ranking blog documents is being discussed around the web.

“A blog search engine may receive a search query. The blog search engine may determine scores for a group of blog documents in response to the search query, where the scores are based on a relevance of the group of blog documents to the search query and a quality of the group of blog documents. The blog search engine may also provide information regarding the group of blog documents based on the determined scores.”

The Google Operating System blog has a nice summary of the features Google mentions as useful in separating the blogs from the splogs. No surprises here.

Positive features Negative features
  • links from blogrolls (especially from high-quality blogrolls or blogrolls of “trusted bloggers”)
  • links from other sources (mail, chats)
  • using tags to categorize a post
  • PageRank
  • the number of feed subscriptions (from feed readers)
  • clicks in search results
  • posts added at a predictable time
  • different content between the site and the feed
  • the amount of duplicate content
  • using words/n-grams that appear frequently in spam blogs
  • posts that have identical size
  • linking to a single web page
  • a large number of ads
  • the location of ads (”the presence of ads in the recent posts part of a blog”)

Spotted on Micro Persuasion.

Honeyblogs lure suckers to known spam domains

March 19th, 2007

One interesting aspect of web spam is that the suckers include both web searchers and advertisers. The goal of spammers is to get the two together and watch the clicking.

A new article in the NYT, Researchers Track Down a Plague of Fake Web Pages, discusses results by a team of researchers from Microsoft and UC Davis.

“Tens of thousands of junk Web pages, created only to lure search-engine users to advertisements, are proliferating like billboards strung along freeways. Now Microsoft researchers say they have traced the companies and techniques behind them.
     A technical paper published by the researchers says the links promoting such pages are generated by a small group of shadowy operators apparently with the acquiescence of some major advertisers, Web page hosts and advertising syndicators. … The finding is striking because it hints at the possibility of curbing the practice.
     The researchers uncovered a complex scheme in which a small group, creating false doorway pages, works with operators of Web-based computers who profit by redirecting traffic passed from search engines in one direction and then sending advertisements acquired from syndicators in the opposite direction.”

The researchers will present their paper on redirection spamming WWW-2007:

Spam Double-Funnel: Connecting Web Spammers with Advertisers. Yi-min Wang, Ming Ma, Yuan Niu, and Hao Chen. To appear in Proceedings of the 16th International World Wide Web Conference (WWW2007).

Abstract: Spammers use questionable search engine optimization (SEO) techniques to promote their spam links into top search results. In this paper, we focus on one prevalent type of spam – redirection spam – where one can identify spam pages by the third-party domains that these pages redirect traffic to. We propose a fivelayer, double-funnel model for describing end-to-end redirection spam, present a methodology for analyzing the layers, and identify prominent domains on each layer using two sets of commercial keywords – one targeting spammers and the other targeting advertisers. The methodology and findings are useful for search engines to strengthen their ranking algorithms against spam, for legitimate website owners to locate and remove spam doorway pages, and for legitimate advertisers to identify unscrupulous syndicators who serve ads on spam pages.

A related paper on spamming in forums (ppt presentation) also uses a “context-based” approach that focuses on link structure and redirection.

Warsaw University wins 2007 ACM programming contest

March 19th, 2007

trophyThe results of the finals for the 2007 ACM International Collegiate Programming Contest are in with Warsaw University, Tsinghua University St. Petersburg University of IT, Mechanics and Optics and MIT placing first through fourth.

The contest has been running since the 1970s has is generally recognized as the oldest, largest and most prestigious programming contest in the world. This year over 6000 teams began the multi-tiered competition with88 teams in the finals at Maihama Japan. See the final problems that the teams had to solve and the final team standings.

links for 2007-03-19

March 19th, 2007