 | 2006 April 
Archive for April, 2006
April 22nd, 2006, by Tim Finin, posted in Uncategorized
We need better approaches to ranking blogs. Scott Karp made a very interesting observation in Technorati Top 100 Is Changing Radically:
Have you checked out the Technorati Top 100 (by unique links) lately? It’s starting to change in very interesting ways. First, Dave Winer is gone. That’s right — Scripting News is no longer a top 100 blog. So what knocked him off? Personal blogs by young Asian women, most of them on MSN Spaces:
and goes on to list eight blogs that made the 100 list by recently receiving a large number of links from MSN blogs.
There is some evidence that the linking blogs are not authentic, i.e., splogs. If so, this is a new use for splogs — raising the rank of blogs, both on sites like Technorati and also on conventional Web search engines. To date, most splogs have been used to promote regular Web sites on search engines or to host Google Adsense advertisements. I’d check this theory out with our own splog detection software, which works pretty well, but we currently only do blogs in a handful of European languages.
Hacking Cough has another explanation, however:
In with a bullet at number eight was M¥$T(e(?I(O/’u~§ G,Î?L,: a blog stuffed full of platitudes and proverbs in Arabic and English that has seemingly warmed the hearts of 8000 blogs. Most of the link-love they gave the 20 posts at the site, as recorded by Technorati, came in the last 48 hours. … It’s not just M¥$T(e(?I(O/’u~§ G,Î?L, who has stormed into the Technorati A-list. You have Myhurt at Japanese blog-host FC2, who has amassed more than 12′000 links from close to 6700 sites, most of them also hosted at FC2. … It seems that Myhurt has knocked up a nice template that has been used by a lot of FC2 bloggers. And Technorati has picked up those links as real trackbacks in the way that it does not for all those links to SixApart at the bottom of Movable Type blogs.
I’ll be very happy if this is just a technical glitch in the way that Technorati ranks blogs. Even if it is, as blogs continue to rise in importance, we can expect to see more people trying to game the ranking system.
It’s clear to me that as blogging becomes more global we will have to develop a sound approach to analyzing blogs in a good fraction of the world’s major languages. We are also working on alternate approaches to ranking blogs and identifying feeds that matter.
One of the advantages of doing research on spam, in any of its varieties, is that the problem will never be solved. As the Red Queen said, It takes all the running you can do, to keep in the same place.
Edit | Bookmark@del.icio.us | Trackback | No Comments »
April 21st, 2006, by Tim Finin, posted in Uncategorized
US Attorney General Alberto Gonzales has called for Congress to require a mandatory website self-rating system to “prevent people from inadvertently stumbling across pornographic images on the Internet” in remarks made at the National Center for Missing and Exploited Children. The proposed law, the Child Pornography and Obscenity Prevention Amendments of 2006, would require commercial Web sites to place FTC designated “marks and notices” on sexually explicit pages. Gonzales described sexually explicit as covering depictions of everything from sexual intercourse and masturbation to “sadistic abuse” and close-ups of fully clothed genital regions.
Of course, we’ve been down this road before, or at least a very similar one, in PICS, the W3C’s Platform for Internet Content Selection. The PICS effort began in the winter of 1995 as a system for associating metadata with Internet content. While it was designed to be used with multiple vocabularies, its original impetus was to label web pages to help parents and teachers control access by children and students.
PICS data was metadata and most of the W3C’s early RDF work was motivated by PICS. A product was an RDF schema for PICS ratings. The idea of associating metadata with web pages quickly grew beyond PICS at the W3C and produced the Semantic Web activity. At the same time, the vocabularies that groups developed to rate “objectionable” content seemed to descend into weird territory. See this PICS vocabulary developed by the Internet Content Rating Association which allows you to describe a page as depicting disturbing subjects like “passionate kissing”, “deliberate damage to objects” and “gambling”. Sounds Like just another episode of Desperate Housewives to me.
Work on PICS and the idea of using it to rate content for audience appropriateness ended at least five years ago. There were many problems: gaining consensus on rating vocabularies, getting sites to voluntarily rate their content, deciding how to rate a given page, and what to do with sited that failed to rate or mis-rated their pages. These problems seemed insurmountable. The proposed law that Alberto Gonzales is advocating solves some of the problems by (1) having the FTC decide on a rating vocabulary, (2) making ratings mandatory for some sites and (3) throwing people in jail if they don’t do it right. I guess the only wiggle room left is whether your web page is art or porn, but a federal prosecutor and judge can help make that decision for you.
This ain’t gonna fly.
Update
Here is a transcript of the Attorney General’s speech.
Edit | Bookmark@del.icio.us | Trackback | No Comments »
April 19th, 2006, by Tim Finin, posted in Uncategorized
CIA mines ‘rich’ content from blogs is an article in today’s Washington Times discussing how the CIA’s Open Source Center is using Internet blogs to gather intelligence.
“The new Open Source Center (OSC) at CIA headquarters recently stepped up data collection and analysis based on bloggers worldwide and is developing new methods to gauge the reliability of the content, said OSC Director Douglas J. Naquin. “A lot of blogs now have become very big on the Internet, and we’re getting a lot of rich information on blogs that are telling us a lot about social perspectives and everything from what the general feeling is to … people putting information on there that doesn’t exist anywhere else,” Mr. Naquin told The Washington Times.”
Intelligence agencies have long relied on open source intelligence — gathering and analyzing information collected from sources available to the general public, such as newspapers, radio and television broadcasts and public documents. It’s not surprising, nor necessarily worrisome, that blogs are being viewed as a new source of information. Companies are using information mined from blogs for market research, e.g., to analyze “the who, what and why of online opinion to provide deep insight into the buzz about companies”, as Umbria puts it. The same is true for politicians. And Governments.
The Washington Times article continues:
Eliot A. Jardines, assistant deputy director of national intelligence for open source, said the amount of unclassified intelligence reaching Mr. Bush and senior policy-makers has increased as a result of the center’s creation in November.
“We’re certainly scoring a number of wins with our ultimate customer,” said Mr. Jardines, who became the first high-level official in charge of the government’s nonsecret intelligence in December.
“I can’t get into detail of what, but I’ll just say the amount of open source reporting that goes into the president’s daily brief has gone up rather significantly,” Mr. Jardines said. “There has been a real interest at the highest levels of our government, and we’ve been able to consistently deliver products that are on par with the rest of the intelligence community.”
Mr. Naquin said recent OSC successes have included the discovery of a technology advance in a foreign country. Also, most data on avian flu outbreaks come from open sources, he said.
I wonder if they are using the standard Web infrastructure to do this — Google, Yahoo, MSN, ping servers, Technorati, etc. — or building their own? Are they filtering out the splogs?
There is, of course, a very real danger that blog mining can be misused to compile information about individuals which don’t pose a threat to national security. Many of us put way too much information about our lives on blogs, the Web, and in email — what we buy and sell on ebay, seek on craigslist, feed to our cats and rant about when we’ve been drinking. The Semantic Web may even make it worse. Today, Martha Mitchell would be blogging in the wee hours rather than telephoning. Integrate and fuse all of that Web information with other sources (public records, credit information, etc.) and you know a lot about many people.
For more information on the CIA’s OSC, see Intelligence Center Mines Open Sources which is a an informative article by the AFCEA (Armed Forces Communications and Electronics Association).
Spotted on Wonkett. Related posts: CIA Open Source Center.
Edit | Bookmark@del.icio.us | Trackback | No Comments »
April 18th, 2006, by Akshay Java, posted in Uncategorized
“The size of the blogosphere continues to double every six months” as per the latest quarterly report on the State of the Blogosphere by David Sifry. According to this report there are 33.5 Million weblogs and many of these are activly posting. Last year there was a post by Jim Lanzone from Ask on which feeds matter? According to Bloglines/Ask in July 2005 there were about 1.12 Million feeds that really matter, which is based on the feeds subscribed by all the users on Bloglines. A study of the feeds on bloglines in April 2005 showed that there were about 32,415 public subscribers and their feeds accounted for 1,059,140 public feed subscriptions.
We collected similar data of the publicly listed users on bloglines. From last year, the number of publicly listed subscribers have increased to 82,428 users (2.5 times that of last year) and there are 1,833,913 listed feeds (~ 1.7 times) on the Bloglines site. Hence even though the blogosphere is almost doubling every six months, the number of feeds that “really matter” probably doubles roughly every year. Inspite of it, it may still be only a small fraction of the blogosphere.
This leads me to think that there is some preferential attatchment for feeds. A new user who joins bloglines would subscribe to some of the feeds from the long tail (belonging to friends and based on interest) but most would tend to also subscribe to feeds that are already popular (such as slashdot or other top popular feeds).

There is also an inherent limit on the amount of information that a user can keep track of at any given time. To study this we show the number of feeds subscribed by the publicly listed users on bloglines.

From the graph it can be observed that although there are some users who monitor more than 5k feeds (which might not be real users but programs using bloglines API), a majority of users are normal users who subscribe to the blogs and news feeds that they want to follow regularly. Mostly, these users have somewhere between 30-100 feeds that they monitor. This might explain the deviation of the graph from that of a typical power law curve.
To summarize:
- The blogosphere continues to grow as does the number of people who follow blogs.
- While this is still a rough estimate, the number of feeds that really matter is a very small fraction of the entire blogosphere.
- The number of feeds that really matter doubles each year as opposed to the size of the blogosphere, which doubles every 6 months.
- Most users tend to follow a relatively modest number of feeds.
Edit | Bookmark@del.icio.us | Trackback | No Comments »
April 17th, 2006, by Anand, posted in Uncategorized
This is an article/transcript(?) of a UK Marketing Society Keynote Address (28 Feb. 2006) by John Naughton. He talks about ecological adaptation and the integration of new communication technologies into society, how business models have changed since the beginning of the Internet era, and how they will continue to transform. He points out the proliferation of new communication technologies including blogs, IPtv, and the iPod phenomenon.
He talks about media, its distribution and the evolving means for its distribution. The article is interspersed with examples of technology adoption, with a historic timeline of events and evolving technology driving the changes.
He also formulates Naughton’s First Law:
“We invariably over-estimate the short-term implications of new communications technologies, and we greviously underestimate their long term impacts”
Here is a link to the article.
Edit | Bookmark@del.icio.us | Trackback | No Comments »
April 15th, 2006, by Tim Finin, posted in Uncategorized
At last month’s AAAI Symposium on Weblogs there was discussion of the difficulty of categorizing automatically generated weblogs as an legitimate blog or a splog. I might create a weblog to aggregate posts from other blogs on a topic of interest to me, say Owl Links. It might even have a blogroll and also carry some Google ads. Why not, everyone does it these days? One person’s blog might be another’s splog.
Of course, we noticed this OWL blog because it came up in a Technorati Search — the most recent OWL Links post appropriated text from one of our Sourceforge pages on an OWL reasoner.
The information this OWL blog collects is certainly useless — any occurrence of the word owl or owls is reason enough for text to be included. It might be about birds, RDF vocabulary, a street in Houston, a bar in Baltimore, Hedwig, or the Temple University basketball team. The blog doesn’t carry the usual signs of sploginess, though — links to off-topic sites, ads, links to other splogs, etc. Googling for sites hosted by bpeleven.com, however, reveals hundreds of suspicious sites. Most still don’t seem sploggy, though, just useless.
Things became clearer when we discovered wpeleven’s OWL site. This mirrors bpeleven’s but carries a full compliment of ads, including several for semantic web products like Racer and Altova.
So, what’s going on here with bpeleven.com and wveleven.com? Further exploration reveals related domains like wvfour.com through wveleven.com and bpseven.com through bpeleven.com. There may be more with different prefixes, who knows? The IP addresses are to various hosting companies and many seem to be unused.
I’d characterize this as a splog farm where seedlings like owl.bpeleven.com are planted on a new domain hosted at a new service and allowed to grow for a while. When they’ve reached maturity, they are put to work by carrying ads and paid links to target sites and become full blown splogs. Over time, search engines begin to wise up and cancel their ad accounts and/or block them and the splogs are plowed under. In the mean time, a new crop of virginal young splogs are ready over at bp.twelve.com.
Edit | Bookmark@del.icio.us | Trackback | No Comments »
April 12th, 2006, by Tim Finin, posted in Uncategorized
One of the biggest changes in “academic” research in the past ten years has been the increasing importance of making publications visible and available online. There are several dimensions to this — online journals, preprint servers, self publishing, blogging, etc. An important one is the increasingly comprehensive research article search services like CiteSeer, Google Scholar and now Windows Live Academic. Microsoft announced and release a beta version of Live Academic Search that is intended to “help students, researchers and university faculty conduct research across a spectrum of academic journals.”
These services make it easy to find articles on a topic or by an author, count citations and get a sense of a paper’s impact. CiteSeer was the pioneer in providing free access to a automatically maintained scientific literature digital library. It was developed by researchers at the NEC Research Institute and is currently hosted at Penn State. Google Scholar, introduced in 2005, provides a similar service but with notable differences, including a paper ranking system based on the number of citations and the ability for publishers to push metadata to the service in addition to relying on web crawlers to find documents and extract the metadata.
Microsoft’s service has some nice features but is missing some offered by CiteSeer and Google Scholar. Interesting features to note are:
- The papers are currently drawn from metadata provided by publishers of 4300 journals and 2000 conferences covering Computer Science, Electrical Engineering and Physics.
- Query results can be grouped and sort by author, source, and date rather and ranked by relevance or date. The details of the relevance ranking are not described, but it is not based on citation count.
- A Sidebar previews information about the paper when you hover over one of the query results.
- Citation text for papers is generated in two forms: BibTex and EndNote.
- Like Google Scholar, Live Academic Search indexes library-subscribed content and supports the OpenURL for linking to subscription-based content.
- It’s not yet doing a native citation count, but importing values from CiteSeer.
Competition is usually good and in this case I think researchers and students will definitely benefit if Google and MSN compete to provide the best search service for academic articles. Even in its current beta form, Microsoft Live Academic looks very useful on its own.
Edit | Bookmark@del.icio.us | Trackback | No Comments »
April 10th, 2006, by Tim Finin, posted in Uncategorized
Platial is a new Google map mashup that I heard about on NPR a few days ago. It allows people “to find, create and use meaningful maps of Places that matter to them.” This is from a recent Wired article:
Map Mashups Get Personal … Platial provides a home for people who love quirky geographical information or just want to mark the locations that have meaning to them. Sign up for a free account, and you can start building and sharing personalized maps, complete with place markers, tags and descriptions of each spot. Collaborate on them with your buddies, or keep them to yourself. … You might say Platial is a cross between MapQuest and LiveJournal. Built on the open interfaces for Google Maps, the 2-month-old site is one of a new breed of map mashups — web applications created by mixing an already-existing open mapping platform with original software. Platial co-creator Di-Ann Eisnor says she built Platial for what she calls “neogeographers,” who use digital maps to tell stories and chart eccentric routes through familiar terrain. “Users tend to start out by making maps of where they’ve lived or traveled, and then they become tour guides for their neighborhoods,” she says. “Later, they might become what I call ‘collectors’ — they tag all slow-food restaurants and organic farms, or they chart the locations of independent bookstores.”
Platial is a typical 2.0 application and is pretty easy to use. I found some functional problems and it can be a bit slow, but I am sure they will get the kinks out as they mature. To check it out, I created a map of places near UMBC and invite others to add interesting or useful places to it. Platial supports a kind of localized search — here’s the result of searching for ‘university’ in ‘Baltimore’. You can also tag places and search by tags, like this map showing places tagged with ‘museum’. It even has rudimentary social networking/blogging features.
Edit | Bookmark@del.icio.us | Trackback | No Comments »
April 9th, 2006, by Tim Finin, posted in Uncategorized
The New York Times has an article This Boring Headline Is Written for Google that talks about how journalists and their editors are now writing headlines with search engines in mind.
“Journalists over the years have assumed they were writing their headlines and articles for two audiences — fickle readers and nitpicking editors. Today, there is a third important arbiter of their work: the software programs that scour the Web, analyzing and ranking online news articles on behalf of Internet search engines like Google, Yahoo and MSN.
The search-engine “bots” that crawl the Web are increasingly influential, delivering 30 percent or more of the traffic on some newspaper, magazine or television news Web sites. And traffic means readers and advertisers, at a time when the mainstream media is desperately trying to make a living on the Web.”
We’ve been trying to do this with good effect on our own ebiquity research site for the past year or so by carefully choosing the titles of web pages and blog posts. We also apply the same principle when choosing titles for research papers to have them match desired search terms. Of course, this has to be done in a principled way — we don’t want to title a paper on Swoogle like “Swoogle: it’s more fun than Britney Spears”. While that might attract more visitors, they aren’t the ones we want. A few might click on our adsense ads which support Mr. Capresso, they are not likely to read and cite our papers. But a title like “Swoogle: searching for knowledge on the Semantic Web” is likely to be better and easier for the right people to find than “Evaluating Swoogle’s performance on finding RDF data”.
These changes are hard to make. Here’s a bit more from the Times article.
“Journalists, they say, would be wise to do a little keyword research to determine the two or three most-searched words that relate to their subject — and then include them in the first few sentences.”That’s not something they teach in journalism schools,” said Danny Sullivan, editor of SearchEngineWatch, an online newsletter. “But in the future, they should.”
Such suggestions stir mixed sentiments. “My first thought is that reporters and editors have a job to do and they shouldn’t worry about what Google’s or Yahoo’s software thinks of their work,” said Michael Schudson, a professor at the University of California, San Diego, who is a visiting faculty member at the Columbia University Graduate School of Journalism.
“But my second thought is that newspaper headlines and the presentation of stories in print are in a sense marketing devices to bring readers to your story,” Mr. Schudson added. ‘Why not use a new marketing device appropriate to the age of the Internet and the search engine?’”
The same applies to research authors. We have to give up, somewhat reluctantly, some old habits — like trying to write clever and memorable titles. By the way, I think my favorite all time AI paper title as Janet Kolodner’s “Car 54 where are you”. But you have to date back to the early 60’s to find that funny. (Sadly, I can’t find a reference to that paper anywhere on the Web. If anyone has the citation information, please send it to me. I think it was from a paper in the late 70s).
Conventional wisdom is that one should also choose file names to produce URLs with the right terms. So we, like many others, configure our software to automatically produce file names using words selected from titles of papers, talks, events, posts, etc. Most MSM organizations are not yet doing that. For example, the Times article that triggered this post has the filename 09lohr. That’s not likely to draw traffic. Compare that to the URL generated by WordPress for this post, newspapers-like-bloggers-write-headlines-for-search-engines.
What’s next? Well, I hope its the appearance of machine understandable metadata in online content — microformats, RDF/A, linkes RDF, etc. Once this becomes a disciminator in getting thigs noticed by your intended audience, the Semantic Web will gain a lot more traction.
Edit | Bookmark@del.icio.us | Trackback | No Comments »
April 6th, 2006, by Tim Finin, posted in Uncategorized
This seemed shocking at first, like Microsoft is admitting defeat and giving up. But on reflection, that’s the way you deal with a compromised Unix system as well — rebuild it from scratch. I found the observation about motive to be interesting. More crackers are compromising systems as a way to make money and not ust for their own amusement or for bragging rights. This will attract a more skilled, motivated and determined group of bad guys. Money is the root of all evil, as usual.
Microsoft Says Recovery from Malware Becoming Impossible April 4, 2006, By Ryan Naraine, Ziff Davis Media Inc.
LAKE BUENA VISTA, Fla. In a rare discussion about the severity of the Windows malware scourge, a Microsoft security official said businesses should consider investing in an automated process to wipe hard drives and reinstall operating systems as a practical way to recover from malware infestation.
“When you are dealing with rootkits and some advanced spyware programs, the only solution is to rebuild from scratch. In some cases, there really is no way to recover without nuking the systems from orbit,” Mike Danseglio, program manager in the Security Solutions group at Microsoft, said in a presentation at the InfoSec World conference here. … Danseglio, who delivered two separate presentations at the conference — one on threats and countermeasures to defend against malware infestations in Windows, and the other on the frightening world on Windows rootkits — said anti-virus software is getting better at detecting and removing the latest threats, but for some sophisticated forms of malware, he conceded that the cleanup process is “just way too hard.”
“We’ve seen the self-healing malware that actually detects that you’re trying to get rid of it. You remove it, and the next time you look in that directory, it’s sitting there. It can simply reinstall itself,” he said. … Danseglio said malicious hackers are conducting targeted attacks that are “stealthy and effective” and warned that the for-profit motive is much more serious than even the destructive network worms of the past. “In 2006, the attackers want to pay the rent. They don’t want to write a worm that destroys your hardware. They want to assimilate your computers and use them to make money. …
Edit | Bookmark@del.icio.us | Trackback | No Comments »
April 5th, 2006, by Akshay Java, posted in Uncategorized
While in California, I had the opportunity to meet Adam Rosien (my mentor from PARC) who now works at Sharpcast, a Palo Alto based startup. They have a really cool product that lets you instantly share photos across multiple devices. Adam took a few pictures from his cell phone and these were directly available on the web and on his desktop through Sharpcast photos. You can also share the photos with your buddy list. The really neat thing about their product is that it is amazingly simple to use. I hope that they would be offering an API for developers to create more tools around this. For example it would be really useful to combine image annotations tools like PhotoStuff from Mindswap lab with Sharpcast. Another idea would be to export the image properties in RDF and develop a photo search tool like Alexa’s Camera Image Search.
Edit | Bookmark@del.icio.us | Trackback | No Comments »
April 5th, 2006, by mayfield, posted in Uncategorized
Dennis Forbes has done an interesting analysis of .com domain names:
If you’re dying to acquire great domains like 8VZ.com or Q6X.com, they’ll free up within a month, though it seems evident that there are swaths of domain speculators acquiring every variant when they come available, so they won’t go without a fight.
If you’re willing to flog a domain name of fifty or more characters, it looks like the possibilities are wide open.
Edit | Bookmark@del.icio.us | Trackback | No Comments »
|  | Recent postsStudents: brand yourself with a blogSocial Data on the Web workshop at ISWC 2008Petrini: Streaming Applications on the Cell BE Processor, 3pm 5/13 UMBCGossip-Based Outlier Detection for Mobile Ad Hoc NetworksInt. Conf. Semantic Web deadlines this week and next (ISWC 2008)
Ebiquity communityFieldmarking data blog
Geospatial Semantic Web
Harry Chen thinks aloud
Planet social media research
Social media research blog
TrackForward by Kolari
UMBC GAIM
|  |