November 27th, 2006
Blogspot now has company.
Though myspace has had its own problems dealing with account spam, it now appears they will have to deal with splogs. Sploggers seem to have compromised myspace captcha system, and are using it heavily to promote affiliates.
I came across a link farm generated using myspace, and indexed by Technorati. Its time blog search engines exercise some caution when indexing myspace pages — possibly employing the same techniques they have been using against blogspot accounts.
November 27th, 2006
We’ve been using Akismet to filter our blog post comments and eliminate the spam. It’s been working great, with virtually no false positives and only a few false negatives a day, which are easy for us to moderate. We stopped looking in the spam bucket long ago since I never found a false positive. Until now. And it’s me. Ouch!
A few days ago I noticed that a comment I made, while logged in to our WordPress based blog, didn’t show up. I assumed that I must have forgotten to actually hit the SUBMIT button so I made it again. And it still didn’t show up. Looking into the Akismet spam bucket, I found both comments. Neither one had any spam characteristics, at least not to my eye. I marked them as not spam, and assumed that this might help Akismet recognize future comments from me as not being spam.
But it happened again today — a comment I just made to my own post while logged in to our Blog was identified as spam.
Now I’m puzzled and wish I knew more about Akismet’s techniques. I might have made all three comments from my home, so maybe it’s my IP address (a Comcast address) that’s the feature marking my comments as spam. Does Akismet know that I’ve logged in to our blog? Maybe it shouldn’t matter — I think we allow anyone to create a low level account on our blog.
I’m at home now but about to go to UMBC. I’ll try adding a comment to this post from home and then when I get to work. That will test the theory that it’s the IP address that marks my comments as spam.
I wonder how many valid comments we’ve lost? The level of spam is too high to manually look through them every day.
November 26th, 2006
Apparently there is yet another industry that has been lost to the United States due to outsourcing to China — virtual gold mining. Inside World of Warcraft Gold Farm, Future of Work, Wagner James Au talks about the seemy sweatshops where Chinese workers toil to mine virtual gold in “gray market companies which collect and sell virtual gold (primarily from World of Warcraft) to wealthier gamers in the developed world.”
“Drawing from an fascinating upcoming documentary by UC San Diego grad student Ge Jin (YouTube clip from his film here), the MTV segment features interviews with workers and managers of several gold farms, which resemble a cross between a 24 hour LAN party and a very shabby college dorm. By the segmentâ€™s estimate, an astounding half million Chinese now make a living – about $100 a month – from the acquisition and sale of WoW gold to US and EU gamers. Why is this is the future of work online? Consider the numbers, youth, and low wages of the gold farmers, and the growing interest in outsourcing tasks online.”
Somebody call Lou Dobbs! Someday the lazy American workers will wake up and be sorry when they can no longer enjoy cheap WoW gold earned by the sweat of others.
November 26th, 2006
The BBC reports that Korean electronics company LG is collaborating with builders to create developments of “smart homes”.
More than 100 homes offering smart technology have just been built in South Korea and another 30,000 are planned.
The homes use the HomeNet standard to communicate with devices and appliances by sending signals and data over power lines. HomeNet is a proprietary standard originally developed by LG that competes with others such as X-10 and more recently Z-Wave and In2 Networks. (spotted on Smart Mobs)
November 25th, 2006
As reported by Charles Arthur in The Guardian and subsequently on Slashdot, blog spammers are outsourcing the placement of comment spam to India and perhaps elsewhere.
“The other day, while administering the Free Our Data blog … I came across an unusual piece of comment spam – a remark left on one of the blog posts. … The surprise was that despite the automated defences to prevent such junk being posted by a machine, it had got through. … The electronic trail explained: the “captcha” … had been filled in. …. So who had done this? The junk filter had recorded their IP (internet) address. It resolved to somewhere in India. Which rang a bell: earlier this year, I spoke with someone who does blog spamming for a living – a very comfortable living, he claimed. But he said that the one thing that did give him pause was the possibility that rival blog spammers might start paying people in developing countries to fill in captchas: they could always use a bit of western cash, would have the spare time and, increasingly, cheap internet connections to be able to do such tedious (but paid) work”.
Our blog relies on Akismet to reject most of the spam comments and it does a great job. Anything that it doesn’t reject is left for us to moderate. In the last few months, we’ve noticed an increase in spam comments we believe to have been left by people, not machines. They are typically just a sentence or two and fairly general, yet still quite relevant to the meaning post. These are distinct from another common post-specific type of comment spam which seem to key off a key word in the post, such as mentioning Google and are undoubtedly automated.
The only way I can identify these as spam comments is that they (1) don’t add a very meaningful comment to the post and (2) the commenter’s URL points to what is clearly a commercial site unrelated to the post or comment.
November 25th, 2006
It is now legal in the US to unlock your mobile phone, the one you paid for and own, to allow you to use it with a different mobile phone service provider. Well, it will be starting on Monday and for 35 months.
As reported by The Register and others, the US Register of Copyrights has recommended six exemptions to the Digital Millennium Copyright Act (DMCA) including the right to unlock a cell phone.
“Computer programs in the form of firmware that enable wireless telephone handsets to connect to a wireless telephone communication network, when circumvention is accomplished for the sole purpose of lawfully connecting to a wireless telephone communication network.”
The exemptions become effective on 27 November 2006 and will only remain in effect for three years. The last set of expemtions were defined in 2003 and are expiring, including the right to determine what web sites a ‘commercially marketed filtering software application’ is blocking. Apparently, exemptions can expire if no one agues for them. Not all proposed exemptions are taken up, of course. Ars technica reports a number of rejected proposals, including these:
Proposals against space-shifting, playing DVDs on Linux, bypassing region coding on DVDs, bypassing copy protection on legally purchased computer software, audiobooks distributed by libraries, all works protected by DRM that prevents backups, and all works protected by a broadcast flag.
November 23rd, 2006
How credible is the information in Wikipedia? There have been many studies, formal and informal, that have tried to access this. A new article in the online journal First Monday reports on a simple methodology to explore the question.
Thomas Chesney, An empirical examination of Wikipedia’s credibility, First Monday, Volume 11, Number 11, 6 November 2006.
Two groups of researchers with various areas of specialization were asked to review Wikipedia articles. One group read articles within their own areas of expertise and the other was given random articles. After reading an article they were asked to assess its credibility, the credibility of its author and the credibility of Wikipedia as a whole.
If the Wikipedia articles are not very accurate, one might expect that the subjects would be less likely to judge articles in their own areas as credible than those outside their areas. But the results were a bit surprising.
“No difference was found between the two group in terms of their perceived credibility of Wikipedia or of the articlesâ€™ authors, but a difference was found in the credibility of the articles — the experts found Wikipediaâ€™s articles to be more credible than the non-experts. This suggests that the accuracy of Wikipedia is high.”
The author does point out that the study was small and that the experts did identify mistakes in the articles. Nonetheless, this simple methodology is believable and produced an interesting result.
November 22nd, 2006
We got a positive write up of the XPod project in Newsday.
In the fantasy world of every iPod user, there’s a playlist for every moment, a soundtrack for every day.
The drawback – it takes a lot of time to build those lists. But what if MP3 players knew when to play your uniquely perfect soundtrack before you requested it?
This particular feature is not yet available from the Apple machine or any of its competitors, but a PhD student at the University of Maryland-Baltimore County is working hard to change that.
November 21st, 2006
We’ll see Elaine Peterson one Borges and raise her another…
“These ambiguities, redundancies and deficiencies remind us of those which doctor Franz Kuhn attributes to a certain Chinese encyclopaedia entitled ‘Celestial Empire of benevolent Knowledge’. In its remote pages it is written that the animals are divided into: (a) belonging to the emperor, (b) embalmed, (c) tame, (d) sucking pigs, (e) sirens, (f) fabulous, (g) stray dogs, (h) included in the present classification, (i) frenzied, (j) innumerable, (k) drawn with a very fine camelhair brush, (l) et cetera, (m) having just broken the water pitcher, (n) that from a long way off look like flies.”
– The analytical language of John Wilkins, in Jorge Luis Borges, ‘Other inquisitions 1937-1952′ (University of Texas Press, 1993)
November 16th, 2006
I am happy to report that 28 people have contributed over 700 photos from the 2006 International Semantic Web Conference in Athens Georgia. Maybe you are in some?
Of course, it would be even better if they were annotated in RDF. Of course, many in the Semantic Web community have built experimental photo annotation and sharing sites. But it would be great to find an easy and intuitive way to allow people to make RDF annotation that refer to Flickr photos. A browser plug in could do the integration for a user.
Anyway, see the ISWC Group Photo Pool on Flickr. If you have photos on Flickr to contribute, it’s easy to do so…
November 16th, 2006
…News at 11…
One vision that many of us have is that the Web is evolving into a collective brain for human society, complete with long term memory (web pages), active behaviors (web services and agents), a stream of consciousness (the Blogosphere) and a nervous system (Internet protocols). So it’s in this context that I read the news from Google, Yahoo and Microsoft that the big search engines have agreed on a the SiteMaps protocol. It’s a small step for a machine, but a a bigger leap for machinekind.
Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL (when it was last updated, how often it usually changes, and how important it is, relative to other URLs in the site) so that search engines can more intelligently crawl the site. Web crawlers usually discover pages from links within the site and from other sites. Sitemaps supplement this data to allow crawlers that support Sitemaps to pick up all URLs in the Sitemap and learn about those URLs using the associated metadata.
The Semantic Web community should consider doing something similar for Semantic Web data. FrÃ©dÃ©rick Giasson has addressed one aspect with his PingtheSemanticWeb protocol and site. Last year Li Ding worked out a sitemap like format and protocol to facilitate RDF crawlers like Swoogle to keep their collections current more easily. We’ve not yet had a chance to experiment or promote the idea — maybe the time is right now. It might be especially helpful in exposing pages that have embedded RDFa or other content that could be extracted via GRDDL. Currently, the only way to discover them is with exhaustive crawling, something small systems like Swoogle can not afford to do.
November 15th, 2006
John Markoff’s Web 3.0 article has given the Semantic Web a bit of a bump in the Blogosphere.