xkcd bot enforces originality on IRC channel

January 14th, 2008

xkcd has an IRC channel where its strange fans talk about even stranger things, some of the anyway. xkcd creator Randall Munroe discusses a common problem with IRC channels in a recent blog post ROBOT9000 and #xkcd-signal: Attacking Noise in Chat.

“When social communities grow past a certain point (Dunbar’s Number?), they start to suck. Be they sororities or IRC channels, there’s a point where they get big enough that nobody knows everybody anymore. The community becomes overwhelmed with noise from various small cliques and floods of obnoxious people and the signal-to-noise ratio eventually drops to near-zero — no signal, just noise. This has happened to every channel I’ve been on that started small and slowly got big.”

After laying out the standard approaches to controlling the problem (entry requirements, moderation, side channels) Randall describes a novel approach that fits oh so well with the xkcd community.

“And then I had an idea — what if you were only allowed to say sentences that had never been said before, ever? A bot with access to the full channel logs could kick you out when you repeated something that had already been said. There would be no “all your base are belong to us”, no “lol”, no “asl”, no “there are no girls on the internet”. No “I know rite”, no “hi everyone”, no “morning sucks.” Just thoughtful, full sentences.”

The idea’s implementation as a Perl bot sounds workable — when you violate the xkcd protocol by uttering a non-novel statement you are muted to prevent chatting for two second and the mute time quadruples for every subsequent violation. The bot forgives you after a while — your mute-time decays by half every six hours or so. You can read more about it on the xkcd blog or experience its tight rein on #xkcd-signal at irc.xkcd.com.

Not surprisingly, the channel is currently overwhelmed by chatters testing the bot to learn the finer points of its rules and how to subvert them. Hopefully, this is just a transient phenomenon and the robotic enforcement of novelty will evolve into something truly useful — a kindler, gentler moderator who can keep discussion from degenerating. But some serious tinkering will be required — common and repetitious utterances (“good morning”) are part of our social protocol, so this needs to be allowed to some degree.

For teens, social media is not technology, it’s just life

January 13th, 2008

Alan Kay got it right when he said “Technology is anything that wasn’t around when you were born.” — a quote I’d not heard until I saw it on danah boyd‘s blog recently. She cited it to explain why today’s youth find it so natural to use Internet technology while their parents find it a bit strange and artificial.

The Pew Internet & American Life Project published a report on Teens and Social Media with the tag line “The use of social media gains a greater foothold in teen life as they embrace the conversational nature of interactive online media”.

“Some 93% of teens use the Internet, and more of them than ever are treating it as a venue for social interaction — a place where they can share creations, tell stories, and interact with others.
    The Pew Internet & American Life Project has found that 64% of online teens ages 12-17 have participated in one or more among a wide range of content-creating activities on the internet, up from 57% of online teens in a similar survey at the end of 2004.
    Girls continue to dominate most elements of content creation. Some 35% of all teen girls blog, compared with 20% of online boys, and 54% of wired girls post photos online compared with 40% of online boys. Male teens, however, do dominate one area — posting of video content online. Online boys are nearly twice as likely as online girls (19% vs. 10%) to have posted a video online where others could see it.”

The Kay quote reminds me of Arthur C. Clark’s third law:

“Any sufficiently advanced technology is indistinguishable from magic.

although I am not sure anyone really sees blogging as magic. Maybe it’s not yet sufficiently advanced.

The XKCD data died in a blogging accident

January 13th, 2008

The popular XKCD had another Web related comic yesterday, but it trned out to be self-negating. As was noted on Slashdot:

“As I noted yesterday (and was joined by many others)… in an offhand observation xkcd has singlehandedly changed a small section of the Internet. Changing the results from a Google search for “Died in a Blogging Accident” from 2 to (at this writing) over 7,170 in a little more than 24 hours.”

The number of results are now up to 13.3K 66.1K (8/10/08). I guess something like the Heisenberg uncertainty principle applies to the Internet, too.

Update 1/15: Here’s a trend graph from blogpulse for occurrences of “died in a blogging accident” in blogs as of 09:00 gmt+5 on 15 January 2008. Click graph to see current data.

mentions of ‘died in a blogging accident’ in blogs as of 15 Jan 2008 09:00 gmt+5 via blogpulse

Update 1/16: Google trends shows a sudden interest in the dangers of blogging las week. Here’s a graph from 16 January 2008. Click on the graph to see the current trend graph.

Google searches as of 16 Jan 08 for ‘died in a blogging accident’

How Google processes 20 petabytes of data each day

January 9th, 2008

The latest CACM has an article by Google fellows Jeffrey Dean and Sanjay Ghemawat with interesting details on Google’s text processing engines. Niall Kennedy summarized it this way on his blog post, Google processes over 20 petabytes of data per day.

“Google currently processes over 20 petabytes of data per day through an average of 100,000 MapReduce jobs spread across its massive computing clusters. The average MapReduce job ran across approximately 400 machines in September 2007, crunching approximately 11,000 machine years in a single month.”

If big numbers numb your mind, 20 petabytes is 20,000,000,000,000,000 bytes (or 22,517,998,136,852,480 for the obsessive-compulsives among us) — enough data to fill up over five million 4G ipods a day, which, if laid end to end would …

Kevin Burton has a copy of the paper on his blog.

Jeffrey Dean and Sanjay Ghemawat, MapReduce: simplified data processing on large clusters, Communications of the ACM, pp 107-113, 51:1, January 2008.

MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google’s clusters every day, processing a total of more than twenty petabytes of data per day.

Dean and Ghemawat conclude their paper by summarizing the key reasons why MapReduce has worked so well for Google.

“First, the model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and load balancing. Second, a large variety of problems are easily expressible as MapReduce computations. For example, MapReduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems. Third, we have developed an implementation of MapReduce that scales to large clusters of machines comprising thousands of machines. The implementation makes efficient use of these machine resources and therefore is suitable for use on many of the large computational problems encountered at Google.”

RDFa tutorial video

January 7th, 2008

Manu Sporny has followed up on his Semantic Web for Noobs video presentation with one covering the basics of RDFa. RDFa provides a standard way to embed RDF data in HTML documents and allows RDF to be used as semantic markup for the text that people see. The “a” in RDFa stands for attribute and is part of the name because the RDF information is embedded as attributes to standard HTML elements.

It’s not intended that average Web content creators will enter the RDFa markup by hand any more than it’s expected that they enter HTML markup. Rather, the RDFa will be added by HTML editors, embedded in page templates, or generated by programs.


The RDFa specification is being developed by the W3C’s Semantic Web Deployment Working Group and is described in the RDFa Primer


Intel explains withdrawl from OLPC board

January 5th, 2008

Here’s Intel’s explanation for its resignation from the OLPC board as described in an email message from Chuck Mulloy to Dave Farber (link)

“Intel and OLPC are in agreement on the need to provide children around the world with new opportunities to explore experiment and express themselves through the use of technology. However, Intel and OLPC have reached a philosophical impasse and Intel is no longer a member of OLPC.

OLPC, through its chairman Nicholas Negroponte, had asked Intel to end its support for the non-OLPC platforms including the Intel designed classmate PC and to focus its support exclusively on the OLPC system the XO.

Intel concluded that it cannot accommodate the OLPC request for two reasons: First Intel has long believed that there is no single solution to the needs of children in emerging and underdeveloped markets. We have always said there will be many solutions but the most important priority is to serve the need. Secondly, if Intel were to exclusively support the XO over other platforms it would force us to abandon our relationships with many local OEMs, suppliers and in some cases governments. Enabling a localized solution in developing countries is a core value in Intel’s efforts because it reaches beyond just the benefits for children to create home grown businesses and entrepreneurship. We believe that the more solution providers there are in this area the more quickly and efficiently the benefits will spread.

It is unfortunate that after more than six months of discussion on this key point we have been unable to reach any agreement and have mutually elected to go our separate ways.” (link)

Intel withdraws from One Laptop Per Child (OLPC) project

January 5th, 2008

Can’t we all just get along?

This week Intel announced that it was withdrawing their participation in the One Laptop Per Child project. OLPC’s Nicholas Negroponte wrote the following in response, which I received via Dave Farber’s IP mailing list with the annotation “The following is a reformatted version of a statement from Nicholas Negroponte, founder and chairman of the One Laptop per Child project, received via electronic mail. The statement was confirmed by the sender.“.

“We at OLPC have been disappointed that Intel did not deliver on any of the promises they made when they joined OLPC; while we were hopeful for a positive, collaborative relationship, it never materialized.

Intel came in late to the OLPC association: they joined an already strong and thriving OLPC Board of Directors made up of premier technology partners; these partners have been crucial in helping us fulfill our mission of getting laptops into the hands of children in the developing world. We have always embraced and welcomed other low-cost laptop providers to join us in this mission. But since joining the OLPC Board of Directors in July, Intel has violated its written agreement with OLPC on numerous occasions. Intel continued to disparage the XO laptop in developing nations that had already decided to partner with OLPC (Uruguay and Peru), with countries that were in the midst of choosing a laptop solution (Brazil and Nigeria), and even small and remote places (Mongolia).

Intel was unwilling to work cooperatively with OLPC on software development. Over the entire six months it was a member of the association, Intel contributed nothing of value to OLPC: Intel never contributed in any way to our engineering efforts and failed to provide even a single line of code to the XO software efforts – even though Intel marketed its products as being able to run the XO software. The best Intel could offer in regards to an “Intel inside” XO laptop was one that would be more expensive and consume more power – exactly the opposite direction of OLPC’s stated mandate and vision.

Despite OLPC’s best efforts to work things out with Intel and several warnings that their behavior was untenable, it is clear that Intel’s heart has never been in working collaboratively as a part of OLPC.

This is well illustrated by the way in which our separation was announced single-handedly by Intel; Intel issued a statement to the press behind our backs while simultaneously asking us to work on a joint statement with them. Actions do speak louder than words in this case. As we said in the past, we view the children as a mission; Intel views them as a market.

The benefit to the departure of Intel from the OLPC board is a renewed clarity in purpose and the marketplace; we will continue to focus on our mission of providing every child with an opportunity for learning.”

Intel has its own approach to proving a low cost computer for educational use, the Classmate PC and the hugely popular (among geeks, anyway) Asus eee PC is based on it. The two approaches are quite different and both should be pursued, hopefully with cooperation.

I’m conflicted about the two. The OLPC XO is innovative, using interesting new ideas for both hardware (e.g., mesh networking) and software (e.g., eToys version of Squeak) so as a computer scientist, I think it’s great. The Intel approach, at least judging from the Asus eee PC, is more conventional, so as a potential user, I think it’s a pragmatic choice. (More on this later.)

Making low-cost, networked computers available to people in developing nations is a big idea that will unleash a disruptive change for the better. For all of us.

Call for ICWSM Posters, Jan 6 deadline

January 2nd, 2008

If you have something you would like to submit to the Second International Conference on Weblogs and Social Media (ICWSM 2008), there is still time to send in a two page poster paper or demonstration paper. These are due before 11:59pm PDT on Sunday, January 6. See the ICWSM 2008 conference site for details.

Hoosgot exploits the wisdom of the Blogosphere crowd

January 2nd, 2008

Technorati founder David Sifry launched a new service last week, when everyone was recovering from one holiday and preparing for another. Hoosgot (Who’s got …) let’s you ask the collective Blogosphere by posting a question on your blog or on the Twitter microblogging system. You need to include the term hoosgot in your blog post and @hoosgot in your twitter update to have it noticed.

Sifry explanation of how Hoosgot happened reinforces my belief that the greatest skill a practical computer scientist can have is being able to quickly test a new idea by turning it into running code.

You gotta love Holiday Weekends. Friday night (the 28th) The lazyweb popped back into my mind. I missed it. I started asking myself the question, “Why hasn’t anyone reconstituted the lazyweb?, What if we could rebuild the lazyweb for the 2008 web? What if we could take advantage of all the cool tools that have arrived in the last 5 years? Would it work?” Rather than wait around, I realized I could just build it, and maybe folks like me would use it. At about 5am on Saturday morning, the first prototype was up. I made some major changes, including twitter support Saturday night. And launch is today, on Sunday morning! Ain’t working on the web fun?:-)” (link)

Of course it helped that he could tweak Technorati to collect blog posts and tweets.

Will it work? Hooknows. One problem is spam, and Sifry is well positioned to deal with this. The other is that the wisdom of crowds is not uniform. Since your Hoosgot query is going out to a very broad group, a narrow question on an obscure aspect of Java programming will be a head scratcher to most. If you ask the blogmob for a movie recommendation, they will tell you to go see Norbit, which was 2007’s 29th highest grossing movie but also so unredeamably horrible that it almost killed Eddie Murphy’s career.

There are some possible things that could address these problems. Learning to spot Hoosgot spam and automatically adjust the model as it evolves is one. Another is to classify the Hoosgot queries by intent, topic and geography. Both of these are made more difficult if the queries are short, as they will be for Twitter-based queries. We’ve dealt with some of this in Akshay Java’s recent work on analyzing Twitter updates (
Why We Twitter: Understanding Microblogging Usage and Communities


New US RFID pass card raises privacy and security concerns

January 1st, 2008

Today’s Washington Post has a story, Electronic Passports Raise Privacy Issues, on the new passport card that’s part of the DOS/DHS Western Hemisphere Travel Initiative. The program is controversial since the cards use “vicinity read” radio frequency identification (RFID) technology that can be read from a distance of 20 or even 40 feet. This is in contrast to the ‘proximity read’ RFID tags in new US passports that require that the reader be within inches. The cards will be available to US citizens to speed their processing as they cross the borders in North America.

“The goal of the passport card, an alternative to the traditional passport, is to reduce the wait at land and sea border checkpoints by using an electronic device that can simultaneously read multiple cards’ radio frequency identification (RFID) signals from a distance, checking travelers against terrorist and criminal watchlists while they wait. “As people are approaching a port of inspection, they can show the card to the reader, and by the time they get to the inspector, all the information will have been verified and they can be waved on through,” said Ann Barrett, deputy assistant secretary of state for passport services, commenting on the final rule on passport cards published yesterday in the Federal Register. src

As described in the ruling published in the Federal Register, the Government feels that privacy concerns have been addressed.

“The government said that to protect the data against copying or theft, the chip will contain a unique identifying number linked to information in a secure government database but not to names, Social Security numbers or other personal information. It will also come with a protective sleeve to guard against hackers trying to skim data wirelessly, Barrett said.” src

Of course, if you carry the card in your purse or wallet, your movements can still be tracked by the unique ID on the card. There are also security concerns since the tag’s ID may be cloned.

“Randy Vanderhoof, executive director of the Smart Card Alliance, represents technology firms that make another kind of RFID chip, one that can only be read up close, and he is critical of the passport card’s technology. It offers no way to check whether the card is valid or a duplicate, he said, so a hacker could alter the number on the chip using the same techniques used in cloning. “Because there’s no security in the numbering system, a person who obtains a passport card and is later placed on a watchlist could easily alter the number on the passport card to someone else’s who’s not on the watchlist,” Vanderhoof said.” src