February 17th, 2007, by Akshay Java, posted in Uncategorized
Lately, there has been a lotofdiscussion about measuring influence using inlinks. A recent post by Matthew Hurst talks about Biz360, a market intelligence company that is using a PageRank like metric to measure influence. Their approach is to measure influence using not just inlinks to a blog but also the rank of the blogs linking to it. Inlinks, PageRank and all its variations work out alright if one is trying to define a global ranking for all blogs. But I think we need to start talking about influence more in terms of communities and readership.
The key question here is not “What is a blog’s influence?“, but “whom does this blog have an influence on and how much?”. This is a more difficult problem. I think the role of communities in influence metrics is quite important. To illustrate this point, here is an example of a small community of political blogs and size of the nodes is proportionate to the inlinks.
The interesting thing here is that once we identify the community, inlink counts are quite effective in finding influential nodes. On the other hand, if we just look at the top blogs using PageRank alone it would contain blogs for a mixed bag of topics. For example following are top 3 blogs in the dataset using PageRank:
Over the last few years, Intrade — with headquarters in Dublin, where the gambling laws are loose — has become the biggest success story among a new crop of prediction markets. Another company, Newsfutures, helps the world’s largest steel maker, Arcelor Mittal, run an internal market on which executives predict the price of steel. At Best Buy, a company called Consensus Point has helped start a market for employees to guess which DVDs and video game consoles, among other products, will be popular. Google and Eli Lilly have similar markets. The idea is to let a company’s decision-makers benefit from the collective, if often hidden, knowledge of their employees.
According to the article, Intrade’s odds correctly forecast the outcome in all 50 states in the 2004 US presidential election. What do they say about the 2008 election?
Intrade now makes John McCain the favorite for the Republican nomination, followed by Rudy Giuliani. For the Democrats, Hillary Rodham Clinton has about a 50 percent chance of being the nominee, more than twice as much as anyone else. The most counterintuitive forecast is that Mrs. Clinton is given a better-than-even chance of winning the general election if she is nominated, while Mr. McCain — perhaps because he wants to keep fighting the war in Iraq — is not.
Markets go up and they go down, so we’ll see how things progress. Shares in Vice President Cheney becoming the 2008 GOP nominee are a bargain at $1.50 compared to $37.40 for John McCain. Shares in Cheney resigning by December 7, 2007 are selling at $27.00. This looks like much more fun than actually voting.
Intrade has categories for current events (e.g., Bird flu breaking out in the US, Air strikes in Iran), entertainment (e.g., Jesus Camp wins an Oscar), financial, legal, politics, and weather.
The oldest Internet prediction market is the Iowa Electronic Markets, which are operated by faculty at the University of Iowa Tippie College of Business as part of their research and teaching mission.
February 11th, 2007, by Pranam Kolari, posted in Uncategorized
AIRWeb 2007 is third in a series of workshops on Adversarial Information Retrieval on the Web. This year the workshop also features a web spam challenge.
AIRWeb is a series of international workshops focusing on Adversarial Information Retrieval on the Web that brings together both researchers and industry practitioners, to present and discuss advances in the state of the art. This year, AIRWeb’2007 will be co-located with the WWW’07 conference in Banff, Canada. The workshop will include a Web Spam challenge that will test different spam detection techniques on a shared reference collection.
The call for papers lists a interesting set of problems, including new one’s like malicious tagging.
Web spam is one area where research is highly influenced by discussions with practitioners. With sponsorship and involvement from industry leaders, this should be a great venue to seek inputs. We plan to submit a paper on our continuing work on splogs.
Hotels, restaurants and online shops that post glowing reviews about themselves under false identities could face criminal prosecution under new rules that come into force next year. Businesses which write fake blog entries or create whole wesbites purporting to be from customers will fall foul of a European directive banning them from ‘falsely representing oneself as a consumer’. … The change is part of a Europe-wide overhaul of the consumer protection laws. It will oblige businesses not to mislead consumers and also will outlaw aggressive commercial practices such as aggressive doorstep selling, bogus ‘closing down’ sales and pressurising parents through their children to buy products.”
February 11th, 2007, by Tim Finin, posted in Uncategorized
The Hidden Persuaders, a best selling book from the 1950s by journalist Vance Packard, popularized the idea that our beliefs and desires were being consciously manipulated by advertising agencies and the media. Having become a meme, everyone now agrees that itâ€™s part of modern life and much more so now that it was fifty years ago. Itâ€™s a depressing idea, though, which no one really likes, except when the techniques have to be used to promote oneâ€™s own ends. The evolution of the Web, and of social media in particular, is thought to offer an antidote to the hidden persuaders. If all of us are empowered to develop and publish content, then maybe the crowds, in their wisdom, can filter out the hype and marketing and identify the authentic content. Well, human ingenuity knows no bound.
Yesterday’s WSJ has a feature article, The Wizards of Buzz (no subscription required), on social bookmarking sites like Digg and del.icio.us along with a podcast interview with one of the reporters, John Jurgensen.
What makes the article interesting is that it is more than just an overview piece; some real investigation was done.
“To find the key influencers, The Wall Street Journal analyzed more than 25,000 submissions across six major sites. With the help of Dapper, a company that designs software to track information published on the Web, this analysis sifted through snapshots of the sites’ home pages every 30 minutes over three weeks. The data included which users posted the submissions and the number of votes each received from fellow users. We then contacted scores of individual users to find which ones are tracked by the wider community.”
But it’s always good to have some careful studies. Here’s what the WSJ reported:
“Though it can take hundreds or thousands of votes to make it onto the hot list at these sites, the Journal’s analysis found that a substantial number of submissions originated with a handful of users. At Digg, which has 900,000 registered users, 30 people were responsible for submitting one-third of postings on the home page.”
Of course, what makes the article fun to read is that it profiles twenty of the “hidden influencers” it found, including this young man.
“On Reddit, one of the most influential users is 12-year-old Adam Fuhrer. At his desktop computer in his parents’ home in the quiet northern Toronto suburb of Thornhill, Mr. Fuhrer monitors more than 100 Web sites looking for news on criminal justice, software releases — and the Toronto Maple Leafs, his favorite hockey team. When Microsoft launched its Vista operating system this year, he submitted stories that discussed its security flaws and price tag, which attracted approving votes from more than 500 users.”
Reading the capsule descriptions of the twenty influencers profiled, I can see why my own occasional efforts to shamelessly promote our research group on Digg have been complete failures. Most of the profiled influencers spend several hours every day looking for content to push.
February 9th, 2007, by Tim Finin, posted in Uncategorized
View in Quicktime:
More than 500 middle-school youth and their families participated in the FIRST LEGO League Maryland State Tournament, a competition that builds students’ ability to design and program LEGO robots, on Saturday, 20 January 2007. The FIRST LEGO League is an international robotics program intended to encourage enthusiasm for discovery, science, and technology in kids ages 9 to 14. This year, teams built LEGO robots to perform functions such as removing “pizza molecules” from a paper plate. Judges will evaluate the teams on their ability to program robots to achieve tasks relevant to nanotechnology, a scientific frontier focused on achieving advances in medicine and computers through research into particles 100,000 times smaller than the thickness of a single strand of hair.
February 6th, 2007, by Tim Finin, posted in Uncategorized
There is a good article Alan Sherman’s electronic voting research in last week’s Catonsville Times and also in the Arbutus Times. The hardcopy versions have a very good picture of him in front of the UMBC Pond. I think that Alan’s position on the current problems with voting systems in the US and what to do about them is very pragmatic. The Punchscan system that Alan and his students are collaborating on looks very interesting, as is the VoComp competition and conference.
February 6th, 2007, by Tim Finin, posted in Uncategorized
CRA blogs about new data on CS and CE enrollments and interest among incoming freshman.
Interest in computer science (CS) and computer engineering (CE) as majors among incoming freshmen at all undergraduate institutions remained low in 2006, according to survey results from the Higher Education Research Institute at the University of California at Los Angeles (HERI/UCLA). After peaking in 1999 and 2000, interest in CS as a major fell 70 percent between 2000 and 2005. In the fall of 2006, 1.1 percent of incoming freshmen indicated CS as their probable major, the same as in 2005.
CRA reports that this year’s Taulbee Survey data (available March 1) will show a second year of double-digit declines in CS undergraduate enrollment and degree production. Interest in Computer Engineering shows a similar decline.
February 5th, 2007, by Tim Finin, posted in Uncategorized
Wow. what a depressing article featured on Slashdot tonight: The death of computing. And on the site of the venerable British Computer Society. And so wrong, I think. I can’t think of a time when computing has had so many interesting and provocative prospects.
It’s certainly true that CS enrollments are down significantly, here in the US and, apparently, in the UK as well. But we’ve seen this before about 15 years ago.
The anecdotal evidence we see suggests that job opportunities are back up. Companies like IBM, Google, Yahoo, and Microsoft are putting the rush on our students. There’s a rash of startups focused on many new web and internet oriented ideas: Web 2.0, social media, semantic web, SOA, information extraction, interactive entertainment, pervasive computing, etc.
I think this is still a good time to be in information technology.
February 3rd, 2007, by Tim Finin, posted in Uncategorized
Mr. Google is a dull fellow. He doesn’t appreciate irony, have a sense of humor, get outraged or have an ounce of curiosity. If you want people to find your content, and who doesn’t, you need to think like a machine, not like a human. Give articles and posts a title, headline or subject that has the right keywords and phrases to ensure that they will be found by search engines and ranked high in the results list.
“Pithy, witty and provocative headlines–the pride of many an editor–are often useless and even counterproductive in getting the Web page ranked high in search engines. A low ranking means limited exposure and fewer readers.”
This holds for all kinds of content — news articles, research papers, dissertations, blog posts, Web pages, ebay listings, etc. It also works off the Web — pick good subject lines for your email messages and descriptive names for your files.
In traditional MSM, editors write headlines, not reporters. The assumption is that the reader is already looking at the story, holding a newspaper in her hand or seeing it on a news stand, and the headline’s job is to make her want to read the story. But on the Web, the scenario is a bit different. A person is seeking information about a topic and typing keywords and phrases into a search engine to find relevant documents.
News organizations that generate revenue from advertising are keenly aware of the problem and are using coding techniques and training journalists to rewrite the print headlines, thinking about what the story is about and being as clear as possible. The science behind it is called SEO, or search engine optimization, and it has spawned a whole industry of companies dedicated to helping Web sites get noticed by Google’s search engine.
It works, too.
In November, Nielsen/NetRatings ranked Boston.com, the sister Web site of The Boston Globe, as the fourth-most trafficked newspaper Web site in the country, even though its print circulation is ranked 15th by one audit bureau. “We’re regularly beating the bigger boys, like the Chicago Tribune and The Wall Street Journal…and part of the reason is SEO,” said David Beard, editor of Boston.com and former assistant managing editor of its print sibling, The Boston Globe. “We have Web ‘heds.’ We go into the newspaper (production) system to create a more literal Web headline,” said Beard. “We’ve had training sessions with copy editors and the night desk for the newspaper. It’s been a big education initiative.”
For our research group, we’ve been applying these ideas whenever we name or title anything. For example, I had to fight the urge to give this post a clever title, like Google has no sense of humor. If you must be witty, save it for the lede, the first sentence. Not only should the title be descriptive, it needs to be timely in using the current words and phrases being used for a topic.
So, (this is work in progress) instead of titling a paper Modeling trust and influence in the blogosphere using link polarity I think we should use the following title: Modeling bias, trust and influence in blogs using link sentiment is better. Few (maybe just one?) research groups are using the phrase link polarity to mean associating a sentiment (e.g., love it, hate it, don’t care) with a hypertext link. I’m guessing that blogs might be more common in a search than blogosphere. Finally, my choice reflects an intuition that modeling bias in news and information sources will be a topic of increasing interest over the next few years. Of course, I’d like to shove in some additional keywords, like community or opinion, but there is only so much that one title can bear.
February 1st, 2007, by Pranam Kolari, posted in Uncategorized
We present some updates on the Splogosphere as seen at a pingserver (weblogs.com). This follows our study from a year earlier which reported on splogs in the English speaking blogosphere. Our current update is based on 8.8 million pings on weblogs.com between January 23rd and January 26th. Though not fully representative, it does give a good sense of spam in the indexed blogosphere.
(i) 53% of all pings is spam, 64% of all pings from blogs in English is spam. A year earlier we found that close to 75% of all pings from English blogs are spings. Dave Sifry reported on seeing 70% spings in his last report. Clearly the growth of spings has plateaued, one less thing to worry about.
(ii) 56% of all pinging blogs are spam. By collapsing these pings to their respective blogs, we chart the distribution of authentic blogs against splogs. These numbers have seen no change, 56% of all pinging blogs are splogs.
(iii) MySpace is now the biggest contributor to the blogosphere. The other key driver LiveJournal and blogs managed by SixApart (as seen at their update stream) contribute only 50-60% of what MySpace does. The growth of MySpace blogs has in fact dwarfed the growth of splogs! Further if MySpace is discounted in our analysis close to 84% of all pings are spings! Though MySpace is relatively splog free, we are beginning to notice splogs, something blog harvesters should keep an eye on. [Note that not all blogspot blogs ping weblogs.com]
(iv) Blogspot continues to be heavily spammed. Most of this spam however is now detected by blog search engines, a point also shared by Matt Cutts and Randy Morin. In all of the pings we processed, 51% blogspot blogs were spam!
(v) Most spam blogs are still hosted in the US. We ranked IPs associated with spam blogs based on their frequency of pings, and located them using ARIN.
Mountain View, CA
San Francisco, CA
Blogspot hosts the highest number of splogs, but we also found that most of the other top hosts where physically hosted in the US. Perhaps Jonathan Bailey knows more about the legal ramifications.
(vi) Content on .info domain continues to be a problem. 99.75% of all blogs hosted on these domains are spam. In other words 1.65 Million blogs were spam as opposed to only around 4K authentic blogs! As long these domains are cheap and keyword rich this trend is likely to continue. Sploggers are also exploiting private domain registration services (see here).
(vii) High PPC contexts remain the primary motivation to spam. We identified the top keywords associated with spam blogs and generated a tag cloud using keyword frequency.