UMBC ebiquity research group Building intelligent systems in open, heterogeneous, dynamic, distributed environments
30 August 2008, 11:34:42 EDT  
Web

Archive for the 'Web' Category

Spammers are using Amazon EC2

July 1st, 2008, by Tim Finin, posted in Social media, Web, splog

The Washington Posts Security Fix blog has a post, Amazon: Hey Spammers, Get Off My Cloud!, reporting on allegations that spammers are starting to use Amazon’s Elastic Compute Cloud (EC2) servers. It only makes sense — you can sign up easily without committing to a contract of any length, the price is low, and the IP addresses are drawn from a wide range, making it hard to block them all. Besides, if Amazon’s EC2 IP addresses all get put in a spam blacklist, it will be bad for their many legitimate users. It may be tricky for Amazon to police this.

Blog comment spam magnet

July 1st, 2008, by Tim Finin, posted in Social media, splog

A good fraction of the comment spam that makes it through our Akismet filter is from people who are trying to add a comment to one of our posts about spam blogs or comments. Here’s an example from today’s batch, a comment on a two-year old post Blog comment spam with plagiarized text: hard to spot from cameroun trying to promote the site africapresse.com.

“spam is a real problem in this day not just for .edu but for the entire internet world. Plagiarism is a problem too.”

It’s easy for me to classify this as spam since the comment was made on a very old post, is short, includes a reference to a site that looks commercial, makes a few general and superficial statements that are not really tied to any of the posts details.

I think it’s ironic that so many SEO wannabes try to spam posts about spam. I guess they just have spam on the brain. So, I offer up this post as food for the comment spammers and their search and comment tools.

akismet, anti-spam, antispam, automated, automated, automatic, backlink, backlinks, bad behavior, blacklist, block, blocking, blog, blogging, capcha, comment, comment spam, comments, human, keywords, links, links, nofollow, pagerank, people, plagiarize, plagiarism, rank, search engine optimization, seo, spam, spam blogs, spam comments, spam karma, spamming, splog, splog, splogs, steal, target, trackbacks, traffic, typepad, wordpress.

Splogs and politics

July 1st, 2008, by Tim Finin, posted in Social media, Web, splog

Here’s something I never expected: splogs as a political issue. Actually, it’s allegations of political blogs being splogs, or rather allegations of accusing political blogs of being a splogs in order to get Google to block them. The NYT Bits blog has a post, Google and the Anti-Obama Bloggers, that describes the controversy.

“Did Google use its network of online services to silence critics of Barack Obama? That was the question buzzing on a corner of the blogosphere over the last few days, after several anti-Obama bloggers were unable to update their sites, which are hosted on Google’s Blogger service. … In an article that appeared on Bloggasm.com, the reporter Simon Owens spoke with some of the affected bloggers, who said they believed that Google had fallen prey to a campaign by activists supporting Senator Obama. According to the bloggers, the Obama supporters had clicked on a “flag” on the anti-Obama blogs alerting Google that they were spam.”

Maybe this is a good reason to rely on the judgment of machines, at least until they start running for office.

Act before you think you think you think

June 28th, 2008, by Tim Finin, posted in AI, Semantic Web

The WSJ has an article, Get Out of Your Own Way, on research suggesting that people have often form intentions to act and make decisions well before they are conscious of the fact. Maybe this is like detecting the inferences made by the OWL reasoner or classification of a low-level SVM model before the high-level Python code processes its results. This picture from the article sums it up nicely.


As usual, you’re always the last to know. At least this opens up new interpretations for the old excuse, “Hey, I was out of the loop!”.

Microsoft rumored to buy semantic search startup Powerset

June 26th, 2008, by Tim Finin, posted in AI, NLP, Semantic Web, Web 2.0

Venture Beat reports that Microsoft will acquire Powerset for a price “rumored to be slightly more than $100 million”. Powerset has been developing a Web search system that uses natural language processing technology acquired from PARC to more fully understand user’s queries and the text of documents indexed.

“By buying Powerset, Microsoft is hoping to close the perceived quality gap with Google’s search engine. The move comes as Microsoft CEO Steve Ballmer continues to argue that improving search is Microsoft’s most important task. Microsoft’s market share in search has steadily declined, dropping further and further behind first-place Google and second place Yahoo.

Google has generally dismissed Powerset’s semantic, or “natural language” approach as being only marginally interesting, even though Google has hired some semantic specialists to work on that approach in limited fashion. Google’s search results are still based primarily on the individual words you type into its search bar, and its approach does very little to understand the possible meaning created by joining two or more words together.”

If you put the query “Where is Mount Kilimanjaro” into the beta version of Powerset, it answers “Mount Kilimanjaro: Contained by Tanzania” in addition to showing web pages extracted from Wikipedia. That’s a pretty good answer.

Its response to “what is the Serengeti” is a little less precise. It reports seven things it knows about Serengeti — that it replaced “desert, Platinum”, twilight and Caribbean Blue”, that it hosted ‘migration’, that it provided ‘draw’, that it gained ‘fame’, that it recorded ‘explorations’, that it rutted ’season’ and that it boasted ‘Blue Wildebeests’. I’m just glad I don’t have a school report due on the Serengeti due tomorrow!

Asking “Who is the president of Zimbabwe” results only in the fallback answer — which appears to be just the set of Wikipedia pages that the query words produce in an IR query. Compare this with the results of the Google query who is the president of zimbabwe site:wikipedia.org.

By the way, the AskWiki system often does a better job on these kinds of question. Asking “where is the Serengeti” produces the answer “The Serengeti ecosystem is located in north-western Tanzania and extends to south-western Kenya between latitudes 1 and 3 S and longitudes 34 and 36 E. It spans some 30,000 km.” It’s a bit of a hack, though. It seems to work by selecting the sentence or two in Wikipedia that best serves as an answer. See our post on Askwiki from last Fall for more examples.

Still, Powerset is an ambitious system that shows promise. What they are trying to do is important and will eventually be done. They have shown real progress in the past two years, more than I had expected. I hope Microsoft can accelerate the development and find practical ways to improve Web search even if the ultimate goal of full language understanding is many years away.

Models? We don’t need no stinking models!

June 26th, 2008, by Tim Finin, posted in Semantic Web, Social media, Web, Web 2.0

Wired has an interesting article, The End of Theory: The Data Deluge Makes the Scientific Method Obsolete, that discusses the data driven revolution that computers and the Web have unleashed. Science used to rely on developing models to explain and organize the world and make predictions. Now much of that can be done by correlating large amounts of data. It applies equally well to other disciplines (e.g., Linguistics) as well as businesses (think Google).

“All models are wrong, but some are useful.” So proclaimed statistician George Box 30 years ago, and he was right. But what choice did we have? Only models, from cosmological equations to theories of human behavior, seemed to be able to consistently, if imperfectly, explain the world around us. Until now. Today companies like Google, which have grown up in an era of massively abundant data, don’t have to settle for wrong models. Indeed, they don’t have to settle for models at all.

Sixty years ago, digital computers made information readable. Twenty years ago, the Internet made it reachable. Ten years ago, the first search engine crawlers made it a single database. Now Google and like-minded companies are sifting through the most measured age in history, treating this massive corpus as a laboratory of the human condition. They are the children of the Petabyte Age.

Update: And then there is this counterpoint: Why the cloud cannot obscure the scientific method .

Journal of Web Semantics has high impact factor

June 25th, 2008, by Tim Finin, posted in GENERAL, Semantic Web

During the past year, the Journal of Web Semantics was added to the list of journals indexed by Thomson Reuters. Their most recent Journal Citation Report (2007) gives the JWS an impact factor of 3.41, which is the third highest out of the 92 titles in its category — Computer Science, Information Systems.

Thomson Reuter’s journal impact factor is a measure of the frequency with which the average article in a journal has been cited in a particular year. The 2007 impact factor is computed as the citations received in 2007 to all articles published in 2006 and 2005, divided by the number of “source items” published in 2006 and 2005.

Technology Review special issue on Web 2.0

June 24th, 2008, by Tim Finin, posted in GENERAL, Security, Social media, Web 2.0

Technology Review special issue on Web 2.0, July/August 2008The July/August issue of Technology Review is focused on Web 2.0. The lead article, “The Business of Social Networks“, asks “Web 2.0–the dream of the user-built, user-centered, user-run Internet–has delivered on just about every promise except profit. Will its most prominent example, social networking, ever make any money?”

“Social networking is the fastest-growing activity on Web 2.0–the shorthand term for the new user-centered Internet, where everyone publicly modifies everyone else’s work, whether it’s an encyclopedia entry or a photo album. The growth of social networking is astonishing, and it has spread to sites of all sizes, which are increasingly intertwined as platforms open (see “Who Owns Your Friends?”). Even small players are soaring.”

There are quite a few interesting stories on various Web 2.0 topics. Visit the table of contents to see what’s available.

Web Science CACM cover article now online

June 23rd, 2008, by Tim Finin, posted in Semantic Web, Social media, Web, Web 2.0

The cover story of the July 2008 CACM (v51, n7) is Web Science by Jim Hendler, Nigel Shadbolt, Wendy Hall, Tim Berners-Lee, and Danny Weitzner. The article argues for an interdisciplinary approach to understanding the Web as an entity in its own right. It’s great that this article is freely available on the web. Ironically, figuring out what URL to use to link to it was a bit tricky and the pages are rendered as png images to protect the IP. But, it’s a good article that lays out an important new area of study in information systems.

“Despite the Web’s great success as a technology and the significant amount of computing infrastructure on which it is built, it remains, as an entity, surprisingly unstudied. Here, we look at some of the technical and social challenges that must be overcome to model the Web as a whole, keep it growing, and understand its continuing social impact. A systems approach, in the sense of “systems biology,”, is needed if we are to be able to understand and engineer the future Web.”

What I find exciting is that one of the attributes that makes the Web so successful is that it is a system to which all can contribute. We need to make sure it remains that way and doesn’t devolve into a hegenomic structure.

Is it Lindsay Lohan or your friends who make you a binge drinker?

June 23rd, 2008, by Tim Finin, posted in Agents, Social, Social media

What determines our behavior or beliefs? Are we influenced by people who are the well-known and popular leaders — political, social, religious — in our society or by the few hundred people that are in our immediate social network — family, friends and co-workers. It’s reasonable to assume that it varies by domain or topic, with your music preferences falling in the first category and your spiritual orientation in the second.

Paul Ormerod and Greg Wiltshire have a preprint of a paper ‘Binge’ drinking in the UK: a social network phenomenon (pdf) that reports on a study that the binge drinking phenomenon seems to spread through “small world” social networks rather than by imitating influentials in a “scale free” network

“We analyse the recent rapid growth of ‘binge’ drinking in the UK. This means the consumption of large amounts of alcohol, especially by young people, leading to serious anti-social and criminal behaviour in urban centres. We show how a simple agent-based model, based on binary choice with externalities, combined with a small amount of survey data can explain the phenomenon. We show that the increase in binge drinking is a fashion-related phenomenon, with imitative behaviour spreading across social networks. The results show that a small world network, rather than a random or scale free, offers the best description of the key aspects of the data.”

It’s fascinating that with the right data, simulation models can help to answer such questions.

W3C anounces RDFa as a candidate recommendation

June 20th, 2008, by Tim Finin, posted in KR, Ontologies, RDF, Semantic Web, Web 2.0

The W3C has officially announced that RDFa is a candidate recommendation

“2008-06-20: The Semantic Web Deployment Working Group has published a Candidate Recommendation of RDFa in XHTML: Syntax and Processing. Web documents contain significant amounts of structured data, which is largely unavailable to tools and applications. When publishers can express this data more completely, and when tools can read it, a new world of user functionality becomes available, letting users transfer structured data between applications and web sites, and allowing browsing applications to improve the user experience. RDFa is a specification for attributes to be used with languages such as HTML and XHTML to express structured data. See the group’s RDFa implementation report. The Working Group also updated the companion document RDFa Primer. Learn more about the Semantic Web and the HTML Activity.”

Achieving candidate recommendation status is a significant step toward becoming a W3C recommendation. Congratulation to the working group for all of their efforts in developing RDFa.

First Obama-McCain Twitter debate starts tonight

June 20th, 2008, by Tim Finin, posted in Social media, Web 2.0

The Personal Democracy Forum is sponsoring a twitter debate tonight on “technology and government” between representatives of Barack Obama and John McCain to be moderated by Time magazine blogger Anna Marie Cox. A note on PDF has the details:

“The McCain campaign will be represented by Liz Mair, the online communications director of the Republican National Committee. The Obama campaign will be represented by Mike Nelson, a professor at Georgetown University who served in the Clinton White House under Vice President Gore on tech policy issues. He is an outside advisor to Obama’s campaign on issues of technology, media and telecommunications.”

Of course, it remains to be seen what kind of debate can happen if short taking points are further compressed into 140 character tweeting points. It will be an interesting experiment.

“Mike, Liz and Ana will be using their personal Twitter accounts, @mikenelson, @lizmair and @anamariecox, and we’ve also asked them to tag their responses with the hashtag #pdfdebate. We suggest that readers who want to follow along use a Twitter application like Summize.com to track the conversation.”

The debate will start sometime tonight (Friday 20 June) and is expected to run through the end of the conference on Tuesday 24 June and maybe beyond.

You are currently browsing the archives for the Web category.

  Home | Archive | Login | Feed





UMBC