Joel Sachs on Linked Data, 10:30am Oct 1, ITE 325b

September 27th, 2007

I thought I would start blogging about our weekly ebiquity meetings, at least for those that might be of interest to people outside of our group. Our meetings are, in general, open and we are happy to have visitors, with or without warning. We meet on Monday mornings, from 10:30 to 11:30 or Noon, depending on the topic, in our department’s large conference room (325b ITE Building). This coming week (October 1) Joel Sachs will give us a tutorial on Linked Data. Here’s his abstract.

Linked Data refers to a collection of best practices for publishing data on the semantic web. It is also, in part, a re-branding of the semantic web itself, with less emphasis on semantics, and more on RDF linkages amongst data sources. Also heavily emphasized is the proper role of web architecture (http requests and responses; 303 redirects; etc.), and the distinction between information resources (those that physically reside on the web), and non-information resources (those that exist in the so-called real world). I’ll give a brief overview of Linked Data, followed by a discussion of some issues that Linked Data raises for the SPIRE project. These issues include how Swoogle should handle information sources such as DBpedia, and how to link ETHAN to other sources of taxonomic and natural history information.

Pranam Kolari PhD dissertation defense: Detecting Spam Blogs

September 27th, 2007

Earliet this week Pranam Kolari successfully defended his Ph.D. dissertation, Detecting Spam Blogs: An Adaptive Online Approach. Here’s a video of the defense.

Abstract: Weblogs, or blogs, are an important new way to publish information, engage in discussions, and form communities on the Internet. Blogs are a global phenomenon, and with numbers well over 100 million they form the core of the emerging paradigm of Social Media. While the utility of blogs is unquestionable, a serious problem now afflicts them, that of spam. Spam blogs, or splogs are blogs with auto-generated or plagiarized content with the sole purpose of hosting profitable contextual ads and/or inflating importance of linked-to sites. Though estimates vary, splogs account for more than 50% of blog content, and present a serious threat to their continued utility.

Splogs impact search engines that index the entire Web or just the blogosphere by increasing computational overhead and reducing user satisfaction. Hence, search engines try to minimize the influence of spam, both prior to indexing and after indexing, by eliminating splogs, comment spam, social media spam, or generic web spam. In this work we further the state of the art of splog detection prior to indexing.

First, we have identified and developed techniques that are effective for splog detection in a supervised machine learning setting. While some of these are novel, a few others confirm the utility of techniques that have worked well for e-mail and Web spam detection in a new domain i.e. the blogosphere. Specifically, our techniques identify spam blogs using URL, home-page, and syndication feeds. To enable the utility of our techniques prior to indexing, the emphasis of our effort is fast online detection.

Second, to effectively utilize identified techniques in a real-world context, we have developed a novel system that filters out spam in a stream of update pings from blogs. Our approach is based on using filters serially in increasing cost of detection that better supports balancing cost and effectiveness. We have used such a system to support multiple blog related projects, both internally and externally.

Next, motivated by these experiences, and input from real-world deployments of our techniques for over a year, we have developed an approach for updating classifiers in an adversarial setting. We show how an ensemble of classifiers can co-evolve and adapt when used on a stream of unlabeled instances susceptible to concept drift. We discuss how our system is amenable to such evolution by discussing approaches that can feed into it.

Finally, over the course of this work we have characterized the specific nature of spam blogs along various dimensions, formalized the problem and created general awareness of the issue. We are the first to formalize and address the problem of spam in blogs and identify the general problem of spam in Social Media. We discuss how lessons learned can guide follow-up work on spam in social media, an important new problem on the Web.

Committee: Prof. Tim Finin (Chair), Prof. Anupam Joshi, Prof. Yelena Yesha, Prof. Tim Oates, Dr. James Mayfield (JHU/APL), Dr. Nicolas Nicolov (Umbria)

UMBC Multicore Computational Center (MC**2) hits engadget

September 25th, 2007

An item on UMBC’s Multicore Computational Center was featured in engadget today. Technorati judges this to be the world’s most popular blog, receiving about one million pageviews a day. Engadget was said to get 10M on the day the iphone was released.

Last Friday UMBC held an event to launch MC**2, the UMBC Multicore Computational Center. MC**2 will focus on supercomputing research related to aerospace/defense, financial services, medical imaging and weather/climate change prediction. IBM awarded UMBC a significant gift to support the development of this new center, which researchers describe as an “orchestra” of one of the world’s most powerful supercomputing processors, the Cell Broadband Engine (Cell/B.E.). This was jointly developed by IBM, Sony, and Toshiba and is used in Sony’s PlayStation3.

Stories on engadget and WBAL skewed the facts somewhat. UMBC is not building a supercomputer out of PS3s. Rather IBM is giving UMBC a number of of their new Cell Broadband Engine processor blades that will be added to our existing IBM based beowulf system. The blades include QS20s and soon to be released QS21s. These have processors that are based on the the processor in the PS3 but with much higher performance characteristics. For example, the Qs21 has two 3.2 GHz Cell/B.E. processors, 2 GB XDR memory, integrated dual GB ethernet, and an InfiniBand adapter. One the goals in the IBM/UMBC partnership is to collaborate on exploring how cell processors can be used for business, science and engineering applications.

Top RDF namespaces

September 23rd, 2007

James Simmons posted about PTSW’s namespaces page, which has a complete list of the 388 namespaces they have seen with frequencies of use. We reported on the Swoogle’s list of the 100 most common RDF namespaces last year. There are some interesting differences. I’ve put the top 20 from each list side by side.

It’s interesting to note that there are only eight namespaces that are common to both lists — these are in black. The ones that are unique to a single list are in red.

PTSW Swoogle…………

The differences are no doubt due to (1) how the two systems acquire RDF documents, which determines the types of documents in their collections and (2) the fact that these studies were done a year apart. If I get a chance in the next few days, I’ll return the query that produced Swoogle’s “top 100″ list.

Is it a social network or a social graph?

September 22nd, 2007

Is it a social network or a social graph? what would Dr. Zaius say?

David Winer writes, in How to avoid sounding like an monkey, that we should prefer the term “social network” over “social graph” when talking about models of people and their relationships. All right, his stick has a bit of a point on it…

“Now if you showed that diagram to most educated people, they probably would call it a network, and before we talked about social graphs we called them social networks, and you know what — they’re exactly the same thing, and social network is a much less confusing term, so why don’t we just stick with it? (Answer: we should, imho.) So if you don’t want to sound like an idiot, call a social graph a social network and stand up for your right to understand technology, and make the techies actually do some useful stuff instead of making simple stuff sound complicated.” (link)

If you (or our software system) is extracting, representing, visualizing, analyzing and/or manipulating date about people and their relationships as abstract graph models or as graph data structures, it seems very reasonable to use the term ’social graph’.

Google planning to leverage social network data ?

September 21st, 2007

TechCrunch has an interesting post, Google To “Out Open” Facebook On November 5, on Google’s plans to counter Facebook by making their social network data portable and interoperable.

“Yesterday a select group of fifteen or so industry luminaries attended a highly confidential meeting at Google’s headquarters in Mountain View to discuss the company’s upcoming plans to address the “Facebook issue.” … Google’s goal – to fight Facebook by being even more open than the Facebook Platform. If Facebook is 98% open, Google wants to be 100%. The short version: Google will announce a new set of APIs on November 5 that will allow developers to leverage Google’s social graph data. They’ll start with Orkut and iGoogle (Google’s personalized home page), and expand from there to include Gmail, Google Talk and other Google services over time.” (link)

A thought leader on this at Google is said to be Brad Fitzpatrick, formerly of Six Apart. His Thoughts on the Social Graph post last month was commented on in semantic web circles. RDF would be a great representation standard to make this work of course. It’s not the only way, but it has a lot to offer. Brad’s post and associated slides do show lots of graph examples that sure look like RDF to me. There is also quite a bit of RDF discussion on the Social Network Portability Google Group.

Techcrunch offers this scenario

“In the long run, Google seems to be planning to add a social layer on top of the entire suite of Google services, with Orkut as their initial main source of social graph information and, as I said above, possibly adding third party networks to the back end as well. Social networks would have little choice but to participate to get additional distribution and attention.” (link)

which sounds believable.

Primitive smilie ancestor discovered

September 21st, 2007

This week we heard about the primitive human ancestor referred to as the hobbit. Now we learn that the smilie had primitive ancestors dating back to 1887. In Language Log post, linguist Benjamin Zimmer talks about the “snigger point or note of cachinnation.” and some other early emoticoids or proto-smilies. Ambrose Bierce suggested the earliest known one in 1887, as shown in this passage from The Collected Works of Ambrose Bierce, Vol. XI: Antepenultimata (1912).

primitive smilie ancstor

Computer Science starting salaries continue to rise

September 19th, 2007

The latest National Association of Colleges and Employers (NACE) survey on initial salaries for new BS graduates shows that Computer Science graduates were among best paid majors for 2007. CS grads received an average salary offer of $53,051, a 4.5 percent increase over 2006. (link).

NACE salary survey for new graduates

Mr. Sulzberger, tear down that wall!

September 17th, 2007

The New York Times reports today (Times to End Charges on Web Site) that it will end its Times Select subscription service at midnight Tuesday. It will also provide free access to archives for the past 20 years and from 1851 to 1922.

“The newspaper said the TimesSelect project had met expectations, drawing 227,000 paying subscribers — out of 787,000 over all — and generating about $10 million a year in revenue. “But our projections for growth on that paid subscriber base were low, compared to the growth of online advertising,” said Vivian L. Schiller, senior vice president and general manager of the site,”

The reason? Google. Well, more accurately, Internet search engines. Whether you are they Grey Lady or a simple research blog, most of your readers come to you via a search. If key content can’t be searched, you lose visitors and you lose online ad revenue.

“What changed, The Times said, was that many more readers started coming to the site from search engines and links on other sites instead of coming directly to These indirect readers, unable to gain access to articles behind the pay wall and less likely to pay subscription fees than the more loyal direct users, were seen as opportunities for more page views and increased advertising revenue. “What wasn’t anticipated was the explosion in how much of our traffic would be generated by Google, by Yahoo and some others,” Ms. Schiller said.”

The Times gave it a good try and it’s very significant that they decided that it was not working. There still is a lot of uncertainty in what are workable MSM business models for the future. While I’m not expert in this area, I have to wonder if the Wall Street Journal will be next.

The WikiWar of 2008: Fred or Freddie?

September 17th, 2007

wikipedia semi-protection padlock The Washington Post has an article, On Wikipedia, Debating 2008 Hopefuls’ Every Facet, about the Wikipedia editing wars going on in the pages for the 2008 candidates in the US presidential election. A current battle is over Republican candidate .

“On Sen. John McCain’s Wikipedia entry, the argument has been over whether he is a conservative, moderate or liberal Republican. A heated exchange on former senator John Edwards’s page has centered on deleting any reference to his $400 haircuts. And perhaps the most contentious dispute of all — at least last week — was over Fred Thompson’s proper name: Is it Freddie, the name he was born with? Or Fred, as he’s called now? ” ‘Freddie’ makes Thompson sound ridiculous,” a user argued. “It’s not about making Thompson look silly,” another responded. “It’s about having accurate information.”” (link)

Wikipedia is a marvel of transparency, all in all. Check out the Fred VS Freddie discussion. I am surprised that the pages of all of the top candidates are not protected to some degree. Here’s my brief survey:

  • Democratic candidates
    • Semi-protected: Clinton, Edwards (temporary), Obama (temporary)
    • Unprotected: Biden, Dodd, Gravel, Kucinich, Richardson
  • Republican candidates
    • Semi-protected: Giuliani, Romney (temporary)
    • Unprotected: Brownback, Huckabee, Hunter, Keyes, Paul, Tancredo, Thompson, McCain
  • Others
    • Unprotected: Gillmore, Gingrich, Gore, Nader, Tommy Thompson, Vilsack

Note: Wikipedia’s semi-protection disables editing from anonymous users and registered accounts less than four days old.

We don’t need an SVM to pick out the distinguishing feature — it’s the currently top-ranked candidates who are locked, not the ones who are most controversial.

Man in China dies after three-day Internet session

September 17th, 2007

Rueters has yet another story about a poor lost soul allegedly dying from exhaustion while playing computer games, Man in China dies after three-day Internet session.

“A Chinese man dropped dead after playing Internet games for three consecutive days, state media said on Monday as China seeks to wean Internet addicts offline. The man from the southern boomtown of Guangzhou, aged about 30, died on Saturday after being rushed to the hospital from the Internet cafe, local authorities were quoted by the Beijing News as saying.” (link)

I wonder how many people around the world have died while watching television all day.

Strangely, the local police looked into the possibility that it was suicide.

“Police have ruled out the possibility of suicide,” the newspaper said, adding that exhaustion was the most likely cause of death. It did not say what game he was playing.

Maybe we should have a hazardous study tuition discount for students in UMBC’s new academic programs in Games, Animation and Interactive media.

Powerful anti-meeting spell: is it web 1.0 or web 2.0?

September 16th, 2007

A Dilbert strip from last week shows powerful technique for brining any meeting of software developers to a screeching halt. Click the excerpt for the full strip.

Dilbert on web 1.0 or 2.0

Spotted on Smart Mobs.