UMBC ebiquity research group Building intelligent systems in open, heterogeneous, dynamic, distributed environments
Google

Archive for the 'Google' Category

Storms on Planet Social Media Research

May 7th, 2009, by Tim Finin, posted in Google, Social media, splog

We maintain Planet Social Media Research (SMR) as a feed aggregator for a set of blogs relevant to research in social media systems. A few days ago I noticed that it wasn’t including new posts from some of the blogs. After updating the Planet Venus software we use and poking around I discovered that our server is unable to access any feeds that resolve to Feedburner.

Apparently Feedburner has a blacklist of IP addresses that it blocks and our server must now be on it. We have a request in to straighten this out and hope that everything will be back to normal very soon. ( I was to get our own blog back onto Planet SMR because I reconfigured the system to revert to the old, non-Feedburner feed.)

We’ve not yet heard from Feedburner/Google and don’t know why we are on their blacklist. It’s unlikely to be a result of our accessing feeds too frequently: we rebuild the site and aggregated feed once an hour and only about ten of our feeds resolve to feedburner.

My speculation is that this is collateral damage in the global war on spam. The easiest way for splogs (spam blogs) to get content is to hijack feeds from other blogs. Web spammers can do even better at disguising their splogs as legitimate sites if they aggregate several feeds that are topically related.

One way to fight such splogs is to deny them access to the feeds. So Google could be trying to protect Feedburner users and also be a good steward of the the Web environment by blocking suspected web spammers from the feeds hosted by Feedburner.

So, my guess is that the Google thinks that the Planet SMR site is a splog. We are not, of course. We only include the feeds of blogs that want to be on SMR. We also do not host any ads, which is a motivation for most splogs.

If our speculation is right, and Google is blocking our access because it thinks we are a splog site, then there will be many other legitimate feed aggregator sites that have or soon will have this problem.

By the way — we are always interested in suggestions for new blogs to add to Planet SMR. If you have or know of one, contact us as planet-smr at cs.umbc.edu.

update 5/8: We’ve identified and solved the problem, thanks to Google Freebase ‘community expert’ Franklin Tse. The problem was due to our having an old entry for the freebase IP address in the server’s /etc/hosts table. I think we added when we were having some technical difficulties some years ago and wanted to keep our key services running smoothly. I guess the trouble with quick temporary hacks is that they’re easy to forget and come back to bite you.

Google flu trends for Mexico

April 30th, 2009, by Tim Finin, posted in Google, Social media

Google has produced a special Mexico Flu Trends page to aggregate flu-related search queries from users in Mexico and various states within Mexico.

“We’ve created experimental estimates of flu activity in Mexico using aggregated search data. Unlike Google Flu Trends for U.S., this data has not been validated against confirmed cases of flu. After conferring with US and Mexican health officials, we’ve decided to share these initial results to provide additional information on the evolving epidemic.”

An article in the New York Times, To Aid Mexico, Google Expands Flu Tracking, quotes one expert on the limitations of the Google data

Dr. Henry L. Niman, a biochemist in Pittsburgh who runs Recombinomics, a Web site that tracks the genetics of flu cases worldwide, said that Google’s service appeared to provide only limited advance warning. “I am not saying that it is not useful. It probably works to complement other sources of surveillance and data,” he said.

Google flu trends: Web searches as sensors

April 26th, 2009, by Tim Finin, posted in Google, Semantic Web, Social media, sEARCH

Google has had a special “flu trends” site up for many months that provides “up-to-date estimates of flu activity in the United States based on aggregated search queries.”

They have found that how many people search for flu-related topics is a leading indicator for reports on how many people actually have flu symptoms. They believe that this metric “may indicate flu activity up to two weeks ahead of traditional flu surveillance systems”. Click on the flash video below to see the relationship between the flu searches and flu symptoms.

So, is Google magic? The explanation for why changes in in the level of flu searches precedes changes in the level of flu symptoms is more mundane.

“So why bother with estimates from aggregated search queries? It turns out that traditional flu surveillance systems take 1-2 weeks to collect and release surveillance data, but Google search queries can be automatically counted very quickly. By making our flu estimates available each day, Google Flu Trends may provide an early-warning system for outbreaks of influenza.”

You can get the details in a recent article in nature:

J. Ginsberg, M. Mohebbi, R. Patel, L. Brammer, M. Smolinski and L. Brilliant, Detecting influenza epidemics using search engine query data, Nature 457, 1012-1014 (19 February 2009).

Of course, such leading indicators may not correlate well if there is a “black swan” flu epidemic or even if there is an unfounded fear of one. Sometimes the crowds are wise, but often not. Remember when we all thought technology stocks real estate was a good thing to invest in?

The Google site also allows you to look at the data by state as well. Click on the image below to try it out.



Cloudera offers a simpler Hadoop distribution

March 18th, 2009, by Tim Finin, posted in Google, High performance computing, MC2, Multicore Computation Center, Semantic Web, Social media, cloud computing

We are early in the era of big data (including social and/or semantic) and more and more of us need the tools to handle it. Monday’s NYT had a story, Hadoop, a Free Software Program, Finds Uses Beyond Search, on Hadoop and Cloudera, a new startup that offering its own Hadoop distribution that is designed to beasier to install and configure.

“In the span of just a couple of years, Hadoop, a free software program named after a toy elephant, has taken over some of the world’s biggest Web sites. It controls the top search engines and determines the ads displayed next to the results. It decides what people see on Yahoo’s homepage and finds long-lost friends on Facebook.”

Three top engineers from Google, Yahoo and Facebook, along with a former executive from Oracle, are betting it will. They announced a start-up Monday called Cloudera, based in Burlingame, Calif., that will try to bring Hadoop’s capabilities to industries as far afield as genomics, retailing and finance. The company has just released its own version of Hadoop. The software remains free, but Cloudera hopes to make money selling support and consulting services for the software. It has only a few customers, but it wants to attract biotech, oil and gas, retail and insurance customers to the idea of making more out of their information for less.

Cloudera’s distribution, curently based on Hadoop v0.18.3, uses RPM and comes with a Web-based configuration aide. The company also offers some free basic training in mapReduce concepts, using Hadoop, developing appropriate algorithms and using Hive.

Wolfram Alpha: an alternative to Google, the Semantic Web and Cyc?

March 11th, 2009, by Tim Finin, posted in AI, Datamining, Google, NLP, Semantic Web

There’s been a lot of interest in Wolfram Alpha in the past week, starting with a blog post from Steve Wolfram, Wolfram|Alpha Is Coming!, in which he described his approach to building a system that integrates vast amounts of knowledge and then tries to answer free form questions posed to it by people. His post lays out his approach, which does not involve extracting data from online text.

“A lot of it is now on the web—in billions of pages of text. And with search engines, we can very efficiently search for specific terms and phrases in that text. But we can’t compute from that. And in effect, we can only answer questions that have been literally asked before. We can look things up, but we can’t figure anything new out.

So how can we deal with that? Well, some people have thought the way forward must be to somehow automatically understand the natural language that exists on the web. Perhaps getting the web semantically tagged to make that easier.

But armed with Mathematica and NKS I realized there’s another way: explicitly implement methods and models, as algorithms, and explicitly curate all data so that it is immediately computable.”

Nova Spivack took a look at Wolfram Alpha last week and thought that it could be “as important as Google”.

In a nutshell, Wolfram and his team have built what he calls a “computational knowledge engine” for the Web. OK, so what does that really mean? Basically it means that you can ask it factual questions and it computes answers for you.

It doesn’t simply return documents that (might) contain the answers, like Google does, and it isn’t just a giant database of knowledge, like the Wikipedia. It doesn’t simply parse natural language and then use that to retrieve documents, like Powerset, for example.

Instead, Wolfram Alpha actually computes the answers to a wide range of questions — like questions that have factual answers such as “What is the location of Timbuktu?” or “How many protons are in a hydrogen atom?,” “What was the average rainfall in Boston last year?,” “What is the 307th digit of Pi?,” “where is the ISS?” or “When was GOOG worth more than $300?”

Doug Lenat, also had a chance to preview Wolfram Alpha and came away impressed:

“Stephen Wolfram generously gave me a two-hour demo of Wolfram Alpha last evening, and I was quite positively impressed. As he said, it’s not AI, and not aiming to be, so it shouldn’t be measured by contrasting it with HAL or Cyc but with Google or Yahoo.”

Doug’s review does a good job of sketching the differences he ses between Wolfram Alpha and systems like Google and Cyc.

Lenat’s description makes Wolfram Alpha sound like a variation on the Semantic Web vision, but one that more like a giant closed database than a distributed Web of data. The system is set to launch in May 2009 and I’m anxious to give it a try.

Unlocked developer Android G1 hobbled

February 26th, 2009, by Tim Finin, posted in Google, Mobile Computing

Macworld reports, in Google blocks paid apps for unlocked G1 users, that Google made a recent change in the capabilities of the unlocked G1 Android phone.

“People who bought an unlocked version of the Android G1 phone are no longer allowed to download new paid applications from the Market, after a change Google made late last week. Google is prohibiting users of the unlocked phones from viewing copy-protected applications, including those that cost to download.”

Gizmodo describes the reason, or a least one very plausible one.

“The problem lies in the phone’s full software permissions. Consumer Android phones download paid content to a private, hidden apps folder, inaccessible to the user. Thing is, as is stands, this normally inaccessible folder is accessible on the dev phones. Not only does this let people flat out copy and redistribute apps—it enables a sort of app laundering scam, in which someone buys an app, copies it to another location, and gets a refund for the app (as per the Marketplace’s 24-hour return policy), only to reinstall the copied version later.”

We purchased an unlocked G1 last month and are using it in several research projects. Not being able to access the paid apps should not be a showstopper, but it would be nice to try some out, so I hope a solution to this problem can be worked out soon.

Google starts Social Web Blog

February 10th, 2009, by Tim Finin, posted in Google, Social media

Ebiquity alumnus Harry Chen alerted us to Google’s new Social Web Blog that described itself as “news and updates about Google products that are helping to make the web more social”. in yesterday’s first post, Mendel Chuang, the product marketing manager for Google Friend Connect says:

“We are launching this blog for anyone interested or involved in helping to make the web more social. Whether you own a site and want to add social features to increase community engagement, or you’re developing a great social application, this blog is for you.

We will write about social initiatives within Google, such as Google Friend Connect, as well as community efforts like OpenSocial. We plan to share some success stories, present tips and tricks, provide updates when there are new developments, and much more.”

Google, structured data and the deep web

February 4th, 2009, by Tim Finin, posted in Google, Semantic Web

IDG news service has a story sketching how Google Researcher Targets Web’s Structured Data. This is not directed at data published in machine understandable form (e.g., in RDF), but on other kinds of structured data accessible on the web.

“Internet search engines have focused largely on crawling text on Web pages, but Google is knee-deep in research about how to analyze and organize structured data, a company scientist said Friday. “There’s a lot of structured data out on the Web and we’re not doing a good job of presenting it to our users,” said Alon Halevy during a talk at the New England Database Day conference at the Massachusetts Institute of Technology,

Halevy was referring in part to so-called “deep Web” sources, such as the databases that sit behind form-driven Web sites like Cars.com or Realtor.com. Google has been submitting queries to various forms for some time, retrieving the resulting Web pages and including them in its search index if the information looks useful.

But the company also wants to analyze the data found in structured tables on many Web sites, Halevy said, offering as an example a table on a Web page that lists the U.S. presidents. And there are reams of those tables — Google’s index turned up 14 billion of them, according to Halevy. He “realized very quickly that over 98 percent of these are not that interesting,” but even after significant filtering there remain about 154 million tables worth indexing, he said.

ReadWriteWeb also has a story (Google: “We’re Not Doing a Good Job with Structured Data”)on that Google is or isn’t doing with structured data, including an interesting admission by Google researcher Halevy.

“During a talk at the New England Database Day conference at the Massachusetts Institute of Technology, Google’s Alon Halevy admitted that the search giant has “not been doing a good job” presenting the structured data found on the web to its users. By “structured data,” Halevy was referring to the databases of the “deep web” – those internet resources that sit behind forms and site-specific search boxes, unable to be indexed through passive means.”

For some technical details on the issues and current work, see the paper Google’s DeepWeb Crawl by researchers from Google (including Halevy), UCSD and Cornell published in the Proceedings of VLDB 2009.

Barack Obama on sorting algorithms

February 3rd, 2009, by Tim Finin, posted in Google, Humor

No doubt about it, President Obama is a polymath.



Warning: Google thinks every site may harm your computer

January 31st, 2009, by Tim Finin, posted in Google, Mobile Computing, Security

The Google has flipped out. Starting a few minutes ago when I try to click on any Google search result, I am shown the Google malware page. The one below was the result when I tried to click through to http://google.com/, the first result for searching for “google”. It is obviously an error in Google’s software and one that surely will be fixed shortly, if it has not been fixed already. Since Google is highly distributed, it’s possible that only some of their sites are in error.

Once you get the “Warning – visiting this web site may harm your computer!” page, the only way to continue on to the page is by manually selecting the text of the URL from the warning page and pasting it into your browser’s URL field.

Through experimentation, the problem exists for the deafult search service as well as image search but not for searchers over blogs, news, video, scholarly papers or shopping.

I suppose this could be the world’s safest CYA disclaimer, but if so they may as well add Do not taunt happy fun ball.

Update: This seems to have been fixed around 10:15am GMT-5.

Update 2: Here is Google’s post about the problem.

When will video dominate text on the Web?

January 18th, 2009, by Tim Finin, posted in Google, Web, sEARCH

Information on the Web comes in many forms, including text, images, services, data, games, and video. I’ve always considered text to be the essential type, possibly because it was the first, but also because so much of our Web experience has been shaped by search engines, which still operate mostly on text. But just as television and film dominate books and other forms of text in popular culture, maybe video-oriented modalities will become the preferred form of Web content.

Today’s New York Times has an article, At First, Funny Videos. Now, a Reference Tool, about how many search for information on YouTube first and turn to text search engines only when their YouTube results are inadequate.

“FACED with writing a school report on an Australian animal, Tyler Kennedy began where many students begin these days: by searching the Internet. But Tyler didn’t use Google or Yahoo. He searched for information about the platypus on YouTube.

“I found some videos that gave me pretty good information about how it mates, how it survives, what it eats,” Tyler said. Similarly, when Tyler gets stuck on one of his favorite games on the Wii, he searches YouTube for tips on how to move forward. And when he wants to explore the ins and outs of collecting Bakugan Battle Brawlers cards, which are linked to a Japanese anime television series, he goes to YouTube again.

While he favors YouTube for searches, he said he also turns to Google from time to time. “When they don’t have really good results on YouTube, then I use Google,” said Tyler, who is 9 and lives in Alameda. Calif.

The article reports that the number of YouTube searches now recently exceeded those on Yahoo, which had been number two.

“In November, Americans conducted nearly 2.8 billion searches on YouTube, about 200 million more than on Yahoo, according to comScore.”

You can see this trend in comScore’s December 2008 Search Engine Rankings report.

It’s hard to say where this is going. Video is great for some kinds of information (e.g, demonstrations, events) and less good for others (e.g., recipes, careful arguments). We can easily link information in text to related information, but can’t (yet) for videos. We can more easily write programs to process text and even extract semantic information from it.

But I have a feeling that nine year old Tyler Kennedy is a sign of things to come.

WWGD: Understanding Google’s Technology Stack

December 24th, 2008, by Tim Finin, posted in AI, GENERAL, Google, Programming, Semantic Web, Social media, Web, cloud computing

It’s popular to ask “What Would Google Do” these days — The Google reports over 7,000 results for the phrase. Of course, it’s not just about Google, which we all use as the archetype for a new Web way of building and thinking about information systems. Asking WWGD can be productive, but only if we know how to implement and exploit the insights the answer gives us. This in turn requires us (well, some of us, anyway) to understand the algorithms, techniques, and software technology that Google and other large scale Web-oriented companies use. We need to ask “How Would Google Do It”.

Michael Nielsen has a nice post on using your laptop to compute PageRank for millions of webpages. His posts reviews PageRank and how to compute it and shows a short, but reasonably efficient, Python program that can easily do a graph with a few million nodes. While not sufficient for many applications, like the Web, there are lots of interesting and significant graphs this small Python program can handle — Wikipedia pages, DBLP publications, RDF namespaces, BGP routers, Twitter followers, etc.

The post is part of a series Nielsen is making on the Google Technology Stack including PageRank, MapReduce, BigTable, and GFS. The posts are a byproduct of a series of weekly lectures he’s giving starting earlier this month in Waterloo. Here’s the way that Nielsen describes the series.

“Part of what makes Google such an amazing engine of innovation is their internal technology stack: a set of powerful proprietary technologies that makes it easy for Google developers to generate and process enormous quantities of data. According to a senior Microsoft developer who moved to Google, Googlers work and think at a higher level of abstraction than do developers at many other companies, including Microsoft: “Google uses Bayesian filtering the way Microsoft uses the if statement” (Credit: Joel Spolsky). This series of posts describes some of the technologies that make this high level of abstraction possible.”

Videos of the first two lectures, Introducion to PageRank and Building our PageRank Intuition) are available online. Nielsen illustrates the concepts and algorithms with well-written Python code and provides exercises to help readers master the material as well as “more challenging and often open-ended problems” which he has worked on but not completely solved.

Nielsen was trained as a as a theoretical Physicist but has shifted his attention to “the development of new tools for scientific collaboration and publication”. As far as I can see, he is offering these as free public lectures out of a desire to share his knowledge and also to help (or maybe force) him to deepen his own understanding of the topics and develop better ways of explaining them. In both cases, it an admirable and inspiring example for us all and appropriate for the holiday season. Merry Christmas!

You are currently browsing the archives for the Google category.

  Home | Archive | Login | Feed






UMBC