Ebiquity PhD student Pranam Kolari has been selected as one of ten finalists for a $5000 Blogging Scholarship by “Scholarships Around the US”, a Web-based scholarship information service. Pranam was selected for his blog posts about his research on blogs, detecting blog spam, policies and trust. If you have a chance, please cast a vote for Pranam. If he wins, you can ask him to buy you a cup of coffee when you see him.
Archive for October, 2006
Here in the US Lexus is heavily advertising their automatic parallel parking feature. My first thought, I’ll admit, was that when civilization collapses, I’ll have the advantage of still remembering how to do many things manually — drive a stick shift, write in longhand with a fountain pen (and even a nib pen if things get really bad), and parallel park my Mad Max vehicle. But the Lexus Advanced Parking Guidance System is pretty neat technology, even if it does add at least $1,200US to the cost of an already expensive car. Here’s how it works
“The view from a back-up camera is displayed on the screen at the heart of the car’s navigation system. The driver … uses touch-screen arrows to move about a targeting box, much like a video game. When the red box turns green, the driver pushes a button that reads “OK”, and the car takes over. With the driver’s hands free, the electric motors in the power steering system spin the wheel back and forth to navigate the car. The driver controls speed by keeping a foot on the brake. And the car won’t stop itself if, say, a child steps off the curb into the parking spot. Stomping on the brake, turning the wheel or hitting the gas will shut off the parking system.” [Source]
This isn’t going to help you get into a tight space, or even a normal space in an urban environment — the car won’t even try to park unless the targeting box is at least six feet longer than the car. I’ve spent many years parking in cities and know that you often have to squeeze into a space that has less than one foot of extra space. Even when itâ€™s on a hill and especially when its 2:00am.
The smart self-parking car might seem like an excess, but it is a good example of how machines are becoming aware of the context they are in and using this knowledge to behave more intelligently, or at least more appropriately.
My friend Rich Fritzson pointed me to TED, a by-invitation conferenced on “Technology, Entertainment, Design” that has been held in Monterey for some fifteen years. Its tag line is “where leading thinkers and doers gather for inspiration”. One way they filter for the leading among the thinkers and doers is by resources — the 2007 registration fee is $4400! But I digress.
TED is putting their TED talks from the conferences online in both video and audio form available for online viewing or download. The talks are short (18 minutes), focused, polished and feature speakers with interesting and timely things to say, including Malcom Gladwell, Steve Levitt, Richard Dawkins, Jimmy Wales, Nicholas Negroponte, Dan Dennett, and many others. Here’s Jimmy Wales’ talk from 2005.
The Washington Post has an article on how Google is expanding its Web reach to Madison Avenue.
“The firm is developing new Web video ad formats that could give TV commercials a run for their money and has been staffing up new projects to sell ads offline in newspapers and magazines and on radio. Earlier this month, Google’s ambitions were on display when it opened its new office in the old Port Authority building, covering 1 1/2 floors on an entire Manhattan block in the funky Chelsea neighborhood. The company has recruited executives from some of the biggest media firms and is rapidly expanding its 500-member team here in an attempt to cultivate relationships with Madison Avenue and large advertisers.”
The most interesting part of the article, at least to me, talks about Google’s plans to team with the “researchware” fim comScore, a company whose practices some consider controvertial.
“Teaming with online research firm ComScore Networks Inc., Google is trying to correlate the effectiveness of each ad by tracking the number of people exposed to it who later perform searches about the product. For example, people who visited the AutoTrader.com site this summer were shown an image of a Volvo sport-utility vehicle advertising the car for lease at $389 a month. ComScore placed “cookies,” or tracing files, on the computers of visitors and tracked how many typed the word “Volvo” or “Volvo SUV” into a search box weeks or months later. During the Web campaign for the Volvo’s XC90, Google said 39 percent of Internet users who were exposed to the ads later conducted online searches for Volvo cars.”
Correlating viewing an advertisement with actually changing the behavior of the viewer, as evidenced by their subsequent searches, sounds like an advertiser’s dream. And one that will be hard to resist.
comScore describes itself as maintaining a “massive proprietary databases that provide a continuous, real-time measurement of the myriad ways in which the Internet is used and the wide variety of activities that are occurring online.” Here’s how they describe their approach
“At the core of comScore Networks is our proprietary data collection technology. Massively scalable, this system allows us to capture a comprehensive view of surfing and buying behavior of more than 2 million participants in an extremely cost-effective manner. These members, representing a cross section of the Internet population, give comScore explicit permission to confidentially monitor their online activities in return for valuable benefits such as server-based virus protection, sweepstakes prizes, and the opportunity to help shape the future of the Internet.” [source]
If this sounds a bit like spyware, read on.
“comScore technology is downloaded to any browser in a matter of seconds and unobtrusively captures and sends information regarding a participant’s Internet browsing and purchasing behavior to comScore’s server network, without requiring any further action on the part of the individual. The technology allows comScore to capture the complete details of communication to and from each individual’s computer — on a site-specific, individual-specific basis. This includes every site visited, page viewed, ad seen, promotion used, product or service bought, and price paid, while excluding sensitive personal information regarding an individual (such as account numbers, user ids, passwords, etc.).” [source]
Ok, you are probably wondering how you can sign up to get chances to win “attractive sweepstakes prizes” while doing your part to “to impact and improve the Internet”. Don’t call them, they’ll call you, at least with probability p.
At the heart of the comScore Global Network is a sample of consumers enlisted via Random Digit Dial (RDD) recruitment – the methodology long endorsed by many market and media researchers. comScore also employs a variety of online recruitment programs, which have been time-tested through the years in which the comScore Global Network has been in operation. …
But, while many are called, fewer are chosen.
… our network includes hundreds of thousands of high-income Internet users – one of the most desirable and influential groups to measure, yet also one of the most difficult to recruit. comScore determines the size and characteristics of the total online population via a continuous survey spanning tens of thousands of persons over the course of a year. The sample of participants in this enumeration survey is selected via RDD methodology. Respondents are asked a variety of questions about their Internet use, as well as descriptive information about themselves and their households. The result is an accurate and up-to-date picture of the universe to which the comScore sample is projected.
comScore’s approach has been criticized before, as outlined in this 2004 CNet article. But comScore calls what they do “researchware”, a term they’ve invented to cover products that gather information for market research purposes and fully disclose what they are doing and why. Not everyone is comfortable with the distinction and some universities, including Cornell and Princeton, have warned their students away from agreeing to use comScore. See this 2005 MSNBC article, Researchware watches where you click, for a discussion of the spyware vs. researchware issue.
And here’s one final disquieting fact. As I understand it, comScore works not with cookies but by routing its opt-in customers’ Web browsing activities through a proxy server. As a man in the middle, this lets them decrypt secure transactions going over HTTPS connections to collect information. And they do. The use another secure HTTPS connection, of course, to complete the transaction and they go to great lengths to assure their users that they don’t collect obviously sensitive information. But still — this is a quite a concession for any of us to make in return for the opportunity to win “attractive sweepstakes prizes”.
Rexa is a digital library covering computer science research and the people who create it. It was developed by the Information Extraction and Synthesis Laboratory at the University of Massachusettes with support from NSF. With just over seven million paper, its collection is about half the size of the CS collections of Citseer and a quarter of Google Scholar’s. Rexa ofers some interesting and valuable features, however, including
- A simple and responsive interface
- Cross-linked pages for papers, authors, topics and NSF grants
- Browsing by citations, authors, topics, co-authors, cited authors, and citing authors
- The ability for users to tag papers
The grants abstracts that are indexed appear to be only from the US National Sceicne Foundation, but this is still a useful service if you want to sample related projects for a research proposal you are developing.
Both Mitch Ratcliffe and Matthew Hurst have some very interesting thoughts on defining influence. In some of our work we have tried to explore what influence means on the Blogosphere and how can we measure it. We found that PageRank of the blog is definitely one of the contributing factors for measuring influence, and being authoritative ofcourse would mean that people are more likely to listen to you. But influence is not limited to authority alone. I like matthew Hurst’s suggestion of breaking influence into smaller, directly measurable components like:
- Authority: the expertise level of the individual.
- Credibility: the trust of the readers for the individual and the manner of presentation.
- Network measurements: link counting, traffic stats, graphical analytics
In addition to the above, I would like to point out that IMHO, this is just the beginning of the list. In coming up with the definition or modeling influence, we need to consider some of the following aspects:
- Influence is topical: For example a blog like Daily Kos that is influential in politics is less likely to have an impact on the technology related blogs. Similarly, Techcrunch, an extremely popular technology blog might not be influential when it comes to politics.
- Influence is polar: A community of “Ipod fans”, for example needs no convincing about the product. On the other hand an “influential” blogger talking negatively about your product might have a drastic impact. In some of the related work, we have participated in the TREC blog track’s opinion extraction task. We feel that opinions, biases and polarity of the links matter when measuring influence on the Blogosphere.
- Influence is temporal: A blog’s influence on a topic might change with time. Tracking blogs over time also allows us
to differentiate blogs that are influential versus something that is just briefly popular. For example, many thousands of sites linking to a â€˜Coke Mentosâ€™ video in one day indicates popularity. But thousands of links accumulated consistently over time by a blog indicate that it is influential.
In the end, we need some measures that are based on link count, traffic, readership (see also “feeds that matter“) etc. But we also need to consider more factors that make a blog “influential” and this would involve using a combination of graph, linguistic and temporal analysis. It might be hard to come up with an exact definition of influence, but the exciting thing is that there are a LOT of interesting challenges in this space!
First, let me be clear that this is not a post about a Second Life for database hackers.
I just heard an interesting talk by AnHai Doan at an information integration workshop. His talk, on Best effort data integration (presentation slides), was built around the problem of doing data collection and integration for web communities, such as database researchers. One result is DBLife — a very nice prototype of a dynamic portal of current information for the database research community. The systems automatically discovers and revisits web pages and resources for the community, extracts information from them, and integrates it to present a unified view of people, organizations, papers, talks, etc. It also summarizes the interesting new facts for the day, such as new papers, conferences, or projects.
See Community Information Management, AnHai Doan et al., Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 29(1), 2006 for a longer overview of their project and DBLife.
This project takes a database approach, but the Semantic Web could offer a lot to enhance it. There should be an AILife, IRLife, GraphicsLife, etc. The underlying semantic models for this will have a lot in common, so they could be supported by a set of common shared ontologies. Moreover, many of the instances — people, conferences, projects — will show up in several portals. Finally, RDF and SPARQL would be the ideal interchange languages when portals want to import or export data.
“Fifty or so other Republican candidates have also been made targets in a sophisticated â€œGoogle bombingâ€ campaign intended to game the search engineâ€™s ranking algorithms. By flooding the Web with references to the candidates and repeatedly cross-linking to specific articles and sites on the Web, it is possible to take advantage of Googleâ€™s formula and force those articles to the top of the list of search results.”
Given the amount of money politicians have and the fact that they are always fighing for their jobs and for power, its a scary development.
A recent article of Electornic Musician discussed the XPod project. A reader of that article posted some comments on MetaFilter. The article sent a fair bit of traffic to our site. We have performed some new work in the area. A new version of the XPod player will be presented at the IJCAI Ambient Inteligence workshop. I don’t have a finished paper to post but you can see the presentation that I plan on showing. If you don’t want to check out the presentation I will tell you the conclusion. We tried a couple of different machine learning techniques, support vector machines, desision trees, and neural networks. It turns out that neural networks works best for our application.
I wanted to reply to some of the comments from metafilter:
Are humans becoming so lazy that we can’t even be bothered to select and play our own music?
Yes that is why I have created this. I don’t like making playlists, and I am not alone. Lazy solutions can be optimal to some problems. Laziness is the mother of many great inventions. Does anybody want to get rid of high level programming languages? High level programming languages were created because people where too lazy to write assembly code.
Apple may have a problem with the name.
The ebiquity groups is fairly bad a choosing project names. Hence our sematic-web search engine Swoogle. If we get enough press to get Apple’s attention, that will be a good thing.
One final note:
We are moving away from the BodyMedia device. I believe our next target platform will be the Nokia 5500 Sport. It’s worth going to the link just to see the Parkour videos.
Spings, or rather pings from splogs and non-blogs inundate ping servers. We did an analysis on this last year by characterizing splogs at ping servers. The problem makes ping servers far less attractive as a “blogosphere update manager”. Recently while looking at weblogs.com pingstream, I noticed something very strange, a new form of spings are now in use by comment/guestbook spammers.
The model used by these spammers has so far been –
- Spam comments on blog postings/guestbooks
- Wait for the next seach crawl of compromised pages
- Bask in artificially inflated rankings
However, its now changing –
- Spam comments on blog postings/guestbooks
- Send proxy pings on compromised postings/guestbooks to ping servers
- Bask in artificially inflated rankings, faster
Here’s a sampling of what we have seen in the last couple of days (changes.xml) –
weblog name=”auto car finance max” url=”http://www.ctle.ngcsu.edu/prof_chuck/?p=129″
weblog name=”best refinance mortgage” url=”http://www.ctle.ngcsu.edu/prof_chuck/?p=120″
weblog name=”cheap motor car insurance” url=”http://www.ctle.ngcsu.edu/prof_chuck/?p=122″
weblog name=”interest mortgage rate refinance” url=”http://www.gajaweb.de/guestbook/gb/index.php”
weblog name=”debt management plan” url=”http://www.dartzwerge.de/guestbook/index.php” />
And, here’s the entire list of spings for ngcsu.edu domain over the last 5 days.
We will of course investigate this further.
Google launched Google Co-op late last night, a service to create custom search engines. The central feature is that it prioritizes or restricts search results based on websites and pages that you specify. It also allows one to tag sources and to provide a way to focus on hits from sources with a given tag from the results page. You can open up the development and maintenance of your custom search engine to others, allowing people (everyone or just those you invite) to add or exclude sites and to tag sources.
Although Elias Torres beat us to it, we’re experimenting with the service with a Co-op search engine (here) that draws on a number of sites related to the semantic web. Feel free to add to it or tag some resources.
This is a good idea, though not novel — remember the concept of a focused search engine? The idea of letting users create their own focused search engines through a web interface is also not new. Rollyo is a Yahoo-powered service that offers the same basic capability and Swicki is another that offers some interesting wiki-inspired features. Google’s collaboration model is different though. I wish there were something between “anyone” and “invited individuals” for collaboration. The former opens the door to web spam, which will soon come in. The latter is too simple a model for building a large community. Update: After trying the collaboration a bit, I see that the “anyone” model requires the approval of the owner, so the collaboration model is reasonable, though simple.
Collective Intellect is a company that mines blogs and other web resources for information of interest to stock traders or corporate public-relations executives.
“Across blogs, discussion boards and social networking websites, Collective Intellect uses a combination of advanced artificial intelligence algorithms and old-fashioned human ingenuity to identify emerging New Media content that is relevant.”
According to an article in CNet (Putting blogs to work for Wall Street)
“The system examines about 150,000 new postings a day. Then it analyzes them for sentiment–is it causing a stock to go up or down?–and credibility. The company then sends out data feeds and e-mails on stock activity and interesting news to subscribers. … To maintain quality, the company says it monitors the performance and accuracy of the sources it combs. It also filters out spam blogs and tries to weed out sites that can influence opinion versus ones that are just new-media also-rans.”
This is similar to Monitor110, which also mines blogs for investment intelligence.