Monday, October 15, 2007

Discovery and Multiple Explanations

My startup, Ofamind, has both a “classification engine” and a “discovery engine” as part of the core technology. A discovery engine is an algorithmic system that tries to show you new content based on what you have been looking at in the past. For Ofamind, the discoveries are currently over scientific newsfeeds, scientific papers and patents. The role of the classification engine is to make it easier to add content to your interest collections (“views” in Ofamind) as you browse the web by automatically suggesting how to add the new web content (via a Firefox extension).

Both are based on a combination of content and linkages between documents. For content, the system goes a step further than current methods by using extracted people, places and organizations to improve the quality of the matches, as well as leverage aspects of document structure. Ongoing work (prior to the full public release of the system) is trying to improve even further the disambiguation of “named entities” to make it possible to answer useful research questions about topics and the researchers who are involved in those topics.

I was therefore intrigued when I learned about the Netflix Prize from 3QuarksDaily. The Netflix Prize is offering a US$1M purse to any group that can do a better job of predicting Netflix film rankings by people than their current system, Cinematch. The prize term runs through 2011 and I am seriously considering giving it a run.

Reading through the work by the leading group (at around 8% improvement over Cinematch so far), the approaches seem rather ho hum at first glance: look at the rankings of people who have seen movies similar to my viewing choices, then use their other rankings to suggest new movies to me. Then we get into the fine details, and start to see two main themes develop. First, there is the problem of high levels of variability for some interest areas versus others. In other words, the landscape of choices is not very smooth. Different movie genres, director’s outputs and actor choices may all influence small pools of choices made by individuals that otherwise share my interests. So smoothing methods are introduced that try to capture latent variables or trends in the data that can reduce the distortions of outliers and improve the overall system performance.

But the other methodology that several groups have started looking at is based on combining decision making between several different approaches. Indeed, one blogger from Columbia University noted that this was similar to Epicurus’ Principle of Multiple Explanations. It also is widely used in classification algorithms like AdaBoost in hoping to overcome the problem of overfitting to training data, which means creating a decision process that is too finely tuned to any special oddities lurking in the training data.

One area that is not currently exploited, though, is the direct use of movie metadata (directors, actors, release date, genre) in the models. It can be argued that some (if not most) of that metadata and its influence is encapsulated in the choices made by people, thus producing an expert analysis in their ranking strategy. But I think there may be some significant value in a hybrid approach that looks at when the metadata connections make better predictions than the crowds. And that kind of hybrid approach is precisely what I am working on with Ofamind.

No comments: