Showing posts with label ofamind. Show all posts
Showing posts with label ofamind. Show all posts

Monday, October 15, 2007

Discovery and Multiple Explanations

My startup, Ofamind, has both a “classification engine” and a “discovery engine” as part of the core technology. A discovery engine is an algorithmic system that tries to show you new content based on what you have been looking at in the past. For Ofamind, the discoveries are currently over scientific newsfeeds, scientific papers and patents. The role of the classification engine is to make it easier to add content to your interest collections (“views” in Ofamind) as you browse the web by automatically suggesting how to add the new web content (via a Firefox extension).

Both are based on a combination of content and linkages between documents. For content, the system goes a step further than current methods by using extracted people, places and organizations to improve the quality of the matches, as well as leverage aspects of document structure. Ongoing work (prior to the full public release of the system) is trying to improve even further the disambiguation of “named entities” to make it possible to answer useful research questions about topics and the researchers who are involved in those topics.

I was therefore intrigued when I learned about the Netflix Prize from 3QuarksDaily. The Netflix Prize is offering a US$1M purse to any group that can do a better job of predicting Netflix film rankings by people than their current system, Cinematch. The prize term runs through 2011 and I am seriously considering giving it a run.

Reading through the work by the leading group (at around 8% improvement over Cinematch so far), the approaches seem rather ho hum at first glance: look at the rankings of people who have seen movies similar to my viewing choices, then use their other rankings to suggest new movies to me. Then we get into the fine details, and start to see two main themes develop. First, there is the problem of high levels of variability for some interest areas versus others. In other words, the landscape of choices is not very smooth. Different movie genres, director’s outputs and actor choices may all influence small pools of choices made by individuals that otherwise share my interests. So smoothing methods are introduced that try to capture latent variables or trends in the data that can reduce the distortions of outliers and improve the overall system performance.

But the other methodology that several groups have started looking at is based on combining decision making between several different approaches. Indeed, one blogger from Columbia University noted that this was similar to Epicurus’ Principle of Multiple Explanations. It also is widely used in classification algorithms like AdaBoost in hoping to overcome the problem of overfitting to training data, which means creating a decision process that is too finely tuned to any special oddities lurking in the training data.

One area that is not currently exploited, though, is the direct use of movie metadata (directors, actors, release date, genre) in the models. It can be argued that some (if not most) of that metadata and its influence is encapsulated in the choices made by people, thus producing an expert analysis in their ranking strategy. But I think there may be some significant value in a hybrid approach that looks at when the metadata connections make better predictions than the crowds. And that kind of hybrid approach is precisely what I am working on with Ofamind.

Sunday, May 20, 2007

Startup and Dissemination

I've been running full-bore these last two weeks, writing commercialization plans and dispatching them to consultants for review, getting servers up and running in skyscrapers in San Jose, and pushing the whole system out for early public consumption. On top of that, I was drawn into the local paper's wonderful world of fora, sparing with those who use "seditious liberal" as a form of punctuation. Name calling is such a great debate tactic. Oh, and I did "Career Day" at my son's school, spending the morning repeating myself. Such is the fate of teachers, as I recall. I had my patter down by the third and final session and, of course, will go back again.

The forum engagements began as a follow-on to my original editorial, now followed up twice in print. The second follow-up calmly attacked the previous one for their anti-evolution stance. I post little notes on these topics, focusing on issues in epistemology and noting that I don't much care what others believe as long as policy provides for freedom of conscience. My unflappability results in the name calling, I suppose. Oh, I was also told to repent. That was expected.

But, more interesting, is the continued sense of support for my main business effort, which I will now reveal to my very limited readership! ofamind.com is an elaboration of my earlier glitta.com platform and is designed as an online prosumer web clippings/tagging/social networking engine for online knowledge workers. I liken it to MySpace crossed with Salesforce for serious researchers. The system integrates with patent search and with citeseer (more sources to come) as well as with other user's content to provide discoveries related to your interests. From a business standpoint, the technology is a channel for personalization and advertising for a select audience (with high earnings) built around science, technology, legal and business research professionals. The early-stage funding came from consulting gigs, but I won a National Science Foundation grant to continue to expand the system and have been team building in an effort to try get a second wave of funding.

Meanwhile, I am hard at work on another government funding opportunity that builds on one component of the Ofamind system: indexable briefings and presentations. Specifically, a subcontractor has a tool for compiling voice together with PowerPoint slideshows, then uploading them for search and dissemination to Ofamind. The new funding opportunity builds off of that capability and expands on the idea that cross-citation of supporting documents and ideas via tagging, content and reference can provide a powerful new way of working on the web and distributing briefings and presentations to a wider audience.