Sunday, July 29, 2007

Semantics and Sonny Bono


"Bono and the tree became one"

That sentence has been an object of scrutiny for me over the past several weeks. It is short enough and the meaning seems fairly easy to digest: Sonny Bono died in a skiing accident. It might have shown up in a blog back when the event transpired, or in casual conversation around the same time.

So what is so fascinating about it? It is the range of semantic tools that are needed to resolve Bono to Sonny Bono and not to U2's Bono or any of the thousands of other Bonos that likely exist. First, we need background knowledge that Sonny Bono died in a skiing accident. Next we need either the specific knowledge that a tree was involved or the inference that skiing accidents sometimes involve trees. Finally, we need a choice preference that rates notable people as more likely to be the object of the discussion than everyday folk.

We could still be wrong, of course. The statement might be about Frank Bono, a guy from down the street who likes to commune with nature. It might be, but for a statement in isolation the notability preference serves a de facto role as a disambiguator.

How, then, can we design technology to correctly assign the correct referent to occurrences like Bono in the text above? We have several choices and the choices overlap to varying degrees. We could, for instance, collect together all of the contexts that contain the term Bono (with or without Sonny), label them as to their referent, and try to infer statistical models that use the term context to partition our choices. This could be as simple as using a feature vector of counts of terms that co-occur with Bono and then looking at the vector distance between a new context vector (formed from the sentence above) with the existing assignments.

We could also try to create a model that recreates our selection preferences and the skiing <-> tree relationship and does some matching combined with some inferencing to try to identify
the correct referent. That is fairly tricky to do over the vast sea of possible names, but is easy enough for a single one, like Bono.

It turns out all of these approaches have been tried, as well as interesting hybridizations of them. For instance, express the notability preference as a probability weighting based on web search mentions, while adding-in the distance between different concepts in a tree-based ontology, trying to exploit human-created semantic knowledge to assist in the process. It turns out that fairly simple statistics do pretty well over large sets of names (just choose the most likely assignment all the time), but don't really capture the kinds of semantic processing that we believe we undertake in our own "folk psychologies" as described above.

Still, I see the limited success of knowledge resources as an opportunity rather than a source of discouragement. We definitely have not exhausted the well.

No comments: