Showing posts with label semantics. Show all posts
Showing posts with label semantics. Show all posts

Friday, May 23, 2008

Assault and Ambiguity

It began with a chance encounter. I was walking through a room with a TV on and a news caption was running during a commercial break. The newscaster intoned with provocative seriousness how a woman had followed a man who had sexually assaulted her weeks earlier to his home and he had been arrested. The location was my town and it rang a bell.

Three weeks back my wife and I went to a doctor’s appointment in Lafayette and, as we returned and entered our neighborhood, we were surprised to see a half-dozen police cars in our otherwise perfectly tame patchwork of planned homes and parks. During a walk two hours later there was still a squad car on one side street. A scan of the police blotter turned up the cause: a woman walking with her toddler had been assaulted and groped by a man who ran away when struck in the groin with a sippy cup.

Weeks passed until I overheard the news of the arrest. Good for her! The next phase of datamining impressed me with the thoroughness of the picture. I was able to use the television station archives combined with Google to find the mug shots, the original sketch of the suspect done by police sketch artists, the suspect’s arrest status in the county courts system, the location of the alleged perpetrator’s house, the suspect’s father’s name and place of business, the suspect’s mother’s name and place of business, a previous citation of the suspect for a moving violation (infraction) in a neighboring city, the county records concerning the amount and type of mortgage held on the suspect’s home, and a satellite view of the home as well.

Amusingly, also, was that the reporter in the news piece actually drove by our house and coincidently filmed our various vehicles. I could likely have read the license plates if I wanted from the footage.

Overall, I had managed to scour out all the corners of ambiguity concerning when, where, who and how, leaving only the strange question of why left in fuzziness. Why was this 24-year-old still living at home, jogging at midday and preying on middle-aged women? Why was he living in this neighborhood where even a Megan’s Law offender is fairly hard to find?

But strangely, it was the suspect’s last name that was the key to developing the search picture because the last name was so unusual. Had he been “Jim Smith” or “Joe Sanchez” or “Mike White” it would have been virtually impossible to make as much headway in extinguishing ambiguity. George Miller is quoted something like “There is only one problem in Artificial Intelligence: words have more than one meaning.” (And I can’t resolve the ambiguity of the source of that quote because George Miller is too ambiguous). This problem is amplified for searching across identities of places and people, or when special identifiers are introduced as placeholders in a single document (this happens quite often in technical literature where an acronym is used locally as a technical shorthand but is ambiguous outside that document or domain). Moving to the level of folksonomies for, say, labeling pictures on the web, we see the problem exasperated by the natural telegraphic shorthand that any labeling scheme suggests to the user purely by dint of the size of the entry fields.

Clever approaches to trying to apply context to help address these limitations start with statistical co-occurrence-based disambiguation and linkage analysis, and then run all the way through to using complex ontologies to try to infer the best relabeling of the ambiguous entity or concept as a canonical identifier. None of these methods can hope to achieve any level of perfection but a basket of them can enhance the process of information discovery and disambiguation.

Wednesday, August 15, 2007

Moore and Semantic Skepticism


Strange, I found the Paglia essay interpenetrating all of my thoughts over the past few days, dredging up language swarms from old Derrida and Feyeraband essays, and dipping over into my work on disambiguation and ontology. See, linguistics-wise, I was once an empiricist with an almost palpable antagonism to the value of knowledge resources like ontologies in solving specific problems. I would reach first for a statistical model that was trained on the contexts of word occurrences, expecting that words can only be known by the company that they keep.

Even the notion that the Semantic Web can achieve any level of crispness in assigning metadata to online content was doubtful in that it was inherently impossible for content authors to assign metadata consistently. The position is postmodern relativism, if you will, derived from the same kind of semantic and pragmatic arguments that have been used to deconstruct machine learning: do I translate this as "terrorist" or "freedom fighter"? Well, what is your frame of reference? What is your meta-narrative?

A radical position is the folksonomy view that folks are themselves are the best determiners of how to tag metadata. In this view, they use whatever tags seem appropriate based on their own intuitions about the content. But does this get us around the Bono issue, below? Unlikely. It seems more appropriate to purely abstract and controversial concepts like "terrorist" or "justice".

So I think we need a gradation of semantic forms that range from relatively simple propositions about identity up through propositions about meaning and intent. The latter are purely Wittgensteinian word games, with agreement and disagreement strewn across the symbol space, but the former have lower average rates of disagreement over referential attachment.

This parallels the notion of post-postmodernism in a way, by accepting fluidity and chaotic symbol/signifier interactions but still anticipating a useful and uncontroversial basis for facts. G.E. Moore would raise his hand in salute.

Sunday, July 29, 2007

Semantics and Sonny Bono


"Bono and the tree became one"

That sentence has been an object of scrutiny for me over the past several weeks. It is short enough and the meaning seems fairly easy to digest: Sonny Bono died in a skiing accident. It might have shown up in a blog back when the event transpired, or in casual conversation around the same time.

So what is so fascinating about it? It is the range of semantic tools that are needed to resolve Bono to Sonny Bono and not to U2's Bono or any of the thousands of other Bonos that likely exist. First, we need background knowledge that Sonny Bono died in a skiing accident. Next we need either the specific knowledge that a tree was involved or the inference that skiing accidents sometimes involve trees. Finally, we need a choice preference that rates notable people as more likely to be the object of the discussion than everyday folk.

We could still be wrong, of course. The statement might be about Frank Bono, a guy from down the street who likes to commune with nature. It might be, but for a statement in isolation the notability preference serves a de facto role as a disambiguator.

How, then, can we design technology to correctly assign the correct referent to occurrences like Bono in the text above? We have several choices and the choices overlap to varying degrees. We could, for instance, collect together all of the contexts that contain the term Bono (with or without Sonny), label them as to their referent, and try to infer statistical models that use the term context to partition our choices. This could be as simple as using a feature vector of counts of terms that co-occur with Bono and then looking at the vector distance between a new context vector (formed from the sentence above) with the existing assignments.

We could also try to create a model that recreates our selection preferences and the skiing <-> tree relationship and does some matching combined with some inferencing to try to identify
the correct referent. That is fairly tricky to do over the vast sea of possible names, but is easy enough for a single one, like Bono.

It turns out all of these approaches have been tried, as well as interesting hybridizations of them. For instance, express the notability preference as a probability weighting based on web search mentions, while adding-in the distance between different concepts in a tree-based ontology, trying to exploit human-created semantic knowledge to assist in the process. It turns out that fairly simple statistics do pretty well over large sets of names (just choose the most likely assignment all the time), but don't really capture the kinds of semantic processing that we believe we undertake in our own "folk psychologies" as described above.

Still, I see the limited success of knowledge resources as an opportunity rather than a source of discouragement. We definitely have not exhausted the well.