Thursday, March 8, 2007

Selective Memory and Chunking

A problem that I have worked on for around 15 years involves trying to tease apart how to create optimal mental models. Now, when I say optimal I mean models that are in some sense the best at predicting the class of a signal given a learned history of signals. The best in some sense may not, however, be perfect because there may not have been enough information to formulate a perfect model that is infallible, especially when some of signals may have been underrepresented in the learned history.

My interest in this topic arises from my interest in evolutionary optimization. The generalized intelligence of human beings is both an incremental improvement over other animals and a remarkably new phenomenon, especially when combined with language and cultural transmission of lessons learned. I assume our model-making capabilities are highly optimized to use information effectively, although there are clear biases that influence the default modes that our mental models grow and change. Examples of these biases arise all the time, like in our highly productive attempts to find human faces and forms in shadows. This illustrates a strong bias that likely had evolutionary significance in helping us escape threats; it's better to run away or be prepared for a fight than to rule out a few false positives on the threat radar. The worst you experience is mild embarrassment compared with death.

The problem of optimal learning of models is nevertheless interesting because it points a way towards understanding how information can be used for prediction even if biases are mixed in due to extrinsic factors like threats, and the topic I have studied involves language learning. Here's the interesting tidbit that led me to investigate this topic: if you create a sequence of letters that look like a cryptoquip but are created by a simple set of probabilistic equations, people can study them and then tell whether other, similar sequences came from the same equations or not. In other words, we can learn arbitrary patterns of symbols that are essentially meaningless. How is this interesting? Well, sound sequences for young children have only a bit more context than that, a bit of pointing and smiling and frowning and what not. So it is safe to assume that a general capability for learning patterns is likely related to the same machinery that children use during language acquisition. Moreover, it seems likely that it is related to general pattern learning processes.

In the original experiment, the question that arose was whether the people were learning based only on repetitious pairs of symbols or whether larger "chunks" of symbols figured into their reasoning. The experiment was repeated with a careful balancing of "pairwise frequencies" to demonstrate that larger chunks must be involved in the learning process. So we appear to create a mental model that involves inferring that some chunks are useful but others are not, and then comparing those chunks to new sequences to see whether the new sequence is a good match or not. But which chunks do we choose? And is there an optimal way to acquire chunks such that we predict well?

For the second question, we can see the bounds of the problem if we acquired every chunk of every length and held them in our mind to compare with the new sequence. We could then count the number of chunk matches and called them a match if we had enough. But what is that threshold to make a judgment? And, besides, we don't have enough short term memory to keep all those chunks in mind, since these are arbitrary symbol sequences like XVMTMVXQT... So maybe we look for large chunks that have similar patterns in different positions, or symmetries, and hold only those chunks in our mind? That seems more likely, but we know that it can't be merely due to pairwise symbol frequencies, since people still can perform the task when those frequencies are balanced.

My solution, which was partially derivative of another model used for phrase acquisition, is to create a "lexicon" of symbol pairs initially, then create a new level by combining pairs together (or pairs with singletons), but then throwing away combinations that do not increase the predictive capability of the system as a whole. This criteria, known variously as Minimum Description Length or Minimum Message Length, considers the expense of adding elements to the lexicon as trading off with the predictive value of the element. A combined element may be slightly useful, but it also may cost too much to continue to represent it in the lexicon, and therefore should be thrown away. Using this approach, some notion of optimal prediction can be achieved given a finite set of learning examples.

But here is the rub. Using a bottom-up combination approach means that it is possible that you won't ever find the best combination of chunks because the chunks repeat with a certain distance between them, and the symbols in that separating area vary. Combination doesn't suffice under those terms, so you need the ability to semi-randomly and experimentally vary elements of the lexicon and check whether those mini-experiments improve the model. This is where we can propose a selectionist description for optimization that builds on the basic combination approach by adding a generate-and-test style approach to thinking! The end result is a greater likelihood that the resulting model will outperform other models, while requiring limited additional commitment of working memory.

No comments: