May 15, 2001 by George A. Miller
Understanding how humans process the subtlety of language is crucial to recreating the ability to understand natural language in computers. Dr. George Miller investigates the cognitive processes of resolving the vagueness in human language.
Originally published March 2001 at Impacts Magazine. Published on KurzweilAI.net May 15, 2001.
“If I accomplish nothing else in this story, I hope I will persuade you that human language is so vague and ambiguous that only a very clever brain could possibly understand it.”
Most people are unaware how vague and ambiguous human languages really are, so they are disappointed when computers fail to understand linguistic communication. They are surprised when an information retrieval system gives responses that seem unrelated to their search word. They can’t understand why question answering should be so hard for a machine. And they are really upset by the low quality of machine translations. As communication grows increasingly important, the computer’s linguistic limitations become increasingly frustrating. As more and more documents are stored in computers, the machines’ inability to understand the information they hold restricts their usefulness to both business and government. Computers are not to blame for this situation. Language itself is at the root of it.
I am the kind of psychologist who studies the basic cognitive processes of the human mind, the cognitive processes that support sensation and perception, learning and memory, problem solving and reasoning, and especially those characteristically human cognitive processes that support speech and language. As a psychologist, I say that I study the mind, but my private conceit is that I really study the brain, for surely it is the brain that performs all those processes that enable us to develop and maintain our knowledge of the world and of ourselves.
If I accomplish nothing else in this story, I hope I will persuade you that human language is so vague and ambiguous that only a very clever brain could possibly understand it.
The Nature of the Problem
The problem begins with words, which I shall take to be the smallest meaningful units of language. I am going to assume that we already understand how words are recognized as units in the flow of spoken sound. Speech recognition is still a challenging topic for research, but let’s assume that this perceptual part of the process of understanding speech has been solved–that we already have an adequate theory of how individual words are recognized.
The first question is how we assign meaning to the spoken words. To take one example from thousands that are available, consider the noun “triangle.” As philosophers pointed out long ago, the noun “triangle” is hopelessly vague. Without further explanation we don’t know whether the triangle is acute or obtuse, oblique or right-angled, scalene or isosceles or equilateral, and we have no idea what color it is or where it is or how big it is or how it is oriented. So the word “triangle”‘ is referentially vague.
Moreover, the noun “triangle” is ambiguous, in the sense that it can be used to express more than one meaning. The word “triangle” can refer either to a three-sided polygon or to a musical percussion instrument or to a social situation involving three parties. If you were to hear someone say, “It’s a good triangle,” you could not be sure which meaning of “triangle” the speaker had in mind. So the noun “triangle” is semantically ambiguous.
Of course, “triangle” is seldom ambiguous when it occurs as part of an on-going conversation. It has several possible meanings, but the intended meaning is almost always clear from the context in which the word is used. The fact that it has several meanings makes it potentially ambiguous. But there is a difference between multiplicity of meaning and ambiguity. To keep this distinction clear, I am going to use a technical term. I am going to say that “triangle” is polysemantic: poly–meaning “many”–and semantic–meaning “meaning.” A polysemantic word can have many meanings, yet not be ambiguous when used in an appropriate context.
The point, however, is that words, the basic building blocks of meaningful language, are extremely slippery items and must be handled with great care. Indeed, some people think that words can be indefinitely polysemantic–that a word can express different meanings every time it is used. They point to the noun “container.” If someone says he has a container of apples, you would probably understand “container” to mean “basket.” But if he says he has a container of water, you would probably understand “container” to mean “glass” or “bottle.” And if he says he has a container of groceries, you would probably understand “container” to mean “bag” or “box.” According to this argument, every time “container” occurs in a different environment, it expresses a different meaning. Hence, unlimited polysemy.
Things are bad, but they are not that bad. If, like Humpty Dumpty, our words could mean whatever we wanted them to mean, we would not have much luck using words to communicate. The trouble with this argument for unlimited polysemy is that it confuses meaning and reference. The word “container,” like the word “triangle,” is referentially vague–it can be used to refer to any one of a great variety of containers. But its meaning is, roughly, “an object capable of holding material for storage or transport,” and a great variety of objects, from spoons to boxcars, satisfy that definition.
Now, I can understand how people tolerate referential vagueness. It is a matter of common courtesy. A polite communicator gives the audience as much information as is needed, but not all that is available. Language evolved for social collaboration and once collaboration is achieved, language has done its job. Imagine telling someone to come here and then getting into an argument about precisely where “here” is–just there, or maybe an inch closer, or a tiny bit to the left? The adverb “here” is referentially vague, but that doesn’t cause trouble; it would only cause trouble if it were not vague. So I understand vagueness.
What I don’t understand is how we tolerate semantic ambiguity. Yet we seem to thrive on it. As a psychologist, I find it very interesting that most people are not even aware how ambiguous words can be. People are so skilled at resolving potential ambiguities that they don’t realize that they are doing it. The realization hits you, however, when you try to develop a theory of how people do it. People use the context, of course, but precisely what context is and how people use it need to be explained.
How Computers Resolve Ambiguity
There have been many attempts to enable computers to deal with ambiguity and I want to describe them briefly. If nothing else, it will help clarify what the problem is.
One of the benefits that modern computers provide for cognitive scientists is to give us a tool for sharpening and testing our theories. Many behavioral scientists believe that a computational theory is a first step toward a neuro-physiological theory. If we really understood how people cope with semantic ambiguity, we should be able to program a computer to do the same thing. But, so far, our attempts to devise such a theory and explain it to a computer have been only marginally successful.
The problem of ambiguity comes up almost everywhere that computers try to cope with human language. In information retrieval, the computer often retrieves information about alternative meanings of the search terms, meanings that we had no interest in. In machine translation, the different meanings of an English word may be expressed by very different words in the target language, so it is important to determine which meaning of the English word the author intended–and that is what a computer has trouble doing. Over and over, attempts to use computers to process human language have been frustrated by the computer’s limited ability to deal with polysemy.
I will illustrate what a computer faces with a well- known excerpt from Robert Frost’s poem, “Stopping by Woods on a Snowy Evening”:
The woods are lovely, dark and deep, But I have promises to keep And miles to go before I sleep, And miles to go before I sleep.”
To make my illustration as simple as possible, I will use only the couplet, `”But I have promises to keep, and miles to go before I sleep.” Let’s see what a computer might make of these thirteen words.
Imagine that a computer has been given all of the information about the meanings of English words that can be found in a good collegiate dictionary. So the computer will begin by looking up the word `But’ and will discover that the dictionary provides 11 different meanings. Next, the computer looks up `I’ and finds three meanings. On the assumption that the meaning of word combinations depends on the meanings of the individual words, the computer concludes that the two initial words, `But I,’ must have 3 x 11 = 33 possible compound meanings.
Proceeding in this manner, the computer finds that the word `have’ can be used to express 16 different meanings, so the number of possible compound meanings of `But I have’ is 3 x 11 x 16 = 528. And 7 meanings of `promise’ brings the number of possible meanings to 3,696.
By the time the computer finishes looking up all 13 of the words in this couplet, the product is 3,616,013,016,000 (three trillion six hundred sixteen billion thirteen million and sixteen thousand) possible compound meanings. This works out to an average of 9.247 meanings per word.
Put it this way: Imagine the computer is running a maze and that at each choice point there are 9 alternative ways to continue. In order to run the maze, the computer must make the correct choice every time–it must find the one correct path out of three trillion possibilities. Computers find this maze very difficult, but you and I sail through it without even noticing that there are any alternatives.
Of course, this couplet is short and the words are as plain and familiar as only Robert Frost could make them. And that is part of the trouble–the words are so plain and familiar. It is a perverse feature of human languages that the words used most frequently tend to be the most polysemantic. If we took a passage filled with obscure but unambiguous technical terms, the branching would be far less. But it would still not be zero.
So far I have assumed that the computer has only a dictionary. Let’s give the computer some capacity for syntactic analysis. Let’s assume–which is not unrealistic–that the little words (“but,” “I,” “to,” “and,” “before”–the so-called “closed-class” words) are there primarily as markers of grammatical structure, so a good syntactic analyzer will take care of them. The only thing tricky about the grammar here is that “have to” is a kind of modal auxiliary verb, synonymous with “must”–”have to keep promises” and “have to go miles.” The syntactic analyzer will also tell us that in this passage “promise” is a noun and “keep” is a transitive verb, and so on. Armed with this information, the computer can now make better use of its dictionary.
|have (t)||modal verb||1||1|
|Geometric Mean = 2.026 senses/word|
When the ambiguity calculation is repeated using only the meanings possible for the given syntactic structure, the product comes down to 9,660 possible meanings. Of course, this only looks like progress because three trillion was so absurd. But the computer still has to find the right meaning among a set of 9,660 possibilities.
The geometric mean per word is now 2.026 for this brief passage. If longer passages also average about two meanings per word, and if we were to guess at random which meaning was intended, we should be right about half the time. Not good enough.
The problem is even worse in other languages. The polysemy of words in spoken Chinese is far greater than it is in spoken English. Even French is more polysemantic than English.
The truth is that polysemy just doesn’t bother people. While a computer is struggling with its 9,660 alternatives, you and I select the correct interpretation in the twinkle of an eye. And we don’t even realize that we have done something remarkable.
But maybe language isn’t as ambiguous as this example has made it seem. It is true that common words usually have several different meanings, but not all of those meanings are used equally often. Some meanings of polysemantic words are used much more frequently than others are. For example, the word “horse” can refer to an animal, or it can refer to a gymnastic apparatus, or it can refer to a sawhorse, or it can refer to heroin, but if you sample usage in books and newspapers and magazines, you will find that the noun “horse” refers to an animal 100 times as often as it refers to anything else.
So maybe the computer can use statistics to solve this problem. What would happen if the computer always chose the most frequent meaning at every choice point in the maze?
My colleagues and I at Princeton University actually explored this possibility a few years ago. It isn’t easy, because good statistics about the relative frequencies of different meanings of polysemantic words do not exist. But we determined the context-appropriate meaning of every noun, verb, adjective, and adverb in some 104 passages (over 200,000 runningwords) of the Brown Corpus, which is a collection of 1,000,000 running words said to be representative of American prose writing. That gave us data on the relative frequencies of the different meanings of the most common polysemantic words.
Then we went through this semantically disambiguated text and looked to see how often the context-appropriate meaning was the most frequent one. The results are shown in this graph (Figure 1).
Looking only at the polysemantic words, the most frequent meaning was correct just 56 percent of the time. Of course, many of the nouns, verbs, adjectives, and adverbs in the Brown Corpus are monosemantic (in which case the most frequent meaning is the only meaning). So if we look at all the words together, the most frequent meaning is the correct one just 67 percent of the time.
When we give a computer more information, it does a better job. But understanding the wrong meaning for a third of the words is still not good enough. So far we have given the computer information about the words’ possible meanings, about the words’ syntactic role, and about the words’ most frequent usages. What more could we give it?
I have already said that people use context to determine the appropriate meanings of individual words, but so far we have not given the computer any information about context. Context can be linguistic–the other words that occur before and after a polysemantic word–or it can be situational–the situation in which the linguistic interaction is occurring. The linguistic context is the easiest to deal with, so let’s start with that.
One way to explore linguistic context is to collect a large sample of excerpts that contain a particular target word and to classify those excerpts manually according to which meaning of the word was intended. This manually disambiguated collection of contexts can then serve as training material for a computer.
My colleagues Claudia Leacock and Martin Chodorow and I programmed a computer to look for certain features of the context, then exposed it to a large sample of manually disambiguated contexts, and finally tested how well the computer could distinguish among a new set of manually disambiguated contexts. One program looked to see what nouns, verbs, adjectives, and adverbs occurred within plus-or-minus 50 words of the target word; we called that topical context. Another program looked at the exact order of words plus-or-minus two words on either side of the target word; we called that local context. And finally, we combined the output of the two programs in the hope that what one program missed, the other might catch.
The results for three different target words are shown in the following slides, where the percent correct is plotted as a function of the number of training contexts provided. In all cases, the performance improved as the number of training contexts increased.
First (Figure 2), the program was trained to distinguish four different meanings of the verb “serve.” As you can see, topical context was not very useful for this verb; the best results were obtained with local context. Combining them was only a little better than local context alone. Second (Figure 3), the program was trained to distinguish three different meanings of the adjective “hard.” As in the case of the verb, local context was much more useful than topical context, and combining them was no help.
Finally, the program was trained to distinguish six different meanings of the noun “line.” For this noun, the topical context was more useful than the local, and there was some advantage to combining them (Figure 4).
It is possible, of course–indeed, I think it likely–that we did not choose the correct properties of the context to train on, but in an international competition between programs that try to do this kind of thing [see Senseval-1 at http://www.itri.brighton.ac.uk/events/senseval/], ours was as good as any other. And we are only 85 percent correct at best, and we know how to do that well for only a few of the thousands of polysemantic English words. It’s still not nearly good enough. If you misunderstood the meaning of every seventh important word, you would not find language very useful.
The reality is there does not exist today a large-scale, operational computer system for determining the intended meanings of words in discourse. But solving the polysemy problem is so important that we can be confident that efforts will continue and that future systems will continue to improve. If you were to ask me what more could be done, I would suggest that we still have a lot to learn about contexts in general and linguistic contexts in particular. If I were feeling reckless, I might even suggest that understanding contexts better is critical for the future of processing linguistic messages by computer.
An Internet-user knows that information technology can now provide large amounts of raw information at the touch of a button. Unfortunately, most of it is irrelevant and searching through it to find what we really wanted requires great patience and peace of mind. My reckless claim would be that in addition to information technology, we need context technology. The future belongs to those who discover how to help users better understand the information that is provided. And the only way I know to do that is provide contexts to make the information meaningful.
How People Resolve Ambiguity
Enough about computers. Since people recognize intended meanings so easily, maybe computational linguists are missing something. So, what do we know about how people deal with ambiguous words?
Psychologists have learned a little about how people do it. We know, for example, that when a polysemantic word occurs, more than one meaning can be activated initially, but the context-appropriate meaning can be chosen very rapidly, within half a second. We assume that during that half second or so a meaning is chosen that can be integrated into a mental representation of the on-going discourse.
The nature of that representation of the on-going discourse is still uncertain, but it seems to involve more than just the linguistic context. It involves situational context and general knowledge.
Some psychologists believe that the representation of discourse must be “propositional,” with many propositions being filled in inferentially from general knowledge, and all of the propositions related by first order logic. A propositional representation of discourse would, of course, be easiest for a computer to simulate.
However, other psychologists maintain that the representation is “imaginal,” a mental picture that provides many default values from general knowledge but in which many irrelevant details are missing. Probably both propositions and mental images are involved. In any case, we don’t understand the mental representation of discourse well enough to replicate it with computers that are available today.
What we know is that the mental representations that people need in order to understand discourse must be both coherent and plausible.
First, a mental representation must be coherent. That is to say, if you scramble the order of sentences, or take sentences randomly from different sources, the result is not going to be organized around a unifying topic. It will not be coherent. The demand for coherence places many linguistic constraints on discourse. For example, new objects must be introduced with the indefinite article and thereafter referred to with the definite article; pronouns must have some antecedent to refer to; tense, locale and voice must agree, and so on.
And the mental representation must be plausible. If someone says, “Bill won the race from Sam because he had a good coach,” it is not plausible to conclude that “he” and “Sam” are coreferential. If told not to play with those boys because they are too rough, only a child would go looking for a smooth one. And if you see a sign in a farmer’s field saying “The bull may charge,” it is not plausible to think that the bull might charge admission. The demand for plausibility implies that the discourse must conform to general knowledge. That is a strong demand, of course, for general knowledge is boundless. It has been said that there is no fact so small or obscure that it would not disambiguate some polysemantic word.
So we can argue with some confidence that when people encounter a polysemantic word, they quickly select a meaning that can be integrated into a coherent and plausible mental representation of the discourse in which the polysemantic word occurs. The word is truly ambiguous only if two or more meanings satisfy that criterion.
Unfortunately, there is no insurance that the mental representation of the speaker and the mental representation of the listener will coincide. When people misunderstand one another, it is usually because they are working with different mental representations of what is being said, not because they misinterpret polysemantic words. They disagree because they have different ideas about why the speaker uttered the words she did. They disagree about the speaker’s pragmatic intentions. When it comes to estimating a speaker’s pragmatic intentions, we really do approach something resembling unlimited polysemy, and there is no dictionary of sentence meanings to guide us. But the pragmatics of discourse is a much larger topic than we can pursue here. Suffice it to say that there are many issues that affect the speaker’s pragmatic intentions, among them context, personal history, culture and the dynamic of the interaction. It’s a vibrant area of research, right now.
The first step is a plausible theory of linguistic context. Knowing the possible meanings of words and the grammatical structure of sentences is necessary, of course, but until we understand how people use context to construct a coherent and plausible mental representation of discourse, we will have no theory of language understanding, neither a computational theory nor a psychological theory.
One thing we can say with some assurance, however, is that people are extremely good at using context to resolve potential ambiguities. I believe that this skill in contextualizing is a general cognitive ability, not specific to language, but is involved in many higher cognitive processes. And I also believe that the best way to investigate our remarkable human ability to contextualize is to study ambiguous words.