THE AGE OF INTELLIGENT MACHINES | Chapter 7: The Moving Frontier
September 24, 2001
- Ray Kurzweil
The digitization of information in all its forms will probably be known as the most fascinating development of the twentieth century.
An Wang, founder of Wang Laboratories
Most probably, we think, the human brain is, in the main, composed of large numbers of relatively small distributed systems, arranged by embryology into a complex society that is controlled in part (but only in part) by serial, symbolic systems that are added later. But the subsymbolic systems that do most of the work from underneath must, by their very character, block all the other parts of the brain from knowing much about how they work. And this, itself, could help explain how people do so many things yet have such incomplete ideas of how those things are actually done.
Marvin Minsky and Seymour Papert, the 1988 epilogue to Perceptrons
Two types of thinking
Photo by Lou Jones www.fotojones.com
Try not to think about elephants. For the next sixty seconds, do not let the image of these huge mammals with their large ears and swaying trunks enter your mind. Now look across the room and focus your vision on an object. Without closing your eyes or turning them away, try not to determine what the object is. Finally, consider the Tower of Hanoi problem described in the last chapter. For the next sixty seconds, do not solve this problem.
You are undoubtedly having difficulty avoiding the mental image of an elephant. Assuming that the object that you selected to look at is not unknown to you, you were probably unsuccessful as well in not determining its identity. On the other hand, unless you have an unusual passion for mathematics, you are probably experiencing little difficulty in not solving the Tower of Hanoi problem.
Two types of thought processes coexist in our brains, and the above exercises illustrate one of the profound differences between them. Perhaps most often cited as a uniquely human form of intelligence is the logical process involved in solving problems and playing games. A more ubiquitous form of intelligence that we share with most of the earth’s higher animal species is the ability to recognize patterns from our visual, auditory, and tactile senses. We appear to have substantial control over the sequential steps required for logical thought. In contrast, pattern recognition, while very complex and involving several levels of abstraction, seems to happen without our conscious direction.1 It is often said that a master chess player can “see” his or her next move without going through all of the conscious sequences of thinking required of less experienced players. It may be that after being exposed to tens of thousands of board situations, the master player is able to replace at least some of the logical processes usually used to play games with pattern-recognition methods.2
There are several key differences between these two forms of intelligence, including the level of success the AI field has had in emulating them.3 Ironically, we find it easier to create an artificial mathematician or master chess player than to emulate the abilities of animals. While there are many animal capabilities that our machines have not yet mastered, including the intricacies of fine motor coordination, the most difficult barrier has been the subtleties of vision, our most powerful sense and a prime example of pattern recognition.
One attribute that the two types of thinking have in common is the use of imagination. The first example cited above, imagining an elephant, is a direct exercise in imagination. The second example, identifying an object, also involves imagination, particularly in the latter stages of the process. If part of the object we are trying to identify is blocked or if its orientation toward us prevents the most useful view, we use our imagination to visualize in our minds what the full object might look like and then determine if the imagined object matches what we can see. In fact, we almost always use our imagination to visualize an object, because invariably there are sides we cannot see.4 The technical term for this technique is “hypothesis and test”; we use our imagination to hypothesize the answer and then test its validity.5 Hypothesis and test is also used in logical thought. We often imagine an answer to a logical problem based on methods of intuition that are only partially understood and then work backward to the original problem statement.
If we examine the nature of our visual imagination, we can gain some insight into the most important way in which pattern recognition differs from logical rule-based thinking. Consider again your imagination of an elephant. Your mental picture probably does not include a great deal of detail: it is essentially a line drawing, probably one that is moving (I’ll bet the trunk is swaying back and forth). The phenomenon of the line drawing-the fact that we recognize a line drawing of an object (a face, for example) as representing that object though the drawing is markedly simpler than the original image-provides us with an important clue to the nature of the transformations performed during the process of human vision. Very important to the recognition of visual objects is the identification of edges, which we model in two dimensions as lines. If we explore what is required to extract edges from an image, we shall gain an appreciation of one major way in which visual perception differs from logic.
There are many aspects of visual perception that we do not yet understand, but some understanding of the identification of edge location and orientation has been achieved.6 A particular set of computations has been discovered that is capable of detecting edges with reasonable accuracy. There is some evidence that similar techniques are used in visual processing by mammals. The technique is based on two observations. The first observation is that we need to smooth the data; changes involving tiny regions can probably be considered to be non-information-bearing visual noise. Thus, small defects in edges can be ignored, at least initially, in locating all of the edges in an image. Second, we note that changes in the visual information (across any spatial dimension) are more important than the information itself. In other words, we are primarily interested in sudden and consistent alterations in color or shading from one region to another.
I shall now describe a method for inferring edges from visual images.7 The following two paragraphs are somewhat technical. Yet it is not necessary to understand all of these details to appreciate some of the implications of the method. The image itself is represented by a two-dimensional array of pixels, or points of information. In a black and white image, each pixel can be represented by a single number representing a shade of gray. In a color image, several numbers (usually three) are required to represent the color and shade. We can take this initial raw image and modify it to take advantage of the two observations cited above. The modification is achieved by applying what is called a filter, in which each pixel has an influence on its surrounding pixels. For example, a Gaussian filter designates certain pixels as propagating pixels; it then increases the intensity of each pixel in the vicinity of each propagating pixel on the basis of the intensity of the propagating pixel and the distance to the neighboring pixel. The function of intensity to distance is based on the familiar Gaussian (normal) curve, with the peak of the curve representing zero distance (that is, the propagating pixel itself). A Gaussian filter is applied to an image by making every pixel a propagating pixel; thus, all pixels bleed into their surrounding pixels. This has the impact of smoothing the image, with the sharpness of the resulting image being a function of the width of the Gaussian curve. A different filter, the Laplacian, can then be applied to detect changes. This filter replaces the value of every pixel with the rate of change of the rate of change (that is, the second derivative) of the pixel values.
These two processes-smoothing and determining rates of rates of change-can be combined into a single filter in which every pixel influences all of the pixels within its vicinity. This filter, with the appropriate, if forbidding, name of “Laplacian of a Gaussian convolver,” has a graph with the shape of an upside-down Mexican hat, so it is often called a sombrero filter. As the figure shows, each pixel has a positive influence on the pixels in its immediate vicinity and a negative influence on pixels in a band surrounding the immediate vicinity. Once the sombrero filter has been applied, edges can be inferred by looking for zero crossings, places where values change from negative to positive.8
Let us consider some of the implications of this process. First, the technique is not particularly complicated. Second, experiments have shown that it is reasonably successful. In general, edges are correctly inferred. False hypotheses are generated, but these can be eliminated by later processing that incorporates knowledge about the types of objects we expect to see in the environment and the nature of their edges.9 Third, there is evidence that the hardware exists in mammalian brains to perform this type of transformation. For example, David H. Hubel and Torsten N. Wiesel of Harvard Medical School have discovered specialized edge detector cells in the outer (early) layers of the visual cortex of the human brain.10
Most important is a conclusion we can draw regarding the amount of computation required to perform edge detection. While it has not been proved that this precise filter, the Laplacian of a Gaussian convolver, is used in mammal vision, it can be shown that any algorithm that could possibly perform edge detection with the facility of human (and apparently most mammal) vision must use a center-surround filter (a filter in which each pixel influences all pixels within a certain distance) that requires a comparable amount of computation. This amount turns out to be vast and is determined by a six-dimensional computation. First, the filter must be applied for every pixel, and the pixels are organized in a two-dimensional array. For each pixel we must apply the filter to all pixels in a two-dimensional array surrounding that pixel, which gives us a four-dimensional computation. We noted earlier that the sharpness of our edge analysis was a function of the size of the Gaussian (normal) curve applied. In the combined sombrero filter, the size of the Mexican hat has the same impact. A large sombrero will enable us to detect the edges of large objects; a small sombrero will detect smaller features. We thus need to perform this entire computation several times, which is a fifth dimension. The sixth dimension is time; since vision must be capable of dealing with moving images, this entire computation must be repeated many times each second. Undoubtedly, some optimizations can be applied. For example, if we note that portions of the image are not changing, it is not necessary to repeat all of the computations. Nonetheless, the number of computations required is essentially determined by this six-dimensional array.11
Let us plug in some numbers to get a feeling for the orders of magnitude involved. Human vision is estimated to have a resolution of 10,000 positions along each of the two axes of vision, or about 100 million pixels (there are indeed about 100 million rod cells in each eye to detect shape and motion and 6 million cone cells to detect color and fine detail).12 The diameter of typical sombrero fitters used in computer-vision experiments range from 10 to 30 pixels, but these experiments are based on images of only 1,000 pixels on a side. A reasonable average size for a human sombrero filter would be about 100 by 100 pixels. If we assume about 3 different sombreros for different size objects and a refresh rate of recomputing the image of 30 times per second, we have the following number of multiplications per second: 10,000 x 10,000 x 100 x 100 x 3 x 30, or about 100 trillion. Now, a typical personal computer can perform about 100,000 multiplications per second. Thus, we would need about a billion personal computers to match the edge detection capability of human vision, and that’s just for one eye!13
Typical computer vision systems have somewhat less demanding specifications. Typically image resolution is about 1,000 by 1,000 pixels, which requires smaller filters of about 25 by 25 pixels. With three filters of different sizes and a refresh rate of 30 images per second, we have 1,000 x 1,000 x 25 x 25 x 3 x 30, or only 60 billion multiplications per second, which could be handled in real time by a mere 600,000 personal computers.
This brings us back to the issue of digital versus analog computation. As mentioned earlier, the need for massive parallel processing (doing many computations at the same time) may reverse, at least partially, the trend away from analog computing. While it is possible to achieve billions of digital computations per second in our more powerful supercomputers, these systems are large and expensive. The computations described above for the sombrero filter do not need high degrees of accuracy or repeatability, so analog multiplications would be satisfactory. Multiplying 60 billion analog numbers per second (600,000 computing elements each performing 100,000 multiplications per second) could be achieved using VLSI circuits in a relatively compact system. Even the 100 trillion multiplications per second required for human vision, though out of the question using digital circuits, is not altogether impractical using analog techniques. After all, the human brain accomplishes image-filtering tasks using just this combination of methods: massive parallel processing and analog computation.14
The human visual system picks up an image with 100 million specialized (rod and cone) cells. Multiple layers, each of a comparable number of cells, would have the capability to perform transformations similar to the sombrero filter described above. In fact, the visual cortex of the brain contains hundreds of layers, so these filtering steps are but the first transformations in the long (but quick) journey of processing that a visual image undergoes.15
The images from both eyes need to be processed, and then the two images need to be fused into one through a technique called stereopsis. As a result of having two eyes, we can detect depth; that is, we can determine the relative distance of different objects we see.16 Because our eyes are a few inches apart, the same object will be slightly shifted in the images they receive. The amount of shift is determined by simple trigonometric relationships. Distant objects will have little shift, whereas close objects will have larger shifts. However, before our visual system can apply trigonometry to the problem of determining depth it needs to line up the corresponding objects in the two visual fields. This is more difficult than it sounds. Experiments indicate that matching the image of each object in the visual field of one eye to the image of that object in the visual field of the other must take place after the detection of edges.17 Once edge detection has taken place, the edges can be matched using additional pattern-recognition techniques.18
Once the edges are detected and the dual images fused with corresponding information regarding depth, it becomes possible for more subtle processes of discrimination to begin. Edges and depths can be organized into surfaces, the texture of the surfaces can be estimated, and finally the objects themselves identified.19 In this process a great deal of knowledge about the types of objects we expect to see in our environment is used. The paradigm of hypothesis and test is clearly used here in that people typically see what they expect to see in a situation. Visual experiments have shown that people often misrecognize objects that are not expected if they appear to be similar to those that are anticipated. This indicates that the testing of the hypotheses has given a positive result. If an unusual object does not match our hypothesis (i.e., the test fails), then that object is likely to grab our focus of attention.
We have now described a fundamental way in which pattern recognition in general, and vision in particular, differs from the logical processes of thought. The essence of logic is sequential, whereas vision is parallel. I am not suggesting that the human brain does not incorporate any parallel processing in its logical analyses, but logical thinking generally involves considering only one transformation and its implications at a time. When speaking of parallelism in human vision (and in any attempt to truly emulate vision in a machine), we are speaking not of a few computations at the same time but rather of billions simultaneously. The steps after edge detection also involve vast amounts of computation, most of which are also accomplished through massive parallelism.20 Only in the final stages of the process do we begin to reason about what we have seen and thereby to introduce more sequential logical transformations. Though vision involves vastly greater amounts of computation than logical processes, it is accomplished much more quickly because the number of processing stages are relatively fewer. The trillions of computations required for the human visual system to view and recognize a scene can take place in a split second.
This explains the relatively automatic (not consciously controlled) nature of vision: these tremendously parallel circuits are constantly processing information and piping their results to the next stage. It is not a process we can turn off unless we close our eyes. Even then we have trouble preventing our imagination from presenting images for analysis.
Logical thought appears to be a more recent evolutionary development than pattern recognition, one that requires more conscious control over each sequential step.21 The amount of computation required is not as vast, and less massive parallelism appears to be involved. This is one reason that we have been more successful in emulating these more “advanced” logical processes in our “intelligent” machines. Despite the relatively slow speed of neuronal circuits, the massive parallelism of the human brain makes it capable of vastly more computation than today’s computers. Thus, the relative lack of computational capability of computers to date (less parallel processing) have rendered them inadequate for a level of visual processing comparable to human vision. On the less computationally intensive (and more sequential) tasks of solving problems and playing games, even the very early computers were sufficient to perform at credible levels. Conversely, the brain’s capacity for massive parallel processing is at least one of the keys to the apparent superiority of human versus computer thought in areas such as vision.22
The realization of this superiority has focused attention on breaking the von Neumann bottleneck of conventional, single-processor computers. W. Daniel Hillis’s Connection Machine, for example, is capable of 65,536 computations at the same time, and machines with a millionfold parallelism are on the way.23 Billions of simultaneous processes, particularly if analog methods are combined with digital, are not out of the question.24
The realization that certain critical mental processes are inherently massively parallel rather than sequential has also refocused attention on the neural net as an approach to building intelligent machines.25 The 1960s concept of a neural net machine incorporated very simple neuron models and a relatively small number of neurons (hundreds or thousands) organized in one or two layers. They were provided with no specific task-oriented algorithms and were expected to organize themselves by rearranging the interneuronal connections on the basis of feedback from the human trainer. These systems were capable of recognizing simple shapes, but Minsky and Papert showed, in their classic Perceptrons, that the machines were essentially just matching individual pixel values against stored templates. These early neural nets were simply not capable of more sophisticated discriminations.26 As noted earlier, the 1980s school of neural nets uses potentially more capable neuron models that can incorporate their own algorithms.27 Designers are targeting systems with millions of such artificial neurons organized into many layers. Though the self-organizing paradigm is still popular, its role can be limited. Predetermined algorithms can be built into both the neuron models themselves and the organization of each layer. For example, a layer designed to detect edges should be organized differently from a layer designed to integrate edges into surfaces. Of course, this is still a far cry from the human visual system, with its billions of neurons organized into hundreds of layers. We still have very limited understanding of the algorithms incorporated in most of the layers or even what their functions are. Greater insight into these issues will be required before neural nets can solve real problems. Minsky and Papert remain critical of the excessive reliance of the new connectionists on the self-organizing paradigm of neural nets. In the prologue to a new edition of Perceptrons (1988) they state, “Our position remains what it was when we wrote the book: We believe this realm of work to be immensely important and rich, but we expect its growth to require a degree of critical analysis that its more romantic advocates have always been reluctant to pursue-perhaps because the spirit of connectionism seems itself to go somewhat against the grain of analytic rigor.”28
Another difference between logical and imaginal thinking is the issue of gradual versus catastrophic degradation.29 In animal vision the failure of any neuron to perform its task correctly is irrelevant. Even substantial portions of the visual cortex could be defective with relatively little impact on the quality of the end result. Leaving aside physical damage to the eyes themselves, the ability of the human brain to process visual images typically degrades the same way that a holographic (three-dimensional, laser-generated) picture degrades. Failure of individual elements subtract only marginally from the overall result. Logical processes are quite different. Failure of any step in a chain of logical deductions and inferences dooms the rest of the thought process. Most mistakes are catastrophic (in that they lead to an invalid result). We have some ability to detect problems in later stages, realize that earlier assumptions must have been faulty, and then attempt to correct them, but our ability to do this is limited.
The difference between parallel thinking and sequential thinking is significant in skill acquisition. When we first learn to perform a pattern-recognition task (learning a new type of alphabet, for example, or, on a higher level, a new language), we use our rational facilities to reason through the decision-making tasks required. This tends to be slow, deliberate, and conscious.30 As we “master” the new task,
Photo by Lou Jones www.fotojones.com
The promise of parallel processing. Hidehiko Tanaka of the University of Tokyo designs super fast computers that incorporate thousands of tiny processors working in unison on a single problem.
Photo by Lou Jones www.fotojones.com
Yuichiro Anzai, one of the leading AI authorities in Japan.
our parallel facilities take over and we no longer need to consciously think through each step. It seems just to happen automatically. We have programmed our parallel pattern-recognition systems to take over the job. The process of recognition becomes substantially faster, and we are no longer conscious of the steps in the process. Visual-perception experiments have indicated that when we read, we do not perform recognition on individual characters and then group the characters into words but rather recognize entire words and even groups of words in parallel. If we had to reason through each discrimination (e.g., “Now there’s a semicircle with a straight line to the left of it, so that must be a p“), our reading speeds would be extremely slow. Indeed, a child’s reading speed is very slow until the child has succeeded in programming his parallel pattern-recognition facilities to recognize first individual letters, then words, finally, after years, groups of words.
There is a similar phenomenon on the output side of human intelligence. When we learn to perform a certain task that involves the coordination of our muscles (learning a sport or even speaking a new language), we start out very deliberately and conscious of each step in the process. After we “master” the new skill, we are conscious only of the higher-level tasks, not of the individual steps. We have gone from sequential to parallel thinking.
One of the objections that philosophers such as Hubert Dreyfus have made of AI is that computers appear to lack the ability for this type of parallel thought (the objection is generally expressed in the much vaguer terms that computers lack intuition).31 It is true that the purely logical processes of most expert systems do not have the capacity for achieving this vital category of massively parallel thought. It is not valid, however, to conclude that machines are inherently incapable of using this approach.
One might point out that even massively parallel machines ultimately use logic in their transformations. Logic alone, however, is not the appropriate level of analysis to understand such systems. It is similar to trying to understand meteorology using the laws of physics.32 Obviously, cloud particles do follow the laws of physics, but it is hopeless to attempt to predict the weather by means of the physics of particle interactions alone (not that we are very successful at weather forecasting even with “appropriate” methods). As an example of the weakness of rule-based methodologies in mastering certain intelligent tasks, consider the problem of describing how to recognize faces using logic alone. Face recognition is a process we are very good at despite our having little awareness of how the process actually works. No one has been able to program a computer to perform this task, in part because no one can begin to describe how we perform this feat. In general, we find it far easier to reconstruct our mental processes for sequential thinking than for parallel thinking because we are consciously aware of each step in the process.
Building a brain
We can draw conclusions from the above discussion regarding some of the capabilities required to simulate the human brain (i.e., to emulate its functionality). Clearly, we need a capacity for hundreds of levels of massively parallel computations (with the parallelism of each stage potentially in the billions). These levels cannot be fully self-organizing, although the algorithms will in some cases allow for “growing” new interneuronal connections. Each level will embody an algorithm, although the algorithms must permit learning. The algorithms are implemented in two ways: the transformations performed by the neurons themselves and the architecture of how the neurons are connected. The multiple layers of parallel neuronal analysis permit information to be encoded on multiple levels of abstraction. For example, in vision, images are first analyzed in terms of edges; edges form surfaces; surfaces form objects; objects form scenes.33
Another example is human written language. Lines and curves form letters, which form words, which form phrases, which form sentences, and so on. In spoken language, we have sounds forming phonemes, which form words, and so on. Knowledge regarding the constraints of each level of abstraction is used in the appropriate layer. The knowledge itself is not built in (though algorithms for manipulating it may be) and methods need to be provided to acquire, represent, access, and utilize the domain-specific knowledge.
Each level of analysis reduces information. In vision, for example, we start with the signals received from the hundred million rod and cone cells in each eye. This is equivalent to tens of billions of bits of information per second. Intermediate representations in terms of surfaces and surface qualities can be represented with far less information. The knowledge we finally extract from this analysis is a reduction of the original massive stream of data by many orders of magnitude. Here too we see the selective (i.e., intelligent) destruction of information discussed earlier as the purpose of computation.34
The human brain has a certain degree of plasticity in that different areas of the brain can often be used to represent the same type of knowledge. This property enables stroke victims to relearn lost skills by training other portions of the brain that were not damaged. The process of learning (or relearning) requires our sequential conscious processes to repetitively expose the appropriate parallel unconscious mechanisms to the knowledge and constraints of a pattern-recognition or physical-skill task. There are substantial limits to this plasticity, however. The visual cortex, for example, is specifically designed for vision and cannot be used for most other tasks (although it is involved in visual imagination, which does impact many other areas of thought).
We can also draw a conclusion regarding the type of physical construction required to achieve human-level performance. The human brain achieves massive parallelism in all stages of its processing measured in the tens or hundreds of billions of simultaneous computations in a package substantially under one cubic foot, about the size of a typical personal computer. It is capable of this immense level of performance because it is organized in three dimensions, whereas our electronic circuits are currently organized in only two. Our integrated-circuit chips, for example, are essentially flat. With the number of components on each side of a chip measured in the thousands, we are limited to a few million components per chip. If, on the other hand, we could build three-dimensional chips (that is, with a thousand or so layers of circuitry on each chip instead of just one), we would add three orders of magnitude to their complexity: we would have chips with billions rather than mere millions of components. This appears to be necessary to achieve hardware capable of human performance. Evolution certainly found it necessary to use the third dimension when designing animal brains.35 Interestingly, one way that the design of the human brain uses the third dimension is by elaborately folding the surface of the cerebral cortex to achieve a very large surface area.
A primary reason that the third dimension is not utilized is thermal problems. Transistors generate heat, and multiple layers would cause chip circuitry to melt. However, a solution may be on the horizon in the form of superconductivity: because of their lack of electrical resistance, superconducting circuits generate virtually no heat. This may enable circuit designers to further reduce the size of each transistor as well as to exploit the unexplored third dimension for a potential millionfold improvement in performance.36
David Marr and Tomaso Poggio pointed out another salient difference between human brains and today’s computers in their first paper on stereo vision in 1976.37 While the ratio of connections to components in a conventional computer is about 3, this ratio for the mammalian cortex can be as high as 10,000. In a computer virtually every component and connection is vital. Although there are special fail-safe computers that provide a small measure of redundancy, most computers depend on a very high degree of reliability in all of their components. The design of mammalian brains appears to use a radically different methodology in which none of the components or connections are crucial; massive redundancy allows major portions of the process to fail with little or no effect on the final results.38
In summary, there are two fundamentally different forms of thinking: logical thinking and parallel thinking. Logical thinking is sequential and conscious. It involves deliberate control over each step. It tends to be slow and errors in early stages propagate throughout the rest of the process often with catastrophic results. The amount of computation required tends to be limited. Thus, computers lacking in parallel-processing capabilities (nearly all computers to date) have been relatively successful in emulating some forms of logical thought. Most AI through the mid 1980s has been concerned with emulating this type of problem solving, with parallel thought processes often being overlooked.39 This has led to criticism of AI, often with the unjustified conclusion that computers are inherently incapable of parallel thought. Parallel thinking is massively parallel. It is capable of simultaneously processing multiple levels of abstraction, with each level incorporating substantial knowledge and constraints. It tends to be relatively fast because of its highly parallel construction. It generally takes place without either conscious direction or even awareness of the nature of the transformations being made. Skill acquisition generally involves the sequential mind repeatedly training the parallel mind.
The principles of pattern recognition
Building on the observations above, we can describe several principles that govern successful pattern recognition systems. While specific implementations and techniques will differ from one problem domain to another, the principles remain the same.
It is clear that parallel processing is important, particularly in the early stages of the process, since the quantity of information is greatly reduced by each stage of processing. Pattern-recognition tasks generally require a hierarchy of decisions. Each stage has its own manner of representing information and its own methods for deriving the information from the previous stage. For example, in vision we represent the original image data in terms of pixel intensity values. Hypothesized line segments, on the other hand, are probably represented in terms of the coordinates of the ends of each segment along with additional information about the characteristics of each segment (curvature, edge noise, shading, etc.). Similarly, surfaces are represented by a large number of coordinates plus information regarding the surface characteristics. A variety of methods have been devised for representing objects, including the primal sketch, the 2½D sketch, and the world model.40
A key issue in analyzing each stage of representation is segmentation. In speech recognition, for example, we need to divide a sample of continuous speech into smaller segments such as words or perhaps phonemes (basic sounds).41 Choosing the appropriate types of segments for each stage is one of the most important decisions in designing a pattern-recognition system. Once segments in the data have been located, they can be labeled (described). In vision, for example, once we have segmented a scene into line segments, we can describe the nature of the segments. We then segment the edge representation into surfaces and go on to label the surfaces with their characteristics.
After we have determined the stages of processing, the representation of information contained in each stage, the segments to be extracted, and the type of labeling desired for each segment, we are still faced with the heart of the problem: designing methods to make the segmentation and labeling decisions. The most successful paradigm I have found for accomplishing this is that of multiple experts.42 Usually the only methods available to perform specific recognition tasks are very imperfect ones. Information theory tells us that with several independent methods of relatively low accuracy we can still achieve high levels of accuracy if we combine them in a certain way. These multiple methods, called experts, are considered independent if they have what are called orthogonal invariances, that is, independent strengths. Another way of saying the same thing is that the different experts (sometimes also called knowledge sources) tend to make different types of mistakes. The goal is to assemble a group of experts diverse enough that for each pattern that arises, at least one of the experts will have the proficiency to respond correctly. (Of course, we still have to decide which expert is right, just as in ordinary life! I shall come back to this question.)
As an example, consider the recognition of printed letters.43 One useful expert we can call on would detect a feature called the loop, which is an area of white completely surrounded by black. The capital A, for example, has one loop; B has two. Another useful expert would detect concavities, which are concave regions facing in a particular direction. For example, A has one concavity facing south, F has one concave region facing east, and E has two concavities facing east.
Our loop expert would be proficient at distinguishing an O from a C in that O has a loop and C does not. It would not be capable, however, of discriminating C from I (no loops in either case) or O from 6 (each has one loop). The concavity expert could help us here, since it can distinguish C from I and 6 from O by the presence of an east concavity in C and 6. Similarly, the concavity expert by itself would be unable to distinguish C from 6 (since they both have an east concavity), but the loop expert could identify 6 by its single loop. Clearly, the two experts together give us far greater recognition capability than either one alone. In fact, using just these two experts (a loop detector and a concavity detector), we can sort all 62 sans-serif roman characters, excluding punctuation (A through Z, a through z, and 0 through 9) into about two dozen distinct groups with only a few characters in each group. For example, the group characterized by no loops with north and south concavities contains only the characters H and N. In other words, if the loop expert examined a pattern and indicated it had found no loops and the concavity expert indicated concave regions facing south and north, we could conclude that the character was (probably) either an H or an N. Additional experts that examined the location and orientation of line segments or angle vertices could then help us to make a final identification.
It is clear that in addition to a set of experts that can provide us with the ability to make all of the discriminations necessary, we also need a process to direct and organize the efforts of these experts. Such a system, often called the expert manager, is programmed with the knowledge of which expert to use in each situation.44 It knows the relative strengths and weaknesses of each expert and how to
The loop feature.
The concavity feature.
Typical defects in real print.
combine their insights into making final decisions. It would know, for example, that the loop expert is relatively useless in discriminating 6 from O but very helpful for determining whether a character is a 6 or a C, and so on.
In a real system (one that deals with images from the real world), classifications are rarely as straightforward as the examples above suggest. For example, it is entirely possible that an A as actually printed might not contain a loop because a printing error caused the loop to be broken. An a (which should contain one loop) might actually contain two loops if an ink smear caused the upper portion to close. Real-world patterns rarely display the expected patterns perfectly. Even a well-printed document contains a surprisingly large number of defects. One way to deal with the vagaries of real-world patterns is to have redundant experts and multiple ways of describing the same type of pattern. There are a number of different ways of describing what an A should look like. Thus, if one of our experts failed (e.g., the loop expert), we still have a good chance of correctly recognizing the pattern.
There are many sources of variability. One, called noise for obvious reasons, consists of random changes to a pattern, particularly near the edges, caused by defects in the pattern itself as well as imperfections in the sensing mechanism that visualizes the pattern (e.g., an image scanner). Another source of variability derives from the inherent nature of patterns defined at a high level of abstraction. For example, the concept of an A allows for a great deal of variation. There are hundreds of different styles of type in common use and many more if ornamental styles are considered. If one considers only a single type style, then one could obtain accurate recognition using a relatively small number of experts. If, on the other hand, we attempt to recognize printed characters drawn from a wide multitude of styles, then it is clear that a substantially more diverse set
What letter is this? Douglas Hofstadter uses these images to illustrate the superiority of human pattern recognition over the machines of today. Humans have little difficulty recognizing these variations (although a few of these may indeed be problematical if presented in isolation). While machines exist today that can accurately recognize the many type styles in common usage, no machine can successfully deal with the level of abstraction required by these ornamental forms.
The classification of roman well-printed sans-serif characters by loop and concavity features.
Disambiguating N from H using a line segment expert.
The loop and concavity features of 62 roman sans-serif characters. Some of the concavities are ambiguous or marginal. For example, the southern concavity in the letter a is so small it may be overlooked by the concavity expert. Thus, in a practical system a would be classified in both the “has one southern concavity” and “has no southern concavity” categories. To account for multiple type fonts most characters will in fact have multiple classifications.
of experts is required.45 Allowing such variability in the patterns to be recognized also complicates the task of the expert manager.
Since the classification of patterns in the real world is often not clear cut, it is desirable for our experts to provide their “opinions” on a continuous scale. Rather than stating that this pattern has a loop, it would be of greater value for the expert to indicate its relative level of confidence in the presence of such a property (e.g., “There is a 95 percent probability of there being one loop in this pattern, a 3 percent probability of there being two loops”). A less-than-certain result might indicate that the loop expert almost found a loop, that the “loop” found is broken by a few pixels. Even if the loop is entirely closed, there is always the possibility that it really should not be there at all but is only an artifact of a printing or scanning error. If all of the experts provide their analyses in terms of probabilities, then the expert manager can use information theory to combine these results in an optimal way.
In cases of significant print distortion, even human perception can fail on the level of individual letters. Yet we are often able to correct for printing defects by using our knowledge of language context. For example, if we have trouble distinguishing a t from a c because of poor printing, we generally look (consciously or unconsciously) at the context of the letter. We might determine, for example, that “computer” makes more sense than “compucer.” This introduces the concept of experts that use knowledge of the constraints of higher levels of context. Knowing that “compucer” is not a word in English but that “computer” is enables us to disambiguate an otherwise ambiguous pattern. Similarly, in the field of speech recognition, the only possible way to distinguish the spoken word “to” from “too” and from “two” (all of which sound identical) is from context. In the sentence “I am going to the store,” we can eliminate “too” and “two” from consideration by relying on our higher-level syntactic knowledge. Perceptual experiments indicate that human pattern recognition relies heavily on such contextual discrimination. Attempting to recognize printed letters without a word context, human speech without a sentence context, and musical timbres without a melodic context sharply reduces the accuracy of human perception. Similarly, machines dealing with highly variable types of patterns require extensive use of context experts with substantial knowledge about their domains. A word-context expert in a character-recognition system requires knowledge of all the possible words in the language. A syntactic expert in a speech-recognition system requires knowledge of possible word sequences. Again, an expert that can say, “‘Phthisis’ has a probability of .0001,” is more valuable than one who can only say, “‘Phthisis’ is possible.”
All of the experts mentioned above deal with relatively abstract concepts. Concavity is not a perfectly defined concept. Detecting this property is not straightforward and requires a relatively complex program. A very different category of experts, low-level experts (as distinguished from the high-level experts described above), deal with features that are simple transformations of the original input data. For example, in any type of visual recognition we could have a low-level property associated with every pixel whose value is simply the value of the pixel. This is, of course, the simplest possible property. A slightly higher level property (but still low levee could detect the amount of black in a particular region of the image. For
Varieties of low-level (minimal) property sets.
example, a T will tend to have more black in the upper region of the image than an L, which will tend to be more black in the lower region. In actual use, minimal properties tend to be more complex than in the above two examples but nonetheless use straightforward and well-defined transformations of the original input.
It turns out that such low-level properties are quite useful in recognizing patterns when the possible types of patterns are highly constrained. For example, in character recognition, if we restrict the problem to a single style of type, then a system built entirely with low-level property experts is capable of a very high level of accuracy (potentially less than one error in over ten thousand printed characters). This limited problem is often attacked with template matching, so called because it involves matching the image under consideration to stored templates of every letter in the character set.46 Template matching (and other methods of minimal-property extraction) also work well for recognizing printed letters drawn from a small number of type styles. If we are trying to recognize any nonornamental type style (called omnifont, or intelligent, character recognition), then an approach using only minimal property extraction does not work at all. In this case, we must use the higher-level (more intelligent) experts that are based on such abstract topological concepts as loops, concavities, and line segments. The minimal properties can still play an important role, however. Fortunately, printed material does not generally combine multiple type styles in anything like a random fashion. Any particular document (e.g., a book or magazine) will tend to use a limited number of type styles in a consistent way.47 When an omnifont character-recognition machine first encounters a new document, it has no choice but to use its intelligent experts (its loop expert, concavity expert, etc.) to recognize the characters. As it begins successfully to recognize characters, its higher-level experts can actually train its lower-level experts to do the job, and its expert manager (which directs the overall recognition process) can begin to rely more heavily on the lower-level experts for recognition. The higher-level experts train the lower-level ones by presenting actual examples of recognized characters and telling them, in essence, “Here are examples of characters as they actually appear in this document, and this is what we believe their correct identifications to be.” The advantages of such an automatic learning process include both speed and accuracy. The lower-level experts are not only potentially much faster, they can also be less sensitive to image noise.
To return to the first theme of this chapter, the higher-level experts in such a character-recognition system are representative of logical analysis, whereas the lower-level experts represent a more parallel type of thinking. The lower-level experts use much simpler algorithms, so they are more amenable to massive parallel processing, which is a major reason for their potential speed advantage.
Interestingly, perceptual experiments indicate that the human visual system works in a similar way. When we first encounter a new typeface, to recognize it, we rely on our conceptual understanding of print (a logical type of analysis), and our recognition speeds are relatively slow. Once we get used to the style, our recognition process becomes less analytic, and our speed and accuracy increase substantially. This is another example of our logical mind training our parallel mind.
The paradigm of pattern recognition described above is common to most serious recognition problems: multiple stages of processing based on a hierarchy of levels, massive parallel processing (particularly in the early stages), segmentation and labeling, multiple experts on both high and low levels, expert management, disambiguation using the constraints of higher levels of context, and learning from actual recognition examples.48 The actual content of the paradigm, however, will differ substantially from one problem area to another. Most of the technology of any successful pattern-recognition system is domain specific; that is, it is based on the detailed nature of the types of patterns to be recognized. Every so often one hears claims regarding a general-purpose pattern-recognition system that can recognize any type of pattern-printed characters, spoken words, land-terrain maps-regardless of their source. As mentioned earlier, while such systems do recognize many types of patterns, they perform these tasks poorly. To perform any specific pattern-recognition task well with commercially acceptable rates of accuracy requires substantial knowledge deeply embedded in the algorithms and specific to the domain of inquiry.
The Real World
Looking at the real world
Attempts to emulate the general capabilities of human vision are being pursued at a number of leading AI laboratories. One is an ambitious project to create an artificial eye-head system at the MIT Vision Laboratory under the direction of Tomaso Poggio.49 The MIT work includes edge detection (using the Laplacian Gaussian convolver described above and other similar algorithms), fusing stereo images to provide information on depth, understanding color perception, reconstructing surfaces and their properties, tracking trajectories of moving objects, and the ultimate problem of describing the content of what is seen. One of the most interesting aspects of the MIT work is the development of a new type of computer that combines digital control with massive analog parallelism.50 Experiments conducted by Poggio, his associate Christof Koch, and others have already suggested that the human nervous system appears to be capable of substantial parallelism (hundreds) of analog computations within a single neuron.
This work is exemplary of the paradigm of multiple experts. A number of different systems, each with extensive knowledge of a specific aspect of the visual-perception task, are combined in the MIT eye-head system.51 For example, an expert being developed by Anya Hurlbert and Tomaso Poggio uses knowledge of the spectral (color) reflectance of surfaces to help describe them.52 The project also addresses the issues of integrating visual perception with the mechanical control of a robot and includes a head with two solid-state cameras for eyes (that is, cameras with a special chip called a charge-coupled device as an electronic retina).
A major center for the development of vision systems and their application to the field of robotics is the Robotics Institute (RI) at Carnegie-Mellon University under the direction of AI pioneer Raj Reddy. A particularly ambitious project at RI, funded by the Defense Advanced Research Projects Agency (DARPA) is an autonomous vehicle called Terregator (Terrestrial Navigator) which combines a high-resolution vision system, parallel processing, and advanced decision-making capabilities.53
In view of the strong Japanese commitment to the application of robotics to production techniques, Japanese researchers have targeted vision as a priority
Photo by Lou Jones www.fotojones.com
Computer vision pioneer Tomaso Poggio at the MIT Vision Lab.
Photo by Lou Jones www.fotojones.com
Raj Reddy, director of the Robotics Institute of Carnegie-Mellon University. Reddy has been a pioneer in the development of voice recognition, computer vision, and robotics. Now working on the Terregator (Terrestrial Navigator) for the Defense Advanced Research Projects Agency (DARPA), Reddy predicts that future robotic-vision systems will eventually revolutionize driving and provide cars with effective collision-control and road-following capabilities.
research topic. Building on the work of Poggio and his associates, Yoshiaki Shirai (of the Electrotechnical Laboratory, Ibaraki, Japan) and Yoshiro Nishimoto (of the Research Laboratory, Kobe Steel, Kobe, Japan) are attempting to build a practical system for fusing stereo images. Based on parallel hardware, the Shirai-Nishimoto system uses a Laplacian of a Gaussian convolver (a sombrero filter) as well as more advanced pattern-matching techniques. Japanese development efforts are emphasizing the integration of vision with real-time robotic control to provide a new generation of robots that can see their environment, perceive and understand the relevant features of objects, and reason about what they have seen. Hirochika Inoue and Hiroshi Mizoguchi (of the University of Tokyo) have developed a system that can detect, recognize, and track rapidly moving objects in real time.
One promising approach to organizing the massive parallelism required for pattern-recognition tasks is to develop specialized chips to perform those tasks requiring the most computation. One researcher pursuing this approach is Carver A. Mead (of the California Institute of Technology), one of the original pioneers in the development of design methodologies for large-scale integrated circuits. Mead and his associates have developed an artificial-retina chip that performs such early-vision tasks as edge detection and the adjustment of an image for the effects of varying levels of illumination.54 One of the innovations of Mead’s approach is his reliance on massively parallel analog circuits to provide the bulk of the computation. Mead is also working on an artificial-cochlea chip based on similar principles.
While research is just beginning on systems that emulate the full range of human visual processing, machines that perform more limited tasks of visual perception have already found significant commercial applications. For example, optical character recognition (OCR) was a $100 million industry in 1986 and is projected to grow to several hundred million dollars in 1990.55 Applications include reading aloud for the blind, as well as scanning printed and typed documents for entry into word processing, electronic publishing, transaction processing, and database systems.
Photo by Lou Jones www.fotojones.com
Seeing and believing. At the Tsukuba Research Center in Japan, Yoshiaki Shirai’s research in robotics focuses on the development of three-dimensional vision systems.
Photo by Lou Jones www.fotojones.com
Hirochika Inoue, a pioneer in robotic vision systems, at the University of Tokyo.
Photo by Lou Jones www.fotojones.com
Makoto Nagao of Kyoto University explores pattern recognition by means of shadows and surfaces.
Systems using pattern-recognition techniques are revolutionizing the handling of fingerprints by law enforcement agencies. A system called the Automated Fingerprint Identification System (AFIS) developed by NEC of Japan enables agencies across the United States to rapidly identify suspects from fingerprints or even small fragments of fingerprints by intelligently matching them against the stored prints of hundreds of thousands of previously arrested men and women. A report by the U.S. Bureau of Justice Statistics stated, “AFIS may well have the greatest impact of any technological development on law enforcement effectiveness since the introduction of computers to widespread use in the criminal justice system in the 1960s.”56 AFIS is capable of identifying a suspect in several minutes; the manual methods it replaces took months or even years.
Similar techniques are being used in security devices. Systems manufactured by Fingermatrix, Thumbscan, and other firms include a small optical scanner into which a person inserts his finger.57 The device quickly reads the person’s finger pattern and uses pattern-recognition techniques to match it against stored images. The system can control entry to restricted areas and protect information in computers from unauthorized access. Such systems could eventually replace ordinary locks and keys in homes and cars.
One of the largest applications of commercial vision systems so far can be found in factories, where the systems are used for inspection, assembly, and process control. Such systems typically use solid-state cameras with specialized
Photo by Lou Jones www.fotojones.com
Photo by Lou Jones www.fotojones.com
The Cognex Machine Vision System.
electronics to digitize moving images and provide for the computationally intensive early phases of processing.58 A general-purpose computer with custom software provides for the higher levels of analysis. One of the more sophisticated of such systems has been developed by Cognex Corporation, founded by Robert Shillman and a team of MIT AI researchers in 1981. One Cognex product can scan manufactured products streaming by on a conveyor belt and detect and recognize such information as serial numbers embossed in metal or even glass. Other Cognex products can identify specific objects and their orientation for inspection and to assist robotic assemblers. Other major providers of vision systems include Automatix, Defracto, Perceptron, Robotic Vision Systems, and View Engineering. A major player has been General Motors, which has provided investments and contracts for several of the players. According to DM Data, overall revenues for the factory-vision industry were over $300 million in 1987 and are projected to hit $800 million in 1990.59
Military systems account for another major application of artificial vision.60 The ability to scan and recognize terrain at very low altitudes is a crucial element of the cruise missile, which can be launched thousands of miles from its intended target. Modern fighter planes have a similar ability to track terrain and provide pilots with a continually updated display of the location and trajectory of the aircraft. Smart weapons (bombs, missiles, and other munitions) use a variety of sensing mechanisms including vision to locate, identify, and reach intended targets.
The advent of weapons that can see has resulted in profound changes in military tactics and strategy. As recently as the Vietnam War, it was generally necessary to launch enormous numbers of passive blind weapons in relatively indiscriminate patterns to assure the destruction of a target. Modern battlefield tactics emphasize instead the carefully targeted destruction of the enemy with weapons that can recognize their objective. Intelligent missiles allow planes, ships, and submarines to destroy targets from relatively safe distances. For example, a plane can launch an intelligent missile to destroy a ship from tens or even hundreds of miles away, well out of range of the ship’s guns. A new generation of pilotless aircraft use pattern-recognition-based vision systems to navigate and launch weapons without human crews.61 Vision systems and other pattern-recognition technologies are also deployed in defensive tactics to recognize an incoming missile, but such defense is generally much more difficult than offense. The result is an increasing degree of vulnerability for such slow-moving targets as tanks and ships.
An area of emerging importance is the application of pattern recognition to medicine. Medical diagnosis is, after all, a matter of perceiving relevant patterns from symptoms, test results, and other diagnostic data. Experimental systems can look at images from a variety of imaging sources-X-ray machines, CAT (computerized axial tomography) scanners, and MRI (magnetic resonance imaging) systems-and provide tentative diagnoses. Few, if any, medical professionals are ready to replace their own perceptions with those of such systems, but many are willing to augment their own analysis. Often an automatic system will detect and report a diagnosis that manual analysis would have overlooked.62
One particularly promising medical application is the analysis of blood-cell images. Certain types of cancer can be diagnosed by finding telltale precursor
Photo by Lou Jones www.fotojones.com
Shigeru Eiho, director of Engineering at Kyoto University, with his medical imaging system.
malignant cells in a blood sample. Analysis of blood samples by human technicians typically involve the examination of only about a hundred cells. By the time a malignant cell shows up in a sample that small, the cancer is often too advanced to be effectively treated. Unhampered by fatigue or tedium and able to operate at speeds hundreds of times greater than human technicians, artificial “technicians” using pattern-recognition techniques can search for signs of cancer in hundreds of thousands or even millions of cells and thus potentially detect the presence of disease at a treatable stage.
Ultimately, medical applications of pattern recognition will have enormous benefit. Medical testing comprises a major fraction of all of medicine and costs several hundred billion dollars per year. Examining the images and other data resulting from such tests is extremely tedious for human technicians, and many studies have cited the relatively low level of accuracy that results. Many doctors routinely order tests to be conducted in duplicate just to improve their accuracy. Once computers have mastered the requisite pattern-recognition tasks, the potential exists for a major transformation of medical testing and diagnosis.
One area of medicine that is already being revolutionized is medical imaging.63 Using a variety of computer-based image-enhancement techniques, physicians now have access to unprecedented views inside our bodies and brains. Similar techniques also allow scientists to visualize such extremely small biochemical phenomena as viruses for the first time.
One of the more surprising results of image processing and recognition took place recently when Lillian Schwartz observed the striking unity of the juxtaposed halves of the “Mona Lisa” and the reversed “Self-Portrait” by Leonardo da Vinci. Further investigation by Schwartz led her to identify Leonardo as the model used to complete the “Mona Lisa,” thereby suggesting a remarkable conclusion to the 500-year-old riddle of the identity of the celebrated painting.64
Listening to the real world
Another human sense that computers are attempting to emulate is hearing. While input to the auditory sense involves substantially less data than the visual sense (about a million bits per second from both ears versus about fifty billion bits per second from both eyes), the two senses are of comparable importance in our understanding of the world. As an experiment, try watching a television news program without sound. Then try listening to a similar broadcast without looking at the picture. You will probably find it easier to follow the news stories with yours ears alone than with your eyes alone. Try the same experiment with a situation comedy; the result should be the same.
Part of the importance of our auditory sense is the close link of verbal language to our conscious thinking process. A theory popular until recently held that thinking was subvocalized speech.65 While we now recognize that our thoughts incorporate both language and visual images, the crucial importance of the auditory sense in the acquisition of knowledge is widely accepted.
Blindness is often considered to be a more serious handicap than deafness. A careful consideration of the issues, however, shows this to be a misconception. With modern mobility techniques, blind persons with appropriate training have little difficulty in travelling from place to place, reading machines can provide access to the world of print, and the visually impaired experience few barriers to communicating with other persons in groups and meetings large or small. For the deaf, however, there is a barrier to engaging in a very fundamental activity-understanding what other people are saying in person to person contact, on the phone, and in meetings. The hearing impaired are often cut off from basic human communication and feel anger at society’s failure to accommodate or understand their situation.
We hear many things: music, speech, the varied noises of our environment. Of these, the sounds that are the most important in terms of interacting with and learning about the world are those of human speech. Appropriately, the area of auditory recognition that has received the most attention is that of speech recognition in both human and machine.
As with human vision, the stages of human auditory processing that we know the most about are the early ones. From the work of Stephanie Seneff, Richard Goldhor, and others at the MIT Speech Laboratory and from similar work at other laboratories, we have some knowledge of the specific transformations applied by the auditory nerve and early stages of the auditory cortex. As with vision, our knowledge of the details of the later stages is relatively slight due mostly to our inability to access these inner circuits of the brain. Since it is difficult for us to analyze the human auditory system directly, the bulk of the research in speech recognition to date has been devoted to teaching machines to understand speech. As with vision, the success of such efforts may provide us with workable theories as to how the human auditory system might work, and these theories may subsequently be verified by neurophysical experimentation.
Let us examine automatic speech recognition (ASR) in terms of the pattern-recognition paradigm described above for vision. Speech is created by the human vocal tract, which, like a complex musical instrument, has a number of different ways of shaping sound. The vocal cords vibrate, creating a characteristic pitched sound. The length and tautness of the vocal cords determines pitch in the same way that the length and tautness of a violin or piano string determines pitch. We can control the tautness of our vocal cords (hence our ability to sing). We shape the overtones produced by our vibrating vocal cords by moving our tongue, teeth, and lips, which has the effect of changing the shape of the vocal tract. The vocal tract is a chamber that acts like a pipe in a pipe organ, the harmonic resonances of which emphasize certain overtones and diminish others. Finally, we control a small piece of tissue called the alveolar flap, which opens and closes the nasal cavity. When the alveolar flap is open, the nasal cavity provides an additional resonant chamber similar to the opening of another organ pipe.
In addition to the pitched sound produced by the vocal cords, we can produce a noiselike sound by the rush of air through the speech cavity. This sound does not have specific overtones but is rather a complex spectrum of many frequencies mixed together. Like the musical tones produced by the vocal cords, the spectra of these noise sounds are also shaped by the changing resonances of the moving vocal tract.66
This apparatus allows us to create the varied sounds that comprise human speech. While many animals communicate with others of their species with sound, we humans are unique in our ability to shape sound into language. Vowel sounds (/a/, /e/) are produced by shaping the overtones from the vibrating vocal cords into distinct frequency bands called formants. Sibilant sounds (/s/, /z/) are created by the rush of air through particular configurations of tongue and teeth. Plosive consonants (/p/, /k/, /t/) are transitory sounds created by the percussive movement of lips, tongue, and mouth cavity. Nasal sounds (/n/, /m/) are created by invoking the resonances of the nasal cavity.67
Each of the several dozen basic sounds, called phonemes, requires an intricate movement involving precise coordination of the vocal cords, alveolar flap, tongue, lips, and teeth. We typically speak about 3 words per second. So with an average of 6 phonemes per word, we make about 18 complex phonetic gestures each second. We do this without thinking about it, of course. Our thoughts remain on the conceptual (that is, the highest) level of the language hierarchy. In our first two years of life, however, we thought a lot about how to make speech sounds (and how to meaningfully string them together). This is another example of our sequential (logical) conscious mind training our parallel (pattern recognition) mind.
The mechanisms described above for creating speech sounds-vocal cord vibrations, the noise of rushing air, articulatory gestures of the mouth and tongue, the shaping of the vocal and nasal cavities-produce different rates of vibration. A physicist measures these rates of vibration as frequencies; we perceive them as pitches. Though we normally consider speech to be a single time-varying sound, it is actually a composite of many different sounds, each of which has a different frequency. With this insight, most commercial ASR systems start by breaking up the speech waveform into a number of different bands of frequencies. A typical commercial or research ASR system will produce between three and a few dozen frequency bands. The front end of the human auditory system does exactly the same thing; each of the nerve endings in the cochlea (inner ear) responds to different frequencies and emits a pulsed digital signal when activated by an appropriate pitch. The cochlea differentiates several thousand overlapping bands of frequency, which gives the human auditory system its extremely high degree of sensitivity to frequency. Experiments have shown that increasing the number of overlapping frequency bands in an ASR system (and thus bringing it closer to the thousands of bands of the human auditory system) substantially increases the ability of that system to recognize human speech.68
Typically, parallel processing is used in the front-end frequency analysis of an ASR system, although not as massively as in vision systems, since the quantity of data is much less. (If one were to approach the thousands of frequency bands used by the human auditory system, then massive parallel processing would be required.) Once the speech signal has been transformed into the frequency domain, it is normalized (adjusted) to remove the effects of loudness and background noise. At this point we can detect a number of features of frequency-band signals and consider the problems of segmentation and labeling.
As in vision systems, minimal property extraction is one popular technique. One can use as a feature set either the normalized frequency data itself or various transformations of this data. Now, in matching such minimal-property sets, we need to consider the phenomenon of nonlinear time compression.69 When we speak, we change our speed according to context and other factors. If we speak one word more quickly, we do not increase the rate evenly throughout the entire word. The duration of certain portions of the word, such as plosive consonants, will remain fairly constant, while other portions, such as vowels, will undergo most of the change. In matching a spoken word to a stored template, we need to align the corresponding acoustic events, or the match will never succeed. This problem is similar to the matching of visual cues in fusing the stereo images from our two eyes. A mathematical technique called dynamic programming has been developed to accomplish this temporal alignment.70
As with vision systems, high-level features are also used in ASR systems. As mentioned above, speech is made up of strings of phonemes, which comprise the basic “alphabet” of spoken language. In English and other European romance languages, there are about 16 vowels and 24 consonants; Japanese primarily uses only 5 vowels and 15 consonants. The nature of a particular phoneme (such as /a/) is an abstract concept in the same way that the inherent nature of a printed character (such as A) is: neither can be simply defined. Identifying phonemes in human speech requires intelligent algorithms and recognition of high-level features (something like the loops and concavities found in printed characters). The task of segmenting speech into distinct time slices representing different phonemes is also a formidable task. The time-varying spectrum of frequencies characterizing a phoneme in one
Photo by Vladimir Sejnoha and courtesy of Kurzweil Applied Intelligence
Nonlinear alignment of speech events. The lines in the middle graph show the optimal alignment of the spectographs of two utterances of the word “further” spoken by the same female speaker.
context may be dramatically different in a different context. In fact, in many instances no time slice corresponding to a particular phoneme can be found; we detect it only from the subtle influence it has on phonemes surrounding it.
As with vision and character recognition, both high-level and low-level features have value in speech-recognition systems. For recognizing a relatively small vocabulary (a few hundred words) for a single speaker, low-level feature detection and template matching by means of dynamic programming is usually sufficient, and most small-vocabulary systems use this approach. For more advanced systems, a combination of techniques is usually required: generally, multiple experts and an expert manager that knows the strengths and weaknesses of each.71
High-level context experts are also vital for large vocabulary systems. For example, phonemes cannot appear in any order. Indeed, many sequences are impossible to articulate (try saying “ptkee”). More important, only certain phoneme sequences will correspond to a word or word fragment in the language. On a higher level, the syntax and semantics of the language put constraints on possible word orders. While the set of phonemes is similar from one language to another, context factors differ dramatically. English, for example, has over 10,000 possible syllables, whereas Japanese has only 120.
Learning is also vital in speech recognition. Adaptation to the particular characteristics of each speaker is a powerful technique in each stage of processing. Learning must take place on a number of different levels: the frequency and time relationships characterizing each phoneme, the dialect (pronunciation) patterns of each word, and the syntactic patterns of possible phrases and sentences.
In sum, we see in speech recognition the full paradigm of pattern recognition that we first encountered in vision and character recognition systems: parallel processing in the front-end, segmentation and labeling, multiple experts on both high and low levels, expert management, disambiguation by context experts, and learning from actual recognition examples. But while the paradigm is the same, the content is dramatically different. Only a small portion of the technology in a successful ASR system consists of classical pattern-recognition techniques. The bulk of it consists of extensive knowledge about the nature of human speech and language: the shape of the speech sounds and the phonology, syntax, and semantics of spoken language.72
Automatic speech recognition is receiving considerable attention because of its potential for commercial applications.73 We learn to understand and produce spoken language in our first year of life, years before we can understand or create written language. Thus, being able to communicate with computers using verbal language would provide an optimal modality of communication. A major goal of AI is to make our interactions with computers more natural and intuitive. Being able to converse with them by talking and listening is a vital part of that process.
For years ASR systems have been used in situations where users necessarily have their hands and eyes busy, making it impossible to use ordinary computer keyboards and display screens. For example, a laboratory technician examining an image through a microscope or other technical equipment can speak results into the microphone of an ASR system while continuing to view the image being examined. Similarly, factory workers can verbalize inspection data and other information on the production or shop floor directly to a computer without having to occupy their hands with a keyboard. Other systems are beginning to automate routine business transactions over telephone lines. Such telecommunications applications include credit-card verification, sales-order entry, accessing data base and inventory records, conducting banking and other financial transactions, and many others.
The applications mentioned are usually highly structured and thus require the ASR system to recognize only a small vocabulary, typically a few hundred words or less. The largest markets are projected for systems that can handle vocabularies that are relatively unrestricted and much larger, say, ten thousand words or more. The most obvious application of large vocabulary ASR systems is the creation of written documents by voice. The creation of written documents is a ubiquitous activity engaged in by almost everyone in offices, schools, and homes. Just copying all of the written documents created each year is a $25 billion industry. Tens of billions of dollars are spent each year in the creation of original written works from books to interoffice memoranda. Being able to dictate a document to a computer based machine and see the document appear on a screen as it is being dictated has obvious advantages in terms of speed, accuracy, and convenience. Large-vocabulary ASR-based dictation systems are beginning to see use by doctors in writing medical reports and by many other professionals.
There are also significant applications of ASR technology to the handicapped. Persons who are unable to use their hands because of quadriplegia (paralysis due to spinal-cord injury), cerebral palsy, or other neurological impairment can still create written documents, interact with computers, and otherwise control their environment by speaking to a computer equipped with ASR. Certain types of brain damage cause a person’s speech to be slurred and distorted. While such speech patterns are unrecognizable by human listeners (unless they have received special training), it is distorted in a consistent way and thus can be recognized by machine. A system consisting of an ASR system to recognize the distorted speech connected to a speech synthesizer can act as a translator, allowing the person to be understood by others. Research has been conducted at Children’s Hospital, Boston, by Dr. Howard Shane on assisting both the hand-impaired and the speaking-impaired using ASR. Of perhaps the greatest potential benefit to the handicapped, and also representing the greatest technological challenge, would be an ASR-based device for the deaf that could provide a visual read-out of what people are saying.
There are three fundamental attributes that characterize a particular ASR system: vocabulary size, training requirements, and its ability to handle continuous speech. Vocabulary size is the number of different words that a system can handle at one time. Text creation requires a large basic vocabulary as well as the ability to add additional words to the active personal vocabulary of each individual user. For most other applications, small vocabularies suffice.
Most ASR systems require each user to train the system on their particular pronunciation patterns. This is typically accomplished by making the user provide the system with one or more spoken samples of each word in the vocabulary. However, for large vocabulary systems, speaking every word in the vocabulary is often not practical. It is preferable if the ASR system can infer how the speaker is likely to pronounce words never actually spoken to the machine. Then the user needs to train the system on only a subset of the full vocabulary.
Some small-vocabulary systems have been preprogrammed with all of the dialectic patterns anticipated from the population expected to use the system and thus do not require any prior training by each user. This capability, called speaker independence, is generally required for telephone-based systems, where a single system can be accessed by a large group of users.
Most commercial systems to date require users to speak with brief pauses (usually around 100 milliseconds) between words. This helps the system make a crucial segmentation decision: where words start and end. Speaking with such pauses reduces the speed of a typical speaker by 20 to 50 percent. ASR systems that can handle continuous speech exist, but they are limited today to small vocabularies. Continuous-speech systems that can handle large vocabularies are expected in the early 1990s.
Other characteristics that are important in describing practical ASR systems include the accuracy rate, response time, immunity to background noise, requirements for correcting errors, and integration of the speech recognition capability with specific computer applications. In general, it is not desirable to simply insert a speech recognition system as a front-end to ordinary computer applications. The human requirements for controlling computer applications by voice are substantially different from those of more conventional input devices such as keyboards, so the design of the overall system needs to take this into account.
While ASR systems continue to fall far short of human performance, their capabilities are rapidly improving, and commercial applications are taking root. As of 1989 ASR systems could either recognize a large vocabulary (10,000 words or more), recognize continuous speech, or provide speaker independence (no user training), but they could provide only one of these capabilities at a time. In 1990, commercial systems were introduced that combined speaker independence with the ability to recognize a large vocabulary. I expect it to be possible in the early 1990s to combine any two of these attributes in the same system. In other words, we will see large vocabulary systems that can handle continuous speech while still requiring training for each speaker; there will be speaker-independent systems that can handle continuous speech but only for small vocabularies; and so on. The Holy Grail of speech recognition is to combine all three of these abilities, as human speech recognition does.
Other types of auditory perception
There are applications of pattern recognition to auditory perception other than recognizing speech. A field closely related to ASR is speaker identification, which attempts to reliably identify the person speaking by his or her speech profile. The techniques used are drawn from the ASR field. Common applications include entry and access control.
Military applications of auditory pattern recognition consist primarily of ship and submarine detection. It is not possible to see through the ocean, but sound waves can travel through the water for long distances. By sending sound waves and analyzing the reflected patterns, marine vehicles can be recognized. One of the challenges is to correctly characterize the patterns of different types of submarines and ships as opposed to whales and other natural undersea phenomena.
There are also a number of medical applications. Scanning the body with sound waves has become a powerful noninvasive diagnostic tool, and systems are being developed to apply pattern-recognition techniques to data from sonogram scanners. Perhaps the most significant medical application lies in the area of listening to the human heart. Many irregularities in heartbeat (arrhythmias) occur infrequently. So a technique has been developed that involves recording an electrocardiogram for a 24-hour period. In such Holter monitoring the patient goes through a normal day while a portable unit makes a tape recording of his heart pattern. The recording is then reviewed on a special screen by a human technician at 24 times normal speed, thus requiring an hour of analysis. But reading an electrocardiogram at 24 times real time for an hour is extremely demanding and studies have indicated a significant rate of errors. Using pattern-recognition techniques, computers have been programmed to analyze the recording at several hundred times real time and thus have the potential for providing lower costs and higher accuracy. Similar systems are used to monitor the vital functions of critical-care patients. Eventually wristwatch systems will be able to monitor our vital functions on a continuous basis. Heart patients could be told by their wristwatches to slow down or take other appropriate action if it determines they are overexerting themselves or otherwise getting into difficulty.
Next to human speech, perhaps the most important type of sound that we are exposed to is music. It might be argued that music is the most fundamental of the arts: it is the only one universal to all known cultures. The musical-tone qualities created by complex acoustic instruments such as the piano or violin have unique psychological effects, particularly when combined with the other elements of music (melody, rhythm, harmony, and expression). What is it that makes a piano sound the way it does? Every piano sounds somewhat different, yet they are all recognizable as the same type of instrument. This is essentially a pattern-recognition question, and insight into the answer has provided the ability to recreate such sounds using computer-based synthesizers. It also provides the ability to create entirely new synthetic sounds that have the same richness, depth, and musical relevance as acoustically produced sounds.74
Analogs to the other human senses are being developed as well. Chemical-analysis systems are beginning to emulate the functions of taste and smell. A variety of tactile sensors have been developed to provide robots with a sense of touch to augment their sight.
Taken together, applications of pattern recognition comprise fully half of the AI industry. It is surprising that many discussions of AI overlook this area entirely. Unlike expert systems and other areas that primarily emphasize logical rules and relationships, the field of pattern recognition combines both parallel and sequential types of thinking. Because of their need for enormous amounts of computation, pattern-recognition systems tend to require the cutting edge of advanced computer architectures.