THE AGE of INTELLIGENT MACHINES | A Personal Postscript
February 21, 2001
- author |
- Ray Kurzweil
Pattern matching is the basis of Raymond Kurzweil’s inventions in optical character recognition, speech recognition and synthesis, and electronic music. From Ray Kurzweil’s revolutionary book The Age of Intelligent Machines, published in 1990.
Originally Published 1990
Success provides the opportunity for growth, and growth provides the opportunity to risk at a higher level. Eric Vogt
Most of the AI projects that I have been involved in personally are in the pattern-recognition field. The following is a description of some of these efforts from the point of view of technology. This section is perhaps misnamed. This is not really a personal postscript; not recounted here are the many exceptional people who have made contributions to these projects, the early struggles of growing companies, the efforts to attract both capital and talent, the ambiguities and subtleties of understanding markets, the challenge of establishing manufacturing facilities, the relationships with vendors, suppliers, contractors, dealers, distributors, customers, consultants, attorneys, accountants, bankers, investment bankers, investors and the media, or the interpersonal challenges of building organizations.
Optical Character Recognition
I founded Kurzweil Computer Products (KCP) in 1974. Our goal was to solve the problem of omnifont (any type font) optical character recognition (OCR) and to apply the resulting technology to the reading needs of the blind as well as to other commercial applications. There had been attempts to help the blind read using conventional OCR devices (those for a single or limited number of type fonts/, but these machines were unable to deal with the great majority of printed material as it actually exists in the world. It was clear that to be of much value to the blind, an OCR machine would have to read any style of print in common use and also deal with the vagaries of printing errors, poor quality photocopies, varieties of paper and ink, complex page formats, and so on. OCR machines had existed from the beginning of the computer age, but all of the machines up to that time had relied on template matching, a form of low-level property extraction, and thus were severely limited in the range of material they could handle. Typically, users had to actually retype printed material using a specialized typeface before scanning. The principal value of these devices was that typewriters were at that time more ubiquitous than computer terminals.
It was clear to us that to produce an OCR device that was font invariant as well as relatively insensitive to distorted print, we would need additional experts beyond minimal property extraction. Our solution was to develop software for multiple experts, including topological experts such as loop, concavity, and line-segment detectors, with an expert manager that could combine the results of both high- and low-level recognition experts. The system was able to
Courtesy of the Kurzweil Reading Machine Division of Xerox
The flow of information in the Kurzweil Reading Machine.
learn by having the high-level experts teach the low-level experts the type faces found in a particular document. At a later point we added context experts by providing the machine with a knowledge of English (and ultimately several other languages).
The first Kurzweil Reading Machine (KRM/, introduced in 1976, consisted of an image scanner we developed ourselves that contained an electra-optical camera. The camera’s eye consisted of a charge-coupled device containing 500 light-sensitive elements arranged in a straight line. The camera was mounted on an electromechanical “X-Y mover,” which could move the camera in both vertical and horizontal directions. The material to be read (a book, magazine, typed letter, etc.) lay face down on the glass plate that formed the top of the machine.
The camera automatically moved back and forth scanning each line of print, transmitting the image electronically to a minicomputer contained in a separate cabinet. Using our omnifont OCR software, the minicomputer recognized the characters, grouped them into words and computed the pronunciation of each word. To accomplish this last task, several hundred pronunciation rules were programmed in, along with several thousand exceptions. The resulting string of phonemes was sent to a speech synthesizer, which articulated each word. Subsequent models of the KRM have been substantially improved, but they are organized in a similar way.
Using the device is straightforward: the user places the document to be read face down on the machine, presses start, and listens. The KRM has a control panel to control the movement of the scanner, back it up, make it reread sections or spell out words, and provide a variety of other functions.
Photo by Lou Jones www.fotojones.com
Jamal Mazrui uses the Kurzweil Reading Machine. The Kurzweil Reading Machine scans and recognizes such text as books, magazines, and memos and converts it into synthesized speech and thus is able to provide blind readers with independent access to printed materials.
Courtesy of the Kurzweil Reading Machine Division of Xerox
Font invariance is a primary goal of Kurzweil Computer Products’ intelligent character recognition. Users can verify recognized characters against images of the original characters if so desired. A few of the topological features considered by Kurzweil Computer Products’ ICR.
The KRM has been called the first commercial product to successfully incorporate AI technology. A recent survey showed that most blind college students have access to a KRM to read their educational materials. Nothing in my professional career has given me greater satisfaction than the many letters I have received from blind persons of all ages indicating the benefit they have received from the KRM in enabling them to complete their studies or gain and maintain productive employment.
Two years after the introduction of the KRM, we introduced a refined version, the Kurzweil Data Entry Machine (KDEM), designed for commercial applications. The KDEM, like the KRM, could scan printed and typed documents and recognize characters and page formats from a wide variety of sources, but rather than speaking the wards, it transmits them. It has been used to automatically enter documents into databases, word-processing machines, electronic-publishing systems, and a variety of other computer-based systems.
For example, the KDEM was used to automatically scan and recognize all of the contributed articles in this book for entry into a computerized publishing system.
Many computerized systems move information from electronic form onto the printed page. The KDEM allows it to move back not just as an electronic image but in an intelligent form that a computer can understand and process further.
The result is to make the printed page another form of information storage like a floppy disk or tape. Unlike electronic media, however, the printed page can be easily accessed by humans as well, which makes it the medium of choice for both people and machines.
I find it interesting to review the rapidly improving price performance of computer-based products in terms of the products of my own companies. The 1978 KDEM were sold for $120,000, which, adjusted far inflation, is equivalent to $231,000 in 1990 dollars. It had 65,536 bytes of memory and recognized print at about 3 characters per second. In 1990 KCP offered a far superior product for under $5,000.
The 1990 version has 2 to 4 million bytes of memory, can recognize between 30 and 75 characters per second, can recognize a substantially wider range of degraded print, and is far more accurate than the 1978 KDEM. The 1990 version thus has 32 to 64 times as much memory, is 10 to 25 times faster, and is more accurate and versatile than the 1978 version. If we conservatively assume that it provides at least 15 times the performance at 1/46.2 the price, it represents an overall improvement in price-performance of 693 to 1. Since 29.4 = 693, KCP has doubled its price-performance 9.4 times in 144 months, which is a doubling of price-performance every 15.3 months. That rate is somewhat better than the computer industry at large, which is generally considered to double its price-performance ratio only every 18 to 24 months.
On July 1, 1982, I founded Kurzweil Applied Intelligence, (KAI). KAI’s goal has been to master automatic speech recognition (ASR) technology and to integrate ASR with other AI technologies to solve real-world problems. The long-term goal is to establish ASR as a ubiquitous modality of communication between human and machine. In 1985 we introduced the Kurzweil Voice System (KVS), the first commercial ASR device with a 1,000 word vocabulary. A refined version called KVW, the first to provide a recognition vocabulary of up to 10,000 words, was introduced in 1987.1
The KAI speech recognizers follow the paradigm described in the previous section. The KVW, for example, has seven different experts, all of which attempt to recognize each spoken word simultaneously. Several of the experts analyze the acoustics (or sound) of the word spoken, and the others are context experts programmed with a knowledge of English word sequences (other languages will follow in the future). The system makes extensive use of user training and adaptation. It learns the phonological patterns (that is, the frequency patterns) and the phonetic sequences (the dialect) of each user. It also includes a user syntax expert that learns the characteristic word-sequence patterns used by each speaker.
The KVW consists of specialized electronics, an industry standard personal computer and software. Before a user can begin productive dictation with the system, he needs to enroll, which involves speaking a sample of words to the machine to provide it with initial phonological and phonetic models. Once that has been accomplished, the user can dictate to the machine and watch each word appear on the screen within a fraction of a second after speaking it. The 1990 models require the user to provide a brief pause between words, although later models are expected to accept continuous speech.
The KVW actually improves its performance as the system adapts to (that is, learns about) the user’s pronunciation patterns and syntax. This adaptation continues indefinitely. In addition to displaying each recognized word, the system also displays its second through sixth choices. If the KVW makes a mistake, one of these alternate words is very often the correct choice. Thus, errors are typically corrected by the user saying “Take two” or “Take three,” which replaces the word originally displayed in the text with the appropriate alternate word.
KVW speech-recognition technology has been integrated with a variety of applications. One version includes a full-function word-processor with the capability of entering text as well as issuing all editing and formatting commands by voice. Several sections of this book were written by voice using this version of the KVW. Other
Courtesy of Kurzweil Applied Intelligence
A doctor dictates medical reports to VoiceRad.
versions of the KVW are integrated with knowledge-based systems that have expertise in the types of reports created in different professions.
Courtesy of Kurzweil Applied Intelligence
A doctor dictates medical reports to VoiceRad.
For example, VoiceRad integrates KAI’s speech recognition technology with knowledge of radiology reporting, allowing a radiologist to quickly dictate the results of an examination for instantaneous transcription. As with the word-processor version of the KVW, the radiologist can dictate a report word by word.
In addition, the system can automatically generate predefined sections of text based on its knowledge of radiology reporting. VoiceEM is a similar system for emergency medicine. A variety of similar systems have been developed for medicine and other disciplines. This approach combines the productivity gains of ASR-based dictation with those of a built-in domain-specific knowledge base. These products mark the first time that a commercially available large vocabulary ASR product has been used to create written text by voice in other than experimental situations.
KAI also has a major commitment to applying technology for the handicapped. Versions of KVS and KVW technology provide means for text creation and computer and environmental control for quadriplegic and other hand-impaired individuals. A long-term goal of the company is to develop a sensory aid for the deaf that would provide a real-time display of what someone is saying on the phone and in person.
The company’s long-term objectives are two-fold. First, it intends to continue strengthening its core speech-recognition technology, to move toward the Holy Grail of combining large vocabulary ASR with continuous-speech capability and minimal requirements for training the system for each user. Second, it intends to integrate ASR with a variety of applications, particularly those emphasizing other AI technologies. Ultimately, our goal is to establish voice communication as a desirable and widely used means of communicating with machine intelligence.
The Electronic Music Revolution
I founded Kurzweil Music Systems (KMS), also on July 1,1982. The inspiration for starting KMS came from two sources. One was my lifelong interest in music, along with a nearly lifelong interest in computers. My father, a noted conductor and concert pianist, had told me shortly before his death in 1970 that I would combine these two interests one day, although he was not sure how. The other and more immediate genesis of KMS was a conversation I had with Stevie Wonder, who had been a user of the Kurzweil Reading Machine from its inception. While showing me some new musical instruments he had recently acquired, Steve noted that two worlds of musical instruments–the acoustic and the electronic–had developed with no bridge existing between them.
On the one hand, acoustic instruments such as the piano, violin, and guitar provided the musical sounds that were still the sounds of choice for most of the world’s musicians. While these acoustic sounds were rich, complex, and musically satisfying, only limited means were available for controlling or even playing these sounds.
For one thing, once a piano key was struck, there was no further ability to shape the note other than to terminate it: the initial velocity of the key strike was the only means for modifying piano sounds. Second, most instruments could only play one note at a time.
Third, there were no ways to layer sounds, that is, play the sounds of different instruments simultaneously. Even if you had the skills to play both a piano and a guitar, for example, you could hardly play both at the same time. Even two musicians would find playing the same chords on a piano and guitar almost impassible. In any case, very few musicians, no matter how accomplished, could play more than a very few instruments, as each one requires substantially different playing techniques. Since the playing methods themselves were linked to the physics of each acoustic instrument, many instruments required a high level of finger dexterity. If a composer had a multi-instrumental arrangement in mind, he had no way of even hearing what the piece sounded like without assembling a large group of musicians. Then making changes to the composition required laborious modification of written scores and additional rehearsal. I recall my father’s lamenting the same difficulties.
Steve pointed out that on the other hand there existed the electronic world of music in which most of the above limitations are overcome. Using just one type of playing skill (e.g., a piano-keyboard technique), one can activate and control all available electronic sounds. A wide variety of techniques exist for modifying many aspects of the sounds themselves prior to as well as during performance (these techniques have expanded greatly since 1982). One can layer sounds by having each key initiate different sounds simultaneously. Using sequencers, one can play one part of a multi-instrumental composition, then play that part back from memory and play a second part over it, repeating this process indefinitely. However, electronic instruments at that time suffered from a major drawback, namely the sounds themselves. While they had found an important role in both popular and classical music, synthetic sounds were “thin,” had relatively limited diversity, and did not include any of the desirable acoustic sounds.
Steve asked whether it would be possible to combine these two worlds of music to create in a single instrument the capabilities of both. Such an instrument could produce music that neither world of instruments alone could create. Accomplishing this
Courtesy of Kurzweil Music Systems
The Kurzweil 250 Computer-Based Synthesizer.
Courtesy of Kurzweil Music Systems
A musician plays the Kurzweil 250 using the keyboard and an electronic drum controller.
would, for example, enable musicians to play a guitar and a piano at the same time. We could take acoustic sounds and modify them to accomplish a wide variety of artistic purposes.
A musician could play a multi-instrumental composition (such as an entire orchestra) by himself using real acoustic (as well as electronic) sounds. A musician could play a violin or any other instrument polyphonically (playing more than one note at a time). One could play sounds of any instrument without having to learn the playing techniques of each. One could even create new sounds that were based on acoustic sounds, and thus shared their complexity, but moved beyond them to a whole new class of timbres with substantial musical value.
This vision defined the goal of KMS. In June of 1983 we demonstrated an engineering prototype of the Kurzweil 250 (K250j, and we introduced it commercially in 19842 The K250 was considered the first electronic musical instrument to successfully emulate the sounds of a grand piano and a wide variety of other instruments: orchestral string instruments (violin, viola, etc.). guitar, human voice, brass instruments, drums, and many others. In listening tests we found that listeners, including professional musicians, were essentially unable to tell the K250 “grand piano” sound apart from that of a real $40,000 concert grand piano. A 12-track sequencer, sound layering, and extensive sound modification facilities provide a full complement of artistic control methods.
The essence of K250 technology lies in its sound models. These data structures, contained in read-only memory within the instrument, define the essential patterns of each instrument voice. We needed to create a signal-processing model of an instrument that will respond to changes in pitch, loudness, and the passage of time in the same complex ways as the original acoustic instrument.
To create a sound model, the starting point is to record the original instrument using high-quality digital techniques. Surprisingly, just finding the right instruments to record turned out to be a major challenge. We were unable, for example, to find a single concert grand piano with an attractive sound in all registers. Some had a beautiful bass region but a shrill midrange. Others were stunning in the high range, but mediocre otherwise. We ended up recording five different pianos, including the one that Rudolph Serkin plays when he comes to Boston.
When capturing an instrument, we record examples of many different pitches and loudness levels. When a particular key on a piano is struck with varying levels of force, it is not just the loudness level that changes but the entire time-varying spectrum of sound frequencies. All of these digital recordings are fed into our sound analysis computer, and a variety of both automatic and manual techniques are used to shape each instrument model. Part of the process involves a form of painstaking tuning and attention to detail ironically reminiscent of old-world craftsmanship. The automatic aspects of the process deal primarily with the issue of data compression. The original recorded data for even a single instrument would exceed the K250′s memory capacity. Thus, it is necessary to include only the salient information necessary to accurately represent the original sounds.
When the keyboardist strikes the K250′s piano-like keys, special sensors detect the velocity of each key’s motion. (Other KMS keyboards can also detect the time-varying pressure exerted by each finger.) The K250′s computer and specialized electronics extract the relevant information from the appropriate sound models in memory and then compute in real time the waveforms representing the selected instrument sound, pitch, and loudness for each note. The varied control features, such as sequencing, layering, and sound modification, are provided by software routines stored in the unit’s memory.
In evolving our instruments at KMS, we have followed two paths. First, the K250 has evolved into a comprehensive system for creating complex musical works. It is essentially a digital recording and production studio in an instrument, and it has become a standard for the creation of movie and television soundtracks and professional recordings. KMS has also moved to bring down the cost of its sound-modeling technology. Its K1000 series, for example, is a line of relatively inexpensive products that provide the same quality and diversity of sounds as the K250.
There is an historic trend taking place in the musical instrument industry away from acoustic and mechanical technology and toward digital electronic technology. There are two reasons for this. First, the price-performance of acoustic technology is rapidly deteriorating because of the craftsmanship and labor-intensive nature of its manufacturing processes. A grand piano, for example, has over 10,000 mostly hand-crafted moving parts. The price of the average piano has increased by over 250 percent since 1970. At the same time, it is widely acknowledged that the quality of new pianos is diminishing. On the other hand, the price-performance of digital electronics is, of course, rapidly improving. Furthermore, it is now possible for an electronic instrument to provide the same sound quality as an acoustic instrument, with substantially greater functionality. For these reasons, electronic keyboard instruments have gone from 9.5 percent of the American market for keyboard instruments in 1980 to 55.2 percent in 1986 (according to the American Music Conference). It is my strong belief that this trend will continue until the market is virtually entirely electronic. Our long-term goal at KMS is to continue to provide leadership for this emerging worldwide industry of intelligent digital music technology.
A Final Note
1 have tried to select projects that make it possible to build strong companies while meeting social and cultural goals that are important to me and others. I believe, for example, that there is a good match between the capabilities of computer science and the needs of the handicapped. It has been a personal goal of mine to apply AI technologies to help overcome the handicaps associated with major physical and sensory disabilities. I believe the potential exists in the next couple of decades to largely overcome these major handicaps. As amplifiers of human thought, computers have great potential to assist human expression, improve productivity, and expand creativity for all of us, in all areas of work and play. I hope to play a role in constructively harnessing this potential.
All of the projects described above have been highly interdisciplinary efforts and have required the dedication and talents of many brilliant individuals in a broad range of fields. Inventing today is very much a team effort and its success is a function of the quality of the individual members of the team as well as the quality of the group’s communication. As Norbert Wiener pointed out in Cybernetics, scientists and engineers with different areas of expertise often use entirely different technical vocabularies to refer to the same phenomena.
Creating an environment in which a team of linguists, speech scientists, signal-processing experts, VLSI designers, and other specializts can understand each other’s terminology and effectively work together (as was required, for example, in the efforts to develop the speech recognition technology described above) is at least as challenging as the development of the technology itself. Once developed, the technology (and the technologists) must be further integrated into the equally well-developed disciplines of manufacturing, marketing, finance, and the other management skills of a modern corporation.
It is always exciting to see (or hear) a new product, to experience the realization of a vision after years of hard collaborative work. Perhaps my greatest pleasure has been the opportunity to share in the creative process with the many outstanding men and women who have contributed to these endeavors.
1. Raymond Kurzweil, “The Technology of the Kurzweil Voice Writer,” BYTE, March 1986.