Digitizing Old Text and Fighting Spam, Too

August 15, 2008 | Source: ScienceNOW Daily News

Carnegie Mellon University scientists have developed a program called reCAPTCHA that collects words flagged as unreadable by optical scanners as they digitize texts, sending those words (in the form of OCR scans) to cooperating Web sites and used in place of random CAPTCHAs.

The reCAPTCHA system now automatically collects about 4 million responses every day from 40,000 Web sites, the equivalent of 1500 people working full-time and transcribing 60 words per minute. The process can peg more than 99% of words accurately. It is available at www.recaptcha.net, free to any Web site and accessible to blind users.

After a year of operation, reCAPTCHA has helped resolve about 440 million words for client users that are digitizing newspaper and document archives. It has just completed the entire 1908 archive from The New York Times.