Researchers use text mining to analyze 330,000 New York Times stories

July 31, 2006 | Source: KurzweilAI

Scientists at UC Irvine have used a text-mining technique called “topic modeling” to complete in hours a complex topic analysis of 330,000 stories published primarily by The New York Times.

Topic modeling looks for patterns of words that tend to occur together in documents, then automatically categorizes those words into topics. UCI researchers developed a technique that allows the technology to be used on huge document collections.

The topic model, applied to the collection of news articles published from 2000 to 2002, identified patterns of words that occurred together in the stories. From those words, researchers were able to identify topics.

Information associated with those topics was charted over time, allowing the scientists to pinpoint what months of the year certain topics were most in the news and how much ink they received from year to year.

Topic modeling looks for patterns of words that tend to occur together in documents, then automatically categorizes those words into topics. Older text-mining techniques require the user to come up with an appropriate set of topic categories and manually find hundreds to thousands of example documents for each category. This human-intensive process is called supervised learning.

In contrast, topic modeling, a type of unsupervised learning, doesn’t need suggestions for an appropriate set of topic categories or human-found example documents. This makes retrieving information easier and quicker.

Source: UC Irvine