Mining the blogosphere
September 10, 2012
Leila Kosseim, project lead and associate professor in Concordia’s Computational Linguistics Laboratory, and recently graduated doctoral student Shamima Mithun have developed a natural-language-processing system called BlogSum that allows an organization to pose a question and then find out how a large number of people talking online would respond.
The system is capable of gauging things like consumer preferences and voter intentions by sorting through websites, examining real-life self-expression and conversation, and producing summaries that focus exclusively on the original question.
Making sense of blog text
“Huge quantities of electronic texts have become easily available on the Internet, but people can be overwhelmed, and they need help to find the real content hiding in the mass of information,” explains (CLaC lab).
Analyzing informally written language poses unique challenges compared to analyzing, for example, a news article. Blogs, forums, and the like contain opinions, emotions and speculations, not to mention spelling errors and poor grammar. So a summarization tool must address question irrelevance (sentences that are not relevant to the main question) and discourse incoherence, (sentences in which the intent of the writer is unclear).
The researchers developed and tested BlogSum by examining a set of blogs and review sites. BlogSum used “discourse relations” to crunch the data — ways of filtering and ordering sentences into coherent summaries. BlogSum was measured against prior computational rankings and achieved mostly superior results.
In addition, it was evaluated by actual human subjects, who also found it to be superior. Summaries produced by BlogSum reduced question irrelevance and discourse incoherence, successfully distilling large amounts of text into highly readable summaries.
Detailed Architecture of BlogSum (credit: Shamima Mithun)