Using Bayesian statistics to rank Wikipedia entries
August 8, 2014
Computer scientists in China have devised a software algorithm based on Bayesian statistics that can automatically check a Wikipedia entry and rank it by its quality.
Bayesian analysis is commonly used to assess the content of emails and determine the probability that the content is spam or junk mail, and if so, filter it from the user’s inbox if the probability is high.
Writing in the International Journal of Information Quality, Jingyu Han and Kejia Chen of Nanjing University of Posts and Telecommunications say they have used a dynamic Bayesian network (DBN) to analyze the content of Wikipedia entries in a similar manner.
They also apply multivariate Gaussian distribution modeling to the DBN analysis, which gives them a distribution of the quality of each article so that entries can be ranked. Very-low-ranking entries could be flagged for editorial attention to raise the quality. By contrast, high-ranking entries could be marked as the definitive entry in some way, so that such an entry is not subsequently overwritten with lower quality information.
Outperforms human users
The team has tested its algorithm on sets of several hundred articles, comparing their automated quality assessment with assessment by a human user. Their algorithm outperforms a human user by up to 23 percent in correctly classifying the quality rank of a given article in the set, the researchers report.
The use of a computerized system to provide a quality standard for Wikipedia entries would avoid the subjective need to have people classify each entry. That could improve the standard and provide a basis for an improved reputation for the online encyclopedia.
* The term “Bayesian” refers to the 18th century mathematician and theologian Thomas Bayes, who provided the first mathematical treatment of a non-trivial problem of Bayesian inference. — Wikipedia (hopefully, this entry is ranked as quality.)
Abstract of International Journal of Information Quality paper
As the largest free user-generated knowledge repository, data quality of Wikipedia has attracted great attention these years. Automatic assessment of Wikipedia article’s data quality is a pressing concern. We observe that every Wikipedia quality class exhibits its specific characteristic along different first-class quality dimensions including accuracy, completeness, consistency and minimality. We propose to extract quality dimension values from article’s content and editing history using dynamic Bayesian network (DBN) and information extraction techniques. Next, we employ multivariate Gaussian distributions to model quality dimension distributions for each quality class, and combine multiple trained classifiers to predict an article’s quality class, which can distinguish different quality classes effectively and robustly. Experiments demonstrate that our approach generates a good performance.