The purpose of this study was to evaluate the effectiveness of Wikipedia’s premier internal quality control mechanism, the “featured article” process, which assesses articles against a stringent set of criteria. To this end, scholars were asked to evaluate the quality and accuracy of Wikipedia featured articles within their area of expertise. A total of 22 usable responses were collected from a variety of disciplines. Out of the Wikipedia articles assessed, only 12 of 22 were found to pass Wikipedia’s own featured article criteria, indicating that Wikipedia’s process is ineffective. This finding suggests both that Wikipedia must take steps to improve its featured article process and that scholars interested in studying Wikipedia should be careful not to naively believe its assertions of quality.
Since its founding in 2001, the online encyclopedia Wikipedia has produced a phenomenal quantity of articles, including more than three million in its English language version. Wikipedia has also garnered some recognition for articles of high quality, most notably in 2005 when Nature declared that Wikipedia was “nearly as accurate as Britannica” on many scientific topics . On the other hand, Wikipedia is often plagued by tremendous failures of quality. For example, in 2005, Wikipedia’s biography of the journalist John Seigenthaler alleged that Seigenthaler had been involved in the assassination of John Kennedy (Seelye, 2005), and in 2009, fake quotes inserted into the biography of composer Maurice Jarre were picked up mainstream media organizations and included in several obituaries (Fitzgerald, 2009). Other articles in Wikipedia are simply made up. For example, an article on the entirely imagined Baldock Beer Disaster (purportedly a tragic brewery accident) was briefly featured on Wikipedia’s main page (“Recent additions,” 2005 and “Articles for deletion,” 2007).
The featured article process is characterized by a complex bureaucracy and set of criteria (with four major requirements and seven sub–requirements). In order for an article to become “featured” it must be evaluated by a number of other Wikipedia contributors on the basis of those criteria. If a group of Wikipedia contributors agree that the article is of “featured” quality, the director of the process or one of his deputies officially awards “featured” status, and the article is recognized with a small bronze star in the upper left–hand corner. Featured articles are also eligible to be prominently displayed on the site’s main page as “Today’s featured article.”
The standards for featured articles are quite exacting. For example, the article must be “a thorough and representative survey of the relevant literature on the topic”, and its prose must be “engaging, even brilliant, and of a professional standard.”  As a result of these stringent criteria, at the time of this writing only 2,615 articles (less than one in 1000) had achieved featured status on Wikipedia. Because of the perceived strengths of the featured article process, many earlier authors have accepted, for the purpose of their research, that the featured articles are of high quality, including Huberman and Wilkinson (2007), Poderi (2009), and Blumenstock (2008). Like nearly all aspects of Wikipedia, however, the featured article process relies on anonymous volunteers, and there is no reason to assume that these individuals are in any way qualified to judge whether or not an article is in fact of high quality.
All of this, of course, begs a question: does the featured article process actually work?
In order to assess the effectiveness of Wikipedia’s featured article process, I contacted a number of subject matter experts by e–mail and asked each of them to assess a featured article on Wikipedia. The articles to be assessed were selected randomly, though I discarded those articles for which no qualified expert could be found (most such articles were those which lie far outside the boundaries of traditional academic inquiry). Each expert was asked to comment on the general quality and accuracy of the article he or she was assessing, to comment specifically on whether it satisfied Wikipedia’s own featured article criteria, to compare the article to other materials, and to rate the article on a scale of one to 10 (where 10 is best). The numerical rating was not directly connected to Wikipedia’s own criteria; instead, expert reviewers were asked to rate, in their own opinion, the “overall” quality of each article.
In all, I contacted 160 experts and received 22 usable evaluations of Wikipedia articles, spanning a wide variety of disciplines. All of the reviewers were initially contacted during the months of August and September 2009 and all responses were received by the beginning of October 2009.
The evaluation results demonstrate a fundamental unevenness of quality among Wikipedia’s featured articles. Some of the articles assessed proved to be quite excellent. For example, Charles Esdaile (Professor of History at the University of Liverpool and author of several books on the Napoleonic Wars, including Fighting Napoleon: Guerrillas, Bandits and Adventurers in Spain, 1808–1814) wrote of the article on the Battle of Barossa, “I am glad to say that it is a very complete account which contains no obvious errors and is certainly more detailed than [its] rivals; moreover, it is both well written and accessible.” On the other hand, Grigory Ioffe (Professor of Geography at Radford University and the author of Understanding Belarus and How Western Policy Misses the Mark), wrote of the article on Belarus, “This is a piece of immature writing unusual even for the Wikipedia.”
Overall, of the 22 articles assessed, the expert reviewers found that 12 (54.5 percent) clearly passed Wikipedia’s criteria for a featured article. Another seven clearly failed the criteria (31.8 percent), and the remaining three were borderline cases.
As for the numerical quality score I requested, the scores assigned ranged from one to nine (no article received a perfect 10) and averaged a seven. In the table below, articles are ordered by the score they received in these assessments.
It is worth noting that many of the articles assessed did score quite well, proving that Wikipedia’s contributors can produce very good articles. The articles receiving lower scores, however, show quite convincingly that Wikipedia’s attempt at quality control is failing. Even among those articles that scored highly, there was room for improvement. For example, David Archer (Professor of Geophysical Sciences at the University of Chicago and the author of Global Warming: Understanding the Forecast), scored the article on global warming at an eight and wrote that it was “very concise and clear”, but remarked that he could tell “it was not written by professional climate scientists” and noted an error in the way the article explained how clouds are included in climate models. Similarly, Jan Kubik (Associate Professor of Political Science at Rutgers and the author of The Power of Symbols Against the Symbols of Power: The Rise of Solidarity and the Fall of State Socialism in Poland) delivered a favorable review of the article “History of Solidarity”, scoring it at a nine, but noted three small errors in it.
Among the articles that did not score as well, several of the expert reviewers compared the articles to the work of high school students or university undergraduates. For example, Malcolm Rohrbough (Emeritus Professor of History at the University of Iowa and author of Days of Gold: The California Gold Rush and the American Nation) wrote that the article on the California Gold Rush was “written at about the level of a junior in high school.” Several others also noted the problems associated with non–expert authors, noting that the sources used were poorly selected and not representative of the broader literature. Stephen Turner (Professor of Philosophy at the University of South Florida and editor of the Cambridge Companion to Max Weber) wrote that Wikipedia’s account of Max Weber was “misleading, full of errors or at least problematic claims” and “odd judgments” and wrote that the sources used were “pretty strange.”
As for my final question, in which I asked reviewers to compare the Wikipedia entry to other materials (particularly those available online), most reviewers professed ignorance as to exactly what other materials could be found online. Of those who did comment on the subject, seven out of nine responded that the Wikipedia was the best or one of the best available online, but in several cases, this said more about the other materials online than the Wikipedia article. For example, Thomas Saine (Professor Emeritus of German at the University of California Irvine and author of Georg Forster) wrote of the article Georg Forster, “It is certainly the best I have encountered online. Not that that says much.”
In expert evaluations, nearly one–third of the featured articles assessed were found to fail Wikipedia’s own featured article criteria. As such, the featured article process can hardly be considered successful. It is not especially surprising that Wikipedia’s non–expert contributors are not able to adequately assess the quality of featured articles. Other research into quality on Wikipedia suggests that the participants in the featured article process apply rather unsophisticated criteria to their decisions.
Blumenstock (2008), for example, showed that article length (as measured by word count) is an excellent predictor of whether an article is featured or not (identifying featured articles with 96.3 percent accuracy); a method more accurate than many more complicated measures. While this can be taken as evidence that Wikipedia’s longest articles are its best, it seems more likely, in light of the evidence presented here, that the featured article process (due to the non–expert nature of its participants) focuses on easily measured attributes like length rather than on actual quality judgments. It is easier for a non–expert to judge length than true comprehensiveness.
For Wikipedia, then, it seems that if the featured article process is to serve as an effective means of quality control, it must be changed. The most obvious way to improve the process would be to include the input of experts. Wikipedia contributors have a tendency to reject the input of outsiders and to suggest that experts will not work for free, but involving a few outside reviewers in the featured article process would not be especially difficult. Over the first eight months of 2009, Wikipedia identified an average of 44 featured articles per month, which is hardly an overwhelming number to review. More importantly, given Wikipedia’s importance as a source of information for the general public, scholars are coming to recognize that they need to be concerned with its content. For example, an editorial in Nature called on scientific researchers to “read Wikipedia cautiously and amend it enthusiastically.”  Laurent and Vickers (2009) suggest that medical doctors should do the same, in order to ensure that good health information is available to patients. Given these views from experts, and the small number of featured articles, it seems quite possible that experts could be involved in the process, improving its effectiveness. Furthermore, involving expert reviewers in the featured article process could serve as a gateway to introduce more scholars to Wikipedia.
For future scholars, the data presented here should serve as a caution when undertaking research into quality on Wikipedia. Previous efforts to determine what leads to the production of high–quality content in Wikipedia have often (as with Poderi (2009) or Huberman and Wilkinson (2007)) assumed that featured articles are of high quality, or at least represent the best of Wikipedia. If featured articles are not in fact of high quality, then such research does nothing more than show what leads to the production of featured articles, which is hardly an interesting research program.
The fact that featured articles are not necessarily of high quality, however, does not necessarily suggest that they are no better than other articles on Wikipedia. It seems almost absurd, considering that more that more than one million of Wikipedia’s articles are “stubs” (short articles of only a few sentences), to suggest that featured articles are no better than average. On the other hand, there is reason to believe that Wikipedia’s featured articles are not much better than other reasonably developed articles on the site. Two other surveys, both done by newspapers, have asked experts to grade Wikipedia entries on a scale of one to 10 (one by the Guardian and one by the Mail and Guardian; see van Noort, 2005). Together, these two studies produced assessments of 15 non–featured articles. The scores of these articles averaged a 6.2 (only slightly lower than the average of 7 for the featured articles evaluated for this paper) and two of the articles from these studies received a score of 10. Out of the 15 articles evaluated in these studies, seven scored a seven or better, indicating that they are comparable in quality to the average featured article. To put it simply, being a featured article may not mean much at all. Thus I suggest that rather than accepting Wikipedia’s assertion that its featured articles are the best, future scholars should use a more sophisticated approach.
About the author
David Lindsey is a student in the Edmund A. Walsh School of Foreign Service at Georgetown University and a frequent contributor to Wikipedia.
I would like to extend my deep gratitude to the scholars who gave generously of their time to participate in this study. Without their help, this project would not have been possible. They are as follows: David Archer, University of Chicago; Charles Esdaile, University of Liverpool; Hugh Hudson, University of Melbourne; Scott Hughes, MIT; Grigory Ioffe, Radford University; Jan Kubik, Rutgers University; Mark Lytle, Bard College; Bill Miscamble, University of Notre Dame; Ronald Paulson, Johns Hopkins University; James Pritchett; William Rebeck, Georgetown University; Malcolm Rohrbough, University of Iowa; Thomas Saine, University of California — Irvine; Nathan Schmiedicke, St. Charles Borromeo Seminary; Niall Sharples, Cardiff University; James Siddons; Martin Staab, Georgetown University; Richard Squier, Georgetown University; Stephen Turner, University of South Florida; Amy Werbel, St. Michael’s College; Daniel Wilson, Muhlenberg College; and, John van Wyhe, University of Cambridge.
1. Giles 2005, p. 900.
2. “Featured article criteria,” 2009.
3. Nature, 2005. “Wiki’s wild world,” p. 890.
“Articles for Deletion/Baldock Beer Disaster,” 2007. Wikipedia, at http://en.wikipedia.org/wiki/Wikipedia:Articles_for_deletion/Baldock_Beer_Disaster, accessed 3 April 2010.
Joshua Blumenstock, 2008. “Size matters: Word count as a measure of quality on Wikipedia,” WWW ’08: Proceedings of the 17th International Conference on the World Wide Web, pp. 1,095–1,096, and at http://www2008.org/papers/pdf/p1095-jblumenstock.pdf, accessed 3 April 2010.
“Can you trust Wikipedia,” 2005. Guardian (24 October), at http://www.guardian.co.uk/technology/2005/oct/24/comment.newmedia, accessed 3 April 2010.
“Featured article criteria,” 2009. Wikipedia, at http://en.wikipedia.org/wiki/Wikipedia:Featured_article_criteria, accessed 3 April 2010.
“Featured articles,” 2009. Wikipedia, at http://en.wikipedia.org/wiki/Wikipedia:Featured_articles, accessed 3 April 2010.
Shane Fitzgerald, 2009. “Lazy journalism exposed by online hoax,“ Irish Times (7 May), at http://www.irishtimes.com/newspaper/opinion/2009/0507/1224246059241.html, accessed 3 April 2010.
Jim Giles, 2005. “Internet encyclopaedias go head to head,” Nature, volume 438, number 7070 (15 December), pp. 900–901, and at http://www.nature.com/nature/journal/v438/n7070/full/438900a.html, accessed 3 April 2010.
B.A. Huberman and D.M. Wilkinson, 2007. “Assessing the value of cooperation in Wikipedia,” First Monday, volume 12, number 4 (April), at http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/1763/1643, accessed 3 April 2010.
Michaël R. Laurent and Tim J. Vickers, 2009. “Seeking health information online: Does Wikipedia matter?” Journal of the American Medical Informatics Association, volume 16, number 4, pp. 471–479, and at http://www.jamia.org/cgi/content/abstract/16/4/4713, accessed 3 April 2010.
Nature, 2005. “Wiki’s wild world,” Nature, volume 438, number 7070 (15 December), p. 890, and at http://www.nature.com/nature/journal/v438/n7070/full/438890a.html, accessed 3 April 2010.
Elvira van Noort, 2005. “Can you trust Wikipedia,” Mail and Guardian (7 November), at http://www.mg.co.za/article/2005-11-07-can-you-trust-wikipedia, accessed 3 April 2010.
Giacomo Poderi, 2009. “Comparing featured article groups and revision patterns correlations in Wikipedia,” First Monday, volume 14, number 5 (May), at http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2365/2182, accessed 3 April 2010.
“Recent additions,” 2005. Wikipedia, at http://en.wikipedia.org/wiki/Wikipedia:Recent_additions/2005/November, accessed 3 April 2010.
Katharine Q. Seelye, 2005. “Rewriting history; Snared in the Web of a Wikipedia liar,” New York Times (4 December), at http://www.nytimes.com/2005/12/04/weekinreview/04seelye.html, accessed 3 April 2010.
Paper received 13 October 2009; revised 10 March 2010; accepted 21 March 2010.
Copyright © 2010, First Monday.
Copyright © 2010, David Lindsey.
Evaluating quality control of Wikipedia’s feature articles
by David Lindsey.
First Monday, Volume 15, Number 4 - 5 April 2010
A Great Cities Initiative of the University of Illinois at Chicago University Library.
© First Monday, 1995-2014.