The Internetbased encyclopædia Wikipedia has grown to become one of the most visited Web sites on the Internet, but critics have questioned the quality of entries. An empirical study of Wikipedia found errors in a 2005 sample of science entries. Biased coverage and lack of sources are among the “Wikipedia risks.” This paper describes a simple assessment of these aspects by examining the outbound links from Wikipedia articles to articles in scientific journals with a comparison against journal statistics from Journal Citation Reports such as impact factors. The results show an increasing use of structured citation markup and good agreement with citation patterns seen in the scientific literature though with a slight tendency to cite articles in highimpact journals such as Nature and Science. These results increase confidence in Wikipedia as a reliable information resource for science in general.
Extracting citations from Wikipedia
Results and discussion
Wikipedia’s popularity has steadily been growing and will probably become increasingly important as a vehicle for the dissemination of scientific research. But how can the articles of this freely edited Internetbased encyclopædia be trusted? 
Inbound links can to some extent quantify the quality of a work; examples include Googles PageRank for Web pages and the impact factor of scientific journals. The algorithms behind the PageRank and Kleinbergs HITS (Kleinberg, 1999) can be adapted to Wikipedia (Bellomi and Benato, 2005), but it is not clear that is an exact correlation between high quality of content and high rank. It has been suggested (Neus, 2001; Cross, 2006) that Wikipedia content surviving for a long period and many edits may be deemed of high quality. On the other hand studies have found that highly edited articles are likely quality articles (Wilkinson and Huberman, 2007). Other proposals for quality assessment use revision history to compute a trust index for an article or an author reputation index (Zeng, et al., 2006; Adler and Alfaro, 2007). Another feature of an article that may correlate with article quality is the amount of outbound citation to “trusted” material, such as scientific articles published in respected peerreview journals. How prolific are these citation and does Wikipedia use them across scientific fields? Critics have noted that Wikipedia may be biased on the corpus level — leaning towards topics that interest the “young and Internetsavvy” — and a possible lack of sources has been noted (Denning, et al., 2005).
Extracting citations from Wikipedia
Contributors to Wikipedia can include scientific references by a variety of means, most simply, by listing them at the end of a given article. A more structured approach uses the
<ref>construct and the cite journal template which allow for inline referencing and consistent formatting. A user of the cite journal template needs to fill out the appropriate bibliographic fields of the template, e.g., the fields for article title and journal name. The structured citation markup makes it relatively easy to extract bibliographic information and ask: How well do the outgoing scientific citations in Wikipedia compare with citations seen in traditional scientific journals?
To answer this question programs with regular expression matching written in Perl extracted the journal titles from the cite journal template in all pages of the English version of Wikipedia obtained as the XML database dump file. A small list was setup to match the different variations of journal titles, and then the total number of citations was counted for each individual journal. The Journal Citation Reports (JCR) for 2005 of Thomson Scientific provided statistics on citations for different scientific journals.
Results and discussion
The regular expression matched 30,368 outbound citations from the cite journal template with the database dump for 2 April 2007. The summary statistics for individual journals with the largest number of inbound citations from Wikipedia showed Nature (787), Science (669) and New England Journal of Medicine (NEJM) (446) on the top (number of citations in parenthesis). A number of astronomy journals received many citations: Astrophysical Journal (424), Astronomy & Astrophysics (154), Icarus, International Journal of Solar System Studies (147) and Astronomical Journal (93). Apart from NEJM other medical journals high on the list included The Lancet (268), Journal of the American Medical Association (JAMA) (217), British Medical Journal (187) and Annals of Internal Medicine (104). Some newspapers and non-scientific journals also received citations via the cite journal template with The New York Times (69) among the most referenced. These nonscientific entries as well as respected journals such as Scientific American and Physical Review (that as a journal with variations in its title may be referenced in several ways) were excluded and the rest of the values were correlated against numbers obtained from JCR (Figure 1). The Wikipedia citation numbers showed high correlation with JCRs numbers for the total number of citations to a journal. Wikipedias citation numbers correlated less with JCRs impact factor and JCRs measure of numbers of articles in a journal. With 47.4 the Annual Review of Immunology has the highest impact factor, but because it publishes few articles it receives relatively few citations both from scientific journals and from Wikipedia (18). The correlations depend on the number of journals included in the test, with the largest correlation observed for highly cited journals. It may simply reflect that journals with a small number of citations make noisy and poor statistics. In most cases the highest correlation could be obtained by multiplying the total number of citations with the impact factor, i.e., Wikipedia authors slightly overcite highimpact journals compared to JCRs numbers. The high correlation among topcited journals with this combined number means that the 10 journals with the highest value of this measure feature among the 19 most Wikipediareferenced journals.
Figure 1: Correlations between citations to a journal from Wikipedia and from scientific journals. Kendalls rank correlation (a) and its associated Pvalue (b) as a function of the number of journals included in the test, e.g., the value at 80 shows the correlation between Wikipedia citations and JCR numbers for the 80 most cited journals from Wikipedia. The number of citations from Wikipedia is compared with three series of numbers from JCR and one derived: The total citations to a journal, its impact factors, the number of articles and the product of the total citations and impact factor.
When individual journals are examined Wikipedia citations to astronomy journals stand out compared to the overall trend (Figure 2). Also Australian botany journals received a considerable number of citations, e.g., Nuytsia (101), in part due to concerted effort for the genus Banksia, where several Wikipedia articles for Banksia species have reached “featured article” status. Computer and Internetrelated journals do not get as many as one would expect if Wikipedia showed bias towards fields for the “Internetsavvy”: Communications of the ACM (34) is the most referenced in this analysis. Of the medical journals BMJ received relatively many Wikipedia citations. Authors in general cite more often freely available articles (Lawrence, 2001), and this may be particularly true for authors of articles for this free encyclopædia. Since BMJs research articles are openly accessible the journal may gain extra citations from this effect.
Figure 2: Comparison between citations from scientific journals and from Wikipedia. and from scientific journals. Scatter plot with each dot representing the target journal receiving the citations, and with one axis representing the number of citations from Wikipedia and the other the product of two numbers: JCR total citations and impact factor. It indicates the 100 most Wikipediareferenced articles. The plot shows not all journal titles.
Citing Wikipedia as an authoritative source may be questionable with the present state of debate about content in this online encyclopædia; some universities have even banned citations to Wikipedia (Cohen, 2007). But when citations to trusted material support statements Wikipedia may be valuable for background reading. The present number of structured outbound citations from Wikipedia is quite small compared to the total number of scientific citations found in current scientific literature. With this low number dedicated enthusiasts can influence the statistics making relatively few edits, such as content providing details on Australian botany. However, the use of the cite journal template has grown from zero in February 2005 when first introduced, to 19,066 in November 2006, 24,656 in February 2007, to a total of 30,368 citations in April 2007. Reference management software, such as Zotero (http://www.zotero.org/), includes functionality for handling Wikipedia citations. Thus use of structured scientific citations in Wikipedia will very likely continue to grow and increasingly benefit researchers that look for wellorganized pointers to original research.
About the author
Finn Årup Nielsen is a postdoc in Informatics and Mathematical Modelling at the Technical University of Denmark and the Neurobiology Research Unit at the Copenhagen University Hospital Rigshospitalet. He does neuroinformatics work in the Lundbeck Foundation Center for Integrated Molecular Brain Imaging.
Email: fn [at] imm [dot] dtu [dot] dk
I thank D. Balslev, R. Jesus and L.K. Hansen for discussions and the Lundbeck Foundation for support.
1. See, e.g., McHenry, 2004; Denning, et al., 2005; Giles, 2005.
B. Thomas Adler and Luca de Alfaro, 2007. “A contentdriven reputation system for the Wikipedia,” 16th International World Wide Web Conference, 8–12 May 2007 (Banff, Alberta), at http://www2007.org/paper692.php,accessed 15 May 2007.
F. Bellomi and R. Bonato, 2005. “Network analysis of Wikipedia,” Proceedings of Wikimania 2005 — First International Wikimedia Conference, at http://www.fran.it/articles/wikimania_bellomi_bonato.pdf, accessed 15 May 2007.
Noam Cohen, 2007. “A history department bans citing Wikipedia as a research source,” New York Times (21 February), and at http://www.nytimes.com/2007/02/21/education/21wikipedia.html?ex=1329714000&en=156f770bd93c4fa0&ei=5088&partner=rssnyt&emc=rss, accessed 21 July 2007.
Tom Cross, 2006. “Puppy smoothies: Improving the reliability of open, collaborative wikis,” First Monday, volume 11, number 9 (September), at http://www.firstmonday.org/issues/issue11_9/cross/, accessed 15 May 2007.
Peter Denning, Jim Horning, David Parnas and Lauren Weinstein, 2005. “Wikipedia risks,” Communications of the ACM, volume 48, number 12 (December), p. 152. http://dx.doi.org/10.1145/1101779.1101804
Jim Giles, 2005, “Internet encyclopaedias go head to head,” Nature, volume 438, number 7070 (15 December), pp. 900–901.
Jon M. Kleinberg, 1999. “Authoritative sources in a hyperlinked environment,” Journal of the ACM, volume 46, number 5 (September), pp. 604–632; also at http://www.cs.cornell.edu/home/kleinber/auth.pdf, accessed 21 July 2007.
Steve Lawrence, 2001. “Free online availability substantially increases a papers impact,” Nature, volume 411, number 6837 (31 May), p. 521.
Robert McHenry, 2004. “The FaithBased Encyclopedia,” TCS Daily (15 November) at http://www.techcentralstation.com/111504A.html, accessed 15 May 2007.
Andreas Neus, 2001. “Managing information quality in virtual communities of practice,” In: E. Pierce and R. KatzHaas (editors). Proceedings of the Sixth International Conference on Information Quality at MIT, Boston: Sloan School of Management, at http://opensource.mit.edu/papers/neus.pdf, accessed 15 May 2007.
Dennis M. Wilkinson and Bernardo A. Huberman, 2007. “Assessing the value of cooperation in Wikipedia,” First Monday, volume 12, number 4 (April), at http://www.firstmonday.org/issues/issue12_4/wilkinson/, accessed 15 May 2007.
Honglei Zeng, Maher Alhossaini, Li Ding, Richard Fikes, and Deborah L. McGuinness, 2006. Computing trust from revision history, Proceedings of the 2006 International Conference on Privacy, Security and Trust (30 October), at http://ebiquity.umbc.edu/paper/html/id/320/Computing-Trust-from-Revision-History, accessed 15 May 2007.
Paper received 16 May 2007; accepted 20 July 2007.
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 License.
This work is licensed under a GNU Free Documentation License version 1.2.
Scientific citations in Wikipedia by Finn Årup Nielsen
First Monday, volume 12, number 8 (August 2007),