A universal, Internet-based, bibliographic and citation database would link every scholarly work ever written - no matter how published - to every work that it cites and every work that cites it. Such a database could revolutionize many aspects of scholarly communication: literature research, keeping current with new literature, evaluation of scholarly work, choice of publication venue, among others. Models are proposed for the cost-effective operational and technical organization of such a database as well as for a feasible initial goal: the semi-universal citation database
By definition, however, a universal citation database would provide search results that are comprehensive and up-to-date. As soon as a paper is published in any form, its citation data would be incorporated in the universal database. Citation searches could thus return the latest working papers or technical reports in an area. With well-formulated citation searches on a universal database, it might often be the case that no further searching need be done at all. One could fail to find papers only if the authors themselves had failed to cite the relevant literature.
As well as locating the papers of interest, a citation database system can also provide useful ways of filtering them. For example, one may be interested only in papers that cite two specific earlier works, or ones that both cite a particular work and use certain keywords in their abstracts or titles. One could also use citation counts for filtering; papers would only be selected if they had been cited a certain number of times (or by a certain number of distinct authors). This could be particularly useful for general or initial reading in an area; effort could be concentrated on those papers deemed important by the research community through their cumulative citation counts.
Another intriguing possibility is the use of citation data in current awareness services. Scholars would be able to establish search and filtering criteria matching their general and specialized interests. As publications and citation data are entered, any new papers matching the criteria would automatically be brought to the scholar's attention, e.g., by e-mail.
Citation analysis also has important indirect effects on scholarly evaluation through citation-based journal rankings. Journal Citation Reports annually ranks journals based on such measures as journal impact factor (the average number of citations received per article published in the journal [7 ]), cited half-life (the number of years covered by the most recent half of a journal's citations) and others. These citation measures are often used by research librarians in making or recommending journal purchasing or cancellation decisions [ 3 ]. Scholars are often evaluated by the prestige of the journals in which they publish; both the citation rankings themselves and the availability of journals on library shelves are contributing elements in establishing journal prestige.
However, there are serious methodological issues in the application of citation analysis to scholarly evaluation. Many studies have been criticized on the grounds that citation counts are sensitive to "fads, foibles and popular trends in science" [ 14 ] and that simple-minded citation counts are often used without correlation of those counts with other relevant data [ 17 ]. Furthermore, use of the existing citation indexes tends to overemphasize the role of the particular journals indexed and devalue all other forms of scholarly communication.
These inadequacies may well be ameliorated, however, with the development of a freely-available universal citation database.
The universality of the database would value all forms of publication equally, allowing the impact of works to be judged without measurement bias imposed by the inclusion or non-inclusion in present academic indices. The free availability of the database would allow data therefrom to be easily correlated with other information relevant to the evaluation of scholarly works.
In response to this crisis, there have been a number of efforts to reform the world of scholarly publication by developing and promoting lower-cost alternatives to expensive print journals. In particular, electronic publication and dissemination of scholarly work has emerged as a serious and potentially lower-cost alternative to the conventional academic journal on paper. Indeed, there have been a number of efforts to establish fully-refereed electronic journals that are distributed free of charge via the Internet. A number of the first generation e-journals, such as EJournal and Postmodern Culture started with a no-frills, text-only (ASCII) format [ 1, 12 ]. Subsequently, high quality typography started to appear in freely-distributed electronic journals; in particular, mathematics journals such as the Electronic Journal of Combinatorics and the New York Journal of Mathematics use typesetting conventions based on the TeX system with the American Mathematical Society extensions. With the development of the World-Wide Web, originally ASCII-based e-journals have generally developed enriched graphics and hypertext formats.
Subscription-based electronic journals are also beginning to be widely available. Some of the efforts in this area, such as the Chicago Journal of Theoretical Computer Science and First Monday have been to develop new electronic-only journals with a modest subscription basis to provide for long-term stability of the journal. Other projects, such as the Johns Hopkins University Press Project Muse [ 13 ] and the electronic journals of the American Mathematical Society have been converting existing print journals to electronic form with reduced prices for electronic-only subscription. Many of the major commercial publishers are now starting to experiment with electronic subscriptions, but without commitment to reduced pricing.
At the same time, there have been efforts to raise awareness of the serials crisis among faculty members and discourage publication in high-cost print journals. One study of physics journals showed a 80-fold variation in the per character cost of refereed physics journals [ 2 ]; the author and the American Physics Society were subsequently sued by one of the publishers whose journals were rated least cost-effective [ 15 ]. Some groups have proposed university copyright policies that require faculty members to consider the cost and accessibility of articles before transferring copyright to journal publishers [ 6 ]. Tenure and promotion policies have been proposed suggesting that faculty members be evaluated based on the perceived quality and contribution of a small number of publications rather than the usual metric of total publication count.
Nevertheless, reform in scholarly communication is slow to come. Although most electronic journals have sought to establish their credibility through strong peer review processes - indeed some have developed novel processes that can be considered a significant improvement over conventional practices [ 10 ] - refereed electronic journals have not yet become widely accepted as the equal of refereed print journals [ 11 ]. When academic careers are on the line, faculty members are slow to eschew the expensive but well-established print journals in favor of lower-cost but unproved alternatives such as electronic journals.
A universal citation database has significant potential to act as a catalyst for reform in scholarly communication by leveling the playing field between alternative forms of scholarly publication. This would happen in two important ways.
First, the citation database would ensure that publications in any form are equally visible (but not necessarily equally accessible) to the literature research process. Regardless of which publication venue an author chooses, all that she/he need to do to make her/his work visible is to cite appropriate previous works. Publication venues would then compete on the important values that they bring to the publication process, such as refereeing standards, editorial control, quality of presentation, timeliness of dissemination and so forth. Publications would no longer enjoy an unfair competitive advantage simply by virtue of being indexed in a particular literature database.
The second way that a universal citation database would promote fairer competition among publication venues is by providing a method for evaluating the significance of individual papers independent of the publication venue chosen. University faculty members are often critically concerned with the recognition that their work receives because of its importance to the evaluation of their academic careers. Because the significance of papers is often judged solely by the perceived quality of the venues in which they are published, this encourages a very conservative approach to choice of publication venues. By providing citation data as an independent means of demonstrating the significance of a particular work, a universal citation database has the potential to encourage authors to choose publication venues for other qualities.
The costs of production for conventional literature indexes arise primarily from three sources. The first is simply the cost of assembling the basic bibliographic data for published works. This includes the not insignificant costs of acquiring the relevant publications as well as those of preparing and entering bibliographic data in the standard format of a particular index. The second cost area is that of adding classification, review or other information to the basic bibliographic data to aid in bibliographic search and evaluation. This includes the indexing of works by keyword, the classification of works by subject hierarchy and the independent analysis and review of works. The third cost area is that of marketing and distribution. With a universal bibliographic database, there are significant potential cost savings available in each area.
In particular, the cost of assembling basic bibliographic data could be reduced tremendously with a universal bibliographic and citation database organized along the lines suggested in this paper. With present literature indexes, there is a significant cost of post-publication bibliographic data entry and this cost is multiplied by the number of indexes that separately index a work. It is not uncommon to find a journal indexed by ten or more separate literature services. Under a universal database as proposed here, bibliographic data is contributed at source, that is, by authors and/or publishers. As we argue in Section II, the bibliographic data is already routinely developed at source; the cost of developing and contributing it in a standardized format can be made almost trivial and more than compensated by the increased value that a universal database represents. Indeed, in a thoroughly integrated system, bibliographic and citation data would be a derived product of the publication process and there would hence be a complete elimination of costs associated with manual entry of this data.
The reduction of data entry costs could be realized as savings to research libraries in two ways: cancellation of some literature indexes and lower prices for others. Cancellation would be appropriate and logical for those indexes that do not add much value beyond the assembly of basic bibliographic data; they become redundant with the availability of the universal database. Cancellation pressure should also ensure that the remaining indexes pass on data entry savings in the form of reduced subscription prices.
Additional cost reductions may be achieved by canceling indexes even if they do add value in the form of keyword- or classification-based indexes. A universal bibliographic and citation database will provide, at a minimum, searches by author and/or title word as well as citation searching. Arguably, citation searching on a universal database would be sufficiently powerful for most literature research activities. In this event, the value added by manually assembled keyword or classification indexes may not be sufficiently high to justify their continuation.
With the advent of a universal bibliographic and citation database then, significant cost savings for research libraries should be realized through the cancellation of numbers of conventional literature indexes. It should also be possible to arrange for the distribution of the universal database at a cost less than, or comparable to, that of distribution of the canceled indexes. With low and unduplicated data entry costs for entry of basic bibliographic data and no costs associated with manual "value-added" indexing, the universal bibliographic database should cost much less than the literature indexes supplanted.
Of course, the success of such an approach requires the cooperation of individual authors who may at first consider it a requirement for extra work. However, there is good reason to believe that preparing bibliographic and citation data in this way could in fact be a time saver. At present, authors or their assistants must spend considerable time in selecting and formatting their bibliographic references for every paper they write. Even with the use of standard bibliographic software packages such as BibTeX , ProCite, or EndNote, considerable drudgery is involved. Once a paper has been accepted for publication, additional work is often required to make the citations and references consistent with the ordering and formatting conventions of a particular journal, although the existing bibliographic packages may take care of this in some cases.
Imagine instead that authors have at their disposal a bibliographic software package that is integrated with the universal citation and bibliographic database and that both of these are based on canonical citation identifiers.
These identifiers would uniquely identify particular scholarly works and allow for full retrieval of the bibliographic entry from the universal citation database. The bibliographic software package would accept these identifiers and would be able to generate bibliographies ordered and formatted according to the requirements of any particular publication venue. The package could also translate occurrences of the identifiers in the text of papers to the appropriate numeric or symbolic labels associated with the citations. Literature searches using the universal citation database would return the appropriate canonical identifiers automatically. Personal bibliographic databases would no longer need to include the full bibliographic data for each paper, but could simply be lists of canonical identifiers. The net effect should be to considerably simplify the preparation of bibliographies and reference lists for scholarly works.
We should also be aware that scholars will not always have access to their computers, so it is important that the canonical identifiers use a mnemonic scheme that works when reading printed materials and making notes with pencil and paper. For example, a technical report might be cited as U-SFraser-CMPT-TR:95-07, where the U designates a university publication, SFraser is a standard abbreviation for Simon Fraser University from the list published annually by the Association of Commonwealth Universities, CMPT is the department code used at Simon Fraser for the School of Computing Science, TR indicates that this publication is a technical report (vs. MSc and PhD for theses, for example) and 95-07 is the assigned number in the Computing Science Technical Report series. A journal citation might be in the form J-ACM-TOPLAS:11@194 for the first article that begins on page 194 of volume 11 of ACM Transactions on Programming Languages and Systems. Here, the initial J identifies the citation as a reference to a journal article, ACM is a code for the publisher (Association for Computing Machinery) and TOPLAS is an abbreviation for the journal (unique for the publisher). The use of a separate publisher code helps avoid conflicts between journal abbreviations, while keeping to standard abbreviations used in a discipline. For example, TOPLAS is widely accepted in the computing science community as the abbreviation of this particular journal, but may conflict with other uses of that abbreviation. Unfortunately, it may not be possible to create meaningful canonical abbreviations in every case; in particular, it is difficult to see how books can be uniquely identified without resorting to non-mnemonic designations. Nevertheless, it is certainly feasible to devise some scheme for uniquely identifying publications and making it mnemonic as possible seems highly desirable.
With such a bibliographic software system for preparing bibliographies and references, the task of submitting the required bibliographic and citation data for each paper becomes almost trivial. All that need be done is to extract the list of citation identifiers used in the paper (with an appropriate software tool) and to submit this list together with the paper's bibliographic data and abstract. Still, some authors will have difficulties with these tasks, so it will be important to train appropriate support staff (reference librarians and/or computing services personnel) to provide help. With these facilities in place, an institutional requirement that authors submit bibliographic data in the necessary form is certainly not onerous and should be more than compensated by the assistance provided in preparing and formatting reference lists.
Although the requirement for preparation of the bibliographic and citation data has been suggested as an institutional requirement on authors, other mechanisms appropriate to a particular institution could be employed instead. The key point is that submission of the publication data be required at some level of institutional operation to ensure the universality of the database. Application of this requirement to authors, however, seems consistent with typical university practices in requiring faculty members to submit publication data for academic career evaluation and publicity purposes.
The model suggested here is but one of many possible ways of implementing a universal citation database. Other possibilities may involve more participation on the part of academic societies, research libraries, and/or publishers in data preparation. However, the development of data by the institutions that originate scholarly work has a number of advantages that should be considered in any scheme. First of all, preparation of the citation and bibliographic data, albeit in a different form, already takes place at the originating institutions as part of the process of writing a paper and including references. Regenerating this data at some later stage in the publication cycle would seem to be an unnecessary duplication of effort. Secondly, scholars will be able to include in the database all of their works that they deem of interest, regardless of publication venue. This freedom to have works included in the database seems highly appropriate in view of the potentially heavy use of the citation database for academic career evaluation. Thirdly, authors can also ensure the most rapid possible dissemination of their work through timely data preparation. Finally, development of data by the originating institutions can be carried out even if some publishers refuse to participate; in this way, the originating institutions regain a measure of control over their publications to balance the loss of copyright typically required for journal publication.
A natural and efficient approach to organizing the bibliographic database - for serial publications at least - is to distribute the data by publication, i.e., journal, conference, technical report series, and so on. Data for each publication would be provided by a logically distinct server on the Internet. Given any canonical citation identifier, a master index would allow the corresponding publication and its server (and perhaps backup servers) to be identified. Local copies of the master index will allow most of the publication server look-up traffic to be kept off the Internet. If the server for a particular publication changes, the original server would be expected to forward requests until local indexes are updated. For new publications not yet registered in the local copy, a query of the Internet master index server may be required.
Each publication server will provide both the basic bibliographic data and the citation data (if available) for articles published in the corresponding serial. The citation data will include both citations made and citations received by the article. Initially, a complete record of basic bibliographic data should be created at the time a canonical citation identifier is registered for an article. Ideally, the list of citations made in the article (i.e., canonical citation identifiers of cited articles) will also be entered at this time. In practice, it will be necessary to allow deferred entry of citation data. The field for citations received by an article will be initially set to empty. This field will be updated from time to time as citations from other articles become known. These updates will received from the publication servers of the citing articles as they enter data for those articles. Thus, each time a publication server adds or amends the list of articles cited for a given article, it should transmit that article's citation identifier to the publication servers for each of the cited articles. The Internet traffic generated by this process will be quite modest: short messages (containing only canonical identifiers of citing and cited works) typically sent from a citing server to a small number of servers for cited works.
With the distribution of citation data from citing servers to cited servers, citation searching on the Internet becomes quite efficient. For example, to find all works that cite at least two of a set of four articles, queries are sent to the servers for each of the articles. Each server returns a short message containing lists of canonical citation identifiers of citing works. The client software will then determine which citation identifiers occur in two or more of the lists and retrieve the full bibliographic data for only those identifiers. This approach generally avoids transmission of full bibliographic records until any citation logic which may reduce the retrieval set has been applied.
Of course, Internet traffic can be reduced even further when a local copy of the universal citation and bibliographic database is available. Even if Internet citation searching is used to obtain the most recent possible citations, the bulk of bibliographic record retrieval will normally take place from the local copy. However, it may also be reasonable to initially restrict citation searching to the local database alone. Before making an Internet query, users could be asked to analyze the results obtainable from a local search. These results should provide all relevant citing works up to the time that the local database copy was made. An option could be provided to update the results of a completed local search with information on the latest citations from the Internet. Indeed, it may be reasonable to restrict Internet access for certain groups of users (e.g., undergraduate students) to this form of query updating only.
In a minimal model, author/title and other forms of keyword searching would not be directly supported by the Internet component of the universal bibliographic and citation database. The reason is that such searches would be very costly in network resources; a keyword search could potentially match an article in any publication and hence would require that a query be issued to every publication server on the Internet. However, keyword searching could be supported indirectly as a filtering operation on a citation search; that is, once a set of records has been retrieved from the citation search, select only those records containing the keyword. Furthermore, general keyword searching could also be performed on the local copy of the universal database. This would miss some of the most recently published items, but would not be too different from keyword searching of present-day CD-ROM literature indexes.
Support for more general keyword and other forms of searching could also be provided through separately compiled subject indexes. For example, a particular scholarly society may want to create a subject index for a discipline X. It could establish a list of publications relevant to X and request that the publication servers for these publications forward bibliographic records as they are created. The scholarly society would be free to create any additional indexing information desired and could provide Internet access to the X subject index for accessing the most up-to-date material via keyword search.
Note that the results from different subject indexes could be freely combined under this model, providing that all search results are returned as lists of canonical citation identifiers. For example, one might be interested in interdisciplinary work involving concepts from two disciplines covered by different indexes. Searches of the disciplinary indexes could each identify papers relevant to one of the concepts; citation searching for papers that cite at least one paper in each list could provide a good starting point for the desired interdisciplinary literature.
The model presented here for Internet operation of a universal citation database represents just one vision for how such a database could be organized. The key point is that organization by publication server allows efficient updating and access to the data by canonical citation identifier. Beyond this feasibility argument, however, there are many technical and institutional issues to be explored in the design and deployment of such a database.
Quite soon after initiation, however, a semi-universal citation database would have almost all of the benefits of a fully universal database. Consider the value of a semi-universal citation database five years past initiation. From each of the perspectives of literature research, evaluation of scholarly work and the reform of scholarly communication, this semi-universal citation database would be almost as effective as a fully universal database. From the literature research perspective, existing bibliographic databases would suffice for finding works published prior to the initiation date using standard searching techniques. More recent works could then be found using the semi-universal citation database. In evaluation of scholarly work, the semi-universal citation database helps with only the most recent five years of citations. But these include all the citations of recent work and all the recent citations of older work, precisely the citation information of greatest interest for evaluating academic careers (or departments or journals). Finally, in the reform of scholarly communication, citation databases have their value in modifying the publication behavior of scholars. Once initiated, a semi-universal citation database would have the same effect as a fully universal database in assuring scholars that their works - and the citation credits their works receive - are visible independent of the publication venue chosen.
Building upon existing sources of basic bibliographic data and instituting procedures for citation data collection for new works represents a realistic goal for initial citation database development. Once established, retrospective addition of historical citation data could be contemplated.
Another possibility is that some form of universal citation database will naturally grow out of World-Wide Web (WWW) developments on the Internet. Indeed, by using WWW "robots" to systematically explore Web space, prototype citation databases have already been created. At present, the usefulness of such databases is limited by several factors, including, in particular, the current implementation of WWW citations as universal resource locators (URLs). URLs provide specific technical information about the protocol, computer address, port number and file location to be used in retrieving a document. As such, they cannot serve as permanent canonical identifiers of scholarly works. However, there are proposals to replace URLs with more abstract specifications based on universal resource names URNs [ 18 ]. It is conceivable that a canonical naming scheme using URNs may serve as the basis of a useful citation database within the universe of WWW space. Nevertheless, it would be regrettable if such a development served to marginalize works because they are not published on the Web or overvalue works because they are.
An earlier version of this paper was published as TR 95-07 by the School of Computing Science, Simon Fraser University. This is a working draft; a revised version may be available at URL http://elib.cs.sfu.ca/project/papers/citebase/citebase.html Multiple copies of this draft may be may be made for use in classrooms, discussion groups, or committee meetings, provided that notice of the intent and extent of the copying is sent to the author (e-mail is satisfactory). Archival copying of this preliminary version of the paper is not permitted. All copying requires that the integrity of the paper be preserved and that this copyright notice be reproduced in full.
2. Henry H. Barschall, 1988. "The cost-effectiveness of physics journals," Physics Today, Vol. 41, No. 7 (July), pp. 56-59. http://dx.doi.org/10.1063/1.881125 3. Robert N. Broadus, 1985. "A proposed method for eliminating titles from periodical subscription lists," College & Research Libraries, Vol. 46, No. 1 (January), pp. 31-35.
4. Tina E. Chrzastowski and Karen A. Schmidt, 1993. "Surveying the damage: Academic library serial cancellations 1987-88 through 1989-90," College & Research Libraries, Vol. 54, No. 2 (March), pp. 93-102.
5. Anthony M. Cummings, Marcia L. White, William B. Bowen, Laura O. Lazarus, and Richard H. Ekman, 1992. University Libraries and Scholarly Communication: A Study Prepared for the Andrew W. Mellon Foundation. Washington, D. C.: Association of Research Libraries, available at http://www.lib.virginia.edu/mellon/mellon.html
6. TRLN Copyright Policy Task Force, 1993. "Model university copyright policy regarding faculty publication in scholarly journals: A background paper and review of the issues," The Public-Access Computer Systems Review, Vol. 4, No. 4, pp. 4-25, available at http://info.lib.uh.edu/pr/v4/n4/trln.4n4
7. Eugene Garfield, 1972. "Citation analysis as a tool in journal evaluation," Science, Vol. 178, No. 4060 (November 3), pp. 471-479.
8. Philip Howard Gary, 1983. "Using science citation analysis to evaluate administrative accountability," American Psychologist, Vol. 38 (January), pp. 116-17. http://dx.doi.org/10.1037/0003-066X.38.1.116
9. Lowell L. Hargens, 1990. "Citation counts and social comparisons: Scientists' use and evaluation of," Social Science Research, Vol. 19 (January), pp. 205-221. http://dx.doi.org/10.1016/0049-089X(90)90006-5
10. Stevan Harnad, 1995. "Implementing peer review on the net: Scientific quality control in scholarly electronic journals," In: R. Peek and G. Newby, (eds.), Electronic Publishing Confronts Academia: The Agenda for the Year 2000. Cambridge, Mass.: MIT Press, available at ftp://princeton.edu/pub/harnad/harnad95.peer.review
11 Stephen P. Harter, 1996. "The impact of electronic journals on scholarly communication: A citation analysis," The Public-Access Computer Systems Review, Vol. 7, No. 5, pp. 5-34, available at http://info.lib.uh.edu/pr/v7/n5/hart7n5.html
12. Edward M. Jennings, 1991. "EJournal: An account of the first two years," The Public-Access Computer Systems Review, Vol. 2, No. 1, pp. 91-110, available at http://info.lib.uh.edu/pr/v2/n1/jennings.2n1
13. Susan Lewis, 1995. "From earth to ether: One publisher's reincarnation," Serials Librarian, Vol. 25, Nos. 3/4, pp. 173-180. http://dx.doi.org/10.1300/J123v25n03_19
14. D. Lindsey, 1989. "Using citation counts as a measure of quality in science: Measuring what's measurable rather than what's valid," Scientometrics, Vol. 15, Nos. 3-4, pp. 189-203. http://dx.doi.org/10.1007/BF02017198
15. Harry Lustig and Ken Ford, 1992. "Statement by the American Physical Society and the American Institute of Physics: Gordon and Breach press release is misleading," Newsletter on Serials Pricing Issues, Vol. 43, (August), article 2, available at ftp://ftp.lib.ncsu.edu/pub/stacks/prices/nspi-ns043
16. Kenneth E. Marks, Steven P. Nielsen, H. Craig Petersen, and Peter E. Wagner, 1991. "Longitudinal study of scientific journal prices in a research library," College & Research Libraries, Vol. 52, No. 2 (March), pp. 125-138.
17. A. Schubert and T. Braun, 1993. "Reference standards for citation based assessments," Scientometrics, Vol. 26, No. 1, pp. 21-35. http://dx.doi.org/10.1007/BF02016790
18. K. Sollins and L. Masinter, 1994. Functional requirements for uniform resource names. RFC 1737, Internet Engineering Task Force, (December), available at http://info.internet.isi.edu/in-notes/rfc/files/rfc1737.txt
19, Nicholas Wade, 1975. "Citation analysis: A new tool for science administrators," Science, Vol. 188, No. 4183 (May 2), pp. 429-432.
Copyright © 1997, First Monday
A Universal Citation Database as a Catalyst for Reform in Scholarly Communication by Robert D. Cameron
First Monday, volume 2, number 4 (April 1997),
URL: http://www.firstmonday.org/?journal=fm&page=article&op=view&path[]=522