The Google Books Project has drawn a great deal of attention, offering the prospect of the library of the future and rendering many other library and digitizing projects apparently superfluous. To grasp the value of Google’s endeavor, we need among other things, to assess its quality. On such a vast and undocumented project, the task is challenging. In this essay, I attempt an initial assessment in two steps. First, I argue that most quality assurance on the Web is provided either through innovation or through inheritance. In the later case, Web sites rely heavily on institutional authority and quality assurance techniques that antedate the Web, assuming that they will carry across unproblematically into the digital world. I suggest that quality assurance in the Googles Book Search and Google Books Library Project primarily comes through inheritance, drawing on the reputation of the libraries, and before them publishers involved. Then I chose one book to sample the Googles Project, Lawrence Sternes Tristram Shandy. This book proved a difficult challenge for Project Gutenberg, but more surprisingly, it evidently challenged Googles approach, suggesting that quality is not automatically inherited. In conclusion, I suggest that a strain of romanticism may limit Googles ability to deal with that very awkward object, the book.
In trying to assess the quality of information available on the Internet, it can be helpful to consider two broad sources of quality assurance: innovation and inheritance. Innovative methods are those that, in the final analysis, could not have existed as they do now without the Internet itself. This category includes sources of quality that, while they may have existed in principle before, have been transformed by the new methods of communication. Thus, for example, peer review may be at least as old as the Royal Society, but the breadth and reach of peer review achieved in many Open Source software projects have never before been possible. In such cases, where quality depends to a significant degree on the number of participants corralled by the Net (a premise of Linuss Law (Raymond, 1998)), digital technology has introduced a quantitative revolution of multiple orders of magnitude.
Inheritance, by contrast, covers methods of quality assurance that existed before the Net and which the introduction of new technology has not significantly changed. In these cases, new means of communication still depend heavily on techniques developed by old institutions to convey authority, credibility, and quality in general. Online news services provide example here. The sites of the New York Times, the BBC, or Le Monde are certainly creatures of the Net, but they bring with them both the reputation and many of the qualityassurance methods of news gathering that were established well before the Internet came into being. Clashes between the blogosphere and the conventional press often reflect the difference between these two views of quality, with many bloggers insisting that innovation should supersede inheritance, where conventional journalists hold that inheritance better assures quality.
In an earlier essay (Duguid, 2006), I discussed issues of quality through innovation. I asked, in particular, whether innovative quality assurance achieved in Open Source software writing was transferable to cultural open source projects, such as Gracenote, Wikipedia, or Project Gutenberg. In this essay I want to look much more briefly at the second issue, quality through inheritance, asking a slightly different question about transference: Is quality necessarily inherited when old institutions provide established content in new digitized forms, or may the process of migration too easily leave behind significant aspects of the quality it was presumed to be carrying along?
To address this question, I take the case of Google Book Search and its Book Library Project. The Project was made public in December 2004, and from the start many assumed that the qualitative effect would be cumulative . Libraries would provide collections assembled with wellhoned skills of selection and preservation, while Google would add its remarkable technological expertise to prise the covers from the text, allowing us, as Geoffrey Nunberg says, to barrel sideways into books. This combination, it was hoped would allow, as one newspaper claimed, the heft of these libraries to temper the wilder side of the Internet, while allowing Google to continue its mission of organizing the worlds information, and helping users to search the full text of books to find ones that interest them . This is not to say that in this case the new merely sits atop the old. Googles implementation suggests that many conventional qualityassurance methods (reflected, for example, in metadata) can be innovatively replaced by means of search and search algorithms alone. Nonetheless, it easy to assume that even with these changes, this combination will inherit quality from the library practices involved.
There are inevitably challenges to investigating this assumption. Google Books Library Project is not only vast, it is also mysteriously shrouded. Unlike the conventional library, Google's provides no catalog and does not even reveal how many books it contains . Testing for quality and understanding the significance of samples are thus difficult. Nonetheless, a start needs to be made. There have been rumblings around the Internet that quality is not particularly high. Robert Townsend, a blogger on the American Historical Association Web site, reported being deeply disconcerted, while the Wikipedia page on the Project links serendipitously to numerous defective page scans . Serendipity will always be a problem for attempts to detail the quality of a project whose central elements and organizing principles are hidden behind voluminous folds. It should not deter us. While many of the Projects strengths are in plain site, we seem to have little option but to take its overall quality on faith (and on the reputation of the organizations involved) or to attempt, however ineptly, to discern, however blindly, the health of the overall elephant one toe at a time.
My approach to this question is simple, and in its simplicity prey, I am well aware, to all sorts of doubts about the choice, size, and significance of my sample — in short about the quality of my own analysis. I shall not attempt to defend my methods other than to make the process as open to inspection as possible. Whatever my limits, I do suggest that Google Books Project is now sufficiently advanced and sufficiently impressive to merit and to withstand serious questions about its reliability. The Project merits questioning because it brings together a company with a justifiably high reputation for data collection and mining and a growing number of institutions again justifiably respected for their collections . In so doing, the Google Project has, however unintentionally, made not only conventional libraries themselves, but other projects digitizing cultural artifacts appear inept or inadequate. Project Gutenberg and its 17,000 books in ascii appear insignificant and superfluous beside the millions of books that Google is contemplating. So do most scanning projects by conventional libraries. As a consequence of the assumed superiority of Googles approach, therefore, it is highly unlikely that either the funds or the energies for an alternative project of similar magnitude will become available, nor are the libraries who are lending their books (at significant costs to their funds, their books, and their users) likely to undertake such an effort a second time . With each scanned page, Google Books Library Project, by its quantity if not necessarily by its quality, makes the possibility of a better alternative unlikely. The Project may then become the library of the future, whatever its quality, by default. So it does seems important to probe what kind of quality Google Book Project might present to an ordinary user that Google envisages wanting to find a book.
For my probe, I have chosen the same book that I used to assess Project Gutenberg, The Life and Opinions of Tristram Shandy, Gentleman. Laurence Sternes unconventional eighteenthcentury book presented some significant challenges to Gutenbergs ascii methods of presentation, but, in principle as people often remind me, few to a scanning project like Googles. The Greek text, the footnotes, the black page, the blank page, or the marbled page, all of which were missing or compromised in the Gutenberg text, should scan without difficulty. In trying to assess different modes of presenting this book, however, we do need to remember that Sterne is known for his eccentricity. If the job is not well done, the putative reader in search of a book may have difficulty distinguishing modern technological limitations from Sternes eighteenthcentury typographical experimentations.
Results: Finding a cock and bull story 
We can begin, as the ordinary reader might, by typing Tristram Shandy, as the book is generally known, into the Google Book Project search box . This search returns the following page:
Its hard to know how Google sorts its results, but these results suggest that viewable text might have a high priority. Google clearly wants its users to be able to read the text they are searching for. The first link then leads to a Web page providing the opening page of this particular version of Sternes novel, to which Ill return in a moment. The interface allows us not only to read on, by clicking forward from this page, but also to move to the front of the book, where we can get intimations of the inheritance of quality:
Harvard was an early partner in the Google Books Project and this is evidently a Harvard book, so we can return with some confidence to the opening page of Sternes text :
To someone not sure what to expect from the eccentric Sterne and his typographic experiments, it may take a little time to realize that the book does not start with the word WISH. Rather, this version is missing the left hand of the page and hence the opening word (not trivial in a splendidly egocentric book), I. The problem is not unique, as an undaunted reader will find if he or she goes on:
Nor, as anyone who has clicked this far will know, is this a lefthanded prejudice (or one related to the wellknown scanning problem that books have gutters). Top and righthand outer edges have their problems too, some of which are as remarkable as illegible:
By the time this page  has been reached, the astute reader will also have noticed that the book has other, quality control problems. Famously, on the death of Parson Yorick, Sterne quoted Hamlets phrase, Alas, poor Yorick!, and inserted a black page of mourning. The version of the Sternes novel that Harvard offered and Google scanned evidently overlooked this iconic page, perhaps assuming it was an inky disaster in the print shop rather than part of the authors design. We can see the problem if we compare the Google page to the same page from the Penguin edition (Sterne, 1967):
Figure 6b: Tristram Shandy, Penguin edition.
Thus, where scanning should conquer the disparaged ascii transcription of Project Gutenberg, in this case, the text chosen by the scanning project and presented as first choice to the reader is so inadequate there is little improvement for the ordinary reader, who at least can read the ascii text.
At this point, a wise reader might abandon Googles first offering and go to its second (see Figure 1) and work with that. Here, if we again click back, we find a competing brand for the book, Stanfords.
Unfortunately, Stanford doesnt come out of this battle of giants much better. If we go to the first page of Sternes text, skipping the introduction, we find the following:
This might appear to be yet another blank page amid the frontmatter (or even Sternes own blank page; Tristram Shandy has one). By clicking one page on, we see it is not. Rather, this should be the first page of Sternes text, but unfortunately its content is missing. The first page of Sternes text that is not empty plunges reader and author, who struggled so hard to find his beginning, in media res:
Given the difficulties with the Harvard text, we might persevere. Yet, those who had ploughed as far as the second page of the Harvard text will notice a problem with Stanfords second. For, at the opening of chapter 2 and in contrast to Figure 9, Harvards text gave the following:
An untutored reader might be able to tell that the first page of the book didnt open with the word WISH, but how would he or she know whether the second chapter opened:
THEN, positively, there is nothing in the question that I can see, either good or bad.? [Harvard] Or WHAT prodigious armies you had in Flanders!? [Stanford]
This is clearly not a matter of overly subtle text editing. The problem comes a little clearer if we click on the link Table of Contents that appears on the side of the Stanford pages (see Figure 9). The link doesn't actually take us to a table of contents:
Rather, its a list of illustrations. (Googles no doubt intricate software clearly guessed wrong at this point about what kind of page it had before it.) But the list does at least tell us that this is volume II of Tristram Shandy. If, however, we miss this page and instead follow the about this book link at the side of the page, we are only told that this is
The Life and Opinions of Tristram Shandy,
Gentleman, by Laurence Sterne, Wilbur Lucius
Cross, Published 1904, J.F. Taylor &
Not a word is mentioned about multiple volumes or volume number. Indeed, a quick survey of the Google Book Project suggests that Google doesnt recognize volume numbers. Not only are the different editions (Harvards from 1896, Stanfords from 1904) given exactly the same name, but also the different volumes of this Stanfords multivolume edition are labeled identically. Consequently, whatever algorithm Google uses to find the book, it is quite likely, as in this case, to offer volume II first .
Cautious by now, we might take a look at the Penguin version of Tristram Shandy (Sterne, 1967), which though presented as a single volume, acknowledges the divisions of the original edition. Unfortunately, there we discover that chapter 2 of volume II of Sternes work begins presciently:
There is nothing so foolish, when you are at the expence of making an
entertainment of this kind, as to order things so badly, as to let your critics and
gentry of refined taste run it down.
Alas Poor Sterne!
Alas Poor Sterne! Evidently he had some premonition of what might become of his work. Like the Harvard edition, which ignored Sternes black page, the Stanford work not only ignores Sternes divisions, but introduces new ones of its own. Its chapter 2 has no bearing on Sternes chapter 2 in either Volume I or any subsequent volume of the original text. This would matter little were it not that Sterne continuously refers back and forth to preceding or future pages, chapters, and books. Indeed, he even opens his second volume with an alert to his readers (and, perhaps, editors) that I have begun a new book. That phrase is no doubt buried mystifyingly somewhere in the first volume of the Stanford edition which is, in turn, buried mystifying somewhere in Google Books Library Project.
Of course, it may be argued that as these editions are available on library shelves, there is no reason that they should not be available in digital libraries. But it isnt quite true that these are available on library shelves. A quick look at the online catalogue for Stanfords library shows that the Stanford volume presented as your second choice by Google Books is actually tucked away in the Stanford Auxiliary library along with infrequentlyused texts. An ordinary reader would have to work through some 80plus catalogue entries and find their way into the auxiliary library before coming across this text barriers enough to suggest treating this edition with caution and even then he or she is likely to find the different volumes in the right order . Google may or may not be sucking the air out of other digitization projects, but like Project Gutenberg before, it is certainly sucking betterforgotten versions of classic texts from justified oblivion and presenting them as the first choice to readers.
With this brief encounter between Google and Project Gutenberg in mind, we might take one more look at the links from the Google edition (Figure 8). These identify modern editions that the user might buy another way in which Google distances itself from what it actually puts online. The first link offers an admirable, cheap Oxford edition at US$11.95. The discerning reader, however, might be seduced by the second, more expensive edition produced by Kessinger Publishing. This costs US$37.95, but as Kessinger produces rare reprints we might be tempted . Put down forty bucks and you will find yourself the owner of no less than an expensive dump of the Project Gutenberg edition, whose limitations many thought Google Books would inherently overcome. Theres an elegant symmetry here, as the Kessinger edition holding in June 2007 a rank close to 4,000,000 in order of popularity on Amazon was also the first option Googles algorithm would mysteriously offer when Google Books initially went online.
Figure 12: Google search results.
The Google Books Project is no doubt an important, in many ways invaluable, project. It is also, on the brief evidence given here, a highly problematic one. Relying on the power of its search tools, Google has ignored elemental metadata, such as volume numbers. The quality of its scanning (and so we may presume its searching) is at times completely inadequate . The editions offered (by search or by sale) are, at best, regrettable. Curiously, this suggests to me that it may be Googles technicians, and not librarians, who are the great romanticisers of the book. Google Books takes books as a storehouse of wisdom to be opened up with new tools. They fail to see what librarians know: books can be obtuse, obdurate, even obnoxious things. As a group, they dont submit equally to a standard shelf, a standard scanner, or a standard ontology. Nor are their constraints overcome by scraping the text and developing search algorithms. Such strategies can undoubtedly be helpful, but in trying to do away with fairly simple constraints (like volumes), these strategies underestimate how a books rigidities are often simultaneously resources deeply implicated in the ways in which authors and publishers sought to create the content, meaning, and significance that Google now seeks to liberate. Even with some of the best search and scanning technology in the world behind you, it is unwise to ignore the bookish character of books. More generally, transferring any complex communicative artifacts between generations of technology is always likely to be more problematic than automatic.
Even with some of the best search and scanning technology in the world behind you, it is unwise to ignore the bookish character of books.
Finally, with regard to inheritance as a strategy for quality assurance, the question of quality in Google Books Library Project reminds us that the newer form is always in danger of a kind of patricide, destroying in the process the resources it hope to inherit. This remains a puzzle, for example, for Google News. In its free provision of news, it risks undermining the income stream that allows the sources on which Google News relies for quality to survive. It may even be true, in a lesser way, for Google Books. Google relies here for quality assurance on the reputation of the grand libraries it has corralled for its project. Harvard and Stanford libraries certainly do not have their reputations enhanced by the dubious quality of Tristram Shandy, labeled with their name in the Google database. And Tristram Shandy is not alone. With each badly scanned page or badly catalogued book, Google threatens not only its own reputation for quality and technological sophistication, but also those of the institutions that have allied themselves to the project. The Google Book Projects Tristram Shandy may be, as Sterne said ruefully about his marbled page, the motley emblem of its work .
About the author
Paul Duguid is adjunct professor in the School of Information at the University of California, Berkeley; professorial research fellow at Queen Mary, University of London, where he was an ESRCSSRC Visiting Fellow in the spring of 2005; and, a research fellow at the Center for Science, Technology, and Society at Santa Clara University. He is also an honorary fellow of the Institute for Entrepreneurship and Enterprise Development at Lancaster University School of Management.
Email: duguid [at] ischool [dot] berkeley [dot] edu
AcknowledgmentsThis paper is based on a talk given to the Society of Scholarly Publishers, San Francisco, 6 June 2007. I am grateful to the Society for the invitation, and to Kathleen Vanden Heuvel and Andrew MacDiarmid for comments made on drafts of the paper.
1. For a summary of early enthusiasm see http://books.google.com/googlebooks/newsviews/media.html.
5. New partners are announced regularly. See, for example, http://www.cnn.com/2007/TECH/internet/06/08/big.ten.books.ap/index.html.
6. See Baker (2001) on the microfilming of libraries for a related argument about ways in which libraries locked themselves in to a high tech project with low quality. For an important, open alternative to Googles shrouded project, see the "Open Content Alliance": http://www.opencontentalliance.org/.
7. On the last page of the book, Tristrams mother asks what the book is about and Yorick replies: A COCK and a BULL (Sterne, 1967, p. 615).
8. All data gathered on 17 June 2007.
9. That confidence may be undermined if we note along the way two copies of the frontispiece. See http://books.google.com/books?id=zC_UH934kncC&pg=PA1&dq=Tristram+Shandy#PPR2,M1 and http://books.google.com/books?id=zC_UH934kncC&pg=PA1&dq=Tristram+Shandy#PPR4,M1.
11. I argued before (Duguid, 2006) that a minimum requirement of the Gracenote music database was that Act I of an opera should play before Act II. I dont think it raises the bar too much to ask of a digitized library that volume I appear before volume II.
14. There are actually more problems with the page shown in Figure 5c, which in the edition Google provides mixes body text and footnote text capriciously, than I care to enumerate, but one of the words lost in the blur is renfermé. If you search in this book for the word, Google cant find the match, suggesting that the character recognition, like the reader, suffers from the quality of the visible page.
15. For any reader who might want an online edition of Tristram Shandy, I would suggest avoiding both the Google and the Project Gutenberg texts. The editions on the Making of America Books site (University of Michigan) and at http://www2.hn.psu.edu/faculty/jmanis/l-sterne.htm both appear to offer the inheritance of academic credentials, but both are better avoided. The Making of America text ( http://quod.lib.umich.edu/cgi/t/text/text-idx?c=moa;cc=moa;rgn=main;view=text;idno=AAN6595.0001.001) betrays itself in its presentation of the title of the book ( T R Ir STR A. 1 S E A N D Y) and the author (LAi1JRENCE $S E NE), while the Penn State edition is no more than the Gutenberg text, and so has managed to inherit all its quality. There is, however, an edition from the library of the University of California, Los Angeles, at the Open Content Alliance site (http://ia340914.us.archive.org/0/items/novelsoflaurence01steriala/novelsoflaurence01steriala.pdf), where the text seems reliable, the scanning error free and the metadata correct. And there has been a particularly elegant edition online from Gifu University in Japan since 1997 (http://www1.gifu-u.ac.jp/~masaru/TS/contents.html), where the html is very well done.
Nicolson Baker, 2001. Double Fold: Libraries and the Assault on Paper. New York: Random House.
Paul Duguid, 2006. Limits of SelfOrganization: Peer Production and the ‘Laws of Quality, First Monday, volume 11 number 10 (October), at http://www.firstmonday.org/issues/issue11_10/duguid/. http://dx.doi.org/10.5210/fm.v11i10.1405
Eric S. Raymond, 1998. The Cathedral and the Bazaar, First Monday, volume 3, number 3 (March), at http://www.firstmonday.org/issues/issue3_3/raymond/.
Laurence Sterne, 1967. The Life and Opinions of Tristram Shandy, Gentleman. Graham Petrie, editor. Harmondsworth, Middlesex: Penguin Books. [First published, 17591767.]
Paper received 18 June 2007; revised 19 June 2007; accepted 15 July 2007.
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License
Inheritance and loss? A brief survey of Google Books by Paul Duguid
First Monday, volume 12, number 8 (August 2007),
A Great Cities Initiative of the University of Illinois at Chicago University Library.
© First Monday, 1995-2016.