Inheritance and loss? A brief survey of Google Books

The Google Books Project has drawn a great deal of attention, offering the prospect of the library of the future and rendering many other library and digitizing projects apparently superfluous. To grasp the value of Google’s endeavor, we need among other things, to assess its quality. On such a vast and undocumented project, the task is challenging. In this essay, I attempt an initial assessment in two steps. First, I argue that most quality assurance on the Web is provided either through innovation or through “inheritance.” In the later case, Web sites rely heavily on institutional authority and quality assurance techniques that antedate the Web, assuming that they will carry across unproblematically into the digital world. I suggest that quality assurance in the Google’s Book Search and Google Books Library Project primarily comes through inheritance, drawing on the reputation of the libraries, and before them publishers involved. Then I chose one book to sample the Google’s Project, Lawrence Sterne’s Tristram Shandy. This book proved a difficult challenge for Project Gutenberg, but more surprisingly, it evidently challenged Google’s approach, suggesting that quality is not automatically inherited. In conclusion, I suggest that a strain of romanticism may limit Google’s ability to deal with that very awkward object, the book.

Contents

Method: Searching for Shandy
Results: Finding a cock and bull story
Conclusion

In trying to assess the quality of information available on the Internet, it can be helpful to consider two broad sources of quality assurance: innovation and inheritance. Innovative methods are those that, in the final analysis, could not have existed as they do now without the Internet itself. This category includes sources of quality that, while they may have existed in principle before, have been transformed by the new methods of communication. Thus, for example, peer review may be at least as old as the Royal Society, but the breadth and reach of peer review achieved in many Open Source software projects have never before been possible. In such cases, where quality depends to a significant degree on the number of participants corralled by the Net (a premise of “Linus’s Law” (Raymond, 1998)), digital technology has introduced a “quantitative revolution” of multiple orders of magnitude.

Inheritance, by contrast, covers methods of quality assurance that existed before the Net and which the introduction of new technology has not significantly changed. In these cases, new means of communication still depend heavily on techniques developed by old institutions to convey authority, credibility, and quality in general. Online news services provide example here. The sites of the New York Times, the BBC, or Le Monde are certainly creatures of the Net, but they bring with them both the reputation and many of the quality–assurance methods of news gathering that were established well before the Internet came into being. Clashes between the “blogosphere” and the conventional press often reflect the difference between these two views of quality, with many bloggers insisting that innovation should supersede inheritance, where conventional journalists hold that inheritance better assures quality.

In an earlier essay (Duguid, 2006), I discussed issues of quality through innovation. I asked, in particular, whether innovative quality assurance achieved in Open Source software writing was transferable to cultural “open source” projects, such as Gracenote, Wikipedia, or Project Gutenberg. In this essay I want to look much more briefly at the second issue, quality through inheritance, asking a slightly different question about transference: Is quality necessarily inherited when old institutions provide established content in new digitized forms, or may the process of migration too easily leave behind significant aspects of the quality it was presumed to be carrying along?

To address this question, I take the case of Google Book Search and its Book Library Project. The Project was made public in December 2004, and from the start many assumed that the qualitative effect would be cumulative [1]. Libraries would provide collections assembled with well–honed skills of selection and preservation, while Google would add its remarkable technological expertise to prise the covers from the text, allowing us, as Geoffrey Nunberg says, to “barrel sideways into books.” This combination, it was hoped would allow, as one newspaper claimed, “the heft of these libraries” to temper the wilder side of the Internet, while allowing Google to continue its mission of “organizing the world’s information,” and helping users to “search the full text of books to find ones that interest” them [2]. This is not to say that in this case the new merely sits atop the old. Google’s implementation suggests that many conventional quality–assurance methods (reflected, for example, in metadata) can be innovatively replaced by means of search and search algorithms alone. Nonetheless, it easy to assume that even with these changes, this combination will inherit quality from the library practices involved.

There are inevitably challenges to investigating this assumption. Google Books Library Project is not only vast, it is also mysteriously shrouded. Unlike the conventional library, Google's provides no catalog and does not even reveal how many books it contains [3]. Testing for quality and understanding the significance of samples are thus difficult. Nonetheless, a start needs to be made. There have been rumblings around the Internet that quality is not particularly high. Robert Townsend, a blogger on the American Historical Association Web site, reported being “deeply disconcerted,” while the Wikipedia page on the Project links serendipitously to numerous defective page scans [4]. Serendipity will always be a problem for attempts to detail the quality of a project whose central elements and organizing principles are hidden behind voluminous folds. It should not deter us. While many of the Project’s strengths are in plain site, we seem to have little option but to take its overall quality on faith (and on the reputation of the organizations involved) or to attempt, however ineptly, to discern, however blindly, the health of the overall elephant one toe at a time.

Method: Searching for Shandy

My approach to this question is simple, and in its simplicity prey, I am well aware, to all sorts of doubts about the choice, size, and significance of my sample — in short about the quality of my own analysis. I shall not attempt to defend my methods other than to make the process as open to inspection as possible. Whatever my limits, I do suggest that Google Books Project is now sufficiently advanced and sufficiently impressive to merit and to withstand serious questions about its reliability. The Project merits questioning because it brings together a company with a justifiably high reputation for data collection and mining and a growing number of institutions again justifiably respected for their collections [5]. In so doing, the Google Project has, however unintentionally, made not only conventional libraries themselves, but other projects digitizing cultural artifacts appear inept or inadequate. Project Gutenberg and its 17,000 books in ascii appear insignificant and superfluous beside the millions of books that Google is contemplating. So do most scanning projects by conventional libraries. As a consequence of the assumed superiority of Google’s approach, therefore, it is highly unlikely that either the funds or the energies for an alternative project of similar magnitude will become available, nor are the libraries who are lending their books (at significant costs to their funds, their books, and their users) likely to undertake such an effort a second time [6]. With each scanned page, Google Books’ Library Project, by its quantity if not necessarily by its quality, makes the possibility of a better alternative unlikely. The Project may then become the library of the future, whatever its quality, by default. So it does seems important to probe what kind of quality Google Book Project might present to an ordinary user that Google envisages wanting to find a book.

For my probe, I have chosen the same book that I used to assess Project Gutenberg, The Life and Opinions of Tristram Shandy, Gentleman. Laurence Sterne’s unconventional eighteenth–century book presented some significant challenges to Gutenberg’s ascii methods of presentation, but, in principle as people often remind me, few to a scanning project like Google’s. The Greek text, the footnotes, the black page, the blank page, or the marbled page, all of which were missing or compromised in the Gutenberg text, should scan without difficulty. In trying to assess different modes of presenting this book, however, we do need to remember that Sterne is known for his eccentricity. If the job is not well done, the putative reader in search of a book may have difficulty distinguishing modern technological limitations from Sterne’s eighteenth–century typographical experimentations.

Results: Finding a cock and bull story [7]

We can begin, as the ordinary reader might, by typing “Tristram Shandy,” as the book is generally known, into the Google Book Project search box [8]. This search returns the following page:

Figure 1: [http://books.google.com/books?q=Tristram+Shandy&btnG=Search+Books].

It’s hard to know how Google sorts its results, but these results suggest that “viewable text” might have a high priority. Google clearly wants its users to be able to read the text they are searching for. The first link then leads to a Web page providing the opening page of this particular version of Sterne’s novel, to which I’ll return in a moment. The interface allows us not only to read on, by clicking forward from this page, but also to move to the front of the book, where we can get intimations of the inheritance of quality:

Figure 2: [http://books.google.com/books?id=zC_UH934kncC&pg=PA1&dq=Tristram+Shandy#PPP2,M1].

Harvard was an early partner in the Google Books Project and this is evidently a Harvard book, so we can return with some confidence to the opening page of Sterne’s text [9]:

Figure 3: [http://books.google.com/books?id=zC_UH934kncC&pg=PA1&dq=Tristram+Shandy#PPA1,M1].

To someone not sure what to expect from the eccentric Sterne and his typographic experiments, it may take a little time to realize that the book does not start with the word “WISH.” Rather, this version is missing the left hand of the page and hence the opening word (not trivial in a splendidly egocentric book), “I.” The problem is not unique, as an undaunted reader will find if he or she goes on:

Figure 4: [http://books.google.com/books?id=zC_UH934kncC&pg=PA1&dq=Tristram+Shandy#PPA17,M1].

Nor, as anyone who has clicked this far will know, is this a left–handed prejudice (or one related to the well–known scanning problem that books have gutters). Top and right–hand outer edges have their problems too, some of which are as remarkable as illegible:

Figure 5a: [http://books.google.com/books?id=zC_UH934kncC&pg=PA1&dq=Tristram+Shandy#PPA7,M1].

Figure 5b: [http://books.google.com/books?id=zC_UH934kncC&pg=PA1&dq=Tristram+Shandy#PPA26-IA3,M1].

Figure 5c: [http://books.google.com/books?id=zC_UH934kncC&pg=PA1&dq=Tristram+Shandy#PPA41,M1].

By the time this page [27] has been reached, the astute reader will also have noticed that the book has other, quality control problems. Famously, on the death of Parson Yorick, Sterne quoted Hamlet’s phrase, “Alas, poor Yorick!”, and inserted a black page of mourning. The version of the Sterne’s novel that Harvard offered and Google scanned evidently overlooked this iconic page, perhaps assuming it was an inky disaster in the print shop rather than part of the author’s design. We can see the problem if we compare the Google page to the same page from the Penguin edition (Sterne, 1967):

Figure 6a: [http://books.google.com/books?id=zC_UH934kncC&pg=PA1&dq=Tristram+Shandy#PPA23,M1].

Figure 6b: Tristram Shandy, Penguin edition.

Thus, where scanning should conquer the disparaged ascii transcription of Project Gutenberg, in this case, the text chosen by the scanning project and presented as first choice to the reader is so inadequate there is little improvement for the ordinary reader, who at least can read the ascii text.

At this point, a wise reader might abandon Google’s first offering and go to its second (see Figure 1) and work with that. Here, if we again click back, we find a competing brand for the book, Stanford’s.

Figure 7: [http://books.google.com/books?id=UXYLAAAAIAAJ&pg=PA1&dq=Tristram+Shandy#PPP2,M1].

Unfortunately, Stanford doesn’t come out of this battle of giants much better. If we go to the first page of Sterne’s text, skipping the introduction, we find the following:

Figure 8: [http://books.google.com/books?id=UXYLAAAAIAAJ&pg=PA1&dq=Tristram+Shandy#PPA4,M1].

This might appear to be yet another blank page amid the frontmatter (or even Sterne’s own blank page; Tristram Shandy has one). By clicking one page on, we see it is not. Rather, this should be the first page of Sterne’s text, but unfortunately its content is missing. The first page of Sterne’s text that is not empty plunges reader and author, who struggled so hard to find his beginning, in media res:

Figure 9: [http://books.google.com/books?id=UXYLAAAAIAAJ&pg=PA1&dq=Tristram+Shandy#PPA5,M1].

Given the difficulties with the Harvard text, we might persevere. Yet, those who had ploughed as far as the second page of the Harvard text will notice a problem with Stanford’s second. For, at the opening of chapter 2 and in contrast to Figure 9, Harvard’s text gave the following:

Figure 10: [http://books.google.com/books?id=zC_UH934kncC&pg=PA1&dq=Tristram+Shandy#PPA2,M1].

An untutored reader might be able to tell that the first page of the book didn’t open with the word “WISH”, but how would he or she know whether the second chapter opened:

“— THEN, positively, there is nothing in the question that I can see, either good or bad.”? [Harvard]

Or

“— WHAT prodigious armies you had in Flanders!”? [Stanford]

This is clearly not a matter of overly subtle text editing. The problem comes a little clearer if we click on the link “Table of Contents” that appears on the side of the Stanford pages (see Figure 9). The link doesn't actually take us to a table of contents:

Figure 11: [http://books.google.com/books?id=UXYLAAAAIAAJ&pg=PA1&dq=Tristram+Shandy#PPR7,M1].

Rather, it’s a list of illustrations. (Google’s no doubt intricate software clearly guessed wrong at this point about what kind of page it had before it.) But the list does at least tell us that this is volume II of Tristram Shandy. If, however, we miss this page and instead follow the “about this book” link at the side of the page, we are only told that this is

The Life and Opinions of Tristram Shandy,
Gentleman, by Laurence Sterne, Wilbur Lucius
Cross, Published 1904, J.F. Taylor &
Company. [10]

Not a word is mentioned about multiple volumes or volume number. Indeed, a quick survey of the Google Book Project suggests that Google doesn’t recognize volume numbers. Not only are the different editions (Harvard’s from 1896, Stanford’s from 1904) given exactly the same name, but also the different volumes of this Stanford’s multivolume edition are labeled identically. Consequently, whatever algorithm Google uses to find the book, it is quite likely, as in this case, to offer volume II first [11].

Cautious by now, we might take a look at the Penguin version of Tristram Shandy (Sterne, 1967), which though presented as a single volume, acknowledges the divisions of the original edition. Unfortunately, there we discover that chapter 2 of volume II of Sterne’s work begins presciently:

There is nothing so foolish, when you are at the expence of making an
entertainment of this kind, as to order things so badly, as to let your critics and
gentry of refined taste run it down.

Alas Poor Sterne!

Alas Poor Sterne! Evidently he had some premonition of what might become of his work. Like the Harvard edition, which ignored Sterne’s black page, the Stanford work not only ignores Sterne’s divisions, but introduces new ones of its own. Its chapter 2 has no bearing on Sterne’s chapter 2 in either Volume I or any subsequent volume of the original text. This would matter little were it not that Sterne continuously refers back and forth to preceding or future pages, chapters, and books. Indeed, he even opens his second volume with an alert to his readers (and, perhaps, editors) that “I have begun a new book.” That phrase is no doubt buried mystifyingly somewhere in the first volume of the Stanford edition which is, in turn, buried mystifying somewhere in Google Books Library Project.

Of course, it may be argued that as these editions are available on library shelves, there is no reason that they should not be available in digital libraries. But it isn’t quite true that these are available on library shelves. A quick look at the online catalogue for Stanford’s library shows that the Stanford volume presented as your second choice by Google Books is actually tucked away in the Stanford Auxiliary library along with “infrequently–used” texts. An ordinary reader would have to work through some 80–plus catalogue entries and find their way into the auxiliary library before coming across this text — barriers enough to suggest treating this edition with caution — and even then he or she is likely to find the different volumes in the right order [12]. Google may or may not be sucking the air out of other digitization projects, but like Project Gutenberg before, it is certainly sucking better–forgotten versions of classic texts from justified oblivion and presenting them as the first choice to readers.

With this brief encounter between Google and Project Gutenberg in mind, we might take one more look at the links from the Google edition (Figure 8). These identify modern editions that the user might buy — another way in which Google distances itself from what it actually puts online. The first link offers an admirable, cheap Oxford edition at US$11.95. The discerning reader, however, might be seduced by the second, more expensive edition produced by Kessinger Publishing. This costs US$37.95, but as Kessinger produces “rare reprints” we might be tempted [13]. Put down forty bucks and you will find yourself the owner of no less than an expensive dump of the Project Gutenberg edition, whose limitations many thought Google Books would inherently overcome. There’s an elegant symmetry here, as the Kessinger edition — holding in June 2007 a rank close to 4,000,000 in order of popularity on Amazon — was also the first option Google’s algorithm would mysteriously offer when Google Books initially went online.

Figure 12: Google search results.

Conclusion

The Google Books Project is no doubt an important, in many ways invaluable, project. It is also, on the brief evidence given here, a highly problematic one. Relying on the power of its search tools, Google has ignored elemental metadata, such as volume numbers. The quality of its scanning (and so we may presume its searching) is at times completely inadequate [14]. The editions offered (by search or by sale) are, at best, regrettable. Curiously, this suggests to me that it may be Google’s technicians, and not librarians, who are the great romanticisers of the book. Google Books takes books as a storehouse of wisdom to be opened up with new tools. They fail to see what librarians know: books can be obtuse, obdurate, even obnoxious things. As a group, they don’t submit equally to a standard shelf, a standard scanner, or a standard ontology. Nor are their constraints overcome by scraping the text and developing search algorithms. Such strategies can undoubtedly be helpful, but in trying to do away with fairly simple constraints (like volumes), these strategies underestimate how a book’s rigidities are often simultaneously resources deeply implicated in the ways in which authors and publishers sought to create the content, meaning, and significance that Google now seeks to liberate. Even with some of the best search and scanning technology in the world behind you, it is unwise to ignore the bookish character of books. More generally, transferring any complex communicative artifacts between generations of technology is always likely to be more problematic than automatic.

Even with some of the best search and scanning technology in the world behind you, it is unwise to ignore the bookish character of books.

Finally, with regard to inheritance as a strategy for quality assurance, the question of quality in Google Book’s Library Project reminds us that the newer form is always in danger of a kind of patricide, destroying in the process the resources it hope to inherit. This remains a puzzle, for example, for Google News. In its free provision of news, it risks undermining the income stream that allows the sources on which Google News relies for quality to survive. It may even be true, in a lesser way, for Google Books. Google relies here for quality assurance on the reputation of the grand libraries it has corralled for its project. Harvard and Stanford libraries certainly do not have their reputations enhanced by the dubious quality of Tristram Shandy, labeled with their name in the Google database. And Tristram Shandy is not alone. With each badly scanned page or badly catalogued book, Google threatens not only its own reputation for quality and technological sophistication, but also those of the institutions that have allied themselves to the project. The Google Book Project’s Tristram Shandy may be, as Sterne said ruefully about his marbled page, the “motley emblem” of its work [15].

About the author

Paul Duguid is adjunct professor in the School of Information at the University of California, Berkeley; professorial research fellow at Queen Mary, University of London, where he was an ESRC–SSRC Visiting Fellow in the spring of 2005; and, a research fellow at the Center for Science, Technology, and Society at Santa Clara University. He is also an honorary fellow of the Institute for Entrepreneurship and Enterprise Development at Lancaster University School of Management.
E–mail: duguid [at] ischool [dot] berkeley [dot] edu

Acknowledgments
This paper is based on a talk given to the Society of Scholarly Publishers, San Francisco, 6 June 2007. I am grateful to the Society for the invitation, and to Kathleen Vanden Heuvel and Andrew MacDiarmid for comments made on drafts of the paper.

Notes

1. For a summary of early enthusiasm see http://books.google.com/googlebooks/newsviews/media.html.

2. See http://books.google.com/googlebooks/newsviews/history.html and http://books.google.com/googlebooks/about.html.

3. Google does describe the project as “an enhanced card catalog of the world’s books”. See http://books.google.com/googlebooks/about.html.

4. See http://blog.historians.org/articles/204/google-books-whats-not-to-like and http://en.wikipedia.org/wiki/Google_Book_Search.

5. New “partners” are announced regularly. See, for example, http://www.cnn.com/2007/TECH/internet/06/08/big.ten.books.ap/index.html.

6. See Baker (2001) on the microfilming of libraries for a related argument about ways in which libraries locked themselves in to a “high tech” project with low quality. For an important, “open” alternative to Google’s shrouded project, see the "Open Content Alliance": http://www.opencontentalliance.org/.

7. On the last page of the book, Tristram’s mother asks what the book is about and Yorick replies: “A COCK and a BULL” (Sterne, 1967, p. 615).

8. All data gathered on 17 June 2007.

9. That confidence may be undermined if we note along the way two copies of the frontispiece. See http://books.google.com/books?id=zC_UH934kncC&pg=PA1&dq=Tristram+Shandy#PPR2,M1 and http://books.google.com/books?id=zC_UH934kncC&pg=PA1&dq=Tristram+Shandy#PPR4,M1.

10. See http://books.google.com/books?id=UXYLAAAAIAAJ&dq=Tristram+Shandy.

11. I argued before (Duguid, 2006) that a minimum requirement of the Gracenote music database was that Act I of an opera should play before Act II. I don’t think it raises the bar too much to ask of a digitized library that volume I appear before volume II.

12. http://jenson.stanford.edu/uhtbin/cgisirsi/TJcTFl1kt1/GREEN/8070095/9.

13. See http://www.kessingerpub.com/.

14. There are actually more problems with the page shown in Figure 5c, which in the edition Google provides mixes body text and footnote text capriciously, than I care to enumerate, but one of the words lost in the blur is “renfermé.” If you search “in this book” for the word, Google can’t find the match, suggesting that the character recognition, like the reader, suffers from the quality of the visible page.

15. For any reader who might want an online edition of Tristram Shandy, I would suggest avoiding both the Google and the Project Gutenberg texts. The editions on the Making of America Books site (University of Michigan) and at http://www2.hn.psu.edu/faculty/jmanis/l-sterne.htm both appear to offer the inheritance of academic credentials, but both are better avoided. The Making of America text ( http://quod.lib.umich.edu/cgi/t/text/text-idx?c=moa;cc=moa;rgn=main;view=text;idno=AAN6595.0001.001) betrays itself in its presentation of the title of the book ( T R Ir STR A. 1 S E A N D Y) and the author (LAi1JRENCE $S E NE), while the Penn State edition is no more than the Gutenberg text, and so has managed to inherit all its quality. There is, however, an edition from the library of the University of California, Los Angeles, at the Open Content Alliance site (http://ia340914.us.archive.org/0/items/novelsoflaurence01steriala/novelsoflaurence01steriala.pdf), where the text seems reliable, the scanning error free and the metadata correct. And there has been a particularly elegant edition online from Gifu University in Japan since 1997 (http://www1.gifu-u.ac.jp/~masaru/TS/contents.html), where the html is very well done.

References

Nicolson Baker, 2001. Double Fold: Libraries and the Assault on Paper. New York: Random House.

Paul Duguid, 2006. “Limits of Self–Organization: Peer Production and the ‘Laws of Quality’,” First Monday, volume 11 number 10 (October), at http://www.firstmonday.org/issues/issue11_10/duguid/. http://dx.doi.org/10.5210/fm.v11i10.1405

Eric S. Raymond, 1998. “The Cathedral and the Bazaar,” First Monday, volume 3, number 3 (March), at http://www.firstmonday.org/issues/issue3_3/raymond/.

Laurence Sterne, 1967. The Life and Opinions of Tristram Shandy, Gentleman. Graham Petrie, editor. Harmondsworth, Middlesex: Penguin Books. [First published, 1759–1767.]

Editorial history

Paper received 18 June 2007; revised 19 June 2007; accepted 15 July 2007.

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License

Inheritance and loss? A brief survey of Google Books by Paul Duguid
First Monday, volume 12, number 8 (August 2007),
URL: http://firstmonday.org/issues/issue12_8/duguid/index.html