For the last ten years or so internet search engines, and Google in particular, have made the retrieval of digital information increasingly and surprisingly fast, reliable, and efficient. Google searches have already changed the way we live and work. And our expectations of a permanent, seamless access to all kinds of searchable data have grown accordingly, in quantity and quality. Google itself is constantly expanding the power and scope of its searches into new and sometimes unexpected domains. Given the pace of this ongoing revolution in global data processing, scholars and historians may be reasonably puzzled by the persistent opacity of most of the predigital accumulated lore of humankind, in print and in manuscript form. Why should the full content of all the books (and codices) in the world not be made as easily searchable as today's World Wide Web?
There are various reasons for that: technical, first of all, but also legal and cultural. All digital objects, regardless of how they manifest themselves to human senses, are recorded and transmitted as sequences of numbers, or digits. In the case of alphabetical texts, the 30-something letters of Western (and non-Western) alphabets and the 10 Hindu-Arabic numerals are first converted into a three-digit code (ASCII or one of its followers); this code is then translated into a binary notation (a series of zeroes and ones). Unlike humans, computers can easily search huge digital files looking for a precise sequence, or string, of zeroes and ones, thus retrieving all occurrences of the same words or numbers in a given data field.
In order to become digitally searchable, the content of printed books or manuscripts must first be digitized——converted into a sequence of numbers. This used to be done manually (by typing texts into a word processor, for example), but is now more often done automatically, using machines that take sequential pictures of all the pages in a book——apparently without destroying the book itself. OCR (Optical Character Recognition) software is then used to "read" and convert most graphic signs into standard notations from a preset list (letters of the alphabet, both lowercase and uppercase; Hindu-Arabic numerals; punctuation and diacritic marks; and a few more). This operation is more successful when the original pages are clean and the lines of type are crisp and clear. Once disambiguated, each sign can be encoded (translated into a number), and the digital file resulting from this process becomes as searchable as any digitally born document.
Information retrieval systems existed before electronics, but they did not work this way, and they did not work so well. The traditional predigital way to search for data was based on classification: taxonomies and hierarchies. Information was sorted and stored in different physical locations based on its subject matter, and retrieved accordingly. One looked for a certain item based on the category to which it belonged and the place allocated to that category. This applied to shelves within a library, for example, or books on a shelf, or cards in a drawer.
New information retrieval systems and technologies flourished during the Renaissance, when books started to be multiplied by print. Giulio Camillo's celebrated and mysterious "memory theatre" (about 1530––44), in spite of its misleading name and of some more obscure applications, was originally a device designed to search and retrieve quotations from a corpus of Cicero's writings. Around the same time, Conrad Gesner's topical bibliographies and Pierre de La Raméée's "methodical" arborescences similarly aimed at providing a general classification of knowledge, to be used both as a memory aid and as a tool for storing and retrieving books and other bits of data. Following from the same principles, the two classification systems still used by most libraries around the world were created by Melvil Dewey and Paul Otlet between the end of the nineteenth century and the beginning of the twentieth. In the predigital age, finer searches were also performed via the arbitrary choice of a few keywords that were indexed and listed topically and/or alphabetically (indexes of names, places, etc.) Hand-made taxonomies still inspired the earliest internet search tools (famously, Yahoo!) but today's information retrieval technology is mostly based on across-the-board, indiscriminate full-text string searching. For the first time ever, today we can search without sorting——and we can find without placing.
Not surprisingly, such searches generate so many results that the art of searching today depends on the talent that some search engines have to prioritize results in a way that meets or guesses users' expectations. This is where Google has been eminently successful, and it is known that its secret formula for ranking search results depends somehow on the links existing between web pages. This strategy cannot be easily extended from the web to books in print, as the printed equivalent and predecessors of digital hyperlinks, i.e., citations and cross-references, are too few and far apart, and often too specific to be statistically significant.1 Nevertheless, in 2004 Google started an ambitious project to apply the almost miraculous power of its digital search engine to the printed domain, and the project (first called Google Print and Google Book Search, now Google Books) immediately ran into trouble.
The trouble was not technical. Some copyright owners objected to having their work scanned and digitized without their consent. They objected even more loudly when they found that some of that copyrighted material was searchable and partly readable online. Many argued that one private company should not be given the power to choose which books would be digitized, and others, particularly in France, noted that an American company might favor works in English to the detriment of other languages.2 Several legal suits were filed against the project by associations of copyright owners, and at the time of writing (March 2010) Google states on its website that a resolution is imminent.3 But the new Google Books that will emerge from this litigation promises to be quite different from the book search tool that Google had conceived only a few years ago.
The ten million books already digitized by Google are made available to the general public at different levels of visibility, based on their copyright status.4 Some books appear only with an entry similar to a card catalogue, and they are not searchable at all; in fact, nothing proves that they were ever digitized, and their bibliographic entries, derived from ISBN numbers, are often inaccurate. Then come books that are searchable but not viewable: the results of the search are shown as snippets excerpted from a facsimile of the original text in print, each showing the search word highlighted in the context of one or two lines preceding and following it. (In the early days of information retrieval, similar searches were called KWIC, from "keywords in context.") Other books are both searchable and viewable, but the view (which Google Books calls "preview") is limited to a maximum number of non-consecutive pages per viewer, based on agreements between Google and the copyright holders. Evidently, Google keeps track of each visitor and closes the book after the reader has turned a given number of pages. As some pages are randomly omitted, no one should expect to read even a short chapter of any such book for free, or to view the same pages again on a later visit——except accidentally. In compensation, one can get a fair idea of what the book is about; and should one decide to buy the book, Google Books offers direct links to booksellers' web pages (typically, the publisher's, Amazon's, and a few others). Google Books also lists libraries where the book can be consulted or borrowed. Books that are out of copyright, however (old editions of the classics, for example) are fully searchable and fully viewable online, and even downloadable (as PDFs and in other formats, but some of the downloads are not searchable). All this is for free.
The future of Google Books, as announced on the company's website and contingent upon the settlement of the ongoing litigation, will add two new modes of content delivery. Online access to digitized books that are still in copyright, but out of print, will be made available for individual and institutional purchase (and the income will be shared between Google and the copyright holders). This will extend the life span of copyrighted but out-of-print books through their digital substitute, but for the same reason will make actual reprints less likely. Finally, the sale of online access to books that are still both in copyright and in print, which is also announced by the company as forthcoming, will put Google Books in direct competition with printed books. In this case, too, copyrighted digital content will be sold by Google Books in partnership with book publishers, and publishers will ultimately have to choose between selling books as physical items or selling their digital content online, or any combination of the two that makes economic sense. Apparently, both these new modes of Google Books will be designed for online search and view——that is, not for download, but if the market for e-books takes off, it is easy to imagine that Google Books will follow the trend, and compete in the e-book marketplace (with the advantage of its search engine, which no e-book publisher can match).
All the older layers of access, search, and visibility will apparently be maintained alongside the new ones, and in the present version of Google Books, book searches and book views are already generating revenue. Contextual ads (or sponsored links) are shown next to the book content generated by a search, and the advertisers pay money to Google through the proverbial and proprietary Google AdWords program ("pay per click"). For the time being, contextual ads in Google Books seem rare and scarcely pertinent. Google claims that Amazon and other sellers and publishers whose links are not shown as sponsored links, and which appear next to a book which is being searched for content, do not pay for those links, and that Google does not profit from the eventual sale of the book.5 It does not claim that it does not profit from the click on the link. On the opposite side of the balance sheet, Google pays for digitizing books, but in different measure depending on the books' provenance: some are scanned in partnership with major university libraries around the world; some are digitized by publishers that partner with Google to offer some of their own content online (for the time being, only as free "snippet views" or "limited previews"). But Google Books 2.0, if it develops as anticipated by the company, will be primarily designed to sell digital content online and to derive revenue from these transactions (which, due to the nature of the legal settlement under discussion, will be initially limited to the U.S.).
As a result, a project that began as an information retrieval system for a corpus of printed matter seems now poised to become a platform for the electronic distribution of digital copies of printed material, in some cases for free. In the process, a technology which was originally meant to make all books searchable may now make many physical books unnecessary——for better or for worse.
As scholars know all too well, many books are difficult to find. Old and rare books in particular are found in a few libraries around the world and to see them is a privilege (which, moreover, tends to favor scholars who live or work in privileged locations). Making those books freely accessible to a larger community is a worthy philanthropic endeavor, and Google Books is fortunately not alone in this. A simple Google search will reveal that many rare books (first editions of architectural treatises, for example) are already made accessible on the web by a variety of libraries, universities, and other cultural institutions, some of which create and curate digital editions specifically designed for free online access. The collection Architectura, a joint venture of the Centre d'éétudes supéérieures de la Renaissance in Tours and of the Institut National d'Histoire de l'Art in Paris, is a remarkable case in point, as texts are carefully transcribed prior to being put on line——a laborious and time-consuming philological operation; but due to the absence of the power of Google, full-text searches of the original versions are not supported.6
The other side of the story, however, so vividly epitomized by the ongoing transformation of Google Books itself, is that the very technology that is making rare books more freely accessible may soon make new books in print themselves a rarity. Instead of being printed and then digitized, new books may soon be designed for digital distribution right from the start. And the technical logics of digital media and of printed media are so distant from, and alien to, one another that a book meant for digital use will soon cease to be similar to any book in print we know. It is an ironic but not infrequent pattern in the history of cultural technologies that a media shift may at the same time revive old content and kill the old media that had made that content possible in the first place.
The scholarly system of referencing through citations, footnotes, and bibliographies inspired the first experiments in automated cross references and eventually the invention of digital hyperlinking, hypertexts, the HTTP (Hypertext Transfer Protocol), and the World Wide Web itself. In turn, the automatic indexation of hyperlinks is at the basis of the Google search technology and the Google formula for prioritizing search results (known as Page-Rank from the name of one of its inventors, Google cofounder Larry Page). Google Scholar, a parallel project to Google Books, still in an embryonic stage and aimed at scholarly and scientific publications, ranks search results based on the number of times each article or book is referred to by other sources within the Google Scholar database and on an assessment of their respective weight, thus emulating the original spirit of academic cross-referencing. For the time being, however, Google Scholar searches in the arts and humanities often produce quirky results.
See for example the essay by Jean Noëël Jeanneney (then President of the French National Library), Quand Google dééfie l'Europe: Plaidoyer pour un sursaut (Paris: ÉÉditions Mille et une nuits, 2005); and the commentary published on the website of the French National Library: Marie-Noëële Darmois, "Face au dééfi de Google: Une bibliothèèque numéérique europééenne," http://chroniques.bnf.fr/archives/septembre2005/numero_courant/dossiers/biblio_numerique.htm). Gallica, the French National Library's own pioneering digital library, launched in 1997, currently offers over a million free documents, for the most part non-searchable, comprising primarily "works about France, in the French language, and published in France": "Gallica Digital Library Charter: 1997––2007," http://www.bnf.fr/en/professionals/a.gallica_digital_library_charter.html (accessed 20 April 2010).
"Google Books Settlement Agreement," see http://books.google.com/googlebooks/agreement/ (accessed 20 April 2010).
The figure was given by Google co-founder Sergey Brin in an article first published in the New York Times on October 6, 2009, and now posted on the official Google Blog site, http://googleblog.blogspot.com/2009/10/tale-of-10000000-books.html (accessed 20 April 2010).
http://books.google.com/googlebooks/facts.html (accessed 20 April 2010).
http://architectura.cesr.univ-tours.fr (accessed 20 April 2010).