HathiTrust as a Data Source for Researching Early Nineteenth-Century Library Collections

Identification, Coverage, and Methods


  • Julia Bauder Grinnell College Libraries




An intriguing new opportunity for research into the nineteenth-century history of print culture, libraries, and local communities is performing full-text analyses on the corpus of books held by a specific library or group of libraries. Creating corpora using books that are known to have been owned by a given library at a given point in time is potentially feasible because digitized records of the books in several hundred nineteenth-century library collections are available in the form of scanned book catalogs: a book or pamphlet listing all of the books available in a particular library. However, there are two potential problems with using those book catalogs to create corpora. First, it is not clear whether most or all of the books that were in these collections have been digitized. Second, the prospect of identifying the digital representations of the books listed in the catalogs is daunting, given the diversity of cataloging practices at the time. This article will report on progress towards developing an automated method to match entries in early nineteenth-century book catalogs with digitized versions of those books, and will also provide estimates of the fractions of the library holdings that have been digitized and made available in the Google Books/HathiTrust corpus.

Author Biography

Julia Bauder, Grinnell College Libraries

Social Studies and Data Services Librarian, Grinnell College Libraries


Jean-Baptiste Michel et al., “Quantitative Analysis of Culture Using Millions of Digitized Books,” Science, 311, no. 6014 (January 11, 2011): 176-82, https://doi.org/10.1126/science.1199644.

Jean M. Twenge, W. Keith Campbell, and Brittany Gentile, “Male and Female Pronoun Use in U.S. Books Reflects Women’s Status, 1900-2008,” Sex Roles 67, nos. 9-10 (November 2012), 488-93, https://doi.org/10.1007/BF00287963

Patricia M. Greenfield, “The Changing Psychology of Culture from 1800 through 2000,” Psychological Science 24, no. 9, 1722-31, https://doi.org/10.1177/0956797613479387.

Eitan Adam Pechenick, Christopher M. Danforth, and Peter Sheridan Dodds, “Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-cultural and Linguistic Evolution,” PLOS One 10, no. 10 (October 7, 2015): e0137041. https://doi.org/10.1371/journal.pone.0137041.

Alexander Koplenig, “The Impact of Lacking Metadata for the Measurement of Cultural and Linguistic Change Using the Google Ngram Data Sets—Reconstructing the Composition of the German Corpus in Times of WWII,” Digital Scholarship in the Humanities 32, no. 1 (April 2017): 169-88, https://doi.org/10.1093/llc/fqv037.

Pechenick et al., 2015; Lindsay DiCuirci, Colonial Revivals: The Nineteenth-Century Lives of Early American Books (Philadelphia: University of Pennsylvania Press, 2019).

Robert A. Gross, “Reconstructing Early American Libraries: Concord, Massachusetts, 1795-1850,” Proceedings of the American Antiquarian Society, 97, no. 1 (January 1, 1987): p. 331-451.

Jennifer Howard, “What Ever Happened to Google’s Effort to Scan Millions of University Library Books?,” EdSurge, August 20, 2017, https://www.edsurge.com/news/2017-08-10-what-happened-to-google-s-effort-to-scan-millions-of-university-library-books.




How to Cite

Bauder, J. (2019). HathiTrust as a Data Source for Researching Early Nineteenth-Century Library Collections: Identification, Coverage, and Methods. Information Technology and Libraries, 38(4), 14–24. https://doi.org/10.6017/ital.v38i4.11251