Erin Engle (Library of Congress) has posted an interview with Kalev Hannes Leetaru (George Washington University Center for Cyber and Homeland Security) at the Signal Digital Preservation blog, detailing research he recently completed comparing the Internet Archive, HathiTrust, and Google Books Ngrams Collections.
Kalev: … To explore this further, with the assistance of Google, Clemson University, the Internet Archive (IA), HathiTrust, and OCLC, I downloaded the English-language 1800-present digitized book collections of IA, HathiTrust, and Google Books Ngrams collections. I applied an array of data mining algorithms to them to explore what the collections look like in detail, and how similar the results are across the three collections. For each book in each collection, a list of all person names, organization names, millions of topics, and thousands of emotions from “happiness” to “anxiety,” “smugness” and “vanity” to “passivity” and “perception” were calculated, along with disambiguating and converting to mappable coordinates all textual references to location. A full detailed summary of the findings are available online and all of the computed metadata is available for analysis through SQL queries.
Erin: As you mentioned, you were able to compile a list of all people, organizations, and other names, and full-text geocode the data to plot points on a map. How do you think visualizing these collections helps researchers access or understand them, individually or collectively?
Kalev: Historically we’ve only been able to access digitized books through the same exact-match keyword search that we’ve been using for more than half a century. That lets a user find a book mentioning a particular word or passage of interest, but doesn’t let us look at the macro-level patterns that emerge when looking across the world’s books. Previous efforts at identifying patterns at scale have focused largely on the popularity of individual words over time, but that doesn’t get at the deeper questions of how all of those words fit together and the things they tell us about the geographic, thematic, and emotional undertones of the world’s knowledge encoded in our books. For example, being able to create an animated map of every location mentioned in 213 years of books, or all of the locations mentioned in books about the American Civil War or World War I are things that have simply never been possible before. In short, these algorithms allow us to rise above individual words, pages, and books, and look across all of this information in terms of what it tells us about the world.
The interview goes on to cover the challenges of working with the datasets, some of the research findings, and Kalev’s observations about shifting roles of libraries, museums, and archives in large-scale data mining, and recommendations for institutions supporting this type of work.