POST: The World As Seen Through Books: An Interview with Kalev Hannes Leetaru

Erin Engle (Library of Congress) has posted an interview with Kalev Hannes Leetaru (George Washington University Center for Cyber and Homeland Security) at the Signal Digital Preservation blog, detailing research he recently completed comparing the Internet Archive, HathiTrust, and Google Books Ngrams Collections.

Kalev: … To explore this further, with the assistance of Google, Clemson University, the Internet Archive (IA), HathiTrust, and OCLC, I downloaded the English-language 1800-present digitized book collections of IA, HathiTrust, and Google Books Ngrams collections. I applied an array of data mining algorithms to them to explore what the collections look like in detail, and how similar the results are across the three collections. For each book in each collection, a list of all person names, organization names, millions of topics, and thousands of emotions from “happiness” to “anxiety,” “smugness” and “vanity” to “passivity” and “perception” were calculated, along with disambiguating and converting to mappable coordinates all textual references to location. A full detailed summary of the findings are available online and all of the computed metadata is available for analysis through SQL queries.

Erin: As you mentioned, you were able to compile a list of all people, organizations, and other names, and full-text geocode the data to plot points on a map. How do you think visualizing these collections helps researchers access or understand them, individually or collectively?

Kalev: Historically we’ve only been able to access digitized books through the same exact-match keyword search that we’ve been using for more than half a century. That lets a user find a book mentioning a particular word or passage of interest, but doesn’t let us look at the macro-level patterns that emerge when looking across the world’s books. Previous efforts at identifying patterns at scale have focused largely on the popularity of individual words over time, but that doesn’t get at the deeper questions of how all of those words fit together and the things they tell us about the geographic, thematic, and emotional undertones of the world’s knowledge encoded in our books. For example, being able to create an animated map of every location mentioned in 213 years of books, or all of the locations mentioned in books about the American Civil War or World War I are things that have simply never been possible before. In short, these algorithms allow us to rise above individual words, pages, and books, and look across all of this information in terms of what it tells us about the world.

The interview goes on to cover the challenges of working with the datasets, some of the research findings, and Kalev’s observations about shifting roles of libraries, museums, and archives in large-scale data mining, and recommendations for institutions supporting this type of work.

POST: On Capacity and Care

Bethany Nowviskie (CLIR) has shared a “blended” version of two recent talks in her post, “On Capacity and Care.” Addressing the dual topics of sustainable digital humanities and the future of graduate education, Nowviskie frames her discussion using the notion of “care”:

I offer care as a hard-nosed survival strategy, and as a strategy to increase the reach and grasp (which is at the root of the word “capacity”—the “capture”) of the humanities. We must take practical steps to prevent fatigue at the individual and community level in digital humanities and cultural heritage fields …

What if we more self-consciously drafted DH design specs to center on care? What if we did a better job of placing key humanities interests and concerns at the very heart of big-data humanities infrastructure? Doing so might help ensure our digital tools are truly open and more sustainably constructed, so that anyone with a reasonable level of training (the level I think our graduate and even undergraduate humanities programs might usefully provide) could look under the hood, change a spark plug in a moment of need—and build her next conveyance, to go further than we have imagined.

POST: Questions to Ask When You Learn of Digitization Projects

Sarah Werner highlights some of the thornier issues around access to digital collections in a recent blog post, “Questions to Ask When You Learn of Digitization Projects.” As news reports of increased access to historical materials make the rounds, Werner reminds us to interrogate the motives behind projects, noting the challenges of balancing the often competing interests of researchers and (frequently) commercial enterprises:

What are the ways that we—whether we work in libraries and museums as staff or as researchers—can create high quality digitizations without selling our cultural heritage to the highest bidder?

POST: Guidelines for Digital Dissertations in History

Sharon Leon (Roy Rosenzweig Center for History and New Media) has written a post discussing the genesis of the newly-released “Digital Dissertation Guidelines” for George Mason University’s History and Art History Department. Leon points to some of the materials that were consulted in developing the guidelines, though she ends on a cautionary note:

[W]e still have serious issues to address in the realm of official deposit and preservation of digital dissertation work. This is usually the responsibility of the university library, but very few institutions are equipped to ingest and provide access to web archives, or to provide emulators for other kinds of digital work. Digital humanities scholars are going to need to enter into a serious conversations with our university librarians and institutional repository administrators to develop a official submission process that preserves digital dissertation work.

POST: #contextiseverything Whyte Memorial Lecture 2015

Jaye Weatherburn (Swinburne University of Technology, Australia) authored a post summing up the Whyte Memorial lecture delivered by Ross Harvey (Royal Melbourne Institute of Technology, Australia), “Keeping, Forgetting, and Misreading Digital Material: Libraries Learning from Archives and Recordkeeping Practice.” Harvey “extolled the benefits of archival principles, and called for them to be used for managing digital materials.” Weatherburn expanded on how archival thinking is essential for managing digital assets across multiple systems over many years:

It is such an exciting time to be working in this space, and with established professionals like Harvey telling us to ‘get our heads out of the sand’ and promote our skills more widely, it is most definitely time for radical-thinking information professionals to join forces and start making changes to the way we conduct our practice of managing information.

POST: No More Excuses

In “No More Excuses,” Jacqueline Wernimont (Arizona State University) calls for the end of all-male panels in the digital humanities and points to several crowdsourced resources for locating women in DH: Build a Better DH Syllabus, Build a Better List of Code Experts, and Build a Better Panel. Wernimont concludes:

There are no more excuses. You know we are here and that we do damn fine work. Going forward, all-male panels can only be construed as a choice, not an issue of ignorance. We have been busy building the communities we want to see within DH,  and now we’ve taken time from our research, our teaching, our lives to pull together information for you – now it’s your turn to do your part.

POST: Introducing Git-Lit

Jonathan Reeve (Columbia University) has written a post, “Introducing Git-Lit,” inviting input on a project that turns digitized manuscripts from the British Library into GitHub repositories:

Git-Lit aims to parse, version control, and post each work in the British Library’s corpus of digital texts. Parsing the texts will transform the machine-readable metadata into human-readable prefatory material; version controlling the texts will allow for collaborative editing and revision of the texts, effectively crowdsourcing the correction of OCR errors; and posting the texts to GitHub will ensure the texts’ visibility to the greater community.

POST: Digital humanities might never be evenly distributed

In a blog post titled, “Digital humanities might never be evenly distributed,” Ted Underwood (University of Illinois, Urbana-Champaign) considers the different ways in which digital humanities infrastructure, support, and collaboration work (or don’t work) across constituencies in institutions. He posits that DH can be understood “as an institutional achievement that happens to exist on some campuses and not others.”  Underwood concludes with a prediction that institutions will do DH differently as a matter of local cultures and constraints.

The post is particularly relevant for librarians in that it describes the dispersed nature of DH activity on one (large) campus that is also home to a leading LIS program. It suggests that even without a centralized DH initiative, important successes in digital humanities can be found through the types of interdisciplinary connections and local attentiveness academic librarians engage as a part of our professional practice.

 

POST: A Dataset for Distant-Reading Literature in English, 1700-1922

Ted Underwood has written a post describing his work in collaboration with HathiTrust Research Center to create a dataset for scholars looking to get started with distant reading methods on 18th & 19th century literature (in English).

Using the page-level wordcounts available in HTRC and pairing them with page-level metadata created to track the genre within each volume, Underwood has “created three genre-specific datasets of word counts covering poetry, fiction, and drama from 1700 to 1922.” He notes, however, that while the datasets are designed to make an easier entry into distant reading for literature scholars, “This is not something that can be sliced easily using a tool like Excel. Someone involved with the project needs to be able to program in order to pair the metadata table with the files.”

POST: ADHO Announces New SIG, Libraries and Digital Humanities

The Alliance of Digital Humanities Organizations (ADHO) has announced the formation of a new Special Interest Group, Libraries and Digital Humanities:

ADHO’s Libraries and Digital Humanities SIG aims at fostering collaboration and communication among librarians and other scholars doing DH work, by showcasing the work of librarians engaged in DH projects, advocating for initiatives of interest and benefit to both libraries and DH, promoting librarians’ perspectives and skills to the rest of the DH community, and offering advice and support to new and emergent associations of librarians engaged in DH projects.

The Libraries and Digital Humanities SIG encourages membership from all fields and geographic regions: please visit its Twitter page or sign up for updates through this Google Form.

dh+lib will be hosting a longer post by the conveners of the Libraries and Digital Humanities SIG in the coming weeks; watch this space for more details!

POST: What is “extended” about Extended Collective Licensing?

Kevin Smith (Duke University) has authored a post for the Duke University Libraries Scholarly Communications @ Duke blog on the US Copyright Office’s new proposals on orphan works and mass digitization. Smith breaks down the proposals, including how the “extended collective licensing” provision would work:

Alongside this proposal for how to deter use of individual orphan works is a grander scheme to deter mass digitization projects, called extended collective licensing.  So what does “extended” mean in this context?  A normal collective licensing scheme means that rights holders get together and create a collective organization to administer the rights that they own.  Such organizations are usually inefficient and sometimes prone to corruption, but there is nothing inherently wrong with the idea behind them.  They could, if done well, increase efficiency for both rights holders and users (that is, for new creators).  When a collective licensing scheme is extended, however, it means that licenses are being sold for rights not held by any of the members of the collective society.  That is the point about orphan works — a collective society representing the traditional content industries would sell licenses for the use of works for which they do not, by definition, hold the rights.  They would collect licensing fees “on behalf” of the unknown owners.

Smith goes on to expand upon why extended collective licensing and the Copyright Office proposals in general are so harmful to libraries and to the creative sphere:

ECL is a form of tax on using orphan works.  The revenue from that tax will have no benefit in providing an incentive for further creation, because it will not go to the creators who made the works in the first place.  But a requirement to pay such an unproductive tax will certainly deter many digitization projects that could make rare historic materials available for research, study and teaching.  Thus productivity is lost without the benefits of an economic incentive.  It is, truly, a lose-lose situation.