POST: A Dataset for Distant-Reading Literature in English, 1700-1922

Ted Underwood has written a post describing his work in collaboration with HathiTrust Research Center to create a dataset for scholars looking to get started with distant reading methods on 18th & 19th century literature (in English).

Using the page-level wordcounts available in HTRC and pairing them with page-level metadata created to track the genre within each volume, Underwood has “created three genre-specific datasets of word counts covering poetry, fiction, and drama from 1700 to 1922.” He notes, however, that while the datasets are designed to make an easier entry into distant reading for literature scholars, “This is not something that can be sliced easily using a tool like Excel. Someone involved with the project needs to be able to program in order to pair the metadata table with the files.”

Author: Roxanne Shirazi

Roxanne is the Dissertation Research Librarian at the Graduate Center, CUNY.