Ted Underwood has written a post describing his work in collaboration with HathiTrust Research Center to create a dataset for scholars looking to get started with distant reading methods on 18th & 19th century literature (in English).
Using the page-level wordcounts available in HTRC and pairing them with page-level metadata created to track the genre within each volume, Underwood has “created three genre-specific datasets of word counts covering poetry, fiction, and drama from 1700 to 1922.” He notes, however, that while the datasets are designed to make an easier entry into distant reading for literature scholars, “This is not something that can be sliced easily using a tool like Excel. Someone involved with the project needs to be able to program in order to pair the metadata table with the files.”
dh+lib Review
This post was produced through a cooperation between Rebecca Dowson, Kristen Mapes, Jason Mickel, A. Miller, Lisa Otty, Samuel Russell, Jeri Wieringa (Editors-at-large for the week), Roxanne Shirazi (Editor for the week), Zach Coble and Sarah Potvin (Site Editors), and Caitlin Christian-Lamb, Caro Pinto and Patrick Williams (dh+lib Review Editors).