POST: Twisty Little Passages: The Franklin D. Roosevelt Master Speech File

In response to the release of The Franklin D. Roosevelt Master Speech File, Thomas Padilla (Michigan State University Libraries) has put together a bulk data gathering and transformation workflow tutorial with that dataset as an example. Padilla meticulously documents the phases of the process and provides tips about the tools and commands at work in the scenario. The steps are:

1. compile a list of collection item links
2. use that list to instruct wget which content it should download
3. specify wget limitations so it doesn’t burden source data server
4. mass append file extensions to items when they are missing them
5. mass remove the first page of every file (reasons will become clear)
6. extract OCR data from every PDF file
7. create plain text files with OCR data

The tutorial prepares users to experiment with the Master Speech File dataset with an approach that is scalable and adaptable to their own datasets.

 

Author: Patrick Williams

Patrick Williams is Associate Librarian for Literature, Rhetoric, and Digital Humanities in the Syracuse University Libraries. He received his MSIS and PhD in Information Studies from the University of Texas at Austin. He is the editor of the poetry journal Really System.