In response to the release of The Franklin D. Roosevelt Master Speech File, Thomas Padilla (Michigan State University Libraries) has put together a bulk data gathering and transformation workflow tutorial with that dataset as an example. Padilla meticulously documents the phases of the process and provides tips about the tools and commands at work in the scenario. The steps are:
1. compile a list of collection item links
2. use that list to instruct wget which content it should download
3. specify wget limitations so it doesn’t burden source data server
4. mass append file extensions to items when they are missing them
5. mass remove the first page of every file (reasons will become clear)
6. extract OCR data from every PDF file
7. create plain text files with OCR data
The tutorial prepares users to experiment with the Master Speech File dataset with an approach that is scalable and adaptable to their own datasets.
dh+lib Review
This post was produced through cooperation among Caroline Barratt, Rebecca Dowson, Melanie Hubbard, Alix Keener, Paula S. Kiser, Chella Vaidyanathan, and Amy Wickner (Editors-at-large for the week), Patrick Williams (Editor for the week), Sarah Potvin (Site Editor), and Caitlin Christian-Lamb, Caro Pinto, and Roxanne Shirazi (dh+lib Review Editors).