Thomas Padilla (Michigan State University) has written a post addressing a central concern in text analysis projects: how do you get the data you want, and how to do you make it usable?
… more experienced Digital Humanists often have programmatic means of getting data and transforming it in such a way that it suits their needs. These means are not inaccessible to beginners but the path from DH interest to DH exploration is sometimes better wended via a route that poses the least resistance. In what follows, I will describe a method that kludges together a couple of different easy to use tools to download web pages en masse, remove markup, and convert them to .txt.
Padilla’s post goes on to provide a step-by-step tutorial, using UC Davis’ British Women Romantic Poets, 1789-1832 project.