RESOURCE: Crowdsourcing + Machine Learning: Nicholas Woodward at TCDL

Nicholas Woodward, Software Developer at the University of Texas Libraries, shares the text of the talk he gave at the Texas Conference on Digital Libraries. Woodward describes his novel approach for transcribing the Digital Archive of the Guatemalan National Police Historical Archive, a collection of over 12 million pages:

My approach looks to break up documents into individual words with the idea that though no two documents are exactly alike they are likely to contain similar words. And across an entire corpus, particularly very large ones such as AHPN, words are likely to appear many times. Consequently, if users transcribe the words of one document, then I can use image matching algorithms to find other images of the same words and apply the crowdsourced transcription to the new images.

This post was produced through a cooperation between Trevor Muñoz, Kristen Andrews, and Elizabeth Lorang (Editors-at-Large for the week), Zach Coble (dh+lib review Editor for the week), and Caro Pinto and Roxanne Shirazi (dh+lib review Editors).