POST: Text Capture and Optical Character Recognition 101

Simon Tanner (King’s College London) offers an introduction to Text Capture and OCR in a recent blog post. Tanner outlines the various ways in which digital humanities textual datasets are created from physical artifacts, and the strengths and weaknesses of OCR, rekeying, handwriting recognition, and speech recognition as methods for creating them.

This post is particularly helpful for those considering starting digitization projects from scratch and serves as a good, readable primer for those who may not have had much exposure to the processes through which print documents are transformed into digital textual data. Tanner also provides advice on choosing a suitable approach for original projects, with consideration of levels of representation, indexing, metadata, and mark-up.

dh+lib Review

This post was produced through a cooperation between Nora Almeida, Corey Davis, Lisa Gonzalez, Anne Ligon Harding, Paula S. Kiser, Carli Spina, and Lauren Work (Editors-at-large for the week), Patrick Williams (Editor for the week), Zach Coble and Sarah Potvin (Site Editors), and Caitlin Christian-Lamb, Caro Pinto, and Roxanne Shirazi (dh+lib Review Editors).