The National Archives has announced the addition of Optical Character Recognition (OCR) search capabilities to its online catalog. Until now, the catalog was only searchable by a few metadata fields — including title and description — or crowdsourced tags and transcriptions. OCR functionality will improve search across millions of pages, and potentially make findable some of the text that appear in images.
The new OCR engine (build on Tesseract) is applied to records in either JPG or PDF format added since June 2019, but NARA is working to apply OCR processing to older records.
The post includes a sample search experience, and offers a few recently-added collections to try it yourself.
This post was produced through a cooperation between Conor Dugan, Tierney Gleason, Jill Krefft, Amy Mallory-Kani, Jennifer Matthews, Adam Mazel, and Kristen Totleben (Editors-at-large for the week), Pamella Lach (Editor for the week), and Caitlin Christian-Lamb, Nickoal Eichmann-Kalwara, Linsey Ford, and Ian Goodale (dh+lib Review Editors).