POST: OCR Now Available in the National Archives Catalog

The National Archives has announced the addition of Optical Character Recognition (OCR) search capabilities to its online catalog. Until now, the catalog was only searchable by a few metadata fields — including title and description — or crowdsourced tags and transcriptions. OCR functionality will improve search across millions of pages, and potentially make findable some of the text that appear in images.

The new OCR engine (build on Tesseract) is applied to records in either JPG or PDF format added since June 2019, but NARA is working to apply OCR processing to older records.

The post includes a sample search experience, and offers a few recently-added collections to try it yourself.