The rOpenSci project has released tabulizer, an R package that provides bindings to the Tabula java library.
Tabula is a tool for extracting data from PDF tables:
If you’ve ever tried to do anything with data provided to you in PDFs, you know how painful it is — there’s no easy way to copy-and-paste rows of data out of PDF files. Tabula allows you to extract that data into a CSV or Microsoft Excel spreadsheet using a simple, easy-to-use interface.
rOpenSci develops R packages “that provide programmatic access to a variety of scientific data, full-text of journal articles, and repositories.”
dh+lib Review
This post was produced through a cooperation between Gayle Fischer, Stephen Lingrell, Anna Newman, Kelley Rowan, Chelcie Rowell, and Ashley Zengerski (Editors-at-large for the week), Roxanne Shirazi (Editor for the week), Sarah Potvin (Site Editor), and Caitlin Christian-Lamb, Caro Pinto and Patrick Williams (dh+lib Review Editors).