POST: Refining the Problem — More work with NYPL’s open data, Part Two

In part II of his experiment to create an index of items using the New York Public Library’s What’s on the menu? data set, Trevor Muñoz discusses his work with the data and some of the lessons he learned. Muñoz used the Open Refine tool and, finding the NYPL data set too large to easily work with, he discusses some of his workarounds. Muñoz concludes,

The larger question is whether there is a still a plausible vision for how a data curator could add value to this data set. The need to script around limitations of a tool increases the cost of normalizing the NYPL data. At the same time, the ability to see the clusters of similar values that Refine produces increases my confidence that the potential gain in data quality could be very substantial in going from the raw crowdsourced data to an authoritative index.

dh+lib Review

This post was produced through a cooperation between Hugh Burkhart, Christopher A. Miller, Caitlin Pollock, and Jennifer Snider (Editors-at-Large for the week), Zach Coble (Editor for the week), and Roxanne Shirazi and Caro Pinto (dh+lib Review Editors).