POST: The Magnificent Seven: Looking Back on a Year of Exploring the Web Archives Datasets

Pedro Gonzalez-Fernandez (Library of Congress) has authored a post on The Signal, “The Magnificent Seven: Looking Back on a Year of Exploring the Web Archives Datasets.” Gonzalez-Fernandez reviews the activities of LC’s Web Archiving Team over the past year:

Now at over 2 petabytes, the web archives are a complex aggregation of interrelated web objects that make up the internet as we know it (images, text, code, audio, video, etc.). In keeping with the Digital Strategy for the Library of Congress, we are working to “throw open the treasure chest” by making this digital content as broadly available as possible. However, without the proper tools to navigate this complex resource, users may think of the treasure chest as more of a Pandora’s box! Two broad goals directed our investigation: 1) to develop a better understanding of the individual media objects that comprise the web archives, and 2) to surface specific sets of individual resources from the web archives that will support users exploring research and creative uses of archived content.

The author includes links to the seven released web archives datasets, as well as sharing two creative uses of those datasets by Matt Miller (Library of Congress): Byzantine PDF, which “creates a “Frankenstein” PDF document by cobbling together bits and pieces from the 1,000 PDFs in our dataset,” and Anaphora, which “uses AWS Transcribe to generate transcripts of audio files that can be used to find repeated phrases.”

dh+lib Review

This post was produced through a cooperation between Esther Brandon, Elisa Coghlan, Hannah Hopkins, Jennifer Matthews, Robin Miller, Race MoChridhe, and Isaac Williams (Editors-at-large for the week), Caitlin Christian-Lamb (Editor for the week), and Nickoal Eichmann-Kalwara, Linsey Ford, Ian Goodale, and Pamella Lach (dh+lib Review Editors).