The Library of Congress released a set of 1,000 PDF files, randomly selected from .gov domain sites. This is the first of many anticipated datasets the Digital Content Management section is preparing in order to “extract and make available sets of files from the Library’s significant web archives holdings.” The broader project seeks to make LC web archives more accessible and usable.
As the Library of Congress blog post explains:
Our aim in creating these sets is to identify reusable, “real world” content in the Library’s digital collections, which we can provide for public access. The outcome of the project will be a series of datasets, each containing 1,000 files of related media types selected from .gov domains… [T]he data will be made available through LC Labs. Although we invite usage and interest from a wide range of digital enthusiasts, we are particularly hoping to interest practitioners and scholars working on digital preservation education and digital scholarship projects.
Additional datasets will be posted over the next year on the Library of Congress’ blog, The Signal. Researchers, librarians, and citizens are invited to link to the datasets, download them, and explore them.