POST: The Magnificent Seven: Looking Back on a Year of Exploring the Web Archives Datasets

Pedro Gonzalez-Fernandez (Library of Congress) has authored a post on The Signal, “The Magnificent Seven: Looking Back on a Year of Exploring the Web Archives Datasets.” Gonzalez-Fernandez reviews the activities of LC’s Web Archiving Team over the past year:

Now at over 2 petabytes, the web archives are a complex aggregation of interrelated web objects that make up the internet as we know it (images, text, code, audio, video, etc.). In keeping with the Digital Strategy for the Library of Congress, we are working to “throw open the treasure chest” by making this digital content as broadly available as possible. However, without the proper tools to navigate this complex resource, users may think of the treasure chest as more of a Pandora’s box! Two broad goals directed our investigation: 1) to develop a better understanding of the individual media objects that comprise the web archives, and 2) to surface specific sets of individual resources from the web archives that will support users exploring research and creative uses of archived content.

The author includes links to the seven released web archives datasets, as well as sharing two creative uses of those datasets by Matt Miller (Library of Congress): Byzantine PDF, which “creates a “Frankenstein” PDF document by cobbling together bits and pieces from the 1,000 PDFs in our dataset,” and Anaphora, which “uses AWS Transcribe to generate transcripts of audio files that can be used to find repeated phrases.”

POST: Co-creating Open Infrastructure to Support Epistemic Diversity and Knowledge Equity

Reflecting on the theme of this year’s Open Access Week, “Open For Whom,” Yasmeen Shorish and Leslie Chan turn their attention to scholarly infrastructure. In their post, Shorish and Chan call attention to questions of diversity, equity, and inclusion in scholarly communication, broadly construed to include digital humanities work happening within and beyond libraries.

They write, “Infrastructure comprises systems and social practices that reflect the values of its creators and, ideally, those who interact with it. Infrastructure, we contend, is never neutral but involves contest over power. Infrastructure not only determines how we access and who can access information, but whose voices count as ‘legitimate’ scholarship.”

They call attention to existing governance structures for open knowledge infrastructure, currently dominated by “a handful of powerful multinational publishers” who “are busy building end-to-end platforms, integrating once disparate journal production workflows, research tools, data services, and researcher profiles. This is done with the aim of extracting vast quantities of data that would allow them to develop new products and services for the global marketplace of metrics, analytics, and university rankings.” The drive for efficiency, they lament, makes it hard to break away from the “implicit value statements have been asserted by the infrastructure providers. Biases in language preference, research areas, publication venue, methodologies, modes of presentation, and even ‘excellence’ have become ‘standardized’ and reinforced as if there is only a single ‘universal’ set of practices.”

As a corrective, they urge librarians to “turn our support to autonomous, community-governed local initiatives, and by providing a network of solidarity for truly diverse and inclusive scholarly communication.” They provide several examples of promising projects, including the Community-led Open Publication Infrastructures for Monographs (COPIM) project, AmeliCA‘s efforts to “co-create an open, non-commercial infrastructure for Latin America and other Global South journals,” and the Invest in Open Infrastructure (IOI) project.

POST: ‘Despatches from the Fourth Quadrant’: Three observations from this year’s Discovering Collections, Discovering Communities (DCDC) conference

Stephen Brooks shares reflections from the recent DCDC2019 in Jisc‘s Content and Digitisation blog. ‘Despatches from the Fourth Quadrant’: Three observations from this year’s Discovering Collections, Discovering Communities (DCDC) conference, held November 12-14 in Birmingham, U.K.

In addition to providing a thoughtful and detailed overview of the conference and its many innovations, Brooks noted several key themes of the conference:

Theme 1: ‘Always on’ emphasized the importance of connection and engagement at the conference. He noted the diversity of roles present, observing that, “the movers and shakers of the information sector mingled with frontline librarians, local archivists, digital innovators and the technologically curious. There were representatives of galleries and museums, specialists in accessibility, publishers and a number of researchers.”

Theme 2: ‘We have more in common than we have differences’ praised the diversity of sessions as much as the overlapping and intersecting topics covered.

Theme 3: ‘You are the infrastructure’ reflects on the keynote talk by David De Roure, who “made it clear that we who ply our trades in the ‘fourth quadrant’ – where growth in both technology and human engagement gives rise to the creation of ‘social machines’ and new forms of research – are all responsible for building and maintaining the environment needed to support these new capabilities in learning and research.”

Brooks noted the importance of human stories across the many collections and communities shared at the conference.

 

POST: Chasing cows in a swamp? Perspectives on Plan S from Australia and the USA

In a recent post during Open Access Week, Beatrice Gini (University of Cambridge) interviewed Danny Kingsley (Scholarly Communication Consultant) and Micah Vandegrift (Open Knowledge Librarian at NC State University) about Plan S, an Open Access publishing initiative launched in 2018 that requires publicly funded scientific research to be published in open repositories and journals by 2021.

From the post:

In this Open Access Week, we look East and West to find out how Plan S is being received across the globe. Dr Danny Kingsley explores how reliance on foreign students has trapped Australian universities in a ‘Faustian bargain’ with publishers and reduced the scope for change. Micah Vandegrift reports on the type of conversations that Plan S has inspired in the USA, as well as the potential political barriers, sounding a note of cautious optimism.

The uptake of Plan S or equivalent principles in countries beyond Europe is crucial to the overall success of the movement. Publishers are using the fact that uptake currently has limited geographic scope to stall change, arguing that they cannot alter their model to suit the requirements of a relatively small percentage of authors. The number of supporting funders is still small and concentrated in Europe, with a few US players. China initially looked set to join in and thus change the game, but since the end of 2018 we have seen little progress on that front. Has Plan S been successful in shaping conversations around the world?

Hearing from our colleagues in other countries highlights some of the promises and challenges Plan S is facing in making an impact outside Europe. Learning about those raises a number of interesting points for how we advocate for open access at home too.

Kingsley and Vandegrift address issues of politics, economics, and barriers and incentives for researchers for local and global implementation of Plan S and a broader open research ecosystem.

POST: Intra-Campus Collaboration Around Research Support

Brian Lavoie, a research scientist at OCLC has published a post about developing formal and informal intra-campus collaboration around research support. He shares several observations garnered from recent discussions by the OCLC Research Library Partnership (RLP) Research Support Interest Group:

  • Collaborative relationships on campus are both formal and informal
  • Who you know is important … build your networks
  • Sustainability can be an issue for informal collaborations
  • Appeal to partners’ missions

His discussion and suggestions are highly applicable to those of us working to build research-driven relationships and collaborations across our campuses.

POST: The Research Data Sharing Business Landscape

Rebecca Springer and Roger C. Schonfeld (both Ithaka S+R) have co-authored a post on The Scholarly Kitchen, “The Research Data Sharing Business Landscape.” Springer and Schonfeld focus on the business of “large-scale generalist data repositories,” focusing on the four they deem “the most significant players in the landscape today”: Dryad, Mendeley Data, figshare, and Zenodo.

The authors point to a few key trends within these four repositories: an increase in integration with open source options; integrating data publication and sharing within existing user workflows; and facilitating compliance with funding agency requirements for data sharing. These trends point towards these four services taking advantage of their existing strengths and attempting to justify their existence in a business-case scenario. As Springer and Schonfeld point out, “there is little reason to believe that data sharing has developed into a sustainable, let alone profitable, line of business. With growing funder and publisher mandates, however, it is clear that several companies and not-for-profits are staking their claim on the belief that research data must be part of a comprehensive workflow solution.”

As users and as consultants for faculty, student, and staff projects that may require data sharing, dh+lib readers may be interested in keeping abreast of the changes and trends in these data repositories.

POST: OCR Now Available in the National Archives Catalog

The National Archives has announced the addition of Optical Character Recognition (OCR) search capabilities to its online catalog. Until now, the catalog was only searchable by a few metadata fields — including title and description — or crowdsourced tags and transcriptions. OCR functionality will improve search across millions of pages, and potentially make findable some of the text that appear in images.

The new OCR engine (build on Tesseract) is applied to records in either JPG or PDF format added since June 2019, but NARA is working to apply OCR processing to older records.

The post includes a sample search experience, and offers a few recently-added collections to try it yourself.

POST: Libraries and Archivists Are Scanning and Uploading Books That Are Secretly in the Public Domain

This Motherboard post by Karl Bode details efforts of archivists, activists, and libraries to vastly expand the number public domain books that are being digitized, with particular emphasis on books published between 1923 and 1964.

“As it currently stands, all books published in the U.S. before 1924 are in the public domain, meaning they’re publicly owned and can be freely used and copied. Books published in 1964 and after are still in copyright, and by law will be for 95 years from their publication date. But a copyright loophole means that up to 75 percent of books published between 1923 to 1964 are secretly in the public domain, meaning they are free to read and copy.”

The New York Public Library is leading the effort to identify appropriate titles, digitize them, and upload them to the Internet Archive. Using Python scripts to automate parts of the process, organizers and volunteers are striving to do this work at scale, including verifying that copyright was not renewed. Volunteers from Project Gutenberg and other organizations “are tasked with locating a copy of the book in question, scanning it, proofing it, then putting out HTML and plain-text editions.”

DH library folks might want to keep an eye on these efforts in order to help faculty and student access a broader range of texts, including computationally-ready plain-text files, to engage in textual analysis and other DH work.

POST: Connections in Sound: Irish traditional music at AFC

In her article “Connections in Sound: Irish traditional music at AFC,” Meghan Ferriter (The Signal) highlights Patrick Egan, a scholar and musician from Ireland. Patrick recently began a six-month residency with the Library of Congress as a Kluge Fellow in Digital Studies.

Throughout 2019, Patrick has a number of digital projects underway, sharing data about recordings of Irish traditional music collected and held by the American Folklife Center (AFC). Patrick’s research aims to understand more fully the role that archives and collections might play in the lives of performers, as a result of the digital turn. He’s created a number of prototypes for exploring the collections … Patrick agreed to share his research and these ongoing digital projects with the public as he creates them and he’s interested in receiving feedback from researchers and the Irish traditional music community.

Visualizations of some of his work are also available to better enable users to access details about the recordings. Several images are provided in the article.

POST: Should Libraries Be the Keepers of Their Cities’ Public Data?

Linda Poon’s (CityLab) article “Should Libraries Be the Keepers of Their Cities’ Public Data?” looks at the ways in which public libraries engage with open data, and in the process raises interesting questions about the ethics of data dissemination and the role of libraries in protecting privacy. From the article:

Libraries are committed to protecting patrons’ data—in fact, [Pratt Institute professor of library science Debbie] Rabina says it’s strongly emphasized in library science education—and they often delete records of searches. But what does it mean for the library’s commitment to their patrons when they are pressured into entering public-private partnerships with companies that think differently about consumers’ data?

While focused on public libraries, this article raises interesting questions that apply to academic environments as well, particularly as campus libraries increase their engagement with digital data, data literacy, and DH projects that involve public data.

POST: Possibilities for Digital Humanities at Community Colleges

Lisa Spiro (executive director of Digital Scholarship Services at Rice University’s Fondren Library) has posted the slides of a recent talk delivered at Houston Community College’s Spring 2019 English & Humanities Colloquium on “The Digital Classroom: Humanities, Literature & Composition.”

“Possibilities for Digital Humanities at Community Colleges,” built on Dr. Anne McGrail’s work on DH and community colleges, explores the relevance of DH in a community college context, the elements of digital (humanities) pedagogy, and the obstacles & practical solutions of engaging in DH work at community colleges. The talk provides many examples of project-based learning and tips for integrating DH work into the curriculum which librarians at a range of institutions will find useful.

A pdf version of the slides can be accessed directly here.