POST: CNI’s Spring 2024 Project Briefings

The Coalition for Networked Information’s (CNI) Spring 2024 Membership Meeting consisted of plenaries and project briefings which are posted and publicly available through their site. These meetings feature various members’ semi-formal presentations on initiatives, projects, and research – both theoretical and practical, such as:

Recordings also include Q&A sessions.

RECOMMENDED: Modeling Doubt: A Speculative Syllabus

Shannon Mattern (University of Pennsylvania) has published an open-access piece in the Journal of Visual Culture titled “Modeling doubt: a speculative syllabus.” Adapted from Mattern’s May 2023 King’s Public Lecture in Digital Humanities at King’s College London, the piece explores “where humanistic conceptions of doubt do, or could or should, reside within our digital systems: at the interface, within the code, or engineered into hardware and infrastructure.”

From the abstract:

In light of increasing artificial intelligence and proliferating conspiracy, technofetishism and moral panics, faith in ubiquitous data capture and mistrust of public institutions, the ascendance of STEM and the ‘deplatforming’ of the arts and humanities, this article considers doubt as an epistemological condition, a political tool, an ethical force, a rhetorical register, and an aesthetic category.

Of interest to DH practitioners, librarians, and higher ed administrators, this speculative discussion provides a valuable starting point for those wishing to dig deeper into the historical and recent conversations around the paradox of applying digital technologies to humanities data that is by nature incomplete and uncertain.

RECOMMENDED: Large Language Models and Academic Writing

The South African Journal of Science recently published an article by Martin Bekker (University of the Witwatersrand) that explores a tiered model for assessing academic authors’ engagement with large language models (LLMs) like OpenAI’s ChatGPT.

Large language models and academic writing: Five tiers of engagement” offers guidance for academic journal editors, university instructors and curriculum developers (and library workers) on thinking about the different modes of authorial engagement with LLMs for academic writing. The article proposes a five-tier system “to simplify thinking around permissions and prohibitions related to using LLMs for academic writing. While representing increasing ‘levels’ of LLM support that progress along a seeming continuum, the tiers in fact represent paradigmatically different types of mental undertakings” (p. 2).

The tiers include 1: Use ban, 2: Proofing tool, 3: Copyediting tool, 4: Drafting consultant, and 5: No limits. Bekker proposes an ethical framework for evaluating potential harms and benefits for authors’ use of LLMs at each tier of engagement. Concluding with a brief discussion of “AI hype and despair,” this paper makes an interesting contribution to the ongoing conversations in higher education across the globe around emerging AI technology’s use and impact on academic publishing.

Read the full open-access article on the publisher’s website.

POST: AI Will Lead Us to Need More Garbage-subtraction

Todd Carpenter, Executive Director of NISO, writes for The Scholarly Kitchen, “AI Will Lead Us to Need More Garbage-subtraction.” Amid a flurry of recent articles in LIS journals and higher education blogs on concerns about generative AI and large language models (LLMs) being trained on non-transparent, highly biased swaths of data culled from across the internet, Carpenter speculates on another unintended consequence of these advanced machine-learning technologies: they are adding to the growing amount of low-quality content being shared online, or in other words, more “garbage” for researchers to sift through in the search for valuable information.

From the article:

In a world of ubiquitous information, curation becomes the most coveted service. Reduction, selection, and curation become the highest value an organization can provide. We need to subtract from the flow of information, by “deleting the garbage….”

Into this environment, generative AI systems will only exacerbate that problem. In the same way that robotics have made manufacturing processes more exact, more efficient, faster, and cheaper, AI tools will help everyone generate ever more content. As large language models and generative text creation AI systems make the authorship of content easier, ultimately this will only generate more and more content.

POST: Global ‘Bit List’ of Endangered Digital Species 2023

On World Digital Preservation Day (November 2), the Digital Preservation Coalition released the 2023 edition of the “Global ‘Bit List’ of Endangered Digital Species” – an open, community-created resource listing the most at-risk digital materials. The list this year consists of 87 entries, with new entries including “First Nations Secret/Sacred Cultural Material.”

From the post announcing the resource, titled “Is data loss a choice?”:

“The most noticeable thing about the 2023 edition of the Bit List is actually how little has changed,” explains Dr. Amy Currie, Bit List Co-ordinator for the DPC. “The Bit List Council made only marginal changes from the recommendations in 2021 so, in this sense, the 2023 report has validated the broad conclusions of previous years, updating them rather than setting them aside. With a few honourable exceptions, there has been little or no improvement in the overall risk profile of digital assets.”

The Bit List 2023 is published in the shadow of a global pandemic, during a land war in Europe and a time of heightened tension and possible war in the Middle East. These threats to digital content coincide with a crisis of knowledge and fog of disinformation. Cyber-warfare can make battlefields and hostages of almost any connected device and data, and technical inter-dependency means that economic shocks threaten the digital memory of the world in ways we have barely begun to comprehend.

POST: Modeling Cultural Networks in the Classroom with Constellate

In recent years, JSTOR Labs launched Constellate Lab, a modern successor of JSTOR Data for Research, to enable computational analysis of historical newspapers, scholarly journals, and other documents within the JSTOR corpora. With tutorials including analysis with Python notebooks and datasets builders, Constellate focuses on data pedagogy to sharpen and enhance ones data skills.

Dr. Natalie M. Susmann, digital scholarship librarian at Brandeis University and Mediterranean Landscape Archaeologist, shares her experience with Constellate in a recent post from JSTOR Daily, “Modeling Cultural Networks in the Classroom with Constellate: Using JSTOR’s Constellate lab to teach students how to do digital text analysis and data visualization for historical subjects.” She explores conducting text analysis on peer-reviewed articles related to the Argive Heraion sanctuary in Argos, Greece.

Her step-by-step process offers a use-case and template for exploring Constellate’s tools and data for your own research and especially for pedagogical purposes in courses and workshops.

POST: The Data Sitters Club #19: Shelley and the Bad Corpus

DCS#19 of The Data Sitters Club, a project that applies “digital humanities computational text analysis tools and methods” to a popular book series from the 1990s, looks at the corpus of works that make up the collection. The author of this chapter, “Shelley and the Bad Corpus,” Quinn Dombrowski (Stanford University), worked with Prof. Shelley Staples (University of Arizona), a corpus linguist, to look more closely at what constitutes a complete corpus. For this project, the items in it were easy to identify, as it consists of a complete series and there was a finite number of installments, but that isn’t always the case. The author also looks to pizza to “illustrate the consequences of corpus choice and set up a discussion about the claims we can make.”

POST: Touching Data: Conducting a Survey With Paper and Thread

A recent piece in NightingaleTouching Data: Conducting a Survey With Paper and Thread” explores using physical data methods to collect survey results. This process also resulted instant visualization of the results. While this survey was not on a humanities specific topic, this method could provide interesting new ways to engage patrons with humanities data questions.

Touching Data: Conducting a Survey With Paper and Thread

POST: John2Vec, or embedding Dewey’s philosophy

Elisabetta Rocchetti and Tommaso Locatelli (both University of Milan) have authored a post on the ISLAB at UniversitĂ  degli Studi di Milano’s Tales from the ISLab blog, “John2Vec, or embedding Dewey’s philosophy.” This post describes using an Artificial Neural Network (ANN), in this case, word2vec, on the massive text corpus of the writings of philosopher John Dewey (1859-1952).

This technique allowed the researchers to identify “how vector operations relate to semantic relations”:

For instance, the nearest embedding to Kant is Hegel; Empiricism is placed next to Rationalism; nature and universe embeddings are next to each other. These examples show that Euclidean distances relate to semantic similarities.

To further demonstrate word2vec potential, we try to extract complex relations using vector operations. By adding the difference between two semantically related terms, such as idealism and Hegel, to another embedding, such as Kant, we obtain the vector associated to rationalism. Equivalently, we are comparing Hegel to Kant to find out which is Kant’s school of thoughts. This experiment demonstrates the possibility of extracting analogies through vector operations involving word2vec’s embeddings [2].

The authors go on to use Dewey’s writings to examine semantic shift:

Semantic shift is a phenomenon that concern the evolution of a word usage. Indeed, the meaning of a word is not fixed once for all and can change over generations, lifetimes or geographical regions.

In this case, semantic shift was used to track Dewey’s changing thought process over time – for instance, which philosophers he’s referencing at different periods of his life and career – as well as for the change in concepts mentioned in relation to education over time.

As the authors note in their closing paragraph, “Computational natural language processing methods can be of great interest for social sciences such as philosophy: experts in this fields can benefit from these tools and techniques to analyse its history and evolution, automatically extracting relevant concepts and thoughts.” Information workers similarly can use ANN’s like word2vec in their own research, or as part of their toolset when working on collaborative projects with others.

POST: The Digital Campaigns Project

Writing for the Archive-It blog, Joshua Meyer-Gutbrod (University of South Carolina, Department of Political Science) describes the context and creation of the Digital Campaigns Project, a database of U.S. state legislative campaign websites from 2016-2022 that enables researchers to examine variation in state partisan agendas and rhetoric.

The project, initially begun in 2016, is led by Meyer-Gutbrod and co-created with undergraduate students from the University of South Carolina, with the goal of establishing a long-running data source on state campaign rhetoric by archiving campaign websites for state-level elected officials.

The blog post describes the process of finding local campaign websites and partnering with the Archive-It team to create a public repository of the website data. On the research potential of this database, Meyer-Gutbrod writes, “From a research perspective, both the text and image data stored through these archives provides a unique, expansive, and holistic look at state political rhetoric, which has historically been understudied due to the lack of availability of similar data.”

POST: Fight for the Future Statement on Libraries’ Digital Rights

Fight for the Future, a group of artists, engineers, activists, and technologists, published a statement about the ongoing lawsuit against the Internet Archive’s digital library. Oral arguments are scheduled for March 20, 2023 in this suit brought by four major publishers. The post quotes Lia Holland (they/she), Campaigns and Communications Director at Fight for the Future:

We’re eagerly awaiting the Internet Archive’s opportunity to have their day in court and speak up for the digital rights and future of all libraries in the US. This suit from major publishers has broad implications for libraries’ abilities to circulate digital books—namely, whether or not they are allowed to own and preserve digital books at all.

Currently, major publishers offer no option for libraries to permanently purchase digital books and carry out their traditional role of preservation. It is just as important to preserve digital books as paper books, given especially the rising popularity of digital books and the fact that many local and diverse voices are not published in print. We want a future where libraries are free to preserve digital book files and ensure they remain accessible to the public as well unaltered. Instead, libraries are forced to pay high licensing fees that regard patron privacy as a premium feature, and the third-party vendors like Overdrive that offer such licenses are vulnerable to censorship from book banners. Under this regime, publishers act as malicious gatekeepers, preventing the free flow of information and undermining libraries’ ability to serve their patrons.

Read more at the Fight for the Future website. Want to catch up on what this latest lawsuit filed against the Internet Archive could mean for the future of controlled digital lending? Check out the Internet Archive’s blog post about the lawsuit and the opposition briefs that have been submitted to date: “Internet Archive Opposes Publishers in Federal Lawsuit” (September 3, 2022).