POST: Models All The Way Down

Models All the Way Down, a Knowing Machines project written by Christo Buschek and Jer Thorp, breaks down the relationship between a large AI model and its training data.

The authors focus their article on the LAION-5B training set, which is a enormous, open dataset of captioned images. The model is widely used for research and public facing projects to generate new images or captions, but because of its size, very few actually know what all the images are that the model contains.

This is the major concern of the authors, who state “What this training set contains is extremely important. More than any other thing, it will influence what your model can do and how well it does it. Yet few people in the world have spent the time to look at what these sets that feed their models contain.”

In the article the authors explain that a project at Stanford University found that LAION-5B contains a significant amount of images depicting NSFW content including child sexual abuse material. The authors investigate how training sets are collected and curated (often by other ML models) to show the importance and necessity of building transparent and ethical, safe practices for dataset curation to stop exploitative and harmful collection practices especially for use in machine learning contexts.

The author’s investigation of the LAION-5B dataset shows the lack of care and responsibility in dataset curation despite these datasets being the core of many of the ML models and products being used everyday by people everywhere. This concern is something the authors, and other members of the Knowing Machine project, are taking seriously.

Knowing Machines is a project investigating Machine Learning Systems and how they are trained to better understand the relationship between training data and the algorithms they inform. From the project website:

We are developing critical methodologies and tools for understanding, analyzing, and investigating training datasets, and studying their role in the construction of “ground truth” for machine learning. Our research addresses how datasets index the world, make predictions, and structure knowledge cultures. Working with an international team, we aim to support the emerging field of critical data studies by contributing research, reading lists, research tools, and supporting communities of inquiry that are focused on the foundational epistemologies of machine learning.

Their homepage contains more information and articles, like Models All the Way Down, making clear what is often hidden in large ML datasets.

dh+lib Review

This post was produced through a cooperation between Amy Gay, Abbie Norris-Davidson, Mariam Ismail, Carrie Pirmann, Trip Kirkpatrick, and Mimosa Shah (Editors-at-Large), Ruth Carpenter, Hillary Richardson, and Caitlin Christian-Lamb (Editors for the week), Claudia Berger, Molly McGuire, Nickoal Eichmann-Kalwara, Linsey Ford, Pamella Lach, Christine Salek, and Rachel Starry (dh+lib Review Editors), and Tom Lee (Technical Editor).

Leave a Reply

  

  

  

This site uses Akismet to reduce spam. Learn how your comment data is processed.