POST: robots.txt

Ed Summers (MITH) has written a post in response to the Internet Archive’s recent comments about the difficulties that robots.txt files present for web archiving. In”Robots.txt meant for search engines don’t work well for web archives,” his post on the Internet Archive blog, Mark Graham explained that the web archiving service has already “stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages,” and that they were looking to expand that practice more broadly. Summers responds:

If the Internet Archive starts to ignore robots.txt it pushes the decisions about who and what to archive down into the unseen parts of web infrastructures. It introduces more uncertainty, and reduces transparency. It starts an arms race between the archive and the sites that do not want their content to be archived. It treats the web as one big public information space, and ignores the more complicated reality that there is a continuum between public and private. The idea that Internet Archive is simply a public good obscures the fact that ia_archiver is run by a subsidiary of Amazon, who sell the data, and also make it available to the Internet Archive through a special arrangement. This is a complicated situation and not about a simple technical fix.

Summers goes on to consider the role of the Internet Archive more broadly, calls for greater collaboration between web publishers and web archives, and asks for a more nuanced conversation about archiving the web.

Summers’s piece also generated an interesting discussion in the comments, with input from Jessamyn West and Dale Askey.

dh+lib Review

This post was produced through a cooperation between Amber D'Ambrosio, Rebecca Dowson, Benedikt Kroll, John Meyerhofer, Liz Rodrigues, Jordan Sly, and Mary Vasudeva (Editors-at-large for the week), Roxanne Shirazi (Editor for the week), and Caitlin Christian-Lamb, Caro Pinto and Patrick Williams (dh+lib Review Editors).