Data Praxis Series

Taking Action: Ethics, Outcomes, and Hope

Yasmeen Shorish is Associate Professor and Data Services Coordinator at James Madison University. Yasmeen is the Chair of the ACRL Research and Scholarly Environment Committee, Co-Founder of the Digital Library Federation Technologies of Surveillance Group, and Co-PI of the IMLS funded Supporting OA collections in the open: community requirements and principles. Yasmeen holds an M.S. in Library and Information Science, a B.S. in Biology, and a B.F.A. in Theatre from the University of Illinois.

Thomas: Thank you for the taking the time for this conversation. You are engaged in efforts that span scholarly communication, data information literacy, and representation and social justice in librarianship. Over the course of this conversation we’ll touch upon each, but before digging in can you talk a bit about what led you to where you are now?

Yasmeen: I had a very circuitous path getting to where I am now so it’s difficult to capture succinctly. I have a BFA in technical theater and I worked in the entertainment industry in Chicago before transitioning to librarianship. That experience influenced a lot of how I approach the profession. For example, I’m very outcome-oriented and I look for collaborators with expertise in different areas. From my time in the entertainment industry I learned that diversifying expertise often creates a stronger process and product.

I have been an advocate for representation for as long as I can remember. I was part of the University of Illinois Asian American Artists Collective and contributor to its publication Monsoon while I was still in high school! After 9/11 I did some public speaking in Chicago for Afghan women’s rights but it was hard to do that work and my day job. I realized that I needed my career to be something that connected with my core beliefs and made a positive contribution to society. By that time, I had completed a second bachelor’s degree in biology while working full-time. Struggling with next steps my advisor suggested librarianship. Once I learned more about the profession I realized that it was the best path for me. I didn’t realize at the time how monocultural librarianship is but nonetheless it is refreshing to have the opportunity to advocate for representation and social justice as part of my work.

Thomas: Last year, you and Shea Swauger kickstarted the Digital Library Federation Technologies of Surveillance group. Can you speak a bit about why the group started? Have the goals changed or remained the same?

Yasmeen: For some time I had been really concerned with the proliferation of surveillance technology in our everyday lives. The market dominance of products like Alexa and other Internet of Things (IoT) innovations make me really anxious. In librarianship we have a real need for conversations and practices that help us meet the challenges that these technologies pose. Fundamental library work positions us to be advocates in this space. For example, the “information has value” frame of information literacy and our work with metadata and systems provide a good foundation.

At RDAP 2017, I found that Shea and I shared some of the same concerns. At DLF 2017 we presented Surveyance or Surveillance?: Data ethics in library technologies. I asked Bethany Nowviskie if she had ideas for how we could increase engagement with the topic moving forward and she encouraged us to think about forming a working group. She also provided inspiration for the working group name which she had coined in a few of her writings.

The goals of the working group, at the core, remain the same: to interrogate data collection technologies, to document their ethical implications, and to establish guidelines that support critical engagement throughout the profession.

As I mentioned, I am outcome-oriented, so it is important to me that the group produces something for the community to use. Our areas of focus have expanded based on community input and I think we have been able to bring things together in a complementary way. Thanks to the leadership of Dorothea Salo, the group has already produced a document – Ethics in Research Use of Library Patron Data: Glossary and Explainer. We hope it will be taken up and used by the community. The working group is striving to have a more comprehensive document that encompasses the work of the various subgroups completed by DLF 2019.

So many people and projects inspire this work. Dorothea Salo, Bethany, Alison Macrina, Melissa Morrone … honestly I can’t name everyone because it would take up pages. Projects and groups like Data & Society, EFF, the Library Freedom Project, the IMLS Data Doubles project … there are so many allied initiatives that inspire the working group.

Thomas: Two things struck me from your comments above – the reference to ethics and your own disposition toward outcomes. It has been my sense that in a number of library communities we are becoming better at recognizing the ethical implications of our work – especially as they intersect with the broader realities of a de facto data driven culture. In terms of gaps, I’d like to see more discussion of ethics paired with development of corresponding actions germane to community need. As you work toward outcomes is there a particular form you would like them to take? Are there any models you look to that support making ethics actionable?

Yasmeen: This is complex. When I think about outcomes, I think of them as practical actions that can be taken up and implemented by a variety of practitioners. This requires that I think beyond the act of resource creation, i.e. “we made a document.” I work to anticipate how a resource will be used by diverse communities. I try to determine if the requirements for implementation are problematic – potentially not achievable in the reality of a different environment.

I don’t have a different style for “ethics work” compared to all my work. I just employ different calculations. These calculations lead to a lot of internal struggle as I try to thread the needle of optimal outcome in a sub-optimal environment.

Sometimes ideal products must be diluted in order for them to be adopted, implemented, and embraced by a community. Again, I have a lot of internal struggle about this. I want us to take ethically-grounded actions – the most just actions – and trust that they will be championed and radical change will occur. I want us to do that, but I don’t think enough people in positions of power are willing to champion radical change and motivate buy-in from their communities.

I have no formal training in social sciences, behavior studies, or even the humanities. My background is in fine art and biology. The models I use to make ethics actionable are primarily informed by lived experience.

The first is the “take me as I am” model. This is the most uncompromising and idealized model – “Here are the most just actions required for the most just outcomes”. Often, the prevailing reality – for any institution or society – is not built with justice centered in the system so actions associated with this model often require radical change. The latter model requires more compromise and trust. The actions take longer to implement and the outcomes are potentially weaker. Compromise is made in order to foster change and engender less defensiveness from those who hold power. I personally feel conflicted about this model. This approach is often exhausting and is sharply felt by most any person of color working to effect change in white spaces. Dealing with white fragility is a constant.

Thomas: You serve as the Chair of ACRL’s Research and Scholarly Environment Committee (ReSEC). In consultation with ReSEC, Rebecca Kenison and Nancy Maron are working with a number of communities to develop the ACRL Research Agenda for Scholarly Communications and the Research Environment. I am encouraged by the focus on openness, inclusion, and equity – I am doubly encouraged by what appears to be a fairly robust discussion around what those terms could mean in the context of scholarly communication. In context, what do these terms mean to you? What sort of research would you like to see within this frame? Who could benefit from it?

Yasmeen: In the context of scholarly communication I think “open,” “inclusive,” and “equitable” should mean that we participate in systems where myriad ways of knowing are valued, are accessible, and are supported.

In the United States, knowledge systems are guided by a “Western” or Euro-American culture. This isn’t inherently bad, but the systems are exclusive, such that “other” practices are viewed as invalid, trivial, or suspect. I recently read an article by Kirsten Broadfoot and Debashish Munshi that is hugely relevant. They argue that implementation of a postcolonial academic culture is challenging because Western epistemology has become so widely internalized – a hegemonic state. When I think of how we can make scholarly communication more just, I consider a number of dimensions: academic disciplines, publishing, the marketplace, systemic racism, sexism, entrenched socio-political structures … it’s a lot!

I’d like to see more research on representation of authors and creators across fields: who is contributing to the information landscape; what topics are under and over-represented; presence or lack of presence of publishing venues that are supportive of non-dominant epistemologies; the relationship between classification and “normative” experience; user privacy; the capacity for technical platforms to provide more just representations of knowledge. I’d like to see findings brought to bear on the library – the catalog, the institutional repository and so forth.

Honestly, the whole thing is overwhelming. I have a hard time writing about the enormity of the situation and all the opportunities that we have to contribute to positive change. A more just system doesn’t only benefit those of us who have been discouraged from participating or whose voices have been silenced. A more just system benefits humanity. Individuals or groups may lose positional or social power with some of these changes, but there will be more communication, more sharing, on more even footing.

These changes will not be without conflict and friction, but if we can move the needle towards a more collaborative and communicative society, shouldn’t we make the effort? Not as advocates from the margins, but as a body, representing the best aspects of our profession?

This is why I have a fair amount of hope attached to the ACRL efforts with the research agenda. ACRL represents its members and parts of the profession. If the Association can help provide the reasoning and the tools to help practitioners engage with these weighty and complex topics we may see more movement across the profession to do this difficult but meaningful work.

Thomas: Following on that, you are Co-PI for the IMLS funded Supporting OA Collections in the Open. What do you hope to achieve with this project? How can people interact with the project?

Together with co-PI, Liz Thompson, and our project partner Judy Ruttenberg at ARL, I hope that we can get at what the academic library community really requires to move towards authentic collective action on open access collection development. We’re currently conducting focus groups with participants from various types of institutions (e.g. liberal arts colleges, community colleges, HBCUs, research universities) and from various roles within the library (e.g. acquisitions librarians, scholarly communication librarians, library deans, e-resource librarians, subject librarians). We are bringing them together in a room to try to get at their priorities and needs. After we’ve synthesized the focus group data, we will produce a white paper of our findings with the hope that others in the community will build upon those findings and move the project forward. We will present our findings and hopefully have robust conversation about the project at the 2019 DLF Forum in October.

Thomas: Lastly, whose work would you like people to know more about?

Yasmeen: Seriously, way too many to name. This question is going to give me anxiety because I will think about all the names I want to add.

I have been incredibly lucky to know generous people, who have shared their knowledge and experiences with me. I have mentioned several already, but I would like to add Emily Drabinski for her work on labor/power and also her writing on the human condition; Bergis Jules, because even though our views sometimes differ, I appreciate and respect his perspectives on community-controlled cultural heritage so much; Mark Matienzo, because I appreciate the ways they think about technology and culture; Charlotte Roh, because she has such critical insights into publishing systems; Anasuya Sengupta and Siko Bouterse (both of Whose Knowledge?) because they are doing amazing work, across the globe; and, lastly, authors whom I do not know personally, but people should know more about: Rumi and Hafez, for writing about one’s spirit through a lens of connectivity and also humor (and I’m talking about the Masvani for Rumi – not the bite-sized phrases that are divorced from the work); Haruki Murakami, for his non-linear writing that is way better than Thomas Pynchon or William Burroughs (don’t @ me); and Art Spiegelman, for writing Maus – a highly accessible work that helped me explain war to others who have never had to consider it and because it reminds us of the things that humans are capable of.

Data Librarianship: A Path and an Ethic

Vicky Steeves is the Librarian for Research Data Management and Reproducibility at New York University – a dual appointment between NYU’s Division of Libraries and Center for Data Science. Vicky contributes to ReproZip, is a co-founder of LIS Scholarship Archive, and developed Women Working in Openness – an effort initiated by April Hathcock. Vicky holds a BS in Computer Science and an MS in Library and Information Science from Simmons College.

Thomas: I take it that you have a dual appointment in the libraries and an external center. Can you tell us more about your current work? Is your work novel? Might it suggest a model?

Vicky: Yes, I’m a dual appointment between the Libraries and the Center for Data Science. It’s exciting because I can flex my computer science muscles, working half time with the Center for Data Science supervised by one of the most badass women in computing around — Juliana Freire (first female chair of the Special Interest Group on the Management of Data for ACM). I contribute to tools like ReproZip that make an immediate impact on researchers. Working with Juliana, Remi Rampin, and Fernando Chirigati, I learn a lot about reproducibility with a particular emphasis on computational reproducibility (for an introduction check out Ben Marwick’s How Computers Broke Science). This work challenges me to think of scholarship more holistically, not just as an article and accompanying data, but as research materials and computational environment.

The other half of my time at NYU is spent building the libraries’ data management service with the wonderful Nicholas Wolf. We teach classes in the library and embedded classes for faculty upon request. We also build collections, create resources, and provide consultations for the NYU community. Nick and I only teach with freely available open source tools. This was a purposeful choice. We want students to be able to take what we teach them and be able to use it whether or not they are at NYU. We presented this model at the 2016 LITA forum.

I am certainly not the first librarian to have a dual appointment with an outside institute but I think the responsibilities of my job are novel. My job description explicitly requires me to support reproducibility initiatives. I receive a lot of questions from colleagues at other institutions about my role – my successes and failures. I wrote Reproducibility Librarianship, a “from the field” report, for Collaborative Librarianship that describes my day-to-day. The report is meant to be a resource for colleagues who want to fight for resources to support openness, reproducibility, and data management at their institutions. I think having well-resourced staff supporting reproducibility is important for enhancing and preserving the scholarly record.

Thomas: While you have a background in computer science and information science, I’d venture a guess that understanding of these areas doesn’t immediately resolve to some of your current areas of focus. Could you tell us about the path that took you to a career in libraries with this particular area of focus?

Vicky: I knew I wanted to be a librarian when I started college. I went into undergrad assuming I would major in English, thinking it was the best path into librarianship. Nanette Veilleux, my advisor, convinced me that I should take at least one Computer Science course, and that it might be more beneficial to librarianship than an English degree. I took one class with her, and I was hooked. The same professor approached me later that year and asked if I wanted to be “Student Zero” of the newly formed 3 + 1 program. This program would have me complete my Computer Science (CS) degree in three years, and my Library and Information Science (LIS) degree in one year. I jumped on the opportunity, as paying for college was tough. I had to work four jobs on top of school all four years, so I was grateful for the chance to pay less, finish up early, and get going in my chosen field.

After I finished up my LIS degree and started looking for work, I saw the National Digital Stewardship Residency (NDSR) opportunity on the Simmons job list. I thought it would be a valuable chance to get more hands-on experience with digital preservation. I applied to NDSR NY and got into my first pick host institution – the American Museum of Natural History.

This was my first run-in with research data. My project entailed interviewing all the curators and some of their staff and students to better understand their data storage, curation, and preservation needs. In the course of conducting these interviews I ended up answering questions from researchers like: “Why is the library doing this?”, “Isn’t this an IT job?”, and “What do you mean by data documentation?” In the course of answering I realized that I was explaining what archivists would call pre-custodial intervention — “acting to influence the arrangement, description, and appraisal of the materials by the creators before they are transferred to [a] repository” (Tatum 2010). Getting documentation together, migrating digital materials to open and preservable formats, making sure materials are stored and backed up securely. All of this was just digital preservation basics. I was explaining archiving and digital preservation to researchers and calling it research data management. It was a big lightbulb-over-the-head-moment for me. I took the results of these interviews and recommended strategies for preserving digital assets.

Thomas: Where do you think the Computer Science degree factors into your work, past and present, if at all?

Vicky: Well, I’d begin by highlighting the fact that very few librarians in the reproducibility and/or data management space have or need a computer science degree. For me, the most helpful part of my CS degree was the emphasis on learning how to learn technology instead of focusing on a few specific tools or programming languages. I’ve been able to adapt to different tools very well because of this and that has been immensely helpful.

My digital preservation classes introduced fundamental issues in managing and preserving all types of digital materials. The CS degree helped me understand the more technical aspects, e.g. how operating systems work, how file formats work, and so forth. I think computing or coding is seen as magical sometimes, and it’s really not.

My CS degree required two philosophy classes and one ethics class. These were huge in shaping my professional identity. Challenging and interesting questions were presented about privacy, sharing, and privilege that are important for any information professional. How are our systems privileging some users/patrons over others? How are we protecting user/patron data?

When I did my first degree, there were only two full time computer science professors at Simmons. I graduated with just 5 other people. We were a very tight-knit group. During graduation we all took The Pledge of the Computer Professional together and received this pin from the faculty which says ‘HONOR’ in ASCII:

It ended up being my second tattoo! The lines from the oath that have stuck with me are:

My work as a Computing Professional affects people’s lives, both now and into the future.

As a result, I bear moral and ethical responsibilities to society.

Thomas: Can you talk about a specific situation where your professional ethics came into play? As you thought about a way to work through that situation were there other people or examples of work that inspired you?

Vicky: It effects my day-to-day. How I approach building services, how I recommend resources to patrons, how I do my research, what I choose to spend time on. One side project I decided to work on was the Women Working in Openness site. The website itself is open source and uses CC0 self-reported data. It’s basically a searchable, sortable list of women who do work in the field of openness: open access, open science, open scholarship, open source code, open data, open education resources – anything open. The list started on April Hathcock’s google doc. I just transformed it into a list and a map on the web to encourage folks to quit it with the all-male panels on openness.

When I started building services at NYU, I chose to only support freely available open source tools in my service area. This choice is guided by my ethics and is meant to help undermine lock-in with exorbitantly priced academic tools. Thinking back on The Pledge of the Computer Professional, this choice was made thinking about the students who come to my classes and workshops. Overwhelmingly I see graduate students and I really don’t think it’s right to train them on software that costs upwards of $200 for a standard license. Just because NYU has it isn’t a good enough reason. The students will leave someday, lose access (most likely), and have to learn something freely available anyway. It’s an ethical and an efficient choice to use FOSS tools.

I think a lot about the corporate capture of the scholarly record, and how my work in data management and reproducibility can either contribute to or disrupt that. With the rise of reproducibility as a buzzword, there are plenty of commercial entities ready to profit from so-called ‘reproducibility platforms’. This represents yet another corporate capture of scholarship. I try to disrupt this by advocating for community-run, open source software for reproducibility, such as ReproZip (which I work on), o2r, and Binder. The same goes for data management platforms. We’re seeing a lot of new data services springing up from major publishers and this is also something I am actively trying to combat.

With respect to ethics more broadly I often refer to Dorothea Salo, April Hathcock, Amy Buckland, Jeff Spies, Megan Wacha, Sara Mannheimer, Jessica Schomberg, and Yasmeen Shorish. Some have been influential in writing, others in face-to-face meetings and talks, and others I go to directly when in an ethical crisis myself.

Thomas: Lastly, whose work would you like people to know more about?

Vicky: In addition to all the folks I listed above:

Shirley Zhao, the Data Science Librarian at Eccles Health Sciences Library at the University of Utah is doing some excellent work around community building in data management and reproducibility. She’s currently running a course for librarians on data management and reproducibility via the National Library of Medicine. She also helps organize events like the Research Reproducibility Conference at the University of Utah, and short courses like Principles and Practices for Reproducible Science.

Cynthia Hudson-Vitale, the Data Services Coordinator and Research Transparency Librarian in Data & GIS Services at Washington University in St. Louis (WU) Libraries focuses her work on community infrastructure for scholarly communication and data curation. She is a part of the core SHARE team as well as the Data Curation Network. Her work in providing open, public-goods infrastructure will help keep the scholarly record in the hands of researchers.

As for other data or data-adjacent librarians, there’s too many to possibly name doing great work. I especially follow the work of Jenny Muilenburg at the University of Washington, Kristin Briney at University of Wisconsin, Amy Riegelman at the University of Minnesota, Renaine Julian at Florida State University, and Natalie Meyers at the University of Notre Dame & the Center for Open Science. But again, there are many folks doing excellent work around open infrastructure and databrarianship! I recommend readers follow the #datalibs hashtag on Twitter to find more of us and engage there.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Bias, Perception, and Archival Praxis

Elvia Arroyo-Ramirez is Processing Archivist for Latin American Collections at Princeton University Library. Elvia holds an MLIS with a concentration in Archives, Preservation, and Records Management from the University of Pittsburgh. She has presented widely on digital archives and diversity and is co-author with Rose L. Chou, Jenna Friedman, Simone Fujita, and Cynthia Mari Orozco of the forthcoming article ‘The Reach of a Long Arm Stapler: Calling in Microaggressions in the LIS Field through Zine Making’ (Library Trends, Spring 2018).

Thomas: In Invisible Defaults and Perceived Limitations: Processing the Juan Gelman Files you describe how technologies used to work with digital collections can channel bias – bias that is not just a historical legacy but very much a product of the here and now. Before we discuss this piece in detail I’m curious to hear more about what experiences shaped how you see your work in archives? Perhaps what led you to the archival profession?

Elvia: My interest in archives evolved from my studies in art history as an undergraduate at UCLA. I took a class on Dada and was fascinated by the Dadaists’ tendency to collect and piece together meaning from disposed and/or re-purposed materials. Marcel Duchamp’s The Bride Stripped Bare By Her Bachelor’s Even, The Large Glass and Kurt Schwitters’ Merzbau inspired a deep pathos that eventually became the catalyst to move to a career in archives.

*Merz Picture 32 A. The Cherry Picture (Merzbild 32 A. Das Kirschbild), Kurt Schwitters*
*1921*

To provide a little more context on the catalyst—between 1923 and 1936, Schwitters collected and progressively pieced together his colossal Merzbau with objects gifted or left behind by friends and family such as souvenirs, letters, clippings, and articles of clothing (some stolen by Schwitters). Everything that mattered to Schwitters became part of the bau. It was ultimately destroyed by an Allied air raid during World War II. Schwitters’ loss struck a chord with me. His unconventional way of record keeping and memory construction made me curious about archival collections and the process of maintaining and making them available for access.

[pullquote]Who gets to be remembered and historicized by way of record creation?[/pullquote]Archival work requires an ethics of care for the deeply personal and the deeply political. My former boss at the Center for the Study of Political Graphics often said that all art is political. The same can be said about archives and archival work. Record creation, keeping, obstruction, or misrepresentation are all acts of identity and power. Who gets to be remembered and historicized by way of record creation? Who is forgotten or purposefully silenced in history by way of omission or destruction of records? How are records themselves (official records created for governmental purposes in particular) used to communicate misguided notions of holistic representation, truthfulness, neutrality, and objectivity? These are all questions that initially drew me to and continue to keep me in the profession.

Thomas: I’ve noticed that power and representation or lack thereof are taking a more prominent place in Digital Humanities and digital library conferences. I gather that this focus in archival work isn’t necessarily sparked by a transition to digital environments – rather that it predates that transition and maybe even runs alongside it. Do you think there is a reciprocal value to be gained from working across physical and digital legacies? What sorts of critical questions are raised when working with either? How are these questions different or similar depending on the medium and the technology?

Elvia: Issues of representation and power are fundamentally rooted in archival work and there is rich critical scholarship that discusses these issues in the context of pre-digital archives. Sam Winn’s piece The Hubris of Neutrality in Archives does an excellent job acknowledging some of the recent critical work in the archival profession that addresses issues of representation, and gives a nod to Howard Zinn’s seminal address to the profession at the 1970 Society of American Archivists meeting. Scholars like Verne Harris, Cheryl Beredo, Randall Jimerson, and Michelle Caswell discuss issues of power, representation, and accountability by challenging the existing canon of archival neutrality and objectivity; speaking on colonialism, apartheid, and transitional democracies and their relationships to record keeping; and connecting these challenges to current archival practices. These scholars have built critical foundations for emerging scholarship that speaks to these same issues in the digital realm.

There is definite value to be gained from working across physical and digital legacies. The work helps us recognize our shortcomings. Jarrett Drake has pointed out that the archival profession’s canonical principle of provenance is grounded in a 19th century colonialist and imperialist era wherein legal property and ownership of records was limited to western white men. Historically provenance has more or less worked well for archivists tasked with keeping a history of ownership. In digital mediums and environments things are a bit different. What does the provenance of a collaboratively created or anonymously created Google Doc look like? In digital environments provenance is becoming increasingly difficult to pin down. I believe this will force the profession to re-evaluate how archivists should account for ownership, authenticity, and custody.

*3.5″ floppy disks from the Juan Gelman Papers, Department of Rare Books and Special Collections, Princeton University Library, Elvia Arroyo-Ramirez.*

[pullquote]Appraisal for digital collections is, I believe, slowly being shouldered by the processing archivist…[/pullquote] Of course privacy and volume are issues present in analog collections but they are further problematized when we consider the digital deluge and the responsibility of determining permanent historical value. In analog archival collections donors and creators can physically comb through and filter materials they do not want to deposit in an archival repository due to the presence of sensitive or personal information. Acquisition of entire hard drives makes appraisal for donors a lot more difficult and places the responsibility of protecting sensitive or personal data on archivists, who, on the whole are not nearly paid enough; not equipped with the necessary tools and infrastructure; and do not have enough hours of the day to devote the labor necessary to peruse every file. Appraisal for digital collections is, I believe, slowly being shouldered by the processing archivist without a donor, curator, or administrator understanding of the amount of time it takes to do the work. Questions about how to best address privacy issues and what to keep and what not to keep when we speak with our donors at the point of acquisition is something archivists will have to continue to advocate for.

Thomas: At the end of your previous response you allude to what might be called “the weight of inheritance” – what is passed to us and the wherewithal we gather deal with it. I sense a similar tension at work in Invisible Defaults and Perceived Limitations: Processing the Juan Gelman Files. In that piece you describe how tools you inherit as an archivist carry a set of assumptions that bias processing and representation of digital collections. Are there particular strategies for recognizing these biases and dealing with them? Particular readings or frameworks that guide you in the engagement?

Elvia: I recommend taking a deep dive in social justice and decolonizing technology readings (a trove of which are located here and here).

For me, it has become important to recognize that the tools archivists and other information managers are using (and developing) are part of a larger system that is complicit in propelling and replicating a hegemonic Global North. While technologies are marketed as decentralized, democratic products unbound by location (geographic, cultural) they are largely being developed by a relatively small minority of the world’s population who has the majority control to assert autonomous power. Understanding this, we begin to ask how this frame of thinking impacts an archivists’ responsibility to collections on the margins of, or far from, the Global North.

I want to emphasize that at the heart of what I was writing about in my experience processing the Gelman materials has more to do with recognizing our own biases and perceptions as practitioners learning to be technologists, rather than the current tools we have at our disposal.[pullquote]I also think about the weight of our ancestral and cultural inheritances and how we reckon (or not) with these as practitioners, users, and creators of digital collections. [/pullquote]You mentioned “the weight of inheritance” in the first part of your question — and beside having to reckon with the tools we use and their probable limitations, I also think about two other types of inheritances. I think about the technical language the digital curation community has inherited or adopted as its own and how potentially ill fitting it can be when applying it to cultural heritage collections. I also think about the weight of our ancestral and cultural inheritances and how we reckon (or not) with these as practitioners, users, and creators of digital collections. Tapping into my own cultural inheritances as a bilingual-U.S. living practicing archivist of Mexican ancestral roots, I understood how removing diacritic characters from accented words not only inherently changed the meaning behind filenames, it would be an act of cultural erasure. We need more use cases like Gelman’s in order to critically reflect on our current practices to make them better.

Thomas: In the digital humanities, researchers and practitioners (myself included) often dig into the language that is used to describe data and how one works with data. Verbs like cleaning (see Katie Rawson and Trevor Muñoz’ piece Against Cleaning) are problematized. The word data itself is questioned extensively – some even go the route of suggesting alternative nouns (see Johanna Drucker’s argument for capta). Some question a terrestrial bias at work in our understanding of data (see Melody Jue’s Wild Blue Media: Thinking through Seawater). An increasing number of scholars explore the genealogy of the word data (see Lisa Gitelman’s Raw Data is an Oxymoron). In your work with the Gelman files I was intrigued to see your focus on words like “clean”, “compromise”, and “illegal”. I’m wondering if you might comment on possible alternatives in this space? Maybe models of collaboration and community that could lead to something that better approximates the diversity of a range of lived experience?

Elvia: I find the use of the term “illegal” irresponsible when it is applied outside the confines of the law. Contextualizing the term in our current sociopolitical moment and its application (among others) in the form of a noun to describe migrants not authorized to stay in their country of residence makes for a potentially dangerous association with the dehumanization of migrants. We (digital humanists/archivists) are in the business of preserving and making accessible collections that include a diversity of cultures, identities, and perspectives. Surely we can find more accurate descriptors to communicate what checks out or does not check out in the language we use to describe our practices.

*Elvia Arroyo-Ramirez, Invisible Defaults and Perceived Limitations: Processing the Juan Gelman Files*

[pullquote]… we should keep in mind that wholesale adoption of technological language that has been developed for and by other (dis)similar fields is potentially incongruous to our needs.[/pullquote]Katie Rawson and Trevor Muñoz are onto something when they point to the example of “data cleaning” and how this term is used as an opaque shorthand for a number of diverse actions and steps that are taken to render data usable. This work illustrates the point that emerging areas of work in this space have not fully developed the pointed language needed to communicate our processes and roadblocks. As we move forward we should keep in mind that wholesale adoption of technological language that has been developed for and by other (dis)similar fields is potentially incongruous to our needs. Even in my use of “our” (digital humanists/archivists) there are varying use/need cases.

I believe having conversations across similar fields with a diversity of practitioners is key to understanding how our practices and end goals are alike and dissimilar. Part of the issue is that we are so busy trying to figure out how to reach end goals that we are not quite familiar with the practices each of us employ en route. The proposal of the Collections as Data framework is certainly an opportunity to bring together varied practitioners and users of data to conceptualize or begin reimagining a shared terminology that is mapped to our respective practices and responsibilities.

The records continuum model may add to the collections as data conversation. The model was originally conceptualized to reflect the overlapping responsibilities of records managers and archivists but I think it could potentially be expanded for those working on preserving and researching archival data. For instance, my goal as an archivist is to make little to no changes to the structure and content in a collection while normalizing accessible content to make it as platform and system agnostic as possible. When I intervene (duplicate or irrecoverable files, etc.), I must document and justify why I had to. These decisions should be made transparent to our users. The goals of digital humanists are a lot more diverse (i.e. potentially a lot more “data cleaning”), but their ability to access the content they work on is potentially dependent on my labor to preserve and provide access to it.

While archivists and digital humanists might have different goals, we share similar processes and terminology. I think the records continuum model can reveal how much of our current practices we share, or potentially want to share. I would love to organize a think out loud meeting (a future Collections as Data meeting?) with data curators, archivists, digital humanists, systems administrators and developers, and whomever else is heavily thinking about this. We might create a shared lexicon that better describes our shared needs and practices.

Thomas: Lastly, whose work would you like people to know more about?

Elvia: Tara Robertson’s presentation, Not All Information Wants to be Free, taught me that the library profession’s blanket tendency to digitize pre-Internet print resources can be harmful especially if it clashes with the original consent of participants involved. In the case Tara highlights, materials from an underground print publication that was produced for a very specific target audience were digitized and made accessible to a general audience without taking care to reach out to individual participants to get their renewed consent. The act of digitizing for access, in this case, was an act of “outing” for some participants who relied on the relative obscurity print provides. Everyone should take pause and read it.

Angela Galvan’s Architecture of Authority helps explore the differing and often conflicting core values libraries and vendors have and how these relationships affect the ways we provide access to our resources. The piece also complicates how we see our relationships to our users. My fellow co-presenter, Giordana Mecagni gave an excellent talk, The Colonizing Gaze – Digitized Collections, Radical Communities and Paywalls, on this subject at this year’s Society of American Archivists annual conference. Designer Jen Wang’s Now you see it: Helvetica, Modernism, and the Status Quo of Design, speaks on the history of design and its perpetuation of whiteness as aesthetic neutrality. Todd Honma’s work on teaching community archives and zines can serve as lessons for librarians, archivists, and other information professionals on how to use zines, an originally analog medium, to better engage with broader communities. I’ve gathered much inspiration, perspective, and validation from these readings. I am also excited to hear more from students and new professionals like Itza Carbajal, Chido Muchemwa, Nikki Koehlert, Aliza Elkin, and Crystal Paull, all of whom I just had the pleasure of meeting recently.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Engaging Open Cultural Data

Mia Ridge is a Digital Curator, working within the Digital Scholarship team at the British Library. Mia holds a PhD in Digital Humanities (Department of History, Open University). She has published and presented widely on user experience design, human-computer interaction, open cultural data, audience engagement and participation in the cultural heritage sector, and digital history. Her edited volume, ‘Crowdsourcing our Cultural Heritage’ was published in October 2014.

Thomas: Your dissertation and much recent work has focused on crowdsourcing and other forms of cultural heritage collection engagement. What came before? What experiences prepared you for this particular area of focus?

Mia: A little while ago I found a forum post from the late 90s where I enthused about the power of the internet to democratise knowledge, so I guess some of my views stem from that optimistic pre-first dotcom boom moment when the web seemed more focused on information and discussion than on e-commerce. Not too long after that, I started working with the Outreach team at Melbourne Museum / Museum Victoria, where the web was seen as just one potential form of outreach, one of many ways to connect people with historical, cultural, and scientific collections.
[pullquote]I’d noticed that people needed some kind of prompt to stop and look at collections – simply putting them online isn’t enough.[/pullquote]My work on crowdsourcing in cultural heritage connects that focus on shared knowledge and collections with the challenge of designing projects that provide enjoyable activities that in some way contribute to a greater good. Working in British museums gave me a sense of the size of collections and the vast amount of work required to document and make them discoverable online compared to the resources available to do that work. At the same time, I’d noticed that people needed some kind of prompt to stop and look at collections – simply putting them online isn’t enough. I have always believed that cultural heritage organisations should find ways to engage everyone with our shared histories and cultures, which sometimes means thinking creatively about reducing barriers to participation and access.

In the spirit of Luis von Ahn and Laura Dabbish‘s games for social good and building on projects like steve.museum, my 2011 masters dissertation for my MSc in Human-Computer Interaction explored crowdsourcing games for museums. The games I made were designed to encourage people to create metadata while playing with ‘difficult’ (i.e. less visually accessible) museum objects, particularly objects from history of science and social history collections. I found that the games gave people an excuse to spend time looking at collections they’d otherwise walk past, and that the act of looking at something long enough to describe it with keywords created a sense of engagement with that object. I hadn’t expected to find that relationship between crowdsourcing and engagement, but it’s informed my work since then.

Thomas: At Digital Humanities 2015 I had the distinct sense that many folks were griping (in a good natured way) about all the distance they traveled to get to Australia. Of course some of the Australians in the group reminded the rest that this sort of travel is par for the course for much of their engagement with the international community. I’ve noticed for awhile now the wealth of unique, digital cultural heritage collection activity happening in Australia. Tim Sherratt, Deb Verhoeven, Fiona Tweedie, Sebastian Chan, and Mitchell Whitelaw come to mind. As an Australian, do you think there is something about distance or about the Australian cultural heritage community in particular that encourages some of the neat work we’ve been seeing? Is there any work in particular that is emblematic of that environment in your mind?

Mia: I think there is something about distance that affects our attitudes to overcoming obstacles. Australia is a sparsely populated country, and pre-digital organisations like the School of the Air and the Flying Doctors were well-known models for delivering services over vast distances. When I started working in the cultural sector in the late 1990s, people were still driving vans across the country to deliver training programmes to get people online or to take museum objects to sessions in rural schools and nursing homes. Working online was a continuation of those outreach programmes. Being an expensive and long flight away from things only available in Europe, Asia or the Americas – whether artworks, historical sites, theatre, festivals or gigs – might also mean that ordinary Australians intrinsically get the value of online experiences or information.

Worlds First School of Air Opened — “World’s First School of Air Opened“, The Advertiser (Adelaide, SA : 1931 – 1954)

Perhaps there’s something about dealing with distance and the clear need for services for remote communities that focused minds on collaborative, productive approaches to making cultural heritage available online. I’ve always wondered how much portal projects like Australian Museums and Galleries Online (AMOL; see for example this 2002 description of what later became the Collections Australia Network), the National Library of Australia’s Trove, and the Canadian Heritage Information Network’s (CHIN) Virtual Museum of Canada, launched in 2001, were a response to distance. Large countries with slightly smaller populations, like New Zealand or Finland, have produced even more comprehensive infrastructures (DigitalNZ and CultureSampo respectively) for sharing their heritage online.

Trove, Recently Corrected Articles — Trove’s crowdsourcing functionality allows users to register to correct metadata and transcription

Thomas: You’ve worked extensively in the museum community, particularly with projects and initiatives around the idea of “open cultural data.” For our library audience, can you describe open cultural data as it’s talked about in the museum community? What are some of the key challenges and opportunities afforded by this frame of thinking?

Mia: That’s a doozy of a question! Answering it means thinking about the differences between museums and libraries (and the differences between research libraries and lending libraries, and between general and special collections, etc., etc.). I tend to use terms like ‘cultural heritage’ or ‘cultural data’ to intentionally include a range of institutions that collect and share any sort of historical, artistic or scientific artefact or object, but sometimes that risks eliding important differences between museums, libraries and archives, including how and why their collections were formed and catalogued. Just as a ‘visit’ to a museum is different than a visit to a library, so the outcomes, challenges and opportunities for open cultural data are different.

Open cultural data at its simplest is data – images of or descriptive metadata about historical, scientific or cultural items – that is freely available for re-use (avoiding messy issues of commercial and non-commercial uses). But not all data is created equal – catalogue records without images might be the mainstay of collections management systems, but putting them online won’t in itself immediately serve a wide audience. Images of artworks lend themselves to a wider range of aesthetic responses and are more easily understood or re-purposed than images of social history objects. A digital image of a book cover or binding might be useful to scholars, but digital images of the pages, or even better, automatically-transcribed text from those pages, is useful to a wider audience.

The challenges are changing – there used to be resistance to sharing data online because staff feared it would lead to fewer visits to galleries or museums. Simultaneously, some feared that putting records online would lead to an increase in enquiries about the items (in reality, this was the more likely outcome). For museums, it’s increasingly clear that access to digitised versions of objects or artworks increases interest in the original, and probably increases the chance that they’ll visit it when they can. Open cultural data can help make collections more discoverable – a small museum or library mightn’t have much clout on Google, but their images might be encountered by tens of thousands on a Wikimedia article.

Mia Ridge, "Why do we digitise? 20 reasons in 20 pictures", http://www.slideshare.net/miaridge/why-do-we-digitise-20-reasons-in-20-pictures — Mia Ridge, “Why do we digitise? 20 reasons in 20 pictures”

A long-standing concern is the impact of freely sharing material that could be monetised by picture libraries and other projects. Good digitisation requires decent metadata and technical infrastructure, which isn’t cheap when it’s done to the standards heritage institutions rightly require, but many people find commercial digitisation models problematic.

The opportunities for open cultural data range from the mundane to the aspirational, providing access to information and to experiences.

[pullquote] . . . numbers don’t tell the whole story.[/pullquote] Digitising books reduces pressure on the reading rooms and simultaneously makes them available to anyone anywhere in the world with a decent device and internet connection, at any time of day or night. Images from nineteenth century books uploaded to Flickr have inspired pop videos, art installations, generative artworks and who knows what scholarly curiosity. Which is a good reminder that tracking the use of open collections is itself a challenge, but have libraries ever really known what people do with the books they access? We get hung up on the number of people who access collections online, but numbers don’t tell the whole story. With any luck, at least some of them are acting as intermediaries who will provide further experiences of collections through traditional scholarly publications or fancy apps or something you can buy on etsy.

Thomas: I note that this is far from the first interview you’ve taken part in. What is one question you haven’t received that you would like to answer? What would your response be?

Mia: I tend to be asked about crowdsourcing a lot, but my doctoral research on ‘Making Digital History’ included an examination of the ways family and local ‘amateur’ historians have collaborated to create digital resources. These community history projects are rarely shiny or trendy, but they’ve set the scene for more formal aggregation and digitisation projects. I’d love people to ask what their organisation could do to support non-academic groups using their collections. Should anyone ask me, as a user experience researcher I’m contractually obliged to give the answer ‘it depends’ then suggest they find groups who are already using or have expressed interest in their collections and ask them. Their responses, in turn, might be as simple as ‘provide meeting space’ or ‘license material for re-use’ to ‘provide access to internal specialists who can answer questions about collections.’

Thomas: Whose data praxis would you like to learn more about?

Mia: Throwing the ball back over the Atlantic, I’d love to hear from Effie Kapsalis or Meghan Ferriter, who have done outstanding work on open access data and crowdsourcing respectively while working within the Smithsonian. Closer to home, I’d love to hear from Wellcome’s Jenn Phillips-Bacher, who seems to be at the heart of their innovative explorations of digital collections.

Situated Interpretation, Capacious Computation, Empowered Discovery

Tanya Clement is Assistant Professor in the School of Information at the University of Texas at Austin. Her research centers on scholarly information infrastructure as it impacts academic research, research libraries, and the creation of research tools and resources in the digital humanities. She has published in a wide range of publications including Cultural Analytics, Digital Humanities Quarterly, and Information & Culture. Her research on the development of computational methods to enhance description, analysis, and access to audio collections has been supported by the National Endowment for the Humanities and the Institute of Museum and Library Services.

Thomas: You’ve spent a great deal of time researching and developing infrastructure to support computational analysis of recorded sound. A bit later I’ll be asking more about that, but I’m curious where your interest in infrastructure, the affordances of digital vs. analog media, and possibilities for Humanistic inquiry latent in various computational approaches began? Was it a product of your graduate training, or some other combination of experiences?

Tanya: The most time I spent on the family’s Apple IIe when I was little was to play a bowling game. I did play one text-based game, but it wasn’t the ones you hear spoken about by most DH scholars, not Oregon Trail or Colossal Cave Adventure. In the game I played, the title of which I cannot recall, I died from botulism after opening a can of food. I should have known – the can was dented.

Most likely my original interest in infrastructure came from math. My older brother who went on to become a kind of a math whiz, somehow figured out early on that math was a creative endeavor. You could do it all kinds of ways, even if the teachers only show you how to do it one way in school. So, when I didn’t understand things at school (which was often), I would go home and figure it out for myself — pencils and erasers and books spread out all over the glass kitchen table. I approached a math problem as a constructed thing. I learned that I just had to figure out the best ways to build the math in order to use it towards a solution.

[pullquote]Literature was (and remains) a miraculous thing to me, an incredible thing that humans build.[/pullquote] Literature was (and remains) a miraculous thing to me, an incredible thing that humans build. Many in DH, tinkerers all, talk about a desire to know how things are built. The same was true for me in math and in fiction. When I did my MFA in Fiction at UVa (1998-2000), my primary question was how did that author work that language to that effect? How does she build a person, a family, a community, a society, or a universe out of words on the page? While I was working on my MFA, I had a GAship in the Institute for Advanced Technologies in the Humanities at UVa, which was being run at the time by John Unsworth, and in the eText Center, which was headed by David Seaman. I worked at both for a semester, and these jobs led me to a job as a Project Designer at Apex CoVantage. At the time, Apex was contracted by ProQuest and Chadwick Healey to digitize their microfilm collections, one of which was Early English Books Online (EEBO), among others. The history of EEBO’s digitization is described elsewhere (see History of Early English Books Online, Transcribed by hand, owned by libraries, made for everyone: EEBO-TCP in 2012 [PDF]), alongside the collection’s oddities, so it is not shocking to point out that digitizing EEBO was not an exact science and that the collection remains today riddled with inexactitudes beyond anyone’s control. This job brought me close to the complexities behind building important digitized heritage collections, however, which remains a central interest for me.

Whitney Trettien, “thumbprint of scanner visible”, STC / 887:16 EEBO

I could see that our cultural heritage and the infrastructures, both social and technical, that sustain them, preserve them and make them accessible to us were constructed things – and, as all human-made things, constructed more or less well. In my graduate degree at the University of Maryland and in working on DH projects at the Maryland Institute for Technology in the Humanities (MITH), I learned that we have some agency and that telling the stories of a person, family, community, or society well depends on our enacting that agency in the humanities. Already concerned with considerations for the inexactitudes of representing the complexities of human culture, humanists are best situated to get our books and pencils out, spread them on the glass kitchen table, and work towards the best ways to build and sustain our cultural heritage in the digital age.

Thomas: I really appreciate the perspective of enacting agency through the Humanities. Over the past few years you’ve exercised that agency, in part, to develop computational use of audio collections in the Humanities. What are the primary opportunities and challenges in this space as they pertain to infrastructure?

Tanya: Libraries and archives hold large collections of audiovisual (AV) recordings from a diverse range of cultures and contexts that are of interest in multiple communities and disciplines. Historians, linguists, literary scholars, and biologists use AV recordings to document, preserve, and study dying languages, performances and storytelling practices, oral histories, and animal behaviors. Yet, libraries and archives lack tools to make AV collections discoverable, especially for those collections with noisy recordings – recordings created in the forest (or other “crowded” ecological landscapes), on the street, with a live audience, in multiple languages or in languages for which there are no existing dictionaries. These “messy” spoken word and non-verbal recordings lie beyond the reach of emerging automatic transcription software, and, as a result, remain hidden from Web searches that rely on metadata and indexing for discoverability.

Further, these large AV collections are not well represented in our National Digital Platform. The relative paucity of AV collections in the Europeana Collections, in the Digital Public Library of America (DPLA), the HathiTrust Digital Library (HTDL) and the HathiTrust Research Center (HTRC) for instance, is a testament to the difficulties that the Galleries, Libraries, Archives, and Museums (GLAM) community faces in creating access to their AV collections. Europeana is comprised of 59% images and 38% text objects, but only 1% sound objects and less than 1% video objects. DPLA reports that at the end of 2014 it was comprised of 51% text and 48% images with only 0.11% sound objects, and 0.27% video objects. At this time, HTDL and HTRC do not have any AV materials.

The reasons behind these lack of resources range from copyright and sensitivity concerns to the current absence of efficient technological strategies for making digital real-time media accessible to researchers. CLIR and the LoC have called for, “ . . . new technologies for audio capture and automatic metadata extraction” (Smith, et. al, 2004 [PDF]), with a, “ . . . focus on developing, testing, and enhancing science-based approaches to all areas that affect audio preservation” (Nelson-Straus, B., Gevinson, A., and Brylawski, S. 2012, 15 [PDF]). Unfortunately, beyond simple annotation and visualization tools or expensive proprietary software, open access software for accessing and analyzing audio using “science-based approaches” has not been used widely. When it is used with some success, it is typically on well-produced performances recorded in studios, not, for example, on oral histories made adhoc on the street.

[pullquote]Can we make data about sound collections verbose enough to enable an understanding of a collection even if and when that collection is out of hearing reach because of copyright or privacy restrictions?[/pullquote] We need to do a lot of work to better prepare for infrastructures that can better facilitate access to audio especially at the level of usability, efficacy, and sustainability. For instance, we don’t know what kinds of interfaces facilitate the broad use of large-scale “noisy” AV analyses from a diverse range of disciplines and communities? Sound analysis is pretty technical. How do we learn to engage with its complexities in approachable ways? Further, how much storage and processing power do users need to conduct local and large-scale AV analyses? Finally, what are local and global scale sustainability issues? What metadata standards (descriptive or technical) map to these kinds of approaches? Can we make data about sound collections verbose enough to enable an understanding of a collection even if and when that collection is out of hearing reach because of copyright or privacy restrictions?

Thomas: Now seems like a good time to discuss your audio collection infrastructure development. If you were to focus on a couple of examples of how this work specifically supports access and analysis what would they be?

Tanya: Specifically, we’ve been working on developing a tool called ARLO that was originally developed by David Tcheng, who used to be a researcher at the Illinois Informatics Institute at the University of Illinois Urbana Champaign and Tony Borries, a consultant who lives in Austin, Texas. They had created ARLO to help an ornithologist David Enstrom (also from UIUC) to use machine learning to find specific bird calls in the hundreds of hours of recordings that he had collected. Recording equipment has become much more powerful and cheaper over the last decade and scholars and enthusiasts from all kinds of disciplines have more recordings than possible human hours to analyze it all. Our hope is to develop ARLO so that people have a means of creating data about their audio so that they can analyze that data, share it, or otherwise create information about the collections that can make them discoverable.

Tanya Clement, “A visualization of a song in ARLO”, from Machinic Ballads: Alan Lomax’s Global Jukebox and the Categorization of Sound Culture

We are still very much in the research and development phase, but we have worked on a few of our own projects and helped some other groups in our attempt to learn more about what scholars and others might want in such a tool. For example, in a piece I wrote last year I talk about a project we undertook to analyze the Alan Lomax Collection at the John A. Lomax Collection in UT Folklore Center Archives at the Dolph Briscoe Center for American History at that University of Texas at Austin. We used machine learning to find instrumental, sung, and spoken sections in the collection. Using that data, we visualized these patterns across the recordings. It was really the first time such a “view” was afforded of the collection and it sparked a discussion about how the larger folklore collection at UT reflected the changing practices of ethnography and field research in folklore studies from the decades represented by the Lomax collection to those in more recent decades. With the help of Hannah Alpert-Abrams, PhD candidate in Comparative Literature at UT, we used ARLO in a graduate course being taught at LLILAS Benson Latin American Studies and Collections by Dr. Virginia Burnett called History of Modern America through Digital Archives classroom. Students identified sounds of interest in the Radio Venceremos collection of digital audio recordings of guerrilla radio from the civil war in El Salvador. Some of the sounds students used ARLO to find included bird calls, gunfire, specific word sequences, and music.

In another project School of Information PhD candidate, Steve McLaughlin and I used ARLO to analyze patterns of applause across 2,000 of PennSound’s readings. We discovered different patterns of applause in the context of different poetry reading communities. These results are more provocative than prescriptive, but our hope was to show that these kinds of analyses were not only possible but productive. We are still working through how to approach challenges in this work that come up in the form of usability (What kinds of interfaces and workflows are most useful to the community?), efficacy (For what kinds of research and pedagogical and practical goals could ARLO be most useful?), and scalability (How do we make such a tool accessible to as many people as possible?).

Thomas: In An Information Science Question in DH Feminism, you argue for a number of ways that feminist inquiry can help us better understand epistemologies that shape digital humanities and information science infrastructure development. How has this perspective concretely shaped your own work and thinking in this space?

Tanya: What has shaped my work is very much in line with this piece. Many people in STS (Science and Technology Studies) and information studies have written about the extent to which information work and information infrastructures are invisible work (Layers of Silence, Arenas of Voice: The Ecology of Visible and Invisible Work). Feminist inquiry has always been about making the invisible aspects of society more apparent, but it is also about how you take stock of those perspectives in your articulation of research. Everyone’s perspective is shaped by gender (or, really, the construct of gender) but it is also influenced by other aspects of your situated perspectives in the world including your nationality, your ability status, your day-to-day living as a parent, a child, a sibling, a spouse, a friend or any other aspect of your personhood that shapes the way you address and understand the world.

[pullquote]I’ve tried to advocate for developing tools, infrastructures, and protocols that invite others to address research questions according to their own needs.[/pullquote] The concrete ways (the particular or specific ways) that my own situated look at the world has shaped my own work is perhaps less interesting than the ways I’ve tried to advocate for developing tools, infrastructures, and protocols that invite others to address research questions according to their own needs. One aspect of ARLO that continues to intrigue me is the possibility of searching sound with sound. You choose a sound that interests you, you mark it, and you ask the machine to find more of those sounds. Now, what I like about this scenario is that a linguist might mark a sound because it includes a diphthong; someone else might mark the same sound because of the tone; a third person might be interested in the fact that this same snippet is spoken by an older man, a younger woman, or a child.

That our understanding of sound is based on a situated interpretation seems readily apparent especially compared to search scenarios in which words seem to pass as tokens that once represented complex ideas. You can mark gunshots or laughter or code switching moments when a person uses one language intermittently to express something that a society’s dominant language (let’s say English) can’t quite express. The general point or hope is that the process of choosing a sound for searching can be inviting in ways that are different from the process of choosing a single search term. In comparison to using search terms taken out of context, sound snippets remain more complex even with the absent presence of the missing context. It’s as if sounds have more dimensions, even if they clipped from a longer recording. I like working with sound for these reasons.

Thomas: Whose data praxis would you like to learn more about?

Tanya: There is quite a bit of data work going on in digital humanities that is interesting. I appreciate Lauren Klein’s attempt to unravel different histories of data visualization to help us better understand where we are by looking at from where we’ve come as well as Christine Borgman’s Big Data, Little Data, No Data: Scholarship in the Networked World, which exposes the daily practices of scholars who work with data and how those practices influence interpretation.

I have been lucky to participate in the inaugural issue of the Journal of Cultural Analytics, which is an attempt to provide a platform for showcasing how researchers in the humanities can use data to study literature, culture, media and history. Further, Digital Pedagogy in the Humanities: Concepts, Models, and Experiments, in which Daniel Carter and I have written about ten pedagogical assignments that seek to teach students about the situated elements of data in terms of its collection and use. With each of these examples, I am drawn to work that invites us to critique or understand data as a deeply political phenomenon.

Human Rights and Archival Practice: Equality, Inclusion, Accountability

Yvonne Ng is Senior Archivist at WITNESS, an organization that “trains and supports activists and citizens around the world to use video safely, ethically, and effectively to expose human rights abuse and fight for human rights change.” Yvonne currently serves on the Advisory Board of Documenting the Now, and previously worked as a Research Fellow for the Preserving Digital Public Television Project.

Thomas: To kick things off, how did you find yourself working for an organization like WITNESS? Were there particular experiences that you had that shaped your interest in the work?

Yvonne: I was very lucky to end up at WITNESS, an organization whose mission I truly believe in, and in a position that coincides with many of my interests. As Senior Archivist, I have the opportunity to collaborate with people around the world who are dedicated to defending human rights, to help develop new workflows for archiving video outside of traditional institutional contexts, and to manage a unique collection of human rights video. Since joining WITNESS in 2009, my job has grown to incorporate more outreach, engagement, and training functions, which has given me the opportunity to learn from so many people outside of archives, while also remaining active within the professional archives community.

Before joining WITNESS in 2009, I was a research fellow on the Preserving Digital Public Television project, where I studied economic and other factors affecting the sustainability of digital preservation initiatives in public television. Prior to that, I graduated from the Moving Image Archiving and Preservation (MIAP) program at NYU. My MIAP thesis focused on small arts organizations with audiovisual collections, and developed criteria for assessing their readiness to initiate preservation projects. So I came to WITNESS with a fresh awareness of the challenges of sustaining archives in non-institutional settings, as well as the potential value of archived collections to all different kinds of users.

Before I moved to New York for the MIAP program, I worked at the Canadian Filmmakers’ Distribution Centre in Toronto, one of the oldest artist-run centers in Canada with an amazing de facto archive of important experimental works on film. I was also enrolled in a graduate program in film studies at York University. For a time, I thought I would pursue a Ph.D in film studies, but while at York I became interested in working in a more hands-on way with media, with the people who created it, and in audiovisual archives. Film theory and academia started to seem distanced from the actual systems of media production, distribution, preservation, and use.

Throughout my life, I have always been involved in activist- or art-oriented avocations, ranging from community radio starting in high school, activist-led student government in undergrad, and a collective that made art with kids. These days, I am really proud to be part of XFR Collective and Community Archiving Workshop. Working at WITNESS is the perfect fusion of so many aspects of my life and interests – human rights, archiving and preservation, video, community collaboration, creative production … the list goes on.

Thomas: Thinking about your work with one or more activist groups, can you describe specific challenges that WITNESS sought to help with?

[pullquote]We try to communicate the value of basic archival practices, such as keeping backups in geographically dispersed locations.[/pullquote]Yvonne: When it comes to video archiving, the primary challenges faced by the groups we work with are limited resources, and the fact that they are working in contexts where human rights abuses are taking place and their security is a concern. Groups that have implemented archival workflows tend to be ones that do a lot of video production and post-production, like La Sandia Digital in Mexico or Refugee Law Project in Uganda, and others focused on collecting evidentiary documentation, like SyrianArchive.org. For other groups that are not directly concerned with media management or preservation as part of their everyday activities, it can be difficult to set aside the additional time and resources to implement archival workflows.

We try to communicate the value of basic archival practices, such as keeping backups in geographically dispersed locations. Unfortunately, the importance of these practices sometimes does not sink in until something bad happens, like having a YouTube account taken down or having a hard drive fail. A positive example that I like to share is that of Kianga Mwamba, a woman who recorded her mistreatment by the Baltimore police on her phone. The recording was “accidentally” deleted while her phone was in police custody, but she fortunately had the foresight to activate her remote backup ahead of time. Because of that small act, she was later able to retrieve a copy of the video from the cloud and use it as evidence in her successful suit against the police department.

WITNESS, "Kianga Mwamba: Cops try to delete video of violent, unwarranted arrest, but fortunately it’s backed up to the cloud" — WITNESS, “Kianga Mwamba: Cops try to delete video of violent, unwarranted arrest, but fortunately it’s backed up to the cloud”

WITNESS provides archival training resources, in-person workshops, and ongoing consultation and collaboration. Our freely available resources include the Activists’ Guide to Archiving Video, which received the 2014 SAA Preservation Publication Award, and a number of short training videos. In our resources and workshops, we try to give guidance that is practical and realistic for small human rights groups, which are not necessarily what traditional collecting institutions might consider best practices. I do not, for example, talk much about metadata standards, but do brainstorm with groups about the kinds of information they want to keep track of, why it’s useful to record that information in a consistent way, and at the same time discuss what information might put them at risk.

WITNESS, "Activists' Guide to Archiving Video" — WITNESS, “Activists’ Guide to Archiving Video”

While I think we have been successful helping some groups with practical activities like collecting, organizing, and cataloging their videos, our training resources do not solve larger sustainability challenges, like staffing and the cost of ongoing collections maintenance. One way that groups might address this, however, is through local networks and collaboration – a good example of this is our partnership with XFR Collective.

One of the biggest ongoing challenges is of how to archive and preserve, while at the same protecting video documentation from authorities that seek to seize it, through subpoenas or other means, and use it against people. There are a variety of approaches to mitigating harm, such as concealing identities, informed consent, and encryption, but these are not always sufficient in every case.

Thomas: It is clear from your comments above that there is often an immediate tension between the ability to capture and share digital media with relative ease and an awareness of how the data and relatively automated systems of description around them hold the potential to endanger the creator. What technical solutions have been developed to help groups you typically work with?

Yvonne: A number of technical tools are available that can make documenting abuse safer for victims, activists, and witnesses. There are dedicated apps, like CameraV, which we partnered with the Guardian Project to create. CameraV allows a user to capture important metadata from the device at the time of recording and then secure the video and data from unauthorized access through password-protection and encryption.

Guardian Project, "CameraV" — Guardian Project, “CameraV”

Most people will not have a dedicated app like CameraV installed on their phones just in case they witness an event, of course, so it is important that they are aware of how to operate securely with their available settings and technologies. My colleague Morgan Hargrave wrote a great blog post that outlines some of these basic practices, like securing your device with a strong passkey, setting up automatic cloud backup, knowing your rights, and being conscious of potential risks before you share your video.

Knowing that people will usually use widely available tools and services rather than dedicated apps, we also work with technology companies to encourage the development of built-in security features that are easy to use and benefit all kinds of users beyond human rights defenders. The YouTube Custom Blurring tool, for example, makes it easy for users to obscure identifying information like faces and license plates when they share their videos.

Thomas: You’ve mentioned explicitly and implicitly, efforts you make to balance the needs of the groups you work with, with professional archival practice. What framework or ethic guides you as you walk that line? What suggestions do you have for others working in related capacities?

[pullquote]I think that meeting the needs of the groups we work with is professional archival practice.[/pullquote]Yvonne: I wouldn’t frame it as a balance between those two things; I think that meeting the needs of the groups we work with is professional archival practice. That is, it’s the job of a professional archivist to be able to look at existing standards and best practices and judge how they can apply or be adapted to suit particular situations, which are almost always less than ideal.

But in terms of what frameworks or ethics guide my professional practice, I guess I can point to a few things. First, on the most basic level, I try to approach my work always keeping in mind human rights principles of dignity, equality, inclusion, and accountability. In the day-to-day, this means prioritizing people’s safety and well-being, working in participatory and inclusive ways, and supporting and following the lead of activists working in their communities. With regard to our archival collections at WITNESS, we have internal guidelines on access and use that emphasize our responsibility to people depicted in the footage, people who recorded the footage, and to communities affected by human rights abuse. The Ethical Guidelines for Using Eyewitness Videos published by the WITNESS Media Lab, aimed at people collecting and curating video from the web, reflects these principles as well.

WITNESS, "Video as Evidence: Ethical Guidelines" — WITNESS, “Video as Evidence: Ethical Guidelines”

Michelle Caswell draws a useful framework for human rights archiving that she calls a “survivor-centered approach,” which resonated a lot with me when I read it last year. I appreciate her assertion of an ethical imperative to prioritize survivors of human rights abuse and victims’ families over other stakeholders in the archive, like academic researchers. She also calls for an approach to human rights archiving that is participatory and inclusive, that embraces its work as a form of activism, and that is able to look critically at itself.

I suppose the suggestion I would have for others is similar to what Caswell argues for in her article, which is to be self-reflective about our role as archivists, and not to hide behind the false idea that archiving is a “neutral” act. The work we do has value, so it’s important to think about who it should serve and where we want to put our energy.

Thomas: Whose data praxis would you like to learn more about?

Yvonne: I’m always on the lookout for smaller organizations and groups with media archives. I’m interested in learning about their workflows, what systems and tools they use, their collections, and their specific challenges. There are many ways that groups can manage and preserve video with limited resources, and it’s instructive to hear about as many real-life examples as possible.

Community Oriented Research Data Curation and Reuse

Ixchel Faniel is a Research Scientist at OCLC Research. Faniel has conducted extensive research into how various disciplinary communities approach research data reuse. Faniel served as Principal Investigator on the Institute for Museum and Library Services funded DIPIR (Dissemination Information Packages for Information Reuse).

Thomas: Believe it or not, I think you may be the first Research Scientist I’ve corresponded with! Can you tell me a bit about your role at OCLC and what combination of experience, education, and interests led you to it?

Ixchel: I’m laughing because I didn’t set out to be a Research Scientist, but given my experience, education, and interests, I can see how I ended up at OCLC. My role is to conduct research that informs and impacts the communities OCLC serves in terms of what they do and how they do it – so libraries, archives, museums, their staff, and their patrons. Typically, my interests center on people and their needs. At a high level I’m interested in how people discover, access and use content and how technology supports those things. My interests grew out of my experiences and education for sure.

After graduating from Tufts University with a B.S. in Computer Science, I worked at Andersen Consulting (now Accenture) as a consultant, designing and developing information systems to meet clients’ needs. I spent a lot of time at USC earning my MBA and Ph.D. in Business Administration at The Marshall School of Business. The time between the two degrees was spent at IBM selling mainframes and working with a team to propose total solutions – hardware, software, services – to solve a customer’s specific business problem. Working as an Assistant Professor at the School of Information, University of Michigan also shaped my interests, because I was exposed to different conversations – federal funding agency mandates to share research data for reuse, data repositories, research collaboratories, team science, cyberinfrastructure.

Primarily, my interests have focused on the reuse of research data within academic communities. I’m particularly interested in the kinds of contextual information that need to be captured along with the data and how context can be curated and preserved. This line of work started with National Science Foundation funding when I studied how earthquake engineering researchers reused each other’s data. It continued with DIPIR (Dissemination Information Packages for Information Reuse), a multi-year Institute of Museum and Library Services funded project to study the data reuse practices of social scientists, zoologists, and archaeologists. Recently work in the area has been extended with National Endowment for the Humanities funding for the Beyond Management: Data Curation as Scholarship in Archaeology project. It’s a longitudinal study of archaeologists’ data creation, management, and reuse practices to investigate data quality and modeling requirements for reuse by the large community. A second line of related work I began at OCLC examines the role of academic libraries and librarians in supporting researchers’ data sharing, management, and reuse needs. It serves as a nice complement to studying the researchers, because librarians and libraries are viewed as a key stakeholder group. I’m currently examining their early experiences designing, developing, and delivering e-research support.

Thomas: I am intrigued by the link you drew between technology supported knowledge reuse for innovation and data reuse within academic communities. Can you speak to this in greater detail?

Ixchel: Sure. So when I was studying knowledge management and knowledge reuse within organizations, one of the major issues was that it was contextual in nature. In order for one colleague to understand another colleague’s knowledge well enough to apply it to a new situation, there was a need to know the context within which it was created. The difficulty was having access to all of those details in absence of the colleague. In going about doing work, employees were creating a paper trail for some things, but they weren’t necessarily capturing everything related to how and why they were doing what they were doing. In other cases, employees were capturing a summary of events in a final, formal document to be stored and shared as corporate memory, but keeping documents they generated in the course of creating the final document to themselves. Those additional documents served as a detailed reminder of how they arrived at the final document, but weren’t necessarily shared with others. In other cases, employees were relying on their past and present experiences and that knowledge they were using in the moment to make a decision, solve a problem, or develop a new product was tacit and not captured.

[pullquote]Similar to employees who generate knowledge, researchers who generate data may be recording some context but not necessarily all context, or they’re sharing some context but not necessarily all context.[/pullquote]These are some the same issues we face in studying reuse of research data within academic communities. Similar to employees who generate knowledge, researchers who generate data may be recording some context but not necessarily all context, or they’re sharing some context but not necessarily all context. They are doing this not to withhold information necessarily, but because they don’t think to record or share certain kinds of context. In some cases the context represents something they do so regularly in the course of their research that it becomes tacit or such a minor detail to them that they don’t think to include it. Researchers have not had to document data beyond personal needs before federal funding agency mandates to share data and write data management plans. Are they capturing enough about the context of data production such that others can come along and reuse that data to answer different research questions or make a new discovery? I wasn’t sure and I didn’t think we would really know until we started asking researchers who reused data what they needed.

Thomas: Nearly five years ago, you and Ann Zimmerman wrote, Beyond the Data Deluge: A Research Agenda for Large-Scale Data Sharing and Reuse. In that paper you outlined an ambitious research agenda for yourself and a wide field of scholars and practitioners. I imagine a fair amount of work has happened since then. Thinking on that work, where are we now?

Ixchel: Good question. Where are we now? Making lots of progress, but there is still more to do. Ann and I developed that research agenda around three activities we thought needed more attention – 1) broader participation in data sharing and reuse, 2) increases in the number and types of intermediaries, and 3) more digital products.

We’ve definitely seen more research examining data sharing and reuse. When we wrote the article, we were both in agreement that the research on reuse was sorely lacking and had both done early work in the area to examine data reuse practices among ecologists and earthquake engineering researchers. Interestingly our article was written around the same time Elizabeth Yakel and I were waiting to hear whether the DIPIR project was going to get funded.

Luckily it was funded. As I mentioned earlier, DIPIR is a study of data reuse practices of social scientists, archaeologists, and zoologists. We were particularly interested in what kinds of contextual information were important for reuse and how it could best be curated and preserved. Over the years we’ve published studies considering how social scientists and archaeologists develop trust in repositories, examining internal and external factors that influence archaeologist and zoologist attitudes and actions around preservation, studying the relationship between data quality attributes and data reuse satisfaction, and discussing the topology of change at disciplinary repositories.

Most recently we’ve been doing a lot of work describing our context-driven approach to data curation for reuse we used for the DIPIR project. Initially, our major goal with this approach was to give voice to the reusers’ perspective, understanding that it influences and is influenced by the other stakeholders involved in the process. Now we are working toward providing a more balanced picture of needs among the major stakeholders (data producers, data reusers, and repository staff), knowing that they may have different, sometimes competing, data and documentation needs. We started narrowly by bringing reusers’ perspectives to the fore, but we’ve been slowly expanding to include these other stakeholders.

In February 2016, we convened a workshop at the International Digital Curation Conference to share our approach and findings. We collaborated with Kathleen Fear, Data Librarian, University of Rochester and Eric Kansa, Data Publisher, Open Context. Both also do research in the area, but their primary responsibility is supporting researchers’ data sharing, management, curation and reuse needs. They talked about the impact DIPIR findings were having on their practices, including the tradeoffs they had to make. To complement the presentation, we engaged workshop participants in card sort exercises to consider the importance of different types of contextual information given the needs of data reusers vs. repository staff. With the workshop one of our objectives was for participants to see the partnership or the marriage between research and practice as well as differences in needs not only among data reusers and repository staff, but also among data reusers within different disciplines and repository staff supporting different designated communities of users.

Thomas: Where do you think work of this kind is headed next?

Ixchel: I believe these kinds of connections – connections between researchers studying the phenomena and practitioners implementing it – are one thing that can help advance work in the area. And even among the researchers diverse experiences are important. Elizabeth and I came together given her background in users, archives, and preservation and my background in users of information systems and content. During our research we’ve engaged with different perspectives and literatures that have some similarities but haven’t always talked to or referenced one another. One of our goals was to bridge those areas.

[pullquote]My goal was to convince archivists to bring their expertise to the table in conjunction with an understanding of data reusers’ needs to inform not only the preservation of data’s meaning, but also other archival practices, particularly the partnerships they form.[/pullquote]Bridging is an important part of this effort. I wrote about a particular aspect of it in a two part blog post after participating in a panel on Data Management and Curation in 21st Century Archives at the Society of American Archivists Annual Meeting in 2015. My goal was to convince archivists to bring their expertise to the table in conjunction with an understanding of data reusers’ needs to inform not only the preservation of data’s meaning, but also other archival practices, particularly the partnerships they form.

A key part of my message was archivists cannot go it alone, because curation and management is bigger than the archive. My related study of librarians confirmed it; communication, coordination, and collaboration with other campus entities was particularly important when supporting research data services. Presentations from fellow panelists also confirmed it. What struck me about their presentations was whether and how they and their colleagues came to value each other’s complementarities in order to deliver more effective research data services.

It would be great to see more work about whether and how data and information researchers and professionals begin to partner with each other and other organizations. There has been some work to frame the issue from Brian Lavoie and Constance Malpas, colleagues at OCLC. They conceptualize evolving stewardship models. Seeing additional research in this area and how it’s done in practice, particularly within and across colleges and universities would be interesting.

So when Ann and I talked about increases in the number and types of intermediaries one of the areas we suggested examining was how education, roles, and responsibilities were changing given the evolving nature of data and information professionals. There has been nice progress in those areas from Liz Lyon at the University of Pittsburgh, Carole Palmer at University of Washington, and Helen Tibbo and Cal Lee at UNC Chapel Hill. Going forward it will be interesting to examine career trajectories, how do they advance in their professions, what is rewarded vs. not.

With regard to the last area Ann and I discussed – more digital data products or new types of digital products that include or reference data – The Evolving Scholarly Record presents a framework to organize and drive discussions about it. More recently, Alastair Dunning wrote a nice blog post while at the International Digital Curation Conference summarizing Barend Mons and Eric Kansa’s approach to publishing data and how it benefits reuse – Atomising data: Rethinking data use in the age of explicitome. But that’s just the beginning. There’s definitely room for more work in this area, particularly other approaches being taken in other disciplines given the needs of data producers, reusers, and repository staff.

Thomas: Whose data praxis would you like to learn more about?

Ixchel: That’s an interesting question. For me it’s the data producers and curators. So for the past several years I’ve been working with colleagues to get data reusers’ perspectives inserted into conversations, but by no means do I think it is enough. We’ve done some work examining Three Perspectives on Data Reuse: Producers, Curators, and Reusers, starting at the point of data sharing. It goes back to the context-driven approach. Data producers are the ones who are setting the tone regarding data management, curation, and reuse really, because they are upstream in the data cycle. The NEH funded project I discussed earlier – Beyond Management: Data Curation as Scholarship in Archaeology – aims to bridge data creation and reuse.

The project started in January 2016. We have this fantastic opportunity to interview and observe archaeologists while they are collecting and recording data in the field during archaeological excavations and surveys and interview archaeologists interested in reusing the data. The objective is to examine data creation practices in order to provide guidance about how to create higher quality data at the outset, with the hope that downstream data curation activities become easier, less time intensive, and data creation practices are better aligned with meaningful reuse. We have another great group of people working on the project specializing in archaeology, anthropology, data curation and publishing, information science, archives and preservation, etc. and we are all focused on studying data creation and reuse and impacting practice. I’m looking forward to seeing how it progresses. It should be a lot of fun.

Museum as Play: Iteration, Interactivity, and the Human Experience

Sebastian Chan is Chief Experience Officer at the Australian Centre for the Moving Image. Previously he held positions as Director of Digital & Emerging Media at the Cooper Hewitt Smithsonian Design Museum and Head of Digital, Social and Emerging Technologies at the Powerhouse Museum. His work spans consideration of digital and physical spaces and has been recognized by organizations including but not limited to Fast Company, Core77, American Alliance of Museums, and Museums and the Web.

Thomas: Your positions at the Australian Centre for the Moving Image, the Cooper Hewitt Smithsonian Design Museum, and the Powerhouse Museum all in some way focus on the digital aspect of the museum experience. Looking across your career, what combination of experiences and dispositions led you to these types of roles and the responsibilities they come with?

Seb: To be perfectly honest, it’s been a journey of good fortune and having great managers and mentors.

I ended up in the cultural heritage world largely because I had had enough of writing a PhD on the geographies of music subcultures and was working in IT as an escape route. That led to a systems administration role at the Powerhouse Museum because their previous Y2K project manager had unexpectedly departed in mid 1999. The year 2000 was also the year that the Sydney Olympics happened and the Powerhouse had a huge “Treasures of Ancient Olympia” exhibition planned. Tim Hart and Sarah Kenderdine were implementing an immersive 3D reconstruction of Olympia both online and in the exhibition (tiny remnants available) and one day Tim popped down to the IT department and knew that I had some understanding of 3D graphics acceleration and gaming hardware from my time as a videogame reviewer – so I got drafted into the project to do specialist technical support on it. After that I was more heavily involved in web projects and in 2003 I separated from IT and started an independent web unit which reported to Associate Director Kevin Sumption. This autonomy from both IT (and Marketing), and strong alignment with curatorial, meant that we were able to do some interesting projects that otherwise wouldn’t have happened such as a series of “games” around exhibition content and themes: the design process, the mathematics of gambling, and environmental impact calculators. Two really important “failed” projects were Soundbyte, a music education resource and rudimentary social network connected to the museum’s digital music and media labs, now called Thinkspace, and Behind The Scenes, a back-of-house virtual tour and basic collections highlights experience.

Soundbyte ended up winning some awards – but it failed because we really underestimated the social part of it from both a community management perspective and in terms of technical architecture. However, much of what we learned during that process helped us with the Powerhouse’s later push into social media and associated open content initiatives. Behind The Scenes was a faster failure – the Flash interface looked fantastic but its architecture was very problematic. However, in building Behind The Scenes we made a series of rudimentary connectors that opened up programmatic access to the collection management system – and these small bits of code ended up forming the basis of what would become the Electronic Swatchbook project and later the first version of the online collection database, OPAC2.0.

OPAC2.0 was the start of a new wave of work at Powerhouse. The teams I managed grew significantly and from 2006 onwards there was a lot of activity around getting the collection out to the world – first via the database, then data releases, an API, and through various social platforms. The Powerhouse also launched Design Hub (later dHub) as a portal around design content and collections, and a new children’s site that distributed CC-licensed craft activities and games for under 8s and their parents. In 2008 this led to being commissioned to create cross-government experimental projects – a baby names voyager, an experimental semantic web collections portal, and a multi-agency events calendar and app for parents.

This was on top of all the exhibition projects and other things that the teams did. But at the end of the day it was the collection – its diversity and scale – that lay at the heart of most of this work. We pretty quickly realized that the value of a museum’s collection lay in the public’s ability to interact and engage with it and so there was a lot of rapid experimentation around new interfaces and new platforms through which to provide access to the collection.

We even made a collaboration with artist Craig Walsh that was meant to be a virtual monster inside a box that devoured collections – the web interface was the “feeding tube”, so to speak, and drew edible collections from visitors’ choices and uploads.

A huge amount of work was done – we made many things that didn’t work out as planned – and I worked with and managed a very talented pool of individuals, all of whom have gone on to bigger and better things all around the globe. As you move up the org chart you inevitably become further and further removed from production and certainly from writing code – you’re more in the role of a conductor than soloist.

Eventually my interest started to wane – and a set of coincidences meant that in mid-2011, after a visit to New York, I was on the end of a very late night Skype chat-turned-interview with then Director Bill Moggridge and his Director of Marketing, Jen Northrop, at the Cooper Hewitt who were looking for a Director of Digital & Emerging Media. They knew of my work from the web, and through some research collaborations a few years earlier when I had taught a workshop at the Cooper Hewitt on social media and collections. Of course, I also knew Bill’s work and career from following IDEO and the museum itself from its education programs and quirky exhibitions – and my wife had expressed a strong interest in living in New York – so it sounded like an interesting, and unique challenge. Bill obviously understood the challenge of moving a family internationally and so went out of his way to make it work within the Smithsonian structure – in fact his negotiations with Washington took so long that by the time he emailed back with an offer, I thought they had probably hired someone else!

Cooper Hewitt was a fascinating experience – especially coming in to an organization that had a strong desire but few muscles to bring to life a very different vision of the institution. When I left Powerhouse I became acutely aware of two things – one, that really significant change is easiest done when you can stop everything else and close your main galleries, and two, that Australian institutions are much more inherently visitor-focused (and have been for a long time) than their North American counterparts.

Bill had a very generous way of working and he wanted to make the most of the multi-million dollar renovation that Cooper Hewitt had just begun. There’s a great interview with him in Fast Company that was published a few weeks before I landed in which his discontent with the building, its architecture, and the ‘traditional visitor profile’ is obvious. It is also obvious that he treated the idea of the museum itself as a very malleable construct – and in those early months we got some major structural changes through that might have been more difficult in other circumstances. Three months into the job, the collection metadata had been released under a CC0 license – a first for the Smithsonian – and by mid year I’d been able to grow my team by bringing Micah Walter on as a proper staff member, and hire Aaron Cope who had been thinking about his next steps after working at Stamen. The AV duo from Education were also added to my group – and Katie Shelly slowly transformed from video producer to a hybrid videographer and UX advocate. In the sprint to the museum opening we also added Sam Brenner, a super talented developer fresh from NYU’s ITP.

We had also begun the concept stage of the new museum with Diller Scofidio + Renfro, and my team was working a lot with Local Projects who had been hired as media designers. Then suddenly Bill took ill, and several weeks later passed away with brain cancer.

Everyone was in shock.

Most of the work after that was driven by a sense of trying to bring the vision of a more porous, more generous, and more diverse and playful museum to reality. Most people know the story after that – many things got made, all of which are documented over at Cooper Hewitt Labs – and my team got to do some amazing work with lots of collaborators inside and outside the museum.

ACMI is a different beast altogether. It is really interesting to me because it is taking a museum that is already very successful – 1.25m visitors each year – and working with a dynamic executive team to create a more experimental and fluid institution, which almost certainly necessitates breaking a few of the very things that have led to its current success. I’m also perversely excited by the complete challenge of working with contemporary media and Copyright – this is a museum that deals with cinema, TV, video games and contemporary media art so there’s very little that is simple in terms of IP. Similarly, the first question everyone asks me is “why would I go to a museum about things that I can watch on Netflix or play on my Playstation, Xbox or through Steam?” I think that needs a series of razor sharp responses – some of which will be visibly articulated in new ways in the coming months.

So that’s a potted history of how I’ve ended up where I am now.

The missing piece I haven’t mentioned is my other life in music as a DJ, event and festival producer. It has been that other life that really underpins my constant focus and interest in improving access to, and the human experience of, both museums and their contents. Starting out in public radio during my final year at high school, I’ve been part of a DJ/live/FX duo for over 20 years that was all about introducing dance floors to new music – as well as creating physical events and environments in which people open up to new sonic and sensory experiences. Perhaps subconsciously I’ve treated museum collections like obscure records and sample sources – and the purpose of my work in the last 15 years or so has been about liberating and making those not just accessible, but enticing and useful to the public.

Thomas: What work inspires you right now?

Seb: Right now I’m interested in the work of Anab Jain and her practice Superflux, Ingrid Burrington and her work on the infrastructure of the internet, Amy Rose and May Abdalla and their immersive documentary work as Anagram, Jason Scott’s continuing amazing work on video game archiving and preservation, those working at the intersection of cultural orgs/exhibits/digital like Tellart, as well as all the usual museum/cultural sector suspects who I’m sure everyone is already following and reading about.

Thomas: You mentioned above that your museum collections work might be subconsciously influenced by your passion for music and fostering community around it. I really like that! If you were to distill some core lessons on building digital collections and establishing community in a digital environment around them, what would they be? Is there any particular project you have in mind that illustrates these lessons?

Seb: The core lessons are best told as a recounting of how my teams learned those lessons.

At the Powerhouse the OPAC2.0 project opened up a huge number of vectors into the collection – and we learned a huge amount about what did and didn’t work through that process. The Powerhouse was one of the first museums to release its collection (as a raw data file), which was closely followed by a public facing API, yet these were much more helpful internally than externally – in that they allowed us to work with and see the shape of the collection more easily. My team at Cooper Hewitt did the same – data release followed by the API – but at Cooper Hewitt, Aaron Cope spent a lot of time making the web interface itself a lot more linguistically inviting which made all the difference.

Let me explain that a bit better.

When we were designing and building the Powerhouse online collection we were coming from a very low base. None of the collection was online, and we had just seen the failure of our Behind The Scenes (BTS) project. BTS had presented some top level collection highlights but had been built in Flash on top of Coldfusion (anyone remember Coldfusion?!), and alongside that we had three old specialist collections in their own little portals – the Sydney Olympic Games collection, and two photographic collections, the Tyrrell archive and Hedda Morrison archive. All of these specialist collections presumed an interest and knowledge of the collections’ contents and as a result weren’t particularly browsable. Through the Electronic Swatchbook project, though, we had designed and built an interface that was entirely based on browsing and the interrelationship between objects using tagging because the individual swatches weren’t catalogued (or able to be). And we’d also seen a lot of traffic and downloading of the swatches – which taught us the value of browsable interfaces and open access/public domain releases.

So Giv Parvaneh and I started thinking about how we might apply a swatchbook-like approach to the whole collection. After all, in porting the BTS project from Coldfusion to PHP we had built a very rudimentary library for extracting object metadata from the enterprise collection management system. After OPAC2.0 went live in late 2006 we experienced extremely rapid traffic growth, almost entirely driven by the collection – and with that came other challenges. We were completely unprepared – and understaffed – for success. Imagine you were running a museum and suddenly, and consistently, double the number of visitors started arriving at your museum’s door each day asking new types of questions. You’d hire some more front-of-house staff, get curators and subject matter experts out on the floor, and deal with the increase – but online, when this happens, it remains business as usual.

Following OPAC2.0 came the work with Flickr and the Commons on Flickr. I spoke at WebDirections in 2007 and George Oates was on the lineup too. While she was in Sydney she came and visited Powerhouse and told me that she was about to go live with an exciting collaboration with Library of Congress. We agreed to keep in touch and try to get Powerhouse’s historical photo archive online if LoC was open to expanding the project. LoC went live in January 2008 and then Powerhouse became the first museum in the Commons on Flickr in April 2008. Flickr created a huge audience for the historical photographs and during 2008/9 we did a lot of experiments in integrating the user-generated metadata from Flickr (tags) and comments into both the museum’s workflows and OPAC2.0. Paula Bray who headed up Image Services did a lot of work with Flickr hosting events and even publishing a book of user generated comments – a kind of user generated catalogue.

from "Then and now: stories from the Commons" — from “Then and now: stories from the Commons”

The socialising of the collection on Flickr taught us a lot – that there were bigger, general audiences out there, but that the route to these audiences was often controlled by third parties who might have different business agendas. After George left Flickr, the Commons was effectively put on hold by Yahoo for several years – Flickr’s user base also changed as other products and services appeared on the market. Powerhouse’s collections are still there – they weren’t removed like Brooklyn Museum did – but the Flickr experience also demonstrated the importance of continually supporting and feeding the community. It wasn’t like it was possible to outsource engagement.

There’s good documentation of this period on Fresh & New as well as in these Museums and the Web papers by various team members:

Tagging and Searching—Serendipity and Museum Collection Databases – which covers the OPAC2.0 project and very early results on usage and tagging behaviors. These changed quite substantially in the following years and so the early honeymoon period didn’t end up being representative of the longer term. Changes in the way that Google operated meant that a lot of the early SEO gains were diminished from 2010 onwards.

Uniting the Shanty Towns—Data Combining across Multiple Institutions – which covers the early work building About NSW with Dan McKinley and Greg Turner (who later founded the Interaction Consortium), and Renae Mason (now at Museum of City of New York).

Flickr Commons: Open Licensing and the Future for Collections – my former colleague Paula Bray writing on the experience with Flickr.

Reprogramming The Museum – former colleague Luke Dearnley writing on the museum’s data release and API as well as architectural decisions in that period.

Skip forward a couple of years to 2012 and Cooper Hewitt.

After the CC-Zero release of the Cooper Hewitt collection metadata in early 2012, we hired Aaron Cope to the Digital & Emerging Media team. He wanted to be “head of internet typing” but we finally went with “head of engineering.” He built the alpha version of the Cooper Hewitt collection in his first three months and it became a fertile proving ground for a lot of what would go into the making of the new exhibition experiences and The Pen.

API_stack — The API at the center of the museum

Aaron, Micah Walter and myself did a lot of work to ensure that the collection was going to be at the heart of the new museum and Aaron’s experience in building usable and successful APIs was key to putting Cooper Hewitt on the map. When I arrived at Cooper Hewitt at the end of 2011, there were only 10,000 objects online in the vanilla web interface from the collection management system vendor and the museum was completely unknown in the digital humanities world. By the end of 2012, almost all the collection was online and the team picked up awards from AAM and Museums and the Web for the alpha version.

The value of rapid, publicly visible work was key to the Cooper Hewitt’s success here.

Thomas: Excellent lessons for anyone thinking APIs and collections, interface development, platform utilization, and community engagement. In closing, whose data praxis would you like to learn more about?

Seb: Tim Sherratt, Mia Ridge, Mitchell Whitelaw, Geoff Hinchcliffe, Elisa Lee – all Australians doing fascinating work in the digital humanities and its intersection with interaction design.

Data-Driven Art History: Framing, Adapting, Documenting

This is the first post in Data Praxis, a new series edited by Thomas Padilla

Matthew Lincoln is a PhD candidate in Art History at the University of Maryland, College Park. Matthew is interested in the potential for computer-aided analysis of cultural datasets to help model long-term artistic trends in iconography, art markets, and social relations between artists in the early modern period. Last summer, Matthew held a fellowship at the Harvard MetaLab workshop Beautiful Data, and presented research at the Alliance for Digital Humanities Organizations’ annual international conference, DH2015, in Sydney, where his paper, “Modeling the (Inter)national Printmaking Networks of Early Modern Europe,” was a finalist for the ADHO Paul Fortier Prize.

Thomas: I’m always interested in the hows and whys of folks getting involved in digitally inflected research. Can you tell us a bit about yourself and describe what motivated you to take a path that brings Art History and digital research together?

Matthew: I suppose my digital art history “origin story” is one of a series of coincidences. I’ve always been interested in programming, and, as an undergraduate, even took a few computer science courses while I was majoring in art history at Williams College. But I’d never seriously considered how to apply those digital skills to historical research while at Williams, nor did I start my graduate work at the University of Maryland with any intention of doing computationally-aided art history there, either. However, as it happened, the same generous donation that made my attendance at UMD possible (a Smith Doctoral Fellowship in Northern European Art), had also funded the Michelle Smith Collaboratory for Visual Culture, an innovative physical space in the Department of Art History & Archaeology that was intended to serve as a focal point for experimenting with new digital means for sharing ideas and research. I was already several years into my coursework before I took a semester-long graduate assistantship in the Collaboratory, where I was given remarkable leeway to explore how the so-called “digital humanities” might inflect research in art history. During that semester, I developed a little toy map generated from part of Albrecht Dürer’s diary of his trip to the Netherlands in 1520-1521. But I also had my eyes opened to the vibrant discourse about digital research in the humanities that had, up to that point, been totally outside my field of view. What is more, data-driven approaches held particular promise for my own corner of art historical research on early modern etchings and engravings. Because of the volume of surviving impressions from this period, a lot of scholarship on printmakers and print publishers comprises a wealth of quantitative description and basic cataloging. My dissertation seeks to mine this existing work for larger synthetic conclusions about print production practices in the Dutch golden age.

Thomas: Over the summer you presented a paper at DH2015 that would become a finalist for the ADHO Paul Fortier Prize, “Modeling the (Inter)national Printmaking Networks of Early Modern Europe.” What were the primary research questions in the paper, and what methods and tools (digital and otherwise) did you employ to pursue those questions?

Matthew: I’m interested in how etchings and engravings can serve as an index of past artistic and professional relationships. Most of these objects are the result of many hands’ work: an artist who produced a drawn or painted design, a platecutter who rendered the image onto a printing plate, and often a publisher who coordinated this effort and printed impressions. Seen in this light, the extensive print collections in modern-day collections offer an interesting opportunity to see what kinds of structures emerge from all of this collaboration. In this paper, I wanted to examine how artists tended to connect (or not) across national boundaries. In the history of seventeenth-century Dutch art in particular, there has been a lot of well-deserved attention on the influence and prestige of Dutch painters traveling abroad. But what about printmakers? Did Dutch printmakers tend to connect to fellow Dutch artists more frequently, or did they prefer international collaborators? And how might this ratio have changed over time? It’s easy to intuitively argue either side of this question based on a basic understanding of Dutch history at the time, so this was a good opportunity to introduce some empirical observations and formal measurement to the discussion. In this vein, I’d argue one of my most crucial methods was doing a good old-fashioned literature review in order to properly understand the stakes of the question that I wanted to operationalize.

from DH2015 paper, "Modeling the (Inter)national Printmaking Networks of Early Modern Europe" — from DH2015 paper, “Modeling the (Inter)national Printmaking Networks of Early Modern Europe”

I drew on two major datasets for this paper: the collections data of the British Museum, and that of the Rijksmuseum. The British Museum has released their collections data as Linked Open Data, which meant that I needed to invest a considerable amount of time learning SPARQL (the query language for LOD databases) and how to build my own mirror of their datastore in Apache Fuseki, as my queries were too large to submit to their live server. On the other hand, once I had mastered the basic infrastructure of this graph database, it was easy to produce tables from these data exactly suited to the analyses I wanted to do. The Rijksmuseum offers a JSON API service, allowing you to download one detailed object record at time. The learning curve for understanding the Rijksmuseum’s data model was lower than that for the British Museum’s LOD. However, I had to battle many more technical issues, from building Bash scripts to laboriously scrape every object from the Rijksmuseum’s cantankerous API, to figuring out how to break out just the information I needed from the hierarchical JSON that I got in return (jq was a fantastic utility for doing this).

Because I was more interested in looking at particular metrics of these networks rather than producing “spaghetti monster” visualizations like you can produce in a program like Gephi, I turned to the statistical programming language R to perform the actual quantitative analyses. R has been fantastic for manipulating and structuring huge tables of data, running network analysis algorithms (or just about any other algorithm you’d like to run), and then producing publication-quality visualizations. Because everything is scripted, it was easy to document my work and iterate through several different versions of an analysis. In fact, you can download the data and scripts for my DH2015 paper yourself and reproduce every single visualization.

Thomas: Based on your comments and prior blog posts such as, “Tidy (Art) Historical Data,” it seems that you put a great deal of care into thinking about how your data and research processes are documented and shared. Perhaps it’s a bit of a brusque way to ask, but what made you care? How did you learn how to care? Who did you learn from?

Matthew: I started caring because I saw smart people doing it. I still care because I experienced the practical benefits in a real way. Many of my DH role models put forward careful documentation of their work: Lincoln Mullen’s openly-accessible code, Miriam Posner’s bevy of public DH syllabi, or Caleb McDaniel’s lengthy “backwards survey course” reflection. Here were people doing really useful work, and I was directly benefitting from their openness – so that was absolutely something that I wanted to emulate. On the other side of it, I’ve also had to deal with anti-patterns in documentation. Because I work almost exclusively with data that other people have assembled, I’m painfully conscious of how much the lack of documentation, and/or the assumption that people will only ever use your data the same way that you did, can hinder productive re-use of data.

Now, to be honest, I am not sure if anyone else has directly benefitted yet from looking at my code and data. However, I’ve certainly benefitted from my own documentation! I have been revising an article in response to peer reviews. We all know what that timeline looks like: I “completed” (ha!) the data analysis almost a year ago, finalized and submitted the text with my co-author a month or so after that, then waited many more months before the reviews came back. In just the past month I’ve had to go back in and re-run everything with an updated dataset, clarify some of the analytical decisions made, and enhance several of the visualizations. And I didn’t need to rip my hair out, because all of my work is documented in simple code files, and I don’t have to try and reverse-engineer my own product without the original recipe. (I should note that the R programming community is great for this. It is filled with particularly vocal advocates for reproducible code, like knitr author Yihui Xie, who produce great tools for practicing what they preach.).

By writing documentation notes as I go, I’ve also become much better at explaining – in natural language – what I am doing computationally. This is crucial for any kind of quantitative work, but all the more so in humanities computing, where you can usually count on the fact that most of your audience will have no background in your methodology.

Thomas: Thinking on the digitally inflected research you’ve conducted to date, and the directions you seek to go in the future, what are the most significant challenges you anticipate you will encounter? Accessing data? Sharing your data? Venturing into new methodological terrain? Recognition of the work en route to tenure?

Matthew: I agree with Jacob Price’s assessment of data-driven methods in history: that, however promising, they present major challenges, both in the logistics of producing interoperable data, but also in producing interoperable scholarship: if the skills required to interpret and evaluate data-driven humanistic scholarship remain concentrated in a small corner of our respective fields, and never make it into, say, graduate methodology courses, then the long-term impact of that scholarship will also remain cloistered. One might argue this is surely a solvable problem… but I cite Price because he wrote that in 1969. I am excited to help other scholars implement these approaches in their own research (*cough*I’m available for hire!*cough*), but it is sobering to remember how enduring these problems have been.

Thomas: What recent research has inspired you?

Matthew: Ruth and Sebastian Ahnert’s recent article on English Protestant communities in the 1530s thoughtfully maps formal network concepts onto interesting disciplinary research questions – in their case, examining how Queen Mary I’s campaign to stifle evangelical organization failed to target the most structurally-important members of the dissident correspondence network. Also, I’ve found Ted Underwood’s and Jordan Sellers’ work on machine classification of literary standards to be one of the most fluently-written and compelling explanations of how predictive statistical tools can be used for hypothesis testing in the humanities.

Thomas: Whose data praxis would you like to learn more about?

Matthew: For all the work that I do with art history, I’ve actually done surprisingly little work directly with image data! There are some really interesting questions of stylistic history that I suspect could be informed by applying some fairly basic image processing techniques. I’d like to better understand methods for generating and managing image data and metadata (like color space information), from both the repository/museum perspective (how and why is it produced in the way it is?) as well as a computer vision perspective (how should that metadata be factored into analysis?).

This work is licensed under a Creative Commons Attribution 4.0 International License.