data curation – dh+lib

Community Oriented Research Data Curation and Reuse

Ixchel Faniel is a Research Scientist at OCLC Research. Faniel has conducted extensive research into how various disciplinary communities approach research data reuse. Faniel served as Principal Investigator on the Institute for Museum and Library Services funded DIPIR (Dissemination Information Packages for Information Reuse).

Thomas: Believe it or not, I think you may be the first Research Scientist I’ve corresponded with! Can you tell me a bit about your role at OCLC and what combination of experience, education, and interests led you to it?

Ixchel: I’m laughing because I didn’t set out to be a Research Scientist, but given my experience, education, and interests, I can see how I ended up at OCLC. My role is to conduct research that informs and impacts the communities OCLC serves in terms of what they do and how they do it – so libraries, archives, museums, their staff, and their patrons. Typically, my interests center on people and their needs. At a high level I’m interested in how people discover, access and use content and how technology supports those things. My interests grew out of my experiences and education for sure.

After graduating from Tufts University with a B.S. in Computer Science, I worked at Andersen Consulting (now Accenture) as a consultant, designing and developing information systems to meet clients’ needs. I spent a lot of time at USC earning my MBA and Ph.D. in Business Administration at The Marshall School of Business. The time between the two degrees was spent at IBM selling mainframes and working with a team to propose total solutions – hardware, software, services – to solve a customer’s specific business problem. Working as an Assistant Professor at the School of Information, University of Michigan also shaped my interests, because I was exposed to different conversations – federal funding agency mandates to share research data for reuse, data repositories, research collaboratories, team science, cyberinfrastructure.

Primarily, my interests have focused on the reuse of research data within academic communities. I’m particularly interested in the kinds of contextual information that need to be captured along with the data and how context can be curated and preserved. This line of work started with National Science Foundation funding when I studied how earthquake engineering researchers reused each other’s data. It continued with DIPIR (Dissemination Information Packages for Information Reuse), a multi-year Institute of Museum and Library Services funded project to study the data reuse practices of social scientists, zoologists, and archaeologists. Recently work in the area has been extended with National Endowment for the Humanities funding for the Beyond Management: Data Curation as Scholarship in Archaeology project. It’s a longitudinal study of archaeologists’ data creation, management, and reuse practices to investigate data quality and modeling requirements for reuse by the large community. A second line of related work I began at OCLC examines the role of academic libraries and librarians in supporting researchers’ data sharing, management, and reuse needs. It serves as a nice complement to studying the researchers, because librarians and libraries are viewed as a key stakeholder group. I’m currently examining their early experiences designing, developing, and delivering e-research support.

Thomas: I am intrigued by the link you drew between technology supported knowledge reuse for innovation and data reuse within academic communities. Can you speak to this in greater detail?

Ixchel: Sure. So when I was studying knowledge management and knowledge reuse within organizations, one of the major issues was that it was contextual in nature. In order for one colleague to understand another colleague’s knowledge well enough to apply it to a new situation, there was a need to know the context within which it was created. The difficulty was having access to all of those details in absence of the colleague. In going about doing work, employees were creating a paper trail for some things, but they weren’t necessarily capturing everything related to how and why they were doing what they were doing. In other cases, employees were capturing a summary of events in a final, formal document to be stored and shared as corporate memory, but keeping documents they generated in the course of creating the final document to themselves. Those additional documents served as a detailed reminder of how they arrived at the final document, but weren’t necessarily shared with others. In other cases, employees were relying on their past and present experiences and that knowledge they were using in the moment to make a decision, solve a problem, or develop a new product was tacit and not captured.

[pullquote]Similar to employees who generate knowledge, researchers who generate data may be recording some context but not necessarily all context, or they’re sharing some context but not necessarily all context.[/pullquote]These are some the same issues we face in studying reuse of research data within academic communities. Similar to employees who generate knowledge, researchers who generate data may be recording some context but not necessarily all context, or they’re sharing some context but not necessarily all context. They are doing this not to withhold information necessarily, but because they don’t think to record or share certain kinds of context. In some cases the context represents something they do so regularly in the course of their research that it becomes tacit or such a minor detail to them that they don’t think to include it. Researchers have not had to document data beyond personal needs before federal funding agency mandates to share data and write data management plans. Are they capturing enough about the context of data production such that others can come along and reuse that data to answer different research questions or make a new discovery? I wasn’t sure and I didn’t think we would really know until we started asking researchers who reused data what they needed.

Thomas: Nearly five years ago, you and Ann Zimmerman wrote, Beyond the Data Deluge: A Research Agenda for Large-Scale Data Sharing and Reuse. In that paper you outlined an ambitious research agenda for yourself and a wide field of scholars and practitioners. I imagine a fair amount of work has happened since then. Thinking on that work, where are we now?

Ixchel: Good question. Where are we now? Making lots of progress, but there is still more to do. Ann and I developed that research agenda around three activities we thought needed more attention – 1) broader participation in data sharing and reuse, 2) increases in the number and types of intermediaries, and 3) more digital products.

We’ve definitely seen more research examining data sharing and reuse. When we wrote the article, we were both in agreement that the research on reuse was sorely lacking and had both done early work in the area to examine data reuse practices among ecologists and earthquake engineering researchers. Interestingly our article was written around the same time Elizabeth Yakel and I were waiting to hear whether the DIPIR project was going to get funded.

Luckily it was funded. As I mentioned earlier, DIPIR is a study of data reuse practices of social scientists, archaeologists, and zoologists. We were particularly interested in what kinds of contextual information were important for reuse and how it could best be curated and preserved. Over the years we’ve published studies considering how social scientists and archaeologists develop trust in repositories, examining internal and external factors that influence archaeologist and zoologist attitudes and actions around preservation, studying the relationship between data quality attributes and data reuse satisfaction, and discussing the topology of change at disciplinary repositories.

Most recently we’ve been doing a lot of work describing our context-driven approach to data curation for reuse we used for the DIPIR project. Initially, our major goal with this approach was to give voice to the reusers’ perspective, understanding that it influences and is influenced by the other stakeholders involved in the process. Now we are working toward providing a more balanced picture of needs among the major stakeholders (data producers, data reusers, and repository staff), knowing that they may have different, sometimes competing, data and documentation needs. We started narrowly by bringing reusers’ perspectives to the fore, but we’ve been slowly expanding to include these other stakeholders.

In February 2016, we convened a workshop at the International Digital Curation Conference to share our approach and findings. We collaborated with Kathleen Fear, Data Librarian, University of Rochester and Eric Kansa, Data Publisher, Open Context. Both also do research in the area, but their primary responsibility is supporting researchers’ data sharing, management, curation and reuse needs. They talked about the impact DIPIR findings were having on their practices, including the tradeoffs they had to make. To complement the presentation, we engaged workshop participants in card sort exercises to consider the importance of different types of contextual information given the needs of data reusers vs. repository staff. With the workshop one of our objectives was for participants to see the partnership or the marriage between research and practice as well as differences in needs not only among data reusers and repository staff, but also among data reusers within different disciplines and repository staff supporting different designated communities of users.

Thomas: Where do you think work of this kind is headed next?

Ixchel: I believe these kinds of connections – connections between researchers studying the phenomena and practitioners implementing it – are one thing that can help advance work in the area. And even among the researchers diverse experiences are important. Elizabeth and I came together given her background in users, archives, and preservation and my background in users of information systems and content. During our research we’ve engaged with different perspectives and literatures that have some similarities but haven’t always talked to or referenced one another. One of our goals was to bridge those areas.

[pullquote]My goal was to convince archivists to bring their expertise to the table in conjunction with an understanding of data reusers’ needs to inform not only the preservation of data’s meaning, but also other archival practices, particularly the partnerships they form.[/pullquote]Bridging is an important part of this effort. I wrote about a particular aspect of it in a two part blog post after participating in a panel on Data Management and Curation in 21st Century Archives at the Society of American Archivists Annual Meeting in 2015. My goal was to convince archivists to bring their expertise to the table in conjunction with an understanding of data reusers’ needs to inform not only the preservation of data’s meaning, but also other archival practices, particularly the partnerships they form.

A key part of my message was archivists cannot go it alone, because curation and management is bigger than the archive. My related study of librarians confirmed it; communication, coordination, and collaboration with other campus entities was particularly important when supporting research data services. Presentations from fellow panelists also confirmed it. What struck me about their presentations was whether and how they and their colleagues came to value each other’s complementarities in order to deliver more effective research data services.

It would be great to see more work about whether and how data and information researchers and professionals begin to partner with each other and other organizations. There has been some work to frame the issue from Brian Lavoie and Constance Malpas, colleagues at OCLC. They conceptualize evolving stewardship models. Seeing additional research in this area and how it’s done in practice, particularly within and across colleges and universities would be interesting.

So when Ann and I talked about increases in the number and types of intermediaries one of the areas we suggested examining was how education, roles, and responsibilities were changing given the evolving nature of data and information professionals. There has been nice progress in those areas from Liz Lyon at the University of Pittsburgh, Carole Palmer at University of Washington, and Helen Tibbo and Cal Lee at UNC Chapel Hill. Going forward it will be interesting to examine career trajectories, how do they advance in their professions, what is rewarded vs. not.

With regard to the last area Ann and I discussed – more digital data products or new types of digital products that include or reference data – The Evolving Scholarly Record presents a framework to organize and drive discussions about it. More recently, Alastair Dunning wrote a nice blog post while at the International Digital Curation Conference summarizing Barend Mons and Eric Kansa’s approach to publishing data and how it benefits reuse – Atomising data: Rethinking data use in the age of explicitome. But that’s just the beginning. There’s definitely room for more work in this area, particularly other approaches being taken in other disciplines given the needs of data producers, reusers, and repository staff.

Thomas: Whose data praxis would you like to learn more about?

Ixchel: That’s an interesting question. For me it’s the data producers and curators. So for the past several years I’ve been working with colleagues to get data reusers’ perspectives inserted into conversations, but by no means do I think it is enough. We’ve done some work examining Three Perspectives on Data Reuse: Producers, Curators, and Reusers, starting at the point of data sharing. It goes back to the context-driven approach. Data producers are the ones who are setting the tone regarding data management, curation, and reuse really, because they are upstream in the data cycle. The NEH funded project I discussed earlier – Beyond Management: Data Curation as Scholarship in Archaeology – aims to bridge data creation and reuse.

The project started in January 2016. We have this fantastic opportunity to interview and observe archaeologists while they are collecting and recording data in the field during archaeological excavations and surveys and interview archaeologists interested in reusing the data. The objective is to examine data creation practices in order to provide guidance about how to create higher quality data at the outset, with the hope that downstream data curation activities become easier, less time intensive, and data creation practices are better aligned with meaningful reuse. We have another great group of people working on the project specializing in archaeology, anthropology, data curation and publishing, information science, archives and preservation, etc. and we are all focused on studying data creation and reuse and impacting practice. I’m looking forward to seeing how it progresses. It should be a lot of fun.

Data-Driven Art History: Framing, Adapting, Documenting

This is the first post in Data Praxis, a new series edited by Thomas Padilla

Matthew Lincoln is a PhD candidate in Art History at the University of Maryland, College Park. Matthew is interested in the potential for computer-aided analysis of cultural datasets to help model long-term artistic trends in iconography, art markets, and social relations between artists in the early modern period. Last summer, Matthew held a fellowship at the Harvard MetaLab workshop Beautiful Data, and presented research at the Alliance for Digital Humanities Organizations’ annual international conference, DH2015, in Sydney, where his paper, “Modeling the (Inter)national Printmaking Networks of Early Modern Europe,” was a finalist for the ADHO Paul Fortier Prize.

Thomas: I’m always interested in the hows and whys of folks getting involved in digitally inflected research. Can you tell us a bit about yourself and describe what motivated you to take a path that brings Art History and digital research together?

Matthew: I suppose my digital art history “origin story” is one of a series of coincidences. I’ve always been interested in programming, and, as an undergraduate, even took a few computer science courses while I was majoring in art history at Williams College. But I’d never seriously considered how to apply those digital skills to historical research while at Williams, nor did I start my graduate work at the University of Maryland with any intention of doing computationally-aided art history there, either. However, as it happened, the same generous donation that made my attendance at UMD possible (a Smith Doctoral Fellowship in Northern European Art), had also funded the Michelle Smith Collaboratory for Visual Culture, an innovative physical space in the Department of Art History & Archaeology that was intended to serve as a focal point for experimenting with new digital means for sharing ideas and research. I was already several years into my coursework before I took a semester-long graduate assistantship in the Collaboratory, where I was given remarkable leeway to explore how the so-called “digital humanities” might inflect research in art history. During that semester, I developed a little toy map generated from part of Albrecht Dürer’s diary of his trip to the Netherlands in 1520-1521. But I also had my eyes opened to the vibrant discourse about digital research in the humanities that had, up to that point, been totally outside my field of view. What is more, data-driven approaches held particular promise for my own corner of art historical research on early modern etchings and engravings. Because of the volume of surviving impressions from this period, a lot of scholarship on printmakers and print publishers comprises a wealth of quantitative description and basic cataloging. My dissertation seeks to mine this existing work for larger synthetic conclusions about print production practices in the Dutch golden age.

Thomas: Over the summer you presented a paper at DH2015 that would become a finalist for the ADHO Paul Fortier Prize, “Modeling the (Inter)national Printmaking Networks of Early Modern Europe.” What were the primary research questions in the paper, and what methods and tools (digital and otherwise) did you employ to pursue those questions?

Matthew: I’m interested in how etchings and engravings can serve as an index of past artistic and professional relationships. Most of these objects are the result of many hands’ work: an artist who produced a drawn or painted design, a platecutter who rendered the image onto a printing plate, and often a publisher who coordinated this effort and printed impressions. Seen in this light, the extensive print collections in modern-day collections offer an interesting opportunity to see what kinds of structures emerge from all of this collaboration. In this paper, I wanted to examine how artists tended to connect (or not) across national boundaries. In the history of seventeenth-century Dutch art in particular, there has been a lot of well-deserved attention on the influence and prestige of Dutch painters traveling abroad. But what about printmakers? Did Dutch printmakers tend to connect to fellow Dutch artists more frequently, or did they prefer international collaborators? And how might this ratio have changed over time? It’s easy to intuitively argue either side of this question based on a basic understanding of Dutch history at the time, so this was a good opportunity to introduce some empirical observations and formal measurement to the discussion. In this vein, I’d argue one of my most crucial methods was doing a good old-fashioned literature review in order to properly understand the stakes of the question that I wanted to operationalize.

from DH2015 paper, "Modeling the (Inter)national Printmaking Networks of Early Modern Europe" — from DH2015 paper, “Modeling the (Inter)national Printmaking Networks of Early Modern Europe”

I drew on two major datasets for this paper: the collections data of the British Museum, and that of the Rijksmuseum. The British Museum has released their collections data as Linked Open Data, which meant that I needed to invest a considerable amount of time learning SPARQL (the query language for LOD databases) and how to build my own mirror of their datastore in Apache Fuseki, as my queries were too large to submit to their live server. On the other hand, once I had mastered the basic infrastructure of this graph database, it was easy to produce tables from these data exactly suited to the analyses I wanted to do. The Rijksmuseum offers a JSON API service, allowing you to download one detailed object record at time. The learning curve for understanding the Rijksmuseum’s data model was lower than that for the British Museum’s LOD. However, I had to battle many more technical issues, from building Bash scripts to laboriously scrape every object from the Rijksmuseum’s cantankerous API, to figuring out how to break out just the information I needed from the hierarchical JSON that I got in return (jq was a fantastic utility for doing this).

Because I was more interested in looking at particular metrics of these networks rather than producing “spaghetti monster” visualizations like you can produce in a program like Gephi, I turned to the statistical programming language R to perform the actual quantitative analyses. R has been fantastic for manipulating and structuring huge tables of data, running network analysis algorithms (or just about any other algorithm you’d like to run), and then producing publication-quality visualizations. Because everything is scripted, it was easy to document my work and iterate through several different versions of an analysis. In fact, you can download the data and scripts for my DH2015 paper yourself and reproduce every single visualization.

Thomas: Based on your comments and prior blog posts such as, “Tidy (Art) Historical Data,” it seems that you put a great deal of care into thinking about how your data and research processes are documented and shared. Perhaps it’s a bit of a brusque way to ask, but what made you care? How did you learn how to care? Who did you learn from?

Matthew: I started caring because I saw smart people doing it. I still care because I experienced the practical benefits in a real way. Many of my DH role models put forward careful documentation of their work: Lincoln Mullen’s openly-accessible code, Miriam Posner’s bevy of public DH syllabi, or Caleb McDaniel’s lengthy “backwards survey course” reflection. Here were people doing really useful work, and I was directly benefitting from their openness – so that was absolutely something that I wanted to emulate. On the other side of it, I’ve also had to deal with anti-patterns in documentation. Because I work almost exclusively with data that other people have assembled, I’m painfully conscious of how much the lack of documentation, and/or the assumption that people will only ever use your data the same way that you did, can hinder productive re-use of data.

Now, to be honest, I am not sure if anyone else has directly benefitted yet from looking at my code and data. However, I’ve certainly benefitted from my own documentation! I have been revising an article in response to peer reviews. We all know what that timeline looks like: I “completed” (ha!) the data analysis almost a year ago, finalized and submitted the text with my co-author a month or so after that, then waited many more months before the reviews came back. In just the past month I’ve had to go back in and re-run everything with an updated dataset, clarify some of the analytical decisions made, and enhance several of the visualizations. And I didn’t need to rip my hair out, because all of my work is documented in simple code files, and I don’t have to try and reverse-engineer my own product without the original recipe. (I should note that the R programming community is great for this. It is filled with particularly vocal advocates for reproducible code, like knitr author Yihui Xie, who produce great tools for practicing what they preach.).

By writing documentation notes as I go, I’ve also become much better at explaining – in natural language – what I am doing computationally. This is crucial for any kind of quantitative work, but all the more so in humanities computing, where you can usually count on the fact that most of your audience will have no background in your methodology.

Thomas: Thinking on the digitally inflected research you’ve conducted to date, and the directions you seek to go in the future, what are the most significant challenges you anticipate you will encounter? Accessing data? Sharing your data? Venturing into new methodological terrain? Recognition of the work en route to tenure?

Matthew: I agree with Jacob Price’s assessment of data-driven methods in history: that, however promising, they present major challenges, both in the logistics of producing interoperable data, but also in producing interoperable scholarship: if the skills required to interpret and evaluate data-driven humanistic scholarship remain concentrated in a small corner of our respective fields, and never make it into, say, graduate methodology courses, then the long-term impact of that scholarship will also remain cloistered. One might argue this is surely a solvable problem… but I cite Price because he wrote that in 1969. I am excited to help other scholars implement these approaches in their own research (*cough*I’m available for hire!*cough*), but it is sobering to remember how enduring these problems have been.

Thomas: What recent research has inspired you?

Matthew: Ruth and Sebastian Ahnert’s recent article on English Protestant communities in the 1530s thoughtfully maps formal network concepts onto interesting disciplinary research questions – in their case, examining how Queen Mary I’s campaign to stifle evangelical organization failed to target the most structurally-important members of the dissident correspondence network. Also, I’ve found Ted Underwood’s and Jordan Sellers’ work on machine classification of literary standards to be one of the most fluently-written and compelling explanations of how predictive statistical tools can be used for hypothesis testing in the humanities.

Thomas: Whose data praxis would you like to learn more about?

Matthew: For all the work that I do with art history, I’ve actually done surprisingly little work directly with image data! There are some really interesting questions of stylistic history that I suspect could be informed by applying some fairly basic image processing techniques. I’d like to better understand methods for generating and managing image data and metadata (like color space information), from both the repository/museum perspective (how and why is it produced in the way it is?) as well as a computer vision perspective (how should that metadata be factored into analysis?).

This work is licensed under a Creative Commons Attribution 4.0 International License.

POST: Refining the Problem — More work with NYPL’s open data, Part Two

In part II of his experiment to create an index of items using the New York Public Library’s What’s on the menu? data set, Trevor Muñoz discusses his work with the data and some of the lessons he learned. Muñoz used the Open Refine tool and, finding the NYPL data set too large to easily work with, he discusses some of his workarounds. Muñoz concludes,

The larger question is whether there is a still a plausible vision for how a data curator could add value to this data set. The need to script around limitations of a tool increases the cost of normalizing the NYPL data. At the same time, the ability to see the clusters of similar values that Refine produces increases my confidence that the potential gain in data quality could be very substantial in going from the raw crowdsourced data to an authoritative index.

POST: What IS on the Menu? More Work with NYPL’s Open Data, Part One

Part of making the argument for open collections data is showing what can be done with it. Trevor Muñoz’s recent blog post, in which he plays with the NYPL’s open data from the “What’s on the Menu?” project, explains how he uses the collection data as a testbed for data curation work. As Muñoz states:

I’m particularly interested right now in work that data curators can do to build secondary and tertiary resources—reference materials, if you will—around data. I mean particularly reference materials that draw on the skills of people with training in library and information science, things like indexes. These types of organized systems of description can be one way to provide additional value over full text search (which, for many kinds of data sets, e.g., a table of numerical readings, is not particularly effective anyway).

After evaluating the data release against Tim Berners Lee’s 5 Star Linked Open Data Scale, Muñoz begins the process of creating a useful index to the names of the dishes represented in the collection, introducing linked data concepts and showcasing the work (and potential work) of data curators along the way.

RESOURCE: Using Data Curation Profiles to Design the Datastar Dataset Registry

The current issue of D-Lib Magazine includes an article by Sarah J. Wright, Wendy A. Kozlowski, Dianne Dietrich, Huda J. Khan, Gail S. Steinhart, and Leslie McIntosh titled, Using Data Curation Profiles to Design the Datastar Dataset Registry. From the abstract:

The development of research data services in academic libraries is a topic of concern to many. Cornell University Library’s efforts in this area include the Datastar research data registry project. In order to ensure that Datastar development decisions were driven by real user needs, we interviewed researchers and created Data Curation Profiles (DCPs). Researchers supported providing public descriptions of their datasets; attitudes toward dataset citation, provenance, versioning, and domain specific standards for metadata also helped to guide development. These findings, as well as considerations for the use of this particular method for developing research data services in libraries are discussed in detail.

RESOURCE: Keeping Up With… Big Data

The latest issue of the Association of College and Research Library’s (ACRL) Keeping Up With… publication is devoted to big data. Written by Mark Bieraugel (Business Librarian at California Polytechnic State University), it covers the nuts and bolts of the topic and offers a bibliography that includes sections such as “Big Data and the Academy,” “Privacy and Criticism,” “Tutorials,” and “Sandboxes.”

Bieraugel advises humanities and social science librarians to recognize that “big data is becoming more commonplace in their disciplines as well, and is no longer restricted to corpus linguistics.” He goes on to advocate for the role of libraries in data curation: “Librarians also need to embrace a role in making big datasets more useful, visible and accessible by creating taxonomies, designing metadata schemes, and systematizing retrieval methods.”

RECOMMENDED: Data curation as publishing for digital humanists

Text and slides from a talk delivered by Trevor Muñoz, Assistant Dean for Digital Humanities Research at the University of Maryland Libraries, at the CIC Center for Library Initiatives conference. Muñoz presents an intriguing synthesis of a couple of growing trends in libraries – data curation and publishing. Data curation here is defined as “information work that integrates closely with the disciplinary work practices and needs of researchers in order to ‘maintain digital information that is produced in the course of research in a manner that preserves its meaning and usefulness as a potential input for further research.'” Muñoz argues that “data curation work would also be ‘publishing’ in the sense of ensuring quality and disseminating outputs to interested communities…By recognizing data curation work as a publishing activity, libraries would have a ‘market opportunity’ to address unmet needs in the digital humanities community.” More broadly,

Data curation as a “publishing” activity is increasingly relevant to the working lives of digital humanities scholars. Moreover, articulating connections between “publishing” and data curation is important in the context of strategic decision libraries might make and, in fact, are making about how to participate in “publishing.” Data curation as publishing is publishing work that draws directly on the unique skills of librarians and aligns directly with library missions and values in ways that other kinds of publishing endeavors may not.

JOBS: Fellowships in Data Curation for Medieval Studies

The CLIR/DLF Postdoctoral Fellowship in Data Curation for Medieval Studies is an expansion of the CLIR Postdoctoral Fellowship Program in Academic Libraries. These five, fully-funded fellowships will provide recent Ph.D.s with professional development, education, and training opportunities in data curation for Medieval Studies. Through this program, CLIR seeks to raise awareness and build capacity for sound data management practice throughout the academy.

Each fellowship is two year appointment, with a $60,000 salary, plus benefits, and a yearly travel and research stipend.