Clifford Wulfman is the Coordinator of Library Digital Initiatives at the Firestone Library at Princeton University and consultant to Princeton’s new Center for Digital Humanities. He has been involved with the Perseus Digital Library, the Modernist Journals Project, and is currently Director of the Blue Mountain Project, an NEH-funded project to digitize European and North American avant-garde art periodicals from 1848-1923. Roxanne Shirazi corresponded with Dr. Wulfman about his work and training, to talk about digital libraries, DH, alt-ac, and the future of digital collections.
Roxanne: You have an interesting combination of disciplinary training that spans both literature and computer science. Tell us about your background, and how you came to be involved in digital libraries.
Clifford: It’s a pretty long, convoluted story, actually. I was an English major in college and planned to go on to graduate school, but this was the early 1980s, and the job market was grim, and we were all discouraged from pursuing that path. (Little did anyone know how bad it would get later, of course.) I had friends and roommates who were into computers, though, and from them I got the idea that computational linguistics and artificial intelligence would be another way of thinking about language and thought. So—with a single programming course under my belt—I managed to talk my way into graduate school in computer science. I left grad school with a Master’s degree to go work in a medical AI lab, where I built graphical user interfaces for a number of years, but then, in my late twenties, I found myself itching to study literature, and in the theory and practice of hypertext I saw a way to bring my interests together.
I came to computer science from a “close reading” background, so my concepts of language, knowledge, and meaning seemed quite different from those of the artificial-intelligence researchers and computational linguists I worked with in the 1980s.
So at age thirty I returned to grad school to pursue a PhD in English, writing on traumatic narrative structures in Faulkner and Woolf. I didn’t end up doing much with hypertext then—turns out I was a bit ahead of my department that way—but I kept my hand in, and by the time I graduated, when the job market truly was grim, I was fortunate enough to land a post-doc with the Perseus Digital Library Project at Tufts. That, I suppose, was the real point of intersection, because I was able to pursue interests in language, literature, and computer science among an enthusiastic group of colleagues who understood what I was talking about!
When the post-doc ended, I became project manager (and then technical director) of the Modernist Journals Project. That really brought things together—digital libraries and modernism—and I was able to grow the MJP from a single-title project to a well-used resource. Once again, I was fortunate to have a wonderful group of colleagues, who spoke all my languages, so when I finally had to find a permanent position I was encouraged to look to the library as a place where the work I had been doing could continue. And here we are.
Roxanne: Recently, the term “scholar-practitioner” has been taken up by many in the alt-ac movement, and the digital humanities community in particular has emphasized the value of humanities scholars who are also versed in computing. In what way has your PhD training influenced your work as a programmer, and vice versa?
Clifford: I came to computer science from a “close reading” background, so my concepts of language, knowledge, and meaning seemed quite different from those of the artificial-intelligence researchers and computational linguists I worked with in the 1980s. Their concepts were logic-based, for the most part, whereas mine were more associative, and their notions of “language understanding” seemed quite limited to me. But I think my CS training made me much more aware of structure and process, and I became interested in structuralism, post-structuralism, and psychoanalysis as a result. And in hypertext, of course, which seemed to me to be the reification of a great deal of literary and linguistic theory.What is the value of this vice-versa versatility? First and foremost, it has made me pretty fearless: I’m almost always open to trying something new. And of course it has made it possible for me to see issues from many sides. But it’s interesting: for all our talk about academic diversity, one hears a great deal about “the scholar,” as though there were a single, monolithic figure embodying a consistent set of concerns and practices (picture Jerome in his study, or Rodin’s Thinker, or Ozymandias…). Likewise, the notion of “programming,” in the digital humanities, seems to cover everything from developing new algorithms for text analysis to making WordPress sites. I think I resist definition: I exist as an Other to mainstream notions of “scholar,” “academic,” “computer scientist,” and “programmer”—not to mention any notion of “librarian”!—and I think that bothers a lot of people who like nice, simple identities. A scholar is not an intellectual; a programmer is not a computer scientist, and yet in many ways I am all of these. No wonder I am “alt-ac!”
Roxanne: The area of overlap between digital libraries and digital humanities is quite large. How would you characterize the difference between these areas? How important is the distinction between, say, The Perseus Project as a digital library, and the Modernist Journals Project, which is frequently referred to as a DH project?
Clifford: First of all, I think there’s a crucial and fundamental difference between a “digital library” and a “digitized library.”
The terms digital and digitize are treacherously polysemous. Digital derives from the Latin word digitus, digiti, meaning finger or toe; it has become a metonym for the discrete values modern computers use to represent information. To digitize, then, is to represent information by means of discrete values, and digital data is simply information stored as ordered sequences of discrete states. These ordered sequences are often called files or streams, and they come in many varieties, but at the most basic level they are all the same: audio files, image files, text files are all just sequences of bits.
What, then, does it mean to digitize a book? What it means to digitize a book depends crucially on another question: what is a library?
From one perspective, a library is a hoard of physical artifacts whose principal function is to be looked at. Seen from that perspective, digitization is an image-making activity: rendering surfaces on which drawings and inscriptions appear into sequences of bits that a computer can use to produce a reflection of that surface. From another perspective, a library is a gathering of texts whose principal function is to be read, and from this perspective digitization is a linguistic activity: rendering words into sequences of bits that a computer can use to create linguistic symbols that can be analyzed and compared. It is the scholar’s privilege to regard the library from the latter perspective; it is the librarian’s burden to view it from the former.
The librarian’s concern with preservation and access puts a premium on creating large collections of digitally photo-duplicated holdings, with the same kind of metadata they’ve always created: information that facilitates maintenance of, and access to, institutional assets. I think “digital humanists” want more than that, or soon will. Perseus is a digital library, and not a digitized library, because its resources are not simply digital photo-duplications or transcriptions. They are encoded texts co-existing in a software environment that enables programs and people to interactively search, query, analyze, combine, and compare objects of investigation.
The digital humanities have highlighted the need for other forms of discovery and access, beyond simple string searches.
Perseus self-identifies as a digital library. The MJP is a digital library, too. There are lots of opinions and definitions out there, but I’d venture to say that a digital library is a resource whose mission is to support inquiry, while a digital humanities project is an endeavor whose mission is to answer a question, or make a claim, or engage in an argument. The distinction is not a clean one, but it is probably important for funding agencies and tenure committees, whose mission is to encourage work in particular directions.
Roxanne: There is an ongoing debate in digital humanities librarianship surrounding the idea of service, with some librarians pushing back against the sort of support role in digital projects connoted by the term. In your experience, how have librarians contributed to digital projects and what expertise do they offer that might not be readily apparent to scholars from other disciplines?
Clifford: That debate seems to be, not only ongoing, but also, like so much “debate” these days, drenched in epinephrine and vitriol! As I said earlier, I’m suspicious of identities, so I don’t really know what a “librarian” is. I think people who are able to think systematically about information; who are familiar with evolving standards and know the state of the various arts involved in creating digital projects; who are responsible for imparting a long-term perspective on intellectual work—the people with this knowledge are important partners in digital humanities projects. Are those people called “librarians?”
Roxanne: Fair enough. Let’s talk about your current work on the Blue Mountain Project.Clifford: Blue Mountain is an ongoing project centered on the creation of granular, machine-readable facsimiles of avant-garde magazines for use by traditional scholars and students and also by researchers in the digital humanities. A grant from the NEH has enabled us to digitize 34 titles comprising approximately 53,000 pages in twelve languages, and we continue to add more. This collection presents a cross-pollination of image and text, a mix of genres and forms (poetry, prose, drama, music, opinion, advertisement), and diverse print spaces where conventions and reading habits are challenged by bold experimentation with typography and graphic design.
Blue Mountain, however, intends to be more than a simple digital library. The digital humanities have highlighted the need for other forms of discovery and access, beyond simple string searches. More and more, humanities scholars want to work with digitized materials at a higher degree: they want to explore relationships among named entities like people, places, things, dates; many want to be able to use extant tools like syntactic parsers, automatic translators, and statistical analysis programs; others, along with computational linguists, and computer and information scientists, want access to large text corpora to which they can apply their own algorithms. Blue Mountain has been designed to support the needs both of traditional humanities scholars and of this new breed of researcher.
To expand on the mountain metaphor a bit: Blue Mountain has four essential faces:
- The Archival Face: Blue Mountain is a curated repository of high-quality page images and high-quality metadata housed in a major research library.
- The Digital Library Face: Like most contemporary digital libraries, Blue Mountain provides a searchable and browsable catalog of its resources and a web-based graphical interface that enables users to look at the page images.
- The Research Face: Much of Blue Mountain’s value lies beneath the surface, in its richly encoded data (machine-readable transcriptions) and metadata (library-standard data about titles, authors, places of publication and the like). Blue Mountain currently makes all its data and metadata freely available, in raw form, on the popular web-based hosting service GitHub.
- The Community Face: Thematic research collections like Blue Mountain attract the interest of researchers, students, teachers, librarians, and archivists from many disciplines. In addition to providing access to resources, projects like Blue Mountain can function as hubs for collaborative research and integrated collection growth.
This multi-faceted design makes Blue Mountain a modern resource useful for the broadest range of scholars working in traditional area studies, modernist studies, periodical studies, library and archival science, as well as the digital humanities.
Blue Mountain’s data is already available on GitHub, for anyone who wants it.
The future of computer-assisted research in the humanities lies not in monolithic tools and websites but in networks of programs and data that can work together over the World Wide Web.
The future of computer-assisted research in the humanities lies not in monolithic tools and websites but in networks of programs and data that can work together over the World Wide Web. Thus Blue Mountain is working to develop not only a set of tools to allow researchers and students to exploit its data, but also to develop Blue Mountain itself as a service-oriented framework. Such frameworks of program-accessible interfaces and standard data-interchange formats are the foundation of the emerging semantic web, and will lay the groundwork for realizing the full potential of avant-garde periodical materials in the context of the next generation of networked digital libraries.
So while we continue to add new titles to our collection, we’ve also begun to develop a couple of new technologies. One is a set of custom-made tools for visualization and analysis we call Blue Mountaineer (for exploring Blue Mountain); the other is a set of APIs we’re calling Blue Mountain Springs, because they will make Blue Mountain an abundant source of clean data that can be poured into tools outside our project.
Roxanne: Can you tell us about the infrastructure decisions your team made?
Clifford: I think the most important choice we made was to focus on the data, independent of its delivery. There are lots of catalog-searching, page-turning applications out there, with more coming along all the time. Rather than binding ourselves to one platform, we’ve chosen to encode our data and metadata using bona fide and de facto standard schemas like MODS, METS, ALTO, and TEI.
The most expensive and valuable part of Blue Mountain is the human labor that goes into creating, revising, and extending its underlying encodings, and by choosing to capture that work in standard forms, we think that we are greatly extending its useful lifetime and improving the return on investment.
Roxanne: What do you see as some of the more exciting directions for digital libraries and digital humanities to take in the future?
Clifford: I think the semantic web offers exciting possibilities. I think we’ll see a renewed interest in expert systems—programs that reason about knowledge based on rules and inferences—that interact with users and data to perform complex tasks of discovery and analysis over the vast bodies of data we are now encoding in digital libraries. The challenges are enormous, but we’ve learned a lot since the AI winter.
Minor updates to Clifford Wulfman’s byline at the top of the post were made on Nov. 3, 2014.
This work is licensed under a Creative Commons Attribution 4.0 International License.