RECOMMENDED: The Joy of Topic Modeling

Matt Burton, graduate student at the University of Michigan School of Information, provides an accessible introduction to topic modeling. Aimed at beginners (though useful for everyone), the article unpacks the meaning of the terms used in topic modeling, such as model, word, document, topic, tokenization and stemming. For example,

At the start of any text mining adventure, the natural sequences of words, the sentences and paragraphs of written documents are broken up via a process called tokenization. Individual words become unigrams or individually unique tokens. Tokens are not always equivalent to words because the tokenization process may count two or more words together as a single token, creating what are called bigrams or ngrams. For example, the words “digital humanities” could be a bigram or two individual unigrams, “digital” and “humanities.” Tokenization is more of an art than a science, it requires subjective decisions as well as domain understanding of the texts being processed.

Burton also describes the pros and cons of different types of generative topic models, and ground his discussion in a topic model that uses the text of 10 posts that were featured on Digital Humanities Now.

Author: Zach Coble

Zach is the Digital Scholarship Specialist at New York University.