Matt Burton, graduate student at the University of Michigan School of Information, provides an accessible introduction to topic modeling. Aimed at beginners (though useful for everyone), the article unpacks the meaning of the terms used in topic modeling, such as model, word, document, topic, tokenization and stemming. For example,
At the start of any text mining adventure, the natural sequences of words, the sentences and paragraphs of written documents are broken up via a process called tokenization. Individual words become unigrams or individually unique tokens. Tokens are not always equivalent to words because the tokenization process may count two or more words together as a single token, creating what are called bigrams or ngrams. For example, the words “digital humanities” could be a bigram or two individual unigrams, “digital” and “humanities.” Tokenization is more of an art than a science, it requires subjective decisions as well as domain understanding of the texts being processed.
Burton also describes the pros and cons of different types of generative topic models, and ground his discussion in a topic model that uses the text of 10 posts that were featured on Digital Humanities Now.
This post was produced through a cooperation between Trevor Muñoz, Kristen Andrews, and Elizabeth Lorang (Editors-at-Large for the week), Zach Coble (dh+lib review Editor for the week), and Caro Pinto and Roxanne Shirazi (dh+lib review Editors).