The Programming Historian recently published a new lesson, Clustering and Visualising Documents using Word Embeddings. Developed by Jonathan Reades and Jennie Williams, this lesson “uses word embeddings and clustering algorithms in Python to identify groups of similar documents in a corpus of approximately 9,000 academic abstracts. It will teach you the basics of dimensionality reduction for extracting structure from a large corpus and how to evaluate your results.”
Part of a special series in partnership with Jisc and The National Archives, the lesson includes background information, a case study, and instructions in dimensional reduction, hierarchical clustering, validation, and a bibliography that includes other relevant tutorials. It’s listed as high difficulty.
dh+lib Review
This post was produced through a cooperation between Basia Kapolka, Miki Derdun, Jina FuVernay, Rachel Hogan, Gauri Jhangiani, Mak Jones, Jennifer Matthews, Mimosa Shah, Michelle Speed, Olivia Staciwa, David Sye, and Mark Szarko (Editors-at-large for the week), Hillary Richardson and Pamella Lach (Editors for the week), Claudia Berger, Nickoal Eichmann-Kalwara, Linsey Ford, John Russell, and Rachel Starry (dh+lib Review Editors).