Topic modeling – a particular use case of unsupervised text classification
Topic modeling is an unsupervised ML technique that’s used to discover abstract topics or themes within a large collection of documents. It assumes that each document can be represented as a mixture of topics, and each topic is represented as a distribution over words. The goal of topic modeling is to find the underlying topics and their word distributions, as well as the topic proportions for each document.
There are several topic modeling algorithms, but one of the most popular and widely used is LDA. We will discuss LDA in detail, including its mathematical formulation.
LDA
LDA is a generative probabilistic model that assumes the following generative process for each document:
- Choose the number of words in the document.
- Choose a topic distribution (θ) for the document from a Dirichlet distribution with parameter α.
- For each word in the document, do the following...