Topic Modeling and the Latent Dirichlet Allocation
Updated: Apr 28, 2020
Humans are good at understanding topics in a document. When we read a document, we can find out the underlying topics. We can detect context, understand the meaning of words, and pick up similarities. Machines have a much harder time understanding the hidden meanings of texts. Let's start with the following example:
Mithila was on fire last night. She totally destroyed the other team in the singing competition!
To a human, it’s probably obvious what this sentence implicitly means by using words like 'on fire' or 'destroyed'. However, machines tend to take things a bit too 'literally' and do not understand the hidden meanings, expressions, or emotions used in human languages! It is because computers do bad at working with unstructured data where there are no standardized techniques to process it. Natural language processing (NLP) is a process that enables computers to understand and process human languages as close as possible to a human-level understanding of language. Latent Dirichlet allocation or LDA topic model (Blei et al., 2003) is one of the popular models of NLP that tries to solve this problem.
In this blog post, I will be presenting a high-level introduction to the LDA topic modeling.
How Topic Models Work:
Topic models have some basic assumptions:
Documents are probability distributions over latent topics.
Topics are probability distributions over words.
Each document consists of a fixed number of topics.
Image source: Blei, D. M. (2012), Probabilistic topic models.
The models are built around the idea that the semantics of our documents are actually being governed by some hidden, or “latent,” variables that we are not observing. Topic modeling finds the groups of words that best “fit” a corpus and finds out the latent topic that represents the document. There are several methods for topic modeling. The most popular among them is the latent Dirichlet allocation model.
LDA Topic Modeling: “Plate representation”
Every topic modeling has the following generative process:
Image source: Wikipedia
1) First, we sample a document (d) from a collection of documents (M).
2) Then a topic (c) is selected from the collection of topics under the document. A single document has a total N number of words. Each of these words is generated by a specific topic.
3) Next, a word (w) is selected from the collection of words that represent that topic (c).
Here is a plate representation of the more advanced LDA topic modeling setup:
Image source: Wikipedia
The LDA model requires two hyperparameters, α and 𝛽. α is the shape parameter of the Dirichlet prior for per-document topic distribution. At α=1, any space on the surface of the triangle will be a fair game (in other words, the topics are uniformly distributed). For α>1 values, the samples start to concentrate in the center of the triangle representing an even mixture of all the topics. At low alpha values α<1, most of the topic distribution samples are in the corners (near the topics). The documents will likely have less of a mixture of topics. For each of M documents, we need to draw a random sample representing the topic distribution of a particular document from a Dirichlet distribution, Dir(α).
The document-topic distribution θ depends on the value of α, where, θ~Dirichlet(α). From θ, we are going to select a particular topic Z based on the distribution.
The 𝛽 hyperparameter will control the distribution of words per topic. It is the parameter of the Dirichlet Prior for the per topic word distribution, φ. For lower values of 𝛽, topics will likely have fewer word varieties. We choose the word W from the word distribution φ of the topic Z. And this φ comes from a Dirichlet distribution, where φ~Dir(𝛽).
Knowing the Basic Blocks of the LDA Topic Modeling:
Two output tables represent the building blocks of the LDA topic modeling process: 1) the term-topic matrix, which breaks topics down in terms of their word components, and 2) the document-topic matrix, which describes documents in terms of their topics.
a) Term-topic matrix: The columns contain topic numbers and the rows are representing each term. A value in a cell of the matrix represents the probability of the word belonging to a particular topic.
b) Document-topic matrix: The columns contain topic numbers and the rows are representing each document. The values in the cells indicate how much each topic "belongs" to each document. A document can be a mixture of different topics.
In the context of machine learning and NLP, the LDA topic modeling process is described as a method of uncovering hidden semantic structures in a text body or corpus. It tries to figure out a "recipe" for how each document in a corpus could have been created. We just need to tell the model how many topics to construct and it uses that "recipe" to generate topic and word distributions over a corpus.