LDA - Latent Dirichlet Allocation
note: This summary is for the "two" papers. For simplicity, I will follow these papers using terms like "words" and "documents" in the summary as if we are explicitly dealing with text domain IR, but the inference is based on probability and the applicability of these methods are not confined to text domain.
Term used:
Word: w, dimension V, each word has exactly one dimension equals 1, all other dimensions equal to 0.
Document: d, consist of N word w, where N is different for each document.
Corpus: D, consist of M documents d.
Topic: z, a hidden variable, each dimension represent a "topic". The dimension is assumed to be a fixed value K.
Word simplex: a (V-1) dimensional simplex, each point represents a particular probability distribution of all words.
Latent semantic simplex: a (K - 1) dimensional sub-simplex of word simplex, spanned by K latent semantic vector.
Problem:
Given a corpus, we want to
- Find a good representation for the documents in the corpus
- Model the corpus
Important assumption:
- Words in a document is exchangeable, or say the words in a document are "generated" independently
- Documents in the corpus is exchangeable
Model:
Unigram model - The entire corpus is modeled by one single distribution. The corpus is represented by one single point in the word simplex, with each dimension being as significant to each other. Each document is also modeled by one single point in the word simplex, where all the documents center around the point representing the corpus.
Cluster model - The corpus is split into several "topics", each topic has its own distribution. The corpus is represented by several points in the word simplex. Each document is assigned to one single topic, and documents belonging to the same topic are either being represented by the same point or are represented by points center around the topic in the word simplex.
pLSI - Rather than lying in the word simplex, the corpus is assumed to lie in the latent semantic simplex. The sub-simplex is spanned by K latent semantic vector, where each latent semantic vector represent a set of words that share the same "latent semantic meaning". Note the latent semantic is determined statistically rather than semantically. Each document is represented by a point in the latent semantic simplex. This model is superior to the previous model because:
- It capture the correlation of different words
- Latent semantic simplex usually has much lower dimension than word simplex
LSI -
It is similar to pLSI. It identify the structure between words in the corpus, or say the latent semantic vector, and cast the problem from word space to latent semantic space. It is outperformed by pLSI, probably because it use a less informative distance measure in model assessment. It, however, do not need a pre-definced dimension for latent semantic space.
LDA -
Following the concept of latent semantic simplex in pLSI, the corpus is modeled as a distribution in latent semantic space. It is an extension of pLSI in that the structure of the corpus in latent semantic simplex is captured by a probabilistic model. Because pLSI models each document as a point in latent semantic simplex, it suffers from the problem of too many unseen variable and hence over fitting. This is resolved by LDA.
沒有留言:
張貼留言