spooky's blog: pLSI & LDA

pLSI - Probabilistic Latent Semantic Indexing
LDA - Latent Dirichlet Allocation

note: This summary is for the "two" papers. For simplicity, I will follow these papers using terms like "words" and "documents" in the summary as if we are explicitly dealing with text domain IR, but the inference is based on probability and the applicability of these methods are not confined to text domain.

Term used:
Word: w, dimension V, each word has exactly one dimension equals 1, all other dimensions equal to 0.
Document: d, consist of N word w, where N is different for each document.
Corpus: D, consist of M documents d.
Topic: z, a hidden variable, each dimension represent a "topic". The dimension is assumed to be a fixed value K.

Word simplex: a (V-1) dimensional simplex, each point represents a particular probability distribution of all words.
Latent semantic simplex: a (K - 1) dimensional sub-simplex of word simplex, spanned by K latent semantic vector.

Problem:
Given a corpus, we want to

Find a good representation for the documents in the corpus
Model the corpus

Important assumption:

Words in a document is exchangeable, or say the words in a document are "generated" independently
Documents in the corpus is exchangeable

while the first assumption seems to be anti-intuitive, these assumption leads to the introduction of mixture model in pLSI and LDA, which introduce the latent semantic variable z that coincide with human intuition that some words are correlated rather than being independent. The intuition is previously captured by LSI using SVD, and it turns out that they are mathematically similar.

Model:
Unigram model - The entire corpus is modeled by one single distribution. The corpus is represented by one single point in the word simplex, with each dimension being as significant to each other. Each document is also modeled by one single point in the word simplex, where all the documents center around the point representing the corpus.

Cluster model - The corpus is split into several "topics", each topic has its own distribution. The corpus is represented by several points in the word simplex. Each document is assigned to one single topic, and documents belonging to the same topic are either being represented by the same point or are represented by points center around the topic in the word simplex.

pLSI - Rather than lying in the word simplex, the corpus is assumed to lie in the latent semantic simplex. The sub-simplex is spanned by K latent semantic vector, where each latent semantic vector represent a set of words that share the same "latent semantic meaning". Note the latent semantic is determined statistically rather than semantically. Each document is represented by a point in the latent semantic simplex. This model is superior to the previous model because:

It capture the correlation of different words
Latent semantic simplex usually has much lower dimension than word simplex

LSI -
It is similar to pLSI. It identify the structure between words in the corpus, or say the latent semantic vector, and cast the problem from word space to latent semantic space. It is outperformed by pLSI, probably because it use a less informative distance measure in model assessment. It, however, do not need a pre-definced dimension for latent semantic space.

LDA -
Following the concept of latent semantic simplex in pLSI, the corpus is modeled as a distribution in latent semantic space. It is an extension of pLSI in that the structure of the corpus in latent semantic simplex is captured by a probabilistic model. Because pLSI models each document as a point in latent semantic simplex, it suffers from the problem of too many unseen variable and hence over fitting. This is resolved by LDA.

spooky's blog

2011年3月23日星期三

pLSI & LDA

沒有留言:

張貼留言

2011年3月23日 星期三

pLSI & LDA

沒有留言:

張貼留言

2011年3月23日星期三