spooky's blog: 3月 2011

2011年3月23日星期三

pLSI & LDA

pLSI - Probabilistic Latent Semantic Indexing
LDA - Latent Dirichlet Allocation

note: This summary is for the "two" papers. For simplicity, I will follow these papers using terms like "words" and "documents" in the summary as if we are explicitly dealing with text domain IR, but the inference is based on probability and the applicability of these methods are not confined to text domain.

Term used:
Word: w, dimension V, each word has exactly one dimension equals 1, all other dimensions equal to 0.
Document: d, consist of N word w, where N is different for each document.
Corpus: D, consist of M documents d.
Topic: z, a hidden variable, each dimension represent a "topic". The dimension is assumed to be a fixed value K.

Word simplex: a (V-1) dimensional simplex, each point represents a particular probability distribution of all words.
Latent semantic simplex: a (K - 1) dimensional sub-simplex of word simplex, spanned by K latent semantic vector.

Problem:
Given a corpus, we want to

Find a good representation for the documents in the corpus
Model the corpus

Important assumption:

Words in a document is exchangeable, or say the words in a document are "generated" independently
Documents in the corpus is exchangeable

while the first assumption seems to be anti-intuitive, these assumption leads to the introduction of mixture model in pLSI and LDA, which introduce the latent semantic variable z that coincide with human intuition that some words are correlated rather than being independent. The intuition is previously captured by LSI using SVD, and it turns out that they are mathematically similar.

Model:
Unigram model - The entire corpus is modeled by one single distribution. The corpus is represented by one single point in the word simplex, with each dimension being as significant to each other. Each document is also modeled by one single point in the word simplex, where all the documents center around the point representing the corpus.

Cluster model - The corpus is split into several "topics", each topic has its own distribution. The corpus is represented by several points in the word simplex. Each document is assigned to one single topic, and documents belonging to the same topic are either being represented by the same point or are represented by points center around the topic in the word simplex.

pLSI - Rather than lying in the word simplex, the corpus is assumed to lie in the latent semantic simplex. The sub-simplex is spanned by K latent semantic vector, where each latent semantic vector represent a set of words that share the same "latent semantic meaning". Note the latent semantic is determined statistically rather than semantically. Each document is represented by a point in the latent semantic simplex. This model is superior to the previous model because:

It capture the correlation of different words
Latent semantic simplex usually has much lower dimension than word simplex

LSI -
It is similar to pLSI. It identify the structure between words in the corpus, or say the latent semantic vector, and cast the problem from word space to latent semantic space. It is outperformed by pLSI, probably because it use a less informative distance measure in model assessment. It, however, do not need a pre-definced dimension for latent semantic space.

LDA -
Following the concept of latent semantic simplex in pLSI, the corpus is modeled as a distribution in latent semantic space. It is an extension of pLSI in that the structure of the corpus in latent semantic simplex is captured by a probabilistic model. Because pLSI models each document as a point in latent semantic simplex, it suffers from the problem of too many unseen variable and hence over fitting. This is resolved by LDA.

2011年3月16日星期三

Lost in Binarization: Query-Adaptive Ranking for Similar Image Search with Compact Codes

The combination of local feature like SIFT and the BoW framework has been adopted in many system for image representation. But such representation suffers from:

High dimensionality
A long list of visual word for each image

which makes indexing and searching efficiently a hard problem. Binary embedding is so far the most promising indexing method, comparing with traditional tree-based method, which is not suitable for high dimensional data, and inverted file, which is not as efficient as it is in text domain due to 2. above.

But when performing image search using binary representation, many images will have the same distance to the query one, due to the discrete nature of Hamming distance, and it is unlikely that all these images are as similar to the query as others.The paper proposed a method that can perform fine-grained re-ranking for images with the same hamming distance.

The argument of the paper is that, although every bit in a binary representation has the same importance in Hamming distance, this is not the case for real-world data. Some bits is more discriminative than others, and the importance of each bit may vary in different class of images. According to the argument, the system is designed as followed:

For each semantic class, a "importance", or weight, of each bit in the binary representation is offline calculated. When a query image comes in, a weight for the query will be adaptively generated, based on the weight pre-calculated, and is than used to re-rank results with the same Hamming distance.

To generate the weight of each semantic class, two criteria are considered:

Intra-class similarity -
Image in the same class should remain close together after the re-weighting transform.
This is formulated as:

where ai is the weight of i'th semantic class, ci is the center of the i'th semantic class, and x is an image.
Inter-class relationship -
The distance relationship of each class should remain the same.
This is formulated as:

where sij is the distance between the center of class i and j.

So the problem is formulated as

and can be transformed into a quadratic-programing (QP) like problem

which can be solved iteratively.

As for the weight of query image, it is calculated using a pseudo-relevance feedback method. The weight is the average of the semantic class weight of the top k most similar images. The distance metric used in the system is

Comment:
1. The usage of semantic class is, although somehow reasonable, lack of validation. This is especially true when the dataset is more diverse in sense of semantic class or lack of labeling.

2011年3月9日星期三

Aggregating local descriptors into a compact image representation

This paper has two main contributions

VLAD - A new image descriptor based on local feature -
Based on SIFT local descriptor, VLAD aggregate local descriptors into a fixed-dimension representation for image.
The idea is an outgrow of BOF representation, but it retains the structure of SIFT descriptor. Unlike BOF representation that aggregate SIFT descriptors into a single number using tf-idf manner, VLAD describe each "Visual Word" by accumulating the difference of local descriptors with the centroid of the visual word. Experiment result shows that it outperforms BOF representation in image retrieval when they are comparable in size.
VLAD can also be viewed as a simplification of Fisher kernel, which utilize GMM model to describe "Visual Word".

PCA-ADC approach to "compress" descriptor -
Vector quantization is a widely used approach to reduce the size of descriptors.
ADC approach reduce the vector quantization complexity by dividing a vector into multiple small vectors and quantize each of them separately. The approach is combined with PCA, which is widely used in dimension reduction, to further reduce the data size.
The paper makes a formal analysis about the error caused by these descriptor compression methods, and concluded that the performance loss is acceptable. The analysis is supported by its experiment.

2011年3月2日星期三

Efficient Visual Search of Videos Cast as Text Retrieval

This paper introduces a method to transform the image retrieval problem into text retrieval problem.

Research in text domain retrieval has a long history and a relatively well studied problem, with many promising techniques available. These techniques have been applied to many application, such as search engine, and are proven to be very success. Image retrieval, on the other hand, is a relatively novel problem and is not as mature as text retrieval. So it will be very helpful if we can utilize the technique of text retrieval in image retrieval, which is the aim of this paper.

The key of the paper is to transform low-level visual feature into "Visual Word". Low-level visual features are first obtained by identifying interesting regions and using SIFT feature descriptor to represent these regions. Visual vocabulary are then build by performing clustering, with the center of each cluster being a "Visual Word", and each feature point extracted is cast into a particular "Visual Word". These visual words are analogy to regular words in text, and techniques in text retrieval can then be applied.

After visual words are obtained, they are processed as if they were words in a text, and the structure of the retrieval system follows that of regular text retrieval system:

Building Vocabulary -
- Term weighting using tf-idf
- Stop word removal
Indexing -
- Inverted file
Retrieval Model -
- Vector space model

The main benefit of the proposed method is its time efficiency. With very simple technique, it can greatly reduce the retrieval time while returning results comparable to pure visual matching method.

2011年3月1日星期二

Distinctive Image Features from Scale-Invariant Keypoints

This paper introduces the well-known SIFT local feature, which is a widely used local feature, and has been proven to work well on many application, especially in object recognition and image matching.

SIFT feature contains two main parts, with four stages to extract the feature:

Feature point identification

Scale-space extrema detection -
Identify interesting points by searching scale-space extrema. This is done by finding local minimum in the difference-of-Gaussian convolved image. It can be proved that the minimum is a good approximation of the laplacian of gaussian convovlved image, which has been shown to produce stable feature.
Following computation are done in the particular scale-space, and are thus scale invariant.
Keypoint localization -
The position of interesting points is locate more precisely by finding the exact local minimum point. This is done by expanding the DOG-convolved image around the interesting point find in step1 to second order, and the minimum is located analytically.
The keypoints founded are then filtered. This step is to eliminate poor keypoints that occurs due to edge response. This is done by thresholding the ratio of principal curvature and curvature of perpendicular direction, which will be large on edge.
Orientation assignement -
Each keypoint is then assigned with an principal direction. The direction is determined by creating an orientation histogram about the gradient, with the maximun being the principal direction. If there are multiple peaks in the histogram, a keypoint is created for each direction.
Following calculation are done with the direction of keypoint being aligned, and are thus rotation invariant.

Feature point descriptor

Keypoint descriptor -
The descriptor is an image gradient histogram. Points around the keypoints are divided into a 4*4 array of regions, a gradient histogram with 8 orientation bins is created for each region. Therefore, each keypoint is described by a 4*4*8 = 128 dimensional descriptor, along with the position and orientation.
The descriptor is invariant to affine changes in illumination.

Some desirable properties about SIFT feature make it a successful local feature. These properties includes invariant to rotation, scaling, affine distortion and change in illumination. Also, it extract large numbers of keypoints in general images, which makes it very distinctive.

訂閱：文章 (Atom)

2011年3月23日 星期三