2011年6月8日 星期三

Local Features Are Not Lonely - Laplacian Sparse Coding for Image Classification

Bag of Words (BoW) method is an efficient method for utilizing local feature in image classification and retrieval. Spatial pyramid (SPM) further improve the discriminative power of BoW, resulting in the good on several open data set.
However, according to these result, SPM descriptor with linear classifier doesn't work as well as that with non-linear classifier. Non-linear classifier requires O(n) time complexity even in testing (assuming the number of support vector is O(n)), and thus limit the scalability of system using SPM.

While it is suggest that linear classifier can work as well as non-linear classifier, given the descriptor is discriminative enough, the need for non-linear kernel in SPM is believe to due to the quantization error of BoW. In BoW, each feature point is represented by exactly one "visual word". However, the "coverage" of visual vocabulary is usually limited, since the dictionary size are generally small, so there may exist feature points that can not be well represented using any visual word in the dictionary. An even worse situation is that given a feature point that is about the same distance to two or more visual word, assigning it to one of them is obviously questionable.

A propose solution is to relax the hard assignment constraint, that is, represent a feature point using multiple visual word. Sparse constraint is added, because the dictionary is usually an over complete basis. Another effect of sparse constraint is to limit the visual words used to represent a feature point to those being close to the visual word. While this implicit requirement matches human intuition, it is not necessary the case given the sparse constraint, resulting in large variation in visual vocabulary space given a small distortion in local feature space, which is obviously not a desired property.

To remedy the problem, an additional constraint is added. The constraint is to penalize the difference of sparse code given the original feature are considered to be similar. This constraint explicitly limit the sparse code of similar feature points also being similar. Experimental result shows the distance between sparse codes has higher correlation with the distance of original feature.

Yet another work, "Locality-constrained Linear Coding for Image Classification", use the same concept in different way. Instead of consider the relationship between local features, they consider the distance between a given feature points and visual vocabularies. It penalize using visual vocabulary that is distant to the feature point in feature space. Sparse constraint is removed, because the locality constraint will enforce sparsity, while the reverse is not true.

2011年6月1日 星期三

Fast Concurrent Object Localization and Recognition

Goal - Given a set of candidate objects and an image, one want to simultaneously locate and recognize the object in reasonable time.

Object recognition and object localization are two important problem in computer vision. Existing method, although effective and efficient for a particular problem, can't solve them concurrently effectively as in human vision.

For object localization, the problem for applying existing method is that they handle only one single object at a time. Therefore, linear search over all possible objects is necessary, making it infeasible when the number of candidate object is large, which is very likely when we want to apply it to real application.
For object recognition, existing method improve the recognition speed with an implicit assumption that localization is either given or irrelevant, making them not suitable to apply to object localization.

In this work, local feature of image is used. The key for speed up concurrent localization and recognition (CLR) lies in the following observation: while multiple local features are used for localizing and recognizing an object, these local feature points can be processed independently and then aggregate together using a linear aggregating function. This means the local feature points can be processed in an incremental or decremental manner.

Definition -
Subspace: Defined by a five candidate set {X+,X-,Y+,Y-,C}.
Candidate Set: X+ and Y+ are the possible feature point for the upper bound of bounding box, and X-, Y- are for the lower bound. C is the set of possible label.
Upper Bound of Subspace: The highest prediction score of a subspace. The score is that of a subset of local feature on a possible label of the subspace.

The algorithm works as follow -
For each iteration, the subspace with highest upper bound is processed. The subspace is split into two, where the largest candidate set is split. That is, suppose Sc is the target subspace and Sc = {Xc+,Xc-,Yc+,Yc-,Cc}, the largest candidate set Ac will be pick up and split into Ac+ and Ac-. For Xc and Yc, they are split horizontally or vertically in the coordinate. For Cc, it is split according to the prediction score of each label, where those with higher score are in one set and those with lower score in the other. Sc is then split into two (denote Sc+ and Sc-) according to the split of candidate set, where all points in Ac+ is removed from all candidate set in Sc- and vice versa. The upper bound of Sc+ and Sc- are then computed, and go to the next iteration. The algorithm stop when the subspace with highest score can't be divided.