Bag of Words (BoW) method is an efficient method for utilizing local feature in image classification and retrieval. Spatial pyramid (SPM) further improve the discriminative power of BoW, resulting in the good on several open data set.
However, according to these result, SPM descriptor with linear classifier doesn't work as well as that with non-linear classifier. Non-linear classifier requires O(n) time complexity even in testing (assuming the number of support vector is O(n)), and thus limit the scalability of system using SPM.
While it is suggest that linear classifier can work as well as non-linear classifier, given the descriptor is discriminative enough, the need for non-linear kernel in SPM is believe to due to the quantization error of BoW. In BoW, each feature point is represented by exactly one "visual word". However, the "coverage" of visual vocabulary is usually limited, since the dictionary size are generally small, so there may exist feature points that can not be well represented using any visual word in the dictionary. An even worse situation is that given a feature point that is about the same distance to two or more visual word, assigning it to one of them is obviously questionable.
A propose solution is to relax the hard assignment constraint, that is, represent a feature point using multiple visual word. Sparse constraint is added, because the dictionary is usually an over complete basis. Another effect of sparse constraint is to limit the visual words used to represent a feature point to those being close to the visual word. While this implicit requirement matches human intuition, it is not necessary the case given the sparse constraint, resulting in large variation in visual vocabulary space given a small distortion in local feature space, which is obviously not a desired property.
To remedy the problem, an additional constraint is added. The constraint is to penalize the difference of sparse code given the original feature are considered to be similar. This constraint explicitly limit the sparse code of similar feature points also being similar. Experimental result shows the distance between sparse codes has higher correlation with the distance of original feature.
Yet another work, "Locality-constrained Linear Coding for Image Classification", use the same concept in different way. Instead of consider the relationship between local features, they consider the distance between a given feature points and visual vocabularies. It penalize using visual vocabulary that is distant to the feature point in feature space. Sparse constraint is removed, because the locality constraint will enforce sparsity, while the reverse is not true.
沒有留言:
張貼留言