Bag of Words (BoW) method is an efficient method for utilizing local feature in image classification and retrieval. Spatial pyramid (SPM) further improve the discriminative power of BoW, resulting in the good on several open data set.
However, according to these result, SPM descriptor with linear classifier doesn't work as well as that with non-linear classifier. Non-linear classifier requires O(n) time complexity even in testing (assuming the number of support vector is O(n)), and thus limit the scalability of system using SPM.
While it is suggest that linear classifier can work as well as non-linear classifier, given the descriptor is discriminative enough, the need for non-linear kernel in SPM is believe to due to the quantization error of BoW. In BoW, each feature point is represented by exactly one "visual word". However, the "coverage" of visual vocabulary is usually limited, since the dictionary size are generally small, so there may exist feature points that can not be well represented using any visual word in the dictionary. An even worse situation is that given a feature point that is about the same distance to two or more visual word, assigning it to one of them is obviously questionable.
A propose solution is to relax the hard assignment constraint, that is, represent a feature point using multiple visual word. Sparse constraint is added, because the dictionary is usually an over complete basis. Another effect of sparse constraint is to limit the visual words used to represent a feature point to those being close to the visual word. While this implicit requirement matches human intuition, it is not necessary the case given the sparse constraint, resulting in large variation in visual vocabulary space given a small distortion in local feature space, which is obviously not a desired property.
To remedy the problem, an additional constraint is added. The constraint is to penalize the difference of sparse code given the original feature are considered to be similar. This constraint explicitly limit the sparse code of similar feature points also being similar. Experimental result shows the distance between sparse codes has higher correlation with the distance of original feature.
Yet another work, "Locality-constrained Linear Coding for Image Classification", use the same concept in different way. Instead of consider the relationship between local features, they consider the distance between a given feature points and visual vocabularies. It penalize using visual vocabulary that is distant to the feature point in feature space. Sparse constraint is removed, because the locality constraint will enforce sparsity, while the reverse is not true.
spooky's blog
2011年6月8日 星期三
2011年6月1日 星期三
Fast Concurrent Object Localization and Recognition
Goal - Given a set of candidate objects and an image, one want to simultaneously locate and recognize the object in reasonable time.
Object recognition and object localization are two important problem in computer vision. Existing method, although effective and efficient for a particular problem, can't solve them concurrently effectively as in human vision.
For object localization, the problem for applying existing method is that they handle only one single object at a time. Therefore, linear search over all possible objects is necessary, making it infeasible when the number of candidate object is large, which is very likely when we want to apply it to real application.
For object recognition, existing method improve the recognition speed with an implicit assumption that localization is either given or irrelevant, making them not suitable to apply to object localization.
In this work, local feature of image is used. The key for speed up concurrent localization and recognition (CLR) lies in the following observation: while multiple local features are used for localizing and recognizing an object, these local feature points can be processed independently and then aggregate together using a linear aggregating function. This means the local feature points can be processed in an incremental or decremental manner.
Definition -
Subspace: Defined by a five candidate set {X+,X-,Y+,Y-,C}.
Candidate Set: X+ and Y+ are the possible feature point for the upper bound of bounding box, and X-, Y- are for the lower bound. C is the set of possible label.
Upper Bound of Subspace: The highest prediction score of a subspace. The score is that of a subset of local feature on a possible label of the subspace.
The algorithm works as follow -
For each iteration, the subspace with highest upper bound is processed. The subspace is split into two, where the largest candidate set is split. That is, suppose Sc is the target subspace and Sc = {Xc+,Xc-,Yc+,Yc-,Cc}, the largest candidate set Ac will be pick up and split into Ac+ and Ac-. For Xc and Yc, they are split horizontally or vertically in the coordinate. For Cc, it is split according to the prediction score of each label, where those with higher score are in one set and those with lower score in the other. Sc is then split into two (denote Sc+ and Sc-) according to the split of candidate set, where all points in Ac+ is removed from all candidate set in Sc- and vice versa. The upper bound of Sc+ and Sc- are then computed, and go to the next iteration. The algorithm stop when the subspace with highest score can't be divided.
Object recognition and object localization are two important problem in computer vision. Existing method, although effective and efficient for a particular problem, can't solve them concurrently effectively as in human vision.
For object localization, the problem for applying existing method is that they handle only one single object at a time. Therefore, linear search over all possible objects is necessary, making it infeasible when the number of candidate object is large, which is very likely when we want to apply it to real application.
For object recognition, existing method improve the recognition speed with an implicit assumption that localization is either given or irrelevant, making them not suitable to apply to object localization.
In this work, local feature of image is used. The key for speed up concurrent localization and recognition (CLR) lies in the following observation: while multiple local features are used for localizing and recognizing an object, these local feature points can be processed independently and then aggregate together using a linear aggregating function. This means the local feature points can be processed in an incremental or decremental manner.
Definition -
Subspace: Defined by a five candidate set {X+,X-,Y+,Y-,C}.
Candidate Set: X+ and Y+ are the possible feature point for the upper bound of bounding box, and X-, Y- are for the lower bound. C is the set of possible label.
Upper Bound of Subspace: The highest prediction score of a subspace. The score is that of a subset of local feature on a possible label of the subspace.
The algorithm works as follow -
For each iteration, the subspace with highest upper bound is processed. The subspace is split into two, where the largest candidate set is split. That is, suppose Sc is the target subspace and Sc = {Xc+,Xc-,Yc+,Yc-,Cc}, the largest candidate set Ac will be pick up and split into Ac+ and Ac-. For Xc and Yc, they are split horizontally or vertically in the coordinate. For Cc, it is split according to the prediction score of each label, where those with higher score are in one set and those with lower score in the other. Sc is then split into two (denote Sc+ and Sc-) according to the split of candidate set, where all points in Ac+ is removed from all candidate set in Sc- and vice versa. The upper bound of Sc+ and Sc- are then computed, and go to the next iteration. The algorithm stop when the subspace with highest score can't be divided.
2011年5月18日 星期三
PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations
Motivation -
Various graph mining algorithms are important for data analysis in many researches and real world applications. While these algorithms are widely applied to Tera or Peta-bytes of data, most of them do not take the scaling issue into concern at development, but left it as an implementation issue.
For large scale data manipulation, MapReduce has become an promising and widely adopted solution. The PEGASUS package tries to utilize the MapReduce structure to solve the scalability problem of existing graph mining algorithms.
GIM-V -
Instead of tackle every algorithm seperately, the PEGASUS package first transforms all the algorithms into Generalized Iterative Matrix-Vector multiplication (GIM-V) scheme, and then parallelize GIM-V. By doing so, the package can focus on the optimization of GIM-V, where every algorithm can benefit from the optimization.
The original matrix-vector multiplication can be seperated into three steps:
Parallization of GIM-V -
Parallel GIM-V can be realized using a two stage MapReduce algorithm.
Optimization of pGIM-V -
The bottleneck of naive implementation of parallel GIM-V (pGIM-V) lies in the shuffling stage of MapReduce, where network transfer, sorting and disk I/O occur. So the key for optimization is to reduce the overhead of shuffle stage.
Various graph mining algorithms are important for data analysis in many researches and real world applications. While these algorithms are widely applied to Tera or Peta-bytes of data, most of them do not take the scaling issue into concern at development, but left it as an implementation issue.
For large scale data manipulation, MapReduce has become an promising and widely adopted solution. The PEGASUS package tries to utilize the MapReduce structure to solve the scalability problem of existing graph mining algorithms.
GIM-V -
Instead of tackle every algorithm seperately, the PEGASUS package first transforms all the algorithms into Generalized Iterative Matrix-Vector multiplication (GIM-V) scheme, and then parallelize GIM-V. By doing so, the package can focus on the optimization of GIM-V, where every algorithm can benefit from the optimization.
The original matrix-vector multiplication can be seperated into three steps:
- combine2: multiplication of matrix elements and vector elements
- combineAll: row-wise summation of the multiplication result
- assign: overwrite the vector
- combine2: combine matrix elements and vector elements
- combineAll: row-wise combination of the result from combine2
- assign: update vector element
Parallization of GIM-V -
Parallel GIM-V can be realized using a two stage MapReduce algorithm.
- Stage1(combine2)
- Mapper - emit matrix and vector elements as an (index, value) pair
- Reducer - evaluate the predefined combine2 function
- Stage2(combineAll, Assign)
- Mapper - emit the result of Stage1
- Reducer - evaluate the predefined combineAll and Assign function
Optimization of pGIM-V -
The bottleneck of naive implementation of parallel GIM-V (pGIM-V) lies in the shuffling stage of MapReduce, where network transfer, sorting and disk I/O occur. So the key for optimization is to reduce the overhead of shuffle stage.
- GIM-V BL: Block Multiplication
Instead of treating each element seperately in MapReduce, in GIM-V BL, the basic component for MapReduce is a block of matrix or a block of vector. This means that in Stage1 of GIM-V BL, a block of matrix elements and vector elements are emitted at one time, with the following operations all carry out on a block of elements. By doing so, the number of files that need to be handled in shuffle stage is decrease by b-square and b for matrix and vector repectively, which reduces the network transfer an I/O cost. More over, because the elements can be now indexed using the block coordinate, the number of bit required to index each element is reduced, which decreases the datasize.
- GIM-V CL: Clustered Edges
Because in GIM-V BL, only blocks that contains non-zero element will be emitted in stage1 (which means all unseen blocks are taken as zero by default), so if edge clustering is carried out before starting the GIM-V algorithm, the number of blocks that need to be handled by the algorithm may be greatly decreased, since the matrix is usually sparse in real application. Notice the edge clustering only needs to be carried out once before the entire algorithm start and can be used through out all the iterations.
2011年5月11日 星期三
Building a High-Level Dataflow System on top of Map-Reduce: Teh Big Experience
The need for Pig -
The feature of Pig -
Critical Issue during implementation -
- The increasing need for processing and analyzing ultra-large-scale data
- The Map-Reduce framework shows to be suitable for the task, especially for scalability issue
- Map-Reduce, however, provides only very simple "Mapper-Reducer" framework, which leads for several problem in real application:
- No direct support for multi-step data processing
- No explicit support for combined processing of multiple data sets
- All operations, even frequently and universally used ones, have to be coded from scatch
The feature of Pig -
- High scalability inherent from Map-Reduce
- A general framework and is easy to use, which reduce the application development and data analysis time
- Black box operation provides general optimization opportunities
- Customizable through the UDF and pipeline interface
Critical Issue during implementation -
- Memory control - the key idea is to avoid spill. Since automatic memory control is hard, part of the work has to be handled by user
- Flow Control - takes memory issue, pipline and UDF interface into consideration
- Combiner - while not a necessary part for user to implement when working on Map-Reduce structure, it contains many optimization opportunities
2011年5月4日 星期三
Where's Waldo: Matching People in Images of Crowds
Problem - Given a set of images taken from a social event where crowds of people are in each images, the work aims to find the appearance of a particular person in each image after user specified the person in one of the images.
Assumption - Rate of photo acquisition is fast compared to the rate of movement of people, which implies we can find a crowd of people at the same location in multiple images.
Challenges - Large amount of people in one image, low resolution, change of pose of target person.
The problem is solved by considering both
Visual appearance -
The visual appearance is captured by part based approach using pixel color as the visual feature.
Upon input, user will specify a person using two vertical points, along with body parts of the person. The "body part model" is the color model of the body part, which is a classifier that discriminates whether a pixel belongs to the body part.
The candidate location of the person in other images is located by projecting the two points into target image. For every candidate location, a score is calculated using the learnt body part model, where the score is the number of positive classified pixel. In a particular target image, the score of the candidate location with highest score is taken as the score of the image.
Contextual Cues -
Contextual Cues are based on two observation
Assumption - Rate of photo acquisition is fast compared to the rate of movement of people, which implies we can find a crowd of people at the same location in multiple images.
Challenges - Large amount of people in one image, low resolution, change of pose of target person.
The problem is solved by considering both
- Visual appearance
- Contextual Cues, which takes the co-appearance of people and time-stamp of images into account
Visual appearance -
The visual appearance is captured by part based approach using pixel color as the visual feature.
Upon input, user will specify a person using two vertical points, along with body parts of the person. The "body part model" is the color model of the body part, which is a classifier that discriminates whether a pixel belongs to the body part.
The candidate location of the person in other images is located by projecting the two points into target image. For every candidate location, a score is calculated using the learnt body part model, where the score is the number of positive classified pixel. In a particular target image, the score of the candidate location with highest score is taken as the score of the image.
Contextual Cues -
Contextual Cues are based on two observation
- If a group of people appears together in several images, they are likely to appear together in other images
- Images taken from the same place in a short period of time tend to contain the same group of people
2011年4月30日 星期六
Learning to Rank: From Pairwise Approach to Listwise Approach
Previous works on "Learning to Rank" usually use pairwise-based loss functional, which has several disadvantage:
A listwise approach is proposed as follow:
The ranking of the documents given a query can be viewed as a permutation, which is relevant to the list of score.
Now the loss function is defined as followed:
To reduce the computational cost, only the top k ranking documents are considered in the learning to rank process.
- The objective function does not guarantee the optimization of some important measure in retrieval such as AP. These measure better capture the user experience than the pairwise ordering.
- Document pairs are not generated independently, which violate the assumption of classification methods that is applied.
- The number of document pairs, which grows in n square order, from query to query is highly unbalanced. This will lead to a biased model in training.
A listwise approach is proposed as follow:
- For each query, a list of score over all documents is given as ground truth.
- A ranking function f, which is what we want to learn, takes both query and document as input, and gives a score for each document-query pair.
- For each query, a list of score is generated using the ranking function
- The loss function is defined over the "ground truth list" and the "ranking list" of each query as input.
The ranking of the documents given a query can be viewed as a permutation, which is relevant to the list of score.
Now the loss function is defined as followed:
- Defined the probability of each permutation based on the ordering and the score of each documents.
- Given two lists of score, the probability of the permutation can be calculated, and the cross entropy of the ground truth and the score generated by ranking function is used as the loss function.
To reduce the computational cost, only the top k ranking documents are considered in the learning to rank process.
2011年4月20日 星期三
Support Vector Learning for Ordinal Regression
Problem definition -
Given a set of samples S = {(xi,yi)}, where xi being the feature vector of sample and yi being the label, most machine learning problem can be classified into one of the three categories, depending on the property of label:
Problem formulation -
The formulation of ordinal regression problem is consist of two theorem:
Problem solution -
According to the above theorems, the ordinal regression problem can be transformed into a classification problem with largest margin criteria. This is the same criteria as SVM, and a formulation of the problem which is identical to SVM is provided. So the problem can be solved by first solve the classification problem using SVM solver, and the mapping U and boundary of different labels in R can be found accordingly. Combining mapping U and the boundary, we can obtain the desired mapping h.
Given a set of samples S = {(xi,yi)}, where xi being the feature vector of sample and yi being the label, most machine learning problem can be classified into one of the three categories, depending on the property of label:
- Classification: yi belongs to a finite set of discrete value, where the value is unordered.
- Regression: yi is in a continues metric space, where the value of yi contains relevancy information about the sample.
- Ordinal regression: yi belongs to a finite set of discrete value like classification, but there exist ordering relationship in the value as the regression. It can be interpreted as a ranking problem.
Problem formulation -
The formulation of ordinal regression problem is consist of two theorem:
- Theorem 1: Given a mapping function h: X->Y, we can transform the ranking problem into a classification problem. The classification problem is to find a mapping p: X*X->{-1,0,1}, where the input of the mapping is arbitrary two samples from X, and the output is relationship of h(x). The label of the classification problem is determined by the relationship between the actual rank of the two input samples. The empirical risk of the ranking problem is proportional to the classification problem, so the ranking problem is equivalent to the classification problem.
- Theorem 2: Given the mapping function h, The empirical risk is upper bounded. Suppose there exist a mapping U: X->R where samples of different label are divided in R, and the distribution of samples in R is ordered according to the order of original label. Then for every pair of consecutive labels, there exist a margin of samples in R. Suppose gamma equals the smallest margin, than the larger the margin is, the smaller the upper bound of the empirical risk is. So the optimal mapping h should be the one with largest gamma value, or say, with largest margin in R.
Problem solution -
According to the above theorems, the ordinal regression problem can be transformed into a classification problem with largest margin criteria. This is the same criteria as SVM, and a formulation of the problem which is identical to SVM is provided. So the problem can be solved by first solve the classification problem using SVM solver, and the mapping U and boundary of different labels in R can be found accordingly. Combining mapping U and the boundary, we can obtain the desired mapping h.
訂閱:
文章 (Atom)