Bag of Words (BoW) method is an efficient method for utilizing local feature in image classification and retrieval. Spatial pyramid (SPM) further improve the discriminative power of BoW, resulting in the good on several open data set.
However, according to these result, SPM descriptor with linear classifier doesn't work as well as that with non-linear classifier. Non-linear classifier requires O(n) time complexity even in testing (assuming the number of support vector is O(n)), and thus limit the scalability of system using SPM.
While it is suggest that linear classifier can work as well as non-linear classifier, given the descriptor is discriminative enough, the need for non-linear kernel in SPM is believe to due to the quantization error of BoW. In BoW, each feature point is represented by exactly one "visual word". However, the "coverage" of visual vocabulary is usually limited, since the dictionary size are generally small, so there may exist feature points that can not be well represented using any visual word in the dictionary. An even worse situation is that given a feature point that is about the same distance to two or more visual word, assigning it to one of them is obviously questionable.
A propose solution is to relax the hard assignment constraint, that is, represent a feature point using multiple visual word. Sparse constraint is added, because the dictionary is usually an over complete basis. Another effect of sparse constraint is to limit the visual words used to represent a feature point to those being close to the visual word. While this implicit requirement matches human intuition, it is not necessary the case given the sparse constraint, resulting in large variation in visual vocabulary space given a small distortion in local feature space, which is obviously not a desired property.
To remedy the problem, an additional constraint is added. The constraint is to penalize the difference of sparse code given the original feature are considered to be similar. This constraint explicitly limit the sparse code of similar feature points also being similar. Experimental result shows the distance between sparse codes has higher correlation with the distance of original feature.
Yet another work, "Locality-constrained Linear Coding for Image Classification", use the same concept in different way. Instead of consider the relationship between local features, they consider the distance between a given feature points and visual vocabularies. It penalize using visual vocabulary that is distant to the feature point in feature space. Sparse constraint is removed, because the locality constraint will enforce sparsity, while the reverse is not true.
2011年6月8日 星期三
2011年6月1日 星期三
Fast Concurrent Object Localization and Recognition
Goal - Given a set of candidate objects and an image, one want to simultaneously locate and recognize the object in reasonable time.
Object recognition and object localization are two important problem in computer vision. Existing method, although effective and efficient for a particular problem, can't solve them concurrently effectively as in human vision.
For object localization, the problem for applying existing method is that they handle only one single object at a time. Therefore, linear search over all possible objects is necessary, making it infeasible when the number of candidate object is large, which is very likely when we want to apply it to real application.
For object recognition, existing method improve the recognition speed with an implicit assumption that localization is either given or irrelevant, making them not suitable to apply to object localization.
In this work, local feature of image is used. The key for speed up concurrent localization and recognition (CLR) lies in the following observation: while multiple local features are used for localizing and recognizing an object, these local feature points can be processed independently and then aggregate together using a linear aggregating function. This means the local feature points can be processed in an incremental or decremental manner.
Definition -
Subspace: Defined by a five candidate set {X+,X-,Y+,Y-,C}.
Candidate Set: X+ and Y+ are the possible feature point for the upper bound of bounding box, and X-, Y- are for the lower bound. C is the set of possible label.
Upper Bound of Subspace: The highest prediction score of a subspace. The score is that of a subset of local feature on a possible label of the subspace.
The algorithm works as follow -
For each iteration, the subspace with highest upper bound is processed. The subspace is split into two, where the largest candidate set is split. That is, suppose Sc is the target subspace and Sc = {Xc+,Xc-,Yc+,Yc-,Cc}, the largest candidate set Ac will be pick up and split into Ac+ and Ac-. For Xc and Yc, they are split horizontally or vertically in the coordinate. For Cc, it is split according to the prediction score of each label, where those with higher score are in one set and those with lower score in the other. Sc is then split into two (denote Sc+ and Sc-) according to the split of candidate set, where all points in Ac+ is removed from all candidate set in Sc- and vice versa. The upper bound of Sc+ and Sc- are then computed, and go to the next iteration. The algorithm stop when the subspace with highest score can't be divided.
Object recognition and object localization are two important problem in computer vision. Existing method, although effective and efficient for a particular problem, can't solve them concurrently effectively as in human vision.
For object localization, the problem for applying existing method is that they handle only one single object at a time. Therefore, linear search over all possible objects is necessary, making it infeasible when the number of candidate object is large, which is very likely when we want to apply it to real application.
For object recognition, existing method improve the recognition speed with an implicit assumption that localization is either given or irrelevant, making them not suitable to apply to object localization.
In this work, local feature of image is used. The key for speed up concurrent localization and recognition (CLR) lies in the following observation: while multiple local features are used for localizing and recognizing an object, these local feature points can be processed independently and then aggregate together using a linear aggregating function. This means the local feature points can be processed in an incremental or decremental manner.
Definition -
Subspace: Defined by a five candidate set {X+,X-,Y+,Y-,C}.
Candidate Set: X+ and Y+ are the possible feature point for the upper bound of bounding box, and X-, Y- are for the lower bound. C is the set of possible label.
Upper Bound of Subspace: The highest prediction score of a subspace. The score is that of a subset of local feature on a possible label of the subspace.
The algorithm works as follow -
For each iteration, the subspace with highest upper bound is processed. The subspace is split into two, where the largest candidate set is split. That is, suppose Sc is the target subspace and Sc = {Xc+,Xc-,Yc+,Yc-,Cc}, the largest candidate set Ac will be pick up and split into Ac+ and Ac-. For Xc and Yc, they are split horizontally or vertically in the coordinate. For Cc, it is split according to the prediction score of each label, where those with higher score are in one set and those with lower score in the other. Sc is then split into two (denote Sc+ and Sc-) according to the split of candidate set, where all points in Ac+ is removed from all candidate set in Sc- and vice versa. The upper bound of Sc+ and Sc- are then computed, and go to the next iteration. The algorithm stop when the subspace with highest score can't be divided.
2011年5月18日 星期三
PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations
Motivation -
Various graph mining algorithms are important for data analysis in many researches and real world applications. While these algorithms are widely applied to Tera or Peta-bytes of data, most of them do not take the scaling issue into concern at development, but left it as an implementation issue.
For large scale data manipulation, MapReduce has become an promising and widely adopted solution. The PEGASUS package tries to utilize the MapReduce structure to solve the scalability problem of existing graph mining algorithms.
GIM-V -
Instead of tackle every algorithm seperately, the PEGASUS package first transforms all the algorithms into Generalized Iterative Matrix-Vector multiplication (GIM-V) scheme, and then parallelize GIM-V. By doing so, the package can focus on the optimization of GIM-V, where every algorithm can benefit from the optimization.
The original matrix-vector multiplication can be seperated into three steps:
Parallization of GIM-V -
Parallel GIM-V can be realized using a two stage MapReduce algorithm.
Optimization of pGIM-V -
The bottleneck of naive implementation of parallel GIM-V (pGIM-V) lies in the shuffling stage of MapReduce, where network transfer, sorting and disk I/O occur. So the key for optimization is to reduce the overhead of shuffle stage.
Various graph mining algorithms are important for data analysis in many researches and real world applications. While these algorithms are widely applied to Tera or Peta-bytes of data, most of them do not take the scaling issue into concern at development, but left it as an implementation issue.
For large scale data manipulation, MapReduce has become an promising and widely adopted solution. The PEGASUS package tries to utilize the MapReduce structure to solve the scalability problem of existing graph mining algorithms.
GIM-V -
Instead of tackle every algorithm seperately, the PEGASUS package first transforms all the algorithms into Generalized Iterative Matrix-Vector multiplication (GIM-V) scheme, and then parallelize GIM-V. By doing so, the package can focus on the optimization of GIM-V, where every algorithm can benefit from the optimization.
The original matrix-vector multiplication can be seperated into three steps:
- combine2: multiplication of matrix elements and vector elements
- combineAll: row-wise summation of the multiplication result
- assign: overwrite the vector
- combine2: combine matrix elements and vector elements
- combineAll: row-wise combination of the result from combine2
- assign: update vector element
Parallization of GIM-V -
Parallel GIM-V can be realized using a two stage MapReduce algorithm.
- Stage1(combine2)
- Mapper - emit matrix and vector elements as an (index, value) pair
- Reducer - evaluate the predefined combine2 function
- Stage2(combineAll, Assign)
- Mapper - emit the result of Stage1
- Reducer - evaluate the predefined combineAll and Assign function
Optimization of pGIM-V -
The bottleneck of naive implementation of parallel GIM-V (pGIM-V) lies in the shuffling stage of MapReduce, where network transfer, sorting and disk I/O occur. So the key for optimization is to reduce the overhead of shuffle stage.
- GIM-V BL: Block Multiplication
Instead of treating each element seperately in MapReduce, in GIM-V BL, the basic component for MapReduce is a block of matrix or a block of vector. This means that in Stage1 of GIM-V BL, a block of matrix elements and vector elements are emitted at one time, with the following operations all carry out on a block of elements. By doing so, the number of files that need to be handled in shuffle stage is decrease by b-square and b for matrix and vector repectively, which reduces the network transfer an I/O cost. More over, because the elements can be now indexed using the block coordinate, the number of bit required to index each element is reduced, which decreases the datasize.
- GIM-V CL: Clustered Edges
Because in GIM-V BL, only blocks that contains non-zero element will be emitted in stage1 (which means all unseen blocks are taken as zero by default), so if edge clustering is carried out before starting the GIM-V algorithm, the number of blocks that need to be handled by the algorithm may be greatly decreased, since the matrix is usually sparse in real application. Notice the edge clustering only needs to be carried out once before the entire algorithm start and can be used through out all the iterations.
2011年5月11日 星期三
Building a High-Level Dataflow System on top of Map-Reduce: Teh Big Experience
The need for Pig -
The feature of Pig -
Critical Issue during implementation -
- The increasing need for processing and analyzing ultra-large-scale data
- The Map-Reduce framework shows to be suitable for the task, especially for scalability issue
- Map-Reduce, however, provides only very simple "Mapper-Reducer" framework, which leads for several problem in real application:
- No direct support for multi-step data processing
- No explicit support for combined processing of multiple data sets
- All operations, even frequently and universally used ones, have to be coded from scatch
The feature of Pig -
- High scalability inherent from Map-Reduce
- A general framework and is easy to use, which reduce the application development and data analysis time
- Black box operation provides general optimization opportunities
- Customizable through the UDF and pipeline interface
Critical Issue during implementation -
- Memory control - the key idea is to avoid spill. Since automatic memory control is hard, part of the work has to be handled by user
- Flow Control - takes memory issue, pipline and UDF interface into consideration
- Combiner - while not a necessary part for user to implement when working on Map-Reduce structure, it contains many optimization opportunities
2011年5月4日 星期三
Where's Waldo: Matching People in Images of Crowds
Problem - Given a set of images taken from a social event where crowds of people are in each images, the work aims to find the appearance of a particular person in each image after user specified the person in one of the images.
Assumption - Rate of photo acquisition is fast compared to the rate of movement of people, which implies we can find a crowd of people at the same location in multiple images.
Challenges - Large amount of people in one image, low resolution, change of pose of target person.
The problem is solved by considering both
Visual appearance -
The visual appearance is captured by part based approach using pixel color as the visual feature.
Upon input, user will specify a person using two vertical points, along with body parts of the person. The "body part model" is the color model of the body part, which is a classifier that discriminates whether a pixel belongs to the body part.
The candidate location of the person in other images is located by projecting the two points into target image. For every candidate location, a score is calculated using the learnt body part model, where the score is the number of positive classified pixel. In a particular target image, the score of the candidate location with highest score is taken as the score of the image.
Contextual Cues -
Contextual Cues are based on two observation
Assumption - Rate of photo acquisition is fast compared to the rate of movement of people, which implies we can find a crowd of people at the same location in multiple images.
Challenges - Large amount of people in one image, low resolution, change of pose of target person.
The problem is solved by considering both
- Visual appearance
- Contextual Cues, which takes the co-appearance of people and time-stamp of images into account
Visual appearance -
The visual appearance is captured by part based approach using pixel color as the visual feature.
Upon input, user will specify a person using two vertical points, along with body parts of the person. The "body part model" is the color model of the body part, which is a classifier that discriminates whether a pixel belongs to the body part.
The candidate location of the person in other images is located by projecting the two points into target image. For every candidate location, a score is calculated using the learnt body part model, where the score is the number of positive classified pixel. In a particular target image, the score of the candidate location with highest score is taken as the score of the image.
Contextual Cues -
Contextual Cues are based on two observation
- If a group of people appears together in several images, they are likely to appear together in other images
- Images taken from the same place in a short period of time tend to contain the same group of people
2011年4月30日 星期六
Learning to Rank: From Pairwise Approach to Listwise Approach
Previous works on "Learning to Rank" usually use pairwise-based loss functional, which has several disadvantage:
A listwise approach is proposed as follow:
The ranking of the documents given a query can be viewed as a permutation, which is relevant to the list of score.
Now the loss function is defined as followed:
To reduce the computational cost, only the top k ranking documents are considered in the learning to rank process.
- The objective function does not guarantee the optimization of some important measure in retrieval such as AP. These measure better capture the user experience than the pairwise ordering.
- Document pairs are not generated independently, which violate the assumption of classification methods that is applied.
- The number of document pairs, which grows in n square order, from query to query is highly unbalanced. This will lead to a biased model in training.
A listwise approach is proposed as follow:
- For each query, a list of score over all documents is given as ground truth.
- A ranking function f, which is what we want to learn, takes both query and document as input, and gives a score for each document-query pair.
- For each query, a list of score is generated using the ranking function
- The loss function is defined over the "ground truth list" and the "ranking list" of each query as input.
The ranking of the documents given a query can be viewed as a permutation, which is relevant to the list of score.
Now the loss function is defined as followed:
- Defined the probability of each permutation based on the ordering and the score of each documents.
- Given two lists of score, the probability of the permutation can be calculated, and the cross entropy of the ground truth and the score generated by ranking function is used as the loss function.
To reduce the computational cost, only the top k ranking documents are considered in the learning to rank process.
2011年4月20日 星期三
Support Vector Learning for Ordinal Regression
Problem definition -
Given a set of samples S = {(xi,yi)}, where xi being the feature vector of sample and yi being the label, most machine learning problem can be classified into one of the three categories, depending on the property of label:
Problem formulation -
The formulation of ordinal regression problem is consist of two theorem:
Problem solution -
According to the above theorems, the ordinal regression problem can be transformed into a classification problem with largest margin criteria. This is the same criteria as SVM, and a formulation of the problem which is identical to SVM is provided. So the problem can be solved by first solve the classification problem using SVM solver, and the mapping U and boundary of different labels in R can be found accordingly. Combining mapping U and the boundary, we can obtain the desired mapping h.
Given a set of samples S = {(xi,yi)}, where xi being the feature vector of sample and yi being the label, most machine learning problem can be classified into one of the three categories, depending on the property of label:
- Classification: yi belongs to a finite set of discrete value, where the value is unordered.
- Regression: yi is in a continues metric space, where the value of yi contains relevancy information about the sample.
- Ordinal regression: yi belongs to a finite set of discrete value like classification, but there exist ordering relationship in the value as the regression. It can be interpreted as a ranking problem.
Problem formulation -
The formulation of ordinal regression problem is consist of two theorem:
- Theorem 1: Given a mapping function h: X->Y, we can transform the ranking problem into a classification problem. The classification problem is to find a mapping p: X*X->{-1,0,1}, where the input of the mapping is arbitrary two samples from X, and the output is relationship of h(x). The label of the classification problem is determined by the relationship between the actual rank of the two input samples. The empirical risk of the ranking problem is proportional to the classification problem, so the ranking problem is equivalent to the classification problem.
- Theorem 2: Given the mapping function h, The empirical risk is upper bounded. Suppose there exist a mapping U: X->R where samples of different label are divided in R, and the distribution of samples in R is ordered according to the order of original label. Then for every pair of consecutive labels, there exist a margin of samples in R. Suppose gamma equals the smallest margin, than the larger the margin is, the smaller the upper bound of the empirical risk is. So the optimal mapping h should be the one with largest gamma value, or say, with largest margin in R.
Problem solution -
According to the above theorems, the ordinal regression problem can be transformed into a classification problem with largest margin criteria. This is the same criteria as SVM, and a formulation of the problem which is identical to SVM is provided. So the problem can be solved by first solve the classification problem using SVM solver, and the mapping U and boundary of different labels in R can be found accordingly. Combining mapping U and the boundary, we can obtain the desired mapping h.
2011年4月6日 星期三
Nonlinear Dimensionality Reduction by Locally Linear Embedding
Locally Linear Embedding (LLE) is an unsupervised algorithm that maps data points from the original high dimensional space into a low dimensional space.
Dimension reduction is beneficial in several aspect, for example
The key of this paper is that: given a nonlinear manifold, a local patch of the manifold can be approximated by a hyperplane. The local structure of the patch on the hyperplane is invariant to the projection from the original space to the manifold.
The main assumption, or requirement of the algorithm, is that data points are dense enough that each data point and its neighbors lies in a small patch that can be well approximated by a hyperplane.
So the algorithm can be summarized as followed:
Dimension reduction is beneficial in several aspect, for example
- Compact data representation
- Data visualization
- Relationship between data points are better described in some particular low dimensional space
The key of this paper is that: given a nonlinear manifold, a local patch of the manifold can be approximated by a hyperplane. The local structure of the patch on the hyperplane is invariant to the projection from the original space to the manifold.
The main assumption, or requirement of the algorithm, is that data points are dense enough that each data point and its neighbors lies in a small patch that can be well approximated by a hyperplane.
So the algorithm can be summarized as followed:
- Extract local structure information.
Because local patch can be approximated by hyperplane, each point can be reconstructed by linear combination of its neighbors. Such reconstruction contains local geometry information that is invariant to the projection form the original space to the low dimensional manifold. - Projection to the manifold.
A mapping that preserves the local structure information is found. Finding such mapping can be transformed to a optimization problem with close form solution. It can be solved using linear algebra method.
2011年3月23日 星期三
pLSI & LDA
pLSI - Probabilistic Latent Semantic Indexing
LDA - Latent Dirichlet Allocation
note: This summary is for the "two" papers. For simplicity, I will follow these papers using terms like "words" and "documents" in the summary as if we are explicitly dealing with text domain IR, but the inference is based on probability and the applicability of these methods are not confined to text domain.
Term used:
Word: w, dimension V, each word has exactly one dimension equals 1, all other dimensions equal to 0.
Document: d, consist of N word w, where N is different for each document.
Corpus: D, consist of M documents d.
Topic: z, a hidden variable, each dimension represent a "topic". The dimension is assumed to be a fixed value K.
Word simplex: a (V-1) dimensional simplex, each point represents a particular probability distribution of all words.
Latent semantic simplex: a (K - 1) dimensional sub-simplex of word simplex, spanned by K latent semantic vector.
Problem:
Given a corpus, we want to
Important assumption:
Model:
Unigram model - The entire corpus is modeled by one single distribution. The corpus is represented by one single point in the word simplex, with each dimension being as significant to each other. Each document is also modeled by one single point in the word simplex, where all the documents center around the point representing the corpus.
Cluster model - The corpus is split into several "topics", each topic has its own distribution. The corpus is represented by several points in the word simplex. Each document is assigned to one single topic, and documents belonging to the same topic are either being represented by the same point or are represented by points center around the topic in the word simplex.
pLSI - Rather than lying in the word simplex, the corpus is assumed to lie in the latent semantic simplex. The sub-simplex is spanned by K latent semantic vector, where each latent semantic vector represent a set of words that share the same "latent semantic meaning". Note the latent semantic is determined statistically rather than semantically. Each document is represented by a point in the latent semantic simplex. This model is superior to the previous model because:
LSI -
It is similar to pLSI. It identify the structure between words in the corpus, or say the latent semantic vector, and cast the problem from word space to latent semantic space. It is outperformed by pLSI, probably because it use a less informative distance measure in model assessment. It, however, do not need a pre-definced dimension for latent semantic space.
LDA -
Following the concept of latent semantic simplex in pLSI, the corpus is modeled as a distribution in latent semantic space. It is an extension of pLSI in that the structure of the corpus in latent semantic simplex is captured by a probabilistic model. Because pLSI models each document as a point in latent semantic simplex, it suffers from the problem of too many unseen variable and hence over fitting. This is resolved by LDA.
LDA - Latent Dirichlet Allocation
note: This summary is for the "two" papers. For simplicity, I will follow these papers using terms like "words" and "documents" in the summary as if we are explicitly dealing with text domain IR, but the inference is based on probability and the applicability of these methods are not confined to text domain.
Term used:
Word: w, dimension V, each word has exactly one dimension equals 1, all other dimensions equal to 0.
Document: d, consist of N word w, where N is different for each document.
Corpus: D, consist of M documents d.
Topic: z, a hidden variable, each dimension represent a "topic". The dimension is assumed to be a fixed value K.
Word simplex: a (V-1) dimensional simplex, each point represents a particular probability distribution of all words.
Latent semantic simplex: a (K - 1) dimensional sub-simplex of word simplex, spanned by K latent semantic vector.
Problem:
Given a corpus, we want to
- Find a good representation for the documents in the corpus
- Model the corpus
Important assumption:
- Words in a document is exchangeable, or say the words in a document are "generated" independently
- Documents in the corpus is exchangeable
Model:
Unigram model - The entire corpus is modeled by one single distribution. The corpus is represented by one single point in the word simplex, with each dimension being as significant to each other. Each document is also modeled by one single point in the word simplex, where all the documents center around the point representing the corpus.
Cluster model - The corpus is split into several "topics", each topic has its own distribution. The corpus is represented by several points in the word simplex. Each document is assigned to one single topic, and documents belonging to the same topic are either being represented by the same point or are represented by points center around the topic in the word simplex.
pLSI - Rather than lying in the word simplex, the corpus is assumed to lie in the latent semantic simplex. The sub-simplex is spanned by K latent semantic vector, where each latent semantic vector represent a set of words that share the same "latent semantic meaning". Note the latent semantic is determined statistically rather than semantically. Each document is represented by a point in the latent semantic simplex. This model is superior to the previous model because:
- It capture the correlation of different words
- Latent semantic simplex usually has much lower dimension than word simplex
LSI -
It is similar to pLSI. It identify the structure between words in the corpus, or say the latent semantic vector, and cast the problem from word space to latent semantic space. It is outperformed by pLSI, probably because it use a less informative distance measure in model assessment. It, however, do not need a pre-definced dimension for latent semantic space.
LDA -
Following the concept of latent semantic simplex in pLSI, the corpus is modeled as a distribution in latent semantic space. It is an extension of pLSI in that the structure of the corpus in latent semantic simplex is captured by a probabilistic model. Because pLSI models each document as a point in latent semantic simplex, it suffers from the problem of too many unseen variable and hence over fitting. This is resolved by LDA.
2011年3月16日 星期三
Lost in Binarization: Query-Adaptive Ranking for Similar Image Search with Compact Codes
The combination of local feature like SIFT and the BoW framework has been adopted in many system for image representation. But such representation suffers from:
But when performing image search using binary representation, many images will have the same distance to the query one, due to the discrete nature of Hamming distance, and it is unlikely that all these images are as similar to the query as others.The paper proposed a method that can perform fine-grained re-ranking for images with the same hamming distance.
The argument of the paper is that, although every bit in a binary representation has the same importance in Hamming distance, this is not the case for real-world data. Some bits is more discriminative than others, and the importance of each bit may vary in different class of images. According to the argument, the system is designed as followed:
For each semantic class, a "importance", or weight, of each bit in the binary representation is offline calculated. When a query image comes in, a weight for the query will be adaptively generated, based on the weight pre-calculated, and is than used to re-rank results with the same Hamming distance.
To generate the weight of each semantic class, two criteria are considered:
and can be transformed into a quadratic-programing (QP) like problem
which can be solved iteratively.
As for the weight of query image, it is calculated using a pseudo-relevance feedback method. The weight is the average of the semantic class weight of the top k most similar images. The distance metric used in the system is
Comment:
1. The usage of semantic class is, although somehow reasonable, lack of validation. This is especially true when the dataset is more diverse in sense of semantic class or lack of labeling.
- High dimensionality
- A long list of visual word for each image
But when performing image search using binary representation, many images will have the same distance to the query one, due to the discrete nature of Hamming distance, and it is unlikely that all these images are as similar to the query as others.The paper proposed a method that can perform fine-grained re-ranking for images with the same hamming distance.
The argument of the paper is that, although every bit in a binary representation has the same importance in Hamming distance, this is not the case for real-world data. Some bits is more discriminative than others, and the importance of each bit may vary in different class of images. According to the argument, the system is designed as followed:
For each semantic class, a "importance", or weight, of each bit in the binary representation is offline calculated. When a query image comes in, a weight for the query will be adaptively generated, based on the weight pre-calculated, and is than used to re-rank results with the same Hamming distance.
To generate the weight of each semantic class, two criteria are considered:
- Intra-class similarity -
Image in the same class should remain close together after the re-weighting transform.
This is formulated as:
where ai is the weight of i'th semantic class, ci is the center of the i'th semantic class, and x is an image. - Inter-class relationship -
The distance relationship of each class should remain the same.
This is formulated as:
where sij is the distance between the center of class i and j.
and can be transformed into a quadratic-programing (QP) like problem
which can be solved iteratively.
As for the weight of query image, it is calculated using a pseudo-relevance feedback method. The weight is the average of the semantic class weight of the top k most similar images. The distance metric used in the system is
Comment:
1. The usage of semantic class is, although somehow reasonable, lack of validation. This is especially true when the dataset is more diverse in sense of semantic class or lack of labeling.
2011年3月9日 星期三
Aggregating local descriptors into a compact image representation
This paper has two main contributions
- VLAD - A new image descriptor based on local feature -
Based on SIFT local descriptor, VLAD aggregate local descriptors into a fixed-dimension representation for image.
The idea is an outgrow of BOF representation, but it retains the structure of SIFT descriptor. Unlike BOF representation that aggregate SIFT descriptors into a single number using tf-idf manner, VLAD describe each "Visual Word" by accumulating the difference of local descriptors with the centroid of the visual word. Experiment result shows that it outperforms BOF representation in image retrieval when they are comparable in size.
VLAD can also be viewed as a simplification of Fisher kernel, which utilize GMM model to describe "Visual Word".
- PCA-ADC approach to "compress" descriptor -
Vector quantization is a widely used approach to reduce the size of descriptors.
ADC approach reduce the vector quantization complexity by dividing a vector into multiple small vectors and quantize each of them separately. The approach is combined with PCA, which is widely used in dimension reduction, to further reduce the data size.
The paper makes a formal analysis about the error caused by these descriptor compression methods, and concluded that the performance loss is acceptable. The analysis is supported by its experiment.
2011年3月2日 星期三
Efficient Visual Search of Videos Cast as Text Retrieval
This paper introduces a method to transform the image retrieval problem into text retrieval problem.
Research in text domain retrieval has a long history and a relatively well studied problem, with many promising techniques available. These techniques have been applied to many application, such as search engine, and are proven to be very success. Image retrieval, on the other hand, is a relatively novel problem and is not as mature as text retrieval. So it will be very helpful if we can utilize the technique of text retrieval in image retrieval, which is the aim of this paper.
The key of the paper is to transform low-level visual feature into "Visual Word". Low-level visual features are first obtained by identifying interesting regions and using SIFT feature descriptor to represent these regions. Visual vocabulary are then build by performing clustering, with the center of each cluster being a "Visual Word", and each feature point extracted is cast into a particular "Visual Word". These visual words are analogy to regular words in text, and techniques in text retrieval can then be applied.
After visual words are obtained, they are processed as if they were words in a text, and the structure of the retrieval system follows that of regular text retrieval system:
The main benefit of the proposed method is its time efficiency. With very simple technique, it can greatly reduce the retrieval time while returning results comparable to pure visual matching method.
Research in text domain retrieval has a long history and a relatively well studied problem, with many promising techniques available. These techniques have been applied to many application, such as search engine, and are proven to be very success. Image retrieval, on the other hand, is a relatively novel problem and is not as mature as text retrieval. So it will be very helpful if we can utilize the technique of text retrieval in image retrieval, which is the aim of this paper.
The key of the paper is to transform low-level visual feature into "Visual Word". Low-level visual features are first obtained by identifying interesting regions and using SIFT feature descriptor to represent these regions. Visual vocabulary are then build by performing clustering, with the center of each cluster being a "Visual Word", and each feature point extracted is cast into a particular "Visual Word". These visual words are analogy to regular words in text, and techniques in text retrieval can then be applied.
After visual words are obtained, they are processed as if they were words in a text, and the structure of the retrieval system follows that of regular text retrieval system:
- Building Vocabulary -
- Term weighting using tf-idf
- Stop word removal
- Indexing -
- Inverted file
- Retrieval Model -
- Vector space model
The main benefit of the proposed method is its time efficiency. With very simple technique, it can greatly reduce the retrieval time while returning results comparable to pure visual matching method.
2011年3月1日 星期二
Distinctive Image Features from Scale-Invariant Keypoints
This paper introduces the well-known SIFT local feature, which is a widely used local feature, and has been proven to work well on many application, especially in object recognition and image matching.
SIFT feature contains two main parts, with four stages to extract the feature:
Some desirable properties about SIFT feature make it a successful local feature. These properties includes invariant to rotation, scaling, affine distortion and change in illumination. Also, it extract large numbers of keypoints in general images, which makes it very distinctive.
SIFT feature contains two main parts, with four stages to extract the feature:
- Feature point identification
- Scale-space extrema detection -
Identify interesting points by searching scale-space extrema. This is done by finding local minimum in the difference-of-Gaussian convolved image. It can be proved that the minimum is a good approximation of the laplacian of gaussian convovlved image, which has been shown to produce stable feature.
Following computation are done in the particular scale-space, and are thus scale invariant.
- Keypoint localization -
The position of interesting points is locate more precisely by finding the exact local minimum point. This is done by expanding the DOG-convolved image around the interesting point find in step1 to second order, and the minimum is located analytically.
The keypoints founded are then filtered. This step is to eliminate poor keypoints that occurs due to edge response. This is done by thresholding the ratio of principal curvature and curvature of perpendicular direction, which will be large on edge.
- Orientation assignement -
Each keypoint is then assigned with an principal direction. The direction is determined by creating an orientation histogram about the gradient, with the maximun being the principal direction. If there are multiple peaks in the histogram, a keypoint is created for each direction.
Following calculation are done with the direction of keypoint being aligned, and are thus rotation invariant.
- Feature point descriptor
- Keypoint descriptor -
The descriptor is an image gradient histogram. Points around the keypoints are divided into a 4*4 array of regions, a gradient histogram with 8 orientation bins is created for each region. Therefore, each keypoint is described by a 4*4*8 = 128 dimensional descriptor, along with the position and orientation.
The descriptor is invariant to affine changes in illumination.
Some desirable properties about SIFT feature make it a successful local feature. These properties includes invariant to rotation, scaling, affine distortion and change in illumination. Also, it extract large numbers of keypoints in general images, which makes it very distinctive.
訂閱:
文章 (Atom)





