2011年5月18日 星期三

PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations

Motivation -
Various graph mining algorithms are important for data analysis in many researches and real world applications. While these algorithms are widely applied to Tera or Peta-bytes of data, most of them do not take the scaling issue into concern at development, but left it as an implementation issue.
 For large scale data manipulation, MapReduce has become an promising and widely adopted solution. The PEGASUS package tries to utilize the MapReduce structure to solve the scalability problem of existing graph mining algorithms.

GIM-V -
Instead of tackle every algorithm seperately, the PEGASUS package first transforms all the algorithms into Generalized Iterative Matrix-Vector multiplication (GIM-V) scheme, and then parallelize GIM-V. By doing so, the package can focus on the optimization of GIM-V, where every algorithm can benefit from the optimization.
The original matrix-vector multiplication can be seperated into three steps:
  1. combine2: multiplication of matrix elements and vector elements
  2. combineAll: row-wise summation of the multiplication result
  3. assign: overwrite the vector
In GIM-V, the three operation are generalized, following the regularization:
  1. combine2: combine matrix elements and vector elements
  2. combineAll: row-wise combination of the result from combine2
  3. assign: update vector element
By carefully design the three operations, algorithms such as PageRank, Random Walk with Restart, Diameter Estimation, Connected Components etc. can be realized using GIM-V scheme.

Parallization of GIM-V -
Parallel GIM-V can be realized using a two stage MapReduce algorithm.
  1. Stage1(combine2)
      Mapper - emit matrix and vector elements as an (index, value) pair
      Reducer - evaluate the predefined combine2 function
  2. Stage2(combineAll, Assign)
    • Mapper - emit the result of Stage1
      Reducer - evaluate the predefined combineAll and Assign function

Optimization of pGIM-V -
The bottleneck of naive implementation of parallel GIM-V (pGIM-V) lies in the shuffling stage of MapReduce, where network transfer, sorting and disk I/O occur. So the key for optimization is to reduce the overhead of shuffle stage.
  1. GIM-V BL: Block Multiplication
    Instead of treating each element seperately in MapReduce, in GIM-V BL, the basic component for MapReduce is a block of matrix or a block of vector. This means that in Stage1 of GIM-V BL, a block of matrix elements and vector elements are emitted at one time, with the following operations all carry out on a block of elements. By doing so, the number of files that need to be handled in shuffle stage is decrease by b-square and b for matrix and vector repectively, which reduces the network transfer an I/O cost. More over, because the elements can be now indexed using the block coordinate, the number of bit required to index each element is reduced, which decreases the datasize.
  2. GIM-V CL: Clustered Edges
    Because in GIM-V BL, only blocks that contains non-zero element will be emitted in stage1 (which means all unseen blocks are taken as zero by default), so if edge clustering is carried out before starting the GIM-V algorithm, the number of blocks that need to be handled by the algorithm may be greatly decreased, since the matrix is usually sparse in real application. Notice the edge clustering only needs to be carried out once before the entire algorithm start and can be used through out all the iterations.

2011年5月11日 星期三

Building a High-Level Dataflow System on top of Map-Reduce: Teh Big Experience

The need for Pig -
  1. The increasing need for processing and analyzing ultra-large-scale data
  2. The Map-Reduce framework shows to be suitable for the task, especially for scalability issue
  3. Map-Reduce, however, provides only very simple "Mapper-Reducer" framework, which leads for several problem in real application:
    • No direct support for multi-step data processing
    • No explicit support for combined processing of multiple data sets
    • All operations, even frequently and universally used ones, have to be coded from scatch
    This leads to the repeatedly implementation of same operation, which slow down the development, introduce mistakes and impede possible optimization.

The feature of Pig -
  1. High scalability inherent from Map-Reduce
  2. A general framework and is easy to use, which reduce the application development and data analysis time
  3. Black box operation provides general optimization opportunities
  4. Customizable through the UDF and pipeline interface

Critical Issue during implementation -
  • Memory control - the key idea is to avoid spill. Since automatic memory control is hard, part of the work has to be handled by user
  • Flow Control - takes memory issue, pipline and UDF interface into consideration
  • Combiner - while not a necessary part for user to implement when working on Map-Reduce structure, it contains many optimization opportunities

2011年5月4日 星期三

Where's Waldo: Matching People in Images of Crowds

Problem - Given a set of images taken from a social event where crowds of people are in each images, the work aims to find the appearance of a particular person in each image after user specified the person in one of the images.

Assumption - Rate of photo acquisition is fast compared to the rate of movement of people, which implies we can find a crowd of people at the same location in multiple images.

Challenges - Large amount of people in one image, low resolution, change of pose of target person.

The problem is solved by considering both
  1. Visual appearance
  2. Contextual Cues, which takes the co-appearance of people and time-stamp of images into account

Visual appearance -
The visual appearance is captured by part based approach using pixel color as the visual feature.
Upon input, user will specify a person using two vertical points, along with body parts of the person. The "body part model" is the color model of the body part, which is a classifier that discriminates whether a pixel belongs to the body part.
The candidate location of the person in other images is located by projecting the two points into target image. For every candidate location, a score is calculated using the learnt body part model, where the score is the number of positive classified pixel. In a particular target image, the score of the candidate location with highest score is taken as the score of the image.

Contextual Cues -
Contextual Cues are based on two observation
  1. If a group of people appears together in several images, they are likely to appear together in other images
  2. Images taken from the same place in a short period of time tend to contain the same group of people
Based on the two observation, the "affinity" of people and images, which is defined using the co-occurence and seperation in time respectively, are taken into account along with the score obtained from visual appearance to determined whether a person appears in an image.