Various graph mining algorithms are important for data analysis in many researches and real world applications. While these algorithms are widely applied to Tera or Peta-bytes of data, most of them do not take the scaling issue into concern at development, but left it as an implementation issue.
For large scale data manipulation, MapReduce has become an promising and widely adopted solution. The PEGASUS package tries to utilize the MapReduce structure to solve the scalability problem of existing graph mining algorithms.
GIM-V -
Instead of tackle every algorithm seperately, the PEGASUS package first transforms all the algorithms into Generalized Iterative Matrix-Vector multiplication (GIM-V) scheme, and then parallelize GIM-V. By doing so, the package can focus on the optimization of GIM-V, where every algorithm can benefit from the optimization.
The original matrix-vector multiplication can be seperated into three steps:
- combine2: multiplication of matrix elements and vector elements
- combineAll: row-wise summation of the multiplication result
- assign: overwrite the vector
- combine2: combine matrix elements and vector elements
- combineAll: row-wise combination of the result from combine2
- assign: update vector element
Parallization of GIM-V -
Parallel GIM-V can be realized using a two stage MapReduce algorithm.
- Stage1(combine2)
- Mapper - emit matrix and vector elements as an (index, value) pair
- Reducer - evaluate the predefined combine2 function
- Stage2(combineAll, Assign)
- Mapper - emit the result of Stage1
- Reducer - evaluate the predefined combineAll and Assign function
Optimization of pGIM-V -
The bottleneck of naive implementation of parallel GIM-V (pGIM-V) lies in the shuffling stage of MapReduce, where network transfer, sorting and disk I/O occur. So the key for optimization is to reduce the overhead of shuffle stage.
- GIM-V BL: Block Multiplication
Instead of treating each element seperately in MapReduce, in GIM-V BL, the basic component for MapReduce is a block of matrix or a block of vector. This means that in Stage1 of GIM-V BL, a block of matrix elements and vector elements are emitted at one time, with the following operations all carry out on a block of elements. By doing so, the number of files that need to be handled in shuffle stage is decrease by b-square and b for matrix and vector repectively, which reduces the network transfer an I/O cost. More over, because the elements can be now indexed using the block coordinate, the number of bit required to index each element is reduced, which decreases the datasize.
- GIM-V CL: Clustered Edges
Because in GIM-V BL, only blocks that contains non-zero element will be emitted in stage1 (which means all unseen blocks are taken as zero by default), so if edge clustering is carried out before starting the GIM-V algorithm, the number of blocks that need to be handled by the algorithm may be greatly decreased, since the matrix is usually sparse in real application. Notice the edge clustering only needs to be carried out once before the entire algorithm start and can be used through out all the iterations.