2011年5月11日 星期三

Building a High-Level Dataflow System on top of Map-Reduce: Teh Big Experience

The need for Pig -
  1. The increasing need for processing and analyzing ultra-large-scale data
  2. The Map-Reduce framework shows to be suitable for the task, especially for scalability issue
  3. Map-Reduce, however, provides only very simple "Mapper-Reducer" framework, which leads for several problem in real application:
    • No direct support for multi-step data processing
    • No explicit support for combined processing of multiple data sets
    • All operations, even frequently and universally used ones, have to be coded from scatch
    This leads to the repeatedly implementation of same operation, which slow down the development, introduce mistakes and impede possible optimization.

The feature of Pig -
  1. High scalability inherent from Map-Reduce
  2. A general framework and is easy to use, which reduce the application development and data analysis time
  3. Black box operation provides general optimization opportunities
  4. Customizable through the UDF and pipeline interface

Critical Issue during implementation -
  • Memory control - the key idea is to avoid spill. Since automatic memory control is hard, part of the work has to be handled by user
  • Flow Control - takes memory issue, pipline and UDF interface into consideration
  • Combiner - while not a necessary part for user to implement when working on Map-Reduce structure, it contains many optimization opportunities

沒有留言:

張貼留言