spooky's blog: Building a High-Level Dataflow System on top of Map-Reduce: Teh Big Experience

2011年5月11日星期三

Building a High-Level Dataflow System on top of Map-Reduce: Teh Big Experience

The need for Pig -

The increasing need for processing and analyzing ultra-large-scale data
The Map-Reduce framework shows to be suitable for the task, especially for scalability issue
Map-Reduce, however, provides only very simple "Mapper-Reducer" framework, which leads for several problem in real application:

No direct support for multi-step data processing
No explicit support for combined processing of multiple data sets
All operations, even frequently and universally used ones, have to be coded from scatch

The feature of Pig -

High scalability inherent from Map-Reduce
A general framework and is easy to use, which reduce the application development and data analysis time
Black box operation provides general optimization opportunities
Customizable through the UDF and pipeline interface

Critical Issue during implementation -

Memory control - the key idea is to avoid spill. Since automatic memory control is hard, part of the work has to be handled by user
Flow Control - takes memory issue, pipline and UDF interface into consideration
Combiner - while not a necessary part for user to implement when working on Map-Reduce structure, it contains many optimization opportunities

沒有留言:

張貼留言

訂閱：張貼留言 (Atom)