- The increasing need for processing and analyzing ultra-large-scale data
- The Map-Reduce framework shows to be suitable for the task, especially for scalability issue
- Map-Reduce, however, provides only very simple "Mapper-Reducer" framework, which leads for several problem in real application:
- No direct support for multi-step data processing
- No explicit support for combined processing of multiple data sets
- All operations, even frequently and universally used ones, have to be coded from scatch
The feature of Pig -
- High scalability inherent from Map-Reduce
- A general framework and is easy to use, which reduce the application development and data analysis time
- Black box operation provides general optimization opportunities
- Customizable through the UDF and pipeline interface
Critical Issue during implementation -
- Memory control - the key idea is to avoid spill. Since automatic memory control is hard, part of the work has to be handled by user
- Flow Control - takes memory issue, pipline and UDF interface into consideration
- Combiner - while not a necessary part for user to implement when working on Map-Reduce structure, it contains many optimization opportunities
沒有留言:
張貼留言