Big data processing has become a priority for companies now, and there are plenty of tools and frameworks available for processing this data. The first distributed framework was MapReduce, and after that there were lots of tools being developed for it, such as Hive and Pig. The requirement of processing a larger dataset quickly resulted in the development of Apache Spark, and to be able to process data in real-time, we had Apache Storm. In this chapter, we will discuss some of the popular processing frameworks, such as Apache Spark, Apache Flink, and Apache Storm.
We are going to cover the following topics:
- Apache Spark architecture and its internal
- Example covering running the Spark application
- Apache Flink architecture and its ecosystem
- Apache Flink APIs
- Apache Storm with Heron as its successor