In the previous chapter, we talked about multiple widely used components in data processing. We focused on a batch processing framework, Apache Pig, and talked about its architecture. We discussed a distributed columnar store database, HBase, and also covered the distributed messaging system, Kafka, which gives you ability to store and persist real-time events. Apache Flume was also the focus of the last chapter, which can help in pulling some real-time logs for further processing.
In this chapter, we will talk about some of the design considerations for application processing semantic. The following will be some of the focus points of this chapter:
- Different file formats available
- Advantage of using compression codecs
- Best data ingestion practices
- Design consideration for applications
- Data governance and its importance