In this chapter, we looked at working with data at scale. Working with large datasets requires a paradigm shift in how the data is processed. Traditional methods that work with smaller datasets generally don't work well with large datasets, because these are designed to work on a single computer. These methods need to be re-engineered to work effectively with large datasets. For scalability, we need to turn to distributed computing; however, this introduces significant additional complexity because of the network being involved, where failures are more common. Using good, time-tested frameworks, such as Apache Spark, is the key to addressing these concerns.




















































