Spark RDD and dataframes
In this section, our focus turns to data and how Apache Spark represents data and organizes data. Here, we will provide an introduction to the Apache Spark RDD and Apache Spark dataframes.
After this section, readers will master these two fundamental Spark concepts, RDD and Spark dataframe, and be ready to utilize them for Machine Learning projects.
Spark RDD
Apache Spark's primary data abstraction is in the form of a distributed collection of items, which is called Resilient Distributed Dataset (RDD). RDD is Apache Spark's key innovation, which makes its computing faster and more efficient than others.
Specifically, an RDD is an immutable collection of objects, which spreads across a cluster. It is statically typed, for example RDD[T] has objects of type T. There are RDD of strings, RDD of integers, and RDD of objects.
On the other hand, RDDs:
- Are collections of objects across a cluster with user controlled partitioning
- Are built via parallel transformations like
map
andfilter
That is, an RDD is physically distributed across a cluster, but manipulated as one logical entity. RDDs on Spark have fault tolerant properties such that they can be automatically rebuilt on failure.
New RDDs can be created from Hadoop Input Formats (such as HDFS files) or by transforming other RDDs.
To create RDDs, users can either:
- Distribute a collection of objects from the driver program (using the parallelize method of the Spark context)
- Load an external dataset
- Transform an existing RDD
Spark's team call the above two types of RDD operations action and transformation.
RDDs can be operated by actions, which return values, or by transformations, which return pointers to new RDDs. Some examples of RDD actions are collect
, count
and take
.
Transformations are lazy evaluations. Some examples of RDD transformations are map
, filter
, and join
.
RDD actions and transformations may be combined to form complex computations.
Note
To learn more about RDD, please read the article at
https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Spark dataframes
A Spark dataframe is a distributed collection of data as organized by columns, actually a distributed collection of data as grouped into named columns, that is, an RDD with a schema. In other words, Spark dataframe is an extension of Spark RDD.
Data frame = RDD where columns are named and can be manipulated by name instead of by index value.
A Spark dataframe is conceptually equivalent to a dataframe in R, and is similar to a table in a relational database, which helped Apache Spark to be quickly accepted by the machine learning community. With Spark dataframes, users can directly work with data elements like columns, which are not available when working with RDDs. With data scheme knowledge on hand, users can also apply their familiar SQL types of data re-organization techniques to data. Spark dataframes can be built from many kinds of raw data such as structured relational data files, Hive tables, or existing RDDs.
Apache Spark has built a special dataframe API and a Spark SQL to deal with Spark dataframes. The Spark SQL and Spark dataframe API are both available for Scala, Java, Python, and R. As an extension to the existing RDD API, the DataFrames API features:
- Ability to scale from kilobytes of data on a single laptop to petabytes on a large cluster
- Support for a wide array of data formats and storage systems
- State-of-the-art optimization and code generation through the Spark SQL Catalyst optimizer
- Seamless integration with all big data tooling and infrastructure via Spark
The Spark SQL works with Spark DataFrame very well, which allows users to do ETL easily, and also to work on subsets of any data easily. Then, users can transform them and make them available to other users including R users. Spark SQL can also be used alongside HiveQL, and runs very fast. With Spark SQL, users write less code as well, a lot less than working with Hadoop, and also less than working directly on RDDs.
Note
For more, please visit http://spark.apache.org/docs/latest/sql-programming-guide.html
Dataframes API for R
A dataframe is an essential element for machine learning programming. Apache Spark has made a dataframe API available for R as well as for Java and Python, so that users can operate Spark dataframes easily in their familiar environment with their familiar language. In this section, we provide a simple introduction to operating Spark dataframes, with some simple examples for R to start leading our readers into actions.
The entry point into all relational functionality in Apache Spark is its SQLContext
class, or one of its descendents. To create a basic SQLContext
, all users need is a SparkContext command as below:
sqlContext <- sparkRSQL.init(sc)
To create a Spark dataframe, users may perform the following:
sqlContext <- SQLContext(sc) df <- jsonFile(sqlContext, "examples/src/main/resources/people.json") # Displays the content of the DataFrame to stdout showDF(df)
For Spark dataframe operations, the following are some examples:
sqlContext <- sparkRSQL.init(sc) # Create the DataFrame df <- jsonFile(sqlContext, "examples/src/main/resources/people.json") # Show the content of the DataFrame showDF(df) ## age name ## null Michael ## 30 Andy ## 19 Justin # Print the schema in a tree format printSchema(df) ## root ## |-- age: long (nullable = true) ## |-- name: string (nullable = true) # Select only the "name" column showDF(select(df, "name")) ## name ## Michael ## Andy ## Justin # Select everybody, but increment the age by 1 showDF(select(df, df $name, df$age + 1)) ## name (age + 1) ## Michael null ## Andy 31 ## Justin 20 # Select people older than 21 showDF(where(df, df$age > 21)) ## age name ## 30 Andy # Count people by ageā© showDF(count(groupBy(df, "age"))) ## age count ## null 1 ## 19 1 ## 30 1
Note
For more info, please visit http://spark.apache.org/docs/latest/sql-programming-guide.html#creating-dataframes
ML frameworks, RM4Es and Spark computing
In this section, we discuss machine learning frameworks with RM4Es as one of its examples, in relation to Apache Spark computing.
After this section, readers will master the concept of machine learning frameworks and some examples, and then be ready to combine them with Spark computing for planning and implementing machine learning projects.
ML frameworks
As discussed in earlier sections, Apache Spark computing is very different from Hadoop MapReduce. Spark is faster and easier to use than Hadoop MapReduce. There are many benefits to adopting Apache Spark computing for machine learning.
However, all the benefits for machine learning professionals will materialize only if Apache Spark can enable good ML frameworks. Here, an ML framework means a system or an approach that combines all the ML elements including ML algorithms to make ML most effective to its users. And specifically, it refers to the ways that data is represented and processed, how predictive models are represented and estimated, how modeling results are evaluated, and are utilized. From this perspective, ML Frameworks are different from each other, for their handling of data sources, conducting data pre-processing, implementing algorithms, and for their support for complex computation.
There are many ML frameworks, as there are also various computing platforms supporting these frameworks. Among the available ML frameworks, the frameworks stressing iterative computing and interactive manipulation are considered among the best, because these features can facilitate complex predictive model estimation and good researcher-data interaction. Nowadays, good ML frameworks also need to cover big data capabilities or fast processing at scale, as well as fault tolerance capabilities. Good frameworks always include a large number of machine learning algorithms and statistical tests ready to be used.
As mentioned in previous sections, Apache Spark has excellent iterative computing performance and is highly cost-effective, thanks to in-memory data processing. It's compatible with all of Hadoop's data sources and file formats and, thanks to friendly APIs that they are available in several languages, it also has a faster learning curve. Apache Spark even includes graph processing and machine-learning capabilities. For these reasons, Apache Spark based ML frameworks are favored by ML professionals.
However, Hadoop MapReduce is a more mature platform and it was built for batch processing. It can be more cost-effective than Spark, for some big data that doesn't fit in memory and also due to the greater availability of experienced staff. Furthermore, the Hadoop MapReduce ecosystem is currently bigger thanks to many supporting projects, tools, and cloud services.
But even if Spark looks like the big winner, the chances are that ML professionals won't use it on its own, ML professionals may still need HDFS to store the data and may want to use HBase, Hive, Pig, Impala, or other Hadoop projects. For many cases, this means ML professionals still need to run Hadoop and MapReduce alongside Apache Spark for a full Big Data package.
RM4Es
In a previous section, we have had some general discussion about machine learning frameworks. Specifically, a ML framework covers how to deal with data, analytical methods, analytical computing, results evaluation, and results utilization, which RM4Es represents nicely as a framework. The RM4Es (Research Methods Four Elements) is a good framework to summarize Machine Learning components and processes. The RM4Es include:
- Equation: Equations are used to represent the models for our research
- Estimation: Estimation is the link between equations (models) and the data used for our research
- Evaluation: Evaluation needs to be performed to assess the fit between models and the data
- Explanation: Explanation is the link between equations (models) and our research purposes. How we explain our research results often depends on our research purposes and also on the subject we are studying
The RM4Es are the key four aspects that distinguish one machine learning method from another. The RM4Es are sufficient to represent an ML status at any given moment. Furthermore, using RM4Es can easily and sufficiently represent ML workflows.
Related to what we discussed so far, Equation is like ML libraries, Estimation represents how computing is done, Evaluation is about how to tell whether a ML is better, and, as for iterative computer, whether we should continue or stop. Explanation is also a key part for ML as our goal is to turn data into insightful results that can be used.
Per the above, a good ML framework needs to deal with data abstraction and data pre-processing at scale, and also needs to deal with fast computing, interactive evaluation at scale and speed, as well as easy results interpretation and deployment.
The Spark computing framework
Earlier in the chapter, we discussed how Spark computing supports iterative ML computing. After reviewing machine learning frameworks and how Spark computing relates to ML frameworks, we are ready to understand more about why Spark computing should be selected for ML.
Spark was built to serve ML and data science, to make ML at scale and ML deployment easy. As discussed, Spark's core innovation on RDDs enables fast and easy computing, with good fault tolerance.
Spark is a general computing platform, and its program contains two programs: a driver program and a worker program.
To program, developers need to write a driver program that implements the high-level control flow of their application and also launches various operations in parallel. All the worker programs developed will run on cluster nodes or in local threads, and RDDs operate across all workers.
As mentioned, Spark provides two main abstractions for parallel programming: resilient distributed datasets and parallel operations on these datasets (invoked by passing a function to apply on a dataset).
In addition, Spark supports two restricted types of shared variables:
- Broadcast variables: If a large read-only piece of data (e.g., a lookup table) is used in multiple parallel operations, it is preferable to distribute it to the workers only once instead of packaging it with every closure.
- Accumulators: These are variables that workers can only add to using an associative operation, and that only the driver can read. They can be used to implement counters as in MapReduce and to provide a more imperative syntax for parallel sums. Accumulators can be defined for any type that has an add operation and a zero value. Due to their add-only semantics, they are easy to make fault-tolerant.
With all the above, the Apache Spark computing framework is capable of supporting various machine learning frameworks that need fast parallel computing with fault tolerance.
Note
See http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf for more.