Apache Spark for data processing
Apache Spark is a new-ish project (at least in the world of big data, which moves at warp speed) that integrates well with Hadoop but does not necessarily require Hadoop components to operate. It is a
fast and general engine for large-scale data processing
as described on the Spark project team welcome page. The tagline of
lightning fast cluster computing
is a little catchier: we like that one better.
Apache Spark logo
What is Apache Spark?
Good question, glad you asked. Spark was built for distributed cluster computing, so everything scales nicely without any code changes. The word general in the general engine description is very appropriate for Spark. It refers to the many and varied ways you can use it.
You can use it for ETL data processing, machine learning modeling, graph processing, stream data processing, and SQL and structure data processing. It is a boon for analytics in a distributed computing world.
It has APIs for multiple programming languages such...