Writing to an HDFS cluster with Gobblin
Gobblin is a universal data ingestion framework for the extract, transform, and load (ETL) of large volumes of data from a variety of data sources, such as files, databases, and Hadoop.
Gobblin also performs regular data ETL operations, such as job/task scheduling, state management, task partitioning, error handling, data quality checking, and data publishing.
Some features that make Gobblin very attractive are auto scalability, extensibility, fault tolerance, data quality assurance, and the ability to handle data model evolution.
Getting ready
For this recipe, it is necessary to have a Kafka cluster up and running. We also need an HDFS cluster up and running, into which we write the data.
The installation of Gobblin is also required. Follow the instructions on this page: http://gobblin.readthedocs.io/en/latest/Getting-Started.
How to do it...
- Edit a file called
kafkagobblin.conf
with the following contents; the instructions to read from Kafka and write into...