Spark and Kafka
Spark has a long history of supporting Kafka with both streaming and batch processing. Here, we will go over some of the structured streaming Kafka-related APIs.
Here, we have a streaming read of a Kafka cluster. It will return a streaming DataFrame:
df = spark \ Â Â .readStream \ Â Â .format("kafka")\ Â Â .option("kafka.bootstrap.servers", "<host>:<port>, <host>:<port>")\ Â Â .option("subscribe", "<topic>")\ Â Â .load()\\
Conversely, if you want to do a true batch process, you can also read from Kafka. Keep in mind that we have covered techniques to create a streaming context but using a batch style to avoid rereading messages:
df = spark \ Â Â .read \ Â Â .format("kafka") \ Â Â .option("kafka.bootstrap.servers", "<host>:<port>, <host>:<port>")\ Â Â ...