Managing data using TFRecords
In this section, we will demonstrate how to save image data from Spark DataFrames to TFRecords and load it using TensorFlow in Azure Databricks. We will use as an example the flowers example dataset available in the Databricks filesystem, which contains flower photos stored under five sub-directories, one per class.
We will load the flowers Delta table, which contains the preprocessed flowers dataset using a binary file data source and stored as a Spark DataFrame. We will use this data to demonstrate how you can save data from Spark DataFrames to TFRecords:
- As the first step, we will load the data using PySpark:
from pyspark.sql.functions import col import tensorflow as tf spark_df = spark.read.format("delta").load("/databricks-datasets/flowers/delta") \ .select(col("content"), col("label_index")) \ .limit(100)
- Next, we will save the loaded data to TFRecord-type files:
path =...