Data coming from a source is often raw data. When we talk about raw data we mean data that is in a format that can't be used as is for the training or testing purposes of our models. So, before using, we need to make it tidy. The cleanup process is done through one or more transformations before giving the data as input for a given model.
For data transformation purposes, the DL4J DataVec library and Spark provide several facilities. Some of the concepts described in this section have been explored in the Data ingestion through DataVec and transformation through Spark section, but now we are going to add a more complex use case.
To understand how to use Datavec for transformation purposes, let's build a Spark application for web traffic log analysis. The dataset used is generally available for download at the MonitorWare website (http...