Versioning your data with Pachyderm
Data is the fundamental component for building your models. Without a retrievable version of the dataset the model was trained on, you cannot replicate the model training activity you did in the past and expect the same results. Data versioning enables dataset comparisons and prevents confusion that may occur due to data changes. This allows us to build a reproducible model training workflow. To learn more about Pachyderm in depth, refer to the Pachyderm documentation at https://docs.pachyderm.com/.
To work with Pachyderm, you can either use the Pachyderm command-line tool, pachctl
, or the Pachyderm Python library, which we will use in this book.
Before we start, let’s create a new bucket in your MinIO server. We will use this to store the datasets. Let’s call this bucket raw-data
. Then, upload the wine.csv
file available in the Git repository of this book into this bucket. For the purpose of this exercise, set the raw-data
bucket...