Getting started with MLflow
Next, we will install MLflow on your machine and prepare it for use in this chapter. You will have two options when it comes to installing MLflow. The first option is through a Docker container-based recipe provided in the repository of the book: https://github.com/PacktPublishing/Machine-Learning-Engineering-with-Mlflow.git.
To install it, follow these instructions:
- Use the following commands to install the software:
$ git clone https://github.com/PacktPublishing/Machine-Learning-Engineering-with-Mlflow.git $ cd Machine-Learning-Engineering-with-Mlflow $ cd Chapter01
- The Docker image is very simple at this stage: it simply contains MLflow and sklearn, the main tools to be used in this chapter of the book. For illustrative purposes, you can look at the content of the
Dockerfile
:FROM jupyter/scipy-notebook RUN pip install mlflow RUN pip install sklearn
- To build the image, you should now run the following command:
docker build -t chapter_1_homlflow
- Right after building the image, you can run the
./run.sh
command:./run.sh
Important note
It is important to ensure that you have the latest version of Docker installed on your machine.
- Open your browser to http://localhost:888 and you should be able to navigate to the
Chapter01
folder.
In the following section, we will be developing our first model with MLflow in the Jupyter environment created in the previous set of steps.
Developing your first model with MLflow
From the point of view of simplicity, in this section, we will use the built-in sample datasets in sklearn, the ML library that we will use initially to explore MLflow features. For this section, we will choose the famous Iris
dataset to train a multi-class classifier using MLflow.
The Iris dataset (one of sklearn's built-in datasets available from https://scikit-learn.org/stable/datasets/toy_dataset.html) contains the following elements as features: sepal length, sepal width, petal length, and petal width. The target variable is the class of the iris: Iris Setosa, Iris Versocoulor, or Iris Virginica:
- Load the sample dataset:
from sklearn import datasets from sklearn.model_selection import train_test_split dataset = datasets.load_iris() X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.4)
- Next, let's train your model.
Training a simple machine model with a framework such as scikit-learn involves instantiating an estimator such as
LogisticRegression
and calling thefit
command to execute training over theIris
dataset built in scikit-learn:from sklearn.linear_model import LogisticRegression clf = LogisticRegression() clf.fit(X_train, y_train)
The preceding lines of code are just a small portion of the ML Engineering process. As will be demonstrated, a non-trivial amount of code needs to be created in order to productionize and make sure that the preceding training code is usable and reliable. One of the main objectives of MLflow is to aid in the process of setting up ML systems and projects. In the following sections, we will demonstrate how MLflow can be used to make your solutions robust and reliable.
- Then, we will add MLflow.
With a few more lines of code, you should be able to start your first MLflow interaction. In the following code listing, we start by importing the
mlflow
module, followed by theLogisticRegression
class in scikit-learn. You can use the accompanying Jupyter notebook to run the next section:import mlflow from sklearn.linear_model import LogisticRegression mlflow.sklearn.autolog() with mlflow.start_run(): clf = LogisticRegression() clf.fit(X_train, y_train)
The
mlflow.sklearn.autolog()
instruction enables you to automatically log the experiment in the local directory. It captures the metrics produced by the underlying ML library in use. MLflow Tracking is the module responsible for handling metrics and logs. By default, the metadata of an MLflow run is stored in the local filesystem. - If you run the following excerpt on the accompanying notebook's root document, you should now have the following files in your home directory as a result of running the following command:
$ ls -l total 24 -rw-r--r-- 1 jovyan users 12970 Oct 14 16:30 chapther_01_introducing_ml_flow.ipynb -rw-r--r-- 1 jovyan users 53 Sep 30 20:41 Dockerfile drwxr-xr-x 4 jovyan users 128 Oct 14 16:32 mlruns -rwxr-xr-x 1 jovyan users 97 Oct 14 13:20 run.sh
The
mlruns
folder is generated alongside your notebook folder and contains all the experiments executed by your code in the current context.The
mlruns
folder will contain a folder with a sequential number identifying your experiment. The outline of the folder will appear as follows:├── 46dc6db17fb5471a9a23d45407da680f │ ├── artifacts │ │ └── model │ │ ├── MLmodel │ │ ├── conda.yaml │ │ ├── input_example.json │ │ └── model.pkl │ ├── meta.yaml │ ├── metrics │ │ └── training_score │ ├── params │ │ ├── C │ │ ….. │ └── tags │ ├── mlflow.source.type │ └── mlflow.user └── meta.yaml
So, with very little effort, we have a lot of traceability available to us, and a good foundation to improve upon.
Your experiment is identified as UUID
on the preceding sample by 46dc6db17fb5471a9a23d45407da680f
. At the root of the directory, you have a yaml
file named meta.yaml
, which contains the content:
artifact_uri: file:///home/jovyan/mlruns/0/518d3162be7347298abe4c88567ca3e7/artifacts end_time: 1602693152677 entry_point_name: '' experiment_id: '0' lifecycle_stage: active name: '' run_id: 518d3162be7347298abe4c88567ca3e7 run_uuid: 518d3162be7347298abe4c88567ca3e7 source_name: '' source_type: 4 source_version: '' start_time: 1602693152313 status: 3 tags: [] user_id: jovyan
This is the basic metadata of your experiment, with information including start time, end time, identification of the run (run_id
and run_uuid
), an assumption of the life cycle stage, and the user who executed the experiment. The settings are basically based on a default run, but provide valuable and readable information regarding your experiment:
├── 46dc6db17fb5471a9a23d45407da680f │ ├── artifacts │ │ └── model │ │ ├── MLmodel │ │ ^ ├── conda.yaml │ │ ├── input_example.json │ │ └── model.pkl
The model.pkl
file contains a serialized version of the model. For a scikit-learn model, there is a binary version of the Python code of the model. Upon autologging, the metrics are leveraged from the underlying machine library in use. The default packaging strategy was based on a conda.yaml
file, with the right dependencies to be able to serialize the model.
The MLmodel
file is the main definition of the project from an MLflow project with information related to how to run inference on the current model.
The metrics
folder contains the training score value of this particular run of the training process, which can be used to benchmark the model with further model improvements down the line.
The params
folder on the first listing of folders contains the default parameters of the logistic regression model, with the different default possibilities listed transparently and stored automatically.