Scikit-learn is an amazing Python library with unrivaled documentation, designed to provide a consistent API to dozens of algorithms. It is built upon, and is itself, a core component of the Python scientific stack, which includes NumPy, SciPy, pandas, and matplotlib. Here are some of the areas scikit-learn covers: classification, regression, clustering, dimensionality reduction, model selection, and preprocessing.
We'll look at a few examples. First, we will build a classifier using our iris data, and then we'll look at how we can evaluate our model using the tools of scikit-learn:
- The first step to building a machine learning model in scikit-learn is understanding how the data must be structured.
- The independent variables should be a numeric n × m matrix, X, and the dependent variable, y, an n × 1 vector.
- The y vector may be either a numeric continuous or categorical, or a string categorical.
- These are then passed into the .fit() method on the chosen classifier.
- This is the great benefit of using scikit-learn: each classifier utilizes the same methods to the extent possible. This makes swapping them in and out a breeze.
Let's see this in action in our first example:
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
clf = RandomForestClassifier(max_depth=5, n_estimators=10)
X = df.ix[:,:4]
y = df.ix[:,4]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
rf = pd.DataFrame(zip(y_pred, y_test), columns=['predicted', 'actual'])
rf['correct'] = rf.apply(lambda r: 1 if r['predicted'] == r['actual'] else 0, axis=1)
rf
The preceding code generates the following output:
Now, let's execute the following line of code:
rf['correct'].sum()/rf['correct'].count()
The preceding code generates the following output:
In the preceding few lines of code, we built, trained, and tested a classifier that has a 95% accuracy level on our iris dataset. Let's unpack each of the steps. Up at the top, we made a couple of imports; the first two are from scikit-learn, which thankfully is shortened to sklearn in import statements. The first import is a random forest classifier, and the second is a module for splitting your data into training and testing cohorts. This data partitioning is critical in building machine learning applications for a number of reasons. We'll get into this in later chapters, but suffice to say at this point it is a must. This train_test_split module also shuffles your data, which again is important as the order can contain information that would bias your actual predictions.
The first curious-looking line after the imports instantiates our classifier, in this case a random forest classifier. We select a forest that uses 10 decision tress, and each tree is allowed a maximum split depth of five. This is put in place to avoid overfitting, something we will discuss in depth in later chapters.
The next two lines create our X matrix and y vector. If you remember our original iris DataFrame, it contained four features: petal width and length, and sepal width and length. These features are selected and become our independent feature matrix, X. The last column, the iris class names, then becomes our dependent y vector.
These are then passed into the train_test_split method, which shuffles and partitions our data into four subsets, X_train, X_test, y_train, and y_test. The test_size parameter is set to .3, which means 30% of our dataset will be allocated to the X_test and y_test partitions, while the rest will be allocated to the training partitions, X_train and y_train.
Next, our model is fitted using the training data. Having trained the model, we then call the predict method on our classifier using our test data. Remember, the test data is data the classifier has not seen. The return of this prediction is a list of prediction labels. We then create a DataFrame of the actual labels versus the predicted labels. We finally total the correct predictions and divide by the total number of instances, which we can see gave us a very accurate prediction. Let's now see which features gave us the most discriminative or predictive power:
f_importances = clf.feature_importances_
f_names = df.columns[:4]
f_std = np.std([tree.feature_importances_ for tree in clf.estimators_], axis=0)
zz = zip(f_importances, f_names, f_std)
zzs = sorted(zz, key=lambda x: x[0], reverse=True)
imps = [x[0] for x in zzs]
labels = [x[1] for x in zzs]
errs = [x[2] for x in zzs]
plt.bar(range(len(f_importances)), imps, color="r", yerr=errs, align="center")
plt.xticks(range(len(f_importances)), labels);
The preceding code generates the following output:
As we expected, based upon our earlier visual analysis, the petal length and width have more discriminative power when differentiating between the iris classes. Where exactly did these numbers come from though? The random forest has a method called .feature_importances_ that returns the relative performance of the feature for splitting at the leaves. If a feature is able to consistently and cleanly split a group into distinct classes, it will have a high feature importance. This number will always total one. As you will notice here, we have included the standard deviation, which helps to illustrate how consistent each feature is. This is generated by taking the feature importance, for each of the features, for each ten trees, and calculating the standard deviation.
Let's now take a look at one more example using scikit-learn. We will now switch out our classifier and use a support vector machine (SVM):
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split
clf = OneVsRestClassifier(SVC(kernel='linear'))
X = df.ix[:,:4]
y = np.array(df.ix[:,4]).astype(str)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
rf = pd.DataFrame(zip(y_pred, y_test), columns=['predicted', 'actual'])
rf['correct'] = rf.apply(lambda r: 1 if r['predicted'] == r['actual'] else 0, axis=1)
rf
The preceding code generates the following output:
Now, let's execute the following line of code:
rf['correct'].sum()/rf['correct'].count()
The preceding code generates the following output:
Here, we have swapped in an SVM without changing virtually any of our code. The only changes were the ones related to the importing of the SVM instead of the random forest, and the line that instantiates the classifier. (I did have to make one small change to the format of the y labels, as the SVM wasn't able to interpret them as NumPy strings like the random forest classifier was. Sometimes, these data type conversions have to be made specific or it will result in an error, but it's a minor annoyance.)
This is only a small sample of the functionality of scikit-learn, but it should give you a hint of the power of this magnificent tool for machine learning applications. There are a number of additional machine learning libraries we won't have a chance to discuss here but will explore in later chapters, but I strongly suggest that if this is your first time utilizing a machine learning library, and you want a strong general-purpose tool, scikit-learn is your go-to choice.