We will use open source code available from the scikit-learn site for this case study. The link to the code is available as shown in the following code:
http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#sphx-glr-auto-examples-linear-model-plot-ols-py
We will import the following packages:
Since we will be using regression for our analysis, we import the linear_model, mean_square_error, and r2_score libraries, as seen in the following code:
print(__doc__)
# Code source: Jaques Grobler
# License: BSD 3 clause
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
We import the diabetes data and perform the following actions:
- List the dimension and size
- List the features
The associated code for the preceding code is:
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
print(diabetes.data.shape) # gives the data size and dimensions
print(diabetes.feature_names
print(diabetes.DESCR)
The data has 442 rows of data and 10 features. The features are:
['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
To train the model we use a single feature, that is, the bmi of the individual, as shown:
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 3]
Earlier in the chapter, we discussed the fact that selecting a proper training and testing set is integral. The last 20 items are kept for testing in our case, as shown in the following code:
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]#everything except the last twenty itemsdiabetes_X_test = diabetes_X[-20:]#last twenty items in the array
Further we also split the targets into training and testing sets as shown:
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
everything except the last two items
diabetes_y_test = diabetes.target[-20:]
Next we perform regression on this data to generate results. We use the testing data to fit the model and then use the testing dataset to make predictions on the test dataset that we have extracted, as seen in the following code:
# Create linear regression object
regr = linear_model.LinearRegression()
#Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)
We compute the goodness of fit by computing how large or small the errors are by computing the MSE and variance, as follows:
# The mean squared error
print("Mean squared error: %.2f"
% mean_squared_error(diabetes_y_test, diabetes_y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(diabetes_y_test, diabetes_y_pred))
Finally, we plot the prediction using the Matplotlib graph, as follows:
# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color='black')
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
The output graph looks as follows: