Linear regression for biostatistics in Python
We will continue evaluating the diabetes dataset from previous chapters, but this time, we will be setting the research questions related to predictive biostatistics and creating biostatical models. To proceed, make sure you previously downloaded the diabetes dataset as in previous chapters at https://data.mendeley.com/datasets/wj9rwkp9c2/1.
First, let’s load the required libraries. Notice that we are loading the stats
library from the scipy
package. This library will be used to create the predictive models:
import pandas as pd from scipy import stats import statsmodels.formula.api as smf import seaborn as sns import matplotlib.pyplot as plt # Load the data data = pd.read_csv(r'C:\Users\MEDIN\Downloads\Dataset of Diabetes .csv')
Sometimes, in the datasets, there may be inconsistencies with lower- and uppercase letters. In this dataset, this is the case with one f
versus F
letter in the gender
column. Here is how...