Validating and describing the Diabetes dataset
After we load and examine the dataset, we can take the next step, which is validating and describing the Diabetes dataset. This includes several procedures, including checking for missing values (nan
), simplifying the dataset structure and removing the unnecessary variables, fixing potential wrong names in the classes (the CLASS
variable) and categories, and making sure the structure of the dataset is as described on the official website.
We will perform each of these procedures here.
First, we check for missing values as follows:
#The .sum() after the isna() outputting the number of empty cells data.isna().sum()
As there are no missing values (the output of data.isna().sum
is 0
), you can proceed to simplify the dataset structure and remove the unnecessary variables. For this project, the ID
and No_Pation
variables (which are just unique identifiers for samples) are not needed, so they are removed to have a simpler structure...