Understanding data dimensionality and resolving data complexity
In this section, we will be introduced to a specific exercise that is conducted to show the dimensionality of complex datasets in the biology field, a mice protein analysis, and how to address that dimensionality.
The mice protein dataset is a part of the UCI Machine Learning Repository and can be found at https://archive.ics.uci.edu/dataset/342/mice+protein+expression. This is licensed under Attribution 4.0 International for Clara Higuera, Katheleen Gardiner, and Krzysztof Cios (2015). Mice Protein Expression. UCI Machine Learning Repository. https://doi.org/10.24432/C50S3Z.
Additionally, the dataset may also be found here https://www.kaggle.com/datasets/ruslankl/mice-protein-expression.
Before we start with defining and exploring the variables, let’s first set up the framework for biological variables.
What needs to be defined at this stage is the experiment area, a biological variable, and the statistical...