What this book covers
Chapter 1, Introduction to Biostatistics, introduces the field of biostatistics and its use cases. You will learn why biostatistics is important for biomedicine, clinical trials, biology, biotechnology, and life sciences. You will also understand why it’s important to use computational programming languages such as Python to process biological and biomedical data and try to answer the research questions we may have.
This chapter lays the theoretical foundation needed to proceed with the use of biostatistics in life sciences fields and understand how Python can be used for biostatistical analysis within the biotech and life sciences research fields. You will need this understanding to proceed with the hands-on projects in the following chapters.
Chapter 2, Getting Started with Python for Biostatistics, is about facilitating Python installation and getting started with Python such as Spyder IDE and Jupyter Notebook. You will learn how to install Python and its IDEs using the open source Anaconda distribution. Finally, you will learn how to navigate the interfaces of IDEs.
Chapter 3, Exercise 1 – Cleaning and Describing Data Using Python, will help you learn more about the basics of data science, including data types, how to load data in Python, and more. Practical exercises for loading the famous Iris dataset, cleaning data, and describing data are among the topics in this chapter. The chapter prepares you for the next exercise on diabetes data. It introduces the concept of Exploratory Data Analysis (EDA), which will be used in many other chapters of the book.
Chapter 4, Part 1 Exemplar Project – Load, Clean, and Describe Diabetes Data in Python, is where you will apply what you learned in Chapter 3. The dataset is the Pima Indians Diabetes dataset. EDA, cleaning, and data visualizations as output are the practical goals of this chapter. The chapter also covers theoretical aspects associated with the dataset, such as the theoretical foundation of diabetes mellitus biomarkers.
Chapter 5, Introduction to Python for Biostatistics, covers the libraries used for specific biostatistical methods and how those methods work. You will learn about the libraries for hypothesis tests, effect size analysis, predictive analysis, and more. Toward the end of the chapter, you will learn how to select specific hypothesis tests and biostatistical implementations for different research questions. The main goal of the chapter is to introduce you to the Python framework for biostatistical analysis.
Chapter 6, Biostatistical Inference Using Hypothesis Tests and Effect Sizes, is on biostatistical inference. How to apply hypothesis tests such as Student’s t-test, the Wilcoxon test, and the Chi-square test is covered. Another topic covered is finding the associations between variables using correlation analysis. ANOVA and Kruskal–Wallis tests are explored, as is how to analyze multiple groups using Python.
Chapter 7, Predictive Biostatistics Using Python, looks at predictive biostatistics and its uses in different areas of biology, biomedicine, and other life sciences fields.
You will learn about different types of variables in relation to predictive analysis, such as dependent variables, independent variables, and latent variables. You will learn how to implement linear regression and logistic regression in Python. Finally, you will learn how to create and interpret multivariable regression models.
Chapter 8, Part 2 Exercise – T-Test, ANOVA, and Linear and Logistic Regression, is mostly about practical exercises in Python hypothesis testing and predictive analysis. You will learn how to implement Student’s t-test for comparing two groups and Analysis of Variance (ANOVA) to compare multiple groups in biological data. In this chapter, you will also learn how to practically define, create, and implement linear and logistic regression models using Python. At the end of each analysis, you will also learn how to create a publication-ready and intuitive data visualization using Python’s data visualization libraries.
Chapter 9, Biostatistical Inference and Predictive Analytics Using Cardiovascular Study Data, contains an exemplar project based on the Cleveland Heart Disease dataset. The main focus of this chapter is the practical implementation of biostatistical inference and predictive analytics in cardiology. The chapter includes both biological and statistical aspects of a practical cardiology project in the field of biostatistics. Hypothesis tests and linear and logistic regression are this time applied to a cardiovascular dataset, including cardiovascular disease modeling.
Chapter 10, Clinical Study Design, looks at how one of the most important aspects of any biostatistics project is the study design. In this chapter, the main topic is understanding clinical studies from the design perspective. You will understand the principles for observational studies, including cohort and case control studies, but also different designs of clinical trials. Furthermore, you will learn how to add sample size calculation for study planning and design. Finally, you will learn how to define protocol documentation for clinical studies.
Chapter 11, Survival Analysis in Biomedical Research, will see you start by loading and understanding an oncology dataset (Veterans Oncology dataset) using scikit-learn. Then, you will how to use survival analysis and Kaplan-Meier (KM) curves to visualize and analyze survival in different groups of oncology patients. You will learn how to implement Cox Proportional Hazards regression models to perform survival analysis inference and identify the appropriate oncology survival models in the data.
Chapter 12, Meta-Analysis – Synthesizing Evidence from Multiple Studies, shows you how to synthesize evidence from multiple studies or analyses. This chapter lays the theoretical foundation to help you understand how to use meta-analysis to synthesize evidence from multiple studies and create overall estimates of treatment effects in biostatistics. You will learn the differences between random and fixed-effects meta-analysis models and when to use them. You will learn how to reason about and interpret forest and funnel plots, which are often the main focus of meta-analysis interpretation and data visualization.
Chapter 13, Survival Predictive Analysis and Meta-Analysis Practice, is about the practical implementation of meta-analysis code in Python. You will be using the PythonMeta
package and the DerSimonian & Laird inverse variance method. You will learn about Overall Survival (OS), Progression-Free Survival (PFS), Disease-Free Survival (DFS), and Recurrence-Free Survival (RFS) metrics in oncology meta-analysis. Finally, the main outcome of the chapter is being able to practically implement meta-analysis and visualize and interpret results using Python.
Chapter 14, Part 3 Exemplar Project – Meta-Analysis of Survival Data in Clinical Research, starts with Non-Small Cell Lung Cancer (NSCLC) dataset and the treatment used to target a specific molecule associated with this cancer called Tyrosine Kinase Inhibitors (TKI). The project involves performing a real-world meta-analysis with data from real studies. This exemplar project is a simulation of a real-world oncology meta-analysis, all done using the powerful Python programming language.
Chapter 15, Understanding Biological Variables, looks at simplifying the complexity of biological systems by focusing on the observation and analysis of key variables. It explores latent variables and provides detailed guidance on selecting significant variables for biodata analysis. You will learn how to connect biological questions with observable variables, ensuring the meaningful interpretation of data. The chapter concludes with techniques for validating the biological relevance of data, reinforcing the connection between theory and practical application.
Chapter 16, Data Analysis Frameworks and Performance for Life Sciences Research, focuses on learning to differentiate different statistical data analysis frameworks. We discuss the frequentist and Bayesian frameworks, their differences, when to use them, and how to apply them to different research problems. You will also learn how to choose the correct statistical framework for your analysis. Finally, you will learn how to connect an experiment design with the statistical aspects of the analysis and perform in-depth interpretation of the results based on the statistical framework you choose.
Chapter 17, Part 4 Exercise – Performing Statistics for Biology Studies in Python, contains a state-of-the-art biology research exemplar project for you to use Python programming and advanced statistical approaches. You start with the mice proteomics dataset, in which we explore the biological aspects of neuroscience and proteins associated with different conditions. The approaches included are Principal Component Analysis (PCA), Random Forest (RF) for feature selection, and Structural Equation Modeling (SEM). By the end of the chapter, you will know how to create protein association SEM models and how to relate biological domain knowledge with latent variables to perform protein SEM analysis and statistically test biological pathways understood using the theoretical domains of molecular biology.