Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Bioinformatics with Python Cookbook

You're reading from   Bioinformatics with Python Cookbook Use modern Python libraries and applications to solve real-world computational biology problems

Arrow left icon
Product type Paperback
Published in Sep 2022
Publisher Packt
ISBN-13 9781803236421
Length 360 pages
Edition 3rd Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Tiago Antao Tiago Antao
Author Profile Icon Tiago Antao
Tiago Antao
Arrow right icon
View More author details
Toc

Table of Contents (15) Chapters Close

Preface 1. Chapter 1: Python and the Surrounding Software Ecology 2. Chapter 2: Getting to Know NumPy, pandas, Arrow, and Matplotlib FREE CHAPTER 3. Chapter 3: Next-Generation Sequencing 4. Chapter 4: Advanced NGS Data Processing 5. Chapter 5: Working with Genomes 6. Chapter 6: Population Genetics 7. Chapter 7: Phylogenetics 8. Chapter 8: Using the Protein Data Bank 9. Chapter 9: Bioinformatics Pipelines 10. Chapter 10: Machine Learning for Bioinformatics 11. Chapter 11: Parallel Processing with Dask and Zarr 12. Chapter 12: Functional Programming for Bioinformatics 13. Index 14. Other Books You May Enjoy

Understanding NumPy as the engine behind Python data science and bioinformatics

Most of your analysis will make use of NumPy, even if you don’t use it explicitly. NumPy is an array manipulation library that is behind libraries such as pandas, Matplotlib, Biopython, and scikit-learn, among many others. While much of your bioinformatics work may not require explicit direct use of NumPy, you should be aware of its existence as it underpins almost everything you do, even if only indirectly via the other libraries.

In this recipe, we will use VAERS data to demonstrate how NumPy is behind many of the core libraries that we use. This is a very light introduction to the library so that you are aware that it exists and that it is behind almost everything. Our example will extract the number of cases from the five US states with more adverse effects, splitting them into age bins: 0 to 19 years, 20 to 39, up to 100 to 119.

Getting ready

Once again, we will be using the data from the first recipe, so make sure it’s available. The code for it can be found in Chapter02/NumPy.py.

How to do it…

Follow these steps:

  1. Let’s start by loading the data with pandas and reducing the data so that it’s related to the top five US states only:
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    vdata = pd.read_csv(
        "2021VAERSDATA.csv.gz", encoding="iso-8859-1")
    vdata["STATE"] = vdata["STATE"].str.upper()
    top_states = pd.DataFrame({
        "size": vdata.groupby("STATE").size().sort_values(ascending=False).head(5)}).reset_index()
    top_states["rank"] = top_states.index
    top_states = top_states.set_index("STATE")
    top_vdata = vdata[vdata["STATE"].isin(top_states.index)]
    top_vdata["state_code"] = top_vdata["STATE"].apply(
        lambda state: top_states["rank"].at[state]
    ).astype(np.uint8)
    top_vdata = top_vdata[top_vdata["AGE_YRS"].notna()]
    top_vdata.loc[:,"AGE_YRS"] = top_vdata["AGE_YRS"].astype(int)
    top_states

The top states are as follows. This rank will be used later to construct a NumPy matrix:

Figure 2.2 – US states with largest numbers of adverse effects

Figure 2.2 – US states with largest numbers of adverse effects

  1. Now, let’s extract the two NumPy arrays that contain age and state data:
    age_state = top_vdata[["state_code", "AGE_YRS"]]
    age_state["state_code"]
    state_code_arr = age_state["state_code"].values
    type(state_code_arr), state_code_arr.shape, state_code_arr.dtype
    age_arr = age_state["AGE_YRS"].values
    type(age_arr), age_arr.shape, age_arr.dtype

Note that the data that underlies pandas is NumPy data (the values call for both Series returns NumPy types). Also, you may recall that pandas has properties such as .shape or .dtype: these were inspired by NumPy and behave the same.

  1. Now, let’s create a NumPy matrix from scratch (a 2D array), where each row is a state and each column represents an age group:
    age_state_mat = np.zeros((5,6), dtype=np.uint64)
    for row in age_state.itertuples():
        age_state_mat[row.state_code, row.AGE_YRS//20] += 1
    age_state_mat

The array has five rows – one for each state – and six columns – one for each age group. All the cells in the array must have the same type.

We initialize the array with zeros. There are many ways to initialize arrays, but if you have a very large array, initializing it may take a lot of time. Sometimes, depending on your task, it might be OK that the array is empty at the beginning (meaning it was initialized with random trash). In that case, using np.empty will be much faster. We use pandas iteration here: this is not the best way to do things from a pandas perspective, but we want to make the NumPy part very explicit.

  1. We can extract a single row – in our case, the data for a state – very easily. The same applies to a column. Let’s take California data and then the 0-19 age group:
    cal = age_state_mat[0,:]
    kids = age_state_mat[:,0]

Note the syntax to extract a row or a column. It should be familiar to you, given that pandas copied the syntax from NumPy and we encountered it in previous recipes.

  1. Now, let’s compute a new matrix where we have the fraction of cases per age group:
    def compute_frac(arr_1d):
        return arr_1d / arr_1d.sum()
    frac_age_stat_mat = np.apply_along_axis(compute_frac, 1, age_state_mat)

The last line applies the compute_frac function to all rows. compute_frac takes a single row and returns a new row where all the elements are divided by the total sum.

  1. Now, let’s create a new matrix that acts as a percentage instead of a fraction – simply because it reads better:
    perc_age_stat_mat = frac_age_stat_mat * 100
    perc_age_stat_mat = perc_age_stat_mat.astype(np.uint8)
    perc_age_stat_mat

The first line simply multiplies all the elements of the 2D array by 100. Matplotlib is smart enough to traverse different array structures. That line will work if it’s presented with an array with any dimensions and would do exactly what is expected.

Here is the result:

Figure 2.3 – A matrix representing the distribution of vaccine-adverse effects 
in the five US states with the most cases

Figure 2.3 – A matrix representing the distribution of vaccine-adverse effects in the five US states with the most cases

  1. Finally, let’s create a graphical representation of the matrix using Matplotlib:
    fig = plt.figure()
    ax = fig.add_subplot()
    ax.matshow(perc_age_stat_mat, cmap=plt.get_cmap("Greys"))
    ax.set_yticks(range(5))
    ax.set_yticklabels(top_states.index)
    ax.set_xticks(range(6))
    ax.set_xticklabels(["0-19", "20-39", "40-59", "60-79", "80-99", "100-119"])
    fig.savefig("matrix.png")

Do not dwell too much on the Matplotlib code – we are going to discuss it in the next recipe. The fundamental point here is that you can pass NumPy data structures to Matplotlib. Matplotlib, like pandas, is based on NumPy.

See also

The following is some extra information that may be useful:

  • NumPy has many more features than the ones we’ve discussed here. There are plenty of books and tutorials on them. The official documentation is a good place to start: https://numpy.org/doc/stable/.
  • There are many important issues to discover with NumPy, but probably one of the most important is broadcasting: NumPy’s ability to take arrays of different structures and get the operations right. For details, go to https://numpy.org/doc/stable/user/theory.broadcasting.html.
You have been reading a chapter from
Bioinformatics with Python Cookbook - Third Edition
Published in: Sep 2022
Publisher: Packt
ISBN-13: 9781803236421
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image