Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Bioinformatics with Python Cookbook
Bioinformatics with Python Cookbook

Bioinformatics with Python Cookbook: Learn how to use modern Python bioinformatics libraries and applications to do cutting-edge research in computational biology , Second Edition

eBook
€32.99 €47.99
Paperback
€59.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

Bioinformatics with Python Cookbook

Python and the Surrounding Software Ecology

In this chapter, we will cover the following recipes:

  • Installing the required software with Anaconda
  • Installing the required software with Docker
  • Interfacing with R via rpy2
  • Performing R magic with Jupyter Notebook

Introduction

We will start by installing the required software. This will include the Python distribution, some fundamental Python libraries, and external bioinformatics software. Here, we will also be concerned with the world outside Python. In bioinformatics and big data, R is also a major player; therefore, you will learn how to interact with it via rpy2, which is a Python/R bridge. We will also explore the advantages that the IPython framework (via Jupyter Notebook) can give us in order to efficiently interface with R. This chapter will set the stage for all of the computational biology that we will perform in the rest of this book.

As different users have different requirements, we will cover two different approaches for installing the software. One approach is using the Anaconda Python (http://docs.continuum.io/anaconda/) distribution, and another approach to install the software is via Docker (a server virtualization method based on containers sharing the same operating system kernel—https://www.docker.com/). If you are using a Windows-based operating system, you are strongly encouraged to consider changing your operating system or use Docker via some of the existing options on Windows. On macOS, you might be able to install most of the software natively, though Docker is also available.

Installing the required software with Anaconda

Before we get started, we need to install some prerequisite software. The following sections will take you through the software and the steps needed to install them. An alternative way to start is to use the Docker recipe, after which everything will be taken care for you via a Docker container.

If you are already using a different Python version, you are strongly encouraged to consider Anaconda, as it has become the de facto standard for data science. Also, it is the distribution that will allow you to install software from Bioconda (https://bioconda.github.io/).

Getting ready

Python can be run on top of different environments. For instance, you can use Python inside the Java Virtual Machine (JVM) (via Jython) or with .NET (with IronPython). However, here, we are concerned not only with Python, but also with the complete software ecology around it; therefore, we will use the standard (CPython) implementation, since the JVM and .NET versions exist mostly to interact with the native libraries of these platforms. A potentially viable alternative would be to use the PyPy implementation of Python (not to be confused with Python Package Index (PyPI).

Save for noted exceptions, we will be using Python 3 only. If you were starting with Python and bioinformatics, any operating system will work, but here, we are mostly concerned with intermediate to advanced usage. So, while you can probably use Windows and macOS, most heavy-duty analysis will be done on Linux (probably on a Linux cluster). Next-generation sequencing (NGS) data analysis and complex machine learning is mostly performed on Linux clusters.

If you are on Windows, you should consider upgrading to Linux for your bioinformatics work because most modern bioinformatics software will not run on Windows. macOS will be fine for almost all analyses, unless you plan to use a computer cluster, which will probably be Linux-based.

If you are on Windows or macOS and do not have easy access to Linux, don't worry. Modern virtualization software (such as VirtualBox and Docker) will come to your rescue, which will allow you to install a virtual Linux on your operating system. If you are working with Windows and decide that you want to go native and not use Anaconda, be careful with your choice of libraries; you are probably safer if you install the 32-bit version for everything (including Python itself).


If you are on Windows, many tools will be unavailable to you.
Bioinformatics and data science are moving at breakneck speed; this is not just hype, it's a reality. When installing software libraries, choosing a version might be tricky. Depending on the code that you have, it might not work with some old versions, or maybe not even work with a newer version. Hopefully, any code that you use will indicate the correct dependencies—though this is not guaranteed.

The software developed for this book is available at https://github.com/PacktPublishing/Bioinformatics-with-Python-Cookbook-Second-Edition. To access it, you will need to install Git. Alternatively, you can download the ZIP file that GitHub makes available (indeed, getting used to Git may be a good idea because lots of scientific computing software is being developed with it).

Before you install the Python stack properly, you will need to install all the external non-Python software that you will be interoperating with. The list will vary from chapter to chapter, and all chapter-specific packages will be explained in their respective chapters. Some less common Python libraries may also be referred to in their specific chapters. Fortunately, since the first edition of this book, most bioinformatics software can be easily installed with conda using the Bioconda project.

If you are not interested in a specific chapter, you can skip the related packages and libraries. Of course, you will probably have many other bioinformatics applications around—such as Burrows-Wheeler Aligner (bwa) or Genome Analysis Toolkit (GATK) for NGS—but we will not discuss these because we do not interact with them directly (although we might interact with their outputs).

You will need to install some development compilers and libraries, all of which are free. On Ubuntu, consider installing the build-essential package (apt-get it), and on macOS, consider Xcode (https://developer.apple.com/xcode/).

In the following table, you will find a list of the most important Python software:

Name Application URL Purpose
Project Jupyter All chapters https://jupyter.org/ Interactive computing
pandas All chapters https://pandas.pydata.org/ Data processing
NumPy All chapters http://www.numpy.org/ Array/matrix processing
SciPy All chapters https://www.scipy.org/ Scientific computing
Biopython All chapters https://biopython.org/ Bioinformatics library
PyVCF NGS https://pyvcf.readthedocs.io VCF processing
Pysam NGS https://github.com/pysam-developers/pysam SAM/BAM processing
HTSeq NGS/Genomes https://htseq.readthedocs.io NGS processing
simuPOP Population genetics http://simupop.sourceforge.net/ Population genetics simulation
DendroPY Phylogenetics https://dendropy.org/ Phylogenetics
scikit-learn Machine learning/population genetics http://scikit-learn.org Machine learning library
PyMol Proteomics https://pymol.org Molecular visualization
rpy2 Introduction https://rpy2.readthedocs.io R interface
seaborn All chapters http://seaborn.pydata.org/ Statistical chart library
Cython Big data http://cython.org/ High performance
Numba Big data https://numba.pydata.org/ High performance
Dask Big data http://dask.pydata.org Parallel processing

 

We have taken a somewhat conservative approach in most of the recipes with regard to the processing of tabled data. While we use pandas every now and then, most of the time, we use standard Python. As time advances and pandas becomes more pervasive, it will probably make sense to just process all tabular data with it (if it fits in-memory).

How to do it...

Take a look at the following steps to get started:

  1. Start by downloading the Anaconda distribution from https://www.anaconda.com/download. Choose Python version 3. In any case, this is not fundamental, because Anaconda will let you use Python 2 if you need it. You can accept all the installation defaults, but you may want to make sure that the conda binaries are in your path (do not forget to open a new window so that the path is updated). If you have another Python distribution, be careful with your PYTHONPATH and existing Python libraries. It's probably better to unset your PYTHONPATH. As much as possible, uninstall all other Python versions and installed Python libraries.
  2. Let's go ahead with the libraries. We will now create a new conda environment called bioinformatics with biopython=1.70, as shown in the following command:
conda create -n bioinformatics biopython biopython=1.70
  1. Let's activate the environment, as follows:
source activate bioinformatics
  1. Let's add the bioconda and conda-forge channel to our source list:
conda config --add channels bioconda
conda config --add channels conda-forge

Also, install the core packages:

conda install scipy matplotlib jupyter-notebook pip pandas cython numba scikit-learn seaborn pysam pyvcf simuPOP dendropy rpy2

Some of them will probably be installed with the core distribution anyway.

  1. We can even install R from conda:
conda install r-essentials r-gridextra

r-essentials installs a lot of R packages, including ggplot2, which we will use later. We also install r-gridextra, since we will be using it in the Notebook.

There's more...

Compared to the first edition of this book, this recipe is now highly simplified. There are two main reasons for this: the bioconda package, and the fact that we only need to support Anaconda as it has become a standard. If you feel strongly against using Anaconda, you will be able to install many of the Python libraries via pip. You will probably need quite a few compilers and build tools—not only C compilers, but also C++ and Fortran.

Installing the required software with Docker

Docker is the most widely-used framework for implementing operating system-level virtualization. This technology allows you to have an independent container: a layer that is lighter than a virtual machine, but still allows you to compartmentalize software. This mostly isolates all processes, making it feel like each container is a virtual machine.
Docker works quite well at both extremes of the development spectrum: it's an expedient way to set up the content of this book for learning purposes, and may become your platform of choice for deploying your applications in complex environments. This recipe is an alternative to the previous recipe.

However, for long-term development environments, something along the lines of the previous recipe is probably your best route, although it can entail a more laborious initial setup.

Getting ready

If you are on Linux, the first thing you have to do is install Docker. The safest solution is to get the latest version from https://www.docker.com/. While your Linux distribution may have a Docker package, it may be too old and buggy (remember the "advancing at breakneck speed" thing we mentioned?).

If you are on Windows or macOS, do not despair; take a look at the Docker site. Various options are available there to save you, but there is no clear-cut formula, as Docker advances quite quickly on those platforms. A fairly recent computer is necessary to run our 64-bit virtual machine. If you have any problems, reboot your machine and make sure that on the BIOS, VT-X or AMD-V is enabled. At the very least, you will need 6 GB of memory, preferably more.

This will require a very large download from the internet, so be sure that you have plenty of bandwidth. Also, be ready to wait for a long time.

How to do it...

Follow these steps to get started:

  1. Use the following command on your Docker shell:
docker build -t bio https://raw.githubusercontent.com/PacktPublishing/Bioinformatics-with-Python-Cookbook-Second-Edition/master/docker/Dockerfile

On Linux, you will either need to have root privileges or be added to the Docker Unix group.

  1. Now, you are ready to run the container, as follows:
docker run -ti -p 9875:9875 -v YOUR_DIRECTORY:/data bio

Replace YOUR_DIRECTORY with a directory on your operating system. This will be shared between your host operating system and the Docker container. YOUR_DIRECTORY will be seen in the container on /data and vice versa.
-p 9875:9875 will expose the container TCP port 9875 on the host computer port 9875.
Especially on Windows (and maybe on macOS), make sure that your directory is actually visible inside the Docker shell environment. If not, check the Docker documentation on how to expose directories.

  1. You are now ready to use the system. Point your browser to http://localhost:9875 and you should get the Jupyter environment.

If this does not work on Windows, check the Docker documentation (https://docs.docker.com/) on how to expose ports.

See also

  • Docker is the most widely used containerization software and has seen enormous growth in usage in recent times. You can read more about it at https://www.docker.com/.
  • A security-minded alternative to Docker is rkt, which can be found at https://coreos.com/rkt/.
  • If you are not able to use Docker; for example, if you do not have permissions, as will be the case on most computer clusters, then take a look at Singularity at https://www.sylabs.io/singularity/.

Interfacing with R via rpy2

If there is some functionality that you need and you cannot find it in a Python library, your first port of call is to check whether it's implemented in R. For statistical methods, R is still the most complete framework; moreover, some bioinformatics functionalities are also only available in R, most probably offered as a package belonging to the Bioconductor project.

rpy2 provides a declarative interface from Python to R. As you will see, you will be able to write very elegant Python code to perform the interfacing process. To show the interface (and try out one of the most common R data structures, the DataFrame, and one of the most popular R libraries, ggplot2), we will download its metadata from the Human 1,000 Genomes Project (http://www.1000genomes.org/). This is not a book on R, but we want to provide interesting and functional examples.

Getting ready

You will need to get the metadata file from the 1,000 Genomes sequence index. Please check https://github.com/PacktPublishing/Bioinformatics-with-Python-Cookbook-Second-Edition/blob/master/Datasets.ipynb and download the sequence.index file. If you are using Jupyter Notebook, open the Chapter01/Interfacing_R.ipynb file and just execute the wget command on top.

This file has information about all of the FASTQ files in the project (we will use data from the Human 1,000 Genomes Project in the chapters to come). This includes the FASTQ file, the sample ID, and the population of origin, and important statistical information per lane, such as the number of reads and number of DNA bases read.

How to do it...

Follow these steps to get started:

  1. Let's start by doing some imports:
import os
from IPython.display import Image
import rpy2.robjects as robjects
import pandas as pd
from rpy2.robjects import pandas2ri
from rpy2.robjects import default_converter
from rpy2.robjects.conversion import localconverter

We will be using pandas on the Python side. R DataFrames map very well to pandas.

  1. We will read the data from our file using R's read.delim function:
read_delim = robjects.r('read.delim')
seq_data = read_delim('sequence.index', header=True, stringsAsFactors=False)
#In R:
# seq.data <- read.delim('sequence.index', header=TRUE, stringsAsFactors=FALSE)

The first thing that we do after importing is access the read.delim R function, which allows you to read files. The R language specification allows you to put dots in the names of objects. Therefore, we have to convert a function name to read_delim. Then, we call the function name proper; note the following highly declarative features. Firstly, most atomic objects, such as strings, can be passed without conversion. Secondly, argument names are converted seamlessly (barring the dot issue). Finally, objects are available in the Python namespace (but objects are actually not available in the R namespace; more about this later).

For reference, I have included the corresponding R code. I hope it's clear that it's an easy conversion. The seq_data object is a DataFrame. If you know basic R or pandas, you are probably aware of this type of data structure; if not, then this is essentially a table: a sequence of rows where each column has the same type.

  1. Let's perform a basic inspection of this DataFrame, as follows:
print('This dataframe has %d columns and %d rows' %
(seq_data.ncol, seq_data.nrow))
print(seq_data.colnames)
#In R:
# print(colnames(seq.data))
# print(nrow(seq.data))
# print(ncol(seq.data))

Again, note the code similarity.

  1. You can even mix styles using the following code:
my_cols = robjects.r.ncol(seq_data)
print(my_cols)

You can call R functions directly; in this case, we will call ncol if they do not have dots in their name; however, be careful. This will display an output, not 26 (the number of columns), but [26], which is a vector that's composed of the element 26. This is because, by default, most operations in R return vectors. If you want the number of columns, you have to perform my_cols[0]. Also, talking about pitfalls, note that R array indexing starts with 1, whereas Python starts with 0.

  1. Now, we need to perform some data cleanup. For example, some columns should be interpreted as numbers, but they are read as strings:
as_integer = robjects.r('as.integer')
match = robjects.r.match

my_col = match('READ_COUNT', seq_data.colnames)[0] # vector returned
print('Type of read count before as.integer: %s' % seq_data[my_col - 1].rclass[0])
seq_data[my_col - 1] = as_integer(seq_data[my_col - 1])
print('Type of read count after as.integer: %s' % seq_data[my_col - 1].rclass[0])

The match function is somewhat similar to the index method in Python lists. As expected, it returns a vector so that we can extract the 0 element. It's also 1-indexed, so we subtract 1 when working on Python. The as_integer function will convert a column into integers. The first print will show strings (values surrounded by " ), whereas the second print will show numbers.

  1. We will need to massage this table a bit more; details on this can be found on the Notebook, but here, we will finalize getting the DataFrame to R (remember that while it's an R object, it's actually visible on the Python namespace):
import rpy2.robjects.lib.ggplot2 as ggplot2

This will create a variable in the R namespace called seq.datawith the content of the DataFrame from the Python namespace. Note that after this operation, both objects will be independent (if you change one, it will not be reflected on the other).

While you can perform plotting on Python, R has default built-in plotting functionalities (which we will ignore here). It also has a library called ggplot2 that implements the Grammar of Graphics (a declarative language to specify statistical charts).
  1. With regard to our concrete example based on the Human 1,000 Genomes Project, we will first plot a histogram with the distribution of center names, where all sequencing lanes were generated. We will use ggplot for this:
from rpy2.robjects.functions import SignatureTranslatedFunction

ggplot2.theme = SignatureTranslatedFunction(ggplot2.theme, init_prm_translate = {'axis_text_x': 'axis.text.x'})

bar = ggplot2.ggplot(seq_data) + ggplot2.geom_bar() + ggplot2.aes_string(x='CENTER_NAME') + ggplot2.theme(axis_text_x=ggplot2.element_text(angle=90, hjust=1))
robjects.r.png('out.png', type='cairo-png')
bar.plot()
dev_off = robjects.r('dev.off')
dev_off()

The second line is a bit uninteresting, but is an important piece of boilerplate code. One of the R functions that we will call has a parameter with a dot in its name. As Python function calls cannot have this, we must map the axis.text.x R parameter name to the axis_text_r Python name in the function theme. We monkey patch it (that is, we replace ggplot2.theme with a patched version of itself).

We then draw the chart itself. Note the declarative nature of ggplot2 as we add features to the chart. First, we specify the seq_data DataFrame, then we use a histogram bar plot called geom_bar, followed by annotating the x variable (CENTER_NAME). Finally, we rotate the text of the x axis by changing the theme. We finalize this by closing the R printing device.

  1. We can now print the image on the Jupyter Notebook:
Image(filename='out.png')

The following chart is produced:

Figure 1: The ggplot2-generated histogram of center names, which is responsible for sequencing the lanes of the human genomic data from the 1,000 Genomes Project
  1. As a final example, we will now do a scatter plot of read and base counts for all the sequenced lanes for Yoruban (YRI) and Utah residents with ancestry from Northern and Western Europe (CEU), using the Human 1,000 Genomes Project (the summary of the data of this project, which we will use thoroughly, can be seen in the Working with modern sequence formats recipe in Chapter 2, Next-Generation Sequencing). We are also interested in the differences between the different types of sequencing (exome, high, and low coverage). First, we generate a DataFrame only just YRI and CEU lanes, and limit the maximum base and read counts:
robjects.r('yri_ceu <- seq.data[seq.data$POPULATION %in% c("YRI", "CEU") & seq.data$BASE_COUNT < 2E9 & seq.data$READ_COUNT < 3E7, ]')
yri_ceu = robjects.r('yri_ceu')
  1. We are now ready to plot:
scatter = ggplot2.ggplot(yri_ceu) + ggplot2.aes_string(x='BASE_COUNT', y='READ_COUNT', shape='factor(POPULATION)', col='factor(ANALYSIS_GROUP)') + ggplot2.geom_point()
robjects.r.png('out.png')
scatter.plot()

Hopefully, this example (refer to the following screenshot) makes the power of the Grammar of Graphics approach clear. We will start by declaring the DataFrame and the type of chart in use (the scatter plot implemented by geom_point).

Note how easy it is to express that the shape of each point depends on the POPULATION variable and the color on the ANALYSIS_GROUP:


Figure 2: The ggplot2-generated scatter plot with base and read counts for all sequencing lanes read; the color and shape of each dot reflects categorical data (population and the type of data sequenced)
  1. Because the R DataFrame is so close to pandas, it makes sense to convert between the two, as that is supported by rpy2:
pd_yri_ceu = pandas2ri.ri2py(yri_ceu)
del pd_yri_ceu['PAIRED_FASTQ']
no_paired = pandas2ri.py2ri(pd_yri_ceu)
robjects.r.assign('no.paired', no_paired)
robjects.r("print(colnames(no.paired))")

We start by importing the necessary conversion module. We then convert the R DataFrame (note that we are converting yri_ceu in the R namespace, not the one on the Python namespace). We delete the column that indicates the name of the paired FASTQ file on the pandas DataFrame and copy it back to the R namespace. If you print the column names of the new R DataFrame, you will see that PAIRED_FASTQ is missing.

There's more...

It's worth repeating that the advances in the Python software ecology are occurring at a breakneck pace. This means that if a certain functionality is not available today, it might be released sometime in the near future. So, if you are developing a new project, be sure to check for the very latest developments on the Python front before using functionality from an R package.

There are plenty of R packages for Bioinformatics in the Bioconductor project (http://www.bioconductor.org/). This should probably be your first port of call in the R world for bioinformatics functionalities. However, note that there are many R Bioinformatics packages that are not on Bioconductor, so be sure to search the wider R packages on Comprehensive R Archive Network (CRAN) (refer to CRAN at http://cran.rproject.org/).

There are plenty of plotting libraries for Python. Matplotlib is the most common library, but you also have a plethora of other choices. In the context of R, it's worth noting that there is a ggplot2-like implementation for Python based on the Grammar of Graphics description language for charts, and this is called—surprise, surprise—ggplot! (http://yhat.github.io/ggpy/).

See also

Performing R magic with Jupyter Notebook

You have probably heard of, and maybe used, the Jupyter Notebook. Among many other features, Juptyter provides a framework of extensible commands called magics (actually, this only works with the IPython kernel of Jupyter, but that is the one we are concerned with), which allow you to extend the language in many useful ways. There are magic functions to deal with R. As you will see in our example, it makes R interfacing much more declarative and easy. This recipe will not introduce any new R functionalities, but hopefully, it will make it clear how IPython can be an important productivity boost for scientific computing in this regard.

Getting ready

You will need to follow the previous Getting ready steps of the Interfacing with R via rpy2 recipe. The Notebook is Chapter01/R_magic.ipynb. The Notebook is more complete than the recipe presented here, and includes more chart examples. For brevity here, we will only concentrate on the fundamental constructs to interact with R using magics.

How to do it...

This recipe is an aggressive simplification of the previous one because it illustrates the conciseness and elegance of R magics:

  1. The first thing you need to do is load R magics and ggplot2:
import rpy2.robjects as robjects
import rpy2.robjects.lib.ggplot2 as ggplot2%load_ext rpy2.ipython

Note that the % starts an IPython-specific directive. Just as a simple example, you can write %R print(c(1, 2)) on a Jupyter cell.

Check out how easy it is to execute the R code without using the robjects package. Actually, rpy2 is being used to look under the hood.

  1. Let's read the sequence.index file that was downloaded in the previous recipe:
%%R
seq.data <- read.delim('sequence.index', header=TRUE, stringsAsFactors=FALSE)
seq.data$READ_COUNT <- as.integer(seq.data$READ_COUNT)
seq.data$BASE_COUNT <- as.integer(seq.data$BASE_COUNT)

You can then specify that the whole cell should be interpreted as an R code by using %%R (note the double %%).

  1. We can now transfer the variable to the Python namespace:
seq_data = %R seq.data
print(type(seq_data)) # pandas dataframe!

The type of the DataFrame is not a standard Python object, but a pandas DataFrame. This is a departure from previous versions of the R magic interface.

  1. As we have a pandas DataFrame, we can operate on it quite easily using pandas' interface:
my_col = list(seq_data.columns).index("CENTER_NAME")
seq_data['CENTER_NAME'] = seq_data['CENTER_NAME'].apply(lambda x: x.upper())
  1. Let's put this DataFrame back in the R namespace, as follows:
%R -i seq_data
%R print(colnames(seq_data))

The -i argument informs the magic system that the variable that follows on the Python space is to be copied in the R namespace. The second line just shows that the DataFrame is indeed available in R. The name that we are using is different from the original—it's seq_data instead of seq.data.

  1. Let's do some final cleanup (for details, see the precious recipe) and print the same bar chart as before:
%%R
bar <- ggplot(seq_data) + aes(factor(CENTER_NAME)) + geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 1))
print(bar)

The R magic system also allows you to reduce code, as it changes the behavior of the interaction of R with IPython. For example, in the ggplot2 code of the previous recipe, you do not need to use the .png and dev.off R functions, as the magic system will take care of this for you. When you tell R to print a chart, it will magically appear in your Notebook or graphical console.

There's more...

The R magics have seemed to have changed quite a lot over time in terms of interface. For example, I updated the R code for the first edition of this book a few times. The current version of DataFrame assignment returns pandas objects, which is a major change. Be careful with the version of Jupyter that you use as the %R code can be quite different. If this code does not work and you are using an older version, consult the Notebooks of the first edition of this book, as they might help.

See also

Left arrow icon Right arrow icon

Key benefits

  • Perform complex bioinformatics analysis using the most important Python libraries and applications
  • Implement next-generation sequencing, metagenomics, automating analysis, population genetics, and more
  • Explore various statistical and machine learning techniques for bioinformatics data analysis

Description

Bioinformatics is an active research field that uses a range of simple-to-advanced computations to extract valuable information from biological data. This book covers next-generation sequencing, genomics, metagenomics, population genetics, phylogenetics, and proteomics. You'll learn modern programming techniques to analyze large amounts of biological data. With the help of real-world examples, you'll convert, analyze, and visualize datasets using various Python tools and libraries. This book will help you get a better understanding of working with a Galaxy server, which is the most widely used bioinformatics web-based pipeline system. This updated edition also includes advanced next-generation sequencing filtering techniques. You'll also explore topics such as SNP discovery using statistical approaches under high-performance computing frameworks such as Dask and Spark. By the end of this book, you'll be able to use and implement modern programming techniques and frameworks to deal with the ever-increasing deluge of bioinformatics data.

Who is this book for?

This book is for Data data Scientistsscientists, Bioinformatics bioinformatics analysts, researchers, and Python developers who want to address intermediate-to-advanced biological and bioinformatics problems using a recipe-based approach. Working knowledge of the Python programming language is expected.

What you will learn

  • Learn how to process large next-generation sequencing (NGS) datasets
  • Work with genomic dataset using the FASTQ, BAM, and VCF formats
  • Learn to perform sequence comparison and phylogenetic reconstruction
  • Perform complex analysis with protemics data
  • Use Python to interact with Galaxy servers
  • Use High-performance computing techniques with Dask and Spark
  • Visualize protein dataset interactions using Cytoscape
  • Use PCA and Decision Trees, two machine learning techniques, with biological datasets
Estimated delivery fee Deliver to Cyprus

Premium delivery 7 - 10 business days

€32.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Nov 30, 2018
Length: 360 pages
Edition : 2nd
Language : English
ISBN-13 : 9781789344691
Languages :
Concepts :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to Cyprus

Premium delivery 7 - 10 business days

€32.95
(Includes tracking information)

Product Details

Publication date : Nov 30, 2018
Length: 360 pages
Edition : 2nd
Language : English
ISBN-13 : 9781789344691
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 142.97
Hands-On Image Processing with Python
€37.99
R Bioinformatics Cookbook
€44.99
Bioinformatics with Python Cookbook
€59.99
Total 142.97 Stars icon
Banner background image

Table of Contents

11 Chapters
Python and the Surrounding Software Ecology Chevron down icon Chevron up icon
Next-Generation Sequencing Chevron down icon Chevron up icon
Working with Genomes Chevron down icon Chevron up icon
Population Genetics Chevron down icon Chevron up icon
Population Genetics Simulation Chevron down icon Chevron up icon
Phylogenetics Chevron down icon Chevron up icon
Using the Protein Data Bank Chevron down icon Chevron up icon
Bioinformatics Pipelines Chevron down icon Chevron up icon
Python for Big Genomics Datasets Chevron down icon Chevron up icon
Other Topics in Bioinformatics Chevron down icon Chevron up icon
Advanced NGS Processing Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.5
(4 Ratings)
5 star 50%
4 star 0%
3 star 25%
2 star 0%
1 star 25%
math_guy51 Jun 19, 2020
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I know of no other book covering the Advanced Bioinformatics/Biopython topics found in this book.This is not an introductory book ...
Amazon Verified review Amazon
Soup Isarangkoon Jul 10, 2020
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book is very insightful with all the step by step instructions on different projects one can try with Bioinformatics using Python. This book is not too simplistic (some previous knowledge assumed), meanwhile it is not too hard that readers cannot follow through. If one wants to jump into Bioinformatics with Python, this is the book to read!Word of Warning: Some knowledge in Bioinformatics and Python is assumed in the book.
Amazon Verified review Amazon
RUser Feb 09, 2021
Full star icon Full star icon Full star icon Empty star icon Empty star icon 3
Dieses Buch ist nicht für "Anfänger" geeignet. Was es nicht leistet: Es erklärt nicht die Sprache Python und nicht die im Buch verwendeten Algorithmen. Es ist ja auch eine Rezeptsammlung!!! Allerdings muss man die Küche vorher selbst bauen oder um es konkret zu sagen: Man sollte sich mit Anaconda auskennen - und das am besten auf Linux oder Ubuntu... Unter Windows wird man vermutlich nur den halben Spaß haben, da einige benötigte Tools dort nicht laufen. Der Autor erklärt das gut. Leider ist der Code insofern veraltet, als nicht mehr zugängliche Libraries verwendet werden. An einigen Stellen hilft dann Auskommentieren des Codes weiter. Der Autor hat auf Anfrage nach Hinweisen zur Korrektur des Codes auch bisher nur geantwortet, dass das Buch 2018 geschrieben sei und zu dem Zeitpunkt der Code noch funktionierte (die Abschaffung der problematischen Libraries wurde auch damals schon länger debattiert). Für ein Buch, zu dem eine Seite mit Druckfehlern existiert (die leer ist) und ein Repository im Internet (enthält aber nur die nicht rund laufenden Codes), ist das schwach. Ich hätte mir zumindest aktualisierten Code gewünscht. Immerhin merkt man dem Buch an, wie gerne der Autor seine Faszination weitergeben würde. Außerdem konnte ich ein Rezept nach einigen kleinen Korrekturen und Anpassungen auch für eine spielerische Analyse von COVID-DNA verwenden. Daher gebe ich gerne die drei Sterne als Bewertung. Aber ich sage es nochmal ausdrücklich: Um eine Sammlung von Rezepten, die man ohne Vorkenntnisse und Mühe nachkochen kann, handelt es sich nicht!
Amazon Verified review Amazon
Alejandro Marquez Apr 11, 2020
Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1
El libro llegó sin el empaquetado correcto (no emplayado como usualmente venden los libros), doblado de varias hojas y pasta, además las hojas del canto manchadas al parecer con lo que podrían ser dedos. Estamos en cuarentena por covid-19 y debería venir protegido el ejemplar lo mejor posible y contar con el emplayado habitual de los libros que es el que evita sea manoseado directamente.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact [email protected] with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at [email protected] using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on [email protected] with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on [email protected] within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on [email protected] who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on [email protected] within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela