The following image shows a graph that represents a series of web pages (numbered from 1 to 8):
An arrow from a node to another indicates the existence of a link from the web page, represented by the sending node, to the page represented by the receiving node. For example, the arrow from node 2 to node 1 indicates that there is a link in web page 2 pointing to web page 1. Notice how web page 4 has two outer links (to pages 2 and 8), and there are three pages that link to web page 4 (pages 2, 6, and 7). The pages represented by nodes 2, 4, and 8 seem to be the most popular at first sight.
Is there a mathematical way to actually express the popularity of a web page within a network? Researchers at Google came up with the idea of a PageRank
to roughly estimate this concept by counting the number and quality of links to a page. It goes like this:
- We construct a transition matrix of this graph,
T={a[i,j]}
, in the following fashion: the entry a[i,j]
is 1/k
if there is a link from web page i
to web page j
, and the total number of outer links in web page i
amounts to k
. Otherwise, the entry is just zero. The size of a transition matrix of N web pages is always N × N. In our case, the matrix has size 8 × 8:
Let us open an iPython session and load this particular matrix to memory.
Note
Remember that in Python, indices start from zero, not one.
In [1]: import numpy as np, matplotlib.pyplot as plt, \
...: scipy.linalg as spla, scipy.sparse as spsp, \
...: scipy.sparse.linalg as spspla
In [2]: np.set_printoptions(suppress=True, precision=3)
In [3]: cols = np.array([0,1,1,2,2,3,3,4,4,5,6,6,6,7,7]); \
...: rows = np.array([1,0,3,1,4,1,7,6,7,3,2,3,7,5,6]); \
...: data = np.array([1., 0.5, 0.5, 0.5, 0.5, \
...: 0.5, 0.5, 0.5, 0.5, 1., \
...: 1./3, 1./3, 1./3, 0.5, 0.5])
In [4]: T = np.zeros((8,8)); \
...: T[rows,cols] = data
From the transition matrix, we create a PageRank
matrix G
by fixing a positive constant p
between 0 and 1, and following the formula G = (1-p)*T + p*B for a suitable damping factor p
. Here, B
is a matrix with the same size as T
, with all its entries equal to 1/N
. For example, if we choose p = 0.15
, we obtain the following PageRank
matrix:
PageRank
matrices have some interesting properties:
- 1 is an eigenvalue of multiplicity one.
- 1 is actually the largest eigenvalue; all the other eigenvalues are in modulus smaller than 1.
- The eigenvector corresponding to eigenvalue 1 has all positive entries. In particular, for the eigenvalue 1, there exists a unique eigenvector with the sum of its entries equal to 1. This is what we call the
PageRank
vector.
A quick computation with scipy.linalg.eig
finds that eigenvector for us:
Those values correspond to the PageRank
of each of the eight web pages depicted on the graph. As expected, the maximum value of those is associated to the second web page (0.232
), closely followed by the fourth (0.219
) and then the eighth web page (0.157
). These values provide us with the information that we were seeking: the second web page is the most popular, followed by the fourth, and then, the eight.
Note
Note how this problem of networks of web pages has been translated into mathematical objects, to an equivalent problem involving matrices, eigenvalues, and eigenvectors, and has been solved with techniques of Linear Algebra.
The transition matrix is sparse: most of its entries are zeros. Sparse matrices with an extremely large size are of special importance in Numerical Linear Algebra, not only because they encode challenging scientific problems but also because it is extremely hard to manipulate them with basic algorithms.
Rather than storing to memory all values in the matrix, it makes sense to collect only the non-zero values instead, and use algorithms which exploit these smart storage schemes. The gain in memory management is obvious. These methods are usually faster for this kind of matrices and give less roundoff errors, since there are usually far less operations involved. This is another advantage of SciPy, since it contains numerous procedures to attack different problems where data is stored in this fashion. Let us observe its power with another example:
The University of Florida Sparse Matrix Collection is the largest database of matrices accessible online. As of January 2014, it contains 157 groups of matrices arising from all sorts of scientific disciplines. The sizes of the matrices range from very small (1 × 2) to insanely large (28 million × 28 million). More matrices are expected to be added constantly, as they arise in different engineering problems.
For example, the group with the most matrices in the database is the original Harwell-Boeing Collection, with 292 different sparse matrices. This group can also be accessed online at the Matrix Market: http://math.nist.gov/MatrixMarket/.
Each matrix in the database comes in three formats:
- Matrix Market Exchange format [Boisvert et al. 1997]
- Rutherford-Boeing Exchange format [Duff et al. 1997]
- Proprietary Matlab
.mat
format.
Let us import to our iPython session two matrices in the Matrix Market Exchange format from the collection, meant to be used in a solution of a least squares problem. These matrices are located at www.cise.ufl.edu/research/sparse/matrices/Bydder/mri2.html.The numerical values correspond to phantom data acquired on a Sonata 1.5-T scanner (Siemens, Erlangen, Germany) using a
magnetic resonance imaging (MRI) device. The object measured is a simulation of a human head made with several metallic objects. We download the corresponding tar bundle and untar it to get two ASCII files:
mri2.mtx
(the main matrix in the least squares problem)mri2_b.mtx
(the right-hand side of the equation)
The first twenty lines of the file mri2.mtx
read as follows:
The first sixteen lines are comments, and give us some information about the generation of the matrix.
- The computer vision problem where it arose: An MRI reconstruction
- Author information: Mark Bydder, UCSD
- Procedures to apply to the data: Solve a least squares problem A * x - b, and posterior visualization of the result
The seventeenth line indicates the size of the matrix, 63240
rows × 147456
columns, as well as the number of non-zero entries in the data, 569160
.
The rest of the file includes precisely 569160 lines, each containing two integer numbers, and a floating point number: These are the locations of the non-zero elements in the matrix, together with the corresponding values.
Tip
We need to take into account that these files use the FORTRAN convention of starting arrays from 1, not from 0.
A good way to read this file into ndarray
is by means of the function loadtxt
in NumPy. We can then use scipy
to transform the array into a sparse matrix with the function coo_matrix
in the module scipy.sparse
(coo stands for the coordinate internal format).
The best way to visualize the sparsity of this matrix is by means of the routine spy
from the module matplotlib.pyplot
.
Tip
Downloading the example code
You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
We obtain the following image. Each pixel corresponds to an entry in the matrix; white indicates a zero value, and non-zero values are presented in different shades of blue, according to their magnitude (the higher, the darker):
These are the first ten lines from the second file, mri2_b.mtx
, which does not represent a sparse matrix, but a column vector:
Those are six commented lines with information, one more line indicating the shape of the vector (63240
rows and 1
column), and the rest of the lines contain two columns of floating point values, the real and imaginary parts of the corresponding data. We proceed to read this vector to memory, solve the least squares problem suggested, and obtain the following reconstruction that represents a slice of the simulated human head:
Tip
If interested in the theory behind the creation of this matrix and the particulars of this problem, read the article On the optimality of the Gridding Reconstruction Algorithm, by H. Sedarat and D. G. Nishimura, published in IEEE Trans. Medical Imaging, vol. 19, no. 4, pp. 306-317, 2000.
For matrices with a good structure, which are going to be exclusively involved in matrix multiplications, it is often possible to store the objects in smart ways. Let's consider an example.
A horizontal earthquake oscillation affects each floor of a tall building, depending on the natural frequencies of the oscillation of the floors. If we make certain assumptions, a model to quantize the oscillations on buildings with N floors can be obtained as a second-order system of N differential equations by competition: Newton's second law of force is set equal to the sum of Hooke's law of force, and the external force due to the earthquake wave.
These are the assumptions we will need:
- Each floor is considered a point of mass located at its center-of-mass. The floors have masses
m[1], m[2], ..., m[N]
. - Each floor is restored to its equilibrium position by a linear restoring force (Hooke's
-k * elongation
). The Hooke's constants for the floors are k[1], k[2], ..., k[N]
. - The locations of masses representing the oscillation of the floors are
x[1], x[2], ..., x[N]
. We assume all of them functions of time and that at equilibrium, they are all equal to zero. - For simplicity of exposition, we are going to assume no friction: all the damping effects on the floors will be ignored.
- The equations of a floor depend only on the neighboring floors.
Set M
, the mass matrix, to be a diagonal matrix containing the floor masses on its diagonal. Set K
, the Hooke's matrix, to be a tri-diagonal matrix with the following structure, for each row j
, all the entries are zero except for the following ones:
- Column
j-1
, which we set to be k[j+1]
, - Column
j
, which we set to -k[j+1]-k[j+1]
, and - Column
j+1
, which we set to k[j+2]
.
Set H
to be a column vector containing the external force on each floor due to the earthquake, and X
, the column vector containing the functions x[j]
.
We have then the system: M * X'' = K * X + H. The homogeneous part of this system is the product of the inverse of M
with K
, which we denote as A
.
To solve the homogeneous linear second-order system, X'' = A * X, we define the variable Y
to contain 2*N entries: all N
functions x[j]
, followed by their derivatives x'[j]
. Any solution of this second-order linear system has a corresponding solution on the first-order linear system Y' = C * Y, where C
is a block matrix of size 2*N × 2*N. This matrix C
is composed by a block of size N × N containing only zeros, followed horizontally by the identity (of size N × N), and below these two, the matrix A
followed horizontally by another N × N block of zeros.
It is not necessary to store this matrix C
into memory, or any of its factors or blocks. Instead, we will make use of its structure, and use a linear operator to represent it. Minimal data is then needed to generate this operator (only the values of the masses and the Hooke's coefficients), much less than any matrix representation of it.
Let us show a concrete example with six floors. We indicate first their masses and Hooke's constants, and then, proceed to construct a representation of A
as a linear operator:
The construction of C
is very simple now (much simpler than that of its matrix!):
A solution of this homogeneous system comes in the form of an action of the exponential of C
: Y(t) = expm(C*t)* Y(0), where expm()
here denotes a matrix exponential function. In SciPy, this operation is performed with the routine expm_multiply
in the module scipy.sparse.linalg
.
For example, in our case, given the initial value containing the values x[1](0)=0, ..., x[N](0)=0, x'[1](0)=1, ..., x'[N](0)=1
, if we require a solution Y(t)
for values of t
between 0 and 1 in steps of size 0.1
, we could issue the following:
Tip
It has been reported in some installations that, in the next step, a matrix for C must be given instead of the actual linear operator (thus contradicting the manual). If this is the case in your system, simply change C in the next lines to its matrix representation.
The oscillations of the six floors during the first second can then be calculated and plotted. For instance, to view the oscillation of the first floor, we could issue the following:
We obtain the following plot. Note how the first floor rises in the first tenth of a second, only to drop from 0.1 to 0.9 seconds from its original height to almost under a meter and then, start a slow rise:
Tip
For more details about systems of differential equations, and how to solve them with actions of exponentials, read, for example, the excellent book, Elementary Differential Equations 10 ed., by William E. Boyce and Richard C. DiPrima. Wiley, 2012.
These three examples illustrate the goal of this first chapter, Numerical Linear Algebra. In Python, this is accomplished first by storing the data in a matrix form, or as a related linear operator, by means of any of the following classes:
numpy.ndarray
(making sure that they are two-dimensional)numpy.matrix
scipy.sparse.bsr_matrix
(Block Sparse Row matrix)scipy.sparse.coo_matrix
(Sparse Matrix in COOrdinate format)scipy.sparse.csc_matrix
(Compressed Sparse Column matrix)scipy.sparse.csr_matrix
(Compressed Sparse Row matrix)scipy.sparse.dia_matrix
(Sparse matrix with DIAgonal storage)scipy.sparse.dok_matrix
(Sparse matrix based on a Dictionary of Keys)scipy.sparse.lil_matrix
(Sparse matrix based on a linked list)scipy.sparse.linalg.LinearOperator
As we have seen in the examples, the choice of different classes obeys mainly to the sparsity of data and the algorithms that we are to apply to them.
Tip
We will learn when to apply these choices in the following sections.
This choice then dictates the modules that we use for the different algorithms: scipy.linalg
for generic matrices and both scipy.sparse
and scipy.sparse.linalg
for sparse matrices or linear operators. These three SciPy modules are compiled on top of the highly optimized computer libraries BLAS (written in Fortran77), LAPACK (in Fortran90), ARPACK (in Fortran77), and SuperLU (in C).
Note
For a better understanding of these underlying packages, read the description and documentation from their creators:
Most of the routines in these three SciPy modules are wrappers to functions in the mentioned libraries. If we so desire, we also have the possibility to call the underlying functions directly. In the scipy.linalg
module, we have the following:
scipy.linalg.get_blas_funcs
to call routines from BLASscipy.linalg.get_lapack_funcs
to call routines from LAPACK
For example, if we want to use the BLAS
function NRM2
to compute Frobenius norms: