import os
import sys
import glob
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
%precision 4'ggplot')


SciPy’s official tutorial on Linear algebra

Linear Algebra and Matrix Decompositions

Large Linear Systems

This is the age of Big Data. Every second of every day, data is being recorded in countless systems over the world. Our shopping habits, book and movie preferences, key words typed into our email messages, medical records, NSA recordings of our telephone calls, genomic data - and none of it is any use without analysis.

Enormous data sets carry with them enormous challenges in data processing. Solving a system of \(10\) equations in \(10\) unknowns is easy, and one need not be terribly careful about methodolgy. But as the size of the system grows, algorithmic complexity and efficiency become critical.

Example: Netflix Competition (circa 2006-2009)

For a more complete description:

The whole technical story

In 2006, Netflix opened a competition where it provided ratings of over \(400,000\) for \(18,000\) movies. The goal was to make predict a user’s rating of a movie, based on previous ratings and ratings of ‘similar’ users. The task amounted to analysis of a \(400,000\times 18,000\) matrix! The wikipedia link above describes the contest and the second link is a very detailed description of the method (which took into account important characteristics such as how tastes may change over time). Part of the analysis is related to matrix decomposition - we won’t go into the details of the winning algorithm, but we will spend some time on basic matrix decompositions.

Matrix Decompositions

Matrix decompositions are an important step in solving linear systems in a computationally efficient manner.

LU Decomposition and Gaussian Elimination

LU stands for ‘Lower Upper’, and so an LU decomposition of a matrix \(A\) is a decomposition so that

\[A= LU\]

where \(L\) is lower triangular and \(U\) is upper triangular.

Now, LU decomposition is essentially gaussian elimination, but we work only with the matrix \(A\) (as opposed to the augmented matrix).

Let’s review how gaussian elimination (ge) works. We will deal with a \(3\times 3\) system of equations for conciseness, but everything here generalizes to the \(n\times n\) case. Consider the following equation:

\[\begin{split}\left(\begin{matrix}a_{11}&a_{12} & a_{13}\\a_{21}&a_{22}&a_{23}\\a_{31}&a_{32}&a_{33}\end{matrix}\right)\left(\begin{matrix}x_1\\x_2\\x_3\end{matrix}\right) = \left(\begin{matrix}b_1\\b_2\\b_3\end{matrix}\right)\end{split}\]

For simplicity, let us assume that the leftmost matrix \(A\) is non-singular. To solve the system using ge, we start with the ‘augmented matrix’:

\[\begin{split}\left(\begin{array}{ccc|c}a_{11}&a_{12} & a_{13}& b_1 \\a_{21}&a_{22}&a_{23}&b_2\\a_{31}&a_{32}&a_{33}&b_3\end{array}\right)\end{split}\]

We begin at the first entry, \(a_{11}\). If \(a_{11} \neq 0\), then we divide the first row by \(a_{11}\) and then subtract the appropriate multiple of the first row from each of the other rows, zeroing out the first entry of all rows. (If \(a_{11}\) is zero, we need to permute rows. We will not go into detail of that here.) The result is as follows:

\[\begin{split}\left(\begin{array}{ccc|c} 1 & \frac{a_{12}}{a_{11}} & \frac{a_{13}}{a_{11}} & \frac{b_1}{a_{11}} \\ 0 & a_{22} - a_{21}\frac{a_{12}}{a_{11}} & a_{23} - a_{21}\frac{a_{13}}{a_{11}} & b_2 - a_{21}\frac{b_1}{a_{11}}\\ 0&a_{32}-a_{31}\frac{a_{12}}{a_{11}} & a_{33} - a_{31}\frac{a_{13}}{a_{11}} &b_3- a_{31}\frac{b_1}{a_{11}}\end{array}\right)\end{split}\]

We repeat the procedure for the second row, first dividing by the leading entry, then subtracting the appropriate multiple of the resulting row from each of the third and first rows, so that the second entry in row 1 and in row 3 are zero. We could continue until the matrix on the left is the identity. In that case, we can then just ‘read off’ the solution: i.e., the vector \(x\) is the resulting column vector on the right. Usually, it is more efficient to stop at reduced row eschelon form (upper triangular, with ones on the diagonal), and then use back substitution to obtain the final answer.

Note that in some cases, it is necessary to permute rows to obtain reduced row eschelon form. This is called partial pivoting. If we also manipulate columns, that is called full pivoting.

It should be mentioned that we may obtain the inverse of a matrix using ge, by reducing the matrix \(A\) to the identity, with the identity matrix as the augmented portion.

Now, this is all fine when we are solving a system one time, for one outcome \(b\). Many applications involve solutions to multiple problems, where the left-hand-side of our matrix equation does not change, but there are many outcome vectors \(b\). In this case, it is more efficient to decompose \(A\).

First, we start just as in ge, but we ‘keep track’ of the various multiples required to eliminate entries. For example, consider the matrix

\[\begin{split}A = \left(\begin{matrix} 1 & 3 & 4 \\ 2& 1& 3\\ 4&1&2 \end{matrix}\right)\end{split}\]

We need to multiply row \(1\) by \(2\) and subtract from row \(2\) to eliminate the first entry in row \(2\), and then multiply row \(1\) by \(4\) and subtract from row \(3\). Instead of entering zeroes into the first entries of rows \(2\) and \(3\), we record the multiples required for their elimination, as so:

\[\begin{split}\left(\begin{matrix} 1 & 3 & 4 \\ (2)& -5 & -5\\ (4)&-11&-14 \end{matrix}\right)\end{split}\]

And then we eliminate the second entry in the third row:

\[\begin{split}\left(\begin{matrix} 1 & 3 & 4 \\ (2)& -5 & -5\\ (4)&(\frac{-11}{5})&-3 \end{matrix}\right)\end{split}\]

And now we have the decomposition:

\[\begin{split}L= \left(\begin{matrix} 1 & 0 & 0 \\ 2& 1 & 0\\ 4&\frac{-11}5&1 \end{matrix}\right) U = \left(\begin{matrix} 1 & 3 & 4 \\ 0& -5 & -5\\ 0&0&-3 \end{matrix}\right)\end{split}\]

We can solve the system by solving two back-substitution problems:

\[Ly = b\]



These are both \(O(n^2)\), so it is more efficient to decompose when there are multiple outcomes to solve for.

Let do this with numpy:

import numpy as np
import scipy.linalg as la

A = np.array([[1,3,4],[2,1,3],[4,1,2]])


P, L, U =
print(, A))
print(, U))
[[1 3 4]
 [2 1 3]
 [4 1 2]]
[[ 4.  1.  2.]
 [ 1.  3.  4.]
 [ 2.  1.  3.]]

[[ 4.  1.  2.]
 [ 1.  3.  4.]
 [ 2.  1.  3.]]
[[ 0.  1.  0.]
 [ 0.  0.  1.]
 [ 1.  0.  0.]]
[[ 1.      0.      0.    ]
 [ 0.25    1.      0.    ]
 [ 0.5     0.1818  1.    ]]
[[ 4.      1.      2.    ]
 [ 0.      2.75    3.5   ]
 [ 0.      0.      1.3636]]

Note that the numpy decomposition uses partial pivoting (matrix rows are permuted to use the largest pivot). This is because small pivots can lead to numerical instability. Another reason why one should use library functions whenever possible!

Cholesky Decomposition

Recall that a square matrix \(A\) is positive definite if

\[\begin{split}u^TA u > 0\end{split}\]

for any non-zero n-dimensional vector \(u\),

and a symmetric, positive-definite matrix \(A\) is a positive-definite matrix such that

\[A = A^T\]

Let \(A\) be a symmetric, positive-definite matrix. There is a unique decomposition such that

\[A = L L^T\]

where \(L\) is lower-triangular with positive diagonal elements and \(L^T\) is its transpose. This decomposition is known as the Cholesky decompostion, and \(L\) may be interpreted as the ‘square root’ of the matrix \(A\).


Let \(A\) be an \(n\times n\) matrix. We find the matri \(L\) using the following iterative procedure:

\[\begin{split}A = \left(\begin{matrix}a_{11}&A_{12}\\A_{12}&A_{22}\end{matrix}\right) = \left(\begin{matrix}\ell_{11}&0\\ L_{12}&L_{22}\end{matrix}\right) \left(\begin{matrix}\ell_{11}&L_{12}\\0&L_{22}\end{matrix}\right)\end{split}\]

1.) Let \(\ell_{11} = \sqrt{a_{11}}\)

2.) \(L_{12} = \frac{1}{\ell_{11}}A_{12}\)

3.) Solve \(A_{22} - L_{12}L_{12}^T = L_{22}L_{22}^T\) for \(L_{22}\)


\[\begin{split}A = \left(\begin{matrix}1&3&5\\3&13&23\\5&23&42\end{matrix}\right)\end{split}\]
\[\ell_{11} = \sqrt{a_{11}} = 1\]
\[L_{12} = \frac{1}{\ell_{11}} A_{12} = A_{12}\]

And so we conclude that \(\ell_{33}=1\).

This yields the decomposition:

\[\begin{split}\left(\begin{matrix}1&3&5\\3&13&23\\5&23&42\end{matrix}\right) = \left(\begin{matrix}1&0&0\\3&2&0\\5&4&1\end{matrix}\right)\left(\begin{matrix}1&3&5\\0&2&4\\0&0&1\end{matrix}\right)\end{split}\]

Now, with numpy:

A = np.array([[1,3,5],[3,13,23],[5,23,42]])
L = la.cholesky(A)
print(, L))

[[  1.   3.   5.]
 [  3.  13.  23.]
 [  5.  23.  42.]]
[[ 1.  3.  5.]
 [ 0.  2.  4.]
 [ 0.  0.  1.]]
[[ 1  3  5]
 [ 3 13 23]
 [ 5 23 42]]

Cholesky decomposition is about twice as fast as LU decomposition (though both scale as \(n^3\)).

Matrix Decompositions for PCA and Least Squares


Eigenvectors and Eigenvalues

First recall that an eigenvector of a matrix \(A\) is a non-zero vector \(v\) such that

\[Av = \lambda v\]

for some scalar \(\lambda\)

The value \(\lambda\) is called an eigenvalue of \(A\).

If an \(n\times n\) matrix \(A\) has \(n\) linearly independent eigenvectors, then \(A\) may be decomposed in the following manner:

\[A = B\Lambda B^{-1}\]

where \(\Lambda\) is a diagonal matrix whose diagonal entries are the eigenvalues of \(A\) and the columns of \(B\) are the corresponding eigenvectors of \(A\).


  • An \(n\times n\) matrix is diagonizable \(\iff\) it has \(n\) linearly independent eigenvectors.

  • A symmetric, positive definite matrix has only positive eigenvalues and its eigendecomposition

    \[A=B\Lambda B^{-1}\]

is via an orthogonal transformation \(B\). (I.e. its eigenvectors are an orthonormal set)

Calculating Eigenvalues

It is easy to see from the definition that if \(v\) is an eigenvector of an \(n\times n\) matrix \(A\) with eigenvalue \(\lambda\), then

\[Av - \lambda I = \bf{0}\]

where \(I\) is the identity matrix of dimension \(n\) and \(\bf{0}\) is an n-dimensional zero vector. Therefore, the eigenvalues of \(A\) satisfy:

\[\det\left(A-\lambda I\right)=0\]

The left-hand side above is a polynomial in \(\lambda\), and is called the characteristic polynomial of \(A\). Thus, to find the eigenvalues of \(A\), we find the roots of the characteristic polynomial.

Computationally, however, computing the characteristic polynomial and then solving for the roots is prohibitively expensive. Therefore, in practice, numerical methods are used - both to find eigenvalues and their corresponding eigenvectors. We won’t go into the specifics of the algorithms used to calculate eigenvalues, but here is a numpy example:

A = np.array([[0,1,1],[2,1,0],[3,4,5]])

u, V = la.eig(A)
print(,, la.inv(V))))
[[-0.+0.j  1.+0.j  1.+0.j]
 [ 2.+0.j  1.+0.j  0.+0.j]
 [ 3.+0.j  4.+0.j  5.+0.j]]
[ 5.8541+0.j -0.8541+0.j  1.0000+0.j]

NB: Many matrices are not diagonizable, and many have complex eigenvalues (even if all entries are real).

A = np.array([[0,1],[-1,0]])

u, V = la.eig(A)
print(,, la.inv(V))))
[[ 0  1]
 [-1  0]]
[[ 0.+0.j  1.+0.j]
 [-1.+0.j  0.+0.j]]
[ 0.+1.j  0.-1.j]
# If you know the eigenvalues must be reeal
# because A is a positive definite (e.g. covariance) matrix
# use real_if_close

A = np.array([[0,1,1],[2,1,0],[3,4,5]])
u, V = la.eig(A)
print np.real_if_close(u)
[ 5.8541+0.j -0.8541+0.j  1.0000+0.j]
[ 5.8541 -0.8541  1.    ]

Singular Values

For any \(m\times n\) matrix \(A\), we define its singular values to be the square root of the eigenvalues of \(A^TA\). These are well-defined as \(A^TA\) is always symmetric, positive-definite, so its eigenvalues are real and positive. Singular values are important properties of a matrix. Geometrically, a matrix \(A\) maps the unit sphere in \(\mathbb{R}^n\) to an ellipse. The singular values are the lengths of the semi-axes.

Singular values also provide a measure of the stabilty of a matrix. We’ll revisit this in the end of the lecture.

QR decompositon

As with the previous decompositions, \(QR\) decomposition is a method to write a matrix \(A\) as the product of two matrices of simpler form. In this case, we want:

\[A= QR\]

where \(Q\) is an \(m\times n\) matrix with \(Q Q^T = I\) (i.e. \(Q\) is orthogonal) and \(R\) is an \(n\times n\) upper-triangular matrix.

This is really just the matrix form of the Gram-Schmidt orthogonalization of the columns of \(A\). The G-S algorithm itself is unstable, so various other methods have been developed to compute the QR decomposition. We won’t cover those in detail as they are a bit beyond our scope.

The first \(k\) columns of \(Q\) are an orthonormal basis for the column space of the first \(k\) columns of \(A\).

Iterative QR decomposition is often used in the computation of eigenvalues.

Singular Value Decomposition

Another important matrix decomposition is singular value decomposition or SVD. For any \(m\times n\) matrix \(A\), we may write:

\[A= UDV\]

where \(U\) is a unitary (orthogonal in the real case) \(m\times m\) matrix, \(D\) is a rectangular, diagonal \(m\times n\) matrix with diagonal entries \(d_1,...,d_m\) all non-negative. \(V\) is a unitary (orthogonal) \(n\times n\) matrix. SVD is used in principle component analysis and in the computation of the Moore-Penrose pseudo-inverse.

Stabilty and Condition Number

It is important that numerical algorithms be stable and efficient. Efficiency is a property of an algorithm, but stability can be a property of the system itself.


\[\begin{split}\left(\begin{matrix}8&6&4&1\\1&4&5&1\\8&4&1&1\\1&4&3&6\end{matrix}\right)x = \left(\begin{matrix}19\\11\\14\\14\end{matrix}\right)\end{split}\]
A = np.array([[8,6,4,1],[1,4,5,1],[8,4,1,1],[1,4,3,6]])
b = np.array([19,11,14,14])
array([ 1.,  1.,  1.,  1.])
b = np.array([19.01,11.05,14.07,14.05])
array([-2.34 ,  9.745, -4.85 , -1.34 ])

Note that the tiny perturbations in the outcome vector \(b\) cause large differences in the solution! When this happens, we say that the matrix \(A\) ill-conditioned. This happens when a matrix is ‘close’ to being singular (i.e. non-invertible).

Condition Number

A measure of this type of behavior is called the condition number. It is defined as:

\[cond(A) = ||A||\cdot ||A^{-1}||\]

In general, it is difficult to compute.


\[cond(A) = \frac{\lambda_1}{\lambda_n}\]

where \(\lambda_1\) is the maximum singular value of \(A\) and \(\lambda_n\) is the smallest. The higher the condition number, the more unstable the system. In general if there is a large discrepancy between minimal and maximal singular values, the condition number is large.


U, s, V = np.linalg.svd(A)
[ 15.5457   6.9002   3.8363   0.0049]


We can sometimes improve on this behavior by ‘pre-conditioning’. Instead of solving


we solve

\[D^{-1}Ax=D^{-1}b\]\[where :math:`D^{-1}A` has a lower condition number than :math:`A`\]


Preconditioning is a very involved topic, quite out of the range of this course. It is mentioned here only to make you aware that such a thing exists, should you ever run into an ill-conditioned problem!


1. Compute the LU decomposition of the following matrix by hand and using numpy

A = np.array([[1,2,3],[2,-4,6],[3,-9,-3]])
P, L , U =
[[ 1  2  3]
 [ 2 -4  6]
 [ 3 -9 -3]]
[[ 0.  1.  0.]
 [ 0.  0.  1.]
 [ 1.  0.  0.]]
[[ 1.      0.      0.    ]
 [ 0.3333  1.      0.    ]
 [ 0.6667  0.4     1.    ]]
[[ 3.  -9.  -3. ]
 [ 0.   5.   4. ]
 [ 0.   0.   6.4]]

2. Compute the Cholesky decomposition of the following matrix by hand and using numpy

# Your code here

array([[ 2.    ,  0.    ,  0.    ],
       [ 1.    ,  1.7321,  0.    ],
       [ 1.5   ,  2.0207,  1.291 ]])

3. Write a function in Python to solve a system

\[Ax = b\]

using SVD decomposition. Your function should take \(A\) and \(b\) as input and return \(x\).

Your function should include the following:

  • First, check that \(A\) is invertible - return error message if it is not
  • Invert \(A\) using SVD and solve
  • return \(x\)

Test your function for correctness.

# Your code here

def svdsolver(A,b):
    U, s, V = np.linalg.svd(A)
    if == 0:
       print("Matrix is singular")
       return**(-1))), U.T),b)
A = np.array([[1,1],[1,2]])
b = np.array([3,1])
[ 5. -2.]
[ 5. -2.]