Statistical Computing

We now start a new module on the interface between statistics and computing. Statistical (or mathematical) concepts need to be implemented as algorithms and data structures, and key issues of accuracy, space and time considered. Briefly, we look at the following:

  • Space complexity

  • Time complexity

  • Big O notation for space and time complexity

  • Computer representation of numbers

    • Integers

    • Floating point

    • Decimal

    • Arbitrary precision

    • Vectors

    • Dense and sparse matrices

    • Tensors

  • Accuracy considerations

    • Limits and overflow

    • Round-off errors

    • Catastrophic cancellation

    • Working in log space

    • Condition number

  • Linear algebra foundations

    • Many problems in statistics can be formulated as linear algebra problems

    • Vectors and vector spaces

      • Vector spaces and subspaces are closed under addition and scalar multiplication

      • Viewed as data

      • Viewed as mathematical object

      • Inner products

      • Outer products

      • Projection

      • Vector norms

      • Linear independence

    • Matrices

      • Types and examples

      • Important matrices

        • Symmetric positive definite

        • Orthogonal

      • Span, basis, rank

      • Matrix norms

      • Trace and determinant

      • Eigenvalues and eigenvectors

      • Column space

      • Null space

      • The four fundamental subspaces

    • How to think about matrix vector multiplication

      • As linear transform - rotate, reflect, stretch, contract

      • As weighted combination of column vectors

    • How to think about matrix-matrix multiplication

      • Row times column

      • Column times row gives sum of rank one matrices

      • Upper times upper = upper

      • Orthogonal times orthogonal = orthogonal

    • Matrix factorization

      • Review \(A = LU\)

        • Row pivoting as a numerical consideration

        • Permutation matrix

      • Review \(A = QR\)

        • Gram-Schmidt procedure

        • column pivoting as a numerical consideration

      • \(A = V \Lambda V^{-1}\)

        • Spectral theorem

        • Geometry

        • Similarity transforms preserve eigenvalues

      • \(A = U \Sigma V^{T}\)

        • SVD

        • Geometry

        • Pseudo-inverse

        • SVD generates basis for fundamental subspaces

      • Non-negative and sparse matrix factorizations

    • Important linear algebra problems

      • \(Ax = b\) when \(m = n = r\)

      • \(Ax = b\) when \(m > n\)

      • \(Ax = b\) when \(n > m\)

      • \(Ax = b\) when \(A\) is nearly singular

    • General approaches

      • Matrix factorization

      • Iteration

      • Randomization

      • Optimization

    • Application examples

      • Least squares regression

      • Markov chains

      • PCA

      • Graphs

  • Optimization foundations

    • Root finding

      • Bisection vs Newton-Raphson

      • Link with optimization

    • Zeroth order methods

    • Second order methods

      • Convexity and Newton’s method

    • First order methods

      • Gradient descent

      • Stochastic gradient descent

      • ADAM and friends

    • Constrained optimization

      • Lagrange multipliers

Big O complexity

A function \(f(n)\) had Big O complexity \(g(n)\) if \(\vert f(n) \vert \le M g(n)\) where \(M\) is a constant. Common classes for \(g(n)\) in increasing order of complexity are

  • \(\log n\)

  • linear \(n\)

  • \(n \log n\)

  • polynomial \(n^k\)

  • exponential \(e^n\)

  • factorial n!

Notes

  • Note 1: parallel processing in most cases gives at best a linear speedup.

  • Note 2: The constant factor can be important, especially for small to moderate values of \(n\)

[4]:
import numpy as np
from functools import reduce
[5]:
def factorial(n):
    return reduce(lambda a, b: a* b, range(1, n+1))
[6]:
for n in (5, 20, 50):
    print('n =', n)
    print('-'*40)
    print('log    ', int(np.log2(n)))
    print('linear ', n)
    print('n log n', int(n*np.log2(n)))
    print('n^2    ', n**2)
    print('n^3    ', n**3)
    print('e^n    ', int(np.exp(n)))
    print('n!     ', factorial(n))
    print()
n = 5
----------------------------------------
log     2
linear  5
n log n 11
n^2     25
n^3     125
e^n     148
n!      120

n = 20
----------------------------------------
log     4
linear  20
n log n 86
n^2     400
n^3     8000
e^n     485165195
n!      2432902008176640000

n = 50
----------------------------------------
log     5
linear  50
n log n 282
n^2     2500
n^3     125000
e^n     5184705528587072045056
n!      30414093201713378043612608166064768844377641568960512000000000000

Example

If you have to search a sequence container repeatedly, it is more efficient to first sort, then use a bisection algorithm.

  • Initial sort takes \(n \log n\) time

  • Subsequent searches takes \(\log n\)

  • Total time is \(n \log n + k \log n\) versus \(k \times n/2\)

[7]:
testx = np.random.randint(0, 2*n, 1000)
[8]:
%%time

n = 10000
xs  = list(range(n))
hits = 0
for x in testx:
    if x in xs:
        hits += 1
print(hits)
1000
CPU times: user 11.7 ms, sys: 492 µs, total: 12.2 ms
Wall time: 12.1 ms
[9]:
import bisect
[10]:
%%time

n = 10000
xs  = list(range(n))
xs.sort()
hits = 0
for x in testx:
    if bisect.bisect(xs, x) != n:
        hits += 1
print(hits)
1000
CPU times: user 4.69 ms, sys: 1.03 ms, total: 5.73 ms
Wall time: 5.01 ms

Sorting algorithms

Generally, use the sort function provided by the language (e.g. sort, sorteed). However sort functions are useful to illustrate algorithmic concepts such as in-place editing, recursion and algorithmic complexity, and you should know how to write simple sort functions.

  • How much memory does the sort use?

  • What is its big O complexity class?

  • Is it iterative or recursive? (note - all recursive algorithm have an iterative equivalent, but some (e.g. quicksort) are easier to think of in a recursive way.

Bubble sort

[11]:
def bubblesort(xs):
    """Bubble sort."""

    n = len(xs)
    for i in range(n):
        print(xs)
        for j in range(i+1, n):
            if xs[i] > xs[j]:
                xs[i], xs[j] = xs[j], xs[i]
[12]:
xs = [5,1,3,4,2]
bubblesort(xs)
xs
[5, 1, 3, 4, 2]
[1, 5, 3, 4, 2]
[1, 2, 5, 4, 3]
[1, 2, 3, 5, 4]
[1, 2, 3, 4, 5]
[12]:
[1, 2, 3, 4, 5]

Selection sort

[13]:
def selectionsort(xs):
    """Selection sort."""

    n = len(xs)
    for i in range(n):
        best = xs[i]
        idx = i
        print(xs)
        for j in range(i+1, n):
            if xs[j] < best:
                best = xs[j]
                idx = j
        xs[i], xs[idx] = xs[idx], xs[i]
[14]:
xs = [5,1,3,4,2]
selectionsort(xs)
xs
[5, 1, 3, 4, 2]
[1, 5, 3, 4, 2]
[1, 2, 3, 4, 5]
[1, 2, 3, 4, 5]
[1, 2, 3, 4, 5]
[14]:
[1, 2, 3, 4, 5]

Quicksort

[15]:
def quicksort(xs):
    """Quicksort."""

    if len(xs) < 2:
        return xs
    else:
        pivot = xs[0]
        left = [x for x in xs[1:] if x <= pivot]
        right = [x for x in xs[1:] if x > pivot]
        print(pivot, left, right)
        return quicksort(left) + [pivot] + quicksort(right)
[16]:
xs = [5,1,3,4,2]
quicksort(xs)
5 [1, 3, 4, 2] []
1 [] [3, 4, 2]
3 [2] [4]
[16]:
[1, 2, 3, 4, 5]

Memory usage

[17]:
import sys
[18]:
xs = np.random.randint(0, 10, (100,100))
[19]:
sys.getsizeof(xs)
[19]:
80112
[20]:
xs.nbytes
[20]:
80000

Timing

[21]:
from time import sleep
[22]:
%time sleep(0.1)
CPU times: user 628 µs, sys: 951 µs, total: 1.58 ms
Wall time: 104 ms
[23]:
%%time

sleep(0.1)
CPU times: user 882 µs, sys: 1.26 ms, total: 2.14 ms
Wall time: 103 ms
[24]:
%timeit sleep(0.1)
102 ms ± 368 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
[25]:
%timeit -r 3 -n 10 sleep(0.1)
101 ms ± 142 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)
[26]:
from timeit import timeit
[27]:
t = timeit('from time import sleep; sleep(0.1)', number=1)
t
[27]:
0.10093476400000156
[28]:
t = timeit(lambda: sleep(0.1), number=1)
t
[28]:
0.10023614800000047