Stat. 758: Computation and Programming

Stat. 758: Computation and Programming Eric B. Laber Department of Statistics, North Carolina State University Lecture 4a Sept. 10, 2015

Ambition is like a frog sitting on a Venus flytrap. The flytrap can bite and bite, but it won t bother the frog because it only has little tiny plant teeth. But some other stuff could happen and it could be like ambition. Chiu Chang Suan Shu The word matrix means womb in latin. The title of that stupid Keanu Reeves movie finally makes sense. Terry L. Laber

House keeping Beach trip happened! (A thing you should do!) I will give a week s notice for each quiz HW 1 is due September 22 but HW2 will be up before then Python is on the lab machines in SAS hall Work together! But turn in our your own HW and write your own code!

Warm-up Explain to your stat buddy How linear systems might arise in statistics What is big-o notation? What is Gaussian elimination? (As a child, I always thought this must be some form of assassination) True or false: The mathematician who coined the term matrix for arrangments of numbers as rectangular arrays fled the U.S. after killing a student with a newspaper stick Research on solving linear least squares problems ceased with the invention of modern computing software An Irish penny has a harp on one side and a chicken on the other (this is why they ask harps or chickens before kickofff in Ireland)

Big-O From your mathematics days, f, g : D R f (x) = O {g(x)} as x x 0 lim x x 0 f (x) L g(x), for some fixed constant L True or false f (x) = O {g(x)}: f (x) = 4x 2 10x, g(x) = 10x 2 + x + 2, x f (x) = x log(x 2 ), g(x) = x log(x), x 0 f (n) = log(n!), g(n) = nlog(n), n Z +, n f (x) = x 3 + x 2 log(x), g(x) = 2x 2 log(x) + x, x

Little-O Assume f, g : D R f (x) = o {g(x)} as x x 0 lim x x 0 f (x) g(x) = 0 alternatively, for any constant κ > 0 there exists ɛ κ such that f (x) κ g(x) if x x 0 ɛ κ True or false f (x) = o {g(x)} f (x) = x 2 + x, g(x) = x + 1 as x 0 f (x) = x log(x), g(x) = x 2 as x f (x) = x log(x), g(x) = x + x log(x) as x

Oh-pee! We often require probabilistic notions of big and little O O p through examples We say r.v. X = OP (1) if lim L P( X L) = 1 Given sequence of r.v. s {(Xn, Y n )} n 1 we say X n = O P (Y n ) if for any ɛ > 0 there exists L ɛ s.t. for all sufficiently large n P ( X n L ɛ Y n ) 1 ɛ Ex. Suppose X1,..., X n are i.i.d. with finite mean µ and variance-covariance Σ show X n = O P (n 1/2 )

Oh-pee! cont d More general defn: let X(t), Y(t) be stochastic processes indexed by t T, say X(t) = O P {Y(t)} as t t 0 if for any ɛ there exists L ɛ > 0 and δ ɛ > 0 so that if t t 0 δ ɛ P { X(t) L ɛ Y(t) } 1 ɛ

Op-pee! cont d o p through examples We say X n = o P (a n ) to mean P( X n /a n > ɛ) 0 as n for any ɛ > 0 (think delta-method) We say X n = o P (Y n ) if for any κ > 0 and ɛ > 0 P ( X n κ Y n ) 1 ɛ for all sufficiently large n Ex. prove that if Xn = O P (1) and Y n = o P (1) then X n Y n = o P (1)

Why do we care? Big O notation used to characterize deterministic algorithm complexity Big O P notation used heavily in asymptotic analyses Stochastic approximation algorithms Bounding Monte Carlo error Dealing with remainder terms in asymptotic expansions

Flops We describe algorithms in the number of floating point operations (flops) Addition, subtraction, multiplication, division Built-in functions, e.g,. exp() harder to evaluate How many flops to compute Av where A R n p and v R p? How many flops to compute A A??

Linear systems: warm-up Let A R p p and b R p, want soln to Ax = b If A were upper triangular how would you solve for x? Go over linsys.ipynb

Linear systems: Gaussian elimination I Triangular systems rare in practice I I Idea! Transform general linear system to triangular system Ax = b BAx = Bb if B, thus sufficient to find B so that BA is triangular invertible

Linear systems: Gaussian elimination cont d Primary school example, reduce to triangular system 1.0 2.0 1.0 0.0 x 1 0.5 1.0 0.0 1.0 x 2 0.0 2.0 0.5 1.5 x 3 = 1.0 1.0 1.5 0.0 x 4 0.5 1.0 1.5 2.0

Linear systems: Gaussian elimination cont d Algorithm for Gaussian elimination to triangular system A (0) = A, (B (1) ) i,1 = A (0) if j 1 i,1 /A(0) 1,1 if i > 1 and (B(1) ) i,j = 1 i=j Recursively for k = 1,..., p 1 A (k+1) = B (k) A (k 1), and (B (k+1) ) i,k+1 = A (k) i,k+1 /A(k) k+1,k+1 if i > k + 1 and (B (k+1) ) i,j = 1 i=j if j k + 1 (k,k) We assume that A k+1,k+1 0 for all k, which need not hold in general, your will fix this in HW2! Back to linsys.ipynb

Iterative methods for large systems Gaussian elimination requires O(p 3 ) operations Manageable for small/moderate-sized problems When p is large iterative methods may be preferable especially if the matrix is sparse Canonical example: Gauss-Seidel iteration for Ax = b Suppose we knew {x j j i} solve for x i via x i = b i k i A i,kx k A i,i Idea! start with initial guess, x 0, then repeatedly update each component of our guess using the above updates

Gauss-Seidel pseudo code Input: x (0), A, b Set m = 0 Repeat forever x (m+1) = x (m) For i = 1,..., p x (m+1) i If x (m+1) x (m) ɛ break m = m + 1 = (b i k i A i,kx k )/A i,i With your stat buddy: convert this to python code!

Sparse matrices Many applications in statistics involve large sparse matrices Functional data analysis Markov decision processes Matrix completion problems Graphical models... Computational savings obtained by exploiting sparsity Save memory: store only non-zero elements Save flops: matrix ops only with non-zero elements

Dictionary of keys Suppose our lin. sys. Ax = b has A = 0 1 0 2 3 0 0 0 1 0 0 4 0 0 8 9 we can store this as the set of triples { (1, 2, 1), (1, 4, 2), (2, 1, 3), (3, 1, 1), (3, 2, 1), (3, 4, 4), (4, 3, 8), (4, 4, 9) } More convenient to store as an associate array where each pair of indices is associated with its respective matrix value, i.e., (1, 2) 1, (1, 4) 2,..., (4, 4) 9

Dictionary of keys cont d Dictionary of keys (DOK) storage format is a set of key value pairs Key: indices of non-zero matrix elements Value: non-zero matrix elements Store matrix A as {(i, j) A i,j : A i,j 0} First part of sparsemats.ipynb

Compressed row storage DOK is intuitive and useful for constructing sparse matrices Slower than alternatives for numerical operations E.g., matrix-vector mult can be slower than dense case Compressed row storage (CRS) faster for numerical operations Pattern: construct with DOK convert to CRS Suppose that A = 0 1 0 2 3 0 0 0 1 0 0 4 0 0 8 9 CRS stores this as three arrays: Value 1 2 3 1 4 8 9 Col. 2 4 1 1 4 3 4 Row 1 3 4 6 8

Compressed row storage cont d With your stat buddy: Convert to dense format Value 1 2 2 3 5 7 6 8 9 Col. 1 1 2 1 3 4 2 3 4 Row 1 2 4 7 10 Convert to CSR format A = 1 1 0 2 0 0 4 0 1 0 0 0 0 0 8 9 Back to sparsemats.ipynb

Cholesky decomposition If A is symmetric positive definite then A = LL, where L is lower triangular Solve Ax = b by solving triangular systems Ly = b L x = y

Cholesky decomposition cont d Algorithm to compute A = LL similar to Gaussian elimination GE and Cholesky both O(p 3 ) but Cholesky better constant Generally Cholesky is more stable Generate Z Normal p (µ, Σ) via 1. Compute Σ = LL 2. Generate W Normal p (0, I p ) 3. Set Z = LW + µ

Break: Warm-up quiz II Explain to your stat buddy: What is a random walk? What is a Brownian bridge? What is importance sampling? True of false: Brownian motion was invented by Cavell Brownie The term Monte Carlo was a code-name for stochastic computer experiments related to nuclear research during WWII Hotter than Satan s Toenails is the name of a nail salon in Chattanooga TN

Ex. Brownian motion Brownian motion shows up frequently in asymptotic statistics Recall {X (t) : t 0} is a Brownian motion process if: (P1) X (0) = 0; (wp1) (P2) {X (t) : t 0} has ind. increments (what does this mean?) (P3) X (t) Normal(0, c 2 t) for all t 0 (We will assume c = 1 hereafter)

Ex. Brownian motion cont d Goal: simulate Brownian motion Problem: computer cannot simulate continuous values Idea: discretize interval [0, T ], 0 = t0 < t 1 < < t n = T and simulate {X (t 1 ),..., X (t n )} Fact: {X (t 1 ),..., X (t n )} is normally distributed with mean 0 and variance-covariance (Σ i,j ) = min(t i, t j ) (see HW2)

Generating random functions In some applications necessary to generate random smooth functions over some domain (e.g., time, space, etc.) Grown curves Depression scores Humidity... Basic idea Choose space for domain of random function Choose basis for this space Generate random linear combinations of basis functions

Review: basis functions Recall: basis for space of functions, F, is a collection {b j } j 1 in F so that for any f F {λ j } j 1 that satisfy f = j 1 λ j b j Stone-Weierstrass theorem: every continuous function on [0, 1] can be uniformly approximated by a polynomial function. Thus, a basis for the space C[0, 1] is {x j 1 } j 1.

Genenerating a random function in F Goal: generate a random element of F Random linear combination of basis functions: Let {b j } j 1 be a basis for F Choose finite truncation J Generate random loadings λ 1,..., λ J, e.g., i.i.d. normal Define f = J j=1 λ jb j, i.e., f (x) = J j=1 λ jb j (x)

Ex. Fourier basis A Fourier basis has the form cos ( ) πx 2 b j (x) = } sin { (j+1)πx 2 if j even if j odd Dense in L 2 [0, 1] (square integrable functions on [0, 1]) Go over fourierrando.py

In class example Dependent spatial binary data (on board)