Math 533 Extra Hour Material

Size: px

Start display at page:

Download "Math 533 Extra Hour Material"

Branden Price
6 years ago
Views:

1 Math 533 Extra Hour Material

2 A Justification for Regression

3 The Justification for Regression It is well-known that if we want to predict a random quantity Y using some quantity m according to a mean-squared error MSE, then the optimal predictor is the expected value of Y, µ; E[(Y m) 2 ] = E[(Y µ + µ m) 2 ] = E[(Y µ) 2 ] + E[(µ m) 2 ] + 2E[(Y µ)(µ m)] = E[(Y µ) 2 ] + (µ m) which is minimized when m = µ. 1

4 The Justification for Regression (cont.) Now suppose we have a joint distribution between Y and X, and wish to predict the Y as a function of X, m(x) say. Using the same MSE criterion, if we write µ(x) = E Y X [Y X = x] to represent the conditional expectation of Y given X = x, we have E X,Y [(Y m(x)) 2 ] = E X,Y [(Y µ(x) + µ(x) m(x)) 2 ] = E X,Y [(Y µ(x)) 2 ] + E X,Y [(µ(x) m(x)) 2 ] + 2E X,Y [(Y µ(x))(µ(x) m(x))] = E X,Y [(Y µ(x)) 2 ] + E X [(µ(x) m(x)) 2 ] + 0 2

5 The Justification for Regression (cont.) The cross term equates to zero by noting that by iterated expectation E X,Y [(Y µ(x))(µ(x) m(x))] = E X [E Y X [(Y µ(x))(µ(x) m(x))]] and for the internal expectation E Y X [(Y µ(x))(µ(x) m(x)) X] = (µ(x) m(x))e Y X [(Y µ(x)) X] and E Y X [(Y µ(x)) X] = 0 a.s. 3

6 The Justification for Regression (cont.) This the MSE is minimized over functions m(.) when E X [(µ(x) m(x)) 2 ] is minimized, but this term can be made zero by setting m(x) = µ(x) = E Y X [Y X = x]. Thus the MSE-optimal prediction is made by using µ(x). Note: Here X can be a single variable, or a vector; it can be random or non-random the result holds. 4

7 The Justification for Linear Modelling Suppose that the true conditional mean function is represented by µ(x) where x is a single predictor. We have by Taylor expansion around x = 0 that p 1 µ(x) = µ(0) + j=1 µ (j) (0) x j + O(x p ) j! where the remainder term O(x p ) represents terms of x p in magnitude or higher order terms, and µ (j) (0) = dj µ(x) dx j x=0 5

8 The Justification for Linear Modelling (cont.) The derivatives of µ(x) at x = 0 may be treated as unspecified constants, in which case a reasonable approximating model takes the form p 1 β 0 + β j x j j=1 where β j µ (j) (0) for j = 0, 1,..., p 1. Similar expansions hold if x is vector valued. 6

9 The Justification for Linear Modelling (cont.) Finally, if Y and X are jointly normally distributed, then the conditional expectation of Y given X = x is linear in x. see Multivariate Normal Distribution handout. 7

10 The General Linear Model

11 The General Linear Model The linear model formulation that assumes E Y X [Y X] = Xβ Var Y X [Y X] = σ 2 I n is actually quite a general formulation as the rows x i of X can be formed by using general transforms of the originally recorded predictors. multiple regression: x i = [1 x i1 x i2 x ik ] polynomial regression: x i = [1 x i1 x 2 i1 xk i1 ] 8

12 The General Linear Model (cont.) harmonic regression: consider single continuous predictor x measured on a bounded interval. Let ω j = j/n, j = 0, 1,..., n/2 = K, and then set E Yi X[Y i x i ] = β 0 + J β j1 cos(2πω j x i ) + j=1 J β j2 sin(2πω j x i ) j=1 If J = K, we then essentially have an n n matrix X specifying a linear transform of y in terms of the derived predictors the coefficients (cos(2πω j x i ), sin(2πω j x i )). (β 0, β 11, β 12,..., β K ) form the discrete Fourier transform of y 9

13 The General Linear Model (cont.) In this case, the columns of X are orthogonal, and X X = diag(n, n/2, n/2,..., n/2, n) that is, X X is a diagonal matrix. 10

14 The General Linear Model (cont.) basis functions: truncated spline basis: for x R, let x i1 = { (x η1 ) α x > η 1 0 x η 1 = (x η 1 ) α + for some fixed η 1, and α R. More generally, E Yi X[Y i x i ] = β 0 + for fixed η 1 < η 2 < < η J. J β j (x i η j ) α + j=1 11

15 The General Linear Model (cont.) piecewise constant: : for x R, let { 1 x A1 x i1 = = 1 0 x / A A1 (x) 1 for some set A 1. More generally, E Yi X[Y i x i ] = β 0 + J β j 1 Aj (x) for sets A 1,..., A J. If we want to use a partition of R, we may write this J E Yi X[Y i x i ] = β j 1 Aj (x). where j=0 J A 0 = j=1 A j j=1 12

16 The General Linear Model (cont.) piecewise linear: specify E Yi X[Y i x i ] = J 1 Aj (x)(β j0 + β j1 x i ) j=0 piecewise linear & continuous: specify J E Yi X[Y i x i ] = β 0 + β 1 x + β j1 (x i η j ) + j=1 for fixed η 1 < η 2 < < η J. higher order piecewise functions (quadratic, cubic etc.) splines 13

17 Example: Motorcycle data 1 library(mass) 2 #Motorcycle data 3 plot(mcycle,pch=19,main= Motorcycle accident data ) 4 5 x<-mcycle$times 6 y<-mcycle$accel 14

18 Raw data Motorcycle accident data times accel 15

19 Example: Piecewise constant fit 1 #Knots 2 K<-11 3 kappa<-as.numeric(quantile(x,probs=c(0:k)/k)) 4 5 X<-(outer(x,kappa, - )>0)^2 6 X<-X[,-12] 7 8 fit.pwc<-lm(y X) 9 summary(fit.pwc) 10 newx<-seq(0,max(x),length=1001) newx<-(outer(newx,kappa, - )>0)^2 13 newx<-cbind(rep(1,1001),newx[,-12]) yhatc<-newx %*% coef(fit.pwc) 16 lines(newx,yhatc,col= blue,lwd=2) 16

20 Fit: Piecewise constant Motorcycle accident data times accel 17

21 Example: Piecewise linear fit 17 X1<-(outer(x,kappa, - )>0)^2 18 X1<-X1[,-12] X2<-(outer(x,kappa, - )>0)*outer(x,kappa, - ) 21 X2<-X2[,-12] X<-cbind(X1,X2) fit.pwl<-lm(y X) 26 summary(fit.pwl) 27 newx<-seq(0,max(x),length=1001) newx1<-(outer(newx,kappa, - )>0)^2 30 newx2<-(outer(newx,kappa, - )>0)*outer(newx,kappa, - ) newx<-cbind(rep(1,1001),newx1[,-12],newx2[,-12]) yhatl<-newx %*% coef(fit.pwl) 35 lines(newx,yhatl,col= red,lwd=2) 18

22 Fit:... + piecewise linear Motorcycle accident data times accel 19

23 Example: Piecewise linear fit 36 X<-(outer(x,kappa, - )>0)*outer(x,kappa, - ) 37 X<-X[,-12] fit.pwcl<-lm(y X) 40 summary(fit.pwcl) 41 newx<-seq(0,max(x),length=1001) newx<-(outer(newx,kappa, - )>0)*outer(newx,kappa, - ) newx<-cbind(rep(1,1001),newx[,-12]) yhatcl<-newx %*% coef(fit.pwcl) 48 lines(newx,yhatcl,col= green,lwd=2) 20

24 Fit:... + piecewise continuous linear Motorcycle accident data times accel 21

25 Fit:... + piecewise continuous quadratic Motorcycle accident data times accel 22

26 Fit:... + piecewise continuous cubic Motorcycle accident data times accel 23

27 Fit: B-spline Motorcycle accident data times accel 24

28 Fit: Harmonic regression K = Motorcycle accident data times accel 25

29 Fit: Harmonic regression K = Motorcycle accident data times accel 26

30 Fit: Harmonic regression K = Motorcycle accident data times accel 27

31 Fit: Harmonic regression K = Motorcycle accident data times accel 28

32 Reparameterization

33 Reparameterizing the model For any model E Y X [Y X] = Xβ we might consider a reparameterization of the model by writing x new i = x i A 1 for some p p non-singular matrix A. Then E Yi X[Y i x i ] = x i β = (x new i A)β = x new i β (new) where β (new) = Aβ 29

34 Reparameterizing the model (cont.) Then X new = XA 1 X = X new A and we may choose A such that {X new } {X new } = I n to give an orthogonal (actually, orthonormal) parameterization. Recall that if the design matrix X new is orthonormal, we have for the OLS estimate β (new) = {X new } y. Note however that the new predictors and their coefficients may not be readily interpretable, so it may be better to reparameterize back to β by defining 1 β = A β(new) 30

35 Coordinate methods for inversion

36 Coordinate methods for inversion To find the ordinary least squares estimates, we solve the normal equations to obtain β = (X X) 1 X y which requires us to invert the p p matrix X X. This is the minimum norm solution to β = arg min y Xb 2 = arg min b b n (y i x i b) 2 i=1 31

37 Coordinate methods for inversion (cont.) We may solve this problem using coordinate descent rather than direct inversion; if b 2,..., b p are fixed, then the minimization problem for b 1 becomes b1 = arg min S(b 1 b 2,..., b p ) = arg min b 1 b 1 n (y i b 1 x i1 c i1 ) 2 i=1 where p c i1 = b j x ij. j=2 32

38 Coordinate methods for inversion (cont.) Writing y i = y i c i1, and the sum of squares as n (y i b 1 x i1 ) 2, i=1 we have that b1 = n x i1 y i. n i=1 x 2 i1 i=1 33

39 Coordinate methods for inversion (cont.) We may solve recursively in turn for each b j, that is, after initialization and at step t, update by minimizing b(t) j b (t+1) j min S(b j b (t+1) 1, b (t+1) 2,..., b (t+1) j 1, b (t) j+1..., b(t) bj which does not require any matrix inversion. p ) 34

40 Distributional Results

41 Some distributional results Distributional results using the Normal distribution are key to many inference procedures for the linear model. Suppose that are independent. X 1,..., X n Normal(µ, σ 2 ) 1. Z i = (X i µ)/σ Normal(0, 1); 2. Y i = Z 2 i χ 2 1 ; 3. U = n i=1 Z2 i χ 2 n; 4. If U 1 χ 2 n 1 and U 2 χ 2 n 2 are independent, then V = U 1/n 1 U 2 /n 2 Fisher(n 1, n 2 ) and 1 V Fisher(n 2, n 1 ) 35

42 Some distributional results (cont.) 5. If Z Normal(0, 1) and U χ 2 ν, then T = Z U/ν Student(ν) 6. If where Y = AX + b X = (X1,..., X n ) Normal(0, σ 2 I n ); A is n n; b is n 1; then say, and Y Normal(b, σ 2 AA ) Normal(µ, Σ) (Y µ) Σ 1 (Y µ) χ 2 n. 36

43 Some distributional results (cont.) 7. Non-central Chi-squared distribution: If X Normal(µ, σ 2 ), we find the distribution of X 2 /σ 2. Y = X σ Normal(µ/σ, 1) By standard transformation results, if Q = Y 2, then f Q (y) = 1 y [φ( y µ/σ) + φ( y µ/σ)] This is the density of the non-central chi-squared distribution with 1 degree of freedom and non-centrality parameter λ = (µ/σ) 2, written χ 2 1 (λ). 37

44 Some distributional results (cont.) The non-central chi-squared distribution has many similar properties to the standard (central) chi-squared distribution. For example if X 1,..., X n are independent, with X i χ 2 1 (λ i), then ( n n ) X i χ 2 n λ i. i=1 The non-central chi-squared distribution plays a role in testing for the linear regression model as it characterizes the distribution of various sums of squares terms. i=1 38

45 Some distributional results (cont.) 8. Quadratic forms: If A is a square symmetric idempotent matrix, and Z = (Z 1,..., Z n ) Normal(0, I n ), then where ν = Trace(A). Z AZ χ 2 ν To see this, use the singular value decomposition A = UDU where U is an orthogonal matrix with U U = I n, and D is a diagonal matrix. Then Z AZ = Z UDU Z = (Z U)D(U Z) = {Z } D{Z } say. But U Z Normal(0, U U) = Normal(0, I n ) 39

46 Some distributional results (cont.) Therefore Z AZ = {Z } D{Z }. But A is idempotent, so AA = A that is, UDU UDU = UDU. The left hand side simplifies, and we have UD 2 U = UDU. 40

47 Some distributional results (cont.) Thus, pre-multiplying by U, and post-multiplying by U, we have D 2 = D and the diagonal elements of D must either be zero or one, so Z AZ = {Z } D{Z } χ 2 ν where ν = Trace(D) = Trace(A). 41

48 as if A 1 and A 2 are orthogonal, then U 2 U 1 = Some distributional results (cont.) 9. If A 1 and A 2 are square, symmetric and orthogonal, and then A 1 A 2 = 0 Z A 1 Z and Z A 2 Z are independent. This result again uses the singular value decomposition; let We have that V 1 = D 1 U 1 Z V 2 = D 2 U 2 Z. Cov V1,V 2 [V 1, V 2 ] = E V1,V 2 [V 2 V 1 ] = E Z [D 2 U 2 ZZ U 1 D 1 ] = D 2 U 2 U 1 D 1 = 0

49 Bayesian Regression and Penalized Least Squares

50 The Bayesian Linear Model Under a Normality assumption Y X, β, σ 2 Normal(Xβ, σ 2 I n ) which defines the likelihood L(β, σ 2 ; y, X), we may perform Bayesian inference by specifying a joint prior distribution π 0 (β, σ 2 ) = π 0 (β σ 2 )π 0 (σ 2 ) and computing the posterior distribution π n (β, σ 2 y, X) L(β, σ 2 ; y, X)π 0 (β, σ 2 ) 43

51 The Bayesian Linear Model (cont.) Prior specification: π 0 (σ 2 ) InvGamma(α/2, γ/2), that is, by definition, 1/σ 2 Gamma(α/2, γ/2) π 0 (σ 2 ) = (γ/2)α/2) Γ(α/2) ( ) 1 α/2 1 { σ 2 exp γ } 2σ 2 π 0 (β σ 2 ) Normal(θ, σ 2 Ψ), that is, β is conditionally Normally distributed in p dimensions, where parameters α, γ, θ, Ψ are fixed, known constants ( hyperparameters ) 44

52 The Bayesian Linear Model (cont.) In the calculation of π n (β, σ 2 y, X), we have after collecting terms π n (β, σ 2 y, X) where ( ) (n+α+p)/2 1 1 { σ 2 exp γ } { } Q(β, y, X) 2σ 2 exp 2σ 2 Q(β, y, X) = (y Xβ) (y Xβ) + (β θ) Ψ 1 (β θ). 45

53 The Bayesian Linear Model (cont.) By completing the square, may write Q(β, y, X) = (β m) M 1 (β m) + c where M = (X X + Ψ 1 ) 1 ; m = (X X + Ψ 1 ) 1 (X y + Ψ 1 θ); c = y y + θ Ψ 1 θ m M 1 m 46

54 The Bayesian Linear Model (cont.) From this we deduce that π n (β, σ 2 y, X) ( ) 1 (n+α+p)/2 1 { σ 2 exp exp } (γ + c) 2σ 2 { 1 2σ 2 (β m) M 1 (β m) } that is π n (β, σ 2 y, X) π n (β σ 2, y, X)π n (σ 2 y, X) π n (β σ 2, y, X) Normal(m, σ 2 M); π n (σ 2 y, X) InvGamma((n + α)/2, (c + γ)/2). 47

55 The Bayesian Linear Model (cont.) The Bayesian posterior mean/modal estimator of β based on this model is β B = m = (X X + Ψ 1 ) 1 (X y + Ψ 1 θ) 48

56 Ridge Regression If, in the Bayesian estimation, we choose θ = 0 Ψ 1 = λi p we have β B = m = (X X + λi p ) 1 X y. If λ = 0: β B = β, the least squares estimate. λ = : β B = 0. Note that to make this specification valid, we should place all the predictors (the columns of X) on the same scale. 49

57 Ridge Regression (cont.) Consider the constrained least squares problem minimize S(β) = n (y i x i β) 2 subject to k β 2 j t i=1 j=1 We solve this problem after centering the predictors x ij x ij x j and centering the responses y i y i y. After this transformation, the intercept can be omitted. 50

58 Ridge Regression (cont.) Suppose that therefore there are precisely p β parameters in the model. We solve the constrained minimization using Lagrange multipliers: we minimize S λ (β) p S λ (β) = S(β) + λ βj 2 t j=1 We have that a p 1 vector. S λ (β) β = S(β) β + 2λβ 51

59 Ridge Regression (cont.) By direct calculus, we have S λ (β) β = 2X (y Xβ) + 2λβ and equating to zero we have X y = (X X + λi)β so that β B = (X X + λi) 1 X y 52

60 Ridge Regression (cont.) For statistical properties, we have E Y X [ β B X] = (X X + λi) 1 X Xβ β so the ridge regression estimator is biased. However the mean squared error (MSE) of β B can be smaller than that of β. 53

61 Ridge Regression (cont.) If the columns of matrix X is orthogonal, so that then X X = I p β Bj = λ β j < β j. In general the ridge regression estimates are shrunk towards zero compared to the least squares estimates. 54

62 Ridge Regression (cont.) Using the singular value decomposition, write where X = UDV U is n p, columns of U are an orthonormal basis for the column space of X, and U U = I p. V is p p, columns of V are an orthonormal basis for the row space of X; V V = I p. D is diagonal with elements d 1 d 2 d p 0. 55

63 Ridge Regression (cont.) Least squares: Predictions are Ŷ = X β = X(X X) 1 X y = (UDV )(VDU UDV ) 1 VDU y = UU y. as V V = I p = VV = I p (to see this pre-multiply both sides by V ) so that (VDU UDV ) 1 (VD 2 V ) 1 = VD 2 V 56

64 Ridge Regression (cont.) Ridge regression: Predictions are Ŷ = X β B = X(X X + λi p ) 1 X y = UD(D 2 + λi p ) 1 DU y = p ( ) d 2 j ũ j dj 2 ũ j y + λ j=1 where ũ j is the jth column of U. 57

65 Ridge Regression (cont.) Ridge regression transforms the problem to one involving the orthogonal matrix U instead of X, and the shrinks the coefficients by d 2 j d 2 j + λ 1. For ridge regression, the hat matrix is H λ = X(X X + λi p ) 1 X = UD(D 2 + λi p ) 1 DU and the degrees of freedom of the fit is Trace(H λ ) = p j=1 d 2 j d 2 j + λ 58

66 Other Penalties The LASSO penalty is and we solve the minimization p λ β j j=1 min(y Xβ) (y Xβ) + λ β j. β No analytical solution is available, but the minimization can be achieved using coordinate descent. In this case, the minimization allows for the optimal value β j to be precisely zero for some j. p j=1 59

67 Other Penalties (cont.) A general version of this penalty is p λ β j q and if q 1, there is a possibility of an estimate being shrunk exactly to zero. For q 1, the problem is convex; for q < 1 the problem is non-convex, and harder to solve. However, if q 1, the shrinkage to zero allows for variable selection to be carried out automatically. j=1 60

68 Model Selection using Information Criteria

69 Selection using Information Criteria Suppose we wish to choose one from a collection of models described by densities f 1, f 2,..., f K with parameters θ 1, θ 2,..., θ K. Let the true, data generating model be denoted f 0. We consider the KL divergence between f 0 and f k : KL(f 0, f k ) = ( ) f0 (x) log f 0 (x)dx f k (x; θ k ) and aim to choose the model using the criterion k = arg min k KL(f 0, f k ) 61

70 Selection using Information Criteria (cont.) In reality, θ k are typically unknown, so we consider estimating them using maximum likelihood procedures. We consider KL(f 0, f k ) = log ( f 0 (x) f k (x; θ k ) ) f 0 (x)dx where θ k is obtained by maximizing the likelihood under model k, that is maximizing n log f k (y i ; θ) i=1 with respect to θ for data y 1,..., y n. 62

71 Selection using Information Criteria (cont.) We have that KL(f 0, f k ) = log f 0 (x)f 0 (x)dx log f k (x; θ k )f 0 (x)dx so we then may choose k by k = arg max k log f k (x; θ k )f 0 (x)dx or if the parameters need to be estimated k = arg max log f k (x; θ k (y))f 0 (x)dx. k 63

72 Asymptotic results for estimation under misspecification

73 Maximum likelihood as minimum KL We may seek to define θ k directly using the entropy criterion θ k = arg min θ KL(f 0, f k (θ)) = arg min θ ( ) f0 (x) log f 0 (x)dx f k (x; θ) and solve the problem using calculus by differentiating KL(f 0, f k (θ)) with respect to θ and equating to zero. Note that under standard regularity conditions { ( ) } f0 (x) log f 0 (x)dx = { θ f k (x; θ) θ } log f k (x; θ)f 0 (x)dx { } = θ log f k(x; θ) f 0 (x)dx [ ] = E X θ log f k(x; θ) 64

74 Maximum likelihood as minimum KL (cont.) Therefore θ k solves E X [ θ log f k(x; θ) θ=θk ] = 0. Under identifiability assumptions, we assume that is that there is a single θ k which solves this equation. The sample-based equivalent calculation dictates that for the estimate θ k 1 n n θ log f k(x i ; θ) θ=θk = 0 i=1 which coincides with ML estimation. 65

75 Maximum likelihood as minimum KL (cont.) Under identifiability assumptions, as 1 n n i=1 θ log f k(x i ; θ) θ=θk 0, by the strong law of large numbers we must have that with probability 1 as n. θ k θ k 66

76 Maximum likelihood as minimum KL (cont.) Let I(θ k ) = E X [ 2 log f k (X; θ) θ θ ] ; θ k θ=θk be the Fisher information computed wrt f k (x; θ k ), and [ ] I(θ k ) = E X 2 log f k (X; θ) θ θ θ=θk be the same expectation quantity computed wrt f 0 (x). The corresponding n data versions, where log f k (X; θ) is replaced by are n log f k (X i ; θ) i=1 I n (θ k ) = ni(θ k ) I n (θ k ) = ni(θ k ). 67

77 Maximum likelihood as minimum KL (cont.) The quantity Î n (θ k ) is the sample based version Î n (θ k ) = { n i=1 } 2 θ θ log f k(x i ; θ) θ=θ k This quantity is evaluated at θ k = θ k to yield Î n ( θ k ); we have that with probability 1 as n. Î n ( θ k ) I n (θ k ) 68

78 Maximum likelihood as minimum KL (cont.) Another approach that is sometimes used is to uses the equivalence between expressions involving the first and second derivative versions of I(θ); we have also that I(θ k ) = E X [ log f k (X; θ) θ ] 2 ; θ k = J (θ k ) θ=θ k with the expectation computed wrt f k (x; θ k ), where for vector U Let J(θ k ) = E X [ U 2 = UU. log f k (X; θ) θ 2 θ=θ k be the equivalent calculation with the expectation computed wrt f 0 (x). ] 69

79 Maximum likelihood as minimum KL (cont.) Now by a second order Taylor expansion of the first derivative of the loglikelihood, if l k (θ k ) = log f k(x; θ) θ at θ = θ k, we have lk (θ k ) = 2 log f k (x; θ) θ=θk θ θ θ=θk l k (θ k ) = l k ( θ k ) + l k ( θ k )(θ k θ k ) (θ k θ k )... l n(θ )(θ k θ k ) = l k ( θ k )(θ k θ k ) + R n (θ ) as l k ( θ k ) = 0, θ k θ < θ k θ k, and where the remainder term is being denoted R n (θ ). 70

80 Maximum likelihood as minimum KL (cont.) We have by definition that and in the limit that is, R n (θ ) = o p (n), and as n. l k ( θ k ) = Î n ( θ k ). R n (θ ) n p 0 1 n În( θ k ) a.s. I(θ k ) Rewriting the above approximation, we have { } 1 1 { } { } { n( θk θ k ) = n În( θ k ) n lk (θ k ) + n În( θ Rn (θ } ) k ) n 71

81 Maximum likelihood as minimum KL (cont.) By the central limit theorem n( θk θ k ) d Normal(0, Σ) say, where Σ denotes the limiting variance-covariance matrix when sampling is under f 0, and 1 lk (θ k ) = 1 n n Finally, n i=1 by Slutsky s Theorem. { 1 n În( θ k ) d log f θ k (X i ; θ k ) Normal(0, J(θ k )). k } 1 { Rn (θ ) n } p 0 72

82 Maximum likelihood as minimum KL (cont.) Therefore, by equating the asymptotic variances of the above quantities, we must have J(θ k ) = I(θ k )ΣI(θ k ) yielding that Σ = {I(θ k )} 1 J(θ k ){I(θ k )} 1 73

83 Model Selection using Minimum KL

84 The Expected KL Divergence The quantity log f k (x; θ k (Y ))f 0 (x)dx = E X [log f k (X; θ k (Y ))] is a random quantity, a function of data random quantities Y. Thus we instead decide to choose k by k = arg max E Y [E X [log f k (X; θ k (Y ))]] k where X and Y are drawn from the true model f 0. 74

85 The Expected KL Divergence (cont.) We consider first the expansion of the inner integral around the true value θ k for an arbitrary y; under regularity conditions E X [log f k (X; θ k (y))] = E X [log f k (X; θ k )] + E X [ lk (θ k )] ( θk θ k ) + 1 ] 2 ( θ k θ k ) E X [ lk (θ k ) ( θ k θ k ) + o p (n) 75

86 The Expected KL Divergence (cont.) [ ] By definition E X lk (θ k ) = 0, and so therefore ] E X [ lk (θ k ) = I n (θ k ) E X [log f k (X; θ k (y))] = E X [log f k (X; θ k )] 1 2 ( θ k θ k ) I n (θ k )( θ k θ k )+o p (n) We then must compute E Y [E X [log f k (X; θ k (Y ))]] The term E X [log f k (X; θ k )] is a constant wrt this expectation. 76

87 The Expected KL Divergence (cont.) The expectation of the quadratic term can be computed by standard results for large n; under sampling Y from f 0, we have as before that ( θ k θ k ) I n (θ k )( θ k θ k ) can be rewritten { { } 1 n( θ k θ k )} n I n(θ k ) { n( θ k θ k )} where and n( θk θ k ) d Normal(0, Σ) 1 n I n(θ k ) a.s. I(θ k ) 77

88 The Expected KL Divergence (cont.) Therefore, by standard results for quadratic forms and ] E Y [( θ k (Y ) θ k ) a.s. I n (θ k )( θ k (Y ) θ k ) Trace (I(θ k )Σ) E Y [E X [log f k (X; θ k (Y ))]] = E X [log f k (X; θ k )] 1 2 Trace (I(θ k)σ)+o p (n) However, the right hand side cannot be computed, as θ k is not known and must be estimated. (*) 78

89 The Expected KL Divergence (cont.) Thus we repeat the same operation, but instead expand E X [log f k (X; θ k )] about θ k (x) for a fixed x. We have log f k (x; θ k ) = log f k (x; θ k (x)) + l k ( θ k (x)) (θ k θ k (x)) ( θ k (x) θ k ) lk ( θ k (x))( θ k (x) θ k ) + o(n) of which we need then to take the expectation wrt X. 79

90 The Expected KL Divergence (cont.) [ ] Again E X lk ( θ k (X)) = 0, and writing ( θ k (x) θ k ) lk ( θ k (x))( θ k (x) θ k ) as { n( θ k (x) θ k )} { 1 } n l k ( θ k (x)) { n( θ k (x) θ k )} we have that the expectation over X of this quadratic form converges to Trace (I(θ k )Σ) as by standard theory n( θk (x) θ k ) d Normal(0, Σ). 80

91 The Expected KL Divergence (cont.) Thus E X [log f k (X; θ k )] = E X [log f k (X; θ k (X))] 1 2 Trace (I(θ k)σ) + o p (n). Therefore, using the previous expression (*) we now have E Y [E X [log f k (X; θ k )]] = E X [log f k (X; θ k (X))] Trace (I(θ k )Σ)+o p (n) By the earlier identity I(θ k )Σ = J(θ k ) {I(θ k )} 1 and the right hand side can be consistently estimated by log f k (x; θ k ) Trace (J(θ k ) {I(θ k )} 1) where the estimated trace must be computed from the available data. 81

92 The Expected KL Divergence (cont.) On obvious estimator of the trace is ( } ) 1 Trace Ĵ( θ k (x)) {Î( θk (x)) although this might potentially be improved upon. In any case, if the approximating model f k is close in KL terms to the true model f 0, we would expect that ( Trace J(θ k ) {I(θ k )} 1) dim(θ k ) as we would have under regularity assumptions J(θ k ) I(θ k ) 82

93 The Expected KL Divergence (cont.) This yields the criterion: choose k to maximize or equivalently to minimize log f k (x; θ k ) dim(θ k ) 2 log f k (x; θ k ) + 2dim(θ k ) This is Akaike s Information Criterion ( AIC). Note: The required regularity conditions on the f k model are not too restrictive. However, the approximation ( Trace J(θ k ) {I(θ k )} 1) dim(θ k ) is potentially poor. 83

94 The Bayesian Information Criterion The Bayesian Information Criterion (BIC) uses an approximation to the marginal likelihood function to justify model selection. For data y = (y 1,..., y n ), we have the posterior distribution for model k as π n (θ k ; y) = L k(θ k ; y)π 0 (θ k ) Lk (t; y)π 0 (t) dt. where L k (θ; y) is the likelihood function. The denominator is f k (y) = L k (t; y)π 0 (t) dt that is, the marginal likelihood. 84

95 The Bayesian Information Criterion (cont.) Let l k (θ; y) = log L k (θ; y) = n log f k (y i ; θ) denote the log-likelihood for model k. By a Taylor expansion of l k (θ; y) around ML estimate θ k, we have i=1 l k (θ; y) = l k ( θ k ; y) + (θ θ k ) l( θk ; y) (θ θ k ) l( θk ; y)(θ θ k ) + o(1) We have by definition that l( θ k ) = 0, and we may write l( θ k ; y) = { V n ( θ k )} 1. V n ( θ k ) is the Hessian matrix derived from n data points. 85

96 The Bayesian Information Criterion (cont.) In ML theory, V n ( θ k ) estimates the variance of the ML estimator derived from n data. We may denote that, for a random sample, that say, where V 1 (θ) = n V n ( θ k ) = 1 n V 1( θ k ) { n i=1 } 1 2 θ θ log f k(y i ; θ) is a square symmetric matrix of dimension p k = dim(θ k ) recording an estimate of the the variance of the ML estimator for a single data point n = 1. 86

97 The Bayesian Information Criterion (cont.) Then, exponentiating, we have that { L k (θ; y) L k ( θ k ; y) exp 1 2 (θ θ { } 1 k ) V n ( θ k )} (θ θk ) This is a standard quadratic approximation to the likelihood around the ML estimate. 87

98 The Bayesian Information Criterion (cont.) Suppose that prior π 0 (θ k ) is constant (equal to c, say) in the neighbourhood of θ k. Then, for large n f k (y) = L k (t; y)π 0 (t) dt { cl k ( θ; y) exp 1 2 (t θ { } 1 k ) V n ( θ k )} (t θk ) dt = cl k ( θ; y)(2π) p k/2 V n ( θ k ) 1/2 as the integrand is proportional to a Normal( θ k, V n ( θ k )) distribution. 88

99 The Bayesian Information Criterion (cont.) But V n ( θ k ) 1/2 = 1 n V 1( θ k ) 1/2 Therefore the marginal likelihood becomes = n p k/2 V 1 ( θ k ) f k (y) = cl k ( θ; y)(2π) pk/2 n p k/2 V 1 ( θ k ) 1/2 1/2 or on the 2 log scale we have 2 log f k (y) 2l k ( θ; y) + p k log n + constant where constant = p k log(2π) log V 1 ( θ k ) 2 log c. 89

100 The Bayesian Information Criterion (cont.) The constant term is o(1) and is hence negligible, so the BIC is defined as BIC = 2l k ( θ; y) + p k log n. 90

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 1 Ridge Regression Ridge regression and the Lasso are two forms of regularized