Math 533 Extra Hour Material

Size: px
Start display at page:

Download "Math 533 Extra Hour Material"

Transcription

1 Math 533 Extra Hour Material

2 A Justification for Regression

3 The Justification for Regression It is well-known that if we want to predict a random quantity Y using some quantity m according to a mean-squared error MSE, then the optimal predictor is the expected value of Y, µ; E[(Y m) 2 ] = E[(Y µ + µ m) 2 ] = E[(Y µ) 2 ] + E[(µ m) 2 ] + 2E[(Y µ)(µ m)] = E[(Y µ) 2 ] + (µ m) which is minimized when m = µ. 1

4 The Justification for Regression (cont.) Now suppose we have a joint distribution between Y and X, and wish to predict the Y as a function of X, m(x) say. Using the same MSE criterion, if we write µ(x) = E Y X [Y X = x] to represent the conditional expectation of Y given X = x, we have E X,Y [(Y m(x)) 2 ] = E X,Y [(Y µ(x) + µ(x) m(x)) 2 ] = E X,Y [(Y µ(x)) 2 ] + E X,Y [(µ(x) m(x)) 2 ] + 2E X,Y [(Y µ(x))(µ(x) m(x))] = E X,Y [(Y µ(x)) 2 ] + E X [(µ(x) m(x)) 2 ] + 0 2

5 The Justification for Regression (cont.) The cross term equates to zero by noting that by iterated expectation E X,Y [(Y µ(x))(µ(x) m(x))] = E X [E Y X [(Y µ(x))(µ(x) m(x))]] and for the internal expectation E Y X [(Y µ(x))(µ(x) m(x)) X] = (µ(x) m(x))e Y X [(Y µ(x)) X] and E Y X [(Y µ(x)) X] = 0 a.s. 3

6 The Justification for Regression (cont.) This the MSE is minimized over functions m(.) when E X [(µ(x) m(x)) 2 ] is minimized, but this term can be made zero by setting m(x) = µ(x) = E Y X [Y X = x]. Thus the MSE-optimal prediction is made by using µ(x). Note: Here X can be a single variable, or a vector; it can be random or non-random the result holds. 4

7 The Justification for Linear Modelling Suppose that the true conditional mean function is represented by µ(x) where x is a single predictor. We have by Taylor expansion around x = 0 that p 1 µ(x) = µ(0) + j=1 µ (j) (0) x j + O(x p ) j! where the remainder term O(x p ) represents terms of x p in magnitude or higher order terms, and µ (j) (0) = dj µ(x) dx j x=0 5

8 The Justification for Linear Modelling (cont.) The derivatives of µ(x) at x = 0 may be treated as unspecified constants, in which case a reasonable approximating model takes the form p 1 β 0 + β j x j j=1 where β j µ (j) (0) for j = 0, 1,..., p 1. Similar expansions hold if x is vector valued. 6

9 The Justification for Linear Modelling (cont.) Finally, if Y and X are jointly normally distributed, then the conditional expectation of Y given X = x is linear in x. see Multivariate Normal Distribution handout. 7

10 The General Linear Model

11 The General Linear Model The linear model formulation that assumes E Y X [Y X] = Xβ Var Y X [Y X] = σ 2 I n is actually quite a general formulation as the rows x i of X can be formed by using general transforms of the originally recorded predictors. multiple regression: x i = [1 x i1 x i2 x ik ] polynomial regression: x i = [1 x i1 x 2 i1 xk i1 ] 8

12 The General Linear Model (cont.) harmonic regression: consider single continuous predictor x measured on a bounded interval. Let ω j = j/n, j = 0, 1,..., n/2 = K, and then set E Yi X[Y i x i ] = β 0 + J β j1 cos(2πω j x i ) + j=1 J β j2 sin(2πω j x i ) j=1 If J = K, we then essentially have an n n matrix X specifying a linear transform of y in terms of the derived predictors the coefficients (cos(2πω j x i ), sin(2πω j x i )). (β 0, β 11, β 12,..., β K ) form the discrete Fourier transform of y 9

13 The General Linear Model (cont.) In this case, the columns of X are orthogonal, and X X = diag(n, n/2, n/2,..., n/2, n) that is, X X is a diagonal matrix. 10

14 The General Linear Model (cont.) basis functions: truncated spline basis: for x R, let x i1 = { (x η1 ) α x > η 1 0 x η 1 = (x η 1 ) α + for some fixed η 1, and α R. More generally, E Yi X[Y i x i ] = β 0 + for fixed η 1 < η 2 < < η J. J β j (x i η j ) α + j=1 11

15 The General Linear Model (cont.) piecewise constant: : for x R, let { 1 x A1 x i1 = = 1 0 x / A A1 (x) 1 for some set A 1. More generally, E Yi X[Y i x i ] = β 0 + J β j 1 Aj (x) for sets A 1,..., A J. If we want to use a partition of R, we may write this J E Yi X[Y i x i ] = β j 1 Aj (x). where j=0 J A 0 = j=1 A j j=1 12

16 The General Linear Model (cont.) piecewise linear: specify E Yi X[Y i x i ] = J 1 Aj (x)(β j0 + β j1 x i ) j=0 piecewise linear & continuous: specify J E Yi X[Y i x i ] = β 0 + β 1 x + β j1 (x i η j ) + j=1 for fixed η 1 < η 2 < < η J. higher order piecewise functions (quadratic, cubic etc.) splines 13

17 Example: Motorcycle data 1 library(mass) 2 #Motorcycle data 3 plot(mcycle,pch=19,main= Motorcycle accident data ) 4 5 x<-mcycle$times 6 y<-mcycle$accel 14

18 Raw data Motorcycle accident data times accel 15

19 Example: Piecewise constant fit 1 #Knots 2 K<-11 3 kappa<-as.numeric(quantile(x,probs=c(0:k)/k)) 4 5 X<-(outer(x,kappa, - )>0)^2 6 X<-X[,-12] 7 8 fit.pwc<-lm(y X) 9 summary(fit.pwc) 10 newx<-seq(0,max(x),length=1001) newx<-(outer(newx,kappa, - )>0)^2 13 newx<-cbind(rep(1,1001),newx[,-12]) yhatc<-newx %*% coef(fit.pwc) 16 lines(newx,yhatc,col= blue,lwd=2) 16

20 Fit: Piecewise constant Motorcycle accident data times accel 17

21 Example: Piecewise linear fit 17 X1<-(outer(x,kappa, - )>0)^2 18 X1<-X1[,-12] X2<-(outer(x,kappa, - )>0)*outer(x,kappa, - ) 21 X2<-X2[,-12] X<-cbind(X1,X2) fit.pwl<-lm(y X) 26 summary(fit.pwl) 27 newx<-seq(0,max(x),length=1001) newx1<-(outer(newx,kappa, - )>0)^2 30 newx2<-(outer(newx,kappa, - )>0)*outer(newx,kappa, - ) newx<-cbind(rep(1,1001),newx1[,-12],newx2[,-12]) yhatl<-newx %*% coef(fit.pwl) 35 lines(newx,yhatl,col= red,lwd=2) 18

22 Fit:... + piecewise linear Motorcycle accident data times accel 19

23 Example: Piecewise linear fit 36 X<-(outer(x,kappa, - )>0)*outer(x,kappa, - ) 37 X<-X[,-12] fit.pwcl<-lm(y X) 40 summary(fit.pwcl) 41 newx<-seq(0,max(x),length=1001) newx<-(outer(newx,kappa, - )>0)*outer(newx,kappa, - ) newx<-cbind(rep(1,1001),newx[,-12]) yhatcl<-newx %*% coef(fit.pwcl) 48 lines(newx,yhatcl,col= green,lwd=2) 20

24 Fit:... + piecewise continuous linear Motorcycle accident data times accel 21

25 Fit:... + piecewise continuous quadratic Motorcycle accident data times accel 22

26 Fit:... + piecewise continuous cubic Motorcycle accident data times accel 23

27 Fit: B-spline Motorcycle accident data times accel 24

28 Fit: Harmonic regression K = Motorcycle accident data times accel 25

29 Fit: Harmonic regression K = Motorcycle accident data times accel 26

30 Fit: Harmonic regression K = Motorcycle accident data times accel 27

31 Fit: Harmonic regression K = Motorcycle accident data times accel 28

32 Reparameterization

33 Reparameterizing the model For any model E Y X [Y X] = Xβ we might consider a reparameterization of the model by writing x new i = x i A 1 for some p p non-singular matrix A. Then E Yi X[Y i x i ] = x i β = (x new i A)β = x new i β (new) where β (new) = Aβ 29

34 Reparameterizing the model (cont.) Then X new = XA 1 X = X new A and we may choose A such that {X new } {X new } = I n to give an orthogonal (actually, orthonormal) parameterization. Recall that if the design matrix X new is orthonormal, we have for the OLS estimate β (new) = {X new } y. Note however that the new predictors and their coefficients may not be readily interpretable, so it may be better to reparameterize back to β by defining 1 β = A β(new) 30

35 Coordinate methods for inversion

36 Coordinate methods for inversion To find the ordinary least squares estimates, we solve the normal equations to obtain β = (X X) 1 X y which requires us to invert the p p matrix X X. This is the minimum norm solution to β = arg min y Xb 2 = arg min b b n (y i x i b) 2 i=1 31

37 Coordinate methods for inversion (cont.) We may solve this problem using coordinate descent rather than direct inversion; if b 2,..., b p are fixed, then the minimization problem for b 1 becomes b1 = arg min S(b 1 b 2,..., b p ) = arg min b 1 b 1 n (y i b 1 x i1 c i1 ) 2 i=1 where p c i1 = b j x ij. j=2 32

38 Coordinate methods for inversion (cont.) Writing y i = y i c i1, and the sum of squares as n (y i b 1 x i1 ) 2, i=1 we have that b1 = n x i1 y i. n i=1 x 2 i1 i=1 33

39 Coordinate methods for inversion (cont.) We may solve recursively in turn for each b j, that is, after initialization and at step t, update by minimizing b(t) j b (t+1) j min S(b j b (t+1) 1, b (t+1) 2,..., b (t+1) j 1, b (t) j+1..., b(t) bj which does not require any matrix inversion. p ) 34

40 Distributional Results

41 Some distributional results Distributional results using the Normal distribution are key to many inference procedures for the linear model. Suppose that are independent. X 1,..., X n Normal(µ, σ 2 ) 1. Z i = (X i µ)/σ Normal(0, 1); 2. Y i = Z 2 i χ 2 1 ; 3. U = n i=1 Z2 i χ 2 n; 4. If U 1 χ 2 n 1 and U 2 χ 2 n 2 are independent, then V = U 1/n 1 U 2 /n 2 Fisher(n 1, n 2 ) and 1 V Fisher(n 2, n 1 ) 35

42 Some distributional results (cont.) 5. If Z Normal(0, 1) and U χ 2 ν, then T = Z U/ν Student(ν) 6. If where Y = AX + b X = (X1,..., X n ) Normal(0, σ 2 I n ); A is n n; b is n 1; then say, and Y Normal(b, σ 2 AA ) Normal(µ, Σ) (Y µ) Σ 1 (Y µ) χ 2 n. 36

43 Some distributional results (cont.) 7. Non-central Chi-squared distribution: If X Normal(µ, σ 2 ), we find the distribution of X 2 /σ 2. Y = X σ Normal(µ/σ, 1) By standard transformation results, if Q = Y 2, then f Q (y) = 1 y [φ( y µ/σ) + φ( y µ/σ)] This is the density of the non-central chi-squared distribution with 1 degree of freedom and non-centrality parameter λ = (µ/σ) 2, written χ 2 1 (λ). 37

44 Some distributional results (cont.) The non-central chi-squared distribution has many similar properties to the standard (central) chi-squared distribution. For example if X 1,..., X n are independent, with X i χ 2 1 (λ i), then ( n n ) X i χ 2 n λ i. i=1 The non-central chi-squared distribution plays a role in testing for the linear regression model as it characterizes the distribution of various sums of squares terms. i=1 38

45 Some distributional results (cont.) 8. Quadratic forms: If A is a square symmetric idempotent matrix, and Z = (Z 1,..., Z n ) Normal(0, I n ), then where ν = Trace(A). Z AZ χ 2 ν To see this, use the singular value decomposition A = UDU where U is an orthogonal matrix with U U = I n, and D is a diagonal matrix. Then Z AZ = Z UDU Z = (Z U)D(U Z) = {Z } D{Z } say. But U Z Normal(0, U U) = Normal(0, I n ) 39

46 Some distributional results (cont.) Therefore Z AZ = {Z } D{Z }. But A is idempotent, so AA = A that is, UDU UDU = UDU. The left hand side simplifies, and we have UD 2 U = UDU. 40

47 Some distributional results (cont.) Thus, pre-multiplying by U, and post-multiplying by U, we have D 2 = D and the diagonal elements of D must either be zero or one, so Z AZ = {Z } D{Z } χ 2 ν where ν = Trace(D) = Trace(A). 41

48 as if A 1 and A 2 are orthogonal, then U 2 U 1 = Some distributional results (cont.) 9. If A 1 and A 2 are square, symmetric and orthogonal, and then A 1 A 2 = 0 Z A 1 Z and Z A 2 Z are independent. This result again uses the singular value decomposition; let We have that V 1 = D 1 U 1 Z V 2 = D 2 U 2 Z. Cov V1,V 2 [V 1, V 2 ] = E V1,V 2 [V 2 V 1 ] = E Z [D 2 U 2 ZZ U 1 D 1 ] = D 2 U 2 U 1 D 1 = 0

49 Bayesian Regression and Penalized Least Squares

50 The Bayesian Linear Model Under a Normality assumption Y X, β, σ 2 Normal(Xβ, σ 2 I n ) which defines the likelihood L(β, σ 2 ; y, X), we may perform Bayesian inference by specifying a joint prior distribution π 0 (β, σ 2 ) = π 0 (β σ 2 )π 0 (σ 2 ) and computing the posterior distribution π n (β, σ 2 y, X) L(β, σ 2 ; y, X)π 0 (β, σ 2 ) 43

51 The Bayesian Linear Model (cont.) Prior specification: π 0 (σ 2 ) InvGamma(α/2, γ/2), that is, by definition, 1/σ 2 Gamma(α/2, γ/2) π 0 (σ 2 ) = (γ/2)α/2) Γ(α/2) ( ) 1 α/2 1 { σ 2 exp γ } 2σ 2 π 0 (β σ 2 ) Normal(θ, σ 2 Ψ), that is, β is conditionally Normally distributed in p dimensions, where parameters α, γ, θ, Ψ are fixed, known constants ( hyperparameters ) 44

52 The Bayesian Linear Model (cont.) In the calculation of π n (β, σ 2 y, X), we have after collecting terms π n (β, σ 2 y, X) where ( ) (n+α+p)/2 1 1 { σ 2 exp γ } { } Q(β, y, X) 2σ 2 exp 2σ 2 Q(β, y, X) = (y Xβ) (y Xβ) + (β θ) Ψ 1 (β θ). 45

53 The Bayesian Linear Model (cont.) By completing the square, may write Q(β, y, X) = (β m) M 1 (β m) + c where M = (X X + Ψ 1 ) 1 ; m = (X X + Ψ 1 ) 1 (X y + Ψ 1 θ); c = y y + θ Ψ 1 θ m M 1 m 46

54 The Bayesian Linear Model (cont.) From this we deduce that π n (β, σ 2 y, X) ( ) 1 (n+α+p)/2 1 { σ 2 exp exp } (γ + c) 2σ 2 { 1 2σ 2 (β m) M 1 (β m) } that is π n (β, σ 2 y, X) π n (β σ 2, y, X)π n (σ 2 y, X) π n (β σ 2, y, X) Normal(m, σ 2 M); π n (σ 2 y, X) InvGamma((n + α)/2, (c + γ)/2). 47

55 The Bayesian Linear Model (cont.) The Bayesian posterior mean/modal estimator of β based on this model is β B = m = (X X + Ψ 1 ) 1 (X y + Ψ 1 θ) 48

56 Ridge Regression If, in the Bayesian estimation, we choose θ = 0 Ψ 1 = λi p we have β B = m = (X X + λi p ) 1 X y. If λ = 0: β B = β, the least squares estimate. λ = : β B = 0. Note that to make this specification valid, we should place all the predictors (the columns of X) on the same scale. 49

57 Ridge Regression (cont.) Consider the constrained least squares problem minimize S(β) = n (y i x i β) 2 subject to k β 2 j t i=1 j=1 We solve this problem after centering the predictors x ij x ij x j and centering the responses y i y i y. After this transformation, the intercept can be omitted. 50

58 Ridge Regression (cont.) Suppose that therefore there are precisely p β parameters in the model. We solve the constrained minimization using Lagrange multipliers: we minimize S λ (β) p S λ (β) = S(β) + λ βj 2 t j=1 We have that a p 1 vector. S λ (β) β = S(β) β + 2λβ 51

59 Ridge Regression (cont.) By direct calculus, we have S λ (β) β = 2X (y Xβ) + 2λβ and equating to zero we have X y = (X X + λi)β so that β B = (X X + λi) 1 X y 52

60 Ridge Regression (cont.) For statistical properties, we have E Y X [ β B X] = (X X + λi) 1 X Xβ β so the ridge regression estimator is biased. However the mean squared error (MSE) of β B can be smaller than that of β. 53

61 Ridge Regression (cont.) If the columns of matrix X is orthogonal, so that then X X = I p β Bj = λ β j < β j. In general the ridge regression estimates are shrunk towards zero compared to the least squares estimates. 54

62 Ridge Regression (cont.) Using the singular value decomposition, write where X = UDV U is n p, columns of U are an orthonormal basis for the column space of X, and U U = I p. V is p p, columns of V are an orthonormal basis for the row space of X; V V = I p. D is diagonal with elements d 1 d 2 d p 0. 55

63 Ridge Regression (cont.) Least squares: Predictions are Ŷ = X β = X(X X) 1 X y = (UDV )(VDU UDV ) 1 VDU y = UU y. as V V = I p = VV = I p (to see this pre-multiply both sides by V ) so that (VDU UDV ) 1 (VD 2 V ) 1 = VD 2 V 56

64 Ridge Regression (cont.) Ridge regression: Predictions are Ŷ = X β B = X(X X + λi p ) 1 X y = UD(D 2 + λi p ) 1 DU y = p ( ) d 2 j ũ j dj 2 ũ j y + λ j=1 where ũ j is the jth column of U. 57

65 Ridge Regression (cont.) Ridge regression transforms the problem to one involving the orthogonal matrix U instead of X, and the shrinks the coefficients by d 2 j d 2 j + λ 1. For ridge regression, the hat matrix is H λ = X(X X + λi p ) 1 X = UD(D 2 + λi p ) 1 DU and the degrees of freedom of the fit is Trace(H λ ) = p j=1 d 2 j d 2 j + λ 58

66 Other Penalties The LASSO penalty is and we solve the minimization p λ β j j=1 min(y Xβ) (y Xβ) + λ β j. β No analytical solution is available, but the minimization can be achieved using coordinate descent. In this case, the minimization allows for the optimal value β j to be precisely zero for some j. p j=1 59

67 Other Penalties (cont.) A general version of this penalty is p λ β j q and if q 1, there is a possibility of an estimate being shrunk exactly to zero. For q 1, the problem is convex; for q < 1 the problem is non-convex, and harder to solve. However, if q 1, the shrinkage to zero allows for variable selection to be carried out automatically. j=1 60

68 Model Selection using Information Criteria

69 Selection using Information Criteria Suppose we wish to choose one from a collection of models described by densities f 1, f 2,..., f K with parameters θ 1, θ 2,..., θ K. Let the true, data generating model be denoted f 0. We consider the KL divergence between f 0 and f k : KL(f 0, f k ) = ( ) f0 (x) log f 0 (x)dx f k (x; θ k ) and aim to choose the model using the criterion k = arg min k KL(f 0, f k ) 61

70 Selection using Information Criteria (cont.) In reality, θ k are typically unknown, so we consider estimating them using maximum likelihood procedures. We consider KL(f 0, f k ) = log ( f 0 (x) f k (x; θ k ) ) f 0 (x)dx where θ k is obtained by maximizing the likelihood under model k, that is maximizing n log f k (y i ; θ) i=1 with respect to θ for data y 1,..., y n. 62

71 Selection using Information Criteria (cont.) We have that KL(f 0, f k ) = log f 0 (x)f 0 (x)dx log f k (x; θ k )f 0 (x)dx so we then may choose k by k = arg max k log f k (x; θ k )f 0 (x)dx or if the parameters need to be estimated k = arg max log f k (x; θ k (y))f 0 (x)dx. k 63

72 Asymptotic results for estimation under misspecification

73 Maximum likelihood as minimum KL We may seek to define θ k directly using the entropy criterion θ k = arg min θ KL(f 0, f k (θ)) = arg min θ ( ) f0 (x) log f 0 (x)dx f k (x; θ) and solve the problem using calculus by differentiating KL(f 0, f k (θ)) with respect to θ and equating to zero. Note that under standard regularity conditions { ( ) } f0 (x) log f 0 (x)dx = { θ f k (x; θ) θ } log f k (x; θ)f 0 (x)dx { } = θ log f k(x; θ) f 0 (x)dx [ ] = E X θ log f k(x; θ) 64

74 Maximum likelihood as minimum KL (cont.) Therefore θ k solves E X [ θ log f k(x; θ) θ=θk ] = 0. Under identifiability assumptions, we assume that is that there is a single θ k which solves this equation. The sample-based equivalent calculation dictates that for the estimate θ k 1 n n θ log f k(x i ; θ) θ=θk = 0 i=1 which coincides with ML estimation. 65

75 Maximum likelihood as minimum KL (cont.) Under identifiability assumptions, as 1 n n i=1 θ log f k(x i ; θ) θ=θk 0, by the strong law of large numbers we must have that with probability 1 as n. θ k θ k 66

76 Maximum likelihood as minimum KL (cont.) Let I(θ k ) = E X [ 2 log f k (X; θ) θ θ ] ; θ k θ=θk be the Fisher information computed wrt f k (x; θ k ), and [ ] I(θ k ) = E X 2 log f k (X; θ) θ θ θ=θk be the same expectation quantity computed wrt f 0 (x). The corresponding n data versions, where log f k (X; θ) is replaced by are n log f k (X i ; θ) i=1 I n (θ k ) = ni(θ k ) I n (θ k ) = ni(θ k ). 67

77 Maximum likelihood as minimum KL (cont.) The quantity Î n (θ k ) is the sample based version Î n (θ k ) = { n i=1 } 2 θ θ log f k(x i ; θ) θ=θ k This quantity is evaluated at θ k = θ k to yield Î n ( θ k ); we have that with probability 1 as n. Î n ( θ k ) I n (θ k ) 68

78 Maximum likelihood as minimum KL (cont.) Another approach that is sometimes used is to uses the equivalence between expressions involving the first and second derivative versions of I(θ); we have also that I(θ k ) = E X [ log f k (X; θ) θ ] 2 ; θ k = J (θ k ) θ=θ k with the expectation computed wrt f k (x; θ k ), where for vector U Let J(θ k ) = E X [ U 2 = UU. log f k (X; θ) θ 2 θ=θ k be the equivalent calculation with the expectation computed wrt f 0 (x). ] 69

79 Maximum likelihood as minimum KL (cont.) Now by a second order Taylor expansion of the first derivative of the loglikelihood, if l k (θ k ) = log f k(x; θ) θ at θ = θ k, we have lk (θ k ) = 2 log f k (x; θ) θ=θk θ θ θ=θk l k (θ k ) = l k ( θ k ) + l k ( θ k )(θ k θ k ) (θ k θ k )... l n(θ )(θ k θ k ) = l k ( θ k )(θ k θ k ) + R n (θ ) as l k ( θ k ) = 0, θ k θ < θ k θ k, and where the remainder term is being denoted R n (θ ). 70

80 Maximum likelihood as minimum KL (cont.) We have by definition that and in the limit that is, R n (θ ) = o p (n), and as n. l k ( θ k ) = Î n ( θ k ). R n (θ ) n p 0 1 n În( θ k ) a.s. I(θ k ) Rewriting the above approximation, we have { } 1 1 { } { } { n( θk θ k ) = n În( θ k ) n lk (θ k ) + n În( θ Rn (θ } ) k ) n 71

81 Maximum likelihood as minimum KL (cont.) By the central limit theorem n( θk θ k ) d Normal(0, Σ) say, where Σ denotes the limiting variance-covariance matrix when sampling is under f 0, and 1 lk (θ k ) = 1 n n Finally, n i=1 by Slutsky s Theorem. { 1 n În( θ k ) d log f θ k (X i ; θ k ) Normal(0, J(θ k )). k } 1 { Rn (θ ) n } p 0 72

82 Maximum likelihood as minimum KL (cont.) Therefore, by equating the asymptotic variances of the above quantities, we must have J(θ k ) = I(θ k )ΣI(θ k ) yielding that Σ = {I(θ k )} 1 J(θ k ){I(θ k )} 1 73

83 Model Selection using Minimum KL

84 The Expected KL Divergence The quantity log f k (x; θ k (Y ))f 0 (x)dx = E X [log f k (X; θ k (Y ))] is a random quantity, a function of data random quantities Y. Thus we instead decide to choose k by k = arg max E Y [E X [log f k (X; θ k (Y ))]] k where X and Y are drawn from the true model f 0. 74

85 The Expected KL Divergence (cont.) We consider first the expansion of the inner integral around the true value θ k for an arbitrary y; under regularity conditions E X [log f k (X; θ k (y))] = E X [log f k (X; θ k )] + E X [ lk (θ k )] ( θk θ k ) + 1 ] 2 ( θ k θ k ) E X [ lk (θ k ) ( θ k θ k ) + o p (n) 75

86 The Expected KL Divergence (cont.) [ ] By definition E X lk (θ k ) = 0, and so therefore ] E X [ lk (θ k ) = I n (θ k ) E X [log f k (X; θ k (y))] = E X [log f k (X; θ k )] 1 2 ( θ k θ k ) I n (θ k )( θ k θ k )+o p (n) We then must compute E Y [E X [log f k (X; θ k (Y ))]] The term E X [log f k (X; θ k )] is a constant wrt this expectation. 76

87 The Expected KL Divergence (cont.) The expectation of the quadratic term can be computed by standard results for large n; under sampling Y from f 0, we have as before that ( θ k θ k ) I n (θ k )( θ k θ k ) can be rewritten { { } 1 n( θ k θ k )} n I n(θ k ) { n( θ k θ k )} where and n( θk θ k ) d Normal(0, Σ) 1 n I n(θ k ) a.s. I(θ k ) 77

88 The Expected KL Divergence (cont.) Therefore, by standard results for quadratic forms and ] E Y [( θ k (Y ) θ k ) a.s. I n (θ k )( θ k (Y ) θ k ) Trace (I(θ k )Σ) E Y [E X [log f k (X; θ k (Y ))]] = E X [log f k (X; θ k )] 1 2 Trace (I(θ k)σ)+o p (n) However, the right hand side cannot be computed, as θ k is not known and must be estimated. (*) 78

89 The Expected KL Divergence (cont.) Thus we repeat the same operation, but instead expand E X [log f k (X; θ k )] about θ k (x) for a fixed x. We have log f k (x; θ k ) = log f k (x; θ k (x)) + l k ( θ k (x)) (θ k θ k (x)) ( θ k (x) θ k ) lk ( θ k (x))( θ k (x) θ k ) + o(n) of which we need then to take the expectation wrt X. 79

90 The Expected KL Divergence (cont.) [ ] Again E X lk ( θ k (X)) = 0, and writing ( θ k (x) θ k ) lk ( θ k (x))( θ k (x) θ k ) as { n( θ k (x) θ k )} { 1 } n l k ( θ k (x)) { n( θ k (x) θ k )} we have that the expectation over X of this quadratic form converges to Trace (I(θ k )Σ) as by standard theory n( θk (x) θ k ) d Normal(0, Σ). 80

91 The Expected KL Divergence (cont.) Thus E X [log f k (X; θ k )] = E X [log f k (X; θ k (X))] 1 2 Trace (I(θ k)σ) + o p (n). Therefore, using the previous expression (*) we now have E Y [E X [log f k (X; θ k )]] = E X [log f k (X; θ k (X))] Trace (I(θ k )Σ)+o p (n) By the earlier identity I(θ k )Σ = J(θ k ) {I(θ k )} 1 and the right hand side can be consistently estimated by log f k (x; θ k ) Trace (J(θ k ) {I(θ k )} 1) where the estimated trace must be computed from the available data. 81

92 The Expected KL Divergence (cont.) On obvious estimator of the trace is ( } ) 1 Trace Ĵ( θ k (x)) {Î( θk (x)) although this might potentially be improved upon. In any case, if the approximating model f k is close in KL terms to the true model f 0, we would expect that ( Trace J(θ k ) {I(θ k )} 1) dim(θ k ) as we would have under regularity assumptions J(θ k ) I(θ k ) 82

93 The Expected KL Divergence (cont.) This yields the criterion: choose k to maximize or equivalently to minimize log f k (x; θ k ) dim(θ k ) 2 log f k (x; θ k ) + 2dim(θ k ) This is Akaike s Information Criterion ( AIC). Note: The required regularity conditions on the f k model are not too restrictive. However, the approximation ( Trace J(θ k ) {I(θ k )} 1) dim(θ k ) is potentially poor. 83

94 The Bayesian Information Criterion The Bayesian Information Criterion (BIC) uses an approximation to the marginal likelihood function to justify model selection. For data y = (y 1,..., y n ), we have the posterior distribution for model k as π n (θ k ; y) = L k(θ k ; y)π 0 (θ k ) Lk (t; y)π 0 (t) dt. where L k (θ; y) is the likelihood function. The denominator is f k (y) = L k (t; y)π 0 (t) dt that is, the marginal likelihood. 84

95 The Bayesian Information Criterion (cont.) Let l k (θ; y) = log L k (θ; y) = n log f k (y i ; θ) denote the log-likelihood for model k. By a Taylor expansion of l k (θ; y) around ML estimate θ k, we have i=1 l k (θ; y) = l k ( θ k ; y) + (θ θ k ) l( θk ; y) (θ θ k ) l( θk ; y)(θ θ k ) + o(1) We have by definition that l( θ k ) = 0, and we may write l( θ k ; y) = { V n ( θ k )} 1. V n ( θ k ) is the Hessian matrix derived from n data points. 85

96 The Bayesian Information Criterion (cont.) In ML theory, V n ( θ k ) estimates the variance of the ML estimator derived from n data. We may denote that, for a random sample, that say, where V 1 (θ) = n V n ( θ k ) = 1 n V 1( θ k ) { n i=1 } 1 2 θ θ log f k(y i ; θ) is a square symmetric matrix of dimension p k = dim(θ k ) recording an estimate of the the variance of the ML estimator for a single data point n = 1. 86

97 The Bayesian Information Criterion (cont.) Then, exponentiating, we have that { L k (θ; y) L k ( θ k ; y) exp 1 2 (θ θ { } 1 k ) V n ( θ k )} (θ θk ) This is a standard quadratic approximation to the likelihood around the ML estimate. 87

98 The Bayesian Information Criterion (cont.) Suppose that prior π 0 (θ k ) is constant (equal to c, say) in the neighbourhood of θ k. Then, for large n f k (y) = L k (t; y)π 0 (t) dt { cl k ( θ; y) exp 1 2 (t θ { } 1 k ) V n ( θ k )} (t θk ) dt = cl k ( θ; y)(2π) p k/2 V n ( θ k ) 1/2 as the integrand is proportional to a Normal( θ k, V n ( θ k )) distribution. 88

99 The Bayesian Information Criterion (cont.) But V n ( θ k ) 1/2 = 1 n V 1( θ k ) 1/2 Therefore the marginal likelihood becomes = n p k/2 V 1 ( θ k ) f k (y) = cl k ( θ; y)(2π) pk/2 n p k/2 V 1 ( θ k ) 1/2 1/2 or on the 2 log scale we have 2 log f k (y) 2l k ( θ; y) + p k log n + constant where constant = p k log(2π) log V 1 ( θ k ) 2 log c. 89

100 The Bayesian Information Criterion (cont.) The constant term is o(1) and is hence negligible, so the BIC is defined as BIC = 2l k ( θ; y) + p k log n. 90

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 1 Ridge Regression Ridge regression and the Lasso are two forms of regularized

More information

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Final Review. Yang Feng.   Yang Feng (Columbia University) Final Review 1 / 58 Final Review Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Final Review 1 / 58 Outline 1 Multiple Linear Regression (Estimation, Inference) 2 Special Topics for Multiple

More information

Math 423/533: The Main Theoretical Topics

Math 423/533: The Main Theoretical Topics Math 423/533: The Main Theoretical Topics Notation sample size n, data index i number of predictors, p (p = 2 for simple linear regression) y i : response for individual i x i = (x i1,..., x ip ) (1 p)

More information

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models Statistics 203: Introduction to Regression and Analysis of Variance Penalized models Jonathan Taylor - p. 1/15 Today s class Bias-Variance tradeoff. Penalized regression. Cross-validation. - p. 2/15 Bias-variance

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures FE661 - Statistical Methods for Financial Engineering 9. Model Selection Jitkomut Songsiri statistical models overview of model selection information criteria goodness-of-fit measures 9-1 Statistical models

More information

Linear Methods for Prediction

Linear Methods for Prediction Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we

More information

Nonparametric Regression. Badr Missaoui

Nonparametric Regression. Badr Missaoui Badr Missaoui Outline Kernel and local polynomial regression. Penalized regression. We are given n pairs of observations (X 1, Y 1 ),...,(X n, Y n ) where Y i = r(x i ) + ε i, i = 1,..., n and r(x) = E(Y

More information

A Modern Look at Classical Multivariate Techniques

A Modern Look at Classical Multivariate Techniques A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico

More information

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation Patrick Breheny February 8 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/27 Introduction Basic idea Standardization Large-scale testing is, of course, a big area and we could keep talking

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Spatial Process Estimates as Smoothers: A Review

Spatial Process Estimates as Smoothers: A Review Spatial Process Estimates as Smoothers: A Review Soutir Bandyopadhyay 1 Basic Model The observational model considered here has the form Y i = f(x i ) + ɛ i, for 1 i n. (1.1) where Y i is the observed

More information

Linear Regression Linear Regression with Shrinkage

Linear Regression Linear Regression with Shrinkage Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle

More information

Data Mining Stat 588

Data Mining Stat 588 Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic

More information

Linear Regression Linear Regression with Shrinkage

Linear Regression Linear Regression with Shrinkage Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

LECTURE 2 LINEAR REGRESSION MODEL AND OLS

LECTURE 2 LINEAR REGRESSION MODEL AND OLS SEPTEMBER 29, 2014 LECTURE 2 LINEAR REGRESSION MODEL AND OLS Definitions A common question in econometrics is to study the effect of one group of variables X i, usually called the regressors, on another

More information

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University. Summer School in Statistics for Astronomers V June 1 - June 6, 2009 Regression Mosuk Chow Statistics Department Penn State University. Adapted from notes prepared by RL Karandikar Mean and variance Recall

More information

Lecture 5: A step back

Lecture 5: A step back Lecture 5: A step back Last time Last time we talked about a practical application of the shrinkage idea, introducing James-Stein estimation and its extension We saw our first connection between shrinkage

More information

The Statistical Property of Ordinary Least Squares

The Statistical Property of Ordinary Least Squares The Statistical Property of Ordinary Least Squares The linear equation, on which we apply the OLS is y t = X t β + u t Then, as we have derived, the OLS estimator is ˆβ = [ X T X] 1 X T y Then, substituting

More information

Linear Methods for Prediction

Linear Methods for Prediction This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Regularization: Ridge Regression and the LASSO

Regularization: Ridge Regression and the LASSO Agenda Wednesday, November 29, 2006 Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the l 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7 MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7 1 Random Vectors Let a 0 and y be n 1 vectors, and let A be an n n matrix. Here, a 0 and A are non-random, whereas y is

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

Penalized Regression

Penalized Regression Penalized Regression Deepayan Sarkar Penalized regression Another potential remedy for collinearity Decreases variability of estimated coefficients at the cost of introducing bias Also known as regularization

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

High-dimensional regression modeling

High-dimensional regression modeling High-dimensional regression modeling David Causeur Department of Statistics and Computer Science Agrocampus Ouest IRMAR CNRS UMR 6625 http://www.agrocampus-ouest.fr/math/causeur/ Course objectives Making

More information

Generalized Linear Models. Kurt Hornik

Generalized Linear Models. Kurt Hornik Generalized Linear Models Kurt Hornik Motivation Assuming normality, the linear model y = Xβ + e has y = β + ε, ε N(0, σ 2 ) such that y N(μ, σ 2 ), E(y ) = μ = β. Various generalizations, including general

More information

We ve seen today that a square, symmetric matrix A can be written as. A = ODO t

We ve seen today that a square, symmetric matrix A can be written as. A = ODO t Lecture 5: Shrinky dink From last time... Inference So far we ve shied away from formal probabilistic arguments (aside from, say, assessing the independence of various outputs of a linear model) -- We

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Frequentist-Bayesian Model Comparisons: A Simple Example

Frequentist-Bayesian Model Comparisons: A Simple Example Frequentist-Bayesian Model Comparisons: A Simple Example Consider data that consist of a signal y with additive noise: Data vector (N elements): D = y + n The additive noise n has zero mean and diagonal

More information

MATH 205C: STATIONARY PHASE LEMMA

MATH 205C: STATIONARY PHASE LEMMA MATH 205C: STATIONARY PHASE LEMMA For ω, consider an integral of the form I(ω) = e iωf(x) u(x) dx, where u Cc (R n ) complex valued, with support in a compact set K, and f C (R n ) real valued. Thus, I(ω)

More information

Lecture 16 Solving GLMs via IRWLS

Lecture 16 Solving GLMs via IRWLS Lecture 16 Solving GLMs via IRWLS 09 November 2015 Taylor B. Arnold Yale Statistics STAT 312/612 Notes problem set 5 posted; due next class problem set 6, November 18th Goals for today fixed PCA example

More information

Business Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata'

Business Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata' Business Statistics Tommaso Proietti DEF - Università di Roma 'Tor Vergata' Model Evaluation and Selection Predictive Ability of a Model: Denition and Estimation We aim at achieving a balance between parsimony

More information

Random Matrix Eigenvalue Problems in Probabilistic Structural Mechanics

Random Matrix Eigenvalue Problems in Probabilistic Structural Mechanics Random Matrix Eigenvalue Problems in Probabilistic Structural Mechanics S Adhikari Department of Aerospace Engineering, University of Bristol, Bristol, U.K. URL: http://www.aer.bris.ac.uk/contact/academic/adhikari/home.html

More information

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Statistics 203: Introduction to Regression and Analysis of Variance Course review Statistics 203: Introduction to Regression and Analysis of Variance Course review Jonathan Taylor - p. 1/?? Today Review / overview of what we learned. - p. 2/?? General themes in regression models Specifying

More information

CS540 Machine learning Lecture 5

CS540 Machine learning Lecture 5 CS540 Machine learning Lecture 5 1 Last time Basis functions for linear regression Normal equations QR SVD - briefly 2 This time Geometry of least squares (again) SVD more slowly LMS Ridge regression 3

More information

A Probability Review

A Probability Review A Probability Review Outline: A probability review Shorthand notation: RV stands for random variable EE 527, Detection and Estimation Theory, # 0b 1 A Probability Review Reading: Go over handouts 2 5 in

More information

Lecture 13: Simple Linear Regression in Matrix Format. 1 Expectations and Variances with Vectors and Matrices

Lecture 13: Simple Linear Regression in Matrix Format. 1 Expectations and Variances with Vectors and Matrices Lecture 3: Simple Linear Regression in Matrix Format To move beyond simple regression we need to use matrix algebra We ll start by re-expressing simple linear regression in matrix form Linear algebra is

More information

Regression: Lecture 2

Regression: Lecture 2 Regression: Lecture 2 Niels Richard Hansen April 26, 2012 Contents 1 Linear regression and least squares estimation 1 1.1 Distributional results................................ 3 2 Non-linear effects and

More information

Lecture 14: Shrinkage

Lecture 14: Shrinkage Lecture 14: Shrinkage Reading: Section 6.2 STATS 202: Data mining and analysis October 27, 2017 1 / 19 Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the

More information

Random Eigenvalue Problems Revisited

Random Eigenvalue Problems Revisited Random Eigenvalue Problems Revisited S Adhikari Department of Aerospace Engineering, University of Bristol, Bristol, U.K. Email: S.Adhikari@bristol.ac.uk URL: http://www.aer.bris.ac.uk/contact/academic/adhikari/home.html

More information

STAT 100C: Linear models

STAT 100C: Linear models STAT 100C: Linear models Arash A. Amini June 9, 2018 1 / 56 Table of Contents Multiple linear regression Linear model setup Estimation of β Geometric interpretation Estimation of σ 2 Hat matrix Gram matrix

More information

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model Outline 1 Multiple Linear Regression (Estimation, Inference, Diagnostics and Remedial Measures) 2 Special Topics for Multiple Regression Extra Sums of Squares Standardized Version of the Multiple Regression

More information

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman Linear Regression Models Based on Chapter 3 of Hastie, ibshirani and Friedman Linear Regression Models Here the X s might be: p f ( X = " + " 0 j= 1 X j Raw predictor variables (continuous or coded-categorical

More information

Adaptive Piecewise Polynomial Estimation via Trend Filtering

Adaptive Piecewise Polynomial Estimation via Trend Filtering Adaptive Piecewise Polynomial Estimation via Trend Filtering Liubo Li, ShanShan Tu The Ohio State University li.2201@osu.edu, tu.162@osu.edu October 1, 2015 Liubo Li, ShanShan Tu (OSU) Trend Filtering

More information

Covariance function estimation in Gaussian process regression

Covariance function estimation in Gaussian process regression Covariance function estimation in Gaussian process regression François Bachoc Department of Statistics and Operations Research, University of Vienna WU Research Seminar - May 2015 François Bachoc Gaussian

More information

Ridge Regression. Flachs, Munkholt og Skotte. May 4, 2009

Ridge Regression. Flachs, Munkholt og Skotte. May 4, 2009 Ridge Regression Flachs, Munkholt og Skotte May 4, 2009 As in usual regression we consider a pair of random variables (X, Y ) with values in R p R and assume that for some (β 0, β) R +p it holds that E(Y

More information

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model Xiuming Zhang zhangxiuming@u.nus.edu A*STAR-NUS Clinical Imaging Research Center October, 015 Summary This report derives

More information

Gaussian Processes for Machine Learning

Gaussian Processes for Machine Learning Gaussian Processes for Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics Tübingen, Germany carl@tuebingen.mpg.de Carlos III, Madrid, May 2006 The actual science of

More information

Bayesian linear regression

Bayesian linear regression Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding

More information

STA 2101/442 Assignment 3 1

STA 2101/442 Assignment 3 1 STA 2101/442 Assignment 3 1 These questions are practice for the midterm and final exam, and are not to be handed in. 1. Suppose X 1,..., X n are a random sample from a distribution with mean µ and variance

More information

Lecture 6: Methods for high-dimensional problems

Lecture 6: Methods for high-dimensional problems Lecture 6: Methods for high-dimensional problems Hector Corrada Bravo and Rafael A. Irizarry March, 2010 In this Section we will discuss methods where data lies on high-dimensional spaces. In particular,

More information

EM Algorithm II. September 11, 2018

EM Algorithm II. September 11, 2018 EM Algorithm II September 11, 2018 Review EM 1/27 (Y obs, Y mis ) f (y obs, y mis θ), we observe Y obs but not Y mis Complete-data log likelihood: l C (θ Y obs, Y mis ) = log { f (Y obs, Y mis θ) Observed-data

More information

Introduction to Statistical modeling: handout for Math 489/583

Introduction to Statistical modeling: handout for Math 489/583 Introduction to Statistical modeling: handout for Math 489/583 Statistical modeling occurs when we are trying to model some data using statistical tools. From the start, we recognize that no model is perfect

More information

Spring 2017 Econ 574 Roger Koenker. Lecture 14 GEE-GMM

Spring 2017 Econ 574 Roger Koenker. Lecture 14 GEE-GMM University of Illinois Department of Economics Spring 2017 Econ 574 Roger Koenker Lecture 14 GEE-GMM Throughout the course we have emphasized methods of estimation and inference based on the principle

More information

Biostatistics Advanced Methods in Biostatistics IV

Biostatistics Advanced Methods in Biostatistics IV Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results

More information

STATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

STATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010 STATS306B Discriminant Analysis Jonathan Taylor Department of Statistics Stanford University June 3, 2010 Spring 2010 Classification Given K classes in R p, represented as densities f i (x), 1 i K classify

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

More information

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j Standard Errors & Confidence Intervals β β asy N(0, I( β) 1 ), where I( β) = [ 2 l(β, φ; y) ] β i β β= β j We can obtain asymptotic 100(1 α)% confidence intervals for β j using: β j ± Z 1 α/2 se( β j )

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2. APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product

More information

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM. Université du Sud Toulon - Var Master Informatique Probabilistic Learning and Data Analysis TD: Model-based clustering by Faicel CHAMROUKHI Solution The aim of this practical wor is to show how the Classification

More information

STAT 540: Data Analysis and Regression

STAT 540: Data Analysis and Regression STAT 540: Data Analysis and Regression Wen Zhou http://www.stat.colostate.edu/~riczw/ Email: riczw@stat.colostate.edu Department of Statistics Colorado State University Fall 205 W. Zhou (Colorado State

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Partial factor modeling: predictor-dependent shrinkage for linear regression

Partial factor modeling: predictor-dependent shrinkage for linear regression modeling: predictor-dependent shrinkage for linear Richard Hahn, Carlos Carvalho and Sayan Mukherjee JASA 2013 Review by Esther Salazar Duke University December, 2013 Factor framework The factor framework

More information

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Nonconcave Penalized Likelihood with A Diverging Number of Parameters Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

Chapter 17: Undirected Graphical Models

Chapter 17: Undirected Graphical Models Chapter 17: Undirected Graphical Models The Elements of Statistical Learning Biaobin Jiang Department of Biological Sciences Purdue University bjiang@purdue.edu October 30, 2014 Biaobin Jiang (Purdue)

More information

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1 Inverse of a Square Matrix For an N N square matrix A, the inverse of A, 1 A, exists if and only if A is of full rank, i.e., if and only if no column of A is a linear combination 1 of the others. A is

More information

Model comparison and selection

Model comparison and selection BS2 Statistical Inference, Lectures 9 and 10, Hilary Term 2008 March 2, 2008 Hypothesis testing Consider two alternative models M 1 = {f (x; θ), θ Θ 1 } and M 2 = {f (x; θ), θ Θ 2 } for a sample (X = x)

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

So far our focus has been on estimation of the parameter vector β in the. y = Xβ + u

So far our focus has been on estimation of the parameter vector β in the. y = Xβ + u Interval estimation and hypothesis tests So far our focus has been on estimation of the parameter vector β in the linear model y i = β 1 x 1i + β 2 x 2i +... + β K x Ki + u i = x iβ + u i for i = 1, 2,...,

More information

Linear models. Linear models are computationally convenient and remain widely used in. applied econometric research

Linear models. Linear models are computationally convenient and remain widely used in. applied econometric research Linear models Linear models are computationally convenient and remain widely used in applied econometric research Our main focus in these lectures will be on single equation linear models of the form y

More information

Machine Learning. Part 1. Linear Regression. Machine Learning: Regression Case. .. Dennis Sun DATA 401 Data Science Alex Dekhtyar..

Machine Learning. Part 1. Linear Regression. Machine Learning: Regression Case. .. Dennis Sun DATA 401 Data Science Alex Dekhtyar.. .. Dennis Sun DATA 401 Data Science Alex Dekhtyar.. Machine Learning. Part 1. Linear Regression Machine Learning: Regression Case. Dataset. Consider a collection of features X = {X 1,...,X n }, such that

More information

10. Linear Models and Maximum Likelihood Estimation

10. Linear Models and Maximum Likelihood Estimation 10. Linear Models and Maximum Likelihood Estimation ECE 830, Spring 2017 Rebecca Willett 1 / 34 Primary Goal General problem statement: We observe y i iid pθ, θ Θ and the goal is to determine the θ that

More information

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization

More information

Approximate Likelihoods

Approximate Likelihoods Approximate Likelihoods Nancy Reid July 28, 2015 Why likelihood? makes probability modelling central l(θ; y) = log f (y; θ) emphasizes the inverse problem of reasoning y θ converts a prior probability

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1 MA 575 Linear Models: Cedric E Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1 1 Within-group Correlation Let us recall the simple two-level hierarchical

More information

Regression. Oscar García

Regression. Oscar García Regression Oscar García Regression methods are fundamental in Forest Mensuration For a more concise and general presentation, we shall first review some matrix concepts 1 Matrices An order n m matrix is

More information

Stat 5100 Handout #26: Variations on OLS Linear Regression (Ch. 11, 13)

Stat 5100 Handout #26: Variations on OLS Linear Regression (Ch. 11, 13) Stat 5100 Handout #26: Variations on OLS Linear Regression (Ch. 11, 13) 1. Weighted Least Squares (textbook 11.1) Recall regression model Y = β 0 + β 1 X 1 +... + β p 1 X p 1 + ε in matrix form: (Ch. 5,

More information

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77 Linear Regression Chapter 3 September 27, 2016 Chapter 3 September 27, 2016 1 / 77 1 3.1. Simple linear regression 2 3.2 Multiple linear regression 3 3.3. The least squares estimation 4 3.4. The statistical

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam. CS 189 Spring 2013 Introduction to Machine Learning Midterm You have 1 hour 20 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. Please use non-programmable calculators

More information

Penalized Splines, Mixed Models, and Recent Large-Sample Results

Penalized Splines, Mixed Models, and Recent Large-Sample Results Penalized Splines, Mixed Models, and Recent Large-Sample Results David Ruppert Operations Research & Information Engineering, Cornell University Feb 4, 2011 Collaborators Matt Wand, University of Wollongong

More information

Model Selection. Frank Wood. December 10, 2009

Model Selection. Frank Wood. December 10, 2009 Model Selection Frank Wood December 10, 2009 Standard Linear Regression Recipe Identify the explanatory variables Decide the functional forms in which the explanatory variables can enter the model Decide

More information

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

UNIVERSITETET I OSLO

UNIVERSITETET I OSLO UNIVERSITETET I OSLO Det matematisk-naturvitenskapelige fakultet Examination in: STK4030 Modern data analysis - FASIT Day of examination: Friday 13. Desember 2013. Examination hours: 14.30 18.30. This

More information

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X. Optimization Background: Problem: given a function f(x) defined on X, find x such that f(x ) f(x) for all x X. The value x is called a maximizer of f and is written argmax X f. In general, argmax X f may

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models Lecture 3. Hypothesis testing. Goodness of Fit. Model diagnostics GLM (Spring, 2018) Lecture 3 1 / 34 Models Let M(X r ) be a model with design matrix X r (with r columns) r n

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

The Multivariate Normal Distribution. In this case according to our theorem

The Multivariate Normal Distribution. In this case according to our theorem The Multivariate Normal Distribution Defn: Z R 1 N(0, 1) iff f Z (z) = 1 2π e z2 /2. Defn: Z R p MV N p (0, I) if and only if Z = (Z 1,..., Z p ) T with the Z i independent and each Z i N(0, 1). In this

More information

Recap. HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis:

Recap. HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis: 1 / 23 Recap HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis: Pr(G = k X) Pr(X G = k)pr(g = k) Theory: LDA more

More information

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane DATA MINING AND MACHINE LEARNING Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Linear models for regression Regularized

More information

COS 424: Interacting with Data

COS 424: Interacting with Data COS 424: Interacting with Data Lecturer: Rob Schapire Lecture #14 Scribe: Zia Khan April 3, 2007 Recall from previous lecture that in regression we are trying to predict a real value given our data. Specically,

More information