Math 533 Extra Hour Material
|
|
- Branden Price
- 6 years ago
- Views:
Transcription
1 Math 533 Extra Hour Material
2 A Justification for Regression
3 The Justification for Regression It is well-known that if we want to predict a random quantity Y using some quantity m according to a mean-squared error MSE, then the optimal predictor is the expected value of Y, µ; E[(Y m) 2 ] = E[(Y µ + µ m) 2 ] = E[(Y µ) 2 ] + E[(µ m) 2 ] + 2E[(Y µ)(µ m)] = E[(Y µ) 2 ] + (µ m) which is minimized when m = µ. 1
4 The Justification for Regression (cont.) Now suppose we have a joint distribution between Y and X, and wish to predict the Y as a function of X, m(x) say. Using the same MSE criterion, if we write µ(x) = E Y X [Y X = x] to represent the conditional expectation of Y given X = x, we have E X,Y [(Y m(x)) 2 ] = E X,Y [(Y µ(x) + µ(x) m(x)) 2 ] = E X,Y [(Y µ(x)) 2 ] + E X,Y [(µ(x) m(x)) 2 ] + 2E X,Y [(Y µ(x))(µ(x) m(x))] = E X,Y [(Y µ(x)) 2 ] + E X [(µ(x) m(x)) 2 ] + 0 2
5 The Justification for Regression (cont.) The cross term equates to zero by noting that by iterated expectation E X,Y [(Y µ(x))(µ(x) m(x))] = E X [E Y X [(Y µ(x))(µ(x) m(x))]] and for the internal expectation E Y X [(Y µ(x))(µ(x) m(x)) X] = (µ(x) m(x))e Y X [(Y µ(x)) X] and E Y X [(Y µ(x)) X] = 0 a.s. 3
6 The Justification for Regression (cont.) This the MSE is minimized over functions m(.) when E X [(µ(x) m(x)) 2 ] is minimized, but this term can be made zero by setting m(x) = µ(x) = E Y X [Y X = x]. Thus the MSE-optimal prediction is made by using µ(x). Note: Here X can be a single variable, or a vector; it can be random or non-random the result holds. 4
7 The Justification for Linear Modelling Suppose that the true conditional mean function is represented by µ(x) where x is a single predictor. We have by Taylor expansion around x = 0 that p 1 µ(x) = µ(0) + j=1 µ (j) (0) x j + O(x p ) j! where the remainder term O(x p ) represents terms of x p in magnitude or higher order terms, and µ (j) (0) = dj µ(x) dx j x=0 5
8 The Justification for Linear Modelling (cont.) The derivatives of µ(x) at x = 0 may be treated as unspecified constants, in which case a reasonable approximating model takes the form p 1 β 0 + β j x j j=1 where β j µ (j) (0) for j = 0, 1,..., p 1. Similar expansions hold if x is vector valued. 6
9 The Justification for Linear Modelling (cont.) Finally, if Y and X are jointly normally distributed, then the conditional expectation of Y given X = x is linear in x. see Multivariate Normal Distribution handout. 7
10 The General Linear Model
11 The General Linear Model The linear model formulation that assumes E Y X [Y X] = Xβ Var Y X [Y X] = σ 2 I n is actually quite a general formulation as the rows x i of X can be formed by using general transforms of the originally recorded predictors. multiple regression: x i = [1 x i1 x i2 x ik ] polynomial regression: x i = [1 x i1 x 2 i1 xk i1 ] 8
12 The General Linear Model (cont.) harmonic regression: consider single continuous predictor x measured on a bounded interval. Let ω j = j/n, j = 0, 1,..., n/2 = K, and then set E Yi X[Y i x i ] = β 0 + J β j1 cos(2πω j x i ) + j=1 J β j2 sin(2πω j x i ) j=1 If J = K, we then essentially have an n n matrix X specifying a linear transform of y in terms of the derived predictors the coefficients (cos(2πω j x i ), sin(2πω j x i )). (β 0, β 11, β 12,..., β K ) form the discrete Fourier transform of y 9
13 The General Linear Model (cont.) In this case, the columns of X are orthogonal, and X X = diag(n, n/2, n/2,..., n/2, n) that is, X X is a diagonal matrix. 10
14 The General Linear Model (cont.) basis functions: truncated spline basis: for x R, let x i1 = { (x η1 ) α x > η 1 0 x η 1 = (x η 1 ) α + for some fixed η 1, and α R. More generally, E Yi X[Y i x i ] = β 0 + for fixed η 1 < η 2 < < η J. J β j (x i η j ) α + j=1 11
15 The General Linear Model (cont.) piecewise constant: : for x R, let { 1 x A1 x i1 = = 1 0 x / A A1 (x) 1 for some set A 1. More generally, E Yi X[Y i x i ] = β 0 + J β j 1 Aj (x) for sets A 1,..., A J. If we want to use a partition of R, we may write this J E Yi X[Y i x i ] = β j 1 Aj (x). where j=0 J A 0 = j=1 A j j=1 12
16 The General Linear Model (cont.) piecewise linear: specify E Yi X[Y i x i ] = J 1 Aj (x)(β j0 + β j1 x i ) j=0 piecewise linear & continuous: specify J E Yi X[Y i x i ] = β 0 + β 1 x + β j1 (x i η j ) + j=1 for fixed η 1 < η 2 < < η J. higher order piecewise functions (quadratic, cubic etc.) splines 13
17 Example: Motorcycle data 1 library(mass) 2 #Motorcycle data 3 plot(mcycle,pch=19,main= Motorcycle accident data ) 4 5 x<-mcycle$times 6 y<-mcycle$accel 14
18 Raw data Motorcycle accident data times accel 15
19 Example: Piecewise constant fit 1 #Knots 2 K<-11 3 kappa<-as.numeric(quantile(x,probs=c(0:k)/k)) 4 5 X<-(outer(x,kappa, - )>0)^2 6 X<-X[,-12] 7 8 fit.pwc<-lm(y X) 9 summary(fit.pwc) 10 newx<-seq(0,max(x),length=1001) newx<-(outer(newx,kappa, - )>0)^2 13 newx<-cbind(rep(1,1001),newx[,-12]) yhatc<-newx %*% coef(fit.pwc) 16 lines(newx,yhatc,col= blue,lwd=2) 16
20 Fit: Piecewise constant Motorcycle accident data times accel 17
21 Example: Piecewise linear fit 17 X1<-(outer(x,kappa, - )>0)^2 18 X1<-X1[,-12] X2<-(outer(x,kappa, - )>0)*outer(x,kappa, - ) 21 X2<-X2[,-12] X<-cbind(X1,X2) fit.pwl<-lm(y X) 26 summary(fit.pwl) 27 newx<-seq(0,max(x),length=1001) newx1<-(outer(newx,kappa, - )>0)^2 30 newx2<-(outer(newx,kappa, - )>0)*outer(newx,kappa, - ) newx<-cbind(rep(1,1001),newx1[,-12],newx2[,-12]) yhatl<-newx %*% coef(fit.pwl) 35 lines(newx,yhatl,col= red,lwd=2) 18
22 Fit:... + piecewise linear Motorcycle accident data times accel 19
23 Example: Piecewise linear fit 36 X<-(outer(x,kappa, - )>0)*outer(x,kappa, - ) 37 X<-X[,-12] fit.pwcl<-lm(y X) 40 summary(fit.pwcl) 41 newx<-seq(0,max(x),length=1001) newx<-(outer(newx,kappa, - )>0)*outer(newx,kappa, - ) newx<-cbind(rep(1,1001),newx[,-12]) yhatcl<-newx %*% coef(fit.pwcl) 48 lines(newx,yhatcl,col= green,lwd=2) 20
24 Fit:... + piecewise continuous linear Motorcycle accident data times accel 21
25 Fit:... + piecewise continuous quadratic Motorcycle accident data times accel 22
26 Fit:... + piecewise continuous cubic Motorcycle accident data times accel 23
27 Fit: B-spline Motorcycle accident data times accel 24
28 Fit: Harmonic regression K = Motorcycle accident data times accel 25
29 Fit: Harmonic regression K = Motorcycle accident data times accel 26
30 Fit: Harmonic regression K = Motorcycle accident data times accel 27
31 Fit: Harmonic regression K = Motorcycle accident data times accel 28
32 Reparameterization
33 Reparameterizing the model For any model E Y X [Y X] = Xβ we might consider a reparameterization of the model by writing x new i = x i A 1 for some p p non-singular matrix A. Then E Yi X[Y i x i ] = x i β = (x new i A)β = x new i β (new) where β (new) = Aβ 29
34 Reparameterizing the model (cont.) Then X new = XA 1 X = X new A and we may choose A such that {X new } {X new } = I n to give an orthogonal (actually, orthonormal) parameterization. Recall that if the design matrix X new is orthonormal, we have for the OLS estimate β (new) = {X new } y. Note however that the new predictors and their coefficients may not be readily interpretable, so it may be better to reparameterize back to β by defining 1 β = A β(new) 30
35 Coordinate methods for inversion
36 Coordinate methods for inversion To find the ordinary least squares estimates, we solve the normal equations to obtain β = (X X) 1 X y which requires us to invert the p p matrix X X. This is the minimum norm solution to β = arg min y Xb 2 = arg min b b n (y i x i b) 2 i=1 31
37 Coordinate methods for inversion (cont.) We may solve this problem using coordinate descent rather than direct inversion; if b 2,..., b p are fixed, then the minimization problem for b 1 becomes b1 = arg min S(b 1 b 2,..., b p ) = arg min b 1 b 1 n (y i b 1 x i1 c i1 ) 2 i=1 where p c i1 = b j x ij. j=2 32
38 Coordinate methods for inversion (cont.) Writing y i = y i c i1, and the sum of squares as n (y i b 1 x i1 ) 2, i=1 we have that b1 = n x i1 y i. n i=1 x 2 i1 i=1 33
39 Coordinate methods for inversion (cont.) We may solve recursively in turn for each b j, that is, after initialization and at step t, update by minimizing b(t) j b (t+1) j min S(b j b (t+1) 1, b (t+1) 2,..., b (t+1) j 1, b (t) j+1..., b(t) bj which does not require any matrix inversion. p ) 34
40 Distributional Results
41 Some distributional results Distributional results using the Normal distribution are key to many inference procedures for the linear model. Suppose that are independent. X 1,..., X n Normal(µ, σ 2 ) 1. Z i = (X i µ)/σ Normal(0, 1); 2. Y i = Z 2 i χ 2 1 ; 3. U = n i=1 Z2 i χ 2 n; 4. If U 1 χ 2 n 1 and U 2 χ 2 n 2 are independent, then V = U 1/n 1 U 2 /n 2 Fisher(n 1, n 2 ) and 1 V Fisher(n 2, n 1 ) 35
42 Some distributional results (cont.) 5. If Z Normal(0, 1) and U χ 2 ν, then T = Z U/ν Student(ν) 6. If where Y = AX + b X = (X1,..., X n ) Normal(0, σ 2 I n ); A is n n; b is n 1; then say, and Y Normal(b, σ 2 AA ) Normal(µ, Σ) (Y µ) Σ 1 (Y µ) χ 2 n. 36
43 Some distributional results (cont.) 7. Non-central Chi-squared distribution: If X Normal(µ, σ 2 ), we find the distribution of X 2 /σ 2. Y = X σ Normal(µ/σ, 1) By standard transformation results, if Q = Y 2, then f Q (y) = 1 y [φ( y µ/σ) + φ( y µ/σ)] This is the density of the non-central chi-squared distribution with 1 degree of freedom and non-centrality parameter λ = (µ/σ) 2, written χ 2 1 (λ). 37
44 Some distributional results (cont.) The non-central chi-squared distribution has many similar properties to the standard (central) chi-squared distribution. For example if X 1,..., X n are independent, with X i χ 2 1 (λ i), then ( n n ) X i χ 2 n λ i. i=1 The non-central chi-squared distribution plays a role in testing for the linear regression model as it characterizes the distribution of various sums of squares terms. i=1 38
45 Some distributional results (cont.) 8. Quadratic forms: If A is a square symmetric idempotent matrix, and Z = (Z 1,..., Z n ) Normal(0, I n ), then where ν = Trace(A). Z AZ χ 2 ν To see this, use the singular value decomposition A = UDU where U is an orthogonal matrix with U U = I n, and D is a diagonal matrix. Then Z AZ = Z UDU Z = (Z U)D(U Z) = {Z } D{Z } say. But U Z Normal(0, U U) = Normal(0, I n ) 39
46 Some distributional results (cont.) Therefore Z AZ = {Z } D{Z }. But A is idempotent, so AA = A that is, UDU UDU = UDU. The left hand side simplifies, and we have UD 2 U = UDU. 40
47 Some distributional results (cont.) Thus, pre-multiplying by U, and post-multiplying by U, we have D 2 = D and the diagonal elements of D must either be zero or one, so Z AZ = {Z } D{Z } χ 2 ν where ν = Trace(D) = Trace(A). 41
48 as if A 1 and A 2 are orthogonal, then U 2 U 1 = Some distributional results (cont.) 9. If A 1 and A 2 are square, symmetric and orthogonal, and then A 1 A 2 = 0 Z A 1 Z and Z A 2 Z are independent. This result again uses the singular value decomposition; let We have that V 1 = D 1 U 1 Z V 2 = D 2 U 2 Z. Cov V1,V 2 [V 1, V 2 ] = E V1,V 2 [V 2 V 1 ] = E Z [D 2 U 2 ZZ U 1 D 1 ] = D 2 U 2 U 1 D 1 = 0
49 Bayesian Regression and Penalized Least Squares
50 The Bayesian Linear Model Under a Normality assumption Y X, β, σ 2 Normal(Xβ, σ 2 I n ) which defines the likelihood L(β, σ 2 ; y, X), we may perform Bayesian inference by specifying a joint prior distribution π 0 (β, σ 2 ) = π 0 (β σ 2 )π 0 (σ 2 ) and computing the posterior distribution π n (β, σ 2 y, X) L(β, σ 2 ; y, X)π 0 (β, σ 2 ) 43
51 The Bayesian Linear Model (cont.) Prior specification: π 0 (σ 2 ) InvGamma(α/2, γ/2), that is, by definition, 1/σ 2 Gamma(α/2, γ/2) π 0 (σ 2 ) = (γ/2)α/2) Γ(α/2) ( ) 1 α/2 1 { σ 2 exp γ } 2σ 2 π 0 (β σ 2 ) Normal(θ, σ 2 Ψ), that is, β is conditionally Normally distributed in p dimensions, where parameters α, γ, θ, Ψ are fixed, known constants ( hyperparameters ) 44
52 The Bayesian Linear Model (cont.) In the calculation of π n (β, σ 2 y, X), we have after collecting terms π n (β, σ 2 y, X) where ( ) (n+α+p)/2 1 1 { σ 2 exp γ } { } Q(β, y, X) 2σ 2 exp 2σ 2 Q(β, y, X) = (y Xβ) (y Xβ) + (β θ) Ψ 1 (β θ). 45
53 The Bayesian Linear Model (cont.) By completing the square, may write Q(β, y, X) = (β m) M 1 (β m) + c where M = (X X + Ψ 1 ) 1 ; m = (X X + Ψ 1 ) 1 (X y + Ψ 1 θ); c = y y + θ Ψ 1 θ m M 1 m 46
54 The Bayesian Linear Model (cont.) From this we deduce that π n (β, σ 2 y, X) ( ) 1 (n+α+p)/2 1 { σ 2 exp exp } (γ + c) 2σ 2 { 1 2σ 2 (β m) M 1 (β m) } that is π n (β, σ 2 y, X) π n (β σ 2, y, X)π n (σ 2 y, X) π n (β σ 2, y, X) Normal(m, σ 2 M); π n (σ 2 y, X) InvGamma((n + α)/2, (c + γ)/2). 47
55 The Bayesian Linear Model (cont.) The Bayesian posterior mean/modal estimator of β based on this model is β B = m = (X X + Ψ 1 ) 1 (X y + Ψ 1 θ) 48
56 Ridge Regression If, in the Bayesian estimation, we choose θ = 0 Ψ 1 = λi p we have β B = m = (X X + λi p ) 1 X y. If λ = 0: β B = β, the least squares estimate. λ = : β B = 0. Note that to make this specification valid, we should place all the predictors (the columns of X) on the same scale. 49
57 Ridge Regression (cont.) Consider the constrained least squares problem minimize S(β) = n (y i x i β) 2 subject to k β 2 j t i=1 j=1 We solve this problem after centering the predictors x ij x ij x j and centering the responses y i y i y. After this transformation, the intercept can be omitted. 50
58 Ridge Regression (cont.) Suppose that therefore there are precisely p β parameters in the model. We solve the constrained minimization using Lagrange multipliers: we minimize S λ (β) p S λ (β) = S(β) + λ βj 2 t j=1 We have that a p 1 vector. S λ (β) β = S(β) β + 2λβ 51
59 Ridge Regression (cont.) By direct calculus, we have S λ (β) β = 2X (y Xβ) + 2λβ and equating to zero we have X y = (X X + λi)β so that β B = (X X + λi) 1 X y 52
60 Ridge Regression (cont.) For statistical properties, we have E Y X [ β B X] = (X X + λi) 1 X Xβ β so the ridge regression estimator is biased. However the mean squared error (MSE) of β B can be smaller than that of β. 53
61 Ridge Regression (cont.) If the columns of matrix X is orthogonal, so that then X X = I p β Bj = λ β j < β j. In general the ridge regression estimates are shrunk towards zero compared to the least squares estimates. 54
62 Ridge Regression (cont.) Using the singular value decomposition, write where X = UDV U is n p, columns of U are an orthonormal basis for the column space of X, and U U = I p. V is p p, columns of V are an orthonormal basis for the row space of X; V V = I p. D is diagonal with elements d 1 d 2 d p 0. 55
63 Ridge Regression (cont.) Least squares: Predictions are Ŷ = X β = X(X X) 1 X y = (UDV )(VDU UDV ) 1 VDU y = UU y. as V V = I p = VV = I p (to see this pre-multiply both sides by V ) so that (VDU UDV ) 1 (VD 2 V ) 1 = VD 2 V 56
64 Ridge Regression (cont.) Ridge regression: Predictions are Ŷ = X β B = X(X X + λi p ) 1 X y = UD(D 2 + λi p ) 1 DU y = p ( ) d 2 j ũ j dj 2 ũ j y + λ j=1 where ũ j is the jth column of U. 57
65 Ridge Regression (cont.) Ridge regression transforms the problem to one involving the orthogonal matrix U instead of X, and the shrinks the coefficients by d 2 j d 2 j + λ 1. For ridge regression, the hat matrix is H λ = X(X X + λi p ) 1 X = UD(D 2 + λi p ) 1 DU and the degrees of freedom of the fit is Trace(H λ ) = p j=1 d 2 j d 2 j + λ 58
66 Other Penalties The LASSO penalty is and we solve the minimization p λ β j j=1 min(y Xβ) (y Xβ) + λ β j. β No analytical solution is available, but the minimization can be achieved using coordinate descent. In this case, the minimization allows for the optimal value β j to be precisely zero for some j. p j=1 59
67 Other Penalties (cont.) A general version of this penalty is p λ β j q and if q 1, there is a possibility of an estimate being shrunk exactly to zero. For q 1, the problem is convex; for q < 1 the problem is non-convex, and harder to solve. However, if q 1, the shrinkage to zero allows for variable selection to be carried out automatically. j=1 60
68 Model Selection using Information Criteria
69 Selection using Information Criteria Suppose we wish to choose one from a collection of models described by densities f 1, f 2,..., f K with parameters θ 1, θ 2,..., θ K. Let the true, data generating model be denoted f 0. We consider the KL divergence between f 0 and f k : KL(f 0, f k ) = ( ) f0 (x) log f 0 (x)dx f k (x; θ k ) and aim to choose the model using the criterion k = arg min k KL(f 0, f k ) 61
70 Selection using Information Criteria (cont.) In reality, θ k are typically unknown, so we consider estimating them using maximum likelihood procedures. We consider KL(f 0, f k ) = log ( f 0 (x) f k (x; θ k ) ) f 0 (x)dx where θ k is obtained by maximizing the likelihood under model k, that is maximizing n log f k (y i ; θ) i=1 with respect to θ for data y 1,..., y n. 62
71 Selection using Information Criteria (cont.) We have that KL(f 0, f k ) = log f 0 (x)f 0 (x)dx log f k (x; θ k )f 0 (x)dx so we then may choose k by k = arg max k log f k (x; θ k )f 0 (x)dx or if the parameters need to be estimated k = arg max log f k (x; θ k (y))f 0 (x)dx. k 63
72 Asymptotic results for estimation under misspecification
73 Maximum likelihood as minimum KL We may seek to define θ k directly using the entropy criterion θ k = arg min θ KL(f 0, f k (θ)) = arg min θ ( ) f0 (x) log f 0 (x)dx f k (x; θ) and solve the problem using calculus by differentiating KL(f 0, f k (θ)) with respect to θ and equating to zero. Note that under standard regularity conditions { ( ) } f0 (x) log f 0 (x)dx = { θ f k (x; θ) θ } log f k (x; θ)f 0 (x)dx { } = θ log f k(x; θ) f 0 (x)dx [ ] = E X θ log f k(x; θ) 64
74 Maximum likelihood as minimum KL (cont.) Therefore θ k solves E X [ θ log f k(x; θ) θ=θk ] = 0. Under identifiability assumptions, we assume that is that there is a single θ k which solves this equation. The sample-based equivalent calculation dictates that for the estimate θ k 1 n n θ log f k(x i ; θ) θ=θk = 0 i=1 which coincides with ML estimation. 65
75 Maximum likelihood as minimum KL (cont.) Under identifiability assumptions, as 1 n n i=1 θ log f k(x i ; θ) θ=θk 0, by the strong law of large numbers we must have that with probability 1 as n. θ k θ k 66
76 Maximum likelihood as minimum KL (cont.) Let I(θ k ) = E X [ 2 log f k (X; θ) θ θ ] ; θ k θ=θk be the Fisher information computed wrt f k (x; θ k ), and [ ] I(θ k ) = E X 2 log f k (X; θ) θ θ θ=θk be the same expectation quantity computed wrt f 0 (x). The corresponding n data versions, where log f k (X; θ) is replaced by are n log f k (X i ; θ) i=1 I n (θ k ) = ni(θ k ) I n (θ k ) = ni(θ k ). 67
77 Maximum likelihood as minimum KL (cont.) The quantity Î n (θ k ) is the sample based version Î n (θ k ) = { n i=1 } 2 θ θ log f k(x i ; θ) θ=θ k This quantity is evaluated at θ k = θ k to yield Î n ( θ k ); we have that with probability 1 as n. Î n ( θ k ) I n (θ k ) 68
78 Maximum likelihood as minimum KL (cont.) Another approach that is sometimes used is to uses the equivalence between expressions involving the first and second derivative versions of I(θ); we have also that I(θ k ) = E X [ log f k (X; θ) θ ] 2 ; θ k = J (θ k ) θ=θ k with the expectation computed wrt f k (x; θ k ), where for vector U Let J(θ k ) = E X [ U 2 = UU. log f k (X; θ) θ 2 θ=θ k be the equivalent calculation with the expectation computed wrt f 0 (x). ] 69
79 Maximum likelihood as minimum KL (cont.) Now by a second order Taylor expansion of the first derivative of the loglikelihood, if l k (θ k ) = log f k(x; θ) θ at θ = θ k, we have lk (θ k ) = 2 log f k (x; θ) θ=θk θ θ θ=θk l k (θ k ) = l k ( θ k ) + l k ( θ k )(θ k θ k ) (θ k θ k )... l n(θ )(θ k θ k ) = l k ( θ k )(θ k θ k ) + R n (θ ) as l k ( θ k ) = 0, θ k θ < θ k θ k, and where the remainder term is being denoted R n (θ ). 70
80 Maximum likelihood as minimum KL (cont.) We have by definition that and in the limit that is, R n (θ ) = o p (n), and as n. l k ( θ k ) = Î n ( θ k ). R n (θ ) n p 0 1 n În( θ k ) a.s. I(θ k ) Rewriting the above approximation, we have { } 1 1 { } { } { n( θk θ k ) = n În( θ k ) n lk (θ k ) + n În( θ Rn (θ } ) k ) n 71
81 Maximum likelihood as minimum KL (cont.) By the central limit theorem n( θk θ k ) d Normal(0, Σ) say, where Σ denotes the limiting variance-covariance matrix when sampling is under f 0, and 1 lk (θ k ) = 1 n n Finally, n i=1 by Slutsky s Theorem. { 1 n În( θ k ) d log f θ k (X i ; θ k ) Normal(0, J(θ k )). k } 1 { Rn (θ ) n } p 0 72
82 Maximum likelihood as minimum KL (cont.) Therefore, by equating the asymptotic variances of the above quantities, we must have J(θ k ) = I(θ k )ΣI(θ k ) yielding that Σ = {I(θ k )} 1 J(θ k ){I(θ k )} 1 73
83 Model Selection using Minimum KL
84 The Expected KL Divergence The quantity log f k (x; θ k (Y ))f 0 (x)dx = E X [log f k (X; θ k (Y ))] is a random quantity, a function of data random quantities Y. Thus we instead decide to choose k by k = arg max E Y [E X [log f k (X; θ k (Y ))]] k where X and Y are drawn from the true model f 0. 74
85 The Expected KL Divergence (cont.) We consider first the expansion of the inner integral around the true value θ k for an arbitrary y; under regularity conditions E X [log f k (X; θ k (y))] = E X [log f k (X; θ k )] + E X [ lk (θ k )] ( θk θ k ) + 1 ] 2 ( θ k θ k ) E X [ lk (θ k ) ( θ k θ k ) + o p (n) 75
86 The Expected KL Divergence (cont.) [ ] By definition E X lk (θ k ) = 0, and so therefore ] E X [ lk (θ k ) = I n (θ k ) E X [log f k (X; θ k (y))] = E X [log f k (X; θ k )] 1 2 ( θ k θ k ) I n (θ k )( θ k θ k )+o p (n) We then must compute E Y [E X [log f k (X; θ k (Y ))]] The term E X [log f k (X; θ k )] is a constant wrt this expectation. 76
87 The Expected KL Divergence (cont.) The expectation of the quadratic term can be computed by standard results for large n; under sampling Y from f 0, we have as before that ( θ k θ k ) I n (θ k )( θ k θ k ) can be rewritten { { } 1 n( θ k θ k )} n I n(θ k ) { n( θ k θ k )} where and n( θk θ k ) d Normal(0, Σ) 1 n I n(θ k ) a.s. I(θ k ) 77
88 The Expected KL Divergence (cont.) Therefore, by standard results for quadratic forms and ] E Y [( θ k (Y ) θ k ) a.s. I n (θ k )( θ k (Y ) θ k ) Trace (I(θ k )Σ) E Y [E X [log f k (X; θ k (Y ))]] = E X [log f k (X; θ k )] 1 2 Trace (I(θ k)σ)+o p (n) However, the right hand side cannot be computed, as θ k is not known and must be estimated. (*) 78
89 The Expected KL Divergence (cont.) Thus we repeat the same operation, but instead expand E X [log f k (X; θ k )] about θ k (x) for a fixed x. We have log f k (x; θ k ) = log f k (x; θ k (x)) + l k ( θ k (x)) (θ k θ k (x)) ( θ k (x) θ k ) lk ( θ k (x))( θ k (x) θ k ) + o(n) of which we need then to take the expectation wrt X. 79
90 The Expected KL Divergence (cont.) [ ] Again E X lk ( θ k (X)) = 0, and writing ( θ k (x) θ k ) lk ( θ k (x))( θ k (x) θ k ) as { n( θ k (x) θ k )} { 1 } n l k ( θ k (x)) { n( θ k (x) θ k )} we have that the expectation over X of this quadratic form converges to Trace (I(θ k )Σ) as by standard theory n( θk (x) θ k ) d Normal(0, Σ). 80
91 The Expected KL Divergence (cont.) Thus E X [log f k (X; θ k )] = E X [log f k (X; θ k (X))] 1 2 Trace (I(θ k)σ) + o p (n). Therefore, using the previous expression (*) we now have E Y [E X [log f k (X; θ k )]] = E X [log f k (X; θ k (X))] Trace (I(θ k )Σ)+o p (n) By the earlier identity I(θ k )Σ = J(θ k ) {I(θ k )} 1 and the right hand side can be consistently estimated by log f k (x; θ k ) Trace (J(θ k ) {I(θ k )} 1) where the estimated trace must be computed from the available data. 81
92 The Expected KL Divergence (cont.) On obvious estimator of the trace is ( } ) 1 Trace Ĵ( θ k (x)) {Î( θk (x)) although this might potentially be improved upon. In any case, if the approximating model f k is close in KL terms to the true model f 0, we would expect that ( Trace J(θ k ) {I(θ k )} 1) dim(θ k ) as we would have under regularity assumptions J(θ k ) I(θ k ) 82
93 The Expected KL Divergence (cont.) This yields the criterion: choose k to maximize or equivalently to minimize log f k (x; θ k ) dim(θ k ) 2 log f k (x; θ k ) + 2dim(θ k ) This is Akaike s Information Criterion ( AIC). Note: The required regularity conditions on the f k model are not too restrictive. However, the approximation ( Trace J(θ k ) {I(θ k )} 1) dim(θ k ) is potentially poor. 83
94 The Bayesian Information Criterion The Bayesian Information Criterion (BIC) uses an approximation to the marginal likelihood function to justify model selection. For data y = (y 1,..., y n ), we have the posterior distribution for model k as π n (θ k ; y) = L k(θ k ; y)π 0 (θ k ) Lk (t; y)π 0 (t) dt. where L k (θ; y) is the likelihood function. The denominator is f k (y) = L k (t; y)π 0 (t) dt that is, the marginal likelihood. 84
95 The Bayesian Information Criterion (cont.) Let l k (θ; y) = log L k (θ; y) = n log f k (y i ; θ) denote the log-likelihood for model k. By a Taylor expansion of l k (θ; y) around ML estimate θ k, we have i=1 l k (θ; y) = l k ( θ k ; y) + (θ θ k ) l( θk ; y) (θ θ k ) l( θk ; y)(θ θ k ) + o(1) We have by definition that l( θ k ) = 0, and we may write l( θ k ; y) = { V n ( θ k )} 1. V n ( θ k ) is the Hessian matrix derived from n data points. 85
96 The Bayesian Information Criterion (cont.) In ML theory, V n ( θ k ) estimates the variance of the ML estimator derived from n data. We may denote that, for a random sample, that say, where V 1 (θ) = n V n ( θ k ) = 1 n V 1( θ k ) { n i=1 } 1 2 θ θ log f k(y i ; θ) is a square symmetric matrix of dimension p k = dim(θ k ) recording an estimate of the the variance of the ML estimator for a single data point n = 1. 86
97 The Bayesian Information Criterion (cont.) Then, exponentiating, we have that { L k (θ; y) L k ( θ k ; y) exp 1 2 (θ θ { } 1 k ) V n ( θ k )} (θ θk ) This is a standard quadratic approximation to the likelihood around the ML estimate. 87
98 The Bayesian Information Criterion (cont.) Suppose that prior π 0 (θ k ) is constant (equal to c, say) in the neighbourhood of θ k. Then, for large n f k (y) = L k (t; y)π 0 (t) dt { cl k ( θ; y) exp 1 2 (t θ { } 1 k ) V n ( θ k )} (t θk ) dt = cl k ( θ; y)(2π) p k/2 V n ( θ k ) 1/2 as the integrand is proportional to a Normal( θ k, V n ( θ k )) distribution. 88
99 The Bayesian Information Criterion (cont.) But V n ( θ k ) 1/2 = 1 n V 1( θ k ) 1/2 Therefore the marginal likelihood becomes = n p k/2 V 1 ( θ k ) f k (y) = cl k ( θ; y)(2π) pk/2 n p k/2 V 1 ( θ k ) 1/2 1/2 or on the 2 log scale we have 2 log f k (y) 2l k ( θ; y) + p k log n + constant where constant = p k log(2π) log V 1 ( θ k ) 2 log c. 89
100 The Bayesian Information Criterion (cont.) The constant term is o(1) and is hence negligible, so the BIC is defined as BIC = 2l k ( θ; y) + p k log n. 90
MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2
MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 1 Ridge Regression Ridge regression and the Lasso are two forms of regularized
More informationFinal Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58
Final Review Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Final Review 1 / 58 Outline 1 Multiple Linear Regression (Estimation, Inference) 2 Special Topics for Multiple
More informationMath 423/533: The Main Theoretical Topics
Math 423/533: The Main Theoretical Topics Notation sample size n, data index i number of predictors, p (p = 2 for simple linear regression) y i : response for individual i x i = (x i1,..., x ip ) (1 p)
More informationStatistics 203: Introduction to Regression and Analysis of Variance Penalized models
Statistics 203: Introduction to Regression and Analysis of Variance Penalized models Jonathan Taylor - p. 1/15 Today s class Bias-Variance tradeoff. Penalized regression. Cross-validation. - p. 2/15 Bias-variance
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More information9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures
FE661 - Statistical Methods for Financial Engineering 9. Model Selection Jitkomut Songsiri statistical models overview of model selection information criteria goodness-of-fit measures 9-1 Statistical models
More informationLinear Methods for Prediction
Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we
More informationNonparametric Regression. Badr Missaoui
Badr Missaoui Outline Kernel and local polynomial regression. Penalized regression. We are given n pairs of observations (X 1, Y 1 ),...,(X n, Y n ) where Y i = r(x i ) + ε i, i = 1,..., n and r(x) = E(Y
More informationA Modern Look at Classical Multivariate Techniques
A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico
More informationRidge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation
Patrick Breheny February 8 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/27 Introduction Basic idea Standardization Large-scale testing is, of course, a big area and we could keep talking
More informationLinear Models for Regression
Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationSpatial Process Estimates as Smoothers: A Review
Spatial Process Estimates as Smoothers: A Review Soutir Bandyopadhyay 1 Basic Model The observational model considered here has the form Y i = f(x i ) + ɛ i, for 1 i n. (1.1) where Y i is the observed
More informationLinear Regression Linear Regression with Shrinkage
Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle
More informationData Mining Stat 588
Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic
More informationLinear Regression Linear Regression with Shrinkage
Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationLECTURE 2 LINEAR REGRESSION MODEL AND OLS
SEPTEMBER 29, 2014 LECTURE 2 LINEAR REGRESSION MODEL AND OLS Definitions A common question in econometrics is to study the effect of one group of variables X i, usually called the regressors, on another
More informationSummer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.
Summer School in Statistics for Astronomers V June 1 - June 6, 2009 Regression Mosuk Chow Statistics Department Penn State University. Adapted from notes prepared by RL Karandikar Mean and variance Recall
More informationLecture 5: A step back
Lecture 5: A step back Last time Last time we talked about a practical application of the shrinkage idea, introducing James-Stein estimation and its extension We saw our first connection between shrinkage
More informationThe Statistical Property of Ordinary Least Squares
The Statistical Property of Ordinary Least Squares The linear equation, on which we apply the OLS is y t = X t β + u t Then, as we have derived, the OLS estimator is ˆβ = [ X T X] 1 X T y Then, substituting
More informationLinear Methods for Prediction
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this
More informationSparse Linear Models (10/7/13)
STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationRegularization: Ridge Regression and the LASSO
Agenda Wednesday, November 29, 2006 Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the l 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression
More informationMA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7
MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7 1 Random Vectors Let a 0 and y be n 1 vectors, and let A be an n n matrix. Here, a 0 and A are non-random, whereas y is
More informationRegression, Ridge Regression, Lasso
Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.
More informationPenalized Regression
Penalized Regression Deepayan Sarkar Penalized regression Another potential remedy for collinearity Decreases variability of estimated coefficients at the cost of introducing bias Also known as regularization
More informationGaussian Process Regression
Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process
More informationHigh-dimensional regression modeling
High-dimensional regression modeling David Causeur Department of Statistics and Computer Science Agrocampus Ouest IRMAR CNRS UMR 6625 http://www.agrocampus-ouest.fr/math/causeur/ Course objectives Making
More informationGeneralized Linear Models. Kurt Hornik
Generalized Linear Models Kurt Hornik Motivation Assuming normality, the linear model y = Xβ + e has y = β + ε, ε N(0, σ 2 ) such that y N(μ, σ 2 ), E(y ) = μ = β. Various generalizations, including general
More informationWe ve seen today that a square, symmetric matrix A can be written as. A = ODO t
Lecture 5: Shrinky dink From last time... Inference So far we ve shied away from formal probabilistic arguments (aside from, say, assessing the independence of various outputs of a linear model) -- We
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com
More informationFrequentist-Bayesian Model Comparisons: A Simple Example
Frequentist-Bayesian Model Comparisons: A Simple Example Consider data that consist of a signal y with additive noise: Data vector (N elements): D = y + n The additive noise n has zero mean and diagonal
More informationMATH 205C: STATIONARY PHASE LEMMA
MATH 205C: STATIONARY PHASE LEMMA For ω, consider an integral of the form I(ω) = e iωf(x) u(x) dx, where u Cc (R n ) complex valued, with support in a compact set K, and f C (R n ) real valued. Thus, I(ω)
More informationLecture 16 Solving GLMs via IRWLS
Lecture 16 Solving GLMs via IRWLS 09 November 2015 Taylor B. Arnold Yale Statistics STAT 312/612 Notes problem set 5 posted; due next class problem set 6, November 18th Goals for today fixed PCA example
More informationBusiness Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata'
Business Statistics Tommaso Proietti DEF - Università di Roma 'Tor Vergata' Model Evaluation and Selection Predictive Ability of a Model: Denition and Estimation We aim at achieving a balance between parsimony
More informationRandom Matrix Eigenvalue Problems in Probabilistic Structural Mechanics
Random Matrix Eigenvalue Problems in Probabilistic Structural Mechanics S Adhikari Department of Aerospace Engineering, University of Bristol, Bristol, U.K. URL: http://www.aer.bris.ac.uk/contact/academic/adhikari/home.html
More informationStatistics 203: Introduction to Regression and Analysis of Variance Course review
Statistics 203: Introduction to Regression and Analysis of Variance Course review Jonathan Taylor - p. 1/?? Today Review / overview of what we learned. - p. 2/?? General themes in regression models Specifying
More informationCS540 Machine learning Lecture 5
CS540 Machine learning Lecture 5 1 Last time Basis functions for linear regression Normal equations QR SVD - briefly 2 This time Geometry of least squares (again) SVD more slowly LMS Ridge regression 3
More informationA Probability Review
A Probability Review Outline: A probability review Shorthand notation: RV stands for random variable EE 527, Detection and Estimation Theory, # 0b 1 A Probability Review Reading: Go over handouts 2 5 in
More informationLecture 13: Simple Linear Regression in Matrix Format. 1 Expectations and Variances with Vectors and Matrices
Lecture 3: Simple Linear Regression in Matrix Format To move beyond simple regression we need to use matrix algebra We ll start by re-expressing simple linear regression in matrix form Linear algebra is
More informationRegression: Lecture 2
Regression: Lecture 2 Niels Richard Hansen April 26, 2012 Contents 1 Linear regression and least squares estimation 1 1.1 Distributional results................................ 3 2 Non-linear effects and
More informationLecture 14: Shrinkage
Lecture 14: Shrinkage Reading: Section 6.2 STATS 202: Data mining and analysis October 27, 2017 1 / 19 Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the
More informationRandom Eigenvalue Problems Revisited
Random Eigenvalue Problems Revisited S Adhikari Department of Aerospace Engineering, University of Bristol, Bristol, U.K. Email: S.Adhikari@bristol.ac.uk URL: http://www.aer.bris.ac.uk/contact/academic/adhikari/home.html
More informationSTAT 100C: Linear models
STAT 100C: Linear models Arash A. Amini June 9, 2018 1 / 56 Table of Contents Multiple linear regression Linear model setup Estimation of β Geometric interpretation Estimation of σ 2 Hat matrix Gram matrix
More informationOutline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model
Outline 1 Multiple Linear Regression (Estimation, Inference, Diagnostics and Remedial Measures) 2 Special Topics for Multiple Regression Extra Sums of Squares Standardized Version of the Multiple Regression
More informationLinear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman
Linear Regression Models Based on Chapter 3 of Hastie, ibshirani and Friedman Linear Regression Models Here the X s might be: p f ( X = " + " 0 j= 1 X j Raw predictor variables (continuous or coded-categorical
More informationAdaptive Piecewise Polynomial Estimation via Trend Filtering
Adaptive Piecewise Polynomial Estimation via Trend Filtering Liubo Li, ShanShan Tu The Ohio State University li.2201@osu.edu, tu.162@osu.edu October 1, 2015 Liubo Li, ShanShan Tu (OSU) Trend Filtering
More informationCovariance function estimation in Gaussian process regression
Covariance function estimation in Gaussian process regression François Bachoc Department of Statistics and Operations Research, University of Vienna WU Research Seminar - May 2015 François Bachoc Gaussian
More informationRidge Regression. Flachs, Munkholt og Skotte. May 4, 2009
Ridge Regression Flachs, Munkholt og Skotte May 4, 2009 As in usual regression we consider a pair of random variables (X, Y ) with values in R p R and assume that for some (β 0, β) R +p it holds that E(Y
More informationRestricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model
Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model Xiuming Zhang zhangxiuming@u.nus.edu A*STAR-NUS Clinical Imaging Research Center October, 015 Summary This report derives
More informationGaussian Processes for Machine Learning
Gaussian Processes for Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics Tübingen, Germany carl@tuebingen.mpg.de Carlos III, Madrid, May 2006 The actual science of
More informationBayesian linear regression
Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding
More informationSTA 2101/442 Assignment 3 1
STA 2101/442 Assignment 3 1 These questions are practice for the midterm and final exam, and are not to be handed in. 1. Suppose X 1,..., X n are a random sample from a distribution with mean µ and variance
More informationLecture 6: Methods for high-dimensional problems
Lecture 6: Methods for high-dimensional problems Hector Corrada Bravo and Rafael A. Irizarry March, 2010 In this Section we will discuss methods where data lies on high-dimensional spaces. In particular,
More informationEM Algorithm II. September 11, 2018
EM Algorithm II September 11, 2018 Review EM 1/27 (Y obs, Y mis ) f (y obs, y mis θ), we observe Y obs but not Y mis Complete-data log likelihood: l C (θ Y obs, Y mis ) = log { f (Y obs, Y mis θ) Observed-data
More informationIntroduction to Statistical modeling: handout for Math 489/583
Introduction to Statistical modeling: handout for Math 489/583 Statistical modeling occurs when we are trying to model some data using statistical tools. From the start, we recognize that no model is perfect
More informationSpring 2017 Econ 574 Roger Koenker. Lecture 14 GEE-GMM
University of Illinois Department of Economics Spring 2017 Econ 574 Roger Koenker Lecture 14 GEE-GMM Throughout the course we have emphasized methods of estimation and inference based on the principle
More informationBiostatistics Advanced Methods in Biostatistics IV
Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results
More informationSTATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010
STATS306B Discriminant Analysis Jonathan Taylor Department of Statistics Stanford University June 3, 2010 Spring 2010 Classification Given K classes in R p, represented as densities f i (x), 1 i K classify
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationComputer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression
Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.
More informationStandard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j
Standard Errors & Confidence Intervals β β asy N(0, I( β) 1 ), where I( β) = [ 2 l(β, φ; y) ] β i β β= β j We can obtain asymptotic 100(1 α)% confidence intervals for β j using: β j ± Z 1 α/2 se( β j )
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationAPPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.
APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product
More information1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.
Université du Sud Toulon - Var Master Informatique Probabilistic Learning and Data Analysis TD: Model-based clustering by Faicel CHAMROUKHI Solution The aim of this practical wor is to show how the Classification
More informationSTAT 540: Data Analysis and Regression
STAT 540: Data Analysis and Regression Wen Zhou http://www.stat.colostate.edu/~riczw/ Email: riczw@stat.colostate.edu Department of Statistics Colorado State University Fall 205 W. Zhou (Colorado State
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationPartial factor modeling: predictor-dependent shrinkage for linear regression
modeling: predictor-dependent shrinkage for linear Richard Hahn, Carlos Carvalho and Sayan Mukherjee JASA 2013 Review by Esther Salazar Duke University December, 2013 Factor framework The factor framework
More informationNonconcave Penalized Likelihood with A Diverging Number of Parameters
Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized
More informationComputer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression
Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:
More informationChapter 17: Undirected Graphical Models
Chapter 17: Undirected Graphical Models The Elements of Statistical Learning Biaobin Jiang Department of Biological Sciences Purdue University bjiang@purdue.edu October 30, 2014 Biaobin Jiang (Purdue)
More informationInverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1
Inverse of a Square Matrix For an N N square matrix A, the inverse of A, 1 A, exists if and only if A is of full rank, i.e., if and only if no column of A is a linear combination 1 of the others. A is
More informationModel comparison and selection
BS2 Statistical Inference, Lectures 9 and 10, Hilary Term 2008 March 2, 2008 Hypothesis testing Consider two alternative models M 1 = {f (x; θ), θ Θ 1 } and M 2 = {f (x; θ), θ Θ 2 } for a sample (X = x)
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationSo far our focus has been on estimation of the parameter vector β in the. y = Xβ + u
Interval estimation and hypothesis tests So far our focus has been on estimation of the parameter vector β in the linear model y i = β 1 x 1i + β 2 x 2i +... + β K x Ki + u i = x iβ + u i for i = 1, 2,...,
More informationLinear models. Linear models are computationally convenient and remain widely used in. applied econometric research
Linear models Linear models are computationally convenient and remain widely used in applied econometric research Our main focus in these lectures will be on single equation linear models of the form y
More informationMachine Learning. Part 1. Linear Regression. Machine Learning: Regression Case. .. Dennis Sun DATA 401 Data Science Alex Dekhtyar..
.. Dennis Sun DATA 401 Data Science Alex Dekhtyar.. Machine Learning. Part 1. Linear Regression Machine Learning: Regression Case. Dataset. Consider a collection of features X = {X 1,...,X n }, such that
More information10. Linear Models and Maximum Likelihood Estimation
10. Linear Models and Maximum Likelihood Estimation ECE 830, Spring 2017 Rebecca Willett 1 / 34 Primary Goal General problem statement: We observe y i iid pθ, θ Θ and the goal is to determine the θ that
More informationFundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner
Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization
More informationApproximate Likelihoods
Approximate Likelihoods Nancy Reid July 28, 2015 Why likelihood? makes probability modelling central l(θ; y) = log f (y; θ) emphasizes the inverse problem of reasoning y θ converts a prior probability
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationMA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1
MA 575 Linear Models: Cedric E Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1 1 Within-group Correlation Let us recall the simple two-level hierarchical
More informationRegression. Oscar García
Regression Oscar García Regression methods are fundamental in Forest Mensuration For a more concise and general presentation, we shall first review some matrix concepts 1 Matrices An order n m matrix is
More informationStat 5100 Handout #26: Variations on OLS Linear Regression (Ch. 11, 13)
Stat 5100 Handout #26: Variations on OLS Linear Regression (Ch. 11, 13) 1. Weighted Least Squares (textbook 11.1) Recall regression model Y = β 0 + β 1 X 1 +... + β p 1 X p 1 + ε in matrix form: (Ch. 5,
More informationLinear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77
Linear Regression Chapter 3 September 27, 2016 Chapter 3 September 27, 2016 1 / 77 1 3.1. Simple linear regression 2 3.2 Multiple linear regression 3 3.3. The least squares estimation 4 3.4. The statistical
More informationLinear Model Selection and Regularization
Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In
More informationMidterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.
CS 189 Spring 2013 Introduction to Machine Learning Midterm You have 1 hour 20 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. Please use non-programmable calculators
More informationPenalized Splines, Mixed Models, and Recent Large-Sample Results
Penalized Splines, Mixed Models, and Recent Large-Sample Results David Ruppert Operations Research & Information Engineering, Cornell University Feb 4, 2011 Collaborators Matt Wand, University of Wollongong
More informationModel Selection. Frank Wood. December 10, 2009
Model Selection Frank Wood December 10, 2009 Standard Linear Regression Recipe Identify the explanatory variables Decide the functional forms in which the explanatory variables can enter the model Decide
More informationLinear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,
Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,
More informationFall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.
1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n
More informationUNIVERSITETET I OSLO
UNIVERSITETET I OSLO Det matematisk-naturvitenskapelige fakultet Examination in: STK4030 Modern data analysis - FASIT Day of examination: Friday 13. Desember 2013. Examination hours: 14.30 18.30. This
More informationOptimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.
Optimization Background: Problem: given a function f(x) defined on X, find x such that f(x ) f(x) for all x X. The value x is called a maximizer of f and is written argmax X f. In general, argmax X f may
More informationGeneralized Linear Models
Generalized Linear Models Lecture 3. Hypothesis testing. Goodness of Fit. Model diagnostics GLM (Spring, 2018) Lecture 3 1 / 34 Models Let M(X r ) be a model with design matrix X r (with r columns) r n
More informationISyE 691 Data mining and analytics
ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)
More informationThe Multivariate Normal Distribution. In this case according to our theorem
The Multivariate Normal Distribution Defn: Z R 1 N(0, 1) iff f Z (z) = 1 2π e z2 /2. Defn: Z R p MV N p (0, I) if and only if Z = (Z 1,..., Z p ) T with the Z i independent and each Z i N(0, 1). In this
More informationRecap. HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis:
1 / 23 Recap HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis: Pr(G = k X) Pr(X G = k)pr(g = k) Theory: LDA more
More informationDATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane
DATA MINING AND MACHINE LEARNING Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Linear models for regression Regularized
More informationCOS 424: Interacting with Data
COS 424: Interacting with Data Lecturer: Rob Schapire Lecture #14 Scribe: Zia Khan April 3, 2007 Recall from previous lecture that in regression we are trying to predict a real value given our data. Specically,
More information