PLS. theoretical results for the chemometrics use of PLS. Liliana Forzani. joint work with R. Dennis Cook
|
|
- Arline Dixon
- 5 years ago
- Views:
Transcription
1 PLS theoretical results for the chemometrics use of PLS Liliana Forzani Facultad de Ingeniería Química, UNL, Argentina joint work with R. Dennis Cook
2 Example in chemometrics A concrete situation could be that y is a chemical variable (protein content or fat content) and x = (x 1,..., x p ) are absorptions or reflectances measured at p different wavelengths using some kind of spectroscopic instrument. We will have available simultaneous measurements of x and y on n chemical samples (the calibration set), and we want to use these measurements to predict y from x measurements on new specimens.
3 Setting in chemometrics Goal: predict a random variable y from p random variables x = (x 1,..., x p ). Statistical model: linear regression with ɛ N(0, σ 2 ɛ ), σ 2 ɛ = σ 2 y σ T Σ 1 σ, σ 2 y = var(y), Σ = cov(x), σ = cov(x, y). y = µ y + β T (x µ x ) + ɛ (1)
4 Setting in chemometrics Goal: predict a random variable y from p random variables x = (x 1,..., x p ). Statistical model: linear regression with ɛ N(0, σ 2 ɛ ), σ 2 ɛ = σ 2 y σ T Σ 1 σ, σ 2 y = var(y), Σ = cov(x), σ = cov(x, y). y = µ y + β T (x µ x ) + ɛ (1) Least square solution for β = Σ 1 σ.
5 Algorithm PLS started as an algorithm to avoid (when n < p) the problem of inverting the covariance matrix Σ to get the estimator of β = Σ 1 σ. Let us call that estimator ˆβ P LS.
6 Algorithm PLS started as an algorithm to avoid (when n < p) the problem of inverting the covariance matrix Σ to get the estimator of β = Σ 1 σ. Let us call that estimator ˆβ P LS. It was set in motion by Herman Wold in the late 1960s to address problems in path modeling, and was adapted in 1977 by Svante Wold for prediction in chemometrics. It was an algorithm, easy to compute that worked pretty well in chemometrics even if p > n.
7 PLS algorithm. Martens and Naes (1989) The algorithm in this version is as follows: Choose a d (there are ways to choose d).
8 PLS algorithm. Martens and Naes (1989) The algorithm in this version is as follows: Choose a d (there are ways to choose d). Compute Ŝ = {ˆσ, ˆΣˆσ,..., ˆΣ d 1ˆσ} with ˆσ and ˆΣ the sample version of σ and Σ.
9 PLS algorithm. Martens and Naes (1989) The algorithm in this version is as follows: Choose a d (there are ways to choose d). Compute Ŝ = {ˆσ, ˆΣˆσ,..., ˆΣ d 1ˆσ} with ˆσ and ˆΣ the sample version of σ and Σ. Choose ˆβ span(ŝ) such that it gives the minimum square error for Y X ˆβ
10 PLS algorithm. Martens and Naes (1989) The algorithm in this version is as follows: Choose a d (there are ways to choose d). Compute Ŝ = {ˆσ, ˆΣˆσ,..., ˆΣ d 1ˆσ} with ˆσ and ˆΣ the sample version of σ and Σ. Choose ˆβ span(ŝ) such that it gives the minimum square error for Y X ˆβ d? d = p?
11 PLS algorithm. Martens and Naes (1989) The algorithm in this version is as follows: Choose a d (there are ways to choose d). Compute Ŝ = {ˆσ, ˆΣˆσ,..., ˆΣ d 1ˆσ} with ˆσ and ˆΣ the sample version of σ and Σ. Choose ˆβ span(ŝ) such that it gives the minimum square error for Y X ˆβ d? d = p? Is there convergence of the algorithm? where?
12 PLS works The chemometrics community was using PLS for calibration since then
13 PLS works The chemometrics community was using PLS for calibration since then Chemometricians tend not to address population PLS models or regression coefficients, but instead deal directly with predictions resulting from PLS algorithms.
14 PLS works The chemometrics community was using PLS for calibration since then Chemometricians tend not to address population PLS models or regression coefficients, but instead deal directly with predictions resulting from PLS algorithms. The method works even for n < p, but there was no consistent theory to support the claim of why it was working and where is going
15 PLS works The chemometrics community was using PLS for calibration since then Chemometricians tend not to address population PLS models or regression coefficients, but instead deal directly with predictions resulting from PLS algorithms. The method works even for n < p, but there was no consistent theory to support the claim of why it was working and where is going Statistic community did not pay attention to PLS (maybe this is why there were no asymptotics)
16 PLS works The chemometrics community was using PLS for calibration since then Chemometricians tend not to address population PLS models or regression coefficients, but instead deal directly with predictions resulting from PLS algorithms. The method works even for n < p, but there was no consistent theory to support the claim of why it was working and where is going Statistic community did not pay attention to PLS (maybe this is why there were no asymptotics) The PLS tradition is perhaps more akin to conventions in machine learning or data science than it is to statistical customs. There is now a vast literature on PLS within chemometrics, some of it refining and extending the methodology and some of it affirming the methodology like the paper PLS works. by Bro and Eldén (2009)
17 Constraint in the parameters. Helland. An statistician appear Helland in 1990 realized that the algorithm was a plugging estimator for β of the form with β PLS = S(S T ΣS) 1 S T σ (2) S = {σ, Σσ,..., Σ d 1 σ} i.e. ˆβ PLS = Ŝ(ŜT ˆΣŜ) 1 Ŝ T ˆσ (3) As a consequence for p fixed ˆβ PLS consistent estimator of β PLS There was still a mystery about the shape of β PLS from (2).
18 But the statisticians did show up Mystery dissapaear: Model in the population: (Cook, Helland and Su, 2013) Idea: when d = 1, PLS in the population β = σ(σ T Σσ) 1 σ T σ.
19 But the statisticians did show up Mystery dissapaear: Model in the population: (Cook, Helland and Su, 2013) Idea: when d = 1, PLS in the population β = σ(σ T Σσ) 1 σ T σ. 1 β = cσ 2 Let us recall that β = Σ 1 σ 3 (1) and (2) together Σ 1 σ = cσ or Σσ = cσ, 4 Then, σ is one of the eigenvectors of Σ
20 But the statisticians did show up Mystery dissapaear: Model in the population: (Cook, Helland and Su, 2013) Idea: when d = 1, PLS in the population β = σ(σ T Σσ) 1 σ T σ. 1 β = cσ 2 Let us recall that β = Σ 1 σ 3 (1) and (2) together Σ 1 σ = cσ or Σσ = cσ, 4 Then, σ is one of the eigenvectors of Σ Moreover, if d > 1 PLS in the population means that β cuts only d eigenvectors of Σ and therefore β can be envelope by a span of d eigenvectors of Σ
21 More about constraints. MLE Cook, Helland and Zu (2013): informally β only cut a few eigenvectors of Σ. Formally there exists Γ R p u with u p such that the columns of Γ are u eigenvectors of Σ (not necessary the first ones) and β = ΓU for some U R u 1 and since β = Σ 1 σ we have Σ = ΓΛ Γ Γ T + Γ 0 Λ Γ0 Γ T 0 since Γ are eigenvectors of Σ β = Γ(Γ T ΣΓ) 1 Γσ Γ = S?, remember that β PLS = S(S T ΣS) 1 Sσ.
22 More about constraints. MLE Cook, Helland and Zu (2013): informally β only cut a few eigenvectors of Σ. Formally there exists Γ R p u with u p such that the columns of Γ are u eigenvectors of Σ (not necessary the first ones) and β = ΓU for some U R u 1 and since β = Σ 1 σ we have Σ = ΓΛ Γ Γ T + Γ 0 Λ Γ0 Γ T 0 since Γ are eigenvectors of Σ β = Γ(Γ T ΣΓ) 1 Γσ Γ = S?, remember that β PLS = S(S T ΣS) 1 Sσ. They found the MLE for β PLS and show and prove that for p fixed, n. Efficiency.
23 But the Chemometrics community used it for p increasing!!! Setting: p > n. MLE, does not exists. No hope The ˆβ PLS algorithm works if d < min{p, n}, and works (pretty well) when n < p. Recall ˆβ PLS = Ŝ(ŜT ˆΣŜ) 1 Ŝˆσ.
24 But the Chemometrics community used it for p increasing!!! Setting: p > n. MLE, does not exists. No hope The ˆβ PLS algorithm works if d < min{p, n}, and works (pretty well) when n < p. Recall ˆβ PLS = Ŝ(ŜT ˆΣŜ) 1 Ŝˆσ. In view of the apparent success that PLS has had in chemometrics and elsewhere, we might anticipate that it has reasonable statistical properties in high-dimensional regression.
25 But the statisticians did show up again
26 But the statisticians did show up again but with bad news
27 But the statisticians did show up again but with bad news Chun and Keles (2010) provided a piece of the puzzle by showing that, within a certain modeling framework, the PLS estimator of the coefficient vector in linear regression is inconsistent unless p/n 0. They then used this as motivation for their development of
28 But the statisticians did show up again but with bad news Chun and Keles (2010) provided a piece of the puzzle by showing that, within a certain modeling framework, the PLS estimator of the coefficient vector in linear regression is inconsistent unless p/n 0. They then used this as motivation for their development of sparse version of PLS.
29 A dilema The Chun- Keles result poses a dilemma.
30 A dilema The Chun- Keles result poses a dilemma. On the one hand, decades of experience support PLS as a useful method, but its inconsistency when p/n c > 0 casts doubt on its usefulness in high-dimensional regression, which is one of the contexts in which PLS undeniably stands out by virtue of its wide spread application.
31 A dilema The Chun- Keles result poses a dilemma. On the one hand, decades of experience support PLS as a useful method, but its inconsistency when p/n c > 0 casts doubt on its usefulness in high-dimensional regression, which is one of the contexts in which PLS undeniably stands out by virtue of its wide spread application. There are several possible explanations for this conflict, including
32 A dilema The Chun- Keles result poses a dilemma. On the one hand, decades of experience support PLS as a useful method, but its inconsistency when p/n c > 0 casts doubt on its usefulness in high-dimensional regression, which is one of the contexts in which PLS undeniably stands out by virtue of its wide spread application. There are several possible explanations for this conflict, including consistency does not always signal the value of a method in practice, the literature is largely wrong about the value of PLS, and the modeling construct used by Chun and Keles does not adequately reflect the range of applications in which PLS is employed.
33 Model in Chun and Keles s paper The model for x (the predictor) is given by x y = µ x + Θν y + ω, (4) where ν R d, ν N(0, I d ), Θ R p d, ω R p, ν N(0, π 2 I p ).
34 Model in Chun and Keles s paper The model for x (the predictor) is given by x y = µ x + Θν y + ω, (4) where ν R d, ν N(0, I d ), Θ R p d, ω R p, ν N(0, π 2 I p ). As a consequence, x y Θ T x, and thus d linear combinations Θ T x carry all of the information that x has about y.
35 Model in Chun and Keles s paper The model for x (the predictor) is given by x y = µ x + Θν y + ω, (4) where ν R d, ν N(0, I d ), Θ R p d, ω R p, ν N(0, π 2 I p ). As a consequence, x y Θ T x, and thus d linear combinations Θ T x carry all of the information that x has about y. The variance of x can be expressed as Σ = ΘΘ T + π 2 I p = H(Θ T Θ + π 2 I d )H T + π 2 Q H, where H = Θ(Θ T Θ) 1/2 is a semi-orthogonal basis matrix for span(θ).
36 Assumptions in Chun and Kele s paper They ask the columns of Θ to be orthogonal with bounded norms that converge as sequences. As a consequence Σ is bounded Σ = ΘΘ T + π 2 I p = H(Θ T Θ + π 2 I d )H T + π 2 Q H
37 Assumptions in Chun and Kele s paper They ask the columns of Θ to be orthogonal with bounded norms that converge as sequences. As a consequence Σ is bounded Σ = ΘΘ T + π 2 I p = H(Θ T Θ + π 2 I d )H T + π 2 Q H But in spectroscopy data it seems entirely plausible that notable signal comes from many wavelengths, not just a few. When this happens many rows of Θ are non-zero in such a way that p i=1 θ i 2 diverges and we are not in Chun and Kele s assumptions for non-consistency.
38 Assumptions in Chun and Kele s paper They ask the columns of Θ to be orthogonal with bounded norms that converge as sequences. As a consequence Σ is bounded Σ = ΘΘ T + π 2 I p = H(Θ T Θ + π 2 I d )H T + π 2 Q H But in spectroscopy data it seems entirely plausible that notable signal comes from many wavelengths, not just a few. When this happens many rows of Θ are non-zero in such a way that p i=1 θ i 2 diverges and we are not in Chun and Kele s assumptions for non-consistency. As a conclusion Chun and Kele s paper effectively imposes sparsity to get non-consistency.
39 Wold, remember who he was? Sparsity vs Abundance. A quote from Hierarchical multiblock PLS and PC models for easier model interpretation and as an alternative to variable selection by Wold, Kettaneh and Tjessem, Who is Wold?
40 Wold, remember who he was? Sparsity vs Abundance. A quote from Hierarchical multiblock PLS and PC models for easier model interpretation and as an alternative to variable selection by Wold, Kettaneh and Tjessem, Who is Wold? In situations with many variables, more than say 50 or 100, there is a strong temptation to drastically reduce the number of variables in the model. This temptation is further strengthened by the regression tradition to reduce the variables as far as possible to get the X matrix well conditioned. As discussed below, however, this reduction of variables often removes information, makes the interpretation misleading and increases the risk of spurious models. An often better alternative than variable reduction is to divide the variables into conceptually meaningful blocks and then apply hierarchical multi-block PLS (or PC) models. These ideas were presented by Wold, Martens and co-workers around 1986, but in rather obscure papers.
41 Wold, remember who he was? Sparsity vs Abundance. A quote from Hierarchical multiblock PLS and PC models for easier model interpretation and as an alternative to variable selection by Wold, Kettaneh and Tjessem, Who is Wold? In situations with many variables, more than say 50 or 100, there is a strong temptation to drastically reduce the number of variables in the model. This temptation is further strengthened by the regression tradition to reduce the variables as far as possible to get the X matrix well conditioned. As discussed below, however, this reduction of variables often removes information, makes the interpretation misleading and increases the risk of spurious models. An often better alternative than variable reduction is to divide the variables into conceptually meaningful blocks and then apply hierarchical multi-block PLS (or PC) models. These ideas were presented by Wold, Martens and co-workers around 1986, but in rather obscure papers. With multivariate projection models such as PLS and PCA, however, the situation is different. These methods work well also with many variables even when the number of observations, N, is small. in fact, the larger the number of relevant variables, the more precise are the scores t (and u in PLS), because they have the characteristics of weighted averages of all the X- or Y - variables and an average is more precise the larger is the number of elements forming the basis of the average. There is therefore no real need for keeping the number of variables small; only really unimportant variables should be deleted to stabilize the model and its predictions.
42 And the statisticians did show up again
43 And the statisticians did show up again but now with good news Let us assume we are under the same model x y = µ x + Θν y + ω, (5) where ν R d, ν N(0, I d ), Θ R p d, ω R p, ν N(0, π 2 I p ).
44 And the statisticians did show up again but now with good news Let us assume we are under the same model x y = µ x + Θν y + ω, (5) where ν R d, ν N(0, I d ), Θ R p d, ω R p, ν N(0, π 2 I p ). Again Σ = ΘΘ T + π 2 I p = H(Θ T Θ + π 2 I d )H T + π 2 Q H
45 And the statisticians did show up again but now with good news Let us assume we are under the same model x y = µ x + Θν y + ω, (5) where ν R d, ν N(0, I d ), Θ R p d, ω R p, ν N(0, π 2 I p ). Again Σ = ΘΘ T + π 2 I p = H(Θ T Θ + π 2 I d )H T + π 2 Q H Then the rate of convergence for the prediction error square, using PLS, is of order p ( p i=1 θ i 2 )n
46 Consequences Order of convergence for the prediction error square: p ( p i=1 θ i 2 )n Chun and Kele s case: p i=1 θ i 2 bounded we have consistency only if p/n 0
47 Consequences Order of convergence for the prediction error square: p ( p i=1 θ i 2 )n Chun and Kele s case: p i=1 θ i 2 bounded we have consistency only if p/n 0 If p i=1 θ i 2 p α, the order of convergence of the prediction square error is p(1 α) n.
48 Consequences Order of convergence for the prediction error square: p ( p i=1 θ i 2 )n Chun and Kele s case: p i=1 θ i 2 bounded we have consistency only if p/n 0 If p i=1 θ i 2 p α, the order of convergence of the prediction square error is p(1 α) n. When the maximum amount of information is accumulated ( p i=1 θ i 2 p) we have the traditional n consistency
49 More? Yes, we have general result of consistency (not only for the model presented in Chun and Kele s paper) The consistency of the prediction and rate of convergence depends more or less on the ratio between the information that new predictors contribute about y and the amount of noise they contribute
50 Simulation n = p/2 x y = µ x + Θν y + ω, (6) The columns of Θ were constructed to be orthogonal with the diagonal elements diag(θ T Θ) = (4p a, p a ), a = 1/2, 3/4, 1, and diag(θ T Θ) = (4c, c) = c(4p 0, p 0 ) where c is constant.
51 Simulation n = p/2 x y = µ x + Θν y + ω, (6) The columns of Θ were constructed to be orthogonal with the diagonal elements diag(θ T Θ) = (4p a, p a ), a = 1/2, 3/4, 1, and diag(θ T Θ) = (4c, c) = c(4p 0, p 0 ) where c is constant. The theoretical result for this case indicate D N = O p ( φ) with p φ = n ( p i=1 θ i 2 ). Here p i=1 θ i 2 p a with a = 1, 3/4, 1/2 and 0.
52 Theoretical result: D N = O p ( p n ( p i=1 θ ) Only no i 2 convergence for the case of diag(θ T Θ) c
53 Theoretical result: D N = O p ( p n ( p i=1 θ ) Only no i 2 convergence for the case of diag(θ T Θ) c
54 Tetracycline data Goicoechea and Olivieri (1999) used PLS to develop a predictor of tetracycline concentration in human blood. The 50 training samples were constructed by spiking blank sera with various amounts of tetracycline in the range 0 4 µg ml 1. A validation set of 57 samples was constructed in the same way. For each sample, the values of the predictors were determined by measuring fluorescence intensity at p = 101 equally spaced points in the range nm. The authors determined using leave-one-out cross validation that the best predictions of the training data were obtained with d = 4 linear combinations of the original 101 predictors.
55 Tetracycline data We use these data to illustrate the behavior of PLS predictions in Chemometrics as the number of predictors increases. We used PLS with d = 4 to predict the validation data based on p equally spaced spectra, with p ranging between 10 and 101. For those five values of p we compute the root mean squared error(mse).
56 MSE for tetracycline data for different values of p Telatively steep drop in MSE for small p, say less than 30, and a slow but steady decrease in MSE thereafter.
57 Thanks!
Dimension Reduction in Abundant High Dimensional Regressions
Dimension Reduction in Abundant High Dimensional Regressions Dennis Cook University of Minnesota 8th Purdue Symposium June 2012 In collaboration with Liliana Forzani & Adam Rothman, Annals of Statistics,
More informationLinear Methods for Regression. Lijun Zhang
Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived
More informationEXTENDING PARTIAL LEAST SQUARES REGRESSION
EXTENDING PARTIAL LEAST SQUARES REGRESSION ATHANASSIOS KONDYLIS UNIVERSITY OF NEUCHÂTEL 1 Outline Multivariate Calibration in Chemometrics PLS regression (PLSR) and the PLS1 algorithm PLS1 from a statistical
More informationRegression I: Mean Squared Error and Measuring Quality of Fit
Regression I: Mean Squared Error and Measuring Quality of Fit -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD 1 The Setup Suppose there is a scientific problem we are interested in solving
More informationNotes 11: OLS Theorems ECO 231W - Undergraduate Econometrics
Notes 11: OLS Theorems ECO 231W - Undergraduate Econometrics Prof. Carolina Caetano For a while we talked about the regression method. Then we talked about the linear model. There were many details, but
More informationPrincipal component analysis and the asymptotic distribution of high-dimensional sample eigenvectors
Principal component analysis and the asymptotic distribution of high-dimensional sample eigenvectors Kristoffer Hellton Department of Mathematics, University of Oslo May 12, 2015 K. Hellton (UiO) Distribution
More informationLinear Methods for Prediction
Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we
More informationLinear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,
Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,
More informationGeneralized Least Squares for Calibration Transfer. Barry M. Wise, Harald Martens and Martin Høy Eigenvector Research, Inc.
Generalized Least Squares for Calibration Transfer Barry M. Wise, Harald Martens and Martin Høy Eigenvector Research, Inc. Manson, WA 1 Outline The calibration transfer problem Instrument differences,
More informationEnvelopes: Methods for Efficient Estimation in Multivariate Statistics
Envelopes: Methods for Efficient Estimation in Multivariate Statistics Dennis Cook School of Statistics University of Minnesota Collaborating at times with Bing Li, Francesca Chiaromonte, Zhihua Su, Inge
More informationProperties of the least squares estimates
Properties of the least squares estimates 2019-01-18 Warmup Let a and b be scalar constants, and X be a scalar random variable. Fill in the blanks E ax + b) = Var ax + b) = Goal Recall that the least squares
More informationVector Space Models. wine_spectral.r
Vector Space Models 137 wine_spectral.r Latent Semantic Analysis Problem with words Even a small vocabulary as in wine example is challenging LSA Reduce number of columns of DTM by principal components
More informationBiostatistics Advanced Methods in Biostatistics IV
Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results
More informationLinear regression methods
Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response
More informationModeling Mutagenicity Status of a Diverse Set of Chemical Compounds by Envelope Methods
Modeling Mutagenicity Status of a Diverse Set of Chemical Compounds by Envelope Methods Subho Majumdar School of Statistics, University of Minnesota Envelopes in Chemometrics August 4, 2014 1 / 23 Motivation
More information10 Model Checking and Regression Diagnostics
10 Model Checking and Regression Diagnostics The simple linear regression model is usually written as i = β 0 + β 1 i + ɛ i where the ɛ i s are independent normal random variables with mean 0 and variance
More information1 Least Squares Estimation - multiple regression.
Introduction to multiple regression. Fall 2010 1 Least Squares Estimation - multiple regression. Let y = {y 1,, y n } be a n 1 vector of dependent variable observations. Let β = {β 0, β 1 } be the 2 1
More informationCS281 Section 4: Factor Analysis and PCA
CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we
More information1 What does the random effect η mean?
Some thoughts on Hanks et al, Environmetrics, 2015, pp. 243-254. Jim Hodges Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota USA 55414 email: hodge003@umn.edu October 13, 2015
More information[y i α βx i ] 2 (2) Q = i=1
Least squares fits This section has no probability in it. There are no random variables. We are given n points (x i, y i ) and want to find the equation of the line that best fits them. We take the equation
More informationA Significance Test for the Lasso
A Significance Test for the Lasso Lockhart R, Taylor J, Tibshirani R, and Tibshirani R Ashley Petersen May 14, 2013 1 Last time Problem: Many clinical covariates which are important to a certain medical
More informationExplaining Correlations by Plotting Orthogonal Contrasts
Explaining Correlations by Plotting Orthogonal Contrasts Øyvind Langsrud MATFORSK, Norwegian Food Research Institute. www.matforsk.no/ola/ To appear in The American Statistician www.amstat.org/publications/tas/
More informationLECTURE 4 PRINCIPAL COMPONENTS ANALYSIS / EXPLORATORY FACTOR ANALYSIS
LECTURE 4 PRINCIPAL COMPONENTS ANALYSIS / EXPLORATORY FACTOR ANALYSIS NOTES FROM PRE- LECTURE RECORDING ON PCA PCA and EFA have similar goals. They are substantially different in important ways. The goal
More informationLinear models and their mathematical foundations: Simple linear regression
Linear models and their mathematical foundations: Simple linear regression Steffen Unkel Department of Medical Statistics University Medical Center Göttingen, Germany Winter term 2018/19 1/21 Introduction
More informationNew method for the determination of benzoic and. sorbic acids in commercial orange juices based on
New method for the determination of benzoic and sorbic acids in commercial orange juices based on second-order spectrophotometric data generated by a ph gradient flow injection technique (Supporting Information)
More informationThe prediction of house price
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationSection 3: Simple Linear Regression
Section 3: Simple Linear Regression Carlos M. Carvalho The University of Texas at Austin McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Regression: General Introduction
More informationEstimation of large dimensional sparse covariance matrices
Estimation of large dimensional sparse covariance matrices Department of Statistics UC, Berkeley May 5, 2009 Sample covariance matrix and its eigenvalues Data: n p matrix X n (independent identically distributed)
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationClassification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).
Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes
More informationChapter 12 REML and ML Estimation
Chapter 12 REML and ML Estimation C. R. Henderson 1984 - Guelph 1 Iterative MIVQUE The restricted maximum likelihood estimator (REML) of Patterson and Thompson (1971) can be obtained by iterating on MIVQUE,
More informationFunctional Latent Feature Models. With Single-Index Interaction
Generalized With Single-Index Interaction Department of Statistics Center for Statistical Bioinformatics Institute for Applied Mathematics and Computational Science Texas A&M University Naisyin Wang and
More informationMS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015
MS&E 226 In-Class Midterm Examination Solutions Small Data October 20, 2015 PROBLEM 1. Alice uses ordinary least squares to fit a linear regression model on a dataset containing outcome data Y and covariates
More informationMatrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =
30 MATHEMATICS REVIEW G A.1.1 Matrices and Vectors Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = a 11 a 12... a 1N a 21 a 22... a 2N...... a M1 a M2... a MN A matrix can
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationGLS and FGLS. Econ 671. Purdue University. Justin L. Tobias (Purdue) GLS and FGLS 1 / 22
GLS and FGLS Econ 671 Purdue University Justin L. Tobias (Purdue) GLS and FGLS 1 / 22 In this lecture we continue to discuss properties associated with the GLS estimator. In addition we discuss the practical
More informationRegression Models - Introduction
Regression Models - Introduction In regression models, two types of variables that are studied: A dependent variable, Y, also called response variable. It is modeled as random. An independent variable,
More informationGeneralized Linear Models. Kurt Hornik
Generalized Linear Models Kurt Hornik Motivation Assuming normality, the linear model y = Xβ + e has y = β + ε, ε N(0, σ 2 ) such that y N(μ, σ 2 ), E(y ) = μ = β. Various generalizations, including general
More informationChemometrics. Matti Hotokka Physical chemistry Åbo Akademi University
Chemometrics Matti Hotokka Physical chemistry Åbo Akademi University Linear regression Experiment Consider spectrophotometry as an example Beer-Lamberts law: A = cå Experiment Make three known references
More information4 Bias-Variance for Ridge Regression (24 points)
Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,
More informationOptimization Problems
Optimization Problems The goal in an optimization problem is to find the point at which the minimum (or maximum) of a real, scalar function f occurs and, usually, to find the value of the function at that
More informationNonconcave Penalized Likelihood with A Diverging Number of Parameters
Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized
More informationName Solutions Linear Algebra; Test 3. Throughout the test simplify all answers except where stated otherwise.
Name Solutions Linear Algebra; Test 3 Throughout the test simplify all answers except where stated otherwise. 1) Find the following: (10 points) ( ) Or note that so the rows are linearly independent, so
More informationMultiple Linear Regression
Multiple Linear Regression Simple linear regression tries to fit a simple line between two variables Y and X. If X is linearly related to Y this explains some of the variability in Y. In most cases, there
More informationSupplementary Materials for Tensor Envelope Partial Least Squares Regression
Supplementary Materials for Tensor Envelope Partial Least Squares Regression Xin Zhang and Lexin Li Florida State University and University of California, Bereley 1 Proofs and Technical Details Proof of
More informationESL Chap3. Some extensions of lasso
ESL Chap3 Some extensions of lasso 1 Outline Consistency of lasso for model selection Adaptive lasso Elastic net Group lasso 2 Consistency of lasso for model selection A number of authors have studied
More informationConfidence Intervals for Low-dimensional Parameters with High-dimensional Data
Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Cun-Hui Zhang and Stephanie S. Zhang Rutgers University and Columbia University September 14, 2012 Outline Introduction Methodology
More informationThis model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that
Linear Regression For (X, Y ) a pair of random variables with values in R p R we assume that E(Y X) = β 0 + with β R p+1. p X j β j = (1, X T )β j=1 This model of the conditional expectation is linear
More informationBasics of Multivariate Modelling and Data Analysis
Basics of Multivariate Modelling and Data Analysis Kurt-Erik Häggblom 2. Overview of multivariate techniques 2.1 Different approaches to multivariate data analysis 2.2 Classification of multivariate techniques
More informationLeast squares under convex constraint
Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption
More informationPrincipal Components Theory Notes
Principal Components Theory Notes Charles J. Geyer August 29, 2007 1 Introduction These are class notes for Stat 5601 (nonparametrics) taught at the University of Minnesota, Spring 2006. This not a theory
More informationCHAPTER 5. Outlier Detection in Multivariate Data
CHAPTER 5 Outlier Detection in Multivariate Data 5.1 Introduction Multivariate outlier detection is the important task of statistical analysis of multivariate data. Many methods have been proposed for
More information, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1
Regression diagnostics As is true of all statistical methodologies, linear regression analysis can be a very effective way to model data, as along as the assumptions being made are true. For the regression
More informationSTAT 200C: High-dimensional Statistics
STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57
More informationDimension Reduction Methods
Dimension Reduction Methods And Bayesian Machine Learning Marek Petrik 2/28 Previously in Machine Learning How to choose the right features if we have (too) many options Methods: 1. Subset selection 2.
More informationSparse Linear Models (10/7/13)
STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine
More informationData Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.
TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin
More informationInternational Journal of Pure and Applied Mathematics Volume 19 No , A NOTE ON BETWEEN-GROUP PCA
International Journal of Pure and Applied Mathematics Volume 19 No. 3 2005, 359-366 A NOTE ON BETWEEN-GROUP PCA Anne-Laure Boulesteix Department of Statistics University of Munich Akademiestrasse 1, Munich,
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com
More informationLinear Methods for Prediction
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this
More informationSufficient Dimension Reduction for Longitudinally Measured Predictors
Sufficient Dimension Reduction for Longitudinally Measured Predictors Ruth Pfeiffer National Cancer Institute, NIH, HHS joint work with Efstathia Bura and Wei Wang TU Wien and GWU University JSM Vancouver
More informationFinal Exam. Economics 835: Econometrics. Fall 2010
Final Exam Economics 835: Econometrics Fall 2010 Please answer the question I ask - no more and no less - and remember that the correct answer is often short and simple. 1 Some short questions a) For each
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationRegression diagnostics
Regression diagnostics Kerby Shedden Department of Statistics, University of Michigan November 5, 018 1 / 6 Motivation When working with a linear model with design matrix X, the conventional linear model
More informationChapter 3. Linear Models for Regression
Chapter 3. Linear Models for Regression Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Linear
More informationChapter 4: Factor Analysis
Chapter 4: Factor Analysis In many studies, we may not be able to measure directly the variables of interest. We can merely collect data on other variables which may be related to the variables of interest.
More informationA Significance Test for the Lasso
A Significance Test for the Lasso Lockhart R, Taylor J, Tibshirani R, and Tibshirani R Ashley Petersen June 6, 2013 1 Motivation Problem: Many clinical covariates which are important to a certain medical
More informationLECTURE NOTE #NEW 6 PROF. ALAN YUILLE
LECTURE NOTE #NEW 6 PROF. ALAN YUILLE 1. Introduction to Regression Now consider learning the conditional distribution p(y x). This is often easier than learning the likelihood function p(x y) and the
More informationLinear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.
Linear Models DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Linear regression Least-squares estimation
More informationStatistics 910, #15 1. Kalman Filter
Statistics 910, #15 1 Overview 1. Summary of Kalman filter 2. Derivations 3. ARMA likelihoods 4. Recursions for the variance Kalman Filter Summary of Kalman filter Simplifications To make the derivations
More informationCPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017
CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code
More informationFactor Analysis (10/2/13)
STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.
More informationLECTURE NOTE #10 PROF. ALAN YUILLE
LECTURE NOTE #10 PROF. ALAN YUILLE 1. Principle Component Analysis (PCA) One way to deal with the curse of dimensionality is to project data down onto a space of low dimensions, see figure (1). Figure
More informationChemometrics: Classification of spectra
Chemometrics: Classification of spectra Vladimir Bochko Jarmo Alander University of Vaasa November 1, 2010 Vladimir Bochko Chemometrics: Classification 1/36 Contents Terminology Introduction Big picture
More information1 Regression with High Dimensional Data
6.883 Learning with Combinatorial Structure ote for Lecture 11 Instructor: Prof. Stefanie Jegelka Scribe: Xuhong Zhang 1 Regression with High Dimensional Data Consider the following regression problem:
More informationLinear regression. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda
Linear regression DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall15 Carlos Fernandez-Granda Linear models Least-squares estimation Overfitting Example:
More informationMaximum Likelihood, Logistic Regression, and Stochastic Gradient Training
Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions
More informationMachine Learning (Spring 2012) Principal Component Analysis
1-71 Machine Learning (Spring 1) Principal Component Analysis Yang Xu This note is partly based on Chapter 1.1 in Chris Bishop s book on PRML and the lecture slides on PCA written by Carlos Guestrin in
More informationMachine learning - HT Maximum Likelihood
Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationECE 275A Homework 6 Solutions
ECE 275A Homework 6 Solutions. The notation used in the solutions for the concentration (hyper) ellipsoid problems is defined in the lecture supplement on concentration ellipsoids. Note that θ T Σ θ =
More informationRestricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model
Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model Xiuming Zhang zhangxiuming@u.nus.edu A*STAR-NUS Clinical Imaging Research Center October, 015 Summary This report derives
More informationRegression Models - Introduction
Regression Models - Introduction In regression models there are two types of variables that are studied: A dependent variable, Y, also called response variable. It is modeled as random. An independent
More information9.1 Orthogonal factor model.
36 Chapter 9 Factor Analysis Factor analysis may be viewed as a refinement of the principal component analysis The objective is, like the PC analysis, to describe the relevant variables in study in terms
More informationIntroduction to Machine Learning
10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what
More informationIntroduction to Simple Linear Regression
Introduction to Simple Linear Regression Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Introduction to Simple Linear Regression 1 / 68 About me Faculty in the Department
More informationIntroduction to Maximum Likelihood Estimation
Introduction to Maximum Likelihood Estimation Eric Zivot July 26, 2012 The Likelihood Function Let 1 be an iid sample with pdf ( ; ) where is a ( 1) vector of parameters that characterize ( ; ) Example:
More informationMidterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.
CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic
More informationLectures on Simple Linear Regression Stat 431, Summer 2012
Lectures on Simple Linear Regression Stat 43, Summer 0 Hyunseung Kang July 6-8, 0 Last Updated: July 8, 0 :59PM Introduction Previously, we have been investigating various properties of the population
More informationKernel Methods. Machine Learning A W VO
Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance
More informationReducedPCR/PLSRmodelsbysubspaceprojections
ReducedPCR/PLSRmodelsbysubspaceprojections Rolf Ergon Telemark University College P.O.Box 2, N-9 Porsgrunn, Norway e-mail: rolf.ergon@hit.no Published in Chemometrics and Intelligent Laboratory Systems
More informationLecture 6 Multiple Linear Regression, cont.
Lecture 6 Multiple Linear Regression, cont. BIOST 515 January 22, 2004 BIOST 515, Lecture 6 Testing general linear hypotheses Suppose we are interested in testing linear combinations of the regression
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationTechniques for Dimensionality Reduction. PCA and Other Matrix Factorization Methods
Techniques for Dimensionality Reduction PCA and Other Matrix Factorization Methods Outline Principle Compoments Analysis (PCA) Example (Bishop, ch 12) PCA as a mixture model variant With a continuous latent
More information11 : Gaussian Graphic Models and Ising Models
10-708: Probabilistic Graphical Models 10-708, Spring 2017 11 : Gaussian Graphic Models and Ising Models Lecturer: Bryon Aragam Scribes: Chao-Ming Yen 1 Introduction Different from previous maximum likelihood
More informationMultiple Linear Regression
Multiple Linear Regression University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html 1 / 42 Passenger car mileage Consider the carmpg dataset taken from
More informationSparse Approximation and Variable Selection
Sparse Approximation and Variable Selection Lorenzo Rosasco 9.520 Class 07 February 26, 2007 About this class Goal To introduce the problem of variable selection, discuss its connection to sparse approximation
More informationPartial factor modeling: predictor-dependent shrinkage for linear regression
modeling: predictor-dependent shrinkage for linear Richard Hahn, Carlos Carvalho and Sayan Mukherjee JASA 2013 Review by Esther Salazar Duke University December, 2013 Factor framework The factor framework
More informationNearest Neighbor Gaussian Processes for Large Spatial Data
Nearest Neighbor Gaussian Processes for Large Spatial Data Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public Health, Johns
More informationPRINCIPAL COMPONENTS ANALYSIS
121 CHAPTER 11 PRINCIPAL COMPONENTS ANALYSIS We now have the tools necessary to discuss one of the most important concepts in mathematical statistics: Principal Components Analysis (PCA). PCA involves
More information