Foundation of Intelligent Systems, Part I. Regression 2
|
|
- Sybil Dalton
- 5 years ago
- Views:
Transcription
1 Foundation of Intelligent Systems, Part I Regression 2 mcuturi@i.kyoto-u.ac.jp FIS
2 Some Words on the Survey Not enough answers to say anything meaningful! Try again: survey. FIS
3 Last Week Regression: highlight a functional relationship between a predicted variable and predictors FIS
4 Last Week Regression: highlight a functional relationship between a predicted variable and predictors find a function f such that (x,y) that can appear,f(x) y FIS
5 Last Week Regression: highlight a functional relationship between a predicted variable and predictors to find an accurate function f such that (x,y) that can appear,f(x) y use a data set & the least-squares criterion: min f F 1 N N (y j f(x j )) 2 j=1 FIS
6 Last Week Regression: highlight a functional relationship between a predicted variable and predictors when regressing a real number vs a real number : Scatter plot of Rent vs. Surface y = 0.13*x y = *x *x 0.71 y = 6.2e 05*x *x *x 3.1 y = 1.4e 06*x *x *x *x 1.2 y = 1.8e 07*x e 05*x *x *x 2 1.1*x apartment linear quadratic cubic 4th degree 5th degree Rent (x JPY) Surface Least-Squares Criterion L(b,a 1,,a p ) to fit lines, polynomials. results in solving a linear system. 2 nd order(b,a 1,,a p ) a p = linear in (b,a 1,,a p ) When setting L/ a p = 0 we get p+1 linear equations for p+1 variables. FIS
7 Last Week Regression: highlight a functional relationship between a predicted variable and predictors when regressing a real number vs d real numbers (vector in R d ), find best fit α R n such that (α T x+α 0 ) y. Add to d N data matrix, a row of 1 s to get the predictors X. The row Y of predicted values The Least-Squares criterion also applies: ( L(α) = Y α T X 2 = α T XX T α 2Y X T α+ Y 2). α L = 0 α = (XX T ) 1 XY T This works if XX T R d+1 is invertible. FIS
8 Last Week >> (X X )\(X Y ) ans = x age x surface x distance JPY FIS
9 Today A statistical / probabilistic perspective on LS-regression A few words on polynomials in higher dimensions A geometric perspective Variable co-linearity and Overfitting problem Some solutions: advanced regression techniques Subset selection Ridge Regression Lasso FIS
10 A (very few) words on the statistical/probabilistic interpretation of LS FIS
11 The Statistical Perspective on Regression Assume that the values of y are stochastically linked to observations x as y (α T x+β) N(0,σ). This difference is a random variable called ε and is called a residue. FIS
12 The Statistical Perspective on Regression This can be rewritten as, y = (α T x+β)+ε, ε N(0,σ), We assume that the difference between y and (α T x+b) behaves like a Gaussian (normally distributed) random variable. Goal as a statistician: Estimate α and β given observations. FIS
13 Identically Independently Distributed (i.i.d) Observations Statistical hypothesis: assume that the parameters are α = a,β = b FIS
14 Identically Independently Distributed (i.i.d) Observations Statistical hypothesis: assume that the parameters are α = a,β = b In such a case, what would be the probability of each observation (x j,y j )? FIS
15 Identically Independently Distributed (i.i.d) Observations Statistical hypothesis: assuming that the parameters are α = a,β = b, what would be the probability of each observation?: For each couple (x j,y j ), j = 1,,N, P(x j,y j α = a,β = b) = 1 2πσ exp ( y j (a T x j +b) 2 ) 2σ 2 FIS
16 Identically Independently Distributed (i.i.d) Observations Statistical hypothesis: assuming that the parameters are α = a,β = b, what would be the probability of each observation?: For each couple (x j,y j ), j = 1,,N, P(x j,y j α = a,β = b) = 1 2πσ exp ( y j (a T x j +b) 2 ) 2σ 2 Since each measurement (x j,y j ) has been independently sampled, P ({(x j,y j )} j=1,,n α = a,β = b) = N j=1 1 2πσ exp ( y j (a T x j +b) 2 ) 2σ 2 FIS
17 Identically Independently Distributed (i.i.d) Observations Statistical hypothesis: assuming that the parameters are α = a,β = b, what would be the probability of each observation?: For each couple (x j,y j ), j = 1,,N, P(x j,y j α = a,β = b) = 1 2πσ exp ( y j (a T x j +b) 2 ) 2σ 2 Since each measurement (x j,y j ) has been independently sampled, P ({(x j,y j )} j=1,,n α = a,β = b) = N j=1 1 2πσ exp ( y j (a T x j +b) 2 ) 2σ 2 A.K.A likelihood of the dataset {(x j,y j ) j=1,,n } as a function of a and b, L {(xj,y j )}(a,b) = N j=1 1 2πσ exp ( y j (a T x j +b) 2 ) 2σ 2 FIS
18 Maximum Likelihood Estimation (MLE) of Parameters Hence, for a,b, the likelihood function on the dataset {(x j,y j ) j=1,,n }... L(a,b) = N j=1 1 2πσ exp ( y j (a T x j +b) 2 ) 2σ 2 FIS
19 Maximum Likelihood Estimation (MLE) of Parameters Hence, for a,b, the likelihood function on the dataset {(x j,y j ) j=1,,n }... L(a,b) = N j=1 1 2πσ exp ( y j (a T x j +b) 2 ) 2σ 2 Why not use the likelihood to guess (a,b) given data? FIS
20 Maximum Likelihood Estimation (MLE) of Parameters Hence, for a,b, the likelihood function on the dataset {(x j,y j ) j=1,,n }... L(a,b) = N j=1 1 2πσ exp ( y j (a T x j +b) 2 ) 2σ 2...the MLE approach selects the values of (a,b) which mazimize L(a,b) FIS
21 Maximum Likelihood Estimation (MLE) of Parameters Hence, for a,b, the likelihood function on the dataset {(x j,y j ) j=1,,n }... L(a,b) = N j=1 1 2πσ exp ( y j (a T x j +b) 2 ) 2σ 2...the MLE approach selects the values of (a,b) which mazimize L(a,b) Since max (a,b) L(a,b) max (a,b) logl(a,b) logl(a,b) = C 1 2σ 2 N y j (a T x j +b) 2. j=1 Hence max (a,b) L(a,b) min (a,b) N j=1 y j (a T x j +b) 2... FIS
22 Statistical Approach to Linear Regression Properties of the MLE estimator: convergence of α a? Confidence intervals for coefficients, Tests procedures to assess if model fits the data, Residues Histogram Relative Frequency x x 10 Bayesian approaches: instead of looking for one optimal fit (a, b) juggle with a whole density on (a,b) to make decisions etc. FIS
23 A few words on polynomials in higher dimensions FIS
24 A few words on polynomials in higher dimensions For d variables, that is for points x R d, the space of polynomials on these variables up to degree p is generated by {x u u N d,u = (u 1,,u d ), d u i p} i=1 where the monomial x u is defined as x u 1 1 xu 2 2 xu d d Recurrence for dimension of that space: dim p+1 = dim p + ( p+1 d+p For d = 20 and p = 5, > ) Problem with polynomial interpolation in high-dimensions is the explosion of relevant variables (one for each monomial) FIS
25 Geometric Perspective FIS
26 Back to Basics Recall the problem: X = x 1 x 2 x N Rd+1 N... and Y = [ y 1 y N ] R N. We look for α such that α T X Y. FIS
27 Back to Basics If we transpose this expression we get X T α Y T, 1 x 1,1 x d,1 1 x 1,2 x d, x 1,k x d,k α 0. =.... α d 1 x 1,N x d,n y 1. y 2. y. y N Using the notation Y = Y T,X = X T and X k for the (k +1) th column of X, d α k X k Y k=0 Note how the X k corresponds to all values taken by the k th variable. Problem: approximate/reconstruct Reconstructing Y R N using X 0,X 1,,X d R N? FIS
28 Linear System Reconstructing Y R N using X 0,X 1,,X d vectors of R N. Our ability to approximate Y depends implicitly on the space spanned by X 0,X 1,,X d Y Consider the observed vector in R N of predicted values FIS
29 Linear System Reconstructing Y R N using X 0,X 1,,X d vectors of R N. Our ability to approximate Y depends implicitly on the space spanned by X 0,X 1,,X d Y X 0 Plot the first regressor X 0... FIS
30 Linear System Reconstructing Y R N using X 0,X 1,,X d vectors of R N. Our ability to approximate Y depends implicitly on the space spanned by X 0,X 1,,X d Y X 1 X 0 Assume the next regressor X 1 is colinear to X 0... FIS
31 Linear System Reconstructing Y R N using X 0,X 1,,X d vectors of R N. Our ability to approximate Y depends implicitly on the space spanned by X 0,X 1,,X d Y X 1 X 0 X 2 and so is X 2... FIS
32 Linear System Reconstructing Y R N using X 0,X 1,,X d vectors of R N. Our ability to approximate Y depends implicitly on the space spanned by X 0,X 1,,X d Y X 1 X 0 X 2 Very little choices to approximate Y... FIS
33 Linear System Reconstructing Y R N using X 0,X 1,,X d vectors of R N. Our ability to approximate Y depends implicitly on the space spanned by X 0,X 1,,X d Y X 2 X 1 X 0 Suppose X 2 is actually not colinear to X 0. FIS
34 Linear System Reconstructing Y R N using X 0,X 1,,X d vectors of R N. Our ability to approximate Y depends implicitly on the space spanned by X 0,X 1,,X d Y X 2 X 1 X 0 This opens new ways to reconstruct Y. FIS
35 Linear System Reconstructing Y R N using X 0,X 1,,X d vectors of R N. Our ability to approximate Y depends implicitly on the space spanned by X 0,X 1,,X d Y X 2 X 1 X 0 When X 0,X 1,X 2 are linearly independent, FIS
36 Linear System Reconstructing Y R N using X 0,X 1,,X d vectors of R N. Our ability to approximate Y depends implicitly on the space spanned by X 0,X 1,,X d Y X 2 X 1 X 0 Y is in their span since the space is of dimension 3 FIS
37 Linear System Reconstructing Y R N using X 0,X 1,,X d vectors of R N. Our ability to approximate Y depends implicitly on the space spanned by X 0,X 1,,X d The dimension of that space is Rank(X), the rank of X Rank(X) min(d+1,n). FIS
38 Linear System Three cases depending on RankX and d,n 1. RankX < N. d+1 column vectors do not span R N For arbitrary Y, there is no solution to α T X=Y 2. RankX = N and d+1 > N, too many variables span the whole of R N infinite number of solutions to α T X=Y. 3. RankX = N and d+1 = N, # variables = # observations Exact and unique solution: α = X 1 Y we have α T X=Y In most applications, d+1 N so we are either in case 1 or 2 FIS
39 Case 1: RankX < N no solution to α T X=Y (equivalently Xα = Y) in general case. What about the orthogonal projection of Y on the image of X Y Ŷ span{x 0,X 1,,X d } Namely the point Ŷ such that Ŷ = argmin Y u. u spanx 0,X 1,,X d FIS
40 Case 1: RankX < N Lemma 1.{X 0,X 1,,X d } is a l.i. family X T X is invertible FIS
41 Case 1: RankX < N Computing the projection ˆω of a point ω on a subspace V is well understood. In particular, if (X 0,X 1,,X d ) is a basis of span{x 0,X 1,,X d }... (that is {X 0,X 1,,X d } is a linearly independent family)... then (X T X) is invertible and... Ŷ = X(X T X) 1 X T Y This gives us the α vector of weights we are looking for: Ŷ = X(X T X) 1 X T Y }{{} ˆα = Xˆα Y or ˆα T X = Y What can go wrong? FIS
42 Case 1: RankX < N If X T X is invertible, Ŷ = X(X T X) 1 X T Y If X T X is not invertible... we have a problem. If X T X s condition number λ max (X T X) λ min (X T X), is very large, a small change in Y can cause dramatic changes in α. In this case the linear system is said to be badly conditioned... Using the formula Ŷ = X(X T X) 1 X T Y might return garbage as can be seen in the following Matlab example. FIS
43 Case 2: RankX = N and d+1 > N high-dimensional low-sample setting Ill-posed inverse problem, the set {α R d Xα = Y} is a whole vector space. We need to choose one from many admissible points. When does this happen? High-dimensional low-sample case (DNA chips, multimedia etc.) How to solve for this? Use something called regularization. FIS
44 A practical perspective: Colinearity and Overfitting FIS
45 A Few High-dimensions Low sample settings DNA chips are very long vectors of measurements, one for each gene Task: regress a health-related variable against gene expression levels Image: FIS
46 A Few High-dimensions Low sample settings. please = 2. send = 1 s represented as bag-of-words j =. R d money = 2. assignment = 0. Task: regress probability that this is an against bag-of-words Image: FIS
47 Correlated Variables Suppose you run a real-estate company. For each apartment you have compiled a few hundred predictor variables, e.g. distances to conv. store, pharmacy, supermarket, parking lot, etc. distances to all main locations in Kansai socio-economic variables of the neighboorhood characteristics of the apartment Some are obviously correlated (correlated= almost colinear) distance to Post Office / distance to Post ATM In that case, we may have some problems (Matlab example) Source: FIS
48 Overfitting Given d variables (including constant variable), consider the least squares criterion j d 2 L d (α 1,,α d ) = y j α i x i,j i=1 i=1 Add any variable vector x d+1,j,j = 1,,N, and define L d+1 (α 1,,α d,α d+1 ) = j y j i=1 d 2 α i x i,j α d+1 x d+1,j i=1 FIS
49 Overfitting Given d variables (including constant variable), consider the least squares criterion j d 2 L d (α 1,,α d ) = y j α i x i,j i=1 i=1 Add any variable vector x d+1,j,j = 1,,N, and define L d+1 (α 1,,α d,α d+1 ) = j y j i=1 d 2 α i x i,j α d+1 x d+1,j i=1 THEN min α R d+1l d+1(α) min α R dl d (α) FIS
50 Overfitting Given d variables (including constant variable), consider the least squares criterion j d 2 L d (α 1,,α d ) = y j α i x i,j i=1 i=1 Add any variable vector x d+1,j,j = 1,,N, and define L d+1 (α 1,,α d,α d+1 ) = j y j i=1 d 2 α i x i,j α d+1 x d+1,j i=1 Then min α R d+1l d+1(α) min α R dl d (α) why? L d (α 1,,α d ) = L d+1 (α 1,,α d,0) FIS
51 Overfitting Given d variables (including constant variable), consider the least squares criterion j d 2 L d (α 1,,α d ) = y j α i x i,j i=1 i=1 Add any variable vector x d+1,j,j = 1,,N, and define L d+1 (α 1,,α d,α d+1 ) = j y j i=1 d 2 α i x i,j α d+1 x d+1,j i=1 Then min α R d+1l d+1(α) min α R dl d (α) why? L d (α 1,,α d ) = L d+1 (α 1,,α d,0) Residual-sum-of-squares goes down... but is it relevant to add variables? FIS
52 Occam s razor formalization of overfitting Minimizing least-squares (RSS) is not clever enough. We need another idea to avoid overfitting. Occam s razor:lex parsimoniae law of parsimony: principle that recommends selecting the hypothesis that makes the fewest assumptions. one should always opt for an explanation in terms of the fewest possible causes, factors, or variables. Wikipedia: William of Ockham, born died 1347 FIS
53 Advanced Regression Techniques FIS
54 Quick Reminder on Vector Norms For a vector a R d, the Euclidian norm is the quantity a 2 = d a 2 i. More generally, the q-norm is for q > 0, i=1 a q = ( d )1 a i q i=1 q. In particular for q = 1, In the limit q and q 0, a 1 = d a i i=1 a = max i=1,,d a i. a 0 = #{i a i 0}. FIS
55 Tikhonov Regularization 43 - Ridge Regression 62 Tikhonov s motivation : solve ill-posed inverse problems by regularization If min α L(α) is achieved on many points... consider min α L(α)+λ α 2 2 We can show that this leads to selecting ˆα = (X T X+λI d+1 ) 1 XY The condition number has changed to λ max (X T X)+λ λ min (X T X)+λ. FIS
56 Subset selection : Exhaustive Search Following Ockham s razor, ideally we would like to know for any value p min L(α) α, α 0 =p select the best vector α which only gives weights to p variables. Find the best combination of p variables. Practical Implementation For p n, ( n p) possible combinations of p variables. Brute force approach: generate ( n p) regression problems and select the one that achieves the best RSS. Impossible in practice with moderately large n and p... ( 30 5) = FIS
57 Subset selection : Forward Search Since the exact search is intractable in practice, consider the forward heuristic In Forward search: define I 1 = {0}. given a set I k {0,,d} of k variables, what is the most informative variable one could add? Compute for each variable i in {0,,d}\I k t i = min (α k ) k Ik,α N y j j=1 k x k,j +αx i,j k Ikα 2 Set I k+1 = I k {i } for any i such that i = min t i. k = k +1 until desired number of variablse FIS
58 Subset selection : Backward Search... or the backward heuristic In Backward search: define I d = {0,1,,n}. given a set I k {0,,d} of k variables, what is the least informative variable one could remove? Compute for each variable i in I k t i = min (α k ) k Ik \{i} N y j j=1 k I k \{i} α k x k,j 2 Set I k 1 = I k \{i } for any i such that i = max t i. k = k 1 until desired number of variables FIS
59 Subset selection : LASSO Naive Least-squares min α L(α) Best fit with p variables (Occam!) min L(α) α, α 0 =p Tikhonov regularized Least-squares min α L(α)+λ α 2 2 LASSO (least absolute shrinkage and selection operator) min α L(α)+λ α 1 FIS
Statistical Machine Learning, Part I. Regression 2
Statistical Machine Learning, Part I Regression 2 mcuturi@i.kyoto-u.ac.jp SML-2015 1 Last Week Regression: highlight a functional relationship between a predicted variable and predictors SML-2015 2 Last
More informationFoundation of Intelligent Systems, Part I. Regression
Foundation of Intelligent Systems, Part I Regression mcuturi@i.kyoto-u.ac.jp FIS-2013 1 Before starting Please take this survey before the end of this week. Here are a few books which you can check beyond
More informationSCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.
SCMA292 Mathematical Modeling : Machine Learning Krikamol Muandet Department of Mathematics Faculty of Science, Mahidol University February 9, 2016 Outline Quick Recap of Least Square Ridge Regression
More informationThis model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that
Linear Regression For (X, Y ) a pair of random variables with values in R p R we assume that E(Y X) = β 0 + with β R p+1. p X j β j = (1, X T )β j=1 This model of the conditional expectation is linear
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationCS540 Machine learning Lecture 5
CS540 Machine learning Lecture 5 1 Last time Basis functions for linear regression Normal equations QR SVD - briefly 2 This time Geometry of least squares (again) SVD more slowly LMS Ridge regression 3
More informationWeek 3: Linear Regression
Week 3: Linear Regression Instructor: Sergey Levine Recap In the previous lecture we saw how linear regression can solve the following problem: given a dataset D = {(x, y ),..., (x N, y N )}, learn to
More information3. For a given dataset and linear model, what do you think is true about least squares estimates? Is Ŷ always unique? Yes. Is ˆβ always unique? No.
7. LEAST SQUARES ESTIMATION 1 EXERCISE: Least-Squares Estimation and Uniqueness of Estimates 1. For n real numbers a 1,...,a n, what value of a minimizes the sum of squared distances from a to each of
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationMODULE 8 Topics: Null space, range, column space, row space and rank of a matrix
MODULE 8 Topics: Null space, range, column space, row space and rank of a matrix Definition: Let L : V 1 V 2 be a linear operator. The null space N (L) of L is the subspace of V 1 defined by N (L) = {x
More informationDS-GA 1002 Lecture notes 10 November 23, Linear models
DS-GA 2 Lecture notes November 23, 2 Linear functions Linear models A linear model encodes the assumption that two quantities are linearly related. Mathematically, this is characterized using linear functions.
More informationISyE 691 Data mining and analytics
ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)
More informationMA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2
MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 1 Ridge Regression Ridge regression and the Lasso are two forms of regularized
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationMachine Learning for Biomedical Engineering. Enrico Grisan
Machine Learning for Biomedical Engineering Enrico Grisan enrico.grisan@dei.unipd.it Curse of dimensionality Why are more features bad? Redundant features (useless or confounding) Hard to interpret and
More informationSTAT 100C: Linear models
STAT 100C: Linear models Arash A. Amini June 9, 2018 1 / 56 Table of Contents Multiple linear regression Linear model setup Estimation of β Geometric interpretation Estimation of σ 2 Hat matrix Gram matrix
More informationA Modern Look at Classical Multivariate Techniques
A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico
More informationRidge Regression 1. to which some random noise is added. So that the training labels can be represented as:
CS 1: Machine Learning Spring 15 College of Computer and Information Science Northeastern University Lecture 3 February, 3 Instructor: Bilal Ahmed Scribe: Bilal Ahmed & Virgil Pavlu 1 Introduction Ridge
More informationGI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil
GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis Massimiliano Pontil 1 Today s plan SVD and principal component analysis (PCA) Connection
More informationMachine Learning (CS 567) Lecture 2
Machine Learning (CS 567) Lecture 2 Time: T-Th 5:00pm - 6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationStatistical Geometry Processing Winter Semester 2011/2012
Statistical Geometry Processing Winter Semester 2011/2012 Linear Algebra, Function Spaces & Inverse Problems Vector and Function Spaces 3 Vectors vectors are arrows in space classically: 2 or 3 dim. Euclidian
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the
More informationGeometric Modeling Summer Semester 2010 Mathematical Tools (1)
Geometric Modeling Summer Semester 2010 Mathematical Tools (1) Recap: Linear Algebra Today... Topics: Mathematical Background Linear algebra Analysis & differential geometry Numerical techniques Geometric
More informationLinear Regression Linear Regression with Shrinkage
Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle
More informationLinear Regression Linear Regression with Shrinkage
Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle
More informationTutorial 6: Linear Regression
Tutorial 6: Linear Regression Rob Nicholls nicholls@mrc-lmb.cam.ac.uk MRC LMB Statistics Course 2014 Contents 1 Introduction to Simple Linear Regression................ 1 2 Parameter Estimation and Model
More informationData Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.
TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin
More informationLinear regression COMS 4771
Linear regression COMS 4771 1. Old Faithful and prediction functions Prediction problem: Old Faithful geyser (Yellowstone) Task: Predict time of next eruption. 1 / 40 Statistical model for time between
More informationIntroduction to Bayesian Learning. Machine Learning Fall 2018
Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability
More informationMultiple Linear Regression
Multiple Linear Regression University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html 1 / 42 Passenger car mileage Consider the carmpg dataset taken from
More informationMethods of Mathematical Physics X1 Homework 2 - Solutions
Methods of Mathematical Physics - 556 X1 Homework - Solutions 1. Recall that we define the orthogonal complement as in class: If S is a vector space, and T is a subspace, then we define the orthogonal
More informationLecture 3. Linear Regression
Lecture 3. Linear Regression COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne Weeks 2 to 8 inclusive Lecturer: Andrey Kan MS: Moscow, PhD:
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationFundamentals of Machine Learning (Part I)
Fundamentals of Machine Learning (Part I) Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp April 12, 2018 Mohammad Emtiyaz Khan 2018 1 Goals Understand (some) fundamentals
More informationBindel, Fall 2016 Matrix Computations (CS 6210) Notes for
1 A cautionary tale Notes for 2016-10-05 You have been dropped on a desert island with a laptop with a magic battery of infinite life, a MATLAB license, and a complete lack of knowledge of basic geometry.
More informationCOS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION
COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION SEAN GERRISH AND CHONG WANG 1. WAYS OF ORGANIZING MODELS In probabilistic modeling, there are several ways of organizing models:
More informationData Mining Stat 588
Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic
More informationLeast Squares. Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter UCSD
Least Squares Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 75A Winter 0 - UCSD (Unweighted) Least Squares Assume linearity in the unnown, deterministic model parameters Scalar, additive noise model: y f (
More informationLinear Regression. Aarti Singh. Machine Learning / Sept 27, 2010
Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X
More informationLinear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,
Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,
More informationHomework 5. Convex Optimization /36-725
Homework 5 Convex Optimization 10-725/36-725 Due Tuesday November 22 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)
More informationSTAT 100C: Linear models
STAT 100C: Linear models Arash A. Amini April 27, 2018 1 / 1 Table of Contents 2 / 1 Linear Algebra Review Read 3.1 and 3.2 from text. 1. Fundamental subspace (rank-nullity, etc.) Im(X ) = ker(x T ) R
More informationAssociation studies and regression
Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration
More informationMachine Learning Linear Regression. Prof. Matteo Matteucci
Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com
More informationRegression. Oscar García
Regression Oscar García Regression methods are fundamental in Forest Mensuration For a more concise and general presentation, we shall first review some matrix concepts 1 Matrices An order n m matrix is
More informationOutline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren
1 / 34 Metamodeling ME598/494 Lecture Max Yi Ren Department of Mechanical Engineering, Arizona State University March 1, 2015 2 / 34 1. preliminaries 1.1 motivation 1.2 ordinary least square 1.3 information
More informationFoundation of Intelligent Systems, Part I. SVM s & Kernel Methods
Foundation of Intelligent Systems, Part I SVM s & Kernel Methods mcuturi@i.kyoto-u.ac.jp FIS - 2013 1 Support Vector Machines The linearly-separable case FIS - 2013 2 A criterion to select a linear classifier:
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationRegression, Ridge Regression, Lasso
Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.
More informationLecture 15. Hypothesis testing in the linear model
14. Lecture 15. Hypothesis testing in the linear model Lecture 15. Hypothesis testing in the linear model 1 (1 1) Preliminary lemma 15. Hypothesis testing in the linear model 15.1. Preliminary lemma Lemma
More informationMATH 315 Linear Algebra Homework #1 Assigned: August 20, 2018
Homework #1 Assigned: August 20, 2018 Review the following subjects involving systems of equations and matrices from Calculus II. Linear systems of equations Converting systems to matrix form Pivot entry
More informationLecture 6: Linear Regression
Lecture 6: Linear Regression Reading: Sections 3.1-3 STATS 202: Data mining and analysis Jonathan Taylor, 10/5 Slide credits: Sergio Bacallado 1 / 30 Simple linear regression Model: y i = β 0 + β 1 x i
More information[y i α βx i ] 2 (2) Q = i=1
Least squares fits This section has no probability in it. There are no random variables. We are given n points (x i, y i ) and want to find the equation of the line that best fits them. We take the equation
More informationEECS 275 Matrix Computation
EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 9 1 / 23 Overview
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationAn Introduction to Statistical and Probabilistic Linear Models
An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning
More informationSummer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.
Summer School in Statistics for Astronomers V June 1 - June 6, 2009 Regression Mosuk Chow Statistics Department Penn State University. Adapted from notes prepared by RL Karandikar Mean and variance Recall
More informationPeter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8
Contents 1 Linear model 1 2 GLS for multivariate regression 5 3 Covariance estimation for the GLM 8 4 Testing the GLH 11 A reference for some of this material can be found somewhere. 1 Linear model Recall
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More information11 Hypothesis Testing
28 11 Hypothesis Testing 111 Introduction Suppose we want to test the hypothesis: H : A q p β p 1 q 1 In terms of the rows of A this can be written as a 1 a q β, ie a i β for each row of A (here a i denotes
More informationReferences. Lecture 3: Shrinkage. The Power of Amnesia. Ockham s Razor
References Lecture 3: Shrinkage Isabelle Guon guoni@inf.ethz.ch Structural risk minimization for character recognition Isabelle Guon et al. http://clopinet.com/isabelle/papers/sr m.ps.z Kernel Ridge Regression
More informationThe Hilbert Space of Random Variables
The Hilbert Space of Random Variables Electrical Engineering 126 (UC Berkeley) Spring 2018 1 Outline Fix a probability space and consider the set H := {X : X is a real-valued random variable with E[X 2
More information14 Multiple Linear Regression
B.Sc./Cert./M.Sc. Qualif. - Statistics: Theory and Practice 14 Multiple Linear Regression 14.1 The multiple linear regression model In simple linear regression, the response variable y is expressed in
More informationStatistics 203: Introduction to Regression and Analysis of Variance Penalized models
Statistics 203: Introduction to Regression and Analysis of Variance Penalized models Jonathan Taylor - p. 1/15 Today s class Bias-Variance tradeoff. Penalized regression. Cross-validation. - p. 2/15 Bias-variance
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 14, 2014 Today s Schedule Course Project Introduction Linear Regression Model Decision Tree 2 Methods
More information(II.B) Basis and dimension
(II.B) Basis and dimension How would you explain that a plane has two dimensions? Well, you can go in two independent directions, and no more. To make this idea precise, we formulate the DEFINITION 1.
More informationVectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =
Linear Algebra Review Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1 x x = 2. x n Vectors of up to three dimensions are easy to diagram.
More informationIntroduction to Statistical modeling: handout for Math 489/583
Introduction to Statistical modeling: handout for Math 489/583 Statistical modeling occurs when we are trying to model some data using statistical tools. From the start, we recognize that no model is perfect
More informationCSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes
CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes Roger Grosse Roger Grosse CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes 1 / 55 Adminis-Trivia Did everyone get my e-mail
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationMS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015
MS&E 226 In-Class Midterm Examination Solutions Small Data October 20, 2015 PROBLEM 1. Alice uses ordinary least squares to fit a linear regression model on a dataset containing outcome data Y and covariates
More informationCS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS
CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS * Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington Mingon Kang, Ph.D. Computer Science, Kennesaw State University Problems
More information18.06 Problem Set 3 - Solutions Due Wednesday, 26 September 2007 at 4 pm in
8.6 Problem Set 3 - s Due Wednesday, 26 September 27 at 4 pm in 2-6. Problem : (=2+2+2+2+2) A vector space is by definition a nonempty set V (whose elements are called vectors) together with rules of addition
More informationcxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c
Warm up D cai.yo.ie p IExrL9CxsYD Sglx.Ddl f E Luo fhlexi.si dbll Fix any a, b, c > 0. 1. What is the x 2 R that minimizes ax 2 + bx + c x a b Ta OH 2 ax 16 0 x 1 Za fhkxiiso3ii draulx.h dp.d 2. What is
More informationMathematical statistics
October 1 st, 2018 Lecture 11: Sufficient statistic Where are we? Week 1 Week 2 Week 4 Week 7 Week 10 Week 14 Probability reviews Chapter 6: Statistics and Sampling Distributions Chapter 7: Point Estimation
More informationMachine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:
Simple Variable Selection LASSO: Sparse Regression Machine Learning CSE546 Carlos Guestrin University of Washington October 7, 2013 1 Sparsity Vector w is sparse, if many entries are zero: Very useful
More informationSparse Linear Models (10/7/13)
STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine
More informationLecture 6: Linear Regression (continued)
Lecture 6: Linear Regression (continued) Reading: Sections 3.1-3.3 STATS 202: Data mining and analysis October 6, 2017 1 / 23 Multiple linear regression Y = β 0 + β 1 X 1 + + β p X p + ε Y ε N (0, σ) i.i.d.
More informationLinear Models Review
Linear Models Review Vectors in IR n will be written as ordered n-tuples which are understood to be column vectors, or n 1 matrices. A vector variable will be indicted with bold face, and the prime sign
More informationCOS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10
COS53: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 0 MELISSA CARROLL, LINJIE LUO. BIAS-VARIANCE TRADE-OFF (CONTINUED FROM LAST LECTURE) If V = (X n, Y n )} are observed data, the linear regression problem
More informationClick Prediction and Preference Ranking of RSS Feeds
Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS
More informationMatrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =
30 MATHEMATICS REVIEW G A.1.1 Matrices and Vectors Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = a 11 a 12... a 1N a 21 a 22... a 2N...... a M1 a M2... a MN A matrix can
More informationIs the test error unbiased for these programs?
Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x
More informationCSCI567 Machine Learning (Fall 2014)
CSCI567 Machine Learning (Fall 24) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu October 2, 24 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 / 24 Outline Review
More informationMA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7
MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7 1 Random Vectors Let a 0 and y be n 1 vectors, and let A be an n n matrix. Here, a 0 and A are non-random, whereas y is
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationMachine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017
Machine Learning Regularization and Feature Selection Fabio Vandin November 13, 2017 1 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m,
More informationRegression Models - Introduction
Regression Models - Introduction In regression models, two types of variables that are studied: A dependent variable, Y, also called response variable. It is modeled as random. An independent variable,
More informationContainment restrictions
Containment restrictions Tibor Szabó Extremal Combinatorics, FU Berlin, WiSe 207 8 In this chapter we switch from studying constraints on the set operation intersection, to constraints on the set relation
More informationMatrices and vectors A matrix is a rectangular array of numbers. Here s an example: A =
Matrices and vectors A matrix is a rectangular array of numbers Here s an example: 23 14 17 A = 225 0 2 This matrix has dimensions 2 3 The number of rows is first, then the number of columns We can write
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationMethods for sparse analysis of high-dimensional data, II
Methods for sparse analysis of high-dimensional data, II Rachel Ward May 23, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 47 High dimensional
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2015 Announcements TA Monisha s office hour has changed to Thursdays 10-12pm, 462WVH (the same
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released
More informationScientific Computing: Dense Linear Systems
Scientific Computing: Dense Linear Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 Course MATH-GA.2043 or CSCI-GA.2112, Spring 2012 February 9th, 2012 A. Donev (Courant Institute)
More informationRegression: Lecture 2
Regression: Lecture 2 Niels Richard Hansen April 26, 2012 Contents 1 Linear regression and least squares estimation 1 1.1 Distributional results................................ 3 2 Non-linear effects and
More information11 : Gaussian Graphic Models and Ising Models
10-708: Probabilistic Graphical Models 10-708, Spring 2017 11 : Gaussian Graphic Models and Ising Models Lecturer: Bryon Aragam Scribes: Chao-Ming Yen 1 Introduction Different from previous maximum likelihood
More information