Foundation of Intelligent Systems, Part I. Regression 2

Size: px

Start display at page:

Download "Foundation of Intelligent Systems, Part I. Regression 2"

Sybil Dalton
5 years ago
Views:

1 Foundation of Intelligent Systems, Part I Regression 2 mcuturi@i.kyoto-u.ac.jp FIS

2 Some Words on the Survey Not enough answers to say anything meaningful! Try again: survey. FIS

3 Last Week Regression: highlight a functional relationship between a predicted variable and predictors FIS

4 Last Week Regression: highlight a functional relationship between a predicted variable and predictors find a function f such that (x,y) that can appear,f(x) y FIS

5 Last Week Regression: highlight a functional relationship between a predicted variable and predictors to find an accurate function f such that (x,y) that can appear,f(x) y use a data set & the least-squares criterion: min f F 1 N N (y j f(x j )) 2 j=1 FIS

6 Last Week Regression: highlight a functional relationship between a predicted variable and predictors when regressing a real number vs a real number : Scatter plot of Rent vs. Surface y = 0.13*x y = *x *x 0.71 y = 6.2e 05*x *x *x 3.1 y = 1.4e 06*x *x *x *x 1.2 y = 1.8e 07*x e 05*x *x *x 2 1.1*x apartment linear quadratic cubic 4th degree 5th degree Rent (x JPY) Surface Least-Squares Criterion L(b,a 1,,a p ) to fit lines, polynomials. results in solving a linear system. 2 nd order(b,a 1,,a p ) a p = linear in (b,a 1,,a p ) When setting L/ a p = 0 we get p+1 linear equations for p+1 variables. FIS

7 Last Week Regression: highlight a functional relationship between a predicted variable and predictors when regressing a real number vs d real numbers (vector in R d ), find best fit α R n such that (α T x+α 0 ) y. Add to d N data matrix, a row of 1 s to get the predictors X. The row Y of predicted values The Least-Squares criterion also applies: ( L(α) = Y α T X 2 = α T XX T α 2Y X T α+ Y 2). α L = 0 α = (XX T ) 1 XY T This works if XX T R d+1 is invertible. FIS

$Last Week >> (X X )\(X Y ) ans = 0.049332605603095 x age 0.$

8 Last Week >> (X X )\(X Y ) ans = x age x surface x distance JPY FIS

9 Today A statistical / probabilistic perspective on LS-regression A few words on polynomials in higher dimensions A geometric perspective Variable co-linearity and Overfitting problem Some solutions: advanced regression techniques Subset selection Ridge Regression Lasso FIS

10 A (very few) words on the statistical/probabilistic interpretation of LS FIS

11 The Statistical Perspective on Regression Assume that the values of y are stochastically linked to observations x as y (α T x+β) N(0,σ). This difference is a random variable called ε and is called a residue. FIS

12 The Statistical Perspective on Regression This can be rewritten as, y = (α T x+β)+ε, ε N(0,σ), We assume that the difference between y and (α T x+b) behaves like a Gaussian (normally distributed) random variable. Goal as a statistician: Estimate α and β given observations. FIS

13 Identically Independently Distributed (i.i.d) Observations Statistical hypothesis: assume that the parameters are α = a,β = b FIS

14 Identically Independently Distributed (i.i.d) Observations Statistical hypothesis: assume that the parameters are α = a,β = b In such a case, what would be the probability of each observation (x j,y j )? FIS

15 Identically Independently Distributed (i.i.d) Observations Statistical hypothesis: assuming that the parameters are α = a,β = b, what would be the probability of each observation?: For each couple (x j,y j ), j = 1,,N, P(x j,y j α = a,β = b) = 1 2πσ exp ( y j (a T x j +b) 2 ) 2σ 2 FIS

16 Identically Independently Distributed (i.i.d) Observations Statistical hypothesis: assuming that the parameters are α = a,β = b, what would be the probability of each observation?: For each couple (x j,y j ), j = 1,,N, P(x j,y j α = a,β = b) = 1 2πσ exp ( y j (a T x j +b) 2 ) 2σ 2 Since each measurement (x j,y j ) has been independently sampled, P ({(x j,y j )} j=1,,n α = a,β = b) = N j=1 1 2πσ exp ( y j (a T x j +b) 2 ) 2σ 2 FIS

17 Identically Independently Distributed (i.i.d) Observations Statistical hypothesis: assuming that the parameters are α = a,β = b, what would be the probability of each observation?: For each couple (x j,y j ), j = 1,,N, P(x j,y j α = a,β = b) = 1 2πσ exp ( y j (a T x j +b) 2 ) 2σ 2 Since each measurement (x j,y j ) has been independently sampled, P ({(x j,y j )} j=1,,n α = a,β = b) = N j=1 1 2πσ exp ( y j (a T x j +b) 2 ) 2σ 2 A.K.A likelihood of the dataset {(x j,y j ) j=1,,n } as a function of a and b, L {(xj,y j )}(a,b) = N j=1 1 2πσ exp ( y j (a T x j +b) 2 ) 2σ 2 FIS

18 Maximum Likelihood Estimation (MLE) of Parameters Hence, for a,b, the likelihood function on the dataset {(x j,y j ) j=1,,n }... L(a,b) = N j=1 1 2πσ exp ( y j (a T x j +b) 2 ) 2σ 2 FIS

19 Maximum Likelihood Estimation (MLE) of Parameters Hence, for a,b, the likelihood function on the dataset {(x j,y j ) j=1,,n }... L(a,b) = N j=1 1 2πσ exp ( y j (a T x j +b) 2 ) 2σ 2 Why not use the likelihood to guess (a,b) given data? FIS

20 Maximum Likelihood Estimation (MLE) of Parameters Hence, for a,b, the likelihood function on the dataset {(x j,y j ) j=1,,n }... L(a,b) = N j=1 1 2πσ exp ( y j (a T x j +b) 2 ) 2σ 2...the MLE approach selects the values of (a,b) which mazimize L(a,b) FIS

21 Maximum Likelihood Estimation (MLE) of Parameters Hence, for a,b, the likelihood function on the dataset {(x j,y j ) j=1,,n }... L(a,b) = N j=1 1 2πσ exp ( y j (a T x j +b) 2 ) 2σ 2...the MLE approach selects the values of (a,b) which mazimize L(a,b) Since max (a,b) L(a,b) max (a,b) logl(a,b) logl(a,b) = C 1 2σ 2 N y j (a T x j +b) 2. j=1 Hence max (a,b) L(a,b) min (a,b) N j=1 y j (a T x j +b) 2... FIS

22 Statistical Approach to Linear Regression Properties of the MLE estimator: convergence of α a? Confidence intervals for coefficients, Tests procedures to assess if model fits the data, Residues Histogram Relative Frequency x x 10 Bayesian approaches: instead of looking for one optimal fit (a, b) juggle with a whole density on (a,b) to make decisions etc. FIS

23 A few words on polynomials in higher dimensions FIS

24 A few words on polynomials in higher dimensions For d variables, that is for points x R d, the space of polynomials on these variables up to degree p is generated by {x u u N d,u = (u 1,,u d ), d u i p} i=1 where the monomial x u is defined as x u 1 1 xu 2 2 xu d d Recurrence for dimension of that space: dim p+1 = dim p + ( p+1 d+p For d = 20 and p = 5, > ) Problem with polynomial interpolation in high-dimensions is the explosion of relevant variables (one for each monomial) FIS

25 Geometric Perspective FIS

26 Back to Basics Recall the problem: X = x 1 x 2 x N Rd+1 N... and Y = [ y 1 y N ] R N. We look for α such that α T X Y. FIS

27 Back to Basics If we transpose this expression we get X T α Y T, 1 x 1,1 x d,1 1 x 1,2 x d, x 1,k x d,k α 0. =.... α d 1 x 1,N x d,n y 1. y 2. y. y N Using the notation Y = Y T,X = X T and X k for the (k +1) th column of X, d α k X k Y k=0 Note how the X k corresponds to all values taken by the k th variable. Problem: approximate/reconstruct Reconstructing Y R N using X 0,X 1,,X d R N? FIS

28 Linear System Reconstructing Y R N using X 0,X 1,,X d vectors of R N. Our ability to approximate Y depends implicitly on the space spanned by X 0,X 1,,X d Y Consider the observed vector in R N of predicted values FIS

29 Linear System Reconstructing Y R N using X 0,X 1,,X d vectors of R N. Our ability to approximate Y depends implicitly on the space spanned by X 0,X 1,,X d Y X 0 Plot the first regressor X 0... FIS

30 Linear System Reconstructing Y R N using X 0,X 1,,X d vectors of R N. Our ability to approximate Y depends implicitly on the space spanned by X 0,X 1,,X d Y X 1 X 0 Assume the next regressor X 1 is colinear to X 0... FIS

31 Linear System Reconstructing Y R N using X 0,X 1,,X d vectors of R N. Our ability to approximate Y depends implicitly on the space spanned by X 0,X 1,,X d Y X 1 X 0 X 2 and so is X 2... FIS

32 Linear System Reconstructing Y R N using X 0,X 1,,X d vectors of R N. Our ability to approximate Y depends implicitly on the space spanned by X 0,X 1,,X d Y X 1 X 0 X 2 Very little choices to approximate Y... FIS

33 Linear System Reconstructing Y R N using X 0,X 1,,X d vectors of R N. Our ability to approximate Y depends implicitly on the space spanned by X 0,X 1,,X d Y X 2 X 1 X 0 Suppose X 2 is actually not colinear to X 0. FIS

34 Linear System Reconstructing Y R N using X 0,X 1,,X d vectors of R N. Our ability to approximate Y depends implicitly on the space spanned by X 0,X 1,,X d Y X 2 X 1 X 0 This opens new ways to reconstruct Y. FIS

35 Linear System Reconstructing Y R N using X 0,X 1,,X d vectors of R N. Our ability to approximate Y depends implicitly on the space spanned by X 0,X 1,,X d Y X 2 X 1 X 0 When X 0,X 1,X 2 are linearly independent, FIS

36 Linear System Reconstructing Y R N using X 0,X 1,,X d vectors of R N. Our ability to approximate Y depends implicitly on the space spanned by X 0,X 1,,X d Y X 2 X 1 X 0 Y is in their span since the space is of dimension 3 FIS

37 Linear System Reconstructing Y R N using X 0,X 1,,X d vectors of R N. Our ability to approximate Y depends implicitly on the space spanned by X 0,X 1,,X d The dimension of that space is Rank(X), the rank of X Rank(X) min(d+1,n). FIS

38 Linear System Three cases depending on RankX and d,n 1. RankX < N. d+1 column vectors do not span R N For arbitrary Y, there is no solution to α T X=Y 2. RankX = N and d+1 > N, too many variables span the whole of R N infinite number of solutions to α T X=Y. 3. RankX = N and d+1 = N, # variables = # observations Exact and unique solution: α = X 1 Y we have α T X=Y In most applications, d+1 N so we are either in case 1 or 2 FIS

39 Case 1: RankX < N no solution to α T X=Y (equivalently Xα = Y) in general case. What about the orthogonal projection of Y on the image of X Y Ŷ span{x 0,X 1,,X d } Namely the point Ŷ such that Ŷ = argmin Y u. u spanx 0,X 1,,X d FIS

40 Case 1: RankX < N Lemma 1.{X 0,X 1,,X d } is a l.i. family X T X is invertible FIS

41 Case 1: RankX < N Computing the projection ˆω of a point ω on a subspace V is well understood. In particular, if (X 0,X 1,,X d ) is a basis of span{x 0,X 1,,X d }... (that is {X 0,X 1,,X d } is a linearly independent family)... then (X T X) is invertible and... Ŷ = X(X T X) 1 X T Y This gives us the α vector of weights we are looking for: Ŷ = X(X T X) 1 X T Y }{{} ˆα = Xˆα Y or ˆα T X = Y What can go wrong? FIS

42 Case 1: RankX < N If X T X is invertible, Ŷ = X(X T X) 1 X T Y If X T X is not invertible... we have a problem. If X T X s condition number λ max (X T X) λ min (X T X), is very large, a small change in Y can cause dramatic changes in α. In this case the linear system is said to be badly conditioned... Using the formula Ŷ = X(X T X) 1 X T Y might return garbage as can be seen in the following Matlab example. FIS

43 Case 2: RankX = N and d+1 > N high-dimensional low-sample setting Ill-posed inverse problem, the set {α R d Xα = Y} is a whole vector space. We need to choose one from many admissible points. When does this happen? High-dimensional low-sample case (DNA chips, multimedia etc.) How to solve for this? Use something called regularization. FIS

44 A practical perspective: Colinearity and Overfitting FIS

45 A Few High-dimensions Low sample settings DNA chips are very long vectors of measurements, one for each gene Task: regress a health-related variable against gene expression levels Image: FIS

A Few High-dimensions Low sample settings. please = 2. send = 1 Emails represented as bag-of-words email j =. R d money = 2. assignment = 0.

46 A Few High-dimensions Low sample settings. please = 2. send = 1 s represented as bag-of-words j =. R d money = 2. assignment = 0. Task: regress probability that this is an against bag-of-words Image: FIS

Correlated Variables Suppose you run a real-estate company. For each apartment you have compiled a few hundred predictor variables, e.g. distances to conv.

47 Correlated Variables Suppose you run a real-estate company. For each apartment you have compiled a few hundred predictor variables, e.g. distances to conv. store, pharmacy, supermarket, parking lot, etc. distances to all main locations in Kansai socio-economic variables of the neighboorhood characteristics of the apartment Some are obviously correlated (correlated= almost colinear) distance to Post Office / distance to Post ATM In that case, we may have some problems (Matlab example) Source: FIS

48 Overfitting Given d variables (including constant variable), consider the least squares criterion j d 2 L d (α 1,,α d ) = y j α i x i,j i=1 i=1 Add any variable vector x d+1,j,j = 1,,N, and define L d+1 (α 1,,α d,α d+1 ) = j y j i=1 d 2 α i x i,j α d+1 x d+1,j i=1 FIS

49 Overfitting Given d variables (including constant variable), consider the least squares criterion j d 2 L d (α 1,,α d ) = y j α i x i,j i=1 i=1 Add any variable vector x d+1,j,j = 1,,N, and define L d+1 (α 1,,α d,α d+1 ) = j y j i=1 d 2 α i x i,j α d+1 x d+1,j i=1 THEN min α R d+1l d+1(α) min α R dl d (α) FIS

50 Overfitting Given d variables (including constant variable), consider the least squares criterion j d 2 L d (α 1,,α d ) = y j α i x i,j i=1 i=1 Add any variable vector x d+1,j,j = 1,,N, and define L d+1 (α 1,,α d,α d+1 ) = j y j i=1 d 2 α i x i,j α d+1 x d+1,j i=1 Then min α R d+1l d+1(α) min α R dl d (α) why? L d (α 1,,α d ) = L d+1 (α 1,,α d,0) FIS

51 Overfitting Given d variables (including constant variable), consider the least squares criterion j d 2 L d (α 1,,α d ) = y j α i x i,j i=1 i=1 Add any variable vector x d+1,j,j = 1,,N, and define L d+1 (α 1,,α d,α d+1 ) = j y j i=1 d 2 α i x i,j α d+1 x d+1,j i=1 Then min α R d+1l d+1(α) min α R dl d (α) why? L d (α 1,,α d ) = L d+1 (α 1,,α d,0) Residual-sum-of-squares goes down... but is it relevant to add variables? FIS

52 Occam s razor formalization of overfitting Minimizing least-squares (RSS) is not clever enough. We need another idea to avoid overfitting. Occam s razor:lex parsimoniae law of parsimony: principle that recommends selecting the hypothesis that makes the fewest assumptions. one should always opt for an explanation in terms of the fewest possible causes, factors, or variables. Wikipedia: William of Ockham, born died 1347 FIS

53 Advanced Regression Techniques FIS

54 Quick Reminder on Vector Norms For a vector a R d, the Euclidian norm is the quantity a 2 = d a 2 i. More generally, the q-norm is for q > 0, i=1 a q = ( d )1 a i q i=1 q. In particular for q = 1, In the limit q and q 0, a 1 = d a i i=1 a = max i=1,,d a i. a 0 = #{i a i 0}. FIS

55 Tikhonov Regularization 43 - Ridge Regression 62 Tikhonov s motivation : solve ill-posed inverse problems by regularization If min α L(α) is achieved on many points... consider min α L(α)+λ α 2 2 We can show that this leads to selecting ˆα = (X T X+λI d+1 ) 1 XY The condition number has changed to λ max (X T X)+λ λ min (X T X)+λ. FIS

56 Subset selection : Exhaustive Search Following Ockham s razor, ideally we would like to know for any value p min L(α) α, α 0 =p select the best vector α which only gives weights to p variables. Find the best combination of p variables. Practical Implementation For p n, ( n p) possible combinations of p variables. Brute force approach: generate ( n p) regression problems and select the one that achieves the best RSS. Impossible in practice with moderately large n and p... ( 30 5) = FIS

57 Subset selection : Forward Search Since the exact search is intractable in practice, consider the forward heuristic In Forward search: define I 1 = {0}. given a set I k {0,,d} of k variables, what is the most informative variable one could add? Compute for each variable i in {0,,d}\I k t i = min (α k ) k Ik,α N y j j=1 k x k,j +αx i,j k Ikα 2 Set I k+1 = I k {i } for any i such that i = min t i. k = k +1 until desired number of variablse FIS

58 Subset selection : Backward Search... or the backward heuristic In Backward search: define I d = {0,1,,n}. given a set I k {0,,d} of k variables, what is the least informative variable one could remove? Compute for each variable i in I k t i = min (α k ) k Ik \{i} N y j j=1 k I k \{i} α k x k,j 2 Set I k 1 = I k \{i } for any i such that i = max t i. k = k 1 until desired number of variables FIS

59 Subset selection : LASSO Naive Least-squares min α L(α) Best fit with p variables (Occam!) min L(α) α, α 0 =p Tikhonov regularized Least-squares min α L(α)+λ α 2 2 LASSO (least absolute shrinkage and selection operator) min α L(α)+λ α 1 FIS

Statistical Machine Learning, Part I. Regression 2

Statistical Machine Learning, Part I Regression 2 mcuturi@i.kyoto-u.ac.jp SML-2015 1 Last Week Regression: highlight a functional relationship between a predicted variable and predictors SML-2015 2 Last