Foundation of Intelligent Systems, Part I. Regression

Size: px

Start display at page:

Download "Foundation of Intelligent Systems, Part I. Regression"

Rhoda Sharp
5 years ago
Views:

1 Foundation of Intelligent Systems, Part I Regression mcuturi@i.kyoto-u.ac.jp FIS

2 Before starting Please take this survey before the end of this week. Here are a few books which you can check beyond the slides. Elements of Statistical Learning, Hastie Tibshirani Friedman Pattern Recognition, Theodoridis Koutroumbras Pattern Recognition & Machine Learning, Bishop You can also check Andrew Ng s video lectures (Stanford) FIS

3 Fundamentals in Regression Can be studied from different viewpoints: statistical, linear algebra, AI... etc.. Linear regression is currently revived by different ideas in sparsity Lasso (1996 ) SVM for regression (1996 ) Compressed Sensing (2002 ) FIS

4 One of the most standard data analysis tasks: Regression Data: many observations of the same data type We have a database {x 1,,x N }. Each datapoint x j can be encoded as a vector of features x j = x 1,j x 2,j. x d,j Each feature x i,j (1 i d) of a given point x j is a number. FIS

5 One of the most standard data analysis tasks: Regression This database can be seen as a R d N matrix {x 1,,x N } x 1,1 x 1,2 x 1,N x 2,1 x 2,2 x 2,N x 3,1 x 3,2 x 3,N... x d,1 x d,2 x d,n FIS

6 Examples Credit card holder x j = Patient x j = Blog x j = Income Age. Work history (months) Family # Credit Incidents height weight. # minutes exercise/week LDL cholesterol HDL cholesterol avg. pages view/month # posts. avg. # comments/month revenue from ads/month FIS

7 Within such variables... Some variables are very cheap to measure, others very expensive. Some variables might have a causal effect on other variables. In the regression setting, the d variables are split between k regressor (or predictor) variables d k response (or predicted) variables. FIS

8 Regression In the regression setting, the d variables are split between k regressor (or predictor) variables d k response (or predicted) variables. guess response(expensive) variables using regressor(cheap) ones. FIS

9 The Regression Problem Given, A database {x 1,,x N } X = x 1,1 x 1,2 x 1,N x 2,1 x 2,2 x 2,N x 3,1 x 3,2 x 3,N... x d,1 x d,2 x d,n FIS

10 The Regression Problem Given, A database {x 1,,x N } X = x 1,1 x 1,2 x 1,N x 2,1 x 2,2 x 2,N x 3,1 x 3,2 x 3,N... x d 1,1 x d 1,2 x d 1,N x d,1 x d,2 x d,n A set of k regressors variables Reg {1,,d} A set of d k response variable Res {1,,d} FIS

11 The Regression Problem Regression = build a function f : R k R d-k such that, x,f((x i ) i Reg ) (x k ) k Res. e.g. if d = 6, k = 4, Reg= {1,2,3,4}, Res= {5,6} we look for a function f : R 4 R 2, f(x 1,x 2,x 3,x 4 ) (x 5,x 6 ) FIS

12 Examples continued Credit card holder x j = Patient x j = Blog x j = Income Age. Work history (months) Family # Credit Incidents height weight. # minutes exercise/week LDL cholesterol HDL cholesterol avg. pages view/month # posts. avg. # comments/month revenue from ads/month FIS

13 Examples continued Credit card holder x j = Patient x j = Blog x j = Income Age. Work history (months) Family # Credit Incidents height weight. # minutes exercise/week LDL cholesterol HDL cholesterol avg. pages view/month # posts. avg. # comments/month revenue from ads/month FIS

14 In the following slides... We only consider tasks with one response variable All other variables are regressors. We rename the response variable y and reassign x 1,,x d for the regressors predicting more than one variable? heavier mathematically, but similar. FIS

15 In the following slides... We assume that y takes continuous values. When y takes discrete values, notably binary {0, 1} things change a bit. Yet... binary real : regression techniques work on discrete data but real binary... we ll discuss that later. FIS

16 Today s Example: Your apartment Collected information about 285 (out of 1226) apartments close to Kyoto U. Source: FIS

17 Today s Example: Your apartment Collected information about 285 (out of 1226) apartments close to Kyoto U. Kept 4 variables: Surface, Rent, Age of Building, Walking distance to station. FIS

18 What does the matrix look like? imagecs(h); colorbar; 285 columns, 4 lines. Each column represents one apartment. In these slides, we will regress the rent using age, surface and distance FIS

19 Regression: one variable vs. another FIS

20 Rent vs. Surface 14 Scatter plot of Rent vs. Surface apartment 12 Rent (x JPY) Surface Note that the dataset has been censored above JPY FIS

21 Rent vs. Surface 14 y = 0.13*x Scatter plot of Rent vs. Surface apartment linear Rent (x JPY) Surface Using the linear tool in curve fitting, we obtain the approximation y = 0.13x+2.4 FIS

22 Rent vs. Surface Scatter plot of Rent vs. Surface y = 0.13*x y = *x *x 0.71 y = 6.2e 05*x *x *x 3.1 y = 1.4e 06*x *x *x *x 1.2 y = 1.8e 07*x e 05*x *x *x 2 1.1*x apartment linear quadratic cubic 4th degree 5th degree Rent (x JPY) Surface We can use higher order polynomials... yet look at the results. FIS

23 Behind the curve fitting tool Matlab selects these curves using the least-squares criterion e.g min f F N (y j f(x j )) 2 where F is a class of functions Matlab considers a few function classes for F. FIS

24 Behind the curve fitting tool Matlab selects these curves using the least-squares criterion e.g min f F N (y j f(x j )) 2 where F is a class of functions Matlab considers a few function classes for F. Among them.. Linear min b,a 1 R N (y j (b + a 1 x j )) 2 FIS

25 Behind the curve fitting tool Matlab selects these curves using the least-squares criterion e.g min f F N (y j f(x j )) 2 where F is a class of functions Matlab considers a few function classes for F. Among them.. Linear min b,a 1 R N (y j (b + a 1 x j )) 2 Quadratic min b,a 1,a 2 R N ) 2 (y j (b + a 1 x j +a 2 x 2j ) FIS

26 Behind the curve fitting tool Matlab selects these curves using the least-squares criterion e.g min f F N (y j f(x j )) 2 where F is a class of functions Matlab considers a few function classes for F. Among them.. Linear min b,a 1 R N (y j (b + a 1 x j )) 2 Quadratic min b,a 1,a 2 R Cubic etc. min b,a 1,a 2,a 3 R N ) 2 (y j (b + a 1 x j +a 2 x 2j ) N ) 2 (y j (b + a 1 x j +a 2 x 2j +a 3x 3j ) FIS

27 How can we solve this? The linear case Let s take a look at the function (a,b) N (y j (b+ax j )) 2. Using the notations Rent Y = [ y 1 y 2 y N ] Surface X = [ x 1 x 2 x N ] Constant 1 N = [ ] we have that N (y j (ax j +b)) 2 = Y ax b1 N 2 FIS

28 Contour plot of (a,b) Y ax b1 N 2 Since we only handle 2 parameters, we can make a contour plot 3 norm(h(:,4) a H(:,2) b) b a This validates the equation y = 0.13x+2.4. How to get there? FIS

29 We define the function L as Some linear algebra L : (a,b) N (y j (b+ax j )) 2 The partial derivatives of L can be computed. N L a = 2 L b = 2 N (y j (b+ax j ))x j y j (b+ax j ) Any minimum (a,b ) of L must be a saddle point. FIS

30 Some linear algebra Namely, the partial derivatives of L at (a,b ) must be zero L ( a = 2 a x 2 j +b x j ) y j x j L ( b = 2 Nb y j +a ) x j Hence (a,b ) must satisfy the linear system 0 = a x 2 j +b x j y j x j 0 = Nb y j +a x j Namely, [ a b ] = [ ] x 2 1 [ ] j xj yj yj x j xj N ans = FIS

31 Rent vs. Surface Scatter plot of Rent vs. Surface y = 0.13*x y = *x *x 0.71 y = 6.2e 05*x *x *x 3.1 y = 1.4e 06*x *x *x *x 1.2 y = 1.8e 07*x e 05*x *x *x 2 1.1*x apartment linear quadratic cubic 4th degree 5th degree Rent (x JPY) Surface We understood how to get the linear curve. What about the quadratic? FIS

32 same idea... define What about the quadratic case? Quadratic min b,a 1,a 2 R N ) (y j (b + a 1 x j + a 2 x 2j ) L : (a 1,a 2,b) N ( yj ( b+a 1 x j +a 2 x 2 2 j)) look at the objective s derivatives... L a 2 = 2 L a 1 = 2 N N L b = 2 N ( yj ( b+a 1 x j +a 2 x 2 j)) x 2 j ( yj ( b+a 1 x j +a 2 x 2 )) j xj ( yj ( b+a 1 x j +a 2 x 2 j)) FIS

33 What about the quadratic case? We consider the equations that a saddle point must verify: 0 = 0 = N N 0 = ( yj ( b +a 1x j +a 2x 2 j)) x 2 j ( yj ( )) b +a 1x j +a 2x 2 j xj N ( yj ( b +a 1x j +a 2x 2 j)) a 2 a 1 b = x 4 j x 3 j x 2 j x 3 j x 2 j xj x 2 j xj N 1 yj x 2 j yj yj x j ans = FIS

34 Higher order polynomials Intuitively, for polynomial up to degree p we would have to Build the corresponding Toeplitz Matrix Build the corresponding vector with y and x combined at different exponents Solve the linear system Surprisingly Finding the best p th order polynomial with least-squares Solving a p dimensional linear system Not so surprising after all: Least-squares: objective of degree 2 in coefficients Minimum saddle point system of degree 1.. Least-squares has been chosen because it yields a linear system... FIS

35 The general case: one vs. all rest What about using all other variables? Rent Y = [ ] y 1 y 2 y N... All other variables X = x 1 x 2 x N... FIS

36 The general case We assume that we have d regressor variables, 1 response variable. Consider again the linear approach. We look for a function f of the form f(x) = α 0 +α 1 x 1 +α 2 x 2 + +α d x d. We want to determine d+1 weights, a constant α 0 1 i d,α i weights for each variable. Least squares: L(α 0,α 1,α 2,,α d ) = N ( ) 2 y j (α 0 +α 1 x 1,j +α 2 x 2,j + +α d x d,j ) FIS

37 The general case Notice that L(α 0,α 1,α 2,,α d ) N i=1 y i α 0 + α 1. α d T x i 2 = α 0. α d T X Y 2, where and X = x 1 x 2 x N Rd+1 N... Y = [ y 1 y N ] R N. We write α for the d+1 dimensional vector α 0. α d. FIS

38 Linear least squares Expanding this expression, L(α) = ( α T XX T α 2YX T α+ Y 2) Consider the gradient of that function L = 2XX T α 2XY T Hence this gradient is zero for α = (XX T ) 1 XY T XX T S n +, that is XX T is a positive (semi)definite matrix. This works if XX T R d+1 is invertible, that is XX T S n ++. FIS

$Considering again rents vs the rest Getting the data again, adding a line of 1 s >> (X X )\(X$

39 Considering again rents vs the rest Getting the data again, adding a line of 1 s >> (X X )\(X Y ) ans = FIS

40 What went wrong? 14 Scatter plot of Rent vs. Surface apartment 12 Rent (x JPY) Surface rent= age surf dist JPY FIS

41 What happens if we remove outliers? (surf> 40) 9 Scatter plot of Rent vs. Surface 8 7 Rent (x JPY) Surface >> (X X )\(X Y ) ans = x age x surface x distance JPY Moral of the story: easy to draw wrong conclusions even with simple tools FIS

42 What else can go wrong? Next time... What happens when d n? (XX T ) is no longer invertible... high-dimensional data in genomics, images analysis (lots of features) What happens when (XX T ) is badly conditioned ( λ min(xx T ) λ max (XX T ) 0)? if λ min (XX T ) = 1e 10, λ max ( (XX T ) 1) = 1e10!! Very bad numerical stability of the solution... When d n, we might want to do variable selection, i.e.pick a subset d of the d variables which is relevant to predict y. i.e.favor vectors β such that β 0 = cardβ i 0 is small. FIS

Foundation of Intelligent Systems, Part I. Regression 2

Foundation of Intelligent Systems, Part I Regression 2 mcuturi@i.kyoto-u.ac.jp FIS - 2013 1 Some Words on the Survey Not enough answers to say anything meaningful! Try again: survey. FIS - 2013 2 Last