CS-E3210 Machine Learning: Basic Principles

Size: px

Start display at page:

Download "CS-E3210 Machine Learning: Basic Principles"

Maryann Chapman
6 years ago
Views:

1 CS-E3210 Machine Learning: Basic Principles Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) / 48

2 In a nutshell today and friday we consider regression problems data points x (i) R d and continuous target y (i) R we want to learn a function h(x (i) ) y (i) function prediction h(x) is continuous in classification both target y and h(x) are binary a function h( ) is represented by parameters w parameters w need to fit data X = (x (1), y (1) ),..., (x (N), y (N) ) 2 / 48

3 output y: rent Can we predict apartment rent? 2000 Rent prediction input x: house size (sqm) we observe rents y (i) for i = 1,..., 11 houses x (i) learn from this data to predict rent h(x) R given d house properties x R d (designing good h(x) by hand is not machine learning) 3 / 48

4 output y: rent input x 2 : house age Which features do we have access to? Rent prediction Rent prediction, output y: rent input x: house size (sqm) input x 1 : house size (sqm) house size x size can predict a linear trend in rent y house age x age gives non-linear information about y new and old houses seem expensive, little effect for 40 s to 90 s informative features add accuracy (eg. location, condition) non-informative features add noise (eg. house color) 4 / 48

5 output y: rent output y: rent Alternative hypotheses h(x), which to choose? 2000 Rent prediction h(x) = 8.5 x Rent prediction h(x) complex input x: house size (sqm) input x: house size (sqm) linear functions are surprisingly powerful non-linear functions can achieve low error, but still err badly a model should learn the underlying model and generalise to future data Lectures 7 & 8 5 / 48

6 input x 2 : house age input x 2 : house age 800 Alternative hypotheses h(x), which to choose? Rent prediction, output y: rent Rent prediction, output y: rent input x 1 : house size (sqm) input x 1 : house size (sqm) a linear function can not explain the bimodal behavior of x age basis functions 6 / 48

7 Outline 1 2 Polynomial basis Gaussian basis 7 / 48

8 A regression problem inputs x (i) = (x (i) 1,..., x (i) d )T R d with d features/properties/dimensions/covariates a scalar target/response/output/label y (i) R a dataset of N data points X = {(x (1), y (1) ),..., (x (N), y (N) )} = {x (i), y (i) } N i=1 in matrix form the dataset is x (1) 1 x (1) d X =..... =. x (N) 1 x (N) d x (1)T x (N)T RN d, y = learn a function h( ) : R d R with y (i) h(x (i) ) (1) which function family h(x) to choose? (2) how to measure h(x) y? y (1). y (N) R N 8 / 48

9 linear regression for multivariate inputs x R d defines d h w (x) = w j x j = w T x j=0 where w R d are linear weight parameters encode x 0 = 1, then w 0 encodes intercept the hypothesis class h w {h w : w R d } all predictions in matrix notation now h(x (1) ). h(x (N) ) = w T x (1). w T x (N) = X w measure prediction error by square error/loss L((x (i), y (i) ), h( )) = (y (i) h(x (i) )) 2 9 / 48

10 output y: rent Can we predict apartment rent? Rent prediction input x (i) output y (i) input x: house size (sqm) x (1) = 31 y (1) = 705 x (2) = 33 y (2) = 540 x (3) = 31 y (3) = 650 x (4) = 49 y (4) = 840 x (5) = 53 y (5) = 890 x (6) = 69 y (6) = 850 x (7) = 101 y (7) = 1200 x (8) = 99 y (8) = 1150 x (9) = 143 y (9) = 1700 x (10) = 132 y (10) = 900 x (11) = 109 y (11) = 1550 we observe data X = (x (1), y (1) ),..., (x (N), y (N) ) with N = 11 we assume y (i) f (x (i) ) where f ( ) is the true function 10 / 48

11 output y: rent Can we predict apartment rent? Rent prediction h(x) = 9 x input x: house size (sqm) input x (i) output y (i) h(x) = 9 x x (1) = 31 y (1) = 705 h(x (1) ) = 679 x (2) = 33 y (2) = 540 h(x (2) ) = 697 x (3) = 31 y (3) = 650 h(x (3) ) = 679 x (4) = 49 y (4) = 840 h(x (4) ) = 841 x (5) = 53 y (5) = 890 h(x (5) ) = 877 x (6) = 69 y (6) = 850 h(x (6) ) = 1021 x (7) = 101 y (7) = 1200 h(x (7) ) = 1309 x (8) = 99 y (8) = 1150 h(x (8) ) = 1291 x (9) = 143 y (9) = 1700 h(x (9) ) = 1687 x (10) = 132 y (10) = 900 h(x (10) ) = 1588 x (11) = 109 y (11) = 1550 h(x (11) ) = 1381 linear hypothesis class h w (x) = w 1 x + w 0 = w T x encode x = (x, 1) T with w = (w 1, w 0 ) T compute losses (y (i) h(x (i) )) 2 11 / 48

12 output y: rent Which parameters to choose? 2000 Rent prediction input x: house size (sqm) choose parameters to minimize empirical risk (mean loss) { ŵ = argmin E(h( ) X) = 1 N (y (i) h(x (i) )) 2 w N i=1 = 1 N y X w 2} 12 / 48

13 y Empirical risk Empirical risk 2000 Rent prediction (b=0) Empirical risk 1500 data h(x) = 5x 10 8 Empirical risk w= x w empirical risk quantifies how well the function fits data h(x) = w 1 x + 0, w 1 = 5 13 / 48

14 y Empirical risk Empirical risk 2000 Rent prediction (b=0) Empirical risk 1500 data h(x) = 5x h(x) = 11.7x 10 8 Empirical risk w=5 w= x w empirical risk quantifies how well the function fits data h(x) = w 1 x + 0, w 1 = / 48

15 y Empirical risk Empirical risk 2000 Rent prediction (b=0) Empirical risk 1500 data h(x) = 5x h(x) = 11.7x h(x) = 15x 10 8 Empirical risk w=5 w=11.7 w= x w empirical risk quantifies how well the function fits data h(x) = w 1 x + 0, w 1 = 15 best hypothesis was w 1 = 11.7 when w 0 = 0 (only on this data X!) 15 / 48

16 Empirical risk 2D empirical risk surface over w 0, w 1 16 / 48

17 Derivatives let s minimize empirical risk minimization of functions is based on derivatives df (x) f (x + h) f (x) = lim dx h 0 h derivative is the direction of steepest descent 17 / 48

18 Derivatives derivative of w wrt empirical error is (for 1D problem) E(h( ) X) = 1 N N i=1 (y (i) wx (i) ) 2 w w = 1 N (y (i) wx (i) ) 2 N w = 2 N = 2 N i=1 N i=1 N i=1 (y (i) wx (i) ) (y (i) wx (i) ) w x (i) (y (i) wx (i) ) }{{} i th data error gradient of w = (w 1,..., w d ) T wrt empirical error is w E(h w ( ) X) = E(h w 1,...,w d ( ) X) w 1. E(h w 1,...,w d ( ) X) w d 18 / 48

19 Iterative gradient descent choose initial parameter w (0) (eg. all 0 s) and stepsize α iterative gradient descent (GD): for k = 1,..., K, update w (k+1) = w (k) α w E(h( ) X) = w (k) + 2α }{{} N gradient N i=1 x (i) (y (i) w (k)t x (i) ) }{{} i th data point error output: final K th regression weight vector w (K) choice of step size or learning rate α is crucial! if α is too large: iterations may not converge if α is too small: very slow convergence α usually chosen by trial-and-error gradient w E(h( ) X) points to direction of the maximal rate of increase of E(h( ) X) at current value w subtract gradient from w (k) to maximally decrease E(h( ) X) computational complexity O(K N 2 ) 19 / 48

20 Gradient minimization we use update equation w (k+1) = w (k) + 2α N N x (i) (y (i) w (k)t x (i) ) i=1 stepsize α is good 20 / 48

21 Gradient minimization we use update equation w (k+1) = w (k) + 2α N N x (i) (y (i) w (k)t x (i) ) i=1 too large α, we are not converging 21 / 48

22 Stochastic gradient descent in gradient descent each data point pulls parameters w (k+1) = w (k) α w E(h( ) X) = w (k) + 2α N N i=1 x (i) (y (i) w (k)t x (i) ) }{{} i th data point error in stochastic gradient descent (SGD) we compute gradient over random minibatches I {1,..., N} of data of size M < N w (k+1) = w (k) α w E(h( ) X I ) = w (k) + 2α N computational complexity O(K M 2 ) x (i) (y (i) w (k)t x (i) ) SGD is one of the most powerful optimizers for large models i I 22 / 48

23 Analytical solution for linear regression to minimize E(h( ) X) we can directly solve where its gradients are 0: w E(h( ) X) = 0 with solution (DLbook 5.1.4) ŵ = (X T X ) 1 X T y we get global optimum since empirical risk (of linear regression) is convex (X T X ) 1 needs to be invertible Regression Home Assignment matrix inverse is an O(d 3 ) operation 23 / 48

24 ID card of linear regression input/feature space X = R d target space Y = R function family h(x) = w T x = d j=0 w jx j bias trick: x 0 = 1 and j starts from 0 loss function L((x, y), h( )) = (h(x) y) 2 empirical risk E(h( ) X) = 1 N X w y 2 2 empirical risk minimization leads to parameters ŵ = (X T X ) 1 X T y (..or..) w (k+1) = w (k) + 2α N x (i) (y (i) w (k)t x (i) ) N i=1 DL book: covered in chapter / 48

25 Case study: predict red wine quality with linear regression? one wants to understand what makes a wine taste good we have measured chemical composition of many wines (x) tasting evaluations to rate the wines (y) task: predict wine quality h(x) given its composition x 25 / 48

26 Wine measurement data we construct a dataset X of N = 1599 wine measurements x we manually obtain a rating y [0, 10] for each wine from subjective tastings fixed volatile citric free total acid acid acid sugar chlorides sulfur sulfur density ph sulphates alcohol quality x (i) 1 x (i) 2 x (i) 3 x (i) 4 x (i) 5 x (i) 6 x (i) 7 x (i) 8 x (i) 9 x (i) 10 x (i) x (1) 5 y (1) x (2) 5 y (2) X = x (3), y = 5 y (3) x (4) 6 y (4) x (5) 5 y (5) x (1599) 6 y (1599) *P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4): , / 48

27 Linear Regression on wine linear hypothesis space H = {h w (x) = w T x : w R 11 } empirical risk minimizer (fits the 1599 wines best): ŵ = argmin w 1 N N (y (i) w T x (i) ) 2 = (X T X ) 1 X T y i=1 [ ] = ŵ X = fixed volatile citric free total acid acid acid sugar chlorides sulfur sulfur density ph sulphates alcohol x (i) 1 x (i) 2 x (i) 3 x (i) 4 x (i) 5 x (i) 6 x (i) 7 x (i) 8 x (i) 9 x (i) 10 x (i) x (1) x (2) x (3) x (4) x (5) , y = quality 5 y (1) 5 y (2) 5 y (3) 6 y (4) 5 y (5).. 27 / 48

28 predictions h(x (i) ) = j ŵ j x (i) j = ŵ T x (i) X = [ ] x (1) x (2), x (3) x (4) x (5) = ŵ w T x (1) w T x (2) w T x (3) w T x (4) w T x (5) w T x (1599) h(x (1) ) = ( 1.09) = h(x (2) ) = ( 1.09) = / 48

29 result on wine We achieve empirical risk (mean square error) E(h( ) X) = 1 N N (h(x (i) ) y (i) ) 2 = i= y = 5, X ŵ = / 48

30 Outline Polynomial basis Gaussian basis 1 2 Polynomial basis Gaussian basis 30 / 48

31 Non-linearity Polynomial basis Gaussian basis so far we have analysed linear models where each feature contribution towards output is summed independently most machine learning problems are non-linear non-linear effects, e.g. log(x alcohol ) combined effects, e.g. x sugar x alcohol let s expand the feature space by considering n basis functions h(x) = n w j φ j (x) = w T φ(x) j=0 where φ(x) : R d R n with usually n > d and φ 0 (x) = 0 dataset is then Φ = (φ(x (1) ),..., φ(x (N) )) T R N n risk: 1 N Φw y 2 2, solution: ŵ = (ΦT Φ) 1 Φ T y 31 / 48

32 Outline Polynomial basis Gaussian basis 1 2 Polynomial basis Gaussian basis 32 / 48

33 Polynomial expansion Polynomial basis Gaussian basis map φ : (x 1, x 2 ) (x 1, x 2, x1 2x 2 2) the product x1 2x 2 2 solves the problem (feature expansion) trivial solution now w 3 = 1 33 / 48

34 Polynomial basis functions Polynomial basis Gaussian basis let s consider non-additive effects via M th order polynomial basis functions: where φ (M) (x) = {x j1 x j2 x jm : j M 1,..., d} φ (0) (x) = 1 φ (1) (x) = (x 1, x 2,..., x d ) T φ (2) (x) = (x 2 1, x 1 x 2,..., x d 1 x d, x 2 d )T d = 11 features gives 55 pairwise terms, 165 triplets, etc. basis expansion dramatically increases hypothesis space bases are precomputed to produce Φ matrix basis functions results in non-linear prediction 34 / 48

35 Polynomial basis Gaussian basis Polynomial basis example sample 100 points where x (i) [ 1, 1] and y (i) = sin(πx (i) ) + ε black dots: 7 data points red dots: more samples linear function h(x) = 1.37x X Y sin(x π) degree 1 polynomial 35 / 48

36 Polynomial basis Gaussian basis Polynomial regressor, M = X Y sin(x π) degree 0 polynomial h(x) = ŵ 0 36 / 48

37 Polynomial basis Gaussian basis Polynomial regressor, M = X Y sin(x π) degree 1 polynomial h(x) = ŵ 0 + ŵ 1 x 37 / 48

38 Polynomial basis Gaussian basis Polynomial regressor, M = X Y sin(x π) degree 2 polynomial h(x) = ŵ T φ(x) = ŵ 0 + ŵ 1 x + ŵ 2 x 2 38 / 48

39 Polynomial basis Gaussian basis Polynomial regressor, M = X Y sin(x π) degree 3 polynomial h(x) = ŵ T φ(x) = ŵ 0 + ŵ 1 x + ŵ 2 x 2 + ŵ 3 x 3 39 / 48

40 Polynomial basis Gaussian basis Polynomial regressors, M = X Y sin(x π) degree 5 polynomial h(x) = ŵ T φ(x) = ŵ 0 + ŵ 1 x + ŵ 2 x 2 + ŵ 3 x 3 + ŵ 4 x 4 + ŵ 5 x 5 40 / 48

41 Polynomial basis Gaussian basis Polynomial regressors, M = 5 with enough data X Y sin(x π) Polynomial of degree 5 41 / 48

42 Outline Polynomial basis Gaussian basis 1 2 Polynomial basis Gaussian basis 42 / 48

43 Kernel basis functions Polynomial basis Gaussian basis kernel function K(x, x ) R measures similarity of two vectors x, x R d opposite concept to distance function D(x, x ) a common kernel is the gaussian kernel ( K(x, x ) = exp 1 x x 2 ) 2 σ 2 kernel basis function encodes feature φ i (x) as similarity to other point m (i), φ i (x) = K(x, m (i) ) how to choose basis points m (i)? 43 / 48

44 Polynomial basis Gaussian basis feature mapping with 3 gaussian bases 3 features φ j (x) = e (x m(j) ) 2 2σ 2 at m (j) = 50, 100, 150 feature mapping φ : x (φ 1 (x), φ 2 (x), φ 3 (x)) eg. x = 31 becomes φ(31) = (0.74, 0.02, 0.00) eg. x = 69 becomes φ(69) = (0.74, 0.46, 0.00) eg. x = 143 becomes φ(143) = (0.00, 0.22, 0.96) 44 / 48

45 3 gaussian bases on 1D Polynomial basis Gaussian basis three gaussian features φ j (x) = e (x m(j) ) 2 2σ 2 with (m (1), m (2), m (3) ) = (50, 100, 150) hypothesis is a sum of weighted gaussian features 3 h(x) = w j φ j (x) j=1 45 / 48

46 ID card of linear basis regression Polynomial basis Gaussian basis input space X = R d feature space F = R n by basis function φ(x) R n dataset is then Φ = (φ(x (1) ),..., φ(x (N) ) T R N n target space Y = R function family h(x) = w T φ(x) loss function L((x, y), h( )) = (h(x) y) 2 empirical risk E(h( ) X) = 1 N Φw y 2 2 empirical risk minimization leads to parameters ŵ = (Φ T Φ) 1 Φ T y 46 / 48

47 Basis function summary Polynomial basis Gaussian basis basis functions φ : R d R n project the data into higher dimensional space (if n > d) linear regression with the high-dimensional data points φ(x) leads to non-linear hypothesis h(φ(x)) selection of informative basis functions is a difficult task polynomial bases take combinations (products) of existing features gaussian bases generate a new feature mapping 47 / 48

48 Next steps Polynomial basis Gaussian basis next lecture: Regression II with kernel methods and Bayesian regression on friday at 10:15 DL book: read chapters 5.1 and 5.2 on linear regression more information about basis functions Hastie s book 1 : chapters 3.2 & 5 Bishop s book 2 : chapter 3.1 fill out post-lecture questionnaire in MyCourses! we read and appreciate all feedback 1 Elements of Statistical Learning, Springer. 2 Pattern recognition and Machine Learning, Springer / 48

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction