CS-E3210 Machine Learning: Basic Principles

Size: px

Start display at page:

Download "CS-E3210 Machine Learning: Basic Principles"

Daniella Jordan
5 years ago
Views:

1 CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) / 61

2 Today s introduction we still want to learn a continuous hypothesis function h(x (i) ) y (i) from N observed data points (x (1), y (1) ),..., (x (N), y (N) ) today we consider kernel regression and methods in kernel regression a non-linear hypothesis is learned in we learn a distribution of hypotheses instead of a single hypothesis 2 / 61

3 Outline 1D example 2D example 1 1D example 2D example / 61

4 Kernel smoothing 1D example 2D example, h(x) = = N i=1 y (i) K(x, x (i) ) N l=1 K(x, x(l) ) }{{} K(x,x (i) ) N y (i) K(x, x (i) ) i=1 normalised kernel K(x, x (i) ) sums to one over data the hypothesis becomes a weighted average of y (i) every data point becomes a basis point ( Lecture 3) ( ) we assume a gaussian kernel K σ (x, x ) = exp 1 x x 2 2 σ 2 4 / 61

5 Outline 1D example 2D example 1 1D example 2D example / 61

6 Kernel on 1D rent data 1D example 2D example ( kernel K σ (x, x (3) ) = exp 1 (x x (3) ) 2 2 neighbouring points σ 2 ) gives similarities to (univariate gaussian kernel function since x is scalar) 6 / 61

7 Kernel on 1D rent data 1D example 2D example ( kernel K σ (x, x (6) ) = exp 1 (x x (6) ) 2 2 neighbouring points σ 2 ) gives similarities to 7 / 61

8 Kernel on 1D rent data 1D example 2D example ( kernel K σ (x, x (9) ) = exp 1 (x x (9) ) 2 2 neighbouring points σ 2 ) gives similarities to 8 / 61

9 Kernel on 1D rent data 1D example 2D example normalised kernel K σ (x, x (9) ) = Kσ(x,x(9) ) N l=1 Kσ(x,x(l) ) scales similarities to percentages (note different color scale) 9 / 61

10 Kernel on 1D rent data 1D example 2D example kernel matrix and normalised (rows sum to one) kernel matrix of 11 data point inputs 10 / 61

11 Kernel on 1D rent data 1D example 2D example kernel regression h(x) = N i=1 y (i) Kσ (x, x (i) ) with σ = / 61

12 Kernel on 1D rent data 1D example 2D example kernel regression h(x) = N i=1 y (i) K σ (x, x (i) ) with σ = 50 we get linear regression as σ increases 12 / 61

13 Kernel on 1D rent data 1D example 2D example kernel regression h(x) = N i=1 y (i) K σ (x, x (i) ) with σ = we get constant hypothesis 13 / 61

14 Kernel on 1D rent data 1D example 2D example kernel regression h(x) = N i=1 y (i) Kσ (x, x (i) ) with σ = 5 14 / 61

15 Kernel on 1D rent data 1D example 2D example kernel regression h(x) = N i=1 y (i) K σ (x, x (i) ) with σ = 1 nearest neighbour smoothing as σ 0 15 / 61

16 Outline 1D example 2D example 1 1D example 2D example / 61

17 1D example 2D example 2D rent data with kernel regression kernel regression h(x) = N i=1 y (i) Kσ (x, x (i) ) with σ = 2 nearest neighbor model 17 / 61

18 1D example 2D example 2D rent data with kernel regression kernel regression h(x) = N i=1 y (i) K σ (x, x (i) ) with σ = 5 18 / 61

19 1D example 2D example 2D rent data with kernel regression kernel regression h(x) = N i=1 y (i) K σ (x, x (i) ) with σ = / 61

20 1D example 2D example 2D rent data with kernel regression kernel regression h(x) = N i=1 y (i) K σ (x, x (i) ) with σ = / 61

21 1D example 2D example 2D rent data with kernel regression kernel regression h(x) = N i=1 y (i) K σ (x, x (i) ) with σ = / 61

22 1D example 2D example 2D rent data with kernel regression kernel regression h(x) = N i=1 y (i) Kσ (x, x (i) ) with σ = 50 we get linear regression 22 / 61

23 1D example 2D example 2D rent data with kernel regression the data supports the hypothesis function or surface h(x) kernel regression interpolates observed outputs (eg. rents) based on input geometry (similarity) 23 / 61

24 Kernel method summary 1D example 2D example there are no parameters (like w) in kernel regression (!) h(x) = N i=1 y (i) K(x, x (i) ) N l=1 K(x, x(l) ) all datapoints act as parameters (N-dimensional model) the input features x (i) j are not directly weighted by parameters w j, instead they go inside the kernel the kernel can have hyperparameters, eg. variance σ 2 Lectures 7 & 8: selection of hyperparameters 3 pillars of ML: neural networks, Bayesian learning, kernel methods CS-E Kernel Methods in Machine Learning, / 61

25 1D example 2D example Parametric vs non-parametric methods a parametric method compresses all information in the dataset into a set of model parameters eg. linear regression parameters w, hypothesis h(x) = w T x model of size d smaller than the data Nd model is a low-rank explanation of the data model is often interpretable new prediction h(x (N+1) ) = w T x (N+1) does not depend on datapoints (x (i), y (i) ) a non-parametric method uses the dataset as its parameters the whole dataset needs to be stored, increased memory model has no interpretable parameters eg. NW kernel method where h(x (N+1) ) = N i=1 depends on all observed datapoints y (i) K(x (N+1), x (i) ) l K(x(N+1), x (l) ) 25 / 61

26 ID card of NW kernel regression 1D example 2D example input/feature space X = R d target space Y = R function family h(x) = N i=1 y (i) K θ (x, x (i) ) N l=1 K θ(x, x (l) ) multiple choice of kernel functions K θ (x, x (i) ) selection of kernel hyperparameters θ is crucial choosing kernel parameters to minimize empirical risk can lead to overfitting Lectures 7 & 8 26 / 61

27 Outline 1 1D example 2D example / 61

28 Statistical learning previously we studied regression as a deterministic function fitting (minimize empirical risk) we can also treat regression as a statistical problem of inferring distribution w p( ) of random variable parameter after observing data (X, y) The key concepts of statistics: Probability density function (pdf) p(θ) 0 Probability P(a θ b) = b p(θ)dθ [0, 1] a Expectation E p(θ) [g(θ)] = g(θ)p(θ)dθ R read DL book chapter 3! 28 / 61

29 Some common continuous distributions 29 / 61

30 Generative model let s consider the task of estimating the ratio θ [0, 1] of genders of newborn babies, given a dataset of N observed newborn genders y = {y (1),..., y (N) } with y (i) {0, 1} a global true ratio is perhaps θ 0.52 we assume that each birth results in a boy or a girl with a Bernoulli probability p(y (i) θ) = Ber(y (i) θ) = θ y (i) (1 θ) 1 y (i) 0. assume all births are independent, p(y (i), y (j) ) = p(y (i) )p(y (j) ) assume all births i follow the same Ber(θ) distribution parameter θ contains everything needed to compute probability of an observation p(y (i) θ) assume we observe N = 5 births y = (0, 0, 1, 0, 0) T 30 / 61

31 Data likelihood a data likelihood p(y θ) is the probability of seeing this data y assuming parameters θ, p(y θ) = = N Ber(y (i) θ) i=1 N i=1 θ y (i) 1 y (i) (1 θ) θ p(y θ) in maximum likelihood (ML) inference we maximise (log) likelihood θ ML = argmax log p(y θ) = 1 θ N N y (i) = 0.2 i=1 31 / 61

32 Prior distribution the ML gender ratio of θ ML = 0.2 is unlikely! if we add more data, the ML solution will surely converge to a reasonable value around θ 0.5 what if we don t have more data? we should also consider our prior beliefs on θ that would reject θ = 0.2 let s encode prior belief as a distribution p(θ) = Beta(θ α, β) let s subjectively choose α = β = 20 how to combine prior and likelihood? 32 / 61

33 Posterior distribution combines prior and likelihood Bayes rule: p(a B) = p(b A)p(A) p(b) hence p(θ y) = p(y θ)p(θ) p(y) gives the posterior distribution p(θ y) the evidence p(y) is constant wrt θ exactly what we wanted! maximum a posteriori (MAP) inference θ MAP = argmax θ p(θ y) }{{} = p(y θ)p(θ) 33 / 61

34 Bayesian estimators maximum likelihood estimator θ ML = argmax log p(y θ) θ maximum a posteriori estimator θ MAP = argmax log p(θ y) θ (maximum a priori is not used) 34 / 61

35 Data decreases parameter variance 35 / 61

36 Solving posteriors the crucial part of Bayesian modelling is solving the posterior naive solution: search θ that maximises posterior p(θ y) p(y θ)p(θ) not feasible if parameter space is large conjugate distributions are combinations of priors and likelihoods that have known analytical posterior distributions for instance, N p(y θ) = Ber(y (i) θ) i=1 p(θ) = Beta(α, β) p(θ y) = p(y θ)p(θ) p(y) = Beta ( α + N y (i), β + N i=1 ) N y (i) i=1 36 / 61

37 ID card for Bayesian modelling data set of observations y = (y (i) ) N i=1 define a generative probability model p(y θ) R + define a likelihood p(y θ) of observing dataset y given model θ define a prior belief p(θ) on parameter values solve the posterior belief p(θ y) = p(y θ)p(θ) p(y) p(y θ)p(θ) of parameters θ given observations y several likelihoods/priors have known posterior solutions 37 / 61

38 Outline 1 1D example 2D example / 61

39 Bayesian linear regression (BLR) recipe Bayesian linear regression has several differences to deterministic linear regression parameters w are modeled as distributions p(w) predictions h(x) are distributions p(y w) BLR recipe (1) we need to define a likelihood p(y w) (2) we need to define a prior p(w) (3) optimal parameters are represented by posterior p(w y) 39 / 61

40 Bayesian linear regression: data and hypothesis assume a 1D dataset X with x = (x (1),..., x (N) ) T R N and y = (y (1),..., y (N) ) T R N assume bias trick φ(x) = (1, x) T R 2 assume parameters w = (w 0, w 1 ) T R 2 and linear regression 1 h w (x) = w 0 + w 1 x = w j φ j (x) = w T φ(x) j=0 data feature matrix is Φ = (φ(x (1) ),..., φ(x (N) )) T R N 2 40 / 61

41 Bayesian linear regression: variance model assume variance model y = h w (x) + }{{}}{{} ε clean output perturbation ε N (0, σ 2 ) if A N (µ, σ 2 ), then a new random variable ca + d = B N (cµ + d, c 2 σ 2 ) for any real c, d this rule gives h w (x) }{{} d + ε }{{} A } {{ } B = y N (w T φ(x), σ 2 ) 41 / 61

42 Bayesian linear regression: (1) likelihood the likelihood is then p(y w, σ) = = N p(y (i) w, σ) i=1 N N (y (i) w T φ(x (i) ), σ 2 ) i=1 = N (y Φw, σ 2 I NN ) 1 = (2π) N σ 2 I exp( 1 2 (y Φw)T (σ 2 I ) 1 (y Φw)) ( exp 1 ) 2 (y Φw)T σ 2 (y Φw) 42 / 61

43 Bayesian linear regression: (2) prior bivariate Gaussian prior for parameters w = (w 0, w 1 ) T ( [ ] [ ] [ ] w p(w) = N 0 m0,0 σ 2, 0 0 ) ( w 1 m 0,1 0 σ 2 exp 1 ) 1 2 wt S0 1 w }{{}}{{}}{{} w m 0 S 0 43 / 61

44 Bayesian linear regression: (3) posterior the (conjugate) posterior is now (DL book 5.6.) p(w y, σ) = where p(y w, σ)p(w) p(y) p(y w, σ)p(w) = N (y Φw, σ 2 I )N (w m 0, S 0 ) ( exp 1 ) 2 (y Φw)T σ 1 (y Φw) ( = exp 1 ) 2 (w m N) T S 1 N (w m N) N (w m N, S N ) S N = (S σ 2 Φ T Φ) 1 m N = S N (S 1 0 m 0 + σ 2 Φ T y) ( exp 1 ) 2 wt S0 1 w 44 / 61

45 BLR summary: (1) likelihood, (2) prior and (3) posterior data likelihood p(y w, σ) = parameter prior N N (y (i) w T φ(x (i) ), σ 2 ) = N (y Φw, σ 2 I N ) i=1 parameter posterior where p(w y, σ) = p(w) = N (w m 0, S 0 ) p(y w, σ)p(w) p(y σ) = N (w m N, S N ) m N = S N (S 1 0 m 0 + σ 2 Φ T y) S N = (S σ 2 Φ T Φ) 1 45 / 61

46 BLR Prior (σ = 100) let s pick m 0 = (0, 0) T and σ 0 = 300 and σ 1 = 10 (why?) let s sample 5 parameters from the prior let s plot 5 hypotheses w (j) N (m 0, S 0 ), j = 1,..., 5 h (j) (x) = w (j)t φ(x), j = 1,..., 5 46 / 61

47 BLR first data point let s sample new 5 parameters from the posterior w (j) N (m 1, S 1 ) where m 1 = σ 2 S 1 Φ T 1:1 y 1:1 and S 1 = (α 2 I + σ 2 Φ T 1:1 Φ 1:1) 1 let s draw these 5 hypothesis h (j) (x) = w (j)t φ(x) 47 / 61

48 BLR two data points posterior after seeing (random) 2 out of 11 datapoints 48 / 61

49 BLR 3 data points 49 / 61

50 BLR 5 data points 50 / 61

51 BLR 7 data points 51 / 61

52 BLR all data points posterior has converged to p(w y, σ) = N (m 11, S 11 ) with m 11 = [ ] 441 = 7.2 [ ] E[w0 y, σ], S E[w 1 y, σ] 11 = [ 4185 ] / 61

53 BLR final model let s zoom the posterior and draw 100 samples the posterior represents all hypotheses that match data and our prior assumptions (!) which one should we predict with? all of them and take average! 53 / 61

54 BLR final model predictive hypothesis h(x) = E w p(w y) [h w (x)] = w T φ(x)n (w m N, S N )dw N (m T N φ(x), φ(x)t S N φ(x)) finally y(x) = h(x) + ε N (m T N φ(x), φ(x)t S N φ(x) + σ 2 ) 54 / 61

55 What did we learn? posterior p(w y, σ) is the belief in any specific parameter value w = (w 0, w 1 ) after observing data our solution is a distribution instead of a single value w instead of predicting with an optimal w MAP = argmax p(w y) = m N = (S0 1 +σ2 Φ T Φ) 1 (S0 1 m 0+σ 2 Φ T y) w we average predictions from all posterior solutions posterior concentrates as data increases predictive distribution averages over all hypotheses more robust model, no need to choose a parameter value there is rarely a true underlying parameter value, instead a continuum of compatible parameters 55 / 61

56 On priors a prior is a subjective distribution defined by the modeler Bayesian theory argues that the prior should be subjective, and represent our prior beliefs on which hypothesis are expected our analysis is thus not objective, but does not need to be! the posterior encodes our degree of belief on certain models/parameters given our very explicit prior assumptions, and the data wide prior is recommended: don t want to exclude hypotheses prior assigns a probability to each hypothesis in the hypothesis space priors can also be used as a way to constrain the parameter values not to have crazy extreme values, or to favour simple models Lectures 7 & 8 56 / 61

57 Regularising with priors prior that constrains parameters to nice values is a regulariser assume 1st data points is x (1) = 0.6, y (1) = 1, the likelihood is now an infinite band we assume huge values are clearly wrong, zero-mean prior encodes this assumption w 0 w 1 w 1x (1) + w / 61

58 ID card of input/feature space X = R d, target space Y = R, feature mapping φ(x) R n linear function and noise y = w T φ(x) + ε with ε N (0, σ 2 ) Normal prior: p(w) = N (w m 0, S 0 ) Normal likelihood: p(y w, σ) = N (y Φw, σ 2 I N ) Normal posterior: p(w y, σ) = N (w T m N, S N ) where m N = S N (S 1 0 m 0 + σ 2 Φ T y) S N = (S σ 2 Φ T Φ) 1 predictive posterior: h(x) N (m T N φ(x), φ(x)t S N φ(x) + σ 2 ) MAP solution w MAP = m N 3 pillars of ML: neural networks, Bayesian learning, kernel methods CS-E Machine Learning: Advanced Probabilistic Methods, CS-E Bayesian Data Analysis, / 61

59 Recap of course so far: hypothesis space hypothesis space of linear regression H = {h(x) = w T x where x R d } each unique parameter vector w gives a linear hypothesis h w (x) hence the number of possible hypotheses is the number of d-dimensional real valued vectors (infinite!) hypothesis space grows with input dimension d in hypothesis space does not change, but hypotheses h w ( ) have prior probabilities p(h w ) = p(w) the concept of hypothesis space will be discussed in Lectures 7 & 8 59 / 61

60 Recap of course so far: loss function quantify error we make when predicting h(x (i) ) for i th datapoint when its true value was y (i) square loss function L((x, y), h( )) = (y h(x)) 2 empirical risk is average square error over dataset E(h w (x) X, y) = 1 N N L((x (i), y (i) ), h( )) = (y (i) h(x (i) )) 2 i=1 hypothesis that minimizes empirical risk has optimal fit to data X, y in Bayesian learning loss function is likelihood 60 / 61

61 Next steps next lecture: Classification I on at 10:15 DL book: read chapters 5.5 and 5.6 more information on kernel methods Hastie s book: chapter 6.1 (+ 6.2 & 6.3) Bishop s book: chapter 6.3 Bayesian linear regression Bishop s book: chapter 3.3 remember the post-lecture feedback questionnaire for this lecture 61 / 61

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 48 In a nutshell