Linear Models for Regression

Size: px

Start display at page:

Download "Linear Models for Regression"

Corey Goodwin
5 years ago
Views:

1 Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1

2 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter of The Elements of Statistical Learning by Hastie, Tibshirani, Friedman Möller/Mori 2

3 Outline Regression Linear Basis Function Models Loss Functions for Regression Finding Optimal Weights Regularization Bayesian Linear Regression Möller/Mori 3

4 Outline Regression Linear Basis Function Models Loss Functions for Regression Finding Optimal Weights Regularization Bayesian Linear Regression Möller/Mori 4

5 Regression Given training set {(x 1, t 1 ),..., (x N, t N )} t i is continuous: regression For now, assume t i R, x i R D E.g. t i is stock price, x i contains company profit, debt, cash flow, gross sales, number of spam s sent, Möller/Mori 5

6 Outline Regression Linear Basis Function Models Loss Functions for Regression Finding Optimal Weights Regularization Bayesian Linear Regression Möller/Mori 6

7 Linear Functions A function f( ) is linear if f(αx 1 + βx 2 ) = αf(x 1 ) + βf(x 2 ) Linear functions will lead to simple algorithms, so let s see what we can do with them Möller/Mori 7

8 Linear Regression Simplest linear model for regression y(x, w) = w 0 + w 1 x 1 + w 2 x w D x D Remember, we re learning w Set w so that y(x, w) aligns with target value in training data Möller/Mori 8

9 Linear Regression Simplest linear model for regression y(x, w) = w 0 + w 1 x 1 + w 2 x w D x D Remember, we re learning w Set w so that y(x, w) aligns with target value in training data Möller/Mori 9

10 Linear Regression Simplest linear model for regression y(x, w) = w 0 + w 1 x 1 + w 2 x w D x D Remember, we re learning w Set w so that y(x, w) aligns with target value in training data This is a very simple model, limited in what it can do Möller/Mori 10

11 Linear Basis Function Models Simplest linear model y(x, w) = w 0 + w 1 x 1 + w 2 x w D x D was linear in x ( ) and w Linear in w is what will be important for simple algorithms Extend to include fixed non-linear functions of data y(x, w) = w 0 + w 1 ϕ 1 (x) + w 2 ϕ 2 (x) w M 1 ϕ M 1 (x) Linear combinations of these basis functions also linear in parameters Möller/Mori 11

12 Linear Basis Function Models Simplest linear model y(x, w) = w 0 + w 1 x 1 + w 2 x w D x D was linear in x ( ) and w Linear in w is what will be important for simple algorithms Extend to include fixed non-linear functions of data y(x, w) = w 0 + w 1 ϕ 1 (x) + w 2 ϕ 2 (x) w M 1 ϕ M 1 (x) Linear combinations of these basis functions also linear in parameters Möller/Mori 12

13 Linear Basis Function Models Simplest linear model y(x, w) = w 0 + w 1 x 1 + w 2 x w D x D was linear in x ( ) and w Linear in w is what will be important for simple algorithms Extend to include fixed non-linear functions of data y(x, w) = w 0 + w 1 ϕ 1 (x) + w 2 ϕ 2 (x) w M 1 ϕ M 1 (x) Linear combinations of these basis functions also linear in parameters Möller/Mori 13

14 Linear Basis Function Models Bias parameter allows fixed offset in data y(x, w) = w 0 }{{} bias Think of simple 1-D x: +w 1 ϕ 1 (x) + w 2 ϕ 2 (x) w M 1 ϕ M 1 (x) y(x, w) = w }{{} 0 + w 1 intercept }{{} slope For notational convenience, define ϕ 0 (x) = 1: y(x, w) = M 1 j=0 x w j ϕ j (x) = w T ϕ(x) Möller/Mori 14

15 Linear Basis Function Models Bias parameter allows fixed offset in data y(x, w) = w 0 }{{} bias Think of simple 1-D x: +w 1 ϕ 1 (x) + w 2 ϕ 2 (x) w M 1 ϕ M 1 (x) y(x, w) = w }{{} 0 + w 1 intercept }{{} slope For notational convenience, define ϕ 0 (x) = 1: y(x, w) = M 1 j=0 x w j ϕ j (x) = w T ϕ(x) Möller/Mori 15

16 Linear Basis Function Models Bias parameter allows fixed offset in data y(x, w) = w 0 }{{} bias Think of simple 1-D x: +w 1 ϕ 1 (x) + w 2 ϕ 2 (x) w M 1 ϕ M 1 (x) y(x, w) = w }{{} 0 + w 1 intercept }{{} slope For notational convenience, define ϕ 0 (x) = 1: y(x, w) = M 1 j=0 x w j ϕ j (x) = w T ϕ(x) Möller/Mori 16

17 Linear Basis Function Models Function for regression y(x, w) is non-linear function of x, but linear in w: y(x, w) = M 1 j=0 w j ϕ j (x) = w T ϕ(x) Polynomial regression is an example of this Order M polynomial regression, ϕ j (x) =? Möller/Mori 17

18 Linear Basis Function Models Function for regression y(x, w) is non-linear function of x, but linear in w: y(x, w) = M 1 j=0 w j ϕ j (x) = w T ϕ(x) Polynomial regression is an example of this Order M polynomial regression, ϕ j (x) =? Möller/Mori 18

19 Linear Basis Function Models Function for regression y(x, w) is non-linear function of x, but linear in w: y(x, w) = M 1 j=0 w j ϕ j (x) = w T ϕ(x) Polynomial regression is an example of this Order M polynomial regression, ϕ j (x) =? ϕ j (x) = x j : y(x, w) = w 0 x 0 + w 1 x w M x M Möller/Mori 19

20 Basis Functions: Feature Functions Often we extract features from x An intuitve way to think of ϕ j (x) is as feature functions E.g. Automatic project report grading system x is text of report: In this project we apply the algorithm of Möller [2] to recognizing blue objects. We test this algorithm on pictures of you and I from my holiday photo collection.... ϕ 1 (x) is count of occurrences of Möller [ ϕ 2 (x) is count of occurrences of of you and I Regression grade y(x, w) = 20ϕ 1 (x) 10ϕ 2 (x) Möller/Mori 20

21 Basis Functions: Feature Functions Often we extract features from x An intuitve way to think of ϕ j (x) is as feature functions E.g. Automatic project report grading system x is text of report: In this project we apply the algorithm of Möller [2] to recognizing blue objects. We test this algorithm on pictures of you and I from my holiday photo collection.... ϕ 1 (x) is count of occurrences of Möller [ ϕ 2 (x) is count of occurrences of of you and I Regression grade y(x, w) = 20ϕ 1 (x) 10ϕ 2 (x) Möller/Mori 21

22 Basis Functions: Feature Functions Often we extract features from x An intuitve way to think of ϕ j (x) is as feature functions E.g. Automatic project report grading system x is text of report: In this project we apply the algorithm of Möller [2] to recognizing blue objects. We test this algorithm on pictures of you and I from my holiday photo collection.... ϕ 1 (x) is count of occurrences of Möller [ ϕ 2 (x) is count of occurrences of of you and I Regression grade y(x, w) = 20ϕ 1 (x) 10ϕ 2 (x) Möller/Mori 22

23 Basis Functions: Feature Functions Often we extract features from x An intuitve way to think of ϕ j (x) is as feature functions E.g. Automatic project report grading system x is text of report: In this project we apply the algorithm of Möller [2] to recognizing blue objects. We test this algorithm on pictures of you and I from my holiday photo collection.... ϕ 1 (x) is count of occurrences of Möller [ ϕ 2 (x) is count of occurrences of of you and I Regression grade y(x, w) = 20ϕ 1 (x) 10ϕ 2 (x) Möller/Mori 23

24 Other Non-linear Basis Functions Polynomial ϕ j (x) = x j Gaussians ϕ j (x) = exp{ (x µ j) 2 2s 2 } Sigmoidal ϕ j (x) = 1 1+exp((µ j x)/s) Möller/Mori 24

25 Example - Gaussian Basis Functions: Temperature Use Gaussian basis functions, regression on temperature Möller/Mori 25

26 Example - Gaussian Basis Functions: Temperature µ 1 = Vancouver, µ 2 = San Francisco, µ 3 = Oakland Möller/Mori 26

27 Example - Gaussian Basis Functions: Temperature µ 1 = Vancouver, µ 2 = San Francisco, µ 3 = Oakland Temperature in x = Seattle? y(x, w) = w 1 exp{ (x µ 1) 2 } + w 2s 2 2 exp{ (x µ 2) 2 } + w 2s 2 3 exp{ (x µ 3) 2 } 2s 2 Möller/Mori 27

28 Example - Gaussian Basis Functions: Temperature µ 1 = Vancouver, µ 2 = San Francisco, µ 3 = Oakland Temperature in x = Seattle? y(x, w) = w 1 exp{ (x µ 1) 2 } + w 2s 2 2 exp{ (x µ 2) 2 } + w 2s 2 3 exp{ (x µ 3) 2 } 2s 2 Compute distances to all µ, y(x, w) w 1 Möller/Mori 28

29 Example - Gaussian Basis Functions Could define ϕ j (x) = exp{ (x x j) 2 2s 2 } Gaussian around each training data point x j, N of them Could use for modelling temperature or resource availability at spatial location x Overfitting - interpolates data Example of a kernel method, more later Möller/Mori 29

30 Example - Gaussian Basis Functions Could define ϕ j (x) = exp{ (x x j) 2 2s 2 } Gaussian around each training data point x j, N of them Could use for modelling temperature or resource availability at spatial location x Overfitting - interpolates data Example of a kernel method, more later Möller/Mori 30

31 Example - Gaussian Basis Functions Could define ϕ j (x) = exp{ (x x j) 2 2s 2 } Gaussian around each training data point x j, N of them Could use for modelling temperature or resource availability at spatial location x Overfitting - interpolates data Example of a kernel method, more later Möller/Mori 31

32 Example - Gaussian Basis Functions Could define ϕ j (x) = exp{ (x x j) 2 2s 2 } Gaussian around each training data point x j, N of them Could use for modelling temperature or resource availability at spatial location x Overfitting - interpolates data Example of a kernel method, more later Möller/Mori 32

33 Example - Gaussian Basis Functions Could define ϕ j (x) = exp{ (x x j) 2 2s 2 } Gaussian around each training data point x j, N of them Could use for modelling temperature or resource availability at spatial location x Overfitting - interpolates data Example of a kernel method, more later Möller/Mori 33

34 Outline Regression Linear Basis Function Models Loss Functions for Regression Finding Optimal Weights Regularization Bayesian Linear Regression Möller/Mori 34

35 Loss Functions for Regression We want to find the best set of coefficients w Recall, one way to define best was minimizing squared error: E(w) = 1 N {y(x n, w) t n } 2 2 n=1 We will now look at another way, based on maximum likelihood Möller/Mori 35

36 Gaussian Noise Model for Regression We are provided with a training set {(x i, t i )} Let s assume t arises from a deterministic function plus Gaussian distributed (with precision β) noise: t = y(x, w) + ϵ The probability of observing a target value t is then: p(t x, w, β) = N (t y(x, w), β 1 ) Notation: N (x µ, σ 2 ); x drawn from Gaussian with mean µ, variance σ 2 Möller/Mori 36

37 Gaussian Noise Model for Regression We are provided with a training set {(x i, t i )} Let s assume t arises from a deterministic function plus Gaussian distributed (with precision β) noise: t = y(x, w) + ϵ The probability of observing a target value t is then: p(t x, w, β) = N (t y(x, w), β 1 ) Notation: N (x µ, σ 2 ); x drawn from Gaussian with mean µ, variance σ 2 Möller/Mori 37

38 Maximum Likelihood for Regression The likelihood of data t = {t i } using this Gaussian noise model is: N p(t w, β) = N (t n w T ϕ(x n ), β 1 ) The log-likelihood is: ln p(t w, β) = ln N n=1 n=1 β exp ( β2 ) (t n w T ϕ(x n )) 2 2π = N 2 ln β N 2 ln(2π) β 1 N (t n w T ϕ(x n )) 2 }{{} 2 n=1 const. wrt w }{{} squared error Sum of squared errors is maximum likelihood under a Gaussian noise model Möller/Mori 38

39 Maximum Likelihood for Regression The likelihood of data t = {t i } using this Gaussian noise model is: N p(t w, β) = N (t n w T ϕ(x n ), β 1 ) The log-likelihood is: ln p(t w, β) = ln N n=1 n=1 β exp ( β2 ) (t n w T ϕ(x n )) 2 2π = N 2 ln β N 2 ln(2π) β 1 N (t n w T ϕ(x n )) 2 }{{} 2 n=1 const. wrt w }{{} squared error Sum of squared errors is maximum likelihood under a Gaussian noise model Möller/Mori 39

40 Maximum Likelihood for Regression The likelihood of data t = {t i } using this Gaussian noise model is: N p(t w, β) = N (t n w T ϕ(x n ), β 1 ) The log-likelihood is: ln p(t w, β) = ln N n=1 n=1 β exp ( β2 ) (t n w T ϕ(x n )) 2 2π = N 2 ln β N 2 ln(2π) β 1 N (t n w T ϕ(x n )) 2 }{{} 2 n=1 const. wrt w }{{} squared error Sum of squared errors is maximum likelihood under a Gaussian noise model Möller/Mori 40

41 Maximum Likelihood for Regression The likelihood of data t = {t i } using this Gaussian noise model is: N p(t w, β) = N (t n w T ϕ(x n ), β 1 ) The log-likelihood is: ln p(t w, β) = ln N n=1 n=1 β exp ( β2 ) (t n w T ϕ(x n )) 2 2π = N 2 ln β N 2 ln(2π) β 1 N (t n w T ϕ(x n )) 2 }{{} 2 n=1 const. wrt w }{{} squared error Sum of squared errors is maximum likelihood under a Gaussian noise model Möller/Mori 41

42 Outline Regression Linear Basis Function Models Loss Functions for Regression Finding Optimal Weights Regularization Bayesian Linear Regression Möller/Mori 42

43 Finding Optimal Weights How do we maximize likelihood wrt w (or minimize squared error)? Take gradient of log-likelihood wrt w: w i ln p(t w, β) = β N (t n w T ϕ(x n ))ϕ i (x n ) n=1 In vector form: N ln p(t w, β) = β (t n w T ϕ(x n ))ϕ(x n ) T n=1 Möller/Mori 43

44 Finding Optimal Weights How do we maximize likelihood wrt w (or minimize squared error)? Take gradient of log-likelihood wrt w: w i ln p(t w, β) = β N (t n w T ϕ(x n ))ϕ i (x n ) n=1 In vector form: N ln p(t w, β) = β (t n w T ϕ(x n ))ϕ(x n ) T n=1 Möller/Mori 44

45 Finding Optimal Weights How do we maximize likelihood wrt w (or minimize squared error)? Take gradient of log-likelihood wrt w: w i ln p(t w, β) = β N (t n w T ϕ(x n ))ϕ i (x n ) n=1 In vector form: N ln p(t w, β) = β (t n w T ϕ(x n ))ϕ(x n ) T n=1 Möller/Mori 45

46 Finding Optimal Weights Set gradient to 0: ( N N ) 0 T = t n ϕ(x n ) T w T ϕ(x n )ϕ(x n ) T n=1 n=1 Maximum likelihood estimate for w: Φ = w ML = ( Φ T Φ ) 1 Φ T t ϕ 0 (x 1 ) ϕ 1 (x 1 )... ϕ M 1 (x 1 ) ϕ 0 (x 2 ) ϕ 1 (x 2 )... ϕ M 1 (x 2 ) ϕ 0 (x N ) ϕ 1 (x N )... ϕ M 1 (x N ) Φ = ( Φ T Φ ) 1 Φ T known as the pseudo-inverse (pinv in MATLAB) Möller/Mori 46

47 Finding Optimal Weights Set gradient to 0: ( N N ) 0 T = t n ϕ(x n ) T w T ϕ(x n )ϕ(x n ) T n=1 n=1 Maximum likelihood estimate for w: Φ = w ML = ( Φ T Φ ) 1 Φ T t ϕ 0 (x 1 ) ϕ 1 (x 1 )... ϕ M 1 (x 1 ) ϕ 0 (x 2 ) ϕ 1 (x 2 )... ϕ M 1 (x 2 ) ϕ 0 (x N ) ϕ 1 (x N )... ϕ M 1 (x N ) Φ = ( Φ T Φ ) 1 Φ T known as the pseudo-inverse (pinv in MATLAB) Möller/Mori 47

48 Geometry of Least Squares S t ϕ 2 ϕ 1 y t = (t 1,..., t N ) is the target value vector S is space spanned by φ j = (ϕ j (x 1 ),..., ϕ j (x N )) Solution y lies in S Least squares solution is orthogonal projection of t onto S Can verify this by looking at y = Φw ML = ΦΦ t = P t P 2 = P, P = P T Möller/Mori 48

49 Geometry of Least Squares S t ϕ 2 ϕ 1 y t = (t 1,..., t N ) is the target value vector S is space spanned by φ j = (ϕ j (x 1 ),..., ϕ j (x N )) Solution y lies in S Least squares solution is orthogonal projection of t onto S Can verify this by looking at y = Φw ML = ΦΦ t = P t P 2 = P, P = P T Möller/Mori 49

50 Geometry of Least Squares S t ϕ 2 ϕ 1 y t = (t 1,..., t N ) is the target value vector S is space spanned by φ j = (ϕ j (x 1 ),..., ϕ j (x N )) Solution y lies in S Least squares solution is orthogonal projection of t onto S Can verify this by looking at y = Φw ML = ΦΦ t = P t P 2 = P, P = P T Möller/Mori 50

51 Geometry of Least Squares S t ϕ 2 ϕ 1 y t = (t 1,..., t N ) is the target value vector S is space spanned by φ j = (ϕ j (x 1 ),..., ϕ j (x N )) Solution y lies in S Least squares solution is orthogonal projection of t onto S Can verify this by looking at y = Φw ML = ΦΦ t = P t P 2 = P, P = P T Möller/Mori 51

52 Geometry of Least Squares S t ϕ 2 ϕ 1 y t = (t 1,..., t N ) is the target value vector S is space spanned by φ j = (ϕ j (x 1 ),..., ϕ j (x N )) Solution y lies in S Least squares solution is orthogonal projection of t onto S Can verify this by looking at y = Φw ML = ΦΦ t = P t P 2 = P, P = P T Möller/Mori 52

53 Sequential Learning In practice N might be huge, or data might arrive online Can use a gradient descent method: Start with initial guess for w Update by taking a step in gradient direction E of error function Modify to use stochastic / sequential gradient descent: If error function E = n E n (e.g. least squares) Update by taking a step in gradient direction E n for one example Details about step size are important decrease step size at the end Möller/Mori 53

54 Sequential Learning In practice N might be huge, or data might arrive online Can use a gradient descent method: Start with initial guess for w Update by taking a step in gradient direction E of error function Modify to use stochastic / sequential gradient descent: If error function E = n E n (e.g. least squares) Update by taking a step in gradient direction E n for one example Details about step size are important decrease step size at the end Möller/Mori 54

55 Sequential Learning In practice N might be huge, or data might arrive online Can use a gradient descent method: Start with initial guess for w Update by taking a step in gradient direction E of error function Modify to use stochastic / sequential gradient descent: If error function E = n E n (e.g. least squares) Update by taking a step in gradient direction E n for one example Details about step size are important decrease step size at the end Möller/Mori 55

56 Sequential Learning In practice N might be huge, or data might arrive online Can use a gradient descent method: Start with initial guess for w Update by taking a step in gradient direction E of error function Modify to use stochastic / sequential gradient descent: If error function E = n E n (e.g. least squares) Update by taking a step in gradient direction E n for one example Details about step size are important decrease step size at the end Möller/Mori 56

57 Outline Regression Linear Basis Function Models Loss Functions for Regression Finding Optimal Weights Regularization Bayesian Linear Regression Möller/Mori 57

58 Regularization We discussed regularization as a technique to avoid over-fitting: Ẽ(w) = 1 2 N {y(x n, w) t n } 2 + λ 2 w 2 }{{} regularizer n=1 Next on the menu: Other regularizers Bayesian learning and quadratic regularizer Möller/Mori 58

59 Other Regularizers q = 0.5 q = 1 q = 2 q = 4 Can use different norms for regularizer: Ẽ(w) = 1 2 N {y(x n, w) t n } 2 + λ 2 n=1 e.g. q = 2 ridge regression e.g. q = 1 lasso math is easiest with ridge regression M w j q j=1 Möller/Mori 59

60 Optimization with a Quadratic Regularizer With q = 2, total error still a nice quadratic: Calculus... Ẽ(w) = 1 2 N {y(x n, w) t n } 2 + λ 2 wt w n=1 w = (λi } + {{ Φ T Φ } ) 1 Φ T t regularlized Similar to unregularlized least squares Advantage (λi + Φ T Φ) is well conditioned so inversion is stable Möller/Mori 60

61 Ridge Regression vs. Lasso w2 w w1 Ridge regression aka parameter shrinkage Weights w shrink back towards origin Möller/Mori 61

62 Ridge Regression vs. Lasso w2 w2 w w w1 w1 Ridge regression aka parameter shrinkage Weights w shrink back towards origin Lasso leads to sparse models Components of w tend to 0 with large λ (strong regularization) Intuitively, once minimum achieved at large radius, minimum is on w 1 = 0 Möller/Mori 62

63 Outline Regression Linear Basis Function Models Loss Functions for Regression Finding Optimal Weights Regularization Bayesian Linear Regression Möller/Mori 63

64 Bayesian Linear Regression Start with a prior over parameters w Conjugate prior is a Gaussian: p(w) = N (w 0, α 1 I) This simple form will make math easier; can be done for arbitrary Gaussian too Data likelihood, Gaussian model as before: p(t x, w, β) = N (t y(x, w), β 1 ) Möller/Mori 64

65 Bayesian Linear Regression Start with a prior over parameters w Conjugate prior is a Gaussian: p(w) = N (w 0, α 1 I) This simple form will make math easier; can be done for arbitrary Gaussian too Data likelihood, Gaussian model as before: p(t x, w, β) = N (t y(x, w), β 1 ) Möller/Mori 65

66 Bayesian Linear Regression Posterior distribution on w: ( N ) p(w t) p(t n x n, w, β) p(w) = [ N n=1 Take the log: n=1 ( β exp β ) ] ( α ) M 2π 2 (t n w T ϕ(x n )) 2 2 exp( α 2π 2 wt w) ln p(w t) = β 2 N (t n w T ϕ(x n )) 2 + α 2 wt w + const n=1 L 2 regularization is maximum a posteriori (MAP) with a Gaussian prior. λ = α/β Möller/Mori 66

67 Bayesian Linear Regression Posterior distribution on w: ( N ) p(w t) p(t n x n, w, β) p(w) = [ N n=1 Take the log: n=1 ( β exp β ) ] ( α ) M 2π 2 (t n w T ϕ(x n )) 2 2 exp( α 2π 2 wt w) ln p(w t) = β 2 N (t n w T ϕ(x n )) 2 + α 2 wt w + const n=1 L 2 regularization is maximum a posteriori (MAP) with a Gaussian prior. λ = α/β Möller/Mori 67

68 Bayesian Linear Regression Posterior distribution on w: ( N ) p(w t) p(t n x n, w, β) p(w) = [ N n=1 Take the log: n=1 ( β exp β ) ] ( α ) M 2π 2 (t n w T ϕ(x n )) 2 2 exp( α 2π 2 wt w) ln p(w t) = β 2 N (t n w T ϕ(x n )) 2 + α 2 wt w + const n=1 L 2 regularization is maximum a posteriori (MAP) with a Gaussian prior. λ = α/β Möller/Mori 68

69 Bayesian Linear Regression Posterior distribution on w: ( N ) p(w t) p(t n x n, w, β) p(w) = [ N n=1 Take the log: n=1 ( β exp β ) ] ( α ) M 2π 2 (t n w T ϕ(x n )) 2 2 exp( α 2π 2 wt w) ln p(w t) = β 2 N (t n w T ϕ(x n )) 2 + α 2 wt w + const n=1 L 2 regularization is maximum a posteriori (MAP) with a Gaussian prior. λ = α/β Möller/Mori 69

70 Bayesian Linear Regression - Intuition Simple example x, t R, y(x, w) = w 0 + w 1 x Start with Gaussian prior in parameter space Samples shown in data space Receive data points (blue circles in data space) Compute likelihood Posterior is prior (or prev. posterior) times likelihood Möller/Mori 70

71 Bayesian Linear Regression - Intuition Simple example x, t R, y(x, w) = w 0 + w 1 x Start with Gaussian prior in parameter space Samples shown in data space Receive data points (blue circles in data space) Compute likelihood Posterior is prior (or prev. posterior) times likelihood Möller/Mori 71

72 Bayesian Linear Regression - Intuition Simple example x, t R, y(x, w) = w 0 + w 1 x Start with Gaussian prior in parameter space Samples shown in data space Receive data points (blue circles in data space) Compute likelihood Posterior is prior (or prev. posterior) times likelihood Möller/Mori 72

73 Bayesian Linear Regression - Intuition Simple example x, t R, y(x, w) = w 0 + w 1 x Start with Gaussian prior in parameter space Samples shown in data space Receive data points (blue circles in data space) Compute likelihood Posterior is prior (or prev. posterior) times likelihood Möller/Mori 73

74 Bayesian Linear Regression - Intuition Simple example x, t R, y(x, w) = w 0 + w 1 x Start with Gaussian prior in parameter space Samples shown in data space Receive data points (blue circles in data space) Compute likelihood Posterior is prior (or prev. posterior) times likelihood Möller/Mori 74

75 Bayesian Linear Regression - Intuition Simple example x, t R, y(x, w) = w 0 + w 1 x Start with Gaussian prior in parameter space Samples shown in data space Receive data points (blue circles in data space) Compute likelihood Posterior is prior (or prev. posterior) times likelihood Möller/Mori 75

76 Predictive Distribution Single estimate of w (ML or MAP) doesn t tell whole story We have a distribution over w, and can use it to make predictions Given a new value for x, we can compute a distribution over t: p(t t, α, β) = p(t, w t, α, β)dw p(t t, α, β) = p(t w, β) p(w t, α, β) }{{}}{{}}{{} dw predict probability sum i.e. For each value of w, let it make a prediction, multiply by its probability, sum over all w For arbitrary models as the distributions, this integral may not be computationally tractable Möller/Mori 76

77 Predictive Distribution Single estimate of w (ML or MAP) doesn t tell whole story We have a distribution over w, and can use it to make predictions Given a new value for x, we can compute a distribution over t: p(t t, α, β) = p(t, w t, α, β)dw p(t t, α, β) = p(t w, β) p(w t, α, β) }{{}}{{}}{{} dw predict probability sum i.e. For each value of w, let it make a prediction, multiply by its probability, sum over all w For arbitrary models as the distributions, this integral may not be computationally tractable Möller/Mori 77

78 Predictive Distribution Single estimate of w (ML or MAP) doesn t tell whole story We have a distribution over w, and can use it to make predictions Given a new value for x, we can compute a distribution over t: p(t t, α, β) = p(t, w t, α, β)dw p(t t, α, β) = p(t w, β) p(w t, α, β) }{{}}{{}}{{} dw predict probability sum i.e. For each value of w, let it make a prediction, multiply by its probability, sum over all w For arbitrary models as the distributions, this integral may not be computationally tractable Möller/Mori 78

79 Predictive Distribution With the Gaussians we ve used for these distributions, the predicitve distribution will also be Gaussian (math on convolutions of Gaussians) Green line is true (unobserved) curve, blue data points, red line is mean, pink one standard deviation Uncertainty small around data points Pink region shrinks with more data Möller/Mori 79

80 Bayesian Model Selection So what do the Bayesians say about model selection? Model selection is choosing model M i e.g. degree of polynomial, type of basis function ϕ Don t select, just integrate p(t x, D) = L p(t x, M i, D) p(m }{{} i D) predictive dist. i=1 Average together the results of all models Could choose most likely model a posteriori p(m i D) More efficient, approximation Möller/Mori 80

81 Bayesian Model Selection So what do the Bayesians say about model selection? Model selection is choosing model M i e.g. degree of polynomial, type of basis function ϕ Don t select, just integrate p(t x, D) = L p(t x, M i, D) p(m }{{} i D) predictive dist. i=1 Average together the results of all models Could choose most likely model a posteriori p(m i D) More efficient, approximation Möller/Mori 81

82 Bayesian Model Selection So what do the Bayesians say about model selection? Model selection is choosing model M i e.g. degree of polynomial, type of basis function ϕ Don t select, just integrate p(t x, D) = L p(t x, M i, D) p(m }{{} i D) predictive dist. i=1 Average together the results of all models Could choose most likely model a posteriori p(m i D) More efficient, approximation Möller/Mori 82

83 Bayesian Model Selection So what do the Bayesians say about model selection? Model selection is choosing model M i e.g. degree of polynomial, type of basis function ϕ Don t select, just integrate p(t x, D) = L p(t x, M i, D) p(m }{{} i D) predictive dist. i=1 Average together the results of all models Could choose most likely model a posteriori p(m i D) More efficient, approximation Möller/Mori 83

84 Bayesian Model Selection How do we compute the posterior over models? p(m i D) p(d M i )p(m i ) Another likelihood + prior combination Likelihood: p(d M i ) = p(d w, M i )p(w M i )dw Möller/Mori 84

85 Bayesian Model Selection How do we compute the posterior over models? p(m i D) p(d M i )p(m i ) Another likelihood + prior combination Likelihood: p(d M i ) = p(d w, M i )p(w M i )dw Möller/Mori 85

86 Conclusion Readings: Ch. 3.1, , , 3.4 Linear Models for Regression Linear combination of (non-linear) basis functions Fitting parameters of regression model Least squares Maximum likelihood (can be = least squares) Controlling over-fitting Regularization Bayesian, use prior (can be = regularization) Model selection Cross-validation (use held-out data) Bayesian (use model evidence, likelihood) Möller/Mori 86

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop