Linear Models for Regression

Size: px
Start display at page:

Download "Linear Models for Regression"

Transcription

1 Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1

2 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter of The Elements of Statistical Learning by Hastie, Tibshirani, Friedman Möller/Mori 2

3 Outline Regression Linear Basis Function Models Loss Functions for Regression Finding Optimal Weights Regularization Bayesian Linear Regression Möller/Mori 3

4 Outline Regression Linear Basis Function Models Loss Functions for Regression Finding Optimal Weights Regularization Bayesian Linear Regression Möller/Mori 4

5 Regression Given training set {(x 1, t 1 ),..., (x N, t N )} t i is continuous: regression For now, assume t i R, x i R D E.g. t i is stock price, x i contains company profit, debt, cash flow, gross sales, number of spam s sent, Möller/Mori 5

6 Outline Regression Linear Basis Function Models Loss Functions for Regression Finding Optimal Weights Regularization Bayesian Linear Regression Möller/Mori 6

7 Linear Functions A function f( ) is linear if f(αx 1 + βx 2 ) = αf(x 1 ) + βf(x 2 ) Linear functions will lead to simple algorithms, so let s see what we can do with them Möller/Mori 7

8 Linear Regression Simplest linear model for regression y(x, w) = w 0 + w 1 x 1 + w 2 x w D x D Remember, we re learning w Set w so that y(x, w) aligns with target value in training data Möller/Mori 8

9 Linear Regression Simplest linear model for regression y(x, w) = w 0 + w 1 x 1 + w 2 x w D x D Remember, we re learning w Set w so that y(x, w) aligns with target value in training data Möller/Mori 9

10 Linear Regression Simplest linear model for regression y(x, w) = w 0 + w 1 x 1 + w 2 x w D x D Remember, we re learning w Set w so that y(x, w) aligns with target value in training data This is a very simple model, limited in what it can do Möller/Mori 10

11 Linear Basis Function Models Simplest linear model y(x, w) = w 0 + w 1 x 1 + w 2 x w D x D was linear in x ( ) and w Linear in w is what will be important for simple algorithms Extend to include fixed non-linear functions of data y(x, w) = w 0 + w 1 ϕ 1 (x) + w 2 ϕ 2 (x) w M 1 ϕ M 1 (x) Linear combinations of these basis functions also linear in parameters Möller/Mori 11

12 Linear Basis Function Models Simplest linear model y(x, w) = w 0 + w 1 x 1 + w 2 x w D x D was linear in x ( ) and w Linear in w is what will be important for simple algorithms Extend to include fixed non-linear functions of data y(x, w) = w 0 + w 1 ϕ 1 (x) + w 2 ϕ 2 (x) w M 1 ϕ M 1 (x) Linear combinations of these basis functions also linear in parameters Möller/Mori 12

13 Linear Basis Function Models Simplest linear model y(x, w) = w 0 + w 1 x 1 + w 2 x w D x D was linear in x ( ) and w Linear in w is what will be important for simple algorithms Extend to include fixed non-linear functions of data y(x, w) = w 0 + w 1 ϕ 1 (x) + w 2 ϕ 2 (x) w M 1 ϕ M 1 (x) Linear combinations of these basis functions also linear in parameters Möller/Mori 13

14 Linear Basis Function Models Bias parameter allows fixed offset in data y(x, w) = w 0 }{{} bias Think of simple 1-D x: +w 1 ϕ 1 (x) + w 2 ϕ 2 (x) w M 1 ϕ M 1 (x) y(x, w) = w }{{} 0 + w 1 intercept }{{} slope For notational convenience, define ϕ 0 (x) = 1: y(x, w) = M 1 j=0 x w j ϕ j (x) = w T ϕ(x) Möller/Mori 14

15 Linear Basis Function Models Bias parameter allows fixed offset in data y(x, w) = w 0 }{{} bias Think of simple 1-D x: +w 1 ϕ 1 (x) + w 2 ϕ 2 (x) w M 1 ϕ M 1 (x) y(x, w) = w }{{} 0 + w 1 intercept }{{} slope For notational convenience, define ϕ 0 (x) = 1: y(x, w) = M 1 j=0 x w j ϕ j (x) = w T ϕ(x) Möller/Mori 15

16 Linear Basis Function Models Bias parameter allows fixed offset in data y(x, w) = w 0 }{{} bias Think of simple 1-D x: +w 1 ϕ 1 (x) + w 2 ϕ 2 (x) w M 1 ϕ M 1 (x) y(x, w) = w }{{} 0 + w 1 intercept }{{} slope For notational convenience, define ϕ 0 (x) = 1: y(x, w) = M 1 j=0 x w j ϕ j (x) = w T ϕ(x) Möller/Mori 16

17 Linear Basis Function Models Function for regression y(x, w) is non-linear function of x, but linear in w: y(x, w) = M 1 j=0 w j ϕ j (x) = w T ϕ(x) Polynomial regression is an example of this Order M polynomial regression, ϕ j (x) =? Möller/Mori 17

18 Linear Basis Function Models Function for regression y(x, w) is non-linear function of x, but linear in w: y(x, w) = M 1 j=0 w j ϕ j (x) = w T ϕ(x) Polynomial regression is an example of this Order M polynomial regression, ϕ j (x) =? Möller/Mori 18

19 Linear Basis Function Models Function for regression y(x, w) is non-linear function of x, but linear in w: y(x, w) = M 1 j=0 w j ϕ j (x) = w T ϕ(x) Polynomial regression is an example of this Order M polynomial regression, ϕ j (x) =? ϕ j (x) = x j : y(x, w) = w 0 x 0 + w 1 x w M x M Möller/Mori 19

20 Basis Functions: Feature Functions Often we extract features from x An intuitve way to think of ϕ j (x) is as feature functions E.g. Automatic project report grading system x is text of report: In this project we apply the algorithm of Möller [2] to recognizing blue objects. We test this algorithm on pictures of you and I from my holiday photo collection.... ϕ 1 (x) is count of occurrences of Möller [ ϕ 2 (x) is count of occurrences of of you and I Regression grade y(x, w) = 20ϕ 1 (x) 10ϕ 2 (x) Möller/Mori 20

21 Basis Functions: Feature Functions Often we extract features from x An intuitve way to think of ϕ j (x) is as feature functions E.g. Automatic project report grading system x is text of report: In this project we apply the algorithm of Möller [2] to recognizing blue objects. We test this algorithm on pictures of you and I from my holiday photo collection.... ϕ 1 (x) is count of occurrences of Möller [ ϕ 2 (x) is count of occurrences of of you and I Regression grade y(x, w) = 20ϕ 1 (x) 10ϕ 2 (x) Möller/Mori 21

22 Basis Functions: Feature Functions Often we extract features from x An intuitve way to think of ϕ j (x) is as feature functions E.g. Automatic project report grading system x is text of report: In this project we apply the algorithm of Möller [2] to recognizing blue objects. We test this algorithm on pictures of you and I from my holiday photo collection.... ϕ 1 (x) is count of occurrences of Möller [ ϕ 2 (x) is count of occurrences of of you and I Regression grade y(x, w) = 20ϕ 1 (x) 10ϕ 2 (x) Möller/Mori 22

23 Basis Functions: Feature Functions Often we extract features from x An intuitve way to think of ϕ j (x) is as feature functions E.g. Automatic project report grading system x is text of report: In this project we apply the algorithm of Möller [2] to recognizing blue objects. We test this algorithm on pictures of you and I from my holiday photo collection.... ϕ 1 (x) is count of occurrences of Möller [ ϕ 2 (x) is count of occurrences of of you and I Regression grade y(x, w) = 20ϕ 1 (x) 10ϕ 2 (x) Möller/Mori 23

24 Other Non-linear Basis Functions Polynomial ϕ j (x) = x j Gaussians ϕ j (x) = exp{ (x µ j) 2 2s 2 } Sigmoidal ϕ j (x) = 1 1+exp((µ j x)/s) Möller/Mori 24

25 Example - Gaussian Basis Functions: Temperature Use Gaussian basis functions, regression on temperature Möller/Mori 25

26 Example - Gaussian Basis Functions: Temperature µ 1 = Vancouver, µ 2 = San Francisco, µ 3 = Oakland Möller/Mori 26

27 Example - Gaussian Basis Functions: Temperature µ 1 = Vancouver, µ 2 = San Francisco, µ 3 = Oakland Temperature in x = Seattle? y(x, w) = w 1 exp{ (x µ 1) 2 } + w 2s 2 2 exp{ (x µ 2) 2 } + w 2s 2 3 exp{ (x µ 3) 2 } 2s 2 Möller/Mori 27

28 Example - Gaussian Basis Functions: Temperature µ 1 = Vancouver, µ 2 = San Francisco, µ 3 = Oakland Temperature in x = Seattle? y(x, w) = w 1 exp{ (x µ 1) 2 } + w 2s 2 2 exp{ (x µ 2) 2 } + w 2s 2 3 exp{ (x µ 3) 2 } 2s 2 Compute distances to all µ, y(x, w) w 1 Möller/Mori 28

29 Example - Gaussian Basis Functions Could define ϕ j (x) = exp{ (x x j) 2 2s 2 } Gaussian around each training data point x j, N of them Could use for modelling temperature or resource availability at spatial location x Overfitting - interpolates data Example of a kernel method, more later Möller/Mori 29

30 Example - Gaussian Basis Functions Could define ϕ j (x) = exp{ (x x j) 2 2s 2 } Gaussian around each training data point x j, N of them Could use for modelling temperature or resource availability at spatial location x Overfitting - interpolates data Example of a kernel method, more later Möller/Mori 30

31 Example - Gaussian Basis Functions Could define ϕ j (x) = exp{ (x x j) 2 2s 2 } Gaussian around each training data point x j, N of them Could use for modelling temperature or resource availability at spatial location x Overfitting - interpolates data Example of a kernel method, more later Möller/Mori 31

32 Example - Gaussian Basis Functions Could define ϕ j (x) = exp{ (x x j) 2 2s 2 } Gaussian around each training data point x j, N of them Could use for modelling temperature or resource availability at spatial location x Overfitting - interpolates data Example of a kernel method, more later Möller/Mori 32

33 Example - Gaussian Basis Functions Could define ϕ j (x) = exp{ (x x j) 2 2s 2 } Gaussian around each training data point x j, N of them Could use for modelling temperature or resource availability at spatial location x Overfitting - interpolates data Example of a kernel method, more later Möller/Mori 33

34 Outline Regression Linear Basis Function Models Loss Functions for Regression Finding Optimal Weights Regularization Bayesian Linear Regression Möller/Mori 34

35 Loss Functions for Regression We want to find the best set of coefficients w Recall, one way to define best was minimizing squared error: E(w) = 1 N {y(x n, w) t n } 2 2 n=1 We will now look at another way, based on maximum likelihood Möller/Mori 35

36 Gaussian Noise Model for Regression We are provided with a training set {(x i, t i )} Let s assume t arises from a deterministic function plus Gaussian distributed (with precision β) noise: t = y(x, w) + ϵ The probability of observing a target value t is then: p(t x, w, β) = N (t y(x, w), β 1 ) Notation: N (x µ, σ 2 ); x drawn from Gaussian with mean µ, variance σ 2 Möller/Mori 36

37 Gaussian Noise Model for Regression We are provided with a training set {(x i, t i )} Let s assume t arises from a deterministic function plus Gaussian distributed (with precision β) noise: t = y(x, w) + ϵ The probability of observing a target value t is then: p(t x, w, β) = N (t y(x, w), β 1 ) Notation: N (x µ, σ 2 ); x drawn from Gaussian with mean µ, variance σ 2 Möller/Mori 37

38 Maximum Likelihood for Regression The likelihood of data t = {t i } using this Gaussian noise model is: N p(t w, β) = N (t n w T ϕ(x n ), β 1 ) The log-likelihood is: ln p(t w, β) = ln N n=1 n=1 β exp ( β2 ) (t n w T ϕ(x n )) 2 2π = N 2 ln β N 2 ln(2π) β 1 N (t n w T ϕ(x n )) 2 }{{} 2 n=1 const. wrt w }{{} squared error Sum of squared errors is maximum likelihood under a Gaussian noise model Möller/Mori 38

39 Maximum Likelihood for Regression The likelihood of data t = {t i } using this Gaussian noise model is: N p(t w, β) = N (t n w T ϕ(x n ), β 1 ) The log-likelihood is: ln p(t w, β) = ln N n=1 n=1 β exp ( β2 ) (t n w T ϕ(x n )) 2 2π = N 2 ln β N 2 ln(2π) β 1 N (t n w T ϕ(x n )) 2 }{{} 2 n=1 const. wrt w }{{} squared error Sum of squared errors is maximum likelihood under a Gaussian noise model Möller/Mori 39

40 Maximum Likelihood for Regression The likelihood of data t = {t i } using this Gaussian noise model is: N p(t w, β) = N (t n w T ϕ(x n ), β 1 ) The log-likelihood is: ln p(t w, β) = ln N n=1 n=1 β exp ( β2 ) (t n w T ϕ(x n )) 2 2π = N 2 ln β N 2 ln(2π) β 1 N (t n w T ϕ(x n )) 2 }{{} 2 n=1 const. wrt w }{{} squared error Sum of squared errors is maximum likelihood under a Gaussian noise model Möller/Mori 40

41 Maximum Likelihood for Regression The likelihood of data t = {t i } using this Gaussian noise model is: N p(t w, β) = N (t n w T ϕ(x n ), β 1 ) The log-likelihood is: ln p(t w, β) = ln N n=1 n=1 β exp ( β2 ) (t n w T ϕ(x n )) 2 2π = N 2 ln β N 2 ln(2π) β 1 N (t n w T ϕ(x n )) 2 }{{} 2 n=1 const. wrt w }{{} squared error Sum of squared errors is maximum likelihood under a Gaussian noise model Möller/Mori 41

42 Outline Regression Linear Basis Function Models Loss Functions for Regression Finding Optimal Weights Regularization Bayesian Linear Regression Möller/Mori 42

43 Finding Optimal Weights How do we maximize likelihood wrt w (or minimize squared error)? Take gradient of log-likelihood wrt w: w i ln p(t w, β) = β N (t n w T ϕ(x n ))ϕ i (x n ) n=1 In vector form: N ln p(t w, β) = β (t n w T ϕ(x n ))ϕ(x n ) T n=1 Möller/Mori 43

44 Finding Optimal Weights How do we maximize likelihood wrt w (or minimize squared error)? Take gradient of log-likelihood wrt w: w i ln p(t w, β) = β N (t n w T ϕ(x n ))ϕ i (x n ) n=1 In vector form: N ln p(t w, β) = β (t n w T ϕ(x n ))ϕ(x n ) T n=1 Möller/Mori 44

45 Finding Optimal Weights How do we maximize likelihood wrt w (or minimize squared error)? Take gradient of log-likelihood wrt w: w i ln p(t w, β) = β N (t n w T ϕ(x n ))ϕ i (x n ) n=1 In vector form: N ln p(t w, β) = β (t n w T ϕ(x n ))ϕ(x n ) T n=1 Möller/Mori 45

46 Finding Optimal Weights Set gradient to 0: ( N N ) 0 T = t n ϕ(x n ) T w T ϕ(x n )ϕ(x n ) T n=1 n=1 Maximum likelihood estimate for w: Φ = w ML = ( Φ T Φ ) 1 Φ T t ϕ 0 (x 1 ) ϕ 1 (x 1 )... ϕ M 1 (x 1 ) ϕ 0 (x 2 ) ϕ 1 (x 2 )... ϕ M 1 (x 2 ) ϕ 0 (x N ) ϕ 1 (x N )... ϕ M 1 (x N ) Φ = ( Φ T Φ ) 1 Φ T known as the pseudo-inverse (pinv in MATLAB) Möller/Mori 46

47 Finding Optimal Weights Set gradient to 0: ( N N ) 0 T = t n ϕ(x n ) T w T ϕ(x n )ϕ(x n ) T n=1 n=1 Maximum likelihood estimate for w: Φ = w ML = ( Φ T Φ ) 1 Φ T t ϕ 0 (x 1 ) ϕ 1 (x 1 )... ϕ M 1 (x 1 ) ϕ 0 (x 2 ) ϕ 1 (x 2 )... ϕ M 1 (x 2 ) ϕ 0 (x N ) ϕ 1 (x N )... ϕ M 1 (x N ) Φ = ( Φ T Φ ) 1 Φ T known as the pseudo-inverse (pinv in MATLAB) Möller/Mori 47

48 Geometry of Least Squares S t ϕ 2 ϕ 1 y t = (t 1,..., t N ) is the target value vector S is space spanned by φ j = (ϕ j (x 1 ),..., ϕ j (x N )) Solution y lies in S Least squares solution is orthogonal projection of t onto S Can verify this by looking at y = Φw ML = ΦΦ t = P t P 2 = P, P = P T Möller/Mori 48

49 Geometry of Least Squares S t ϕ 2 ϕ 1 y t = (t 1,..., t N ) is the target value vector S is space spanned by φ j = (ϕ j (x 1 ),..., ϕ j (x N )) Solution y lies in S Least squares solution is orthogonal projection of t onto S Can verify this by looking at y = Φw ML = ΦΦ t = P t P 2 = P, P = P T Möller/Mori 49

50 Geometry of Least Squares S t ϕ 2 ϕ 1 y t = (t 1,..., t N ) is the target value vector S is space spanned by φ j = (ϕ j (x 1 ),..., ϕ j (x N )) Solution y lies in S Least squares solution is orthogonal projection of t onto S Can verify this by looking at y = Φw ML = ΦΦ t = P t P 2 = P, P = P T Möller/Mori 50

51 Geometry of Least Squares S t ϕ 2 ϕ 1 y t = (t 1,..., t N ) is the target value vector S is space spanned by φ j = (ϕ j (x 1 ),..., ϕ j (x N )) Solution y lies in S Least squares solution is orthogonal projection of t onto S Can verify this by looking at y = Φw ML = ΦΦ t = P t P 2 = P, P = P T Möller/Mori 51

52 Geometry of Least Squares S t ϕ 2 ϕ 1 y t = (t 1,..., t N ) is the target value vector S is space spanned by φ j = (ϕ j (x 1 ),..., ϕ j (x N )) Solution y lies in S Least squares solution is orthogonal projection of t onto S Can verify this by looking at y = Φw ML = ΦΦ t = P t P 2 = P, P = P T Möller/Mori 52

53 Sequential Learning In practice N might be huge, or data might arrive online Can use a gradient descent method: Start with initial guess for w Update by taking a step in gradient direction E of error function Modify to use stochastic / sequential gradient descent: If error function E = n E n (e.g. least squares) Update by taking a step in gradient direction E n for one example Details about step size are important decrease step size at the end Möller/Mori 53

54 Sequential Learning In practice N might be huge, or data might arrive online Can use a gradient descent method: Start with initial guess for w Update by taking a step in gradient direction E of error function Modify to use stochastic / sequential gradient descent: If error function E = n E n (e.g. least squares) Update by taking a step in gradient direction E n for one example Details about step size are important decrease step size at the end Möller/Mori 54

55 Sequential Learning In practice N might be huge, or data might arrive online Can use a gradient descent method: Start with initial guess for w Update by taking a step in gradient direction E of error function Modify to use stochastic / sequential gradient descent: If error function E = n E n (e.g. least squares) Update by taking a step in gradient direction E n for one example Details about step size are important decrease step size at the end Möller/Mori 55

56 Sequential Learning In practice N might be huge, or data might arrive online Can use a gradient descent method: Start with initial guess for w Update by taking a step in gradient direction E of error function Modify to use stochastic / sequential gradient descent: If error function E = n E n (e.g. least squares) Update by taking a step in gradient direction E n for one example Details about step size are important decrease step size at the end Möller/Mori 56

57 Outline Regression Linear Basis Function Models Loss Functions for Regression Finding Optimal Weights Regularization Bayesian Linear Regression Möller/Mori 57

58 Regularization We discussed regularization as a technique to avoid over-fitting: Ẽ(w) = 1 2 N {y(x n, w) t n } 2 + λ 2 w 2 }{{} regularizer n=1 Next on the menu: Other regularizers Bayesian learning and quadratic regularizer Möller/Mori 58

59 Other Regularizers q = 0.5 q = 1 q = 2 q = 4 Can use different norms for regularizer: Ẽ(w) = 1 2 N {y(x n, w) t n } 2 + λ 2 n=1 e.g. q = 2 ridge regression e.g. q = 1 lasso math is easiest with ridge regression M w j q j=1 Möller/Mori 59

60 Optimization with a Quadratic Regularizer With q = 2, total error still a nice quadratic: Calculus... Ẽ(w) = 1 2 N {y(x n, w) t n } 2 + λ 2 wt w n=1 w = (λi } + {{ Φ T Φ } ) 1 Φ T t regularlized Similar to unregularlized least squares Advantage (λi + Φ T Φ) is well conditioned so inversion is stable Möller/Mori 60

61 Ridge Regression vs. Lasso w2 w w1 Ridge regression aka parameter shrinkage Weights w shrink back towards origin Möller/Mori 61

62 Ridge Regression vs. Lasso w2 w2 w w w1 w1 Ridge regression aka parameter shrinkage Weights w shrink back towards origin Lasso leads to sparse models Components of w tend to 0 with large λ (strong regularization) Intuitively, once minimum achieved at large radius, minimum is on w 1 = 0 Möller/Mori 62

63 Outline Regression Linear Basis Function Models Loss Functions for Regression Finding Optimal Weights Regularization Bayesian Linear Regression Möller/Mori 63

64 Bayesian Linear Regression Start with a prior over parameters w Conjugate prior is a Gaussian: p(w) = N (w 0, α 1 I) This simple form will make math easier; can be done for arbitrary Gaussian too Data likelihood, Gaussian model as before: p(t x, w, β) = N (t y(x, w), β 1 ) Möller/Mori 64

65 Bayesian Linear Regression Start with a prior over parameters w Conjugate prior is a Gaussian: p(w) = N (w 0, α 1 I) This simple form will make math easier; can be done for arbitrary Gaussian too Data likelihood, Gaussian model as before: p(t x, w, β) = N (t y(x, w), β 1 ) Möller/Mori 65

66 Bayesian Linear Regression Posterior distribution on w: ( N ) p(w t) p(t n x n, w, β) p(w) = [ N n=1 Take the log: n=1 ( β exp β ) ] ( α ) M 2π 2 (t n w T ϕ(x n )) 2 2 exp( α 2π 2 wt w) ln p(w t) = β 2 N (t n w T ϕ(x n )) 2 + α 2 wt w + const n=1 L 2 regularization is maximum a posteriori (MAP) with a Gaussian prior. λ = α/β Möller/Mori 66

67 Bayesian Linear Regression Posterior distribution on w: ( N ) p(w t) p(t n x n, w, β) p(w) = [ N n=1 Take the log: n=1 ( β exp β ) ] ( α ) M 2π 2 (t n w T ϕ(x n )) 2 2 exp( α 2π 2 wt w) ln p(w t) = β 2 N (t n w T ϕ(x n )) 2 + α 2 wt w + const n=1 L 2 regularization is maximum a posteriori (MAP) with a Gaussian prior. λ = α/β Möller/Mori 67

68 Bayesian Linear Regression Posterior distribution on w: ( N ) p(w t) p(t n x n, w, β) p(w) = [ N n=1 Take the log: n=1 ( β exp β ) ] ( α ) M 2π 2 (t n w T ϕ(x n )) 2 2 exp( α 2π 2 wt w) ln p(w t) = β 2 N (t n w T ϕ(x n )) 2 + α 2 wt w + const n=1 L 2 regularization is maximum a posteriori (MAP) with a Gaussian prior. λ = α/β Möller/Mori 68

69 Bayesian Linear Regression Posterior distribution on w: ( N ) p(w t) p(t n x n, w, β) p(w) = [ N n=1 Take the log: n=1 ( β exp β ) ] ( α ) M 2π 2 (t n w T ϕ(x n )) 2 2 exp( α 2π 2 wt w) ln p(w t) = β 2 N (t n w T ϕ(x n )) 2 + α 2 wt w + const n=1 L 2 regularization is maximum a posteriori (MAP) with a Gaussian prior. λ = α/β Möller/Mori 69

70 Bayesian Linear Regression - Intuition Simple example x, t R, y(x, w) = w 0 + w 1 x Start with Gaussian prior in parameter space Samples shown in data space Receive data points (blue circles in data space) Compute likelihood Posterior is prior (or prev. posterior) times likelihood Möller/Mori 70

71 Bayesian Linear Regression - Intuition Simple example x, t R, y(x, w) = w 0 + w 1 x Start with Gaussian prior in parameter space Samples shown in data space Receive data points (blue circles in data space) Compute likelihood Posterior is prior (or prev. posterior) times likelihood Möller/Mori 71

72 Bayesian Linear Regression - Intuition Simple example x, t R, y(x, w) = w 0 + w 1 x Start with Gaussian prior in parameter space Samples shown in data space Receive data points (blue circles in data space) Compute likelihood Posterior is prior (or prev. posterior) times likelihood Möller/Mori 72

73 Bayesian Linear Regression - Intuition Simple example x, t R, y(x, w) = w 0 + w 1 x Start with Gaussian prior in parameter space Samples shown in data space Receive data points (blue circles in data space) Compute likelihood Posterior is prior (or prev. posterior) times likelihood Möller/Mori 73

74 Bayesian Linear Regression - Intuition Simple example x, t R, y(x, w) = w 0 + w 1 x Start with Gaussian prior in parameter space Samples shown in data space Receive data points (blue circles in data space) Compute likelihood Posterior is prior (or prev. posterior) times likelihood Möller/Mori 74

75 Bayesian Linear Regression - Intuition Simple example x, t R, y(x, w) = w 0 + w 1 x Start with Gaussian prior in parameter space Samples shown in data space Receive data points (blue circles in data space) Compute likelihood Posterior is prior (or prev. posterior) times likelihood Möller/Mori 75

76 Predictive Distribution Single estimate of w (ML or MAP) doesn t tell whole story We have a distribution over w, and can use it to make predictions Given a new value for x, we can compute a distribution over t: p(t t, α, β) = p(t, w t, α, β)dw p(t t, α, β) = p(t w, β) p(w t, α, β) }{{}}{{}}{{} dw predict probability sum i.e. For each value of w, let it make a prediction, multiply by its probability, sum over all w For arbitrary models as the distributions, this integral may not be computationally tractable Möller/Mori 76

77 Predictive Distribution Single estimate of w (ML or MAP) doesn t tell whole story We have a distribution over w, and can use it to make predictions Given a new value for x, we can compute a distribution over t: p(t t, α, β) = p(t, w t, α, β)dw p(t t, α, β) = p(t w, β) p(w t, α, β) }{{}}{{}}{{} dw predict probability sum i.e. For each value of w, let it make a prediction, multiply by its probability, sum over all w For arbitrary models as the distributions, this integral may not be computationally tractable Möller/Mori 77

78 Predictive Distribution Single estimate of w (ML or MAP) doesn t tell whole story We have a distribution over w, and can use it to make predictions Given a new value for x, we can compute a distribution over t: p(t t, α, β) = p(t, w t, α, β)dw p(t t, α, β) = p(t w, β) p(w t, α, β) }{{}}{{}}{{} dw predict probability sum i.e. For each value of w, let it make a prediction, multiply by its probability, sum over all w For arbitrary models as the distributions, this integral may not be computationally tractable Möller/Mori 78

79 Predictive Distribution With the Gaussians we ve used for these distributions, the predicitve distribution will also be Gaussian (math on convolutions of Gaussians) Green line is true (unobserved) curve, blue data points, red line is mean, pink one standard deviation Uncertainty small around data points Pink region shrinks with more data Möller/Mori 79

80 Bayesian Model Selection So what do the Bayesians say about model selection? Model selection is choosing model M i e.g. degree of polynomial, type of basis function ϕ Don t select, just integrate p(t x, D) = L p(t x, M i, D) p(m }{{} i D) predictive dist. i=1 Average together the results of all models Could choose most likely model a posteriori p(m i D) More efficient, approximation Möller/Mori 80

81 Bayesian Model Selection So what do the Bayesians say about model selection? Model selection is choosing model M i e.g. degree of polynomial, type of basis function ϕ Don t select, just integrate p(t x, D) = L p(t x, M i, D) p(m }{{} i D) predictive dist. i=1 Average together the results of all models Could choose most likely model a posteriori p(m i D) More efficient, approximation Möller/Mori 81

82 Bayesian Model Selection So what do the Bayesians say about model selection? Model selection is choosing model M i e.g. degree of polynomial, type of basis function ϕ Don t select, just integrate p(t x, D) = L p(t x, M i, D) p(m }{{} i D) predictive dist. i=1 Average together the results of all models Could choose most likely model a posteriori p(m i D) More efficient, approximation Möller/Mori 82

83 Bayesian Model Selection So what do the Bayesians say about model selection? Model selection is choosing model M i e.g. degree of polynomial, type of basis function ϕ Don t select, just integrate p(t x, D) = L p(t x, M i, D) p(m }{{} i D) predictive dist. i=1 Average together the results of all models Could choose most likely model a posteriori p(m i D) More efficient, approximation Möller/Mori 83

84 Bayesian Model Selection How do we compute the posterior over models? p(m i D) p(d M i )p(m i ) Another likelihood + prior combination Likelihood: p(d M i ) = p(d w, M i )p(w M i )dw Möller/Mori 84

85 Bayesian Model Selection How do we compute the posterior over models? p(m i D) p(d M i )p(m i ) Another likelihood + prior combination Likelihood: p(d M i ) = p(d w, M i )p(w M i )dw Möller/Mori 85

86 Conclusion Readings: Ch. 3.1, , , 3.4 Linear Models for Regression Linear combination of (non-linear) basis functions Fitting parameters of regression model Least squares Maximum likelihood (can be = least squares) Controlling over-fitting Regularization Bayesian, use prior (can be = regularization) Model selection Cross-validation (use held-out data) Bayesian (use model evidence, likelihood) Möller/Mori 86

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Linear Models for Regression. Sargur Srihari

Linear Models for Regression. Sargur Srihari Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 The Regression Problem Training data: A set of input-output

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

Bayesian Linear Regression. Sargur Srihari

Bayesian Linear Regression. Sargur Srihari Bayesian Linear Regression Sargur srihari@cedar.buffalo.edu Topics in Bayesian Regression Recall Max Likelihood Linear Regression Parameter Distribution Predictive Distribution Equivalent Kernel 2 Linear

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization

More information

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis

More information

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh September 24 (All of the slides in this course have been adapted from previous versions

More information

Outline lecture 2 2(30)

Outline lecture 2 2(30) Outline lecture 2 2(3), Lecture 2 Linear Regression it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic Control

More information

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010 Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X

More information

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop Bayesian Gaussian / Linear Models Read Sections 2.3.3 and 3.3 in the text by Bishop Multivariate Gaussian Model with Multivariate Gaussian Prior Suppose we model the observed vector b as having a multivariate

More information

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Computer Vision Group Prof. Daniel Cremers. 3. Regression Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the

More information

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Overview Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Probabilistic Interpretation: Linear Regression Assume output y is generated

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA Sistemi di Elaborazione dell Informazione Regressione Ruggero Donida Labati Dipartimento di Informatica via Bramante 65, 26013 Crema (CR), Italy http://homes.di.unimi.it/donida

More information

Non-parametric Methods

Non-parametric Methods Non-parametric Methods Machine Learning Alireza Ghane Non-Parametric Methods Alireza Ghane / Torsten Möller 1 Outline Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

GWAS IV: Bayesian linear (variance component) models

GWAS IV: Bayesian linear (variance component) models GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS IV: Bayesian

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Today. Calculus. Linear Regression. Lagrange Multipliers

Today. Calculus. Linear Regression. Lagrange Multipliers Today Calculus Lagrange Multipliers Linear Regression 1 Optimization with constraints What if I want to constrain the parameters of the model. The mean is less than 10 Find the best likelihood, subject

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1 Kernel Methods Foundations of Data Analysis Torsten Möller Möller/Mori 1 Reading Chapter 6 of Pattern Recognition and Machine Learning by Bishop Chapter 12 of The Elements of Statistical Learning by Hastie,

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Machine learning - HT Basis Expansion, Regularization, Validation

Machine learning - HT Basis Expansion, Regularization, Validation Machine learning - HT 016 4. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford Feburary 03, 016 Outline Introduce basis function to go beyond linear regression Understanding

More information

Outline Lecture 2 2(32)

Outline Lecture 2 2(32) Outline Lecture (3), Lecture Linear Regression and Classification it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

Linear Regression and Discrimination

Linear Regression and Discrimination Linear Regression and Discrimination Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian

More information

Week 3: Linear Regression

Week 3: Linear Regression Week 3: Linear Regression Instructor: Sergey Levine Recap In the previous lecture we saw how linear regression can solve the following problem: given a dataset D = {(x, y ),..., (x N, y N )}, learn to

More information

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes Roger Grosse Roger Grosse CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes 1 / 55 Adminis-Trivia Did everyone get my e-mail

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Bayesian Linear Regression [DRAFT - In Progress]

Bayesian Linear Regression [DRAFT - In Progress] Bayesian Linear Regression [DRAFT - In Progress] David S. Rosenberg Abstract Here we develop some basics of Bayesian linear regression. Most of the calculations for this document come from the basic theory

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Lecture 1b: Linear Models for Regression

Lecture 1b: Linear Models for Regression Lecture 1b: Linear Models for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced

More information

1. Non-Uniformly Weighted Data [7pts]

1. Non-Uniformly Weighted Data [7pts] Homework 1: Linear Regression Writeup due 23:59 on Friday 6 February 2015 You will do this assignment individually and submit your answers as a PDF via the Canvas course website. There is a mathematical

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machine Learning (Fall 24) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu October 2, 24 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 / 24 Outline Review

More information

Machine Learning. 7. Logistic and Linear Regression

Machine Learning. 7. Logistic and Linear Regression Sapienza University of Rome, Italy - Machine Learning (27/28) University of Rome La Sapienza Master in Artificial Intelligence and Robotics Machine Learning 7. Logistic and Linear Regression Luca Iocchi,

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Multivariate Bayesian Linear Regression MLAI Lecture 11

Multivariate Bayesian Linear Regression MLAI Lecture 11 Multivariate Bayesian Linear Regression MLAI Lecture 11 Neil D. Lawrence Department of Computer Science Sheffield University 21st October 2012 Outline Univariate Bayesian Linear Regression Multivariate

More information

The evolution from MLE to MAP to Bayesian Learning

The evolution from MLE to MAP to Bayesian Learning The evolution from MLE to MAP to Bayesian Learning Zhe Li January 13, 2015 1 The evolution from MLE to MAP to Bayesian Based on the linear regression, we illustrate the evolution from Maximum Loglikelihood

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Bayesian methods in economics and finance

Bayesian methods in economics and finance 1/26 Bayesian methods in economics and finance Linear regression: Bayesian model selection and sparsity priors Linear Regression 2/26 Linear regression Model for relationship between (several) independent

More information

An Introduction to Statistical and Probabilistic Linear Models

An Introduction to Statistical and Probabilistic Linear Models An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning

More information

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met c Outlines Statistical Group and College of Engineering and Computer Science Overview Linear Regression Linear Classification Neural Networks Kernel Methods and SVM Mixture Models and EM Resources More

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Fundamentals of Machine Learning

Fundamentals of Machine Learning Fundamentals of Machine Learning Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://icapeople.epfl.ch/mekhan/ emtiyaz@gmail.com Jan 2, 27 Mohammad Emtiyaz Khan 27 Goals Understand (some) fundamentals of Machine

More information

Fundamentals of Machine Learning (Part I)

Fundamentals of Machine Learning (Part I) Fundamentals of Machine Learning (Part I) Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp April 12, 2018 Mohammad Emtiyaz Khan 2018 1 Goals Understand (some) fundamentals

More information

CSC2515 Assignment #2

CSC2515 Assignment #2 CSC2515 Assignment #2 Due: Nov.4, 2pm at the START of class Worth: 18% Late assignments not accepted. 1 Pseudo-Bayesian Linear Regression (3%) In this question you will dabble in Bayesian statistics and

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann Machine Learning! in just a few minutes Jan Peters Gerhard Neumann 1 Purpose of this Lecture Foundations of machine learning tools for robotics We focus on regression methods and general principles Often

More information

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Module 2 Lecture 05 Linear Regression Good morning, welcome

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

CSC 411: Lecture 04: Logistic Regression

CSC 411: Lecture 04: Logistic Regression CSC 411: Lecture 04: Logistic Regression Raquel Urtasun & Rich Zemel University of Toronto Sep 23, 2015 Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 1 / 16 Today Key Concepts: Logistic

More information

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University, 1 Introduction to Machine Learning Regression Computer Science, Tel-Aviv University, 2013-14 Classification Input: X Real valued, vectors over real. Discrete values (0,1,2,...) Other structures (e.g.,

More information

Artificial Neural Networks

Artificial Neural Networks Artificial Neural Networks Stephan Dreiseitl University of Applied Sciences Upper Austria at Hagenberg Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support Knowledge

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

CSC 411: Lecture 02: Linear Regression

CSC 411: Lecture 02: Linear Regression CSC 411: Lecture 02: Linear Regression Richard Zemel, Raquel Urtasun and Sanja Fidler University of Toronto (Most plots in this lecture are from Bishop s book) Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression

More information

y(x) = x w + ε(x), (1)

y(x) = x w + ε(x), (1) Linear regression We are ready to consider our first machine-learning problem: linear regression. Suppose that e are interested in the values of a function y(x): R d R, here x is a d-dimensional vector-valued

More information

Relevance Vector Machines

Relevance Vector Machines LUT February 21, 2011 Support Vector Machines Model / Regression Marginal Likelihood Regression Relevance vector machines Exercise Support Vector Machines The relevance vector machine (RVM) is a bayesian

More information

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com Linear Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology Rirdge Regression Szymon Bobek Institute of Applied Computer science AGH University of Science and Technology Based on Carlos Guestrin adn Emily Fox slides from Coursera Specialization on Machine Learnign

More information

Gradient Descent. Sargur Srihari

Gradient Descent. Sargur Srihari Gradient Descent Sargur srihari@cedar.buffalo.edu 1 Topics Simple Gradient Descent/Ascent Difficulties with Simple Gradient Descent Line Search Brent s Method Conjugate Gradient Descent Weight vectors

More information

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Statistical learning. Chapter 20, Sections 1 4 1

Statistical learning. Chapter 20, Sections 1 4 1 Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

More information

Density Estimation: ML, MAP, Bayesian estimation

Density Estimation: ML, MAP, Bayesian estimation Density Estimation: ML, MAP, Bayesian estimation CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Introduction Maximum-Likelihood Estimation Maximum

More information