Linear Models for Classification

Size: px
Start display at page:

Download "Linear Models for Classification"

Transcription

1 Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4

2 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0, 0, 0, 0, 0) Each input vector classified into one of K discrete classes Denote classes by C k Represent input image as a vector x i R 784. We have target vector t i {0, 1} 10 Given a training set {(x 1, t 1 ),..., (x N, t N )}, learning problem is to construct a good function y(x) from these. y : R 784 R 10

3 Generalized Linear Models Similar to previous chapter on linear models for regression, we will use a linear model for classification: y(x) = f (w T x + w 0 ) This is called a generalized linear model f ( ) is a fixed non-linear function e.g. { 1 if u 0 f (u) = 0 otherwise Decision boundary between classes will be linear function of x Can also apply non-linearity to x, as in φ i (x) for regression

4 Generalized Linear Models Similar to previous chapter on linear models for regression, we will use a linear model for classification: y(x) = f (w T x + w 0 ) This is called a generalized linear model f ( ) is a fixed non-linear function e.g. { 1 if u 0 f (u) = 0 otherwise Decision boundary between classes will be linear function of x Can also apply non-linearity to x, as in φ i (x) for regression

5 Generalized Linear Models Similar to previous chapter on linear models for regression, we will use a linear model for classification: y(x) = f (w T x + w 0 ) This is called a generalized linear model f ( ) is a fixed non-linear function e.g. { 1 if u 0 f (u) = 0 otherwise Decision boundary between classes will be linear function of x Can also apply non-linearity to x, as in φ i (x) for regression

6 Overview Linear regression for Classification The Fisher Linear Discriminant, or How to Draw a Line Between Classes The Perceptron, or The Smallest Neural Net Logistic Regression The Statistician s Classifier

7 Outline Discriminant Functions Generative Models Discriminative Models

8 Discriminant Functions with Two Classes x 2 y > 0 y = 0 y < 0 R 1 R 2 Start with 2 class problem, t i {0, 1} Simple linear discriminant w x x y(x) w y(x) = w T x + w 0 apply threshold function to get classification w0 w x 1 Decision surface is line; orthogonal to w. Projection of x in w dir. is wt x w

9 Multiple Classes A linear discriminant between two classes separates with a hyperplane How to use this for multiple classes? One-versus-the-rest method: build K 1 classifiers, between C k and all others One-versus-one method: build K(K 1)/2 classifiers, between all pairs

10 Multiple Classes A linear discriminant between two classes separates with a hyperplane How to use this for multiple classes? One-versus-the-rest method: build K 1 classifiers, between C k and all others One-versus-one method: build K(K 1)/2 classifiers, between all pairs

11 Multiple Classes A linear discriminant between two classes separates with a hyperplane How to use this for multiple classes? One-versus-the-rest method: build K 1 classifiers, between C k and all others One-versus-one method: build K(K 1)/2 classifiers, between all pairs

12 Multiple Classes? R1 R2 C1 R3 C2 not C1 not C2 A linear discriminant between two classes separates with a hyperplane How to use this for multiple classes? One-versus-the-rest method: build K 1 classifiers, between C k and all others One-versus-one method: build K(K 1)/2 classifiers, between all pairs

13 Multiple Classes? R1 R2 C1 R3 C2 not C1 not C2 A linear discriminant between two classes separates with a hyperplane How to use this for multiple classes? One-versus-the-rest method: build K 1 classifiers, between C k and all others One-versus-one method: build K(K 1)/2 classifiers, between all pairs

14 Multiple Classes? C 1 C 3 R 1 R1 R2 C 1? R 3 C 3 C1 R3 C2 C 2 R 2 not C1 not C2 C 2 A linear discriminant between two classes separates with a hyperplane How to use this for multiple classes? One-versus-the-rest method: build K 1 classifiers, between C k and all others One-versus-one method: build K(K 1)/2 classifiers, between all pairs

15 Multiple Classes R j R i R k x A A solution is to build K linear functions: ˆx x B y k (x) = w T k x + w k0 assign x to class max k y k (x) Gives connected, convex decision regions ˆx = λx A + (1 λ)x B y k (ˆx) = λy k (x A ) + (1 λ)y k (x B ) y k (ˆx) > y j (ˆx), j k

16 Multiple Classes R j R i R k x A A solution is to build K linear functions: ˆx x B y k (x) = w T k x + w k0 assign x to class max k y k (x) Gives connected, convex decision regions ˆx = λx A + (1 λ)x B y k (ˆx) = λy k (x A ) + (1 λ)y k (x B ) y k (ˆx) > y j (ˆx), j k

17 Least Squares for Classification How do we learn the decision boundaries (w k, w k0 )? One approach is to use least squares, similar to regression Find W to minimize squared error over all examples and all components of the label vector: E(W) = 1 2 N n=1 k=1 K (y k (x n ) t nk ) 2 Some algebra, we get a solution using the pseudo-inverse as in regression

18 Least Squares for Classification How do we learn the decision boundaries (w k, w k0 )? One approach is to use least squares, similar to regression Find W to minimize squared error over all examples and all components of the label vector: E(W) = 1 2 N n=1 k=1 K (y k (x n ) t nk ) 2 Some algebra, we get a solution using the pseudo-inverse as in regression

19 Least Squares for Classification How do we learn the decision boundaries (w k, w k0 )? One approach is to use least squares, similar to regression Find W to minimize squared error over all examples and all components of the label vector: E(W) = 1 2 N n=1 k=1 K (y k (x n ) t nk ) 2 Some algebra, we get a solution using the pseudo-inverse as in regression

20 4 Problems with Least Squares Looks okay... least squares decision boundary Similar to logistic regression decision boundary (more later)

21 4 Problems with Least Squares Looks okay... least squares decision boundary Similar to logistic regression decision boundary (more later) Gets worse by adding easy points?!

22 4 Problems with Least Squares Looks okay... least squares decision boundary Similar to logistic regression decision boundary (more later) Gets worse by adding easy points?! Why?

23 4 Problems with Least Squares Looks okay... least squares decision boundary Similar to logistic regression decision boundary (more later) Gets worse by adding easy points?! Why? If target value is 1, points far from boundary will have high value, say 10; this is a large error so the boundary is moved

24 More Least Squares Problems Easily separated by hyperplanes, but not found using least squares! Remember that least squares is MLE under Gaussian noise model for continuous target - we ve got discrete targets. We ll address these problems later with better models First, a look at a different criterion for linear discriminant

25 Fisher s Linear Discriminant The two-class linear discriminant acts as a projection: y = w T x

26 Fisher s Linear Discriminant The two-class linear discriminant acts as a projection: followed by a threshold y = w T x w 0

27 Fisher s Linear Discriminant The two-class linear discriminant acts as a projection: followed by a threshold y = w T x w 0 In which direction w should we project?

28 Fisher s Linear Discriminant The two-class linear discriminant acts as a projection: followed by a threshold y = w T x w 0 In which direction w should we project? One which separates classes well Intuition: We want the (projected) centers of the classes to be far apart, and each class (projection) to be clustered around its centre.

29 Fisher s Linear Discriminant A natural idea would be to project in the direction of the line connecting class means

30 Fisher s Linear Discriminant A natural idea would be to project in the direction of the line connecting class means

31 Fisher s Linear Discriminant A natural idea would be to project in the direction of the line connecting class means However, problematic if classes have variance in this direction (ie are not clustered around the mean)

32 Fisher s Linear Discriminant A natural idea would be to project in the direction of the line connecting class means However, problematic if classes have variance in this direction (ie are not clustered around the mean) Fisher criterion: maximize ratio of inter-class separation (between) to intra-class variance (inside)

33 Math time - FLD Projection y n = w T x n Inter-class separation is distance between class means (good): m k = 1 N k n C k w T x n Intra-class variance (bad): s 2 k = n C k (y n m k ) 2 Fisher criterion: J(w) = (m 2 m 1 ) 2 s s2 2 maximize wrt w

34 Math time - FLD Projection y n = w T x n Inter-class separation is distance between class means (good): m k = 1 N k n C k w T x n Intra-class variance (bad): s 2 k = n C k (y n m k ) 2 Fisher criterion: J(w) = (m 2 m 1 ) 2 s s2 2 maximize wrt w

35 Math time - FLD Projection y n = w T x n Inter-class separation is distance between class means (good): m k = 1 N k n C k w T x n Intra-class variance (bad): s 2 k = n C k (y n m k ) 2 Fisher criterion: J(w) = (m 2 m 1 ) 2 s s2 2 maximize wrt w

36 Math time - FLD Projection y n = w T x n Inter-class separation is distance between class means (good): m k = 1 N k n C k w T x n Intra-class variance (bad): s 2 k = n C k (y n m k ) 2 Fisher criterion: J(w) = (m 2 m 1 ) 2 s s2 2 maximize wrt w

37 Math time - FLD J(w) = (m 2 m 1 ) 2 s s2 2 = wt S B w w T S W w Between-class covariance: S B = (m 2 m 1 )(m 2 m 1 ) T Within-class covariance: S W = n C 1 (x n m 1 )(x n m 1 ) T + n C 2 (x n m 2 )(x n m 2 ) T Lots of math: w S 1 W (m 2 m 1 ) If covariance S W is isotropic (proportional to unit matrix, so little variance within class), reduces to class mean difference vector

38 Math time - FLD J(w) = (m 2 m 1 ) 2 s s2 2 = wt S B w w T S W w Between-class covariance: S B = (m 2 m 1 )(m 2 m 1 ) T Within-class covariance: S W = n C 1 (x n m 1 )(x n m 1 ) T + n C 2 (x n m 2 )(x n m 2 ) T Lots of math: w S 1 W (m 2 m 1 ) If covariance S W is isotropic (proportional to unit matrix, so little variance within class), reduces to class mean difference vector

39 FLD Summary FLD is a dimensionality reduction technique (more later in the course) Criterion for choosing projection based on class labels Still suffers from outliers (e.g. earlier least squares example)

40 Perceptrons Perceptrons is used to refer to many neural network structures (more next week) The classic type is a fixed non-linear transformation of input, one layer of adaptive weights, and a threshold: y(x) = f (w T φ(x)) Developed by Rosenblatt in the 50s The main difference compared to the methods we ve seen so far is the learning algorithm

41 Perceptron Learning Two class problem For ease of notation, we will use t = 1 for class C 1 and t = 1 for class C 2 We saw that squared error was problematic Instead, we d like to minimize the number of misclassified examples An example is mis-classified if w T φ(x n )t n < 0 Perceptron criterion: E P (w) = w T φ(x n )t n n M sum over mis-classified examples only

42 Perceptron Learning Two class problem For ease of notation, we will use t = 1 for class C 1 and t = 1 for class C 2 We saw that squared error was problematic Instead, we d like to minimize the number of misclassified examples An example is mis-classified if w T φ(x n )t n < 0 Perceptron criterion: E P (w) = w T φ(x n )t n n M sum over mis-classified examples only

43 Perceptron Learning Two class problem For ease of notation, we will use t = 1 for class C 1 and t = 1 for class C 2 We saw that squared error was problematic Instead, we d like to minimize the number of misclassified examples An example is mis-classified if w T φ(x n )t n < 0 Perceptron criterion: E P (w) = w T φ(x n )t n n M sum over mis-classified examples only

44 Perceptron Learning Two class problem For ease of notation, we will use t = 1 for class C 1 and t = 1 for class C 2 We saw that squared error was problematic Instead, we d like to minimize the number of misclassified examples An example is mis-classified if w T φ(x n )t n < 0 Perceptron criterion: E P (w) = w T φ(x n )t n n M sum over mis-classified examples only

45 Perceptron Learning Algorithm Minimize the error function using stochastic gradient descent (gradient descent per example): w (τ+1) = w (τ) η E P (w)

46 Perceptron Learning Algorithm Minimize the error function using stochastic gradient descent (gradient descent per example): w (τ+1) = w (τ) η E P (w) = w (τ) + ηφ(x n )t n }{{} if incorrect Iterate over all training examples, only change w if the example is mis-classified

47 Perceptron Learning Algorithm Minimize the error function using stochastic gradient descent (gradient descent per example): w (τ+1) = w (τ) η E P (w) = w (τ) + ηφ(x n )t n }{{} if incorrect Iterate over all training examples, only change w if the example is mis-classified Guaranteed to converge if data are linearly separable Will not converge if not May take many iterations Sensitive to initialization

48 Perceptron Learning Illustration

49 Perceptron Learning Illustration

50 Perceptron Learning Illustration

51 Perceptron Learning Illustration

52 Note there are many hyperplanes with 0 error Support vector machines (in a few weeks) have a nice way of choosing one Perceptron Learning Illustration

53 Limitations of Perceptrons ressiveness of perceptrons on with g = Perceptrons step function can (Rosenblatt, only solve 1957, linearly 1960) separable problems in, OR, NOT, feature majority, space etc. Same as the other models in this chapter separator incanonical input space: example of non-separable problem is X-OR r W x > 0 Real datasets can look like this too I 1 I 1 1 1? 2 I (b) I 1 or I 2 I 2 (c) I 1 xor I 2 I 2

54 Outline Discriminant Functions Generative Models Discriminative Models

55 Intuition for Logistic Regression in Generative Models Classification with Joint Probabilities With 2 classes, C 1 and C 2 : Choose C 1 if 1 < p(c 1 x) p(c 2 x) = p(c 1, x) p(c 2, x) ( ) p(c1 x) 0 < ln = ln(p(c 1 x)) ln(p(c 2 x)) p(c 2 x) The quantity ( ) p(c1 x) a = ln p(c 2 x) is called the log-odds. Logistic Regression assumes that the log-odds are a linear function of the feature vector: a = w T x + w 0. Often true given assumptions about the class-conditional densities p(x C k ).

56 Intuition for Logistic Regression in Generative Models Classification with Joint Probabilities With 2 classes, C 1 and C 2 : Choose C 1 if 1 < p(c 1 x) p(c 2 x) = p(c 1, x) p(c 2, x) ( ) p(c1 x) 0 < ln = ln(p(c 1 x)) ln(p(c 2 x)) p(c 2 x) The quantity ( ) p(c1 x) a = ln p(c 2 x) is called the log-odds. Logistic Regression assumes that the log-odds are a linear function of the feature vector: a = w T x + w 0. Often true given assumptions about the class-conditional densities p(x C k ).

57 Intuition for Logistic Regression in Generative Models Classification with Joint Probabilities With 2 classes, C 1 and C 2 : Choose C 1 if 1 < p(c 1 x) p(c 2 x) = p(c 1, x) p(c 2, x) ( ) p(c1 x) 0 < ln = ln(p(c 1 x)) ln(p(c 2 x)) p(c 2 x) The quantity ( ) p(c1 x) a = ln p(c 2 x) is called the log-odds. Logistic Regression assumes that the log-odds are a linear function of the feature vector: a = w T x + w 0. Often true given assumptions about the class-conditional densities p(x C k ).

58 Intuition for Logistic Regression in Generative Models Classification with Joint Probabilities With 2 classes, C 1 and C 2 : Choose C 1 if 1 < p(c 1 x) p(c 2 x) = p(c 1, x) p(c 2, x) ( ) p(c1 x) 0 < ln = ln(p(c 1 x)) ln(p(c 2 x)) p(c 2 x) The quantity ( ) p(c1 x) a = ln p(c 2 x) is called the log-odds. Logistic Regression assumes that the log-odds are a linear function of the feature vector: a = w T x + w 0. Often true given assumptions about the class-conditional densities p(x C k ).

59 Probabilistic Generative Models With 2 classes, C 1 and C 2 : p(c 1 x) = p(x C 1)p(C 1 ) p(x) Bayes Rule p(c 1 x) = p(c 1 x) = p(x C 1)p(C 1 ) Sum rule p(x, C 1 ) + p(x, C 2 ) p(x C 1 )p(c 1 ) Product rule p(x C 1 )p(c 1 ) + p(x C 2 )p(c 2 ) In generative models we specify the class-conditional distribution p(x C k ) which generates the data for each class

60 Probabilistic Generative Models With 2 classes, C 1 and C 2 : p(c 1 x) = p(x C 1)p(C 1 ) p(x) Bayes Rule p(c 1 x) = p(c 1 x) = p(x C 1)p(C 1 ) Sum rule p(x, C 1 ) + p(x, C 2 ) p(x C 1 )p(c 1 ) Product rule p(x C 1 )p(c 1 ) + p(x C 2 )p(c 2 ) In generative models we specify the class-conditional distribution p(x C k ) which generates the data for each class

61 Probabilistic Generative Models With 2 classes, C 1 and C 2 : p(c 1 x) = p(x C 1)p(C 1 ) p(x) Bayes Rule p(c 1 x) = p(c 1 x) = p(x C 1)p(C 1 ) Sum rule p(x, C 1 ) + p(x, C 2 ) p(x C 1 )p(c 1 ) Product rule p(x C 1 )p(c 1 ) + p(x C 2 )p(c 2 ) In generative models we specify the class-conditional distribution p(x C k ) which generates the data for each class

62 Probabilistic Generative Models With 2 classes, C 1 and C 2 : p(c 1 x) = p(x C 1)p(C 1 ) p(x) Bayes Rule p(c 1 x) = p(c 1 x) = p(x C 1)p(C 1 ) Sum rule p(x, C 1 ) + p(x, C 2 ) p(x C 1 )p(c 1 ) Product rule p(x C 1 )p(c 1 ) + p(x C 2 )p(c 2 ) In generative models we specify the class-conditional distribution p(x C k ) which generates the data for each class

63 Probabilistic Generative Models - Example Let s say we observe x which is the current temperature Determine if we are in Vancouver (C 1 ) or Honolulu (C 2 ) Generative model: p(c 1 x) = p(x C 1 )p(c 1 ) p(x C 1 )p(c 1 ) + p(x C 2 )p(c 2 ) p(x C 1 ) is distribution over typical temperatures in Vancouver e.g. p(x C 1) = N (x; 10, 5) p(x C 2 ) is distribution over typical temperatures in Honolulu e.g. p(x C 2) = N (x; 25, 5) Class priors p(c 1 ) = 0.1, p(c 2 ) = 0.9 p(c 1 x = 15) =

64 Probabilistic Generative Models - Example Let s say we observe x which is the current temperature Determine if we are in Vancouver (C 1 ) or Honolulu (C 2 ) Generative model: p(c 1 x) = p(x C 1 )p(c 1 ) p(x C 1 )p(c 1 ) + p(x C 2 )p(c 2 ) p(x C 1 ) is distribution over typical temperatures in Vancouver e.g. p(x C 1) = N (x; 10, 5) p(x C 2 ) is distribution over typical temperatures in Honolulu e.g. p(x C 2) = N (x; 25, 5) Class priors p(c 1 ) = 0.1, p(c 2 ) = 0.9 p(c 1 x = 15) =

65 Generalized Linear Models Suppose we have built a model for predicting the log-odds. We can use it to compute the class probability as follows. p(c 1 x) = = p(x C 1 )p(c 1 ) p(x C 1 )p(c 1 ) + p(x C 2 )p(c 2 ) exp( a) σ(a) where a = ln p(x C 1)p(C 1 ) p(x C 2 )p(c 2 ) = ln p(x, C 1) p(x, C 2 ).

66 Generalized Linear Models Suppose we have built a model for predicting the log-odds. We can use it to compute the class probability as follows. p(c 1 x) = = p(x C 1 )p(c 1 ) p(x C 1 )p(c 1 ) + p(x C 2 )p(c 2 ) exp( a) σ(a) where a = ln p(x C 1)p(C 1 ) p(x C 2 )p(c 2 ) = ln p(x, C 1) p(x, C 2 ).

67 Logistic Sigmoid The function σ(a) = 1+exp( a) is known as the logistic sigmoid It squashes the real axis down to [0, 1] It is continuous and differentiable

68 Multi-class Extension There is a generalization of the logistic sigmoid to K > 2 classes: p(c k x) = p(x C k )p(c k ) j p(x C j)p(c j ) = exp(a k ) j exp(a j) where a k = ln p(x C k )p(c k ) a. k. a. softmax function If some a k a j, p(c k x) goes to 1

69 Multi-class Extension There is a generalization of the logistic sigmoid to K > 2 classes: p(c k x) = p(x C k )p(c k ) j p(x C j)p(c j ) = exp(a k ) j exp(a j) where a k = ln p(x C k )p(c k ) a. k. a. softmax function If some a k a j, p(c k x) goes to 1

70 Example Logistic Regression

71 Gaussian Class-Conditional Densities Back to the log-odds a in the logistic sigmoid for 2 classes Let s assume the class-conditional densities p(x C k ) are Gaussians, and have the same covariance matrix Σ: p(x C k ) = 1 (2π) D/2 exp Σ 1/2 a takes a simple form: { 1 2 (x µ k) T Σ 1 (x µ k ) a = ln p(x C 1)p(C 1 ) p(x C 2 )p(c 2 ) = wt x + w 0 Note that quadratic terms x T Σ 1 x cancel }

72 Gaussian Class-Conditional Densities Back to the log-odds a in the logistic sigmoid for 2 classes Let s assume the class-conditional densities p(x C k ) are Gaussians, and have the same covariance matrix Σ: p(x C k ) = 1 (2π) D/2 exp Σ 1/2 a takes a simple form: { 1 2 (x µ k) T Σ 1 (x µ k ) a = ln p(x C 1)p(C 1 ) p(x C 2 )p(c 2 ) = wt x + w 0 Note that quadratic terms x T Σ 1 x cancel }

73 Gaussian Class-Conditional Densities Back to the log-odds a in the logistic sigmoid for 2 classes Let s assume the class-conditional densities p(x C k ) are Gaussians, and have the same covariance matrix Σ: p(x C k ) = 1 (2π) D/2 exp Σ 1/2 a takes a simple form: { 1 2 (x µ k) T Σ 1 (x µ k ) a = ln p(x C 1)p(C 1 ) p(x C 2 )p(c 2 ) = wt x + w 0 Note that quadratic terms x T Σ 1 x cancel }

74 Gaussian Class-Conditional Densities Back to the log-odds a in the logistic sigmoid for 2 classes Let s assume the class-conditional densities p(x C k ) are Gaussians, and have the same covariance matrix Σ: p(x C k ) = 1 (2π) D/2 exp Σ 1/2 a takes a simple form: { 1 2 (x µ k) T Σ 1 (x µ k ) a = ln p(x C 1)p(C 1 ) p(x C 2 )p(c 2 ) = wt x + w 0 Note that quadratic terms x T Σ 1 x cancel }

75 Maximum Likelihood Learning We can fit the parameters to this model using maximum likelihood Parameters are µ 1, µ 2, Σ 1, p(c 1 ) π, p(c 2 ) 1 π Refer to as θ For a datapoint x n from class C 1 (t n = 1): p(x n, C 1 ) = p(c 1 )p(x n C 1 ) = πn (x n µ 1, Σ) For a datapoint x n from class C 2 (t n = 0): p(x n, C 2 ) = p(c 2 )p(x n C 2 ) = (1 π)n (x n µ 2, Σ)

76 Maximum Likelihood Learning We can fit the parameters to this model using maximum likelihood Parameters are µ 1, µ 2, Σ 1, p(c 1 ) π, p(c 2 ) 1 π Refer to as θ For a datapoint x n from class C 1 (t n = 1): p(x n, C 1 ) = p(c 1 )p(x n C 1 ) = πn (x n µ 1, Σ) For a datapoint x n from class C 2 (t n = 0): p(x n, C 2 ) = p(c 2 )p(x n C 2 ) = (1 π)n (x n µ 2, Σ)

77 Maximum Likelihood Learning We can fit the parameters to this model using maximum likelihood Parameters are µ 1, µ 2, Σ 1, p(c 1 ) π, p(c 2 ) 1 π Refer to as θ For a datapoint x n from class C 1 (t n = 1): p(x n, C 1 ) = p(c 1 )p(x n C 1 ) = πn (x n µ 1, Σ) For a datapoint x n from class C 2 (t n = 0): p(x n, C 2 ) = p(c 2 )p(x n C 2 ) = (1 π)n (x n µ 2, Σ)

78 Maximum Likelihood Learning The likelihood of the training data is: p(t π, µ 1, µ 2, Σ) = N [πn (x n µ 1, Σ)] tn [(1 π)n (x n µ 2, Σ)] 1 tn n=1 As usual, ln is our friend: l(t; θ) = N t n ln π + (1 t n ) ln(1 π) + t }{{} n ln N 1 + (1 t n ) ln N }{{} 2 π µ 1,µ 2,Σ n=1 Maximize for each separately

79 Maximum Likelihood Learning The likelihood of the training data is: p(t π, µ 1, µ 2, Σ) = N [πn (x n µ 1, Σ)] tn [(1 π)n (x n µ 2, Σ)] 1 tn n=1 As usual, ln is our friend: l(t; θ) = N t n ln π + (1 t n ) ln(1 π) + t }{{} n ln N 1 + (1 t n ) ln N }{{} 2 π µ 1,µ 2,Σ n=1 Maximize for each separately

80 Maximum Likelihood Learning - Class Priors Maximization with respect to the class priors parameter π is straightforward: N π l(t; θ) = n=1 π = N 1 N 1 + N 2 t n π 1 t n 1 π N 1 and N 2 are the number of training points in each class Prior is simply the fraction of points in each class

81 Maximum Likelihood Learning - Class Priors Maximization with respect to the class priors parameter π is straightforward: N π l(t; θ) = n=1 π = N 1 N 1 + N 2 t n π 1 t n 1 π N 1 and N 2 are the number of training points in each class Prior is simply the fraction of points in each class

82 Maximum Likelihood Learning - Class Priors Maximization with respect to the class priors parameter π is straightforward: N π l(t; θ) = n=1 π = N 1 N 1 + N 2 t n π 1 t n 1 π N 1 and N 2 are the number of training points in each class Prior is simply the fraction of points in each class

83 Maximum Likelihood Learning - Gaussian Parameters The other parameters can also be found in the same fashion Class means: µ 1 = 1 N t n x n N 1 n=1 µ 2 = 1 N (1 t n )x n N 2 n=1 Means of training examples from each class Shared covariance matrix: Σ = N 1 N 1 (x n µ N 1 )(x n µ 1 ) T + N 2 1 N n C 1 Weighted average of class covariances 1 (x n µ N 2 )(x n µ 2 ) T 2 n C 2

84 Maximum Likelihood Learning - Gaussian Parameters The other parameters can also be found in the same fashion Class means: µ 1 = 1 N t n x n N 1 n=1 µ 2 = 1 N (1 t n )x n N 2 n=1 Means of training examples from each class Shared covariance matrix: Σ = N 1 N 1 (x n µ N 1 )(x n µ 1 ) T + N 2 1 N n C 1 Weighted average of class covariances 1 (x n µ N 2 )(x n µ 2 ) T 2 n C 2

85 Probabilistic Generative Models Summary Fitting Gaussian using ML criterion is sensitive to outliers Simple linear form for a in logistic sigmoid occurs for more than just Gaussian distributions Arises for any distribution in the exponential family, a large class of distributions

86 Outline Discriminant Functions Generative Models Discriminative Models

87 Probabilistic Discriminative Models Generative model made assumptions about form of class-conditional distributions (e.g. Gaussian) Resulted in logistic sigmoid of linear function of x Discriminative model - explicitly use functional form p(c 1 x) = exp( w T x + w 0 ) and find w directly For the generative model we had 2M + M(M + 1)/2 + 1 parameters M is dimensionality of x Discriminative model will have M + 1 parameters

88 Probabilistic Discriminative Models Generative model made assumptions about form of class-conditional distributions (e.g. Gaussian) Resulted in logistic sigmoid of linear function of x Discriminative model - explicitly use functional form p(c 1 x) = exp( w T x + w 0 ) and find w directly For the generative model we had 2M + M(M + 1)/2 + 1 parameters M is dimensionality of x Discriminative model will have M + 1 parameters

89 Probabilistic Discriminative Models Generative model made assumptions about form of class-conditional distributions (e.g. Gaussian) Resulted in logistic sigmoid of linear function of x Discriminative model - explicitly use functional form p(c 1 x) = exp( w T x + w 0 ) and find w directly For the generative model we had 2M + M(M + 1)/2 + 1 parameters M is dimensionality of x Discriminative model will have M + 1 parameters

90 Maximum Likelihood Learning - Discriminative Model As usual we can use the maximum likelihood criterion for learning p(t w) = N n=1 Taking ln and derivative gives: y tn n {1 y n } 1 tn ; where y n = p(c 1 x n ) l(w) = N (y n t n )x n n=1 This time no closed-form solution since y n = σ(w T x) Could use (stochastic) gradient descent But Iterative Reweighted Least Squares (IRLS) is a better technique.

91 Maximum Likelihood Learning - Discriminative Model As usual we can use the maximum likelihood criterion for learning p(t w) = N n=1 Taking ln and derivative gives: y tn n {1 y n } 1 tn ; where y n = p(c 1 x n ) l(w) = N (y n t n )x n n=1 This time no closed-form solution since y n = σ(w T x) Could use (stochastic) gradient descent But Iterative Reweighted Least Squares (IRLS) is a better technique.

92 Maximum Likelihood Learning - Discriminative Model As usual we can use the maximum likelihood criterion for learning p(t w) = N n=1 Taking ln and derivative gives: y tn n {1 y n } 1 tn ; where y n = p(c 1 x n ) l(w) = N (y n t n )x n n=1 This time no closed-form solution since y n = σ(w T x) Could use (stochastic) gradient descent But Iterative Reweighted Least Squares (IRLS) is a better technique.

93 Maximum Likelihood Learning - Discriminative Model As usual we can use the maximum likelihood criterion for learning p(t w) = N n=1 Taking ln and derivative gives: y tn n {1 y n } 1 tn ; where y n = p(c 1 x n ) l(w) = N (y n t n )x n n=1 This time no closed-form solution since y n = σ(w T x) Could use (stochastic) gradient descent But Iterative Reweighted Least Squares (IRLS) is a better technique.

94 Generative vs. Discriminative Generative models Can generate synthetic example data Perhaps accurate classification is equivalent to accurate synthesis Support learning with missing data Tend to have more parameters Require good model of class distributions Discriminative models Only usable for classification Don t solve a harder problem than you need to Tend to have fewer parameters Require good model of decision boundary

95 Conclusion Readings: Ch , 4.1.7, , Generalized linear models y(x) = f (w T x + w 0 ) Threshold/max function for f ( ) Minimize with least squares Fisher criterion - class separation Perceptron criterion - mis-classified examples Probabilistic models: logistic sigmoid / softmax for f ( ) Generative model - assume class conditional densities in exponential family; obtain sigmoid Discriminative model - directly model posterior using sigmoid (a. k. a. logistic regression, though classification) Can learn either using maximum likelihood All of these models are limited to linear decision boundaries in feature space

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,

More information

Machine Learning. 7. Logistic and Linear Regression

Machine Learning. 7. Logistic and Linear Regression Sapienza University of Rome, Italy - Machine Learning (27/28) University of Rome La Sapienza Master in Artificial Intelligence and Robotics Machine Learning 7. Logistic and Linear Regression Luca Iocchi,

More information

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship

More information

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D. Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá April 9, 2018 Content 1 2 3 4 Outline 1 2 3 4 problems { C 1, y(x) threshold predict(x) = C 2, y(x) < threshold, with threshold

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII

More information

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA Logistic Regression Sargur N. University at Buffalo, State University of New York USA Topics in Linear Classification using Probabilistic Discriminative Models Generative vs Discriminative 1. Fixed basis

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Linear Classification: Probabilistic Generative Models

Linear Classification: Probabilistic Generative Models Linear Classification: Probabilistic Generative Models Sargur N. University at Buffalo, State University of New York USA 1 Linear Classification using Probabilistic Generative Models Topics 1. Overview

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Linear Models for Classification

Linear Models for Classification Catherine Lee Anderson figures courtesy of Christopher M. Bishop Department of Computer Science University of Nebraska at Lincoln CSCE 970: Pattern Recognition and Machine Learning Congradulations!!!!

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Linear Classification

Linear Classification Linear Classification Lili MOU moull12@sei.pku.edu.cn http://sei.pku.edu.cn/ moull12 23 April 2015 Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:

More information

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch. Neural Networks Oliver Schulte - CMPT 726 Bishop PRML Ch. 5 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of biological plausibility We will

More information

An Introduction to Statistical and Probabilistic Linear Models

An Introduction to Statistical and Probabilistic Linear Models An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Henrik I Christensen Robotics & Intelligent Machines @ GT Georgia Institute of Technology, Atlanta, GA 30332-0280 hic@cc.gatech.edu Henrik I Christensen (RIM@GT) Linear

More information

Linear and logistic regression

Linear and logistic regression Linear and logistic regression Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Linear and logistic regression 1/22 Outline 1 Linear regression 2 Logistic regression 3 Fisher discriminant analysis

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Multiclass Logistic Regression

Multiclass Logistic Regression Multiclass Logistic Regression Sargur. Srihari University at Buffalo, State University of ew York USA Machine Learning Srihari Topics in Linear Classification using Probabilistic Discriminative Models

More information

Probabilistic generative models

Probabilistic generative models Linear models for classification Francesco Corona Probabilistic discriminative models Models with linear decision boundaries arise from assumptions about the data In a generative approach to classification,

More information

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

More information

CSC 411: Lecture 04: Logistic Regression

CSC 411: Lecture 04: Logistic Regression CSC 411: Lecture 04: Logistic Regression Raquel Urtasun & Rich Zemel University of Toronto Sep 23, 2015 Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 1 / 16 Today Key Concepts: Logistic

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications Neural Networks Bishop PRML Ch. 5 Alireza Ghane Neural Networks Alireza Ghane / Greg Mori 1 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of

More information

Latent Variable Models and Expectation Maximization

Latent Variable Models and Expectation Maximization Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012) Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation

More information

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine Learning Support Vector Machines. Prof. Matteo Matteucci Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Latent Variable Models and Expectation Maximization

Latent Variable Models and Expectation Maximization Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Iterative Reweighted Least Squares

Iterative Reweighted Least Squares Iterative Reweighted Least Squares Sargur. University at Buffalo, State University of ew York USA Topics in Linear Classification using Probabilistic Discriminative Models Generative vs Discriminative

More information

Week 5: Logistic Regression & Neural Networks

Week 5: Logistic Regression & Neural Networks Week 5: Logistic Regression & Neural Networks Instructor: Sergey Levine 1 Summary: Logistic Regression In the previous lecture, we covered logistic regression. To recap, logistic regression models and

More information

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition LINEAR CLASSIFIERS Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input

More information

Slides modified from: PATTERN RECOGNITION CHRISTOPHER M. BISHOP. and: Computer vision: models, learning and inference Simon J.D.

Slides modified from: PATTERN RECOGNITION CHRISTOPHER M. BISHOP. and: Computer vision: models, learning and inference Simon J.D. Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP and: Computer vision: models, learning and inference. 2011 Simon J.D. Prince ClassificaLon Example: Gender ClassificaLon

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 1 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

SGN (4 cr) Chapter 5

SGN (4 cr) Chapter 5 SGN-41006 (4 cr) Chapter 5 Linear Discriminant Analysis Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology January 21, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006

More information

Logistic Regression. Seungjin Choi

Logistic Regression. Seungjin Choi Logistic Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 2012 Engineering Part IIB: Module 4F10 Introduction In

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Probabilistic Machine Learning. Industrial AI Lab.

Probabilistic Machine Learning. Industrial AI Lab. Probabilistic Machine Learning Industrial AI Lab. Probabilistic Linear Regression Outline Probabilistic Classification Probabilistic Clustering Probabilistic Dimension Reduction 2 Probabilistic Linear

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Clustering and Gaussian Mixtures

Clustering and Gaussian Mixtures Clustering and Gaussian Mixtures Oliver Schulte - CMPT 883 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15 2 25 5 1 15 2 25 detected tures detected

More information

Bayesian Linear Regression. Sargur Srihari

Bayesian Linear Regression. Sargur Srihari Bayesian Linear Regression Sargur srihari@cedar.buffalo.edu Topics in Bayesian Regression Recall Max Likelihood Linear Regression Parameter Distribution Predictive Distribution Equivalent Kernel 2 Linear

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

More information

Classification. Sandro Cumani. Politecnico di Torino

Classification. Sandro Cumani. Politecnico di Torino Politecnico di Torino Outline Generative model: Gaussian classifier (Linear) discriminative model: logistic regression (Non linear) discriminative model: neural networks Gaussian Classifier We want to

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features Y target

More information

Stochastic gradient descent; Classification

Stochastic gradient descent; Classification Stochastic gradient descent; Classification Steve Renals Machine Learning Practical MLP Lecture 2 28 September 2016 MLP Lecture 2 Stochastic gradient descent; Classification 1 Single Layer Networks MLP

More information

Learning from Data Logistic Regression

Learning from Data Logistic Regression Learning from Data Logistic Regression Copyright David Barber 2-24. Course lecturer: Amos Storkey a.storkey@ed.ac.uk Course page : http://www.anc.ed.ac.uk/ amos/lfd/ 2.9.8.7.6.5.4.3.2...2.3.4.5.6.7.8.9

More information

Intelligent Systems Discriminative Learning, Neural Networks

Intelligent Systems Discriminative Learning, Neural Networks Intelligent Systems Discriminative Learning, Neural Networks Carsten Rother, Dmitrij Schlesinger WS2014/2015, Outline 1. Discriminative learning 2. Neurons and linear classifiers: 1) Perceptron-Algorithm

More information

MLPR: Logistic Regression and Neural Networks

MLPR: Logistic Regression and Neural Networks MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition Amos Storkey Amos Storkey MLPR: Logistic Regression and Neural Networks 1/28 Outline 1 Logistic Regression 2 Multi-layer

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap. Outline MLPR: and Neural Networks Machine Learning and Pattern Recognition 2 Amos Storkey Amos Storkey MLPR: and Neural Networks /28 Recap Amos Storkey MLPR: and Neural Networks 2/28 Which is the correct

More information

Artificial Neural Networks

Artificial Neural Networks Artificial Neural Networks Stephan Dreiseitl University of Applied Sciences Upper Austria at Hagenberg Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support Knowledge

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Outline Lecture 2 2(32)

Outline Lecture 2 2(32) Outline Lecture (3), Lecture Linear Regression and Classification it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic

More information

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer. University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x

More information

Statistical Machine Learning Hilary Term 2018

Statistical Machine Learning Hilary Term 2018 Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 The Regression Problem Training data: A set of input-output

More information

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met c Outlines Statistical Group and College of Engineering and Computer Science Overview Linear Regression Linear Classification Neural Networks Kernel Methods and SVM Mixture Models and EM Resources More

More information

Neural Networks and the Back-propagation Algorithm

Neural Networks and the Back-propagation Algorithm Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely

More information

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber Machine Learning Regression-Based Classification & Gaussian Discriminant Analysis Manfred Huber 2015 1 Logistic Regression Linear regression provides a nice representation and an efficient solution to

More information

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I SYDE 372 Introduction to Pattern Recognition Probability Measures for Classification: Part I Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 Why use probability

More information

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

Learning Methods for Linear Detectors

Learning Methods for Linear Detectors Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2011/2012 Lesson 20 27 April 2012 Contents Learning Methods for Linear Detectors Learning Linear Detectors...2

More information

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396 Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents 1 Introduction

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

CSC321 Lecture 4: Learning a Classifier

CSC321 Lecture 4: Learning a Classifier CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a Classifier 1 / 28 Overview Last time: binary classification, perceptron algorithm Limitations of the perceptron

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Variational Bayesian Logistic Regression

Variational Bayesian Logistic Regression Variational Bayesian Logistic Regression Sargur N. University at Buffalo, State University of New York USA Topics in Linear Models for Classification Overview 1. Discriminant Functions 2. Probabilistic

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Learning Linear Detectors

Learning Linear Detectors Learning Linear Detectors Instructor - Simon Lucey 16-423 - Designing Computer Vision Apps Today Detection versus Classification Bayes Classifiers Linear Classifiers Examples of Detection 3 Learning: Detection

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information