Linear Models for Classification
|
|
- Sydney Norris
- 6 years ago
- Views:
Transcription
1 Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4
2 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0, 0, 0, 0, 0) Each input vector classified into one of K discrete classes Denote classes by C k Represent input image as a vector x i R 784. We have target vector t i {0, 1} 10 Given a training set {(x 1, t 1 ),..., (x N, t N )}, learning problem is to construct a good function y(x) from these. y : R 784 R 10
3 Generalized Linear Models Similar to previous chapter on linear models for regression, we will use a linear model for classification: y(x) = f (w T x + w 0 ) This is called a generalized linear model f ( ) is a fixed non-linear function e.g. { 1 if u 0 f (u) = 0 otherwise Decision boundary between classes will be linear function of x Can also apply non-linearity to x, as in φ i (x) for regression
4 Generalized Linear Models Similar to previous chapter on linear models for regression, we will use a linear model for classification: y(x) = f (w T x + w 0 ) This is called a generalized linear model f ( ) is a fixed non-linear function e.g. { 1 if u 0 f (u) = 0 otherwise Decision boundary between classes will be linear function of x Can also apply non-linearity to x, as in φ i (x) for regression
5 Generalized Linear Models Similar to previous chapter on linear models for regression, we will use a linear model for classification: y(x) = f (w T x + w 0 ) This is called a generalized linear model f ( ) is a fixed non-linear function e.g. { 1 if u 0 f (u) = 0 otherwise Decision boundary between classes will be linear function of x Can also apply non-linearity to x, as in φ i (x) for regression
6 Overview Linear regression for Classification The Fisher Linear Discriminant, or How to Draw a Line Between Classes The Perceptron, or The Smallest Neural Net Logistic Regression The Statistician s Classifier
7 Outline Discriminant Functions Generative Models Discriminative Models
8 Discriminant Functions with Two Classes x 2 y > 0 y = 0 y < 0 R 1 R 2 Start with 2 class problem, t i {0, 1} Simple linear discriminant w x x y(x) w y(x) = w T x + w 0 apply threshold function to get classification w0 w x 1 Decision surface is line; orthogonal to w. Projection of x in w dir. is wt x w
9 Multiple Classes A linear discriminant between two classes separates with a hyperplane How to use this for multiple classes? One-versus-the-rest method: build K 1 classifiers, between C k and all others One-versus-one method: build K(K 1)/2 classifiers, between all pairs
10 Multiple Classes A linear discriminant between two classes separates with a hyperplane How to use this for multiple classes? One-versus-the-rest method: build K 1 classifiers, between C k and all others One-versus-one method: build K(K 1)/2 classifiers, between all pairs
11 Multiple Classes A linear discriminant between two classes separates with a hyperplane How to use this for multiple classes? One-versus-the-rest method: build K 1 classifiers, between C k and all others One-versus-one method: build K(K 1)/2 classifiers, between all pairs
12 Multiple Classes? R1 R2 C1 R3 C2 not C1 not C2 A linear discriminant between two classes separates with a hyperplane How to use this for multiple classes? One-versus-the-rest method: build K 1 classifiers, between C k and all others One-versus-one method: build K(K 1)/2 classifiers, between all pairs
13 Multiple Classes? R1 R2 C1 R3 C2 not C1 not C2 A linear discriminant between two classes separates with a hyperplane How to use this for multiple classes? One-versus-the-rest method: build K 1 classifiers, between C k and all others One-versus-one method: build K(K 1)/2 classifiers, between all pairs
14 Multiple Classes? C 1 C 3 R 1 R1 R2 C 1? R 3 C 3 C1 R3 C2 C 2 R 2 not C1 not C2 C 2 A linear discriminant between two classes separates with a hyperplane How to use this for multiple classes? One-versus-the-rest method: build K 1 classifiers, between C k and all others One-versus-one method: build K(K 1)/2 classifiers, between all pairs
15 Multiple Classes R j R i R k x A A solution is to build K linear functions: ˆx x B y k (x) = w T k x + w k0 assign x to class max k y k (x) Gives connected, convex decision regions ˆx = λx A + (1 λ)x B y k (ˆx) = λy k (x A ) + (1 λ)y k (x B ) y k (ˆx) > y j (ˆx), j k
16 Multiple Classes R j R i R k x A A solution is to build K linear functions: ˆx x B y k (x) = w T k x + w k0 assign x to class max k y k (x) Gives connected, convex decision regions ˆx = λx A + (1 λ)x B y k (ˆx) = λy k (x A ) + (1 λ)y k (x B ) y k (ˆx) > y j (ˆx), j k
17 Least Squares for Classification How do we learn the decision boundaries (w k, w k0 )? One approach is to use least squares, similar to regression Find W to minimize squared error over all examples and all components of the label vector: E(W) = 1 2 N n=1 k=1 K (y k (x n ) t nk ) 2 Some algebra, we get a solution using the pseudo-inverse as in regression
18 Least Squares for Classification How do we learn the decision boundaries (w k, w k0 )? One approach is to use least squares, similar to regression Find W to minimize squared error over all examples and all components of the label vector: E(W) = 1 2 N n=1 k=1 K (y k (x n ) t nk ) 2 Some algebra, we get a solution using the pseudo-inverse as in regression
19 Least Squares for Classification How do we learn the decision boundaries (w k, w k0 )? One approach is to use least squares, similar to regression Find W to minimize squared error over all examples and all components of the label vector: E(W) = 1 2 N n=1 k=1 K (y k (x n ) t nk ) 2 Some algebra, we get a solution using the pseudo-inverse as in regression
20 4 Problems with Least Squares Looks okay... least squares decision boundary Similar to logistic regression decision boundary (more later)
21 4 Problems with Least Squares Looks okay... least squares decision boundary Similar to logistic regression decision boundary (more later) Gets worse by adding easy points?!
22 4 Problems with Least Squares Looks okay... least squares decision boundary Similar to logistic regression decision boundary (more later) Gets worse by adding easy points?! Why?
23 4 Problems with Least Squares Looks okay... least squares decision boundary Similar to logistic regression decision boundary (more later) Gets worse by adding easy points?! Why? If target value is 1, points far from boundary will have high value, say 10; this is a large error so the boundary is moved
24 More Least Squares Problems Easily separated by hyperplanes, but not found using least squares! Remember that least squares is MLE under Gaussian noise model for continuous target - we ve got discrete targets. We ll address these problems later with better models First, a look at a different criterion for linear discriminant
25 Fisher s Linear Discriminant The two-class linear discriminant acts as a projection: y = w T x
26 Fisher s Linear Discriminant The two-class linear discriminant acts as a projection: followed by a threshold y = w T x w 0
27 Fisher s Linear Discriminant The two-class linear discriminant acts as a projection: followed by a threshold y = w T x w 0 In which direction w should we project?
28 Fisher s Linear Discriminant The two-class linear discriminant acts as a projection: followed by a threshold y = w T x w 0 In which direction w should we project? One which separates classes well Intuition: We want the (projected) centers of the classes to be far apart, and each class (projection) to be clustered around its centre.
29 Fisher s Linear Discriminant A natural idea would be to project in the direction of the line connecting class means
30 Fisher s Linear Discriminant A natural idea would be to project in the direction of the line connecting class means
31 Fisher s Linear Discriminant A natural idea would be to project in the direction of the line connecting class means However, problematic if classes have variance in this direction (ie are not clustered around the mean)
32 Fisher s Linear Discriminant A natural idea would be to project in the direction of the line connecting class means However, problematic if classes have variance in this direction (ie are not clustered around the mean) Fisher criterion: maximize ratio of inter-class separation (between) to intra-class variance (inside)
33 Math time - FLD Projection y n = w T x n Inter-class separation is distance between class means (good): m k = 1 N k n C k w T x n Intra-class variance (bad): s 2 k = n C k (y n m k ) 2 Fisher criterion: J(w) = (m 2 m 1 ) 2 s s2 2 maximize wrt w
34 Math time - FLD Projection y n = w T x n Inter-class separation is distance between class means (good): m k = 1 N k n C k w T x n Intra-class variance (bad): s 2 k = n C k (y n m k ) 2 Fisher criterion: J(w) = (m 2 m 1 ) 2 s s2 2 maximize wrt w
35 Math time - FLD Projection y n = w T x n Inter-class separation is distance between class means (good): m k = 1 N k n C k w T x n Intra-class variance (bad): s 2 k = n C k (y n m k ) 2 Fisher criterion: J(w) = (m 2 m 1 ) 2 s s2 2 maximize wrt w
36 Math time - FLD Projection y n = w T x n Inter-class separation is distance between class means (good): m k = 1 N k n C k w T x n Intra-class variance (bad): s 2 k = n C k (y n m k ) 2 Fisher criterion: J(w) = (m 2 m 1 ) 2 s s2 2 maximize wrt w
37 Math time - FLD J(w) = (m 2 m 1 ) 2 s s2 2 = wt S B w w T S W w Between-class covariance: S B = (m 2 m 1 )(m 2 m 1 ) T Within-class covariance: S W = n C 1 (x n m 1 )(x n m 1 ) T + n C 2 (x n m 2 )(x n m 2 ) T Lots of math: w S 1 W (m 2 m 1 ) If covariance S W is isotropic (proportional to unit matrix, so little variance within class), reduces to class mean difference vector
38 Math time - FLD J(w) = (m 2 m 1 ) 2 s s2 2 = wt S B w w T S W w Between-class covariance: S B = (m 2 m 1 )(m 2 m 1 ) T Within-class covariance: S W = n C 1 (x n m 1 )(x n m 1 ) T + n C 2 (x n m 2 )(x n m 2 ) T Lots of math: w S 1 W (m 2 m 1 ) If covariance S W is isotropic (proportional to unit matrix, so little variance within class), reduces to class mean difference vector
39 FLD Summary FLD is a dimensionality reduction technique (more later in the course) Criterion for choosing projection based on class labels Still suffers from outliers (e.g. earlier least squares example)
40 Perceptrons Perceptrons is used to refer to many neural network structures (more next week) The classic type is a fixed non-linear transformation of input, one layer of adaptive weights, and a threshold: y(x) = f (w T φ(x)) Developed by Rosenblatt in the 50s The main difference compared to the methods we ve seen so far is the learning algorithm
41 Perceptron Learning Two class problem For ease of notation, we will use t = 1 for class C 1 and t = 1 for class C 2 We saw that squared error was problematic Instead, we d like to minimize the number of misclassified examples An example is mis-classified if w T φ(x n )t n < 0 Perceptron criterion: E P (w) = w T φ(x n )t n n M sum over mis-classified examples only
42 Perceptron Learning Two class problem For ease of notation, we will use t = 1 for class C 1 and t = 1 for class C 2 We saw that squared error was problematic Instead, we d like to minimize the number of misclassified examples An example is mis-classified if w T φ(x n )t n < 0 Perceptron criterion: E P (w) = w T φ(x n )t n n M sum over mis-classified examples only
43 Perceptron Learning Two class problem For ease of notation, we will use t = 1 for class C 1 and t = 1 for class C 2 We saw that squared error was problematic Instead, we d like to minimize the number of misclassified examples An example is mis-classified if w T φ(x n )t n < 0 Perceptron criterion: E P (w) = w T φ(x n )t n n M sum over mis-classified examples only
44 Perceptron Learning Two class problem For ease of notation, we will use t = 1 for class C 1 and t = 1 for class C 2 We saw that squared error was problematic Instead, we d like to minimize the number of misclassified examples An example is mis-classified if w T φ(x n )t n < 0 Perceptron criterion: E P (w) = w T φ(x n )t n n M sum over mis-classified examples only
45 Perceptron Learning Algorithm Minimize the error function using stochastic gradient descent (gradient descent per example): w (τ+1) = w (τ) η E P (w)
46 Perceptron Learning Algorithm Minimize the error function using stochastic gradient descent (gradient descent per example): w (τ+1) = w (τ) η E P (w) = w (τ) + ηφ(x n )t n }{{} if incorrect Iterate over all training examples, only change w if the example is mis-classified
47 Perceptron Learning Algorithm Minimize the error function using stochastic gradient descent (gradient descent per example): w (τ+1) = w (τ) η E P (w) = w (τ) + ηφ(x n )t n }{{} if incorrect Iterate over all training examples, only change w if the example is mis-classified Guaranteed to converge if data are linearly separable Will not converge if not May take many iterations Sensitive to initialization
48 Perceptron Learning Illustration
49 Perceptron Learning Illustration
50 Perceptron Learning Illustration
51 Perceptron Learning Illustration
52 Note there are many hyperplanes with 0 error Support vector machines (in a few weeks) have a nice way of choosing one Perceptron Learning Illustration
53 Limitations of Perceptrons ressiveness of perceptrons on with g = Perceptrons step function can (Rosenblatt, only solve 1957, linearly 1960) separable problems in, OR, NOT, feature majority, space etc. Same as the other models in this chapter separator incanonical input space: example of non-separable problem is X-OR r W x > 0 Real datasets can look like this too I 1 I 1 1 1? 2 I (b) I 1 or I 2 I 2 (c) I 1 xor I 2 I 2
54 Outline Discriminant Functions Generative Models Discriminative Models
55 Intuition for Logistic Regression in Generative Models Classification with Joint Probabilities With 2 classes, C 1 and C 2 : Choose C 1 if 1 < p(c 1 x) p(c 2 x) = p(c 1, x) p(c 2, x) ( ) p(c1 x) 0 < ln = ln(p(c 1 x)) ln(p(c 2 x)) p(c 2 x) The quantity ( ) p(c1 x) a = ln p(c 2 x) is called the log-odds. Logistic Regression assumes that the log-odds are a linear function of the feature vector: a = w T x + w 0. Often true given assumptions about the class-conditional densities p(x C k ).
56 Intuition for Logistic Regression in Generative Models Classification with Joint Probabilities With 2 classes, C 1 and C 2 : Choose C 1 if 1 < p(c 1 x) p(c 2 x) = p(c 1, x) p(c 2, x) ( ) p(c1 x) 0 < ln = ln(p(c 1 x)) ln(p(c 2 x)) p(c 2 x) The quantity ( ) p(c1 x) a = ln p(c 2 x) is called the log-odds. Logistic Regression assumes that the log-odds are a linear function of the feature vector: a = w T x + w 0. Often true given assumptions about the class-conditional densities p(x C k ).
57 Intuition for Logistic Regression in Generative Models Classification with Joint Probabilities With 2 classes, C 1 and C 2 : Choose C 1 if 1 < p(c 1 x) p(c 2 x) = p(c 1, x) p(c 2, x) ( ) p(c1 x) 0 < ln = ln(p(c 1 x)) ln(p(c 2 x)) p(c 2 x) The quantity ( ) p(c1 x) a = ln p(c 2 x) is called the log-odds. Logistic Regression assumes that the log-odds are a linear function of the feature vector: a = w T x + w 0. Often true given assumptions about the class-conditional densities p(x C k ).
58 Intuition for Logistic Regression in Generative Models Classification with Joint Probabilities With 2 classes, C 1 and C 2 : Choose C 1 if 1 < p(c 1 x) p(c 2 x) = p(c 1, x) p(c 2, x) ( ) p(c1 x) 0 < ln = ln(p(c 1 x)) ln(p(c 2 x)) p(c 2 x) The quantity ( ) p(c1 x) a = ln p(c 2 x) is called the log-odds. Logistic Regression assumes that the log-odds are a linear function of the feature vector: a = w T x + w 0. Often true given assumptions about the class-conditional densities p(x C k ).
59 Probabilistic Generative Models With 2 classes, C 1 and C 2 : p(c 1 x) = p(x C 1)p(C 1 ) p(x) Bayes Rule p(c 1 x) = p(c 1 x) = p(x C 1)p(C 1 ) Sum rule p(x, C 1 ) + p(x, C 2 ) p(x C 1 )p(c 1 ) Product rule p(x C 1 )p(c 1 ) + p(x C 2 )p(c 2 ) In generative models we specify the class-conditional distribution p(x C k ) which generates the data for each class
60 Probabilistic Generative Models With 2 classes, C 1 and C 2 : p(c 1 x) = p(x C 1)p(C 1 ) p(x) Bayes Rule p(c 1 x) = p(c 1 x) = p(x C 1)p(C 1 ) Sum rule p(x, C 1 ) + p(x, C 2 ) p(x C 1 )p(c 1 ) Product rule p(x C 1 )p(c 1 ) + p(x C 2 )p(c 2 ) In generative models we specify the class-conditional distribution p(x C k ) which generates the data for each class
61 Probabilistic Generative Models With 2 classes, C 1 and C 2 : p(c 1 x) = p(x C 1)p(C 1 ) p(x) Bayes Rule p(c 1 x) = p(c 1 x) = p(x C 1)p(C 1 ) Sum rule p(x, C 1 ) + p(x, C 2 ) p(x C 1 )p(c 1 ) Product rule p(x C 1 )p(c 1 ) + p(x C 2 )p(c 2 ) In generative models we specify the class-conditional distribution p(x C k ) which generates the data for each class
62 Probabilistic Generative Models With 2 classes, C 1 and C 2 : p(c 1 x) = p(x C 1)p(C 1 ) p(x) Bayes Rule p(c 1 x) = p(c 1 x) = p(x C 1)p(C 1 ) Sum rule p(x, C 1 ) + p(x, C 2 ) p(x C 1 )p(c 1 ) Product rule p(x C 1 )p(c 1 ) + p(x C 2 )p(c 2 ) In generative models we specify the class-conditional distribution p(x C k ) which generates the data for each class
63 Probabilistic Generative Models - Example Let s say we observe x which is the current temperature Determine if we are in Vancouver (C 1 ) or Honolulu (C 2 ) Generative model: p(c 1 x) = p(x C 1 )p(c 1 ) p(x C 1 )p(c 1 ) + p(x C 2 )p(c 2 ) p(x C 1 ) is distribution over typical temperatures in Vancouver e.g. p(x C 1) = N (x; 10, 5) p(x C 2 ) is distribution over typical temperatures in Honolulu e.g. p(x C 2) = N (x; 25, 5) Class priors p(c 1 ) = 0.1, p(c 2 ) = 0.9 p(c 1 x = 15) =
64 Probabilistic Generative Models - Example Let s say we observe x which is the current temperature Determine if we are in Vancouver (C 1 ) or Honolulu (C 2 ) Generative model: p(c 1 x) = p(x C 1 )p(c 1 ) p(x C 1 )p(c 1 ) + p(x C 2 )p(c 2 ) p(x C 1 ) is distribution over typical temperatures in Vancouver e.g. p(x C 1) = N (x; 10, 5) p(x C 2 ) is distribution over typical temperatures in Honolulu e.g. p(x C 2) = N (x; 25, 5) Class priors p(c 1 ) = 0.1, p(c 2 ) = 0.9 p(c 1 x = 15) =
65 Generalized Linear Models Suppose we have built a model for predicting the log-odds. We can use it to compute the class probability as follows. p(c 1 x) = = p(x C 1 )p(c 1 ) p(x C 1 )p(c 1 ) + p(x C 2 )p(c 2 ) exp( a) σ(a) where a = ln p(x C 1)p(C 1 ) p(x C 2 )p(c 2 ) = ln p(x, C 1) p(x, C 2 ).
66 Generalized Linear Models Suppose we have built a model for predicting the log-odds. We can use it to compute the class probability as follows. p(c 1 x) = = p(x C 1 )p(c 1 ) p(x C 1 )p(c 1 ) + p(x C 2 )p(c 2 ) exp( a) σ(a) where a = ln p(x C 1)p(C 1 ) p(x C 2 )p(c 2 ) = ln p(x, C 1) p(x, C 2 ).
67 Logistic Sigmoid The function σ(a) = 1+exp( a) is known as the logistic sigmoid It squashes the real axis down to [0, 1] It is continuous and differentiable
68 Multi-class Extension There is a generalization of the logistic sigmoid to K > 2 classes: p(c k x) = p(x C k )p(c k ) j p(x C j)p(c j ) = exp(a k ) j exp(a j) where a k = ln p(x C k )p(c k ) a. k. a. softmax function If some a k a j, p(c k x) goes to 1
69 Multi-class Extension There is a generalization of the logistic sigmoid to K > 2 classes: p(c k x) = p(x C k )p(c k ) j p(x C j)p(c j ) = exp(a k ) j exp(a j) where a k = ln p(x C k )p(c k ) a. k. a. softmax function If some a k a j, p(c k x) goes to 1
70 Example Logistic Regression
71 Gaussian Class-Conditional Densities Back to the log-odds a in the logistic sigmoid for 2 classes Let s assume the class-conditional densities p(x C k ) are Gaussians, and have the same covariance matrix Σ: p(x C k ) = 1 (2π) D/2 exp Σ 1/2 a takes a simple form: { 1 2 (x µ k) T Σ 1 (x µ k ) a = ln p(x C 1)p(C 1 ) p(x C 2 )p(c 2 ) = wt x + w 0 Note that quadratic terms x T Σ 1 x cancel }
72 Gaussian Class-Conditional Densities Back to the log-odds a in the logistic sigmoid for 2 classes Let s assume the class-conditional densities p(x C k ) are Gaussians, and have the same covariance matrix Σ: p(x C k ) = 1 (2π) D/2 exp Σ 1/2 a takes a simple form: { 1 2 (x µ k) T Σ 1 (x µ k ) a = ln p(x C 1)p(C 1 ) p(x C 2 )p(c 2 ) = wt x + w 0 Note that quadratic terms x T Σ 1 x cancel }
73 Gaussian Class-Conditional Densities Back to the log-odds a in the logistic sigmoid for 2 classes Let s assume the class-conditional densities p(x C k ) are Gaussians, and have the same covariance matrix Σ: p(x C k ) = 1 (2π) D/2 exp Σ 1/2 a takes a simple form: { 1 2 (x µ k) T Σ 1 (x µ k ) a = ln p(x C 1)p(C 1 ) p(x C 2 )p(c 2 ) = wt x + w 0 Note that quadratic terms x T Σ 1 x cancel }
74 Gaussian Class-Conditional Densities Back to the log-odds a in the logistic sigmoid for 2 classes Let s assume the class-conditional densities p(x C k ) are Gaussians, and have the same covariance matrix Σ: p(x C k ) = 1 (2π) D/2 exp Σ 1/2 a takes a simple form: { 1 2 (x µ k) T Σ 1 (x µ k ) a = ln p(x C 1)p(C 1 ) p(x C 2 )p(c 2 ) = wt x + w 0 Note that quadratic terms x T Σ 1 x cancel }
75 Maximum Likelihood Learning We can fit the parameters to this model using maximum likelihood Parameters are µ 1, µ 2, Σ 1, p(c 1 ) π, p(c 2 ) 1 π Refer to as θ For a datapoint x n from class C 1 (t n = 1): p(x n, C 1 ) = p(c 1 )p(x n C 1 ) = πn (x n µ 1, Σ) For a datapoint x n from class C 2 (t n = 0): p(x n, C 2 ) = p(c 2 )p(x n C 2 ) = (1 π)n (x n µ 2, Σ)
76 Maximum Likelihood Learning We can fit the parameters to this model using maximum likelihood Parameters are µ 1, µ 2, Σ 1, p(c 1 ) π, p(c 2 ) 1 π Refer to as θ For a datapoint x n from class C 1 (t n = 1): p(x n, C 1 ) = p(c 1 )p(x n C 1 ) = πn (x n µ 1, Σ) For a datapoint x n from class C 2 (t n = 0): p(x n, C 2 ) = p(c 2 )p(x n C 2 ) = (1 π)n (x n µ 2, Σ)
77 Maximum Likelihood Learning We can fit the parameters to this model using maximum likelihood Parameters are µ 1, µ 2, Σ 1, p(c 1 ) π, p(c 2 ) 1 π Refer to as θ For a datapoint x n from class C 1 (t n = 1): p(x n, C 1 ) = p(c 1 )p(x n C 1 ) = πn (x n µ 1, Σ) For a datapoint x n from class C 2 (t n = 0): p(x n, C 2 ) = p(c 2 )p(x n C 2 ) = (1 π)n (x n µ 2, Σ)
78 Maximum Likelihood Learning The likelihood of the training data is: p(t π, µ 1, µ 2, Σ) = N [πn (x n µ 1, Σ)] tn [(1 π)n (x n µ 2, Σ)] 1 tn n=1 As usual, ln is our friend: l(t; θ) = N t n ln π + (1 t n ) ln(1 π) + t }{{} n ln N 1 + (1 t n ) ln N }{{} 2 π µ 1,µ 2,Σ n=1 Maximize for each separately
79 Maximum Likelihood Learning The likelihood of the training data is: p(t π, µ 1, µ 2, Σ) = N [πn (x n µ 1, Σ)] tn [(1 π)n (x n µ 2, Σ)] 1 tn n=1 As usual, ln is our friend: l(t; θ) = N t n ln π + (1 t n ) ln(1 π) + t }{{} n ln N 1 + (1 t n ) ln N }{{} 2 π µ 1,µ 2,Σ n=1 Maximize for each separately
80 Maximum Likelihood Learning - Class Priors Maximization with respect to the class priors parameter π is straightforward: N π l(t; θ) = n=1 π = N 1 N 1 + N 2 t n π 1 t n 1 π N 1 and N 2 are the number of training points in each class Prior is simply the fraction of points in each class
81 Maximum Likelihood Learning - Class Priors Maximization with respect to the class priors parameter π is straightforward: N π l(t; θ) = n=1 π = N 1 N 1 + N 2 t n π 1 t n 1 π N 1 and N 2 are the number of training points in each class Prior is simply the fraction of points in each class
82 Maximum Likelihood Learning - Class Priors Maximization with respect to the class priors parameter π is straightforward: N π l(t; θ) = n=1 π = N 1 N 1 + N 2 t n π 1 t n 1 π N 1 and N 2 are the number of training points in each class Prior is simply the fraction of points in each class
83 Maximum Likelihood Learning - Gaussian Parameters The other parameters can also be found in the same fashion Class means: µ 1 = 1 N t n x n N 1 n=1 µ 2 = 1 N (1 t n )x n N 2 n=1 Means of training examples from each class Shared covariance matrix: Σ = N 1 N 1 (x n µ N 1 )(x n µ 1 ) T + N 2 1 N n C 1 Weighted average of class covariances 1 (x n µ N 2 )(x n µ 2 ) T 2 n C 2
84 Maximum Likelihood Learning - Gaussian Parameters The other parameters can also be found in the same fashion Class means: µ 1 = 1 N t n x n N 1 n=1 µ 2 = 1 N (1 t n )x n N 2 n=1 Means of training examples from each class Shared covariance matrix: Σ = N 1 N 1 (x n µ N 1 )(x n µ 1 ) T + N 2 1 N n C 1 Weighted average of class covariances 1 (x n µ N 2 )(x n µ 2 ) T 2 n C 2
85 Probabilistic Generative Models Summary Fitting Gaussian using ML criterion is sensitive to outliers Simple linear form for a in logistic sigmoid occurs for more than just Gaussian distributions Arises for any distribution in the exponential family, a large class of distributions
86 Outline Discriminant Functions Generative Models Discriminative Models
87 Probabilistic Discriminative Models Generative model made assumptions about form of class-conditional distributions (e.g. Gaussian) Resulted in logistic sigmoid of linear function of x Discriminative model - explicitly use functional form p(c 1 x) = exp( w T x + w 0 ) and find w directly For the generative model we had 2M + M(M + 1)/2 + 1 parameters M is dimensionality of x Discriminative model will have M + 1 parameters
88 Probabilistic Discriminative Models Generative model made assumptions about form of class-conditional distributions (e.g. Gaussian) Resulted in logistic sigmoid of linear function of x Discriminative model - explicitly use functional form p(c 1 x) = exp( w T x + w 0 ) and find w directly For the generative model we had 2M + M(M + 1)/2 + 1 parameters M is dimensionality of x Discriminative model will have M + 1 parameters
89 Probabilistic Discriminative Models Generative model made assumptions about form of class-conditional distributions (e.g. Gaussian) Resulted in logistic sigmoid of linear function of x Discriminative model - explicitly use functional form p(c 1 x) = exp( w T x + w 0 ) and find w directly For the generative model we had 2M + M(M + 1)/2 + 1 parameters M is dimensionality of x Discriminative model will have M + 1 parameters
90 Maximum Likelihood Learning - Discriminative Model As usual we can use the maximum likelihood criterion for learning p(t w) = N n=1 Taking ln and derivative gives: y tn n {1 y n } 1 tn ; where y n = p(c 1 x n ) l(w) = N (y n t n )x n n=1 This time no closed-form solution since y n = σ(w T x) Could use (stochastic) gradient descent But Iterative Reweighted Least Squares (IRLS) is a better technique.
91 Maximum Likelihood Learning - Discriminative Model As usual we can use the maximum likelihood criterion for learning p(t w) = N n=1 Taking ln and derivative gives: y tn n {1 y n } 1 tn ; where y n = p(c 1 x n ) l(w) = N (y n t n )x n n=1 This time no closed-form solution since y n = σ(w T x) Could use (stochastic) gradient descent But Iterative Reweighted Least Squares (IRLS) is a better technique.
92 Maximum Likelihood Learning - Discriminative Model As usual we can use the maximum likelihood criterion for learning p(t w) = N n=1 Taking ln and derivative gives: y tn n {1 y n } 1 tn ; where y n = p(c 1 x n ) l(w) = N (y n t n )x n n=1 This time no closed-form solution since y n = σ(w T x) Could use (stochastic) gradient descent But Iterative Reweighted Least Squares (IRLS) is a better technique.
93 Maximum Likelihood Learning - Discriminative Model As usual we can use the maximum likelihood criterion for learning p(t w) = N n=1 Taking ln and derivative gives: y tn n {1 y n } 1 tn ; where y n = p(c 1 x n ) l(w) = N (y n t n )x n n=1 This time no closed-form solution since y n = σ(w T x) Could use (stochastic) gradient descent But Iterative Reweighted Least Squares (IRLS) is a better technique.
94 Generative vs. Discriminative Generative models Can generate synthetic example data Perhaps accurate classification is equivalent to accurate synthesis Support learning with missing data Tend to have more parameters Require good model of class distributions Discriminative models Only usable for classification Don t solve a harder problem than you need to Tend to have fewer parameters Require good model of decision boundary
95 Conclusion Readings: Ch , 4.1.7, , Generalized linear models y(x) = f (w T x + w 0 ) Threshold/max function for f ( ) Minimize with least squares Fisher criterion - class separation Perceptron criterion - mis-classified examples Probabilistic models: logistic sigmoid / softmax for f ( ) Generative model - assume class conditional densities in exponential family; obtain sigmoid Discriminative model - directly model posterior using sigmoid (a. k. a. logistic regression, though classification) Can learn either using maximum likelihood All of these models are limited to linear decision boundaries in feature space
Ch 4. Linear Models for Classification
Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,
More informationLINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception
LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,
More informationMachine Learning. 7. Logistic and Linear Regression
Sapienza University of Rome, Italy - Machine Learning (27/28) University of Rome La Sapienza Master in Artificial Intelligence and Robotics Machine Learning 7. Logistic and Linear Regression Luca Iocchi,
More informationLast updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship
More informationApril 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.
Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá April 9, 2018 Content 1 2 3 4 Outline 1 2 3 4 problems { C 1, y(x) threshold predict(x) = C 2, y(x) < threshold, with threshold
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationLinear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII
More informationLogistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA
Logistic Regression Sargur N. University at Buffalo, State University of New York USA Topics in Linear Classification using Probabilistic Discriminative Models Generative vs Discriminative 1. Fixed basis
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationLinear Classification: Probabilistic Generative Models
Linear Classification: Probabilistic Generative Models Sargur N. University at Buffalo, State University of New York USA 1 Linear Classification using Probabilistic Generative Models Topics 1. Overview
More informationMax Margin-Classifier
Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization
More informationLinear Models for Classification
Catherine Lee Anderson figures courtesy of Christopher M. Bishop Department of Computer Science University of Nebraska at Lincoln CSCE 970: Pattern Recognition and Machine Learning Congradulations!!!!
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationLinear Classification
Linear Classification Lili MOU moull12@sei.pku.edu.cn http://sei.pku.edu.cn/ moull12 23 April 2015 Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationEngineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:
More informationFeed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.
Neural Networks Oliver Schulte - CMPT 726 Bishop PRML Ch. 5 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of biological plausibility We will
More informationAn Introduction to Statistical and Probabilistic Linear Models
An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationMachine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)
More informationKernel Methods and Support Vector Machines
Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationLinear Models for Classification
Linear Models for Classification Henrik I Christensen Robotics & Intelligent Machines @ GT Georgia Institute of Technology, Atlanta, GA 30332-0280 hic@cc.gatech.edu Henrik I Christensen (RIM@GT) Linear
More informationLinear and logistic regression
Linear and logistic regression Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Linear and logistic regression 1/22 Outline 1 Linear regression 2 Logistic regression 3 Fisher discriminant analysis
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationMulticlass Logistic Regression
Multiclass Logistic Regression Sargur. Srihari University at Buffalo, State University of ew York USA Machine Learning Srihari Topics in Linear Classification using Probabilistic Discriminative Models
More informationProbabilistic generative models
Linear models for classification Francesco Corona Probabilistic discriminative models Models with linear decision boundaries arise from assumptions about the data In a generative approach to classification,
More informationECE521 Lecture7. Logistic Regression
ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard
More informationCSC 411: Lecture 04: Logistic Regression
CSC 411: Lecture 04: Logistic Regression Raquel Urtasun & Rich Zemel University of Toronto Sep 23, 2015 Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 1 / 16 Today Key Concepts: Logistic
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More informationNeural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications
Neural Networks Bishop PRML Ch. 5 Alireza Ghane Neural Networks Alireza Ghane / Greg Mori 1 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of
More informationLatent Variable Models and Expectation Maximization
Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15
More informationMachine Learning 2017
Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section
More informationOutline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation
More informationMachine Learning Support Vector Machines. Prof. Matteo Matteucci
Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationMachine Learning for Signal Processing Bayes Classification and Regression
Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For
More informationMachine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationLatent Variable Models and Expectation Maximization
Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationIterative Reweighted Least Squares
Iterative Reweighted Least Squares Sargur. University at Buffalo, State University of ew York USA Topics in Linear Classification using Probabilistic Discriminative Models Generative vs Discriminative
More informationWeek 5: Logistic Regression & Neural Networks
Week 5: Logistic Regression & Neural Networks Instructor: Sergey Levine 1 Summary: Logistic Regression In the previous lecture, we covered logistic regression. To recap, logistic regression models and
More informationLINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
LINEAR CLASSIFIERS Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input
More informationSlides modified from: PATTERN RECOGNITION CHRISTOPHER M. BISHOP. and: Computer vision: models, learning and inference Simon J.D.
Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP and: Computer vision: models, learning and inference. 2011 Simon J.D. Prince ClassificaLon Example: Gender ClassificaLon
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationPattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods
Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter
More informationGenerative v. Discriminative classifiers Intuition
Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 1 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationSGN (4 cr) Chapter 5
SGN-41006 (4 cr) Chapter 5 Linear Discriminant Analysis Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology January 21, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006
More informationLogistic Regression. Seungjin Choi
Logistic Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationEngineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I
Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 2012 Engineering Part IIB: Module 4F10 Introduction In
More informationLinear Models for Regression
Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationProbabilistic Machine Learning. Industrial AI Lab.
Probabilistic Machine Learning Industrial AI Lab. Probabilistic Linear Regression Outline Probabilistic Classification Probabilistic Clustering Probabilistic Dimension Reduction 2 Probabilistic Linear
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationClustering and Gaussian Mixtures
Clustering and Gaussian Mixtures Oliver Schulte - CMPT 883 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15 2 25 5 1 15 2 25 detected tures detected
More informationBayesian Linear Regression. Sargur Srihari
Bayesian Linear Regression Sargur srihari@cedar.buffalo.edu Topics in Bayesian Regression Recall Max Likelihood Linear Regression Parameter Distribution Predictive Distribution Equivalent Kernel 2 Linear
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationThese slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
More informationClassification. Sandro Cumani. Politecnico di Torino
Politecnico di Torino Outline Generative model: Gaussian classifier (Linear) discriminative model: logistic regression (Non linear) discriminative model: neural networks Gaussian Classifier We want to
More informationGenerative v. Discriminative classifiers Intuition
Logistic Regression Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features Y target
More informationStochastic gradient descent; Classification
Stochastic gradient descent; Classification Steve Renals Machine Learning Practical MLP Lecture 2 28 September 2016 MLP Lecture 2 Stochastic gradient descent; Classification 1 Single Layer Networks MLP
More informationLearning from Data Logistic Regression
Learning from Data Logistic Regression Copyright David Barber 2-24. Course lecturer: Amos Storkey a.storkey@ed.ac.uk Course page : http://www.anc.ed.ac.uk/ amos/lfd/ 2.9.8.7.6.5.4.3.2...2.3.4.5.6.7.8.9
More informationIntelligent Systems Discriminative Learning, Neural Networks
Intelligent Systems Discriminative Learning, Neural Networks Carsten Rother, Dmitrij Schlesinger WS2014/2015, Outline 1. Discriminative learning 2. Neurons and linear classifiers: 1) Perceptron-Algorithm
More informationMLPR: Logistic Regression and Neural Networks
MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition Amos Storkey Amos Storkey MLPR: Logistic Regression and Neural Networks 1/28 Outline 1 Logistic Regression 2 Multi-layer
More informationBias-Variance Tradeoff
What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff
More informationOutline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.
Outline MLPR: and Neural Networks Machine Learning and Pattern Recognition 2 Amos Storkey Amos Storkey MLPR: and Neural Networks /28 Recap Amos Storkey MLPR: and Neural Networks 2/28 Which is the correct
More informationArtificial Neural Networks
Artificial Neural Networks Stephan Dreiseitl University of Applied Sciences Upper Austria at Hagenberg Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support Knowledge
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationOutline Lecture 2 2(32)
Outline Lecture (3), Lecture Linear Regression and Classification it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic
More informationMark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.
University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x
More informationStatistical Machine Learning Hilary Term 2018
Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html
More informationLogistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu
Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data
More informationDEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE
Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures
More informationLinear Models for Regression
Linear Models for Regression CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 The Regression Problem Training data: A set of input-output
More informationOverview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met
c Outlines Statistical Group and College of Engineering and Computer Science Overview Linear Regression Linear Classification Neural Networks Kernel Methods and SVM Mixture Models and EM Resources More
More informationNeural Networks and the Back-propagation Algorithm
Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely
More informationMachine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber
Machine Learning Regression-Based Classification & Gaussian Discriminant Analysis Manfred Huber 2015 1 Logistic Regression Linear regression provides a nice representation and an efficient solution to
More informationSYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I
SYDE 372 Introduction to Pattern Recognition Probability Measures for Classification: Part I Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 Why use probability
More informationLinear Discrimination Functions
Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach
More informationLearning Methods for Linear Detectors
Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2011/2012 Lesson 20 27 April 2012 Contents Learning Methods for Linear Detectors Learning Linear Detectors...2
More informationData Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396
Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents 1 Introduction
More informationIntroduction to Machine Learning
Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574
More informationCSC321 Lecture 4: Learning a Classifier
CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a Classifier 1 / 28 Overview Last time: binary classification, perceptron algorithm Limitations of the perceptron
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationVariational Bayesian Logistic Regression
Variational Bayesian Logistic Regression Sargur N. University at Buffalo, State University of New York USA Topics in Linear Models for Classification Overview 1. Discriminant Functions 2. Probabilistic
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationLearning Linear Detectors
Learning Linear Detectors Instructor - Simon Lucey 16-423 - Designing Computer Vision Apps Today Detection versus Classification Bayes Classifiers Linear Classifiers Examples of Detection 3 Learning: Detection
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More information