Machine Learning 2017

Size: px
Start display at page:

Download "Machine Learning 2017"

Transcription

1 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning st March / 41

2 Section 5 Classification Volker Roth (University of Basel) Machine Learning st March / 41

3 Outline Goal Understanding the problem settings in classification and regression estimation. Focus on classification: Understanding the concept of Bayesian decision theory. Understanding the notion of discriminant functions. Understanding generalized linear discriminant functions and separating hyperplanes. Understanding some simple classification algorithms. Volker Roth (University of Basel) Machine Learning st March / 41

4 Classification Example Sorting incoming fish on a conveyor according to species using optical sensing Features: Length Brightness Width Shape of fins Volker Roth (University of Basel) Machine Learning st March / 41

5 Classification Given an environment characterized by p(x, y). k classes c 1,..., c k y is a class indicator variable A supervisor classifies x into k classes according to P(c x) Goal Construct a function that classifies in the same manner as the supervisor Methods: Linear Discriminant Analysis, Nearest Neighbor, Support Vector Machines, etc.... Volker Roth (University of Basel) Machine Learning st March / 41

6 Bayesian Decision Theory Assign observed x R d into one of k classes. A classifier is a mapping that assigns labels to observations f α : x {1,..., k}. For any observation x there exists a set of k possible actions α i, i.e. k different assignments of labels. The loss L incurred for taking action α i when the true label is j is denoted by a loss matrix L ij = L(α i c = j). Natural 0 1 loss function can be defined by simply counting misclassifications: L ij = 1 δ ij, where δ ij = { 1 if i = j, 0 otherwise. Volker Roth (University of Basel) Machine Learning st March / 41

7 Bayesian Decision Theory (cont d) A classifier is trained on a set of observed pairs {(x 1, c 1 ),..., (x n, c n )} i.i.d. p(x, c) = p(c x)p(x) The probability that a given x is member of class c j is obtained via the Bayes rule: where P(c j x) = p(x c = j) P(c = j), p(x) k p(x) = p(x c = j)p(c = j). j=1 Given an observation x, the expected loss associated with choosing action α i (the conditional risk) is k R(f αi x) = L ij P(c j x) = P(c j x) = 1 P(c i x). j=1 j i Volker Roth (University of Basel) Machine Learning st March / 41

8 Bayesian Decision Theory (cont d) Goal: minimize the overall risk R(f α ) = R( f α (x) x ) p(x) dx. R d If f α (x) minimizes the conditional risk R(f α (x) x) for every x, the overall risk will be minimized as well. This is achieved by the Bayes optimal classifier which chooses the mapping k f (x) = argmin L ij p(c = j x). i j=1 For 0 1 loss this reduces to classifying x to the class with highest posterior probability: f (x) = argmax p(c = i x). i Volker Roth (University of Basel) Machine Learning st March / 41

9 Bayesian Decision Theory (cont d) Simplification: only 2 classes: c is Binomial RV. Bayes optimal classifier is defined by the zero crossings of the Bayes (optimal) discriminant function G(x) = P(c 1 x) P(c 2 x), or g(x) = log P(c 1 x) P(c 2 x). Link to regression: use encoding {+1, 1} for the two possible states c 1,2 of c. The conditional expectation of c x equals the Bayes discriminant function: E[c x] = c {+1, 1} cp(c x) = P(c 1 x) P(c 2 x) = G(x). Classification can be viewed as a (local) approximation of G(x) = E[c x] near its zero crossings. Volker Roth (University of Basel) Machine Learning st March / 41

10 Linear Discriminant Functions Problem: direct approximation of G would require the knowledge of the Bayes optimal discriminant. We have to define a hypothesis class from which a function is chosen by some inference mechanism. One such class is the set of linear discriminant functions g(x; w) = w 0 + w t x. Decide c 1 if g(x; w) > 0 and c 2 if g(x; w) < 0. Equation g(x; w) = 0 defines the decision surface. Linearity of g(x; w) hyperplane w is orthogonal to any vector lying in the plane. The hyperplane divides the feature space into half-spaces R 1 ( positive side ) and R 2 ( negative side ). Volker Roth (University of Basel) Machine Learning st March / 41

11 Decision Hyperplanes g(x; w) defines distance r from x to the hyperplane: x = x p + r w w. g(x p ) = 0 g(x) = r w r = g(x)/ w. x 3 r x x p R 1 R 2 g(x) = 0 w w 0 / w H x 2 x 1 Volker Roth (University of Basel) Machine Learning st March / 41

12 Generalized Linear Discriminant Functions Use basis functions {b 1 (x),..., b m (x)}, where each b i (x) : R d R, and g(x; w) = w 0 + w 1 b 1 (x) + + w m b m (x) =: w t y (note that we have redefined y here in order to be consistent with the following figure) 1 x x 2 y = ( ) 4 2 y 3 ˆ R R 1 R 2 R 1 x FIGURE 5.5. The mapping y = (1, x, x 2 ) t takes a line and transforms it to a parabola in three dimensions. A plane splits the resulting y-space into regions corresponding to two categories, and this in turn gives a nonsimply connected decision region in the one-dimensional x-space. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Volker Roth (University Pattern of Classification. Basel) Copyright Machine c 2001Learning by John 2017 Wiley & Sons, Inc. 21st March / 41 1 y ˆ R y 2

13 Generalized Linear Discriminant Functions Use basis functions {b 1 (x),..., b m (x)}, where each b i (x) : R d R, and g(x; w) = w 0 + w 1 b 1 (x) + + w m b m (x) =: w t y. x 2 x 1 x2 αx 1 x 2 y = ( ) y 3 ˆ R 2 y 2w ˆ ˆH R 1 x 2 y 1 x 1 R 1 R 1 R 2 FIGURE 5.6. The two-dimensional input space x is mapped through a polynomial function f to y. Here the mapping is y 1 = x 1, y 2 = x 2 and y 3 x 1 x 2. A linear discriminant Volker Roth (University of Basel) Machine Learning st March / 41 x 1

14 Separable Case Consider sample {y i, c i } n i=1. If there exists f (y; w) = y t w which is positive for all examples in class 1 and negative for all examples in class 2, we say that the sample is linearly separable. Normalization : replace all samples labeled c 2 by their negatives simply write y t w > 0 for all samples. Each sample places a constraint on the possible location of w solution region. solution region y 2 solution region y 2 a a separating plane y 1 "separating" plane y 1 FIGURE 5.8. Four training samples (black for ω 1,redforω 2 )and the solution region in feature space. The figure on the left shows the raw data; the solution vectors leads to a Volker Roth (University of Basel) Machine Learning st March / 41

15 Separable Case: margin Different solution vectors may have different margins b : y t w b > 0. Intuitively, large margins are good. a 2 a 2 b/ y 3 } solution region y 1 y 2 solution region } b/ y 1 y 1 y 2 y3 a 1 y 3 a 1 } b/ y 2 FIGURE 5.9. The effect of the margin on the solution region. At the left is the case of no margin (b = 0) equivalent to a case such as shown at the left in Fig At the right Volker Roth (University of Basel) Machine Learning st March / 41

16 Gradient Descent Solve y t w > 0 by defining J(w) such that a minimizer of J is a solution. Start with initial w(1), and choose next value by moving in the direction of steepest gradient: w(k + 1) = w(k) η(k) J(w(k)). Alternatively, use second order expansion (Newton): w(k + 1) = w(k) H 1 J(w(k)). J(a) a 2 a 1 Volker Roth (University of Basel) Machine Learning st March / 41

17 Minimizing the Perceptron Criterion Solve y t w > 0 by defining J(w) such that a minimizer of J is a solution. Most obvious: number of misclassifications, but not differentiable. Alternative choice: J p (w) = y M y t w, where M(w) is the set of samples misclassified by w. Since y t w < 0 y M, J p is non-negative, and zero only if w is a solution. Gradient: J(w) = y M y w(k + 1) = w(k) + η(k) y M This defines the Batch Perceptron algorithm. y. Volker Roth (University of Basel) Machine Learning st March / 41

18 Minimizing the Perceptron Criterion (2) J p (a) y 1 y 2 solution y a 1 a 2 region 2 y 3 y 1 y3 Volker Roth (University of Basel) Machine Learning st March / 41 2

19 Fixed-Increment Single Sample Perceptron Fix learning rate η(k) = 1. Sequential single-sample updates: use superscripts y 1, y 2,... for misclassified samples y M. Ordering is irrelevant. Simple algorithm: w(1) arbitrary w(k + 1) = w(k) + y k, k 1 Perceptron Convergence Theorem If the samples are linearly separable, the sequence of weight vectors given by the Fixed-Increment Single Sample Perceptron algorithm will terminate at a solution vector. Proof: exercises. Volker Roth (University of Basel) Machine Learning st March / 41

20 Issues A number of problems with the perceptron algorithm: When the data are separable, there are many solutions, and which one is found depends on the starting values. In particular, no separation margin can be guaranteed (however, there exist modified versions...) The number of steps can be very large. When the data are not separable, the algorithm will not necessarily converge, and cycles may occur. The cycles can be long and therefore hard to detect. Method technical in nature, no (obvious) connection to learning theory. Volker Roth (University of Basel) Machine Learning st March / 41

21 Informative (or Generative) vs Discriminative Recall: classification can be viewed as a (local) approximation of G(x) = E[c x] near its zero crossings. Two main strategies for solving this problem: Informative: Informative classifiers model the class densities. The likelihood of each class is examined and classification is done by assigning to the most likely class. Discriminative: These classifiers focus on modeling the class boundaries or the class membership probabilities directly. No attempt is made to model the underlying class conditional densities. Volker Roth (University of Basel) Machine Learning st March / 41

22 Informative Classifiers Central idea: model the conditional class densities p(x c). Use Bayes rule to compute the posteriors: P(c j x) = p(x c = j) P(c = j). p(x) Assuming a parametrized class conditional density p θj (x c = j) and collecting all model parameters in a vector θ, these parameters are usually estimated by maximizing the full log likelihood ˆθ MLE = argmax θ = argmax θ n log p θ (x i, c i ) i=1 n (log p θ (x i c i ) + log P θ (c i )). i=1 Volker Roth (University of Basel) Machine Learning st March / 41

23 Informative Classifiers: LDA In Linear Discriminant Analysis (LDA), a Gaussian model is used where all classes share a common covariance matrix Σ: p θ (x c = j) = N (x; µ j, Σ). The resulting discriminant functions are linear: g(x) = log P(c = 1)N (x; µ 1, Σ) P(c = 2)N (x; µ 2, Σ) = log π 1 π (µ 1 + µ 2 ) T Σ 1 (µ 1 µ 2 ) }{{} w 0 +(µ 1 µ 2 ) T Σ 1 x }{{} w T x = w 0 + w T x. Volker Roth (University of Basel) Machine Learning st March / 41

24 Informative Classifiers: LDA Discriminant function g(x) measures differences in the Mahalanobis distances from x to the k-th class mean: D (x, µ k ) = (x µ k ) T Σ 1 (x µ k ). Bayes-optimal decision rule: measure the Mahalanobis distances to each class and assign x to the class with the (prior corrected) smallest distance. Connection to LDA algorithm: let Σ w be an estimate of the shared covariance matrix Σ, and m j an estimate of µ j. LDA finds the weight vector w LDA = Σ 1 w (m 1 m 2 ). This w LDA asymptotically coincides with the Bayes-optimal w (if Gaussian model is correct!). Volker Roth (University of Basel) Machine Learning st March / 41

25 LDA: Algorithmic view 2 x 2 2 x w x w x 1 FIGURE 3.5. Projection of the same set of samples onto two different lines in the directions Question: marked Howw. should The figure we select on the the rightprojection shows greater vector, separation whichbetween optimally the red and black projected points. From: Richard O. Duda, Peter E. Hart, and David G. Stork, discriminates between the different classes? Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. Volker Roth (University of Basel) Machine Learning st March / 41

26 LDA: Algorithmic view Samples: {(x 1, c 1 ),..., (x n, c n )} Partition the sample set in class specific subsets Projected samples: x i = w T x i, 1 i n. Measure for the discrimination of the projected points? Sample averages m c = 1 n c x X c x, with n c = X c. Sample averages of projected points: m c = 1 w T x = w T m α. n c x X c Distance of the averages in the 2 class case: w T (m 1 m 2 ). Volker Roth (University of Basel) Machine Learning st March / 41

27 LDA: Algorithmic view Scatter of class c: Σ c = x X c (x m c )(x m c ) T. Scatter matrix is unscaled estimate of covariance matrix. Within scatter : Σ W = Σ 1 + Σ 2 (unscaled average). Scatter of projected data: x X c = (w T x m c )(w T x m c ) T x X c w T (x m c )(x m c ) T w = w T Σ c w =: Σ c 2 classes = Σ1 + Σ 2 = w T Σ W w. Volker Roth (University of Basel) Machine Learning st March / 41

28 LDA: Algorithmic view J(w) = m 1 m 2 2 Σ 1 + Σ 2 = w T =:Σ B {}}{ (m 1 m 2 )(m 1 m 2 ) T w w T. Σ W w Idea: Optimal w can be found by maximization of J(w)! d dw J(w) = d dw ( ) (w T Σ W w) 1 w T Σ B w = 2Σ W w(w T Σ W w) 1 w T Σ B w(w T Σ W w) 1 +2Σ B w(w T Σ W w) 1 = 0. w T Σ W w Σ B w = Σ W w w T Σ B w w T Σ W w. Volker Roth (University of Basel) Machine Learning st March / 41

29 LDA: Algorithmic view Generalized Eigenvalue Problem: Let Σ W be non singular: Σ 1 W Σ Bw }{{} (m 1 m 2 ) = λw with λ = w T Σ B w w T Σ W w. Unscaled Solution: ŵ = Σ 1 W (m 1 m 2 ). Volker Roth (University of Basel) Machine Learning st March / 41

30 Discriminative classifiers Discriminative classifiers focus directly on the discriminant function. In general, they are more flexible with regard to the class conditional densities they are capable of modeling. Bayes formula: P(c = 1 x) g(x) = log P(c = 2 x) p(x c = 1)P(c = 1) = log p(x c = 2)P(c = 2), Can model any conditional probabilities that are exponential tilts of each other: p(x c = 1) = e g(x) P(c = 2) p(x c = 2) P(c = 1) Volker Roth (University of Basel) Machine Learning st March / 41

31 Discriminative classifiers Since the discriminative model directly focuses on the posteriors, it is natural to estimate the model parameters θ in p θ (x c) by maximizing the conditional log likelihood ˆθ DISCR = argmax θ n log p θ (c i x i ). i=1 The discriminative model makes no use of the information contained in the marginal density p(x). Thus, if the density model is correct, useful information is ignored and the informative approach may give better results. However, if the class models are incorrect, it may be better not to use them. Volker Roth (University of Basel) Machine Learning st March / 41

32 Logistic Regression (LOGREG) Logistic regression uses a linear discriminant function, i.e. g(x) = w T x + w 0. For the special case p(x c = {1, 0}) = N (x; µ 1,0, Σ), LDA and LOGREG coincide: p(x c = 1) = N (x, µ 1, Σ) = e w T k x+w0 N (x; µ 0, Σ) π 0 π 1 g(x) = w 0 + w T x = log π 1N (x; µ 1, Σ) π 0 N (x; µ 0, Σ) Volker Roth (University of Basel) Machine Learning st March / 41

33 Logistic Regression (LOGREG) Two-class problem with Binomial RV c taking values in {0, 1} sufficient to represent P(1 x), since P(0 x) = 1 P(1 x). Success probability of the Binomial RV: π(x) := P(1 x) Basketball example: Volker Roth (University of Basel) Machine Learning st March / 41

34 Logistic Regression (LOGREG) LOGREG: log P(c=1 x) P(c=0 x) = log π(x) 1 π(x) = w T x + w 0 Thus, the conditional expectation of c x satisfies the logistic regression model: π(x) := E[c x] = exp{w T x + w 0 } 1 + exp{w T x + w 0 } = σ(w T x + w 0 ), Logistic squashing function σ(z) = ez 1+e = 1 z 1+e turns linear z predictions into probabilities Volker Roth (University of Basel) Machine Learning st March / 41

35 Logistic Regression (LOGREG) Assume that w 0 is absorbed in w using x (1 : x). Estimate w by maximizing the conditional likelihood n ŵ DISCR = argmax p(c i x i ; w) w = argmax w i=1 n i=1 (π(x i ; w)) c i (1 π(x i ; w)) 1 c i, or by maximizing the corresponding log likelihood l: n l(w) = [c i log π(x i ; w) + (1 c i ) log(1 π(x i ; w))]. i=1 The score functions are defined as s(w) = n w l(w) = x i (c i π i ). i=1 Volker Roth (University of Basel) Machine Learning st March / 41

36 Logistic Regression (LOGREG) Since the π i depend non-linearly on w, the equation system s(w) = 0 usually cannot be solved analytically and iterative techniques must be applied. Newton s method updates the parameter estimates w at the r-th step by w (r+1) = w (r) {H (r) } 1 s (r), where H (r) is the Hessian of l, evaluated at w (r) : ( H (r) 2 ) l = w w T w=w (r) n = π (r) i (1 π (r) i )x i x T i. i=1 Volker Roth (University of Basel) Machine Learning st March / 41

37 Logistic Regression (LOGREG) The Newton updates can be restated as an Iterated Re-weighted Least Squares (IRLS) problem: The Hessian is equal to ( X T WX), with W = diag {π 1 (1 π 1 ),..., π n (1 π n )} The gradient of l (the scores) can be written as s = X T W e, where e is a vector with entries e j = (c j π j )/W jj. With q (r) := Xw (r) + e, the updates read (X T W (r) X)w r+1 = X T W (r) q (r). These are the normal equations of a LS problem with input matrix (W (r) ) 1/2 X and r.h.s. (W (r) ) 1/2 q (r). The values W ii are functions of the actual w (r), so that iteration is needed. Volker Roth (University of Basel) Machine Learning st March / 41

38 Logistic Regression (LOGREG) Simple binary classification problem in R 2. Solved with LOGREG using polynomial basis functions. Volker Roth (University of Basel) Machine Learning st March / 41

39 Loss functions LOGREG maximizes log likelihood n l(w) = [c i log π(x i ; w) + (1 c i ) log(1 π(x i ; w))], i=1 where π = 1 1+e, z = w t x. z This is the same as minimizing n l(w) = [ c i log π(x i ; w) (1 c i ) log(1 π(x i ; w))] = =: i=1 n i=1 n i=1 [ log(1 + e z i ) + (1 c i )z i ] Loss(c i, z i ). Loss function: (1 c i )z + log(1 + e z ). Volker Roth (University of Basel) Machine Learning st March / 41

40 Loss functions Volker Roth (University of Basel) Machine Learning st March / 41

41 Summing up Classification can be viewed as a (local) approximation of G(x) = E[c x] near its zero crossings. We have to define a hypothesis class from which a function is chosen by some inference mechanism. One such class is the set of linear discriminant functions g(x; w) = w 0 + w t x. Can be easily generalized to arbitrary basis functions. Two main strategies for solving this problem: Informative: use marginal density p(x), and discriminative: ignore p(x). Two simple models: LDA and LOGREG. Volker Roth (University of Basel) Machine Learning st March / 41

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

p(d θ ) l(θ ) 1.2 x x x

p(d θ ) l(θ ) 1.2 x x x p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1 EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:

More information

p(x ω i 0.4 ω 2 ω

p(x ω i 0.4 ω 2 ω p( ω i ). ω.3.. 9 3 FIGURE.. Hypothetical class-conditional probability density functions show the probability density of measuring a particular feature value given the pattern is in category ω i.if represents

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Minimum Error-Rate Discriminant

Minimum Error-Rate Discriminant Discriminants Minimum Error-Rate Discriminant In the case of zero-one loss function, the Bayes Discriminant can be further simplified: g i (x) =P (ω i x). (29) J. Corso (SUNY at Buffalo) Bayesian Decision

More information

5. Discriminant analysis

5. Discriminant analysis 5. Discriminant analysis We continue from Bayes s rule presented in Section 3 on p. 85 (5.1) where c i is a class, x isap-dimensional vector (data case) and we use class conditional probability (density

More information

Logistic Regression. Seungjin Choi

Logistic Regression. Seungjin Choi Logistic Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information

CS 195-5: Machine Learning Problem Set 1

CS 195-5: Machine Learning Problem Set 1 CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Linear discriminant functions

Linear discriminant functions Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative

More information

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall

More information

Statistical Machine Learning Hilary Term 2018

Statistical Machine Learning Hilary Term 2018 Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012) Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

CSC 411: Lecture 04: Logistic Regression

CSC 411: Lecture 04: Logistic Regression CSC 411: Lecture 04: Logistic Regression Raquel Urtasun & Rich Zemel University of Toronto Sep 23, 2015 Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 1 / 16 Today Key Concepts: Logistic

More information

SGN (4 cr) Chapter 5

SGN (4 cr) Chapter 5 SGN-41006 (4 cr) Chapter 5 Linear Discriminant Analysis Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology January 21, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Linear and logistic regression

Linear and logistic regression Linear and logistic regression Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Linear and logistic regression 1/22 Outline 1 Linear regression 2 Logistic regression 3 Fisher discriminant analysis

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Bayesian Decision and Bayesian Learning

Bayesian Decision and Bayesian Learning Bayesian Decision and Bayesian Learning Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1 / 30 Bayes Rule p(x ω i

More information

Classification. Chapter Introduction. 6.2 The Bayes classifier

Classification. Chapter Introduction. 6.2 The Bayes classifier Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode

More information

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE March 28, 2012 The exam is closed book. You are allowed a double sided one page cheat sheet. Answer the questions in the spaces provided on

More information

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship

More information

Logistic regression and linear classifiers COMS 4771

Logistic regression and linear classifiers COMS 4771 Logistic regression and linear classifiers COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random

More information

Pattern Recognition. Parameter Estimation of Probability Density Functions

Pattern Recognition. Parameter Estimation of Probability Density Functions Pattern Recognition Parameter Estimation of Probability Density Functions Classification Problem (Review) The classification problem is to assign an arbitrary feature vector x F to one of c classes. The

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine Learning Support Vector Machines. Prof. Matteo Matteucci Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

Lecture 3: Pattern Classification

Lecture 3: Pattern Classification EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

PATTERN CLASSIFICATION

PATTERN CLASSIFICATION PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS

More information

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang Machine Learning Lecture 04: Logistic and Softmax Regression Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set

More information

Pattern Classification

Pattern Classification Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors

More information

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart 1 Motivation and Problem In Lecture 1 we briefly saw how histograms

More information

p(x ω i 0.4 ω 2 ω

p(x ω i 0.4 ω 2 ω p(x ω i ).4 ω.3.. 9 3 4 5 x FIGURE.. Hypothetical class-conditional probability density functions show the probability density of measuring a particular feature value x given the pattern is in category

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Machine Learning. Lecture 3: Logistic Regression. Feng Li. Machine Learning Lecture 3: Logistic Regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2016 Logistic Regression Classification

More information

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.

More information

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II) Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Bayesian Classification Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Introduction to Machine Learning

Introduction to Machine Learning Outline Introduction to Machine Learning Bayesian Classification Varun Chandola March 8, 017 1. {circular,large,light,smooth,thick}, malignant. {circular,large,light,irregular,thick}, malignant 3. {oval,large,dark,smooth,thin},

More information

Bayesian Decision Theory

Bayesian Decision Theory Bayesian Decision Theory Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Bayesian Decision Theory Bayesian classification for normal distributions Error Probabilities

More information

Bayesian Learning (II)

Bayesian Learning (II) Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP

More information

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Bayesian decision theory 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Jussi Tohka jussi.tohka@tut.fi Institute of Signal Processing Tampere University of Technology

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

A vector from the origin to H, V could be expressed using:

A vector from the origin to H, V could be expressed using: Linear Discriminant Function: the linear discriminant function: g(x) = w t x + ω 0 x is the point, w is the weight vector, and ω 0 is the bias (t is the transpose). Two Category Case: In the two category

More information

Multiclass Classification-1

Multiclass Classification-1 CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass

More information

Example - basketball players and jockeys. We will keep practical applicability in mind:

Example - basketball players and jockeys. We will keep practical applicability in mind: Sonka: Pattern Recognition Class 1 INTRODUCTION Pattern Recognition (PR) Statistical PR Syntactic PR Fuzzy logic PR Neural PR Example - basketball players and jockeys We will keep practical applicability

More information

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary

More information

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition Memorial University of Newfoundland Pattern Recognition Lecture 6 May 18, 2006 http://www.engr.mun.ca/~charlesr Office Hours: Tuesdays & Thursdays 8:30-9:30 PM EN-3026 Review Distance-based Classification

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

More information

An Introduction to Statistical and Probabilistic Linear Models

An Introduction to Statistical and Probabilistic Linear Models An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

Pattern Recognition 2

Pattern Recognition 2 Pattern Recognition 2 KNN,, Dr. Terence Sim School of Computing National University of Singapore Outline 1 2 3 4 5 Outline 1 2 3 4 5 The Bayes Classifier is theoretically optimum. That is, prob. of error

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest

More information

Classification. Sandro Cumani. Politecnico di Torino

Classification. Sandro Cumani. Politecnico di Torino Politecnico di Torino Outline Generative model: Gaussian classifier (Linear) discriminative model: logistic regression (Non linear) discriminative model: neural networks Gaussian Classifier We want to

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Lecture 5: LDA and Logistic Regression

Lecture 5: LDA and Logistic Regression Lecture 5: and Logistic Regression Hao Helen Zhang Hao Helen Zhang Lecture 5: and Logistic Regression 1 / 39 Outline Linear Classification Methods Two Popular Linear Models for Classification Linear Discriminant

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Machine Learning - Waseda University Logistic Regression

Machine Learning - Waseda University Logistic Regression Machine Learning - Waseda University Logistic Regression AD June AD ) June / 9 Introduction Assume you are given some training data { x i, y i } i= where xi R d and y i can take C different values. Given

More information

CSC321 Lecture 4: Learning a Classifier

CSC321 Lecture 4: Learning a Classifier CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a Classifier 1 / 28 Overview Last time: binary classification, perceptron algorithm Limitations of the perceptron

More information

Bayesian Decision Theory

Bayesian Decision Theory Bayesian Decision Theory Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 1 / 46 Bayesian

More information

Linear Classification

Linear Classification Linear Classification Lili MOU moull12@sei.pku.edu.cn http://sei.pku.edu.cn/ moull12 23 April 2015 Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

ECE662: Pattern Recognition and Decision Making Processes: HW TWO ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are

More information

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier Machine Learning 10-701/15 701/15-781, 781, Spring 2008 Theory of Classification and Nonparametric Classifier Eric Xing Lecture 2, January 16, 2006 Reading: Chap. 2,5 CB and handouts Outline What is theoretically

More information

Learning Methods for Linear Detectors

Learning Methods for Linear Detectors Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2011/2012 Lesson 20 27 April 2012 Contents Learning Methods for Linear Detectors Learning Linear Detectors...2

More information

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures

More information

Bayesian Decision Theory Lecture 2

Bayesian Decision Theory Lecture 2 Bayesian Decision Theory Lecture 2 Jason Corso SUNY at Buffalo 14 January 2009 J. Corso (SUNY at Buffalo) Bayesian Decision Theory Lecture 2 14 January 2009 1 / 58 Overview and Plan Covering Chapter 2

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information