Machine Learning 2017
|
|
- Maryann Conley
- 6 years ago
- Views:
Transcription
1 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning st March / 41
2 Section 5 Classification Volker Roth (University of Basel) Machine Learning st March / 41
3 Outline Goal Understanding the problem settings in classification and regression estimation. Focus on classification: Understanding the concept of Bayesian decision theory. Understanding the notion of discriminant functions. Understanding generalized linear discriminant functions and separating hyperplanes. Understanding some simple classification algorithms. Volker Roth (University of Basel) Machine Learning st March / 41
4 Classification Example Sorting incoming fish on a conveyor according to species using optical sensing Features: Length Brightness Width Shape of fins Volker Roth (University of Basel) Machine Learning st March / 41
5 Classification Given an environment characterized by p(x, y). k classes c 1,..., c k y is a class indicator variable A supervisor classifies x into k classes according to P(c x) Goal Construct a function that classifies in the same manner as the supervisor Methods: Linear Discriminant Analysis, Nearest Neighbor, Support Vector Machines, etc.... Volker Roth (University of Basel) Machine Learning st March / 41
6 Bayesian Decision Theory Assign observed x R d into one of k classes. A classifier is a mapping that assigns labels to observations f α : x {1,..., k}. For any observation x there exists a set of k possible actions α i, i.e. k different assignments of labels. The loss L incurred for taking action α i when the true label is j is denoted by a loss matrix L ij = L(α i c = j). Natural 0 1 loss function can be defined by simply counting misclassifications: L ij = 1 δ ij, where δ ij = { 1 if i = j, 0 otherwise. Volker Roth (University of Basel) Machine Learning st March / 41
7 Bayesian Decision Theory (cont d) A classifier is trained on a set of observed pairs {(x 1, c 1 ),..., (x n, c n )} i.i.d. p(x, c) = p(c x)p(x) The probability that a given x is member of class c j is obtained via the Bayes rule: where P(c j x) = p(x c = j) P(c = j), p(x) k p(x) = p(x c = j)p(c = j). j=1 Given an observation x, the expected loss associated with choosing action α i (the conditional risk) is k R(f αi x) = L ij P(c j x) = P(c j x) = 1 P(c i x). j=1 j i Volker Roth (University of Basel) Machine Learning st March / 41
8 Bayesian Decision Theory (cont d) Goal: minimize the overall risk R(f α ) = R( f α (x) x ) p(x) dx. R d If f α (x) minimizes the conditional risk R(f α (x) x) for every x, the overall risk will be minimized as well. This is achieved by the Bayes optimal classifier which chooses the mapping k f (x) = argmin L ij p(c = j x). i j=1 For 0 1 loss this reduces to classifying x to the class with highest posterior probability: f (x) = argmax p(c = i x). i Volker Roth (University of Basel) Machine Learning st March / 41
9 Bayesian Decision Theory (cont d) Simplification: only 2 classes: c is Binomial RV. Bayes optimal classifier is defined by the zero crossings of the Bayes (optimal) discriminant function G(x) = P(c 1 x) P(c 2 x), or g(x) = log P(c 1 x) P(c 2 x). Link to regression: use encoding {+1, 1} for the two possible states c 1,2 of c. The conditional expectation of c x equals the Bayes discriminant function: E[c x] = c {+1, 1} cp(c x) = P(c 1 x) P(c 2 x) = G(x). Classification can be viewed as a (local) approximation of G(x) = E[c x] near its zero crossings. Volker Roth (University of Basel) Machine Learning st March / 41
10 Linear Discriminant Functions Problem: direct approximation of G would require the knowledge of the Bayes optimal discriminant. We have to define a hypothesis class from which a function is chosen by some inference mechanism. One such class is the set of linear discriminant functions g(x; w) = w 0 + w t x. Decide c 1 if g(x; w) > 0 and c 2 if g(x; w) < 0. Equation g(x; w) = 0 defines the decision surface. Linearity of g(x; w) hyperplane w is orthogonal to any vector lying in the plane. The hyperplane divides the feature space into half-spaces R 1 ( positive side ) and R 2 ( negative side ). Volker Roth (University of Basel) Machine Learning st March / 41
11 Decision Hyperplanes g(x; w) defines distance r from x to the hyperplane: x = x p + r w w. g(x p ) = 0 g(x) = r w r = g(x)/ w. x 3 r x x p R 1 R 2 g(x) = 0 w w 0 / w H x 2 x 1 Volker Roth (University of Basel) Machine Learning st March / 41
12 Generalized Linear Discriminant Functions Use basis functions {b 1 (x),..., b m (x)}, where each b i (x) : R d R, and g(x; w) = w 0 + w 1 b 1 (x) + + w m b m (x) =: w t y (note that we have redefined y here in order to be consistent with the following figure) 1 x x 2 y = ( ) 4 2 y 3 ˆ R R 1 R 2 R 1 x FIGURE 5.5. The mapping y = (1, x, x 2 ) t takes a line and transforms it to a parabola in three dimensions. A plane splits the resulting y-space into regions corresponding to two categories, and this in turn gives a nonsimply connected decision region in the one-dimensional x-space. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Volker Roth (University Pattern of Classification. Basel) Copyright Machine c 2001Learning by John 2017 Wiley & Sons, Inc. 21st March / 41 1 y ˆ R y 2
13 Generalized Linear Discriminant Functions Use basis functions {b 1 (x),..., b m (x)}, where each b i (x) : R d R, and g(x; w) = w 0 + w 1 b 1 (x) + + w m b m (x) =: w t y. x 2 x 1 x2 αx 1 x 2 y = ( ) y 3 ˆ R 2 y 2w ˆ ˆH R 1 x 2 y 1 x 1 R 1 R 1 R 2 FIGURE 5.6. The two-dimensional input space x is mapped through a polynomial function f to y. Here the mapping is y 1 = x 1, y 2 = x 2 and y 3 x 1 x 2. A linear discriminant Volker Roth (University of Basel) Machine Learning st March / 41 x 1
14 Separable Case Consider sample {y i, c i } n i=1. If there exists f (y; w) = y t w which is positive for all examples in class 1 and negative for all examples in class 2, we say that the sample is linearly separable. Normalization : replace all samples labeled c 2 by their negatives simply write y t w > 0 for all samples. Each sample places a constraint on the possible location of w solution region. solution region y 2 solution region y 2 a a separating plane y 1 "separating" plane y 1 FIGURE 5.8. Four training samples (black for ω 1,redforω 2 )and the solution region in feature space. The figure on the left shows the raw data; the solution vectors leads to a Volker Roth (University of Basel) Machine Learning st March / 41
15 Separable Case: margin Different solution vectors may have different margins b : y t w b > 0. Intuitively, large margins are good. a 2 a 2 b/ y 3 } solution region y 1 y 2 solution region } b/ y 1 y 1 y 2 y3 a 1 y 3 a 1 } b/ y 2 FIGURE 5.9. The effect of the margin on the solution region. At the left is the case of no margin (b = 0) equivalent to a case such as shown at the left in Fig At the right Volker Roth (University of Basel) Machine Learning st March / 41
16 Gradient Descent Solve y t w > 0 by defining J(w) such that a minimizer of J is a solution. Start with initial w(1), and choose next value by moving in the direction of steepest gradient: w(k + 1) = w(k) η(k) J(w(k)). Alternatively, use second order expansion (Newton): w(k + 1) = w(k) H 1 J(w(k)). J(a) a 2 a 1 Volker Roth (University of Basel) Machine Learning st March / 41
17 Minimizing the Perceptron Criterion Solve y t w > 0 by defining J(w) such that a minimizer of J is a solution. Most obvious: number of misclassifications, but not differentiable. Alternative choice: J p (w) = y M y t w, where M(w) is the set of samples misclassified by w. Since y t w < 0 y M, J p is non-negative, and zero only if w is a solution. Gradient: J(w) = y M y w(k + 1) = w(k) + η(k) y M This defines the Batch Perceptron algorithm. y. Volker Roth (University of Basel) Machine Learning st March / 41
18 Minimizing the Perceptron Criterion (2) J p (a) y 1 y 2 solution y a 1 a 2 region 2 y 3 y 1 y3 Volker Roth (University of Basel) Machine Learning st March / 41 2
19 Fixed-Increment Single Sample Perceptron Fix learning rate η(k) = 1. Sequential single-sample updates: use superscripts y 1, y 2,... for misclassified samples y M. Ordering is irrelevant. Simple algorithm: w(1) arbitrary w(k + 1) = w(k) + y k, k 1 Perceptron Convergence Theorem If the samples are linearly separable, the sequence of weight vectors given by the Fixed-Increment Single Sample Perceptron algorithm will terminate at a solution vector. Proof: exercises. Volker Roth (University of Basel) Machine Learning st March / 41
20 Issues A number of problems with the perceptron algorithm: When the data are separable, there are many solutions, and which one is found depends on the starting values. In particular, no separation margin can be guaranteed (however, there exist modified versions...) The number of steps can be very large. When the data are not separable, the algorithm will not necessarily converge, and cycles may occur. The cycles can be long and therefore hard to detect. Method technical in nature, no (obvious) connection to learning theory. Volker Roth (University of Basel) Machine Learning st March / 41
21 Informative (or Generative) vs Discriminative Recall: classification can be viewed as a (local) approximation of G(x) = E[c x] near its zero crossings. Two main strategies for solving this problem: Informative: Informative classifiers model the class densities. The likelihood of each class is examined and classification is done by assigning to the most likely class. Discriminative: These classifiers focus on modeling the class boundaries or the class membership probabilities directly. No attempt is made to model the underlying class conditional densities. Volker Roth (University of Basel) Machine Learning st March / 41
22 Informative Classifiers Central idea: model the conditional class densities p(x c). Use Bayes rule to compute the posteriors: P(c j x) = p(x c = j) P(c = j). p(x) Assuming a parametrized class conditional density p θj (x c = j) and collecting all model parameters in a vector θ, these parameters are usually estimated by maximizing the full log likelihood ˆθ MLE = argmax θ = argmax θ n log p θ (x i, c i ) i=1 n (log p θ (x i c i ) + log P θ (c i )). i=1 Volker Roth (University of Basel) Machine Learning st March / 41
23 Informative Classifiers: LDA In Linear Discriminant Analysis (LDA), a Gaussian model is used where all classes share a common covariance matrix Σ: p θ (x c = j) = N (x; µ j, Σ). The resulting discriminant functions are linear: g(x) = log P(c = 1)N (x; µ 1, Σ) P(c = 2)N (x; µ 2, Σ) = log π 1 π (µ 1 + µ 2 ) T Σ 1 (µ 1 µ 2 ) }{{} w 0 +(µ 1 µ 2 ) T Σ 1 x }{{} w T x = w 0 + w T x. Volker Roth (University of Basel) Machine Learning st March / 41
24 Informative Classifiers: LDA Discriminant function g(x) measures differences in the Mahalanobis distances from x to the k-th class mean: D (x, µ k ) = (x µ k ) T Σ 1 (x µ k ). Bayes-optimal decision rule: measure the Mahalanobis distances to each class and assign x to the class with the (prior corrected) smallest distance. Connection to LDA algorithm: let Σ w be an estimate of the shared covariance matrix Σ, and m j an estimate of µ j. LDA finds the weight vector w LDA = Σ 1 w (m 1 m 2 ). This w LDA asymptotically coincides with the Bayes-optimal w (if Gaussian model is correct!). Volker Roth (University of Basel) Machine Learning st March / 41
25 LDA: Algorithmic view 2 x 2 2 x w x w x 1 FIGURE 3.5. Projection of the same set of samples onto two different lines in the directions Question: marked Howw. should The figure we select on the the rightprojection shows greater vector, separation whichbetween optimally the red and black projected points. From: Richard O. Duda, Peter E. Hart, and David G. Stork, discriminates between the different classes? Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. Volker Roth (University of Basel) Machine Learning st March / 41
26 LDA: Algorithmic view Samples: {(x 1, c 1 ),..., (x n, c n )} Partition the sample set in class specific subsets Projected samples: x i = w T x i, 1 i n. Measure for the discrimination of the projected points? Sample averages m c = 1 n c x X c x, with n c = X c. Sample averages of projected points: m c = 1 w T x = w T m α. n c x X c Distance of the averages in the 2 class case: w T (m 1 m 2 ). Volker Roth (University of Basel) Machine Learning st March / 41
27 LDA: Algorithmic view Scatter of class c: Σ c = x X c (x m c )(x m c ) T. Scatter matrix is unscaled estimate of covariance matrix. Within scatter : Σ W = Σ 1 + Σ 2 (unscaled average). Scatter of projected data: x X c = (w T x m c )(w T x m c ) T x X c w T (x m c )(x m c ) T w = w T Σ c w =: Σ c 2 classes = Σ1 + Σ 2 = w T Σ W w. Volker Roth (University of Basel) Machine Learning st March / 41
28 LDA: Algorithmic view J(w) = m 1 m 2 2 Σ 1 + Σ 2 = w T =:Σ B {}}{ (m 1 m 2 )(m 1 m 2 ) T w w T. Σ W w Idea: Optimal w can be found by maximization of J(w)! d dw J(w) = d dw ( ) (w T Σ W w) 1 w T Σ B w = 2Σ W w(w T Σ W w) 1 w T Σ B w(w T Σ W w) 1 +2Σ B w(w T Σ W w) 1 = 0. w T Σ W w Σ B w = Σ W w w T Σ B w w T Σ W w. Volker Roth (University of Basel) Machine Learning st March / 41
29 LDA: Algorithmic view Generalized Eigenvalue Problem: Let Σ W be non singular: Σ 1 W Σ Bw }{{} (m 1 m 2 ) = λw with λ = w T Σ B w w T Σ W w. Unscaled Solution: ŵ = Σ 1 W (m 1 m 2 ). Volker Roth (University of Basel) Machine Learning st March / 41
30 Discriminative classifiers Discriminative classifiers focus directly on the discriminant function. In general, they are more flexible with regard to the class conditional densities they are capable of modeling. Bayes formula: P(c = 1 x) g(x) = log P(c = 2 x) p(x c = 1)P(c = 1) = log p(x c = 2)P(c = 2), Can model any conditional probabilities that are exponential tilts of each other: p(x c = 1) = e g(x) P(c = 2) p(x c = 2) P(c = 1) Volker Roth (University of Basel) Machine Learning st March / 41
31 Discriminative classifiers Since the discriminative model directly focuses on the posteriors, it is natural to estimate the model parameters θ in p θ (x c) by maximizing the conditional log likelihood ˆθ DISCR = argmax θ n log p θ (c i x i ). i=1 The discriminative model makes no use of the information contained in the marginal density p(x). Thus, if the density model is correct, useful information is ignored and the informative approach may give better results. However, if the class models are incorrect, it may be better not to use them. Volker Roth (University of Basel) Machine Learning st March / 41
32 Logistic Regression (LOGREG) Logistic regression uses a linear discriminant function, i.e. g(x) = w T x + w 0. For the special case p(x c = {1, 0}) = N (x; µ 1,0, Σ), LDA and LOGREG coincide: p(x c = 1) = N (x, µ 1, Σ) = e w T k x+w0 N (x; µ 0, Σ) π 0 π 1 g(x) = w 0 + w T x = log π 1N (x; µ 1, Σ) π 0 N (x; µ 0, Σ) Volker Roth (University of Basel) Machine Learning st March / 41
33 Logistic Regression (LOGREG) Two-class problem with Binomial RV c taking values in {0, 1} sufficient to represent P(1 x), since P(0 x) = 1 P(1 x). Success probability of the Binomial RV: π(x) := P(1 x) Basketball example: Volker Roth (University of Basel) Machine Learning st March / 41
34 Logistic Regression (LOGREG) LOGREG: log P(c=1 x) P(c=0 x) = log π(x) 1 π(x) = w T x + w 0 Thus, the conditional expectation of c x satisfies the logistic regression model: π(x) := E[c x] = exp{w T x + w 0 } 1 + exp{w T x + w 0 } = σ(w T x + w 0 ), Logistic squashing function σ(z) = ez 1+e = 1 z 1+e turns linear z predictions into probabilities Volker Roth (University of Basel) Machine Learning st March / 41
35 Logistic Regression (LOGREG) Assume that w 0 is absorbed in w using x (1 : x). Estimate w by maximizing the conditional likelihood n ŵ DISCR = argmax p(c i x i ; w) w = argmax w i=1 n i=1 (π(x i ; w)) c i (1 π(x i ; w)) 1 c i, or by maximizing the corresponding log likelihood l: n l(w) = [c i log π(x i ; w) + (1 c i ) log(1 π(x i ; w))]. i=1 The score functions are defined as s(w) = n w l(w) = x i (c i π i ). i=1 Volker Roth (University of Basel) Machine Learning st March / 41
36 Logistic Regression (LOGREG) Since the π i depend non-linearly on w, the equation system s(w) = 0 usually cannot be solved analytically and iterative techniques must be applied. Newton s method updates the parameter estimates w at the r-th step by w (r+1) = w (r) {H (r) } 1 s (r), where H (r) is the Hessian of l, evaluated at w (r) : ( H (r) 2 ) l = w w T w=w (r) n = π (r) i (1 π (r) i )x i x T i. i=1 Volker Roth (University of Basel) Machine Learning st March / 41
37 Logistic Regression (LOGREG) The Newton updates can be restated as an Iterated Re-weighted Least Squares (IRLS) problem: The Hessian is equal to ( X T WX), with W = diag {π 1 (1 π 1 ),..., π n (1 π n )} The gradient of l (the scores) can be written as s = X T W e, where e is a vector with entries e j = (c j π j )/W jj. With q (r) := Xw (r) + e, the updates read (X T W (r) X)w r+1 = X T W (r) q (r). These are the normal equations of a LS problem with input matrix (W (r) ) 1/2 X and r.h.s. (W (r) ) 1/2 q (r). The values W ii are functions of the actual w (r), so that iteration is needed. Volker Roth (University of Basel) Machine Learning st March / 41
38 Logistic Regression (LOGREG) Simple binary classification problem in R 2. Solved with LOGREG using polynomial basis functions. Volker Roth (University of Basel) Machine Learning st March / 41
39 Loss functions LOGREG maximizes log likelihood n l(w) = [c i log π(x i ; w) + (1 c i ) log(1 π(x i ; w))], i=1 where π = 1 1+e, z = w t x. z This is the same as minimizing n l(w) = [ c i log π(x i ; w) (1 c i ) log(1 π(x i ; w))] = =: i=1 n i=1 n i=1 [ log(1 + e z i ) + (1 c i )z i ] Loss(c i, z i ). Loss function: (1 c i )z + log(1 + e z ). Volker Roth (University of Basel) Machine Learning st March / 41
40 Loss functions Volker Roth (University of Basel) Machine Learning st March / 41
41 Summing up Classification can be viewed as a (local) approximation of G(x) = E[c x] near its zero crossings. We have to define a hypothesis class from which a function is chosen by some inference mechanism. One such class is the set of linear discriminant functions g(x; w) = w 0 + w t x. Can be easily generalized to arbitrary basis functions. Two main strategies for solving this problem: Informative: use marginal density p(x), and discriminative: ignore p(x). Two simple models: LDA and LOGREG. Volker Roth (University of Basel) Machine Learning st March / 41
Linear Discrimination Functions
Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach
More informationCh 4. Linear Models for Classification
Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationp(d θ ) l(θ ) 1.2 x x x
p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationEEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1
EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationEngineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:
More informationp(x ω i 0.4 ω 2 ω
p( ω i ). ω.3.. 9 3 FIGURE.. Hypothetical class-conditional probability density functions show the probability density of measuring a particular feature value given the pattern is in category ω i.if represents
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationMinimum Error-Rate Discriminant
Discriminants Minimum Error-Rate Discriminant In the case of zero-one loss function, the Bayes Discriminant can be further simplified: g i (x) =P (ω i x). (29) J. Corso (SUNY at Buffalo) Bayesian Decision
More information5. Discriminant analysis
5. Discriminant analysis We continue from Bayes s rule presented in Section 3 on p. 85 (5.1) where c i is a class, x isap-dimensional vector (data case) and we use class conditional probability (density
More informationLogistic Regression. Seungjin Choi
Logistic Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationCS 195-5: Machine Learning Problem Set 1
CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of
More informationIntroduction to Machine Learning
Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574
More informationLinear discriminant functions
Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative
More informationNon-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines
Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall
More informationStatistical Machine Learning Hilary Term 2018
Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationOutline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation
More informationLinear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII
More informationLecture 4: Types of errors. Bayesian regression models. Logistic regression
Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture
More informationMachine Learning for Signal Processing Bayes Classification and Regression
Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For
More informationCSC 411: Lecture 04: Logistic Regression
CSC 411: Lecture 04: Logistic Regression Raquel Urtasun & Rich Zemel University of Toronto Sep 23, 2015 Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 1 / 16 Today Key Concepts: Logistic
More informationSGN (4 cr) Chapter 5
SGN-41006 (4 cr) Chapter 5 Linear Discriminant Analysis Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology January 21, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationLinear and logistic regression
Linear and logistic regression Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Linear and logistic regression 1/22 Outline 1 Linear regression 2 Logistic regression 3 Fisher discriminant analysis
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More informationBayesian Decision and Bayesian Learning
Bayesian Decision and Bayesian Learning Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1 / 30 Bayes Rule p(x ω i
More informationClassification. Chapter Introduction. 6.2 The Bayes classifier
Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode
More informationMIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE
MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE March 28, 2012 The exam is closed book. You are allowed a double sided one page cheat sheet. Answer the questions in the spaces provided on
More informationLast updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship
More informationLogistic regression and linear classifiers COMS 4771
Logistic regression and linear classifiers COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random
More informationPattern Recognition. Parameter Estimation of Probability Density Functions
Pattern Recognition Parameter Estimation of Probability Density Functions Classification Problem (Review) The classification problem is to assign an arbitrary feature vector x F to one of c classes. The
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationMachine Learning Support Vector Machines. Prof. Matteo Matteucci
Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way
More informationMachine Learning, Fall 2012 Homework 2
0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0
More informationLecture 3: Pattern Classification
EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationPATTERN CLASSIFICATION
PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS
More informationMachine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang
Machine Learning Lecture 04: Logistic and Softmax Regression Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set
More informationPattern Classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors
More informationStatistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart
Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart 1 Motivation and Problem In Lecture 1 we briefly saw how histograms
More informationp(x ω i 0.4 ω 2 ω
p(x ω i ).4 ω.3.. 9 3 4 5 x FIGURE.. Hypothetical class-conditional probability density functions show the probability density of measuring a particular feature value x given the pattern is in category
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationLogistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu
Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data
More informationParametric Techniques
Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure
More informationProbabilistic modeling. The slides are closely adapted from Subhransu Maji s slides
Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework
More informationMachine Learning. Lecture 3: Logistic Regression. Feng Li.
Machine Learning Lecture 3: Logistic Regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2016 Logistic Regression Classification
More informationMachine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.
More informationContents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)
Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture
More informationIntroduction to Machine Learning
Introduction to Machine Learning Bayesian Classification Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574
More informationIntroduction to Machine Learning
Outline Introduction to Machine Learning Bayesian Classification Varun Chandola March 8, 017 1. {circular,large,light,smooth,thick}, malignant. {circular,large,light,irregular,thick}, malignant 3. {oval,large,dark,smooth,thin},
More informationBayesian Decision Theory
Bayesian Decision Theory Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Bayesian Decision Theory Bayesian classification for normal distributions Error Probabilities
More informationBayesian Learning (II)
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP
More informationBayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory
Bayesian decision theory 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Jussi Tohka jussi.tohka@tut.fi Institute of Signal Processing Tampere University of Technology
More informationParametric Techniques Lecture 3
Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to
More informationA vector from the origin to H, V could be expressed using:
Linear Discriminant Function: the linear discriminant function: g(x) = w t x + ω 0 x is the point, w is the weight vector, and ω 0 is the bias (t is the transpose). Two Category Case: In the two category
More informationMulticlass Classification-1
CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass
More informationExample - basketball players and jockeys. We will keep practical applicability in mind:
Sonka: Pattern Recognition Class 1 INTRODUCTION Pattern Recognition (PR) Statistical PR Syntactic PR Fuzzy logic PR Neural PR Example - basketball players and jockeys We will keep practical applicability
More informationLINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning
LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary
More informationENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition
Memorial University of Newfoundland Pattern Recognition Lecture 6 May 18, 2006 http://www.engr.mun.ca/~charlesr Office Hours: Tuesdays & Thursdays 8:30-9:30 PM EN-3026 Review Distance-based Classification
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationECE521 Lecture7. Logistic Regression
ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard
More informationAn Introduction to Statistical and Probabilistic Linear Models
An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning
More informationMidterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian
More informationPattern Recognition 2
Pattern Recognition 2 KNN,, Dr. Terence Sim School of Computing National University of Singapore Outline 1 2 3 4 5 Outline 1 2 3 4 5 The Bayes Classifier is theoretically optimum. That is, prob. of error
More informationInstance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest
More informationClassification. Sandro Cumani. Politecnico di Torino
Politecnico di Torino Outline Generative model: Gaussian classifier (Linear) discriminative model: logistic regression (Non linear) discriminative model: neural networks Gaussian Classifier We want to
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationLecture 5: LDA and Logistic Regression
Lecture 5: and Logistic Regression Hao Helen Zhang Hao Helen Zhang Lecture 5: and Logistic Regression 1 / 39 Outline Linear Classification Methods Two Popular Linear Models for Classification Linear Discriminant
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationMachine Learning - Waseda University Logistic Regression
Machine Learning - Waseda University Logistic Regression AD June AD ) June / 9 Introduction Assume you are given some training data { x i, y i } i= where xi R d and y i can take C different values. Given
More informationCSC321 Lecture 4: Learning a Classifier
CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a Classifier 1 / 28 Overview Last time: binary classification, perceptron algorithm Limitations of the perceptron
More informationBayesian Decision Theory
Bayesian Decision Theory Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 1 / 46 Bayesian
More informationLinear Classification
Linear Classification Lili MOU moull12@sei.pku.edu.cn http://sei.pku.edu.cn/ moull12 23 April 2015 Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationECE662: Pattern Recognition and Decision Making Processes: HW TWO
ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are
More informationMachine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier
Machine Learning 10-701/15 701/15-781, 781, Spring 2008 Theory of Classification and Nonparametric Classifier Eric Xing Lecture 2, January 16, 2006 Reading: Chap. 2,5 CB and handouts Outline What is theoretically
More informationLearning Methods for Linear Detectors
Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2011/2012 Lesson 20 27 April 2012 Contents Learning Methods for Linear Detectors Learning Linear Detectors...2
More informationDEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE
Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures
More informationBayesian Decision Theory Lecture 2
Bayesian Decision Theory Lecture 2 Jason Corso SUNY at Buffalo 14 January 2009 J. Corso (SUNY at Buffalo) Bayesian Decision Theory Lecture 2 14 January 2009 1 / 58 Overview and Plan Covering Chapter 2
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More information1 Machine Learning Concepts (16 points)
CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions
More information