Machine Learning Lecture 7
|
|
- Silvester Branden Butler
- 6 years ago
- Views:
Transcription
1 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory Discriminative Approaches (5 weeks) Linear Discriminant Functions Statistical Learning Theory & SVMs Ensemble Methods & Boosting Randomized Trees, Forests & Ferns Bastian Leibe RWTH Aachen Generative Models (4 weeks) Bayesian Networks Markov Random Fields leibe@vision.rwth-aachen.de Many slides adapted from B. Schiele 3 Topics of This Lecture Recap: Linear Discriminant Functions Recap: Generalized Linear Discriminants Logistic Regression Probabilistic discriminative models Logistic sigmoid (logit function) Cross-entropy error Gradient descent Iteratively Reweighted Least Squares Note on error functions Statistical Learning Theory Generalization and overfitting Empirical and actual risk VC dimension Empirical Risk Minimization Structural Risk Minimization 4 Basic idea Directly encode decision boundary Minimize misclassification probability directly. Linear discriminant functions y(x) = w T x + w 0 weight vector bias (= threshold) w, w 0 define a hyperplane in R D. If a data set can be perfectly classified by a linear discriminant, then we call it linearly separable. y = 0 y < 0 y > 0 5 Recap: Extension to Nonlinear Basis Fcts. Recap: Basis Functions Generalization Transform vector x with M nonlinear basis functions Á j (x): MX y k (x) = w Á j (x) + w k0 Advantages j=1 Transformation allows non-linear decision boundaries. By choosing the right Á j, every continuous function can (in principle) be approximated with arbitrary accuracy. Generally, we consider models of the following form where Á j (x) are known as basis functions. In the simplest case, we use linear basis functions: Á d (x) = x d. Other popular basis functions Disadvantage The error function can in general no longer be minimized in closed form. Minimization with Gradient Descent 6 Polynomial Gaussian Sigmoid 7 1
2 Gradient Descent Recap: Gradient Descent Iterative minimization Start with an initial guess for the parameter values w (0). Move towards a (local) minimum by following the gradient. Basic strategies w( ) Example: Quadratic error function E(w) = (y(x n ; w) t n ) 2 Sequential updating leads to delta rule (=LMS rule) (y k(x n ; w) t kn ) Á j (x n ) Sequential updating where E(w) E n (w) w( ) 8 where ± kná j (x n ) ± kn = y k (x n ; w) t kn Simply feed back the input data point, weighted by the classification error. 9 Recap: Gradient Descent Recap: Classification as Dim. Reduction Cases with differentiable, non-linear activation function 0 1 MX y k (x) = g(a k ) = w ki Á j (x n ) A j=0 bad separation good separation Gradient descent (again with quadratic error n (y k (x n ; w) t kn ) Á j (x n ) ± kná j (x n ) ± kn (y k (x n ; w) t kn ) 10 Classification as dimensionality reduction Interpret linear classification as a projection onto a lower-dim. space. y = w T x Learning problem: Try to find the projection vector w that maximizes class separation. 11 Image source: C.M. Bishop, 2006 Recap: Fisher s Linear Discriminant Analysis Topics of This Lecture Class 1 x x w Slide adapted from Ales Leonardis Class 2 Maximize distance between classes Minimize distance within a class Criterion: S B between-class scatter matrix S W within-class scatter matrix The optimal solution for w can be obtained as: w / S 1 W (m 2 m 1 ) Classification function: y J(w) = wt S Bw w T S W w where w 0 = w T m 12 Recap: Generalized Linear Discriminants Logistic Regression Probabilistic discriminative models Logistic sigmoid (logit function) Cross-entropy error Gradient descent Iteratively Reweighted Least Squares Note on error functions Statistical Learning Theory Generalization and overfitting Empirical and actual risk VC dimension Empirical Risk Minimization Structural Risk Minimization 17 2
3 Probabilistic Discriminative Models Probabilistic Discriminative Models We have seen that we can write p(c 1 jx) = ¾(a) 1 = 1 + exp( a) We can obtain the familiar probabilistic model by setting a = ln p(xjc 1)p(C 1 ) p(xjc 2 )p(c 2 ) Or we can use generalized linear discriminant models or a = w T x a = w T Á(x) logistic sigmoid function 18 In the following, we will consider models of the form with This model is called logistic regression. Why should we do this? What advantage does such a model have compared to modeling the probabilities? p(ájc 1 )p(c 1 ) p(c 1 já) = p(ájc 1 )p(c 1 ) + p(ájc 2 )p(c 2 ) Any ideas? p(c 1 já) = y(á) = ¾(w T Á) p(c 2 já) = 1 p(c 1 já) 19 Comparison Logistic Sigmoid Let s look at the number of parameters Assume we have an M-dimensional feature space Á. And assume we represent p(á C k ) and p(c k ) by Gaussians. How many parameters do we need? For the means: 2M For the covariances: M(M+1)/2 Together with the class priors, this gives M(M+5)/2+1 parameters! How many parameters do we need for logistic regression? p(c 1 já) = y(á) = ¾(w T Á) Properties Definition: Inverse: ¾(a) = Symmetry property: exp( a) µ ¾ a = ln 1 ¾ ¾( a) = 1 ¾(a) logit function Just the values of w M parameters. For large M, logistic regression has clear advantages! Derivative: d¾ = ¾(1 ¾) da Logistic Regression Gradient of the Error Function y n = ¾(w T Á n ) Let s consider a data set {Á n,t n } with n = 1,,N, where Á n = Á(x n ) and t n 2 f0; 1g, t = (t 1 ; : : : ; t N ) T. With y n = p(c 1 Á n ), we can write the likelihood as NY p(tjw) = yn tn f1 y ng 1 tn Define the error function as the negative log-likelihood E(w) = ln p(tjw) = ft n ln y n + (1 t n ) ln(1 y n )g This is the so-called cross-entropy error function. 22 dy n Error function dw = y n(1 y n )Á n E(w) = ft n ln y n + (1 t n ) ln(1 y n )g Gradient ( d dw re(w) = t y d n dw n + (1 t n ) (1 y ) n) y n (1 y n ) ½ y n (1 y n ) = t n Á y n (1 t n ) y ¾ n(1 y n ) n (1 y n ) Á n = f(t n t n y n y n + t n y n )Á n g = (y n t n )Á n 23 3
4 Gradient of the Error Function A More Efficient Iterative Method Gradient for logistic regression re(w) = (y n t n )Á n Second-order Newton-Raphson gradient descent scheme H 1 re(w) where H = rre(w) is the Hessian matrix, i.e. the matrix of second derivatives. Does this look familiar to you? This is the same result as for the Delta (=LMS) rule (y k(x n ; w) t kn )Á j (x n ) Properties Local quadratic approximation to the log-likelihood. Faster convergence. We can use this to derive a sequential estimation algorithm. However, this will be quite slow Newton-Raphson for Least-Squares Estimation Newton-Raphson for Logistic Regression Let s first apply Newton-Raphson to the least-squares error function: E(w) = 1 w T 2 Á 2 n t n re(w) = H = rre(w) = w T Á n t n Án = T w T t 2 Á n Á T n = T where 6 = 4 Resulting update scheme: ( T ) 1 ( T w ( ) T t) Á T 1.. Á T N = ( T ) 1 T t Closed-form solution! Now, let s try Newton-Raphson on the cross-entropy error function: E(w) = ft n ln y n + (1 t n ) ln(1 y n )g re(w) = H = rre(w) = dy n dw = yn(1 yn)á n (y n t n )Á n = T (y t) y n (1 y n )Á n Á T n = T R where R is an N N diagonal matrix with R nn = y n (1 y n ). The Hessian is no longer constant, but depends on w through the weighting matrix R. 27 Iteratively Reweighted Least Squares Summary: Logistic Regression Update equations ( T R ) 1 T (y t) o = ( T R ) n 1 T R w ( ) T (y t) = ( T R ) 1 T Rz with z = w ( ) R 1 (y t) Again very similar form (normal equations) But now with non-constant weighing matrix R (depends on w). Need to apply normal equations iteratively. Iteratively Reweighted Least-Squares (IRLS) 28 Properties Directly represent posterior distribution p(á C k ) Requires fewer parameters than modeling the likelihood + prior. Very often used in statistics. It can be shown that the cross-entropy error function is concave Optimization leads to unique minimum But no closed-form solution exists Iterative optimization (IRLS) Both online and batch optimizations exist There is a multi-class version described in (Bishop Ch.4.3.4). Caveat Logistic regression tends to systematically overestimate odds ratios when the sample size is less than ~
5 Topics of This Lecture A Note on Error Functions Recap: Generalized Linear Discriminants Logistic Regression Probabilistic discriminative models Logistic sigmoid (logit function) Cross-entropy error Gradient descent Iteratively Reweighted Least Squares Note on error functions Statistical Learning Theory Generalization and overfitting Empirical and actual risk VC dimension Empirical Risk Minimization Structural Risk Minimization 30 Not differentiable! Ideal misclassification error function (black) This is what we want to approximate, Unfortunately, it is not differentiable. The gradient is zero for misclassified points. Ideal misclassification error z n = t n y(x n ) We cannot minimize it by gradient descent. 31 Image source: Bishop, 2006 A Note on Error Functions A Note on Error Functions Ideal misclassification error Squared error Ideal misclassification error Squared error Squared error (sigmoid) Sensitive to outliers! But: Zero gradient here! Penalizes too correct data points! Sensitivity to outliers fixed! Too correct data points fixed! z n = t n y(x n ) z n = t n y(x n ) Squared error used in Least-Squares Classification Very popular, leads to closed-form solutions. However, sensitive to outliers due to squared penalty. Penalizes too correct data points Generally does not lead to good classifiers. 32 Image source: Bishop, 2006 Squared error with sigmoid activation function (tanh) Fixes the problems with outliers and too correct data points. But: zero gradient for confidently misclassified data points. Will give better performance than original squared error, but still does not fix all problems. 33 Image source: Bishop, 2006 A Note on Error Functions Topics of This Lecture Robust to outliers! Ideal misclassification error Squared error Cross-entropy error Recap: Generalized Linear Discriminants Logistic Regression Probabilistic discriminative models Logistic sigmoid (logit function) Cross-entropy error Gradient descent Iteratively Reweighted Least Squares Cross-Entropy Error Minimizer of this error is given by posterior class probabilities. Concave error function, unique minimum exists. Robust to outliers, error increases only roughly linearly z n = t n y(x n ) But no closed-form solution, requires iterative estimation. 34 Image source: Bishop, 2006 Note on error functions Statistical Learning Theory Generalization and overfitting Empirical and actual risk VC dimension Empirical Risk Minimization Structural Risk Minimization 36 5
6 Generalization and Overfitting Example: Linearly Separable Data Goal: predict class labels of new observations Train classification model on limited training set. The further we optimize the model parameters, the more the training error will decrease. However, at some point the test error will go up again. Overfitting to the training set! test error training error 37 Image source: B. Schiele Overfitting is often a problem with linearly separable data Which of the many possible decision boundaries is correct? All of them have zero error on the training set However, they will most likely result in different predictions on novel test data. Different generalization performance How to select the classifier with the best generalization performance???? 38 A Broader View on Statistical Learning Statistical Learning Theory Formal treatment: Statistical Learning Theory Supervised learning (from the learning machine s view) Supervised learning Environment: assumed stationary. I.e. the data x have an unknown but fixed probability density p X (x) Selection of a specific function f(x; ) Given: training examples f(x i ; y i )g N i=1 Goal: the desired response y shall be approximated optimally. Teacher: specifies for each data point x the desired classification y (where y may be subject to noise). y = g(x; º) with noise º Measuring the optimality Loss function L(y; f(x; )) Learning machine: represented by class of functions, which produce for each x an output y: y = f(x; ) with parameters Example: quadratic loss L(y; f(x; )) = (y f(x; )) Risk Risk Measuring the optimality Measure the optimality by the risk (= expected loss). Difficulty: how should the risk be estimated? However, what we re really interested in is Actual risk (= ZExpected risk) R( ) = L(y; f(x; ))dp X;Y (x; y) Practical way Empirical risk (measured on the training/validation set) R emp ( ) = 1 N L(y i ; f(x i ; )) i=1 P X;Y (x; y) is the probability distribution of (x,y). P X;Y (x; y) is fixed, but typically unknown. In general, we can t compute the actual risk directly! Example: quadratic loss function R emp ( ) = 1 N (y i f(x i ; )) 2 i=1 The expected risk is the expectation of the error on all data. I.e., it is the expected value of the generalization error
7 Summary: Risk Actual risk Advantage: measure for the generalization ability Disadvantage: in general, we don t know P X;Y (x; y) Empirical risk Disadvantage: no direct measure of the generalization ability Advantage: does not depend on P X;Y (x; y) We typically know learning algorithms which minimize the empirical risk. Strong interest in connection between both types of risk 45 Statistical Learning Theory Idea Compute an upper bound on the actual risk based on the empirical risk where N: number of training examples p * : probability that the bound is correct h: capacity of the learning machine ( VC-dimension ) Side note: R( ) R emp ( ) + ²(N; p ; h) (This idea of specifying a bound that only holds with a certain probability is explored in a branch of learning theory called Probably Approximately Correct or PAC Learning). 48 VC Dimension Vapnik-Chervonenkis dimension Measure for the capacity of a learning machine. VC Dimension Interpretation as a two-player game Opponent s turn: He says a number N. Formal definition: If a given set of ` points can be labeled in all possible 2` ways, and for each labeling, a member of the set {f( )} can be found which correctly assigns those labels, we say that the set of points is shattered by the set of functions. The VC dimension for the set of functions {f( )} is defined as the maximum number of training points that can be shattered by {f( )}. Our turn: We specify a set of N points {x 1,,x N }. Opponent s turn: He gives us a labeling {x 1,,x N }2 {0,1} N Our turn: We specify a function f( ) which correctly classifies all N points. If we can do that for all 2 N possible labelings, then the VC dimension is at least N VC Dimension VC Dimension Example The VC dimension of all oriented lines in R 2 is Shattering 3 points with an oriented line: Intuitive feeling (unfortunately wrong) The VC dimension has a direct connection with the number of parameters. Counterexample f(x; ) = g(sin( x)) ( 1; x > 0 g(x) = 1; x 0 2. More difficult to show: it is not possible to shatter 4 points (XOR) More general: the VC dimension of all hyperplanes in R n is n Image source: C. Burges, 1998 Just a single parameter. Infinite VC dimension Proof: Choose x i = 10 i ; i = 1; : : : ; ` Ã! `X (1 y i )10 i = ¼ i=1 52 7
8 Upper Bound on the Risk Upper Bound on the Risk Important result (Vapnik 1979, 1995) r h(log(2n=h) + 1) log( =4) With probability (1- ), the following bound holds R( ) R emp ( ) + N This bound is independent of P X;Y (x; y)! VC confidence Typically, we cannot compute the left-hand side (the actual risk) If we know h (the VC dimension), we can however easily compute the risk bound R( ) R emp ( ) + ²(N; p ; h) ²(N; p ; h) R emp ( ) Structural Risk Minimization References and Further Reading How can we implement this? R( ) R emp ( ) + ²(N; p ; h) More information on SVMs can be found in Chapter 7.1 of Bishop s book. Classic approach ²(N; p ; h) R emp ( ) Keep constant and minimize. ²(N; p ; h) be kept constant by controlling the model can parameters. Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 Support Vector Machines (SVMs) R emp ( ) ²(N; p ; h) Keep constant and minimize. In fact: R emp ( ) = 0 for separable data. ²(N; p ; h) Control by adapting the VC dimension (controlling the capacity of the classifier). 56 Additional information about Statistical Learning Theory and a more in-depth introduction to SVMs are available in the following tutorial: C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, Vol. 2(2), pp
Machine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationLecture 12. Neural Networks Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 12 Neural Networks 10.12.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More informationLecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants
Advanced Machine Learning Lecture 2 Neural Networks 24..206 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de Talk Announcement Yann LeCun (NYU & FaceBook AI) 28..
More informationMachine Learning Lecture 2
Machine Perceptual Learning and Sensory Summer Augmented 15 Computing Many slides adapted from B. Schiele Machine Learning Lecture 2 Probability Density Estimation 16.04.2015 Bastian Leibe RWTH Aachen
More informationLinear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging
More informationCh 4. Linear Models for Classification
Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,
More informationLecture 3. Linear Regression II Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More informationLecture 12. Neural Networks Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 12 Neural Networks 24.11.2016 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de Talk Announcement Yann LeCun (NYU & FaceBook AI)
More informationMachine Learning Lecture 10
Machine Learning Lecture 10 Neural Networks 26.11.2018 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Today s Topic Deep Learning 2 Course Outline Fundamentals Bayes
More informationMachine Learning Lecture 2
Machine Perceptual Learning and Sensory Summer Augmented 6 Computing Announcements Machine Learning Lecture 2 Course webpage http://www.vision.rwth-aachen.de/teaching/ Slides will be made available on
More informationLINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception
LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationLast updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship
More informationMachine Learning Lecture 2
Announcements Machine Learning Lecture 2 Eceptional number of lecture participants this year Current count: 449 participants This is very nice, but it stretches our resources to their limits Probability
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationMax Margin-Classifier
Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationIterative Reweighted Least Squares
Iterative Reweighted Least Squares Sargur. University at Buffalo, State University of ew York USA Topics in Linear Classification using Probabilistic Discriminative Models Generative vs Discriminative
More informationLinear Models for Classification
Catherine Lee Anderson figures courtesy of Christopher M. Bishop Department of Computer Science University of Nebraska at Lincoln CSCE 970: Pattern Recognition and Machine Learning Congradulations!!!!
More informationLogistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA
Logistic Regression Sargur N. University at Buffalo, State University of New York USA Topics in Linear Classification using Probabilistic Discriminative Models Generative vs Discriminative 1. Fixed basis
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationBayesian Logistic Regression
Bayesian Logistic Regression Sargur N. University at Buffalo, State University of New York USA Topics in Linear Models for Classification Overview 1. Discriminant Functions 2. Probabilistic Generative
More informationMachine Learning Lecture 3
Announcements Machine Learning Lecture 3 Eam dates We re in the process of fiing the first eam date Probability Density Estimation II 9.0.207 Eercises The first eercise sheet is available on L2P now First
More informationECE521 Lecture7. Logistic Regression
ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard
More informationMachine Learning. 7. Logistic and Linear Regression
Sapienza University of Rome, Italy - Machine Learning (27/28) University of Rome La Sapienza Master in Artificial Intelligence and Robotics Machine Learning 7. Logistic and Linear Regression Luca Iocchi,
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationMachine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Slides adapted from Jordan Boyd-Graber Machine Learning: Chenhao Tan Boulder 1 of 39 Recap Supervised learning Previously: KNN, naïve
More informationSupport Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI
Support Vector Machines CAP 5610: Machine Learning Instructor: Guo-Jun QI 1 Linear Classifier Naive Bayes Assume each attribute is drawn from Gaussian distribution with the same variance Generative model:
More informationKernel Methods and Support Vector Machines
Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationMultivariate statistical methods and data mining in particle physics
Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general
More informationApril 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.
Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá April 9, 2018 Content 1 2 3 4 Outline 1 2 3 4 problems { C 1, y(x) threshold predict(x) = C 2, y(x) < threshold, with threshold
More informationData Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396
Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents 1 Introduction
More informationLINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
LINEAR CLASSIFIERS Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input
More informationMachine Learning Lecture 12
Machine Learning Lecture 12 Neural Networks 30.11.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability
More informationMachine Learning 2017
Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationIntroduction to Machine Learning
Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationEngineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationSupport Vector Machines
Support Vector Machines Stephan Dreiseitl University of Applied Sciences Upper Austria at Hagenberg Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support Overview Motivation
More informationLecture 4: Types of errors. Bayesian regression models. Logistic regression
Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture
More informationLecture Machine Learning (past summer semester) If at least one English-speaking student is present.
Advanced Machine Learning Lecture 1 Organization Lecturer Prof. Bastian Leibe (leibe@vision.rwth-aachen.de) Introduction 20.10.2015 Bastian Leibe Visual Computing Institute RWTH Aachen University http://www.vision.rwth-aachen.de/
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More information10-701/ Machine Learning - Midterm Exam, Fall 2010
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationMark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.
University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationPDEEC Machine Learning 2016/17
PDEEC Machine Learning 2016/17 Lecture - Model assessment, selection and Ensemble Jaime S. Cardoso jaime.cardoso@inesctec.pt INESC TEC and Faculdade Engenharia, Universidade do Porto Nov. 07, 2017 1 /
More informationSupport Vector Machines
Support Vector Machines Jordan Boyd-Graber University of Colorado Boulder LECTURE 7 Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah Jordan Boyd-Graber Boulder Support Vector Machines 1 of
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationSupport Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM
1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationSupport Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2
More informationIntroduction to SVM and RVM
Introduction to SVM and RVM Machine Learning Seminar HUS HVL UIB Yushu Li, UIB Overview Support vector machine SVM First introduced by Vapnik, et al. 1992 Several literature and wide applications Relevance
More informationLogistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu
Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationAdvanced statistical methods for data analysis Lecture 2
Advanced statistical methods for data analysis Lecture 2 RHUL Physics www.pp.rhul.ac.uk/~cowan Universität Mainz Klausurtagung des GK Eichtheorien exp. Tests... Bullay/Mosel 15 17 September, 2008 1 Outline
More informationMachine Learning Lecture 1
Many slides adapted from B. Schiele Machine Learning Lecture 1 Introduction 18.04.2016 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de Organization Lecturer Prof.
More informationMulticlass Logistic Regression
Multiclass Logistic Regression Sargur. Srihari University at Buffalo, State University of ew York USA Machine Learning Srihari Topics in Linear Classification using Probabilistic Discriminative Models
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationCSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression
CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html
More informationThe classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know
The Bayes classifier Theorem The classifier satisfies where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know Alternatively, since the maximum it is
More informationThe classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA
The Bayes classifier Linear discriminant analysis (LDA) Theorem The classifier satisfies In linear discriminant analysis (LDA), we make the (strong) assumption that where the min is over all possible classifiers.
More informationToday. Calculus. Linear Regression. Lagrange Multipliers
Today Calculus Lagrange Multipliers Linear Regression 1 Optimization with constraints What if I want to constrain the parameters of the model. The mean is less than 10 Find the best likelihood, subject
More informationMachine Learning Lecture 14
Many slides adapted from B. Schiele, S. Roth, Z. Gharahmani Machine Learning Lecture 14 Undirected Graphical Models & Inference 23.06.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de
More informationChapter 14 Combining Models
Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients
More informationSupport Vector Machine (SVM) and Kernel Methods
Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More information(Kernels +) Support Vector Machines
(Kernels +) Support Vector Machines Machine Learning Torsten Möller Reading Chapter 5 of Machine Learning An Algorithmic Perspective by Marsland Chapter 6+7 of Pattern Recognition and Machine Learning
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More informationMidterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian
More informationLinear Discrimination Functions
Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach
More informationAn Introduction to Statistical and Probabilistic Linear Models
An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning
More informationNeural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science
Neural Networks Prof. Dr. Rudolf Kruse Computational Intelligence Group Faculty for Computer Science kruse@iws.cs.uni-magdeburg.de Rudolf Kruse Neural Networks 1 Supervised Learning / Support Vector Machines
More informationLinear Classification
Linear Classification Lili MOU moull12@sei.pku.edu.cn http://sei.pku.edu.cn/ moull12 23 April 2015 Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative
More informationQualifying Exam in Machine Learning
Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts
More informationECE-271B. Nuno Vasconcelos ECE Department, UCSD
ECE-271B Statistical ti ti Learning II Nuno Vasconcelos ECE Department, UCSD The course the course is a graduate level course in statistical learning in SLI we covered the foundations of Bayesian or generative
More informationMachine Learning for Signal Processing Bayes Classification and Regression
Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For
More informationMultivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008)
Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008) RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement
More informationRegression with Numerical Optimization. Logistic
CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204
More informationLearning from Data Logistic Regression
Learning from Data Logistic Regression Copyright David Barber 2-24. Course lecturer: Amos Storkey a.storkey@ed.ac.uk Course page : http://www.anc.ed.ac.uk/ amos/lfd/ 2.9.8.7.6.5.4.3.2...2.3.4.5.6.7.8.9
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationIntroduction to Machine Learning
Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland SUPPORT VECTOR MACHINES Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah Machine Learning: Jordan
More information10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers
Computational Methods for Data Analysis Massimo Poesio SUPPORT VECTOR MACHINES Support Vector Machines Linear classifiers 1 Linear Classifiers denotes +1 denotes -1 w x + b>0 f(x,w,b) = sign(w x + b) How
More information