Machine Learning Lecture 5
|
|
- Percival Howard
- 6 years ago
- Views:
Transcription
1 Machine Learning Lecture 5 Linear Discriminant Functions Bastian Leibe RWTH Aachen leibe@vision.rwth-aachen.de
2 Course Outline Fundamentals Bayes Decision Theory Probability Density Estimation Classification Approaches Linear Discriminants Support Vector Machines Ensemble Methods & Boosting Randomized Trees, Forests & Ferns Deep Learning Foundations Convolutional Neural Networks Recurrent Neural Networks 2
3 Recap: Mixture of Gaussians (MoG) Generative model j p(j) = ¼ j Weight of mixture component p(x) p(xjµ j ) Mixture component x p(x) p(xjµ) = Mixture density MX p(xjµ j )p(j) Slide credit: Bernt Schiele x j=1 3
4 Recap: Estimating MoGs Iterative Strategy Assuming we knew the values of the hidden variable ML for Gaussian #1 ML for Gaussian #2 assumed known j h(j = 1jx n ) = h(j = 2jx n ) = ¹ 1 = P N n=1 h(j = 1jx n)x n P N i=1 h(j = 1jx n) ¹ 2 = P N n=1 h(j = 2jx n)x n P N i=1 h(j = 2jx n) Slide credit: Bernt Schiele 4
5 Recap: Estimating MoGs Iterative Strategy Assuming we knew the mixture components assumed known p(j = 1jx) p(j = 2jx) j Bayes decision rule: Decide j = 1 if p(j = 1jx n ) > p(j = 2jx n ) Slide credit: Bernt Schiele 5
6 Recap: K-Means Clustering Iterative procedure 1. Initialization: pick K arbitrary centroids (cluster means) 2. Assign each sample to the closest centroid. 3. Adjust the centroids to be the means of the samples assigned to them. 4. Go to step 2 (until no change) Algorithm is guaranteed to converge after finite #iterations. Local optimum Final result depends on initialization. Slide credit: Bernt Schiele 6
7 Recap: EM Algorithm Expectation-Maximization (EM) Algorithm E-Step: softly assign samples to mixture components j (x n ) à ¼ jn(x n j¹ j ; j ) P N 8j = 1; : : : ; K; k=1 ¼ kn(x n j¹ k ; k ) n = 1; : : : ; N M-Step: re-estimate the parameters (separately for each mixture component) based on the soft assignments NX ^N j à j (x n ) = soft number of samples labeled j n=1 ^¼ new j ^¹ new j ^ new j à ^N j N à 1^Nj à 1^Nj Slide adapted from Bernt Schiele NX j (x n )x n n=1 NX n=1 j (x n )(x n ^¹ new j )(x n ^¹ new j ) T 7
8 Topics of This Lecture Linear discriminant functions Definition Extension to multiple classes Least-squares classification Derivation Shortcomings Generalized linear models Connection to neural networks Generalized linear discriminants & gradient descent 8
9 Discriminant Functions Bayesian Decision Theory p(c k jx) = p(xjc k)p(c k ) p(x) Model conditional probability densities p(xjc k ) and priors p(c k ) Compute posteriors p(c k jx) (using Bayes rule) Minimize probability of misclassification by maximizing p(cjx). New approach Directly encode decision boundary Without explicit modeling of probability densities Minimize misclassification probability directly. Slide credit: Bernt Schiele 9
10 Recap: Discriminant Functions Formulate classification in terms of comparisons Discriminant functions y 1 (x); : : : ; y K (x) Classify x as class C k if y k (x) > y j (x) 8j 6= k Examples (Bayes Decision Theory) y k (x) = p(c k jx) y k (x) = p(xjc k )p(c k ) y k (x) = log p(xjc k ) + log p(c k ) Slide credit: Bernt Schiele 10
11 Discriminant Functions Example: 2 classes y 1 (x) > y 2 (x), y 1 (x) y 2 (x) > 0, y(x) > 0 Decision functions (from Bayes Decision Theory) y(x) = p(c 1 jx) p(c 2 jx) y(x) = ln p(xjc 1) p(xjc 2 ) + ln p(c 1) p(c 2 ) Slide credit: Bernt Schiele 11
12 Learning Discriminant Functions General classification problem Goal: take a new input x and assign it to one of K classes C k. Given: training set X = {x 1,, x N } with target values T = {t 1,, t N }. Learn a discriminant function y(x) to perform the classification. 2-class problem Binary target values: K-class problem 1-of-K coding scheme, e.g. t n 2 f0; 1g t n = (0; 1; 0; 0; 0) T 12
13 Linear Discriminant Functions 2-class problem y(x) > 0 : Decide for class C 1, else for class C 2 In the following, we focus on linear discriminant functions y(x) = w T x + w 0 weight vector bias (= threshold) If a data set can be perfectly classified by a linear discriminant, then we call it linearly separable. Slide credit: Bernt Schiele 13
14 Linear Discriminant Functions Decision boundary Normal vector: w Offset: w 0 kwk y(x) = 0 y > 0 y = 0 y < 0 defines a hyperplane y(x) = w T x + w 0 Slide credit: Bernt Schiele 14
15 Linear Discriminant Functions Notation D : Number of dimensions x = x 1 x w = w 1 w x D w D y(x) = w T x + w 0 DX = w i x i + w 0 = i=1 DX w i x i i=0 with x 0 = 1 constant Slide credit: Bernt Schiele 15
16 Extension to Multiple Classes Two simple strategies One-vs-all classifiers One-vs-one classifiers How many classifiers do we need in both cases? What difficulties do you see for those strategies? 16 Image source: C.M. Bishop, 2006
17 Extension to Multiple Classes Problem Both strategies result in regions for which the pure classification result (y k > 0) is ambiguous. In the one-vs-all case, it is still possible to classify those inputs based on the continuous classifier outputs y k > y j 8j k. Solution We can avoid those difficulties by taking K linear functions of the form y k (x) = w T k x + w k0 and defining the decision boundaries directly by deciding for C k iff y k > y j 8j k. This corresponds to a 1-of-K coding scheme t n = (0; 1; 0; : : : ; 0; 0) T 17 Image source: C.M. Bishop, 2006
18 Extension to Multiple Classes K-class discriminant Combination of K linear functions y k (x) = w T k x + w k0 Resulting decision hyperplanes: (w k w j ) T x + (w k0 w j0 ) = 0 It can be shown that the decision regions of such a discriminant are always singly connected and convex. This makes linear discriminant models particularly suitable for problems for which the conditional densities p(x w i ) are unimodal. 18 Image source: C.M. Bishop, 2006
19 Topics of This Lecture Linear discriminant functions Definition Extension to multiple classes Least-squares classification Derivation Shortcomings Generalized linear models Connection to neural networks Generalized linear discriminants & gradient descent 19
20 General Classification Problem Classification problem Let s consider K classes described by linear models y k (x) = w T k x + w k0 ; k = 1; : : : ; K We can group those together using vector notation where y(x) = W ft ex 2 w 10 : : : w K0 w 11 : : : w K1 fw = [ ew 1 ; : : : ; ew K ] = w 1D : : : w KD The output will again be in 1-of-K notation. We can directly compare it to the target value. t = [t 1 ; : : : ; t k ] T 20
21 General Classification Problem Classification problem For the entire dataset, we can write Y( e X) = e X f W and compare this to the target matrix T where fw = [ ew 1 ; : : : ; ew K ] 2 3 ex = 6 xt T = x T N t T 1... t T N Result of the comparison: ex f W T fw Goal: Choose such that this is minimal! 21
22 Least-Squares Classification Simplest approach Directly try to minimize the sum-of-squares error We could write this as E(w) = 1 2 NX KX (y k (x n ; w) t kn ) 2 n=1 k=1 But let s stick with the matrix notation for now (The result will be simpler to express and we ll learn some nice matrix algebra rules along the way...) 22
23 Least-Squares Classification Multi-class case E D ( W) f = 1 n( 2 Tr X e W f T) T ( X e W f o T) Let s formulate the sum-of-squares error in matrix notation X using: a 2 ij = Tr A T A ª i;j Taking the f W E D( f W) = f W Tr n ( e X f W T) T ( e X f W T) = X e W f T) T ( X e W f W f ( X e W f T) T ( X e W f T) = e X T ( e X f W T) o @Y n ( X e W f T) T ( X e W f o T) Tr fag = 25
24 Least-Squares Classification Minimizing the W f E D( W) f = X e T ( X e W f T) =! 0 ex W f = T fw = ( e X T e X) 1 e X T T = e X y T pseudo-inverse We then obtain the discriminant function as y(x) = f W T ex = T T³ e X y T ex Exact, closed-form solution for the discriminant function parameters. 26
25 Problems with Least Squares Least-squares is very sensitive to outliers! The error function penalizes predictions that are too correct. 27 Image source: C.M. Bishop, 2006
26 Problems with Least-Squares Another example: 3 classes (red, green, blue) Linearly separable problem Least-squares solution: Most green points are misclassified! Deeper reason for the failure Least-squares corresponds to Maximum Likelihood under the assumption of a Gaussian conditional distribution. However, our binary target vectors have a distribution that is clearly non-gaussian! Least-squares is the wrong probabilistic tool in this case! 28 Image source: C.M. Bishop, 2006
27 Topics of This Lecture Linear discriminant functions Definition Extension to multiple classes Least-squares classification Derivation Shortcomings Generalized linear models Connection to neural networks Generalized linear discriminants & gradient descent 29
28 Generalized Linear Models Linear model y(x) = w T x + w 0 Generalized linear model y(x) = g(w T x + w 0 ) g( ) is called an activation function and may be nonlinear. The decision surfaces correspond to y(x) = const:, w T x + w 0 = const: If g is monotonous (which is typically the case), the resulting decision boundaries are still linear functions of x. 30
29 Generalized Linear Models Consider 2 classes: p(c 1 jx) = = = p(xjc 1 )p(c 1 ) p(xjc 1 )p(c 1 ) + p(xjc 2 )p(c 2 ) p(xjc 2)p(C 2 ) p(xjc 1 )p(c 1 ) exp( a) g(a) Slide credit: Bernt Schiele with a = ln p(xjc 1)p(C 1 ) p(xjc 2 )p(c 2 ) 31
30 Logistic Sigmoid Activation Function g(a) exp( a) Example: Normal distributions with identical covariance p x a pxb x p a x p b x Slide credit: Bernt Schiele x 32
31 Normalized Exponential General case of K > 2 classes: p(c k jx) = p(xjc k)p(c k ) P j p(xjc j)p(c j ) = P exp(a k) j exp(a j) with a k = ln p(xjc k )p(c k ) This is known as the normalized exponential or softmax function Can be regarded as a multiclass generalization of the logistic sigmoid. Slide credit: Bernt Schiele 33
32 Relationship to Neural Networks 2-Class case with x 0 = 1 constant Neural network ( single-layer perceptron ) Slide credit: Bernt Schiele 34
33 Relationship to Neural Networks Multi-class case with x 0 = 1 constant Multi-class perceptron K Slide credit: Bernt Schiele 35
34 Logistic Discrimination If we use the logistic sigmoid activation function g(a) exp( a) y(x) = g(w T x + w 0 ) then we can interpret the y(x) as posterior probabilities! Slide adapted from Bernt Schiele 36
35 Other Motivation for Nonlinearity Recall least-squares classification One of the problems was that data points that are too correct have a strong influence on the decision surface under a squared-error criterion. E(w) = NX (y(x n ; w) t n ) 2 n=1 Reason: the output of y(x n ;w) can grow arbitrarily large for some x n : y(x; w) = w T x + w 0 By choosing a suitable nonlinearity (e.g. a sigmoid), we can limit those influences y(x; w) = g(w T x + w 0 ) 37
36 Discussion: Generalized Linear Models Advantages The nonlinearity gives us more flexibility. Can be used to limit the effect of outliers. Choice of a sigmoid leads to a nice probabilistic interpretation. Disadvantage Least-squares minimization in general no longer leads to a closed-form analytical solution. Need to apply iterative methods. Gradient descent. 38
37 Linear Separability Up to now: restrictive assumption Only consider linear decision boundaries Classical counterexample: XOR x 2 x 1 Slide credit: Bernt Schiele 39
38 Generalized Linear Discriminants Generalization Transform vector x with M nonlinear basis functions Á j (x): y k (x) = MX w kj Á j (x) + w k0 j=1 Purpose of Á j (x): basis functions Allow non-linear decision boundaries. By choosing the right Á j, every continuous function can (in principle) be approximated with arbitrary accuracy. Notation y k (x) = MX w kj Á j (x) with Á 0 (x) = 1 Slide credit: Bernt Schiele j=0 41
39 Generalized Linear Discriminants Model y k (x) = MX w kj Á j (x) = y k (x; w) j=0 K functions (outputs) y k (x;w) Learning in Neural Networks Single-layer networks: Á j are fixed, only weights w are learned. Multi-layer networks: both the w and the Á j are learned. We will take a closer look at neural networks from lecture 11 on. For now, let s first consider generalized linear discriminants in general Slide credit: Bernt Schiele 42
40 Gradient Descent Learning the weights w: N training data points: X = {x 1,, x N } K outputs of decision functions: y k (x n ;w) Target vector for each data point: T = {t 1,, t N } Error function (least-squares error) of linear model E(w) = 1 2 = 1 2 NX n=1 k=1 KX (y k (x n ; w) t kn ) 2 0 NX n=1 k=1 j=1 1 MX w kj Á j (x n ) t kn A 2 Slide credit: Bernt Schiele 43
41 Gradient Descent Problem The error function can in general no longer be minimized in closed form. ldea (Gradient Descent) Iterative minimization Start with an initial guess for the parameter values. Move towards a (local) minimum by following the gradient. w ( +1) kj = w ( kj This simple scheme corresponds to a 1 st -order Taylor expansion (There are more complex procedures available). w( ) : Learning rate w (0) kj 44
42 Gradient Descent Basic Strategies Batch learning w ( +1) kj = w ( kj w( ) : Learning rate Compute the gradient based on all kj Slide credit: Bernt Schiele 45
43 Gradient Descent Basic Strategies Sequential updating E(w) = NX n=1 w ( +1) kj = w ( ) kj E n kj : Learning rate w( ) Compute the gradient based on a single data point at a n kj Slide credit: Bernt Schiele 46
44 Gradient Descent Error function NX E(w) = E n (w) = 1 2 n=1 E n (w) = 1 2 n kj NX 0 n=1 k=1 0 k=1 1 MX w kj Á j (x n ) t kn A j=1 1 MX w kj Á j (x n ) t kn A j=1 1 MX w k ~j Á ~j (x n) t kn A Á j (x n ) ~j=1 2 2 = (y k (x n ; w) t kn ) Á j (x n ) Slide credit: Bernt Schiele 47
45 Gradient Descent Delta rule (=LMS rule) w ( +1) kj = w ( ) kj (y k(x n ; w) t kn ) Á j (x n ) = w ( ) kj ± kná j (x n ) where ± kn = y k (x n ; w) t kn Simply feed back the input data point, weighted by the classification error. Slide credit: Bernt Schiele 48
46 Gradient Descent Cases with differentiable, non-linear activation function 0 y k (x) = g(a k ) = 1 MX w ki Á j (x n ) A j=0 Gradient descent Slide credit: Bernt n kj kj (y k (x n ; w) t kn ) Á j (x n ) w ( +1) kj = w ( ) kj ± kná j (x n ) ± kn kj (y k (x n ; w) t kn ) 49
47 Summary: Generalized Linear Discriminants Properties General class of decision functions. Nonlinearity g( ) and basis functions Á j allow us to address linearly non-separable problems. Shown simple sequential learning approach for parameter estimation using gradient descent. Better 2 nd order gradient descent approaches available (e.g. Newton-Raphson). Limitations / Caveats Flexibility of model is limited by curse of dimensionality g( ) and Á j often introduce additional parameters. Models are either limited to lower-dimensional input space or need to share parameters. Linearly separable case often leads to overfitting. Several possible parameter choices minimize training error. 50
48 References and Further Reading More information on Linear Discriminant Functions can be found in Chapter 4 of Bishop s book (in particular Chapter 4.1). Christopher M. Bishop Pattern Recognition and Machine Learning Springer,
Machine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationLecture 12. Neural Networks Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 12 Neural Networks 10.12.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More informationMachine Learning Lecture 10
Machine Learning Lecture 10 Neural Networks 26.11.2018 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Today s Topic Deep Learning 2 Course Outline Fundamentals Bayes
More informationMachine Learning Lecture 2
Machine Perceptual Learning and Sensory Summer Augmented 15 Computing Many slides adapted from B. Schiele Machine Learning Lecture 2 Probability Density Estimation 16.04.2015 Bastian Leibe RWTH Aachen
More informationLecture 12. Neural Networks Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 12 Neural Networks 24.11.2016 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de Talk Announcement Yann LeCun (NYU & FaceBook AI)
More informationMachine Learning Lecture 3
Announcements Machine Learning Lecture 3 Eam dates We re in the process of fiing the first eam date Probability Density Estimation II 9.0.207 Eercises The first eercise sheet is available on L2P now First
More informationCh 4. Linear Models for Classification
Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,
More informationLecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants
Advanced Machine Learning Lecture 2 Neural Networks 24..206 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de Talk Announcement Yann LeCun (NYU & FaceBook AI) 28..
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationLecture 3. Linear Regression II Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More informationMachine Learning Lecture 12
Machine Learning Lecture 12 Neural Networks 30.11.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability
More informationMachine Learning Lecture 2
Machine Perceptual Learning and Sensory Summer Augmented 6 Computing Announcements Machine Learning Lecture 2 Course webpage http://www.vision.rwth-aachen.de/teaching/ Slides will be made available on
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationMachine Learning Lecture 2
Announcements Machine Learning Lecture 2 Eceptional number of lecture participants this year Current count: 449 participants This is very nice, but it stretches our resources to their limits Probability
More informationLast updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII
More informationLecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models
Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationLINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception
LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationMachine Learning for Signal Processing Bayes Classification and Regression
Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For
More informationMulti-layer Neural Networks
Multi-layer Neural Networks Steve Renals Informatics 2B Learning and Data Lecture 13 8 March 2011 Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 1 Overview Multi-layer neural
More informationIntroduction to Neural Networks
Introduction to Neural Networks Steve Renals Automatic Speech Recognition ASR Lecture 10 24 February 2014 ASR Lecture 10 Introduction to Neural Networks 1 Neural networks for speech recognition Introduction
More informationApril 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.
Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá April 9, 2018 Content 1 2 3 4 Outline 1 2 3 4 problems { C 1, y(x) threshold predict(x) = C 2, y(x) < threshold, with threshold
More informationLecture Machine Learning (past summer semester) If at least one English-speaking student is present.
Advanced Machine Learning Lecture 1 Organization Lecturer Prof. Bastian Leibe (leibe@vision.rwth-aachen.de) Introduction 20.10.2015 Bastian Leibe Visual Computing Institute RWTH Aachen University http://www.vision.rwth-aachen.de/
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationArtificial Neural Networks. MGS Lecture 2
Artificial Neural Networks MGS 2018 - Lecture 2 OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation
More informationEngineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:
More informationEngineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I
Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 2012 Engineering Part IIB: Module 4F10 Introduction In
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More informationLinear discriminant functions
Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative
More informationMachine Learning Lecture 14
Many slides adapted from B. Schiele, S. Roth, Z. Gharahmani Machine Learning Lecture 14 Undirected Graphical Models & Inference 23.06.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de
More informationMachine Learning. 7. Logistic and Linear Regression
Sapienza University of Rome, Italy - Machine Learning (27/28) University of Rome La Sapienza Master in Artificial Intelligence and Robotics Machine Learning 7. Logistic and Linear Regression Luca Iocchi,
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More informationMachine Learning Lecture 14
Machine Learning Lecture 14 Tricks of the Trade 07.12.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability
More informationMark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.
University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationLINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
LINEAR CLASSIFIERS Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationMachine Learning Lecture 1
Many slides adapted from B. Schiele Machine Learning Lecture 1 Introduction 18.04.2016 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de Organization Lecturer Prof.
More informationMultilayer Perceptron
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Single Perceptron 3 Boolean Function Learning 4
More informationLinear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging
More informationAdvanced statistical methods for data analysis Lecture 2
Advanced statistical methods for data analysis Lecture 2 RHUL Physics www.pp.rhul.ac.uk/~cowan Universität Mainz Klausurtagung des GK Eichtheorien exp. Tests... Bullay/Mosel 15 17 September, 2008 1 Outline
More informationMachine Learning 2017
Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section
More informationNeural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications
Neural Networks Bishop PRML Ch. 5 Alireza Ghane Neural Networks Alireza Ghane / Greg Mori 1 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of
More informationAn Introduction to Statistical and Probabilistic Linear Models
An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationLinear Models for Classification
Catherine Lee Anderson figures courtesy of Christopher M. Bishop Department of Computer Science University of Nebraska at Lincoln CSCE 970: Pattern Recognition and Machine Learning Congradulations!!!!
More informationChapter 14 Combining Models
Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients
More informationArtificial Neural Networks
Artificial Neural Networks Stephan Dreiseitl University of Applied Sciences Upper Austria at Hagenberg Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support Knowledge
More informationThe exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.
CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please
More informationMax Margin-Classifier
Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationIntelligent Systems Discriminative Learning, Neural Networks
Intelligent Systems Discriminative Learning, Neural Networks Carsten Rother, Dmitrij Schlesinger WS2014/2015, Outline 1. Discriminative learning 2. Neurons and linear classifiers: 1) Perceptron-Algorithm
More informationFeed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.
Neural Networks Oliver Schulte - CMPT 726 Bishop PRML Ch. 5 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of biological plausibility We will
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationVasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks
C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,
More informationLogistic Regression & Neural Networks
Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationLecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016
Lecture 7 Logistic Regression Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 11, 2016 Luigi Freda ( La Sapienza University) Lecture 7 December 11, 2016 1 / 39 Outline 1 Intro Logistic
More informationMulticlass Logistic Regression
Multiclass Logistic Regression Sargur. Srihari University at Buffalo, State University of ew York USA Machine Learning Srihari Topics in Linear Classification using Probabilistic Discriminative Models
More informationMultivariate statistical methods and data mining in particle physics
Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general
More informationMachine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)
More informationMidterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian
More informationMidterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.
CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic
More informationCSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression
CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html
More informationFeed-forward Network Functions
Feed-forward Network Functions Sargur Srihari Topics 1. Extension of linear models 2. Feed-forward Network Functions 3. Weight-space symmetries 2 Recap of Linear Models Linear Models for Regression, Classification
More informationECE521 Lecture7. Logistic Regression
ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationFINAL: CS 6375 (Machine Learning) Fall 2014
FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationLearning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014
Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of
More informationThe classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know
The Bayes classifier Theorem The classifier satisfies where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know Alternatively, since the maximum it is
More informationThe classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA
The Bayes classifier Linear discriminant analysis (LDA) Theorem The classifier satisfies In linear discriminant analysis (LDA), we make the (strong) assumption that where the min is over all possible classifiers.
More informationLinear Discrimination Functions
Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach
More informationCOMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017
COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University SOFT CLUSTERING VS HARD CLUSTERING
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationy(x n, w) t n 2. (1)
Network training: Training a neural network involves determining the weight parameter vector w that minimizes a cost function. Given a training set comprising a set of input vector {x n }, n = 1,...N,
More informationNeural networks and support vector machines
Neural netorks and support vector machines Perceptron Input x 1 Weights 1 x 2 x 3... x D 2 3 D Output: sgn( x + b) Can incorporate bias as component of the eight vector by alays including a feature ith
More information4. Multilayer Perceptrons
4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output
More informationNeural Networks. Nicholas Ruozzi University of Texas at Dallas
Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify
More informationLogistic Regression. Seungjin Choi
Logistic Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationNeural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17
3/9/7 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/9/7 Perceptron as a neural
More informationStatistical Machine Learning Hilary Term 2018
Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More informationSGN (4 cr) Chapter 5
SGN-41006 (4 cr) Chapter 5 Linear Discriminant Analysis Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology January 21, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006
More information