Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Save this PDF as:

Size: px
Start display at page:

Download "Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers" Transcription

1 Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: Michaelmas 202 Engineering Part IIB: Module 4F0

2 4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Introduction Previously generative models with Gaussian & GMM class-conditional PDFs were discussed. For Gaussian PDFs with common covariance matrices, the decision boundary is linear. The linear decision boundary from the generative model was briefly compared with a discriminative model, logistic regression. An alternative to modelling the class conditional PDFs functions is to select a functional form for a discriminant function and construct a mapping from the observation to the class directly. Here we will concentrate on the construction of linear classifiers, however it is also possible to construct quadratic or other non-linear decision boundaries (this is how some types of neural networks operate). There are several methods of classifier construction for a linear discriminant function. The following schemes will be examined: Iterative solution for the weights via the perceptron algorithm which directly minimises the number of misclassifications Fisher linear discriminant: directly maximises a measure of class separability Engineering Part IIB: Module 4F0

3 4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Simple Vowel Classifier Select two vowels to classify with a linear decision boundary 2400 Linear Classifier Formant Two Most of the data is correctly classified but classification errors occur realisation of vowels vary from speaker to speaker the same vowel varies for the same speaker as well! (vowel context etc) Formant One Engineering Part IIB: Module 4F0 2

4 4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers x x 2 w w 2 w 0 Σ z Single Layer Perceptron y(x) Typically uses threshold activation function The d-dimensional input vector x and scalar value z are related by x d w d z is then fed to the activation function to yield y(x). The parameters of this system are weights: w = w. selects w d direction of decision boundary bias: w 0, sets position of decision boundary. z = w x+w 0 Parameters are often combined into single composite vector, w, & extended input vector, x. w = w. w d w 0 ; x = x. x d Engineering Part IIB: Module 4F0 3

5 4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers We can then write z = w x The task is to train the set of model parameters w. For this example a decision boundary is placed at z = 0. The decision rule is Single Layer Perceptron (cont).5 y(x) = {, z 0, z < If the training data is linearly separable in the d-dimensional space then using an appropriate training algorithm perfect classification (on the training data at least!) can be achieved Is the solution unique? Precise solution depends on training criterion/algorithm used. Engineering Part IIB: Module 4F0 4

6 4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Parameter Optimisation First a cost or loss function of weights defined, E( w). A learning process which minimises the cost function is used. One basic procedure is gradient descent:. Start with some initial estimate w, τ = Compute the gradient E( w) w[τ] 3. Update weight by moving a small distance in the steepest downhill direction, to give the estimate of the weights at iteration τ +, w[τ +], Set τ = τ + w[τ +] = w[τ] η E( w) w[τ] 4. Repeat steps (2) and (3) until convergence, or optimisation criterion satisfied If using gradient descent, then cost function E( w) must be differentiable (and hence continuous). This means mis-classification cannot be used as the cost function for gradient descent schemes. Gradient descent is not usually guaranteed to decrease cost function. Engineering Part IIB: Module 4F0 5

7 4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Choice of η In the previous slide the learning rate term η was used in the optimisation scheme. E slow descent Step size too large - divergence Desired minima η is positive. When setting η we need to consider: if η is too small, convergence is slow; if η is too large, we may overshoot the solution and diverge. Later in the course we will examine improvements for gradient descent schemes. Some of these give automated techniques for setting η. Engineering Part IIB: Module 4F0 6

8 4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Local Minima Another problem with gradient descent is local minima. At all maxima and minima E( w) = 0 gradient descent stops at all maxima/minima estimated parameters will depend on starting point Criterion Global Minimum Local Minima single solution if function convex Expectation-Maximisation is also only guaranteed to find a local maximum of the likelihood function for example. θ θ Engineering Part IIB: Module 4F0 7

9 4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Perceptron Criterion The decision boundary should be constructed to minimise the number of misclassified training examples. For each training example x i w x i > 0 x i ω w x i < 0 x i ω 2 Distance from the decision boundary is given by w x, but cannot simply sum the values of w x as a cost function since the sign depends on the class. For training, we replace all the observations of class ω 2 by their negative value. Thus x = { x; belongs to ω x; belongs to ω 2 This means that for a correctly classified symbol w x > 0 and for mis-classified training example w x < 0 We will refer to this as normalising the data. Engineering Part IIB: Module 4F0 8

10 4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Perceptron Solution Region Each sample x i places a constraint on possible location of a solution vector that classifies all samples correctly. w x i = 0 defines a hyperplane through the origin of the weight-space of w vectors with x i as a normal vector. For normalised data, solution vector on positive side of every such hyperplane. Solution vector, if it exists, is not unique lies anywhere within solution region Figure below (from DHS) shows 4 training points and the solution region for both normalised data and un-normalised data. solution region x 2 solution region x 2 w w separating plane x "separating" plane x Engineering Part IIB: Module 4F0 9

11 4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Perceptron Criterion (cont) The perceptron criterion may be expressed as E( w) = x Y( w x) where Y is the set of mis-classified points. We now want to minimise the perceptron criterion. We can use gradient descent. It is simple to show that E( w) = x Y( x) Hence the GD update rule is w[τ +] = w[τ]+η x x Y[τ] where Y[τ] is the set of mis-classified points using w[τ]. For the case when the samples are linearly separable using a value of η = is guaranteed to converge. Engineering Part IIB: Module 4F0 0

12 4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers The basic algorithm is:. Take the extended observations x for class ω 2 and invert the sign to give x. 2. Initialise the weight vector w, τ = Using w[τ] produce the set of mis-classified samples Y[τ]. 4. Use update rule w[τ +] = w[τ]+ x then set τ = τ +. x Y[τ] 5. Repeat steps (3) and (4) until the convergence criterion is satisfied. Engineering Part IIB: Module 4F0

13 4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Class has points Class 2 has points Initial estimate: w = [ ] [, 0 [ ] [ 0 0, A Simple Example ] [ ] [ ] ,, ] [ ] [ ] ,, This yields the following initial estimate of the decision boundary. 0.5 Given this initial estimate we need to train the decision boundary Engineering Part IIB: Module 4F0 2

14 4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Simple Example (cont) First use current decision boundary to obtain the set of mis-classified points. For the data from Class ω Class ω z Class 2 2 The set of mis-classified points, Y, is and for Class ω 2 Class ω z Class 2 2 Y = { ω,4 ω,2 ω2,3 ω2 } From the perceptron update rule this yields the updated vector w[] = w = Engineering Part IIB: Module 4F0 3

15 4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Apply decision rule again: class ω data Simple Example (cont) and for class ω 2 Class Class z z Class Class All points correctly classified the algorithm has converged..5 This yields the following decision boundary 0.5 Is this a good decision boundary? Engineering Part IIB: Module 4F0 4

16 4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Fisher s Discriminant Analysis An alternative approach to training the perceptron uses Fisher s discriminant analysis. The aim is to find the projection that maximises the distance between the class means, whilst minimising the within class variance. Note only the projection w is determined. The following cost function is used E(w) = (µ µ 2 ) 2 s +s 2 where s j and µ j are the projected scatter matrix and mean for class ω j. The projected scatter matrix is defined as s j = x i ω j (x i µ j ) 2 The cost function may be expressed as where E(w) = w S b w w S w w S b = (µ µ 2 )(µ µ 2 ) S w = S +S 2 S j = x i ω j (x i µ j )(x i µ j ) the mean of class µ j is defined as usual. Engineering Part IIB: Module 4F0 5

17 4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Fisher s Discriminant Analysis (cont) Differerentiating E(w) w.r.t. weights, it is minimised when (ŵ S b ŵ)s w ŵ = (ŵ S w ŵ)s b ŵ From the definition of S b S b ŵ = (µ µ 2 )(µ µ 2 ) ŵ = ((µ µ 2 ) ŵ)(µ µ 2 ) We therefore know that S w ŵ (µ µ 2 ) Multiplying both sides by S w yields ŵ S w (µ µ 2 ) This has given the direction of the decision boundary. However we still need the bias value b. If the data is separable using the Fisher s discriminant it makes sense to select the value of b that maximises the margin. Simply put this means that, given no additional information, the boundary should be equidistant from the two points either side of the decision boundary. Engineering Part IIB: Module 4F0 6

18 4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Example Using the previously described data µ = [ ] ; µ 2 = [ ] ; S w = [ ] Solving yields the direction. ŵ = [ ].5 Taking the midpoint between the observations closest to the decision boundary w[] = Engineering Part IIB: Module 4F0 7

19 4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Logistic Regression/Classification Interesting to contrast the perceptron algorithm with logistic regression/classification. This is a linear classifier, but a discriminative model not a discriminant function. The training criterion is L( w) = n i= [ ( ) y i log +exp( w x i ) ( )] +( y i )log +exp( w x i ) where (note change of label definition) y i = {, xi generated by class ω 0, x i generated by class ω 2 It is simple to show that (has a nice intuitive feel) L( w) = n i= ( y i ) +exp( w x i ) Engineering Part IIB: Module 4F0 8 x i

20 4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Maximising the log-likelihood is equal to minimising the negative log-likelihood, E( w) = L( w). Update rule then becomes w[τ +] = w[τ]+η L( w) w[τ] Compared to the perceptron algorithm: need an appropriate method to select η; does not necessarily correctly classify training data even if linearly separable; not necessary for data to be linearly separable; yields a class posterior (use Bayes decision rule to get hypothesis). Engineering Part IIB: Module 4F0 9

21 4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Kesler s Construct So far only examined binary classifiers direct use of multiple binary classifiers can result in no-decision regions (see examples paper). The multi-class problem can be converted to a 2-class problem. Consider an extended observation x which belongs to class ω. Then to be correctly classified w x w j x > 0, j = 2,...,K There are therefore K inequalities requiring that K(d + )-dimensional vector α = w. w K Engineering Part IIB: Module 4F0 20

22 4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers correctly classifies all K set of K(d+)-dimensional samples γ 2 = x x 0. 0, γ 3 = x 0 x. 0,..., γ K = x 0 0. x The multi-class problem has now been transformed to a two-class problem at the expense of increasing the effective dimensionality of the data and increasing the number of training samples. We now simply optimise for α. Engineering Part IIB: Module 4F0 2

23 4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Limitations of Linear Decision Boundaries Perceptrons were very popular until in 969, it was realised that it couldn t solve the XOR problem. We can use perceptrons to solve the binary logic operators AND, OR, NAND, NOR. x.5 AND x 0.5 OR x 2 Σ (a) AND operator x 2 Σ (b) OR operator x.5 NAND x 0.5 NOR x 2 Σ (c) NAND operator x 2 Σ (d) NOR operator Engineering Part IIB: Module 4F0 22

24 4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers XOR (cont) But XOR may be written in terms of AND, NAND and OR gates 0.5 OR x x 2 Σ.5 AND.5 NAND Σ Σ This yields the decision boundaries So XOR can be solved using a two-layer network. The problem is how to train multi-layer perceptrons. In the 980 s an algorithm for training such networks was proposed, error back propagation. Engineering Part IIB: Module 4F0 23

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 2012 Engineering Part IIB: Module 4F10 Introduction In

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer. University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

Linear discriminant functions Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

Inf2b Learning and Data Inf2b Learning and Data Lecture : Single layer Neural Networks () (Credit: Hiroshi Shimodaira Iain Murray and Steve Renals) Centre for Speech Technology Research (CSTR) School of Informatics University

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians Engineering Part IIB: Module F Statistical Pattern Processing University of Cambridge Engineering Part IIB Module F: Statistical Pattern Processing Handout : Multivariate Gaussians. Generative Model Decision

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout :. The Multivariate Gaussian & Decision Boundaries..15.1.5 1 8 6 6 8 1 Mark Gales mjfg@eng.cam.ac.uk Lent

Multi-layer Neural Networks Multi-layer Neural Networks Steve Renals Informatics 2B Learning and Data Lecture 13 8 March 2011 Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 1 Overview Multi-layer neural

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

The Perceptron Algorithm 1 CS 64: Machine Learning Spring 5 College of Computer and Information Science Northeastern University Lecture 5 March, 6 Instructor: Bilal Ahmed Scribe: Bilal Ahmed & Virgil Pavlu Introduction The Perceptron

Machine Learning Support Vector Machines. Prof. Matteo Matteucci Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt. 1 The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt. The algorithm applies only to single layer models

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

SGN (4 cr) Chapter 5 SGN-41006 (4 cr) Chapter 5 Linear Discriminant Analysis Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology January 21, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006

Single layer NN. Neuron Model Single layer NN We consider the simple architecture consisting of just one neuron. Generalization to a single layer with more neurons as illustrated below is easy because: M M The output units are independent

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,

Learning from Data Logistic Regression Learning from Data Logistic Regression Copyright David Barber 2-24. Course lecturer: Amos Storkey a.storkey@ed.ac.uk Course page : http://www.anc.ed.ac.uk/ amos/lfd/ 2.9.8.7.6.5.4.3.2...2.3.4.5.6.7.8.9

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians University of Cambridge Engineering Part IIB Module 4F: Statistical Pattern Processing Handout 2: Multivariate Gaussians.2.5..5 8 6 4 2 2 4 6 8 Mark Gales mjfg@eng.cam.ac.uk Michaelmas 2 2 Engineering

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7. Preliminaries Linear models: the perceptron and closest centroid algorithms Chapter 1, 7 Definition: The Euclidean dot product beteen to vectors is the expression d T x = i x i The dot product is also

y(x n, w) t n 2. (1) Network training: Training a neural network involves determining the weight parameter vector w that minimizes a cost function. Given a training set comprising a set of input vector {x n }, n = 1,...N,

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009 AN INTRODUCTION TO NEURAL NETWORKS Scott Kuindersma November 12, 2009 SUPERVISED LEARNING We are given some training data: We must learn a function If y is discrete, we call it classification If it is

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

Linear Discriminant Functions Linear Discriminant Functions Linear discriminant functions and decision surfaces Definition It is a function that is a linear combination of the components of g() = t + 0 () here is the eight vector and

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

Lecture 4: Perceptrons and Multilayer Perceptrons Lecture 4: Perceptrons and Multilayer Perceptrons Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning Perceptrons, Artificial Neuronal Networks Lecture 4: Perceptrons

Intelligent Systems Discriminative Learning, Neural Networks Intelligent Systems Discriminative Learning, Neural Networks Carsten Rother, Dmitrij Schlesinger WS2014/2015, Outline 1. Discriminative learning 2. Neurons and linear classifiers: 1) Perceptron-Algorithm

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16 COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS6 Lecture 3: Classification with Logistic Regression Advanced optimization techniques Underfitting & Overfitting Model selection (Training-

Linear Models for Classification Catherine Lee Anderson figures courtesy of Christopher M. Bishop Department of Computer Science University of Nebraska at Lincoln CSCE 970: Pattern Recognition and Machine Learning Congradulations!!!!

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

Multilayer Perceptron Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Single Perceptron 3 Boolean Function Learning 4

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

The Perceptron algorithm The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries Overview Gaussians estimated from training data Guido Sanguinetti Informatics B Learning and Data Lecture 1 9 March 1 Today s lecture Posterior probabilities, decision regions and minimising the probability

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks Topics in Machine Learning-EE 5359 Neural Networks 1 The Perceptron Output: A perceptron is a function that maps D-dimensional vectors to real numbers. For notational convenience, we add a zero-th dimension

Lecture 3: Pattern Classification EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition LINEAR CLASSIFIERS Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input

MLPR: Logistic Regression and Neural Networks MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition Amos Storkey Amos Storkey MLPR: Logistic Regression and Neural Networks 1/28 Outline 1 Logistic Regression 2 Multi-layer

Gradient Descent. Sargur Srihari Gradient Descent Sargur srihari@cedar.buffalo.edu 1 Topics Simple Gradient Descent/Ascent Difficulties with Simple Gradient Descent Line Search Brent s Method Conjugate Gradient Descent Weight vectors

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap. Outline MLPR: and Neural Networks Machine Learning and Pattern Recognition 2 Amos Storkey Amos Storkey MLPR: and Neural Networks /28 Recap Amos Storkey MLPR: and Neural Networks 2/28 Which is the correct

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Neural Networks CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Perceptrons x 0 = 1 x 1 x 2 z = h w T x Output: z x D A perceptron

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

Introduction to Neural Networks Introduction to Neural Networks Steve Renals Automatic Speech Recognition ASR Lecture 10 24 February 2014 ASR Lecture 10 Introduction to Neural Networks 1 Neural networks for speech recognition Introduction

Classification with Perceptrons. Reading: Classification with Perceptrons Reading: Chapters 1-3 of Michael Nielsen's online book on neural networks covers the basics of perceptrons and multilayer neural networks We will cover material in Chapters

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

In the Name of God. Lecture 11: Single Layer Perceptrons 1 In the Name of God Lecture 11: Single Layer Perceptrons Perceptron: architecture We consider the architecture: feed-forward NN with one layer It is sufficient to study single layer perceptrons with just

Neural Networks and the Back-propagation Algorithm Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

ECE662: Pattern Recognition and Decision Making Processes: HW TWO ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are

ECS171: Machine Learning ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting

Deep Learning for Computer Vision Deep Learning for Computer Vision Lecture 4: Curse of Dimensionality, High Dimensional Feature Spaces, Linear Classifiers, Linear Regression, Python, and Jupyter Notebooks Peter Belhumeur Computer Science

LECTURE NOTE #8 PROF. ALAN YUILLE. Can we find a linear classifier that separates the position and negative examples? LECTURE NOTE #8 PROF. ALAN YUILLE 1. Linear Classifiers and Perceptrons A dataset contains N samples: { (x µ, y µ ) : µ = 1 to N }, y µ {±1} Can we find a linear classifier that separates the position

Advanced statistical methods for data analysis Lecture 2 Advanced statistical methods for data analysis Lecture 2 RHUL Physics www.pp.rhul.ac.uk/~cowan Universität Mainz Klausurtagung des GK Eichtheorien exp. Tests... Bullay/Mosel 15 17 September, 2008 1 Outline

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

CS 590D Lecture Notes CS 590D Lecture Notes David Wemhoener December 017 1 Introduction Figure 1: Various Separators We previously looked at the perceptron learning algorithm, which gives us a separator for linearly separable

Lecture 3: Pattern Classification. Pattern classification EE E68: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mitures and

3.4 Linear Least-Squares Filter X(n) = [x(1), x(2),..., x(n)] T 1 3.4 Linear Least-Squares Filter Two characteristics of linear least-squares filter: 1. The filter is built around a single linear neuron. 2. The cost function is the sum

Lecture 16: Modern Classification (I) - Separating Hyperplanes Lecture 16: Modern Classification (I) - Separating Hyperplanes Outline 1 2 Separating Hyperplane Binary SVM for Separable Case Bayes Rule for Binary Problems Consider the simplest case: two classes are

6.867 Machine Learning 6.867 Machine Learning Problem Set 2 Due date: Wednesday October 6 Please address all questions and comments about this problem set to 6867-staff@csail.mit.edu. You will need to use MATLAB for some of

LEARNING & LINEAR CLASSIFIERS LEARNING & LINEAR CLASSIFIERS 1/26 J. Matas Czech Technical University, Faculty of Electrical Engineering Department of Cybernetics, Center for Machine Perception 121 35 Praha 2, Karlovo nám. 13, Czech

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

Machine Learning and Data Mining. Linear classification. Kalev Kask Machine Learning and Data Mining Linear classification Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ = f(x ; q) Parameters q Program ( Learner ) Learning algorithm Change q

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,

Simple Neural Nets For Pattern Classification CHAPTER 2 Simple Neural Nets For Pattern Classification Neural Networks General Discussion One of the simplest tasks that neural nets can be trained to perform is pattern classification. In pattern classification

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

Course 395: Machine Learning - Lectures Course 395: Machine Learning - Lectures Lecture 1-2: Concept Learning (M. Pantic) Lecture 3-4: Decision Trees & CBC Intro (M. Pantic & S. Petridis) Lecture 5-6: Evaluating Hypotheses (S. Petridis) Lecture

Perceptron (Theory) + Linear Regression 10601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Perceptron (Theory) Linear Regression Matt Gormley Lecture 6 Feb. 5, 2018 1 Q&A

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

Introduction to Machine Learning 1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

Neural Networks (Part 1) Goals for the lecture Neural Networks (Part ) Mark Craven and David Page Computer Sciences 760 Spring 208 www.biostat.wisc.edu/~craven/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some Incorporating detractors into SVM classification AGH University of Science and Technology 1 2 3 4 5 (SVM) SVM - are a set of supervised learning methods used for classification and regression SVM maximal