Linear discriminant functions

Similar documents
Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Artifical Neural Networks

Multilayer Perceptron

Lecture 4: Perceptrons and Multilayer Perceptrons

Linear Regression, Neural Networks, etc.

Machine Learning Lecture 5

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine

Ch 4. Linear Models for Classification

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Neural Network Training

Linear & nonlinear classifiers

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

COMP304 Introduction to Neural Networks based on slides by:

Linear Discrimination Functions

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Statistical Data Mining and Machine Learning Hilary Term 2016

Single layer NN. Neuron Model

The Perceptron Algorithm 1

Machine Learning for NLP

Neural networks. Chapter 20. Chapter 20 1

Generative v. Discriminative classifiers Intuition

10-701/ Machine Learning, Fall

Bias-Variance Tradeoff

Midterm: CS 6375 Spring 2015 Solutions

Artificial Neural Networks

Simple Neural Nets For Pattern Classification

Lecture 7 Artificial neural networks: Supervised learning

Revision: Neural Network

Machine Learning 2017

Generative v. Discriminative classifiers Intuition

Neural networks. Chapter 19, Sections 1 5 1

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 5: Logistic Regression. Neural Networks

EE 511 Online Learning Perceptron

Neural networks. Chapter 20, Section 5 1

Neural Networks and the Back-propagation Algorithm

Machine Learning and Data Mining. Linear classification. Kalev Kask

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

CSC242: Intro to AI. Lecture 21

In the Name of God. Lecture 11: Single Layer Perceptrons

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Stochastic Gradient Descent

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

The Perceptron. Volker Tresp Summer 2016

Lab 5: 16 th April Exercises on Neural Networks

Classification with Perceptrons. Reading:

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Statistical Machine Learning from Data

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Logistic Regression & Neural Networks

Intelligent Systems Discriminative Learning, Neural Networks

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning

Week 5: Logistic Regression & Neural Networks

Machine Learning Lecture 7

Data Mining Part 5. Prediction

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

The Perceptron. Volker Tresp Summer 2014

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Lecture 6. Notes on Linear Algebra. Perceptron

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Machine Learning (CSE 446): Neural Networks

ECE521 week 3: 23/26 January 2017

Perceptron. (c) Marcin Sydow. Summary. Perceptron

Neural Networks Introduction CIS 32

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

The Perceptron algorithm

1 Machine Learning Concepts (16 points)

Support Vector Machines

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Deep Learning for Computer Vision

Neural Networks. Xiaojin Zhu Computer Sciences Department University of Wisconsin, Madison. slide 1

Machine Learning (CS 567) Lecture 3

Course 395: Machine Learning - Lectures

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

CS:4420 Artificial Intelligence

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Machine Learning Gaussian Naïve Bayes Big Picture

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Introduction to Logistic Regression and Support Vector Machine

Machine Learning

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CS 188: Artificial Intelligence. Outline

Transcription:

Andrea Passerini passerini@disi.unitn.it Machine Learning

Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative learning focuses on directly modeling the discriminant function E.g. for classification, directly modeling decision boundaries (rather than inferring them from the modelled data distributions)

Discriminative learning PROS When data are complex, modeling their distribution can be very difficult If data discrimination is the goal, data distribution modeling is not needed Focuses parameters (and thus use of available training examples) on the desired goal CONS The learned model is less flexible in its usage It does not allow to perform arbitrary inference tasks E.g. it is not possible to efficiently generate new data from a certain class

Description f (x) = w T x + w 0 The discriminant function is a linear combination of example features w 0 is called bias or threshold it is the simplest possible discriminant function Depending on the complexity of the task and amount of data, it can be the best option available (at least it is the first to try)

Linear binary classifier Description f (x) = sign(w T x + w 0 ) It is obtained taking the sign of the linear function The decision boundary (f (x) = 0) is a hyperplane (H) The weight vector w is orthogonal to the decision hyperplane: x, x : f (x) = f (x ) = 0 w T x + w 0 w T x w 0 = 0 w T (x x ) = 0

Linear binary classifier Functional margin The value f (x) of the function for a certain point x is called functional margin It can be seen as a confidence in the prediction Geometric margin The distance from x to the hyperplane is called geometric margin r x = f (x) w It is a normalize version of the functional margin The distance from the origin to the hyperplane is: r 0 = f (0) w = w 0 w

Linear binary classifier Geometric margin (cont) A point can be expressed its projection on H plus its distance to H times the unit vector in that direction: x = x p + r x w w

Linear binary classifier Geometric margin (cont) Then as f (x p ) = 0: f (x) = w T x + w 0 = w T (x p + r x w w ) + w 0 f (x) w = w T x p + w }{{ 0 +r } x w T w w f (x p ) = r x w = r x

Biological motivation Human Brain Composed of densely interconnected network of neurons A neuron is made of: soma A central body containing the nucleus dendrites A set of filaments departing from the body axon a longer filament (up to 100 times body diameter) synapses connections between dendrites and axons from other neurons

Biological motivation Human Brain Electrochemical reactions allow signals to propagate along neurons via axons, synapses and dendrites Synapses can either excite on inhibit a neuron potentional Once a neuron potential exceeds a certain threshold, a signal is generated and transmitted along the axon

Perceptron Single neuron architecture f (x) = sign(w T x + w 0 ) Linear combination of input features Threshold activation function

Perceptron Representational power Linearly separable sets of examples E.g. primitive boolean functions (AND,OR,NAND,NOT) any logic formula can be represented by a network of two levels of perceptrons (in disjunctive or conjunctive normal form). Problem non-linearly separable sets of examples cannot be separated Representing complex logic formulas can require a number of perceptrons exponential in the size of the input

Parameter learning Error minimization Need to find a function of the parameters to be optimized (like in maximum likelihood for probability distributions) Reasonable function is measure of error on training set D (i.e. the loss l): E(w; D) = (x,y) D l(y, f (x)) Problem of overfitting training data (less severe for linear classifier, we will discuss it)

Parameter learning Gradient descent 1 Initialize w (e.g. w = 0) 2 Iterate until gradient is approx. zero: w = w η E(w; D) Note η is called learning rate and controls the amount of movement at each gradient step The algorithm is guaranteed to converge to a local optimum of E(w; D) (for small enough η) Too low η implies slow convergence Techniques exist to adaptively modify η

Parameter learning Problem The misclassification loss is piecewise constant Poor candidate for gradient descent Perceptron training rule E(w; D) = yf (x) (x,y) D E D E is the set of current training errors for which: yf (x) 0 The error is the sum of the functional margins of incorrectly classified examples

Parameter learning Perceptron training rule We focus on unbiased (i.e. w 0 = 0) perceptron for simplicity E(w; D) = yf (x) (x,y) D E = y(w T x) (x,y) D E = yx (x,y) D E the amount of update is: η E(w; D) = η yx (x,y) D E

Perceptron learning Stochastic perceptron training rule 1 Initialize weights randomly 2 Iterate until all examples correctly classified: 1 For each training example (x, y) update each weight: w i w i + ηyx i Note on stochastic we make a gradient step for each training error (rather than on the sum of them in batch learning) Each gradient step is very fast Stochasticity can sometimes help to avoid local minima, being guided by various gradients for each training example (which won t have the same local minima in general)

Perceptron learning

Perceptron learning

Perceptron learning

Perceptron learning

Perceptron learning

Perceptron regression Exact solution Let X IR n IR d be the input training matrix (i.e. X = [x 1 x n ] T for n = D and d = x ) Let y IR n be the output training matrix (i.e. y i is output for example x i ) Regression learning could be stated as a set of linear equations): Giving as solution: Xw = y w = X 1 y

Perceptron regression Problem Matrix X is rectangular, usually more rows than columns System of equations is overdetermined (more equations than unknowns) No exact solution typically exists

Perceptron regression Mean squared error (MSE) Resort to loss minimization Standard loss for regression is the mean squared error: E(w; D) = (x,y) D Closed form solution exists (y f (x)) 2 = (y Xw) T (y Xw) Can be always be solved by gradient descent (can be faster) Can also be used as a classification loss

Perceptron regression Closed form solution E(w; D) = (y w T x) 2 = (x,y) D (x,y) D 2(y w T x)( x) = 2X T (y Xw) = 0 X T y = X T Xw (X T X) 1 X T y = w

Perceptron regression Note w = (X T X) 1 X T y (X T X) 1 X T is called pseudoinverse If X is square and nonsingular, inverse and pseudoinverse coincide and the MSE solution corresponds to the exact one The pseudoinverse always exists approximate solution can always be found

Perceptron regression Gradient descent E w i = = 1 2 = 1 2 1 w i 2 = (x,y) D (x,y) D (x,y) D (x,y) D (y f (x)) 2 w i (y f (x)) 2 2(y f (x)) w i (y w T x) (y f (x))( x i )

Multiclass classification One-vs-all Learn one binary classifier for each class: positive examples are examples of the class negative examples are examples of all other classes Predict a new example in the class with maximum functional margin Decision boundaries for which f i (x) = f j (x) are pieces of hyperplanes: w T i x = w T j x (w i w j ) T x = 0

Multiclass classification

Multiclass classification all-pairs Learn one binary classifier for each pair of classes: positive examples from one class negative examples from the other Predict a new example in the class winning the largest number of pairwise classifications

Generative linear classifiers Gaussian distributions linear decision boundaries are obtained when covariance is shared among classes (Σ i = Σ) Naive Bayes classifier f i (x) = P(x y i )P(y i ) = = x K j=1 k=1 K k=1 θ z k (x[j]) ky i D i D θ N kx ky i D i D where N kx is the number of times feature k (e.g. a word) appears in x

Generative linear classifiers Naive Bayes classifier (cont) log f i (x) = K N kx log θ kyi k=1 } {{ } w T x + log( D i D ) }{{} w 0 x = [N 1x N Kx ] T w = [log θ 1yi log θ Kyi ] T Naive Bayes is a log-linear model (as Gaussian distributions with shared Σ)