April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Similar documents
Machine Learning. 7. Logistic and Linear Regression

Linear Models for Classification

Multiclass Logistic Regression

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Iterative Reweighted Least Squares

Ch 4. Linear Models for Classification

Neural Network Training

Machine Learning Lecture 5

Reading Group on Deep Learning Session 1

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Machine Learning Lecture 7

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear Models for Classification

Bayesian Logistic Regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Logistic Regression & Neural Networks

CSC 411: Lecture 04: Logistic Regression

Multilayer Perceptron

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Logistic Regression. COMP 527 Danushka Bollegala

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Linear discriminant functions

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Artificial Neural Networks

Neural Networks and the Back-propagation Algorithm

Artificial Neural Networks. MGS Lecture 2

Classification. Sandro Cumani. Politecnico di Torino

Multi-layer Neural Networks

An Introduction to Statistical and Probabilistic Linear Models

y(x n, w) t n 2. (1)

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Linear and logistic regression

Neural Networks (Part 1) Goals for the lecture

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Linear & nonlinear classifiers

Overfitting, Bias / Variance Analysis

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Chapter ML:VI (continued)

Linear Classification

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Artificial Neural Networks

Introduction to Machine Learning

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Probabilistic generative models

Lecture 5: Logistic Regression. Neural Networks

STA 4273H: Statistical Machine Learning

From perceptrons to word embeddings. Simon Šuster University of Groningen

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Linear Models for Classification

Stochastic gradient descent; Classification

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

COMP-4360 Machine Learning Neural Networks

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Machine Learning: Logistic Regression. Lecture 04

Introduction to Neural Networks

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning for NLP

ECE521 Lecture7. Logistic Regression

CSCI567 Machine Learning (Fall 2014)

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Linear Models in Machine Learning

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Neural Networks and Deep Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Machine Learning 2017

Inf2b Learning and Data

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

COMS 4771 Introduction to Machine Learning. Nakul Verma

Machine Learning Lecture 10

Advanced statistical methods for data analysis Lecture 2

Final Examination CS 540-2: Introduction to Artificial Intelligence

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Chapter ML:VI (continued)

PATTERN RECOGNITION AND MACHINE LEARNING

Course 395: Machine Learning - Lectures

Artificial Neural Networks (ANN)

Artificial Neural Networks

Machine Learning

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Linear Discrimination Functions

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

CSCI-567: Machine Learning (Spring 2019)

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Unit 8: Introduction to neural networks. Perceptrons

Max Margin-Classifier

Transcription:

Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá April 9, 2018

Content 1 2 3 4

Outline 1 2 3 4

problems { C 1, y(x) threshold predict(x) = C 2, y(x) < threshold, with threshold = 0 or threshold = 0.5.

problems { C 1, y(x) threshold predict(x) = C 2, y(x) < threshold, with threshold = 0 or threshold = 0.5. Three ways to address the classification problem:

problems { C 1, y(x) threshold predict(x) = C 2, y(x) < threshold, with threshold = 0 or threshold = 0.5. Three ways to address the classification problem: 1 Directly model the discrimination function: e.g y(x) = w T x + w 0

problems { C 1, y(x) threshold predict(x) = C 2, y(x) < threshold, with threshold = 0 or threshold = 0.5. Three ways to address the classification problem: 1 Directly model the discrimination function: e.g y(x) = w T x + w 0 2 Generative model: y(x) = P(C k x) = P(x C k)p(c k ) P(x)

problems { C 1, y(x) threshold predict(x) = C 2, y(x) < threshold, with threshold = 0 or threshold = 0.5. Three ways to address the classification problem: 1 Directly model the discrimination function: e.g y(x) = w T x + w 0 2 Generative model: y(x) = P(C k x) = P(x C k)p(c k ) P(x) 3 Discriminative model: with f an arbitrary function. y(x) = P(C k x) = f (x),

y(x) = f (w T x + w 0 ) f ( ): activation function, may be non-linear

y(x) = f (w T x + w 0 ) f ( ): activation function, may be non-linear Even if f ( ) is non-linear, the decision boundary is linear

y(x) = f (w T x + w 0 ) f ( ): activation function, may be non-linear Even if f ( ) is non-linear, the decision boundary is linear Also called generalized linear models

y(x) = f (w T x + w 0 ) f ( ): activation function, may be non-linear Even if f ( ) is non-linear, the decision boundary is linear Also called generalized linear models Applicable if instead of x we use a vector of basis functions φ(x), corresponding to features in a feature space

Using for classification We can use a model, such as least squares to fit a linear classification model with a linear activation function min w,w o l (t i w T x i + w 0 ) 2, i=1 where t i { 1, 1} is the label of the i-th training sample,

Using for classification We can use a model, such as least squares to fit a linear classification model with a linear activation function min w,w o l (t i w T x i + w 0 ) 2, i=1 where t i { 1, 1} is the label of the i-th training sample, but this strategy does not work well: 186 4. LINEAR MODELS FOR CLASSIFICATION 4 4 2 2 0 0 2 2 4 4 6 6 8 4 2 0 2 4 6 8 8 4 2 0 2 4 6 8 Figure 4.4 left plot shows data from two classes, denoted by red crosses and blue circles, together with the decision boundary found by least squares (magenta curve) and also by the logistic model (green curve), which is discussed later in Section 4.3.2. right-hand plot shows the corresponding results obtained when extra data points are added at the bottom left of the diagram, showing that least squares is highly sensitive to outliers, unlike logistic.

Outline 1 2 3 4

Rosemblatt s Designed by Frank Rossemblat in 1957

Rosemblatt s Designed by Frank Rossemblat in 1957 A hardware implementation of the learning algorithm

Rosemblatt s Designed by Frank Rossemblat in 1957 A hardware implementation of the learning algorithm precursor of neural networks

Rosemblatt s Designed by Frank Rossemblat in 1957 A hardware implementation of the learning algorithm precursor of neural networks Criticized by Marvin Minsky, producing a decline in research funding

Activation function: f (a) = Perceptron learning { +1, a 0 1, a < 0

Activation function: Loss function: f (a) = E p (w, w 0 ) = Perceptron learning { +1, a 0 1, a < 0 l f (w T x n + w 0 )t n, n=1

Activation function: Loss function: f (a) = E p (w, w 0 ) = Perceptron learning { +1, a 0 1, a < 0 l f (w T x n + w 0 )t n, n=1 Learning rule: where f n = f (w T x n + w 0 ) w (n) = w (n 1) + η(f n t n )x n

Perceptron convergence If the training points are linearly separable, the algorithms converges ( convergence theorem)

Perceptron convergence If the training points are linearly separable, the algorithms converges ( convergence theorem) It could converge to different solutions depending on the order of presentation of training sample

Perceptron convergence If the training points are linearly separable, the algorithms converges ( convergence theorem) 4.1. Discriminant Functions 195 It could converge to different solutions depending on the order of presentation of training sample 1 1 0.5 0 0.5 0 0.5 0.5 1 1 0.5 0 0.5 1 1 1 1 0.5 0 0.5 1 1 0.5 0.5 0 0 0.5 0.5 1 1 0.5 0 0.5 1 Figure 4.7 1 1 0.5 0 0.5 1 Illustration of the convergence of the learning algorithm, showing data points from two

Perceptron problems Non-probabilistic outputs

Perceptron problems Non-probabilistic outputs Non-convex problem

Perceptron problems Non-probabilistic outputs Non-convex problem No convergence guarantee if samples are not linearly separable

Perceptron problems Non-probabilistic outputs Non-convex problem No convergence guarantee if samples are not linearly separable Its power may be increased by stacking several layers (multilayer s) and using smooth activation functions

Outline 1 2 3 4

Parametric discrimination se three conditions are equivalent: P(C 1 x) 0.5 P(C 1 x) 1 P(C 1 1 x) logit(p(c 1 x)) = log P(C1 x) 1 P(C 0 1 x)

Parametric discrimination se three conditions are equivalent: P(C 1 x) 0.5 P(C 1 x) 1 P(C 1 1 x) logit(p(c 1 x)) = log P(C1 x) 1 P(C 0 1 x) If we assume that P(x C 1 ) and P(x C 2 ) are normally distributed sharing the same covariance matrix: logit(p(c 1 x)) = log P(C 1 x) P(C 2 x) = w T x + w 0, where w = Σ 1 (µ 1 + µ 2 ) w 0 = 1 2 (µ 1 + µ 2 ) T Σ 1 (µ 1 + µ 2 ) + log P(C 1) P(C 2 )

logit function: function logit(p(c 1 x)) = log P(C 1 x) 1 P(C 1 x) = w T x + w 0

logit function: function logit(p(c 1 x)) = log inverse-logit: P(C 1 x) 1 P(C 1 x) = w T x + w 0 P(C 1 x) = σ(w T 1 x + w 0 ) = 1 + e (w T x+w 0 )

logit function: function logit(p(c 1 x)) = log inverse-logit: P(C 1 x) 1 P(C 1 x) = w T x + w 0 P(C 1 x) = σ(w T x + w 0 ) = 1 1 + e (w T x+w 0 ) σ is called the logistic or sigmoid function.

y(x) = P(C 1 x) = σ(w T x)

y(x) = P(C 1 x) = σ(w T x) Find w using maximum likelihood estimation: p(t w) = l n=1 y tn n (1 y n ) 1 tn, where t = {t 1,..., t l } and y n = y(x n ).

y(x) = P(C 1 x) = σ(w T x) Find w using maximum likelihood estimation: p(t w) = l n=1 y tn n (1 y n ) 1 tn, where t = {t 1,..., t l } and y n = y(x n ). Cross-entropy error: l E(w) = ln p(t w) = [t n ln y n + (1 t n ) ln(1 y n )] n=1

Multiclass logistic y k (x) = P(C k x) = ewt k x j ewt j x

Likelihood: Multiclass logistic y k (x) = P(C k x) = p(t w 1... w K ) = ewt k x j ewt j x l K n=1 k=1 y t nk nk, where y nk = y k (x n ) and T R l K is a matrix of target variables with elements t nk.

Likelihood: Multiclass logistic y k (x) = P(C k x) = p(t w 1... w K ) = ewt k x j ewt j x l K n=1 k=1 y t nk nk, where y nk = y k (x n ) and T R l K is a matrix of target variables with elements t nk. Multiclass cross-entropy error: E(w 1,..., w K ) = ln p(t w 1... w K ) = l n=1 k=1 K t nk ln y nk

Outline 1 2 3 4

min w Optimization problem l E(w) = min [t n ln y n + (1 t n ) ln(1 y n )] w n=1

min w Optimization problem l E(w) = min [t n ln y n + (1 t n ) ln(1 y n )] w n=1 E(w) = l (y n t n )φ n n=1

min w Optimization problem l E(w) = min [t n ln y n + (1 t n ) ln(1 y n )] w n=1 E(w) = l (y n t n )φ n n=1 w (τ+1) = w (τ) η l (y n t n )φ n n=1

Newton-Raphson w (τ+1) = w (τ) H 1 E(w)

Newton-Raphson w (τ+1) = w (τ) H 1 E(w) E(w) = Φ T (y t)

Newton-Raphson w (τ+1) = w (τ) H 1 E(w) E(w) = Φ T (y t) H = E(w) = Φ T RΦ, with R a diagonal matrix with R nn = y n (1 y n ).

Newton-Raphson w (τ+1) = w (τ) H 1 E(w) E(w) = Φ T (y t) H = E(w) = Φ T RΦ, with R a diagonal matrix with R nn = y n (1 y n ). resulting algorithm is called iterative reweighted least squares.

Regularization l min C [t n ln y n + (1 t n ) ln(1 y n )] + w 2 w n=1

Regularization l min C [t n ln y n + (1 t n ) ln(1 y n )] + w 2 w n=1 Prevents overfitting.

Regularization l min C [t n ln y n + (1 t n ) ln(1 y n )] + w 2 w n=1 Prevents overfitting. Equivalent to the inclusion of a prior and finding a MAP solution for W.

Stochastic gradient descent min w Q(w) = min w n Q i (w) i=1

Stochastic gradient descent min w Batch gradient descent: Q(w) = min w n Q i (w) i=1 w (τ+1) = w (τ) α Q(w) = w (τ) α n Q i (w) i=1

Stochastic gradient descent min w Batch gradient descent: Q(w) = min w n Q i (w) i=1 w (τ+1) = w (τ) α Q(w) = w (τ) α n Q i (w) i=1 on-line gradient descent: w (τ+1) = w (τ) α Q i (w)

Alpaydin, E. 2010 to Machine Learning, 2Ed. MIT Press. (Chap 10) Russell, S and Norvig, P. 2010 Artificial Intelligence: a Modern Approach, 3rd Ed, Prentice-Hall (Sect 18.6)