Machine Learning: Logistic Regression. Lecture 04

Similar documents
Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning: The Perceptron. Lecture 06

Machine Learning: Fisher s Linear Discriminant. Lecture 05

Logistic Regression. Machine Learning Fall 2018

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Stochastic gradient descent; Classification

Machine Learning: Logistic Regression. Lecture 04

Multiclass Logistic Regression

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x)

Ch 4. Linear Models for Classification

Neural Network Training

Linear Models for Classification

Neural networks and support vector machines

Deep Feedforward Networks

Lecture 3a: The Origin of Variational Bayes

Linear discriminant functions

CS60021: Scalable Data Mining. Large Scale Machine Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Iterative Reweighted Least Squares

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Day 4: Classification, support vector machines

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Statistical Data Mining and Machine Learning Hilary Term 2016

Machine Learning. 7. Logistic and Linear Regression

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Logistic Regression. COMP 527 Danushka Bollegala

Machine Learning Lecture 5

Neural Networks and Deep Learning

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Artificial Neural Networks. Part 2

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian Learning (II)

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

y(x) = x w + ε(x), (1)

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Lecture 5: Logistic Regression. Neural Networks

Intelligent Systems Discriminative Learning, Neural Networks

Lecture 6. Regression

Logistic Regression. William Cohen

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Deep Feedforward Networks

PATTERN RECOGNITION AND MACHINE LEARNING

probability of k samples out of J fall in R.

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Gradient-Based Learning. Sargur N. Srihari

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

Reading Group on Deep Learning Session 1

CS-E3210 Machine Learning: Basic Principles

CMU-Q Lecture 24:

Ad Placement Strategies

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions

Artificial Neural Networks

Part 8: Neural Networks

Support Vector Machines

Lecture 3: Multiclass Classification

Application: Can we tell what people are looking at from their brain activity (in real time)? Gaussian Spatial Smooth

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Machine Learning Lecture 7

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

LECTURE NOTE #NEW 6 PROF. ALAN YUILLE

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Discriminative Models

Generative v. Discriminative classifiers Intuition

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Naïve Bayes classification

Kernel Methods and Support Vector Machines

Linear Models in Machine Learning

Expectation Maximization Algorithm

Machine Learning for NLP

ECE521 week 3: 23/26 January 2017

Classification. Sandro Cumani. Politecnico di Torino

Artificial Neural Networks. Q550: Models in Cognitive Science Lecture 5

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Discriminative Models

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

STA141C: Big Data & High Performance Statistical Computing

Machine Learning 2017

MACHINE LEARNING ADVANCED MACHINE LEARNING

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Machine Learning Basics III

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Probabilistic Graphical Models & Applications

Transcription:

Machine Learning: Logistic Regression Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu

Supervised Learning Task = learn an (unkon function t : X T that maps input instances x Î X to output targets t(x Î T: Classification: The output t(x Î T is one of a finite set of discrete categories. Regression: The output t(x Î T is continuous, or has a continuous component. Target function t(x is knon (only through (noisy set of training examples: (x 1,t 1, (x 2,t 2, (x n,t n

Supervised Learning Training Training Examples (x k, t k Learning Algorithm Model h Testing Test Examples (x, t Model h Generalization Performance

Parametric Approaches to Supervised Learning Task = build a function h(x such that: h matches t ell on the training data: => h is able to fit data that it has seen. h also matches t ell on test data: => h is able to generalize to unseen data. Task = choose h from a nice class of functions that depend on a vector of parameters : h(x º h (x º h(,x hat classes of functions are nice?

Neurons Soma is the central part of the neuron: here the input signals are combined. Dendrites are cellular extensions: here majority of the input occurs. Axon is a fine, long projection: carries nerve signals to other neurons. Synapses are molecular structures beteen axon terminals and other neurons: here the communication takes place.

Neuron Models https://.research.ibm.com/softare/ibmresearch/multimedia/ijcnn2013.neuronmodel.pdf

Spiking/LIF Neuron Function http://ee.princeton.edu/research/prucnal/sites/default/files/06497478.pdf

Neuron Models https://.research.ibm.com/softare/ibmresearch/multimedia/ijcnn2013.neuronmodel.pdf

McCulloch-Pitts Neuron Function x 0 1 0 activation / output function x 1 x 2 1 2 3 Σ i x i f h (x x 3 Algebraic interpretation: The output of the neuron is a linear combination of inputs from other neurons, rescaled by the synaptic eights. eights i correspond to the synaptic eights (activating or inhibiting. summation corresponds to combination of signals in the soma. It is often transformed through an activation / output function.

Activation Functions " $ unit step f (z = # %$ Perceptron 0 if z < 0 1 if z 0 1 1 logistic f (z = 1+ e z Logistic Regression identity f (z = z Linear Regression 0

Linear Regression x 0 1 0 activation / output function x 1 1 2 Σ f x 2 3 i x i f (z = z h (x = i x i x 3 Polynomial curve fitting is Linear Regression: x = φ(x = [1, x, x 2,..., x M ] T h(x = T x

McCulloch-Pitts Neuron Function x 0 1 0 activation / output function x 1 x 2 1 2 3 Σ i x i f h (x x 3 Algebraic interpretation: The output of the neuron is a linear combination of inputs from other neurons, rescaled by the synaptic eights. eights i correspond to the synaptic eights (activating or inhibiting. summation corresponds to combination of signals in the soma. It is often transformed through a monotonic activation / output function.

Logistic Regression x 0 x 1 x 2 x 3 1 0 1 2 Σ activation function f i x i 1 h (x = 3 1 f (z = 1+ exp( T x 1+ exp( z Training set is (x 1,t 1, (x 2,t 2, (x n,t n. x = [1, x 1, x 2,..., x k ] T h(x = σ( T x Can be used for both classification and regression: Classification: T = {C 1, C 2 } = {1, 0}. Regression: T = [0, 1] (i.e. output needs to be normalized.

Logistic Regression for Binary Classification Model output can be interpreted as posterior class probabilities: p(c 1 x = σ ( T x = 1 1+ exp( T x p(c 2 x =1 σ ( T x = exp( T x 1+ exp( T x Ho do e train a logistic regression model? What error/cost function to minimize?

Logistic Regression Learning Learning = finding the right parameters T = [ 0, 1,, k ] Find that minimizes an error function E( hich measures the misfit beteen h(x n, and t n. Expect that h(x, performing ell on training examples x n Þ h(x, ill perform ell on arbitrary test examples x Î X. Least Squares error function? E( = 1 2 N n=1 {h(x n, t n } 2 Differentiable => can use gradient descent Non-convex => not guaranteed to find the global optimum

Maximum Likelihood Training set is D = {áx n, t n ñ t n Î {0,1}, n Î 1 N} Let h n = p(c 1 x n h n = p(t n =1 x n = σ ( T x n Maximum Likelihood (ML principle: find parameters that maximize the likelihood of the labels. The likelihood function is N p(t = h n t n (1 h n (1 t n n=1 The negative log-likelihood (cross entropy error function: N { } E( = ln p(t x = t n lnh n + (1 t n ln(1 h n n=1

Maximum Likelihood Learning for Logistic Regression The ML solution is: ML = argmax p(t = argmin E( convex in ML solution is given by ÑE( = 0. Cannot solve analytically => solve numerically ith gradient based methods: (stochastic gradient descent, conjugate gradient, L-BFGS, etc. Gradient is (prove it: E( = N n=1 (h n t n x n T

Regularized Logistic Regression Use a Gaussian prior over the parameters: = [ 0, 1,, M ] T Bayes Theorem: MAP solution: þ ý ü î í ì - ø ö ç è æ = = + - I 0 T M Ν p 2 exp 2, ( ( 2 1 / ( 1 a p a a ( ( ( ( ( ( t t t t p p p p p p µ = ( max arg t p MAP =

Regularized Logistic Regression MAP solution: = MAP arg max p( t = arg max p ( t p ( = arg min- ln p( t p( = arg min- ln p( t - ln p( = arg min E D ( - ln p( a T = arg min E D ( + 2 = arg mine D ( + E ( E E D ( ( = - N å{ tn ln yn + (1 - tnln(1 - yn } n= 1 a = 2 T regularization term data term

Regularized Logistic Regression MAP solution: MAP = arg min E D ( + E ( still convex in ML solution is given by ÑE( = 0. ÑE( = ÑE D ( + ÑE ( Cannot solve analytically => solve numerically: N = (h n t n x n T +α T n=1 (stochastic gradient descent [PRML 3.1.3], Neton Raphson iterative optimization [PRML 4.3.3], conjugate gradient, LBFGS. here h n = σ ( T x n

Softmax Regression = Logistic Regression for Multiclass Classification Multiclass classification: T = {C 1, C 2,..., C K } = {1, 2,..., K}. Training set is (x 1,t 1, (x 2,t 2, (x n,t n. x = [1, x 1, x 2,..., x M ] t 1, t 2, t n Î {1, 2,..., K} One eight vector per class [PRML 4.3.4]: p(c k x = exp( k T x exp( j T x j

Softmax Regression (K ³ 2 Inference: C* = arg max p( Ck x C k = argmax C k exp( T k x exp( T j x j Z(x a normalization constant Training using: = argmax C k exp( k T x = argmax C k k T x Maximum Likelihood (ML Maximum A Posteriori (MAP ith a Gaussian prior on.

Softmax Regression The negative log-likelihood error function is: E D ( = 1 N ln p(t n x n = 1 N = 1 N N n=1 N N n=1 ln exp( T t n x n Z(x n K n=1 k=1 δ k (t n ln exp( T k x n Z(x n convex in here d ( x t = ì1 í î0 x x = ¹ t t is the Kronecker delta function.

Softmax Regression The ML solution is: = arg min ML E D ( The gradient is (prove it: k E D ( = 1 N = 1 N N n=1 N n=1 ( δ k (t n p(c k x n x n δ k (t n exp( T k x n Z(x n x n E D ( = " # T E 1 D (, T 2 E D (,, T K E D ( $ % T

Regularized Softmax Regression The ne cost function is: E( = E D (+ E ( = 1 N N K δ k (t n ln exp( T k x n + α Z(x n 2 n=1 k=1 K k=1 T k k The ne gradient is (prove it: k E( = 1 N N n=1 T ( δ k (t n p(c k x n x n +α k T

Softmax Regression ML solution is given by ÑE D ( = 0. Cannot solve analytically. Solve numerically, by pluging [cost, gradient] = [E D (, ÑE D (] values into general convex solvers: L-BFGS Neton methods conjugate gradient (stochastic / minibatch gradient-based methods. gradient descent (ith / ithout momentum. AdaGrad, AdaDelta RMSProp ADAM,...

Implementation Need to compute [cost, gradient]: cost = 1 N gradient k N δ k (t n ln p(c k x n + α 2 n=1 k=1 => need to compute, for k = 1,..., K: K = 1 N N n=1 K k=1 T k k T T ( δ k (t n p(c k x n x n +α k output p(c k x n = exp( T k x n exp( T j x n j Overflo hen kt x n are too large.

Implementation: Preventing Overflos Subtract from each product kt x n the maximum product: c = max 1 k K k T x n p(c k x n = exp( k T x n c exp( j T x n c j

Implementation: Gradient Checking Want to minimize J(θ, here θ is a scalar. Mathematical definition of derivative: d J(θ +ε J(θ ε J(θ = lim dθ ε 2ε Numerical approximation of derivative: d J(θ +ε J(θ ε J(θ dθ 2ε here ε = 0.0001

Implementation: Gradient Checking If θ is a vector of parameters θ i, Compute numerical derivative ith respect to each θ i. Aggregate all derivatives into numerical gradient G num (θ. Compare numerical gradient G num (θ ith implementation of gradient G imp (θ: G num (θ G imp (θ G num (θ+ G imp (θ 10 6