Machine Learning: Logistic Regression Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu
Supervised Learning Task = learn an (unkon function t : X T that maps input instances x Î X to output targets t(x Î T: Classification: The output t(x Î T is one of a finite set of discrete categories. Regression: The output t(x Î T is continuous, or has a continuous component. Target function t(x is knon (only through (noisy set of training examples: (x 1,t 1, (x 2,t 2, (x n,t n
Supervised Learning Training Training Examples (x k, t k Learning Algorithm Model h Testing Test Examples (x, t Model h Generalization Performance
Parametric Approaches to Supervised Learning Task = build a function h(x such that: h matches t ell on the training data: => h is able to fit data that it has seen. h also matches t ell on test data: => h is able to generalize to unseen data. Task = choose h from a nice class of functions that depend on a vector of parameters : h(x º h (x º h(,x hat classes of functions are nice?
Neurons Soma is the central part of the neuron: here the input signals are combined. Dendrites are cellular extensions: here majority of the input occurs. Axon is a fine, long projection: carries nerve signals to other neurons. Synapses are molecular structures beteen axon terminals and other neurons: here the communication takes place.
Neuron Models https://.research.ibm.com/softare/ibmresearch/multimedia/ijcnn2013.neuronmodel.pdf
Spiking/LIF Neuron Function http://ee.princeton.edu/research/prucnal/sites/default/files/06497478.pdf
Neuron Models https://.research.ibm.com/softare/ibmresearch/multimedia/ijcnn2013.neuronmodel.pdf
McCulloch-Pitts Neuron Function x 0 1 0 activation / output function x 1 x 2 1 2 3 Σ i x i f h (x x 3 Algebraic interpretation: The output of the neuron is a linear combination of inputs from other neurons, rescaled by the synaptic eights. eights i correspond to the synaptic eights (activating or inhibiting. summation corresponds to combination of signals in the soma. It is often transformed through an activation / output function.
Activation Functions " $ unit step f (z = # %$ Perceptron 0 if z < 0 1 if z 0 1 1 logistic f (z = 1+ e z Logistic Regression identity f (z = z Linear Regression 0
Linear Regression x 0 1 0 activation / output function x 1 1 2 Σ f x 2 3 i x i f (z = z h (x = i x i x 3 Polynomial curve fitting is Linear Regression: x = φ(x = [1, x, x 2,..., x M ] T h(x = T x
McCulloch-Pitts Neuron Function x 0 1 0 activation / output function x 1 x 2 1 2 3 Σ i x i f h (x x 3 Algebraic interpretation: The output of the neuron is a linear combination of inputs from other neurons, rescaled by the synaptic eights. eights i correspond to the synaptic eights (activating or inhibiting. summation corresponds to combination of signals in the soma. It is often transformed through a monotonic activation / output function.
Logistic Regression x 0 x 1 x 2 x 3 1 0 1 2 Σ activation function f i x i 1 h (x = 3 1 f (z = 1+ exp( T x 1+ exp( z Training set is (x 1,t 1, (x 2,t 2, (x n,t n. x = [1, x 1, x 2,..., x k ] T h(x = σ( T x Can be used for both classification and regression: Classification: T = {C 1, C 2 } = {1, 0}. Regression: T = [0, 1] (i.e. output needs to be normalized.
Logistic Regression for Binary Classification Model output can be interpreted as posterior class probabilities: p(c 1 x = σ ( T x = 1 1+ exp( T x p(c 2 x =1 σ ( T x = exp( T x 1+ exp( T x Ho do e train a logistic regression model? What error/cost function to minimize?
Logistic Regression Learning Learning = finding the right parameters T = [ 0, 1,, k ] Find that minimizes an error function E( hich measures the misfit beteen h(x n, and t n. Expect that h(x, performing ell on training examples x n Þ h(x, ill perform ell on arbitrary test examples x Î X. Least Squares error function? E( = 1 2 N n=1 {h(x n, t n } 2 Differentiable => can use gradient descent Non-convex => not guaranteed to find the global optimum
Maximum Likelihood Training set is D = {áx n, t n ñ t n Î {0,1}, n Î 1 N} Let h n = p(c 1 x n h n = p(t n =1 x n = σ ( T x n Maximum Likelihood (ML principle: find parameters that maximize the likelihood of the labels. The likelihood function is N p(t = h n t n (1 h n (1 t n n=1 The negative log-likelihood (cross entropy error function: N { } E( = ln p(t x = t n lnh n + (1 t n ln(1 h n n=1
Maximum Likelihood Learning for Logistic Regression The ML solution is: ML = argmax p(t = argmin E( convex in ML solution is given by ÑE( = 0. Cannot solve analytically => solve numerically ith gradient based methods: (stochastic gradient descent, conjugate gradient, L-BFGS, etc. Gradient is (prove it: E( = N n=1 (h n t n x n T
Regularized Logistic Regression Use a Gaussian prior over the parameters: = [ 0, 1,, M ] T Bayes Theorem: MAP solution: þ ý ü î í ì - ø ö ç è æ = = + - I 0 T M Ν p 2 exp 2, ( ( 2 1 / ( 1 a p a a ( ( ( ( ( ( t t t t p p p p p p µ = ( max arg t p MAP =
Regularized Logistic Regression MAP solution: = MAP arg max p( t = arg max p ( t p ( = arg min- ln p( t p( = arg min- ln p( t - ln p( = arg min E D ( - ln p( a T = arg min E D ( + 2 = arg mine D ( + E ( E E D ( ( = - N å{ tn ln yn + (1 - tnln(1 - yn } n= 1 a = 2 T regularization term data term
Regularized Logistic Regression MAP solution: MAP = arg min E D ( + E ( still convex in ML solution is given by ÑE( = 0. ÑE( = ÑE D ( + ÑE ( Cannot solve analytically => solve numerically: N = (h n t n x n T +α T n=1 (stochastic gradient descent [PRML 3.1.3], Neton Raphson iterative optimization [PRML 4.3.3], conjugate gradient, LBFGS. here h n = σ ( T x n
Softmax Regression = Logistic Regression for Multiclass Classification Multiclass classification: T = {C 1, C 2,..., C K } = {1, 2,..., K}. Training set is (x 1,t 1, (x 2,t 2, (x n,t n. x = [1, x 1, x 2,..., x M ] t 1, t 2, t n Î {1, 2,..., K} One eight vector per class [PRML 4.3.4]: p(c k x = exp( k T x exp( j T x j
Softmax Regression (K ³ 2 Inference: C* = arg max p( Ck x C k = argmax C k exp( T k x exp( T j x j Z(x a normalization constant Training using: = argmax C k exp( k T x = argmax C k k T x Maximum Likelihood (ML Maximum A Posteriori (MAP ith a Gaussian prior on.
Softmax Regression The negative log-likelihood error function is: E D ( = 1 N ln p(t n x n = 1 N = 1 N N n=1 N N n=1 ln exp( T t n x n Z(x n K n=1 k=1 δ k (t n ln exp( T k x n Z(x n convex in here d ( x t = ì1 í î0 x x = ¹ t t is the Kronecker delta function.
Softmax Regression The ML solution is: = arg min ML E D ( The gradient is (prove it: k E D ( = 1 N = 1 N N n=1 N n=1 ( δ k (t n p(c k x n x n δ k (t n exp( T k x n Z(x n x n E D ( = " # T E 1 D (, T 2 E D (,, T K E D ( $ % T
Regularized Softmax Regression The ne cost function is: E( = E D (+ E ( = 1 N N K δ k (t n ln exp( T k x n + α Z(x n 2 n=1 k=1 K k=1 T k k The ne gradient is (prove it: k E( = 1 N N n=1 T ( δ k (t n p(c k x n x n +α k T
Softmax Regression ML solution is given by ÑE D ( = 0. Cannot solve analytically. Solve numerically, by pluging [cost, gradient] = [E D (, ÑE D (] values into general convex solvers: L-BFGS Neton methods conjugate gradient (stochastic / minibatch gradient-based methods. gradient descent (ith / ithout momentum. AdaGrad, AdaDelta RMSProp ADAM,...
Implementation Need to compute [cost, gradient]: cost = 1 N gradient k N δ k (t n ln p(c k x n + α 2 n=1 k=1 => need to compute, for k = 1,..., K: K = 1 N N n=1 K k=1 T k k T T ( δ k (t n p(c k x n x n +α k output p(c k x n = exp( T k x n exp( T j x n j Overflo hen kt x n are too large.
Implementation: Preventing Overflos Subtract from each product kt x n the maximum product: c = max 1 k K k T x n p(c k x n = exp( k T x n c exp( j T x n c j
Implementation: Gradient Checking Want to minimize J(θ, here θ is a scalar. Mathematical definition of derivative: d J(θ +ε J(θ ε J(θ = lim dθ ε 2ε Numerical approximation of derivative: d J(θ +ε J(θ ε J(θ dθ 2ε here ε = 0.0001
Implementation: Gradient Checking If θ is a vector of parameters θ i, Compute numerical derivative ith respect to each θ i. Aggregate all derivatives into numerical gradient G num (θ. Compare numerical gradient G num (θ ith implementation of gradient G imp (θ: G num (θ G imp (θ G num (θ+ G imp (θ 10 6