Introduction to Logistic Regression and Support Vector Machine

Similar documents
Machine Learning for NLP

Dan Roth 461C, 3401 Walnut

Machine Learning for NLP

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Max Margin-Classifier

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

COMP 875 Announcements

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Overfitting, Bias / Variance Analysis

Discriminative Models

CSC 411 Lecture 17: Support Vector Machine

Support vector machines Lecture 4

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Lecture 3: Multiclass Classification

Logistic Regression. Machine Learning Fall 2018

Linear & nonlinear classifiers

Support Vector Machines

Training Support Vector Machines: Status and Challenges

Machine Learning Practice Page 2 of 2 10/28/13

Logistic Regression. William Cohen

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Warm up: risk prediction with logistic regression

Support Vector Machines

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning. Support Vector Machines. Manfred Huber

Discriminative Models

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Support Vector Machines. Machine Learning Fall 2017

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines, Kernel SVM

Introduction to Machine Learning Midterm Exam

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Announcements - Homework

Kernel methods CSE 250B

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Stochastic Gradient Descent

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Statistical Methods for NLP

CMU-Q Lecture 24:

Support Vector Machines

Advanced statistical methods for data analysis Lecture 2

Support Vector Machines for Classification: A Statistical Portrait

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Ad Placement Strategies

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Logistic Regression. COMP 527 Danushka Bollegala

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

1 Machine Learning Concepts (16 points)

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Lecture 9: PGM Learning

CS145: INTRODUCTION TO DATA MINING

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

CS269: Machine Learning Theory Lecture 16: SVMs and Kernels November 17, 2010

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Linear Regression (continued)

Generative v. Discriminative classifiers Intuition

Convex Optimization Algorithms for Machine Learning in 10 Slides

ICS-E4030 Kernel Methods in Machine Learning

Statistical Pattern Recognition

1 Review of Winnow Algorithm

Machine Learning Techniques

Linear & nonlinear classifiers

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Introduction to Machine Learning Midterm Exam Solutions

Incremental and Decremental Training for Linear Classification

Introduction to Support Vector Machines

Machine Learning Techniques

Support Vector Machines and Kernel Methods

Learning From Data Lecture 25 The Kernel Trick

Machine Learning Lecture 7

Neural Networks and Deep Learning

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Probabilistic Graphical Models & Applications

Statistical NLP Spring A Discriminative Approach

Generative v. Discriminative classifiers Intuition

18.9 SUPPORT VECTOR MACHINES

Coordinate Descent and Ascent Methods

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Machine Learning

Bias-Variance Tradeoff

Lecture Support Vector Machine (SVM) Classifiers

Support Vector Machines: Maximum Margin Classifiers

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Lecture 3 January 28

Learning with kernels and SVM

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements

EE 511 Online Learning Perceptron

Support Vector Machine

Transcription:

Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25

Before we start () 2 / 25 Fall, 2009 2 / 25

Before we start Feel free to ask questions anytime The slides are newly made. Please tell me if you find any mistake. () 2 / 25 Fall, 2009 2 / 25

Today: supervised learning algorithms training data machine learning model testing data () 3 / 25 Fall, 2009 3 / 25

Today: supervised learning algorithms training data machine learning model testing data Supervised learning algorithms we have mentioned Decision Tree Online Learning: Perceptron, Winnow,... Generative Model: Naive Bayes () 3 / 25 Fall, 2009 3 / 25

Today: supervised learning algorithms training data machine learning model testing data Supervised learning algorithms we have mentioned Decision Tree Online Learning: Perceptron, Winnow,... Generative Model: Naive Bayes What are we going to talk about today? Modern supervised learning algorithms Specifically, logistic regression and support vector machine () 3 / 25 Fall, 2009 3 / 25

Motivation Logistic regression and support vector machine are both very popular! () 4 / 25 Fall, 2009 4 / 25

Motivation Logistic regression and support vector machine are both very popular! Batch learning algorithms Using optimization algorithms as training algorithms An important technique we need to be familiar with. Learn not to be afraid of these algorithms () 4 / 25 Fall, 2009 4 / 25

Motivation Logistic regression and support vector machine are both very popular! Batch learning algorithms Using optimization algorithms as training algorithms An important technique we need to be familiar with. Learn not to be afraid of these algorithms Understand the relationships between these algorithms and the algorithms we have learned () 4 / 25 Fall, 2009 4 / 25

Review: Naive Bayes Notations Input: x, Output y {+, } Assume each x has m features. We use x j to represent the j-th features of x y x x 2 x 3 x 4 x 5 () 5 / 25 Fall, 2009 5 / 25

Review: Naive Bayes Notations Input: x, Output y {+, } Assume each x has m features. We use x j to represent the j-th features of x y x x 2 x 3 x 4 x 5 Conditional Independence () 5 / 25 Fall, 2009 5 / 25

Review: Naive Bayes y x x 2 x 3 x 4 x 5 m P(y, x) = P(y) P(x j y) j= () 6 / 25 Fall, 2009 6 / 25

Review: Naive Bayes y x x 2 x 3 x 4 x 5 Training Maximize the likelihood of P(D) = P(Y, X ) = l i p(y i, x i ) Algorithm Estimate P(y = ) and P(y = ) by counting Estimate P(x j y) by counting m P(y, x) = P(y) P(x j y) j= () 6 / 25 Fall, 2009 6 / 25

Review: Naive Bayes y x x 2 x 3 x 4 x 5 m P(y, x) = P(y) P(x j y) j= Training Testing Maximize the likelihood of P(D) = P(Y, X ) = l i p(y i, x i ) Algorithm Estimate P(y = ) and P(y = ) by counting Estimate P(x j y) by counting P(y=+ x) P(y= x) = P(y=+,x) P(y=,x)? () 6 / 25 Fall, 2009 6 / 25

Review: Naive Bayes The prediction function of a Naive Bayes model is a linear function In previous lectures, we have shown that log P(y=+ x) P(y= x) 0 w T x + b 0 The counting results can be re-expressed as a linear function () 7 / 25 Fall, 2009 7 / 25

Review: Naive Bayes The prediction function of a Naive Bayes model is a linear function In previous lectures, we have shown that log P(y=+ x) P(y= x) 0 w T x + b 0 The counting results can be re-expressed as a linear function Key observation: Naive Bayes cannot express all possible linear functions Intuition: conditional independence assumption () 7 / 25 Fall, 2009 7 / 25

Review: Naive Bayes The prediction function of a Naive Bayes model is a linear function In previous lectures, we have shown that log P(y=+ x) P(y= x) 0 w T x + b 0 The counting results can be re-expressed as a linear function Key observation: Naive Bayes cannot express all possible linear functions Intuition: conditional independence assumption We will propose a model (logistic regression) that can express all possible linear functions in the next few slides. () 7 / 25 Fall, 2009 7 / 25

Modeling conditional probability using a linear function Starting point: the predicting function of Naive Bayes log P(y=+ x) P(y= x) = w T x + b () 8 / 25 Fall, 2009 8 / 25

Modeling conditional probability using a linear function Starting point: the predicting function of Naive Bayes log P(y=+ x) P(y= x) = w T x + b P(y=+ x) P(y=+ x) = ew T x+b () 8 / 25 Fall, 2009 8 / 25

Modeling conditional probability using a linear function Starting point: the predicting function of Naive Bayes log P(y=+ x) P(y= x) = w T x + b P(y=+ x) P(y=+ x) = ew T x+b P(y = + x) = ewt x+b +e wt x+b = +e (wt x+b) () 8 / 25 Fall, 2009 8 / 25

Modeling conditional probability using a linear function Starting point: the predicting function of Naive Bayes log P(y=+ x) P(y= x) = w T x + b P(y=+ x) P(y=+ x) = ew T x+b P(y = + x) = ewt x+b +e wt x+b = +e (wt x+b) The conditional probability P(y x) = +e y(wt x+b) () 8 / 25 Fall, 2009 8 / 25

Modeling conditional probability using a linear function Starting point: the predicting function of Naive Bayes log P(y=+ x) P(y= x) = w T x + b P(y=+ x) P(y=+ x) = ew T x+b P(y = + x) = ewt x+b +e wt x+b = +e (wt x+b) The conditional probability P(y x) = +e y(wt x+b) In order to simplify the notation, w T [ w T b ] x T [ x T ] () 8 / 25 Fall, 2009 8 / 25

Modeling conditional probability using a linear function Starting point: the predicting function of Naive Bayes log P(y=+ x) P(y= x) = w T x + b P(y=+ x) P(y=+ x) = ew T x+b P(y = + x) = ewt x+b +e wt x+b = +e (wt x+b) The conditional probability P(y x) = +e y(wt x+b) In order to simplify the notation, w T [ w T b ] x T [ x T ] Using the bias trick, P(y x) = +e y(wt x) () 8 / 25 Fall, 2009 8 / 25

Logistic regression: introduction () 9 / 25 Fall, 2009 9 / 25

Logistic regression: introduction Naive Bayes: model P(y, x) conditional independence assumption: training = counting not all w are possible () 9 / 25 Fall, 2009 9 / 25

Logistic regression: introduction Naive Bayes: model P(y, x) conditional independence assumption: training = counting not all w are possible In the testing phase, we just showed that the conditional probability can be expressed P(y x, w) = + e y(w T x) () () 9 / 25 Fall, 2009 9 / 25

Logistic regression: introduction Naive Bayes: model P(y, x) conditional independence assumption: training = counting not all w are possible In the testing phase, we just showed that the conditional probability can be expressed P(y x, w) = + e y(w T x) () Logistic Regression Maximizes conditional likelihood P(y x) directly in the training phase () 9 / 25 Fall, 2009 9 / 25

Logistic regression: introduction Naive Bayes: model P(y, x) conditional independence assumption: training = counting not all w are possible In the testing phase, we just showed that the conditional probability can be expressed P(y x, w) = () + e y(w T x) Logistic Regression Maximizes conditional likelihood P(y x) directly in the training phase How to find w? w = argmax w P(Y X, w) = argmax w l i= P(y i x i, w) () 9 / 25 Fall, 2009 9 / 25

Logistic regression: introduction Naive Bayes: model P(y, x) conditional independence assumption: training = counting not all w are possible In the testing phase, we just showed that the conditional probability can be expressed P(y x, w) = () + e y(w T x) Logistic Regression Maximizes conditional likelihood P(y x) directly in the training phase How to find w? l w = argmax w P(Y X, w) = argmax w i= P(y i x i, w) For all possible w, find the one that maximizes the conditional likelihood drop the conditional independence assumption! () 9 / 25 Fall, 2009 9 / 25

Logistic regression: the final objective function w = argmax w P(Y X, w) = argmax l w i= P(y x, w) () 0 / 25 Fall, 2009 0 / 25

Logistic regression: the final objective function w = argmax w P(Y X, w) = argmax l w i= P(y x, w) Finding w as an optimization problem w = argmax w = argmin w = argmin w log P(Y X, w) = argmin log P(Y X, w) w l i= log + e y i (w T x i ) l log( + e y i (w T x i ) ) i= () 0 / 25 Fall, 2009 0 / 25

Logistic regression: the final objective function w = argmax w P(Y X, w) = argmax l w i= P(y x, w) Finding w as an optimization problem w = argmax w = argmin w = argmin w log P(Y X, w) = argmin log P(Y X, w) w l i= log + e y i (w T x i ) l log( + e y i (w T x i ) ) i= Properties of this optimization problem A convex optimization problem () 0 / 25 Fall, 2009 0 / 25

Adding regularization Explanation Empirical loss : log( + e y i (w T x i ) ) () / 25 Fall, 2009 / 25

Adding regularization Explanation Empirical loss : log( + e y i (w T x i ) ) y i (w T x i ) increases log( + e y i (w T x i ) ) decreases In order to minimize the empirical loss, w will tend to be large Therefore, to prevent over-fitting, we add a regularization term Regularization Term min w 2 w T w + C l log( + e y i (w T x i ) ) i= () / 25 Fall, 2009 / 25

Adding regularization Explanation Empirical loss : log( + e y i (w T x i ) ) y i (w T x i ) increases log( + e y i (w T x i ) ) decreases In order to minimize the empirical loss, w will tend to be large Therefore, to prevent over-fitting, we add a regularization term Regularization Term min w Empirical Loss 2 w T w + C l log( + e y i (w T x i ) ) i= () / 25 Fall, 2009 / 25

Adding regularization Explanation Empirical loss : log( + e y i (w T x i ) ) y i (w T x i ) increases log( + e y i (w T x i ) ) decreases In order to minimize the empirical loss, w will tend to be large Therefore, to prevent over-fitting, we add a regularization term Regularization Term min w Empirical Loss 2 w T w + C l log( + e y i (w T x i ) ) i= balance parameter () / 25 Fall, 2009 / 25

Optimization An unconstrained problem. We can use the gradient descent algorithm! () 2 / 25 Fall, 2009 2 / 25

Optimization An unconstrained problem. We can use the gradient descent algorithm! However, it is quite slow. Many other methods () 2 / 25 Fall, 2009 2 / 25

Optimization An unconstrained problem. We can use the gradient descent algorithm! However, it is quite slow. Many other methods Iterative scaling; non-linear conjugate gradient; quasi-newton methods; truncated Newton methods; trust-region newton method. All methods are iterative methods, that generate a sequence w k Converging to the optimal solution of the optimization problem above. () 2 / 25 Fall, 2009 2 / 25

Optimization An unconstrained problem. We can use the gradient descent algorithm! However, it is quite slow. Many other methods Iterative scaling; non-linear conjugate gradient; quasi-newton methods; truncated Newton methods; trust-region newton method. All methods are iterative methods, that generate a sequence w k Converging to the optimal solution of the optimization problem above. Choice of optimization techniques Low cost per iteration High cost per iteration (slow convergence) (fast convergence) Iterative scaling Newton Methods (each w component at a time) () 2 / 25 Fall, 2009 2 / 25

Optimization An unconstrained problem. We can use the gradient descent algorithm! However, it is quite slow. Many other methods Iterative scaling; non-linear conjugate gradient; quasi-newton methods; truncated Newton methods; trust-region newton method. All methods are iterative methods, that generate a sequence w k Converging to the optimal solution of the optimization problem above. Choice of optimization techniques Low cost per iteration High cost per iteration (slow convergence) (fast convergence) Iterative scaling Newton Methods (each w component at a time) Currently: Limited memory BFGS is very popular in NLP community () 2 / 25 Fall, 2009 2 / 25

Logistic regression versus Naive Bayes Logistic regression Naive Bayes Training maximize P(Y X ) maximize P(Y, X ) Training Algorithm optimization algorithms counting Testing P(y x) 0.5? P(y x) 0.5? Table: Comparison between Naive Bayes and logistic regression () 3 / 25 Fall, 2009 3 / 25

Logistic regression versus Naive Bayes Logistic regression Naive Bayes Training maximize P(Y X ) maximize P(Y, X ) Training Algorithm optimization algorithms counting Testing P(y x) 0.5? P(y x) 0.5? Table: Comparison between Naive Bayes and logistic regression LR and NB are both linear functions in the testing phase However, their training agendas are very different () 3 / 25 Fall, 2009 3 / 25

Support Vector Machine: another loss function No magic. Just another loss function () 4 / 25 Fall, 2009 4 / 25

Support Vector Machine: another loss function No magic. Just another loss function Logistic Regression min w 2 w T w + C l log( + e y i (w T x i ) ) i= () 4 / 25 Fall, 2009 4 / 25

Support Vector Machine: another loss function No magic. Just another loss function Logistic Regression L-loss SVM min w min w 2 w T w + C 2 w T w + C l log( + e y i (w T x i ) ) i= l max(0, y i w T x i ) i= () 4 / 25 Fall, 2009 4 / 25

Support Vector Machine: another loss function No magic. Just another loss function Logistic Regression L-loss SVM L2-loss SVM min w min w min w 2 w T w + C 2 w T w + C 2 w T w + C l log( + e y i (w T x i ) ) i= l max(0, y i w T x i ) i= l max(0, y i w T x i ) 2 i= () 4 / 25 Fall, 2009 4 / 25

Compare these loss functions () 5 / 25 Fall, 2009 5 / 25

The regularization term: maximize margin The L-loss SVM: min w 2 w T w + C l i= max(0, y iw T x i ) () 6 / 25 Fall, 2009 6 / 25

The regularization term: maximize margin The L-loss SVM: min w 2 w T w + C l i= max(0, y iw T x i ) Rewrite it using slack variables (why are they the same?) min w 2 w T w + C l i= s.t. y i w T x i ξ i, ξ i 0 ξ i () 6 / 25 Fall, 2009 6 / 25

The regularization term: maximize margin The L-loss SVM: min w 2 w T w + C l i= max(0, y iw T x i ) Rewrite it using slack variables (why are they the same?) min w 2 w T w + C l i= s.t. y i w T x i ξ i, ξ i 0 If there is no training error, what is the margin of w? ξ i () 6 / 25 Fall, 2009 6 / 25

The regularization term: maximize margin The L-loss SVM: min w 2 w T w + C l i= max(0, y iw T x i ) Rewrite it using slack variables (why are they the same?) min w 2 w T w + C l i= s.t. y i w T x i ξ i, ξ i 0 If there is no training error, what is the margin of w? w ξ i () 6 / 25 Fall, 2009 6 / 25

The regularization term: maximize margin The L-loss SVM: min w 2 w T w + C l i= max(0, y iw T x i ) Rewrite it using slack variables (why are they the same?) min w 2 w T w + C l i= s.t. y i w T x i ξ i, ξ i 0 If there is no training error, what is the margin of w? w Maximizing w minimizing w T w SVM regularization: find the linear line that maximizes the margin ξ i () 6 / 25 Fall, 2009 6 / 25

The regularization term: maximize margin The L-loss SVM: min w 2 w T w + C l i= max(0, y iw T x i ) Rewrite it using slack variables (why are they the same?) min w 2 w T w + C l i= s.t. y i w T x i ξ i, ξ i 0 If there is no training error, what is the margin of w? w Maximizing w minimizing w T w SVM regularization: find the linear line that maximizes the margin Learning theory: Link to SVM theory notes ξ i () 6 / 25 Fall, 2009 6 / 25

Balance between regularization and empirical loss (a) Training data and an overfitting classifier (b) Testing data and an overfitting classifier The maximal margin line with 0 training error Best? () 7 / 25 Fall, 2009 7 / 25

Balance between regularization and empirical loss (c) Training data and a better classifier (d) Testing data and a better classifier If we allow some training error, we can find a better line We need to balance the regularization term and the empirically loss term Problem of model selection. Select balance parameter with cross validation () 8 / 25 Fall, 2009 8 / 25

Primal and Dual Formulations Explaining the primal-dual relationship Link to the lecture notes: 07-LecSvm-opt.pdf Why primal-dual relationship is useful Link to a talk by Professor Chih-Jen Lin in 2005. Optimization, Support Vector Machines, and Machine Learning. Talk in DIS, University of Rome and IASI, CNR, Italy. September -2, 2005. We will only use the slides from page -20. Link to notes () 9 / 25 Fall, 2009 9 / 25

Nonlinear SVM SVM tries to find a linear line that maximizes the margin. Why people mention about non-linear SVM? () 20 / 25 Fall, 2009 20 / 25

Nonlinear SVM SVM tries to find a linear line that maximizes the margin. Why people mention about non-linear SVM? Usually this means: x φ(x) Find a linear function of φ(x) This can be a non linear function for x () 20 / 25 Fall, 2009 20 / 25

Nonlinear SVM SVM tries to find a linear line that maximizes the margin. Why people mention about non-linear SVM? Usually this means: x φ(x) Find a linear function of φ(x) This can be a non linear function for x Primal min w,ξi 2 w T w + C s.t. l i= ξ i y i w T φ(x) i ξ i ξ i 0, i =... l () 20 / 25 Fall, 2009 20 / 25

Nonlinear SVM SVM tries to find a linear line that maximizes the margin. Why people mention about non-linear SVM? Usually this means: x φ(x) Find a linear function of φ(x) This can be a non linear function for x Primal l min w,ξi 2 w T w + C ξ i Dual min α 2 αt Qα et α s.t. i= y i w T φ(x) i ξ i ξ i 0, i =... l s.t. i, 0 α i C where Q is a l-by-l matrix with Q ij = y i y j K(x i, x j ) () 20 / 25 Fall, 2009 20 / 25

Nonlinear SVM SVM tries to find a linear line that maximizes the margin. Why people mention about non-linear SVM? Usually this means: x φ(x) Find a linear function of φ(x) This can be a non linear function for x Primal l min w,ξi 2 w T w + C ξ i Dual min α 2 αt Qα et α s.t. i= y i w T φ(x) i ξ i ξ i 0, i =... l s.t. i, 0 α i C where Q is a l-by-l matrix with Q ij = y i y j K(x i, x j ) Same for Kernel perceptron: find a linear function on φ(x) Demo () 20 / 25 Fall, 2009 20 / 25

Solving SVM Both primal and dual problems have constraints We can not use the gradient descent algorithm For linear dual SVM, there is a simple optimization algorithm Coordinate descent method! () 2 / 25 Fall, 2009 2 / 25

Solving SVM Both primal and dual problems have constraints We can not use the gradient descent algorithm For linear dual SVM, there is a simple optimization algorithm Coordinate descent method! min α 2 αt Qα et α s.t. i, 0 α i C # of α i = # of training example The idea: pick one example i. Optimize α i only () 2 / 25 Fall, 2009 2 / 25

Coordinate Descent Algorithm Algorithm Run through the training data multiple times () 22 / 25 Fall, 2009 22 / 25

Coordinate Descent Algorithm Algorithm Run through the training data multiple times Pick a random example (i) among the training data. () 22 / 25 Fall, 2009 22 / 25

Coordinate Descent Algorithm Algorithm Run through the training data multiple times Pick a random example (i) among the training data. Fix α, α 2,..., α i, α i+,..., α l, only change α i α i = α i + s () 22 / 25 Fall, 2009 22 / 25

Coordinate Descent Algorithm Algorithm Run through the training data multiple times Pick a random example (i) among the training data. Fix α, α 2,..., α i, α i+,..., α l, only change α i α i = α i + s Solve the problem min s 2 (α + sd)t Q(α + sd) et (α + sd) s.t. 0 α i + s C only one constraint, where d is a vector of l zeros. The i-th component of d is. () 22 / 25 Fall, 2009 22 / 25

Coordinate Descent Algorithm Algorithm Run through the training data multiple times Pick a random example (i) among the training data. Fix α, α 2,..., α i, α i+,..., α l, only change α i α i = α i + s Solve the problem min s 2 (α + sd)t Q(α + sd) et (α + sd) s.t. 0 α i + s C only one constraint, where d is a vector of l zeros. The i-th component of d is. It is a single variable problem. We know how to solve this. () 22 / 25 Fall, 2009 22 / 25

Coordinate Descent Algorithm Assume that the optimal s is s. We can update α i using: α i = α i + s Given that w = l i α iy i x i, this is equivalent to is equivalent to w w + (α i α i)y i x i Isn t this familiar? () 23 / 25 Fall, 2009 23 / 25

Coordinate Descent Algorithm Assume that the optimal s is s. We can update α i using: α i = α i + s Similar to dual perceptron Given that w = l i α iy i x i, this is equivalent to is equivalent to w w + (α i α i)y i x i Isn t this familiar? () 23 / 25 Fall, 2009 23 / 25

Coordinate Descent Algorithm Assume that the optimal s is s. We can update α i using: α i = α i + s Similar to dual perceptron Given that w = l i α iy i x i, this is equivalent to is equivalent to w w + (α i α i)y i x i Similar to primal perceptron Isn t this familiar? () 23 / 25 Fall, 2009 23 / 25

Relationships between linear classifiers NB, LR, Perceptron and SVM are all linear classifiers NB and LR have the same interpretation for conditional probability P(y x, w) = + e y(w T x) The difference between LR and SVM are their loss functions But they are quite similar! Perceptron algorithm and the coordinate descent algorithm for SVM are very similar (2) () 24 / 25 Fall, 2009 24 / 25

Summary Logistic regression Maximizes P(Y X ) while Naive Bayes maximizes the joint probability P(Y, X ) Model the conditional probability using a linear line. Drop the conditional independence assumption Many available methods of optimizing the objective function () 25 / 25 Fall, 2009 25 / 25

Summary Logistic regression Maximizes P(Y X ) while Naive Bayes maximizes the joint probability P(Y, X ) Model the conditional probability using a linear line. Drop the conditional independence assumption Many available methods of optimizing the objective function Support Vector Machine Similar to Logistic Regression; Different Loss function Maximizes Margin; Has many nice theoretical properties Interesting Primal-Dual relationship Allows us to choose the easier one to solve Many available methods of optimizing the objective function The linear dual coordinate descent method turns out to be similar to Perceptron () 25 / 25 Fall, 2009 25 / 25