Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang
Review: machine learning basics
Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution D Find y = f(x) H that minimizes L f = 1 σ n i=1 n l(f, x i, y i ) s.t. the expected loss is small L f = E x,y ~D [l(f, x, y)]
Machine learning 1-2-3 Collect data and extract features Build model: choose hypothesis class H and loss function l Optimization: minimize the empirical loss
Machine learning 1-2-3 Experience Collect data and extract features Build model: choose hypothesis class H and loss function l Optimization: minimize the empirical loss Prior knowledge
Example: Linear regression Given training data Find f w x = w T x that minimizes L f w x i, y i : 1 i n i.i.d. from distribution D = 1 σ n n i=1 w T x i y i 2 l 2 loss Linear model H
Why l 2 loss Why not choose another loss l 1 loss, hinge loss, exponential loss, Empirical: easy to optimize For linear case: w = X T X 1 X T y Theoretical: a way to encode prior knowledge Questions: What kind of prior knowledge? Principal way to derive loss?
Maximum likelihood Estimation
Maximum likelihood Estimation (MLE) Given training data x i, y i : 1 i n i.i.d. from distribution D Let {P θ x, y : θ Θ} be a family of distributions indexed by θ Would like to pick θ so that P θ (x, y) fits the data well
Maximum likelihood Estimation (MLE) Given training data x i, y i : 1 i n i.i.d. from distribution D Let {P θ x, y : θ Θ} be a family of distributions indexed by θ fitness of θ to one data point x i, y i likelihood θ; x i, y i P θ (x i, y i )
Maximum likelihood Estimation (MLE) Given training data x i, y i : 1 i n i.i.d. from distribution D Let {P θ x, y : θ Θ} be a family of distributions indexed by θ fitness of θ to i.i.d. data points { x i, y i } likelihood θ; {x i, y i } P θ {x i, y i } = ς i P θ (x i, y i )
Maximum likelihood Estimation (MLE) Given training data x i, y i : 1 i n i.i.d. from distribution D Let {P θ x, y : θ Θ} be a family of distributions indexed by θ MLE: maximize fitness of θ to i.i.d. data points { x i, y i } θ ML = argmax θ Θ ς i P θ (x i, y i )
Maximum likelihood Estimation (MLE) Given training data x i, y i : 1 i n i.i.d. from distribution D Let {P θ x, y : θ Θ} be a family of distributions indexed by θ MLE: maximize fitness of θ to i.i.d. data points { x i, y i } θ ML = argmax θ Θ log[ς i P θ x i, y i ] θ ML = argmax θ Θ σ i log[p θ x i, y i ]
Maximum likelihood Estimation (MLE) Given training data x i, y i : 1 i n i.i.d. from distribution D Let {P θ x, y : θ Θ} be a family of distributions indexed by θ MLE: negative log-likelihood loss θ ML = argmax θ Θ σ i log(p θ x i, y i ) l P θ, x i, y i = log(p θ x i, y i ) L P θ = σ i log(p θ x i, y i )
MLE: conditional log-likelihood Given training data x i, y i : 1 i n i.i.d. from distribution D Let {P θ y x : θ Θ} be a family of distributions indexed by θ MLE: negative conditional log-likelihood loss θ ML = argmax θ Θ σ i log(p θ y i x i ) Only care about predicting y from x; do not care about p(x) l P θ, x i, y i = log(p θ y i x i ) L P θ = σ i log(p θ y i x i )
MLE: conditional log-likelihood Given training data x i, y i : 1 i n i.i.d. from distribution D Let {P θ y x : θ Θ} be a family of distributions indexed by θ MLE: negative conditional log-likelihood loss θ ML = argmax θ Θ σ i log(p θ y i x i ) P(y x): discriminative; P(x,y): generative l P θ, x i, y i = log(p θ y i x i ) L P θ = σ i log(p θ y i x i )
Example: l 2 loss Given training data Find f θ x that minimizes L f θ x i, y i : 1 i n i.i.d. from distribution D = 1 σ n n i=1 f θ (x i ) y i 2
Example: l 2 loss Given training data Find f θ x that minimizes L f θ x i, y i : 1 i n i.i.d. from distribution D = 1 σ n n i=1 Define P θ y x = Normal y; f θ x, σ 2 f θ (x i ) y i 2 log(p θ y i x i ) = 1 (f 2σ 2 θ x i y i ) 2 log(σ) 1 log(2π) 2 1 n θ ML = argmin θ Θ f θ (x i ) y 2 i n σ i=1 l 2 loss: Normal + MLE
Linear classification
Example 1: image classification indoor Indoor outdoor
Example 2: Spam detection # $ # Mr. # sale Spam? Email 1 2 1 1 Yes Email 2 0 1 0 No Email 3 1 1 1 Yes Email n 0 0 0 No New email 0 0 1??
Why classification Classification: a kind of summary Easy to interpret Easy for making decisions
Linear classification w T x = 0 w T x > 0 Class 1 w w T x < 0 Class 0
Linear classification: natural attempt Given training data Hypothesis f w x = w T x y = 1 if w T x > 0 y = 0 if w T x < 0 Prediction: y = step(f w x ) = step(w T x) x i, y i : 1 i n i.i.d. from distribution D Linear model H
Linear classification: natural attempt Given training data Find f w x = w T x to minimize L f w x i, y i : 1 i n i.i.d. from distribution D = 1 σ n i=1 n I[step(w T x i ) y i ] Drawback: difficult to optimize NP-hard in the worst case 0-1 loss
Linear classification: simple approach Given training data Find f w x = w T x that minimizes L f w x i, y i : 1 i n i.i.d. from distribution D = 1 σ n n i=1 w T x i y i 2 Reduce to linear regression; ignore the fact y {0,1}
Linear classification: simple approach Drawback: not robust to outliers Figure borrowed from Pattern Recognition and Machine Learning, Bishop
Compare the two y y = w T x y = step(w T x) w T x
Between the two Prediction bounded in [0,1] Smooth Sigmoid: σ a = 1 1+exp( a) Figure borrowed from Pattern Recognition and Machine Learning, Bishop
Linear classification: sigmoid prediction Squash the output of the linear function Sigmoid w T x = σ w T x = 1 1 + exp( w T x) Find w that minimizes L f w = 1 σ n n i=1 σ(w T x i ) y i 2
Linear classification: logistic regression Squash the output of the linear function Sigmoid w T x = σ w T 1 x = 1 + exp( w T x) A better approach: Interpret as a probability P w (y = 1 x) = σ w T 1 x = 1 + exp( w T x) P w y = 0 x = 1 P w y = 1 x = 1 σ w T x
Linear classification: logistic regression Given training data x i, y i : 1 i n i.i.d. from distribution D Find w that minimizes n L w = 1 n log P w y x i=1 L w = 1 n yi=1 logσ(w T x i ) 1 n y i =0 log[1 σ w T x i ] Logistic regression: MLE with sigmoid
Linear classification: logistic regression Given training data Find w that minimizes x i, y i : 1 i n i.i.d. from distribution D L w = 1 n yi=1 logσ(w T x i ) 1 n y i =0 log[1 σ w T x i ] No close form solution; Need to use gradient descent
Properties of sigmoid function Bounded Symmetric σ a = 1 1 + exp( a) (0,1) Gradient 1 σ a = σ (a) = exp a 1 + exp a = 1 exp a + 1 = σ( a) exp a 1 + exp a 2 = σ(a)(1 σ a )