Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Size: px

Start display at page:

Download "Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang"

Garey Rich
6 years ago
Views:

1 Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang

2 Review: machine learning basics

3 Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution D Find y = f(x) H that minimizes L f = 1 σ n i=1 n l(f, x i, y i ) s.t. the expected loss is small L f = E x,y ~D [l(f, x, y)]

4 Machine learning Collect data and extract features Build model: choose hypothesis class H and loss function l Optimization: minimize the empirical loss

5 Machine learning Experience Collect data and extract features Build model: choose hypothesis class H and loss function l Optimization: minimize the empirical loss Prior knowledge

6 Example: Linear regression Given training data Find f w x = w T x that minimizes L f w x i, y i : 1 i n i.i.d. from distribution D = 1 σ n n i=1 w T x i y i 2 l 2 loss Linear model H

7 Why l 2 loss Why not choose another loss l 1 loss, hinge loss, exponential loss, Empirical: easy to optimize For linear case: w = X T X 1 X T y Theoretical: a way to encode prior knowledge Questions: What kind of prior knowledge? Principal way to derive loss?

8 Maximum likelihood Estimation

9 Maximum likelihood Estimation (MLE) Given training data x i, y i : 1 i n i.i.d. from distribution D Let {P θ x, y : θ Θ} be a family of distributions indexed by θ Would like to pick θ so that P θ (x, y) fits the data well

10 Maximum likelihood Estimation (MLE) Given training data x i, y i : 1 i n i.i.d. from distribution D Let {P θ x, y : θ Θ} be a family of distributions indexed by θ fitness of θ to one data point x i, y i likelihood θ; x i, y i P θ (x i, y i )

11 Maximum likelihood Estimation (MLE) Given training data x i, y i : 1 i n i.i.d. from distribution D Let {P θ x, y : θ Θ} be a family of distributions indexed by θ fitness of θ to i.i.d. data points { x i, y i } likelihood θ; {x i, y i } P θ {x i, y i } = ς i P θ (x i, y i )

12 Maximum likelihood Estimation (MLE) Given training data x i, y i : 1 i n i.i.d. from distribution D Let {P θ x, y : θ Θ} be a family of distributions indexed by θ MLE: maximize fitness of θ to i.i.d. data points { x i, y i } θ ML = argmax θ Θ ς i P θ (x i, y i )

13 Maximum likelihood Estimation (MLE) Given training data x i, y i : 1 i n i.i.d. from distribution D Let {P θ x, y : θ Θ} be a family of distributions indexed by θ MLE: maximize fitness of θ to i.i.d. data points { x i, y i } θ ML = argmax θ Θ log[ς i P θ x i, y i ] θ ML = argmax θ Θ σ i log[p θ x i, y i ]

14 Maximum likelihood Estimation (MLE) Given training data x i, y i : 1 i n i.i.d. from distribution D Let {P θ x, y : θ Θ} be a family of distributions indexed by θ MLE: negative log-likelihood loss θ ML = argmax θ Θ σ i log(p θ x i, y i ) l P θ, x i, y i = log(p θ x i, y i ) L P θ = σ i log(p θ x i, y i )

15 MLE: conditional log-likelihood Given training data x i, y i : 1 i n i.i.d. from distribution D Let {P θ y x : θ Θ} be a family of distributions indexed by θ MLE: negative conditional log-likelihood loss θ ML = argmax θ Θ σ i log(p θ y i x i ) Only care about predicting y from x; do not care about p(x) l P θ, x i, y i = log(p θ y i x i ) L P θ = σ i log(p θ y i x i )

16 MLE: conditional log-likelihood Given training data x i, y i : 1 i n i.i.d. from distribution D Let {P θ y x : θ Θ} be a family of distributions indexed by θ MLE: negative conditional log-likelihood loss θ ML = argmax θ Θ σ i log(p θ y i x i ) P(y x): discriminative; P(x,y): generative l P θ, x i, y i = log(p θ y i x i ) L P θ = σ i log(p θ y i x i )

17 Example: l 2 loss Given training data Find f θ x that minimizes L f θ x i, y i : 1 i n i.i.d. from distribution D = 1 σ n n i=1 f θ (x i ) y i 2

18 Example: l 2 loss Given training data Find f θ x that minimizes L f θ x i, y i : 1 i n i.i.d. from distribution D = 1 σ n n i=1 Define P θ y x = Normal y; f θ x, σ 2 f θ (x i ) y i 2 log(p θ y i x i ) = 1 (f 2σ 2 θ x i y i ) 2 log(σ) 1 log(2π) 2 1 n θ ML = argmin θ Θ f θ (x i ) y 2 i n σ i=1 l 2 loss: Normal + MLE

19 Linear classification

20 Example 1: image classification indoor Indoor outdoor

21 Example 2: Spam detection # $ # Mr. # sale Spam? Yes No Yes n No New ??

22 Why classification Classification: a kind of summary Easy to interpret Easy for making decisions

23 Linear classification w T x = 0 w T x > 0 Class 1 w w T x < 0 Class 0

24 Linear classification: natural attempt Given training data Hypothesis f w x = w T x y = 1 if w T x > 0 y = 0 if w T x < 0 Prediction: y = step(f w x ) = step(w T x) x i, y i : 1 i n i.i.d. from distribution D Linear model H

25 Linear classification: natural attempt Given training data Find f w x = w T x to minimize L f w x i, y i : 1 i n i.i.d. from distribution D = 1 σ n i=1 n I[step(w T x i ) y i ] Drawback: difficult to optimize NP-hard in the worst case 0-1 loss

26 Linear classification: simple approach Given training data Find f w x = w T x that minimizes L f w x i, y i : 1 i n i.i.d. from distribution D = 1 σ n n i=1 w T x i y i 2 Reduce to linear regression; ignore the fact y {0,1}

27 Linear classification: simple approach Drawback: not robust to outliers Figure borrowed from Pattern Recognition and Machine Learning, Bishop

28 Compare the two y y = w T x y = step(w T x) w T x

29 Between the two Prediction bounded in [0,1] Smooth Sigmoid: σ a = 1 1+exp( a) Figure borrowed from Pattern Recognition and Machine Learning, Bishop

30 Linear classification: sigmoid prediction Squash the output of the linear function Sigmoid w T x = σ w T x = exp( w T x) Find w that minimizes L f w = 1 σ n n i=1 σ(w T x i ) y i 2

31 Linear classification: logistic regression Squash the output of the linear function Sigmoid w T x = σ w T 1 x = 1 + exp( w T x) A better approach: Interpret as a probability P w (y = 1 x) = σ w T 1 x = 1 + exp( w T x) P w y = 0 x = 1 P w y = 1 x = 1 σ w T x

32 Linear classification: logistic regression Given training data x i, y i : 1 i n i.i.d. from distribution D Find w that minimizes n L w = 1 n log P w y x i=1 L w = 1 n yi=1 logσ(w T x i ) 1 n y i =0 log[1 σ w T x i ] Logistic regression: MLE with sigmoid

33 Linear classification: logistic regression Given training data Find w that minimizes x i, y i : 1 i n i.i.d. from distribution D L w = 1 n yi=1 logσ(w T x i ) 1 n y i =0 log[1 σ w T x i ] No close form solution; Need to use gradient descent

34 Properties of sigmoid function Bounded Symmetric σ a = exp( a) (0,1) Gradient 1 σ a = σ (a) = exp a 1 + exp a = 1 exp a + 1 = σ( a) exp a 1 + exp a 2 = σ(a)(1 σ a )

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)