Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Similar documents
Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Logistic Regression. Machine Learning Fall 2018

Logic and machine learning review. CS 540 Yingyu Liang

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Linear and Logistic Regression. Dr. Xiaowei Huang

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Linear Models for Classification

Machine Learning Basics Lecture 3: Perceptron. Princeton University COS 495 Instructor: Yingyu Liang

Lecture 4 Logistic Regression

Lecture 2: Logistic Regression and Neural Networks

Naïve Bayes classification

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

CS229 Supplemental Lecture notes

Linear Models in Machine Learning

CSC 411: Lecture 04: Logistic Regression

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Gaussian discriminant analysis Naive Bayes

Stochastic Gradient Descent

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Statistical Data Mining and Machine Learning Hilary Term 2016

Linear and logistic regression

Logistic Regression. Stochastic Gradient Descent

Lecture 2 Machine Learning Review

Introduction to Machine Learning

An Introduction to Statistical and Probabilistic Linear Models

Machine Learning, Fall 2012 Homework 2

Linear Classification: Probabilistic Generative Models

Generative v. Discriminative classifiers Intuition

Support Vector Machines

Machine Learning

CS534: Machine Learning. Thomas G. Dietterich 221C Dearborn Hall

CPSC 340: Machine Learning and Data Mining

Probabilistic Machine Learning

Ch 4. Linear Models for Classification

Machine Learning 2017

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Classification Based on Probability

Classification: Logistic Regression from Data

STA141C: Big Data & High Performance Statistical Computing

CSC 411: Lecture 09: Naive Bayes

Machine learning - HT Maximum Likelihood

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Discriminative Models

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Lecture 9: PGM Learning

From perceptrons to word embeddings. Simon Šuster University of Groningen

ECE521 Lecture7. Logistic Regression

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

CS 195-5: Machine Learning Problem Set 1

Empirical Risk Minimization

Classification Logistic Regression

Generative v. Discriminative classifiers Intuition

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

CMU-Q Lecture 24:

Classification: Logistic Regression from Data

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning for Signal Processing Bayes Classification and Regression

Bias-Variance Tradeoff

Linear classifiers: Logistic regression

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

MLPR: Logistic Regression and Neural Networks

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

CSCI-567: Machine Learning (Spring 2019)

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Introduction to Machine Learning

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Week 5: Logistic Regression & Neural Networks

Warm up: risk prediction with logistic regression

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Machine Learning

Association studies and regression

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Discriminative Models

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression Trained with Different Loss Functions. Discussion

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Statistical Machine Learning Hilary Term 2018

Bayesian Learning (II)

Machine Learning Linear Classification. Prof. Matteo Matteucci

Linear Regression and Discrimination

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

CS 6375 Machine Learning

ECE521 week 3: 23/26 January 2017

Online Learning and Sequential Decision Making

Introduction to Bayesian Learning. Machine Learning Fall 2018

Linear Classification

Transcription:

Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang

Review: machine learning basics

Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution D Find y = f(x) H that minimizes L f = 1 σ n i=1 n l(f, x i, y i ) s.t. the expected loss is small L f = E x,y ~D [l(f, x, y)]

Machine learning 1-2-3 Collect data and extract features Build model: choose hypothesis class H and loss function l Optimization: minimize the empirical loss

Machine learning 1-2-3 Experience Collect data and extract features Build model: choose hypothesis class H and loss function l Optimization: minimize the empirical loss Prior knowledge

Example: Linear regression Given training data Find f w x = w T x that minimizes L f w x i, y i : 1 i n i.i.d. from distribution D = 1 σ n n i=1 w T x i y i 2 l 2 loss Linear model H

Why l 2 loss Why not choose another loss l 1 loss, hinge loss, exponential loss, Empirical: easy to optimize For linear case: w = X T X 1 X T y Theoretical: a way to encode prior knowledge Questions: What kind of prior knowledge? Principal way to derive loss?

Maximum likelihood Estimation

Maximum likelihood Estimation (MLE) Given training data x i, y i : 1 i n i.i.d. from distribution D Let {P θ x, y : θ Θ} be a family of distributions indexed by θ Would like to pick θ so that P θ (x, y) fits the data well

Maximum likelihood Estimation (MLE) Given training data x i, y i : 1 i n i.i.d. from distribution D Let {P θ x, y : θ Θ} be a family of distributions indexed by θ fitness of θ to one data point x i, y i likelihood θ; x i, y i P θ (x i, y i )

Maximum likelihood Estimation (MLE) Given training data x i, y i : 1 i n i.i.d. from distribution D Let {P θ x, y : θ Θ} be a family of distributions indexed by θ fitness of θ to i.i.d. data points { x i, y i } likelihood θ; {x i, y i } P θ {x i, y i } = ς i P θ (x i, y i )

Maximum likelihood Estimation (MLE) Given training data x i, y i : 1 i n i.i.d. from distribution D Let {P θ x, y : θ Θ} be a family of distributions indexed by θ MLE: maximize fitness of θ to i.i.d. data points { x i, y i } θ ML = argmax θ Θ ς i P θ (x i, y i )

Maximum likelihood Estimation (MLE) Given training data x i, y i : 1 i n i.i.d. from distribution D Let {P θ x, y : θ Θ} be a family of distributions indexed by θ MLE: maximize fitness of θ to i.i.d. data points { x i, y i } θ ML = argmax θ Θ log[ς i P θ x i, y i ] θ ML = argmax θ Θ σ i log[p θ x i, y i ]

Maximum likelihood Estimation (MLE) Given training data x i, y i : 1 i n i.i.d. from distribution D Let {P θ x, y : θ Θ} be a family of distributions indexed by θ MLE: negative log-likelihood loss θ ML = argmax θ Θ σ i log(p θ x i, y i ) l P θ, x i, y i = log(p θ x i, y i ) L P θ = σ i log(p θ x i, y i )

MLE: conditional log-likelihood Given training data x i, y i : 1 i n i.i.d. from distribution D Let {P θ y x : θ Θ} be a family of distributions indexed by θ MLE: negative conditional log-likelihood loss θ ML = argmax θ Θ σ i log(p θ y i x i ) Only care about predicting y from x; do not care about p(x) l P θ, x i, y i = log(p θ y i x i ) L P θ = σ i log(p θ y i x i )

MLE: conditional log-likelihood Given training data x i, y i : 1 i n i.i.d. from distribution D Let {P θ y x : θ Θ} be a family of distributions indexed by θ MLE: negative conditional log-likelihood loss θ ML = argmax θ Θ σ i log(p θ y i x i ) P(y x): discriminative; P(x,y): generative l P θ, x i, y i = log(p θ y i x i ) L P θ = σ i log(p θ y i x i )

Example: l 2 loss Given training data Find f θ x that minimizes L f θ x i, y i : 1 i n i.i.d. from distribution D = 1 σ n n i=1 f θ (x i ) y i 2

Example: l 2 loss Given training data Find f θ x that minimizes L f θ x i, y i : 1 i n i.i.d. from distribution D = 1 σ n n i=1 Define P θ y x = Normal y; f θ x, σ 2 f θ (x i ) y i 2 log(p θ y i x i ) = 1 (f 2σ 2 θ x i y i ) 2 log(σ) 1 log(2π) 2 1 n θ ML = argmin θ Θ f θ (x i ) y 2 i n σ i=1 l 2 loss: Normal + MLE

Linear classification

Example 1: image classification indoor Indoor outdoor

Example 2: Spam detection # $ # Mr. # sale Spam? Email 1 2 1 1 Yes Email 2 0 1 0 No Email 3 1 1 1 Yes Email n 0 0 0 No New email 0 0 1??

Why classification Classification: a kind of summary Easy to interpret Easy for making decisions

Linear classification w T x = 0 w T x > 0 Class 1 w w T x < 0 Class 0

Linear classification: natural attempt Given training data Hypothesis f w x = w T x y = 1 if w T x > 0 y = 0 if w T x < 0 Prediction: y = step(f w x ) = step(w T x) x i, y i : 1 i n i.i.d. from distribution D Linear model H

Linear classification: natural attempt Given training data Find f w x = w T x to minimize L f w x i, y i : 1 i n i.i.d. from distribution D = 1 σ n i=1 n I[step(w T x i ) y i ] Drawback: difficult to optimize NP-hard in the worst case 0-1 loss

Linear classification: simple approach Given training data Find f w x = w T x that minimizes L f w x i, y i : 1 i n i.i.d. from distribution D = 1 σ n n i=1 w T x i y i 2 Reduce to linear regression; ignore the fact y {0,1}

Linear classification: simple approach Drawback: not robust to outliers Figure borrowed from Pattern Recognition and Machine Learning, Bishop

Compare the two y y = w T x y = step(w T x) w T x

Between the two Prediction bounded in [0,1] Smooth Sigmoid: σ a = 1 1+exp( a) Figure borrowed from Pattern Recognition and Machine Learning, Bishop

Linear classification: sigmoid prediction Squash the output of the linear function Sigmoid w T x = σ w T x = 1 1 + exp( w T x) Find w that minimizes L f w = 1 σ n n i=1 σ(w T x i ) y i 2

Linear classification: logistic regression Squash the output of the linear function Sigmoid w T x = σ w T 1 x = 1 + exp( w T x) A better approach: Interpret as a probability P w (y = 1 x) = σ w T 1 x = 1 + exp( w T x) P w y = 0 x = 1 P w y = 1 x = 1 σ w T x

Linear classification: logistic regression Given training data x i, y i : 1 i n i.i.d. from distribution D Find w that minimizes n L w = 1 n log P w y x i=1 L w = 1 n yi=1 logσ(w T x i ) 1 n y i =0 log[1 σ w T x i ] Logistic regression: MLE with sigmoid

Linear classification: logistic regression Given training data Find w that minimizes x i, y i : 1 i n i.i.d. from distribution D L w = 1 n yi=1 logσ(w T x i ) 1 n y i =0 log[1 σ w T x i ] No close form solution; Need to use gradient descent

Properties of sigmoid function Bounded Symmetric σ a = 1 1 + exp( a) (0,1) Gradient 1 σ a = σ (a) = exp a 1 + exp a = 1 exp a + 1 = σ( a) exp a 1 + exp a 2 = σ(a)(1 σ a )