Linear classifiers: Overfitting and regularization

Size: px
Start display at page:

Download "Linear classifiers: Overfitting and regularization"

Transcription

1 Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1

2 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x i ) + w 1 h 1 (x i ) + + w D h D (x i ) = w T h(x i ) #awful 4 3 Score(x) < 0 Relate Score(x i ) to P(y=+1 x, )? Score(x) > 0 #awesome 3 Compare and contrast regression models Linear regression with Gaussian errors y i = w T h(x i ) + ε i ε i ~ N(0,σ 2 ) p(y x,w) = N(y; w T h(x), σ 2 ) 2σ w T h(x) y 4 Logistic regression e P(y x,w) = - w T h(x) T e - w h(x) 1 + e - w T h(x) y = +1 y = y 2

3 Understanding the logistic regression model P(y=+1 x i,w) = sigmoid(score(x i )) = e - w T h(x) Score(x i ) P(y=+1 x i,w) 5 Sentence from review Input: x Predict most likely class P(y x) = estimate of class probabilities If P(y=+1 x) > 0.5: = Else: = Estimating P(y x) improves interpretability: - Predict = +1 and tell me how sure you are 6 3

4 Gradient descent algorithm, cont d Step 5: Gradient over all data points Sum over data points Feature value Difference between truth and prediction P(y=+1 x i,w) P(y=+1 x i,w) y i =+1 y i =-1 8 4

5 Summary of gradient ascent for logistic regression init w (1) =0 (or randomly, or smartly), t=1 while Δ for j=0,,d l(w (t) ) > ε partial[j] = w (t+1) j w (t) j + η partial[j] t t Overfitting in classification 5

6 Learned decision boundary Feature Value Coefficient learned h 0 (x) h 1 (x) x[1] 1.12 h 2 (x) x[2] Quadratic features (in 2d) Note: we are not including cross terms for simplicity Feature Value Coefficient learned h 0 (x) h 1 (x) x[1] 1.39 h 2 (x) x[2] h 3 (x) (x[1]) h 4 (x) (x[2])

7 Degree 6 features (in 2d) Note: we are not including cross terms for simplicity 13 Feature Value Coefficient learned h 0 (x) h 1 (x) x[1] 5.3 h 2 (x) x[2] h 3 (x) (x[1]) h 4 (x) (x[2]) h 5 (x) (x[1]) h 6 (x) (x[2]) h 7 (x) (x[1]) h 8 (x) (x[2]) h 9 (x) (x[1]) h 10 (x) (x[2]) h 11 (x) (x[1]) h 12 (x) (x[2]) Coefficient values getting large Score(x) < 0 Degree 20 features (in 2d) Note: we are not including cross terms for simplicity Featur e Value Coefficien t learned h 0 (x) h 1 (x) x[1] 5.1 h 2 (x) x[2] 78.7 h 11 (x) (x[1]) h 12 (x) (x[2]) h 13 (x) (x[1]) h 14 (x) (x[2]) h 37 (x) (x[1]) 19-2*10-6 h 38 (x) (x[2]) h 39 (x) (x[1]) 20-2* h 40 (x) (x[2]) Often, overfitting associated with very large estimated coefficients 7

8 Overfitting in classification Error = Classification Error Overfitting if there exists w*: training_error(w*) > training_error( ) true_error(w*) < true_error( ) Model complexity 15 Overfitting in classifiers Overconfident predictions 8

9 Logistic regression model w - T h(x i ) P(y=+1 x i,w) = sigmoid(w T h(x i )) 17 The subtle (negative) consequence of overfitting in logistic regression Overfitting Large coefficient values T h(x i ) is very positive (or very negative) sigmoid( T h(x i )) goes to 1 (or to 0) Model becomes extremely overconfident of predictions 18 9

10 Effect of coefficients on logistic regression model Input x: #awesome=2, #awful=1 w 0 0 w #awesome +1 w #awful -1 w 0 0 w #awesome +2 w #awful -2 w 0 0 w #awesome +6 w #awful -6 #awesome - #awful #awesome - #awful #awesome - #awful 19 Learned probabilities Feature Value Coefficient learned h 0 (x) h 1 (x) x[1] 1.12 h 2 (x) x[2]

11 Quadratic features: Learned probabilities Feature Value Coefficien t learned h 0 (x) h 1 (x) x[1] 1.39 h 2 (x) x[2] h 3 (x) (x[1]) h 4 (x) (x[2]) Overfitting Overconfident predictions Degree 6: Learned probabilities Tiny uncertainty regions Overfitting & overconfident about it!!! We are sure we are right, when we are surely wrong! Degree 20: Learned probabilities 22 11

12 Overfitting in logistic regression: Another perspective Linearly-separable data Data are linearly separable if: There exist coefficients such that: - For all positive training data Note 1: If you are using D features, linear separability happens in a D-dimensional space 24 - For all negative training data Note 2: If you have enough features, data are (almost) always linearly separable training_error( ) = 0 12

13 Effect of linear separability on coefficients X[2]=#awful Score(x) < 0 Score(x) > x[1]=#awesome Data are linearly separable with 1=1.0 and 2=-1.5 Data also linearly separable with 1=10 and 2=-15 Data also linearly separable with 1=10 9 and 2=-1.5x Maximum Effect likelihood of linear estimation separability (MLE) prefers most certain model on weights Coefficients go to infinity for linearly-separable data!!! X[2]=#awful x[1]= 2 x[2]= x[1]=#awesome P(y=+1 x i, ) Data is linearly separable with 1=1.0 and 2=-1.5 Data also linearly separable with 1=10 and 2=-15 Data also linearly separable with 1=10 9 and 2=-1.5x

14 Overfitting in logistic regression is twice as bad Learning tries to find decision boundary that separates data Overly complex boundary If data are linearly separable Coefficients go to infinity! 1=10 9 2=-1.5x Penalizing large coefficients to mitigate overfitting 14

15 Desired total cost format Want to balance: i. How well function fits data ii. Magnitude of coefficients Total quality = want to balance measure of fit - measure of magnitude of coefficients 29 (data likelihood) large # = good fit to training data large # = overfit Consider resulting objective Select to maximize: 2 l(w) - λ w 2 tuning parameter = balance of fit and magnitude L 2 regularized logistic regression Pick λ using: Validation set (for large datasets) Cross-validation (for smaller datasets) (as in ridge/lasso regression) 30 15

16 Degree 20 features, effect of regularization penalty λ Regularizatio n λ = 0 λ = λ = λ = 1 λ = 10 Range of coefficients to to to to to 0.22 Decision boundary 31 Coefficient path coefficients j 32 λ 16

17 Degree 20 features: regularization reduces overconfidence Regularization λ = 0 λ = λ = λ = 1 Range of coefficients to to to to 0.57 Learned probabilities 33 Finding best L 2 regularized linear classifier with gradient ascent 17

18 Understanding contribution of L 2 regularization Term from L 2 penalty w j > 0 w j < 0-2 λ w j Impact on w j 35 Summary of gradient ascent for logistic regression + L 2 regularization init w (1) =0 (or randomly, or smartly), t=1 while not converged: for j=0,,d partial[j] = w (t+1) j w (t) j + η (partial[j] 2λ w (t) j ) t t

19 Summary of overfitting in logistic regression What you can do now Identify when overfitting is happening Relate large learned coefficients to overfitting Describe the impact of overfitting on decision boundaries and predicted probabilities of linear classifiers Motivate the form of L2 regularized logistic regression quality metric Describe what happens to estimated coefficients as tuning parameter λ is varied Interpret coefficient path plot Estimate L2 regularized logistic regression coefficients using gradient ascent Describe the use of L1 regularization to obtain sparse logistic regression solutions 38 19

20 Linear classifiers: Scaling up learning via SGD Emily Fox University of Washington January 25, 2017 Stochastic gradient descent: Learning, one data point at a time 20

21 Why gradient ascent is slow w (t) Update coefficients w (t+1) Update coefficients w (t+2) Data Δ l(w (t) ) Pass over data Compute gradient Δ l(w (t+1) ) Every update requires a full pass over data 41 More formally: How expensive is gradient ascent? Sum over data points Contribution of data point x i,y i to gradient 42 21

22 Every step requires touching every data point!!! Sum over data points 43 Time to compute contribution of x i,y i # of data points (N) 1 millisecond second millisecond 10 million 1 millisecond 10 billion Total time to compute 1 step of gradient ascent Stochastic gradient ascent w (t) Update coefficients w (t+1) Update coefficients w (t+2) Update coefficients w (t+3) Update coefficients w (t+4) Data Use only small subsets of data Compute gradient Many updates for each pass over data 44 22

23 Example: Instead of all data points for gradient, use 1 data point only??? Gradient ascent Sum over data points 45 Stochastic gradient ascent Each time, pick different data point i Stochastic gradient ascent for logistic regression init w (1) =0, t=1 until for converged i=1,,n for j=0,,d partial[j] = Each time, pick Sum over different data point i data points t t + 1 w j (t+1) w j (t) + η partial[j] 46 23

24 Stochastic gradient for L2-regularized objective Total derivative = - 2 λ w j partial[j] What about regularization term? Each time, pick different data point i Stochastic gradient ascent Total derivative - 2 λ w j N partial[j] Each data point contributes 1/N to regularization 47 Comparing computational time per step Gradient ascent Stochastic gradient ascent Time to compute contribution of x i,y i 48 # of data points (N) Total time for 1 step of gradient 1 millisecond second 1 second minutes 1 millisecond 10 million 2.8 hours 1 millisecond 10 billion days Total time for 1 step of stochastic gradient 24

25 Which one is better??? Depends Algorithm Gradient Stochastic gradient Time per iteration Slow for large data Always fast Total time to convergence for large data In theory Slower Faster In practice Often slower Often faster Sensitivity to parameters Moderate Very high 49 Summary of stochastic gradient Tiny change to gradient ascent Much better scalability Huge impact in real-world Very tricky to get right in practice 50 25

26 Why would stochastic gradient ever work??? Gradient is direction of steepest ascent Gradient is best direction, but any direction that goes up would be useful 52 26

27 In ML, steepest direction is sum of little directions from each data point Sum over data points 53 For most data points, contribution points up Stochastic gradient: Pick a data point and move in direction Most of the time, total likelihood will increase 54 27

28 Stochastic gradient ascent: Most iterations increase likelihood, but sometimes decrease it On average, make progress until converged for i=1,,n for j=0,,d w (t+1) j w (t) j + η t t Convergence path 28

29 Convergence paths Gradient Stochastic gradient 57 Stochastic gradient convergence is noisy Stochastic gradient makes noisy progress Stochastic gradient achieves higher likelihood sooner, but it s noisier Better Avg. log likelihood Gradient usually increases likelihood smoothly 58 Total time proportional to # passes over data 29

30 Eventually, gradient catches up Note: should only trust average quality of stochastic gradient Better Avg. log likelihood Stochastic gradient Gradient 59 The last coefficients may be really good or really bad!! 60 Stochastic gradient will eventually oscillate around a solution w (1005) was good How do we minimize risk of picking bad coefficients w (1000) was bad Minimize noise: don t return last learned coefficients Output average: = 1 w (t) T 30

31 Summary of why stochastic gradient works Gradient finds direction of steeps ascent Gradient is sum of contributions from each data point Stochastic gradient uses direction from 1 data point On average increases likelihood, sometimes decreases 61 Stochastic gradient has noisy convergence Online learning: Fitting models from streaming data 31

32 Batch vs online learning Batch learning All data is available at start of training time Online learning Data arrives (streams in) over time - Must train model as data arrives! t=1 t=2 t=3 t=4 time Data ML algorithm Data Data Data Data ML (1) (2) (3) (4) algorithm Online learning example: Ad targeting Website Ad1 Ad2 Ad3 = Suggested ads User clicked on Ad2 y t =Ad2 64 Input: x t User info, page text ML algorithm (t) (t+1) 32

33 Online learning problem Data arrives over each time step t: - Observe input x t Info of user, text of webpage - Make a prediction t Which ad to show - Observe true output y t Which ad user clicked on 65 Need ML algorithm to update coefficients each time step! Stochastic gradient ascent can be used for online learning!!! init w (1) =0, t=1 Each time step t: - Observe input x t - Make a prediction t - Observe true output y t - Update coefficients: for j=0,,d w j (t+1) w j (t) + η 66 33

34 Summary of online learning Data arrives over time Must make a prediction every time new data point arrives Observe true class after prediction made Want to update parameters immediately 67 Summary of stochastic gradient descent 34

35 What you can do now Significantly speedup learning algorithm using stochastic gradient Describe intuition behind why stochastic gradient works Apply stochastic gradient in practice Describe online learning problems Relate stochastic gradient to online learning 69 Generalized linear models 35

36 . Models for data Linear regression with Gaussian errors y i = w T h(x i ) + ε i ε i ~ N(0,σ 2 ) p(y x,w) = N(y; w T h(x), σ 2 ) 2σ w T h(x) y 71 Logistic regression e P(y x,w) = - w T h(x) T e - w h(x) 1 + e - w T h(x) y = +1 y = y Examples of generalized linear models A generalized linear model has E[y x,w ] = g(w T h(x)) Mean Link function g Linear regression w T h(x) identity function Logistic regression (assuming y in {0,1} instead of {-1,1} or some other set of values) P(y=+1 x,w) (mean is different if y is not in {0,1}) sigmoid function (need slight modification if y is not in {0,1}) 72 36

37 Stochastic gradient descent more formally Learning Problems as Expectations Minimizing loss in training data: - Given dataset: Sampled iid from some distribution p(x) on features: - Loss function, e.g., hinge loss, logistic loss, - We often minimize loss in training data: However, we should really minimize expected loss on all data: 74 So, we are approximating the integral by the average on the training data 37

38 Gradient Ascent in Terms of Expectations True objective function: Taking the gradient: True gradient ascent rule: How do we estimate expected gradient? 75 SGD: Stochastic Gradient Ascent (or Descent) True gradient: Sample based approximation: What if we estimate gradient with just one sample??? - Unbiased estimate of gradient - Very noisy! - Called stochastic gradient ascent (or descent) Among many other names - VERY useful in practice!!! 76 38

Linear classifiers: Logistic regression

Linear classifiers: Logistic regression Linear classifiers: Logistic regression STAT/CSE 416: Machine Learning Emily Fox University of Washington April 19, 2018 How confident is your prediction? The sushi & everything else were awesome! The

More information

Logistic Regression Logistic

Logistic Regression Logistic Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Ridge Regression: Regulating overfitting when using many features. Training, true, & test error vs. model complexity. CSE 446: Machine Learning

Ridge Regression: Regulating overfitting when using many features. Training, true, & test error vs. model complexity. CSE 446: Machine Learning Ridge Regression: Regulating overfitting when using many features Emily Fox University of Washington January 3, 207 Training, true, & test error vs. model complexity Overfitting if: Error y Model complexity

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

Kernelized Perceptron Support Vector Machines

Kernelized Perceptron Support Vector Machines Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Decision Trees: Overfitting

Decision Trees: Overfitting Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Warm up. Regrade requests submitted directly in Gradescope, do not  instructors. Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required

More information

Classification Logistic Regression

Classification Logistic Regression Classification Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 1 THUS FAR, REGRESSION: PREDICT A CONTINUOUS VALUE GIVEN SOME INPUTS 2 Weather prediction

More information

Logistic Regression. William Cohen

Logistic Regression. William Cohen Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting

More information

Is the test error unbiased for these programs?

Is the test error unbiased for these programs? Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012 Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 20, 2012 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 1 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Machine Learning Gaussian Naïve Bayes Big Picture

Machine Learning Gaussian Naïve Bayes Big Picture Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 27, 2011 Today: Naïve Bayes Big Picture Logistic regression Gradient ascent Generative discriminative

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 The Regression Problem Training data: A set of input-output

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features Y target

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: Classification: Logistic Regression NB & LR connections Readings: Barber 17.4 Dhruv Batra Virginia Tech Administrativia HW2 Due: Friday 3/6, 3/15, 11:55pm

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard and Mitch Marcus (and lots original slides by

More information

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017 CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code

More information

Introduction to Logistic Regression

Introduction to Logistic Regression Introduction to Logistic Regression Guy Lebanon Binary Classification Binary classification is the most basic task in machine learning, and yet the most frequent. Binary classifiers often serve as the

More information

Machine Learning. Lecture 2: Linear regression. Feng Li. https://funglee.github.io

Machine Learning. Lecture 2: Linear regression. Feng Li. https://funglee.github.io Machine Learning Lecture 2: Linear regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2017 Supervised Learning Regression: Predict

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so. CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic

More information

Announcements Kevin Jamieson

Announcements Kevin Jamieson Announcements Project proposal due next week: Tuesday 10/24 Still looking for people to work on deep learning Phytolith project, join #phytolith slack channel 2017 Kevin Jamieson 1 Gradient Descent Machine

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 2, 2015 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)

More information

Support Vector Machines

Support Vector Machines Two SVM tutorials linked in class website (please, read both): High-level presentation with applications (Hearst 1998) Detailed tutorial (Burges 1998) Support Vector Machines Machine Learning 10701/15781

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting

More information

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

More information

Classification Logistic Regression

Classification Logistic Regression O due Thursday µtwl Classification Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 1 THUS FAR, REGRESSION: PREDICT A CONTINUOUS VALUE GIVEN SOME INPUTS

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2019 Due: 11am Monday, February 25th, 2019 Submit scan of plots/written responses to Gradebook; submit your

More information

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com Logistic Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

Logistic Regression. Stochastic Gradient Descent

Logistic Regression. Stochastic Gradient Descent Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}

More information

Logistic Regression & Neural Networks

Logistic Regression & Neural Networks Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability

More information

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor Logistic Regression Some slides adapted from Dan Jurfasky and Brendan O Connor Naïve Bayes Recap Bag of words (order independent) Features are assumed independent given class P (x 1,...,x n c) =P (x 1

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

Classification with Perceptrons. Reading:

Classification with Perceptrons. Reading: Classification with Perceptrons Reading: Chapters 1-3 of Michael Nielsen's online book on neural networks covers the basics of perceptrons and multilayer neural networks We will cover material in Chapters

More information

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin Kernels Machine Learning CSE446 Carlos Guestrin University of Washington October 28, 2013 Carlos Guestrin 2005-2013 1 Linear Separability: More formally, Using Margin Data linearly separable, if there

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016 Lecture 7 Logistic Regression Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 11, 2016 Luigi Freda ( La Sapienza University) Lecture 7 December 11, 2016 1 / 39 Outline 1 Intro Logistic

More information

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

Linear discriminant functions

Linear discriminant functions Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

CSC 411: Lecture 04: Logistic Regression

CSC 411: Lecture 04: Logistic Regression CSC 411: Lecture 04: Logistic Regression Raquel Urtasun & Rich Zemel University of Toronto Sep 23, 2015 Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 1 / 16 Today Key Concepts: Logistic

More information

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 22 Dan Weld Learning Gaussians Naïve Bayes Last Time Gaussians Naïve Bayes Logistic Regression Today Some slides from Carlos Guestrin, Luke Zettlemoyer

More information

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1 Linear Regression Machine Learning CSE546 Kevin Jamieson University of Washington Oct 5, 2017 1 The regression problem Given past sales data on zillow.com, predict: y = House sale price from x = {# sq.

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released

More information

Week 5: Logistic Regression & Neural Networks

Week 5: Logistic Regression & Neural Networks Week 5: Logistic Regression & Neural Networks Instructor: Sergey Levine 1 Summary: Logistic Regression In the previous lecture, we covered logistic regression. To recap, logistic regression models and

More information

CSC2541 Lecture 5 Natural Gradient

CSC2541 Lecture 5 Natural Gradient CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient 1 / 12 Motivation Two classes of optimization procedures used throughout ML (stochastic) gradient descent,

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c Warm up D cai.yo.ie p IExrL9CxsYD Sglx.Ddl f E Luo fhlexi.si dbll Fix any a, b, c > 0. 1. What is the x 2 R that minimizes ax 2 + bx + c x a b Ta OH 2 ax 16 0 x 1 Za fhkxiiso3ii draulx.h dp.d 2. What is

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA Logistic Regression Sargur N. University at Buffalo, State University of New York USA Topics in Linear Classification using Probabilistic Discriminative Models Generative vs Discriminative 1. Fixed basis

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Learning from Data Logistic Regression

Learning from Data Logistic Regression Learning from Data Logistic Regression Copyright David Barber 2-24. Course lecturer: Amos Storkey a.storkey@ed.ac.uk Course page : http://www.anc.ed.ac.uk/ amos/lfd/ 2.9.8.7.6.5.4.3.2...2.3.4.5.6.7.8.9

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Why should you care about the solution strategies?

Why should you care about the solution strategies? Optimization Why should you care about the solution strategies? Understanding the optimization approaches behind the algorithms makes you more effectively choose which algorithm to run Understanding the

More information

SVMs, Duality and the Kernel Trick

SVMs, Duality and the Kernel Trick SVMs, Duality and the Kernel Trick Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 26 th, 2007 2005-2007 Carlos Guestrin 1 SVMs reminder 2005-2007 Carlos Guestrin 2 Today

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Hypothesis Space variable size deterministic continuous parameters Learning Algorithm linear and quadratic programming eager batch SVMs combine three important ideas Apply optimization

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

More information