Introduction to Logistic Regression and Support Vector Machine

Size: px
Start display at page:

Download "Introduction to Logistic Regression and Support Vector Machine"

Transcription

1 Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25

2 Before we start () 2 / 25 Fall, / 25

3 Before we start Feel free to ask questions anytime The slides are newly made. Please tell me if you find any mistake. () 2 / 25 Fall, / 25

4 Today: supervised learning algorithms training data machine learning model testing data () 3 / 25 Fall, / 25

5 Today: supervised learning algorithms training data machine learning model testing data Supervised learning algorithms we have mentioned Decision Tree Online Learning: Perceptron, Winnow,... Generative Model: Naive Bayes () 3 / 25 Fall, / 25

6 Today: supervised learning algorithms training data machine learning model testing data Supervised learning algorithms we have mentioned Decision Tree Online Learning: Perceptron, Winnow,... Generative Model: Naive Bayes What are we going to talk about today? Modern supervised learning algorithms Specifically, logistic regression and support vector machine () 3 / 25 Fall, / 25

7 Motivation Logistic regression and support vector machine are both very popular! () 4 / 25 Fall, / 25

8 Motivation Logistic regression and support vector machine are both very popular! Batch learning algorithms Using optimization algorithms as training algorithms An important technique we need to be familiar with. Learn not to be afraid of these algorithms () 4 / 25 Fall, / 25

9 Motivation Logistic regression and support vector machine are both very popular! Batch learning algorithms Using optimization algorithms as training algorithms An important technique we need to be familiar with. Learn not to be afraid of these algorithms Understand the relationships between these algorithms and the algorithms we have learned () 4 / 25 Fall, / 25

10 Review: Naive Bayes Notations Input: x, Output y {+, } Assume each x has m features. We use x j to represent the j-th features of x y x x 2 x 3 x 4 x 5 () 5 / 25 Fall, / 25

11 Review: Naive Bayes Notations Input: x, Output y {+, } Assume each x has m features. We use x j to represent the j-th features of x y x x 2 x 3 x 4 x 5 Conditional Independence () 5 / 25 Fall, / 25

12 Review: Naive Bayes y x x 2 x 3 x 4 x 5 m P(y, x) = P(y) P(x j y) j= () 6 / 25 Fall, / 25

13 Review: Naive Bayes y x x 2 x 3 x 4 x 5 Training Maximize the likelihood of P(D) = P(Y, X ) = l i p(y i, x i ) Algorithm Estimate P(y = ) and P(y = ) by counting Estimate P(x j y) by counting m P(y, x) = P(y) P(x j y) j= () 6 / 25 Fall, / 25

14 Review: Naive Bayes y x x 2 x 3 x 4 x 5 m P(y, x) = P(y) P(x j y) j= Training Testing Maximize the likelihood of P(D) = P(Y, X ) = l i p(y i, x i ) Algorithm Estimate P(y = ) and P(y = ) by counting Estimate P(x j y) by counting P(y=+ x) P(y= x) = P(y=+,x) P(y=,x)? () 6 / 25 Fall, / 25

15 Review: Naive Bayes The prediction function of a Naive Bayes model is a linear function In previous lectures, we have shown that log P(y=+ x) P(y= x) 0 w T x + b 0 The counting results can be re-expressed as a linear function () 7 / 25 Fall, / 25

16 Review: Naive Bayes The prediction function of a Naive Bayes model is a linear function In previous lectures, we have shown that log P(y=+ x) P(y= x) 0 w T x + b 0 The counting results can be re-expressed as a linear function Key observation: Naive Bayes cannot express all possible linear functions Intuition: conditional independence assumption () 7 / 25 Fall, / 25

17 Review: Naive Bayes The prediction function of a Naive Bayes model is a linear function In previous lectures, we have shown that log P(y=+ x) P(y= x) 0 w T x + b 0 The counting results can be re-expressed as a linear function Key observation: Naive Bayes cannot express all possible linear functions Intuition: conditional independence assumption We will propose a model (logistic regression) that can express all possible linear functions in the next few slides. () 7 / 25 Fall, / 25

18 Modeling conditional probability using a linear function Starting point: the predicting function of Naive Bayes log P(y=+ x) P(y= x) = w T x + b () 8 / 25 Fall, / 25

19 Modeling conditional probability using a linear function Starting point: the predicting function of Naive Bayes log P(y=+ x) P(y= x) = w T x + b P(y=+ x) P(y=+ x) = ew T x+b () 8 / 25 Fall, / 25

20 Modeling conditional probability using a linear function Starting point: the predicting function of Naive Bayes log P(y=+ x) P(y= x) = w T x + b P(y=+ x) P(y=+ x) = ew T x+b P(y = + x) = ewt x+b +e wt x+b = +e (wt x+b) () 8 / 25 Fall, / 25

21 Modeling conditional probability using a linear function Starting point: the predicting function of Naive Bayes log P(y=+ x) P(y= x) = w T x + b P(y=+ x) P(y=+ x) = ew T x+b P(y = + x) = ewt x+b +e wt x+b = +e (wt x+b) The conditional probability P(y x) = +e y(wt x+b) () 8 / 25 Fall, / 25

22 Modeling conditional probability using a linear function Starting point: the predicting function of Naive Bayes log P(y=+ x) P(y= x) = w T x + b P(y=+ x) P(y=+ x) = ew T x+b P(y = + x) = ewt x+b +e wt x+b = +e (wt x+b) The conditional probability P(y x) = +e y(wt x+b) In order to simplify the notation, w T [ w T b ] x T [ x T ] () 8 / 25 Fall, / 25

23 Modeling conditional probability using a linear function Starting point: the predicting function of Naive Bayes log P(y=+ x) P(y= x) = w T x + b P(y=+ x) P(y=+ x) = ew T x+b P(y = + x) = ewt x+b +e wt x+b = +e (wt x+b) The conditional probability P(y x) = +e y(wt x+b) In order to simplify the notation, w T [ w T b ] x T [ x T ] Using the bias trick, P(y x) = +e y(wt x) () 8 / 25 Fall, / 25

24 Logistic regression: introduction () 9 / 25 Fall, / 25

25 Logistic regression: introduction Naive Bayes: model P(y, x) conditional independence assumption: training = counting not all w are possible () 9 / 25 Fall, / 25

26 Logistic regression: introduction Naive Bayes: model P(y, x) conditional independence assumption: training = counting not all w are possible In the testing phase, we just showed that the conditional probability can be expressed P(y x, w) = + e y(w T x) () () 9 / 25 Fall, / 25

27 Logistic regression: introduction Naive Bayes: model P(y, x) conditional independence assumption: training = counting not all w are possible In the testing phase, we just showed that the conditional probability can be expressed P(y x, w) = + e y(w T x) () Logistic Regression Maximizes conditional likelihood P(y x) directly in the training phase () 9 / 25 Fall, / 25

28 Logistic regression: introduction Naive Bayes: model P(y, x) conditional independence assumption: training = counting not all w are possible In the testing phase, we just showed that the conditional probability can be expressed P(y x, w) = () + e y(w T x) Logistic Regression Maximizes conditional likelihood P(y x) directly in the training phase How to find w? w = argmax w P(Y X, w) = argmax w l i= P(y i x i, w) () 9 / 25 Fall, / 25

29 Logistic regression: introduction Naive Bayes: model P(y, x) conditional independence assumption: training = counting not all w are possible In the testing phase, we just showed that the conditional probability can be expressed P(y x, w) = () + e y(w T x) Logistic Regression Maximizes conditional likelihood P(y x) directly in the training phase How to find w? l w = argmax w P(Y X, w) = argmax w i= P(y i x i, w) For all possible w, find the one that maximizes the conditional likelihood drop the conditional independence assumption! () 9 / 25 Fall, / 25

30 Logistic regression: the final objective function w = argmax w P(Y X, w) = argmax l w i= P(y x, w) () 0 / 25 Fall, / 25

31 Logistic regression: the final objective function w = argmax w P(Y X, w) = argmax l w i= P(y x, w) Finding w as an optimization problem w = argmax w = argmin w = argmin w log P(Y X, w) = argmin log P(Y X, w) w l i= log + e y i (w T x i ) l log( + e y i (w T x i ) ) i= () 0 / 25 Fall, / 25

32 Logistic regression: the final objective function w = argmax w P(Y X, w) = argmax l w i= P(y x, w) Finding w as an optimization problem w = argmax w = argmin w = argmin w log P(Y X, w) = argmin log P(Y X, w) w l i= log + e y i (w T x i ) l log( + e y i (w T x i ) ) i= Properties of this optimization problem A convex optimization problem () 0 / 25 Fall, / 25

33 Adding regularization Explanation Empirical loss : log( + e y i (w T x i ) ) () / 25 Fall, 2009 / 25

34 Adding regularization Explanation Empirical loss : log( + e y i (w T x i ) ) y i (w T x i ) increases log( + e y i (w T x i ) ) decreases In order to minimize the empirical loss, w will tend to be large Therefore, to prevent over-fitting, we add a regularization term Regularization Term min w 2 w T w + C l log( + e y i (w T x i ) ) i= () / 25 Fall, 2009 / 25

35 Adding regularization Explanation Empirical loss : log( + e y i (w T x i ) ) y i (w T x i ) increases log( + e y i (w T x i ) ) decreases In order to minimize the empirical loss, w will tend to be large Therefore, to prevent over-fitting, we add a regularization term Regularization Term min w Empirical Loss 2 w T w + C l log( + e y i (w T x i ) ) i= () / 25 Fall, 2009 / 25

36 Adding regularization Explanation Empirical loss : log( + e y i (w T x i ) ) y i (w T x i ) increases log( + e y i (w T x i ) ) decreases In order to minimize the empirical loss, w will tend to be large Therefore, to prevent over-fitting, we add a regularization term Regularization Term min w Empirical Loss 2 w T w + C l log( + e y i (w T x i ) ) i= balance parameter () / 25 Fall, 2009 / 25

37 Optimization An unconstrained problem. We can use the gradient descent algorithm! () 2 / 25 Fall, / 25

38 Optimization An unconstrained problem. We can use the gradient descent algorithm! However, it is quite slow. Many other methods () 2 / 25 Fall, / 25

39 Optimization An unconstrained problem. We can use the gradient descent algorithm! However, it is quite slow. Many other methods Iterative scaling; non-linear conjugate gradient; quasi-newton methods; truncated Newton methods; trust-region newton method. All methods are iterative methods, that generate a sequence w k Converging to the optimal solution of the optimization problem above. () 2 / 25 Fall, / 25

40 Optimization An unconstrained problem. We can use the gradient descent algorithm! However, it is quite slow. Many other methods Iterative scaling; non-linear conjugate gradient; quasi-newton methods; truncated Newton methods; trust-region newton method. All methods are iterative methods, that generate a sequence w k Converging to the optimal solution of the optimization problem above. Choice of optimization techniques Low cost per iteration High cost per iteration (slow convergence) (fast convergence) Iterative scaling Newton Methods (each w component at a time) () 2 / 25 Fall, / 25

41 Optimization An unconstrained problem. We can use the gradient descent algorithm! However, it is quite slow. Many other methods Iterative scaling; non-linear conjugate gradient; quasi-newton methods; truncated Newton methods; trust-region newton method. All methods are iterative methods, that generate a sequence w k Converging to the optimal solution of the optimization problem above. Choice of optimization techniques Low cost per iteration High cost per iteration (slow convergence) (fast convergence) Iterative scaling Newton Methods (each w component at a time) Currently: Limited memory BFGS is very popular in NLP community () 2 / 25 Fall, / 25

42 Logistic regression versus Naive Bayes Logistic regression Naive Bayes Training maximize P(Y X ) maximize P(Y, X ) Training Algorithm optimization algorithms counting Testing P(y x) 0.5? P(y x) 0.5? Table: Comparison between Naive Bayes and logistic regression () 3 / 25 Fall, / 25

43 Logistic regression versus Naive Bayes Logistic regression Naive Bayes Training maximize P(Y X ) maximize P(Y, X ) Training Algorithm optimization algorithms counting Testing P(y x) 0.5? P(y x) 0.5? Table: Comparison between Naive Bayes and logistic regression LR and NB are both linear functions in the testing phase However, their training agendas are very different () 3 / 25 Fall, / 25

44 Support Vector Machine: another loss function No magic. Just another loss function () 4 / 25 Fall, / 25

45 Support Vector Machine: another loss function No magic. Just another loss function Logistic Regression min w 2 w T w + C l log( + e y i (w T x i ) ) i= () 4 / 25 Fall, / 25

46 Support Vector Machine: another loss function No magic. Just another loss function Logistic Regression L-loss SVM min w min w 2 w T w + C 2 w T w + C l log( + e y i (w T x i ) ) i= l max(0, y i w T x i ) i= () 4 / 25 Fall, / 25

47 Support Vector Machine: another loss function No magic. Just another loss function Logistic Regression L-loss SVM L2-loss SVM min w min w min w 2 w T w + C 2 w T w + C 2 w T w + C l log( + e y i (w T x i ) ) i= l max(0, y i w T x i ) i= l max(0, y i w T x i ) 2 i= () 4 / 25 Fall, / 25

48 Compare these loss functions () 5 / 25 Fall, / 25

49 The regularization term: maximize margin The L-loss SVM: min w 2 w T w + C l i= max(0, y iw T x i ) () 6 / 25 Fall, / 25

50 The regularization term: maximize margin The L-loss SVM: min w 2 w T w + C l i= max(0, y iw T x i ) Rewrite it using slack variables (why are they the same?) min w 2 w T w + C l i= s.t. y i w T x i ξ i, ξ i 0 ξ i () 6 / 25 Fall, / 25

51 The regularization term: maximize margin The L-loss SVM: min w 2 w T w + C l i= max(0, y iw T x i ) Rewrite it using slack variables (why are they the same?) min w 2 w T w + C l i= s.t. y i w T x i ξ i, ξ i 0 If there is no training error, what is the margin of w? ξ i () 6 / 25 Fall, / 25

52 The regularization term: maximize margin The L-loss SVM: min w 2 w T w + C l i= max(0, y iw T x i ) Rewrite it using slack variables (why are they the same?) min w 2 w T w + C l i= s.t. y i w T x i ξ i, ξ i 0 If there is no training error, what is the margin of w? w ξ i () 6 / 25 Fall, / 25

53 The regularization term: maximize margin The L-loss SVM: min w 2 w T w + C l i= max(0, y iw T x i ) Rewrite it using slack variables (why are they the same?) min w 2 w T w + C l i= s.t. y i w T x i ξ i, ξ i 0 If there is no training error, what is the margin of w? w Maximizing w minimizing w T w SVM regularization: find the linear line that maximizes the margin ξ i () 6 / 25 Fall, / 25

54 The regularization term: maximize margin The L-loss SVM: min w 2 w T w + C l i= max(0, y iw T x i ) Rewrite it using slack variables (why are they the same?) min w 2 w T w + C l i= s.t. y i w T x i ξ i, ξ i 0 If there is no training error, what is the margin of w? w Maximizing w minimizing w T w SVM regularization: find the linear line that maximizes the margin Learning theory: Link to SVM theory notes ξ i () 6 / 25 Fall, / 25

55 Balance between regularization and empirical loss (a) Training data and an overfitting classifier (b) Testing data and an overfitting classifier The maximal margin line with 0 training error Best? () 7 / 25 Fall, / 25

56 Balance between regularization and empirical loss (c) Training data and a better classifier (d) Testing data and a better classifier If we allow some training error, we can find a better line We need to balance the regularization term and the empirically loss term Problem of model selection. Select balance parameter with cross validation () 8 / 25 Fall, / 25

57 Primal and Dual Formulations Explaining the primal-dual relationship Link to the lecture notes: 07-LecSvm-opt.pdf Why primal-dual relationship is useful Link to a talk by Professor Chih-Jen Lin in Optimization, Support Vector Machines, and Machine Learning. Talk in DIS, University of Rome and IASI, CNR, Italy. September -2, We will only use the slides from page -20. Link to notes () 9 / 25 Fall, / 25

58 Nonlinear SVM SVM tries to find a linear line that maximizes the margin. Why people mention about non-linear SVM? () 20 / 25 Fall, / 25

59 Nonlinear SVM SVM tries to find a linear line that maximizes the margin. Why people mention about non-linear SVM? Usually this means: x φ(x) Find a linear function of φ(x) This can be a non linear function for x () 20 / 25 Fall, / 25

60 Nonlinear SVM SVM tries to find a linear line that maximizes the margin. Why people mention about non-linear SVM? Usually this means: x φ(x) Find a linear function of φ(x) This can be a non linear function for x Primal min w,ξi 2 w T w + C s.t. l i= ξ i y i w T φ(x) i ξ i ξ i 0, i =... l () 20 / 25 Fall, / 25

61 Nonlinear SVM SVM tries to find a linear line that maximizes the margin. Why people mention about non-linear SVM? Usually this means: x φ(x) Find a linear function of φ(x) This can be a non linear function for x Primal l min w,ξi 2 w T w + C ξ i Dual min α 2 αt Qα et α s.t. i= y i w T φ(x) i ξ i ξ i 0, i =... l s.t. i, 0 α i C where Q is a l-by-l matrix with Q ij = y i y j K(x i, x j ) () 20 / 25 Fall, / 25

62 Nonlinear SVM SVM tries to find a linear line that maximizes the margin. Why people mention about non-linear SVM? Usually this means: x φ(x) Find a linear function of φ(x) This can be a non linear function for x Primal l min w,ξi 2 w T w + C ξ i Dual min α 2 αt Qα et α s.t. i= y i w T φ(x) i ξ i ξ i 0, i =... l s.t. i, 0 α i C where Q is a l-by-l matrix with Q ij = y i y j K(x i, x j ) Same for Kernel perceptron: find a linear function on φ(x) Demo () 20 / 25 Fall, / 25

63 Solving SVM Both primal and dual problems have constraints We can not use the gradient descent algorithm For linear dual SVM, there is a simple optimization algorithm Coordinate descent method! () 2 / 25 Fall, / 25

64 Solving SVM Both primal and dual problems have constraints We can not use the gradient descent algorithm For linear dual SVM, there is a simple optimization algorithm Coordinate descent method! min α 2 αt Qα et α s.t. i, 0 α i C # of α i = # of training example The idea: pick one example i. Optimize α i only () 2 / 25 Fall, / 25

65 Coordinate Descent Algorithm Algorithm Run through the training data multiple times () 22 / 25 Fall, / 25

66 Coordinate Descent Algorithm Algorithm Run through the training data multiple times Pick a random example (i) among the training data. () 22 / 25 Fall, / 25

67 Coordinate Descent Algorithm Algorithm Run through the training data multiple times Pick a random example (i) among the training data. Fix α, α 2,..., α i, α i+,..., α l, only change α i α i = α i + s () 22 / 25 Fall, / 25

68 Coordinate Descent Algorithm Algorithm Run through the training data multiple times Pick a random example (i) among the training data. Fix α, α 2,..., α i, α i+,..., α l, only change α i α i = α i + s Solve the problem min s 2 (α + sd)t Q(α + sd) et (α + sd) s.t. 0 α i + s C only one constraint, where d is a vector of l zeros. The i-th component of d is. () 22 / 25 Fall, / 25

69 Coordinate Descent Algorithm Algorithm Run through the training data multiple times Pick a random example (i) among the training data. Fix α, α 2,..., α i, α i+,..., α l, only change α i α i = α i + s Solve the problem min s 2 (α + sd)t Q(α + sd) et (α + sd) s.t. 0 α i + s C only one constraint, where d is a vector of l zeros. The i-th component of d is. It is a single variable problem. We know how to solve this. () 22 / 25 Fall, / 25

70 Coordinate Descent Algorithm Assume that the optimal s is s. We can update α i using: α i = α i + s Given that w = l i α iy i x i, this is equivalent to is equivalent to w w + (α i α i)y i x i Isn t this familiar? () 23 / 25 Fall, / 25

71 Coordinate Descent Algorithm Assume that the optimal s is s. We can update α i using: α i = α i + s Similar to dual perceptron Given that w = l i α iy i x i, this is equivalent to is equivalent to w w + (α i α i)y i x i Isn t this familiar? () 23 / 25 Fall, / 25

72 Coordinate Descent Algorithm Assume that the optimal s is s. We can update α i using: α i = α i + s Similar to dual perceptron Given that w = l i α iy i x i, this is equivalent to is equivalent to w w + (α i α i)y i x i Similar to primal perceptron Isn t this familiar? () 23 / 25 Fall, / 25

73 Relationships between linear classifiers NB, LR, Perceptron and SVM are all linear classifiers NB and LR have the same interpretation for conditional probability P(y x, w) = + e y(w T x) The difference between LR and SVM are their loss functions But they are quite similar! Perceptron algorithm and the coordinate descent algorithm for SVM are very similar (2) () 24 / 25 Fall, / 25

74 Summary Logistic regression Maximizes P(Y X ) while Naive Bayes maximizes the joint probability P(Y, X ) Model the conditional probability using a linear line. Drop the conditional independence assumption Many available methods of optimizing the objective function () 25 / 25 Fall, / 25

75 Summary Logistic regression Maximizes P(Y X ) while Naive Bayes maximizes the joint probability P(Y, X ) Model the conditional probability using a linear line. Drop the conditional independence assumption Many available methods of optimizing the objective function Support Vector Machine Similar to Logistic Regression; Different Loss function Maximizes Margin; Has many nice theoretical properties Interesting Primal-Dual relationship Allows us to choose the easier one to solve Many available methods of optimizing the objective function The linear dual coordinate descent method turns out to be similar to Perceptron () 25 / 25 Fall, / 25

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Dan Roth 461C, 3401 Walnut

Dan Roth   461C, 3401 Walnut CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels Basic concepts: SVM and kernels SVM primal/dual problems

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

COMP 875 Announcements

COMP 875 Announcements Announcements Tentative presentation order is out Announcements Tentative presentation order is out Remember: Monday before the week of the presentation you must send me the final paper list (for posting

More information

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

CSC 411 Lecture 17: Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Lecture 3: Multiclass Classification

Lecture 3: Multiclass Classification Lecture 3: Multiclass Classification Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Training Support Vector Machines: Status and Challenges

Training Support Vector Machines: Status and Challenges ICML Workshop on Large Scale Learning Challenge July 9, 2008 Chih-Jen Lin (National Taiwan Univ.) 1 / 34 Training Support Vector Machines: Status and Challenges Chih-Jen Lin Department of Computer Science

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Logistic Regression. William Cohen

Logistic Regression. William Cohen Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting

More information

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Hypothesis Space variable size deterministic continuous parameters Learning Algorithm linear and quadratic programming eager batch SVMs combine three important ideas Apply optimization

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2

More information

Machine Learning. Support Vector Machines. Manfred Huber

Machine Learning. Support Vector Machines. Manfred Huber Machine Learning Support Vector Machines Manfred Huber 2015 1 Support Vector Machines Both logistic regression and linear discriminant analysis learn a linear discriminant function to separate the data

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 8 Feb. 12, 2018 1 10-601 Introduction

More information

Support Vector Machines. Machine Learning Fall 2017

Support Vector Machines. Machine Learning Fall 2017 Support Vector Machines Machine Learning Fall 2017 1 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost 2 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Support Vector Machines, Kernel SVM

Support Vector Machines, Kernel SVM Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation Deviations from linear separability Kernel methods CSE 250B Noise Find a separator that minimizes a convex loss function related to the number of mistakes. e.g. SVM, logistic regression. Systematic deviation

More information

Announcements - Homework

Announcements - Homework Announcements - Homework Homework 1 is graded, please collect at end of lecture Homework 2 due today Homework 3 out soon (watch email) Ques 1 midterm review HW1 score distribution 40 HW1 total score 35

More information

Kernel methods CSE 250B

Kernel methods CSE 250B Kernel methods CSE 250B Deviations from linear separability Noise Find a separator that minimizes a convex loss function related to the number of mistakes. e.g. SVM, logistic regression. Deviations from

More information

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science Neural Networks Prof. Dr. Rudolf Kruse Computational Intelligence Group Faculty for Computer Science kruse@iws.cs.uni-magdeburg.de Rudolf Kruse Neural Networks 1 Supervised Learning / Support Vector Machines

More information

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Data Mining Support Vector Machines Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 Support Vector Machines Find a linear hyperplane

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Text Categorization, Support Vector Machines Sameer Maskey Announcement Reading Assignments Will be posted online tonight Homework 1 Assigned and available from the course website

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Support Vector Machines

Support Vector Machines EE 17/7AT: Optimization Models in Engineering Section 11/1 - April 014 Support Vector Machines Lecturer: Arturo Fernandez Scribe: Arturo Fernandez 1 Support Vector Machines Revisited 1.1 Strictly) Separable

More information

Advanced statistical methods for data analysis Lecture 2

Advanced statistical methods for data analysis Lecture 2 Advanced statistical methods for data analysis Lecture 2 RHUL Physics www.pp.rhul.ac.uk/~cowan Universität Mainz Klausurtagung des GK Eichtheorien exp. Tests... Bullay/Mosel 15 17 September, 2008 1 Outline

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012 Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 20, 2012 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48 Logistic Regression Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 1 / 48 Outline 1 Administration 2 Review of last lecture 3 Logistic regression

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor Logistic Regression Some slides adapted from Dan Jurfasky and Brendan O Connor Naïve Bayes Recap Bag of words (order independent) Features are assumed independent given class P (x 1,...,x n c) =P (x 1

More information

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 9 Sep. 26, 2018 1 Reminders Homework 3:

More information

Lecture 9: PGM Learning

Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)

More information

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Nathan Schneider (some slides borrowed from Chris Dyer) ENLP 12 February 2018 23 Outline Words, probabilities Features,

More information

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard and Mitch Marcus (and lots original slides by

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

CS269: Machine Learning Theory Lecture 16: SVMs and Kernels November 17, 2010

CS269: Machine Learning Theory Lecture 16: SVMs and Kernels November 17, 2010 CS269: Machine Learning Theory Lecture 6: SVMs and Kernels November 7, 200 Lecturer: Jennifer Wortman Vaughan Scribe: Jason Au, Ling Fang, Kwanho Lee Today, we will continue on the topic of support vector

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features Y target

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

ICS-E4030 Kernel Methods in Machine Learning

ICS-E4030 Kernel Methods in Machine Learning ICS-E4030 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 28. September, 2016 Juho Rousu 28. September, 2016 1 / 38 Convex optimization Convex optimisation This

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Support Vector Machine (SVM) Hamid R. Rabiee Hadi Asheri, Jafar Muhammadi, Nima Pourdamghani Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Introduction

More information

1 Review of Winnow Algorithm

1 Review of Winnow Algorithm COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture # 17 Scribe: Xingyuan Fang, Ethan April 9th, 2013 1 Review of Winnow Algorithm We have studied Winnow algorithm in Algorithm 1. Algorithm

More information

Machine Learning Techniques

Machine Learning Techniques Machine Learning Techniques ( 機器學習技法 ) Lecture 5: Hsuan-Tien Lin ( 林軒田 ) htlin@csie.ntu.edu.tw Department of Computer Science & Information Engineering National Taiwan University ( 國立台灣大學資訊工程系 ) Hsuan-Tien

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML) Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML) Nonlinear Features x4: -1 x1: +1 x3: +1 x2: -1 Concatenated (combined) features XOR:

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Incremental and Decremental Training for Linear Classification

Incremental and Decremental Training for Linear Classification Incremental and Decremental Training for Linear Classification Authors: Cheng-Hao Tsai, Chieh-Yen Lin, and Chih-Jen Lin Department of Computer Science National Taiwan University Presenter: Ching-Pei Lee

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Hsuan-Tien Lin Learning Systems Group, California Institute of Technology Talk in NTU EE/CS Speech Lab, November 16, 2005 H.-T. Lin (Learning Systems Group) Introduction

More information

Machine Learning Techniques

Machine Learning Techniques Machine Learning Techniques ( 機器學習技巧 ) Lecture 5: SVM and Logistic Regression Hsuan-Tien Lin ( 林軒田 ) htlin@csie.ntu.edu.tw Department of Computer Science & Information Engineering National Taiwan University

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods 2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University

More information

Learning From Data Lecture 25 The Kernel Trick

Learning From Data Lecture 25 The Kernel Trick Learning From Data Lecture 25 The Kernel Trick Learning with only inner products The Kernel M. Magdon-Ismail CSCI 400/600 recap: Large Margin is Better Controling Overfitting Non-Separable Data 0.08 random

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine Learning Support Vector Machines. Prof. Matteo Matteucci Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Statistical NLP Spring A Discriminative Approach

Statistical NLP Spring A Discriminative Approach Statistical NLP Spring 2008 Lecture 6: Classification Dan Klein UC Berkeley A Discriminative Approach View WSD as a discrimination task (regression, really) P(sense context:jail, context:county, context:feeding,

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 1 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Slides adapted from Jordan Boyd-Graber Machine Learning: Chenhao Tan Boulder 1 of 39 Recap Supervised learning Previously: KNN, naïve

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 2, 2015 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines: Maximum Margin Classifiers Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 16, 2008 Piotr Mirowski Based on slides by Sumit Chopra and Fu-Jie Huang 1 Outline What is behind

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Lecture 3 January 28

Lecture 3 January 28 EECS 28B / STAT 24B: Advanced Topics in Statistical LearningSpring 2009 Lecture 3 January 28 Lecturer: Pradeep Ravikumar Scribe: Timothy J. Wheeler Note: These lecture notes are still rough, and have only

More information

Learning with kernels and SVM

Learning with kernels and SVM Learning with kernels and SVM Šámalova chata, 23. května, 2006 Petra Kudová Outline Introduction Binary classification Learning with Kernels Support Vector Machines Demo Conclusion Learning from data find

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2010 Lecture 24: Perceptrons and More! 4/22/2010 Pieter Abbeel UC Berkeley Slides adapted from Dan Klein Announcements W7 due tonight [this is your last written for

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2010 Lecture 22: Nearest Neighbors, Kernels 4/18/2011 Pieter Abbeel UC Berkeley Slides adapted from Dan Klein Announcements On-going: contest (optional and FUN!)

More information

EE 511 Online Learning Perceptron

EE 511 Online Learning Perceptron Slides adapted from Ali Farhadi, Mari Ostendorf, Pedro Domingos, Carlos Guestrin, and Luke Zettelmoyer, Kevin Jamison EE 511 Online Learning Perceptron Instructor: Hanna Hajishirzi hannaneh@washington.edu

More information

Support Vector Machine

Support Vector Machine Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

More information