Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Size: px
Start display at page:

Download "Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang"

Transcription

1 Machine Learning Lecture 04: Logistic and Softmax Regression Nevin L. Zhang Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set of notes is based on internet resources and KP Murphy (2012). Machine learning: a probabilistic perspective. MIT Press. (Chapter 8) Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT press. (Chapter 4) Andrew Ng. Lecture Notes on Machine Learning. Stanford. Hal Daume. A Course on Machine Learning. Nevin L. Zhang (HKUST) Machine Learning 1 / 50

2 Logistic Regression Outline 1 Logistic Regression 2 Gradient Descent 3 Newton s Method 4 Softmax Regression 5 Optimization Approach to Classification Nevin L. Zhang (HKUST) Machine Learning 2 / 50

3 Logistic Regression Recap of Probabilistic Linear Regression Training set D = {x i, y i } N i=1, where y i R. Probabilistic model: p(y x, θ) = N (y µ(x), σ 2 ) = N (y w x, σ 2 ) Learning: Determining w by minimizing the negative loglikelihood NLL(w, σ) = log P(D w, σ) Point estimation of y: ŷ = µ(x) = w x Nevin L. Zhang (HKUST) Machine Learning 3 / 50

4 Logistic Regression Logistic Regression (for Classification) Training set D = {x i, y i } N i=1, where y i {0, 1}. Probabilistic model: p(y x, w) = Ber(y σ(w x)) σ(z) is the sigmoid/logistic/logit function. σ(z) = exp( z) = ez e z + 1 It maps the real line R to [0, 1]. Not to be confused with the variance σ 2 in the Gaussian distribution Nevin L. Zhang (HKUST) Machine Learning 4 / 50

5 Logistic Regression Logistic Regression Rephrase of the probabilistic model, log p 1 p log p(y = 1 x, w) = σ(w x) = p(y = 1 x, w) 1 p(y = 1 x, w) is called the logit of p. = w x exp( w x) Learning: Determining w by minimizing the negative loglikelihood NLL(w) = log P(D w) Nevin L. Zhang (HKUST) Machine Learning 5 / 50

6 Logistic Regression Logistic Regression: Decision Rule To classify instances, we obtain a point estimation of y: ŷ = arg max y p(y x, w) In other words, the decision/classification rule is: ŷ = 1 iff p(y = 1 x, w) > 0.5 This is called the optimal Bayes classifier: Suppose the same situation is to occur many times. You will always make mistakes no matter what decision rule to use. The probability of mistakes is minimized if you use the above rule. Nevin L. Zhang (HKUST) Machine Learning 6 / 50

7 Logistic Regression Logistic Regression: Example Solid black dots are the data. Those at the bottom are the SAT scores of applicants rejected by a university, and those at the top are the SAT scores of applicants accepted by a university. The red circles are the predicted probabilities that the applicants would be accepted. Nevin L. Zhang (HKUST) Machine Learning 7 / 50

8 Logistic Regression Logistic Regression: 2D Example The decision boundary p(y = 1 x, w) = 0.5 is a straight line in the feature space. Nevin L. Zhang (HKUST) Machine Learning 8 / 50

9 Logistic Regression Logistic Regression is a Linear Classifier In fact, the decision/classification rule in logistic regression is equivalent to: ŷ = 1 iff w x > 0 So,it is a linear classifier with a decision boundary. Example: Whether a students is admitted based on the results of two exams: Nevin L. Zhang (HKUST) Machine Learning 9 / 50

10 Logistic Regression Some Math Since P(y = 1 x, w) = σ(w x), we have P(y = 0 x, w) = 1 σ(w x). We can combine the two equations into one: P(y x, w) = (σ(w x)) y (1 σ(w x)) 1 y Hence, log P(y x, w) = y log σ(w x) + (1 y) log(1 σ(w x)) Nevin L. Zhang (HKUST) Machine Learning 10 / 50

11 Logistic Regression Model fitting We would like to find the MLE of w, i.e, the values of w that minimizes the negative loglikelihood: NLL(w) = log P(D w) = = N log P(y i x i, w) i=1 N [y i log σ(w x i ) + (1 y i ) log(1 σ(w x i ))] i=1 Unlike linear regression, we can no longer write down the MLE in closed form. Instead, we need to use optimization algorithms to compute it. Gradient descent Newton s method Nevin L. Zhang (HKUST) Machine Learning 11 / 50

12 Gradient Descent Outline 1 Logistic Regression 2 Gradient Descent 3 Newton s Method 4 Softmax Regression 5 Optimization Approach to Classification Nevin L. Zhang (HKUST) Machine Learning 12 / 50

13 Gradient Descent Gradient Descent Consider a function y = J(w) of a scalar variable w. The derivative of J(w) is defined as follows: J (w) = When ɛ is small, we have df (w) dw = lim J(w + ɛ) J(w) ɛ 0 ɛ J(w + ɛ) J(w) + ɛj (w) This equation tells use how to reduce J(w) by changing w in small steps: If J (w) > 0, make ɛ negative, i.e., decrease w; If J (w) < 0, make ɛ positive, i.e., increase w. In other words, move in the opposite direction of the derivative (gradient) Nevin L. Zhang (HKUST) Machine Learning 13 / 50

14 Gradient Descent Gradient Descent To implement the idea, we update w as follows: w w αj (w) The term J (w) means that we move in the opposite direction of the derivative, and α determines how much we move in that direction. It is called the step size in optimization and the learning rate in machine learning Nevin L. Zhang (HKUST) Machine Learning 14 / 50

15 Gradient Descent Gradient Descent Consider a function y = J(w), where w = (w 0, w 1,..., w D ). The gradient of J at w is defined as w J = ( dj dw 0, dj dw 1,..., dj dw D ) The gradient is the direction along which J increases the fastest. If we want to reduce J as fast as possible, move in the opposite direction of the gradient. Nevin L. Zhang (HKUST) Machine Learning 15 / 50

16 Gradient Descent Gradient Descent The method of steepest descent/gradient descent for minimizing J(w) 1 Initialize w 2 Repeat until convergence w w α w J(w) The learning rate α usually changes from iteration to iteration. Nevin L. Zhang (HKUST) Machine Learning 16 / 50

17 Gradient Descent Choice of Learning Rate Constant learning rate is difficult to set: Too small, convergence will be very slow Too large, the method can fail to converge at all. Better to vary the learning rate. Will discuss this more in L07. Nevin L. Zhang (HKUST) Machine Learning 17 / 50

18 Gradient Descent Gradient Descent Gradient descent can get stuck at local minima or saddle points Nonetheless, it usually works well. Nevin L. Zhang (HKUST) Machine Learning 18 / 50

19 Gradient Descent Derivative of the Sigmoid Function To apply gradient descent to logistic regression, we need to compute the partial derivative of the NLL w.r.t each weight w j. Before doing that, first consider the derivative of the sigma function: σ (z) = dσ(z) dz e z = d dz 1 d(1 + e z ) = (1 + e z ) 2 dz 1 = (1 + e z ) 2 e z 1 = 1 + e z ( e z ) = σ(z)(1 σ(z)) Nevin L. Zhang (HKUST) Machine Learning 19 / 50

20 Gradient Descent Derivative of NLL Let z i = w x i. NLL(w) w j = = = = = N i=1 [y i log σ(z i ) w j + (1 y i ) log(1 σ(z i)) w j ] N 1 [y i σ(z i ) (1 y 1 i) 1 σ(z i ) ] σ(z i) w j i=1 N 1 [y i σ(z i ) (1 y 1 i) 1 σ(z i ) ]σ(z i)(1 σ(z i )) z i w j i=1 N [y i (1 σ(z i )) (1 y i )σ(z i )]x i,j i=1 N [y i σ(z i )]x i,j i=1 Nevin L. Zhang (HKUST) Machine Learning 20 / 50

21 Gradient Descent Batch Gradient Descent The Batch Gradient Descent algorithm for logistic regression Repeat until convergence w j w j + α N [y i σ(w x i )]x i,j Interpretation: (Assume x i are positive vectors) If predicted value σ(w x i ) is smaller than the actual value y i, there is reason to increase w j. The increment is proportional to x i,j. If predicted value σ(w x i ) is larger than the actual value y i, there is reason to decrease w j. The decrement is proportional to x i,j. i=1 Nevin L. Zhang (HKUST) Machine Learning 21 / 50

22 Gradient Descent Stochastic Gradient Descent Batch gradient descent is costly when N is large. In such case, we usually use Stochastic Gradient Descent: Repeat until convergence Randomly choose B {1, 2,..., N} w j w j + α i B[y i σ(w x i )]x i,j The randomly picked subset B is called a minibatch. Its size is called the batch size. Nevin L. Zhang (HKUST) Machine Learning 22 / 50

23 Gradient Descent L 2 Regularization Just as we prefer ridge regression to linear regression, we prefer to add a regularization term in the objective function of logistic regression: J(w) = NLL(w) λ w 2 2. In this case, we have J(w) w j N = [y i σ(z i )]x i,j + λw j i=1 The weight update formula is: w j w j + α[ λw j + i B(y i σ(w x i ))x i,j ] Nevin L. Zhang (HKUST) Machine Learning 23 / 50

24 Gradient Descent L 2 Regularization Update rule without regularization: w j w j + α i B[y i σ(w x i )]x i,j Update rule with regularization: w j (1 αλ)w j + α i B(y i σ(w x i ))x i,j Regularization forces the weights to be smaller: as 1 > αλ > 0. (1 αλ)w j < w j Nevin L. Zhang (HKUST) Machine Learning 24 / 50

25 Newton s Method Outline 1 Logistic Regression 2 Gradient Descent 3 Newton s Method 4 Softmax Regression 5 Optimization Approach to Classification Nevin L. Zhang (HKUST) Machine Learning 25 / 50

26 Newton s Method Newton s Method Consider again a function J(w) of a scalar variable w. We want to minimize J(w). At a minimum point, the derivative J (w) = 0. Let f (w) = J (w). We want to find points where f (w) = 0. Newton s method (aka the Newton-Raphson method) is a commonly used method to solve the problem: Start some initial value w Repeat the following update until convergence w w f (w) f (w) Nevin L. Zhang (HKUST) Machine Learning 26 / 50

27 Newton s Method Newton s Method Interpretation of Newton s method Approximate the function f via a linear function that is tangent to f at the current guess w i : y = f (w i )w + f (w i ) f (w i )w i Solve for where that linear function equals to zero to get next w. f (w i )w + f (w i ) f (w i )w i = 0 w i+1 = w i f (w i) f (w i ) Nevin L. Zhang (HKUST) Machine Learning 27 / 50

28 Newton s Method Newton s Method Minimize J(w) using Newton s method: Start some initial value w Repeat the following update until convergence w w J (w) J (w) where f is the second derivative of f. Nevin L. Zhang (HKUST) Machine Learning 28 / 50

29 Newton s Method Newton s Method New consider minimizing a J(w) of a vector w = (w 0, w 1,..., w D ). The first derivative of J(w) is the gradient vector w J(w). The second derivative of J(w) is the Hessian matrix H = [H ij ] H ij = 2 J(w) w i w j Minimize J(w) using Newton s method Start some initial value w Repeat the following update until convergence w w H 1 w J(w) Nevin L. Zhang (HKUST) Machine Learning 29 / 50

30 Newton s Method Newton s Method: Another Perspective By Taylor expansion, we have J(w + ɛ) J(w) + J (w)ɛ J (w)ɛ 2 We want dj(w+ɛ) dɛ = 0. So, Solving the equation, we get J (w) + J (w)ɛ = 0 ɛ = J (w) J (w) Nevin L. Zhang (HKUST) Machine Learning 30 / 50

31 Newton s Method Newton s Method vs Gradient Descent In gradient descent, we have only a first order term in the approximation (gradient). J(w + ɛ) J(w) + ɛj (w) In Newton s method, we also have a second order term in the approximation (curvature) J(w + ɛ) J(w) + J (w)ɛ J (w)ɛ 2 Nevin L. Zhang (HKUST) Machine Learning 31 / 50

32 Newton s Method Newton s Method vs Gradient Descent Use of curvature allows us to better predict how J(w) changes when we change w 1 With negative curvature, J actually decreases faster than the gradient predicts. 2 With no curvature, the gradient predicts the decrease correctly. 3 With positive curvature, the function decreases more slowly than expected. The use of curvature mitigates with problems with case 1 and 3. Nevin L. Zhang (HKUST) Machine Learning 32 / 50

33 Newton s Method One iteration of Newtons can, however, be more expensive than one iteration of gradient descent, since it requires finding and inverting an Hessian matrix. Nevin L. Zhang (HKUST) Machine Learning 33 / 50 Newton s Method vs Gradient Descent Newton s method (red) typically enjoys faster convergence than (batch) gradient descent (green), and requires many fewer iterations to get very close to the minimum.

34 Softmax Regression Outline 1 Logistic Regression 2 Gradient Descent 3 Newton s Method 4 Softmax Regression 5 Optimization Approach to Classification Nevin L. Zhang (HKUST) Machine Learning 34 / 50

35 Softmax Regression Multi-Class Logistic Regression Training set D = {x i, y i } N i=1, where y i {1, 2,..., C}, and C 2. Multi-Class Logistic Regression (aka softmax regression) uses the following probability model: There is a weight vector w c = (w c,0, w c,1,..., w c,d ) for each class c. W = (w 1,..., w C ) is the weight matrix. The probability of a particular class c given x is: p(y = c x, W) = 1 Z(x, W) exp(w c x) where Z(x, W) = C c =1 exp(w c x) is the normalization constant to ensure the probabilities of all classes sum to 1. It is called the partition function. The classification rule is: ŷ = arg max p(y x, W) y Nevin L. Zhang (HKUST) Machine Learning 35 / 50

36 Softmax Regression Relationship with Logistic Regression When C = 2, name the two classes as {0, 1} instead of {1, 2}: p(y = 0 x, W) = exp(w0 x) exp(w0 x) + exp(w 1 x), p(y = 1 x, W) = exp(w1 x) exp(w0 x) + exp(w 1 x) Dividing both numerator and denominator by exp(w 1 x), we get p(y = 0 x, W) = exp((w 0 w 1 )x) exp((w 0 w 1 )x) + 1, p(y = 1 x, W) = 1 exp((w 0 w 1 )x) + 1 Let w = w 1 w 0, we get the logistic regression model: p(y = 0 x, w) = exp( w x) 1 + exp( w x), p(y = 1 x, w) = 1 exp(1 + w x) Nevin L. Zhang (HKUST) Machine Learning 36 / 50

37 Softmax Regression Negative Loglikelihood The negative loglikelihood of the softmax regression model is: NLL(w) = log P(D w) = = = = = N i=1 c=1 N i=1 c=1 N i=1 c=1 N log P(y i x i, w) i=1 C 1(y i = c) log P(y i = c x i, w) C 1(y i = c) log C 1(y i = c)[wc x i log N C [ 1(y i = c)wc x i log i=1 c=1 exp(w c x i ) C c =1 exp(w c x i) C exp(wc x i)] c =1 C exp(wc x i)] c =1 Nevin L. Zhang (HKUST) Machine Learning 37 / 50

38 Softmax Regression Partial Derivative of NLL We would like to find the MLE of W. To do so, we need to minimize the NLL. There is no closed-form solution. So, we resort to gradient descent. The partial derivative of NLL w.r.t each w c,j is: NLL(w) w c,j = = = N [ i=1 i=1 1 w c,j C c=1 1(y i = c)w c x i 1 w c,j log N exp(wc x i ) [1(y i = c)x i,j C c =1 exp(w c x i) x i,j] N [1(y i = c) p(y i = c x i, W)]x i,j i=1 Compare this to the gradient of logistic regression: NLL(w) N = [y i σ(w x i )]x i,j w j i=1 C exp(wc x i)] Nevin L. Zhang (HKUST) Machine Learning 38 / 50 c =1

39 Softmax Regression Gradient of NLL The gradient of NLL w.r.t each w c is: wc NLL(w) = ( NLL(w) = w c,0, NLL(w) w c,1,..., NLL(w) ) w c,d N [1(y i = c) p(y i = c x i, W)]x i i=1 Update rule in Gradient Descent: w c w c + α N [1(y i = c) p(y i = c x i, W)]x i i=1 Nevin L. Zhang (HKUST) Machine Learning 39 / 50

40 Optimization Approach to Classification Outline 1 Logistic Regression 2 Gradient Descent 3 Newton s Method 4 Softmax Regression 5 Optimization Approach to Classification Nevin L. Zhang (HKUST) Machine Learning 40 / 50

41 Optimization Approach to Classification The Probabilistic Approach to Classification So far, we have been talking about the probabilistic approach to classification: Start with training set D = {x i, y i } N i=1, where y i {0, 1}. Learn a probabilistic model: p(y x, w) = Ber(y σ(w x)) by minimizing the negative loglikelihood Classify new instances using: NLL(w) = log P(D w) ŷ = 1 iff p(y = 1 x, w) > 0.5 Nevin L. Zhang (HKUST) Machine Learning 41 / 50

42 Optimization Approach to Classification The Optimization Approach Next, we will briefly talk about the optimization approach to classification: Start with training set D = {x i, y i } N i=1, where y i { 1, 1}, Learn a linear classifier ŷ = sign(w x + b) where x = (x 1,..., x D ) and w = (w 1,..., w D ). We drop the convention x 0 = 1 and use b to replace w 0. Nevin L. Zhang (HKUST) Machine Learning 42 / 50

43 Optimization Approach to Classification The Sign Function signx = 1 if x > 0; signx = 0 if x = 0; signx = 1 if x < 0. Nevin L. Zhang (HKUST) Machine Learning 43 / 50

44 Optimization Approach to Classification The Optimization Approach We determine w and b by minimizing the training/empirical error L(w, b) = N L 0/1 (y i, ŷ i ) i=1 where L 0/1 (y i, ŷ i ) = 1(y i ŷ i ) is called the zero/one loss function. Let z = w x + b. It is easy to see that L 0/1 (y, ŷ) = 1(y(w x + b) 0) = 1(yz 0) To overload the notation, we also write it as L 0/1 (y, z) So, the loss function of linear classifiers is often written as L(w, b) = N 1(y i (w x i + b) 0) i=1 Nevin L. Zhang (HKUST) Machine Learning 44 / 50

45 Optimization Approach to Classification A problem the Zero/One Loss Function The zero/one loss function, as a function of w and b, is not convex and hence difficult to optimize X -axis is yz. Nevin L. Zhang (HKUST) Machine Learning 45 / 50

46 Optimization Approach to Classification Convex Function A real-valued function f is convex if the line segment between any two points on the graph of the function lies above or on the graph. For any two points x 1 < x 2 and any t [0, 1] f (tx 1 + (1 t)x 2 ) tf (x 1 ) + (1 t)f (x 2 ) f is strictly convex if the the equality holds only when x 1 = x 2. Nevin L. Zhang (HKUST) Machine Learning 46 / 50

47 Optimization Approach to Classification Surrogate Loss Functions Convex functions are easy to minimize. Intuitively, if you drop a ball anywhere in a convex function, it will eventually get to the minimum. This is not true for non-convex functions. Since the zero/one loss is hard to optimize, we want to approximate it using a convex function and optimize that function. This approximating function will be called a surrogate loss. The surrogate losses need to be upper bounds on the true loss function: this guarantees that if you minimize the surrogate loss, you are also pushing down the real loss. Nevin L. Zhang (HKUST) Machine Learning 47 / 50

48 Optimization Approach to Classification Logistic Loss One common surrogate loss function is the logistic loss: 1 L log (y, z) = log(1 + exp( yz)) log 2 = { 1 log 2 log(1 + exp( (w x + b))) if y = 1 1 log 2 log(1 + exp(w x + b)) if y = 1 On the other hand, each term in the NLL of logistic regression has the following form log P(y x, w) = {y log σ(w x) + (1 y) log(1 σ(w x))} { log 1 if y = 1 1+exp( w = x) 1 log(1 1+exp( w x) ) if y = 0 { log(1 + exp( w = x)) if y = 1 log(1 + exp(w x)) if y = 0 So, logistic regression is the same as linear classifier with logistic surrogate loss function. Nevin L. Zhang (HKUST) Machine Learning 48 / 50

49 Optimization Approach to Classification Other Surrogate Loss Functions Zero/one loss: L 0/1 (y, z) = 1(yz 0). Logistic loss: L log (y, z) = 1 log 2 log(1 + exp( yz)) Hinge loss: L hin (y, z) = max{0, 1 yz} Exponential loss: L exp (y, z) = exp( yz) Squared loss: L sqr (y, z) = (y z) 2 Nevin L. Zhang (HKUST) Machine Learning 49 / 50

50 Optimization Approach to Classification Loss and Error When a surrogate loss function is used, There are two ways to measure how a classifier performs on the training set: training loss and training error. There are also two ways to measure how a classifier performs on the set set: test loss and test error. As in the case of regression, test loss/error can be improved using regularization. Nevin L. Zhang (HKUST) Machine Learning 50 / 50

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016 Lecture 7 Logistic Regression Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 11, 2016 Luigi Freda ( La Sapienza University) Lecture 7 December 11, 2016 1 / 39 Outline 1 Intro Logistic

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information

Linear and logistic regression

Linear and logistic regression Linear and logistic regression Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Linear and logistic regression 1/22 Outline 1 Linear regression 2 Logistic regression 3 Fisher discriminant analysis

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

Machine Learning - Waseda University Logistic Regression

Machine Learning - Waseda University Logistic Regression Machine Learning - Waseda University Logistic Regression AD June AD ) June / 9 Introduction Assume you are given some training data { x i, y i } i= where xi R d and y i can take C different values. Given

More information

Introduction to gradient descent

Introduction to gradient descent 6-1: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction to gradient descent Derivation and intuitions Hessian 6-2: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction Our

More information

Logistic Regression. Stochastic Gradient Descent

Logistic Regression. Stochastic Gradient Descent Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

CS 340 Lec. 16: Logistic Regression

CS 340 Lec. 16: Logistic Regression CS 34 Lec. 6: Logistic Regression AD March AD ) March / 6 Introduction Assume you are given some training data { x i, y i } i= where xi R d and y i can take C different values. Given an input test data

More information

Machine Learning Basics III

Machine Learning Basics III Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient

More information

Logistic Regression. Seungjin Choi

Logistic Regression. Seungjin Choi Logistic Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Lecture 3 - Linear and Logistic Regression

Lecture 3 - Linear and Logistic Regression 3 - Linear and Logistic Regression-1 Machine Learning Course Lecture 3 - Linear and Logistic Regression Lecturer: Haim Permuter Scribe: Ziv Aharoni Throughout this lecture we talk about how to use regression

More information

CSC 411: Lecture 04: Logistic Regression

CSC 411: Lecture 04: Logistic Regression CSC 411: Lecture 04: Logistic Regression Raquel Urtasun & Rich Zemel University of Toronto Sep 23, 2015 Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 1 / 16 Today Key Concepts: Logistic

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

CSC321 Lecture 4: Learning a Classifier

CSC321 Lecture 4: Learning a Classifier CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a Classifier 1 / 31 Overview Last time: binary classification, perceptron algorithm Limitations of the perceptron

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so. CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic

More information

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data DD2424 March 23, 2017 Binary classification problem given labelled training data Have labelled training examples? Given

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2018 Due: Friday, February 23rd, 2018, 11:55 PM Submit code and report via EEE Dropbox You should submit a

More information

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Machine Learning. Lecture 3: Logistic Regression. Feng Li. Machine Learning Lecture 3: Logistic Regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2016 Logistic Regression Classification

More information

Linear Regression. S. Sumitra

Linear Regression. S. Sumitra Linear Regression S Sumitra Notations: x i : ith data point; x T : transpose of x; x ij : ith data point s jth attribute Let {(x 1, y 1 ), (x, y )(x N, y N )} be the given data, x i D and y i Y Here D

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com Logistic Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Deep Learning. Authors: I. Goodfellow, Y. Bengio, A. Courville. Chapter 4: Numerical Computation. Lecture slides edited by C. Yim. C.

Deep Learning. Authors: I. Goodfellow, Y. Bengio, A. Courville. Chapter 4: Numerical Computation. Lecture slides edited by C. Yim. C. Chapter 4: Numerical Computation Deep Learning Authors: I. Goodfellow, Y. Bengio, A. Courville Lecture slides edited by 1 Chapter 4: Numerical Computation 4.1 Overflow and Underflow 4.2 Poor Conditioning

More information

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48 Logistic Regression Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 1 / 48 Outline 1 Administration 2 Review of last lecture 3 Logistic regression

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2019 Due: 11am Monday, February 25th, 2019 Submit scan of plots/written responses to Gradebook; submit your

More information

CSC321 Lecture 4: Learning a Classifier

CSC321 Lecture 4: Learning a Classifier CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a Classifier 1 / 28 Overview Last time: binary classification, perceptron algorithm Limitations of the perceptron

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,

More information

Least Mean Squares Regression

Least Mean Squares Regression Least Mean Squares Regression Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Lecture Overview Linear classifiers What functions do linear classifiers express? Least Squares Method

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 8 Feb. 12, 2018 1 10-601 Introduction

More information

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert Logistic regression Comments Mini-review and feedback These are equivalent: x > w = w > x Clarification: this course is about getting you to be able to think as a machine learning expert There has to be

More information

CS229 Supplemental Lecture notes

CS229 Supplemental Lecture notes CS229 Supplemental Lecture notes John Duchi Binary classification In binary classification problems, the target y can take on at only two values. In this set of notes, we show how to model this problem

More information

A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

OPTIMIZATION METHODS IN DEEP LEARNING

OPTIMIZATION METHODS IN DEEP LEARNING Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

Least Mean Squares Regression. Machine Learning Fall 2018

Least Mean Squares Regression. Machine Learning Fall 2018 Least Mean Squares Regression Machine Learning Fall 2018 1 Where are we? Least Squares Method for regression Examples The LMS objective Gradient descent Incremental/stochastic gradient descent Exercises

More information

Linear classifiers: Overfitting and regularization

Linear classifiers: Overfitting and regularization Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x

More information

Week 5: Logistic Regression & Neural Networks

Week 5: Logistic Regression & Neural Networks Week 5: Logistic Regression & Neural Networks Instructor: Sergey Levine 1 Summary: Logistic Regression In the previous lecture, we covered logistic regression. To recap, logistic regression models and

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16 COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS6 Lecture 3: Classification with Logistic Regression Advanced optimization techniques Underfitting & Overfitting Model selection (Training-

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Multiclass Logistic Regression

Multiclass Logistic Regression Multiclass Logistic Regression Sargur. Srihari University at Buffalo, State University of ew York USA Machine Learning Srihari Topics in Linear Classification using Probabilistic Discriminative Models

More information

Homework #3 RELEASE DATE: 10/28/2013 DUE DATE: extended to 11/18/2013, BEFORE NOON QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FORUM.

Homework #3 RELEASE DATE: 10/28/2013 DUE DATE: extended to 11/18/2013, BEFORE NOON QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FORUM. Homework #3 RELEASE DATE: 10/28/2013 DUE DATE: extended to 11/18/2013, BEFORE NOON QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FORUM. Unless granted by the instructor in advance, you must turn

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1

More information

Numerical Optimization

Numerical Optimization Numerical Optimization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Logistic Regression. William Cohen

Logistic Regression. William Cohen Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting

More information

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012) Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:

More information

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane DATA MINING AND MACHINE LEARNING Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Linear models for regression Regularized

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Deep Learning & Neural Networks Lecture 4

Deep Learning & Neural Networks Lecture 4 Deep Learning & Neural Networks Lecture 4 Kevin Duh Graduate School of Information Science Nara Institute of Science and Technology Jan 23, 2014 2/20 3/20 Advanced Topics in Optimization Today we ll briefly

More information

Introduction to Machine Learning

Introduction to Machine Learning How o you estimate p(y x)? Outline Contents Introuction to Machine Learning Logistic Regression Varun Chanola April 9, 207 Generative vs. Discriminative Classifiers 2 Logistic Regression 2 3 Logistic Regression

More information

Classification Based on Probability

Classification Based on Probability Logistic Regression These slides were assembled by Byron Boots, with only minor modifications from Eric Eaton s slides and grateful acknowledgement to the many others who made their course materials freely

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015 Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint input-output features, Likelihood Maximization, Logistic regression, binary

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Is the test error unbiased for these programs?

Is the test error unbiased for these programs? Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x

More information

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 12 Neural Networks 10.12.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

CPSC 340 Assignment 4 (due November 17 ATE)

CPSC 340 Assignment 4 (due November 17 ATE) CPSC 340 Assignment 4 due November 7 ATE) Multi-Class Logistic The function example multiclass loads a multi-class classification datasetwith y i {,, 3, 4, 5} and fits a one-vs-all classification model

More information

Lecture 16 Solving GLMs via IRWLS

Lecture 16 Solving GLMs via IRWLS Lecture 16 Solving GLMs via IRWLS 09 November 2015 Taylor B. Arnold Yale Statistics STAT 312/612 Notes problem set 5 posted; due next class problem set 6, November 18th Goals for today fixed PCA example

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information