Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang
|
|
- Hubert Jenkins
- 5 years ago
- Views:
Transcription
1 Machine Learning Lecture 04: Logistic and Softmax Regression Nevin L. Zhang Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set of notes is based on internet resources and KP Murphy (2012). Machine learning: a probabilistic perspective. MIT Press. (Chapter 8) Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT press. (Chapter 4) Andrew Ng. Lecture Notes on Machine Learning. Stanford. Hal Daume. A Course on Machine Learning. Nevin L. Zhang (HKUST) Machine Learning 1 / 50
2 Logistic Regression Outline 1 Logistic Regression 2 Gradient Descent 3 Newton s Method 4 Softmax Regression 5 Optimization Approach to Classification Nevin L. Zhang (HKUST) Machine Learning 2 / 50
3 Logistic Regression Recap of Probabilistic Linear Regression Training set D = {x i, y i } N i=1, where y i R. Probabilistic model: p(y x, θ) = N (y µ(x), σ 2 ) = N (y w x, σ 2 ) Learning: Determining w by minimizing the negative loglikelihood NLL(w, σ) = log P(D w, σ) Point estimation of y: ŷ = µ(x) = w x Nevin L. Zhang (HKUST) Machine Learning 3 / 50
4 Logistic Regression Logistic Regression (for Classification) Training set D = {x i, y i } N i=1, where y i {0, 1}. Probabilistic model: p(y x, w) = Ber(y σ(w x)) σ(z) is the sigmoid/logistic/logit function. σ(z) = exp( z) = ez e z + 1 It maps the real line R to [0, 1]. Not to be confused with the variance σ 2 in the Gaussian distribution Nevin L. Zhang (HKUST) Machine Learning 4 / 50
5 Logistic Regression Logistic Regression Rephrase of the probabilistic model, log p 1 p log p(y = 1 x, w) = σ(w x) = p(y = 1 x, w) 1 p(y = 1 x, w) is called the logit of p. = w x exp( w x) Learning: Determining w by minimizing the negative loglikelihood NLL(w) = log P(D w) Nevin L. Zhang (HKUST) Machine Learning 5 / 50
6 Logistic Regression Logistic Regression: Decision Rule To classify instances, we obtain a point estimation of y: ŷ = arg max y p(y x, w) In other words, the decision/classification rule is: ŷ = 1 iff p(y = 1 x, w) > 0.5 This is called the optimal Bayes classifier: Suppose the same situation is to occur many times. You will always make mistakes no matter what decision rule to use. The probability of mistakes is minimized if you use the above rule. Nevin L. Zhang (HKUST) Machine Learning 6 / 50
7 Logistic Regression Logistic Regression: Example Solid black dots are the data. Those at the bottom are the SAT scores of applicants rejected by a university, and those at the top are the SAT scores of applicants accepted by a university. The red circles are the predicted probabilities that the applicants would be accepted. Nevin L. Zhang (HKUST) Machine Learning 7 / 50
8 Logistic Regression Logistic Regression: 2D Example The decision boundary p(y = 1 x, w) = 0.5 is a straight line in the feature space. Nevin L. Zhang (HKUST) Machine Learning 8 / 50
9 Logistic Regression Logistic Regression is a Linear Classifier In fact, the decision/classification rule in logistic regression is equivalent to: ŷ = 1 iff w x > 0 So,it is a linear classifier with a decision boundary. Example: Whether a students is admitted based on the results of two exams: Nevin L. Zhang (HKUST) Machine Learning 9 / 50
10 Logistic Regression Some Math Since P(y = 1 x, w) = σ(w x), we have P(y = 0 x, w) = 1 σ(w x). We can combine the two equations into one: P(y x, w) = (σ(w x)) y (1 σ(w x)) 1 y Hence, log P(y x, w) = y log σ(w x) + (1 y) log(1 σ(w x)) Nevin L. Zhang (HKUST) Machine Learning 10 / 50
11 Logistic Regression Model fitting We would like to find the MLE of w, i.e, the values of w that minimizes the negative loglikelihood: NLL(w) = log P(D w) = = N log P(y i x i, w) i=1 N [y i log σ(w x i ) + (1 y i ) log(1 σ(w x i ))] i=1 Unlike linear regression, we can no longer write down the MLE in closed form. Instead, we need to use optimization algorithms to compute it. Gradient descent Newton s method Nevin L. Zhang (HKUST) Machine Learning 11 / 50
12 Gradient Descent Outline 1 Logistic Regression 2 Gradient Descent 3 Newton s Method 4 Softmax Regression 5 Optimization Approach to Classification Nevin L. Zhang (HKUST) Machine Learning 12 / 50
13 Gradient Descent Gradient Descent Consider a function y = J(w) of a scalar variable w. The derivative of J(w) is defined as follows: J (w) = When ɛ is small, we have df (w) dw = lim J(w + ɛ) J(w) ɛ 0 ɛ J(w + ɛ) J(w) + ɛj (w) This equation tells use how to reduce J(w) by changing w in small steps: If J (w) > 0, make ɛ negative, i.e., decrease w; If J (w) < 0, make ɛ positive, i.e., increase w. In other words, move in the opposite direction of the derivative (gradient) Nevin L. Zhang (HKUST) Machine Learning 13 / 50
14 Gradient Descent Gradient Descent To implement the idea, we update w as follows: w w αj (w) The term J (w) means that we move in the opposite direction of the derivative, and α determines how much we move in that direction. It is called the step size in optimization and the learning rate in machine learning Nevin L. Zhang (HKUST) Machine Learning 14 / 50
15 Gradient Descent Gradient Descent Consider a function y = J(w), where w = (w 0, w 1,..., w D ). The gradient of J at w is defined as w J = ( dj dw 0, dj dw 1,..., dj dw D ) The gradient is the direction along which J increases the fastest. If we want to reduce J as fast as possible, move in the opposite direction of the gradient. Nevin L. Zhang (HKUST) Machine Learning 15 / 50
16 Gradient Descent Gradient Descent The method of steepest descent/gradient descent for minimizing J(w) 1 Initialize w 2 Repeat until convergence w w α w J(w) The learning rate α usually changes from iteration to iteration. Nevin L. Zhang (HKUST) Machine Learning 16 / 50
17 Gradient Descent Choice of Learning Rate Constant learning rate is difficult to set: Too small, convergence will be very slow Too large, the method can fail to converge at all. Better to vary the learning rate. Will discuss this more in L07. Nevin L. Zhang (HKUST) Machine Learning 17 / 50
18 Gradient Descent Gradient Descent Gradient descent can get stuck at local minima or saddle points Nonetheless, it usually works well. Nevin L. Zhang (HKUST) Machine Learning 18 / 50
19 Gradient Descent Derivative of the Sigmoid Function To apply gradient descent to logistic regression, we need to compute the partial derivative of the NLL w.r.t each weight w j. Before doing that, first consider the derivative of the sigma function: σ (z) = dσ(z) dz e z = d dz 1 d(1 + e z ) = (1 + e z ) 2 dz 1 = (1 + e z ) 2 e z 1 = 1 + e z ( e z ) = σ(z)(1 σ(z)) Nevin L. Zhang (HKUST) Machine Learning 19 / 50
20 Gradient Descent Derivative of NLL Let z i = w x i. NLL(w) w j = = = = = N i=1 [y i log σ(z i ) w j + (1 y i ) log(1 σ(z i)) w j ] N 1 [y i σ(z i ) (1 y 1 i) 1 σ(z i ) ] σ(z i) w j i=1 N 1 [y i σ(z i ) (1 y 1 i) 1 σ(z i ) ]σ(z i)(1 σ(z i )) z i w j i=1 N [y i (1 σ(z i )) (1 y i )σ(z i )]x i,j i=1 N [y i σ(z i )]x i,j i=1 Nevin L. Zhang (HKUST) Machine Learning 20 / 50
21 Gradient Descent Batch Gradient Descent The Batch Gradient Descent algorithm for logistic regression Repeat until convergence w j w j + α N [y i σ(w x i )]x i,j Interpretation: (Assume x i are positive vectors) If predicted value σ(w x i ) is smaller than the actual value y i, there is reason to increase w j. The increment is proportional to x i,j. If predicted value σ(w x i ) is larger than the actual value y i, there is reason to decrease w j. The decrement is proportional to x i,j. i=1 Nevin L. Zhang (HKUST) Machine Learning 21 / 50
22 Gradient Descent Stochastic Gradient Descent Batch gradient descent is costly when N is large. In such case, we usually use Stochastic Gradient Descent: Repeat until convergence Randomly choose B {1, 2,..., N} w j w j + α i B[y i σ(w x i )]x i,j The randomly picked subset B is called a minibatch. Its size is called the batch size. Nevin L. Zhang (HKUST) Machine Learning 22 / 50
23 Gradient Descent L 2 Regularization Just as we prefer ridge regression to linear regression, we prefer to add a regularization term in the objective function of logistic regression: J(w) = NLL(w) λ w 2 2. In this case, we have J(w) w j N = [y i σ(z i )]x i,j + λw j i=1 The weight update formula is: w j w j + α[ λw j + i B(y i σ(w x i ))x i,j ] Nevin L. Zhang (HKUST) Machine Learning 23 / 50
24 Gradient Descent L 2 Regularization Update rule without regularization: w j w j + α i B[y i σ(w x i )]x i,j Update rule with regularization: w j (1 αλ)w j + α i B(y i σ(w x i ))x i,j Regularization forces the weights to be smaller: as 1 > αλ > 0. (1 αλ)w j < w j Nevin L. Zhang (HKUST) Machine Learning 24 / 50
25 Newton s Method Outline 1 Logistic Regression 2 Gradient Descent 3 Newton s Method 4 Softmax Regression 5 Optimization Approach to Classification Nevin L. Zhang (HKUST) Machine Learning 25 / 50
26 Newton s Method Newton s Method Consider again a function J(w) of a scalar variable w. We want to minimize J(w). At a minimum point, the derivative J (w) = 0. Let f (w) = J (w). We want to find points where f (w) = 0. Newton s method (aka the Newton-Raphson method) is a commonly used method to solve the problem: Start some initial value w Repeat the following update until convergence w w f (w) f (w) Nevin L. Zhang (HKUST) Machine Learning 26 / 50
27 Newton s Method Newton s Method Interpretation of Newton s method Approximate the function f via a linear function that is tangent to f at the current guess w i : y = f (w i )w + f (w i ) f (w i )w i Solve for where that linear function equals to zero to get next w. f (w i )w + f (w i ) f (w i )w i = 0 w i+1 = w i f (w i) f (w i ) Nevin L. Zhang (HKUST) Machine Learning 27 / 50
28 Newton s Method Newton s Method Minimize J(w) using Newton s method: Start some initial value w Repeat the following update until convergence w w J (w) J (w) where f is the second derivative of f. Nevin L. Zhang (HKUST) Machine Learning 28 / 50
29 Newton s Method Newton s Method New consider minimizing a J(w) of a vector w = (w 0, w 1,..., w D ). The first derivative of J(w) is the gradient vector w J(w). The second derivative of J(w) is the Hessian matrix H = [H ij ] H ij = 2 J(w) w i w j Minimize J(w) using Newton s method Start some initial value w Repeat the following update until convergence w w H 1 w J(w) Nevin L. Zhang (HKUST) Machine Learning 29 / 50
30 Newton s Method Newton s Method: Another Perspective By Taylor expansion, we have J(w + ɛ) J(w) + J (w)ɛ J (w)ɛ 2 We want dj(w+ɛ) dɛ = 0. So, Solving the equation, we get J (w) + J (w)ɛ = 0 ɛ = J (w) J (w) Nevin L. Zhang (HKUST) Machine Learning 30 / 50
31 Newton s Method Newton s Method vs Gradient Descent In gradient descent, we have only a first order term in the approximation (gradient). J(w + ɛ) J(w) + ɛj (w) In Newton s method, we also have a second order term in the approximation (curvature) J(w + ɛ) J(w) + J (w)ɛ J (w)ɛ 2 Nevin L. Zhang (HKUST) Machine Learning 31 / 50
32 Newton s Method Newton s Method vs Gradient Descent Use of curvature allows us to better predict how J(w) changes when we change w 1 With negative curvature, J actually decreases faster than the gradient predicts. 2 With no curvature, the gradient predicts the decrease correctly. 3 With positive curvature, the function decreases more slowly than expected. The use of curvature mitigates with problems with case 1 and 3. Nevin L. Zhang (HKUST) Machine Learning 32 / 50
33 Newton s Method One iteration of Newtons can, however, be more expensive than one iteration of gradient descent, since it requires finding and inverting an Hessian matrix. Nevin L. Zhang (HKUST) Machine Learning 33 / 50 Newton s Method vs Gradient Descent Newton s method (red) typically enjoys faster convergence than (batch) gradient descent (green), and requires many fewer iterations to get very close to the minimum.
34 Softmax Regression Outline 1 Logistic Regression 2 Gradient Descent 3 Newton s Method 4 Softmax Regression 5 Optimization Approach to Classification Nevin L. Zhang (HKUST) Machine Learning 34 / 50
35 Softmax Regression Multi-Class Logistic Regression Training set D = {x i, y i } N i=1, where y i {1, 2,..., C}, and C 2. Multi-Class Logistic Regression (aka softmax regression) uses the following probability model: There is a weight vector w c = (w c,0, w c,1,..., w c,d ) for each class c. W = (w 1,..., w C ) is the weight matrix. The probability of a particular class c given x is: p(y = c x, W) = 1 Z(x, W) exp(w c x) where Z(x, W) = C c =1 exp(w c x) is the normalization constant to ensure the probabilities of all classes sum to 1. It is called the partition function. The classification rule is: ŷ = arg max p(y x, W) y Nevin L. Zhang (HKUST) Machine Learning 35 / 50
36 Softmax Regression Relationship with Logistic Regression When C = 2, name the two classes as {0, 1} instead of {1, 2}: p(y = 0 x, W) = exp(w0 x) exp(w0 x) + exp(w 1 x), p(y = 1 x, W) = exp(w1 x) exp(w0 x) + exp(w 1 x) Dividing both numerator and denominator by exp(w 1 x), we get p(y = 0 x, W) = exp((w 0 w 1 )x) exp((w 0 w 1 )x) + 1, p(y = 1 x, W) = 1 exp((w 0 w 1 )x) + 1 Let w = w 1 w 0, we get the logistic regression model: p(y = 0 x, w) = exp( w x) 1 + exp( w x), p(y = 1 x, w) = 1 exp(1 + w x) Nevin L. Zhang (HKUST) Machine Learning 36 / 50
37 Softmax Regression Negative Loglikelihood The negative loglikelihood of the softmax regression model is: NLL(w) = log P(D w) = = = = = N i=1 c=1 N i=1 c=1 N i=1 c=1 N log P(y i x i, w) i=1 C 1(y i = c) log P(y i = c x i, w) C 1(y i = c) log C 1(y i = c)[wc x i log N C [ 1(y i = c)wc x i log i=1 c=1 exp(w c x i ) C c =1 exp(w c x i) C exp(wc x i)] c =1 C exp(wc x i)] c =1 Nevin L. Zhang (HKUST) Machine Learning 37 / 50
38 Softmax Regression Partial Derivative of NLL We would like to find the MLE of W. To do so, we need to minimize the NLL. There is no closed-form solution. So, we resort to gradient descent. The partial derivative of NLL w.r.t each w c,j is: NLL(w) w c,j = = = N [ i=1 i=1 1 w c,j C c=1 1(y i = c)w c x i 1 w c,j log N exp(wc x i ) [1(y i = c)x i,j C c =1 exp(w c x i) x i,j] N [1(y i = c) p(y i = c x i, W)]x i,j i=1 Compare this to the gradient of logistic regression: NLL(w) N = [y i σ(w x i )]x i,j w j i=1 C exp(wc x i)] Nevin L. Zhang (HKUST) Machine Learning 38 / 50 c =1
39 Softmax Regression Gradient of NLL The gradient of NLL w.r.t each w c is: wc NLL(w) = ( NLL(w) = w c,0, NLL(w) w c,1,..., NLL(w) ) w c,d N [1(y i = c) p(y i = c x i, W)]x i i=1 Update rule in Gradient Descent: w c w c + α N [1(y i = c) p(y i = c x i, W)]x i i=1 Nevin L. Zhang (HKUST) Machine Learning 39 / 50
40 Optimization Approach to Classification Outline 1 Logistic Regression 2 Gradient Descent 3 Newton s Method 4 Softmax Regression 5 Optimization Approach to Classification Nevin L. Zhang (HKUST) Machine Learning 40 / 50
41 Optimization Approach to Classification The Probabilistic Approach to Classification So far, we have been talking about the probabilistic approach to classification: Start with training set D = {x i, y i } N i=1, where y i {0, 1}. Learn a probabilistic model: p(y x, w) = Ber(y σ(w x)) by minimizing the negative loglikelihood Classify new instances using: NLL(w) = log P(D w) ŷ = 1 iff p(y = 1 x, w) > 0.5 Nevin L. Zhang (HKUST) Machine Learning 41 / 50
42 Optimization Approach to Classification The Optimization Approach Next, we will briefly talk about the optimization approach to classification: Start with training set D = {x i, y i } N i=1, where y i { 1, 1}, Learn a linear classifier ŷ = sign(w x + b) where x = (x 1,..., x D ) and w = (w 1,..., w D ). We drop the convention x 0 = 1 and use b to replace w 0. Nevin L. Zhang (HKUST) Machine Learning 42 / 50
43 Optimization Approach to Classification The Sign Function signx = 1 if x > 0; signx = 0 if x = 0; signx = 1 if x < 0. Nevin L. Zhang (HKUST) Machine Learning 43 / 50
44 Optimization Approach to Classification The Optimization Approach We determine w and b by minimizing the training/empirical error L(w, b) = N L 0/1 (y i, ŷ i ) i=1 where L 0/1 (y i, ŷ i ) = 1(y i ŷ i ) is called the zero/one loss function. Let z = w x + b. It is easy to see that L 0/1 (y, ŷ) = 1(y(w x + b) 0) = 1(yz 0) To overload the notation, we also write it as L 0/1 (y, z) So, the loss function of linear classifiers is often written as L(w, b) = N 1(y i (w x i + b) 0) i=1 Nevin L. Zhang (HKUST) Machine Learning 44 / 50
45 Optimization Approach to Classification A problem the Zero/One Loss Function The zero/one loss function, as a function of w and b, is not convex and hence difficult to optimize X -axis is yz. Nevin L. Zhang (HKUST) Machine Learning 45 / 50
46 Optimization Approach to Classification Convex Function A real-valued function f is convex if the line segment between any two points on the graph of the function lies above or on the graph. For any two points x 1 < x 2 and any t [0, 1] f (tx 1 + (1 t)x 2 ) tf (x 1 ) + (1 t)f (x 2 ) f is strictly convex if the the equality holds only when x 1 = x 2. Nevin L. Zhang (HKUST) Machine Learning 46 / 50
47 Optimization Approach to Classification Surrogate Loss Functions Convex functions are easy to minimize. Intuitively, if you drop a ball anywhere in a convex function, it will eventually get to the minimum. This is not true for non-convex functions. Since the zero/one loss is hard to optimize, we want to approximate it using a convex function and optimize that function. This approximating function will be called a surrogate loss. The surrogate losses need to be upper bounds on the true loss function: this guarantees that if you minimize the surrogate loss, you are also pushing down the real loss. Nevin L. Zhang (HKUST) Machine Learning 47 / 50
48 Optimization Approach to Classification Logistic Loss One common surrogate loss function is the logistic loss: 1 L log (y, z) = log(1 + exp( yz)) log 2 = { 1 log 2 log(1 + exp( (w x + b))) if y = 1 1 log 2 log(1 + exp(w x + b)) if y = 1 On the other hand, each term in the NLL of logistic regression has the following form log P(y x, w) = {y log σ(w x) + (1 y) log(1 σ(w x))} { log 1 if y = 1 1+exp( w = x) 1 log(1 1+exp( w x) ) if y = 0 { log(1 + exp( w = x)) if y = 1 log(1 + exp(w x)) if y = 0 So, logistic regression is the same as linear classifier with logistic surrogate loss function. Nevin L. Zhang (HKUST) Machine Learning 48 / 50
49 Optimization Approach to Classification Other Surrogate Loss Functions Zero/one loss: L 0/1 (y, z) = 1(yz 0). Logistic loss: L log (y, z) = 1 log 2 log(1 + exp( yz)) Hinge loss: L hin (y, z) = max{0, 1 yz} Exponential loss: L exp (y, z) = exp( yz) Squared loss: L sqr (y, z) = (y z) 2 Nevin L. Zhang (HKUST) Machine Learning 49 / 50
50 Optimization Approach to Classification Loss and Error When a surrogate loss function is used, There are two ways to measure how a classifier performs on the training set: training loss and training error. There are also two ways to measure how a classifier performs on the set set: test loss and test error. As in the case of regression, test loss/error can be improved using regularization. Nevin L. Zhang (HKUST) Machine Learning 50 / 50
Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016
Lecture 7 Logistic Regression Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 11, 2016 Luigi Freda ( La Sapienza University) Lecture 7 December 11, 2016 1 / 39 Outline 1 Intro Logistic
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationLecture 4: Types of errors. Bayesian regression models. Logistic regression
Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture
More informationRegression with Numerical Optimization. Logistic
CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204
More informationLinear and logistic regression
Linear and logistic regression Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Linear and logistic regression 1/22 Outline 1 Linear regression 2 Logistic regression 3 Fisher discriminant analysis
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationIntroduction to Machine Learning
Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574
More informationLogistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu
Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data
More informationMachine Learning - Waseda University Logistic Regression
Machine Learning - Waseda University Logistic Regression AD June AD ) June / 9 Introduction Assume you are given some training data { x i, y i } i= where xi R d and y i can take C different values. Given
More informationIntroduction to gradient descent
6-1: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction to gradient descent Derivation and intuitions Hessian 6-2: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction Our
More informationLogistic Regression. Stochastic Gradient Descent
Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}
More informationMachine Learning 2017
Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section
More informationCS 340 Lec. 16: Logistic Regression
CS 34 Lec. 6: Logistic Regression AD March AD ) March / 6 Introduction Assume you are given some training data { x i, y i } i= where xi R d and y i can take C different values. Given an input test data
More informationMachine Learning Basics III
Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient
More informationLogistic Regression. Seungjin Choi
Logistic Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationLecture 3 - Linear and Logistic Regression
3 - Linear and Logistic Regression-1 Machine Learning Course Lecture 3 - Linear and Logistic Regression Lecturer: Haim Permuter Scribe: Ziv Aharoni Throughout this lecture we talk about how to use regression
More informationCSC 411: Lecture 04: Logistic Regression
CSC 411: Lecture 04: Logistic Regression Raquel Urtasun & Rich Zemel University of Toronto Sep 23, 2015 Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 1 / 16 Today Key Concepts: Logistic
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationCSC321 Lecture 4: Learning a Classifier
CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a Classifier 1 / 31 Overview Last time: binary classification, perceptron algorithm Limitations of the perceptron
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationMidterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.
CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic
More informationLecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data
Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data DD2424 March 23, 2017 Binary classification problem given labelled training data Have labelled training examples? Given
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationLecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning
Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function
More informationHOMEWORK #4: LOGISTIC REGRESSION
HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2018 Due: Friday, February 23rd, 2018, 11:55 PM Submit code and report via EEE Dropbox You should submit a
More informationMachine Learning. Lecture 3: Logistic Regression. Feng Li.
Machine Learning Lecture 3: Logistic Regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2016 Logistic Regression Classification
More informationLinear Regression. S. Sumitra
Linear Regression S Sumitra Notations: x i : ith data point; x T : transpose of x; x ij : ith data point s jth attribute Let {(x 1, y 1 ), (x, y )(x N, y N )} be the given data, x i D and y i Y Here D
More informationLinear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging
More informationLogistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com
Logistic Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationDeep Learning. Authors: I. Goodfellow, Y. Bengio, A. Courville. Chapter 4: Numerical Computation. Lecture slides edited by C. Yim. C.
Chapter 4: Numerical Computation Deep Learning Authors: I. Goodfellow, Y. Bengio, A. Courville Lecture slides edited by 1 Chapter 4: Numerical Computation 4.1 Overflow and Underflow 4.2 Poor Conditioning
More informationLogistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48
Logistic Regression Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 1 / 48 Outline 1 Administration 2 Review of last lecture 3 Logistic regression
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationHOMEWORK #4: LOGISTIC REGRESSION
HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2019 Due: 11am Monday, February 25th, 2019 Submit scan of plots/written responses to Gradebook; submit your
More informationCSC321 Lecture 4: Learning a Classifier
CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a Classifier 1 / 28 Overview Last time: binary classification, perceptron algorithm Limitations of the perceptron
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationLINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception
LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,
More informationLeast Mean Squares Regression
Least Mean Squares Regression Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Lecture Overview Linear classifiers What functions do linear classifiers express? Least Squares Method
More informationMax Margin-Classifier
Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization
More informationLogistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 8 Feb. 12, 2018 1 10-601 Introduction
More informationComments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert
Logistic regression Comments Mini-review and feedback These are equivalent: x > w = w > x Clarification: this course is about getting you to be able to think as a machine learning expert There has to be
More informationCS229 Supplemental Lecture notes
CS229 Supplemental Lecture notes John Duchi Binary classification In binary classification problems, the target y can take on at only two values. In this set of notes, we show how to model this problem
More informationA summary of Deep Learning without Poor Local Minima
A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationOPTIMIZATION METHODS IN DEEP LEARNING
Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate
More information1 Machine Learning Concepts (16 points)
CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions
More informationLinear Discrimination Functions
Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach
More informationLeast Mean Squares Regression. Machine Learning Fall 2018
Least Mean Squares Regression Machine Learning Fall 2018 1 Where are we? Least Squares Method for regression Examples The LMS objective Gradient descent Incremental/stochastic gradient descent Exercises
More informationLinear classifiers: Overfitting and regularization
Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x
More informationWeek 5: Logistic Regression & Neural Networks
Week 5: Logistic Regression & Neural Networks Instructor: Sergey Levine 1 Summary: Logistic Regression In the previous lecture, we covered logistic regression. To recap, logistic regression models and
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationSequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015
Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationMachine Learning for NLP
Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers
More informationLecture 1: Supervised Learning
Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationCOMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16
COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS6 Lecture 3: Classification with Logistic Regression Advanced optimization techniques Underfitting & Overfitting Model selection (Training-
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline
More informationMachine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationCh 4. Linear Models for Classification
Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More informationMulticlass Logistic Regression
Multiclass Logistic Regression Sargur. Srihari University at Buffalo, State University of ew York USA Machine Learning Srihari Topics in Linear Classification using Probabilistic Discriminative Models
More informationHomework #3 RELEASE DATE: 10/28/2013 DUE DATE: extended to 11/18/2013, BEFORE NOON QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FORUM.
Homework #3 RELEASE DATE: 10/28/2013 DUE DATE: extended to 11/18/2013, BEFORE NOON QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FORUM. Unless granted by the instructor in advance, you must turn
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationIntroduction to Machine Learning
Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1
More informationNumerical Optimization
Numerical Optimization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationLogistic Regression. William Cohen
Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting
More informationOutline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationEngineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:
More informationDATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane
DATA MINING AND MACHINE LEARNING Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Linear models for regression Regularized
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationUndirected Graphical Models
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationDeep Learning & Neural Networks Lecture 4
Deep Learning & Neural Networks Lecture 4 Kevin Duh Graduate School of Information Science Nara Institute of Science and Technology Jan 23, 2014 2/20 3/20 Advanced Topics in Optimization Today we ll briefly
More informationIntroduction to Machine Learning
How o you estimate p(y x)? Outline Contents Introuction to Machine Learning Logistic Regression Varun Chanola April 9, 207 Generative vs. Discriminative Classifiers 2 Logistic Regression 2 3 Logistic Regression
More informationClassification Based on Probability
Logistic Regression These slides were assembled by Byron Boots, with only minor modifications from Eric Eaton s slides and grateful acknowledgement to the many others who made their course materials freely
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationMachine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015
Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint input-output features, Likelihood Maximization, Logistic regression, binary
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationIs the test error unbiased for these programs?
Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x
More informationLecture 3 Feedforward Networks and Backpropagation
Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression
More informationLecture 12. Neural Networks Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 12 Neural Networks 10.12.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More informationCPSC 340 Assignment 4 (due November 17 ATE)
CPSC 340 Assignment 4 due November 7 ATE) Multi-Class Logistic The function example multiclass loads a multi-class classification datasetwith y i {,, 3, 4, 5} and fits a one-vs-all classification model
More informationLecture 16 Solving GLMs via IRWLS
Lecture 16 Solving GLMs via IRWLS 09 November 2015 Taylor B. Arnold Yale Statistics STAT 312/612 Notes problem set 5 posted; due next class problem set 6, November 18th Goals for today fixed PCA example
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More information