Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Size: px
Start display at page:

Download "Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48"

Transcription

1 Logistic Regression Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

2 Outline 1 Administration 2 Review of last lecture 3 Logistic regression Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

3 HW1 Will be returned today in class during our break Can also pick up from Brooke during her office hours Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

4 Outline 1 Administration 2 Review of last lecture Naive Bayes 3 Logistic regression Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

5 How to tell spam from ham? FROM THE DESK OF MR. AMINU SALEH DIRECTOR, FOREIGN OPERATIONS DEPARTMENT AFRI BANK PLC Afribank Plaza, 14th Floormoney344.jpg 51/55 Broad Street, P.M.B Lagos-Nigeria Attention: Honorable Beneficiary, IMMEDIATE PAYMENT NOTIFICATION VALUED AT US$10 MILLION Dear Ameet, Do you have 10 minutes to get on a videocall before 2pm? Thanks, Stefano

6 Simple strategy: count the words free 100 money 2.. account 2.. free 1 money 1.. account 2.. Bag-of-word representation of documents (and textual data)

7 Naive Bayes (in our Spam Setting) Assume X R D, all X d [K], and z k is the number of times k in X P (X = x, Y = c) = P (Y = c)p (X = x Y = c) Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

8 Naive Bayes (in our Spam Setting) Assume X R D, all X d [K], and z k is the number of times k in X P (X = x, Y = c) = P (Y = c)p (X = x Y = c) = P (Y = c) P (k Y = c) z k = π c k Key assumptions made? k θ z k ck Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

9 Naive Bayes (in our Spam Setting) Assume X R D, all X d [K], and z k is the number of times k in X P (X = x, Y = c) = P (Y = c)p (X = x Y = c) = P (Y = c) P (k Y = c) z k = π c k Key assumptions made? Conditional independence: P (X i, X j Y = c) = P (X i Y = c)p (X j Y = c). P (X i Y = c) depends only the value of X i, not i itself (order of words does not matter in bag-of-word representation of documents) k θ z k ck Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

10 Learning problem Training data Goal D = {({z nk } K k=1, y n)} N n=1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

11 Learning problem Training data D = {({z nk } K k=1, y n)} N n=1 Goal Learn π c, c = 1, 2,, C, and θ ck, c [C], k [K] under the constraints: Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

12 Learning problem Training data D = {({z nk } K k=1, y n)} N n=1 Goal Learn π c, c = 1, 2,, C, and θ ck, c [C], k [K] under the constraints: π c = 1, c θ ck = k k and all quantities should be nonnegative. P (k Y = c) = 1, Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

13 Likelihood Function Let X 1,..., X N be IID with PDF f(x θ) (also written as f(x; θ)) Likelihood function is defined by L(θ x) (also written as L(θ; x)): N L(θ x) = f(x i ; θ). i=1 Notes The likelihood function is just the joint density of the data, except that we treat it as a function of the parameter θ, L : Θ [0, ). Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

14 Maximum Likelihood Estimator Definition: The maximum likelihood estimator (MLE) ˆθ, is the value of θ that maximizes L(θ). The log-likelihood function is defined by l(θ) = log L(θ) Maximum occurs at same place as that of the likelihood function Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

15 Maximum Likelihood Estimator Definition: The maximum likelihood estimator (MLE) ˆθ, is the value of θ that maximizes L(θ). The log-likelihood function is defined by l(θ) = log L(θ) Maximum occurs at same place as that of the likelihood function Using logs simplifies mathemetical expressions (converts exponents to products and products to sums) Using logs helps with numerical stabilitity The same is true of the likelihood function times any constant. Thus we shall often drop constants in the likelihood function. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

16 Bayes Rule For any document x, we want to compare p(spam x) and p(ham x) Axiom of Probability: p(spam, x) = p(spam x)p(x) = p(x spam)p(spam) This gives us (via bayes rule): p(spam x) = p(x spam)p(spam) p(x) p(ham x) = p(x ham)p(ham) p(x) Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

17 Bayes Rule For any document x, we want to compare p(spam x) and p(ham x) Axiom of Probability: p(spam, x) = p(spam x)p(x) = p(x spam)p(spam) This gives us (via bayes rule): p(spam x) = p(x spam)p(spam) p(x) p(ham x) = p(x ham)p(ham) p(x) Denominators are same, and easier to compute logarithms, so we compare: log[p(x spam)p(spam)] versus log[p(x ham)p(ham)] Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

18 Our hammer: maximum likelihood estimation Log-Likelihood of the training data L = log P (D) = log N π yn P (x n y n ) n=1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

19 Our hammer: maximum likelihood estimation Log-Likelihood of the training data L = log P (D) = log = log ( N π yn n=1 k N π yn P (x n y n ) n=1 θ z nk y nk ) Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

20 Our hammer: maximum likelihood estimation Log-Likelihood of the training data L = log P (D) = log = log ( N π yn n=1 ( k N π yn P (x n y n ) n=1 θ z nk y nk ) ) = n log π yn + k z nk log θ ynk Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

21 Our hammer: maximum likelihood estimation Log-Likelihood of the training data L = log P (D) = log = log ( N π yn n=1 ( k N π yn P (x n y n ) n=1 θ z nk y nk ) ) = n log π yn + k z nk log θ ynk = n log π yn + n,k z nk log θ ynk Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

22 Our hammer: maximum likelihood estimation Log-Likelihood of the training data L = log P (D) = log = log ( N π yn n=1 ( k N π yn P (x n y n ) n=1 θ z nk y nk ) ) = n log π yn + k z nk log θ ynk = n log π yn + n,k z nk log θ ynk Optimize it! (π c, θ ck ) = arg max n log π yn + n,k z nk log θ ynk Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

23 Details Note the separation of parameters in the likelihood log π yn + z nk log θ ynk n n,k which implies that {π c } and {θ ck } can be estimated separately. Reorganize terms log π yn = log π c (#of data points labeled as c) n c Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

24 Details Note the separation of parameters in the likelihood log π yn + z nk log θ ynk n n,k which implies that {π c } and {θ ck } can be estimated separately. Reorganize terms log π yn = log π c (#of data points labeled as c) n c and z nk log θ ynk = c n,k n:y n=c k z nk log θ ck = c n:y n=c,k z nk log θ ck The later implies {θ ck, k = 1, 2,, K} and {θ c k, k = 1, 2,, K} can be estimated independently. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

25 Estimating {π c } We want to maximize log π c (#of data points labeled as c) Intuition c Similar to roll a dice (or flip a coin): each side of the dice shows up with a probability of π c (total C sides) Solution And we have total N trials of rolling this dice Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

26 Estimating {π c } We want to maximize log π c (#of data points labeled as c) Intuition c Similar to roll a dice (or flip a coin): each side of the dice shows up with a probability of π c (total C sides) Solution And we have total N trials of rolling this dice π c = #of data points labeled as c N Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

27 Estimating {θ ck, k = 1, 2,, K} We want to maximize n:y n=c,k z nk log θ ck Intuition Again similar to roll a dice: each side of the dice shows up with a probability of θ ck (total K sides) Solution And we have total n:y n=c,k z nk trials. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

28 Estimating {θ ck, k = 1, 2,, K} We want to maximize n:y n=c,k z nk log θ ck Intuition Again similar to roll a dice: each side of the dice shows up with a probability of θ ck (total K sides) Solution And we have total n:y n=c,k z nk trials. θ ck = #of times side k shows up in data points labeled as c #total trials for data points labeled as c Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

29 Translating back to our problem of detecting spam s Collect a lot of ham and spam s as training examples Estimate the bias p(ham) = #of ham s #of spam s, p(spam) = #of s #of s Estimate the weights (i.e., p(funny word ham) etc) p(funny word ham) = p(funny word spam) = #of funny word in ham s #of words in ham s #of funny word in spam s #of words in spam s (1) (2) Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

30 Classification rule Given an unlabeled data point x = {z k, k = 1, 2,, K}, label it with y = arg max c [C] P (y = c x) = arg max c [C] P (y = c)p (x y = c) = arg max c [log π c + z k log θ ck ] k Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

31 Constrained optimization Equality Constraints min f(x) s.t. g(x) =0 Method of Lagrange multipliers Construct the following function (Lagrangian) L(x, )=f(x)+ g(x) Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

32 A short derivation of the maximum likelihood estimation To maximize z nk log θ ck n:y n=c,k We can use Lagrange multipliers Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

33 A short derivation of the maximum likelihood estimation To maximize z nk log θ ck n:y n=c,k We can use Lagrange multipliers f(θ) = z nk log θ ck n:y n=c,k g(θ) = 1 k θ ck = 0 Lagrangian L(θ, λ) = f(θ) + λg(θ) = n:y n=c,k ( z nk log θ ck + λ 1 k θ ck ) Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

34 L(θ, λ) = n:y n=c n:y n=c,k z nk log θ ck + λ ( 1 k θ ck ) First take derivatives with respect to θ ck and then find the stationary point ( ) z nk λ = 0 θ ck = 1 z nk θ ck λ n:y n=c Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

35 L(θ, λ) = n:y n=c n:y n=c,k z nk log θ ck + λ ( 1 k θ ck ) First take derivatives with respect to θ ck and then find the stationary point ( ) z nk λ = 0 θ ck = 1 z nk θ ck λ n:y n=c Plug in expression above for θ ck into constraint k θ ck = 1 Solve for λ Plug this expression for λ back into expression for θ ck to get: θ ck = k Chris will review in section on Friday n:y n=c z nk n:y n=c z nk Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

36 Summary Things you should know The form of the naive Bayes model write down the joint distribution explain the conditional independence assumption implied by the model explain how this model can be used to classify spam vs ham s explain how it could be used for categorical variables Be able to go through the short derivation for parameter estimation The model illustrated here is called discrete Naive Bayes HW2 asks you to apply the same principle to other variants of naive Bayes The derivations are very similar except there you need to estimate different model parameters Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

37 Moving forward Examine the classification rule for naive Bayes y = arg max c log π c + k z k log θ ck For binary classification, we thus determine the label based on the sign of log π 1 + ( z k log θ 1k log π 2 + ) z k log θ 2k k k This is just a linear function of the features {z k } Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

38 Moving forward Examine the classification rule for naive Bayes y = arg max c log π c + k z k log θ ck For binary classification, we thus determine the label based on the sign of log π 1 + ( z k log θ 1k log π 2 + ) z k log θ 2k k k This is just a linear function of the features {z k } w 0 + k z k w k where we absorb w 0 = log π 1 log π 2 and w k = log θ 1k log θ 2k. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

39 Naive Bayes is a linear classifier Fundamentally, what really matters in deciding decision boundary is w 0 + k z k w k This motivates many new methods, including logistic regression, to be discussed next Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

40 Outline 1 Administration 2 Review of last lecture 3 Logistic regression General setup Maximum likelihood estimation Gradient descent Newton s method Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

41 Logistic classification Setup for two classes Input: x R D Output: y {0, 1} Training data: D = {(x n, y n ), n = 1, 2,..., N} Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

42 Logistic classification Setup for two classes Input: x R D Output: y {0, 1} Training data: D = {(x n, y n ), n = 1, 2,..., N} Model: p(y = 1 x; b, w) = σ[g(x)] where g(x) = b + d w d x d = b + w T x and σ[ ] stands for the sigmoid function σ(a) = e a Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

43 Why the sigmoid function? Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

44 Why the sigmoid function? What does it look like? where σ(a) = e a Properties a = b + w T x Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

45 Why the sigmoid function? What does it look like? where σ(a) = e a Properties a = b + w T x Bounded between 0 and 1 thus, interpretable as probability Monotonically increasing thus, usable to derive classification rules σ(a) > 0.5, positive (classify as 1 ) σ(a) < 0.5, negative (classify as 0 ) σ(a) = 0.5, undecidable Nice computational properties As we will see soon Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

46 Why the sigmoid function? What does it look like? where σ(a) = e a Properties a = b + w T x Bounded between 0 and 1 thus, interpretable as probability Monotonically increasing thus, usable to derive classification rules σ(a) > 0.5, positive (classify as 1 ) σ(a) < 0.5, negative (classify as 0 ) σ(a) = 0.5, undecidable Nice computational properties As we will see soon Linear or nonlinear classifier? Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

47 Linear or nonlinear? σ(a) is nonlinear, however, the decision boundary is determined by σ(a) = 0.5 Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

48 Linear or nonlinear? σ(a) is nonlinear, however, the decision boundary is determined by σ(a) = 0.5 a = 0 g(x) = b + w T x = 0 which is a linear function in x We often call b the offset term. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

49 Contrast Naive Bayes and our new model Similar Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

50 Contrast Naive Bayes and our new model Similar Both classification models are linear functions of features Different Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

51 Contrast Naive Bayes and our new model Similar Both classification models are linear functions of features Different Naive Bayes models the joint distribution: P (X, Y ) = P (Y )P (X Y ) Logistic regression models the conditional distribution: P (Y X) Generative vs. Discriminative NB is a generative model, LR is a discriminative model We will talk more about the differences later Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

52 Likelihood function Probability of a single training sample (x n, y n ) { σ(b + w p(y n x n ; b; w) = T x n ) if y n = 1 1 σ(b + w T x n ) otherwise Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

53 Likelihood function Probability of a single training sample (x n, y n ) { σ(b + w p(y n x n ; b; w) = T x n ) if y n = 1 1 σ(b + w T x n ) otherwise Compact expression, exploring that y n is either 1 or 0 p(y n x n ; b; w) = σ(b + w T x n ) yn [1 σ(b + w T x n )] 1 yn Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

54 Log Likelihood or Cross Entropy Error Log-likelihood of the whole training data D log P (D) = n {y n log σ(b + w T x n ) + (1 y n ) log[1 σ(b + w T x n )]} Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

55 Log Likelihood or Cross Entropy Error Log-likelihood of the whole training data D log P (D) = n {y n log σ(b + w T x n ) + (1 y n ) log[1 σ(b + w T x n )]} It is convenient to work with its negation, which is called cross-entropy error function E(b, w) = n {y n log σ(b + w T x n ) + (1 y n ) log[1 σ(b + w T x n )]} Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

56 Shorthand notation This is for convenience Append 1 to x x [1 x 1 x 2 x D ] Append b to w w [b w 1 w 2 w D ] Cross-entropy is then E(w) = n {y n log σ(w T x n ) + (1 y n ) log[1 σ(w T x n )]} Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

57 How to find the optimal parameters for logistic regression? We will minimize the error function E(w) = n {y n log σ(w T x n ) + (1 y n ) log[1 σ(w T x n )]} However, this function is complex and we cannot find the simple solution as we did in Naive Bayes. So we need to use numerical methods. Numerical methods are messier, in contrast to cleaner analytic solutions. In practice, we often have to tune a few optimization parameters patience is necessary. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

58 An overview of numerical methods We describe two Gradient descent (our focus in lecture): simple, especially effective for large-scale problems Newton s method: classical and powerful method Gradient descent is often referred to as a first-order method Requires computation of gradients (i.e., the first-order derivative) Newton s method is often referred as to an second-order method Requires computation of second derivatives Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

59 Gradient Descent Start at a random point f(w) w * w0 w

60 Gradient Descent Start at a random point f(w) Determine a descent direction w * w0 w

61 Gradient Descent Start at a random point f(w) Determine a descent direction Choose a step size w * w0 w

62 Gradient Descent Start at a random point f(w) Determine a descent direction Choose a step size Update w * w1 w0 w

63 Gradient Descent Start at a random point Repeat Determine a descent direction Choose a step size Update Until stopping criterion is satisfied f(w) w * w1 w0 w

64 Gradient Descent Start at a random point Repeat Determine a descent direction Choose a step size Update Until stopping criterion is satisfied f(w) w * w1 w0 w

65 Gradient Descent Start at a random point Repeat Determine a descent direction Choose a step size Update Until stopping criterion is satisfied f(w) w * w1 w0 w

66 Gradient Descent Start at a random point Repeat Determine a descent direction Choose a step size Update Until stopping criterion is satisfied f(w) w * w1 w0 w

67 Gradient Descent Start at a random point Repeat Determine a descent direction Choose a step size Update Until stopping criterion is satisfied f(w) w * w2 w1 w0 w

68 Gradient Descent Start at a random point Repeat Determine a descent direction Choose a step size Update Until stopping criterion is satisfied f(w) w * w2 w1 w0 w

69 Where Will We Converge? f(w) Convex g(w) Non-convex w * w Any local minimum is a global minimum w! w * w Multiple local minima may exist Least Squares, Ridge Regression and Logistic Regression are all convex!

70 Why do we move in the direction opposite the gradient? Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

71 Choosing Descent Direction (1D) f(w) positive go left! f(w) negative go right! zero done! w * w0 We can only move in two directions Negative slope is direction of descent! w w0 w * Update Rule: w i+1 = w i Step Size w α i df dw (w i) Negative Slope

72 Choosing Descent Direction 2D Example: Function values are in black/white and black represents higher values Arrows are gradients "Gradient2" by Sarang. Licensed under CC BY-SA 2.5 via Wikimedia Commons We can move anywhere in R d Negative gradient is direction of steepest descent! Step Size Update Rule: w i+1 = w i α i f(w i ) Negative Slope

73 Example: min f(θ) = 0.5(θ 2 1 θ 2 ) (θ 1 1) 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

74 Example: min f(θ) = 0.5(θ 2 1 θ 2 ) (θ 1 1) 2 We compute the gradients f θ 1 = 2(θ 2 1 θ 2 )θ 1 + θ 1 1 f θ 2 = (θ 2 1 θ 2 ) Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

75 Example: min f(θ) = 0.5(θ 2 1 θ 2 ) (θ 1 1) 2 We compute the gradients f θ 1 = 2(θ 2 1 θ 2 )θ 1 + θ 1 1 f θ 2 = (θ 2 1 θ 2 ) Use the following iterative procedure for gradient descent 1 Initialize θ (0) 1 and θ (0) 2, and t = 0 2 do [ θ (t+1) 1 θ (t) 1 η 2(θ (t) 2 ] (t) 1 θ 2 )θ(t) 1 + θ (t) 1 1 [ θ (t+1) 2 θ (t) 2 η (θ (t) 2 ] (t) 1 θ 2 ) t t until f(θ (t) ) does not change much Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

76 Impact of step size Choosing the right η is important Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

77 Impact of step size Choosing the right η is important small η is too slow? Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

78 Impact of step size Choosing the right η is important small η is too slow? large η is too unstable? Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

79 Gradient descent General form for minimizing f(θ) θ t+1 θ η f θ Remarks η is step size how far we go in the direction of the negative gradient Step size needs to be chosen carefully to ensure convergence. Step size can be adaptive, e.g., we can use line search We are minimizing a function, hence the subtraction ( η) With a suitable choice of η, we converge to a stationary point f θ = 0 Stationary point not always global minimum (but happy when convex) Popular variant called stochastic gradient descent Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

80 Gradient Descent Update for Logistic Regression Simple fact: derivatives of σ(a) d σ(a) d a = d ( 1 + e a ) 1 (1 + e a ) = d a (1 + e a ) 2 e a = (1 + e a ) 2 = 1 e a 1 + e a 1 + e a = σ(a)[1 σ(a)] Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

81 Gradients of the cross-entropy error function Cross-entropy Error Function E(w) = n {y n log σ(w T x n ) + (1 y n ) log[1 σ(w T x n )]} Gradients E(w) w = n { yn [1 σ(w T x n )]x n (1 y n )σ(w T x n )]x n } = n { σ(w T x n ) y n } xn Remark Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

82 Gradients of the cross-entropy error function Cross-entropy Error Function E(w) = n {y n log σ(w T x n ) + (1 y n ) log[1 σ(w T x n )]} Gradients E(w) w = n { yn [1 σ(w T x n )]x n (1 y n )σ(w T x n )]x n } = n { σ(w T x n ) y n } xn Remark e n = { σ(w T x n ) y n } is called error for the nth training sample. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

83 Numerical optimization Gradient descent for logistic regression Choose a proper step size η > 0 Iteratively update the parameters following the negative gradient to minimize the error function w (t+1) w (t) η n { σ(w T x n ) y n } xn Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

84 Intuition for Newton s method Approximate the true function with an easy-to-solve optimization problem f(x) f quad (x) f(x) f quad (x) x k x k +d k x k x k +d k Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

85 Intuition for Newton s method Approximate the true function with an easy-to-solve optimization problem f(x) f quad (x) f(x) f quad (x) x k x k +d k x k x k +d k In particular, we can approximate the cross-entropy error function around w (t) by a quadratic function, and then minimize this quadratic function Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

86 Approximation Second Order Taylor expansion around x t f(x) f(x t ) + f (x t )(x x t ) f (x t )(x x t ) 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

87 Approximation Second Order Taylor expansion around x t f(x) f(x t ) + f (x t )(x x t ) f (x t )(x x t ) 2 Taylor expansion of cross-entropy error function around w (t) E(w) E(w (t) ) + (w w (t) ) T E(w (t) ) (w w(t) ) T H (t) (w w (t) ) where E(w (t) ) is the gradient H (t) is the Hessian matrix evaluated at w (t) Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

88 So what is the Hessian matrix? The matrix of second-order derivatives In other words, H = 2 E(w) ww T H ij = ( ) E(w) w j w i So the Hessian matrix is R D D, where w R D. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

89 Optimizing the approximation Minimize the approximation E(w) E(w (t) ) + (w w (t) ) T E(w (t) ) (w w(t) ) T H (t) (w w (t) ) and use the solution as the new estimate of the parameters w (t+1) min w (w w(t) ) T E(w (t) ) (w w(t) ) T H (t) (w w (t) ) Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

90 Optimizing the approximation Minimize the approximation E(w) E(w (t) ) + (w w (t) ) T E(w (t) ) (w w(t) ) T H (t) (w w (t) ) and use the solution as the new estimate of the parameters w (t+1) min w (w w(t) ) T E(w (t) ) (w w(t) ) T H (t) (w w (t) ) The quadratic function minimization has a closed form, thus, we have i.e., the Newton s method. ( w (t+1) w (t) H (t)) 1 E(w (t) ) Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

91 Contrast gradient descent and Newton s method Similar Both are iterative procedures. Different Newton s method requires second-order derivatives. Newton s method does not have the magic η to be set. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

92 Other important things about Hessian Our cross-entropy error function is convex E(w) w = n {σ(w T x n ) y n }x n (3) H = 2 E(w) = homework (4) wwt Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

93 Other important things about Hessian Our cross-entropy error function is convex For any vector v, E(w) w = n {σ(w T x n ) y n }x n (3) H = 2 E(w) = homework (4) wwt v T Hv = homework 0 Thus, positive semi-definite. Thus, the cross-entropy error function is convex, with only one global optimum. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

94 Good about Newton s method Fast (in terms of convergence)! Newton s method finds the optimal point in a single iteration when the function we re optimizing is quadratic In general, the better our Taylor approximation, the more quickly Newton s method will converge Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

95 Bad about Newton s method Not scalable! Computing and inverting Hessian matrix can be very expensive for large-scale problems where the dimensionally D is very large. There are fixes and alternatives, such as Quasi-Newton/Quasi-second order method. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

96 Summary Setup for 2 classes Logistic Regression models conditional distribution as: p(y = 1 x; w) = σ[g(x)] where g(x) = w T x Linear decision boundary: g(x) = w T x = 0 Minimizing cross-entropy error (negative log-likelihood) E(b, w) = n {y n log σ(b+w T x n )+(1 y n ) log[1 σ(b+w T x n )]} No closed form solution; must rely on iterative solvers Numerical optimization Gradient descent: simple, scalable to large-scale problems move in direction opposite of gradient! gradient of logistic function takes nice form Newton method: fast to converge but not scalable At each iteration, find optimal point in 2nd-order Taylor expansion Closed form solution exists for each iteration Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

97 Naive Bayes and logistic regression: two different modeling paradigms Maximize joint likelihood n log p(x n, y n ) (Generative, NB) Maximize conditional likelihood n log p(y n x n ) (Discriminative, LR) More on this next class Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know The Bayes classifier Theorem The classifier satisfies where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know Alternatively, since the maximum it is

More information

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA The Bayes classifier Linear discriminant analysis (LDA) Theorem The classifier satisfies In linear discriminant analysis (LDA), we make the (strong) assumption that where the min is over all possible classifiers.

More information

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016 Lecture 7 Logistic Regression Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 11, 2016 Luigi Freda ( La Sapienza University) Lecture 7 December 11, 2016 1 / 39 Outline 1 Intro Logistic

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes MLE / MAP Readings: Estimating Probabilities (Mitchell, 2016)

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Logistic Regression. Seungjin Choi

Logistic Regression. Seungjin Choi Logistic Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

Linear and logistic regression

Linear and logistic regression Linear and logistic regression Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Linear and logistic regression 1/22 Outline 1 Linear regression 2 Logistic regression 3 Fisher discriminant analysis

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

Support Vector Machines, Kernel SVM

Support Vector Machines, Kernel SVM Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features Y target

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan Boulder 1 of 27 Quiz question For

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Statistical Machine Learning Hilary Term 2018

Statistical Machine Learning Hilary Term 2018 Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html

More information

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Machine Learning. Lecture 3: Logistic Regression. Feng Li. Machine Learning Lecture 3: Logistic Regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2016 Logistic Regression Classification

More information

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Naïve Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders Homework 6: PAC Learning

More information

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang Machine Learning Lecture 04: Logistic and Softmax Regression Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set

More information

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 8 Feb. 12, 2018 1 10-601 Introduction

More information

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824 Logistic Regression Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative Please start HW 1 early! Questions are welcome! Two principles for estimating parameters Maximum Likelihood

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 1 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2019 Due: 11am Monday, February 25th, 2019 Submit scan of plots/written responses to Gradebook; submit your

More information

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes Matt Gormley Lecture 19 March 20, 2018 1 Midterm Exam Reminders

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017 1 Will Monroe CS 109 Logistic Regression Lecture Notes #22 August 14, 2017 Based on a chapter by Chris Piech Logistic regression is a classification algorithm1 that works by trying to learn a function

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

Expectation maximization

Expectation maximization Expectation maximization Subhransu Maji CMSCI 689: Machine Learning 14 April 2015 Motivation Suppose you are building a naive Bayes spam classifier. After your are done your boss tells you that there is

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Gradient Descent. Sargur Srihari

Gradient Descent. Sargur Srihari Gradient Descent Sargur srihari@cedar.buffalo.edu 1 Topics Simple Gradient Descent/Ascent Difficulties with Simple Gradient Descent Line Search Brent s Method Conjugate Gradient Descent Weight vectors

More information

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang Generative Classifiers: Part 1 CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang 1 This Week Discriminative vs Generative Models Simple Model: Does the patient

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2018 Due: Friday, February 23rd, 2018, 11:55 PM Submit code and report via EEE Dropbox You should submit a

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

Lecture 4 September 15

Lecture 4 September 15 IFT 6269: Probabilistic Graphical Models Fall 2017 Lecture 4 September 15 Lecturer: Simon Lacoste-Julien Scribe: Philippe Brouillard & Tristan Deleu 4.1 Maximum Likelihood principle Given a parametric

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Linear classifiers: Overfitting and regularization

Linear classifiers: Overfitting and regularization Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x

More information

Optimization Tutorial 1. Basic Gradient Descent

Optimization Tutorial 1. Basic Gradient Descent E0 270 Machine Learning Jan 16, 2015 Optimization Tutorial 1 Basic Gradient Descent Lecture by Harikrishna Narasimhan Note: This tutorial shall assume background in elementary calculus and linear algebra.

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting

More information

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)

More information

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 22 Dan Weld Learning Gaussians Naïve Bayes Last Time Gaussians Naïve Bayes Logistic Regression Today Some slides from Carlos Guestrin, Luke Zettlemoyer

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012 Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 20, 2012 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)

More information

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 9 Sep. 26, 2018 1 Reminders Homework 3:

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

Logistic Regression. William Cohen

Logistic Regression. William Cohen Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting

More information

CSC 411: Lecture 04: Logistic Regression

CSC 411: Lecture 04: Logistic Regression CSC 411: Lecture 04: Logistic Regression Raquel Urtasun & Rich Zemel University of Toronto Sep 23, 2015 Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 1 / 16 Today Key Concepts: Logistic

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Lecture 3 - Linear and Logistic Regression

Lecture 3 - Linear and Logistic Regression 3 - Linear and Logistic Regression-1 Machine Learning Course Lecture 3 - Linear and Logistic Regression Lecturer: Haim Permuter Scribe: Ziv Aharoni Throughout this lecture we talk about how to use regression

More information

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are

More information

MLPR: Logistic Regression and Neural Networks

MLPR: Logistic Regression and Neural Networks MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition Amos Storkey Amos Storkey MLPR: Logistic Regression and Neural Networks 1/28 Outline 1 Logistic Regression 2 Multi-layer

More information

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap. Outline MLPR: and Neural Networks Machine Learning and Pattern Recognition 2 Amos Storkey Amos Storkey MLPR: and Neural Networks /28 Recap Amos Storkey MLPR: and Neural Networks 2/28 Which is the correct

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

CSCI567 Machine Learning (Fall 2018)

CSCI567 Machine Learning (Fall 2018) CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: Classification: Logistic Regression NB & LR connections Readings: Barber 17.4 Dhruv Batra Virginia Tech Administrativia HW2 Due: Friday 3/6, 3/15, 11:55pm

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

(Kernels +) Support Vector Machines

(Kernels +) Support Vector Machines (Kernels +) Support Vector Machines Machine Learning Torsten Möller Reading Chapter 5 of Machine Learning An Algorithmic Perspective by Marsland Chapter 6+7 of Pattern Recognition and Machine Learning

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information