Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Similar documents
Introduction to Machine Learning

Classification: Logistic Regression from Data

Classification: Logistic Regression from Data

Logistic Regression. INFO-2301: Quantitative Reasoning 2 Michael Paul and Jordan Boyd-Graber SLIDES ADAPTED FROM HINRICH SCHÜTZE

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Generative v. Discriminative classifiers Intuition

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Logistic Regression. Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul SLIDES ADAPTED FROM HINRICH SCHÜTZE

Machine Learning, Fall 2012 Homework 2

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Machine Learning

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Classification Logistic Regression

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Logistic Regression. Stochastic Gradient Descent

Machine Learning for NLP

Machine Learning Gaussian Naïve Bayes Big Picture

Support Vector Machines

Stochastic Gradient Descent

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Solving Regression. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 12. Slides adapted from Matt Nedrich and Trevor Hastie

Generative v. Discriminative classifiers Intuition

Support Vector Machines

Logistic Regression & Neural Networks

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Optimization and Gradient Descent

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Big Data Analytics. Lucas Rego Drumond

Optimization for Machine Learning

Logistic Regression. COMP 527 Danushka Bollegala

Variations of Logistic Regression with Stochastic Gradient Descent

Least Mean Squares Regression

ECS171: Machine Learning

Ad Placement Strategies

Selected Topics in Optimization. Some slides borrowed from

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

ECS171: Machine Learning

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Logistic Regression. William Cohen

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Computational statistics

Bias-Variance Tradeoff

Introduction to Machine Learning

Introduction to Logistic Regression and Support Vector Machine

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Machine Learning for NLP

Probabilistic Machine Learning. Industrial AI Lab.

Logistic Regression Trained with Different Loss Functions. Discussion

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Lecture 2: Logistic Regression and Neural Networks

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

ECE 5984: Introduction to Machine Learning

Generative Clustering, Topic Modeling, & Bayesian Inference

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Naïve Bayes classification

Logistic Regression Logistic

CS6220: DATA MINING TECHNIQUES

Linear Models for Classification

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Ad Placement Strategies

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Statistical Data Mining and Machine Learning Hilary Term 2016

Deconstructing Data Science

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Introduction to Logistic Regression

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Statistical Machine Learning Hilary Term 2018

Neural Networks: Backpropagation

Linear Discrimination Functions

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Introduction to Machine Learning

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Regression with Numerical Optimization. Logistic

Overfitting, Bias / Variance Analysis

Linear and logistic regression

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Classification Based on Probability

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Machine Learning

Perceptron (Theory) + Linear Regression

Lecture 3 - Linear and Logistic Regression

Warm up: risk prediction with logistic regression

Linear & nonlinear classifiers

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Machine Learning

CSCI-567: Machine Learning (Spring 2019)

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

Transcription:

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan Boulder 1 of 27

Quiz question For a test instance (x, y) and a naïve Bayes classifier ˆP, which of the following statements is true? (A) c ˆP(y = c x) = 1 (B) c ˆP(x c) = 1 (C) c ˆP(x c)ˆp(c) = 1 Machine Learning: Chenhao Tan Boulder 2 of 27

Overview Objective function Gradient Descent Stochastic Gradient Descent Machine Learning: Chenhao Tan Boulder 3 of 27

Reminder: Logistic Regression 1 P(Y = 0 X) = 1 + exp [β 0 + i β ix i ] P(Y = 1 X) = exp [β 0 + i β ix i ] 1 + exp [β 0 + i β ix i ] (1) (2) Discriminative prediction: P(y x) Classification uses: sentiment analysis, spam detection What we didn t talk about is how to learn β from data Machine Learning: Chenhao Tan Boulder 4 of 27

Objective function Outline Objective function Gradient Descent Stochastic Gradient Descent Machine Learning: Chenhao Tan Boulder 5 of 27

Objective function Logistic Regression: Objective Function Maximize likelihood Obj log P(Y X, β) = j = j log P(y (j) x (j), β)) y (j) ( β 0 + i β i x (j) i ) log [ 1 + exp ( β 0 + i β i x (j) i )] Machine Learning: Chenhao Tan Boulder 6 of 27

Objective function Logistic Regression: Objective Function Minimize negative log likelihood (loss) L log P(Y X, β) = j = j log P(y (j) x (j), β)) y (j) ( β 0 + i β i x (j) i ) + log [ 1 + exp ( β 0 + i β i x (j) i )] Machine Learning: Chenhao Tan Boulder 7 of 27

Objective function Logistic Regression: Objective Function Minimize negative log likelihood (loss) L log P(Y X, β) = j = j log P(y (j) x (j), β)) y (j) ( β 0 + i β i x (j) i ) + log [ 1 + exp ( β 0 + i β i x (j) i )] Training data {(x, y)} are fixed. Objective function is a function of β... what values of β give a good value? Machine Learning: Chenhao Tan Boulder 7 of 27

Objective function Logistic Regression: Objective Function Minimize negative log likelihood (loss) L log P(Y X, β) = j = j log P(y (j) x (j), β)) y (j) ( β 0 + i β i x (j) i ) + log [ 1 + exp ( β 0 + i β i x (j) i )] Training data {(x, y)} are fixed. Objective function is a function of β... what values of β give a good value? β = arg min L (β) β Machine Learning: Chenhao Tan Boulder 7 of 27

Objective function Convexity L (β) is convex for logistic regression. Proof. Logistic loss yv + log(1 + exp(v)) is convex. Composition with linear function maintains convexity. Sum of convex functions is convex. Machine Learning: Chenhao Tan Boulder 8 of 27

Gradient Descent Outline Objective function Gradient Descent Stochastic Gradient Descent Machine Learning: Chenhao Tan Boulder 9 of 27

Gradient Descent Convexity Convex function Doesn t matter where you start, if you go down along the gradient Machine Learning: Chenhao Tan Boulder 10 of 27

Gradient Descent Convexity Convex function Doesn t matter where you start, if you go down along the gradient Gradient! Machine Learning: Chenhao Tan Boulder 10 of 27

Gradient Descent Convexity It would have been much harder if this is not convex. Machine Learning: Chenhao Tan Boulder 11 of 27

Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β Objective Parameter Machine Learning: Chenhao Tan Boulder 12 of 27

Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β Objective Parameter Undiscovered Country Machine Learning: Chenhao Tan Boulder 12 of 27

Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β Objective 0 Parameter Undiscovered Country Machine Learning: Chenhao Tan Boulder 12 of 27

Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β Objective 0 Parameter Undiscovered Country Machine Learning: Chenhao Tan Boulder 12 of 27

Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β Objective 0 1 Parameter Undiscovered Country Machine Learning: Chenhao Tan Boulder 12 of 27

Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β Objective 0 1 Parameter Undiscovered Country Machine Learning: Chenhao Tan Boulder 12 of 27

Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β Objective 0 1 2 Parameter Undiscovered Country Machine Learning: Chenhao Tan Boulder 12 of 27

Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β Objective 0 1 2 Parameter Undiscovered Country Machine Learning: Chenhao Tan Boulder 12 of 27

Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β Objective 0 1 2 3 Parameter Undiscovered Country Machine Learning: Chenhao Tan Boulder 12 of 27

Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β β l+1 j = β l j η L β j Machine Learning: Chenhao Tan Boulder 12 of 27

Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β β l+1 j = β l j η L β j Luckily, (vanilla) logistic regression is convex Machine Learning: Chenhao Tan Boulder 12 of 27

Gradient Descent Gradient for Logistic Regression To ease notation, let s define π i = exp βt x i 1 + exp β T x i (3) Our objective function is L = i log p(y i x i ) = i L i = i { log π i if y i = 1 log(1 π i ) if y i = 0 (4) Machine Learning: Chenhao Tan Boulder 13 of 27

Gradient Descent Taking the Derivative Apply chain rule: L β j = i L i ( β) β j = i 1 π i π i ( ) 1 1 π i π i β j β j if y i = 1 if y i = 0 (5) If we plug in the derivative, we can merge these two cases π i β j = π i (1 π i )x j, (6) L i β j = (y i π i )x j. (7) Machine Learning: Chenhao Tan Boulder 14 of 27

Gradient Descent Gradient for Logistic Regression Gradient Update β L ( β) = [ L ( β) ],..., L ( β) β 0 β n (8) β η β L ( β) (9) β i β i η L ( β) β i (10) Machine Learning: Chenhao Tan Boulder 15 of 27

Gradient Descent Gradient for Logistic Regression Gradient Update β L ( β) = [ L ( β) ],..., L ( β) β 0 β n (8) β η β L ( β) (9) β i β i η L ( β) β i (10) Why are we subtracting? What would we do if we wanted to do ascent? Machine Learning: Chenhao Tan Boulder 15 of 27

Gradient Descent Gradient for Logistic Regression Gradient Update β L ( β) = [ L ( β) ],..., L ( β) β 0 β n (8) η: step size, must be greater than zero β η β L ( β) (9) β i β i η L ( β) β i (10) Machine Learning: Chenhao Tan Boulder 15 of 27

Gradient Descent Choosing Step Size Objective Parameter Machine Learning: Chenhao Tan Boulder 16 of 27

Gradient Descent Choosing Step Size Objective Parameter Machine Learning: Chenhao Tan Boulder 16 of 27

Gradient Descent Choosing Step Size Objective Parameter Machine Learning: Chenhao Tan Boulder 16 of 27

Gradient Descent Choosing Step Size Objective Parameter Machine Learning: Chenhao Tan Boulder 16 of 27

Gradient Descent Remaining issues When to stop? What if β keeps getting bigger? Machine Learning: Chenhao Tan Boulder 17 of 27

Gradient Descent Regularized Conditional Log Likelihood Unregularized Regularized [ ] β = arg min ln p(y (j) x (j), β) β [ ] β = arg min ln p(y (j) x (j), β) + 1 β 2 µ i (11) β 2 i (12) Machine Learning: Chenhao Tan Boulder 18 of 27

Gradient Descent Regularized Conditional Log Likelihood Unregularized Regularized [ ] β = arg min ln p(y (j) x (j), β) β [ ] β = arg min ln p(y (j) x (j), β) + 1 β 2 µ i (11) β 2 i (12) µ is regularization parameter that trades off between likelihood and having small parameters Machine Learning: Chenhao Tan Boulder 18 of 27

Outline Objective function Gradient Descent Stochastic Gradient Descent Machine Learning: Chenhao Tan Boulder 19 of 27

Approximating the Gradient Our datasets are big (to fit into memory)... or data are changing / streaming Machine Learning: Chenhao Tan Boulder 20 of 27

Approximating the Gradient Our datasets are big (to fit into memory)... or data are changing / streaming Hard to compute true gradient L (β) E x [ L (β, x)] (13) Average over all observations Machine Learning: Chenhao Tan Boulder 20 of 27

Approximating the Gradient Our datasets are big (to fit into memory)... or data are changing / streaming Hard to compute true gradient L (β) E x [ L (β, x)] (13) Average over all observations What if we compute an update just from one observation? Machine Learning: Chenhao Tan Boulder 20 of 27

Stochastic Gradient Descent Getting to Union Station Pretend it s a pre-smartphone world and you want to get to Union Station Machine Learning: Chenhao Tan Boulder 21 of 27

Stochastic Gradient for Regularized Regression L = log p(y x; β) + 1 2 µ j β 2 j (14) Machine Learning: Chenhao Tan Boulder 22 of 27

Stochastic Gradient for Regularized Regression L = log p(y x; β) + 1 2 µ j β 2 j (14) Taking the derivative (with respect to example x i ) L β j = (y i π i )x j + µβ j (15) Machine Learning: Chenhao Tan Boulder 22 of 27

Stochastic Gradient for Logistic Regression Given a single observation x i chosen at random from the dataset, β j β j η ( µβ j x ij [y i π i ] ) (16) Machine Learning: Chenhao Tan Boulder 23 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = β bias = 0, β A = 0, β B = 0, β C = 0, β D = 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D You first see the positive example. First, compute π 1 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D You first see the positive example. First, compute π 1 π 1 = Pr(y 1 = 1 x 1 ) = exp βt x i 1+exp β T x i = Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D You first see the positive example. First, compute π 1 π 1 = Pr(y 1 = 1 x 1 ) = exp βt x i 1+exp β T x i = exp 0 exp 0+1 = 0.5 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D π 1 = 0.5 What s the update for β bias? Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β bias? β bias = β bias + η (y 1 π 1 ) x 1,bias = 0.0 + 1.0 (1.0 0.5) 1.0 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β bias? β bias = β bias + η (y 1 π 1 ) x 1,bias = 0.0 + 1.0 (1.0 0.5) 1.0 =0.5 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β A? Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β A? β A = β A + η (y 1 π 1 ) x 1,A = 0.0 + 1.0 (1.0 0.5) 4.0 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β A? β A = β A + η (y 1 π 1 ) x 1,A = 0.0 + 1.0 (1.0 0.5) 4.0 =2.0 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β B? Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β B? β B = β B + η (y 1 π 1 ) x 1,B = 0.0 + 1.0 (1.0 0.5) 3.0 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β B? β B = β B + η (y 1 π 1 ) x 1,B = 0.0 + 1.0 (1.0 0.5) 3.0 =1.5 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β C? Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β C? β C = β C + η (y 1 π 1 ) x 1,C = 0.0 + 1.0 (1.0 0.5) 1.0 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β C? β C = β C + η (y 1 π 1 ) x 1,C = 0.0 + 1.0 (1.0 0.5) 1.0 =0.5 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β D? Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β D? β D = β D + η (y 1 π 1 ) x 1,D = 0.0 + 1.0 (1.0 0.5) 0.0 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β D? β D = β D + η (y 1 π 1 ) x 1,D = 0.0 + 1.0 (1.0 0.5) 0.0 =0.0 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D Now you see the negative example. What s π 2? Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D Now you see the negative example. What s π 2? π 2 = Pr(y 2 = 1 x 2 ) = exp βt x i 1+exp β T x i = exp{.5+1.5+1.5+0} exp{.5+1.5+1.5+0}+1 = Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D Now you see the negative example. What s π 2? π 2 = Pr(y 2 = 1 x 2 ) = exp βt x i 1+exp β T x i = exp{.5+1.5+1.5+0} exp{.5+1.5+1.5+0}+1 = 0.97 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D Now you see the negative example. What s π 2? π 2 = 0.97 What s the update for β bias? Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β bias? β bias = β bias + η (y 2 π 2 ) x 2,bias = 0.5 + 1.0 (0.0 0.97) 1.0 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β bias? β bias = β bias + η (y 2 π 2 ) x 2,bias = 0.5 + 1.0 (0.0 0.97) 1.0 =-0.47 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β A? Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β A? β A = β A + η (y 2 π 2 ) x 2,A = 2.0 + 1.0 (0.0 0.97) 0.0 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β A? β A = β A + η (y 2 π 2 ) x 2,A = 2.0 + 1.0 (0.0 0.97) 0.0 =2.0 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β B? Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β B? β B = β B + η (y 2 π 2 ) x 2,B = 1.5 + 1.0 (0.0 0.97) 1.0 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β B? β B = β B + η (y 2 π 2 ) x 2,B = 1.5 + 1.0 (0.0 0.97) 1.0 =0.53 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β C? Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β C? β C = β C + η (y 2 π 2 ) x 2,C = 0.5 + 1.0 (0.0 0.97) 3.0 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β C? β C = β C + η (y 2 π 2 ) x 2,C = 0.5 + 1.0 (0.0 0.97) 3.0 =-2.41 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β D? Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β D? β D = β D + η (y 2 π 2 ) x 2,D = 0.0 + 1.0 (0.0 0.97) 4.0 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β D? β D = β D + η (y 2 π 2 ) x 2,D = 0.0 + 1.0 (0.0 0.97) 4.0 =-3.88 Machine Learning: Chenhao Tan Boulder 24 of 27

Algorithm 1. Initialize a vector β to be all zeros 2. For t = 1,..., T For each example x i, y i and feature j: Compute π i Pr(y i = 1 x i) Set β j = β j η(µβ j (y i π i)x i) 3. Output the parameters β 1,..., β d. Machine Learning: Chenhao Tan Boulder 25 of 27

Algorithm 1. Initialize a vector β to be all zeros 2. For t = 1,..., T For each example x i, y i and feature j: Compute π i Pr(y i = 1 x i) Set β j = β j η(µβ j (y i π i)x i) 3. Output the parameters β 1,..., β d. Any issues? Machine Learning: Chenhao Tan Boulder 25 of 27

Algorithm 1. Initialize a vector β to be all zeros 2. For t = 1,..., T For each example x i, y i and feature j: Compute π i Pr(y i = 1 x i) Set β j = β j η(µβ j (y i π i)x i) 3. Output the parameters β 1,..., β d. Any issues? Inefficiency due to updates on every j even if x ij = 0. Lazy sparse updates in the readings. Machine Learning: Chenhao Tan Boulder 25 of 27

Proofs about Stochastic Gradient Depends on convexity of objective and how close ɛ you want to get to actual answer Best bounds depend on changing η over time and per dimension (not all features created equal) Machine Learning: Chenhao Tan Boulder 26 of 27

Running time Number of iterations to get to accuracy: L (β) L (β ) ɛ Gradient descent: If L is strongly convex: O(log(1/ɛ)) iterations Stochastic gradient descent: If L is strongly convex: O(1/ɛ) iterations Machine Learning: Chenhao Tan Boulder 27 of 27