CS489/698: Intro to ML

Size: px

Start display at page:

Download "CS489/698: Intro to ML"

Vincent Joseph
6 years ago
Views:

1 CS489/698: Intro to ML Lecture 04: Logistic Regression 1

2 Outline Announcements Baseline Learning Machine Learning Pyramid Regression or Classification (that s it!) History of Classification History of Solvers (Analytical to Convex to Non-Convex but smooth ) Convexity SGD Perceptron Review Bernoulli model / Logistic Regression Tensorflow Playground / Demo code Multiclass 2

3 Announcements Assignment 1 due next Tuesday 3

4 Outline Announcements Baseline Learning Machine Learning Pyramid Regression or Classification (that s it!) History of Classification History of Solvers (Analytical to Convex to Non-Convex but smooth ) Convexity SGD Perceptron Review Bernoulli model / Logistic Regression Tensorflow Playground / Demo code Multiclass 4

5 Baseline Assuming Lin Alg Basics: 9/21/17 Francois Chaubard and Agastya Kalra 5 f cal Systems

6 Outline Announcements Baseline Learning Machine Learning Pyramid Regression or Classification (that s it!) History of Classification History of Solvers (Analytical to Convex to Non-Convex but smooth ) Convexity SGD Perceptron Review Bernoulli model / Logistic Regression Tensorflow Playground / Demo code Multiclass 6

7 ML Pyramid Deep Learning Machine Learning (anything fancier than simple Sklearn fit / predict calls) Software Engineering Convex Optimization Linear Algebra Information Theory Probability and Statistics 7 f 9/21/17 Francois Chaubard and Agastya Kalra cal Systems

8 Outline Announcements Baseline Learning Machine Learning Pyramid Regression or Classification (that s it!) History of Classification History of Solvers (Analytical to Convex to Non-Convex but smooth ) Convexity SGD Perceptron Review Bernoulli model / Logistic Regression Tensorflow Playground / Demo code Multiclass 8

9 Regression or Classification or 9 f 9/21/17 Francois Chaubard and Agastya Kalra cal Systems

10 Outline Announcements Baseline Learning Machine Learning Pyramid Regression or Classification (that s it!) History of Classification History of Solvers (Analytical to Convex to Non-Convex but smooth ) Convexity SGD Perceptron Review Bernoulli model / Logistic Regression Tensorflow Playground / Demo code Multiclass 10

11 Classification Higher complexity datasets need higher capacity models (Formally, higher VC-Dimensionality) Perceptron!Perceptron Logistic Regression!Logistic Regression? 11 f MLP Feature Eng etc 9/21/17 Francois Chaubard and Agastya Kalra etc cal Systems

12 Outline Announcements Baseline Learning Machine Learning Pyramid Regression or Classification (that s it!) History of Classification History of Solvers (Analytical to Convex to Non-Convex but smooth ) Convexity SGD Perceptron Review Bernoulli model / Logistic Regression Tensorflow Playground / Demo code Multiclass 12

History of Solvers Closed Form <1950s Iterative Methods+Convex 1950-2012 Iterative Methods+Smooth 2012+ Least Squares

13 History of Solvers Closed Form <1950s Iterative Methods+Convex Iterative Methods+Smooth Least Squares etc Interior Point Methods Log Barrier etc Deep Learning etc 13 f 9/21/17 Francois Chaubard and Agastya Kalra cal Systems

14 Outline Announcements Baseline Learning Machine Learning Pyramid Regression or Classification (that s it!) History of Classification History of Solvers (Analytical to Convex to Non-Convex but smooth ) Convexity SGD Perceptron Review Bernoulli model / Logistic Regression Tensorflow Playground / Demo code Multiclass 14

15 Convexity Jensen s Inequality 15 f 9/21/17 Francois Chaubard and Agastya Kalra cal Systems

16 Outline Announcements Baseline Learning Machine Learning Pyramid Regression or Classification (that s it!) History of Classification History of Solvers (Analytical to Convex to Non-Convex but smooth ) Convexity SGD Perceptron Review Bernoulli model / Logistic Regression Tensorflow Playground / Demo code Multiclass 16

17 SGD 17 f 9/21/17 Francois Chaubard and Agastya Kalra cal Systems

18 Outline Announcements Baseline Learning Machine Learning Pyramid Regression or Classification (that s it!) History of Classification History of Solvers (Analytical to Convex to Non-Convex but smooth ) Convexity SGD Perceptron Review Bernoulli model / Logistic Regression Tensorflow Playground / Demo code Multiclass 18

19 Perceptron Issues: 19 f 9/21/17 Francois Chaubard and Agastya Kalra cal Systems

20 Outline Announcements Baseline Learning Machine Learning Pyramid Regression or Classification (that s it!) History of Classification History of Solvers (Analytical to Convex to Non-Convex but smooth ) Convexity SGD Perceptron Review Bernoulli model / Logistic Regression Tensorflow Playground / Demo code Multiclass 20

21 Bernoulli model Let P(Y=1 X=x) = p(x; w), parameterized by w Conditional likelihood on {(x 1, y 1 ), (x n, y n )}: P(Y 1 = y 1,...,Y n = y n X 1 = x 1,...,X n = x n ) simplifies if independence holds ny ny P(Y i = y i X i = x i )= p(x i ; w) y i (1 p(x i ; w)) 1 y i i=1 Assuming y i is {0,1}-valued i=1 21

22 Naïve solution ny p(x i ; w) y i (1 p(x i ; w)) 1 y i i=1 Find w to maximize conditional likelihood What is the solution if p(x; w) does not depend on x? What is the solution if p(x; w) does not depend on? 22

23 Generalized linear models (GLM) y ~ Bernoulli(p); p = p(x; w) natural parameter Logistic regression y ~ Normal(µ, σ 2 ); µ = µ(x; w) (weighted) least-squares regression GLM: y ~ exp( θ φ(y) A(θ) ) log-partition function sufficient statistics 23

24 Logit transform p(x; w) = w T x? p >=0 not guaranteed log p(x; w) = w T x? better! LHS negative, RHS real-valued Logit transform Or equivalently log p(x; w) 1 p(x; w) = w> x p(x; w) = 1 odds ratio 1 + exp( w > x) 24

25 Prediction with confidence p(x; w) = exp( w > x) ŷ = 1 if p = P(Y=1 X=x) > ½ iff w T x > 0 Decision boundary w T x = 0 ŷ = sign(w T x) as before, but with confidence p(x; w) 25

26 Not just a classification algorithm Logistic regression does more than classification it estimates conditional probabilities under the logit transform assumption Having confidence in prediction is nice the price is an assumption that may or may not hold If classification is the sole goal, then doing extra work as shall see, SVM only estimates decision boundary 26

27 More than logistic regression F(p) transforms p from [0,1] to R Then, equating F(p) to a linear function w T x But, there are many other choices for F! precisely the inverse of any distribution function! 27

28 Logistic distribution Cumulative Distribution Function F (x; µ, s) = exp( Mean mu, variance s 2 π 2 /3 x µ s ) 28

29 Outline Announcements Baseline Learning Machine Learning Pyramid Regression or Classification (that s it!) History of Classification History of Solvers (Analytical to Convex to Non-Convex but smooth ) Convexity SGD Perceptron Review Bernoulli model / Logistic Regression Tensorflow Playground / Demo code Multiclass 29

30 Playground 30 f 9/21/17 Francois Chaubard and Agastya Kalra cal Systems

31 Tensorflow coding example 31 f 9/21/17 Francois Chaubard and Agastya Kalra cal Systems

32 Outline Announcements Baseline Learning Machine Learning Pyramid Regression or Classification (that s it!) History of Classification History of Solvers (Analytical to Convex to Non-Convex but smooth ) Convexity SGD Perceptron Review Bernoulli model / Logistic Regression Tensorflow Playground / Demo code Multiclass 32

33 More than 2 classes Softmax P(Y = c x,w)= exp(w > c x) P k q=1 exp(w> q x) 33

34 More than 2 classes Softmax P(Y = c x,w)= exp(w > c x) P k q=1 exp(w> q x) Again, nonnegative and sum to 1 Negative log-likelihood (y is one-hot) log ny i=1 ky c=1 p y ic ic = nx i=1 kx y ic log p ic c=1 34

35 Questions? 35

36 backup 36

37 Classification revisited ŷ = sign( x T w + b ) How confident we are about ŷ? x T w + b seems a good indicator real-valued; hard to interpret ways to transform into [0,1] Better(?) idea: learn confidence directly 37

38 Conditional probability P(Y=1 X=x): conditional on seeing x, what is the chance of this instance being positive, i.e., Y=1? obviously, value in [0,1] P(Y=0 X=x) = 1 P(Y=1 X=x), if two classes more generally, sum to 1 Notation (Simplex). Δ k-1 := { p in R k : p 0, Σ i p i = 1 } 38

39 Reduction to a harder problem P(Y=1 X=x) = E(1 Y=1 X=x) 1 A = ( 1, A is true 0, A is false Let Z = 1 Y=1, then regression function for (X, Z) use linear regression for binary Z? Exploit structure! conditional probabilities are in a simplex Never reduce to unnecessarily harder problem 39

40 Maximum likelihood ny p(x i ; w) y i (1 p(x i ; w)) 1 y i i=1 p(x; w) = exp( w > x) Minimize negative log-likelihood X log(e (1 i y i)w > x i + e y iw > x i ) X i log(1 + e ỹiw > x i ) 40

41 Newton s algorithm w t+1 w t t [r 2 f(w t )] 1 rf(w t ) rf(w t )=X > (p y) r 2 f(w t )= X i p i (1 p i )x i x > i PSD p i = η = 1: iterative weighted least-squares 1 1+e w> t x i Uncertain predictions get bigger weight 41

42 A word about implementation Numerically computing exponential can be tricky easily underflows or overflows The usual trick estimate the range of the exponents shift the mean of the exponents to 0 42

43 Robustness `(t) = log(1 + e t ) L(ŷ, y) =`( ŷy) ŷ = w > x Bounded derivative `0(t) = Variational exponential et 1+e t = 1 1+e t Larger exp loss gets smaller weights log(1 + e t )= min 0apple apple1 et log( )+ 1 43

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Warm up. Regrade requests submitted directly in Gradescope, do not instructors. Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required