Machine Learning & Data Mining CMS/CS/CNS/EE 155. Lecture 2: Perceptron & Gradient Descent

Size: px

Start display at page:

Download "Machine Learning & Data Mining CMS/CS/CNS/EE 155. Lecture 2: Perceptron & Gradient Descent"

Sara Beatrice Shelton
5 years ago
Views:

1 Machine Learning & Data Mining CMS/CS/CNS/EE 155 Lecture 2: Perceptron & Gradient Descent

2 Announcements Homework 1 is out Due Tuesday Jan 12 th at 2pm Via Moodle Sign up for Moodle & Piazza if you haven t yet Announcements are made via Piazza RecitaIon on Python Programming Tonight 7:30pm in Annenberg 105 2

3 Recap: Basic Recipe Training Data: Model Class: N S = {(x i, y i )} i=1 f (x w, b) = w T x b x R D y { 1,+1 } Linear Models Loss FuncIon: L(a, b) = (a b) 2 Squared Loss Learning ObjecIve: argmin w,b N i=1 L( y i, f (x i w, b) ) Op8miza8on Problem 3

4 Recap: Bias-Variance Trade-off Bias Variance 1.5 Bias Variance 1.5 Bias Variance

5 Recap: Complete Pipeline N S = {(x i, y i )} i=1 Training Data f (x w, b) = w T x b Model Class(es) L(a, b) = (a b) 2 Loss FuncIon argmin w,b N i=1 L( y i, f (x i w, b) ) Cross ValidaIon & Model SelecIon Profit! 5

6 Today Two Basic Learning Approaches Perceptron Algorithm Gradient Descent Aka, actually solving the opimizaion problem 6

7 The Perceptron One of the earliest learning algorithms 1957 by Frank Rosenbla^ SIll a great algorithm Fast Clean analysis Precursor to Neural Networks Frank Rosenbla^ with the Mark 1 Perceptron Machine 7

8 Perceptron Learning Algorithm (Linear ClassificaIon Model) w 1 = 0, b 1 = 0 For t = 1. Receive example (x,y) If f(x w t ) = y [w t+1, b t+1 ] = [w t, b t ] Else w t+1 = w t + yx b t+1 = b t + y f (x w) = sign(w T x b) Training Set: N S = {(x i, y i )} i=1 y { +1, 1 } Go through training set in arbitrary order (e.g., randomly) 8

9 Aside: Hyperplane Distance Line is a 1D, Plane is 2D Hyperplane is many D Includes Line and Plane w Defined by (w,b) Distance: w T x b w b/ w Signed Distance: w T x b w Linear Model = un-normalized signed distance!

10 Perceptron Learning 10

11 Perceptron Learning Misclassified! 11

12 Perceptron Learning Update! 12

13 Perceptron Learning Correct! 13

14 Perceptron Learning Misclassified! 14

15 Perceptron Learning Update! 15

16 Perceptron Learning Update! 16

17 Perceptron Learning Correct! 17

18 Perceptron Learning Correct! 18

19 Perceptron Learning Misclassified! 19

20 Perceptron Learning Update! 20

21 Perceptron Learning Update! 21

22 All Training Examples Correctly Classified! Perceptron Learning 22

23 Start Again Perceptron Learning 23

24 Perceptron Learning Misclassified! 24

25 Perceptron Learning Update! 25

26 Perceptron Learning Correct! 26

27 Perceptron Learning Correct! 27

28 Perceptron Learning Misclassified! 28

29 Perceptron Learning Update! 29

30 Perceptron Learning Update! 30

31 Perceptron Learning Correct! 31

32 Perceptron Learning Correct! 32

33 Perceptron Learning Misclassified! 33

34 Perceptron Learning Update! 34

35 Perceptron Learning Update! 35

36 Perceptron Learning Misclassified! 36

37 Perceptron Learning Update! 37

38 Perceptron Learning Update! 38

39 Perceptron Learning Misclassified! 39

40 Perceptron Learning Update! 40

41 Perceptron Learning Update! 41

42 Perceptron Learning Misclassified! 42

43 Perceptron Learning Update! 43

44 Perceptron Learning Update! 44

45 All Training Examples Correctly Classified! Perceptron Learning 45

46 Recap: Perceptron Learning Algorithm (Linear ClassificaIon Model) w 1 = 0, b 1 = 0 For t = 1. Receive example (x,y) If f(x w t ) = y [w t+1, b t+1 ] = [w t, b t ] Else w t+1 = w t + yx b t+1 = b t + y f (x w) = sign(w T x b) Training Set: N S = {(x i, y i )} i=1 y { +1, 1 } Go through training set in arbitrary order (e.g., randomly) 46

47 Comparing the Two Models 47

48 Convergence to Mistake Free = Linearly Separable! 48

49 Margin γ = max w min (x,y) y(w T x) w 49

50 Linear Separability A classificaion problem is Linearly Separable: Exists w with perfect classificaion accuracy Separable with Margin γ: γ = max w min (x,y) y(w T x) w Linearly Separable: γ > 0 50

51 Perceptron Mistake Bound Holds for any ordering of training examples! Radius of Feature Space R = max x x #Mistakes Bounded By: R 2 γ 2 Margin **If Linearly Separable More Details: h^p:// 51

52 In the Real World Most problems are NOT linearly separable! May never converge So what to do? Use valida8on set! 52

53 Early Stopping via ValidaIon Run Perceptron Learning on Training Set Evaluate current model on ValidaIon Set Terminate when validaion accuracy stops improving h^ps://en.wikipedia.org/wiki/early_stopping 53

54 Online Learning vs Batch Learning Online Learning: Receive a stream of data (x,y) Make incremental updates Perceptron Learning is an instance of Online Learning Batch Learning Train over all data simultaneously Can use online learning algorithms for batch learning E.g., stream the data to the learning algorithm h^ps://en.wikipedia.org/wiki/online_machine_learning 54

55 Recap: Perceptron One of the first machine learning algorithms Benefits: Simple and fast Clean analysis Drawbacks: Might not converge to a very good model What is the objecive funcion? 55

56 (StochasIc) Gradient Descent 56

57 Back to OpImizing ObjecIve FuncIons Training Data: Model Class: N S = {(x i, y i )} i=1 f (x w, b) = w T x b x R D y { 1,+1 } Linear Models Loss FuncIon: L(a, b) = (a b) 2 Squared Loss Learning ObjecIve: argmin w,b N i=1 L( y i, f (x i w, b) ) OpImizaIon Problem 57

58 Back to OpImizing ObjecIve FuncIons argmin w,b N i=1 L(w, b S) L y i, f (x i w, b) Typically, requires opimizaion algorithm. Simplest: Gradient Descent ( ) This Lecture: sick with squared loss Talk about various loss funcions next lecture

59 Gradient Review for Squared Loss N i=1 ( ) w L(w, b S) = w L y i, f (x i w, b) N i=1 ( ) = w L y i, f (x i w, b) N i=1 = 2(y i f (x i w, b)) w f (x i w, b) Linearity of DifferenIaIon L(a, b) = (a b) 2 Chain Rule N i=1 = 2(y i f (x i w, b))x i f (x w, b) = w T x b

60 Gradient Descent IniIalize: w 1 = 0, b 1 = 0 For t = 1 w t+1 = w t η t+1 w L(w t, b t S) b t+1 = b t η t+1 b L(w t, b t S) Step Size 60

61 How to Choose Step Size? η =1 w L(w) = 2(1 w) L w 61

62 How to Choose Step Size? η =1 w L(w) = 2(1 w) L w 62

63 How to Choose Step Size? η =1 w L(w) = 2(1 w) L w 63

64 How to Choose Step Size? η =1 w L(w) = 2(1 w) Oscillate Infinitely! L w 64

65 How to Choose Step Size? η = w L(w) = 2(1 w) L w 65

66 How to Choose Step Size? η = w L(w) = 2(1 w) L w 66

67 How to Choose Step Size? η = w L(w) = 2(1 w) L w 67

68 How to Choose Step Size? η = w L(w) = 2(1 w) Takes Really Long Time! L w 68

How to Choose Step Size? 5000 4500 4000 3500 0.001 0.

(Without Diverging) 1500 1000 500 0 0 20 40 60 80 100 120 140 160 180

69 How to Choose Step Size? Loss As Large As Possible! (Without Diverging) IteraIons Note that the absolute scale is not meaningful Focus on the relaive magnitude differences 69

70 Being Scale Invariant Consider the following two gradient updates: w t+1 = w t η t+1 w L(w t, b t S) w t+1 = w t ˆη t+1 w ˆL(w t, b t S) Suppose: ˆL =1000L How are the two step sizes related? ˆη t+1 = η /

71 PracIcal Rules of Thumb Divide Loss FuncIon by Number of Examples: w t+1 = w t ηt+1 w L(w t, b t S) N Start with large step size If loss plateaus, divide step size by 2 (Can also use advanced opimizaion methods) (Step size must decrease over Ime to guarantee convergence to global opimum) 71

72 Aside: Convexity Not Convex Easy to find global op8ma! Strict convex if diff always >0 Image Source: h^p://en.wikipedia.org/wiki/convex_funcion 72

73 Aside: Convexity 2.5 L(x 2 ) L(x 1 )+ L(x 1 ) T (x 2 x 1 ) FuncIon is always above the locally linear extrapolaion

74 Aside: Convexity All local opima are global opima: Gradient Descent will find opimum Assuming step size chosen safely Strictly convex: unique global opimum: Almost all standard objecives are (strictly) convex: Squared Loss, SVMs, LR, Ridge, Lasso We will see non-convex objecives in 2 nd half of course 74

75 Convergence Assume L is convex How many iteraions to achieve: L(w) L(w * ) ε If: If: L(a) L(b) ρ a b Then O(1/ε 2 ) iteraions L(a) L(b) ρ a b Then O(1/ε) iteraions L is ρ-lipschitz L is ρ-smooth If: L(a) L(b)+ L(b) T (a b)+ ρ 2 a b 2 Then O(log(1/ε)) iteraions L is ρ-strongly convex More Details: Bubeck Textbook Chapter 3 75

76 Convergence In general, takes infinite Ime to reach global opimum. But in general, we don t care! As long as we re close enough to the global opimum Loss How do we know if we re here? And not here? IteraIons 76

When to Stop? Convergence analyses = worst-case upper bounds What to do in prac8ce? Stop when progress is sufficiently small E.g., relaive reducion less than 0.

77 When to Stop? Convergence analyses = worst-case upper bounds What to do in prac8ce? Stop when progress is sufficiently small E.g., relaive reducion less than Stop azer pre-specified #iteraions E.g., Yisong prefers this opion Stop when validaion error stops going down 77

78 LimitaIon of Gradient Descent Requires full pass over training set per iteraion N i=1 ( ) w L(w, b S) = w L y i, f (x i w, b) Very expensive if training set is huge Do we need to do a full pass over the data? 78

79 StochasIc Gradient Descent Suppose Loss FuncIon Decomposes AddiIvely L(w, b) = 1 N N L i (w, b) = E i L i (w, b) i=1 [ ] Gradient = expected gradient of sub-funcions L i (w, b) ( y i f (x i w, b) 2 Each L i corresponds to a single data point w L(w, b) = w E i L i (w, b) [ ] 79

80 StochasIc Gradient Descent Suffices to take random gradient update So long as it matches the true gradient in expectaion Each iteraion t: Choose i at random Expected Value is: w L(w, b) w t+1 = w t η t+1 w L i (w, b) b t+1 = b t η t+1 b L i (w, b) SGD is an online learning algorithm! 80

81 Mini-Batch SGD Each L i is a small batch of training examples E.g, examples Can leverage vector operaions Decrease volaility of gradient updates Industry state-of-the-art Everyone uses mini-batch SGD Ozen parallelized (e.g., different cores work on different mini-batches) 81

82 Checking for Convergence How to check for convergence? EvaluaIng loss on enire training set seems expensive Loss IteraIons 82

83 Checking for Convergence How to check for convergence? EvaluaIng loss on enire training set seems expensive Don t check azer every iteraion E.g., check every 1000 iteraions Evaluate loss on a subset of training data E.g., the previous 5000 examples. 83

84 Recap: StochasIc Gradient Descent Conceptually: Decompose Loss FuncIon AddiIvely Choose a Component Randomly Gradient Update Benefits: Avoid iteraing enire dataset for every update Gradient update is consistent (in expectaion) Industry Standard 84

85 Perceptron Revisited (What is the Objec8ve Func8on?) w 1 = 0, b 1 = 0 For t = 1. Receive example (x,y) If f(x w t ) = y [w t+1, b t+1 ] = [w t, b t ] Else w t+1 = w t + yx b t+1 = b t + y f (x w) = sign(w T x b) Training Set: N S = {(x i, y i )} i=1 y { +1, 1 } Go through training set in arbitrary order (e.g., randomly) 85

86 Perceptron (Implicit) ObjecIve L i (w, b) = max 0, y i f (x i w, b) { } Loss yf(x) 86

87 Recap: Complete Pipeline N S = {(x i, y i )} i=1 Training Data f (x w, b) = w T x b Model Class(es) L(a, b) = (a b) 2 Loss FuncIon argmin w,b N i=1 L( y i, f (x i w, b) ) Use SGD! Cross ValidaIon & Model SelecIon Profit! 87

88 Next Week Different Loss FuncIons Hinge Loss (SVM) Log Loss (LogisIc Regression) Non-linear model classes Neural Nets RegularizaIon Recita8on on Python Programming Tonight! 88

Logistic Regression Logistic

Logistic Regression Logistic Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,