Gradient Ascent Chris Piech CS109, Stanford University

Size: px

Start display at page:

Download "Gradient Ascent Chris Piech CS109, Stanford University"

Alvin Williams
5 years ago
Views:

1 Gradient Ascent Chris Piech CS109, Stanford University

2 Our Path Deep Learning Linear Regression Naïve Bayes Logistic Regression Parameter Estimation

3 Our Path Deep Learning Linear Regression Naïve Bayes Logistic Regression Unbiased estimators Maximizing likelihood Bayesian estimation

4 Review

6 Parameter Learning Consider n I.I.D. random variables X 1, X 2,..., X n X i is a sample from density function f(x i q) What are the best choice of parameters q?

7 Piech, CS106A, Stanford University Likelihood (of data given parameters): n Õ L( q ) = f ( q ) i= 1 X i

8 Piech, CS106A, Stanford University Maximum Likelihood Estimation n L( q ) = f ( q ) Õ i= 1 X i LL( ) = nx i=1 log f(x i ) ˆ = argmax LL( )

9 Argmax?

10 Option #1: Straight optimization

11 Computing the MLE General approach for finding MLE of q Determine formula for LL(q) LL( q ) Differentiate LL(q) w.r.t. (each) q : q LL( q ) To maximize, set = 0 q Solve resulting (simultaneous) equations to get q MLE o Make sure that derived is actually a maximum (and not a qˆmle minimum or saddle point). E.g., check LL(q MLE ± e) < LL(q MLE ) This step often ignored in expository derivations So, we ll ignore it here too (and won t require it in this class)

12 End Review Piech, CS106A, Stanford University

13 Maximizing Likelihood with Bernoulli Consider I.I.D. random variables X 1, X 2,..., X n X i ~ Ber(p) Probability mass function, f(x i p):

14 Maximizing Likelihood with Bernoulli Consider I.I.D. random variables X 1, X 2,..., X n X i ~ Ber(p) Probability mass function, f(x i p): PMF of Bernoulli PMF of Bernoulli (p = 0.2) 1 - p p 0 1 f ( X i p) = p x i (1 - p) 1- x i where f(x) xi = =0.2 0 or x 1(1 0.2) 1 x

15 Bernoulli PMF X Ber(p) f(x = x p) =p x (1 p) 1 x

16 Consider I.I.D. random variables X 1, X 2,..., X n X i ~ Ber(p) Probability mass function, f(x i p), can be written as: Likelihood: Log-likelihood: Differentiate w.r.t. p, and set to 0: 1 0 ) (1 ) ( 1 or where = - = - i x x i x p p p X f i i Õ = - - = n i X X i i p p L 1 1 ) (1 (q ) å[ ] å = = = - = n i i i n i X X p X p X p p LL i i ) )log(1 (1 ) (log ) ) (1 log( (q ) = å = = n i i X Y p Y n p Y 1 where ) )log(1 ( ) (log å = = = Þ = = n i MLE X i n n Y p p Y n p Y p p LL ) ( 1 ) ( Maximizing Likelihood with Bernoulli

17 Isn t that the same as the sample mean?

18 Yes. For Bernoulli.

20 Maximum Likelihood Algorithm 1. Decide on a model for the distribution of your samples. Define the PMF / PDF for your sample. 2. Write out the log likelihood function. 3. State that the optimal parameters are the argmax of the log likelihood function. 4. Use an optimization algorithm to calculate argmax

22 Maximizing Likelihood with Poisson Consider I.I.D. random variables X 1, X 2,..., X n X i ~ Poi(l) -l x e l PMF: f ( X i ) i n l = Likelihood: L( q ) = Õ xi! i= Log-likelihood: n -l X i n e l LL( q) = log( ) = - l log( e) + X log( l) - å i= 1 Differentiate w.r.t. l, and set to 0: X i! = -nl + log(l) n å i= 1 e -l X l 1 X i! å[ i log( X i!) ] i= 1 å X i - log( X i= 1 n n LL( l) 1 1 = -n + å X i = 0 Þ lmle = å l l i= 1 n i= 1 n i!) X i i

23 It is so general!

24 Maximizing Likelihood with Uniform Consider I.I.D. random variables X 1, X 2,..., X n X i ~ Uni(a, b) PDF: Likelihood: o o o 1 ì ïb -a a xi b f ( X i a, b) = í ïî 0 otherwise n ìæ ö ç 1 ï = è - a,,..., ( ) ø 1 2 b q b a x x xn L í ï î 0 otherwise Constraint a x 1, x 2,, x n b makes differentiation tricky Intuition: want interval size (b a) to be as small as possible to maximize likelihood function for each data point But need to make sure all observed data contained in interval If all observed data not in interval, then L(q) = 0 Solution: a MLE = min(x 1,, x n ) b MLE = max(x 1,, x n )

25 Understanding MLE with Uniform Consider I.I.D. random variables X 1, X 2,..., X n X i ~ Uni(0, 1) Observe data: o 0.15, 0.20, 0.30, 0.40, 0.65, 0.70, 0.75 Likelihood: L(a,1) Likelihood: L(0, b) L(a,1) L(0, b) a b

26 Small Samples = Problems How do small samples affect MLE? In many cases, µ = = sample mean o Unbiased. Not too shabby Estimating Normal, o Biased. Underestimates for small n (e.g., 0 for n = 1) As seen with Uniform, a MLE a and b MLE b o Biased. Problematic for small n (e.g., a = b when n = 1) Small sample phenomena intuitively make sense: o o 1 n å MLE X i n i= 1 s 2 MLE 1 = n Maximum likelihood Þ best explain data we ve seen Does not attempt to generalize to unseen data n å i= 1 ( X i - µ MLE ) 2

27 Properties of MLE Maximum Likelihood Estimators are generally: Consistent: lim P( ˆ q -q < e ) = 1 for e > 0 Potentially biased (though asymptotically less so) Asymptotically optimal o Has smallest variance of good estimators for large samples Often used in practice where sample size is large relative to parameter space o n But be careful, there are some very large parameter spaces

28 Piech, CS106A, Stanford University Maximum Likelihood Estimation n L( q ) = f ( q ) Õ i= 1 X i LL( ) = nx i=1 log f(x i ) ˆ = argmax LL( )

30 Argmax 2: Gradient Ascent

31 Argmax Option #1: Straight optimization

32 Argmax Option #2: Gradient Ascent

33 Gradient Ascent p(samples ) argmax Walk uphill and you will find a local maxima (if your step size is small enough)

34 Gradient Ascent p(samples ) argmax Walk uphill and you will find a local maxima (if your step size is small enough)

35 Gradient Ascent p(samples ) Especially good if function is convex 1 2 Walk uphill and you will find a local maxima (if your step size is small enough)

36 Gradient Ascent Repeat many times new j = old j old old j This is some profound life philosophy Walk uphill and you will find a local maxima (if your step size is small enough)

37 Piech, CS106A, Stanford University Gradient ascent is your bread and butter algorithm for optimization (eg argmax)

38 Gradient Ascent Initialize: θ j = 0 for all 0 j m Calculate all θ j

39 Gradient Ascent Initialize: θ j = 0 for all 0 j m Repeat many times: gradient[j] = 0 for all 0 j m Calculate all gradient[j] s based on data! j += h * gradient[j] for all 0 j m

40 Review: Maximum Likelihood Algorithm 1. Decide on a model for the likelihood of your samples. This is often using a PMF or PDF. 2. Write out the log likelihood function. 3. State that the optimal parameters are the argmax of the log likelihood function. 4. Use an optimization algorithm to calculate argmax

41 Review: Maximum Likelihood Algorithm 1. Decide on a model for the likelihood of your samples. This is often using a PMF or PDF. 2. Write out the log likelihood function. 3. State that the optimal parameters are the argmax of the log likelihood function. 4. Calculate the derivative of LL with respect to theta 5. Use an optimization algorithm to calculate argmax

42 Linear Regression Lite

43 Predicting Warriors X 1 = Opposing team ELO X 2 = Points in last game X 3 = Curry playing? X 4 = Playing at home? Y = Warriors points

44 Predicting CO 2 (simple) X = CO 2 level Y = Average Global Temperature N training datapoints (x (1),y (1) ), (x (2),y (2) ),...(x (n),y (n) ) Linear Regression Lite Model Y = X + Z Z N(0, 2 ) Y X N( X, 2 )

45 1) Write Likelihood Fn N training datapoints (x (1),y (1) ), (x (2),y (2) ),...(x (n),y (n) ) Model Y X N( X, 2 ) First, calculate Likelihood of the data Shorthand for:

46 1) Write Likelihood Fn N training datapoints (x (1),y (1) ), (x (2),y (2) ),...(x (n),y (n) ) Model Y X N( X, 2 ) First, calculate Likelihood of the data

47 2) Write Log Likelihood Fn N training datapoints: (x (1),y (1) ), (x (2),y (2) ),...(x (n),y (n) ) Likelihood function: Second, calculate Log Likelihood of the data

48 3) State MLE as Optimization N training datapoints: (x (1),y (1) ), (x (2),y (2) ),...(x (n),y (n) ) Log Likelihood: Third, celebrate!

49 4) Find derivative N training datapoints: (x (1),y (1) ), (x (2),y (2) ),...(x (n),y (n) ) Goal: Fourth, optimize!

50 5) Run optimization code N training datapoints: (x (1),y (1) ), (x (2),y (2) ),...(x (n),y (n) )

51 Gradient Ascent Initialize: θ j = 0 for all 0 j m Repeat many times: gradient[j] = 0 for all 0 j m Calculate all gradient[j] s based on data and current setting of theta! j += h * gradient[j] for all 0 j m

52 Linear Gradient Regression Ascent (simple) Initialize: θ = 0 Repeat many times: gradient = 0 Calculate gradient based on data! += h * gradient

53 Linear Regression (simple) Initialize: θ = 0 Repeat many times: gradient = 0 For each training example (x, y): Update gradient for current training example! += h * gradient

54 Linear Regression (simple) Initialize: θ = 0 Repeat many times: gradient = 0 For each training example (x, y): gradient += 2(y - θ x)(x)! += h * gradient

55 Linear Regression

56 Predicting CO 2 X 1 = Temperature X 2 = Elevation X 3 = CO 2 level yesterday X 4 = GDP of region X 5 = Acres of forest growth Y = CO 2 levels

57 Linear Regression Problem: Predict real value Y based on observing variable X Model: Linear weight every feature Ŷ = 1 X m X m + m+1 = T X Training: Gradient ascent to chose the best thetas to describe your data ˆ MLE = argmax nx (Y (i) T x (i) ) 2 i=1

58 Linear Regression Initialize: θ j = 0 for all 0 j m Repeat many times: gradient[j] = 0 for all 0 j m For each training example (x, y): For each parameter j: gradient[j] += (y θ T x)(-x[j])! j += h * gradient[j] for all 0 j m

59 Predicting Warriors Y = Warriors points Ŷ = 1 X m X m + m+1 = T X X 1 = Opposing team ELO X 2 = Points in last game X 3 = Curry playing? X 4 = Playing at home?! 1 = -2.3! 2 = +1.2! 3 = +10.2! 4 = +3.3! 5 = +95.4

60 The Machine Learning Process Learning algorithm g(x) (Classifier) Output (Class) Training data Training data: set of N pre-classified data instances o N training pairs: (x (1),y (1) ), (x (2),y (2) ),, (x (n), y (n) ) Use superscripts to denote i-th training instance Learning algorithm: method for determining g(x) X o o Given a new input observation of x = x 1, x 2,, x m Use g(x) to compute a corresponding output (prediction)

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017 1 Will Monroe CS 109 Logistic Regression Lecture Notes #22 August 14, 2017 Based on a chapter by Chris Piech Logistic regression is a classification algorithm1 that works by trying to learn a function