GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x)

Size: px

Start display at page:

Download "GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x)"

Nelson Barnett
5 years ago
Views:

1 0 x x x CSE 559A: Computer Vision For Binary Classification: [0, ] f (x; ) = σ( x) = exp( x) + exp( x) Output is interpreted as probability Pr(y = ) x are the log-odds. Fall 207: -R: Lopata 0 Instructor: Ayan Chakrabarti (ayan@ustl.edu). Staff: Abby Stylianou (abby@ustl.edu), Jarett Gross (jarett@ustl.edu) f (x; θ) = f (x; θ) = + exp( x) + exp( x) f (x; θ) Pr(y = ) Pr(y = ) x = log = log = log ( f (x; θ)) Pr(y = ) Pr(y = 0) Nov 9, 207 For Binary Classification: [0, ] f (x; ) = σ( x) = exp( x) + exp( x) For Binary Classification: [0, ] f (x; ) = σ( x) = exp( x) + exp( x) o classify, y = if f (x; ) > 0.5 and 0 otherise. y = if x > 0, and if x 0. x x is some augmented variable of x. "Linear Classifier" if Could be polynomial = [ ; ] Or other arbitrary non-linear functions of x o en called the "feature" vector. Some encoding of the input. x could even be non-numeric, and you could define some numeric features. x x = [, x, 2, x 3 ] o classify, y = if x > 0 and 0 otherise. "Feature engineering" or "Feature selection" as / is o en useful E.g., "census transform" Feature vectors based on histograms of gradient orientations (HoG/SIF) Note: Classifier is linear in chosen encoding. x <> 0 defines a "separating hyperplane" beteen positive and negative part of the space of.

2 f (x; ) = σ( x) = exp( x) + exp( x) f (x; ) = σ( x) = exp( x) + exp( x) Cross-entropy / Negative Likelihood Loss L(y, f (x; )) = y log f (x; ) ( y) log( f (x; )) Cross-entropy / Negative Likelihood Loss L(y, f (x; )) = y log[ + exp( ] ) + ( [ ] x y) log + exp( ) x f (x; θ) = f (x; θ) = + exp( x) + exp( x) Putting it all together, given a training set of {(, )}: x t = arg min y log[ + exp( ] ) + ( [ ] ) log + exp( ) t x t x t t=

3 = arg min y log[ + exp( ] ) + ( [ ] ) log + exp( ) t x t x t t= You can sho that this loss is a convex function of (compute the Hessian matrix and sho that it's eigenvalues are non-negative) So it has a single global minimum. But ho do e find it? More General Form = arg min y log[ + exp( ] ) + ( [ ] ) log + exp( ) t x t x t t= = arg min C() C() = () Minimize a cost that is a function of parameters, and a sum of cost functions, each coming from a different training sample. t C t Gradient Descent Begin ith initial guess At each iteration i: 0 i+ i γ C( i ) = arg min C() C() = () t C t Begin ith initial guess At each iteration i: 0 i+ i γ C( i ) At each iteration, e update the parameters by "moving", in -space, in the opposite direction of the gradient (at that point ). We also move by length proportional to the magnitude of the gradient. γ is the step-size. When running optimization for training, o en called the "learning rate". t

4 If you select optimal step size by doing a "line search" for γ, can prove that gradient-descent ill converge. If function is convex, converge to unique global minimum. Second order variants that consider the Hessian matrix: Neton & Quasi-Neton Methods Gauss-Neton, Levenberg-Marquardt,... But simple gradient descent suffices / our only choice hen: Function isn't convex. Can't afford to do line search. So many parameters that can't compute Hessian. Also, no theoretical guarantees. heory still catching up. Meanhile, e'll tro understand the "behavior" of the gradients. C() = What is C t (), the gradient of the loss from a singe training example? 2 C() C() If C() = C (), then C() = () t C t t C t () = log [ + exp( x t) ] + ( ) log [ + exp( x t) ] Ok, hat is the derivative of ith respect to p (here p is a scalar). C t () = log [ + exp( x t) ] + ( ) log [ + exp( x t) ] C t (p) = log [ + exp( p) ] + ( ) log [ + ] C t (p) = p + exp( p) C t (p) = + ( ) p + exp( p) + exp( p) = + [ + exp( p) + + ] = + + [ + + ]

5 C t () = log [ + exp( x t) ] + ( ) log [ + exp( x t) ] C t () = log [ + exp( x t) ] + ( ) log [ + exp( x t) ] Observations + is basicallhe output f ( x t ; ), predicted probabilithat =. Remember: this is the expression for gradient of p, i.e. logit / log-odds. Gradient 0 if = 0 and probability 0, y = and probability. Do nothing if predicting right anser ith perfect confidence. If e say probability > 0, and If e say probability <, and C t (p) = log [ + exp( p) ] + ( ) log [ + ] C t (p) = p + = 0. Gradient is positive. =. Gradient is negative. C t (p) = log [ + exp( p) ] + ( ) log [ + ] C t (p) = p + Also, changing p makes a much bigger difference in the corresponding probability, hen p is near 0 / probability near 0.5. Remember e move in the opposite direction of gradient. C t () = log [ + exp( x t) ] + ( ) log [ + exp( x t) ] C t (p) = log [ + exp( p) ] + ( ) log [ + ] C t (p) = p + But this is still derivative ith respect to p. We ant gradient ith respect to. C j () = j t x C t () = x [ [ + exp( ) x t + exp( ) x t ] ] Putting it together: At each iteration i, Based on current, compute f ( x t, ) = y t Compute derivative of the "output" as y t Multiply by = arg min y log[ + exp( ] ) + ( [ ] ) log + exp( ) t x t x t x t to get i= Change by subtracting some γ times this gradient. C t () = ( x) [ + exp( x t) ] C t (p) C t () = p() p

6 Putting it together: = arg min y log[ + exp( ] ) + ( [ ] ) log + exp( ) t x t x t i= SOCHASIC = arg min C(, ; ) t x t = C( x t, ; ) t At each iteration i, Based on current, compute f ( x t, ) = y t for everraining sample Compute derivative of the "output" as y t for everraining sample Multiply by x t and average across all training samples to get Change by subtracting some γ times this gradient. Expensive hen e have a LO of training data. Remember, summation over training samples meant to approximate an expectation over P XY (x, y). t t C( x t, ; ) 0 C(x, y; ) (x,y) In other ords, e are approximating the "true" gradient ith gradients over samples. P XY x t 0 (x,y) C(, ; ) C(x, y; ) P XY What if e used a smaller number of samples in each iteration, but different samples in different iterations? SOCHASIC Single sample i+ i γ C t ( x t, ; i ) At each iteration, choose a random t {, 2,, }. "Mini"-batched SGD (sometimes GD is called Batched GD) i+ i γ C t ( x t, ; i ) B t At each iteration, choose a random smaller batch of size B <<. With replacement? Without replacement? SOCHASIC In practice: Shuffle order of training examples Choose a batch size ake consecutive groups of B samples as you loop through iterations [,B] in iteration [B+,2B] in iteration 2... Once you reach the end of the training set (called one "epoch"), shuffle the order again.

7 SOCHASIC General Notes γ C t ( x t, ; i ) i+ i B t he gradient over a mini-batch is an "approximation", or a "noisy" version of the gradient over the true training set. C t ( x t, ; i ) = Ct (, ; ) + ϵ B t xt yt i t= ypically, if you decrease the batch-size, you ill ant to decrease your step size (because you are "less sure" about the gradient). SOCHASIC General Notes Say your cost function is convex, and you care only about decreasing this cost (not orried about overfitting) Larger batch size ill alays give you "better" gradients. But diminishing returns a er a batch size. Computational cost is number of examples per iteration number of iterations for convergence Higher batch means more computation per iteration, but may mean feer iterations required to converge. Best combination of step size and batch size is an empirical question. Another factor: parallelism. i+ i γ (, ; ) B Ct x t yt i t Note that you can compute the gradient of all samples of your batch in parallel. Ideally, you ant to at least "saturate" all available parallel threads. SOCHASIC SOCHASIC i+ i γ (, ; ) B Ct x t yt i t General Notes If your cost function is NO convex, and/or you are orried about overfitting. Noise in your gradients might be a good thing! Might help you escape local minima. Might prevent you from overfitting to train set. ry different batch sizes, check performance on dev set, not just train set. Momentum Standard SGD: With Momentum: For β < : gi+ = (, ; ) B C t x t yt i t γ i+ i gi+ gi+ = (, ; ) + β B C t x t yt i gi t γ i+ i gi+ Keep adding the gradient from a previous batch, again and again across iterations, ith decaying eight. Remember: gi as computed ith respect to a different position in space. People o en use β as high as 0.9 or Will need to revisit "best" value of γ hen you change β.

8 LOGISIC REGRESSION = arg min t () C () = log[ + exp( ) ] + ( ) log [ + exp( ) t x t x t ] C t Defined linear classifier on augmented vector Used gradient descent to learn. Looked at behavior of gradients. Simplified computation ith stochasticity. At test time, sign of x gives us our label. he problem is: he definition of augmented vector is hand-cra ed We have manually engineered our features. he onlhing e're learning is a linear classifier on top. Want to learn the features themselves! Given that SGD orks, hat's stopping us from learning a g(x) = x? x x

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Preliminaries Linear models: the perceptron and closest centroid algorithms Chapter 1, 7 Definition: The Euclidean dot product beteen to vectors is the expression d T x = i x i The dot product is also