Convergence rate of SGD heorem: (see Nemirovski et al 09 from readings) Let f be a strongly convex stochastic function Assume gradient of f is Lipschitz continuous and bounded hen, for step sizes: he expected loss decreases as O(/t): Emily Fox 04 4 Convergence rates for gradient descent/ascent versus SGD Number of Iterations to get to accuracy `(w ) `(w) apple Gradient descent: If func is strongly convex: O(ln(/ϵ)) iterations Stochastic gradient descent: If func is strongly convex: O(/ϵ) iterations Seems exponentially worse, but much more subtle: otal running time, e.g., for logistic regression: Gradient descent: SGD: SGD can win when we have a lot of data And, when analyzing true error, situation even more subtle expected running time about the same, see readings Emily Fox 04
Motivating AdaGrad (Duchi, Hazan, Singer 0) w R d Assuming, standard stochastic (sub)gradient descent updates are of the form: g t,i w (t+) i w (t) i Should all features share the same learning rate? Often have high-dimensional feature spaces Many features are irrelevant Rare features are often very informative Adagrad provides a feature-specific adaptive learning rate by incorporating knowledge of the geometry of past observations Emily Fox 04 6 Why Adapt to Geometry? Why adapt to geometry? Hard Hard x x x y t y t t, t, t, t, t, t, 0 0 0 0 - -.. 0 0 -. -. 0 0 - -0 0 0 0 0 0.. 0 0 0 0 - - 0 0 0 0 - - 0 0 - --. -. 0 0 Examples from Duchi et al. ISMP 0 slides Nice Nice Frequent, Frequent, irrelevant irrelevant Infrequent, Infrequent, predictive predictive Infrequent, Infrequent, predictive predictive Emily Fox 04 7 Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 0 8 / Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 0 8 /
Not All Features are Created Equal Examples: ext data: he most unsung birthday in American business and technological history this year may be the 0th anniversary of the Xerox 94 photocopier. a High-dimensional image features a he Atlantic, July/August 00. Images from Duchi et al. ISMP 0 slides Emily Fox 04 8 Projected Gradient w (t+) i w (t) i g t,i Brief aside Consider an arbitrary feature space w W w W If, can use projected gradient for (sub)gradient descent w (t+) = Emily Fox 04 9
Regret Minimization How do we assess the performance of an online algorithm? Algorithm iteratively predicts w (t) Incur loss f t (w (t) ) Regret: What is the total incurred loss of algorithm relative to the best choice of that could have been made retrospectively w R( )= f t (w (t) ) inf ww f t (w) Emily Fox 04 0 Regret Bounds for Standard SGD Standard projected gradient stochastic updates: w (t+) = arg min ww w (w(t) g t ) Standard regret bound: f t (w (t) ) f t (w ) apple w() w + g t Emily Fox 04 4
Projected Gradient using Mahalanobis Standard projected gradient stochastic updates: w (t+) = arg min ww w (w(t) g t ) What if instead of an L metric for projection, we considered the Mahalanobis norm w (t+) = arg min ww w (w(t) A g t ) A Emily Fox 04 Mahalanobis Regret Bounds w (t+) = arg min ww w (w(t) A g t ) A What A to choose? Regret bound now: f t (w (t) ) f t (w ) apple w() w A + What if we minimize upper bound on regret w.r.t. A in hindsight? min A gt,a g t g t A Emily Fox 04
Mahalanobis Regret Minimization Objective: min A Solution: gt,a g t A = c g t gt! subject to A 0, tr(a) apple C For proof, see Appendix E, Lemma of Duchi et al. 0. Uses trace trick and Lagrangian. A defines the norm of the metric space we should be operating in Emily Fox 04 4 AdaGrad Algorithm w (t+) = arg min ww w (w(t) A g t ) A At time t, estimate optimal (sub)gradient modification A by A t = tx g g =! For d large, A t is computationally intensive to compute. Instead, hen, algorithm is a simple modification of normal updates: w (t+) = arg min ww w (w(t) diag(a t ) g t ) diag(a t ) Emily Fox 04 6
AdaGrad in Euclidean Space For W = R d, For each feature dimension, where w (t+) i w (t) i t,i = t,i g t,i hat is, Each feature dimension has it s own learning rate! Adapts with t w (t+) i w (t) i q Pt = g,i g t,i akes geometry of the past observations into account Primary role of η is determining rate the first time a feature is encountered Emily Fox 04 6 AdaGrad heoretical Guarantees AdaGrad regret bound: dx f t (w (t) ) f t (w ) apple R g :,j i= R := max w (t) w t So, what does this mean in practice? Many cool examples. his really is used in practice! Let s just examine one Emily Fox 04 7 7
AdaGrad heoretical Example Expect to out-perform when gradient vectors are sparse SVM hinge loss example: ft (w) = [ y t xt, w ]+ where xt {, 0, }d If xjt 0 with probability / j " E f X (t) w!#, > f (w ) = O w p max{log d, d } / best!# known method: "Previously X w p (t) p E f w f (w ) = O d 8 Emily Fox 04 Neural Network Learning Neural Network Neural Network LearningLearning Neural Network Learning Wildly non-convex problem: f (x; ) = log ( + exp (h[p(hx, i) Wildly non-convex problem: problem: Wildly non-convex Very non-convex problem, but use SGD methods anyway p(hxk, k i)], 0 i)) f (x; ) = log ( + exp (h[p(hx, i) p(hxk, k i)], 0 i)) f (x; ) = log ( + exp (h[p(hx where, i) p(hxk, k i)], 0 i)) Neural Network Learning p( ) = Accuracy on est Set Average Frame Accuracy (%) where + exp( ) p(hx, i) where p( ) = 0 + exp( ) p( ) = 0 + exp( ) i) SGD GPU Downpour SGD Downpour SGD w/adagrad Sandblaster L BFGS 0 0 p(hx, 0 40 60 80 00 x4 xduchixetal. (UCxBerkeley) p(hx, i) x x x x x4 x 4 Adaptive Subgradient Methods x 4 x ISMP 0 x x4 x 4 / 0 ime (hours) (Dean etadaptive al. 0) Subgradient Methods Duchi et al. (UC Berkeley) ISMP 0 / Images from Duchi et Distributed, d =.7 0 parameters. SGD and AdaGrad use 80 Duchi et al. (UC Berkeley) ISMP 0 machines (000 cores), L-BFGS uses 800 (0000Adaptive cores)subgradient Methods al. ISMP 0 slides 9 Duchi et al. (UC Berkeley) Emily Fox 04 Adaptive Subgradient Methods ISMP 0 6 / / 9 8
What you should know about Logistic Regression (LR) and Click Prediction Click prediction problem: Estimate probability of clicking Can be modeled as logistic regression Logistic regression model: Linear model Gradient ascent to optimize conditional likelihood Overfitting + regularization Regularized optimization Convergence rates and stopping criterion Stochastic gradient ascent for large/streaming data Convergence rates of SGD AdaGrad motivation, derivation, and algorithm Emily Fox 04 40 9