Stochastic Optimization Methods
|
|
- August Dalton
- 5 years ago
- Views:
Transcription
1 Stochastc Optmzaton Methods Lecturer: Pradeep Ravkumar Co-nstructor: Aart Sngh Convex Optmzaton / Adapted from sldes from Ryan Tbshran
2 Stochastc gradent descent Consder sum of functons 1 mn x n n f (x) Gradent descent appled to ths problem would repeat x (k) = x (k 1) t k 1 n f (x (k 1) ), k = 1, 2, 3,... n =1 =1 In comparson, stochastc gradent descent (or ncremental gradent descent) repeats x (k) = x (k 1) t k f k (x (k 1) ), k = 1, 2, 3,... where k {1,... n} s some chosen ndex at teraton k 2
3 Notes: Typcally we make a (unform) random choce k {1,... n} Also common: mn-batch stochastc gradent descent, where we choose a random subset I k {1,... n}, of sze b n, and update accordng to x (k) = x (k 1) t k 1 f (x (k 1) ), k = 1, 2, 3,... b I k In both cases, we are approxmatng the full gradent by a nosy estmate, and our nosy estmate s unbased E[ f k (x)] = f(x) [ 1 ] E f (x) = f(x) b I k The mn-batch reduces the varance by a factor 1/b, but s also b tmes more expensve! 3
4 Example: regularzed logstc regresson Gven labels y {0, 1}, features x R p, = 1,... n. Consder logstc regresson wth rdge regularzaton: 1 mn β R p n n =1 ( ) y x T β + log(1 + e xt β ) + λ 2 β 2 2 Wrte the crteron as f(β) = 1 n f (β), f (β) = y x T β + log(1 + e xt β ) + λ n 2 β 2 2 =1 The gradent computaton f(β) = n =1 ( y p (β) ) x + λβ s doable when n s moderate, but not when n s huge. Note that: One batch update costs O(np) One stochastc update costs O(p) One mn-batch update costs O(bp) 4
5 Example wth n = 10, 000, p = 20, all methods employ fxed step szes (dmnshng step szes gve roughly smlar results): Crteron fk Full Stochastc Mn batch, b=10 Mn batch, b= Iteraton number k 5
6 What s happenng? Iteratons make better progress as mn-batch sze b gets bgger. But now let s parametrze by flops: Crteron fk Full Stochastc Mn batch, b=10 Mn batch, b=100 1e+02 1e+04 1e+06 Flop count 6
7 Convergence rates Recall that, under sutable step szes, when f s convex and has a Lpschtz gradent, full gradent (FG) descent satsfes f(x (k) ) f = O(1/k) What about stochastc gradent (SG) descent? Under dmnshng step szes, when f s convex (plus other condtons) E[f(x (k) )] f = O(1/ k) Fnally, what about mn-batch stochastc gradent? Agan, under dmnshng step szes, for f convex (plus other condtons) E[f(x (k) )] f = O(1/ bk + 1/k) But each teraton here b tmes more expensve... and (for small b), n terms of flops, ths s the same rate 7
8 Back to our rdge logstc regresson example, we gan mportant nsght by lookng at suboptmalty gap (on log scale): Crteron gap fk fstar 1e 12 1e 09 1e 06 1e 03 Full Stochastc Mn batch, b=10 Mn batch, b= Iteraton number k 8
9 Recall that, under sutable step szes, when f s strongly convex wth a Lpschtz gradent, gradent descent satsfes f(x (k) ) f = O(ρ k ) where ρ < 1. But, under dmnshng step szes, when f s strongly convex (plus other condtons), stochastc gradent descent gves E[f(x (k) )] f = O(1/k) So stochastc methods do not enjoy the lnear convergence rate of gradent descent under strong convexty For a whle, ths was beleved to be nevtable, as Nemrovsk and others had establshed matchng lower bounds... but these appled to stochastc mnmzaton of crterons, f(x) = F (x, ξ) dξ. Can we do better for fnte sums? 9
10 10 Stochastc Gradent: Bas For solvng mn x f(x), stochastc gradent s actually a class of algorthms that use the terates: x (k) = x (k 1) η k g(x (k 1) ; ξ k ), where g(x (k 1), ξ k ) s a stochastc gradent of the objectve f(x) evaluated at x (k 1). Bas: The bas of the stochastc gradent s defned as: bas(g(x (k 1) ; ξ k )) := E ξk (g(x (k 1) ; ξ k )) f(x (k 1) ). Unbased: When E ξk (g(x (k 1) ; ξ k )) = f(x (k 1) ), the stochastc gradent s sad to be unbased. (e.g. the stochastc gradent scheme dscussed so far) Based: We mght also be nterested n based estmators, but where the bas s small, so that E ξk (g(x (k 1) ; ξ k )) f(x (k 1) ).
11 11 Stochastc Gradent: Varance Varance. In addton to small (or zero) bas, we also want the varance of the estmator to be small: varance(g(x (k 1), ξ k )) := E ξk (g(x (k 1), ξ k ) E ξk (g(x (k 1), ξ k ))) 2 E ξk (g(x (k 1), ξ k )) 2. The caveat wth the stochastc gradent scheme we have seen so far s that ts varance s large, and n partcular doesn t decay to zero wth the teraton ndex. Loosely: because of above, we have to decay the step sze η k to zero, whch n turn means we can t take large steps, and hence the convergence rate s slow. Can we get the varance to be small, and decay to zero wth teraton ndex?
12 12 Varance Reducton Consder an estmator X for a parameter θ. Note that for an unbased estmator, E(X) = θ. Now consder the followng modfed estmator: Z := X Y, such that E(Y ) 0. Then the bas of Z s also close to zero, snce E(Z) = E(X) E(Y ) θ. If E(Y ) = 0, then Z s unbased ff X s unbased. What about the varance of estmator X? Var(X Y ) = Var(X) + Var(Y ) 2 Cov(X, Y ). Ths can be seen to much less than Var(X) f Y s hghly correlated wth X. Thus, gven any estmator X, we can reduce ts varance, f we can construct a Y that (a) has expectaton (close to) zero, and (b) s hghly correlated wth X. Ths s the abstract template followed by SAG, SAGA, SVRG, SDCA,...
13 13 Outlne Rest of today: Stochastc average gradent (SAG) SAGA (does ths stand for somethng?) Stochastc Varance Reduced Gradent (SVRG) Many, many others
14 14 Stochastc average gradent Stochastc average gradent or SAG (Schmdt, Le Roux, Bach 2013) s a breakthrough method n stochastc optmzaton. Idea s farly smple: Mantan table, contanng gradent g of f, = 1,... n Intalze x (0), and g (0) = x (0), = 1,... n At steps k = 1, 2, 3,..., pck a random k {1,... n} and then let Set all other g (k) Update g (k) k = f (x (k 1) ) (most recent gradent of f ) = g (k 1), k,.e., these stay the same x (k) = x (k 1) t k 1 n n =1 g (k)
15 15 Notes: Key of SAG s to allow each f, = 1,... n to communcate a part of the gradent estmate at each step Ths basc dea can be traced back to ncremental aggregated gradent (Blatt, Hero, Gauchman, 2006) SAG gradent estmates are no longer unbased, but they have greatly reduced varance Isn t t expensve to average all these gradents? (Especally f n s huge?) Ths s bascally just as effcent as stochastc gradent descent, as long we re clever: ( (k) g x (k) = x (k 1) t k k n g(k 1) k + 1 n n n =1 g (k 1) }{{} old table average ) } {{ } new table average
16 16 Stochastc gradent n SAG: SAG Varance Reducton g (k) k }{{} X (g (k 1) k n g (k 1) ) =1 } {{ } Y It can be seen that E(X) = f(x (k) ). But that E(Y ) 0, so that we have a based estmator. But we do have that Y seems correlated wth X (n lne wth varance reducton template). In partcular, we have that X Y 0, as k, snce x (k 1) and x (k) converge to x, the dfference between frst two terms converges to zero, and the last term converges to gradent at optmum,.e. also to zero. Thus, the overall estmator l 2 norm (and accordngly ts varance) decays to zero..
17 17 SAG convergence analyss Assume that f(x) = 1 n n =1 f (x), where each f s dfferentable, and f s Lpschtz wth constant L Denote x (k) = 1 k k 1 l=0 x(l), the average terate after k 1 steps Theorem (Schmdt, Le Roux, Bach): SAG, wth a fxed step sze t = 1/(16L), and the ntalzaton satsfes g (0) = f (x (0) ) f(x (0) ), = 1,... n E[f( x (k) )] f 48n ( f(x (0) ) f ) + 128L k k x(0) x 2 2 where the expectaton s taken over the random choce of ndex at each teraton
18 18 Notes: Result stated n terms of the average terate x (k), but also can be shown to hold for best terate x (k) best seen so far Ths s O(1/k) convergence rate for SAG. Compare to O(1/k) rate for FG, and O(1/ k) rate for SG But, the constants are dfferent! Bounds after k steps: SAG : FG : SG : 48n( f(x (0) ) f ) + 128L k k x(0) x 2 2 L 2k x(0) x 2 2 L 5 x (0) x 2 (*not a real bound, loose translaton) 2k So frst term n SAG bound suffers from factor of n; authors suggest smarter ntalzaton to make f(x (0) ) f small (e.g., they suggest usng result of n SG steps)
19 19 Convergence analyss under strong convexty Assume further that each f s strongly convex wth parameter m Theorem (Schmdt, Le Roux, Bach): SAG, wth a step sze t = 1/(16L) and the same ntalzaton as before, satsfes ( { m E[f(x (k) )] f 1 mn 16L, 1 } ) k 8n ( 3 ( f(x (0) ) f ) + 4L ) 2 n x(0) x 2 2 More notes: Ths s lnear convergence rate O(ρ k ) for SAG. Compare ths to O(ρ k ) for FG, and only O(1/k) for SG Lke FG, we say SAG s adaptve to strong convexty (acheves better rate wth same settngs) Proofs of these results not easy: 15 pages, computed-aded!
20 Back to our rdge logstc regresson example, SG versus SAG, over 30 reruns of these randomzed algorthms: Crteron gap fk fstar SG SAG Iteraton number k 20
21 21 SAG does well, but dd not work out of the box; requred a specfc setup Took one full cycle of SG (one pass over the data) to get β (0), and then started SG and SAG both from β (0). Ths warm start helped a lot SAG ntalzed at g (0) = f (β (0) ), = 1,... n, computed durng ntal SG cycle. Centerng these gradents was much worse (and so was ntalzng them at 0) Tunng the fxed step szes for SAG was very fncky; here now hand-tuned to be about as large as possble before t dverges
22 Experments from Schmdt, Le Roux, Bach (each plot s a dfferent problem settng): Objectve mnus Optmum ASG AFG IAG SAG LS SG L BFGS Objectve mnus Optmum L BFGS AFG SG SAG LS IAG ASG Objectve mnus Optmum ASG AFG SAG LS IAG L BFGS SG Objectve mnus Optmum Effectve Passes Effectve Passes Effectve Passes SAG LS L BFGS SG ASG IAG AFG Objectve mnus Optmum SAG LS SG L BFGS ASG AFG IAG Objectve mnus Optmum L BFGS AFG IAG SAG LS SG ASG Effectve Passes 10 0 L BFGS IAG AFG Effectve Passes Effectve Passes 10 0 IAG AFG Objectve mnus Optmum SAG LS SG ASG Objectve mnus Optmum IAG AFG L BFGS SG SAG LS ASG Objectve mnus Optmum L BFGS SG SAG LS ASG Effectve Passes Effectve Passes Effectve Passes Fg. 1 Comparson of d erent FG and SG optmzaton strateges. The top row gves 22
23 23 SAGA SAGA (Defazo, Bach, Lacoste-Julen, 2014) s another recent stochastc method, smlar n sprt to SAG. Idea s agan smple: Mantan table, contanng gradent g of f, = 1,... n Intalze x (0), and g (0) = x (0), = 1,... n At steps k = 1, 2, 3,..., pck a random k {1,... n} and then let Set all other g (k) Update g (k) k = f (x (k 1) ) (most recent gradent of f ) = g (k 1), k,.e., these stay the same ( x (k) = x (k 1) t k g (k) k g (k 1) k + 1 n n =1 ) g (k 1)
24 24 Notes: SAGA gradent estmate g (k) k SAG gradent estmate 1 n g(k) k g (k 1) k 1 n g(k 1) k + 1 n n =1 g(k 1) + 1 n n =1 g(k 1), versus Recall, SAG estmate s based; remarkably, SAGA estmate s unbased!
25 25 Stochastc gradent n SAGA: SAGA Varance Reducton g (k) k }{{} X n (g (k 1) k 1 n } {{ } Y g (k 1) ) =1 It can be seen that E(X) = f(x (k) ). And that E(Y ) 0, so that we have an unbased estmator. Moreover, we have that Y seems correlated wth X (n lne wth varance reducton template). In partcular, we have that X Y 0, as k, snce x (k 1) and x (k) converge to x, the dfference between frst two terms converges to zero, and the last term converges to gradent at optmum,.e. also to zero. Thus, the overall estmator l 2 norm (and accordngly ts varance) decays to zero..
26 26 SAGA bascally matches strong convergence rates of SAG (for both Lpschtz gradents, and strongly convex cases), but the proofs here much smpler Another strength of SAGA s that t can extend to composte problems of the form 1 mn x n m f (x) + h(x) =1 where each f s smooth and convex, and h s convex and nonsmooth but has a known prox. The updates are now ( x (k) = prox h,tk (x (k 1) t k g (k) g (k 1) + 1 n n =1 ) ) g (k 1) It s not known whether SAG s generally convergent under such a scheme
27 27 Back to our rdge logstc regresson example, now addng SAGA to the mx: Crteron gap fk fstar SG SAG SAGA Iteraton number k
28 28 SAGA does well, but agan t requred somewhat specfc setup As before, took one full cycle of SG (one pass over the data) to get β (0), and then started SG, SAG, SAGA all from β (0). Ths warm start helped a lot SAGA ntalzed at g (0) = f (β (0) ), = 1,... n, computed durng ntal SG cycle. Centerng these gradents was much worse (and so was ntalzng them at 0) Tunng the fxed step szes for SAGA was fne; seemngly on par wth tunng for SG, and more robust than tunng for SAG Interestngly, the SAGA crteron curves look lke SG curves (realzatons beng jagged and hghly varable); SAG looks very dfferent, and ths really emphaszes the fact that ts updates have much lower varance
29 29 Stochastc Varance Reduced Gradent (SVRG) The Stochastc Varance Reduced Gradent (SVRG) algorthm (Johnson, Zhang, 2013) runs n epochs: Intalze x (0). For k = 1,...: Set x = x (k 1). Compute µ := f( x). Set x (0) = x. For l = 1,..., m: Pck coordnate l at random from {1,..., n}. Set: Set x (k) = x (m). x (l) = x (l 1) η ( f l (x (l 1) ) f l ( x) + µ).
30 30 Stochastc Varance Reduced Gradent (SVRG) Just lke SAG/SAGA, but does not store a full table of gradents, just an average, and updates ths occasonally. SAGA: f (k) k f (k 1) k + 1 n (k 1) n =1 f, SVRG: f l (x (l 1) ) f l ( x) + 1 n n =1 f ( x). Can be shown to acheve varance reducton smlar to SAGA. Convergence rates smlar to SAGA, but formal analyss much smpler.
31 31 Many, many others A lot of recent work revstng stochastc optmzaton: SDCA (Shalev-Schwartz, Zhang, 2013): apples randomzed coordnate ascent to the dual of rdge regularzed problems. Effectve prmal updates are smlar to SAG/SAGA. There s also S2GD (Konecny, Rchtark, 2014), MISO (Maral, 2013), Fnto (Defazo, Caetano, Domke, 2014), etc.
32 32 SAGA SAG SDCA SVRG FINITO Strongly Convex (SC) Convex, Non-SC* 3 3 7?? Prox Reg. 3? 3[6] 3 7 Non-smooth Low Storage Cost Smple(-sh) Proof Adaptve to SC 3 3 7?? Fgure 1: Basc summary of method propertes. Queston marks denote unproven, but not expermentally (From Defazo, Bach, Lacoste-Julen, 2014) Are we approachng optmalty wth these methods? Agarwal and Bottou (2014) recently proved nonmatchng lower bounds for mnmzng fnte sums Leaves three possbltes: () algorthms we currently have are not optmal; () lower bounds can be tghtened; or () upper bounds can be tghtened Very actve area of research, ths wll lkely be sorted out soon
33 33 References and further readng D. Bertsekas (2010), Incremental gradent, subgradent, and proxmal methods for convex optmzaton: a survey A. Defaso and F. Bach and S. Lacoste-Julen (2014), SAGA: A fast ncremental gradent method wth support for non-strongly convex composte objectves R. Johnson and T. Zhang (2013), Acceleratng stochastc gradent descent usng predctve varance reducton A. Nemrosvk and A. Judtsky and G. Lan and A. Shapro (2009), Robust stochastc optmzaton approach to stochastc programmng M. Schmdt and N. Le Roux and F. Bach (2013), Mnmzng fnte sums wth the stochastc average gradent S. Shalev-Shwartz and T. Zhang (2013), Stochastc dual coordnate ascent methods for regularzed loss mnmzaton
Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization
Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where
More informationLecture 20: November 7
0-725/36-725: Convex Optmzaton Fall 205 Lecturer: Ryan Tbshran Lecture 20: November 7 Scrbes: Varsha Chnnaobreddy, Joon Sk Km, Lngyao Zhang Note: LaTeX template courtesy of UC Berkeley EECS dept. Dsclamer:
More information1 Convex Optimization
Convex Optmzaton We wll consder convex optmzaton problems. Namely, mnmzaton problems where the objectve s convex (we assume no constrants for now). Such problems often arse n machne learnng. For example,
More informationLecture Notes on Linear Regression
Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume
More informationFeature Selection: Part 1
CSE 546: Machne Learnng Lecture 5 Feature Selecton: Part 1 Instructor: Sham Kakade 1 Regresson n the hgh dmensonal settng How do we learn when the number of features d s greater than the sample sze n?
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationGeneralized Linear Methods
Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set
More informationPrimal Method for ERM with Flexible Mini-batching Schemes and Non-convex Losses
Prmal Method for ERM wth Flexble Mn-batchng Schemes and Non-convex Losses Domnk Csba Peter Rchtárk June 7, 205 Abstract In ths work we develop a new algorthm for regularzed emprcal rsk mnmzaton. Our method
More informationCSC 411 / CSC D11 / CSC C11
18 Boostng s a general strategy for learnng classfers by combnng smpler ones. The dea of boostng s to take a weak classfer that s, any classfer that wll do at least slghtly better than chance and use t
More informationLecture 10 Support Vector Machines II
Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed
More informationFor now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.
Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson
More informationOptimization for Machine Learning
Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS
More informationEnsemble Methods: Boosting
Ensemble Methods: Boostng Ncholas Ruozz Unversty of Texas at Dallas Based on the sldes of Vbhav Gogate and Rob Schapre Last Tme Varance reducton va baggng Generate new tranng data sets by samplng wth replacement
More informationSolutions to exam in SF1811 Optimization, Jan 14, 2015
Solutons to exam n SF8 Optmzaton, Jan 4, 25 3 3 O------O -4 \ / \ / The network: \/ where all lnks go from left to rght. /\ / \ / \ 6 O------O -5 2 4.(a) Let x = ( x 3, x 4, x 23, x 24 ) T, where the varable
More informationChapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems
Numercal Analyss by Dr. Anta Pal Assstant Professor Department of Mathematcs Natonal Insttute of Technology Durgapur Durgapur-713209 emal: anta.bue@gmal.com 1 . Chapter 5 Soluton of System of Lnear Equatons
More informationCIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M
CIS56: achne Learnng Lecture 3 (Sept 6, 003) Preparaton help: Xaoyng Huang Lnear Regresson Lnear regresson can be represented by a functonal form: f(; θ) = θ 0 0 +θ + + θ = θ = 0 ote: 0 s a dummy attrbute
More information1 GSW Iterative Techniques for y = Ax
1 for y = A I m gong to cheat here. here are a lot of teratve technques that can be used to solve the general case of a set of smultaneous equatons (wrtten n the matr form as y = A), but ths chapter sn
More informationModule 9. Lecture 6. Duality in Assignment Problems
Module 9 1 Lecture 6 Dualty n Assgnment Problems In ths lecture we attempt to answer few other mportant questons posed n earler lecture for (AP) and see how some of them can be explaned through the concept
More informationMMA and GCMMA two methods for nonlinear optimization
MMA and GCMMA two methods for nonlnear optmzaton Krster Svanberg Optmzaton and Systems Theory, KTH, Stockholm, Sweden. krlle@math.kth.se Ths note descrbes the algorthms used n the author s 2007 mplementatons
More informationA Delay-tolerant Proximal-Gradient Algorithm for Distributed Learning
A Delay-tolerant Proxmal-Gradent Algorthm for Dstrbuted Learnng Konstantn Mshchenko Franck Iutzeler Jérôme Malck Massh Amn KAUST Unv. Grenoble Alpes CNRS and Unv. Grenoble Alpes Unv. Grenoble Alpes ICML
More informationOn Optimal Probabilities in Stochastic Coordinate Descent Methods
On Optmal Probabltes n Stochastc Coordnate Descent Methods Peter Rchtárk and Martn Takáč Unversty of Ednburgh, Unted Kngdom October, 203 Abstract We propose and analyze a new parallel coordnate descent
More informationLecture 4. Instructor: Haipeng Luo
Lecture 4 Instructor: Hapeng Luo In the followng lectures, we focus on the expert problem and study more adaptve algorthms. Although Hedge s proven to be worst-case optmal, one may wonder how well t would
More informationU.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017
U.C. Berkeley CS94: Beyond Worst-Case Analyss Handout 4s Luca Trevsan September 5, 07 Summary of Lecture 4 In whch we ntroduce semdefnte programmng and apply t to Max Cut. Semdefnte Programmng Recall that
More informationCSE 546 Midterm Exam, Fall 2014(with Solution)
CSE 546 Mdterm Exam, Fall 014(wth Soluton) 1. Personal nfo: Name: UW NetID: Student ID:. There should be 14 numbered pages n ths exam (ncludng ths cover sheet). 3. You can use any materal you brought:
More informationKernel Methods and SVMs Extension
Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general
More informationWeek 5: Neural Networks
Week 5: Neural Networks Instructor: Sergey Levne Neural Networks Summary In the prevous lecture, we saw how we can construct neural networks by extendng logstc regresson. Neural networks consst of multple
More informationLogistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI
Logstc Regresson CAP 561: achne Learnng Instructor: Guo-Jun QI Bayes Classfer: A Generatve model odel the posteror dstrbuton P(Y X) Estmate class-condtonal dstrbuton P(X Y) for each Y Estmate pror dstrbuton
More informationprinceton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg
prnceton unv. F 17 cos 521: Advanced Algorthm Desgn Lecture 7: LP Dualty Lecturer: Matt Wenberg Scrbe: LP Dualty s an extremely useful tool for analyzng structural propertes of lnear programs. Whle there
More informationIntroduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law:
CE304, Sprng 2004 Lecture 4 Introducton to Vapor/Lqud Equlbrum, part 2 Raoult s Law: The smplest model that allows us do VLE calculatons s obtaned when we assume that the vapor phase s an deal gas, and
More informationprinceton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora
prnceton unv. F 13 cos 521: Advanced Algorthm Desgn Lecture 3: Large devatons bounds and applcatons Lecturer: Sanjeev Arora Scrbe: Today s topc s devaton bounds: what s the probablty that a random varable
More informationLecture Randomized Load Balancing strategies and their analysis. Probability concepts include, counting, the union bound, and Chernoff bounds.
U.C. Berkeley CS273: Parallel and Dstrbuted Theory Lecture 1 Professor Satsh Rao August 26, 2010 Lecturer: Satsh Rao Last revsed September 2, 2010 Lecture 1 1 Course Outlne We wll cover a samplng of the
More information18.1 Introduction and Recap
CS787: Advanced Algorthms Scrbe: Pryananda Shenoy and Shjn Kong Lecturer: Shuch Chawla Topc: Streamng Algorthmscontnued) Date: 0/26/2007 We contnue talng about streamng algorthms n ths lecture, ncludng
More informationModule 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur
Module 3 LOSSY IMAGE COMPRESSION SYSTEMS Verson ECE IIT, Kharagpur Lesson 6 Theory of Quantzaton Verson ECE IIT, Kharagpur Instructonal Objectves At the end of ths lesson, the students should be able to:
More informationSingular Value Decomposition: Theory and Applications
Sngular Value Decomposton: Theory and Applcatons Danel Khashab Sprng 2015 Last Update: March 2, 2015 1 Introducton A = UDV where columns of U and V are orthonormal and matrx D s dagonal wth postve real
More informationAdaptive Variance Reducing for Stochastic Gradient Descent
daptve Varance Reducng for Stochastc Gradent Descent Zebang Shen, Hu Qan, Tengfe Zhou, Tongzhou Mu Zhejang Unversty, Chna {shenzebang, qanhu, zhoutengfe zju, mutongzhou}@zju.edu.cn bstract Varance Reducng
More informationLecture 21: Numerical methods for pricing American type derivatives
Lecture 21: Numercal methods for prcng Amercan type dervatves Xaoguang Wang STAT 598W Aprl 10th, 2014 (STAT 598W) Lecture 21 1 / 26 Outlne 1 Fnte Dfference Method Explct Method Penalty Method (STAT 598W)
More informationErrors for Linear Systems
Errors for Lnear Systems When we solve a lnear system Ax b we often do not know A and b exactly, but have only approxmatons  and ˆb avalable. Then the best thng we can do s to solve ˆx ˆb exactly whch
More informationSupporting Information
Supportng Informaton The neural network f n Eq. 1 s gven by: f x l = ReLU W atom x l + b atom, 2 where ReLU s the element-wse rectfed lnear unt, 21.e., ReLUx = max0, x, W atom R d d s the weght matrx to
More informationHomework Assignment 3 Due in class, Thursday October 15
Homework Assgnment 3 Due n class, Thursday October 15 SDS 383C Statstcal Modelng I 1 Rdge regresson and Lasso 1. Get the Prostrate cancer data from http://statweb.stanford.edu/~tbs/elemstatlearn/ datasets/prostate.data.
More informationSome Comments on Accelerating Convergence of Iterative Sequences Using Direct Inversion of the Iterative Subspace (DIIS)
Some Comments on Acceleratng Convergence of Iteratve Sequences Usng Drect Inverson of the Iteratve Subspace (DIIS) C. Davd Sherrll School of Chemstry and Bochemstry Georga Insttute of Technology May 1998
More informationCOS 521: Advanced Algorithms Game Theory and Linear Programming
COS 521: Advanced Algorthms Game Theory and Lnear Programmng Moses Charkar February 27, 2013 In these notes, we ntroduce some basc concepts n game theory and lnear programmng (LP). We show a connecton
More informationAssortment Optimization under MNL
Assortment Optmzaton under MNL Haotan Song Aprl 30, 2017 1 Introducton The assortment optmzaton problem ams to fnd the revenue-maxmzng assortment of products to offer when the prces of products are fxed.
More informationMatrix Approximation via Sampling, Subspace Embedding. 1 Solving Linear Systems Using SVD
Matrx Approxmaton va Samplng, Subspace Embeddng Lecturer: Anup Rao Scrbe: Rashth Sharma, Peng Zhang 0/01/016 1 Solvng Lnear Systems Usng SVD Two applcatons of SVD have been covered so far. Today we loo
More informationStructural Extensions of Support Vector Machines. Mark Schmidt March 30, 2009
Structural Extensons of Support Vector Machnes Mark Schmdt March 30, 2009 Formulaton: Bnary SVMs Multclass SVMs Structural SVMs Tranng: Subgradents Cuttng Planes Margnal Formulatons Mn-Max Formulatons
More informationNote on EM-training of IBM-model 1
Note on EM-tranng of IBM-model INF58 Language Technologcal Applcatons, Fall The sldes on ths subject (nf58 6.pdf) ncludng the example seem nsuffcent to gve a good grasp of what s gong on. Hence here are
More informationOn the Multicriteria Integer Network Flow Problem
BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 5, No 2 Sofa 2005 On the Multcrtera Integer Network Flow Problem Vassl Vasslev, Marana Nkolova, Maryana Vassleva Insttute of
More informationarxiv: v2 [math.oc] 2 Mar 2017
Dual Free Adaptve Mn-batch SDCA for Emprcal Rsk Mnmzaton X He 1 Martn Takáč 1 arxv:1510.06684v2 [math.oc] 2 Mar 2017 Abstract In ths paper we develop dual free mn-batch SDCA wth adaptve probabltes for
More informationThe Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction
ECONOMICS 5* -- NOTE (Summary) ECON 5* -- NOTE The Multple Classcal Lnear Regresson Model (CLRM): Specfcaton and Assumptons. Introducton CLRM stands for the Classcal Lnear Regresson Model. The CLRM s also
More informationLOW BIAS INTEGRATED PATH ESTIMATORS. James M. Calvin
Proceedngs of the 007 Wnter Smulaton Conference S G Henderson, B Bller, M-H Hseh, J Shortle, J D Tew, and R R Barton, eds LOW BIAS INTEGRATED PATH ESTIMATORS James M Calvn Department of Computer Scence
More informationCS : Algorithms and Uncertainty Lecture 17 Date: October 26, 2016
CS 29-128: Algorthms and Uncertanty Lecture 17 Date: October 26, 2016 Instructor: Nkhl Bansal Scrbe: Mchael Denns 1 Introducton In ths lecture we wll be lookng nto the secretary problem, and an nterestng
More informationMASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.65/15.070J Fall 013 Lecture 1 10/1/013 Martngale Concentraton Inequaltes and Applcatons Content. 1. Exponental concentraton for martngales wth bounded ncrements.
More informationWhich Separator? Spring 1
Whch Separator? 6.034 - Sprng 1 Whch Separator? Mamze the margn to closest ponts 6.034 - Sprng Whch Separator? Mamze the margn to closest ponts 6.034 - Sprng 3 Margn of a pont " # y (w $ + b) proportonal
More informationLecture 3: Probability Distributions
Lecture 3: Probablty Dstrbutons Random Varables Let us begn by defnng a sample space as a set of outcomes from an experment. We denote ths by S. A random varable s a functon whch maps outcomes nto the
More informationMarkov Chain Monte Carlo Lecture 6
where (x 1,..., x N ) X N, N s called the populaton sze, f(x) f (x) for at least one {1, 2,..., N}, and those dfferent from f(x) are called the tral dstrbutons n terms of mportance samplng. Dfferent ways
More informationNUMERICAL DIFFERENTIATION
NUMERICAL DIFFERENTIATION 1 Introducton Dfferentaton s a method to compute the rate at whch a dependent output y changes wth respect to the change n the ndependent nput x. Ths rate of change s called the
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Journal of Machne Learnng Research () Submtted 9/12; Publshed Stochastc ual Coordnate Ascent Methods for Regularzed Loss Mnmzaton Sha Shalev-Shwartz Benn school of Computer Scence and Engneerng The Hebrew
More informationj) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1
Random varables Measure of central tendences and varablty (means and varances) Jont densty functons and ndependence Measures of assocaton (covarance and correlaton) Interestng result Condtonal dstrbutons
More informationLecture 17: Lee-Sidford Barrier
CSE 599: Interplay between Convex Optmzaton and Geometry Wnter 2018 Lecturer: Yn Tat Lee Lecture 17: Lee-Sdford Barrer Dsclamer: Please tell me any mstake you notced. In ths lecture, we talk about the
More informationEEE 241: Linear Systems
EEE : Lnear Systems Summary #: Backpropagaton BACKPROPAGATION The perceptron rule as well as the Wdrow Hoff learnng were desgned to tran sngle layer networks. They suffer from the same dsadvantage: they
More informationSome modelling aspects for the Matlab implementation of MMA
Some modellng aspects for the Matlab mplementaton of MMA Krster Svanberg krlle@math.kth.se Optmzaton and Systems Theory Department of Mathematcs KTH, SE 10044 Stockholm September 2004 1. Consdered optmzaton
More informationLINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity
LINEAR REGRESSION ANALYSIS MODULE IX Lecture - 31 Multcollnearty Dr. Shalabh Department of Mathematcs and Statstcs Indan Insttute of Technology Kanpur 6. Rdge regresson The OLSE s the best lnear unbased
More informationLimited Dependent Variables
Lmted Dependent Varables. What f the left-hand sde varable s not a contnuous thng spread from mnus nfnty to plus nfnty? That s, gven a model = f (, β, ε, where a. s bounded below at zero, such as wages
More informationMachine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso
Machne Learnng Data Mnng CS/CS/EE 155 Lecture 4: Regularzaton, Sparsty Lasso 1 Recap: Complete Ppelne S = {(x, y )} Tranng Data f (x, b) = T x b Model Class(es) L(a, b) = (a b) 2 Loss Functon,b L( y, f
More informationRandomized block proximal damped Newton method for composite self-concordant minimization
Randomzed block proxmal damped Newton method for composte self-concordant mnmzaton Zhaosong Lu June 30, 2016 Revsed: March 28, 2017 Abstract In ths paper we consder the composte self-concordant CSC mnmzaton
More informationModule 2. Random Processes. Version 2 ECE IIT, Kharagpur
Module Random Processes Lesson 6 Functons of Random Varables After readng ths lesson, ou wll learn about cdf of functon of a random varable. Formula for determnng the pdf of a random varable. Let, X be
More informationC4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )
C4B Machne Learnng Answers II.(a) Show that for the logstc sgmod functon dσ(z) dz = σ(z) ( σ(z)) A. Zsserman, Hlary Term 20 Start from the defnton of σ(z) Note that Then σ(z) = σ = dσ(z) dz = + e z e z
More informationThe Second Anti-Mathima on Game Theory
The Second Ant-Mathma on Game Theory Ath. Kehagas December 1 2006 1 Introducton In ths note we wll examne the noton of game equlbrum for three types of games 1. 2-player 2-acton zero-sum games 2. 2-player
More informationNotes on Frequency Estimation in Data Streams
Notes on Frequency Estmaton n Data Streams In (one of) the data streamng model(s), the data s a sequence of arrvals a 1, a 2,..., a m of the form a j = (, v) where s the dentty of the tem and belongs to
More informationStanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011
Stanford Unversty CS359G: Graph Parttonng and Expanders Handout 4 Luca Trevsan January 3, 0 Lecture 4 In whch we prove the dffcult drecton of Cheeger s nequalty. As n the past lectures, consder an undrected
More informationChapter - 2. Distribution System Power Flow Analysis
Chapter - 2 Dstrbuton System Power Flow Analyss CHAPTER - 2 Radal Dstrbuton System Load Flow 2.1 Introducton Load flow s an mportant tool [66] for analyzng electrcal power system network performance. Load
More informationProblem Set 9 Solutions
Desgn and Analyss of Algorthms May 4, 2015 Massachusetts Insttute of Technology 6.046J/18.410J Profs. Erk Demane, Srn Devadas, and Nancy Lynch Problem Set 9 Solutons Problem Set 9 Solutons Ths problem
More informationSTAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 16
STAT 39: MATHEMATICAL COMPUTATIONS I FALL 218 LECTURE 16 1 why teratve methods f we have a lnear system Ax = b where A s very, very large but s ether sparse or structured (eg, banded, Toepltz, banded plus
More informationHidden Markov Models
Hdden Markov Models Namrata Vaswan, Iowa State Unversty Aprl 24, 204 Hdden Markov Model Defntons and Examples Defntons:. A hdden Markov model (HMM) refers to a set of hdden states X 0, X,..., X t,...,
More informationExcess Error, Approximation Error, and Estimation Error
E0 370 Statstcal Learnng Theory Lecture 10 Sep 15, 011 Excess Error, Approxaton Error, and Estaton Error Lecturer: Shvan Agarwal Scrbe: Shvan Agarwal 1 Introducton So far, we have consdered the fnte saple
More informationBoostrapaggregating (Bagging)
Boostrapaggregatng (Baggng) An ensemble meta-algorthm desgned to mprove the stablty and accuracy of machne learnng algorthms Can be used n both regresson and classfcaton Reduces varance and helps to avod
More informationLinear Feature Engineering 11
Lnear Feature Engneerng 11 2 Least-Squares 2.1 Smple least-squares Consder the followng dataset. We have a bunch of nputs x and correspondng outputs y. The partcular values n ths dataset are x y 0.23 0.19
More informationEconomics 101. Lecture 4 - Equilibrium and Efficiency
Economcs 0 Lecture 4 - Equlbrum and Effcency Intro As dscussed n the prevous lecture, we wll now move from an envronment where we looed at consumers mang decsons n solaton to analyzng economes full of
More information3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X
Statstcs 1: Probablty Theory II 37 3 EPECTATION OF SEVERAL RANDOM VARIABLES As n Probablty Theory I, the nterest n most stuatons les not on the actual dstrbuton of a random vector, but rather on a number
More informationEdge Isoperimetric Inequalities
November 7, 2005 Ross M. Rchardson Edge Isopermetrc Inequaltes 1 Four Questons Recall that n the last lecture we looked at the problem of sopermetrc nequaltes n the hypercube, Q n. Our noton of boundary
More informationEstimation: Part 2. Chapter GREG estimation
Chapter 9 Estmaton: Part 2 9. GREG estmaton In Chapter 8, we have seen that the regresson estmator s an effcent estmator when there s a lnear relatonshp between y and x. In ths chapter, we generalzed the
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationExpected Value and Variance
MATH 38 Expected Value and Varance Dr. Neal, WKU We now shall dscuss how to fnd the average and standard devaton of a random varable X. Expected Value Defnton. The expected value (or average value, or
More informationVector Norms. Chapter 7 Iterative Techniques in Matrix Algebra. Cauchy-Bunyakovsky-Schwarz Inequality for Sums. Distances. Convergence.
Vector Norms Chapter 7 Iteratve Technques n Matrx Algebra Per-Olof Persson persson@berkeley.edu Department of Mathematcs Unversty of Calforna, Berkeley Math 128B Numercal Analyss Defnton A vector norm
More informationarxiv: v1 [math.oc] 6 Jan 2016
arxv:1601.01174v1 [math.oc] 6 Jan 2016 THE SUPPORTING HALFSPACE - QUADRATIC PROGRAMMING STRATEGY FOR THE DUAL OF THE BEST APPROXIMATION PROBLEM C.H. JEFFREY PANG Abstract. We consder the best approxmaton
More informationEcon107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)
I. Classcal Assumptons Econ7 Appled Econometrcs Topc 3: Classcal Model (Studenmund, Chapter 4) We have defned OLS and studed some algebrac propertes of OLS. In ths topc we wll study statstcal propertes
More informationOutline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique
Outlne and Readng Dynamc Programmng The General Technque ( 5.3.2) -1 Knapsac Problem ( 5.3.3) Matrx Chan-Product ( 5.3.1) Dynamc Programmng verson 1.4 1 Dynamc Programmng verson 1.4 2 Dynamc Programmng
More informationFast Stochastic Optimization Algorithms for ML
Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2
More informationEcon Statistical Properties of the OLS estimator. Sanjaya DeSilva
Econ 39 - Statstcal Propertes of the OLS estmator Sanjaya DeSlva September, 008 1 Overvew Recall that the true regresson model s Y = β 0 + β 1 X + u (1) Applyng the OLS method to a sample of data, we estmate
More informationx = , so that calculated
Stat 4, secton Sngle Factor ANOVA notes by Tm Plachowsk n chapter 8 we conducted hypothess tests n whch we compared a sngle sample s mean or proporton to some hypotheszed value Chapter 9 expanded ths to
More informationVariance-Reduced Stochastic Gradient Descent on Streaming Data
Varance-Reduced Stochastc Gradent Descent on Streamng Data Ellango Jothmurugesan Carnege Mellon Unversty ejothmu@cs.cmu.edu Phllp B. Gbbons Carnege Mellon Unversty gbbons@cs.cmu.edu Ashraf Tahmasb Iowa
More informationResearch Article. Almost Sure Convergence of Random Projected Proximal and Subgradient Algorithms for Distributed Nonsmooth Convex Optimization
To appear n Optmzaton Vol. 00, No. 00, Month 20XX, 1 27 Research Artcle Almost Sure Convergence of Random Projected Proxmal and Subgradent Algorthms for Dstrbuted Nonsmooth Convex Optmzaton Hdea Idua a
More informationLine Drawing and Clipping Week 1, Lecture 2
CS 43 Computer Graphcs I Lne Drawng and Clppng Week, Lecture 2 Davd Breen, Wllam Regl and Maxm Peysakhov Geometrc and Intellgent Computng Laboratory Department of Computer Scence Drexel Unversty http://gcl.mcs.drexel.edu
More informationConvergence rates of proximal gradient methods via the convex conjugate
Convergence rates of proxmal gradent methods va the convex conjugate Davd H Gutman Javer F Peña January 8, 018 Abstract We gve a novel proof of the O(1/ and O(1/ convergence rates of the proxmal gradent
More informationGlobal Sensitivity. Tuesday 20 th February, 2018
Global Senstvty Tuesday 2 th February, 28 ) Local Senstvty Most senstvty analyses [] are based on local estmates of senstvty, typcally by expandng the response n a Taylor seres about some specfc values
More informationCOS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 15 Scribe: Jieming Mao April 1, 2013
COS 511: heoretcal Machne Learnng Lecturer: Rob Schapre Lecture # 15 Scrbe: Jemng Mao Aprl 1, 013 1 Bref revew 1.1 Learnng wth expert advce Last tme, we started to talk about learnng wth expert advce.
More informationCalculation of time complexity (3%)
Problem 1. (30%) Calculaton of tme complexty (3%) Gven n ctes, usng exhaust search to see every result takes O(n!). Calculaton of tme needed to solve the problem (2%) 40 ctes:40! dfferent tours 40 add
More information6) Derivatives, gradients and Hessian matrices
30C00300 Mathematcal Methods for Economsts (6 cr) 6) Dervatves, gradents and Hessan matrces Smon & Blume chapters: 14, 15 Sldes by: Tmo Kuosmanen 1 Outlne Defnton of dervatve functon Dervatve notatons
More informationLinear Regression Analysis: Terminology and Notation
ECON 35* -- Secton : Basc Concepts of Regresson Analyss (Page ) Lnear Regresson Analyss: Termnology and Notaton Consder the generc verson of the smple (two-varable) lnear regresson model. It s represented
More informationAdditional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty
Addtonal Codes usng Fnte Dfference Method Benamn Moll 1 HJB Equaton for Consumpton-Savng Problem Wthout Uncertanty Before consderng the case wth stochastc ncome n http://www.prnceton.edu/~moll/ HACTproect/HACT_Numercal_Appendx.pdf,
More informationEconomics 130. Lecture 4 Simple Linear Regression Continued
Economcs 130 Lecture 4 Contnued Readngs for Week 4 Text, Chapter and 3. We contnue wth addressng our second ssue + add n how we evaluate these relatonshps: Where do we get data to do ths analyss? How do
More information