Stochas(c Dual Ascent Linear Systems, Quasi-Newton Updates and Matrix Inversion

Size: px

Start display at page:

Download "Stochas(c Dual Ascent Linear Systems, Quasi-Newton Updates and Matrix Inversion"

Harry Cody Washington
5 years ago
Views:

1 Stochas(c Dual Ascent Linear Systems, Quasi-Newton Updates and Matrix Inversion Peter Richtárik (joint work with Robert M. Gower) University of Edinburgh Oberwolfach, March 8, 2016

2 Part I Stochas(c Dual Ascent for Linear Systems

3 Robert Mansel Gower (Edinburgh) Robert Mansel Gower and P.R. Randomized Itera.ve Methods for Linear Systems SIAM Journal on Matrix Analysis and Applica5ons 36(4): , 2015 [GR 15a] Robert Mansel Gower and P.R. Stochas.c Dual Ascent for Solving Linear Systems arxiv: , 2015 [GR 15b]

4 The Problem

5 The Problem: Solve a Linear System n 2 R n m Ax = b m Assump.on 1 The system is consistent (i.e., has a solu(on)

6 Op(miza(on Formula(on Primal Problem B 0 minimize P (x) := 1 2 kx ck2 B subject to Ax = b A 2 R m n x 2 R n 1 2 (x c)> B(x c) Dual Problem Unconstrained non-strongly concave quadra(c maximiza(on problem maximize D(y) :=(b Ac) > 1 y 2 ka> yk 2 B 1 subject to y 2 R m

7 Dual Correspondence Lemma Lemma (GR 15b) (Any) dual op(mal point A ne mapping from R m to R n x(y) :=c + B 1 A > y Primal op(mal point D(y ) D(y) = 1 2 kx(y) x k 2 B Dual error (in func(on values) Primal error (in distance)

8 New Algorithm: Stochas(c Dual Ascent (SDA)

9 Stochas(c Dual Ascent A random m matrix drawn i.i.d. in each itera(on S D y t+1 = y t + S t m m m Moore-Penrose pseudo-inverse of a small matrix t := arg min 2Q t k k 2 Q t := arg max D(y t + S ) t = S > AB 1 A > S S > b A c + B 1 A > y t

10 Primal Method = Linear Image of the Dual Method x t := x(y t )=c + B 1 A > y t Corresponding primal iterates Dual iterates produced by SDA

11 Main Assump(on Assump.on 2 The matrix E S D h S S > AB 1 A > S S >i is nonsingular H

12 Complexity := 1 + min B 1/2 A > E[H]AB 1/2 of SDA U 0 = 1 2 kx0 x k 2 B Theorem (GR 15b) Primal iterates: E 1 2 kxt x k 2 B apple t U 0 GR 15a Residual: E[kAx t bk B ] apple t/2 kak B p 2 U0 Dual error: E[OPT D(y t )] apple t U 0 Primal error: E[P (x t ) OPT] apple t U 0 +2 t/2p OPT U 0 Duality gap: E[P (x t ) D(y t )] apple 2 t U 0 +2 t/2p OPT U 0

13 The Rate: Lower and Upper Bounds Rank(S > A)=dim(Range(B 1 A > S)) = Tr(B 1 Z) Theorem [RG 15ab] 0 apple 1 Rank(S > A) Rank(A) apple < 1 Insight: apple 1 always < 1 if Assumption 2 holds Insight: The lower bound is good when: i) the dimension of the search space in the constrain and approximate viewpoint is large, ii) the rank of A is small

14 The Primal Iterates: 6 Equivalent Viewpoints x t := x(y t )=c + B 1 A > y t Corresponding primal iterates Dual iterates produced by SDA

15 1. Relaxa(on Viewpoint Sketch and Project x t+1 = arg min x2r n kx xt k 2 B subject to S > Ax = S > b S = iden(ty matrix convergence in 1 step

16 2. Approxima(on Viewpoint Constrain and Approximate x t+1 = arg min x2r n kx x k 2 B subject to x = x t + B 1 A > S is free

$x = x t + B 1 A > S {x t+1 } = x + Null(S > A) \ x t + Range(B 1 A$

17 3. Geometric Viewpoint Random Intersect x t (1) x + Null(S > A) x (2) x t+1 x t + Range(B 1 A > S) (1) x t+1 = arg min kx x x t k B subject to S > Ax = S > b (2) x t+1 = arg min kx x x k B subject to x = x t + B 1 A > S {x t+1 } = x + Null(S > A) \ x t + Range(B 1 A > S)

18 4. Algebraic Viewpoint Random Linear Solve x t+1 = solution in x of the linear system S > Ax = S > b x = x t + B 1 A > S Unknown Unknown

19 5. Algebraic Viewpoint Random Update Random Update Vector x t+1 = x t B 1 A > S(S > AB 1 A > S) S > (Ax t b) Moore-Penrose pseudo-inverse

20 6. Analy(c Viewpoint Random Fixed Point Z := A > S(S > AB 1 A > S) S > A x t+1 x = (I B 1 Z)(x t x ) Random Itera(on Matrix (B 1 Z) 2 = B 1 Z (I B 1 Z) 2 = I B 1 Z B 1 Z projects orthogonally onto Range(B 1 A > S) I B 1 Z projects orthogonally onto Null(S > A)

21 Special Case: Randomized Kaczmarz Method

22 Randomized Kaczmarz (RK) Method M. S. Kaczmarz. Angenaherte Auflosung von Systemen linearer Gleichungen, Bulle5n Interna5onal de l Académie Polonaise des Sciences et des LeKres. Classe des Sciences Mathéma5ques et Naturelles. Série A, Sciences Mathéma5ques 35, pp , 1937 Kaczmarz method (1937) T. Strohmer and R. Vershynin. A Randomized Kaczmarz Algorithm with Exponen.al Convergence. Journal of Fourier Analysis and Applica5ons 15(2), pp , 2009 Randomized Kaczmarz method (2009) RK arises as a special case for parameters B, S set as follows: B = I S = e i =(0,...,0, 1, 0,...,0) with probability p i x t+1 = x t A i: x t b i ka i: k 2 2 (A i: ) T RK was analyzed for p i = ka i:k 2 kak 2 F

23 RK: Deriva(on and Rate General Method x t+1 = x t B 1 A T S (S T AB 1 A T S) S T (Ax t b) Special Choice of Parameters P(S = e i )=p i Complexity Rate B = I x t+1 = x t A i:xt b i S = e i ka i: k 2 (A i: ) T 2 p i = ka i:k 2 kak 2 F E kx t x k 2 min A T A 2 apple 1 kak 2 F! t kx 0 x k 2 2

RK: Further Reading D. Needell. Randomized Kaczmarz solver for noisy linear systems. BIT 50 (2), pp. 395-403, 2010 D. Needell and J. Tropp. Paved with good inten.

24 RK: Further Reading D. Needell. Randomized Kaczmarz solver for noisy linear systems. BIT 50 (2), pp , 2010 D. Needell and J. Tropp. Paved with good inten.ons: analyzis of a randomized block Kaczmarz method. Linear Algebra and its Applica5ons 441, pp , 2012 D. Needell, N. Srebro and R. Ward. Stochas.c gradient descent, weighted sampling and the randomized Kaczmarz algorithm. Mathema5cal Programming, 2015 (arxiv: ) A. Ramdas. Rows vs Columns for Linear Systems of Equa.ons Randomized Kaczmarz or Coordinate Descent? arxiv: , 2014

25 Special Case: Gaussian Descent

26 Gaussian Descent General Method x t+1 = x t B 1 A T S (S T AB 1 A T S) S T (Ax t b) Special Choice of Parameters S N(0, ) x t+1 = x t S T (Ax t b) S T AB 1 A T S B 1 A T S Posi(ve definite covariance matrix Complexity Rate E kx t x k 2 B apple t kx 0 x k 2 B

27 x t+1 = x t h t B 1/2 := B 1/2 A T S N(0, ) x := B 1/2 A T AB 1/2 x 0

28 Gaussian Descent: The Rate Lemma [GR 15] E apple T k k Tr( ) 1 1 n apple apple 1 2 min( ) Tr( ) This follows from the general lower

29 Gaussian Descent: Further Reading Yurii Nesterov. Random gradient-free minimiza.on of convex func.ons. CORE Discussion Paper # 2011/1, 2011 S. U. S(tch, C. L. Muller and G. Gartner. Op.miza.on of convex func.ons with random pursuit. SIAM Journal on Op(miza(on 23 (2), pp , 2014 S. U. S(tch. Convex op.miza.on with random pursuit. PhD Thesis, ETH Zurich, 2014

30 Special Case: Randomized Coordinate Descent

31 Randomized Coordinate Descent (RCD) A. S. Lewis and D. Leventhal. Randomized methods for linear constraints: convergence rates and condi.oning. Mathema5cs of OR 35(3), , 2010 (arxiv: ) min f(x) = 1 x2r n 2 xt Ax b T x Assume: Posi(ve definite S = e i =(0,...,0, 1, 0,...,0) with probability p i RCD (2008) RCD arises as a special case for parameters B, S set as follows: B = A x = A 1 b Recall: In RK we had B = I x t+1 = x t (A i: ) T x t b i A ii e i RCD was analyzed for p i = A ii Tr(A)

32 RCD: Deriva(on and Rate General Method x t+1 = x t B 1 A T S (S T AB 1 A T S) S T (Ax t b) Special Choice of Parameters B = A P(S = e i )=p i Complexity Rate S = e i x t+1 = x t (A i: ) T x t b i A ii e i p i = A ii Tr(A) E kx t x k 2 A apple 1 t min(a) kx 0 x k 2 A Tr(A)

33 RCD: Standard Op(miza(on Form Yurii Nesterov. Efficiency of coordinate descent methods on huge-scale op.miza.on problems. SIAM J. on Op5miza5on, 22(2): , 2012 (CORE Discussion Paper 2010/2) Nesterov considered the problem: min f(x) x2r n Convex and smooth Nesterov assumed that the following inequality holds for all x, h and i: Given a current iterate x, choosing h by minimizing the RHS gives: Nesterov s RCD method: x t+1 = x t 1 L i r i f(x t )e i f(x + he i ) apple f(x)+r i f(x)h + L i 2 h2 f(x) = 1 2 xt Ax b T x ) L i = A ii r i f(x) =(A i: ) T x We recover RCD as we have seen it: x t+1 = x t (A i: ) T x t b i A ii e i b i

34 Special Case: Randomized Newton Method

Randomized Newton (RN) Z. Qu, PR, M. Takáč and O. Fercoq. Stochas.c Dual Newton Ascent for Empirical Risk Minimiza.on. arxiv:1502.

35 Randomized Newton (RN) Z. Qu, PR, M. Takáč and O. Fercoq. Stochas.c Dual Newton Ascent for Empirical Risk Minimiza.on. arxiv: , 2015 min f(x) = 1 x2r n 2 xt Ax b T x Assume: Posi(ve definite RN arises as a special case for parameters B, S set as follows: B = A x = A 1 b S = I :C with probability p C p C 0 8C {1,...,n} X C {1,...,n} p C =1 RCD is special case with p C =0whenever C 6= 1

36 RN: Deriva(on General Method x t+1 = x t B 1 A T S (S T AB 1 A T S) S T (Ax t b) Special Choice of Parameters B = A S = I :C with probability p C x t+1 = x t I :C ((I :C ) T AI :C ) 1 (I :C ) T (Ax t b) This method minimizes f exactly in a random subspace spanned by the coordinates belonging to C

37 e 7 C = {2, 7} C =2 x t+1 x t e 2

38 Summary: Linear Systems SDA: A new class of randomized op(miza(on algorithms Extremely versa(le Works for almost any random S Get several exis(ng algorithms in special cases (RK, RCD, RN, RBK) Get many new algorithms in special cases Linear convergence despite lack of strong concavity RK in the primal = RCD in the dual Did not talk about: Randomized gossip Distributed variant Op(mal sampling via SDP Experiments

40 Part II Stochas(c Dual Ascent for Matrix Inversion

41 Robert Mansel Gower (Edinburgh) Robert Mansel Gower and P.R. Randomized Quasi-Newton Methods are Linearly Convergent Matrix Inversion Algorithms arxiv: , 2016

42 The Problem: Invert a Matrix n 2 R n n Iden(ty matrix n AX = I Assump.on 1 Matrix A is inver(ble

43 Inver(ng Symmetric Matrices

1. Sketch and Project kxk F (B) := p Tr(X > BXB) X t+1 = arg min kx Xt X2Rn n k 2 F (B) subject to S > AX = S >, X = X > Quasi-Newton updates are of this form: S = determinis(c column vector We get

44 1. Sketch and Project kxk F (B) := p Tr(X > BXB) X t+1 = arg min kx Xt X2Rn n k 2 F (B) subject to S > AX = S >, X = X > Quasi-Newton updates are of this form: S = determinis(c column vector We get randomized block version of quasi-newton updates! Randomized quasi-newton updates are linearly convergent matrix inversion methods Interpreta(on: Gaussian Inference (Henning, 2015) Donald Goldfarb. A Family of Variable-Metric Methods Derived by Varia.onal Means. Mathema5cs of Computa5on 24(109), 1970

45 Gaussian Inference Philipp Henning Probabilis.c Interpreta.on of Linear Solvers SIAM Journal on Op5miza5on 25(1): , 2015 X k+1 The new iterate can be interpreted as the mean of a posterior distribu(on under a Gaussian prior with mean and noiseless (and random) linear observa(on of X k A 1

Randomized QN Updates B Equation Method I AX = I Powel-Symmetric-Broyden (PSB) A 1 XA 1 = I Davidon-Fletcher-Powell (DFP) A AX = I Broyden-Fletcher-Goldfarb-Shanno (BFGS) All these QN methods arise

46 Randomized QN Updates B Equation Method I AX = I Powel-Symmetric-Broyden (PSB) A 1 XA 1 = I Davidon-Fletcher-Powell (DFP) A AX = I Broyden-Fletcher-Goldfarb-Shanno (BFGS) All these QN methods arise as special cases of our framework All are linearly convergent, with explicit convergence rates We also recover non-symmetric updates such as Bad Broyden and Good Broyden We get block versions We get randomized versions of new QN updates

2. Constrain and Approximate X t+1 = arg min kx A X2Rn

on even for standard QN methods min X2R n n kx A 1 k 2

47 2. Constrain and Approximate X t+1 = arg min kx A X2Rn n 1 k 2 F (B) s.t. X = X t + S > AB 1 + B 1 A > S > 2 R n is free Randomized BFGS: B = A, =1 X t+1 = arg New formula.on even for standard QN methods min X2R n n kx A 1 k 2 F (A) = kax Ik2 F s.t. X = X t + S > + S > 2 R n is free RBFGS performs best symmetric rank-2 update

48 4. Random Update H = S(S > AB 1 A > S) S > X t+1 = X t (X t A I)HAB 1 + B 1 AH(AX t I)(AHAB 1 I) 6. Random Fixed Point X t+1 A 1 = (I B 1 A > HA)(X t A 1 )(I AHA > B 1 )

49 Complexity / Convergence Theorem [GR 16] kmk B := kb 1/2 MB 1/2 k 2 1 E X t A 1 B apple t kx 0 A 1 k B 2 E[H] 0 < 1 E h kx t A 1 k 2 F (B) i apple t kx 0 A 1 k 2 F (B)

50 Summary: Matrix Inversion Block version of QN updates New points of view (constrain and approximate,...) New link between QN and approx. inverse precondi(oning First (me randomized QN updates are proposed First stochas(c method for matrix inversion (with complexity bounds)? Linear convergence under weak assump(ons Did not talk about: Nonsymmetric variants Theore(cal bounds for discretely distributed S Adap(ve randomized BFGS Limited memory and factored implementa(ons Experiments (Newton-Schultz; MinRes) Use in empirical risk minimiza(on (Gower, Goldfarb & R. 16) Extension to the computa(on of a pseudoinverse

51 The End

arxiv: v2 [math.na] 28 Jan 2016

arxiv: v2 [math.na] 28 Jan 2016 Stochastic Dual Ascent for Solving Linear Systems Robert M. Gower and Peter Richtárik arxiv:1512.06890v2 [math.na 28 Jan 2016 School of Mathematics University of Edinburgh United Kingdom December 21, 2015