1. Introduction. In this paper we consider a class of matrix factorization problems, which can be modeled as

Size: px

Start display at page:

Download "1. Introduction. In this paper we consider a class of matrix factorization problems, which can be modeled as"

Benedict Patrick
5 years ago
Views:

1 A NON-MONOTONE ALTERNATING UPDATING METHOD FOR A CLASS OF MATRIX FACTORIZATION PROBLEMS LEI YANG, TING KEI PONG, AND XIAOJUN CHEN Abstract. In this paper we consider a general matrix factorization model which covers a large class of existing models with many applications in areas such as machine learning and imaging sciences. To solve this possibly nonconvex, nonsmooth and non-lipschitz problem, we develop a non-monotone alternating updating method based on a potential function. Our method essentially updates two blocs of variables in turn by inexactly minimizing this potential function, and updates another auxiliary bloc of variables using an explicit formula. The special structure of our potential function allows us to tae advantage of efficient computational strategies for non-negative matrix factorization to perform the alternating minimization over the two blocs of variables. A suitable line search criterion is also incorporated to improve the numerical performance. Under some mild conditions, we show that the line search criterion is well defined, and establish that the sequence generated is bounded and any cluster point of the sequence is a stationary point. Finally, we conduct some numerical experiments using real datasets to compare our method with some existing efficient methods for non-negative matrix factorization and matrix completion. The numerical results show that our method can outperform these methods for these specific applications Key words. updating. Matrix factorization; non-monotone line search; stationary point; alternating AMS subject classifications. 90C6, 90C30, 90C90, 65K05 1. Introduction. In this paper we consider a class of matrix factorization problems, which can be modeled as 1.1) min FX, Y ) := ΨX) + ΦY ) + 1 AXY ) b, X,Y where X R m r and Y R n r are decision variables with r minm, n}, the functions Ψ : R m r R } and Φ : R n r R } are proper closed but possibly nonconvex, nonsmooth and non-lipschitz, b R q is a given vector and A : R m n R q is a linear map with q mn and AA = I q I q denotes the identity map from R q to R q ). Model 1.1) covers many existing widely-studied models in many application areas such as machine learning [35] and imaging sciences [44]. In particular, ΨX) and ΦY ) can be various regularizers for inducing desired structures, and A can be suitably chosen to model different scenarios. For example, when ΨX) and ΦY ) are chosen as the indicator functions see the next section for notation and definitions) for X = X R m r : X 0} and Y = Y R n r : Y 0}, respectively, and A is the identity map, 1.1) reduces to the non-negative matrix factorization NMF) problem, which has been widely used in data mining applications to provide interpretable decompositions of data. NMF was first introduced by Paatero and Tapper [5], and then popularized by Lee and Seung [17]. The basic tas of NMF is to find two nonnegative matrices X R m r + and Y R n r + such that M XY for a given nonnegative data matrix M R m n +. We refer readers to [, 9, 10, 18, 37] for more information on NMF and its variants. Another example of 1.1) arises in recent models of the matrix completion MC) problem see [30, 31, 3]), where ΨX) Department of Applied Mathematics, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, P.R. China. lei.yang@connect.polyu.h, t.pong@polyu.edu.h, xiaojun.chen@polyu.edu.h). The second author s wor was supported in part by Hong Kong Research Grants Council PolyU153085/16p. The third author s wor was supported in part by Hong Kong Research Grants Council PolyU153000/15p. 1

2 LEI YANG, TING KEI PONG, AND XIAOJUN CHEN and ΦY ) are chosen as the Schatten-p 1 quasi-norm and the Schatten-p quasi-norm for suitable p 1, p > 0, respectively, and A is the sampling map. The MC problem aims to recover an unnown low ran matrix from a sample of its entries and arises in various applications see, for example, [3,, 7, 33]). Many widely-studied models for MC are based on nuclear-norm minimization [5, 6, 6], or, more generally, Schatten-p 0 < p 1) quasi )norm minimization [16, 3, 4]. Recently, models based on low-ran matrix factorization such as 1.1) have become popular because singular value decompositions or eigenvalue decompositions of huge m n) matrices are not required for solving these models see, for example, [15, 30, 31, 3, 34, 38]). More examples of 1.1) can be found in recent surveys [35, 44]. Problem 1.1) is in general nonconvex even when Ψ, Φ are convex) and NP-hard 1. Therefore, in this paper, we focus on finding a stationary point of the objective F in 1.1). Note that F involves two blocs of variables. This ind of structure has been widely studied in the literature; see, for example, [1, 4, 13, 14, 40, 41, 43]. One popular class of methods for tacling this ind of problems is the alternating direction method of multipliers ADMM) see, for example, [41, 43]), in which each iteration consists of an alternating minimization of an augmented Lagrangian function that involves X, Y and some auxiliary variables, followed by updates of the associated multipliers. However, the conditions presented in [41, 43] that guarantee convergence of the ADMM are too restrictive. Moreover, updating the auxiliary variables and the multipliers can be expensive for large-scale problems. Another class of methods for 1.1) is the alternating-minimization-based or bloc-coordinate-descent-type) methods see [1, 4, 8, 11, 0, 1, 40]), which alternately exactly or inexactly) minimizes FX, Y ) over each bloc of variables and converges under some mild conditions. When A is not the identity map, the majorization technique can be used to simplify the subproblems. Some representative algorithms of this class are proximal alternating linearized minimization PALM) [4], hierarchical alternating least squares HALS) for NMF only; see [8, 11, 0, 1]) and bloc coordinate descent BCD) [40]. Comparing with ADMM, it was reported in [40] that BCD outperforms ADMM in both CPU time and solution quality for NMF. PALM, HALS and BCD are currently the state-of-the-art algorithms for solving problems of the form 1.1). In this paper, we develop a new iterative method for 1.1), which, according to our numerical experiments in Section 6, outperforms HALS and BCD for NMF, and PALM for MC. Our method is based on the following potential function specifically constructed for F in 1.1)): 77 1.) Θ α, X, Y, Z) := ΨX) + ΦY ) + α XY Z F + AZ) b, where α and are real numbers. Instead of alternately exactly or inexactly) minimizing FX, Y ) or the augmented Lagrangian function, our method alternately updates X and Y by inexactly minimizing Θ α, X, Y, Z) over X and Y, and then updates Z by an explicit formula. Note that the coupled variables XY is now separated from A in our potential function. Thus, one can readily tae advantage of efficient computational strategies for NMF, such as those used in HALS see the hierarchical-prox updating strategy in Section 4), for inexactly minimizing Θ α, X, Y, Z) over X or Y. Furthermore, our method can be implemented for NMF and MC without explicitly forming the huge m n) matrix Z see 6.3) and 6.5)) in each iteration. This signifi- [36]. 1 Problem 1.1) is NP-hard because it contains NMF as a special case, which is NP-hard in general

3 NAUM FOR A CLASS OF MATRIX FACTORIZATION PROBLEMS cantly reduces the computational cost per iteration. Finally, a suitable non-monotone 88 line search criterion, which is motivated by recent studies on non-monotone algo- 89 rithms see, for example, [7, 1, 39]), is also incorporated to improve the numerical 90 performance. 91 In the rest of this paper, we first present notation and preliminaries in Section. 9 We then study the properties of our potential function Θ α, in Section 3. Specifically, if AA = I q and α, are chosen such that αi + A A 0 and 1 α + 1 = 1, then the 94 problem min α,x, Y, Z)} is equivalent to 1.1) see Theorem 3.). Furthermore, X,Y,Z under the weaer conditions that AA = I q and 1 α + 1 = 1, we can show that i) a 96 stationary point of Θ α, gives a stationary point of F; ii) a stationary point of F can 97 be used to construct a stationary point of Θ α, see Theorem 3.3). Thus, one can find 98 a stationary point of F by finding a stationary point of Θ α,. In Section 4, we develop 99 a non-monotone alternating updating method to find a stationary point of Θ α,, and 100 hence of F. The convergence analysis of our method is presented in Section 5. We 101 show that our non-monotone line search criterion is well defined and any cluster point 10 of the sequence generated by our method is a stationary point of F under some mild 103 conditions. Section 6 gives numerical experiments to evaluate the performance of our 104 method for NMF and MC on real datasets. Our computational results illustrate the 105 efficiency of our method. Finally, some concluding remars are given in Section Notation and preliminaries. In this paper, for a vector x R m, x i denotes its i-th entry, x denotes the Euclidean norm of x and Diagx) denotes the diagonal matrix whose i-th diagonal element is x i. For a matrix X R m n, x ij denotes the ij-th entry of X, x j denotes the j-th column of X and trx) denotes the trace of X. The Schatten-p quasi-)norm 0 < p < ) of X is defined as minm,n) ) X Sp = i=1 ς p 1 i X) p, where ς i X) is the i-th singular value of X. For p =, the Schatten- norm reduces to the Frobenius norm X F, and for p = 1, the Schatten-1 norm reduces to the nuclear norm X. Moreover, the spectral norm is denoted by X, which is the largest singular value of X; and the l 1 -norm and l p -quasi-norm 0 < p < 1) of X are given by X 1 := m n i=1 j=1 x ij and m ) 1 n X p := i=1 j=1 x ij p p, respectively. For two matrices X and Y of the same size, we denote their trace inner product by X, Y := m n i=1 j=1 x ijy ij. We also use X Y resp., X Y ) to denote x ij y ij resp., x ij y ij ) for all i, j). Furthermore, for a linear map A : R m n R q, A denotes the adjoint linear map and A denotes the induced operator norm of A, i.e., A = sup AX) : X F 1}. A linear self-map T is said to be symmetric if T = T. For a symmetric linear selfmap T : R m n R m n, we say that T is positive definite, denoted by T 0, if X, T X) > 0 for all X 0. The identity map from R m n to R m n is denoted by I and the identity map from R q to R q is denoted by I q. Finally, for a nonempty closed set C R m n, its indicator function δ C is defined by 0 if X C, δ C X) = + otherwise. For an extended-real-valued function f : R m n [, ], we say that it is proper if fx) > for all X R m n and its domain domf := X R m n : fx) < } is nonempty. A function f : R m n [, ] is level-bounded [8, Definition 1.8] if for every α R, the set X R m n : fx) α} is bounded possibly empty). For a proper function f : R m n, ], we use the notation

4 4 LEI YANG, TING KEI PONG, AND XIAOJUN CHEN Y f X to denote Y X i.e., Y X F 0) and fy ) fx). The limiting) subdifferential [8, Definition 8.3] of f at X domf used in this paper, denoted by fx), is defined as fx) := D R m n : X f X and D D with D fx } ) for all, 136 where fỹ ) denotes the Fréchet subdifferential of f at Ỹ domf, which is the set 137 of all D R m n satisfying lim inf Y Ỹ,Y Ỹ fy ) fỹ ) D, Y Ỹ Y Ỹ 0. F From the above definition, we can easily observe see, for example, [8, Proposition 8.7]) that }.1) D R m n : X f X, D D, D fx ) fx). When f is continuously differentiable or convex, the above subdifferential coincides with the classical concept of derivative or convex subdifferential of f; see, for example, [8, Exercise 8.8] and [8, Proposition 8.1]. In this paper, we say that X is stationary point of f if 0 fx ). For a proper closed function g : R m, ], the proximal mapping Prox g : R m R m of g is defined by Prox g z) := Argmin gx) + 1 x R m x z }. For any ν > 0, the matrix shrinage operator S ν : R m n R m n is defined by S ν X) := UDiag s)v si ν, if s with s i = i ν > 0, 0, otherwise, where U R m t, s R t + and V R n t are given by the singular value decomposition of X, i.e, X = UDiags)V. We now present two propositions, which will be useful for developing our method in Section 4. Proposition.1. Suppose that AA = I q and αα + ) 0. Then, αi + A A is invertible and its inverse is given by 1 α I αα+) A A. Proof. It is easy to chec that 1 α I αα+) A A is well defined since αα+) 0, and that αi + A A ) ) 1 α I αα+) A A = I. This completes the proof. Proposition.. Let ψ : R m, ] and φ : R n, ] be proper closed functions. Given P, Q R m n and a R n, b R m with a = 0, b 0, the following statements hold. } i) The problem min ψx) + 1 x R m xa P F is equivalent to min ψx) + a x R m x P a } a ; ii) The problem min y R n φy) + 1 by Q F } is equivalent to min y R n φy) + b y Q b } b.

5 NAUM FOR A CLASS OF MATRIX FACTORIZATION PROBLEMS Proof. Statement i) can be easily proved by noticing that xa P F = xa F xa, P + P F = a x x, P a + P F = a x P a/ a P a / a + P F. 163 Then, statement ii) can be easily proved by using statement i) and by Q F = 164 yb Q F. 165 Before ending this section, we discuss the first-order necessary conditions for 1.1). 166 First, from [8, Exercise 8.8] and [8, Proposition 10.5], we see that FX, Y ) = ΨX) + A AXY ) b ) Y ΦY ) + A AXY ) b )) X Then, it follows from the generalized Fermat s rule [8, Theorem 10.1] that any local minimizer X, Y ) of 1.1) satisfies 0 FX, Y ), i.e., ). 170.) 0 ΨX) + A AXY ) b)y, 0 ΦY ) + A AXY ) b)) X, which implies that X, Y ) is a stationary point of F. In this paper, we focus on finding a stationary point X, Y ) of F, i.e., X, Y ) satisfies.) in place of X, Y ). 3. The potential function for F. In this section, we analyze the relation between F and its potential function Θ α, defined in 1.). Intuitively, Θ α, originates from F by separating the coupled variables XY from the linear mapping A via introducing an auxiliary variable Z and penalizing XY = Z. We will see later that the stationary point of F can be characterized by the stationary point of Θ α,. Before proceeding, we prove the following technical lemma. 179 satisfying ) Lemma 3.1. Suppose that AA = I q and 1 α + 1 Z = we have FX, Y ) = Θ α, X, Y, Z). 3.) 3.3) Proof. First, from 3.1), we have ) XY I α+ A A ) + α+ A b), XY Z = AZ) b = A XY = AXY ) α+ AA AXY ) + α+ A AXY ) b) α+ A AXY ) + = 1. Then, for any X, Y, Z) ) α+ A b) α+ AA b) b = α α+ where the last equality follows from AA = I q. Then, we see that b AXY ) b ), 188 α XY Z F + AZ) b α+ A AXY ) b) + F A AXY ) b) F + = α = α α+) 1 = α α+) 1 = α α+ 1 AXY ) b + AXY ) b, α α+ α α+) 1 AXY ) b ) α α+) 1 AXY ) b AXY ) b

6 LEI YANG, TING KEI PONG, AND XIAOJUN CHEN where the first equality follows from 3.) and 3.3); and the third equality follows from AA = I q. This, together with 1 α + 1 = 1 and the definitions of F and Θ α, completes the proof. Based on the above lemma, we now establish the following property of Θ α,. Theorem 3.. Suppose that AA = I q. If α and are chosen such that αi + A A 0 and 1 α + 1 =1, then the problem min Θ α,x, Y, Z)} is equivalent to 1.1). X,Y,Z Proof. First, it is easy to see from αi + A A 0 that the function Z Θ α, X, Y, Z) is strongly convex. Thus, for any fixed X and Y, the optimal solution Z to the problem min Θ α,x, Y, Z)} exists and is unique, and can be obtained Z explicitly. Indeed, from the optimality condition, we have αz XY ) + A AZ ) b) = 0. Then, since αi + A A is invertible as αi + A A 0), we see that Z = αi + A A) 1 [ αxy + A b) ] [ ] 1 [αxy = α I αα+) A A + A b) ] ) [ ] = I α+ A A XY ) + α A b) αα+) A AA b) ) [ ] = I α+ A A XY ) + α αα+) A b) ) = I XY ) + α+ A A α+ A b), where the second equality follows from Proposition.1 and the fourth equality follows from AA = I q. This, together with Lemma 3.1, implies that FX, Y ) = Θ α, X, Y, Z ). Then, we have that } min Θ α,x, Y, Z)} = min min Θ α,x, Y, Z)} = min Θ α,x, Y, Z )} X,Y,Z X,Y Z X,Y This completes the proof. = min FX, Y )}. X,Y Remar 3.1. From the proof of Lemma 3.1, we see that if Φ and) Ψ are the indi cator functions of some nonempty closed sets, then FX, Y ) = α + 1 Θ α, X, Y, Z) holds with the special choice of Z in 3.1) whenever AA = I q and 1 α + 1 > 0. Thus, 10 the result in Theorem 3. remains valid whenever AA = I q and α, are chosen such that αi + A A 0 and 1 α + 1 > 0. 1 We see from Theorem 3. that 1.1) is equivalent to minimizing Θ α, with some 13 suitable choices of α and. On the other hand, we can also characterize the relation 14 between the stationary points of F and Θ α, under weaer conditions on α and. 15 Theorem 3.3. Suppose that AA = I q and α, are chosen such that 1 α + 1 = Then, the following statements hold. 17 i) If X, Y, Z ) is a stationary point of Θ α,, then X, Y ) is a stationary point 18 of F; 19 ii) If X, Y ) is a stationary point of F, then X, Y, Z ) is a stationary point of 0 1 Θ α,, where Z is given by 3.4) Z = ) X I α+ A A Y ) ) + α+ A b).

7 NAUM FOR A CLASS OF MATRIX FACTORIZATION PROBLEMS Proof. First, if X, Y, Z ) is a stationary point of Θ α,, then we have 0 Θ α, X, Y, Z ), i.e., 3.5a) 0 ΨX ) + αx Y ) Z )Y, 3.5b) 0 ΦY ) + αx Y ) Z ) X, 3.5c) 0 = αz X Y ) ) + A AZ ) b). Since 1 α + 1 = 1, we have αα+) 0 and hence αi +A A is invertible from Lemma.1. Then, using the same arguments in the proof of Theorem 3., we see from 3.5c) that X, Y, Z ) satisfies 3.4). Moreover, using 3.4) and the same arguments in 3.) and 3.3), we have 3.6) 3.7) X Y ) Z = α+ A AX Y ) ) b), AZ ) b = AX Y ) ) b ). α α+ Thus, substituting 3.6) into 3.5a) and 3.5b), we see that 0 ΨX ) + α α+ A AX Y ) ) b)y, 3.8) 0 ΦY ) + α A AX Y ) ) b) ) X. α+ This together with 1 α + 1 = 1 implies X, Y ) is a stationary point of F. This proves statement i). We now prove statement ii). First, if X, Y ) is a stationary point of F, then invoing 1 α + 1 = 1 and.), we have 3.8). Next, we consider X, Y, Z ) with Z given by 3.4). Then, X, Y, Z ) satisfies 3.6) and 3.7). Thus, substituting 3.6) into 3.8), we obtain 3.5a) and 3.5b). Moreover, we have from 3.6) and 3.7) that ) αz X Y ) ) + A AZ ) b) = α α+ A AX Y ) ) b ) + A α α+ AX Y ) ) b )) = 0. This together with 3.5a) and 3.5b) implies that X, Y, Z ) is a stationary point of Θ α,. This proves statement ii) Remar 3.. From the proof of Theorem 3.3, one can see that if Ψ and Φ are 40 cones, Theorem 3.3 remains valid under the weaer conditions that AA = I q and 1 α + 1 > 0. 4 From Theorem 3.3, we see that a stationary point of F can be obtained from a stationary point of Θ α, with a suitable choice of α and, i.e., 1 α + 1 = 1. Since the 44 linear map A is no longer associated with the coupled variables XY in Θ α,, finding 45 a stationary point of Θ α, is conceivably easier. Thus, one can consider finding a 46 stationary point of Θ α, in order to find a stationary point of F. Note that some 47 existing alternating-minimization-based methods see, for example, [1, 40]) can be 48 used to find a stationary point of Θ α,, and hence of F, under the conditions that AA = I q and α, are chosen so that αi + A A 0 and 1 α + 1 = 1. These 50 conditions further imply that α > 1 and = α α 1 > 1. However, as we will see from 51 our numerical results in Section 6, finding a stationary point of Θ α, with α > 1 5 can be slow. In view of this, in the next section, we develop a new non-monotone 53 alternating updating method for finding a stationary of Θ α, and hence of F) under the weaer conditions that AA = I q and 1 α + 1 = 1. This allows more flexibilities 55 in choosing α and.

8 58 8 LEI YANG, TING KEI PONG, AND XIAOJUN CHEN Non-monotone alternating updating method. In this section, we con- 57 sider a non-monotone alternating updating method NAUM) for finding a stationary point of Θ α, with 1 α + 1 = 1. Compared to existing alternating-minimization-based 59 methods [1, 40] applied to Θ α,, which update X, Y, Z by alternately solving sub- 60 problems related to Θ α,, NAUM updates Z by an explicit formula see 4.5)) and 61 updates X, Y by solving subproblems related to Θ α, in a Gauss-Seidel manner. Be- 6 fore presenting the complete algorithm, we first comment on the updates of X and 63 Y. 64 Let X, Y ) denote the value of X, Y ) after the 1) th iteration, and let U, V ) 65 denote the candidate for X +1, Y +1 ) at the -th iteration we will set X +1, Y +1 ) 66 to be U, V ) if a line search criterion is satisfied; more details can be found in Algorithm 67 1). For notational simplicity, we also define H α X, Y, Z) := α XY Z F for any X, Y, Z). Then, at the -th iteration, we first compute Z by 4.5) and, in the line search loop, we compute U in one of the following 3 ways for a given µ > 0: Proximal 7 4.1a) U Argmin X ΨX) + H α X, Y, Z ) + µ X X F Prox-linear 4.1b) U Argmin X ΨX) + X H α X, Y, Z ), X X + µ X X F. Hierarchical-prox If Ψ is column-wise separable, i.e., ΨX) = r i=1 ψ ix i ) for X = [x 1,, x r ] R m r, we can update U column-by-column. Specifically, for i = 1,,, r, compute c) u i Argmin x i ψ i x i ) + H α u j<i, x i, x j>i, Y, Z ) + µ x i x i, where u j<i denotes u 1,, u i 1 ) and x j>i denotes x i+1,, x r). After computing U, we compute V in one of the following 3 ways for a given σ > 0: Proximal 80 4.a) V Argmin Y ΦY ) + H α U, Y, Z ) + σ Y Y F. 81 Prox-linear b) V Argmin Y ΦY ) + Y H α U, Y, Z ), Y Y + σ Y Y F. Hierarchical-prox If Φ is column-wise separable, i.e., ΦY ) = r i=1 φ iy i ) for Y = [y 1,, y r ] R n r, we can update V column-by-column. Specifically, for i = 1,,, r, compute 84 4.c) v i Argmin y i φ i y i ) + H α U, v j<i, y i, y j>i, Z ) + σ y i y i, 85 where v j<i denotes v 1,, v i 1 ) and y j>i denotes y i+1,, y r ).

9 NAUM FOR A CLASS OF MATRIX FACTORIZATION PROBLEMS For notational simplicity, we further let 4.3) ρ := I α+ A A and let γ 0 be a nonnegative number satisfying ) α + γ) I + A A Remar 4.1 Comments on hierarchical-prox ). The hierarchical-prox updating scheme requires the column-wise separability of Ψ or Φ. This is satisfied for many common regularizers, for example, F, 1, p p 0 < p < 1), and the indicator function of the nonnegativity or box) constraint. Remar 4. Comments on ρ and γ). Since AA = I q, we see that the eigenvalues of A A are either 0 or 1. Then, the eigenvalues of I α+ A A must α be either 1 or α+, and hence ρ = max 1, α /α + ) }. Similarly, the eigenvalues of αi + A A) are either α or α + ). Then, 4.4) is satisfied whenever γ max0, α, α + )}. Now, we are ready to present NAUM as Algorithm 1. Algorithm 1 NAUM for finding a stationary point of F Input: X 0, Y 0 ), α and such that 1 α + 1 = 1, ρ as in 4.3), γ 0 satisfying 4.4), τ > 1, c > 0, µ min > 0, σ max > σ min > 0, and an integer N 0. Set = 0. while a termination criterion is not met, do Step 1. Compute Z by ) X 4.5) Z = I α+ A A Y ) ) + α+ A b). Step. Choose µ 0 µmin and σ 0 [σmin, σ max ] arbitrarily. Set µ = µ 0, σ = σ 0 and µmax = α + γρ) Y + c. a) Set µ min µ, µ max }. Compute U by either 4.1a), 4.1b) or 4.1c). b) Compute V by either 4.a), 4.b) or 4.c). c) If 4.6) FU, V ) max [ N] FXi, Y i ) c + i U X F + V Y ) F, then go to Step 3. d) If µ = µ max, set σ max = α+γρ) U +c, σ min τσ, σ max } and then, go to step b); otherwise, set µ τµ and σ τσ and then, go to step a). Step 3. Set X +1 U, Y +1 V, µ µ, σ σ, + 1 and go to Step 1. end while Output: X, Y ) In Algorithm 1, the update for Z is given explicitly. This is motivated by the condition on Z at a stationary point of Θ α, ; see 3.5c). In fact, following the same arguments in 3.9), we see that 3.5c) always holds at X, Y, Z ) with Z

10 10 LEI YANG, TING KEI PONG, AND XIAOJUN CHEN given in 4.5) when AA = I q and 1 α + 1 = 1. If, in addition, αi + A A holds, one can show that Z is actually the optimal solution to the problem min Z Θα, X, Y, Z) } 305. In this case, our NAUM with N = 0 in 4.6) can be viewed 306 as an alternating-minimization-based method see, for example, [1, 40]) applied to the problem min X,Y,Z Θ α, X, Y, Z)}. However, if αi + A A 0, then the corresponding inf Z Θα, X, Y, Z) } = for all, and Z is only a stationary point 309 of Z Θ α, X, Y, Z). In this case, the function value of Θ α, may increase after 310 updating Z by 4.5). Fortunately, as we shall see later in 5.8) and 5.9), as long as 311 AA = I q and 1 α + 1 = 1, we still have Θ α,x +1, Y +1, Z ) < Θ α, X, Y, Z ) 31 by updating X +1 and Y +1 with properly chosen parameters µ and σ. Thus, if 313 the possible increase in Θ α, induced by the Z-update is not too large, one can still 314 ensure Θ α, X +1, Y +1, Z +1 ) < Θ α, X, Y, Z ). Moreover, it can be seen from 315 Lemma 3.1 and 4.5) that FX, Y ) = Θ α, X, Y, Z ) and hence the decrease of 316 Θ α, translates to that of F see Lemma 5.1 below). In view of this, Θ α, is a valid potential function for minimizing F as long as AA = I q and 1 α + 1 = 1, even when 318 < 0 or α < 0. Allowing negative α or maes our NAUM even with N = 0 in )) different from the classical alternating minimization schemes. 30 Our NAUM also allows U and V to be updated in three different ways respectively, 31 and hence there are 9 possible combinations. Thus, one can choose suitable updating 3 schemes to fit different applications. In particular, if Ψ or Φ are column-wise separable, taing advantage of the structure of Θ α, and the fact that XY can be written as r i=1 x iyi with X = [x 1,, x r ] R m r and Y = [y 1,, y r ] R n r, one can 35 update X or Y column-wise even when A I. The motivation for updating X 36 or Y ) column-wise rather than updating the whole X or Y ) is that the resulting 37 subproblems 4.1c) or 4.c)) can be reduced to the computation of the proximal 38 mapping of ψ i or φ i ), which is easy for many commonly used ψ i or φ i ). Indeed, 39 from 4.1c) and 4.c), u i and v i are given by u i Argmin ψ i x i ) + α x i yi ) P i x i F x i x i }, ) v i Argmin φ i y i ) + α ui yi Q i y i F y i yi }, where P i and Q i 331 Pi ) are defined by := Z i 1 j=1 u jy j ) r j=i+1 x j y j ), Q i := Z i 1 j=1 u jv j r j=i+1 u jy j ). Then, from Proposition., we can reformulate the subproblems in 4.7) and obtain the corresponding solutions by computing the proximal mappings of ψ i and φ i, which can be computed efficiently when ψ i and φ i are some common regularizers used in the literature. In particular, when ψ i ) and φ i ) are 1, or the indicator function of the box constraint, these subproblems have closed-form solutions. This updating strategy has also been used for NMF; see, for example, [8, 0, 1]. However, the methods used in [8, 0, 1] can only be applied for some specific problems with A = I, while NAUM can be applied for more general problems with AA = I q. Our NAUM adapts a non-monotone line search criterion see Step in Algorithm 1) to improve the numerical performance. This is motivated by recent studies on This may happen when 0 < α < 1 so that = αα 1) 1 < 0, or 0 < < 1 so that α = 1) 1 < 0.

11 NAUM FOR A CLASS OF MATRIX FACTORIZATION PROBLEMS non-monotone algorithms with promising performances; see, for example, [7, 1, 39]. However, different from the non-monotone line search criteria used there, NAUM only includes U, V ) in the line search loop and checs the stopping criterion 4.6) after updating a pair of U, V ), rather than checing 4.6) immediately once U or V is updated. Thus, we do not need to compute the function value after updating each bloc of variable. This may reduce the cost of the line search and mae NAUM more practical, especially when computing the function value is relatively expensive. Before moving to the convergence analysis of NAUM, we would lie to point out an interesting connection between NAUM and the low-ran matrix fitting algorithm, LMaFit [38], for solving the following matrix completion model without regularizers: min X,Y 1 P Ω XY M) F, where Ω is the index set of the nown entries of M, and P Ω Z) eeps the entries of Z in Ω and sets the remaining ones to zero. If we apply our NAUM with 4.1a) and 4.a), then at the -th iteration, the iterates Z, X +1 and Y +1 are given by ) Z = I α+ P Ω X Y ) + α+ P ΩM), X +1 = µ X + αz Y ) µ I + αy ) Y ) 1, Y +1 = σ Y + αz ) X +1) σ I + αx +1 ) X +1) 1. One can verify that the sequence Z, X +1, Y +1 )} above can be equivalently generated by the following scheme with Z 0 = P Ω M) + P Ω c X 0 Y 0 ) ) : Z = Z ) α+ + 1 α+ X Y ), X +1 = µ X + αz Y ) µ I + αy ) Y ) 1, Y +1 = σ Y + αz ) X +1) σ I + αx +1 ) X +1) 1, Z +1 = P Ω M) + P Ω c X +1 Y +1 ) ), 361 where Ω c is the complement set of Ω. Surprisingly, when µ = σ = 0, this scheme 36 is exactly the SORsuccessive over-relaxation)-lie scheme used in LMaFit see [38, 363 Eq..11)]) with ω := α+ being an over-relaxation weight. With this connection, 364 our NAUM, in some sense, can be viewed as an SOR-based algorithm. Moreover, just 365 lie the classical SOR for solving a system of linear equations, LMaFit with ω > also appears to be more efficient from the extensive numerical experiments reported 367 in [38]. Then, it is natural to consider α+ > 1 and hence 1 α > 1 since 1 α + 1 = 1) in 368 NAUM. This also gives some insights for the necessity of allowing more flexibilities in 369 choosing α and, and the promising performance of NAUM with a relatively small 370 α 0, 1) as we shall see in Section Convergence analysis of NAUM. In this section, we discuss the conver- 37 gence properties of Algorithm 1. First, we present the first-order optimality conditions 373 for the three different updating schemes in a) of Algorithm 1 as follows: 374 Proximal a) 0 ΨU) + α UY ) Z ) Y + µ U X ). 376 Prox-linear b) 0 ΨU) + α X Y ) Z ) Y + µ U X ).

12 1 LEI YANG, TING KEI PONG, AND XIAOJUN CHEN Hierarchical-prox For i = 1,,, r, i 5.1c) 0 ψ i u i ) + α j=1 u jyj ) + r j=i+1 x j yj ) Z ) yi + µ u i x i ). Similarly, the first-order optimality conditions for the three different updating schemes in b) of Algorithm 1 are Proximal 5.a) 0 ΦV ) + α UV Z ) U + σ V Y ). Prox-linear 5.b) 0 ΦV ) + α UY ) Z ) U + σ V Y ). Hierarchical-prox For i = 1,,, r, c) 0 φ i v i ) + α i j=1 u jv j + r j=i+1 u jy j ) Z ) ui + σ v i y i ). 388 We also need to mae the following assumptions Assumption a1) Ψ, Φ are proper, closed, level-bounded functions and continuous on their dom- 391 ains respectively; 39 a) AA = I q ; a3) 1 α + 1 = 1. Remar 5.1. i) From a1), one can see from [8, Theorem 1.9] that inf Ψ and inf Φ are finite, i.e., Ψ and Φ are bounded from below. In particular, the iterates 4.1a), 4.1b), 4.1c), 4.a), 4.b) and 4.c) are well defined; ii) The continuity assumption in a1) holds for many common regularizers, for example, l 1 -norm, nuclear norm and the indicator function of a nonempty closed set; iii) a) is satisfied for some common linear maps, for example, the identity map and the sampling map. We start our convergence analysis by proving the following auxiliary lemma. Lemma 5.1 Sufficient descent of F). Suppose that Assumption 5.1 holds. Let X, Y ) be generated by Algorithm 1 at the -th iteration, and U, V ) be the candidate for X +1, Y +1 ) generated by steps a) and b). Then, for any integer 0, we have 5.3) FU, V ) FX, Y ) µ α + γρ) Y U X F σ α + γρ) U V Y F. Proof. First, from Lemma 3.1 and 4.5), we see that FX, Y ) = Θ α, X, Y, Z ). For any U, V ), let ) UV W = I α+ A 5.4) A ) + α+ A b). Then, from Lemma 3.1, we have FU, V ) = Θ α, U, V, W ). Thus, to establish 5.3), we only need to consider the difference Θ α, U, V, W ) Θ α, X, Y, Z ). We start by noting that ) UV A AW ) = A A α+ A AA ) A ) + α+ A AA ) b) 5.5) = α α+ A A UV ) + α+ A b),

13 NAUM FOR A CLASS OF MATRIX FACTORIZATION PROBLEMS where the last equality follows from a) in Assumption 5.1. Then, we obtain that Z Θ α, U, V, W ) = αw UV ) + A AW ) A b) [ ] [ =α α+ A AUV )+ α+ A α b) + α+ A A UV ) + α+ A b) ] A b) = 0, where the second equality follows from 5.4) and 5.5). Moreover, since γ is chosen such that α + γ)i + A A 0 see 4.4)), we see that, for any 0, the function Z Θ α, U, V, Z) + γ Z Z F is convex and hence Θ α, U, V, Z ) + γ Z Z F }} =0 Θ α, U, V, W ) + γ W Z F + Z Θ α, U, V, W ) + γw Z ), Z W, }} =0 which implies that ) Θ α, U, V, W ) Θ α, U, V, Z ) γ W Z F Then, substituting 4.5) and 5.4) into 5.6), we obtain ) Θ α, U, V, W ) Θ α, U, V, Z ) γ UV I α+ A A X Y ) ) F γ I α+ A A UV X Y ) F = γρ UV Y ) + U X )Y ) F 5.7) UV Y ) F + U X )Y ) ) F γρ i) γρ ii) γρ U V Y F + Y U X F ) U V Y F + Y U X F where the equality follows from the definition of ρ in 4.3); i) follows from the relation AB F A B F ; and ii) follows from the relation a + b a + b. Next, we claim that 5.8) Θ α, U, V, Z ) Θ α, U, Y, Z ) α U σ V Y F, Θ α, U, Y, Z ) Θ α, X, Y, Z ) α Y µ 5.9) U X F. Below, we will only prove 5.8). The proof for 5.9) can be done in a similar way. To prove 5.8), we consider the following three cases. Proximal: In this case, we have Θ α, U, V, Z ) Θ α, U, Y, Z ) = ΦV )+H α U, V, Z ) ΦY ) H α U, Y, Z ) [ = ΦV )+H α U, V, Z )+ σ ] V Y F [ ΦY )+H α U, Y, Z ) ] σ V Y F σ V Y F, where the inequality follows from the definition of V as a minimizer of 4.a). This implies 5.8). ),

14 14 LEI YANG, TING KEI PONG, AND XIAOJUN CHEN Prox-linear: In this case, we have Θ α, U, V, Z ) Θ α, U, Y, Z ) = ΦV )+H α U, V, Z ) ΦY ) H α U, Y, Z ) ΦV ) + H α U, Y, Z ) + Y H α U, Y, Z ), V Y + α U V Y F ΦY ) H α U, Y, Z ) = ΦV )+ Y H α U, Y, Z ), V Y + σ V Y F ΦY )+ α U σ V Y F α U σ V Y F, where the first inequality follows from the fact that Y Y H α X, Y, Z) is Lipschitz with modulus α X and the last inequality follows from the definition of V as a minimizer of 4.b). Hierarchical-prox: In this case, for any 1 i r, we have Θ α, U, v j<i, v i, y j>i, Z ) Θ α, U, v j<i, y i, y j>i, Z ) = φ i v i ) + H α U, v j<i, v i, yj>i, Z ) φ i yi ) H α U, v j<i, yi, yj>i, Z ) [ = φ i v i ) + H α U, v j<i, v i, yj>i, Z ) + σ v i yi ] σ v i yi [ φ i yi ) + H α U, v j<i, yi, yj>i, Z ) ] σ v i y i, where the inequality follows from the definition of v i as a minimizer of 4.c). Then, summing the above relation from i = r to i = 1 and simplifying the resulting inequality, we obtain 5.8). The inequality 5.9) can be obtained via a similar argument. Now, summing 5.7), 5.8) and 5.9), and using FU, V ) = Θ α, U, V, W ) and FX, Y ) = Θ α, X, Y, Z ), we obtain 5.3). This completes the proof. From Lemma 5.1, we see that the sufficient descent of FX, Y ) can be guaranteed as long as µ and σ are sufficiently large. Thus, based on this lemma, we can show in the following proposition that our non-monotone line search criterion in Algorithm 1 is well defined. Proposition 5. Well-definedness of the non-monotone line search criterion). Suppose that Assumption 5.1 holds and Algorithm 1 is applied. Then, for each 0, the line search criterion 4.6) is satisfied after finitely many inner iterations. Proof. We prove this proposition by contradiction. Assume that there exists a 0 such that the line search criterion 4.6) cannot be satisfied after finitely many inner iterations. Note from a) and d) in Step of Algorithm 1 that µ µ max = α + γρ) Y + c and hence µ = µ max must be satisfied after finitely many inner iterations. Let n denote the number of inner iterations when µ = µ max is satisfied for the first time. If µ 0 µmax, then n = 1; otherwise, we have which implies that 5.10) n µ min τ n µ 0 τ n < µ max, logµ max ) logµ min ) +. log τ

15 NAUM FOR A CLASS OF MATRIX FACTORIZATION PROBLEMS Then, from d) in Step of Algorithm 1, we have U U µ max and σ max = α + γρ) U µ max + c after at most n + 1 inner iterations, where U µ max is computed by 4.1a), 4.1b) or 4.1c) with µ = µ max. Moreover, we see that σ = σ max must be satisfied after finitely many inner iterations. Similarly, let ˆn denote the number of inner iterations when σ = σ max is satisfied for the first time. If σ 0 > σmax, then ˆn = n ; if σ 0 = σmax, then ˆn = 0; otherwise, we have which implies that ˆn σ min τ ˆn 1 σ 0 τ ˆn 1 < σ max, logσ max ) logσ min ) + 1. log τ Thus, after at most maxn, ˆn }+1 inner iterations, we must have V V σ max is computed by 4.a), 4.b) or 4.c) with σ = σ max V σ max most maxn, ˆn } + 1 inner iterations, we have FU µ max, V σ max) FX, Y ) µmax α+γρ) Y = c Uµ max U µ max X F + V σ max Y F ), X F σmax α+γρ) U µ max, where. Therefore, after at V σ max Y F where the inequality follows from 5.3) and the equality follows from µ max = α + γρ) Y + c and σ max = α + γρ) U µ max + c. This together with FX, Y ) max [ N] FXi, Y i ) + i implies that 4.6) must be satisfied after at most maxn, ˆn } + 1 inner iterations, which leads to a contradiction. Now, we are ready to prove our main convergence result, which characterizes a cluster point of the sequence generated by Algorithm 1. Our proof of statement ii) in the following theorem is similar to that of [39, Lemma 4]. However, the arguments involved are more intricate since we have two blocs of variables in our line search loop. Theorem 5.3. Suppose that Assumption 5.1 holds. Let X, Y )} be the sequence generated by Algorithm 1. Then, i) boundedness of sequence) X, Y )}, µ } and σ } are bounded; ii) diminishing successive changes) lim X+1 X F + Y +1 Y F = 0; iii) global subsequential convergence) any cluster point X, Y ) of X, Y )} is a stationary point of F. Proof. Statement i). We first show that ) FX, Y ) FX 0, Y 0 ) for all 1. We will prove it by induction. Indeed, for = 1, it follows from Proposition 5. that FX 1, Y 1 ) FX 0, Y 0 ) c X 1 X 0 F + Y 1 Y 0 ) F 0

16 16 LEI YANG, TING KEI PONG, AND XIAOJUN CHEN is satisfied after finitely many inner iterations. Hence, 5.11) holds for = 1. We now suppose that 5.11) holds for all K for some integer K 1. Then, we only need to show that 5.11) also holds for = K + 1. For = K + 1, we have FX K+1, Y K+1 ) FX 0, Y 0 ) FX K+1, Y K+1 ) max [K N] FX, Y ) + i K c X K+1 X K F + Y K+1 Y K ) F 0, where the first inequality follows from the induction hypothesis and the second inequality follows from 4.6). Hence, 5.11) holds for = K + 1. This completes the induction. Then, from 5.11), we have that for any 0, FX 0, Y 0 ) FX, Y ) = ΨX ) + ΦY ) + 1 AX Y ) ) b, which, together with a1) in Assumption 5.1, implies that the sequences X }, Y } and AX Y ) ) b } are bounded. Moreover, from Step and Step 3 in Algorithm 1, it is easy to see µ µ max = α + γρ) Y + c for all. Since Y } is bounded, the sequences µ max } and µ } are bounded. Next, we prove the boundedness of σ }. Indeed, at the -th iteration, there are three possibilities: µ < µ max : In this case, we have σ σ 0τ ñ σ max τ ñ, where ñ denotes the number of inner iterations for } the line search at the -th iteration and ñ logµ max ) logµ max 1, min ) log τ + see 5.10) and the discussions preceding it). µ = µ max and σ > σ max : In this case, } we have σ σ 0τ ñ σ max τ ñ, where logµ max ) logµ ñ max 1, min ) log τ +. Otherwise, we have σ σ max = α + γρ) X +1 + c. Note that ñ } is bounded as µ max } is bounded. Thus, σ } is bounded as the sequences X } and ñ } are bounded. This proves statement i). Statement ii). We first claim that any cluster point of X, Y )} is in domf. Since X, Y )} is bounded from statement i), there exists at least one cluster point. Suppose that X, Y ) is a cluster point of X, Y )} and let X i,y i )} be a convergent subsequence such that limx i, Y i ) = X, Y ). Then, from the lower i semicontinuity of F since Ψ, Φ are closed by a1) in Assumption 5.1) and 5.11), we have FX, Y ) lim i FX i, Y i ) FX 0, Y 0 ), which implies that FX, Y ) is finite and hence X, Y ) domf. For notational simplicity, from now on, we let X := X +1 X, Y := Y +1 Y, Z := Z +1 Z and ) l) = arg max FX i, Y i ) : i = [ N] +,, }. i 530 Then, the line search criterion 4.6) can be rewritten as ) FX +1, Y +1 ) FX l), Y l) ) c X F + Y ) F 0.

17 NAUM FOR A CLASS OF MATRIX FACTORIZATION PROBLEMS Observe that FX l+1), Y l+1) ) } = max [+1 N] FXi, Y i ) = max FX +1, Y +1 ), max + i +1 [+1 N] FXi, Y i ) + i } i) max FX l), Y l) ), max [+1 N] FXi, Y i ) + i } max FX l), Y l) ), max [ N] FXi, Y i ) + i } ii) = max FX l), Y l) ), FX l), Y l) ) = FX l), Y l) ), where i) follows from 5.13) and ii) follows from 5.1). Therefore, the sequence FX l), Y l) )} is non-increasing. Since FX l), Y l) ) is also bounded from below due to a1) in Assumption 5.1), we conclude that there exists a number F such that 5.14) lim FXl), Y l) ) = F. We next prove by induction that for all j 1, 5.15a) lim X = lim l) j Y l) j = 0, 5.15b) lim FXl) j, Y l) j ) = F. We first prove 5.15a) and 5.15b) for j = 1. Applying 5.13) with replaced by l) 1, we obtain FX l), Y l) ) FX ll) 1), Y ll) 1) ) c which, together with 5.14), implies that 5.16) Then, from 5.14) and 5.16), we have lim X = lim l) 1 Y l) 1 = 0. X l) 1 F + Y l) 1 ) F, F = lim FXl), Y l) ) = lim FXl) 1 + X l) 1, Y l) 1 + Y l) 1) = lim FXl) 1, Y l) 1 ), where the last equality follows because X, Y )} is bounded, any cluster point of X, Y )} is in domf and F is uniformly continuous on any compact subset of domf under a1) in Assumption 5.1. Thus, 5.15a) and 5.15b) hold for j = 1. We next suppose that 5.15a) and 5.15b) hold for j = J for some J 1. It remains to show that they also hold for j = J + 1. Indeed, from 5.13) with replaced by l) J 1 here, without loss of generality, we assume that is large enough such that l) J 1 is nonnegative), we have FX l) J, Y l) J ) FX ll) J 1), Y ll) J 1) ) c X l) J 1 F + Y l) J 1 F ), which implies that X l) J 1 F + Y l) J 1 F c FX ll) J 1), Y ll) J 1) ) FX l) J, Y l) J ) ).

18 18 LEI YANG, TING KEI PONG, AND XIAOJUN CHEN This together with 5.14) and the induction hypothesis implies that lim X = lim l) J+1) Y l) J+1) = 0. Thus, 5.15a) holds for j = J + 1. From this, we further have lim FXl) J+1), Y l) J+1) )= lim FXl) J X l) J+1), Y l) J Y l) J+1)) = lim FXl) J, Y l) J ) = F, where the second equality follows because X, Y )} is bounded, any cluster point of X, Y )} is in domf and F is uniformly continuous on any compact subset of domf under a1) in Assumption 5.1. Hence, 5.15b) also holds for j = J + 1. This completes the induction. We are now ready to prove the main result in this statement. Indeed, from 5.1), we can see N l) without loss of generality, we assume that is large enough such that N). Thus, for any, we must have N 1 = l) j for 1 j N + 1. Then, we have X N 1 F = X l) j F Y N 1 F = Y l) j F This together with 5.15a) implies that max 1 j N+1 X l) j F, max 1 j N+1 Y l) j F. lim X = lim X N 1 = 0, lim Y = lim Y N 1 = 0. This proves the statement ii). Statement iii). Again, let X, Y ) be a cluster point of X, Y )} and let X i,y i )} be a convergent subsequence such that limx i, Y i ) = X, Y ). Recall i that X, Y ) domf. On the other hand, it is easy to see from 4.5) that lim Z i = i Z, where Z is given by 3.4). Thus, it can be shown as in 3.9) that 5.17) We next show that 5.18a) 5.18b) αz X Y ) ) + A AZ ) b) = 0. 0 ΨX ) + αx Y ) Z )Y, 0 ΦY ) + αx Y ) Z ) X. We start by showing 5.18a) in the following cases: Proximal&Prox-linear: In these two cases, passing to the limit along X i, Y i )} in 5.1a) or 5.1b) with X i+1 in place of U and µ i in place of µ, and invoing a1) in Assumption 5.1, statements i), ii), X, Y ) domf and.1), we obtain 5.18a). Hierarchical-prox: In this case, passing to the limit along X i, Y i )} in 5.1c) with X i+1 in place of U and µ i in place of µ, and invoing a1) in Assumption 5.1, statements i), ii), X, Y ) domf and.1), we have 0 ψ i x i ) + αx Y ) Z )y i for any i = 1,,, r. Then, stacing them up, we obtain 5.18a).

19 NAUM FOR A CLASS OF MATRIX FACTORIZATION PROBLEMS 19 Similarly, we can obtain 5.18b). Thus, combining 5.17), 5.18a) and 5.18b), we see that X, Y, Z ) is a stationary point of Θ α,, which further implies X, Y ) is a stationary point of F from Theorem 3.3. This proves statement iii). 589 Remar 5. Comment on a3) in Assumption 5.1). If Φ and Ψ are the 590 indicator functions of some nonempty closed sets, Theorem 5.3 can remain valid under 591 the weaer condition on α and that 1 α + 1 > 0 with a slight modification in 4.6) 59 of Algorithm 1. Indeed, when Φ and Ψ are the indicator functions, one can see from 593 Remar 3.1 and the proofs of Lemma 5.1 and Proposition 5. that if 1 α + 1 > 0, then ) Θα, FU, V ) FX, Y 1 ) = α + 1 U, V, W ) Θ α, X, Y, Z ) ) 594 ) ) 1 α + 1 µ α+γρ) Y U X F + σ α+γρ) U V Y F, ) and the line search criterion is well defined with c replaced by α + 1 c. Moreover, 596 recalling [8, Exercise 8.14], we see that Ψ and Φ are normal cones. Thus, following Remar 3. and the similar augments in Theorem 5.3, we can obtain the same results ) when 1 α + 1 > 0 with c replaced by 1 α + 1 c in 4.6) of Algorithm 1. Remar 5.3 Comments on updating µ max and σ max ). In Algorithm 1, we need to evaluate µ max = α + γρ) Y + c and σ max = α + γρ) U + c in each iteration. However, computing the spectral norms of Y and U might be costly, especially when r is large. Hence, in our experiments, instead of computing Y and U, we compute Y F and U F, and update µmax and σ max by µ max = α + γρ) Y F + c and σmax = α + γρ) U F + c instead. Since Y Y F and U U F, it follows from 5.3) that FU, V ) FX, Y ) µ α+γρ) Y F U X F σ α+γρ) U F V Y F. Then, one can show that Proposition 5. and Theorem 5.3 remain valid. In addition, we compute the quantities U F and Y F by tru U) and try ) Y ), respectively. For some cases, the matrices U U and Y ) Y can be used repeatedly in updating the variables and evaluating the objective value and successive changes to reduce the cost of line search; see a concrete example in Section Numerical experiments. In this section, we conduct numerical experiments to test our algorithm for NMF and MC on real datasets. All experiments are run in MATLAB R015b on a 64-bit PC with an Intel Core i CPU 3.60 GHz) and 3 GB of RAM equipped with Windows 10 OS. 6.1) 6.1. Non-negative matrix factorization. We first consider NMF min X,Y 1 XY M s.t. X 0, Y 0, F where X R m r and Y R n r are decision variables. Note that the feasible set of 6.1) is unbounded. We hence focus on the following model: 6.) min X,Y 1 XY M s.t. 0 X X max, 0 Y Y max, F where X max 0 and Y max 0 are upper bound matrices. One can show that, when Xij max and Yij max are sufficiently large 3, solving 6.) gives a solution of 6.1). In our 3 The estimations of X max ij and Y max ij have been discussed in [9, Page 67].

20 0 LEI YANG, TING KEI PONG, AND XIAOJUN CHEN experiments, for simplicity, we set Xij max = and Yij max = for all i, j). Now, we see that 6.) corresponds to 1.1) with ΨX) = δ X X), ΦY ) = δ Y Y ) and A = I, where X = X R m r : 0 X X max } and Y = Y R n r : 0 Y Y max }. We apply NAUM to solving 6.), and use 4.1c) and 4.c) to update U and V. The specific updates of Z, u i and v i are α Z = α+ X Y ) + u i = max 0, min α+ M, x max i, v i = max 0, min yi max, αp i y i + µ x i α y i + µ αq i ) u i + σ y i α u i + σ }}, i = 1,, r, }}, i = 1,, r, where Pi and Q i are defined in 4.8). Note that here it is not necessary to update Z explicitly. Indeed, we can directly compute Pi y i and Q i ) u i by substituting Z as below: Pi yi = α α+ X Y ) yi + α+ My i i 1 j=1 u jyj ) yi r j=i+1 x j yj ) yi, 6.3) Q i ) u i = α α+ Y X ) u i + α+ M u i i 1 j=1 v ju j u i r j=i+1 y j u j u i. When computing X Y ) yi and Y X ) u i in the above, we first compute Y ) yi and X ) u i to avoid forming the huge m n) matrix X Y ). Moreover, the matrices X ) U, U U, Y ) Y and M U that have been computed in 6.3) can be used again to evaluate the successive changes and the objective value as follows: U X F = tru U) trx ) U) + trx ) X ), V Y F = trv V ) try ) V ) + try ) Y ), UV M F = tru U)V V )) trm U)V ) + M F In the above relations, X ) X and Y ) Y can be obtained from U U and V V in the previous iteration, respectively, and M F can be computed in advance. Ad- 640 ditionally, as we discussed in Remar 5.3, try ) Y ) and tru U) can also be 641 used in computing µ max and σ max, respectively. These techniques were also used in 64 many popular algorithms for NMF to reduce the computational cost see, for example, 643 [, 9, 10, 18, 37]). 644 The experiments are conducted on the face datasets dense matrices) and the 645 text datasets sparse matrices). For face datasets, we use CBCL 4, ORL 5 [9] and the 646 extended Yale Face Database B e-yaleb) 6 [19] for our test. CBCL contains images of faces with pixels, ORL contains 400 images of faces with pixels, and e-yaleb contains 414 images of faces with pixels. In our 649 experiments, for each face dataset, each image is vectorized and staced as a column 650 of a data matrix M of size m n. For text datasets, we use three datasets from the 651 CLUTO toolit 7. The specific values of m and n for each dataset and the values of r 65 used for our tests are summarized in Table The parameters in NAUM are set as follows: µ min = µ 1 = 1, σ min = σ 1 = 1, σ max = 10 6, τ = 4, c = 10 4, N = 3, µ 0 = max 0.1 µ 1, µ min} and σ 0 = 4 Available in 5 Available in 6 Available in iswa/extyaledatabase/extyaleb.html. 7 Available in

Douglas-Rachford splitting for nonconvex feasibility problems

Douglas-Rachford splitting for nonconvex feasibility problems Guoyin Li Ting Kei Pong Jan 3, 015 Abstract We adapt the Douglas-Rachford DR) splitting method to solve nonconvex feasibility problems by studying