Generalized Singular Value Thresholding

Size: px

Start display at page:

Download "Generalized Singular Value Thresholding"

Mabel Doyle
5 years ago
Views:

1 Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Generalized Singular Value Thresholding Canyi Lu, Chango Zhu, Chunyan u, Shuicheng Yan, Zhouchen Lin 3, Department of Electrical and Computer Engineering, National University of Singapore School of Computer Science and Technology, Huazhong University of Science and Technology 3 Key Laoratory of Machine Perception (MOE), School of EECS, Peking University canyilu@nus.edu.sg, zhuchango@gmail.com, xuchunyan@gmail.com, eleyans@nus.edu.sg, zlin@pku.edu.cn Astract This work studies the Generalized Singular Value Thresholding (GSVT) operator Prox σ g ( ), Prox σ g (B) = arg min g(σ i()) + B F, associated with a nonconvex function g defined on the singular values of. We prove that GSVT can e otained y performing the proximal operator of g (denoted as Prox g( )) on the singular values since Prox g( ) is monotone when g is lower ounded. If the nonconvex g satisfies some conditions (many popular nonconvex surrogate functions, e.g., l p-norm, < p <, of l -norm are special cases), a general solver to find Prox g() is proposed for any. GSVT greatly generalizes the known Singular Value Thresholding (SVT) which is a asic suroutine in many convex low rank minimization methods. We are ale to solve the nonconvex low rank minimization prolem y using GSVT in place of SVT. Introduction The sparse and low rank structures have received much attention in recent years. There have een many applications which exploit these two structures, such as face recognition (Wright et al. 9), suspace clustering (Cheng et al. ; Liu et al. 3) and ackground modeling (Candès et al. ). To achieve sparsity, a principled approach is to use the convex l -norm. However, the l -minimization may e suoptimal, since the l -norm is a loose approximation of the l -norm and often leads to an over-penalized prolem. This rings the attention ack to the nonconvex surrogate y interpolating the l -norm and l -norm. Many nonconvex penalities have een proposed, including l p -norm ( < p < ) (Frank and Friedman 993), Smoothly Clipped Asolute Deviation (SCAD) (Fan and Li ), Logarithm (Friedman ), Minimax Concave Penalty (MCP) (Zhang and others ), Geman (Geman and Yang 995) and Laplace (Trzasko and Manduca 9). Their definitions are shown in Tale. Numerical studies (Candès, Wakin, and Boyd 8) have shown that the nonconvex optimization usually outperforms convex models. Corresponding author. Copyright c 5, Association for the Advancement of Artificial Intelligence ( All rights reserved. Tale : Popular nonconvex surrogate functions of l -norm ( θ ). Penalty Formula g(θ), θ, λ > l p-norm λθ p, < p <. λθ, if θ λ, θ SCAD +γλθ λ, if λ < θ γλ, (γ ) λ (γ+), if θ > γλ. λ Logarithm log(γθ + ) log(γ+) λθ θ MCP, if θ < γλ, γ γλ, if θ γλ. λθ Geman. θ+γ Laplace λ( exp( θ )). γ The low rank structure is an extension of sparsity defined on the singular values of a matrix. A principled way is to use the nuclear norm which is a convex surrogate of the rank function (Recht, Fazel, and Parrilo ). However, it suffers from the same suoptimal issue as the l -norm in many cases. Very recently, many popular nonconvex surrogate functions in Tale are extended on the singular values to etter approximate the rank function (Lu et al. 4). However, different from the convex optimization, the nonconvex low rank minimization is much more challenging than the nonconvex sparse minimization. The Iteratively Reweighted Nuclear Norm (IRNN) method is proposed to solve the following nonconvex low rank minimization prolem (Lu et al. 4) min m F () = g(σ i ()) + h(), () where σ i () denotes the i-th singular value of R m n (we assume m n in this work). g : R + R + is continuous, concave and nonincreasing on [, + ). Popular nonconvex surrogate functions in Tale are some examples. h : R m n R + is the loss function which has Lipschitz continuous gradient. IRNN updates k+ y minimizing a surrogate function which upper ounds the ojective function in (). The surrogate function is constructed y linearizing g and h at k, simultaneously. In theory, IRNN guarantees to decrease the ojective function value of () in each iteration. However, it may decrease slowly since the upper 85

2 Gradient g(θ) Gradient g(θ) θ.5 (a) l p-norm 4 6 θ (d) MCP Gradient g(θ) Gradient g(θ) θ.5 () SCAD 4 6 θ (e) Geman Gradient g(θ) Gradient g(θ) θ (c) Logarithm θ (f) Laplace Figure : Gradients of some nonconvex functions (For l p - norm, p =.5. For all penalties, λ =, γ =.5). ound surrogate may e quite loose. It is expected that minimizing a tighter surrogate will lead to a faster convergence. A possile tighter surrogate function of the ojective function in () is to keep g and relax h only. This leads to the following updating rule which is named as Generalized Proximal Gradient (GPG) method in this work k+ = arg min = arg min g(σ i()) + h( k ) + h( k ), k + µ k F g(σ i()) + µ k + µ h(k ) F, where µ > L(h), L(h) is the Lipschitz constant of h, guarantees the convergence of GPG as shown later. It can e seen that solving () requires solving the following prolem Prox σ g (B) = arg min () g(σ i ()) + B F. (3) In this work, the mapping Prox σ g ( ) is called the Generalized Singular Value Thresholding (GSVT) operator associated with the function m g( ) defined on the singular values. If g(x) = λx, m g(σ i()) is degraded to the convex nuclear norm λ. Then (3) has a closed form solution Prox σ g (B) = U Diag(D λ (σ(b)))v T, where D λ (σ(b)) = (σ i (B) λ) + } m, and U and V are from the SVD of B, i.e., B = U Diag(σ(B))V T. This is the known Singular Value Thresholding (SVT) operator associated with the convex nuclear norm (when g(x) = λx) (Cai, Candès, and Shen ). More generally, for a convex g, the solution to (3) is Prox σ g (B) = U Diag(Prox g (σ(b)))v T, (4) where Prox g ( ) is defined element-wise as follows, Prox g () = arg min x f (x) = g(x) + (x ), (5) For x <, g(x) = g( x). If, Prox g(). If <, Prox g() = Prox g( ). So we only need to discuss the case, x in this work. where Prox g ( ) is the known proximal operator associated with a convex g (Comettes and Pesquet ). That is to say, solving (3) is equivalent to performing Prox g ( ) on each singular value of B. In this case, the mapping Prox g ( ) is unique, i.e., (5) has a unique solution. More importantly, Prox g ( ) is monotone, i.e., Prox g (x ) Prox g (x ) for any x x. This guarantees to preserve the nonincreasing order of the singular values after shrinkage and thresholding y the mapping Prox g ( ). For a nonconvex g, we still call Prox g ( ) as the proximal operator, ut note that such a mapping may not e unique. It is still an open prolem whether Prox g ( ) is monotone or not for a nonconvex g. Without proving the monotonity of Prox g ( ), one cannot simply perform it on the singular values of B to otain the solution to (3) as SVT. Even if Prox g ( ) is monotone, since it is not unique, one also needs to carefully choose the solution p i Prox g (σ i (B)) such that p p p m. Another challenging prolem is that there does not exist a general solver to (5) for a general nonconvex g. It is worth mentioning that some previous works studied the solution to (3) for some special choices of nonconvex g (Nie, Huang, and Ding ; Chartrand ; Liu et al. 3a). However, none of their proofs was rigorous since they ignored proving the monotone property of Prox g ( ). See the detailed discussions in the next section. Another recent work (Gu et al. 4) considered the following prolem related to the weighted nuclear norm: min f w,b() = w i σ i () + B F, (6) where w i, i =,, m. Prolem (6) is a little more general than (3) y taking different g i (x) = w i x. It is claimed in (Gu et al. 4) that the solution to (6) is = UDiag (Prox gi (σ i (B)), i =,, m}) V T, (7) where B = UDiag(σ(B))V T is the SVD of B, and Prox gi (σ i (B)) = maxσ i (B) w i, }. However, such a result and their proof are not correct. A counterexample is as follows: B = = [ [ ] , w = ] [, = [.5.5 ], where is otained y (7). The solution is not optimal to (6) since there exists shown aove such that f w,b ( ) =.6 < f w,b ( ) =.393. The reason ehind is that (Prox gi (σ i (B)) Prox gj (σ j (B)))(σ i (B) σ j (B)), (8) does not guarantee to hold for any i j. Note that (8) holds when w w m, and thus (7) is optimal to (6) in this case. In this work, we give the first rigorous proof that Prox g ( ) is monotone for any lower ounded function (regardless of the convexity of g). Then solving (3) is degenerated to solving (5) for each = σ i (B). The Generalized Singular Value Thresholding (GSVT) operator Prox σ g ( ) associated with any lower ounded function in (3) is much more ], 86

3 general than the known SVT associated with the convex nuclear norm (Cai, Candès, and Shen ). In order to compute GSVT, we analyze the solution to (5) for certain types of g (some special cases are shown in Tale ) in theory, and propose a general solver to (5). At last, with GSVT, we can solve () y the Generalized Proximal Gradient (GPG) algorithm shown in (). We test oth Iteratively Reweighted Nuclear Norm (IRNN) and GPG on the matrix completion prolem. Both synthesis and real data experiments show that GPG outperforms IRNN in terms of the recovery error and the ojective function value. Generalized Singular Value Thresholding Prolem Reformulation A main goal of this work is to compute GSVT (3), and uses it to solve (). We will show that, if Prox g ( ) is monotone, prolem (3) can e reformulated into an equivalent prolem which is much easier to solve. Lemma. (von Neumann s trace inequality (Rhea )) For any matrices A, B R m n (m n), Tr(A T B) m σ i(a)σ i (B), where σ (A) σ (A) and σ (B) σ (B) are the singular values of A and B, respectively. The equality holds if and only if there exist unitaries U and V such that A = U Diag(σ(A))V T and B = U Diag(σ(B))V T are the SVDs of A and B, simultaneously. Theorem. Let g : R + R + e a function such that Prox g ( ) is monotone. Let B = U Diag(σ(B))V T e the SVD of B R m n. Then an optimal solution to (3) is = U Diag(ϱ )V T, (9) where ϱ satisfies ϱ ϱ ϱ m, i =,, m, and ϱ i Prox g (σ i (B)) = argmin g(ϱ i ) + ϱ i (ϱ i σ i (B)). () Proof. Denote σ () σ m () as the singular values of. Prolem (3) can e rewritten as } min g(ϱ i ) + B F. min ϱ:ϱ ϱ m σ()=ϱ () By using the von Neumann s trace inequality in Lemma, we have B F = Tr ( T ) Tr( T B) + Tr(B T B) = σi () Tr( T B) + σi (B) σi () σ i()σ i(b) + σi (B) = (σ i() σ i(b)). Note that the aove equality holds when admits the singular value decomposition = U Diag(σ())V T, where U and V are the left and right orthonormal matrices in the SVD of B. In this case, prolem () is reduced to (g(ϱ i ) + ) (ϱ i σ i (B)). () min ϱ:ϱ ϱ m Since Prox g ( ) is monotone and σ (B) σ (B) σ m (B), there exists ϱ i Prox g (σ i (B)), such that ϱ ϱ ϱ m. Such a choice of ϱ is optimal to (), and thus (9) is optimal to (3). From the aove proof, it can e seen that the monotone property of Prox g ( ) is a key condition which makes prolem () separale conditionally. Thus the solution (9) to (3) shares a similar formulation as the known Singular Value Thresholding (SVT) operator associated with the convex nuclear norm (Cai, Candès, and Shen ). Note that for a convex g, Prox g ( ) is always monotone. Indeed, (Prox g ( ) Prox g ( )) ( ) (Prox g ( ) Prox g ( )),, R +. The aove inequality can e otained y the optimality of Prox g ( ) and the convexity of g. The monotonicity of Prox g ( ) for a nonconvex g is still unknown. There were some previous works (Nie, Huang, and Ding ; Chartrand ; Liu et al. 3a) claiming that the solution (9) is optimal to (3) for some special choices of nonconvex g. However, their results are not rigorous since the monotone property of Prox g ( ) is not proved. Surprisingly, we find that the monotone property of Prox g ( ) holds for any lower ounded function g. Theorem. For any lower ounded function g, its proximal operator Prox g ( ) is monotone, i.e., for any p i Prox g (x i ), i =,, p p, when x > x. Note that it is possile that σ i (B) = σ j (B) for some i < j in (). Since Prox g ( ) may not e unique, we need to choose ϱ i Prox g(σ i (B)) and ϱ j Prox g(σ j (B)) such that ϱ i ϱ j. This is the only difference etween GSVT and SVT. Proximal Operator of Nonconvex Function So far, we have proved that solving (3) is equivalent to solving (5) for each = σ i (B), i =,, m, for any lower ounded function g. For a nonconvex g, only for some special cases, the candidate solutions to (5) have a closed form (Gong et al. 3). There does not exist a general solver for a more general nonconvex g. In this section, we analyze the solution to (5) for a road choice of the nonconvex g. Then a general solver will e proposed in the next section. Assumption. g : R + R +, g() =. g is concave, nondecreasing and differentiale. The gradient g is convex. In this work, we are interested in the nonconvex surrogate of l -norm. Except the differentiality of g and the convexity of g, all the other assumptions in Assumption are necessary to construct a surrogate of l -norm. As shown later, these two additional assumptions make our analysis much easier. Note that the assumptions for the nonconvex 87

4 Algorithm : A General Solver to (5) in which g satisfying Assumption Input:. Output: Identify an optimal solution, or ˆx = maxx f (x) =, x }. if g() = then return ˆx = ; else // find ˆx y fixed point iteration. x =. // Initialization. while not converge do x k+ = g(x k ); if x k+ < then return ˆx = ; reak; end end end Compare f () and f (ˆx ) to identify the optimal one. function considered in Assumption are quite general. It is easy to verify that many popular surrogates of l -norm in Tale satisfy Assumption, including l p -norm, Logarithm, MCP, Geman and Laplace penalties. Only the SCAD penalty violates the convex g assumption, as shown in Figure. Proposition. Given g satisfying Assumption, the optimal solution to (5) lies in [, ]. The aove fact is ovious since oth g(x) and (x ) are nondecreasing on [, + ). Such a result limits the solution space, and thus is very useful for our analysis. Our general solver to (5) is also ased on Proposition. Note that the solutions to (5) lie in or the local points x f (x) = g(x) + x = }. Our analysis is mainly ased on the numer of intersection points of D(x) = g(x) and the line C (x) = x. Let = sup C (x) and D(x) have no intersection}. We have the solution to (5) in different cases. Please refer to the supplementary material for the detailed proofs. Proposition. Given g satisfying Assumption and g() = +. Restricted on [, + ), when >, C (x) and D(x) have two intersection points, denoted as P = (x, y), P = (x, y), and x < x. If there does not exist > such that f () = f (x ), then Prox g () = for all. If there exists > such that f () = f (x ), let = inf f () = f (x ) }. Then we have = x Prox g () = argmin f (x), if >, x, if. Proposition 3. Given g satisfying Assumption and g() < +. Restricted on [, + ), if we have C g() (x) = g() x g(x) for all x (, g()), then C (x) and D(x) have only one intersection point Prox g () Prox g () l (a) l -norm logarithm (d) Logarithm Prox g () Prox g () lp () l p-norm laplace (e) Laplace Prox g () Prox g () mcp (c) MCP geman (f) Geman Figure : Plots of v.s. Prox g () for different choices of g: convex l -norm and popular nonconvex functions which satisfy Assumption in Tale. (x, y ) when > g(). Furthermore, = x Prox g () = argmin f (x), if > g(), x, if g(). Suppose there exists < ˆx < g() such that C g() (ˆx) = g() ˆx > g(ˆx). Then, when g() >, C (x) and D(x) have two intersection points, which are denoted as P = (x, y) and P = (x, y) such that x < x. When g() <, C (x) and D(x) have only one intersection point (x, y ). Also, there exists such that g() > > and f () = f (x ). Let = inf f () = f (x ) }. We have Prox g() = argmin x f (x) = x, if > g(), = x, if g() >,, if. Corollary. Given g satisfying Assumption. Denote ˆx = maxx f (x) =, x } and x = arg min x,ˆx } f (x). Then x is optimal to (5). The results in Proposition and 3 give the solution to (5) in different cases, while Corollary summarizes these results. It can e seen that one only needs to compute ˆx which is the largest local minimum. Then comparing the ojective function value at and ˆx leads to an optimal solution to (5). Algorithms In this section, we first give a general solver to (5) in which g satisfies Assumption. Then we are ale to solve the GSVT prolem (3). With GSVT, prolem () can e solved y Generalized Proximal Gradient (GPG) algorithm as shown in (). We also give the convergence guarantee of GPG. A General Solver to (5) Given g satisfying Assumption, as shown in Corollary, and ˆx = maxx f (x) =, x } are the candidate solutions to (5). The left task is to find ˆx which is the largest local minimum point near x =. So we can start 88

5 Figure 3: Experimental results of low rank matrix recovery on random data. (a) Frequency of Success (FoS) for a noise free case. () Relative error for a noisy case. (c) Convergence curves of IRNN and GPG for a noisy case. searching for x from x = y the fixed point iteration algorithm. Note that it will e very fast since we only need to search within [, ]. The whole procedure to find x can e found in Algorithm. In theory, it can e proved that the fixed point iteration guarantees to find x. If g is nonsmooth or g is nonconvex, the fixed point iteration algorithm may also e applicale. The key is to find all the local solutions with smart initial points. Also all the nonsmooth points should e considered as the candidates. All the nonconvex surrogates g except SCAD in Tale satisfy Assumption, and thus the solution to (5) can e otained y Algorithm. Figure illustrates the shrinkage effect of proximal operators of these functions and the convex ` -norm. The shrinkage and thresholding effect of these proximal operators are similar when is relatively small. However, when is relatively large, the proximal operators of the nonconvex functions are nearly uniased, i.e., keeping nearly the same as the ` -norm. On the contrast, the proximal operator of the convex ` -norm is iased. In this case, the ` -norm may e over-penalized, and thus may perform quite differently from the ` -norm. This also supports the necessity of using nonconvex penalties on the singular values to approximate the rank function. () lim (k k+ ) = ; k + (3) If F () + when F +, then any limit point of k } is a stationary point. It is expected that GPG will decrease the ojective function value faster than IRNN since it uses a tighter surrogate function. This will e verified y the experiments. Experiments In this section, we conduct some experiments on the matrix completion prolem to test our proposed GPG algorithm m (3) g(σi ()) + PΩ () PΩ (M) F, min where Ω is the index set, and PΩ : Rm n Rm n is a linear operator that keeps the entries in Ω unchanged and those outside Ω zeros. Given PΩ (M), the goal of matrix completion is to recover M which is of low rank. Note that we have many choices of g which satisfies Assumption, and we simply test on the Logarithm penalty, since it is suggested in (Lu et al. 4; Cande s, Wakin, and Boyd 8) that it usually performs well y comparing with other nonconvex penalties. Prolem (3) can e solved y GPG y using GSVT (9) in each iteration. We compared GPG with IRNN on oth synthetic and real data. The continuation technique is used to enhance the low rank matrix recovery in GPG. The initial value of λ in the Logarithm penalty is set to λ, and dynamically decreased till reaching λt. Generalized Proximal Gradient Algorithm for () Given g satisfying Assumption, we are now ale to get the optimal solution to (3) y (9) and Algorithm. Now we have a etter solver than IRNN to solve () y the updating rule (), or equivalently σ k+ k k = Prox g h( ). µ µ Low-Rank Matrix Recovery on Random Data We conduct two experiments on synthetic data without and with noises (Lu et al. 4). For the noise free case, we generate M = M M, where M Rm r, M Rr n are i.i.d. random matrices, and m = n = 5. The underlying rank r varies from to 33. Half of the elements in M are missing. We set λ =.9 PΩ (M), and λt = 5 λ. The relative error RelErr= M F / M F is used to evaluate the recovery performance. If RelErr is smaller than 3, is regarded as a successful recovery of M. We repeat the experiments times for each r. We compare GPG y using GSVT with IRNN and the convex Augmented Lagrange Multiplier (ALM) (Lin, Chen, and Ma 9). Figure 3 (a) plots r v.s. the frequency of success. It can e seen The aove updating rule is named as Generalized Proximal Gradient (GPG) for the nonconvex prolem (), which generalizes some previous methods (Beck and Teoulle 9; Gong et al. 3). The main per-iteration cost of GPG is to compute an SVD, which is the same as many convex methods (Toh and Yun a; Lin, Chen, and Ma 9). In theory, we have the following convergence results for GPG. Theorem 3. If µ > L(h), the sequence k } generated y () satisfies the following properties: () F (k ) is monotonically decreasing. 89

recovered image y APGL 8. 7 PSNR 6 5 Relative Error.8.6 4 3 APGL IRNN GPG.4 APGL IRNN GPG 8. PSNR 7 6 5 4 Relative Error.8.6 (a) Original () Noisy (c) APGL (d) IRNN (e) GPG Figure 4: Image inpainting y APGL, IRNN, and GPG.

Both of them outperform the convex ALM method, since the nonconvex logarithm penalty approximates the rank function etter than the convex nuclear norm.

6 recovered image y APGL 8. 7 PSNR 6 5 Relative Error APGL IRNN GPG.4 APGL IRNN GPG 8. PSNR Relative Error.8.6 (a) Original () Noisy (c) APGL (d) IRNN (e) GPG Figure 4: Image inpainting y APGL, IRNN, and GPG. 3.4 APGL IRNN GPG APGL IRNN GPG (f) PSNR & error that GPG is slightly etter than IRNN when r is relatively small, while oth IRNN and GPG fail when r 3. Both of them outperform the convex ALM method, since the nonconvex logarithm penalty approximates the rank function etter than the convex nuclear norm. For the noisy case, the data matrix M is generated in the same way, ut are added some additional noises.e, where E is an i.i.d. random matrix. For this task, we set λ = P Ω (M), and λ t =.λ in GPG. The convex APGL algorithm (Toh and Yun ) is compared in this task. Each method is run times for each r 5, 8,, 3, 5, 3}. Figure 3 () shows the mean relative error. It can e seen that GPG y using GSVT in each iteration significantly outperforms IRNN and APGL. The reason is that λ t is not that small as in the noise free case. Thus, the upper ound surrogate of g in IRNN will e much more loose than that in GPG. Figure 3 (c) plots some convergence curves of GPG and IRNN. It can e seen that GPG without relaxing g will decrease the ojective function value faster. Applications on Real Data Matrix completion can e applied to image inpainting since the main information is dominated y the top singular values. For a color image, assume that 4% of pixels are uniformly missing. They can e recovered y applying low rank matrix completion on each channel (red, green and lue) of the image independently. Besides the relative error defined aove, we also use the Peak Signal-to-Noise Ratio (PSNR) to evaluate the recovery performance. Figure 4 shows two images recovered y APGL, IRNN and GPG, respectively. It can e seen that GPG achieves the est performance, i.e., the largest PSNR value and the smallest relative error. We also apply matrix completion for collaorative filtering. The task of collaorative filtering is to predict the unknown preference of a user on a set of unrated items, according to other similar users or similar items. We test on the MovieLens data set (Herlocker et al. 999) which includes three prolems, movie-k, movie-m and movie- M. Since only the entries in Ω of M are known, we use Normalized Mean Asolute Error (NMAE) P Ω ( ) P Ω (M) / Ω to evaluate the performance as in (Toh and Yun ). As shown in Tale, GPG achieves the est performance. The improvement enefits from the GPG algorithm which uses a fast and exact solver of GSVT (9). Tale : Comparison of NMAE of APGL, IRNN and GPG for collaorative filtering. Prolem size of M: (m, n) APGL IRNN GPG moive-k (943, 68).76e-3.6e-3.53e-3 moive-m (64, 376).66e-.5e-.47e- moive-m (7567, 677) 3.3e- 3.e-.89e- Conclusions This paper studied the Generalized Singular Value Thresholding (GSVT) operator associated with the nonconvex function g on the singular values. We proved that the proximal operator of any lower ounded function g (denoted as Prox g ( )) is monotone. Thus, GSVT can e otained y performing Prox g ( ) on the singular values separately. Given, we also proposed a general solver to find Prox g () for certain type of g. At last, we applied the generalized proximal gradient algorithm y using GSVT as the suroutine to solve the nonconvex low rank minimization prolem (). Experimental results showed that it outperformed previous method with smaller recovery error and ojective function value. For nonconvex low rank minimization, GSVT plays the same role as SVT in convex minimization. One may extend other convex low rank models to nonconvex cases, and solve them y using GSVT in place of SVT. An interesting future work is to solve the nonconvex low rank minimization prolem with affine constraint y ALM (Lin, Chen, and Ma 9) and prove the convergence. Acknowledgements This research is supported y the Singapore National Research Foundation under its International Research Funding Initiative and administered y the IDM Programme Office. Z. Lin is supported y NSF China (grant nos and 63), 973 Program of China 8

7 (grant no. 5CB355) and MSRA Collaorative Research Program. C. Lu is supported y the MSRA fellowship 4. References Beck, A., and Teoulle, M. 9. A fast iterative shrinkagethresholding algorithm for linear inverse prolems. SIAM Journal on Imaging Sciences. Cai, J.-F.; Candès, E. J.; and Shen, Z.. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization (4): Candès, E. J.; Li,.; Ma, Y.; and Wright, J.. Roust principal component analysis? Journal of the ACM 58(3). Candès, E. J.; Wakin, M. B.; and Boyd, S. P. 8. Enhancing sparsity y reweighted l minimization. Journal of Fourier Analysis and Applications 4(5-6): Chartrand, R.. Nonconvex splitting for regularized low-rank+ sparse decomposition. IEEE Transactions on Signal Processing 6(): Cheng, B.; Yang, J.; Yan, S.; Fu, Y.; and Huang, T. S.. Learning with l -graph for image analysis. TIP 9(Compendex): Comettes, P. L., and Pesquet, J.-C.. Proximal splitting methods in signal processing. Fixed-point algorithms for inverse prolems in science and engineering. Fan, J., and Li, R.. Variale selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96(456): Frank, L., and Friedman, J A statistical view of some chemometrics regression tools. Technometrics. Friedman, J.. Fast sparse regression and classification. International Journal of Forecasting 8(3): Geman, D., and Yang, C Nonlinear image recovery with half-quadratic regularization. TIP 4(7): Gong, P.; Zhang, C.; Lu, Z.; Huang, J.; and Ye, J. 3. A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization prolems. In ICML. Gu, S.; Zhang, L.; Zuo, W.; and Feng,. 4. Weighted nuclear norm minimization with application to image denoising. In CVPR. Herlocker, J. L.; Konstan, J. A.; Borchers, A.; and Riedl, J An algorithmic framework for performing collaorative filtering. In International ACM SIGIR conference on Research and development in information retrieval. ACM. Lin, Z.; Chen, M.; and Ma, Y. 9. The augmented Lagrange multiplier method for exact recovery of a corrupted low-rank matrices. UIUC Technical Report UILU-ENG-9-5, Tech. Rep. Liu, D.; Zhou, T.; Qian, H.; u, C.; and Zhang, Z. 3a. A nearly uniased matrix completion approach. In Machine Learning and Knowledge Discovery in Dataases. Liu, G.; Lin, Z.; Yan, S.; Sun, J.; Yu, Y.; and Ma, Y. 3. Roust recovery of suspace structures y low-rank representation. TPAMI 35():7 84. Lu, C.; Tang, J.; Yan, S. Y.; and Lin, Z. 4. Generalized nonconvex nonsmooth low-rank minimization. In CVPR. Nie, F.; Huang, H.; and Ding, C. H.. Low-rank matrix recovery via efficient Schatten p-norm minimization. In AAAI. Recht, B.; Fazel, M.; and Parrilo, P. A.. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review 5(3):47 5. Rhea, D.. The case of equality in the von Neumann trace inequality. preprint. Toh, K., and Yun, S. a. An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares prolems. Pacific Journal of Optimization. Toh, K., and Yun, S.. An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares prolems. Pacific Journal of Optimization 6(65-64):5. Trzasko, J., and Manduca, A. 9. Highly undersampled magnetic resonance image reconstruction via homotopicminimization. IEEE Transactions on Medical imaging 8():6. Wright, J.; Yang, A. Y.; Ganesh, A.; Sastry, S. S.; and Ma, Y. 9. Roust face recognition via sparse representation. TPAMI 3(): 7. Zhang, C.-H., et al.. Nearly uniased variale selection under minimax concave penalty. The Annals of Statistics 38():

8 Supplementary Material of Generalized Singular Value Thresholding Canyi Lu, Chango Zhu, Chunyan u, Shuicheng Yan, Zhouchen Lin 3, Department of Electrical and Computer Engineering, National University of Singapore School of Computer Science and Technology, Huazhong University of Science and Technology 3 Key Laoratory of Machine Perception (MOE), School of EECS, Peking University canyilu@nus.edu.sg, zhuchango@gmail.com, xuchunyan@gmail.com, eleyans@nus.edu.sg, zlin@pku.edu.cn Ananlysis of the Proximal Operator of Nonconvex Function In the following development, we consider the following prolem Prox g () = arg min x f (x) = g(x) + (x ), () where g(x) satisfies the following assumption. Assumption. g : R + R +, g() =. g is concave, nondecreasing and differentiale. The gradient g is convex. Set C (x) = x and D(x) = g(x). Let = sup C (x) and D(x) have no intersection}, and x = inf x (x, y) is the intersection point of C (x) and D(x)}.. Proof of Proposition Proposition. Given g satisfying Assumption and g() = +. Restricted on [, + ), when >, C (x) and D(x) have two intersection points, denoted as P = (x, y), P = (x, y), and x < x. If there does not exist > such that f () = f (x ), then Prox g () = for all. If there exists > such that f () = f (x ), let = inf >, f () = f (x ) }. Then we have Prox g () = argmin f (x) x = x, if >,, if. () Remark: When exists and >, ecause D(x) = g(x) is convex and decreasing, we can conclude that C (x) and D(x) have exactly two intersection points. When, C (x) and D(x) may have multiple intersection points. Proof. When >, since f (x) = D(x) C (x), we can easily see that f is increasing on (, x ), decreasing on (x, x ) and increasing on (x, ). So, and x are two local minimum points of f (x) on [, ]. Case : If there exists > such that f () = f (x ), denote = inf >, f () = f (x ) }. First, we consider >. Let = + ε for some ε >. We have f (x ) f () = (x ε) + g(x ) ( + ε) = (x ) ( ) ε(x ) ε =f (x ) f () εx = εx <. Since f is decreasing on [x, x ], we conclude that f () > f (x ) f (x ). So, when >, x is the gloal minimum of f (x) on [, ]. Corresponding author.

9 Second, we consider <. We show that f () f (x ) y contradiction. Suppose that there exists such that f () > f (x ). Since f is strictly increasing on (, x ), we have f (x ) > f (). Because we have f (x ) > f (), y a direct computation, we get f (x ) < f (), g(x ) x g(x ) (x ) >, g(x ) x g(x ) (x ) <. According to the intermediate value theorem, there exists x such that x < x < x and g( x) x g( x) ( x) =. Let = g( x) + x. Then, ( x, x) is the intersection point of C (x) and D(x) such that f ( x) = f (). Since x < x < x and g is convex and nonincreasing, we conclude that < <, which contradicts the minimality of. Also, when, we have f (x) = D(x) C (x), ecause D(x) is aove C (x). So, the gloal minimum of f (x) on [, ] is. Case : Suppose for all >, f () f (x ). Since f is increasing on (, x ), we have f (x ) > f (). We now show that for all >, f (x ) f (). Suppose this is not true and there exists such that > and f (x ) < f (). Because we have f (x ) > f (), y a direct computation, we get f (x ) < f (), g(x ) x g(x ) (x ) >, g(x ) x g(x ) (x ) <. So, according to the intermediate value theorem, there exists x such that g( x) x g( x) ( x) =. Let = g( x) + x. Then, ( x, x) is the intersection point of C (x) and D(x) such that f ( x) = f (). Since x < x < x and g is convex and nonincreasing, we conclude that < <, which contradicts f () f (x ) for all >. So, for all >, is the minimum of f (x) on [, ]. Similarly, when, we have f (x) = D(x) C (x), ecause D(x) is aove C (x). So, the gloal minimum of f (x) on [, ] is. The proof is completed.. Proof of Proposition 3 Proposition 3. Given g satisfying Assumption and g() < +. Restricted on [, + ), if we have C g() (x) = g() x g(x) for all x (, g()), then C (x) and D(x) have only one intersection point (x, y ) when > g(). Furthermore, Prox g () = argmin f (x) x = x, if > g(),, if g(). Suppose there exists < ˆx < g() such that C g() (ˆx) = g() ˆx > g(ˆx). Then, when g() >, C (x) and D(x) have two intersection points, which are denoted as P = (x, y) and P = (x, y) such that x < x. When g() <, C (x) and D(x) have only one intersection point (x, y ). Also, there exists such that g() > > and f () = f (x ). Let = inf g() > >, f () = f (x ) }. We have Prox g () = argmin x f (x) = x, if > g(), = x, if g() >,, if. (3) (4) Remark: If exists, when, it is possile that C (x) and D(x) have more than two intersection points. If does not exist, when g(), it is also possile that C (x) and D(x) have more than two intersection points. Proof. Case : Suppose we have C g () (x) = g() x g(x) for all x on (, g()). Notice for all g(), we have g(x) = D(x) C (x), so the minimum point of f (x) is. For all > g(), C = x and D(x) have only one intersection point denoted as (x, y ). Then, we can easily see that f is decreasing on (, x ) and increasing on (x, ). So, when > g(), the minimum point of f (x) is x.

10 Case : Suppose there exists < ˆx < g() such that C g() (ˆx) = g() ˆx > g(ˆx). Then, D(x) and C (x) have two intersection points, i.e., (, g()) and (x g(), y g() ). It is easily checked that f g() is strictly decreasing on (, x g() ), so we have f g() (x g() ) < f g() (). Also, since f is strictly increasing on (, x ), we have f (x ) > f (). Because we have y a direct computation, we get f (x ) > f (), f g() (x g() ) < f g() (), g(x ) x g(x ) (x ) >, g(x g() ) x g() g(x g() ) (x g() ) <. So, according to the intermediate value theorem, there exists x such that g( x) x g( x) ( x) =. Let = g( x)+ x. Then, ( x, x) is the intersection point of C (x) and D(x) such that f ( x) = f (). Since x < x < x g() and g is convex and nonincreasing, we conclude that < < g(). Next, we set = inf < < g(), f () = f (x ) }. Given g() >, we can easily see that f is increasing on (, x ), decreasing on (x, x ) and increasing on (x, ). So, and x are two local minimum points of f (x) on [, ]. Next, for g() >, set = + ε for some ε >. We have f (x ) f () = (x ε) + g(x ) ( + ε) = (x ) ( ) ε(x ) ε =f (x ) f () εx = εx <. Since f is decreasing on (x, x ), we conclude that f () > f (x ) f (x ). So, when >, x is the gloal minimum of f (x) on [, ]. Then, for all <, we show that f () f (x ). We prove y contradiction. Suppose that there exists such that f () > f (x ). Because we have f (x ) > f (), y a direct computation, we get f (x ) < f (), g(x ) x g(x ) (x ) >, g(x ) x g(x ) (x ) <. So, according to the intermediate value theorem, there exists x such that g( x ) x g( x ) ( x ) = and x < x < x. Let = g( x ) + x. Then, ( x, x ) is the intersection point of C (x) and D(x) such that f ( x ) = f (). Since x < x < x and g is convex and nonincreasing, we conclude that < <, which contradicts the minimality of. Next, when, we have f (x) = D(x) C (x), so the gloal minimum of f (x) on [, ] is. Also, when > g(), C = x and D(x) have only one intersection point (x, y ). Then, we can easily see that f is decreasing on (, x ) and increasing on (x, ). So, when > g(), the gloal minimum point of f (x) is x..3 Proof of Corollary Corollary. Given g satisfying Assumption in prolem (). Denote ˆx = maxx f (x) =, x } and x = arg min x,ˆx } f (x). Then x is optimal to (), i.e., x Prox g (). 3

11 Proof. As shown in Proposition and 3, when is larger than a certain threshold, Prox g () (x in ()(4) or x in (3)(4)) is unique. Actually the unique solution is the largest intersection point of C (x) and g(x), i.e., Prox g () = ˆx = maxx f (x) =, x }. For all the other choices of, Prox g (). Thus, and ˆx, one of them should e optimal to (). Thus x = arg min x,ˆx } f (x) is optimal to (). Proof of Theorem Theorem. For any lower ounded function g, its proximal operator Prox g ( ) is monotone, i.e., for any p i Prox g (x i ), i =,, p p, when x > x. Proof. The lower ound assumption of g guarantees a finite solution to prolem (). By the optimality of p i, i =,, we have Summing them together gives g(p ) + (p x ) g(p ) + (p x ), (5) g(p ) + (p x ) g(p ) + (p x ). (6) (p x ) + (p x ) (p x ) + (p x ). (7) It reduces to Thus p p when x > x. (p p )(x x ). (8) 3 Convergence Analysis of Algorithm Assume there exists otherwise, is a solution to (). ˆx = maxx f (x) = g(x) + x =, x }; We only need to prove that the fixed point iteration guarantees to find ˆx. First, if g() =, then we have found ˆx =. For the case ˆx <, we prove that, the fixed point iteration, starting from x =, converges to ˆx. Indeed, we have g(x) < x, for any x > ˆx. We prove this y contradiction. Assume there exists x > ˆx such that g( x) > x. Notice g satisfies Assumption. It is easy to see g is continuous, decreasing and nonnegative. Then we have g() < ( g() > since > ˆx ). Thus there must exist some ˆx (min(, x), max(, x)) > ˆx such that g(ˆx) = ˆx. This contradicts the definition of ˆx. So, we have x k+ = g(x k ) < x k, if x k > ˆx. On the other hand, x k } is lower ounded y ˆx. So there must exist a limit of x k }, denoted as x, which is no less than ˆx. Let k + on oth sides of x k+ = g(x k ), and we see that x = g( x). So, x = ˆx, i.e., lim x k = ˆx. k + 4 Convergence Analysis of Generalized Proximal Gradient Algorithm Consider the following prolem min m F () = g(σ i ()) + h(), (9) 4

12 where g : R + R + is continuous, concave and nonincreasing on [, + ), and h : R m n R + has Lipschitz continuous gradient with Lipschitz constant L(h). The Generalized Proximal Gradient (GPG) algorithm solves the aove prolem y the following updating rule k+ = arg min = arg min g(σ i ()) + h( k ) + h( k ), k + µ k F g(σ i ()) + µ k + µ h(k ) F. Then we have the following results. Theorem 3. If µ > L(h), the sequence k } generated y () satisfies the following properties: () F ( k ) is monotonically decreasing. Indeed, () lim k + (k k+ ) = ; F ( k ) F ( k+ ) µ L(h) k k+ F ; (3) If F () + when F +, then any limit point of k } is a stationary point. Proof. Since k+ is optimal to (), we have g(σ i ( k+ )) + h( k ) + h( k ), k+ k + µ k+ k F = g(σ i ( k )) + h( k ) + h( k ), k k + µ k k F g(σ i ( k )). On the other hand, since h has Lipschitz continuous gradient, we have [] h( k+ ) h( k ) + h( k ), k+ k + L(h) k+ k F. () Comining () and () leads to F ( k ) F ( k+ ) = g(σ i ( k )) + h( k ) g(σ i ( k+ )) h( k+ ) µ L(h) k+ k F. Thus µ > L(h) guarantees that F ( k ) F ( k+ ). Summing (3) for k =,,, we get This implies that F ( ) µ L(h) + k= () () (3) k+ k F. (4) lim k + (k k+ ) =. (5) Furthermore, since F () + when F +, k } is ounded. There exist and a susequence kj } such that lim j + kj =. By using (5), we get lim j + kj+ =. Considering that kj is optimal to (), and m g(σ i()) is convex (since g is concave) [3], there exists Q kj+ ( m g(σ i( kj+ )) ) such that Q kj+ + h( kj ) + µ( kj+ kj ) =. (6) 5

13 Let j + in (6). By the upper semi-continuous property of the sudifferential [], there exists Q ( m g(σ i( ))), such that = Q + h( ) F ( ). (7) Thus is a stationary point to (9). References [] Amir Beck and Marc Teoulle. A fast iterative shrinkage-thresholding algorithm for linear inverse prolems. SIAM Journal on Imaging Sciences, 9. [] Frank Clarke. Nonsmooth analysis and optimization. In Proceedings of the International Congress of Mathematicians, 983. [3] Adrian S Lewis and Hristo S Sendov. Nonsmooth analysis of singular values. Part I: Theory. Set-Valued Analysis, 3(3):3 4, 5. 6

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation Zhouchen Lin Peking University April 22, 2018 Too Many Opt. Problems! Too Many Opt. Algorithms! Zero-th order algorithms: