Reweighted Nuclear Norm Minimization with Application to System Identification

Similar documents
System Identification by Nuclear Norm Minimization

Matrix Rank Minimization with Applications

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

Homework 4. Convex Optimization /36-725

Supplement to A Generalized Least Squares Matrix Decomposition. 1 GPMF & Smoothness: Ω-norm Penalty & Functional Data

ELE539A: Optimization of Communication Systems Lecture 15: Semidefinite Programming, Detection and Estimation Applications

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Rank Minimization and Applications in System Theory

Convex Optimization and l 1 -minimization

12. Interior-point methods

A Greedy Framework for First-Order Optimization

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

Network Localization via Schatten Quasi-Norm Minimization

Sparse Covariance Selection using Semidefinite Programming

Sparse Optimization Lecture: Basic Sparse Optimization Models

Constrained Optimization and Lagrangian Duality

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Big Data Analytics: Optimization and Randomization

Fidelity and trace norm

Nonlinear Optimization for Optimal Control

Optimization for Machine Learning

Minimizing the Difference of L 1 and L 2 Norms with Applications

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

CSC 576: Variants of Sparse Learning

5. Duality. Lagrangian

Rank-Constrained Optimization and its Applications

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

Robust Subspace System Identification via Weighted Nuclear Norm Optimization

12.4 Known Channel (Water-Filling Solution)

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Lecture Notes 9: Constrained Optimization

Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization

arxiv: v1 [math.oc] 1 Jul 2016

Exact Low-rank Matrix Recovery via Nonconvex M p -Minimization

A priori bounds on the condition numbers in interior-point methods

Stochastic dynamical modeling:

Projection methods to solve SDP

Principal Component Analysis

Fast Linear Iterations for Distributed Averaging 1

A Fast Algorithm for Reconstruction of Spectrally Sparse Signals in Super-Resolution

Lagrange Duality. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

Math 273a: Optimization Subgradients of convex functions

Lecture 2: Linear Algebra Review

E5295/5B5749 Convex optimization with engineering applications. Lecture 5. Convex programming and semidefinite programming

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Optimization methods

The convex algebraic geometry of rank minimization

Compressed Sensing and Robust Recovery of Low Rank Matrices

Collaborative Filtering Matrix Completion Alternating Least Squares

Transmitter optimization for distributed Gaussian MIMO channels

Bindel, Fall 2009 Matrix Computations (CS 6210) Week 8: Friday, Oct 17

Iteration-complexity of first-order penalty methods for convex programming

A note on rank constrained solutions to linear matrix equations

Low-Rank Matrix Recovery

Sparse and Low-Rank Matrix Decompositions

EE Applications of Convex Optimization in Signal Processing and Communications Dr. Andre Tkacenko, JPL Third Term

Motivation. Lecture 2 Topics from Optimization and Duality. network utility maximization (NUM) problem:

Relaxations and Randomized Methods for Nonconvex QCQPs

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Recovery of Sparse Signals from Noisy Measurements Using an l p -Regularized Least-Squares Algorithm

Near-Potential Games: Geometry and Dynamics

A Unified Approach to Proximal Algorithms using Bregman Distance

Convex Optimization of Graph Laplacian Eigenvalues

Sparse Approximation via Penalty Decomposition Methods

Solving large Semidefinite Programs - Part 1 and 2

minimize x subject to (x 2)(x 4) u,

Acyclic Semidefinite Approximations of Quadratically Constrained Quadratic Programs

Lecture 20: November 1st

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Spectral k-support Norm Regularization

DELFT UNIVERSITY OF TECHNOLOGY

Constrained optimization

Near-Potential Games: Geometry and Dynamics

Key words. Rank minimization, nuclear norm, Hankel matrix, first-order method, system identification, system realization

sparse and low-rank tensor recovery Cubic-Sketching

Summer School: Semidefinite Optimization

CSC Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming

Applications of Linear Programming

An Algorithm for Solving the Convex Feasibility Problem With Linear Matrix Inequality Constraints and an Implementation for Second-Order Cones

A direct formulation for sparse PCA using semidefinite programming

Written Examination

PHASE RETRIEVAL OF SPARSE SIGNALS FROM MAGNITUDE INFORMATION. A Thesis MELTEM APAYDIN

14 Singular Value Decomposition

Semidefinite Programming Duality and Linear Time-invariant Systems

On Optimal Frame Conditioners

Self-Calibration and Biconvex Compressive Sensing

Duality Uses and Correspondences. Ryan Tibshirani Convex Optimization

12. Interior-point methods

15 Singular Value Decomposition

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Canonical Problem Forms. Ryan Tibshirani Convex Optimization

Three Generalizations of Compressed Sensing


Optimization for Compressed Sensing

Recovering any low-rank matrix, provably

arxiv: v1 [math.oc] 11 Jun 2009

Lecture 1. 1 Conic programming. MA 796S: Convex Optimization and Interior Point Methods October 8, Consider the conic program. min.

Applied Machine Learning for Biomedical Engineering. Enrico Grisan

Transcription:

Reweighted Nuclear Norm Minimization with Application to System Identification Karthi Mohan and Maryam Fazel Abstract The matrix ran minimization problem consists of finding a matrix of minimum ran that satisfies given convex constraints It is NP-hard in general and has applications in control, system identification, and machine learning Reweighted trace minimization has been considered as an iterative heuristic for this problem In this paper, we analyze the convergence of this iterative heuristic, showing that the difference between successive iterates tends to zero Then, after reformulating the heuristic as reweighted nuclear norm minimization, we propose an efficient gradient-based implementation that taes advantage of the new formulation and opens the way to solving largescale problems We apply this algorithm to the problem of loworder system identification from input-output data Numerical examples demonstrate that the reweighted nuclear norm minimization maes model order selection easier and results in lower order models compared to nuclear norm minimization without weights A Bacground I INTRODUCTION The matrix ran minimization problem consists of finding a matrix of minimum ran that satisfies given convex constraints, ie, ran(x) X C, where X R m n is the optimization variable and C is a convex set When C is described by affine equality constraints, () is the matrix extension of the popular sparse signal recovery problem in compressed sensing The ran minimization problem arises in a diverse set of fields, where notions of order, complexity, or dimension are expressed by means of the ran of an appropriate matrix Applications include system identification, low-order controller design, collaborative filtering in machine learning, and Euclidean embedding problems (see [4] and references therein) Problem () is in general NP-hard A common convex heuristic [7] replaces ran with the nuclear norm (also nown as the Schatten -norm or trace norm) of the matrix, denoted by X = r i σ i(x) where σ i (X) are the singular values and r = ran(x) The heuristic solves the convex problem X X C This heuristic and its variants have lately received a lot of interest One reason is the recent progress in both the theory and the algorithms for this heuristic Several conditions have Electrical Engineering Department, University of Washington, Seattle Email: arna@uwashingtonedu, mfazel@uwashingtonedu Research funded in part by NSF CAREER grant ECCS-0847077 () (2) been studied that guarantee that the heuristic yields an exact solution when the constraints are linear equalities (eg, [4]) Development of various classes of algorithms for this heuristic that exploit specific problem structure is also an active research area (eg, [7]) Finally, new applications have initiated interest in special cases of the ran minimization problem, eg, the low-ran matrix completion problem [4] arising in machine learning We also mention that matrix ran and nuclear norm minimization have a natural connection to vector sparsity and l -norm: the former reduces to the latter if the matrix variable is taen to be diagonal A variation on this basic heuristic that helps reduce the ran of the solution further, is to use a weighted objective (see [3], [6] for the vector version, and [8] for the matrix version) In this paper we study the reweighted trace heuristic, which is based on using a nonconvex surrogate function for the ran and solving the resulting problem locally via a sequence of convex problems First, note that problem () can also be expressed in a positive semidefinite form [9]: [ ran(y ) + ran() ] X T 0, X C, where X, Y R m m, and R n n are the optimization variables Then, replacing ran with trace, we obtain a semidefinite programming problem that is equivalent to (2) [7] The heuristic given in [8] replaces the ran of positive semidefinite matrices Y, by a surrogate function as follows: log det(y [ + δi) ] + log det( + δi) (4) X T 0, X C, where δ > 0 is a small regularization constant Problem (4) can be solved locally by iterative linearization of the objective The th step of this algorithm solves the problem Tr (Y + δi) Y + Tr( + δi) [ ] X T 0, X C, to get X +, Y +, + Throughout this paper, we refer to (5) as the reweighted trace heuristic (RTH) We often tae Y 0 and 0 to be identity, thus the first iteration of the algorithm will simply TrY + Tr B Summary of results We examine the convergence of the reweighted trace heuristic, showing that the difference between the successive iterates of the heuristic tends to zero (3) (5)

We give an example of other concave functions that can be used as a surrogate for ran in (3), giving rise to other heuristics that have similar convergence properties RTH as given in (5) would require solving a semidefinite program (SDP) at each iteration We reformulate the iterations in terms of the matrix nuclear norm, to which several first-order gradient-based methods could be applied efficiently We apply the reweighted nuclear norm heuristic to a classic system identification problem, finding a low order system from input-output data We use the gradient projection method to implement the heuristic efficiently We give a numerical example (from the system identification database []) and show that the reweighted nuclear norm heuristic gives a clearer description of the model order and results in lower order models compared to a simple subspace method and also the un-weighted or nuclear norm minimization as in (2) C Related wor The papers [3], [6] study reweighted l minimization as a heuristic to find the sparsest vector in a convex set (a special case of the ran minimization problem) It is shown that this heuristic wors better in practice than l minimization [3] RTH was first proposed in [8] While this paper shows that the objective function value converges, it does not discuss if the iterates themselves converge or if the difference between the successive iterates converge That the difference between iterates goes to zero was shown for the reweighted l heuristic in [6], but the proof there does not extend to the matrix case The paper [7] gives an interior point method for nuclear norm minimization, and applies it for system identification as an alternative approach to subspace based identification methods (see, eg, [2], [5]) It shows that nuclear norm minimization determines the lowest system order better than existing methods The interior point implementation is more efficient than generic SDP solvers, however it does not scale as well as the first-order implementation we discuss here II CONVERGENCE OF THE REWEIGHTED TRACE HEURISTIC Let X R m n, Y R m m, R n n Define the function g : R m n S m + S n + R as g(x, Y, ) = log det(y + δi) + log det( + δi) Note that g(x, Y, ) is a continuously differentiable function over its domain The gradient of g is given by g(x, Y, ) = (0, (Y + δi), ( + δi) ) Let W = (W, W 2, W 3 ), X = (X, Y, ) be two points in the domain of g Define the inner product on the cross product space, R m n S m + S n + as W, X c = W, X + W 2, Y + W 3,, where X, Y is the standard inner product between two matrices Since g is strictly concave on its domain, we have g( X) < g( W) + g( W), X W := h( X, W) c for all W X D, where [ ] D = {(X, Y, ) : X T 0, X C} Since g( X) < h( X, W), W X D and g( X) = h( X, X), X D, the function h majorizes g Let X denote (X, Y, ) RTH is thus a Majorization-Minimization (MM) algorithm: X + = arg min h( X, X ) = arg min g( X ), X X D X D c = arg min Tr (Y + δi) Y + Tr ( + δi) (6) X D Note that if X X +, g( X + ) < h( X +, X ) h( X, X ) = g( X ) (7) We first define a stationary point of a function before giving the theorem on convergence Definition II x is a stationary point of a continuously differentiable function f over a set D if x argmin y D f(x) T y Theorem II2 Every convergent subsequence of the reweighted trace heuristic (5) converges to a stationary point of g over D Further the norm of the difference between succesive iterates tends to zero, X + X F 0 Proof: Let X = (X, Y, ) be the solution to the nuclear norm minimization problem (2) The set { X g( X) g( X )} is bounded, because Y F, F implies g( X) Also, if X F, then Y F F (since the bloc matrix is positive semidefinite), hence g( X) Thus, { X g( X) g( X )} is a compact set Therefore the iterates of RTH are bounded (since g( X + ) g( X ), ) Therefore, the sequence { X } has a convergent subsequence Let { X ni }, i =, 2, denote this subsequence with a limit X Let Xn i+ X a Assume that X X a Now, (7) with the fact that g is continuously differentiable implies that g( X ) = limg( X ni ) > limg( X ni+ ) = g( X a ) (8) (7) also implies that g( X i+ ) g( X i ) i Since g is bounded below, {g( X i )} converges and limg( X i ) = limg( X ni ) = g( X ) = g( X ni+ ) = g( X a ) (9) But (9) contradicts (8), hence X = X a and by definition, this implies that X is a stationary point Now, assume that there exists a subsequence and a δ > 0 such that X ni X ni+ F > δ, i =, 2, Let { X ni } X and { X ni+ } X a X Then, by a similar argument as above we arrive at a contradiction Thus, X = X a and it holds that X + X F 0

A Convergence through conditional gradient method For a generic problem with a continuously differentiable objective function f : R n R and a convex, compact constraint set C, the conditional gradient method [] yields the iterates x = argmin x C f(x ), x, x + = x + α ( x x ) If we use a the so-called limited minimization rule to pic the step-size α, we get α = since the objective function f is concave Thus, the heuristic (5) is the same as the conditional gradient method applied to the problem (4) It is nown (eg, []) that every cluster point of the conditional gradient iterates is a stationary point of f over the set C, and we obtain the first part of the convergence result in the previous theorem B Other surrogate functions We considered the concave surrogate function log det(x) in (4) We can similarly use other concave surrogates such as Tr(X ) (see eg [2] for proof of concavity) and apply the conditional gradient method or the Majorizationminimization algorithm to obtain other heuristics for which the same convergence results hold For example, using the surrogate function Tr(X ) yields the following heuristic: X + = arg min Tr(Y + δi) 2 Y + Tr ( + δi) 2 (0) X D with X as defined in section 2 Comparing the performance of these heuristics is a direction for future wor This section establishes the convergence of difference of successive iterates of RTH and shows that the limit point of every convergent subsequence is a stationary point However, we can t say if the trace heuristic can achieve the global minimum of (4) We initialize the weighted iterations with the solution of nuclear norm minimization (2), thus RTH can be thought of as improving on the solution of the nuclear norm heuristic III REWEIGHTED NUCLEAR NORM HEURISTIC Recall from (5) that in the th iteration of the reweighted trace heuristic we solve the following problem: Tr(Y [ + δi) ] Y + Tr ( + δi) X T 0, X C () In this section, we reformulate this problem as a (reweighted) nuclear norm minimization problem This reformulation is algorithmically beneficial: it allows us to exploit the properties of the nuclear norm and the problem structure to obtain an efficient first-order algorithm for solving the heuristic Let W = (Y + δi) 2 and W2 = ( + δi) 2 Since W, W2 are positive definite for any feasible Y, and δ > 0, the constraint in () is equivalent to [ ] [ ] [ ] W 0 W 2 0 0 W2 X T 0 W2 0 Thus, problem () is equivalent to 2 (Tr W Y W + TrW 2 W 2 ) [ W Y W 2 W XW ] 2 W2 X T W W2 W2 0 X C (2) Using the following characterization of the nuclear norm (see eg [7], [4]), X = 2 min(tr Y [ + Tr ) ] X T 0, we can write problem (2) as W XW 2 X C, (3) which is a (weighted) nuclear norm minimization Once the optimal solution X + is found, the weights W + and W2 + are updated as follows Let W X+ W2 = UΣV T be the reduced singular value decomposition of W X + W2, where U R m r, Σ R r r and V R n r It can be checed [4] that the optimal Y + and + in (2) are given by Y + = (W ) UΣU T (W ), + = (W 2 ) V ΣV T (W 2 ), (4) so the weights can be updated as W + = (Y + + δi) /2, W + 2 = ( + + δi) /2 (5) The update equations (4),(5) together with (3) describe the reweighted nuclear norm heuristic If the set C in (3) is described by convex constraints f i (X) 0, i =,,m, we can write the problem in the regularized form X + = argmin i λ i f i (X) + W XW 2 (6) with W, W 2 defined above, and a suitable choice of λ i We note that if in addition to X C we have the constraint that X be positive semidefinite, then the reweighted nuclear norm heuristic reduces to, X + = arg min X C,X 0 Tr(X + δi) X (7) In the next section, we apply this regularized reweighted nuclear norm heuristic (that we abbreviate as RRNH) to a system identification problem using an efficient first-order method IV EFFICIENT IMPLEMENTATION OF THE RRNH FOR SYSTEM IDENTIFICATION Consider the problem of identifying a discrete-time, linear time-invariant state-space model, x(t + ) = Ax(t) + Bu(t) y(t) = Cx(t) + Du(t),

given a set of inputs u(t) R m and noisy measured outputs y meas (t) R p, for t = 0,,,N Here x(t) R n is the state of the system at time t, and n is the order of the model We would lie to find the matrices A, B, C, D, the initial state x(0), and the lowest possible order n that satisfy y(t) y meas (t) Let Ŷ = [y(0),,y(n )] R p N, Y meas = [y meas (0),,y meas (N )] R p N, Û = [u(0),,u(n )] R m N Define the linear operator H r as follows: H r (Ŷ ) = y(0) y() y(n r ) y() y(2) y(n r) y(r) y(r + ) y(n ), (8) which is a bloc-hanel matrix, with Ŷ defined as earlier The adjoint operator, Hr (W), is as below: w w 2 w,n r Hr (W) = w 2 w 22 w 2,N r H r w r+, w r+,n r = [ ] w w 2 + w 2 w 3 + w 22 + w 3 w r+,n r Define X = [x(0), x(),, x(n r )], U = H r (Û), Y = H r (Ŷ ) with ˆX, Û, and Ŷ as defined earlier, and let F = G = [ C T (CA) T (CA r ) T] T, D 0 0 0 CB D 0 0 CAB CB D 0 CA r B CA r 2 B CA r 3 B D It is easy to see that Y = GX + FU, and thus Y U = GXU, where U R N r+ q is a set of orthonormal basis vectors for null-space of U If X has a ran n and there is no ran cancellation in XU, one can find the system order from the ran of Y U (see, eg, [7], [6] for more details) Liu et al [6], [7] propose a nuclear norm heuristic for minimizing the ran of Y U as Ŷ H r (Ŷ )U + λ 2 Ŷ Y meas 2 F, (9) with λ a positive parameter We apply the RRNH to find a minimum order system giving the following iterative minimization: Ŷ + = arg min Ŷ λ 2 Ŷ Y meas 2 F + W H r(ŷ )U W 2, (20) where we define f = 2 Ŷ Y meas 2 F in (6) and W, W2 are as in (3) Once we obtain an optimal Ŷ, we compute the ran of H r (Ŷ )U by looing at its singular values The thumbrule we use to obtain the ran is the number of singular values after which there is a sharp drop (differentiating the significant singular values from the non-significant ones) If we don t observe a sharp drop, the thumbrule used is to choose the ran of the matrix to be the number of singular values that are within 0 percent of the largest singular value (as in ([7]) Once we identify the ran of Y U, the matrices A, B, C, D of the LTI state-space model can be estimated as detailed in section 5 of [7] A Problem Reformulation To solve (20) efficiently, we reformulate it by maing use of the structure of the regularized constraint in (9) and the fact that the nuclear norm is the dual of the spectral norm, Y = max { Y, : F } (2) Using (2) the primal problem in (20) at the th iteration can be formulated as (note that switching to does not change (2)): min y max : T I λ 2 Ŷ Y meas 2 F, W H r (Ŷ )U W 2 (22) The dual problem corresponding to (22) is obtained by interchanging the min and max in the primal as follows: λ max min : T I y 2 Ŷ Ȳ 2 F, W H r(ŷ )U W2 Define the operator Φ : R p N R (r+)p (N r), = 0,, 2,, with Φ (Ŷ ) = W H r(ŷ )U W2 It is easy to chec that the adjoint operator Φ : R(r+)p (N r) R p N is given by Φ () = H r (W W 2 U T ) The dual problem can now be reframed as: max min λ : T I Ŷ 2 Ŷ Y meas 2 F, Φ (Ŷ ) (23) Minimizing over Ŷ, the optimality conditions give λ(ŷ Y meas) Φ () = 0 (24) Note that the primal (22) is a convex problem and obeys Slater s conditions, hence the duality gap between (22) and (23) is zero Thus the primal optimal solution can be obtained from the dual optimal solution, which is the basis for the implementation described later Substituting Ŷ from (24) bac into (23), the dual problem reduces to: min : T I 2λ Φ () 2 F + Y meas, Φ () (25) The dual objective is scaled by λ so that the objective is independent of it: min :λ 2 ( T ) I 2 Φ () 2 F + Y meas, Φ () (26) The RRNH for the System Identification application using an efficient first order method (ie, the Gradient projection method applied to the dual, see eg [5]) can be summarized as follows:

) Set = 0 Initialize W 0 = I, W 0 2 = I 2) Solve the dual problem (26) using the gradient projection algorithm, obtain + 3) Obtain Ŷ + = Y meas + λ Φ (λ+ ) (using 24) 4) Let Y + = H r (Ŷ + ), let UΣV T be the reduced SVD of Ȳ + = W Y + W 2 Set W + = ((W ) UΣU T (W ) + δi) /2, W + 2 = ((W 2 ) V ΣV T (W 2 ) + δi) /2 5) Stop if termination criterion is satisfied, else set = + and go to step 2 Define D () = 2 Φ () 2 F + Y meas, Φ () to be the dual objective in (26) Step 2 of the above algorithm applies the gradient projection method to solve the dual (26) We note that the projection method wors well when the step size in the gradient-descent step of the method is chosen to be inversely proportional to the lipschitz constant of D ()(see eg [5]) An estimate of the Lipschitz constant of D can be obtained as L = rλ 2 max (W )λ2 max (W 2 ) with the details given in the Appendix B Numerical results In [6], Liu et al mention that the main advantage of nuclear norm technique is that it maes the selection of an appropriate model order easier We present an example, where we show that the RRNH improves on the nuclear norm technique for model order selection We apply the RRNH implementation (algorithm described at the end of subsection B) to one of the data sets (96-006, []) available from ( http://homesesatuleuvenbe/ smc/daisy/) The parameter r, which is the number of row-blocs in the bloc Hanel matrices H r (Û), Y = H r(ŷ ), is chosen so that the number of rows is greater than the expected system order We choose an r sufficiently large, ie r : rp = 60, where p is the size of the output of the system The parameter λ is chosen to give approximately the smallest identification error when RRNH is run for one iteration (ie just the nuclear norm heuristic as in (9)) The identification error is given by e I = ( ) Y measi Ỹ 2 F Y measi ȲI 2, (27) F where Ỹ = [ỹ(0),,ỹ(n I )] denotes the output of the identified state-space model, and ȲI has each of its N I columns equal to Y measi Y measi R p NI denotes the first N I output measurements Similarly the validation error,e V, can be obtained by replacing N I by N V in the computations in (27) The number of data points used for the identification experiment is N I = 50 and for computing validation error is N V = 400 The trade-off curve between the identification Identification/Validation error 6 4 2 08 06 04 02 Identification error Validation error 0 0 5 0 5 20 25 30 Nuclear norm of YU Fig Trade off curve between Identification error and Nuclear norm for the data-set The bigger dot on the identification error curve corresponds to a very small identification error of 00725 and corresponds to a λ = 6 Normalized Singular Values 0 0 0 0 2 0 3 0 4 0 5 original nucnorm 0 6 0 5 0 5 20 25 30 Singular Value Index Fig 2 Normalized singular values of Y U with Y = H r(ŷ ) obtained through nuclear norm heuristic for the data-set A plot of the normalized singular values of Y measi U (original) is also shown error,validation error and nuclear norm for is shown in Fig We pic λ = 6, that corresponds with approximately the smallest identification error of 00725 as indicated in Fig The normalized singular values of Y U (obtained by setting maximum singular value to ) using just nuclear norm heuristic as in (9) are shown in Fig 2 As can be seen from Fig 2, there is no sharp drop in singular values that would clearly indicate the ran of Y U, therefore we use the thumbrule described earlier to obtain the ran (and order of the system) to be 6 The termination criterion we use for RRNH is to stop after 4 iterations since we observe empirically that there is no significant change in the optimized variable after 4 iterations For the dual-gradient method used in each iteration of RRNH, the termination criterion used is such that the number of iterations, Q = min(q, Q 2 ), where Q is the number of iterations for the duality gap to fall below a tolerance of 0 4 and Q 2 = 4000 Fig 3 shows the results of RRNH for λ = 6 and different values of the parameter δ The identification error and validation error were obtained as 0069 and 054 respectively, which is comparable to the errors (0069 and 02 respectively) obtained for this data set in [7] The parameter δ, which is used as a regularization term in the weights, W, W 2 seems to have an influence on the singular values of H r (Ŷ )U and thus its ran We observe empirically that as λ increases, smaller values of

Normalized Singular Values 0 0 0 2 0 4 0 6 original δ = 005 δ = 0 δ = 05 δ = 0 8 0 5 0 5 20 25 30 Singular Value Index Fig 3 Normalized singular values of Y U with Y = H r(ŷ ) obtained through regularized reweighted nuclear norm heuristic(rrnh) for the dataset for different values of δ A plot of the normalized singular values of Y measi U (original) is also shown δ give a clearer ran description for H r (Ŷ )U As can be seen from Fig 3,δ = 05 gives a clearer description of the ran, ie, equal to 4 Smaller values of δ (less than 00) don t seem to provide a clear description of the ran and this observation mirrors the observations made in [3] about the choice of ɛ for which the iterative weighted l algorithm recovers sparse solutions Thus, for the data-set considered, we obtain a reduction in model order (from 6 to 4) as well as a much clearer description of model order by using RRNH as compared to the nuclear norm heuristic V CONCLUSIONS We explored the convergence properties of the reweighed trace heuristic, showing that the difference between the successive iterates of this heuristic goes to zero, and that every convergent subsequence converges to a stationary point of the concave surrogate function We gave a reformulation of this heuristic as the reweighted nuclear norm heuristic (RNH), which allows for efficient and scalable implementation through first-order gradient methods such as the gradient projection method and conditional gradient method, as compared to the reweighted trace formulation which requires solving an SDP at each iteration We apply the RRNH to a System Identification application and show that the RRNH provides a clearer description of the matrix ran (and hence system order) through a sharp fall in singular values in the singular value plot of H r (Ŷ )U We also observe that the RRNH gives a lower system order as compared to nuclear norm heuristic (without weighting) for the data set considered, with the identification and validation errors comparable to those obtained for this data set in [7] We observe empirically that as λ increases, smaller values of δ give a clearer ran description for H r (Ŷ )U It would be useful to understand precisely how δ plays a role in providing a clear ran description as λ varies We mentioned that the RRNH allows for an efficient implementation of the reweighted trace heuristic It would be useful to quantify the efficiency and scalability of RRNH and compare it with the nuclear norm heuristic implemented using the interior point method detailed in [7] ACKNOWLEDGEMENTS We would lie to than Paul Tseng for very helpful discussions and for his implementation of the dual gradient projection method, which we adapted for our algorithm APPENDIX Estimate of Lipschitz constant For any, 2 R rp q, D ( ) D ( 2 ) F = Φ Φ ( 2 ) F (28) Note that H r Hr and Φ Φ are self-adjoint operators Φ Φ is a compact operator since its range is finite dimensional, and it is positive since, Φ Φ () = Φ (), Φ () = Φ () 2 F 0 We apply the Rayleigh-Ritz method for self-adjoint, positive, compact operators [3] to express the maximum eigenvalue of Φ Φ as λ max(φ Φ ) = sup W: W F = W, Φ Φ (W) and get Φ Φ ( 2 ) 2 F 2 2 F sup W Φ Φ (W) 2 F W 2 F = λ 2 max (Φ Φ ) Thus an upper bound on λ max can be used to find an estimate L of the Lipschitz constant of the gradient of the dual objective, D We obtain an estimate of λ max (Φ Φ ) below From (9), it is easy to see by using the properties of norms that Hr (X) 2 F r X 2 F Also by using the properties of trace, we have Φ () 2 F r W V, W V Let A = (W2 )2, with eigenvalues ρ 2 ρ2 2 ρ2 q, and let γ 2 γ2 2 γ2 rp be the eigenvalues of (W )2 Using Von Neumann s Trace inequality (see eg [0]), it can be shown that W W2, W W2 ρ 2 γ 2 2 F Thus we have ΦΦ ( 2 ) 2 F 2 2 F λ max (ΦΦ ) 2 ( Φ (W) 2 ) 2 F = sup W W 2 F r 2 ρ 4 4 γ (29) Thus, L = rρ 2 γ 2 (upper-bound on λ max (ΦΦ )) is an estimate of the lipschitz constant of D REFERENCES [] DP Bertseas Nonlinear Programming 999 [2] J Brinhuis, Luo, and S hang Matrix convex functions with applications to weighted centres for semidefinite programming 2005 [3] EJ Candes, MB Wain, and S Boyd Enhancing sparsity by reweighted l minimization Journal of Fourier Analysis and Applications, 4:877 905, 2008 [4] EJCandes and BRecht Exact matrix completion via convex optimization [5] L Ljung System Identification - Theory for the User 2002 [6] MS Lobo, M Fazel, and S Boyd Portfolio optimization with linear and fixed transaction costs [7] MFazel, HHindi, and SBoyd A ran minimization heuristic with application to minimum order system approximation In Proceedings American Control Conference, 200 [8] MFazel, HHindi, and SBoyd Log-det heuristic for matrix ran minimization with applications to hanel and euclidean distance matrices In Proceedings American Control Conference, 2003 [9] MFazel, HHindi, and SBoyd Ran minimization and applications in system theory In Proceedings American Control Conference, 2004 [0] L Mirsy A trace inequality of john von neumann Monatshefte fur Mathemati, 975

[] B Moor, P Gersem, B Schutter, and W Favoreel Daisy: A database for the identification of systems [2] BD Moor, M Moonen, LVandenberghe, and J Vandewalle A geometrical approach for the identification of state space models with the singular value decomposition In Proceedings of the 988 IEEE International Conference on Acoustics, Speech, and Signal Processing, 988 [3] AW Naylor and GR Sell Linear Operator Theory in Engineering and Science 982 [4] B Recht, M Fazel, and P A Parrilo Guaranteed minimum ran solutions to linear matrix equations via nuclear norm minimization Accepted for publication, SIAM Review [5] TKPong, P Tseng, S Ji, and J Ye Trace norm regularization: Formulations, algorithms, and mult-tas learning 2009 [6] Liu and LVandenberghe Semidefinite programming methods for system realization and identification In Proceedings Conference on Decision and Control, 2009 [7] Liu and L Vandenberghe Interior-point method for nuclear norm approximation with application to system identification Submitted to SIAM Journal on Matrix Analysis and Applications, 2009