Matrix Support Functional and its Applications

Matrix Support Functional and its Applications James V Burke Mathematics, University of Washington Joint work with Yuan Gao (UW) and Tim Hoheisel (McGill), CORS, Banff 2016 June 1, 2016

Connections What do the following topics have in common? Quadratic Optimization Problem with Equality Constraints The Matrix Fractional Function and its Generalization Ky Fan p-k Norms K-means Clustering Best Affine Unbiased Estimator Supervised Representation Learning Multi-task Learning Variational Gram Functions

A Matrix Support Functional (B-Hoheisel (2015)) Given A R p n and B R p m set D(A, B) := {( ) Y, 1YY T 2 R n m S n } Y R n m : AY = B

A Matrix Support Functional (B-Hoheisel (2015)) Given A R p n and B R p m set D(A, B) := {( ) Y, 1YY T 2 R n m S n } Y R n m : AY = B We consider the support functional for D(A, B). σ ((X, V ) D(A, B)) = sup (X, V ), (Y, 1YY T 2 ) AY =B

Support Functions σ S (x) := σ (x S ) := sup x, y y S

Support Functions σ S (x) := σ (x S ) := sup x, y y S S = x {y x, y σ (x S )}

Support Functions σ S (x) := σ (x S ) := sup x, y y S S = x {y x, y σ (x S )} σ S = σ cl S = σ conv S = σ conv S

Support Functions σ S (x) := σ (x S ) := sup x, y y S S = x {y x, y σ (x S )} σ S = σ cl S = σ conv S = σ conv S When S is a closed convex set, then σ S (x) = arg max y S x, y.

Epigraph epi f := {(x, µ) f (x) µ} f (y) := σ ((y, 1) epi f )

A Representation for σ ((X, V ) D(A, B)) Let A R p n and B R p m such that rge B rge A. Then { ( (X 1 2 σ ((X, V ) D(A, B)) = tr ) T B M(V ) ( ) ) X B if rge ( X B) rge M(V ), V ker A 0, + else. where ( V A T M(V ) := A 0 ).

A Representation for σ ((X, V ) D(A, B)) Let A R p n and B R p m such that rge B rge A. Then { ( (X 1 2 σ ((X, V ) D(A, B)) = tr ) T B M(V ) ( ) ) X B if rge ( X B) rge M(V ), V ker A 0, + else. where In particular, ( V A T M(V ) := A 0 ). dom σ ( D(A, B)) = dom σ ( D(A, B)) { ( ) } X = (X, V ) R n m S n rge rge M(V ), V ker A 0, B with int (dom σ ( D(A, B))) = {(X, V ) R n m S n V ker A 0}.

Relationship to Equality Constrained QP Consider a equality constrained QP: ν(x, V ) := inf { 1 u R n 2 ut Vu x T u Au = b }.

Relationship to Equality Constrained QP Consider a equality constrained QP: ν(x, V ) := inf { 1 u R n 2 ut Vu x T u Au = b }. The Lagrangian is L(u, λ) = 1 2 ut Vu x T u + λ T (Au b). Optimality conditions are Vu + A T λ x = 0 Au = b This is equivalent to ( V A T A 0 ) ( ) u = λ ( ) x. b

Relationship to Equality Constrained QP Consider a equality constrained QP: ν(x, V ) := inf { 1 u R n 2 ut Vu x T u Au = b }. The Lagrangian is L(u, λ) = 1 2 ut Vu x T u + λ T (Au b). Optimality conditions are This is equivalent to Hence M(V ) = Vu + A T λ x = 0 Au ( V A T A 0 = b ) ( ) u = λ ( ) x. b ν(x, V ) = σ ((x, V ) D(A, b)).

Maximum Likelihood Estimation L(µ, Σ; Y ) := (2π) mn/2 Σ N/2 N i=1 exp((y i µ) T Σ 1 (y i µ)) Up to a constant, the negative log-likelihood is ) ln L(µ, Σ; Y ) = 1 ln det Σ + 1tr 2 2 ((Y M) T Σ 1 (Y M) = σ ((Y M), Σ) D(0, 0)) 1 2 ( ln det Σ). The Matrix Fractional Function:Take A = 0 and B = 0, and set γ(x, V ) := σ ((x, V ) D(0, 0)) = { 1 2 V X if rge X rge V, V 0, + else.

conv D(A, B) Recall σ C (x) = {y conv C x, y = σ C (x)}.

conv D(A, B) Recall σ C (x) = {y conv C x, y = σ C (x)}. For σ ((X, V ) D(A, B)) we need conv (D(A, B)).

conv D(A, B) Recall σ C (x) = {y conv C x, y = σ C (x)}. For σ ((X, V ) D(A, B)) we need conv (D(A, B)). Set { } S n +(ker A) := W S n u T Wu 0 u ker A = {W ker A 0}. Then S n +(ker A) is a closed convex cone whose polar is given by S n +(ker A) = {W S n W = PWP 0}, where P is the orthogonal projection onto ker A. For D(A, B) := {( Y, 1 2 YY T ) R n m S n Y R n m : AY = B }, := { (Y, W ) R n m S n conv D(A, B) = Ω(A, B) AY = B and 1 2 YY T + W S n +(ker A) }.

Applications Quadratic Optimization Problem with Equality Constraints The Matrix Fractional Function and its Generalization Ky Fan p-k Norms K-means Clustering Best Affine Unbiased Estimator Supervised Representation Learning Multi-task Learning Variational Gram Functions

Motivating Examples Recall the matrix fractional function γ(x, V ) := σ ((X, V ) D(0, 0)) { 1 = tr (X T 2 V X ) if rge X rge V, V 0, + else.

Motivating Examples Recall the matrix fractional function γ(x, V ) := σ ((X, V ) D(0, 0)) { 1 = tr (X T 2 V X ) if rge X rge V, V 0, + else. We have the following two representations of the nuclear norm: X = min γ(x, V ) + 1 V 2 tr V 1 2 X 2 = min γ(x, V ) + δ (V tr(v ) 1), V

Infimal Projections For a closed proper convex function h, define the infimal projection: ϕ(x ) := inf V σ ((X, V ) D(A, B)) + h(v ).

Infimal Projections For a closed proper convex function h, define the infimal projection: ϕ(x ) := inf V σ ((X, V ) D(A, B)) + h(v ). Theorem If dom h S n ++(ker A), then ϕ (Y ) = inf {h ( W ) (Y, W ) conv D(A, B)}.

Infimal Projections with Indicators When h is an indicator of a closed convex set V, then ϕ V (X ) := inf σ ((X, V ) D(A, B)), V V ϕ V(Y ) = 1 2 σ (YY T {V V V ker A 0} ) + δ (Y AY = B ) = 1σ 2 (YY T ) V S n +(ker A) + δ (Y AY = B ) Note that when B = 0, both ϕ V and ϕ V are positively homogeneous of degree 2. When A = 0 and B = 0, ϕ V is called a variational Gram functionin Jalali-Xiao-Fazel (2016?).

Ky Fan (p,k) norm For p 1, 1 k min{m, n}, the Ky Fan (p,k)-norm of a matrix X R n m is given by ( k ) 1/p X p,k = σ p i, where σ i are the singular values of X sorted in nonincreasing order. i=1

Ky Fan (p,k) norm For p 1, 1 k min{m, n}, the Ky Fan (p,k)-norm of a matrix X R n m is given by ( k ) 1/p X p,k = σ p i, where σ i are the singular values of X sorted in nonincreasing order. i=1 The Ky Fan (p, min{m, n})-norm is the Schatten-p norm. The Ky Fan (1, k)-norm is the standard Ky Fan k-norm.

Ky Fan (p,k) norm Corollary 1 2 X 2 2p = inf γ(x, V ).,min{m,n} p+1 V p,min{m,n} 1 γ(x, V ) = σ ((X, V ) D(0, 0)), ϕ V (X ) := inf σ ((X, V ) D(A, B)) V V Proof. and γ V (X ) := inf σ ((X, V ) D(0, 0)) V V ( inf γ(x, V ) V p,min{m,n} 1 ) = σ ( 1 2 XX T { V 0 V p,min{m,n} 1 }) = 1 2 XX T p p 1,min{m,n} = 1 2 X 2 2p p+1,min{m,n}.

Ky Fan (p,k) norm As a special case when p = 1, Corollary 1 X 2 2 = min tr V 1 γ(x, V ). Lemma Let V to be the set of rank-k orthogonal projection matrices V = { UU T U R n k, U T } U = I k, then 1 X 2 2 2,k = σ ( 1 V XX T ) 2. Proof. A consequence of the following fact [Fillmore-Williams 1971]: conv { UU T U R n k, U T U = I k } = {V S n I V 0, tr V = k }.

K-means Clustering [Zha-He-Ding-Gu-Simon 2001] Consider X R n m, the k-means objective is K(X ) := min C,E 1 X 2 EC 2 2, where C R k m represents the k centers, and E is a n k matrix where each row is one of e1 T,..., et k which correspond to the k cluster assignments. The optimal C is given by C = (E T E) 1 E T X. Define P E = E(E T E) 1 E T, then P E K(X ) = min E 1 (I P 2 E)X 2 2 = 1 min 2 = 1 2 X 2 2 σ Pk ( 1 2 XX T ) 1 2 is an orthogonal projection. ( ) tr (I P E )XX T E ( X 2 2 X 2 ) 2,k.

Best Affine Unbiased Estimator For a linear regression model y = A T β + ɛ where ɛ N (0, σ 2 V ), and a given matrix B, an affine unbiased estimator of B T β is an estimator of the form ˆθ = X T y + c satisfying Eˆθ = B T β. Best: Var(ˆθ ) Var(ˆθ), ˆθ If a solution to v(a, B, V ) := exists and unique, then ˆθ = (X ) T y. 1 min tr X T 2 V X. X :AX =B v(a, B, V ) = σ D(A,B) (0, V ). The optimal solution X satisfies ( ) X M(V ) = W ( ) 0. B

Supervised Representation Learning Consider a binary classification problem where we are given the training data: (x 1, y 1 ),..., (x n, y n ) R m { 1, 1}, and test data: x n+1,..., x n+t R m. Representation learning aims to learn a feature mapping Φ : R m H that maps the data points to a feature space where points between the two classes are well separated. Kernel Methods: Instead of specifying the function Φ explicitly, kernel methods consider mapping the data points to a reproducing kernel Hilbert space H so that the kernel matrix K S n+t +, where K ij = Φ(x i ), Φ(x j ), implicitly determines the mapping Φ.

Supervised Representation Learning Let K S n+t + be a set of candidate kernels. The best K K can be selected by maximizing its alignment with the kernel specified by the training labels: ϕ 1 V(y) = max 2 K V K 1:n,1:n, yy T, where [ ] V = K B 2 S n 0n n 0 +, A =, and B = 0 0 I (n+t) 1. t t

Multi-task Learning In multi-task learning, T sets of labelled training data (x t1, y t1 ),..., (x tn, y tn ) R m R are given, representing T learning tasks. Assumption: A linear feature map h i (x) = u i, x, i = 1,..., m, where U = (u 1,..., u m ) is an m m orthogonal matrix, and the predictor for each task is f t (x) := a t, h(x). The multi-task learning problem is then min T m A,U t=1 i=1 L ( t yti, a t, U T x ti ) + µ A 2 2,1, where A = (a 1,..., a T ), A 2 2,1 is the square of the sum of the 2-norm of the rows of A, and L t is a loss function for each task. Denote W = UA, then the nonconvex problem is equivalent to the following convex problem [Argyriou-Evgeniou-Pontil 2006]: min W,D T t=1 m i=1 L (y ti, w t, x ti ) + 2µγ(W, D) s.t. tr D 1. It is equivalent to min W T t=1 m i=1 L (y ti, w t, x ti ) + µ W 2.

Thank you!

References I J. V. Burke and T. Hoheisel: Matrix Support Functionals for Inverse Problems, Regularization, and Learning. SIAM Journal on Optimization, 25(2):1135-1159, 2015.