Matrix Support Functional and its Applications

Similar documents
Y R n m : AY = B. McGill University, 805 Sherbrooke St West, Room1114, Montréal, Québec, Canada H3A 0B9

10-725/36-725: Convex Optimization Prerequisite Topics

A Spectral Regularization Framework for Multi-Task Structure Learning

Optimization and Optimal Control in Banach Spaces

Supplement: Universal Self-Concordant Barrier Functions

Feature Learning for Multitask Learning Using l 2,1 Norm Minimization

EE/ACM Applications of Convex Optimization in Signal Processing and Communications Lecture 17

Some tensor decomposition methods for machine learning

Support Vector Machines

CIS 520: Machine Learning Oct 09, Kernel Methods

15. Conic optimization

Bayesian Models for Regularization in Optimization

Semidefinite Programming Basics and Applications

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

The proximal mapping

Jeff Howbert Introduction to Machine Learning Winter

Multivariable Calculus

Mathematical Methods for Data Analysis

ON A CLASS OF NONSMOOTH COMPOSITE FUNCTIONS

CS145: INTRODUCTION TO DATA MINING

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

ELE539A: Optimization of Communication Systems Lecture 15: Semidefinite Programming, Detection and Estimation Applications

Computing regularization paths for learning multiple kernels

Optimality Conditions for Nonsmooth Convex Optimization

Convex Optimization in Classification Problems

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

A Note on Nonconvex Minimax Theorem with Separable Homogeneous Polynomials

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Making Flippy Floppy

Lecture: Convex Optimization Problems

Algorithms for Learning Kernels Based on Centered Alignment

HW1 solutions. 1. α Ef(x) β, where Ef(x) is the expected value of f(x), i.e., Ef(x) = n. i=1 p if(a i ). (The function f : R R is given.

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Chapter 4 Euclid Space

Lecture: Duality of LP, SOCP and SDP

NORMS ON SPACE OF MATRICES

Making Flippy Floppy

L 2,1 Norm and its Applications

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Exercises. Exercises. Basic terminology and optimality conditions. 4.2 Consider the optimization problem

Lecture 3 January 28

Convex Multi-Task Feature Learning

Mathematics 530. Practice Problems. n + 1 }

Homework 4. Convex Optimization /36-725

Algorithms for Nonsmooth Optimization

Lecture 20: November 1st

Convex Functions. Pontus Giselsson

4. Convex optimization problems

Lecture 2: Linear Algebra Review

Support Vector Machines: Maximum Margin Classifiers

Convex Optimization M2

Lecture 1: January 12

Support Vector Machine (SVM) and Kernel Methods

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Convex Optimization M2

Lecture 1: Background on Convex Analysis

Gaps in Support Vector Optimization

Lecture 7: Positive Semidefinite Matrices

Learning gradients: prescriptive models

Optimality Conditions for Constrained Optimization

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Constrained optimization

Convex Optimization Algorithms for Machine Learning in 10 Slides

Variational Functions of Gram Matrices: Convex Analysis and Applications

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Constraint qualifications for convex inequality systems with applications in constrained optimization

Characterizing Robust Solution Sets of Convex Programs under Data Uncertainty

Convex Functions and Optimization

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Kernel Logistic Regression and the Import Vector Machine

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

Lecture Notes on Support Vector Machine

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Geometric problems. Chapter Projection on a set. The distance of a point x 0 R n to a closed set C R n, in the norm, is defined as

Algebraic Statistics progress report

5. Duality. Lagrangian

High Dimensional Inverse Covariate Matrix Estimation via Linear Programming

Conjugate-Gradient. Learn about the Conjugate-Gradient Algorithm and its Uses. Descent Algorithms and the Conjugate-Gradient Method. Qx = b.

Lecture 5: GPs and Streaming regression

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

The local equicontinuity of a maximal monotone operator

Hilbert Schmidt Independence Criterion

On the computation of Hermite-Humbert constants for real quadratic number fields

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

SMO vs PDCO for SVM: Sequential Minimal Optimization vs Primal-Dual interior method for Convex Objectives for Support Vector Machines

Some Properties of the Augmented Lagrangian in Cone Constrained Optimization

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Least squares regularized or constrained by L0: relationship between their global minimizers. Mila Nikolova

1-Bit Matrix Completion

Math 273a: Optimization Subgradients of convex functions

MODULE 8 Topics: Null space, range, column space, row space and rank of a matrix

Lecture 10: Support Vector Machine and Large Margin Classifier

DS-GA 1002 Lecture notes 10 November 23, Linear models

(Part 1) High-dimensional statistics May / 41

Multi-Task Learning and Matrix Regularization

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Lecture 2: Convex functions

Transcription:

Matrix Support Functional and its Applications James V Burke Mathematics, University of Washington Joint work with Yuan Gao (UW) and Tim Hoheisel (McGill), CORS, Banff 2016 June 1, 2016

Connections What do the following topics have in common? Quadratic Optimization Problem with Equality Constraints The Matrix Fractional Function and its Generalization Ky Fan p-k Norms K-means Clustering Best Affine Unbiased Estimator Supervised Representation Learning Multi-task Learning Variational Gram Functions

Connections What do the following topics have in common? Quadratic Optimization Problem with Equality Constraints The Matrix Fractional Function and its Generalization Ky Fan p-k Norms K-means Clustering Best Affine Unbiased Estimator Supervised Representation Learning Multi-task Learning Variational Gram Functions Answer: They can all be represented using a matrix support function that is smooth on the interior of its domain.

A Matrix Support Functional (B-Hoheisel (2015)) Given A R p n and B R p m set D(A, B) := {( ) Y, 1YY T 2 R n m S n } Y R n m : AY = B

A Matrix Support Functional (B-Hoheisel (2015)) Given A R p n and B R p m set D(A, B) := {( ) Y, 1YY T 2 R n m S n } Y R n m : AY = B We consider the support functional for D(A, B). σ ((X, V ) D(A, B)) = sup (X, V ), (Y, 1YY T 2 ) AY =B

A Matrix Support Functional (B-Hoheisel (2015)) Given A R p n and B R p m set D(A, B) := {( ) Y, 1YY T 2 R n m S n } Y R n m : AY = B We consider the support functional for D(A, B). σ ((X, V ) D(A, B)) = sup (X, V ), (Y, 1YY T 2 ) AY =B = inf AY =B 1 tr (Y T 2 VY ) X, Y

Support Functions σ S (x) := σ (x S ) := sup x, y y S

Support Functions σ S (x) := σ (x S ) := sup x, y y S S = x {y x, y σ (x S )}

Support Functions σ S (x) := σ (x S ) := sup x, y y S S = x {y x, y σ (x S )} σ S = σ cl S = σ conv S = σ conv S

Support Functions σ S (x) := σ (x S ) := sup x, y y S S = x {y x, y σ (x S )} σ S = σ cl S = σ conv S = σ conv S When S is a closed convex set, then σ S (x) = arg max y S x, y.

Epigraph epi f := {(x, µ) f (x) µ} f (y) := σ ((y, 1) epi f )

A Representation for σ ((X, V ) D(A, B)) Let A R p n and B R p m such that rge B rge A. Then { ( (X 1 2 σ ((X, V ) D(A, B)) = tr ) T B M(V ) ( ) ) X B if rge ( X B) rge M(V ), V ker A 0, + else. where ( V A T M(V ) := A 0 ).

A Representation for σ ((X, V ) D(A, B)) Let A R p n and B R p m such that rge B rge A. Then { ( (X 1 2 σ ((X, V ) D(A, B)) = tr ) T B M(V ) ( ) ) X B if rge ( X B) rge M(V ), V ker A 0, + else. where In particular, ( V A T M(V ) := A 0 ). dom σ ( D(A, B)) = dom σ ( D(A, B)) { ( ) } X = (X, V ) R n m S n rge rge M(V ), V ker A 0, B with int (dom σ ( D(A, B))) = {(X, V ) R n m S n V ker A 0}.

A Representation for σ ((X, V ) D(A, B)) Let A R p n and B R p m such that rge B rge A. Then { ( (X 1 2 σ ((X, V ) D(A, B)) = tr ) T B M(V ) ( ) ) X B if rge ( X B) rge M(V ), V ker A 0, + else. where In particular, ( V A T M(V ) := A 0 ). dom σ ( D(A, B)) = dom σ ( D(A, B)) { ( ) } X = (X, V ) R n m S n rge rge M(V ), V ker A 0, B with int (dom σ ( D(A, B))) = {(X, V ) R n m S n V ker A 0}. The inverse M(V ) 1 exists when V ker A 0 and A is surjective.

Relationship to Equality Constrained QP Consider a equality constrained QP: ν(x, V ) := inf { 1 u R n 2 ut Vu x T u Au = b }.

Relationship to Equality Constrained QP Consider a equality constrained QP: ν(x, V ) := inf { 1 u R n 2 ut Vu x T u Au = b }. The Lagrangian is L(u, λ) = 1 2 ut Vu x T u + λ T (Au b). Optimality conditions are Vu + A T λ x = 0 Au = b

Relationship to Equality Constrained QP Consider a equality constrained QP: ν(x, V ) := inf { 1 u R n 2 ut Vu x T u Au = b }. The Lagrangian is L(u, λ) = 1 2 ut Vu x T u + λ T (Au b). Optimality conditions are Vu + A T λ x = 0 Au = b This is equivalent to ( V A T A 0 ) ( ) u = λ ( ) x. b

Relationship to Equality Constrained QP Consider a equality constrained QP: ν(x, V ) := inf { 1 u R n 2 ut Vu x T u Au = b }. The Lagrangian is L(u, λ) = 1 2 ut Vu x T u + λ T (Au b). Optimality conditions are Vu + A T λ x = 0 Au = b This is equivalent to M(V ) = ( V A T A 0 ) ( ) u = λ ( ) x. b

Relationship to Equality Constrained QP Consider a equality constrained QP: ν(x, V ) := inf { 1 u R n 2 ut Vu x T u Au = b }. The Lagrangian is L(u, λ) = 1 2 ut Vu x T u + λ T (Au b). Optimality conditions are This is equivalent to Hence M(V ) = Vu + A T λ x = 0 Au ( V A T A 0 = b ) ( ) u = λ ( ) x. b ν(x, V ) = σ ((x, V ) D(A, b)).

Maximum Likelihood Estimation L(µ, Σ; Y ) := (2π) mn/2 Σ N/2 N i=1 exp((y i µ) T Σ 1 (y i µ)) Up to a constant, the negative log-likelihood is ) ln L(µ, Σ; Y ) = 1 ln det Σ + 1tr 2 2 ((Y M) T Σ 1 (Y M) = σ ((Y M), Σ) D(0, 0)) 1 2 ( ln det Σ).

Maximum Likelihood Estimation L(µ, Σ; Y ) := (2π) mn/2 Σ N/2 N i=1 exp((y i µ) T Σ 1 (y i µ)) Up to a constant, the negative log-likelihood is ) ln L(µ, Σ; Y ) = 1 ln det Σ + 1tr 2 2 ((Y M) T Σ 1 (Y M) = σ ((Y M), Σ) D(0, 0)) 1 2 ( ln det Σ). The Matrix Fractional Function:Take A = 0 and B = 0, and set γ(x, V ) := σ ((x, V ) D(0, 0)) = { 1 2 V X if rge X rge V, V 0, + else.

conv D(A, B) Recall σ C (x) = {y conv C x, y = σ C (x)}.

conv D(A, B) Recall σ C (x) = {y conv C x, y = σ C (x)}. For σ ((X, V ) D(A, B)) we need conv (D(A, B)).

conv D(A, B) Recall σ C (x) = {y conv C x, y = σ C (x)}. For σ ((X, V ) D(A, B)) we need conv (D(A, B)). Set { } S n +(ker A) := W S n u T Wu 0 u ker A = {W ker A 0}. Then S n +(ker A) is a closed convex cone whose polar is given by S n +(ker A) = {W S n W = PWP 0}, where P is the orthogonal projection onto ker A.

conv D(A, B) Recall σ C (x) = {y conv C x, y = σ C (x)}. For σ ((X, V ) D(A, B)) we need conv (D(A, B)). Set { } S n +(ker A) := W S n u T Wu 0 u ker A = {W ker A 0}. Then S n +(ker A) is a closed convex cone whose polar is given by S n +(ker A) = {W S n W = PWP 0}, where P is the orthogonal projection onto ker A. For D(A, B) := {( Y, 1 2 YY T ) R n m S n Y R n m : AY = B }, := { (Y, W ) R n m S n conv D(A, B) = Ω(A, B) AY = B and 1 2 YY T + W S n +(ker A) }.

Applications Quadratic Optimization Problem with Equality Constraints The Matrix Fractional Function and its Generalization Ky Fan p-k Norms K-means Clustering Best Affine Unbiased Estimator Supervised Representation Learning Multi-task Learning Variational Gram Functions

Motivating Examples Recall the matrix fractional function γ(x, V ) := σ ((X, V ) D(0, 0)) { 1 = tr (X T 2 V X ) if rge X rge V, V 0, + else.

Motivating Examples Recall the matrix fractional function γ(x, V ) := σ ((X, V ) D(0, 0)) { 1 = tr (X T 2 V X ) if rge X rge V, V 0, + else. We have the following two representations of the nuclear norm: X = min γ(x, V ) + 1 V 2 tr V 1 2 X 2 = min γ(x, V ) + δ (V tr(v ) 1), V

Infimal Projections For a closed proper convex function h, define the infimal projection: ϕ(x ) := inf V σ ((X, V ) D(A, B)) + h(v ).

Infimal Projections For a closed proper convex function h, define the infimal projection: ϕ(x ) := inf V σ ((X, V ) D(A, B)) + h(v ). Theorem If dom h S n ++(ker A), then ϕ (Y ) = inf {h ( W ) (Y, W ) conv D(A, B)}.

Infimal Projections with Indicators When h is an indicator of a closed convex set V, then ϕ V (X ) := inf σ ((X, V ) D(A, B)), V V ϕ V(Y ) = 1 2 σ (YY T {V V V ker A 0} ) + δ (Y AY = B ) = 1σ 2 (YY T ) V S n +(ker A) + δ (Y AY = B )

Infimal Projections with Indicators When h is an indicator of a closed convex set V, then ϕ V (X ) := inf σ ((X, V ) D(A, B)), V V ϕ V(Y ) = 1 2 σ (YY T {V V V ker A 0} ) + δ (Y AY = B ) = 1σ 2 (YY T ) V S n +(ker A) + δ (Y AY = B ) Note that when B = 0, both ϕ V and ϕ V are positively homogeneous of degree 2. When A = 0 and B = 0, ϕ V is called a variational Gram functionin Jalali-Xiao-Fazel (2016?).

Ky Fan (p,k) norm For p 1, 1 k min{m, n}, the Ky Fan (p,k)-norm of a matrix X R n m is given by ( k ) 1/p X p,k = σ p i, where σ i are the singular values of X sorted in nonincreasing order. i=1

Ky Fan (p,k) norm For p 1, 1 k min{m, n}, the Ky Fan (p,k)-norm of a matrix X R n m is given by ( k ) 1/p X p,k = σ p i, where σ i are the singular values of X sorted in nonincreasing order. i=1 The Ky Fan (p, min{m, n})-norm is the Schatten-p norm. The Ky Fan (1, k)-norm is the standard Ky Fan k-norm.

Ky Fan (p,k) norm For p 1, 1 k min{m, n}, the Ky Fan (p,k)-norm of a matrix X R n m is given by ( k ) 1/p X p,k = σ p i, where σ i are the singular values of X sorted in nonincreasing order. i=1 The Ky Fan (p, min{m, n})-norm is the Schatten-p norm. The Ky Fan (1, k)-norm is the standard Ky Fan k-norm. Corollary 1 2 X 2 2p = inf γ(x, V ).,min{m,n} p+1 V p,min{m,n} 1

Ky Fan (p,k) norm Corollary 1 2 X 2 2p = inf γ(x, V ).,min{m,n} p+1 V p,min{m,n} 1 γ(x, V ) = σ ((X, V ) D(0, 0)), ϕ V (X ) := inf σ ((X, V ) D(A, B)) V V Proof. and γ V (X ) := inf σ ((X, V ) D(0, 0)) V V ( inf γ(x, V ) V p,min{m,n} 1 ) = σ ( 1 2 XX T { V 0 V p,min{m,n} 1 }) = 1 2 XX T p p 1,min{m,n} = 1 2 X 2 2p p+1,min{m,n}.

Ky Fan (p,k) norm As a special case when p = 1, Corollary 1 X 2 2 = min tr V 1 γ(x, V ). Lemma Let V to be the set of rank-k orthogonal projection matrices V = { UU T U R n k, U T } U = I k, then 1 X 2 2 2,k = σ ( 1 V XX T ) 2. Proof. A consequence of the following fact [Fillmore-Williams 1971]: conv { UU T U R n k, U T U = I k } = {V S n I V 0, tr V = k }.

K-means Clustering [Zha-He-Ding-Gu-Simon 2001] Consider X R n m, the k-means objective is K(X ) := min C,E 1 X 2 EC 2 2, where C R k m represents the k centers, and E is a n k matrix where each row is one of e1 T,..., et k which correspond to the k cluster assignments. The optimal C is given by C = (E T E) 1 E T X. Define P E = E(E T E) 1 E T, then P E K(X ) = min E 1 (I P 2 E)X 2 2 = 1 min 2 = 1 2 X 2 2 σ Pk ( 1 2 XX T ) 1 2 is an orthogonal projection. ( ) tr (I P E )XX T E ( X 2 2 X 2 ) 2,k.

Best Affine Unbiased Estimator For a linear regression model y = A T β + ɛ where ɛ N (0, σ 2 V ), and a given matrix B, an affine unbiased estimator of B T β is an estimator of the form ˆθ = X T y + c satisfying Eˆθ = B T β. Best: Var(ˆθ ) Var(ˆθ), ˆθ

Best Affine Unbiased Estimator For a linear regression model y = A T β + ɛ where ɛ N (0, σ 2 V ), and a given matrix B, an affine unbiased estimator of B T β is an estimator of the form ˆθ = X T y + c satisfying Eˆθ = B T β. Best: Var(ˆθ ) Var(ˆθ), ˆθ If a solution to v(a, B, V ) := exists and unique, then ˆθ = (X ) T y. 1 min tr X T 2 V X. X :AX =B v(a, B, V ) = σ D(A,B) (0, V ). The optimal solution X satisfies ( ) X M(V ) = W ( ) 0. B

Supervised Representation Learning Consider a binary classification problem where we are given the training data: (x 1, y 1 ),..., (x n, y n ) R m { 1, 1}, and test data: x n+1,..., x n+t R m. Representation learning aims to learn a feature mapping Φ : R m H that maps the data points to a feature space where points between the two classes are well separated. Kernel Methods: Instead of specifying the function Φ explicitly, kernel methods consider mapping the data points to a reproducing kernel Hilbert space H so that the kernel matrix K S n+t +, where K ij = Φ(x i ), Φ(x j ), implicitly determines the mapping Φ.

Supervised Representation Learning Let K S n+t + be a set of candidate kernels. The best K K can be selected by maximizing its alignment with the kernel specified by the training labels: ϕ 1 V(y) = max 2 K V K 1:n,1:n, yy T, where [ ] V = K B 2 S n 0n n 0 +, A =, and B = 0 0 I (n+t) 1. t t

Multi-task Learning In multi-task learning, T sets of labelled training data (x t1, y t1 ),..., (x tn, y tn ) R m R are given, representing T learning tasks. Assumption: A linear feature map h i (x) = u i, x, i = 1,..., m, where U = (u 1,..., u m ) is an m m orthogonal matrix, and the predictor for each task is f t (x) := a t, h(x). The multi-task learning problem is then min T m A,U t=1 i=1 L ( t yti, a t, U T x ti ) + µ A 2 2,1, where A = (a 1,..., a T ), A 2 2,1 is the square of the sum of the 2-norm of the rows of A, and L t is a loss function for each task. Denote W = UA, then the nonconvex problem is equivalent to the following convex problem [Argyriou-Evgeniou-Pontil 2006]: min W,D T t=1 m i=1 L (y ti, w t, x ti ) + 2µγ(W, D) s.t. tr D 1. It is equivalent to min W T t=1 m i=1 L (y ti, w t, x ti ) + µ W 2.

Thank you!

References I J. V. Burke and T. Hoheisel: Matrix Support Functionals for Inverse Problems, Regularization, and Learning. SIAM Journal on Optimization, 25(2):1135-1159, 2015.