Variational Functions of Gram Matrices: Convex Analysis and Applications

Size: px

Start display at page:

Download "Variational Functions of Gram Matrices: Convex Analysis and Applications"

Howard Carson
5 years ago
Views:

1 Variational Functions of Gram Matrices: Convex Analysis and Applications Maryam Fazel University of Washington Joint work with: Amin Jalali (UW), Lin Xiao (Microsoft Research) IMA Workshop on Resource Tradeoffs: Computation, Communication, Information May 216

2 For vectors x 1,..., x m P R n and a compact set M Ă S m, define the variational Gram function (VGF) Ω M px 1,..., x m q max MPM mÿ i,j 1 M ij x T i x j

3 For vectors x 1,..., x m P R n and a compact set M Ă S m, define the variational Gram function (VGF) Ω M px 1,..., x m q max MPM mÿ i,j 1 M ij x T i x j let X rx 1 X T X, x m s. pairwise inner products x T i x j are entries of Gram matrix Ω M pxq max MPM xxt X, My max MPM trpxmxt q

4 For vectors x 1,..., x m P R n and a compact set M Ă S m, define the variational Gram function (VGF) Ω M px 1,..., x m q max MPM mÿ i,j 1 M ij x T i x j let X rx 1 X T X, x m s. pairwise inner products x T i x j are entries of Gram matrix Ω M pxq max MPM xxt X, My max MPM trpxmxt q a.k.a support function of set M, at X T X (recall support function of set M: S M py q max MPM xy, My )

5 : examples box M tm : ĎM ij ď M ij ď Ď M ij u ΩpXq (...more on this later) max M ij ď Ď M ij mÿ i,j 1 M ij x T i x j mÿ i,j 1 ĎM ij x T i x j

6 : examples box M tm : ĎM ij ď M ij ď Ď M ij u ΩpXq (...more on this later) max M ij ď Ď M ij mÿ i,j 1 M ij x T i x j mÿ i,j 1 ĎM ij x T i x j box with n 1: Ωpxq x T Ď M x

7 Outline motivating applications convex analysis of VGFs: convexity, conjugate, subdifferential optimization algorithms for regularized loss minimization min X LpXq ` λωpxq application to a hierarchical classification problem

8 Applications a machine learning application: hierarchical classification vs flat classification three classes

9 Applications a machine learning application: hierarchical classification vs flat classification three classes

10 Applications a machine learning application: hierarchical classification vs flat classification three classes recursive labeling

11 Applications a machine learning application: hierarchical classification vs flat classification three classes recursive labeling x child K x parent classifiers of different levels use different features (or different combinations of features) classifiers desired to be orthogonal to parent classifiers (hierarchy via orthogonal transfer [Zhou,Xiao,Wu 11] ) encourage x l K x f and x b K x f Ωpx m, x f, x l, x b q w 1 x T l x f ` w 2 x T b x f other transfer learning methods e.g., [Cai, Hoffman 4; Dekel et al, 4]

12 Application: multitask learning learn multiple tasks simultaneously using shared information among tasks, e.g. structural assumptions on a matrix of classifiers X tasks» x 1 x 2 x 3 x 4 x 5 x 6 x 7 fi ˆ ˆ ffi ˆ ˆ ˆ ffi ffi ffi ˆ ˆ ffi ffi ffi ˆ ˆ ˆ ˆ ffi ffi ffi ˆ ˆ ˆ ˆ ˆ ffi ffi ffi ˆ ffi fl features

13 Promoting pairwise structure for x 1,..., x m P R n x T i x j s reveal information about relative orientations; can serve as a measure for properties such as orthogonality e.g., minimizing Ωpx 1,..., x m q mÿ i,j 1 ĎM ij x T i x j promotes pairwise orthogonality for certain pairs, specified by Ď M [Zhou,Xiao,Wu, 11] introduced this penalty for hierarchical classification.

14 Promoting pairwise structure Ωpx 1,..., x m q ÿ i,j Ď Mij x T i x j when is it convex? Theorem (Zhou,Xiao,Wu, 11) Ω is convex if Ď M ě, and Ă M, the comparison matrix of Ď M, is PSD. " Ď Mij i j ĂM ĎM ii i j condition is also necessary if n ě m 1.

15 Promoting pairwise structure Ωpx 1,..., x m q ÿ i,j Ď Mij x T i x j when is it convex? Theorem (Zhou,Xiao,Wu, 11) Ω is convex if Ď M ě, and Ă M, the comparison matrix of Ď M, is PSD. " Ď Mij i j ĂM ĎM ii i j condition is also necessary if n ě m 1. proof: brute-force (verify def. of convexity) question: when is a general VGF convex?

16 Convexity given compact (not necessarily convex) set M, Theorem (Jalali, F., Xiao) ΩpXq max MPM trpxmxt q ΩpXq is convex, if and only if for every X there exists a PSD M P M satisfying ΩpXq trpxmx T q.

17 Convexity given compact (not necessarily convex) set M, Theorem (Jalali, F., Xiao) ΩpXq max MPM trpxmxt q ΩpXq is convex, if and only if for every X there exists a PSD M P M satisfying ΩpXq trpxmx T q. corollary: when Ω is convex,? Ω is pointwise max of weighted Frobenius norms a ΩpXq }XM 1{2 } F max MPMXS`

18 Convexity given compact (not necessarily convex) set M, Theorem (Jalali, F., Xiao) ΩpXq max MPM trpxmxt q ΩpXq is convex, if and only if for every X there exists a PSD M P M satisfying ΩpXq trpxmx T q. corollary: when Ω is convex,? Ω is pointwise max of weighted Frobenius norms a ΩpXq }XM 1{2 } F max MPMXS` but when is the condition satisfied?

19 Convexity polytope: M convtm 1,..., M p u. let M eff be the smallest subset satisfying Theorem max MPM trpxmxt q max trpxmx T MPM eff If M is a polytope, Ω is convex if and only if M eff Ă S m`. gray: set M; red: maximal points w.r.t. PSD cone; green: M eff convexity test: check whether green vertices are PSD

20 Convexity examples for M tm : M ij ď M Ď ij u, ΩpXq ř Ď i,j M ij x T i x j M eff Ă tm : M ii M Ď ii, M ij ĎM ij for i j u if n ě m 1, M eff Ă S m` is equivalent to: comparison matrix of M Ď is PSD.

21 Convexity examples for M tm : M ij ď M Ď ij u, ΩpXq ř Ď i,j M ij x T i x j M eff Ă tm : M ii M Ď ii, M ij ĎM ij for i j u if n ě m 1, M eff Ă S m` is equivalent to: comparison matrix of M Ď is PSD.

22 Convexity squared norm }x} 2 for x P R m : convex VGF corresponding to M tuu T : }u} ď 1u

23 Convexity squared norm }x} 2 for x P R m : convex VGF corresponding to M tuu T : }u} ď 1u for M M : ř m i,j 1 pm ij{ Ď M ij q 2 ď 1 (, ΩpXq mÿ i,j 1 ĎM 2 ijpx T i x j q 2 1{2 ĎM ij ě ensures convexity (proof by examining M eff )

24 Conjugate Function conjugate function of ΩpXq max MPM trpxmx T q is Ω py q 1 2 inf trpy M : Y T q : rangepy T q Ď rangepmq ( MPM

25 Conjugate Function conjugate function of ΩpXq max MPM trpxmx T q is Ω py q 1 2 inf trpy M : Y T q : rangepy T q Ď rangepmq ( MPM " j * M Y 1 2 inf T trpcq : ľ, M P M M, C Y C and is semidefinite representable

26 Conjugate Function conjugate function of ΩpXq max MPM trpxmx T q is Ω py q 1 2 inf trpy M : Y T q : rangepy T q Ď rangepmq ( MPM " j * M Y 1 2 inf T trpcq : ľ, M P M M, C Y C and is semidefinite representable the dual norm (if M s invertible): a 2Ω pxq inf MPM }XM 1{2 } F special case: with M tm : αi ĺ M ĺ βi, trpmq γu, gives cluster norm defined by [Jacob, Bach,Vert 8]

27 Proximal Operator proximal operator: prox τh pxq arg min u hpuq ` 1 2τ }u x}2 2 prox τω pxq can be computed by an SDP arg min Y min M,C subject to use QR decomposition X QrR T X, st }Y X}2 2 ` τ trpcq j M Y T ľ, M P M Y C to get SDP of size 2m

28 Proximal Operator proximal operator: prox τh pxq arg min u hpuq ` 1 2τ }u x}2 2 prox τω pxq can be computed by an SDP arg min R min M,C subject to use QR decomposition X QrR T X, st }R RX}2 2 ` τ trpcq j M R T ľ, M P M R C to get SDP of size 2m prox τωp q pxq X prox τω p q pxq

29 Subdifferential ΩpXq max MPM trpxmxt q max MPM ÿ M ij x T i x j subdifferential: B ΩpXq 2 conv XM : M P M, trpxmx T q ΩpXq ( i,j example: for ΩpXq ř i,j Ď M ij x T i x j, B ΩpXq 2 convtxm : M ij Ď M ij signpx T i x j q if x i T x j, M ij ď Ď M ij otherwiseu ([Zhou et al 11] give just one subgradient)

30 Outline motivating applications convex analysis of VGFs: convexity, conjugate, subdifferential optimization algorithms for regularized loss minimization min X LpXq ` λωpxq application to a hierarchical classification problem

31 Regularized Loss Minimization J opt min X LpX ; dataq ` λ ΩpXq common losses: norm loss, Huber loss, hinge, logistic, etc.

32 Regularized Loss Minimization J opt min X LpX ; dataq ` λ ΩpXq common losses: norm loss, Huber loss, hinge, logistic, etc. with smooth loss (and prox cheap): iteratively update X ptq : X pt`1q prox γtω X ptq γ t LpX ptq q, t, 1, 2,..., γ t is step size

33 Regularized Loss Minimization J opt min X LpX ; dataq ` λ ΩpXq common losses: norm loss, Huber loss, hinge, logistic, etc. with smooth loss (and prox cheap): iteratively update X ptq : X pt`1q prox γtω X ptq γ t LpX ptq q, t, 1, 2,..., γ t is step size when LpXq is not smooth: subgradient-based methods; e.g. Regularized Dual Averaging [Xiao 11] convergence can be very slow

34 VGF with Structured Loss Functions exploit smooth variational representation of a VGF, J opt min X max MPM LpX ; dataq ` λ trpxmxt q

35 VGF with Structured Loss Functions exploit smooth variational representation of a VGF, J opt min X max MPM LpX ; dataq ` λ trpxmxt q and consider losses with nice representation (called Fenchel-type): LpXq max GPG xx, DpGqy ˆLpGq where ˆLp q is convex, G is compact, Dp q is a linear operator

36 VGF with Structured Loss Functions exploit smooth variational representation of a VGF, J opt min X max MPM LpX ; dataq ` λ trpxmxt q and consider losses with nice representation (called Fenchel-type): LpXq max GPG xx, DpGqy ˆLpGq where ˆLp q is convex, G is compact, Dp q is a linear operator luckily, covers many important cases: norm loss, Huber loss, binary and muti-class hinge loss...

37 VGF with Structured Loss Functions exploit smooth variational representation of a VGF, J opt min X max MPM LpX ; dataq ` λ trpxmxt q and consider losses with nice representation (called Fenchel-type): LpXq max GPG xx, DpGqy ˆLpGq where ˆLp q is convex, G is compact, Dp q is a linear operator luckily, covers many important cases: norm loss, Huber loss, binary and muti-class hinge loss... then, J opt min X max MPM GPG smooth convex-concave saddle-point problem! xx, DpGqy ˆLpGq ` λ trpxmx T q

38 Mirror-Prox Algorithm J opt min X max MPM GPG xx, DpGqy ˆLpGq ` λ trpxmx T q Setup. find the saddle points of smooth convex-concave functions min xpx max ypy fpx, yq Mirror-prox [Nemirovski 4]: Op1{tq convergence Op1{t 2 q convergence if strongly convex useful if projection (or prox) is cheap often can preprocess to reduce to smaller dimension

39 Experiment: Text Categorization Experiment. Text Categorization for Reuters corpus volume 1: archive of manually categorized news stories. A part of the categories hierarchy: corporate funding ownership changes markets bonds acquisition loans asset transfer domestic markets privatization market share

40 Experiment: Text Categorization Experiment. Text Categorization for Reuters corpus volume 1: archive of manually categorized news stories. A part of the categories hierarchy: corporate funding ownership changes markets bonds acquisition loans asset transfer domestic markets privatization market share minimize X, ξ subject to 1 N Nÿ ξ s ` λωpxq s 1 x T i y s x T j y s ě 1 ξ P P A`pz P t1,..., Nu ξ s P t1,..., Nu where y s P R n are the samples, and z s P t1,..., mu are the labels, s 1,..., N.

sanity check: angles between pairs of classifiers left: flat classification; right: hierarchical classification 124 14 13 85 95 98 87 93 124 18 113 14 18 11 93 86 82 95 88 13 113

95 99 86 13 142 114 93 88 96 86 86 81 15 83 12 114 124 11 1 84 94 94 96 124 112 118 88 93 11 112 98 94 86 93 83 78 85 1 118 98 87 84 94 141 13 87 98 11 99 94 86 141 84 77 85 94

41 sanity check: angles between pairs of classifiers left: flat classification; right: hierarchical classification red: pairs desired to be orthogonal

42 Experiment: Text Categorization objective function convergence rate Subgradient Method non-smooth, convex Op1{? tq Regularized Dual Averaging non-smooth, strongly cvx (σ) Oplnptq{σtq Mirror-prox smooth var. form, convex Op1{tq Mirror-prox smooth var. form, strongly convex Op1{t 2 q FlatMult HierMult Transfer TreeLoss Orthogonal Transfer 21.39p.29q 21.41p.29q 21.p.31q 26.32p.39q 17.46p.74q prediction error on test data (from [Zhou,Xiao, Wu 11])

43 Summary, future work VGFs: functions of Gram matrix, defined via weight set M unify special cases; lead to new functions convex analysis efficient algorithms future work: design M for different applications other applications: multitask learning (with clustered/diverse tasks); disjoint visual features;... Reference: A. Jalali, L. Xiao, M. Fazel, : Convex Analysis and Optimization, prelim draft available on faculty/washington.edu/mfazel

44 4 3 2 MP RDA Figure: Value of loss function vs iteration t for MP and RDA algorithms (log scale)

45 Figure: }X t X final } F vs iteration t for MP and RDA algorithms (log scale)

46 Summary, future work VGFs: functions of Gram matrix, defined via weight set M unify special cases; lead to new functions convex analysis: conjugate, subdifferential, prox efficient algorithms future work: design M for different applications other applications: multitask learning (with clustered or diverse sets of tasks); disjoint visial features (vision);... Reference: A. Jalali, L. Xiao, M. Fazel, : Convex Analysis and Optimization, from website: faculty/washington.edu/mfazel

Variational Gram Functions: Convex Analysis and Optimization

Variational Gram Functions: Convex Analysis and Optimization Amin Jalali Maryam Fazel Lin Xiao October 28, 2015 Abstract We propose a new class of convex penalty functions, called variational Gram functions