Moreau-Yosida Regularization for Grouped Tree Structure Learning

Similar documents
Statistical Learning Theory: A Primer

Projection onto A Nonnegative Max-Heap

Statistical Learning Theory: a Primer

Distributed average consensus: Beyond the realm of linearity

Stochastic Variational Inference with Gradient Linearization

Explicit overall risk minimization transductive bound

A Novel Learning Method for Elman Neural Network Using Local Search

Asynchronous Control for Coupled Markov Decision Systems

A. Distribution of the test statistic

arxiv: v2 [cs.lg] 4 Sep 2014

An Algorithm for Pruning Redundant Modules in Min-Max Modular Network

BP neural network-based sports performance prediction model applied research

DIFFERENCE-OF-CONVEX LEARNING: DIRECTIONAL STATIONARITY, OPTIMALITY, AND SPARSITY

Lecture Note 3: Stationary Iterative Methods

Determining The Degree of Generalization Using An Incremental Learning Algorithm

arxiv: v1 [math.co] 17 Dec 2018

Minimum Enclosing Circle of a Set of Fixed Points and a Mobile Point

Multilayer Kerceptron

On the Goal Value of a Boolean Function

Learning Structural Changes of Gaussian Graphical Models in Controlled Experiments

A Solution to the 4-bit Parity Problem with a Single Quaternary Neuron

A Brief Introduction to Markov Chains and Hidden Markov Models

From Margins to Probabilities in Multiclass Learning Problems

arxiv: v1 [cs.lg] 31 Oct 2017

FORECASTING TELECOMMUNICATIONS DATA WITH AUTOREGRESSIVE INTEGRATED MOVING AVERAGE MODELS

arxiv: v1 [cs.db] 1 Aug 2012

First-Order Corrections to Gutzwiller s Trace Formula for Systems with Discrete Symmetries

Probabilistic Graphical Models

Melodic contour estimation with B-spline models using a MDL criterion

Uniprocessor Feasibility of Sporadic Tasks with Constrained Deadlines is Strongly conp-complete

Adaptive Regularization for Transductive Support Vector Machine

SVM: Terminology 1(6) SVM: Terminology 2(6)

Optimality of Inference in Hierarchical Coding for Distributed Object-Based Representations

Convergence Property of the Iri-Imai Algorithm for Some Smooth Convex Programming Problems

CONJUGATE GRADIENT WITH SUBSPACE OPTIMIZATION

SUPPLEMENTARY MATERIAL TO INNOVATED SCALABLE EFFICIENT ESTIMATION IN ULTRA-LARGE GAUSSIAN GRAPHICAL MODELS

(This is a sample cover image for this issue. The actual cover is not yet available at this time.)

Integrating Factor Methods as Exponential Integrators

A Simple and Efficient Algorithm of 3-D Single-Source Localization with Uniform Cross Array Bing Xue 1 2 a) * Guangyou Fang 1 2 b and Yicai Ji 1 2 c)

Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels

Discriminant Analysis: A Unified Approach

C. Fourier Sine Series Overview

Appendix of the Paper The Role of No-Arbitrage on Forecasting: Lessons from a Parametric Term Structure Model

Supplement of Limited-memory Common-directions Method for Distributed Optimization and its Application on Empirical Risk Minimization

Math 124B January 17, 2012

FUSED MULTIPLE GRAPHICAL LASSO

PARSIMONIOUS VARIATIONAL-BAYES MIXTURE AGGREGATION WITH A POISSON PRIOR. Pierrick Bruneau, Marc Gelgon and Fabien Picarougne

Some Measures for Asymmetry of Distributions

Data Mining Technology for Failure Prognostic of Avionics

Soft Clustering on Graphs

BALANCING REGULAR MATRIX PENCILS

Available online at ScienceDirect. IFAC PapersOnLine 50-1 (2017)

Evolutionary Product-Unit Neural Networks for Classification 1

Approximated MLC shape matrix decomposition with interleaf collision constraint

The EM Algorithm applied to determining new limit points of Mahler measures

Paragraph Topic Classification

SVM-based Supervised and Unsupervised Classification Schemes

Symbolic models for nonlinear control systems using approximate bisimulation

Effect of transport ratio on source term in determination of surface emission coefficient

Primal and dual active-set methods for convex quadratic programming

STA 216 Project: Spline Approach to Discrete Survival Analysis

Discrete Techniques. Chapter Introduction

XSAT of linear CNF formulas

Approximated MLC shape matrix decomposition with interleaf collision constraint

Wavelet shrinkage estimators of Hilbert transform

Sparse canonical correlation analysis from a predictive point of view

Inductive Bias: How to generalize on novel data. CS Inductive Bias 1

Converting Z-number to Fuzzy Number using. Fuzzy Expected Value

Discrete Techniques. Chapter Introduction

Structured sparsity for automatic music transcription

Sequential Decoding of Polar Codes with Arbitrary Binary Kernel

MARKOV CHAINS AND MARKOV DECISION THEORY. Contents

A unified framework for Regularization Networks and Support Vector Machines. Theodoros Evgeniou, Massimiliano Pontil, Tomaso Poggio

A Ridgelet Kernel Regression Model using Genetic Algorithm

Algorithms to solve massively under-defined systems of multivariate quadratic equations

A proposed nonparametric mixture density estimation using B-spline functions

Source and Relay Matrices Optimization for Multiuser Multi-Hop MIMO Relay Systems

Available online at ScienceDirect. Procedia Computer Science 96 (2016 )

Appendix A: MATLAB commands for neural networks

Target Location Estimation in Wireless Sensor Networks Using Binary Data

Spring Gravity Compensation Using the Noncircular Pulley and Cable For the Less-Spring Design

Haar Decomposition and Reconstruction Algorithms

Two view learning: SVM-2K, Theory and Practice

Construction of Supersaturated Design with Large Number of Factors by the Complementary Design Method

Factorized Sparse Learning Models with Interpretable High Order Feature Interactions

Stochastic Automata Networks (SAN) - Modelling. and Evaluation. Paulo Fernandes 1. Brigitte Plateau 2. May 29, 1997

Cryptanalysis of PKP: A New Approach

Incremental Reformulated Automatic Relevance Determination

arxiv: v2 [stat.ml] 19 Oct 2016

Consistent linguistic fuzzy preference relation with multi-granular uncertain linguistic information for solving decision making problems

NEW DEVELOPMENT OF OPTIMAL COMPUTING BUDGET ALLOCATION FOR DISCRETE EVENT SIMULATION

pp in Backpropagation Convergence Via Deterministic Nonmonotone Perturbed O. L. Mangasarian & M. V. Solodov Madison, WI Abstract

Learning Fully Observed Undirected Graphical Models

Equilibrium of Heterogeneous Congestion Control Protocols

Expectation-Maximization for Estimating Parameters for a Mixture of Poissons

Iterative Decoding Performance Bounds for LDPC Codes on Noisy Channels

An Information Geometrical View of Stationary Subspace Analysis

Combining reaction kinetics to the multi-phase Gibbs energy calculation

Support Vector Machine and Its Application to Regression and Classification

Kernel Matching Pursuit

Transcription:

Moreau-Yosida Reguarization for Grouped Tree Structure Learning Jun Liu Computer Science and Engineering Arizona State University J.Liu@asu.edu Jieping Ye Computer Science and Engineering Arizona State University Jieping.Ye@asu.edu Abstract We consider the tree structured group Lasso where the structure over the features can be represented as a tree with eaf nodes as features and interna nodes as custers of the features. The structured reguarization with a pre-defined tree structure is based on a group-lasso penaty, where one group is defined for each node in the tree. Such a reguarization can hep uncover the structured sparsity, which is desirabe for appications with some meaningfu tree structures on the features. However, the tree structured group Lasso is chaenging to sove due to the compex reguarization. In this paper, we deveop an efficient agorithm for the tree structured group Lasso. One of the key steps in the proposed agorithm is to sove the Moreau-Yosida reguarization associated with the grouped tree structure. The main technica contributions of this paper incude (1) we show that the associated Moreau-Yosida reguarization admits an anaytica soution, and (2) we deveop an efficient agorithm for determining the effective interva for the reguarization parameter. Our experimenta resuts on the AR and JAFFE face data sets demonstrate the efficiency and effectiveness of the proposed agorithm. 1 Introduction Many machine earning agorithms can be formuated as a penaized optimization probem: min x (x) + λφ(x), (1) where (x) is the empirica oss function (e.g., the east squares oss and the ogistic oss), λ > 0 is the reguarization parameter, and φ(x) is the penaty term. Recenty, sparse earning via 1 reguarization [20] and its various extensions has received increasing attention in many areas incuding machine earning, signa processing, and statistics. In particuar, the group Lasso [1, 16, 22] utiizes the group information of the features, and yieds a soution with grouped sparsity. The traditiona group Lasso assumes that the groups are non-overapping. However, in many appications the features may form more compex overapping groups. Zhao et a. [23] extended the group Lasso to the case of overapping groups, imposing hierarchica reationships for the features. Jacob et a. [6] considered group Lasso with overaps, and studied theoretica properties of the estimator. Jenatton et a. [7] considered the consistency property of the structured overapping group Lasso, and designed an active set agorithm. In many appications, the features can naturay be represented using certain tree structures. For exampe, the image pixes of the face image shown in Figure 1 can be represented as a tree, where each parent node contains a series of chid nodes that enoy spatia ocaity; genes/proteins may form certain hierarchica tree structures. Kim and Xing [9] studied the tree structured group Lasso for muti-task earning, where mutipe reated tasks foow a tree structure. One chaenge in the practica appication of the tree structured group Lasso is that the resuting optimization probem is much more difficut to sove than Lasso and group Lasso, due to the compex reguarization. 1

(a) (b) (c) (d) Figure 1: Iustration of the tree structure of a two-dimensiona face image. The 64 64 image (a) can be divided into 16 sub-images in (b) according to the spatia ocaity, where the sub-images can be viewed as the chid nodes of (a). Simiary, each 16 16 sub-image in (b) can be divided into 16 sub-images in (c), and such a process is repeated for the sub-images in (c) to get (d). G 0 1 G 1 1 G 1 2 G 1 3 G 2 1 G 2 2 G 2 3 G 2 4 Figure 2: A sampe index tree for iustration. Root: G 0 1 = {1, 2, 3, 4, 5, 6, 7, 8}. Depth 1: G 1 1 = {1, 2}, G 1 2 = {3, 4, 5, 6}, G 1 3 = {7, 8}. Depth 2: G 2 1 = {1}, G 2 2 = {2}, G 2 3 = {3, 4}, G 2 4 = {5, 6}. In this paper, we deveop an efficient agorithm for the tree structured group Lasso, i.e., the optimization probem (1) with φ( ) being the grouped tree structure reguarization (see Equation 2). One of the key steps in the proposed agorithm is to sove the Moreau-Yosida reguarization [17, 21] associated with the grouped tree structure. The main technica contributions of this paper incude: (1) we show that the associated Moreau-Yosida reguarization admits an anaytica soution, and the resuting agorithm for the tree structured group Lasso has a time compexity comparabe to Lasso and group Lasso, and (2) we deveop an efficient agorithm for determining the effective interva for the parameter λ, which is important in the practica appication of the agorithm. We have performed experimenta studies using the AR and JAFFE face data sets, where the features form a hierarchica tree structure based on the spatia ocaity as shown in Figure 1. Our experimenta resuts demonstrate the efficiency and effectiveness of the proposed agorithm. Note that whie the present paper was under review, we became aware of a recent work by Jenatton et a. [8] which appied bock coordinate ascent in the dua and showed that the agorithm converges in one pass. 2 Grouped Tree Structure Reguarization We begin with the definition of the so-caed index tree: Definition 1. For an index tree T of depth d, we et T i = {G i 1, G i 2,..., G i n i } contain a the node(s) corresponding to depth i, where n 0 = 1, G 0 1 = {1, 2,..., p} and n i 1, i = 1, 2,..., d. The nodes satisfy the foowing conditions: 1) the nodes from the same depth eve have non-overapping indices, i.e., G i Gi k =, i = 1,..., d, k, 1, k n i; and 2) et G i 1 0 of a non-root node G i, then Gi Gi 1 0. be the parent node Figure 2 shows a sampe index tree. We can observe that 1) the index sets from different nodes may overap, e.g., any parent node overaps with its chid nodes; 2) the nodes from the same depth eve do not overap; and 3) the index set of a chid node is a subset of that of its parent node. The grouped tree structure reguarization is defined as: d n i φ(x) = w x i G i, (2) i=0 =1 where x R p, w i 0 (i = 0, 1,..., d, = 1, 2,..., n i) is the pre-defined weight for the node G i, is the Eucidean norm, and x G i is a vector composed of the entries of x with the indices in G i. 2

In the next section, we study the Moreau-Yosida reguarization [17, 21] associated with (2), deveop an anaytica soution for such a reguarization, propose an efficient agorithm for soving (1), and specify the meaningfu interva for the reguarization parameter λ. 3 Moreau-Yosida Reguarization of φ( ) The Moreau-Yosida reguarization associated with the grouped tree structure reguarization φ( ) for a given v R p is given by: φ λ (v) = min x f(x) = 1 d 2 x n i v 2 + λ w x i G i, (3) i=0 =1 for some λ > 0. Denote the minimizer of (3) as π λ (v). The Moreau-Yosida reguarization has many usefu properties: 1) φ λ ( ) is continuousy differentiabe despite the fact that φ( ) is non-smooth; 2) π λ ( ) is a non-expansive operator. More properties on the genera Moreau-Yosida reguarization can be found in [5, 10]. Note that, f( ) in (3) is indeed a specia case of the probem (1) with (x) = 1 2 x v 2. Our recent study has shown that, the efficient optimization of the Moreau- Yosida reguarization is key to many optimization agorithms [13, Section 2]. Next, we focus on the efficient optimization of (3). For convenience of subsequent discussion, we denote λ i = λwi. 3.1 An Anaytica Soution We show that the minimization of (3) admits an anaytica soution. We first present the detaied procedure for finding the minimizer in Agorithm 1. Agorithm 1 Moreau-Yosida Reguarization of the tree structured group Lasso (MY tglasso ) Input: v R p, the index tree T with nodes G i (i = 0, 1,..., d, = 1, 2,..., n i) that satisfy Definition 1, the weights w i 0 (i = 0, 1,..., d, = 1, 2,..., n i), λ > 0, and λ i = λwi Output: u 0 R p 1: Set u d+1 = v, (4) 2: for i = d to 0 do 3: for = 1 to n i do 4: Compute 5: end for 6: end for u i G i = u i+1 G i λ i u i+1 G i 0 u i+1 λ i G i u i+1 G i u i+1 > λ i G i, (5) In the impementation of the MY tglasso agorithm, we ony need to maintain a working variabe u, which is initiaized with v. We then traverse the index tree T in the reverse breadth-first order to update u. At the traversed node G i, we update u G i according to the operation in (5), which reduces the Eucidean norm of u G i by at most λ i. The time compexity of MY tglasso is O( d ni i=0 =1 Gi ). By using Definition 1, we have n i =1 Gi p. Therefore, the time compexity of MY tglasso is O(pd). If the tree is baanced, i.e., d = O(og p), then the time compexity of MY tglasso is O(p og p). MY tglasso can hep expain why the structured group sparsity can be induced. Let us anayze the tree given in Figure 2, with the soution denoted as x. We et w i = 1, i,, λ = 2, and v = [1, 2, 1, 1, 4, 4, 1, 1] T. After traversing the nodes of depth 2, we can get that the eements of x with indices in G 2 1 and G 2 3 are zero; and when the traversa continues to the nodes of depth 1, the eements of x with indices in G 1 1 and G 1 3 are set to zero, but those with G 2 4 are sti nonzero. Finay, after traversing the root node, we obtain x = [0, 0, 0, 0, 1, 1, 0, 0] T. 3

Next, we show that MY tglasso finds the exact minimizer of (3). The main resut is summarized in the foowing theorem: Theorem 1. u 0 returned by Agorithm 1 is the unique soution to (3). Before giving the detaied proof for Theorem 1, we introduce some notations, and present severa technica emmas. Define the mapping φ i : Rp R as We can then express φ(x) defined in (2) as: φ(x) = φ i (x) = x G i. (6) d n i λ i φ i (x). i=0 =1 The subdifferentia of f( ) defined in (3) at the point x can be written as: d n i f(x) = x v + λ i φ i (x), (7) where φ i (x) = and G i denotes the compementary set of Gi. i=0 =1 { } y R p : y 1, y G i = 0 { y R p : y G i = x G i x G i, y G i = 0 } if x G i = 0 if x G i 0, Lemma 1. For any 1 i d, 1 n i, we can find a unique path from the node G i to the root node G 0 1. Let the nodes on this path be G r, for = 0, 1,..., i with r 0 = 1 and r i =. We have G i G r, = 0, 1,..., i 1. (9) G i G r =, r r, = 1, 2,..., i 1, r = 1, 2,..., n i. (10) Proof: According to Definition 1, we can find a unique path from the node G i to the root node G0 1. In addition, based on the structure of the index tree, we have (9) and (10). Lemma 2. For any i = 1, 2,..., d, = 1, 2,..., n i, we have u i G u i+1 λ i ( i G i φ i (u i ) ), (11) G i Proof: We can verify (11) using (5), (6) and (8). For (12), it foows from (6) and (8) that, it is sufficient to verify that u 0 G i φ i (u i ) φ i (u 0 ). (12) = α i u i G i, for some α i 0. (13) It foows from Lemma 1 that we can find a unique path from G i to G0 1. Denote the nodes on the path as: G r, where = 0, 1,..., i, r i =, and r 0 = 1. We first anayze the reationship between u i and u i 1. If G i G i ui G i 1 λi 1, we have u i 1 = 0, which eads to u i 1 = 0 by using G i 1 G i (9). Otherwise, if u i 1 G i = ui G i 1 ui G i 1 λi 1 u i G i ui G i 1 > λi 1, we have u i 1 = G i 1 by using (9). Therefore, we have u i 1 G i ui G i 1 λi 1 ui G i 1 (8) u i G i 1, which eads to = β i u i G i, for some β i 0. (14) 4

By a simiar argument, we have Together with (9), we have u 1 G r = β u G r, β 0, = 1, 2,..., i 1. (15) u 1 G i = β u G i, β 0,, = 1, 2,..., i 1. (16) From (14) and (16), we show (13) hods with α i = Πi =1 β. This competes the proof. We are now ready to prove our main resut: Proof of Theorem 1: It is easy to verify that f( ) defined in (3) is strongy convex, thus it admits a unique minimizer. Our methodoogy for the proof is to show that 0 f(u 0 ), (17) which is the sufficient and necessary condition for u 0 to be the minimizer of f( ). According to Definition 1, the eaf nodes are non-overapping. We assume that the union of the eaf nodes equas to {1, 2,..., p}; otherwise, we can add to the index tree the additiona eaf nodes with weight 0 to satisfy the aforementioned assumption. Ceary, the origina index tree and the new index tree with the additiona eaf nodes of weight 0 yied the same penaty φ( ) in (2), the same Moreau- Yosida reguarization in (3), and the same soution from Agorithm 1. Therefore, to prove (17), it suffices to show 0 f(u 0 ) G i, for a the eaf nodes G i. Next, we focus on estabishing the foowing reationship: 0 f(u 0 ) G d 1. (18) It foows from Lemma 1 that, we can find a unique path from the node G d 1 to the root G 0 1. Let the nodes on this path are G r, for = 0, 1,..., d with r 0 = 1 and r d = 1. By using (10) of Lemma 1, we can get that the nodes that contain the index set G d 1 are exacty on the aforementioned path. In other words, x, we have ( φ r (x) ) = {0}, r r G d, = 1, 2,..., d 1, r = 1, 2,..., n i (19) 1 by using (6) and (8). Appying (11) and (12) of Lemma 2 to each node on the aformetioned path, we have u +1 G r u G r λ r ( φ r (u ) ) G r λ r ( φ r (u 0 ) ) G r, = 0, 1,..., d. (20) Making using of (9), we obtain from (20) the foowing reationship: u +1 G d 1 Adding (21) for = 0, 1,..., d, we have u G λ ( d r 1 φ r (u 0 ) ), = 0, 1,..., d. (21) G d 1 u d+1 G d 1 u 0 G d 1 d λ ( r φ r (u 0 ) ) G d 1 =0 It foows from (4), (7), (19) and (22) that (18) hods. Simiary, we have 0 f(u 0 ) G i for the other eaf nodes G i. Thus, we have (17). 3.2 The Proposed Optimization Agorithm With the anaytica soution for π λ ( ), the minimizer of (3), we can appy many existing methods for soving (1). First, we show in the foowing emma that, the optima soution to (1) can be computed as a fixed point. We sha show in Section 3.3 that, the resut in this emma can aso hep determine the interva for the vaues of λ. Lemma 3. Let x be an optima soution to (1). Then, x satisfies: (22) x = π λτ (x τ (x )), τ > 0. (23) 5

Proof: x is an optima soution to (1), if and ony if 0 (x ) + λ φ(x ), (24) which eads to 0 x (x τ (x )) + λτ φ(x ), τ > 0. (25) Thus, we have x 1 = arg min x 2 x (x τ (x )) 2 +λτφ(x). Reca that π λ ( ) is the minimizer of (3). We have (23). It foows from Lemma 3 that we can appy the fixed point continuation method [4] for soving (1). It is interesting to note that, with an appropriatey chosen τ, the scheme in (23) indeed corresponds to the gradient method deveoped for the composite function optimization [2, 19], achieving the goba convergence rate of O(1/k) for k iterations. In addition, the scheme in (23) can be acceerated to obtain the acceerated gradient descent [2, 19], where the Moreau-Yosidea reguarization aso needs to be evauated in each of its iteration. We empoy the acceerated gradient descent deveoped in [2] for the optimization in this paper. The agorithm is caed tglasso, which stands for the tree structured group Lasso. Note that, tglasso incudes our previous agorithm [11] as a specia case, when the index tree is of depth 1 and w 0 1 = 0. 3.3 The Effective Interva for the Vaues of λ When estimating the mode parameters via (1), a key issue is to choose the appropriate vaues for the reguarization parameter λ. A commony used approach is to seect the reguarization parameter from a set of candidate vaues, whose vaues, however, need to be pre-specified in advance. Therefore, it is essentia to specify the effective interva for the vaues of λ. An anaysis of MY tglasso in Agorithm 1 shows that, with increasing λ, the entries of the soution to (3) are monotonicay decreasing. Intuitivey, the soution to (3) sha be exacty zero if λ is sufficienty arge and a the entries of x are penaized in φ(x). Next, we summarize the main resuts of this subsection. Theorem 2. The zero point is a soution to (1) if and ony if the zero point is a soution to (3) with v = (0). For the penaty φ(x), et us assume that a entries of x are penaized, i.e., {1, 2,..., p}, there exists at east one node G i that contains and meanwhie wi > 0. Then, for any 0 < (0) < +, there exists a unique λ max < + satisfying: 1) if λ λ max the zero point is a soution to (1), and 2) if 0 < λ < λ max, the zero point is not a soution to (1). Proof: If x = 0 is the soution to (1), we have (24). Setting τ = 1 in (23), we obtain that x = 0 is aso the soution to (3) with v = (0). If x = 0 is the soution to (3) with v = (0), we have 0 (0) + λ φ(0), which indicates that x = 0 is the soution to (1). The function φ(x) is cosed convex. According to [18, Chapater 3.1.5], φ(0) is a cosed convex and non-empty bounded set. From (8), it is cear that 0 φ(0). Therefore, we have x R, x φ(0), where R is a finite radius constant. Let S = {x : x = αr (0)/ (0), α [0, 1]} be the ine segment from 0 to R (0)/ (0). It is obvious that S is cosed convex and bounded. Define I = S φ(0), which is ceary cosed convex and bounded. Define λ max = (0) / max x I x. It foows from (0) > 0 and the boundedness of I that λ max > 0. We first show λ max < +. Otherwise, we have I = {0}. Thus, λ > 0, we have (0)/λ / φ(0), which indicates that 0 is neither the soution to (1) nor (3) with v = (0). Reca the assumption that, {1, 2,..., p}, there exists at east one node G i that contains and meanwhie wi > 0. It foows from Agorithm 1 that, there exists a λ < + such that when λ > λ, 0 is a soution to (3) with v = (0), eading to a contradiction. Therefore, we have 0 < λ max < +. Let λ max = λ max. The arguments hod since 1) if λ λ max, then (0)/λ I φ(0); and 2) if 0 < λ < λ max, then (0)/λ / φ(0). When (0) = 0, the probem (1) has a trivia zero soution. We next focus on the nontrivia case (0) 0. We present the agorithm for efficienty soving λ max in Agorithm 2. In Step 1, λ 0 is an initia guess of the soution. Our empirica study shows that λ 0 = (0) 2 d ni works quite i=0 =1 (wi )2 we. In Step 2-6, we specify an interva [λ 1, λ 2 ] in which λ max resides. Finay, in Step 7-14, we appy bisection for computing λ max. 6

Agorithm 2 Finding λ max via Bisection Input: (0), the index tree T with nodes G i (i = 0, 1,..., d, = 1, 2,..., n i), the weights w i 0 (i = 0, 1,..., d, = 1, 2,..., n i ), λ 0, and δ = 10 10 Output: λ max 1: Set λ = λ 0 2: if φ λ ( (0)) = 0 then 3: Set λ 2 = λ, and find the argest λ 1 = 2 i λ, i = 1, 2,... such that π λ1 ( (0)) 0 4: ese 5: Set λ 1 = λ, and find the smaest λ 2 = 2 i λ, i = 1, 2,... such that π λ2 ( (0)) = 0 6: end if 7: whie λ 2 λ 1 δ do 8: Set λ = λ1+λ2 2 9: if π λ ( (0)) = 0 then 10: Set λ 2 = λ 11: ese 12: Set λ 1 = λ 13: end if 14: end whie 15: λ max = λ 4 Experiments We have conducted experiments to evauate the efficiency and effectiveness of the proposed tglasso agorithm on the face data sets JAFFE [14] and AR [15]. JAFFE contains 213 images of ten Japanese actresses with seven facia expressions: neutra, happy, disgust, fear, anger, sadness, and suprise. We used a subsect of AR that contains 400 images corresponding to 100 subects, with each subect containing four facia expression: neutra, smie, anger, and scream. For both data sets, we resize the image size to 64 64, and make use of the tree structure depicted in Figure 1. Our task is to discriminate each facia expression from the rest ones. Thus, we have seven and four binary cassification tasks for JAFFE and AR, respectivey. We empoy the east squares oss for ( ), and set the reguarization parameter λ = r λ max, where λ max is computed using Agorithm 2, and r = {5 10 1, 2 10 1, 1 10 1, 5 10 2, 2 10 2, 1 10 2, 5 10 3, 2 10 3 }. The source codes, incuded in the SLEP package [12], are avaiabe onine 1. Tabe 1: Computationa time (seconds) for one binary cassification task (averaged over 7 and 4 runs for JAFFE and AR, respectivey). The tota time for a eight reguarization parameters is reported. JAFFE AR tglasso 30 73 aternating agorithm [9] 4054 5155 Efficiency of the Proposed tglasso We compare our proposed tglasso with the recenty proposed aternating agorithm [9] designed for the tree-guided group Lasso. We report the tota computationa time (seconds) for running one binary cassification task (averaged over 7 and 4 tasks for JAFFE and AR, respectivey) corresponding to the eight reguarization parameters in Tabe 1. We can obseve that tglasso is much more efficient than the aternating agorithm. We note that, the key step of tglasso in each iteration is the associated Moreau-Yosida reguarization, which can be efficienty computed due to the existence of an anaytica soution; and the key step of the aternating agorithm in each iteration is the matrix inversion, which does not scae we to high-dimensiona data. Cassification Performance We compare the cassification performance of tglasso with Lasso. On AR, we use 50 subects for training, and the remaining 50 subects for testing; and on JAFFE, we use 8 subects for training, and the remaining 2 subects for testing. This subect-independent setting is chaenging, as the subects to be tested are not incuded in the training set. The reported resuts are averaged over 10 runs for randomy chosen subects. For each binary cassification task, we compute the baanced error rate [3] to cope with the unbaanced positive and negative sampes. We 1 http://www.pubic.asu.edu/ ye02/software/slep/ 7

26 25 AR tglasso Lasso 40 39.5 JAFFE tglasso Lasso baanced error rate (%) 24 23 22 21 20 baanced error rate (%) 39 38.5 38 37.5 37 19 36.5 18 5e 1 2e 1 1e 1 5e 2 2e 2 1e 2 5e 3 2e 3 reguarization parameter r 36 5e 1 2e 1 1e 1 5e 2 2e 2 1e 2 5e 3 2e 3 reguarization parameter r Figure 3: Cassification performance comparison between Lasso and the tree structured group Lasso. The horizonta axis corresponds to different reguarization parameters λ = r λ max. Neutra Smie Anger Sceam Figure 4: Markers obtained by Lasso, and tree structured group Lasso (white pixes correspond to the markers). First row: face images of four expression from the AR data set; Second row: the markers identified by tree structured group Lasso; Third row: the markers identified by Lasso. report the averaged resuts in Figure 3. Resuts show that tglasso outperforms Lasso in both cases. This verifies the effectiveness of tglasso in incorporating the tree structure in the formuation, i.e., the spatia ocaity information of the face images. Figure 4 shows the markers identified by tglasso and Lasso under the best reguarization parameter. We can observe from the figure that tglasso resuts in a bock sparsity soution, and most of the seected pixes are around mouths and eyes. 5 Concusion In this paper, we consider the efficient optimization for the tree structured group Lasso. Our main technica resut show the Moreau-Yosida reguarization associated with the tree structured group Lasso admits an anaytica soution. Based on the Moreau-Yosida reguarization, we an design efficient agorithm for soving the grouped tree structure reguarized optimization probem for smooth convex oss functions, and deveop an efficient agorithm for determining the effective interva for the parameter λ. Our experimenta resuts on the AR and JAFFE face data sets demonstrate the efficiency and effectiveness of the proposed agorithm. We pan to appy the proposed agorithm to other appications in computer vision and bioinformatics invoving the tree structure. Acknowedgments This work was supported by NSF IIS-0612069, IIS-0812551, CCF-0811790, IIS-0953662, NGA HM1582-08-1-0016, NSFC 60905035, 61035003, and the Office of the Director of Nationa Inteigence (ODNI), Inteigence Advanced Research Proects Activity (IARPA), through the US Army. 8

References [1] F. Bach, G. Lanckriet, and M. Jordan. Mutipe kerne earning, conic duaity, and the SMO agorithm. In Internationa conference on Machine earning, 2004. [2] A. Beck and M. Teboue. A fast iterative shrinkage-threshoding agorithm for inear inverse probems. SIAM Journa on Imaging Sciences, 2(1):183 202, 2009. [3] I. Guyon, A. B. Hur, S. Gunn, and G. Dror. Resut anaysis of the nips 2003 feature seection chaenge. In Neura Information Processing Systems, pages 545 552, 2004. [4] E.T. Hae, W. Yin, and Y. Zhang. Fixed-point continuation for 1-minimization: Methodoogy and convergence. SIAM Journa on Optimization, 19(3):1107 1130, 2008. [5] J. Hiriart-Urruty and C. Lemarécha. Convex Anaysis and Minimization Agorithms I & II. Springer Verag, Berin, 1993. [6] L. Jacob, G. Obozinski, and J. Vert. Group asso with overap and graph asso. In Internationa Conference on Machine Learning, 2009. [7] R. Jenatton, J.-Y. Audibert, and F. Bach. Structured variabe seection with sparsity-inducing norms. Technica report, arxiv:0904.3523v2, 2009. [8] R. Jenatton, J. Maira, G. Obozinski, and F. Bach. Proxima methods for sparse hierarchica dictionary earning. In Internationa Conference on Machine Learning, 2010. [9] S. Kim and E. P. Xing. Tree-guided group asso for muti-task regression with structured sparsity. In Internationa Conference on Machine Learning, 2010. [10] C. Lemarécha and C. Sagastizába. Practica aspects of the Moreau-Yosida reguarization I: Theoretica properties. SIAM Journa on Optimization, 7(2):367 385, 1997. [11] J. Liu, S. Ji, and J. Ye. Muti-task feature earning via efficient 2,1-norm minimization. In Uncertainty in Artificia Inteigence, 2009. [12] J. Liu, S. Ji, and J. Ye. SLEP: Sparse Learning with Efficient Proections. Arizona State University, 2009. [13] J. Liu, L. Yuan, and J. Ye. An efficient agorithm for a cass of fused asso probems. In ACM SIGKDD Conference on Knowedge Discovery and Data Mining, 2010. [14] M. J. Lyons, J. Budynek, and S. Akamatsu. Automatic cassification of singe facia images. IEEE Transactions on Pattern Anaysis and Machine Inteigence, 21(12):1357 1362, 1999. [15] A.M. Martinez and R. Benavente. The AR face database. Technica report, 1998. [16] L. Meier, S. Geer, and P. Bühmann. The group asso for ogistic regression. Journa of the Roya Statistica Society: Series B, 70:53 71, 2008. [17] J.-J. Moreau. Proximité et duaité dans un espace hibertien. Buetin de a Societe mathematique de France, 93:273 299, 1965. [18] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kuwer Academic Pubishers, 2004. [19] Y. Nesterov. Gradient methods for minimizing composite obective function. CORE Discussion Paper, 2007. [20] R. Tibshirani. Regression shrinkage and seection via the asso. Journa of the Roya Statistica Society Series B, 58(1):267 288, 1996. [21] K. Yosida. Functiona Anaysis. Springer Verag, Berin, 1964. [22] M. Yuan and Y. Lin. Mode seection and estimation in regression with grouped variabes. Journa Of The Roya Statistica Society Series B, 68(1):49 67, 2006. [23] P. Zhao, G. Rocha, and B. Yu. The composite absoute penaties famiy for grouped and hierarchica variabe seection. Annas of Statistics, 37(6A):3468 3497, 2009. 9