Multi-task Learning and Structured Sparsity

Size: px

Start display at page:

Download "Multi-task Learning and Structured Sparsity"

Ophelia Pitts
5 years ago
Views:

Computational Statistics and Machine Learning University

1 Multi-task Learning and Structured Sparsity Massimiliano Pontil Department of Computer Science Centre for Computational Statistics and Machine Learning University College London Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 1 / 25

2 Outline Problem formulation and examples Classes of regularizers Statistical analysis Optimization methods Multilinear models Sparse coding Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 2 / 25

3 Problem formulation Let µ 1,..., µ T be prescribed probability distributions on X Y Goal: find functions f t : X Y which minimize 1 T T E (x,y) µt L(y, f t (x)) 2 t=1 Learning from data (x 1 t, y 1 t ),..., (x n t, y n t ) µ t, t = 1,..., T : { 1 min f 1,...,f T T T t=1 1 n n i=1 } L(yt i, f t (xt)) i + λ Ω(f 1,..., f T ) Penalty Ω encourages common structure among the functions Focus on linear regression: X R d, Y R and f (x) = w, x Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 3 / 25

4 Problem formulation (cont.) Linear regression model: y i t = w t, x i t + ɛ ti 1 min w 1,...,w T T T 1 n t=1 n L(yt i, w t, xt ) i i=1 } {{ } training error task t +λ Ω(w 1,..., w T ) }{{} joint regularizer Single task learning: Ω(w 1,..., w T ) = t Ω t (w t ) Typical scenario: many tasks but only few examples per task If the tasks are related, learning them jointly should improve over learning each task independently Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 4 / 25

5 Examples 1 User modelling each task is to predict a user s ratings to products [Lenk et al. 1996,...] the ways different people make decisions about products are related, e.g. small variance of parameters special case (matrix completion): X = {e 1,..., e d } 2 Multiple object detection in scenes detection of each object corresponds to a binary classification task learning common features enhances performance [Torralba et al. 2004,...] early work in ML using neural nets with shared hidden weights [Baxter 1996, Caruana 1997, Silver and Mercer 1996,...] More applications in bioinformatics, finance, neuroimaging, NLP, etc. Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 5 / 25

6 Quadratic regularizer min L(y w 1,...,w t i, w t, x i T t ) + λ Ω(w 1,..., w T ) t,i Let Ω(w) = w, Ew, with w := (w 1,..., w T ) R dt and E 0 Independent task learning if E is block diagonal Encourage linear relationships between tasks, e.g. [Evgeniou & P., 2004]: Ω(w) = T t=1 w t c c T w t 1 T t=1 T 2 w s, c [0, 1] s=1 c = 1: independent tasks; c = 0: identical tasks Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 6 / 25

7 Equivalent formulation [Evgeniou et al. 2005] Choose v R p, B t R p d and set w t = B t v Equivalent problem: min v L(yt i, v, B t xt ) i + λ v, v t,i Previous ex.: B t = [(1 c) 1 2 Id d, 0 d d,..., 0 d d, (ct ) 1 2 Id d, 0 d d,..., 0 d d ] }{{}}{{} t 1 T t w t = (1 c) 1 2 v0 + (ct ) 1 2 vt = common + task specific Extension to coupled hierarchical structures: w t = k A(t) v k Connection to kernel methods: learn a map (x, t) f t (x) using the kernel K((x 1, t 1 ), (x 2, t 2 )) = B t1 x 1, B t2 x 2 Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 7 / 25

8 Structured sparsity: few shared variables [Argyriou et al. 2006; Lounici et al. 2009] Favour matrices with many zero rows: d W 2,1 := T Compare matrices W favoured by different regularizers (green = 0, blue = 1): j=1 t=1 w 2 tj #rows = ,1 = l 1 -norm = Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 8 / 25

9 Statistical analysis Linear regression model: yt i = w t, xt i + ɛ i t, with ɛ i t i.i.d. N(0, σ 2 ) i = 1,..., n, d n, use the square loss: L(y, y ) = (y y ) 2 { } T Assume card j : wtj 2 > 0 s t=1 1 Variables not too correlated: n n xtj i x tk i 1 ρ 7s, t, j k Theorem [Lounici et al. 2011] If λ = 4σ nt 1 + A log d T, A 4 then w.h.p. 1 T ( ) cσ 2 ŵ t w t 2 s 1 + A log d T ρ n T t=1 i=1 Dependency on the dimension d is negligible for large T 1 T Compare to Lasso: T w (L) t w t 2 c s n log(d T ) t=1 Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 9 / 25

10 Multitask feature learning [Argyriou et al. 2006, 2008] Extend above formulation to learn a low dimensional representation: min L(yt i, a t, U x i U,A t ) + λ A 2,1 : U U = I d d, A R d T t,i Let W = UA and minimize over orthogonal U min U U W 2,1 = W tr := r σ j (W ) j=1 Equivalent to trace norm regularization: min L(yt i, w t, xt ) i + λ W tr W t,i Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 10 / 25

11 Variational form and alternate minimization Fact: W tr = 1 2 inf D 0 tr(d 1 WW + D) and infimizer = WW min T W, D 0 t=1 i=1 n L(yt i, w t, xt ) i + λ [ tr(w D 1 W ) 2 }{{} T w t,d 1 w t t=1 Requires a perturbation step to ensure convergence See [Dudík et al. 2012] for comparative results { d Diagonal constraints: W 2,1 = 1 2 inf z>0 j=1 w :,j 2 z j + z j } ] + tr(d) Further constraints on z [Micchelli et al. 2010] e.g. ordered structures Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 11 / 25

12 Risk bound Theorem [Maurer and P. 2012] Let R(W ) = 1 T T t=1 E (x,y) µ t L(y, w t, x ) and ˆR(W ) the empirical { error. Assume L(y, ) is φ-lipschitz and xt i 1. If Ŵ argmin ˆR(W ) : W tr B } T then with prob. at least 1 δ ( R(Ŵ ) R (W ) 2φB Ĉ + n with Ĉ = 1 nt xt i xt i and W argmin R(W ) t,i 2 (ln (nt ) + 1) nt ) 8 ln (3/δ) + nt Interpretation: Assume rank(w ) = K, w t 1 and let B = K. If the inputs are uniformly distributed, as T grows we have a O( K/nd) bound as compared to O( 1/n) for single task learning Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 12 / 25

13 Experiment (computer survey) Test error vs. #tasks Eigenvalues of D Performance improves with more tasks A single most important feature shared by everyone Dataset [Lenk et al. 1996]: consumers ratings of PC models: 180 persons (tasks), 8 training, 4 test points, 13 inputs (RAM, CPU, price etc.), output in {0,..., 10} (likelihood of purchase) Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 13 / 25

14 Experiment (computer survey) TE RAM SC CPU HD CD CA CO AV WA SW GU PR Method Test Independent Aggregate 5.52 Quadratic (best c [0, 1]) 4.37 Structured Sparsity 4.04 Trace norm 3.72 Quadratic + Trace 3.20 The most important feature (1st eigenvector of D) weighs technical characteristics (RAM, CPU, CD-ROM) vs. price Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 14 / 25

15 More complex models Several possible extension of the above formulations: Multiple low dimensional subspaces [Argyriou et al. 08b, Kang et al. 2011,] Encourage heterogeneous features [Romera-Paredes et al. 2012] Sparse coding [Kumar and Daumé III, 2012, Maurer et al. 2013] Multilinear models [Romera-Paredes et al. 2013] Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 15 / 25

16 Multilinear MTL [Romera-Paredes et al. 2013] Tasks associated with multi-index, e.g. t = (t1, t2 ) Example: predict action-units (e.g. cheek raiser) activation for different people [Lucey et. al 2011]: t1 identity, t2 action-unit Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 16 / 25

17 Multilinear MTL (cont.) Let W R T 1 T 2 d, with W t1,t 2,: R d the (t 1, t 2 )-th regression task for t 1 = 1,...,T 1, t 2 = 1,...,T 2 Goal: control rank of each matricization of W : rank(w (1) ) + rank(w (2) ) + rank(w (3) ) where W (n) is the mode-n matricization of W Convex lower bound [Liu et al. 2011, Gandy et al. 2011, Signoretto et al. 2012] 3 W (n) tr n=1 Regularization problem solved by alternating direction of multipliers [Gandy et al. 2011] Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 17 / 25

18 Multilinear MTL (cont.) Alternative approach using Tucker decomposition W t1,t 2,j = S 1 S 2 s 1 s 2 =1 k=1 K G s1,s 2,kA t1,s 1 B t2,s 2 C j,k Transfer learning experiment (more general setting involving distinct sets of training and target tasks) GRR GTNR GMF TTNR TF Correlation m (Training Set Size) Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 18 / 25

Exploiting unrelated groups of tasks [Romera-Paredes et al. 2012] Example: recognizing identity and emotion on a set of faces emotion related feature identity related feature Assumption: 1.

19 Exploiting unrelated groups of tasks [Romera-Paredes et al. 2012] Example: recognizing identity and emotion on a set of faces emotion related feature identity related feature Assumption: 1. Low rank within each group 2. Tasks from different groups tend to use orthogonal features Misclassification Error Rate Ridge Regression MTL MTL 2G OrthoMTL OrthoMTL EN min W,V Training Set Size { ˆR em (W ) + ˆR id (V ) + λ [W, V ] tr + ρ W V 2 Fr} Related convex problem under conditions Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 19 / 25

20 Multi-task learning with dictionaries [Maurer P., Romera-Paredes, 2013] Natural extension of sparse coding [Olshausen and Field 1996]: { } L(yt i, w t, xt ) i : w t = Ua t, u k 2 1, a t 1 α min U,A t,i Multiclass Accuracy RR MTFL SC MTL m train Related method with Frobenius norm bound [Kumar and Daumé III, 2012] Estimation bounds indicate potential improvement over single task learning Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 20 / 25

21 Conclusions Multi-task learning is ubiquitous exploiting task relatedness provides substantial improvement over independent task learning Presented families of regularizers which naturally extend complexity notions (smoothness and sparsity) used for single-task learning; touched upon statistical analyses and optimisation methods Need for scalable algorithms, particularly in more complex task relatedness scenarios such as multilinear models Nonlinear MTL via reproducing kernel Hilbert spaces Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 21 / 25

22 Thanks Andreas Argyriou Nadia Bianchi-Berthouze Andrea Caponnetto Theodoros Evgeniou Karim Lounici Andreas Maurer Charles Micchelli Bernardino Romera-Paredes Alexandre Tsybakov Sara van de Geer Yiming Ying Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 22 / 25

23 References [Ando and Zhang] A framework for learning predictive structures from multiple tasks and unlabeled data. JMLR [Argyriou, Evgeniou, Pontil] Multi-task feature learning. NIPS [Argyriou, Evgeniou, Pontil] Convex multi-task feature learning. Machine Learning [Argyriou, Maurer, Pontil] An algorithm for transfer learning in a heterogeneous environment. ECML 2008b. [Argyriou, Micchelli, Pontil] When is there a representer theorem? Vector versus matrix regularizers. JMLR [Argyriou, Micchelli, Pontil, Shen, Xu] Efficient first order methods for linear composite regularizers. arxiv: [Baxter] A model for inductive bias learning. JAIR [Ben-David and Schuller] Exploiting task relatedness for multiple task learning. COLT [Caponnetto, Micchelli, Pontil, Ying] Universal mult-task kernels. JMLR [Caruana] Multi task learning. Machine Learning Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 23 / 25

24 [Dudík, Harchaoui, Malik] Lifted coordinate descent for learning with trace-norm regularization, AISTATS [Evgeniou and Pontil] Regularized multi-task learning. SIGKDD [Evgeniou, Micchelli, Pontil] Learning multiple tasks with kernel methods. JMLR [Lounici, Pontil, Tsybakov, van de Geer] Taking advantage of sparsity in multi-task learning. COLT [Lounici, Pontil, Tsybakov, van de Geer] Oracle inequalities and optimal inference under group sparsity. Annals of Statistics, [Kakade, Shalev-Shwartz, Tewari] Regularization techniques for learning with matrices, JMLR [Kang, Grauman, Sha] Learning with Whom to Share in Multi-task Feature Learning. ICML [Kumar and Daumé III] Learning Task Grouping and Overlap in Multi-task LearningAbstract, ICML 2012 [Lenk, DeSarbo, Green, Young] Hierarchical Bayes conjoint analysis: recovery of partworth heterogeneity from reduced experimental designs. Marketing Science [Maurer] Bounds for linear multi-task learning. JMLR Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 24 / 25

25 [Maurer and Pontil] K-dimensional coding schemes in Hilbert spaces IEEE Transactions on Information Theory, 56(11): , [Maurer and Pontil] Structured sparsity and generalization. JMLR [Maurer, Pontil, Romera-Paredes] Sparse coding for multitask and transfer learning. ICML [Micchelli, Morales, Pontil] A family of penalty functions for structured sparsity. NIPS [Micchelli and Pontil] On learning vector-valued functions. Neural Computation [Romera-Paredes, Argyriou, Bianchi-Berthouze, Pontil] Exploiting unrelated tasks in multi-task learning. AISTATS [Romera-Paredes, Aung, Bianchi-Berthouze, Pontil] Multilinear mulittask learning. ICML [Salakhutdinov, Torralba, Tenenbaum] Learning to share visual appearance for multiclass object detection. CVPR [Silver & Mercer] The parallel transfer of task knowledge using dynamic learning rates based on a measure of relatedness. Connection Science [Yu, Tresp, Schwaighofer] Learning Gaussian processes from multiple tasks. ICML [Thrun and Pratt] Learning to learn, Springer, [Torralba, Murphy, Freeman] Sharing features: efficient boosting procedures for multiclass object detection. CVPR Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 25 / 25

Multi-task Learning and Structured Sparsity

Multi-task Learning and Structured Sparsity Massimiliano Pontil Department of Computer Science Centre for Computational Statistics and Machine Learning University College London Massimiliano Pontil (UCL)