Learning Task Grouping and Overlap in Multi-Task Learning

Size: px

Start display at page:

Download "Learning Task Grouping and Overlap in Multi-Task Learning"

Shannon Murphy
5 years ago
Views:

1 Learning Task Grouping and Overlap in Multi-Task Learning Abhishek Kumar Hal Daumé III Department of Computer Science University of Mayland, College Park 20 May 2013 Proceedings of the 29 th International Conference on Machine Learning Presented by Kyle Ulrich A. Kumar and H. Daumé III (UMD) GO-MTL 20 May / 20

2 Outline 1 Multi-Task Learning 2 Learning Task Grouping and Overlap Regression: Squared Loss Classification: Logistic Loss 3 Experiments Synthetic Data Real Data A. Kumar and H. Daumé III (UMD) GO-MTL 20 May / 20

3 Outline 1 Multi-Task Learning 2 Learning Task Grouping and Overlap Regression: Squared Loss Classification: Logistic Loss 3 Experiments Synthetic Data Real Data A. Kumar and H. Daumé III (UMD) GO-MTL 20 May / 20

4 Multi-Task Learning Simultaneously learning multiple prediction tasks that are related Idea: common information is shared among tasks and learning them jointly can result in better performance Introduce an inductive bias in the joint hypothesis space (prior belief about relationship structure) Common bias assumptions: Clustering, disjoint groups (Jacob et al., 2008; Xue et al., 2007) Manifold assumption (Agarwal et al., 2010) Shared sparsity in features (Jalali et al., 2010; Chen et al. 2011) Subspace assumption (Argyriou et al., 2008) A. Kumar and H. Daumé III (UMD) GO-MTL 20 May / 20

5 Subspace Assumption Task parameters lie in a low dimensional subspace that captures the predictive structure for all the tasks Concept of latent tasks Individual tasks from different groups are allowed to overlap in one or more bases Latent tasks can influence more than one group Tasks form clusters and tasks within each cluster are a sparse combination of a task dictionary. Nonparametric Bayesian approach in (Passos et al., 2012) A. Kumar and H. Daumé III (UMD) GO-MTL 20 May / 20

6 Outline 1 Multi-Task Learning 2 Learning Task Grouping and Overlap Regression: Squared Loss Classification: Logistic Loss 3 Experiments Synthetic Data Real Data A. Kumar and H. Daumé III (UMD) GO-MTL 20 May / 20

7 GO-MTL Model Grouping and Overlap in Multitask Learning (GO-MTL) We have T tasks and Z t = {(x ti, y ti ) : i = 1, 2,..., N t } is the training set for each task t = 1, 2,..., T Task weight matrix W R d T, d is feature dimension. Let W = LS where L R d k is the latent task parameters, and S R k T contains sparse weights for each task. Each task weight vector w t is a linear combination of latent tasks A. Kumar and H. Daumé III (UMD) GO-MTL 20 May / 20

8 GO-MTL Objective Function The cost function takes the form: t (x ti,y ti ) Z t L(y ti, s tl x ti ) + µ S 1 + λ L 2 F (1) For a convex empirical loss function, this is convex in L for a fixed S and is convex in S for a fixed L. Therefore, an alternating optimization function is used. For a fixed L, decompose into individual problems for s t : For a fixed S: s t = argmin s min L T t=1 (x ti,y ti ) Z t L(y ti, s L x ti ) + µ s 1 (2) (x ti,y ti ) Z t L(y ti, s tl x ti ) + λ L 2 F (3) A. Kumar and H. Daumé III (UMD) GO-MTL 20 May / 20

9 Algorithm 1 GO-MTL: Grouping and Overlap for Multi-Task Learning 1: Input: Z t : Labeled training data for all tasks k: Number of latent tasks µ: Parameter for controlling sparsity 2: Output: Task predictor matrix W, L and S. 3: Learn individual predictors for each task using only its own data 4: Let W 0 be the matrix that contains these initial predictors as columns 5: Compute top-k singular vectors: W 0 = UΣV T 6: Initialize L to first k columns of U 7: while not converged do 8: for t = 1 to T do 9: Solve Eq. 2 to obtain s t 10: end for 11: Construct matrix S = [s 1 s 2... s T ] 12: Save the previous L: L old = L 13: Fix S and solve Eq. 3 to obtain L 14: end while 15: Return outputs: L = L old, S and W = L old S A. Kumar and H. Daumé III (UMD) GO-MTL 20 May / 20

10 Regression: Squared Loss Consider squared loss L(a, b) = (a b) 2 The cost function for task t is min L,S T t=1 where X t R d Nt. 1 N t y t X tls t 2 + µ S 1 + λ L 2 F (4) For fixed L, optimize for s t with the two-metric projection method, which requires the gradient and Hessian of the squared loss function f (s t ): st f (s t ) = 2 N t L X t (X tls t y t ) 2 s t f (s t ) = 2 N t L X t X tl A. Kumar and H. Daumé III (UMD) GO-MTL 20 May / 20

11 Regression: Squared Loss (continued) For a fixed S, the gradient of Equation 4 gives T t=1 1 N t X t y t s t = T t=1 1 N t X t X tls t s t + λl which is a linear equation in L. To solve this, we vectorize both sides: T 1 vec ( ( X t y N t s ) T ) 1 t = vec X t X t=1 t N tls t s t + λl t=1 t [ T ] 1 = (s t s N t) (X t X t) + λi vec(l) t t=1 since vec(axb) = (B A)vec(X). The symbol represents the Kronecker product. This linear equation can now be solved in many ways, such as a matrix inverse, the LU decomposition, or by iterative methods. A. Kumar and H. Daumé III (UMD) GO-MTL 20 May / 20

12 Classification: Logistic Loss Consider logistic loss: L(y, f (x)) = log(1 + exp( yf (x))) For fixed L, again solve for s t using the two-metric projection method. To do so, we need the gradient and Hessian of the loss function w.r.t. s t : f (s t ) = 1 N t st f (s t ) = 1 N t 2 s t f (s t ) = 1 N t N t i=1 N t i=1 N t i=1 log(1 + exp( y ti s tl x ti )) (y ti σ(w tx ti ))L x ti σ(w tx ti )(1 σ(w tx ti ))L x ti x til where w t = Ls t is the weight vector for task t, and σ(x) = 1 1+e x is the logistic function. A. Kumar and H. Daumé III (UMD) GO-MTL 20 May / 20

13 Classification: Logistic Loss (continued) For fixed S, the objective is convex in L and is solved in various ways Gradient update: L : T 1 N t N t=1 t i=1 (y ti σ(w tx ti ))x ti s t + 2λL Newton-Raphson update: Use Taylor series expansion up to second order around L to obtain the step direction M: [ T ] N 1 t δ ti [vec(x ti s t) vec(x ti s t) ] + 2λI vec(m) N t=1 t i=1 = vec ( T 1 N t N t=1 t i=1 (y ti σ(w tx ti ))x ti s t 2λL where δ ti = σ(w tx ti )(1 σ(w tx ti )). The step size is computed using Armijo rule. ) A. Kumar and H. Daumé III (UMD) GO-MTL 20 May / 20

14 Outline 1 Multi-Task Learning 2 Learning Task Grouping and Overlap Regression: Squared Loss Classification: Logistic Loss 3 Experiments Synthetic Data Real Data A. Kumar and H. Daumé III (UMD) GO-MTL 20 May / 20

15 Parameter Selection t (x ti,y ti ) Z t L(y ti, s tl x ti ) + µ S 1 + λ L 2 F λ is fixed at 0.1 in all experiments µ is selected using cross validation with four different random splits (70% train, 30% validation) on the training set Selection of k (number of latent tasks) k < T Large value of k is controlled by increasing the sparsity penalty µ A. Kumar and H. Daumé III (UMD) GO-MTL 20 May / 20

16 Baseline Comparisons No-group MTL (Argyriou et al., 2008): All tasks are assumed to be related, and the task parameters are assumed to lie in a low dimensional subspace. This is done by penalizing the nuclear norm of the weight matrix. Disjoint-Group MTL (DG-MTL) (Kang et al., 2011): Assumes multiple disjoint groups of tasks. Task parameters within a group lie in a low dimensional space. Single Task Learning (STL): Tasks are learned independently. A. Kumar and H. Daumé III (UMD) GO-MTL 20 May / 20

Synthetic Dataset 1 Three disjoint groups of tasks Each latent task generates one group Able to recover the true support which is given by three latent tasks (even for K > 3) Figure: Left: RMSE with

17 Synthetic Dataset 1 Three disjoint groups of tasks Each latent task generates one group Able to recover the true support which is given by three latent tasks (even for K > 3) Figure: Left: RMSE with DG-MTL vs. number of groups (G), Right: RMSE with GO-MTL vs. number of latent tasks (k). Figure: Recovered sparsity patterns (the matrix S) with GO-MTL with 3 disjoint groups. Along horizontal and vertical axes are the observed tasks and the latent tasks, respectively. Top: For k = 3, Middle: For k = 10 after three iterations, Bottom: For k = 10 after convergence (15 iterations). A. Kumar and H. Daumé III (UMD) GO-MTL 20 May / 20

Synthetic Dataset 2 Three overlapping groups of tasks Generated by 4 latent tasks Figure: Left: RMSE with DG-MTL vs. number of groups (G), Right: RMSE with GO-MTL vs. number of latent tasks (k).

18 Synthetic Dataset 2 Three overlapping groups of tasks Generated by 4 latent tasks Figure: Left: RMSE with DG-MTL vs. number of groups (G), Right: RMSE with GO-MTL vs. number of latent tasks (k). Figure: Sparsity patterns (the matrix S) with three overlapping groups generated by 4 latent tasks. Bottom: True task grouping structure, Top: Recovered support for k = 5. Only the first four latent tasks are active. A. Kumar and H. Daumé III (UMD) GO-MTL 20 May / 20

Real Data Table: Results on different datasets: Reported numbers are root mean square error (RMSE) for regression datasets and multi-class classification errors for MNIST and USPS.

19 Real Data Table: Results on different datasets: Reported numbers are root mean square error (RMSE) for regression datasets and multi-class classification errors for MNIST and USPS. Numbers in parentheses are std. dev. Regression Computer Survey data: 190 tasks, 20 examples per task, 75-25% split School data: 139 tasks, 100 examples per task on avg, 60-40% split Classification USPS: 10 classes, 1000 training examples, 500 test examples MNIST: 10 classes, 1000 training examples, 500 test examples A. Kumar and H. Daumé III (UMD) GO-MTL 20 May / 20

20 Conclusion Framework for learning grouping and overlap structure in multi-task learning Parameters of each task group lie in a low dimensional subspace Does not assume disjoint grouping structure Tasks in different groups are allowed to overlap through the sharing of latent basis tasks A. Kumar and H. Daumé III (UMD) GO-MTL 20 May / 20

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views Presenter: Yao Zhou joint work with: Jingrui He - 1 - Roadmap Motivation Proposed framework: M2VW Experimental results Conclusion