Multi-task Learning and Structured Sparsity

Size: px
Start display at page:

Download "Multi-task Learning and Structured Sparsity"

Transcription

1 Multi-task Learning and Structured Sparsity Massimiliano Pontil Department of Computer Science Centre for Computational Statistics and Machine Learning University College London Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 1 / 25

2 Outline Problem formulation and examples Classes of regularizers Statistical analysis Optimization methods Multilinear models Sparse coding Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 2 / 25

3 Problem formulation Let µ 1,..., µ T be prescribed probability distributions on X Y Goal: find functions f t : X Y which minimize 1 T T E (x,y) µt L(y, f t (x)) 2 t=1 Learning from data (x 1 t, y 1 t ),..., (x n t, y n t ) µ t, t = 1,..., T : { 1 min f 1,...,f T T T t=1 1 n n i=1 } L(yt i, f t (xt)) i + λ Ω(f 1,..., f T ) Penalty Ω encourages common structure among the functions Focus on linear regression: X R d, Y R and f (x) = w, x Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 3 / 25

4 Problem formulation (cont.) Linear regression model: y i t = w t, x i t + ɛ ti 1 min w 1,...,w T T T 1 n t=1 n L(yt i, w t, xt ) i i=1 } {{ } training error task t +λ Ω(w 1,..., w T ) }{{} joint regularizer Single task learning: Ω(w 1,..., w T ) = t Ω t (w t ) Typical scenario: many tasks but only few examples per task If the tasks are related, learning them jointly should improve over learning each task independently Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 4 / 25

5 Examples 1 User modelling each task is to predict a user s ratings to products [Lenk et al. 1996,...] the ways different people make decisions about products are related, e.g. small variance of parameters special case (matrix completion): X = {e 1,..., e d } 2 Multiple object detection in scenes detection of each object corresponds to a binary classification task learning common features enhances performance [Torralba et al. 2004,...] early work in ML using neural nets with shared hidden weights [Baxter 1996, Caruana 1997, Silver and Mercer 1996,...] More applications in bioinformatics, finance, neuroimaging, NLP, etc. Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 5 / 25

6 Quadratic regularizer min L(y w 1,...,w t i, w t, x i T t ) + λ Ω(w 1,..., w T ) t,i Let Ω(w) = w, Ew, with w := (w 1,..., w T ) R dt and E 0 Independent task learning if E is block diagonal Encourage linear relationships between tasks, e.g. [Evgeniou & P., 2004]: Ω(w) = T t=1 w t c c T w t 1 T t=1 T 2 w s, c [0, 1] s=1 c = 1: independent tasks; c = 0: identical tasks Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 6 / 25

7 Equivalent formulation [Evgeniou et al. 2005] Choose v R p, B t R p d and set w t = B t v Equivalent problem: min v L(yt i, v, B t xt ) i + λ v, v t,i Previous ex.: B t = [(1 c) 1 2 Id d, 0 d d,..., 0 d d, (ct ) 1 2 Id d, 0 d d,..., 0 d d ] }{{}}{{} t 1 T t w t = (1 c) 1 2 v0 + (ct ) 1 2 vt = common + task specific Extension to coupled hierarchical structures: w t = k A(t) v k Connection to kernel methods: learn a map (x, t) f t (x) using the kernel K((x 1, t 1 ), (x 2, t 2 )) = B t1 x 1, B t2 x 2 Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 7 / 25

8 Structured sparsity: few shared variables [Argyriou et al. 2006; Lounici et al. 2009] Favour matrices with many zero rows: d W 2,1 := T Compare matrices W favoured by different regularizers (green = 0, blue = 1): j=1 t=1 w 2 tj #rows = ,1 = l 1 -norm = Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 8 / 25

9 Statistical analysis Linear regression model: yt i = w t, xt i + ɛ i t, with ɛ i t i.i.d. N(0, σ 2 ) i = 1,..., n, d n, use the square loss: L(y, y ) = (y y ) 2 { } T Assume card j : wtj 2 > 0 s t=1 1 Variables not too correlated: n n xtj i x tk i 1 ρ 7s, t, j k Theorem [Lounici et al. 2011] If λ = 4σ nt 1 + A log d T, A 4 then w.h.p. 1 T ( ) cσ 2 ŵ t w t 2 s 1 + A log d T ρ n T t=1 i=1 Dependency on the dimension d is negligible for large T 1 T Compare to Lasso: T w (L) t w t 2 c s n log(d T ) t=1 Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 9 / 25

10 Multitask feature learning [Argyriou et al. 2006, 2008] Extend above formulation to learn a low dimensional representation: min L(yt i, a t, U x i U,A t ) + λ A 2,1 : U U = I d d, A R d T t,i Let W = UA and minimize over orthogonal U min U U W 2,1 = W tr := r σ j (W ) j=1 Equivalent to trace norm regularization: min L(yt i, w t, xt ) i + λ W tr W t,i Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 10 / 25

11 Variational form and alternate minimization Fact: W tr = 1 2 inf D 0 tr(d 1 WW + D) and infimizer = WW min T W, D 0 t=1 i=1 n L(yt i, w t, xt ) i + λ [ tr(w D 1 W ) 2 }{{} T w t,d 1 w t t=1 Requires a perturbation step to ensure convergence See [Dudík et al. 2012] for comparative results { d Diagonal constraints: W 2,1 = 1 2 inf z>0 j=1 w :,j 2 z j + z j } ] + tr(d) Further constraints on z [Micchelli et al. 2010] e.g. ordered structures Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 11 / 25

12 Risk bound Theorem [Maurer and P. 2012] Let R(W ) = 1 T T t=1 E (x,y) µ t L(y, w t, x ) and ˆR(W ) the empirical { error. Assume L(y, ) is φ-lipschitz and xt i 1. If Ŵ argmin ˆR(W ) : W tr B } T then with prob. at least 1 δ ( R(Ŵ ) R (W ) 2φB Ĉ + n with Ĉ = 1 nt xt i xt i and W argmin R(W ) t,i 2 (ln (nt ) + 1) nt ) 8 ln (3/δ) + nt Interpretation: Assume rank(w ) = K, w t 1 and let B = K. If the inputs are uniformly distributed, as T grows we have a O( K/nd) bound as compared to O( 1/n) for single task learning Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 12 / 25

13 Experiment (computer survey) Test error vs. #tasks Eigenvalues of D Performance improves with more tasks A single most important feature shared by everyone Dataset [Lenk et al. 1996]: consumers ratings of PC models: 180 persons (tasks), 8 training, 4 test points, 13 inputs (RAM, CPU, price etc.), output in {0,..., 10} (likelihood of purchase) Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 13 / 25

14 Experiment (computer survey) TE RAM SC CPU HD CD CA CO AV WA SW GU PR Method Test Independent Aggregate 5.52 Quadratic (best c [0, 1]) 4.37 Structured Sparsity 4.04 Trace norm 3.72 Quadratic + Trace 3.20 The most important feature (1st eigenvector of D) weighs technical characteristics (RAM, CPU, CD-ROM) vs. price Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 14 / 25

15 More complex models Several possible extension of the above formulations: Multiple low dimensional subspaces [Argyriou et al. 08b, Kang et al. 2011,] Encourage heterogeneous features [Romera-Paredes et al. 2012] Sparse coding [Kumar and Daumé III, 2012, Maurer et al. 2013] Multilinear models [Romera-Paredes et al. 2013] Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 15 / 25

16 Multilinear MTL [Romera-Paredes et al. 2013] Tasks associated with multi-index, e.g. t = (t1, t2 ) Example: predict action-units (e.g. cheek raiser) activation for different people [Lucey et. al 2011]: t1 identity, t2 action-unit Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 16 / 25

17 Multilinear MTL (cont.) Let W R T 1 T 2 d, with W t1,t 2,: R d the (t 1, t 2 )-th regression task for t 1 = 1,...,T 1, t 2 = 1,...,T 2 Goal: control rank of each matricization of W : rank(w (1) ) + rank(w (2) ) + rank(w (3) ) where W (n) is the mode-n matricization of W Convex lower bound [Liu et al. 2011, Gandy et al. 2011, Signoretto et al. 2012] 3 W (n) tr n=1 Regularization problem solved by alternating direction of multipliers [Gandy et al. 2011] Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 17 / 25

18 Multilinear MTL (cont.) Alternative approach using Tucker decomposition W t1,t 2,j = S 1 S 2 s 1 s 2 =1 k=1 K G s1,s 2,kA t1,s 1 B t2,s 2 C j,k Transfer learning experiment (more general setting involving distinct sets of training and target tasks) GRR GTNR GMF TTNR TF Correlation m (Training Set Size) Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 18 / 25

19 Exploiting unrelated groups of tasks [Romera-Paredes et al. 2012] Example: recognizing identity and emotion on a set of faces emotion related feature identity related feature Assumption: 1. Low rank within each group 2. Tasks from different groups tend to use orthogonal features Misclassification Error Rate Ridge Regression MTL MTL 2G OrthoMTL OrthoMTL EN min W,V Training Set Size { ˆR em (W ) + ˆR id (V ) + λ [W, V ] tr + ρ W V 2 Fr} Related convex problem under conditions Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 19 / 25

20 Multi-task learning with dictionaries [Maurer P., Romera-Paredes, 2013] Natural extension of sparse coding [Olshausen and Field 1996]: { } L(yt i, w t, xt ) i : w t = Ua t, u k 2 1, a t 1 α min U,A t,i Multiclass Accuracy RR MTFL SC MTL m train Related method with Frobenius norm bound [Kumar and Daumé III, 2012] Estimation bounds indicate potential improvement over single task learning Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 20 / 25

21 Conclusions Multi-task learning is ubiquitous exploiting task relatedness provides substantial improvement over independent task learning Presented families of regularizers which naturally extend complexity notions (smoothness and sparsity) used for single-task learning; touched upon statistical analyses and optimisation methods Need for scalable algorithms, particularly in more complex task relatedness scenarios such as multilinear models Nonlinear MTL via reproducing kernel Hilbert spaces Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 21 / 25

22 Thanks Andreas Argyriou Nadia Bianchi-Berthouze Andrea Caponnetto Theodoros Evgeniou Karim Lounici Andreas Maurer Charles Micchelli Bernardino Romera-Paredes Alexandre Tsybakov Sara van de Geer Yiming Ying Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 22 / 25

23 References [Ando and Zhang] A framework for learning predictive structures from multiple tasks and unlabeled data. JMLR [Argyriou, Evgeniou, Pontil] Multi-task feature learning. NIPS [Argyriou, Evgeniou, Pontil] Convex multi-task feature learning. Machine Learning [Argyriou, Maurer, Pontil] An algorithm for transfer learning in a heterogeneous environment. ECML 2008b. [Argyriou, Micchelli, Pontil] When is there a representer theorem? Vector versus matrix regularizers. JMLR [Argyriou, Micchelli, Pontil, Shen, Xu] Efficient first order methods for linear composite regularizers. arxiv: [Baxter] A model for inductive bias learning. JAIR [Ben-David and Schuller] Exploiting task relatedness for multiple task learning. COLT [Caponnetto, Micchelli, Pontil, Ying] Universal mult-task kernels. JMLR [Caruana] Multi task learning. Machine Learning Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 23 / 25

24 [Dudík, Harchaoui, Malik] Lifted coordinate descent for learning with trace-norm regularization, AISTATS [Evgeniou and Pontil] Regularized multi-task learning. SIGKDD [Evgeniou, Micchelli, Pontil] Learning multiple tasks with kernel methods. JMLR [Lounici, Pontil, Tsybakov, van de Geer] Taking advantage of sparsity in multi-task learning. COLT [Lounici, Pontil, Tsybakov, van de Geer] Oracle inequalities and optimal inference under group sparsity. Annals of Statistics, [Kakade, Shalev-Shwartz, Tewari] Regularization techniques for learning with matrices, JMLR [Kang, Grauman, Sha] Learning with Whom to Share in Multi-task Feature Learning. ICML [Kumar and Daumé III] Learning Task Grouping and Overlap in Multi-task LearningAbstract, ICML 2012 [Lenk, DeSarbo, Green, Young] Hierarchical Bayes conjoint analysis: recovery of partworth heterogeneity from reduced experimental designs. Marketing Science [Maurer] Bounds for linear multi-task learning. JMLR Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 24 / 25

25 [Maurer and Pontil] K-dimensional coding schemes in Hilbert spaces IEEE Transactions on Information Theory, 56(11): , [Maurer and Pontil] Structured sparsity and generalization. JMLR [Maurer, Pontil, Romera-Paredes] Sparse coding for multitask and transfer learning. ICML [Micchelli, Morales, Pontil] A family of penalty functions for structured sparsity. NIPS [Micchelli and Pontil] On learning vector-valued functions. Neural Computation [Romera-Paredes, Argyriou, Bianchi-Berthouze, Pontil] Exploiting unrelated tasks in multi-task learning. AISTATS [Romera-Paredes, Aung, Bianchi-Berthouze, Pontil] Multilinear mulittask learning. ICML [Salakhutdinov, Torralba, Tenenbaum] Learning to share visual appearance for multiclass object detection. CVPR [Silver & Mercer] The parallel transfer of task knowledge using dynamic learning rates based on a measure of relatedness. Connection Science [Yu, Tresp, Schwaighofer] Learning Gaussian processes from multiple tasks. ICML [Thrun and Pratt] Learning to learn, Springer, [Torralba, Murphy, Freeman] Sharing features: efficient boosting procedures for multiclass object detection. CVPR Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 25 / 25

Multi-task Learning and Structured Sparsity

Multi-task Learning and Structured Sparsity Multi-task Learning and Structured Sparsity Massimiliano Pontil Department of Computer Science Centre for Computational Statistics and Machine Learning University College London Massimiliano Pontil (UCL)

More information

A Spectral Regularization Framework for Multi-Task Structure Learning

A Spectral Regularization Framework for Multi-Task Structure Learning A Spectral Regularization Framework for Multi-Task Structure Learning Massimiliano Pontil Department of Computer Science University College London (Joint work with A. Argyriou, T. Evgeniou, C.A. Micchelli,

More information

Mathematical Methods for Data Analysis

Mathematical Methods for Data Analysis Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data

More information

Learning Good Representations for Multiple Related Tasks

Learning Good Representations for Multiple Related Tasks Learning Good Representations for Multiple Related Tasks Massimiliano Pontil Computer Science Dept UCL and ENSAE-CREST March 25, 2015 1 / 35 Plan Problem formulation and examples Analysis of multitask

More information

Multi-Task Learning and Matrix Regularization

Multi-Task Learning and Matrix Regularization Multi-Task Learning and Matrix Regularization Andreas Argyriou Department of Computer Science University College London Collaborators T. Evgeniou (INSEAD) R. Hauser (University of Oxford) M. Herbster (University

More information

Some tensor decomposition methods for machine learning

Some tensor decomposition methods for machine learning Some tensor decomposition methods for machine learning Massimiliano Pontil Istituto Italiano di Tecnologia and University College London 16 August 2016 1 / 36 Outline Problem and motivation Tucker decomposition

More information

Convex Multi-Task Feature Learning

Convex Multi-Task Feature Learning Convex Multi-Task Feature Learning Andreas Argyriou 1, Theodoros Evgeniou 2, and Massimiliano Pontil 1 1 Department of Computer Science University College London Gower Street London WC1E 6BT UK a.argyriou@cs.ucl.ac.uk

More information

Sparse coding for multitask and transfer learning

Sparse coding for multitask and transfer learning Andreas Maurer Adalbertstrasse 55, D-8799, Munchen, Germany am@andreas-maurer.eu Massimiliano Pontil m.pontil@cs.ucl.ac.uk Department of Computer Science and Centre for Computational Statistics and Machine

More information

Multi-Task Learning and Algorithmic Stability

Multi-Task Learning and Algorithmic Stability Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Multi-Task Learning and Algorithmic Stability Yu Zhang Department of Computer Science, Hong Kong Baptist University The Institute

More information

Kernels for Multi task Learning

Kernels for Multi task Learning Kernels for Multi task Learning Charles A Micchelli Department of Mathematics and Statistics State University of New York, The University at Albany 1400 Washington Avenue, Albany, NY, 12222, USA Massimiliano

More information

A Spectral Regularization Framework for Multi-Task Structure Learning

A Spectral Regularization Framework for Multi-Task Structure Learning A Spectral Regularization Framework for Multi-Task Structure Learning Andreas Argyriou Department of Computer Science University College London Gower Street, London WC1E 6BT, UK a.argyriou@cs.ucl.ac.uk

More information

Feature Learning for Multitask Learning Using l 2,1 Norm Minimization

Feature Learning for Multitask Learning Using l 2,1 Norm Minimization Feature Learning for Multitask Learning Using l, Norm Minimization Priya Venkateshan Abstract We look at solving the task of Multitask Feature Learning by way of feature selection. We find that formulating

More information

An Algorithm for Transfer Learning in a Heterogeneous Environment

An Algorithm for Transfer Learning in a Heterogeneous Environment An Algorithm for Transfer Learning in a Heterogeneous Environment Andreas Argyriou 1, Andreas Maurer 2, and Massimiliano Pontil 1 1 Department of Computer Science University College London Malet Place,

More information

Spectral k-support Norm Regularization

Spectral k-support Norm Regularization Spectral k-support Norm Regularization Andrew McDonald Department of Computer Science, UCL (Joint work with Massimiliano Pontil and Dimitris Stamos) 25 March, 2015 1 / 19 Problem: Matrix Completion Goal:

More information

Learning Task Grouping and Overlap in Multi-Task Learning

Learning Task Grouping and Overlap in Multi-Task Learning Learning Task Grouping and Overlap in Multi-Task Learning Abhishek Kumar Hal Daumé III Department of Computer Science University of Mayland, College Park 20 May 2013 Proceedings of the 29 th International

More information

High-dimensional Joint Sparsity Random Effects Model for Multi-task Learning

High-dimensional Joint Sparsity Random Effects Model for Multi-task Learning High-dimensional Joint Sparsity Random Effects Model for Multi-task Learning Krishnakumar Balasubramanian Georgia Institute of Technology krishnakumar3@gatech.edu Kai Yu Baidu Inc. yukai@baidu.com Tong

More information

Learning Bound for Parameter Transfer Learning

Learning Bound for Parameter Transfer Learning Learning Bound for Parameter Transfer Learning Wataru Kumagai Faculty of Engineering Kanagawa University kumagai@kanagawa-u.ac.jp Abstract We consider a transfer-learning problem by using the parameter

More information

MTTTS16 Learning from Multiple Sources

MTTTS16 Learning from Multiple Sources MTTTS16 Learning from Multiple Sources 5 ECTS credits Autumn 2018, University of Tampere Lecturer: Jaakko Peltonen Lecture 6: Multitask learning with kernel methods and nonparametric models On this lecture:

More information

Convex relaxation for Combinatorial Penalties

Convex relaxation for Combinatorial Penalties Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,

More information

Efficient Output Kernel Learning for Multiple Tasks

Efficient Output Kernel Learning for Multiple Tasks Efficient Output Kernel Learning for Multiple Tasks Pratik Jawanpuria, Maksim Lapin, Matthias Hein and Bernt Schiele Saarland University, Saarbrücken, Germany Max Planck Institute for Informatics, Saarbrücken,

More information

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1

More information

Learning to Learn and Collaborative Filtering

Learning to Learn and Collaborative Filtering Appearing in NIPS 2005 workshop Inductive Transfer: Canada, December, 2005. 10 Years Later, Whistler, Learning to Learn and Collaborative Filtering Kai Yu, Volker Tresp Siemens AG, 81739 Munich, Germany

More information

Online Gradient Descent Learning Algorithms

Online Gradient Descent Learning Algorithms DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline

More information

A Convex Optimization Approach to Modeling Consumer Heterogeneity in Conjoint Estimation

A Convex Optimization Approach to Modeling Consumer Heterogeneity in Conjoint Estimation A Convex Optimization Approach to Modeling Consumer Heterogeneity in Conjoint Estimation Theodoros Evgeniou Technology Management and Decision Sciences, INSEAD, theodoros.evgeniou@insead.edu Massimiliano

More information

Matrix Support Functional and its Applications

Matrix Support Functional and its Applications Matrix Support Functional and its Applications James V Burke Mathematics, University of Washington Joint work with Yuan Gao (UW) and Tim Hoheisel (McGill), CORS, Banff 2016 June 1, 2016 Connections What

More information

ORACLE INEQUALITIES AND OPTIMAL INFERENCE UNDER GROUP SPARSITY. By Karim Lounici, Massimiliano Pontil, Sara van de Geer and Alexandre B.

ORACLE INEQUALITIES AND OPTIMAL INFERENCE UNDER GROUP SPARSITY. By Karim Lounici, Massimiliano Pontil, Sara van de Geer and Alexandre B. Submitted to the Annals of Statistics arxiv: 007.77v ORACLE INEQUALITIES AND OPTIMAL INFERENCE UNDER GROUP SPARSITY By Karim Lounici, Massimiliano Pontil, Sara van de Geer and Alexandre B. Tsybakov We

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Sparse multi-kernel based multi-task learning for joint prediction of clinical scores and biomarker identification in Alzheimer s Disease

Sparse multi-kernel based multi-task learning for joint prediction of clinical scores and biomarker identification in Alzheimer s Disease Sparse multi-kernel based multi-task learning for joint prediction of clinical scores and biomarker identification in Alzheimer s Disease Peng Cao 1, Xiaoli Liu 1, Jinzhu Yang 1, Dazhe Zhao 1, and Osmar

More information

OWL to the rescue of LASSO

OWL to the rescue of LASSO OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,

More information

Convex multi-task feature learning

Convex multi-task feature learning Mach Learn (2008) 73: 243 272 DOI 10.1007/s10994-007-5040-8 Convex multi-task feature learning Andreas Argyriou Theodoros Evgeniou Massimiliano Pontil Received: 6 February 2007 / Revised: 10 May 2007 /

More information

Parcimonie en apprentissage statistique

Parcimonie en apprentissage statistique Parcimonie en apprentissage statistique Guillaume Obozinski Ecole des Ponts - ParisTech Journée Parcimonie Fédération Charles Hermite, 23 Juin 2014 Parcimonie en apprentissage 1/44 Classical supervised

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Linking Heterogeneous Input Spaces with Pivots for Multi-Task Learning

Linking Heterogeneous Input Spaces with Pivots for Multi-Task Learning Linking Heterogeneous Input Spaces with Pivots for Multi-Task Learning Jingrui He Yan Liu Qiang Yang Abstract Most existing works on multi-task learning (MTL) assume the same input space for different

More information

A Convex Optimization Approach to Modeling Consumer Heterogeneity in Conjoint Estimation

A Convex Optimization Approach to Modeling Consumer Heterogeneity in Conjoint Estimation A Convex Optimization Approach to Modeling Consumer Heterogeneity in Conjoint Estimation Theodoros Evgeniou Technology Management and Decision Sciences, INSEAD, theodoros.evgeniou@insead.edu Massimiliano

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Transfer Learning using Task-Level Features with Application to Information Retrieval

Transfer Learning using Task-Level Features with Application to Information Retrieval Transfer Learning using Task-Level Features with Application to Information Retrieval Rong Yan IBM T.J. Watson Research Hawthorne, NY, USA yanr@us.ibm.com Jian Zhang Purdue University West Lafayette, IN,

More information

Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Learning Kernels -Tutorial Part III: Theoretical Guarantees. Learning Kernels -Tutorial Part III: Theoretical Guarantees. Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute & Google Research mohri@cims.nyu.edu Afshin Rostami UC Berkeley

More information

Convex Multi-task Relationship Learning using Hinge Loss

Convex Multi-task Relationship Learning using Hinge Loss Department of Computer Science George Mason University Technical Reports 4400 University Drive MS#4A5 Fairfax, VA 030-4444 USA http://cs.gmu.edu/ 703-993-530 Convex Multi-task Relationship Learning using

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

L 2,1 Norm and its Applications

L 2,1 Norm and its Applications L 2, Norm and its Applications Yale Chang Introduction According to the structure of the constraints, the sparsity can be obtained from three types of regularizers for different purposes.. Flat Sparsity.

More information

Cross-Domain Multitask Learning with Latent Probit Models

Cross-Domain Multitask Learning with Latent Probit Models Shaobo Han Xuejun Liao Lawrence Carin Duke University Durham NC 2759 USA shaobo.hanxjliaolcarin@duke.edu Abstract Learning multiple tasks across heterogeneous domains is a challenging problem since the

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Tales from fmri Learning from limited labeled data. Gae l Varoquaux

Tales from fmri Learning from limited labeled data. Gae l Varoquaux Tales from fmri Learning from limited labeled data Gae l Varoquaux fmri data p 100 000 voxels per map Heavily correlated + structured noise Low SNR: 5% 13 db Brain response maps (activation) n Hundreds,

More information

l 1 and l 2 Regularization

l 1 and l 2 Regularization David Rosenberg New York University February 5, 2015 David Rosenberg (New York University) DS-GA 1003 February 5, 2015 1 / 32 Tikhonov and Ivanov Regularization Hypothesis Spaces We ve spoken vaguely about

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

Distribution Regression with Minimax-Optimal Guarantee

Distribution Regression with Minimax-Optimal Guarantee Distribution Regression with Minimax-Optimal Guarantee (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton

More information

Linear and Logistic Regression. Dr. Xiaowei Huang

Linear and Logistic Regression. Dr. Xiaowei Huang Linear and Logistic Regression Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Two Classical Machine Learning Algorithms Decision tree learning K-nearest neighbor Model Evaluation Metrics

More information

An iterative hard thresholding estimator for low rank matrix recovery

An iterative hard thresholding estimator for low rank matrix recovery An iterative hard thresholding estimator for low rank matrix recovery Alexandra Carpentier - based on a joint work with Arlene K.Y. Kim Statistical Laboratory, Department of Pure Mathematics and Mathematical

More information

Trace Lasso: a trace norm regularization for correlated designs

Trace Lasso: a trace norm regularization for correlated designs Author manuscript, published in "NIPS 2012 - Neural Information Processing Systems, Lake Tahoe : United States (2012)" Trace Lasso: a trace norm regularization for correlated designs Edouard Grave edouard.grave@inria.fr

More information

https://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:

More information

Regularization Predicts While Discovering Taxonomy Youssef Mroueh, Tomaso Poggio, and Lorenzo Rosasco

Regularization Predicts While Discovering Taxonomy Youssef Mroueh, Tomaso Poggio, and Lorenzo Rosasco Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2011-029 CBCL-299 June 3, 2011 Regularization Predicts While Discovering Taxonomy Youssef Mroueh, Tomaso Poggio, and

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

Transductive Experiment Design

Transductive Experiment Design Appearing in NIPS 2005 workshop Foundations of Active Learning, Whistler, Canada, December, 2005. Transductive Experiment Design Kai Yu, Jinbo Bi, Volker Tresp Siemens AG 81739 Munich, Germany Abstract

More information

Statistical Learning Reading Assignments

Statistical Learning Reading Assignments Statistical Learning Reading Assignments S. Gong et al. Dynamic Vision: From Images to Face Recognition, Imperial College Press, 2001 (Chapt. 3, hard copy). T. Evgeniou, M. Pontil, and T. Poggio, "Statistical

More information

Kernel Learning via Random Fourier Representations

Kernel Learning via Random Fourier Representations Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Kernel Methods. Machine Learning A W VO

Kernel Methods. Machine Learning A W VO Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance

More information

Fantope Regularization in Metric Learning

Fantope Regularization in Metric Learning Fantope Regularization in Metric Learning CVPR 2014 Marc T. Law (LIP6, UPMC), Nicolas Thome (LIP6 - UPMC Sorbonne Universités), Matthieu Cord (LIP6 - UPMC Sorbonne Universités), Paris, France Introduction

More information

Online Learning for Multi-Task Feature Selection

Online Learning for Multi-Task Feature Selection Online Learning for Multi-Task Feature Selection Haiqin Yang Department of Computer Science and Engineering The Chinese University of Hong Kong Joint work with Irwin King and Michael R. Lyu November 28,

More information

Exclusive Lasso for Multi-task Feature Selection

Exclusive Lasso for Multi-task Feature Selection Yang Zhou 1 Rong Jin 1 Steven C.H. Hoi 1 Department of Computer Science and Engineering Michigan State University East Lansing, MI 48910 USA {zhouyang,rongjin}@msu.edu School of Computer Engineering Nanyang

More information

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis Massimiliano Pontil 1 Today s plan SVD and principal component analysis (PCA) Connection

More information

Approximate Low-Rank Tensor Learning

Approximate Low-Rank Tensor Learning Approximate Low-Rank Tensor Learning Yaoliang Yu Dept of Machine Learning Carnegie Mellon University yaoliang@cs.cmu.edu Hao Cheng Dept of Electric Engineering University of Washington kelvinwsch@gmail.com

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

Distribution Regression

Distribution Regression Zoltán Szabó (École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Dagstuhl Seminar 16481

More information

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of

More information

PCA and admixture models

PCA and admixture models PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Minimax-optimal distribution regression

Minimax-optimal distribution regression Zoltán Szabó (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) ISNPS, Avignon June 12,

More information

Compressed Sensing and Neural Networks

Compressed Sensing and Neural Networks and Jan Vybíral (Charles University & Czech Technical University Prague, Czech Republic) NOMAD Summer Berlin, September 25-29, 2017 1 / 31 Outline Lasso & Introduction Notation Training the network Applications

More information

Learning to Integrate Data from Different Sources and Tasks

Learning to Integrate Data from Different Sources and Tasks 1 Learning to Integrate Data from Different Sources and Tasks Andreas Argyriou A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy of the University

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Efficient Complex Output Prediction

Efficient Complex Output Prediction Efficient Complex Output Prediction Florence d Alché-Buc Joint work with Romain Brault, Alex Lambert, Maxime Sangnier October 12, 2017 LTCI, Télécom ParisTech, Institut-Mines Télécom, Université Paris-Saclay

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Analysis of Greedy Algorithms

Analysis of Greedy Algorithms Analysis of Greedy Algorithms Jiahui Shen Florida State University Oct.26th Outline Introduction Regularity condition Analysis on orthogonal matching pursuit Analysis on forward-backward greedy algorithm

More information

Statistical Machine Learning for Structured and High Dimensional Data

Statistical Machine Learning for Structured and High Dimensional Data Statistical Machine Learning for Structured and High Dimensional Data (FA9550-09- 1-0373) PI: Larry Wasserman (CMU) Co- PI: John Lafferty (UChicago and CMU) AFOSR Program Review (Jan 28-31, 2013, Washington,

More information

1-bit Matrix Completion. PAC-Bayes and Variational Approximation

1-bit Matrix Completion. PAC-Bayes and Variational Approximation : PAC-Bayes and Variational Approximation (with P. Alquier) PhD Supervisor: N. Chopin Bayes In Paris, 5 January 2017 (Happy New Year!) Various Topics covered Matrix Completion PAC-Bayesian Estimation Variational

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

A Family of Penalty Functions for Structured Sparsity

A Family of Penalty Functions for Structured Sparsity A Family of Penalty Functions for Structured Sparsity Charles A. Micchelli Department of Mathematics City University of Hong Kong 83 Tat Chee Avenue, Kowloon Tong Hong Kong charles micchelli@hotmail.com

More information

arxiv: v1 [cs.lg] 27 Dec 2015

arxiv: v1 [cs.lg] 27 Dec 2015 New Perspectives on k-support and Cluster Norms New Perspectives on k-support and Cluster Norms arxiv:151.0804v1 [cs.lg] 7 Dec 015 Andrew M. McDonald Department of Computer Science University College London

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Is the test error unbiased for these programs?

Is the test error unbiased for these programs? Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Sparse and Robust Optimization and Applications

Sparse and Robust Optimization and Applications Sparse and and Statistical Learning Workshop Les Houches, 2013 Robust Laurent El Ghaoui with Mert Pilanci, Anh Pham EECS Dept., UC Berkeley January 7, 2013 1 / 36 Outline Sparse Sparse Sparse Probability

More information

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Cun-Hui Zhang and Stephanie S. Zhang Rutgers University and Columbia University September 14, 2012 Outline Introduction Methodology

More information

Learning Multiple Related Tasks using Latent Independent Component Analysis

Learning Multiple Related Tasks using Latent Independent Component Analysis Learning Multiple Related Tasks using Latent Independent Component Analysis Jian Zhang, Zoubin Ghahramani, Yiming Yang School of Computer Science Gatsby Computational Neuroscience Unit Cargenie Mellon

More information

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms

More information

Statistical Properties of Large Margin Classifiers

Statistical Properties of Large Margin Classifiers Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides

More information

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010 Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

Projection Penalties: Dimension Reduction without Loss

Projection Penalties: Dimension Reduction without Loss Yi Zhang Machine Learning Department, Carnegie Mellon University Jeff Schneider The Robotics Institute, Carnegie Mellon University yizhang1@cs.cmu.edu schneide@cs.cmu.edu Abstract Dimension reduction is

More information