Multi-task Learning and Structured Sparsity
|
|
- Ophelia Pitts
- 5 years ago
- Views:
Transcription
1 Multi-task Learning and Structured Sparsity Massimiliano Pontil Department of Computer Science Centre for Computational Statistics and Machine Learning University College London Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 1 / 25
2 Outline Problem formulation and examples Classes of regularizers Statistical analysis Optimization methods Multilinear models Sparse coding Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 2 / 25
3 Problem formulation Let µ 1,..., µ T be prescribed probability distributions on X Y Goal: find functions f t : X Y which minimize 1 T T E (x,y) µt L(y, f t (x)) 2 t=1 Learning from data (x 1 t, y 1 t ),..., (x n t, y n t ) µ t, t = 1,..., T : { 1 min f 1,...,f T T T t=1 1 n n i=1 } L(yt i, f t (xt)) i + λ Ω(f 1,..., f T ) Penalty Ω encourages common structure among the functions Focus on linear regression: X R d, Y R and f (x) = w, x Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 3 / 25
4 Problem formulation (cont.) Linear regression model: y i t = w t, x i t + ɛ ti 1 min w 1,...,w T T T 1 n t=1 n L(yt i, w t, xt ) i i=1 } {{ } training error task t +λ Ω(w 1,..., w T ) }{{} joint regularizer Single task learning: Ω(w 1,..., w T ) = t Ω t (w t ) Typical scenario: many tasks but only few examples per task If the tasks are related, learning them jointly should improve over learning each task independently Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 4 / 25
5 Examples 1 User modelling each task is to predict a user s ratings to products [Lenk et al. 1996,...] the ways different people make decisions about products are related, e.g. small variance of parameters special case (matrix completion): X = {e 1,..., e d } 2 Multiple object detection in scenes detection of each object corresponds to a binary classification task learning common features enhances performance [Torralba et al. 2004,...] early work in ML using neural nets with shared hidden weights [Baxter 1996, Caruana 1997, Silver and Mercer 1996,...] More applications in bioinformatics, finance, neuroimaging, NLP, etc. Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 5 / 25
6 Quadratic regularizer min L(y w 1,...,w t i, w t, x i T t ) + λ Ω(w 1,..., w T ) t,i Let Ω(w) = w, Ew, with w := (w 1,..., w T ) R dt and E 0 Independent task learning if E is block diagonal Encourage linear relationships between tasks, e.g. [Evgeniou & P., 2004]: Ω(w) = T t=1 w t c c T w t 1 T t=1 T 2 w s, c [0, 1] s=1 c = 1: independent tasks; c = 0: identical tasks Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 6 / 25
7 Equivalent formulation [Evgeniou et al. 2005] Choose v R p, B t R p d and set w t = B t v Equivalent problem: min v L(yt i, v, B t xt ) i + λ v, v t,i Previous ex.: B t = [(1 c) 1 2 Id d, 0 d d,..., 0 d d, (ct ) 1 2 Id d, 0 d d,..., 0 d d ] }{{}}{{} t 1 T t w t = (1 c) 1 2 v0 + (ct ) 1 2 vt = common + task specific Extension to coupled hierarchical structures: w t = k A(t) v k Connection to kernel methods: learn a map (x, t) f t (x) using the kernel K((x 1, t 1 ), (x 2, t 2 )) = B t1 x 1, B t2 x 2 Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 7 / 25
8 Structured sparsity: few shared variables [Argyriou et al. 2006; Lounici et al. 2009] Favour matrices with many zero rows: d W 2,1 := T Compare matrices W favoured by different regularizers (green = 0, blue = 1): j=1 t=1 w 2 tj #rows = ,1 = l 1 -norm = Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 8 / 25
9 Statistical analysis Linear regression model: yt i = w t, xt i + ɛ i t, with ɛ i t i.i.d. N(0, σ 2 ) i = 1,..., n, d n, use the square loss: L(y, y ) = (y y ) 2 { } T Assume card j : wtj 2 > 0 s t=1 1 Variables not too correlated: n n xtj i x tk i 1 ρ 7s, t, j k Theorem [Lounici et al. 2011] If λ = 4σ nt 1 + A log d T, A 4 then w.h.p. 1 T ( ) cσ 2 ŵ t w t 2 s 1 + A log d T ρ n T t=1 i=1 Dependency on the dimension d is negligible for large T 1 T Compare to Lasso: T w (L) t w t 2 c s n log(d T ) t=1 Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 9 / 25
10 Multitask feature learning [Argyriou et al. 2006, 2008] Extend above formulation to learn a low dimensional representation: min L(yt i, a t, U x i U,A t ) + λ A 2,1 : U U = I d d, A R d T t,i Let W = UA and minimize over orthogonal U min U U W 2,1 = W tr := r σ j (W ) j=1 Equivalent to trace norm regularization: min L(yt i, w t, xt ) i + λ W tr W t,i Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 10 / 25
11 Variational form and alternate minimization Fact: W tr = 1 2 inf D 0 tr(d 1 WW + D) and infimizer = WW min T W, D 0 t=1 i=1 n L(yt i, w t, xt ) i + λ [ tr(w D 1 W ) 2 }{{} T w t,d 1 w t t=1 Requires a perturbation step to ensure convergence See [Dudík et al. 2012] for comparative results { d Diagonal constraints: W 2,1 = 1 2 inf z>0 j=1 w :,j 2 z j + z j } ] + tr(d) Further constraints on z [Micchelli et al. 2010] e.g. ordered structures Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 11 / 25
12 Risk bound Theorem [Maurer and P. 2012] Let R(W ) = 1 T T t=1 E (x,y) µ t L(y, w t, x ) and ˆR(W ) the empirical { error. Assume L(y, ) is φ-lipschitz and xt i 1. If Ŵ argmin ˆR(W ) : W tr B } T then with prob. at least 1 δ ( R(Ŵ ) R (W ) 2φB Ĉ + n with Ĉ = 1 nt xt i xt i and W argmin R(W ) t,i 2 (ln (nt ) + 1) nt ) 8 ln (3/δ) + nt Interpretation: Assume rank(w ) = K, w t 1 and let B = K. If the inputs are uniformly distributed, as T grows we have a O( K/nd) bound as compared to O( 1/n) for single task learning Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 12 / 25
13 Experiment (computer survey) Test error vs. #tasks Eigenvalues of D Performance improves with more tasks A single most important feature shared by everyone Dataset [Lenk et al. 1996]: consumers ratings of PC models: 180 persons (tasks), 8 training, 4 test points, 13 inputs (RAM, CPU, price etc.), output in {0,..., 10} (likelihood of purchase) Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 13 / 25
14 Experiment (computer survey) TE RAM SC CPU HD CD CA CO AV WA SW GU PR Method Test Independent Aggregate 5.52 Quadratic (best c [0, 1]) 4.37 Structured Sparsity 4.04 Trace norm 3.72 Quadratic + Trace 3.20 The most important feature (1st eigenvector of D) weighs technical characteristics (RAM, CPU, CD-ROM) vs. price Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 14 / 25
15 More complex models Several possible extension of the above formulations: Multiple low dimensional subspaces [Argyriou et al. 08b, Kang et al. 2011,] Encourage heterogeneous features [Romera-Paredes et al. 2012] Sparse coding [Kumar and Daumé III, 2012, Maurer et al. 2013] Multilinear models [Romera-Paredes et al. 2013] Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 15 / 25
16 Multilinear MTL [Romera-Paredes et al. 2013] Tasks associated with multi-index, e.g. t = (t1, t2 ) Example: predict action-units (e.g. cheek raiser) activation for different people [Lucey et. al 2011]: t1 identity, t2 action-unit Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 16 / 25
17 Multilinear MTL (cont.) Let W R T 1 T 2 d, with W t1,t 2,: R d the (t 1, t 2 )-th regression task for t 1 = 1,...,T 1, t 2 = 1,...,T 2 Goal: control rank of each matricization of W : rank(w (1) ) + rank(w (2) ) + rank(w (3) ) where W (n) is the mode-n matricization of W Convex lower bound [Liu et al. 2011, Gandy et al. 2011, Signoretto et al. 2012] 3 W (n) tr n=1 Regularization problem solved by alternating direction of multipliers [Gandy et al. 2011] Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 17 / 25
18 Multilinear MTL (cont.) Alternative approach using Tucker decomposition W t1,t 2,j = S 1 S 2 s 1 s 2 =1 k=1 K G s1,s 2,kA t1,s 1 B t2,s 2 C j,k Transfer learning experiment (more general setting involving distinct sets of training and target tasks) GRR GTNR GMF TTNR TF Correlation m (Training Set Size) Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 18 / 25
19 Exploiting unrelated groups of tasks [Romera-Paredes et al. 2012] Example: recognizing identity and emotion on a set of faces emotion related feature identity related feature Assumption: 1. Low rank within each group 2. Tasks from different groups tend to use orthogonal features Misclassification Error Rate Ridge Regression MTL MTL 2G OrthoMTL OrthoMTL EN min W,V Training Set Size { ˆR em (W ) + ˆR id (V ) + λ [W, V ] tr + ρ W V 2 Fr} Related convex problem under conditions Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 19 / 25
20 Multi-task learning with dictionaries [Maurer P., Romera-Paredes, 2013] Natural extension of sparse coding [Olshausen and Field 1996]: { } L(yt i, w t, xt ) i : w t = Ua t, u k 2 1, a t 1 α min U,A t,i Multiclass Accuracy RR MTFL SC MTL m train Related method with Frobenius norm bound [Kumar and Daumé III, 2012] Estimation bounds indicate potential improvement over single task learning Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 20 / 25
21 Conclusions Multi-task learning is ubiquitous exploiting task relatedness provides substantial improvement over independent task learning Presented families of regularizers which naturally extend complexity notions (smoothness and sparsity) used for single-task learning; touched upon statistical analyses and optimisation methods Need for scalable algorithms, particularly in more complex task relatedness scenarios such as multilinear models Nonlinear MTL via reproducing kernel Hilbert spaces Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 21 / 25
22 Thanks Andreas Argyriou Nadia Bianchi-Berthouze Andrea Caponnetto Theodoros Evgeniou Karim Lounici Andreas Maurer Charles Micchelli Bernardino Romera-Paredes Alexandre Tsybakov Sara van de Geer Yiming Ying Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 22 / 25
23 References [Ando and Zhang] A framework for learning predictive structures from multiple tasks and unlabeled data. JMLR [Argyriou, Evgeniou, Pontil] Multi-task feature learning. NIPS [Argyriou, Evgeniou, Pontil] Convex multi-task feature learning. Machine Learning [Argyriou, Maurer, Pontil] An algorithm for transfer learning in a heterogeneous environment. ECML 2008b. [Argyriou, Micchelli, Pontil] When is there a representer theorem? Vector versus matrix regularizers. JMLR [Argyriou, Micchelli, Pontil, Shen, Xu] Efficient first order methods for linear composite regularizers. arxiv: [Baxter] A model for inductive bias learning. JAIR [Ben-David and Schuller] Exploiting task relatedness for multiple task learning. COLT [Caponnetto, Micchelli, Pontil, Ying] Universal mult-task kernels. JMLR [Caruana] Multi task learning. Machine Learning Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 23 / 25
24 [Dudík, Harchaoui, Malik] Lifted coordinate descent for learning with trace-norm regularization, AISTATS [Evgeniou and Pontil] Regularized multi-task learning. SIGKDD [Evgeniou, Micchelli, Pontil] Learning multiple tasks with kernel methods. JMLR [Lounici, Pontil, Tsybakov, van de Geer] Taking advantage of sparsity in multi-task learning. COLT [Lounici, Pontil, Tsybakov, van de Geer] Oracle inequalities and optimal inference under group sparsity. Annals of Statistics, [Kakade, Shalev-Shwartz, Tewari] Regularization techniques for learning with matrices, JMLR [Kang, Grauman, Sha] Learning with Whom to Share in Multi-task Feature Learning. ICML [Kumar and Daumé III] Learning Task Grouping and Overlap in Multi-task LearningAbstract, ICML 2012 [Lenk, DeSarbo, Green, Young] Hierarchical Bayes conjoint analysis: recovery of partworth heterogeneity from reduced experimental designs. Marketing Science [Maurer] Bounds for linear multi-task learning. JMLR Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 24 / 25
25 [Maurer and Pontil] K-dimensional coding schemes in Hilbert spaces IEEE Transactions on Information Theory, 56(11): , [Maurer and Pontil] Structured sparsity and generalization. JMLR [Maurer, Pontil, Romera-Paredes] Sparse coding for multitask and transfer learning. ICML [Micchelli, Morales, Pontil] A family of penalty functions for structured sparsity. NIPS [Micchelli and Pontil] On learning vector-valued functions. Neural Computation [Romera-Paredes, Argyriou, Bianchi-Berthouze, Pontil] Exploiting unrelated tasks in multi-task learning. AISTATS [Romera-Paredes, Aung, Bianchi-Berthouze, Pontil] Multilinear mulittask learning. ICML [Salakhutdinov, Torralba, Tenenbaum] Learning to share visual appearance for multiclass object detection. CVPR [Silver & Mercer] The parallel transfer of task knowledge using dynamic learning rates based on a measure of relatedness. Connection Science [Yu, Tresp, Schwaighofer] Learning Gaussian processes from multiple tasks. ICML [Thrun and Pratt] Learning to learn, Springer, [Torralba, Murphy, Freeman] Sharing features: efficient boosting procedures for multiclass object detection. CVPR Massimiliano Pontil (UCL) Multi-Task Learning Oxford, 6/5/13 25 / 25
Multi-task Learning and Structured Sparsity
Multi-task Learning and Structured Sparsity Massimiliano Pontil Department of Computer Science Centre for Computational Statistics and Machine Learning University College London Massimiliano Pontil (UCL)
More informationA Spectral Regularization Framework for Multi-Task Structure Learning
A Spectral Regularization Framework for Multi-Task Structure Learning Massimiliano Pontil Department of Computer Science University College London (Joint work with A. Argyriou, T. Evgeniou, C.A. Micchelli,
More informationMathematical Methods for Data Analysis
Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data
More informationLearning Good Representations for Multiple Related Tasks
Learning Good Representations for Multiple Related Tasks Massimiliano Pontil Computer Science Dept UCL and ENSAE-CREST March 25, 2015 1 / 35 Plan Problem formulation and examples Analysis of multitask
More informationMulti-Task Learning and Matrix Regularization
Multi-Task Learning and Matrix Regularization Andreas Argyriou Department of Computer Science University College London Collaborators T. Evgeniou (INSEAD) R. Hauser (University of Oxford) M. Herbster (University
More informationSome tensor decomposition methods for machine learning
Some tensor decomposition methods for machine learning Massimiliano Pontil Istituto Italiano di Tecnologia and University College London 16 August 2016 1 / 36 Outline Problem and motivation Tucker decomposition
More informationConvex Multi-Task Feature Learning
Convex Multi-Task Feature Learning Andreas Argyriou 1, Theodoros Evgeniou 2, and Massimiliano Pontil 1 1 Department of Computer Science University College London Gower Street London WC1E 6BT UK a.argyriou@cs.ucl.ac.uk
More informationSparse coding for multitask and transfer learning
Andreas Maurer Adalbertstrasse 55, D-8799, Munchen, Germany am@andreas-maurer.eu Massimiliano Pontil m.pontil@cs.ucl.ac.uk Department of Computer Science and Centre for Computational Statistics and Machine
More informationMulti-Task Learning and Algorithmic Stability
Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Multi-Task Learning and Algorithmic Stability Yu Zhang Department of Computer Science, Hong Kong Baptist University The Institute
More informationKernels for Multi task Learning
Kernels for Multi task Learning Charles A Micchelli Department of Mathematics and Statistics State University of New York, The University at Albany 1400 Washington Avenue, Albany, NY, 12222, USA Massimiliano
More informationA Spectral Regularization Framework for Multi-Task Structure Learning
A Spectral Regularization Framework for Multi-Task Structure Learning Andreas Argyriou Department of Computer Science University College London Gower Street, London WC1E 6BT, UK a.argyriou@cs.ucl.ac.uk
More informationFeature Learning for Multitask Learning Using l 2,1 Norm Minimization
Feature Learning for Multitask Learning Using l, Norm Minimization Priya Venkateshan Abstract We look at solving the task of Multitask Feature Learning by way of feature selection. We find that formulating
More informationAn Algorithm for Transfer Learning in a Heterogeneous Environment
An Algorithm for Transfer Learning in a Heterogeneous Environment Andreas Argyriou 1, Andreas Maurer 2, and Massimiliano Pontil 1 1 Department of Computer Science University College London Malet Place,
More informationSpectral k-support Norm Regularization
Spectral k-support Norm Regularization Andrew McDonald Department of Computer Science, UCL (Joint work with Massimiliano Pontil and Dimitris Stamos) 25 March, 2015 1 / 19 Problem: Matrix Completion Goal:
More informationLearning Task Grouping and Overlap in Multi-Task Learning
Learning Task Grouping and Overlap in Multi-Task Learning Abhishek Kumar Hal Daumé III Department of Computer Science University of Mayland, College Park 20 May 2013 Proceedings of the 29 th International
More informationHigh-dimensional Joint Sparsity Random Effects Model for Multi-task Learning
High-dimensional Joint Sparsity Random Effects Model for Multi-task Learning Krishnakumar Balasubramanian Georgia Institute of Technology krishnakumar3@gatech.edu Kai Yu Baidu Inc. yukai@baidu.com Tong
More informationLearning Bound for Parameter Transfer Learning
Learning Bound for Parameter Transfer Learning Wataru Kumagai Faculty of Engineering Kanagawa University kumagai@kanagawa-u.ac.jp Abstract We consider a transfer-learning problem by using the parameter
More informationMTTTS16 Learning from Multiple Sources
MTTTS16 Learning from Multiple Sources 5 ECTS credits Autumn 2018, University of Tampere Lecturer: Jaakko Peltonen Lecture 6: Multitask learning with kernel methods and nonparametric models On this lecture:
More informationConvex relaxation for Combinatorial Penalties
Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,
More informationEfficient Output Kernel Learning for Multiple Tasks
Efficient Output Kernel Learning for Multiple Tasks Pratik Jawanpuria, Maksim Lapin, Matthias Hein and Bernt Schiele Saarland University, Saarbrücken, Germany Max Planck Institute for Informatics, Saarbrücken,
More informationLearning Multiple Tasks with a Sparse Matrix-Normal Penalty
Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1
More informationLearning to Learn and Collaborative Filtering
Appearing in NIPS 2005 workshop Inductive Transfer: Canada, December, 2005. 10 Years Later, Whistler, Learning to Learn and Collaborative Filtering Kai Yu, Volker Tresp Siemens AG, 81739 Munich, Germany
More informationOnline Gradient Descent Learning Algorithms
DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline
More informationA Convex Optimization Approach to Modeling Consumer Heterogeneity in Conjoint Estimation
A Convex Optimization Approach to Modeling Consumer Heterogeneity in Conjoint Estimation Theodoros Evgeniou Technology Management and Decision Sciences, INSEAD, theodoros.evgeniou@insead.edu Massimiliano
More informationMatrix Support Functional and its Applications
Matrix Support Functional and its Applications James V Burke Mathematics, University of Washington Joint work with Yuan Gao (UW) and Tim Hoheisel (McGill), CORS, Banff 2016 June 1, 2016 Connections What
More informationORACLE INEQUALITIES AND OPTIMAL INFERENCE UNDER GROUP SPARSITY. By Karim Lounici, Massimiliano Pontil, Sara van de Geer and Alexandre B.
Submitted to the Annals of Statistics arxiv: 007.77v ORACLE INEQUALITIES AND OPTIMAL INFERENCE UNDER GROUP SPARSITY By Karim Lounici, Massimiliano Pontil, Sara van de Geer and Alexandre B. Tsybakov We
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationSparse multi-kernel based multi-task learning for joint prediction of clinical scores and biomarker identification in Alzheimer s Disease
Sparse multi-kernel based multi-task learning for joint prediction of clinical scores and biomarker identification in Alzheimer s Disease Peng Cao 1, Xiaoli Liu 1, Jinzhu Yang 1, Dazhe Zhao 1, and Osmar
More informationOWL to the rescue of LASSO
OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,
More informationConvex multi-task feature learning
Mach Learn (2008) 73: 243 272 DOI 10.1007/s10994-007-5040-8 Convex multi-task feature learning Andreas Argyriou Theodoros Evgeniou Massimiliano Pontil Received: 6 February 2007 / Revised: 10 May 2007 /
More informationParcimonie en apprentissage statistique
Parcimonie en apprentissage statistique Guillaume Obozinski Ecole des Ponts - ParisTech Journée Parcimonie Fédération Charles Hermite, 23 Juin 2014 Parcimonie en apprentissage 1/44 Classical supervised
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationLinking Heterogeneous Input Spaces with Pivots for Multi-Task Learning
Linking Heterogeneous Input Spaces with Pivots for Multi-Task Learning Jingrui He Yan Liu Qiang Yang Abstract Most existing works on multi-task learning (MTL) assume the same input space for different
More informationA Convex Optimization Approach to Modeling Consumer Heterogeneity in Conjoint Estimation
A Convex Optimization Approach to Modeling Consumer Heterogeneity in Conjoint Estimation Theodoros Evgeniou Technology Management and Decision Sciences, INSEAD, theodoros.evgeniou@insead.edu Massimiliano
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationTransfer Learning using Task-Level Features with Application to Information Retrieval
Transfer Learning using Task-Level Features with Application to Information Retrieval Rong Yan IBM T.J. Watson Research Hawthorne, NY, USA yanr@us.ibm.com Jian Zhang Purdue University West Lafayette, IN,
More informationLearning Kernels -Tutorial Part III: Theoretical Guarantees.
Learning Kernels -Tutorial Part III: Theoretical Guarantees. Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute & Google Research mohri@cims.nyu.edu Afshin Rostami UC Berkeley
More informationConvex Multi-task Relationship Learning using Hinge Loss
Department of Computer Science George Mason University Technical Reports 4400 University Drive MS#4A5 Fairfax, VA 030-4444 USA http://cs.gmu.edu/ 703-993-530 Convex Multi-task Relationship Learning using
More informationRandomized Algorithms
Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models
More informationL 2,1 Norm and its Applications
L 2, Norm and its Applications Yale Chang Introduction According to the structure of the constraints, the sparsity can be obtained from three types of regularizers for different purposes.. Flat Sparsity.
More informationCross-Domain Multitask Learning with Latent Probit Models
Shaobo Han Xuejun Liao Lawrence Carin Duke University Durham NC 2759 USA shaobo.hanxjliaolcarin@duke.edu Abstract Learning multiple tasks across heterogeneous domains is a challenging problem since the
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationTales from fmri Learning from limited labeled data. Gae l Varoquaux
Tales from fmri Learning from limited labeled data Gae l Varoquaux fmri data p 100 000 voxels per map Heavily correlated + structured noise Low SNR: 5% 13 db Brain response maps (activation) n Hundreds,
More informationl 1 and l 2 Regularization
David Rosenberg New York University February 5, 2015 David Rosenberg (New York University) DS-GA 1003 February 5, 2015 1 / 32 Tikhonov and Ivanov Regularization Hypothesis Spaces We ve spoken vaguely about
More informationLearning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute
More informationDistribution Regression with Minimax-Optimal Guarantee
Distribution Regression with Minimax-Optimal Guarantee (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton
More informationLinear and Logistic Regression. Dr. Xiaowei Huang
Linear and Logistic Regression Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Two Classical Machine Learning Algorithms Decision tree learning K-nearest neighbor Model Evaluation Metrics
More informationAn iterative hard thresholding estimator for low rank matrix recovery
An iterative hard thresholding estimator for low rank matrix recovery Alexandra Carpentier - based on a joint work with Arlene K.Y. Kim Statistical Laboratory, Department of Pure Mathematics and Mathematical
More informationTrace Lasso: a trace norm regularization for correlated designs
Author manuscript, published in "NIPS 2012 - Neural Information Processing Systems, Lake Tahoe : United States (2012)" Trace Lasso: a trace norm regularization for correlated designs Edouard Grave edouard.grave@inria.fr
More informationhttps://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:
More informationRegularization Predicts While Discovering Taxonomy Youssef Mroueh, Tomaso Poggio, and Lorenzo Rosasco
Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2011-029 CBCL-299 June 3, 2011 Regularization Predicts While Discovering Taxonomy Youssef Mroueh, Tomaso Poggio, and
More informationApproximation Theoretical Questions for SVMs
Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually
More informationTransductive Experiment Design
Appearing in NIPS 2005 workshop Foundations of Active Learning, Whistler, Canada, December, 2005. Transductive Experiment Design Kai Yu, Jinbo Bi, Volker Tresp Siemens AG 81739 Munich, Germany Abstract
More informationStatistical Learning Reading Assignments
Statistical Learning Reading Assignments S. Gong et al. Dynamic Vision: From Images to Face Recognition, Imperial College Press, 2001 (Chapt. 3, hard copy). T. Evgeniou, M. Pontil, and T. Poggio, "Statistical
More informationKernel Learning via Random Fourier Representations
Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationKernel Methods. Machine Learning A W VO
Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance
More informationFantope Regularization in Metric Learning
Fantope Regularization in Metric Learning CVPR 2014 Marc T. Law (LIP6, UPMC), Nicolas Thome (LIP6 - UPMC Sorbonne Universités), Matthieu Cord (LIP6 - UPMC Sorbonne Universités), Paris, France Introduction
More informationOnline Learning for Multi-Task Feature Selection
Online Learning for Multi-Task Feature Selection Haiqin Yang Department of Computer Science and Engineering The Chinese University of Hong Kong Joint work with Irwin King and Michael R. Lyu November 28,
More informationExclusive Lasso for Multi-task Feature Selection
Yang Zhou 1 Rong Jin 1 Steven C.H. Hoi 1 Department of Computer Science and Engineering Michigan State University East Lansing, MI 48910 USA {zhouyang,rongjin}@msu.edu School of Computer Engineering Nanyang
More informationGI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil
GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis Massimiliano Pontil 1 Today s plan SVD and principal component analysis (PCA) Connection
More informationApproximate Low-Rank Tensor Learning
Approximate Low-Rank Tensor Learning Yaoliang Yu Dept of Machine Learning Carnegie Mellon University yaoliang@cs.cmu.edu Hao Cheng Dept of Electric Engineering University of Washington kelvinwsch@gmail.com
More informationLeast Squares Regression
E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute
More informationIFT Lecture 7 Elements of statistical learning theory
IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and
More informationDistribution Regression
Zoltán Szabó (École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Dagstuhl Seminar 16481
More informationA New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables
A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of
More informationPCA and admixture models
PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationA Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices
A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationMinimax-optimal distribution regression
Zoltán Szabó (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) ISNPS, Avignon June 12,
More informationCompressed Sensing and Neural Networks
and Jan Vybíral (Charles University & Czech Technical University Prague, Czech Republic) NOMAD Summer Berlin, September 25-29, 2017 1 / 31 Outline Lasso & Introduction Notation Training the network Applications
More informationLearning to Integrate Data from Different Sources and Tasks
1 Learning to Integrate Data from Different Sources and Tasks Andreas Argyriou A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy of the University
More informationLeast Squares Regression
CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the
More informationEfficient Complex Output Prediction
Efficient Complex Output Prediction Florence d Alché-Buc Joint work with Romain Brault, Alex Lambert, Maxime Sangnier October 12, 2017 LTCI, Télécom ParisTech, Institut-Mines Télécom, Université Paris-Saclay
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationAnalysis of Greedy Algorithms
Analysis of Greedy Algorithms Jiahui Shen Florida State University Oct.26th Outline Introduction Regularity condition Analysis on orthogonal matching pursuit Analysis on forward-backward greedy algorithm
More informationStatistical Machine Learning for Structured and High Dimensional Data
Statistical Machine Learning for Structured and High Dimensional Data (FA9550-09- 1-0373) PI: Larry Wasserman (CMU) Co- PI: John Lafferty (UChicago and CMU) AFOSR Program Review (Jan 28-31, 2013, Washington,
More information1-bit Matrix Completion. PAC-Bayes and Variational Approximation
: PAC-Bayes and Variational Approximation (with P. Alquier) PhD Supervisor: N. Chopin Bayes In Paris, 5 January 2017 (Happy New Year!) Various Topics covered Matrix Completion PAC-Bayesian Estimation Variational
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationA Family of Penalty Functions for Structured Sparsity
A Family of Penalty Functions for Structured Sparsity Charles A. Micchelli Department of Mathematics City University of Hong Kong 83 Tat Chee Avenue, Kowloon Tong Hong Kong charles micchelli@hotmail.com
More informationarxiv: v1 [cs.lg] 27 Dec 2015
New Perspectives on k-support and Cluster Norms New Perspectives on k-support and Cluster Norms arxiv:151.0804v1 [cs.lg] 7 Dec 015 Andrew M. McDonald Department of Computer Science University College London
More informationStochastic Optimization
Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic
More informationIs the test error unbiased for these programs?
Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationSparse and Robust Optimization and Applications
Sparse and and Statistical Learning Workshop Les Houches, 2013 Robust Laurent El Ghaoui with Mert Pilanci, Anh Pham EECS Dept., UC Berkeley January 7, 2013 1 / 36 Outline Sparse Sparse Sparse Probability
More informationConfidence Intervals for Low-dimensional Parameters with High-dimensional Data
Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Cun-Hui Zhang and Stephanie S. Zhang Rutgers University and Columbia University September 14, 2012 Outline Introduction Methodology
More informationLearning Multiple Related Tasks using Latent Independent Component Analysis
Learning Multiple Related Tasks using Latent Independent Component Analysis Jian Zhang, Zoubin Ghahramani, Yiming Yang School of Computer Science Gatsby Computational Neuroscience Unit Cargenie Mellon
More informationSimultaneous Model Selection and Optimization through Parameter-free Stochastic Learning
Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms
More informationStatistical Properties of Large Margin Classifiers
Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides
More informationLinear Regression. Aarti Singh. Machine Learning / Sept 27, 2010
Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new
More informationProjection Penalties: Dimension Reduction without Loss
Yi Zhang Machine Learning Department, Carnegie Mellon University Jeff Schneider The Robotics Institute, Carnegie Mellon University yizhang1@cs.cmu.edu schneide@cs.cmu.edu Abstract Dimension reduction is
More information