Some tensor decomposition methods for machine learning

Size: px
Start display at page:

Download "Some tensor decomposition methods for machine learning"

Transcription

1 Some tensor decomposition methods for machine learning Massimiliano Pontil Istituto Italiano di Tecnologia and University College London 16 August / 36

2 Outline Problem and motivation Tucker decomposition Two convex relaxations Applications Extensions 2 / 36

3 Problem Learning a tensor from a set of linear measurements Main example: Tensor completion Video denoising/completion Context-aware recommendation Multisensor data analysis Entities-relationships learning... 3 / 36

4 Multilinear multitask learning Tasks are associated with multiple indices, e.g. predict a rating given to different aspects of a restaurant by different critics Tasks regression vectors are vertical fibers of the tensor, e.g. (, food ) 4 / 36

5 0-Shot Transfer Learning Learning tasks for which no training instances are provided 5 / 36

6 Setting We consider a supervised learning setting in which we wish to learn a tensor from a set of measurements y = I (W) + ɛ where I : R d 1 d N R m a linear (sampling) operator, e.g. a subset of the tensor entries The number of parameters explodes with the order of the tensor, so regularization is key: minimize W E(W) + λr(w) for example E(W) = y I (W) 2 6 / 36

7 Matrix case The matrix case has been thoroughly studied, particularly focusing on spectral regularizers which encourage low rank matrices low rank matrix completion multitask feature learning Different notions of rank of a tensor! (some computationally intractable). Many are reviewed in [Kolda and Bader, 2009] Which ones are simple and effective? 7 / 36

8 Objectives We want some kind of guarantees statistical: bounds on the generalization / out-of-sample error optimization: convergence, rates function approximation We discuss two approaches based on: B. Romera-Paredes, H. Aung, N. Bianchi-Berthouze, M. Pontil. Multilinear multitask learning. 30th International Conference on Machine Learning, 2013 B. Romera-Paredes & M. Pontil. A new convex relaxation for tensor completion. Advances in Neural Information Processing Systems 26, / 36

9 Function approximation Learn function parametrized by a tensor from a sample Example: x = (x 1, x 2, x 3 ) V 1 V 2 V 3, three vector spaces f (x) = W, φ 1 (x 1 ) φ 2 (x 2 ) φ 3 (x 3 ), W R d 1 d 2 d 3 φ i : V i R d i feature maps min W m Loss(y i, f (x i )) + γω(w) i=1 Dual problem using kernel methods K(x, t) = K 1 (x 1, t 1 )K 2 (x 2, t 2 )K 3 (x 3, t 3 ) Matrix case: [Abernethy, Bach, Evgeniou, and Vert, JMLR 2009] Tensor case: [Signoretto, De Lathauwer, Suykens, Preprint 2013] 9 / 36

10 Basic operations Tensor W R d 1 d N, with entries W i1,...,i N Mode-n fiber is a vector in R dn formed by the elements of a tensor obtained by fixing all indices but the n-th one, e.g. in the above example W i1,i 2,: R d 3 is the (i 1, i 2 )-th regression vector 10 / 36

11 Basic operations (cont.) Matricization is the process of rearranging the tensor into a matrix Mode-n matricization W (n) R dn Jn, where J n = k n d k, is the matrix, the column of which are the mode-n fibers of W W (1) W (3) 11 / 36

12 Tucker rank TR(W) = (rank(w (1) ),..., rank(w (n) )) A natural regularizer associated to this is the sum of the ranks of the matricizations: R(W) := n rank(w (n) ) 12 / 36

13 Tucker decomposition meaning that W = G 1 A (1) N A (N) K 1 W i1,...,i N = j 1 =1 K N j N =1 K n d n (typically much smaller) G j1,...,j N A (1) i 1,j 1 A (N) i N,j N A (n) R dn Kn, n {1,..., N} are the factor matrices G R K 1 K N is called the core tensor and models the interaction between factors By construction rank(w (n) ) K n 13 / 36

14 Nonconvex approach The Tucker decomposition is invariant under multiplication and division of different factors by the same scalar. With the aim of avoiding this issue and reducing overfitting, we add Frobenius norm regularization terms to the components [ minimize E(G 1 A (1) N A (N) ) + α G,A (1),...,A (N) where α is a regularization parameter We solve this problem by alternate minimization G 2 F + Each step is detailed in the paper (also code available) ] N A (n) 2 F n=1 14 / 36

15 Convex approach An alternative approach is to relax the combinatorial problem argmin W E (W) + γ N rank ( ) W (n) n=1 The trace norm is a widely used convex surrogate for the rank. Therefore, we can consider the following convex relaxation: argmin E (W) + γ N W Tr (n) W n=1 Tensor completion: [Liu et al, 2009, Gandy et al, 2011, Signoretto et al, 2012] 15 / 36

16 Discussion In the Tucker decomposition the factors are explicit, hence we can add a new factor without relearning the full tensor ( N Space complexity: O n=1 d nk n + ) N n=1 K n which can be much smaller than that of the convex approach, particularly if K n d n Main drawback: no guarantee to find a local minimum GRR GTNR GMF TTNR TF Correlation m (Training Set Size) 16 / 36

17 Alternating Direction Method of Multipliers (ADMM) The regularizer N n=1 ADMM decouples the problem W (n) Tr is composite Introduce auxiliary tensors B n, n {1,..., N} accounting for each term in the sum, adding the constraints B n = W Optimize the resultant Lagrangian w.r.t. each B n only involves computing the proximal operator of the trace norm: 1 prox Tr (V ) = argmin 2 X V 2 F + X Tr X R d d 17 / 36

18 ADMM Want to minimize Decouple the regularization term { 1 min W,B 1,...,B N 1 N γ E(W) + Ψ ( ) W (n) γ E (W) + N n=1 n=1 Ψ ( } ) B n(n) : Bn = W, n = 1,...,N Augmented Lagrangian: L (W, B, C) = 1 γ E (W)+ N n=1 [ Ψ ( ) B n(n) Cn, W B n + β ] 2 W B n / 36

19 ADMM (cont.) L (W, B, C) = 1 γ E (W)+ N Updating equations: n=1 W [i+1] argmin W B [i+1] n argmin [ ( ) Ψ (B n(n) ) C n, W B n + β ] 2 W B n 2 2 B n C n [i+1] C n [i] ( L (W, B [i], C [i]) (W [i+1], B, C [i]) L βw [i+1] B [i+1] n ) 2nd step involves the computation of proximity operator of Ψ Convergence properties of ADMM are detailed e.g. in [Eckstein and Bertsekas, Mathematical Programming 1992] 19 / 36

20 Proximity Operator Let B = B n(n) and where A = (W 1 β C n) (n). Rewrite 2nd step as: { 1 ˆB = prox 1 Ψ(A) := argmin β B 2 B A } β Ψ(B) Case of interest: Ψ(B) = ψ(σ(b)) By von Neumann s matrix inequality: ( prox 1 β Ψ (A) = U Adiag prox 1 β ψ (σ A) ) V A 20 / 36

21 Rethinking the convex approach Convex envelope of a function f on a set S is the largest convex function f majorized by f at every point in S E.g: cardinality of a vector: f (v) = card(v) S = {v : v α} f (v) = v 1 /α The smaller S, the tighter the convex envelope card(v) card (v),α = 0.5 card (v),α = 1 card (v),α = In practice α is unknown and tuned by cross validation 21 / 36

22 Rethinking the convex approach The same reasoning applies to matrices W Tr /α is the convex envelope of rank(w) on the spectral unit ball of radius α [Fazel, Hindi, & Boyd, 2001] By using the regularizer N n=1 the same α for the different matricizations W (n) Tr we implicitly assume Difficulty with tensors: W (n) varies with n! 22 / 36

23 Rethinking the convex approach W (n) varies with n Let us consider W R Then: 0,1 0,0 1,0 Vectors of singular values of each matricization Smallest l ball containing each of the vectors Smallest l ball containing all vectors 23 / 36

24 Rethinking the convex approach We are interested in convex functions on matrices which are invariant to the matricization operation The Frobenius norm is very appealing: It is also a spectral function Therefore, we consider the set S = {W : W F α} Calculating the convex envelope of the rank on S can be reduced to calculating the convex envelope of card (v) on the set {v : v 2 α}, where v is the vector of singular values of W (follows by von Neumann s matrix inequality) 24 / 36

25 Convex envelope of the cardinality of a vector in the l 2 ball 0,1 0,0 1,0 Vectors of singular values of each matricization Smallest l ball containing each of the vectors Smallest l ball containing all vectors Smallest l 2 ball containing all vectors 25 / 36

26 Convex envelope of the cardinality of a vector in the l 2 ball Aim: derive convex envelope of f α (x) = card (x) on the set {x R d x 2 α} The conjugate of f α, s R d, is fα (s) = sup x s card (x) = max {α s 1:r 2 r} x B 2 r {0,...,d} Biconjugate of f α : fα (v) = sup s v fα (s), s R d v 2 α We do now know how to compute f. We can compute it and its proximal operator by projected sub-gradient descent 26 / 36

27 Quality of Relaxation (cont.) Lemma. If x 2 = α then ω α (x) = card (x). Let Ω α (W) = N ω α (σ(w (n) )), W tr = n=1 N σ(w (n) ) 1 n=1 Implication: Theorem. If W satisfies (a,b,c) below then Ω pmin (W) > W tr a) W (n) 1 n b) W 2 = p min c) min n rank(w (n) ) < max rank(w (n) ) n On the other hand, ω 1 is the convex envelope of card on l 2 unit ball, so: Ω 1 (W) W tr, W : W / 36

28 Experiments on tensor completion Video compression ( tensor) Exam score prediction ( tensor) Tensor Trace Norm Proposed Regularizer Tensor Trace Norm Proposed Regularizer RMSE RMSE m (Training Set Size) x Time comparison: Tensor Trace Norm Proposed Regularizer m (Training Set Size) 2000 Seconds p 28 / 36

29 Extensions and further work Latent tensor norm Tensor nuclear norm and an open problem Controlling the rank of all matricizations (quantum physics) Kernel methods Statistical analysis 29 / 36

30 Latent tensor nuclear norm Defined by the variational problem [Tomioka and Suzuki 2013, Wimalawarne et al 2014] { m m } W LNN = inf σ(v j ) 1 Mj V j = W. j=1 j=1 where M j is the adjoint of the matricization operator The associated unit ball is conv{w rank(w (n) ) 1, W (n) 1, n} 30 / 36

31 Latent tensor nuclear norm If we change the above to conv{w rank(w (n) ) k, W (n) p 1} the induced norm becomes [Combettes et al 2016] { m m } W LNN = inf σ(v j ) p,k Mj V j = W j=1 where p,k is the (p, k)-support norm [McDonald, Pontil, Stamos 2016]. Its unit ball is Ongoing experiments... j=1 conv{x R d card(x) k, x p 1} 31 / 36

32 Tensor nuclear norm Defined by W TNN,p = inf { λ p K k=1 λ k u (1) k u (N) k } = W where the infimum is over K N, λ = (λ 1,..., λ K ) R K and vectors u (n) k R dn such that u r (n) = 1, n, k Originally proposed and claimed to be a norm in [Lim & Comon, 2010], however [Friedland & Lim, 2016] shows this is well defined and a norm only when p = 1, but it is always zero if p > 0 disproving the claim in the former paper 32 / 36

33 Tensor nuclear norm [Friedland & Lim, 2016] also shows that computing W TNN is NP-hard. Assume for simplicity d 1 = = d N = d Claim 1. We may restrict w.l.o.g. K d N 1 Claim 2. We can rewrite the norm as { K N W TNN = inf where now the v (n) k k=1 n=1 v (n) k : K k=1 λ k v (1) k v (N) k } = W R d are not constrained to have unit norm 33 / 36

34 Tensor nuclear norm Claim 3. Let and ϕ(w) = E(W) + γ W TNN h(v (1),..., V (N) ) = ϕ(v (1) V (N) ) If (V (1),..., V (N) ) is a local minima of h then is a local minima of ϕ True for N = 2 (easy to show) W = V (1) V (N) 34 / 36

35 Controlling the rank of all matricizations Let c {1,..., } and W (i,ī) be the matricization obtained by taking the modes in c as rows and those in c as columns Fact: if max rank(w (c, c) ) k c {1,...,N} then W has a compact decomposition called matrix-product-state in quantum physics [Vidal 2003] W i1,...,i n = trace(a 1 i 1 A N i n ) where A n i are k k matrices, so we can describe the tensor by O(Nk 2 ) parameters. However, with this method we do not have control of max n rank(w (n)) If N 6 we may still obtain a latent norm convex relaxation 35 / 36

36 THANK YOU! 36 / 36

Mathematical Methods for Data Analysis

Mathematical Methods for Data Analysis Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data

More information

Spectral k-support Norm Regularization

Spectral k-support Norm Regularization Spectral k-support Norm Regularization Andrew McDonald Department of Computer Science, UCL (Joint work with Massimiliano Pontil and Dimitris Stamos) 25 March, 2015 1 / 19 Problem: Matrix Completion Goal:

More information

Approximate Low-Rank Tensor Learning

Approximate Low-Rank Tensor Learning Approximate Low-Rank Tensor Learning Yaoliang Yu Dept of Machine Learning Carnegie Mellon University yaoliang@cs.cmu.edu Hao Cheng Dept of Electric Engineering University of Washington kelvinwsch@gmail.com

More information

A New Convex Relaxation for Tensor Completion

A New Convex Relaxation for Tensor Completion A New Convex Relaxation for Tensor Completion Bernardino Romera-Paredes Department of Computer Science and UCL Interactive Centre University College London Malet Place, London WC1E 6BT, UK B.RomeraParedes@cs.ucl.ac.uk

More information

Multi-task Learning and Structured Sparsity

Multi-task Learning and Structured Sparsity Multi-task Learning and Structured Sparsity Massimiliano Pontil Department of Computer Science Centre for Computational Statistics and Machine Learning University College London Massimiliano Pontil (UCL)

More information

Multi-task Learning and Structured Sparsity

Multi-task Learning and Structured Sparsity Multi-task Learning and Structured Sparsity Massimiliano Pontil Department of Computer Science Centre for Computational Statistics and Machine Learning University College London Massimiliano Pontil (UCL)

More information

arxiv: v1 [cs.lg] 27 Dec 2015

arxiv: v1 [cs.lg] 27 Dec 2015 New Perspectives on k-support and Cluster Norms New Perspectives on k-support and Cluster Norms arxiv:151.0804v1 [cs.lg] 7 Dec 015 Andrew M. McDonald Department of Computer Science University College London

More information

Learning Good Representations for Multiple Related Tasks

Learning Good Representations for Multiple Related Tasks Learning Good Representations for Multiple Related Tasks Massimiliano Pontil Computer Science Dept UCL and ENSAE-CREST March 25, 2015 1 / 35 Plan Problem formulation and examples Analysis of multitask

More information

Statistical Performance of Convex Tensor Decomposition

Statistical Performance of Convex Tensor Decomposition Slides available: h-p://www.ibis.t.u tokyo.ac.jp/ryotat/tensor12kyoto.pdf Statistical Performance of Convex Tensor Decomposition Ryota Tomioka 2012/01/26 @ Kyoto University Perspectives in Informatics

More information

A Spectral Regularization Framework for Multi-Task Structure Learning

A Spectral Regularization Framework for Multi-Task Structure Learning A Spectral Regularization Framework for Multi-Task Structure Learning Massimiliano Pontil Department of Computer Science University College London (Joint work with A. Argyriou, T. Evgeniou, C.A. Micchelli,

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

Fantope Regularization in Metric Learning

Fantope Regularization in Metric Learning Fantope Regularization in Metric Learning CVPR 2014 Marc T. Law (LIP6, UPMC), Nicolas Thome (LIP6 - UPMC Sorbonne Universités), Matthieu Cord (LIP6 - UPMC Sorbonne Universités), Paris, France Introduction

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

A Characterization of Sampling Patterns for Union of Low-Rank Subspaces Retrieval Problem

A Characterization of Sampling Patterns for Union of Low-Rank Subspaces Retrieval Problem A Characterization of Sampling Patterns for Union of Low-Rank Subspaces Retrieval Problem Morteza Ashraphijuo Columbia University ashraphijuo@ee.columbia.edu Xiaodong Wang Columbia University wangx@ee.columbia.edu

More information

An Algorithm for Transfer Learning in a Heterogeneous Environment

An Algorithm for Transfer Learning in a Heterogeneous Environment An Algorithm for Transfer Learning in a Heterogeneous Environment Andreas Argyriou 1, Andreas Maurer 2, and Massimiliano Pontil 1 1 Department of Computer Science University College London Malet Place,

More information

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 7: Matrix completion Yuejie Chi The Ohio State University Page 1 Reference Guaranteed Minimum-Rank Solutions of Linear

More information

Theoretical and Experimental Analyses of Tensor-Based Regression and Classification

Theoretical and Experimental Analyses of Tensor-Based Regression and Classification Neural Computation, to appear. 1 Theoretical and Experimental Analyses of Tensor-Based Regression and Classification Kishan Wimalawarne Toyo Institute of Technology, Japan ishanwn@gmail.com Ryota Tomioa

More information

Matrix Support Functional and its Applications

Matrix Support Functional and its Applications Matrix Support Functional and its Applications James V Burke Mathematics, University of Washington Joint work with Yuan Gao (UW) and Tim Hoheisel (McGill), CORS, Banff 2016 June 1, 2016 Connections What

More information

Multi-stage convex relaxation approach for low-rank structured PSD matrix recovery

Multi-stage convex relaxation approach for low-rank structured PSD matrix recovery Multi-stage convex relaxation approach for low-rank structured PSD matrix recovery Department of Mathematics & Risk Management Institute National University of Singapore (Based on a joint work with Shujun

More information

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via convex relaxations Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Multi-Task Learning and Matrix Regularization

Multi-Task Learning and Matrix Regularization Multi-Task Learning and Matrix Regularization Andreas Argyriou Department of Computer Science University College London Collaborators T. Evgeniou (INSEAD) R. Hauser (University of Oxford) M. Herbster (University

More information

An Optimization-based Approach to Decentralized Assignability

An Optimization-based Approach to Decentralized Assignability 2016 American Control Conference (ACC) Boston Marriott Copley Place July 6-8, 2016 Boston, MA, USA An Optimization-based Approach to Decentralized Assignability Alborz Alavian and Michael Rotkowitz Abstract

More information

A Spectral Regularization Framework for Multi-Task Structure Learning

A Spectral Regularization Framework for Multi-Task Structure Learning A Spectral Regularization Framework for Multi-Task Structure Learning Andreas Argyriou Department of Computer Science University College London Gower Street, London WC1E 6BT, UK a.argyriou@cs.ucl.ac.uk

More information

Proximal methods. S. Villa. October 7, 2014

Proximal methods. S. Villa. October 7, 2014 Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem

More information

Generalized Conditional Gradient and Its Applications

Generalized Conditional Gradient and Its Applications Generalized Conditional Gradient and Its Applications Yaoliang Yu University of Alberta UBC Kelowna, 04/18/13 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 1 / 25 1 Introduction 2 Generalized

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725 Dual methods and ADMM Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Given f : R n R, the function is called its conjugate Recall conjugate functions f (y) = max x R n yt x f(x)

More information

Lecture 5 : Projections

Lecture 5 : Projections Lecture 5 : Projections EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Up until now, we have seen convergence rates of unconstrained gradient descent. Now, we consider a constrained minimization

More information

Final exam.

Final exam. EE364b Convex Optimization II June 4 8, 205 Prof. John C. Duchi Final exam By now, you know how it works, so we won t repeat it here. (If not, see the instructions for the EE364a final exam.) Since you

More information

Tensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia

Tensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia Network Feature s Decompositions for Machine Learning 1 1 Department of Computer Science University of British Columbia UBC Machine Learning Group, June 15 2016 1/30 Contact information Network Feature

More information

Tensor Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Tensors via Convex Optimization

Tensor Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Tensors via Convex Optimization Tensor Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Tensors via Convex Optimization Canyi Lu 1, Jiashi Feng 1, Yudong Chen, Wei Liu 3, Zhouchen Lin 4,5,, Shuicheng Yan 6,1

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu Feature engineering is hard 1. Extract informative features from domain knowledge

More information

Homework 5. Convex Optimization /36-725

Homework 5. Convex Optimization /36-725 Homework 5 Convex Optimization 10-725/36-725 Due Tuesday November 22 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 08: Sparsity Based Regularization Lorenzo Rosasco Learning algorithms so far ERM + explicit l 2 penalty 1 min w R d n n l(y

More information

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views A Randomized Approach for Crowdsourcing in the Presence of Multiple Views Presenter: Yao Zhou joint work with: Jingrui He - 1 - Roadmap Motivation Proposed framework: M2VW Experimental results Conclusion

More information

OWL to the rescue of LASSO

OWL to the rescue of LASSO OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,

More information

Tensor intro 1. SIAM Rev., 51(3), Tensor Decompositions and Applications, Kolda, T.G. and Bader, B.W.,

Tensor intro 1. SIAM Rev., 51(3), Tensor Decompositions and Applications, Kolda, T.G. and Bader, B.W., Overview 1. Brief tensor introduction 2. Stein s lemma 3. Score and score matching for fitting models 4. Bringing it all together for supervised deep learning Tensor intro 1 Tensors are multidimensional

More information

arxiv: v3 [cs.lg] 25 May 2018

arxiv: v3 [cs.lg] 25 May 2018 A Riemannian approach to trace norm regularized low-rank tensor completion arxiv:1712.01193v3 [cs.lg] 25 May 2018 Madhav Nimishakavi madhav@iisc.ac.in Pratik Jawanpuria pratik.jawanpuria@microsoft.com

More information

Proximal Methods for Optimization with Spasity-inducing Norms

Proximal Methods for Optimization with Spasity-inducing Norms Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology

More information

Feature Learning for Multitask Learning Using l 2,1 Norm Minimization

Feature Learning for Multitask Learning Using l 2,1 Norm Minimization Feature Learning for Multitask Learning Using l, Norm Minimization Priya Venkateshan Abstract We look at solving the task of Multitask Feature Learning by way of feature selection. We find that formulating

More information

Matrix Rank Minimization with Applications

Matrix Rank Minimization with Applications Matrix Rank Minimization with Applications Maryam Fazel Haitham Hindi Stephen Boyd Information Systems Lab Electrical Engineering Department Stanford University 8/2001 ACC 01 Outline Rank Minimization

More information

Stochastic dynamical modeling:

Stochastic dynamical modeling: Stochastic dynamical modeling: Structured matrix completion of partially available statistics Armin Zare www-bcf.usc.edu/ arminzar Joint work with: Yongxin Chen Mihailo R. Jovanovic Tryphon T. Georgiou

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

On Optimal Frame Conditioners

On Optimal Frame Conditioners On Optimal Frame Conditioners Chae A. Clark Department of Mathematics University of Maryland, College Park Email: cclark18@math.umd.edu Kasso A. Okoudjou Department of Mathematics University of Maryland,

More information

Efficient Sparse Low-Rank Tensor Completion Using the Frank-Wolfe Algorithm

Efficient Sparse Low-Rank Tensor Completion Using the Frank-Wolfe Algorithm Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) Efficient Sparse Low-Rank Tensor Completion Using the Frank-Wolfe Algorithm Xiawei Guo, Quanming Yao, James T. Kwok

More information

Scalable and Sound Low-Rank Tensor Learning

Scalable and Sound Low-Rank Tensor Learning Scalable and Sound Low-Rank Tensor Learning Hao Cheng Yaoliang Yu Xinhua Zhang Eric Xing Dale Schuurmans University of Washington CMU NICTA and ANU University of Alberta Abstract Many real-world data arise

More information

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36 Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow

More information

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis Massimiliano Pontil 1 Today s plan SVD and principal component analysis (PCA) Connection

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Convex sets, conic matrix factorizations and conic rank lower bounds

Convex sets, conic matrix factorizations and conic rank lower bounds Convex sets, conic matrix factorizations and conic rank lower bounds Pablo A. Parrilo Laboratory for Information and Decision Systems Electrical Engineering and Computer Science Massachusetts Institute

More information

Dual Ascent. Ryan Tibshirani Convex Optimization

Dual Ascent. Ryan Tibshirani Convex Optimization Dual Ascent Ryan Tibshirani Conve Optimization 10-725 Last time: coordinate descent Consider the problem min f() where f() = g() + n i=1 h i( i ), with g conve and differentiable and each h i conve. Coordinate

More information

https://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:

More information

Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization

Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization Jialin Dong ShanghaiTech University 1 Outline Introduction FourVignettes: System Model and Problem Formulation Problem Analysis

More information

SPECTRAL THEOREM FOR COMPACT SELF-ADJOINT OPERATORS

SPECTRAL THEOREM FOR COMPACT SELF-ADJOINT OPERATORS SPECTRAL THEOREM FOR COMPACT SELF-ADJOINT OPERATORS G. RAMESH Contents Introduction 1 1. Bounded Operators 1 1.3. Examples 3 2. Compact Operators 5 2.1. Properties 6 3. The Spectral Theorem 9 3.3. Self-adjoint

More information

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J 7 Appendix 7. Proof of Theorem Proof. There are two main difficulties in proving the convergence of our algorithm, and none of them is addressed in previous works. First, the Hessian matrix H is a block-structured

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation

More information

Oslo Class 6 Sparsity based regularization

Oslo Class 6 Sparsity based regularization RegML2017@SIMULA Oslo Class 6 Sparsity based regularization Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017 Learning from data Possible only under assumptions regularization min Ê(w) + λr(w) w Smoothness Sparsity

More information

A Customized ADMM for Rank-Constrained Optimization Problems with Approximate Formulations

A Customized ADMM for Rank-Constrained Optimization Problems with Approximate Formulations A Customized ADMM for Rank-Constrained Optimization Problems with Approximate Formulations Chuangchuang Sun and Ran Dai Abstract This paper proposes a customized Alternating Direction Method of Multipliers

More information

Provable Alternating Minimization Methods for Non-convex Optimization

Provable Alternating Minimization Methods for Non-convex Optimization Provable Alternating Minimization Methods for Non-convex Optimization Prateek Jain Microsoft Research, India Joint work with Praneeth Netrapalli, Sujay Sanghavi, Alekh Agarwal, Animashree Anandkumar, Rashish

More information

Homework 4. Convex Optimization /36-725

Homework 4. Convex Optimization /36-725 Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

An iterative hard thresholding estimator for low rank matrix recovery

An iterative hard thresholding estimator for low rank matrix recovery An iterative hard thresholding estimator for low rank matrix recovery Alexandra Carpentier - based on a joint work with Arlene K.Y. Kim Statistical Laboratory, Department of Pure Mathematics and Mathematical

More information

Forward and Reverse Gradient-Based Hyperparameter Optimization

Forward and Reverse Gradient-Based Hyperparameter Optimization Forward and Reverse Gradient-Based Hyperparameter Optimization Luca Franceschi 1,2, Michele Donini 1, Paolo Frasconi 3, Massimiliano Pontil 1,2 1 Istituto Italiano di Tecnologia, Genoa, Italy 2 Dept of

More information

Kernels for Multi task Learning

Kernels for Multi task Learning Kernels for Multi task Learning Charles A Micchelli Department of Mathematics and Statistics State University of New York, The University at Albany 1400 Washington Avenue, Albany, NY, 12222, USA Massimiliano

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Contraction Methods for Convex Optimization and monotone variational inequalities No.12

Contraction Methods for Convex Optimization and monotone variational inequalities No.12 XII - 1 Contraction Methods for Convex Optimization and monotone variational inequalities No.12 Linearized alternating direction methods of multipliers for separable convex programming Bingsheng He Department

More information

PRINCIPAL Component Analysis (PCA) is a fundamental approach

PRINCIPAL Component Analysis (PCA) is a fundamental approach IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE Tensor Robust Principal Component Analysis with A New Tensor Nuclear Norm Canyi Lu, Jiashi Feng, Yudong Chen, Wei Liu, Member, IEEE, Zhouchen

More information

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012 Instructions Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012 The exam consists of four problems, each having multiple parts. You should attempt to solve all four problems. 1.

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Low-Rank Regression with Tensor Responses

Low-Rank Regression with Tensor Responses Low-Rank Regression with Tensor Responses Guillaume Rabusseau and Hachem Kadri Aix Marseille Univ, CNRS, LIF, Marseille, France {firstname.lastname}@lif.univ-mrs.fr Abstract This paper proposes an efficient

More information

Signal Recovery from Permuted Observations

Signal Recovery from Permuted Observations EE381V Course Project Signal Recovery from Permuted Observations 1 Problem Shanshan Wu (sw33323) May 8th, 2015 We start with the following problem: let s R n be an unknown n-dimensional real-valued signal,

More information

Constrained Optimization and Lagrangian Duality

Constrained Optimization and Lagrangian Duality CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

More information

Low-Rank Matrix Recovery

Low-Rank Matrix Recovery ELE 538B: Mathematics of High-Dimensional Data Low-Rank Matrix Recovery Yuxin Chen Princeton University, Fall 2018 Outline Motivation Problem setup Nuclear norm minimization RIP and low-rank matrix recovery

More information

Generalized Higher-Order Tensor Decomposition via Parallel ADMM

Generalized Higher-Order Tensor Decomposition via Parallel ADMM Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence Generalized Higher-Order Tensor Decomposition via Parallel ADMM Fanhua Shang 1, Yuanyuan Liu 2, James Cheng 1 1 Department of

More information

Online Gradient Descent Learning Algorithms

Online Gradient Descent Learning Algorithms DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline

More information

Rank minimization via the γ 2 norm

Rank minimization via the γ 2 norm Rank minimization via the γ 2 norm Troy Lee Columbia University Adi Shraibman Weizmann Institute Rank Minimization Problem Consider the following problem min X rank(x) A i, X b i for i = 1,..., k Arises

More information

ICS-E4030 Kernel Methods in Machine Learning

ICS-E4030 Kernel Methods in Machine Learning ICS-E4030 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 28. September, 2016 Juho Rousu 28. September, 2016 1 / 38 Convex optimization Convex optimisation This

More information

1 Non-negative Matrix Factorization (NMF)

1 Non-negative Matrix Factorization (NMF) 2018-06-21 1 Non-negative Matrix Factorization NMF) In the last lecture, we considered low rank approximations to data matrices. We started with the optimal rank k approximation to A R m n via the SVD,

More information

An Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion

An Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion An Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion Robert M. Freund, MIT joint with Paul Grigas (UC Berkeley) and Rahul Mazumder (MIT) CDC, December 2016 1 Outline of Topics

More information

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp.

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp. On different ensembles of kernel machines Michiko Yamana, Hiroyuki Nakahara, Massimiliano Pontil, and Shun-ichi Amari Λ Abstract. We study some ensembles of kernel machines. Each machine is first trained

More information

Supplemental Figures: Results for Various Color-image Completion

Supplemental Figures: Results for Various Color-image Completion ANONYMOUS AUTHORS: SUPPLEMENTAL MATERIAL (NOVEMBER 7, 2017) 1 Supplemental Figures: Results for Various Color-image Completion Anonymous authors COMPARISON WITH VARIOUS METHODS IN COLOR-IMAGE COMPLETION

More information

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725 Dual Methods Lecturer: Ryan Tibshirani Conve Optimization 10-725/36-725 1 Last time: proimal Newton method Consider the problem min g() + h() where g, h are conve, g is twice differentiable, and h is simple.

More information

The Perceptron Algorithm, Margins

The Perceptron Algorithm, Margins The Perceptron Algorithm, Margins MariaFlorina Balcan 08/29/2018 The Perceptron Algorithm Simple learning algorithm for supervised classification analyzed via geometric margins in the 50 s [Rosenblatt

More information

Rank Determination for Low-Rank Data Completion

Rank Determination for Low-Rank Data Completion Journal of Machine Learning Research 18 017) 1-9 Submitted 7/17; Revised 8/17; Published 9/17 Rank Determination for Low-Rank Data Completion Morteza Ashraphijuo Columbia University New York, NY 1007,

More information

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning Short Course Robust Optimization and 3. Optimization in Supervised EECS and IEOR Departments UC Berkeley Spring seminar TRANSP-OR, Zinal, Jan. 16-19, 2012 Outline Overview of Supervised models and variants

More information

Relaxations and Randomized Methods for Nonconvex QCQPs

Relaxations and Randomized Methods for Nonconvex QCQPs Relaxations and Randomized Methods for Nonconvex QCQPs Alexandre d Aspremont, Stephen Boyd EE392o, Stanford University Autumn, 2003 Introduction While some special classes of nonconvex problems can be

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Tropical decomposition of symmetric tensors

Tropical decomposition of symmetric tensors Tropical decomposition of symmetric tensors Melody Chan University of California, Berkeley mtchan@math.berkeley.edu December 11, 008 1 Introduction In [], Comon et al. give an algorithm for decomposing

More information

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725 Numerical Linear Algebra Primer Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: proximal gradient descent Consider the problem min g(x) + h(x) with g, h convex, g differentiable, and h simple

More information

Dual and primal-dual methods

Dual and primal-dual methods ELE 538B: Large-Scale Optimization for Data Science Dual and primal-dual methods Yuxin Chen Princeton University, Spring 2018 Outline Dual proximal gradient method Primal-dual proximal gradient method

More information

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates : A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.

More information

ON HYPERPARAMETER OPTIMIZATION IN LEARNING SYSTEMS

ON HYPERPARAMETER OPTIMIZATION IN LEARNING SYSTEMS ON HYPERPARAMETER OPTIMIZATION IN LEARNING SYSTEMS Luca Franceschi 1,2, Michele Donini 1, Paolo Frasconi 3, Massimiliano Pontil 1,2 (1) Istituto Italiano di Tecnologia, Genoa, 16163 Italy (2) Dept of Computer

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Robust Tensor Completion and its Applications. Michael K. Ng Department of Mathematics Hong Kong Baptist University

Robust Tensor Completion and its Applications. Michael K. Ng Department of Mathematics Hong Kong Baptist University Robust Tensor Completion and its Applications Michael K. Ng Department of Mathematics Hong Kong Baptist University mng@math.hkbu.edu.hk Example Low-dimensional Structure Data in many real applications

More information