Multi-Task Learning and Matrix Regularization

Size: px
Start display at page:

Download "Multi-Task Learning and Matrix Regularization"

Transcription

1 Multi-Task Learning and Matrix Regularization Andreas Argyriou Department of Computer Science University College London

2 Collaborators T. Evgeniou (INSEAD) R. Hauser (University of Oxford) M. Herbster (University College London) A. Maurer (Stemmer Imaging) C.A. Micchelli (SUNY Albany) M. Pontil (University College London) Y. Ying (University of Bristol) 1

3 Main Themes Machine learning Convex optimization Sparse recovery 2

4 Outline Multi-task learning and related problems Matrix learning and an alternating algorithm Extensions of the method Multi-task representer theorems Kernel hyperparameter learning; convex kernel learning 3

5 Supervised Learning (Single-Task) m examples are given: (x 1, y 1 ),...,(x m, y m ) X Y Predict using a function f : X Y Want the function to generalize well over the whole of X Y Includes regression, classification etc. Task = probability measure on X Y 4

6 Multi-Task Learning Tasks t = 1,..., n m examples per task are given: (x t1, y t1 ),...,(x tm, y tm ) X Y Predict using functions f t : X Y, t = 1,..., n When the tasks are related, learning the tasks jointly should perform better than learning each task independently Especially important when few data points are available per task (small m); in such cases, independent learning is not successful 5

7 Multi-Task Learning (contd.) One goal is to learn what structure is common across the n tasks Want simple, interpretable models that can explain multiple tasks Want good generalization on the n given tasks but also on new tasks (transfer learning) Given a few examples from a new task t, {(x t 1, y t 1),...,(x t l, y t l)}, want to learn f t using just the learned task structure 6

8 Learning Theoretic View: Environment of Tasks Environment = probability distribution on a set of learning tasks [Baxter, 1996] To sample a task-specific sample from the environment draw a function f t from the environment generate a sample {(x t1, y t1 ),...,(x tm, y tm )} (X Y) m using f t Multi-task learning means learning a common hypothesis space 7

9 Learning Theoretic View (contd.) Baxter s results: As n (#tasks) increases, m (#examples per task needed) decreases as O( 1 n ) Once we have learned a hypothesis space H, we can use it to learn a new task drawn from the same environment; the sample complexity depends on the log-capacity of H Other results: Task relatedness due to input transformations: improved multi-task bounds in some cases [Ben-David & Schuller, 2003] Using common feature maps (bounded linear operators): error bounds depend on Hilbert-Schmidt norm [Maurer, 2006] 8

10 Multi-Task Applications Multi-task learning is ubiquitous Human intelligence relies on transfering learned knowledge from previous tasks to new tasks E.g. character recognition (very few examples should be needed to recognize new characters) Integration of medical / bioinformatics databases 9

11 Multi-Task Applications (contd.) Marketing databases, collaborative filtering, recommendation systems (e.g. Netflix); task = product preferences for each person 10

12 Multi-Task Applications (contd.) Multiple object classification in scenes: an image may contain multiple objects; learning common visual features enhances performance 11

13 Related Problems Sparse coding (some images share common basis images) Vector-valued / structured output Multi-class problems Regression with grouped variables, multifactor ANOVA in statistics; (selection of groups of variables) Multi-task learning is a broad problem; no single method can solve everything 12

14 Learning Multiple Tasks with a Common Kernel Let X IR d, Y IR and let us learn n linear functions f t (x) = w t, x t = 1,..., n (we ignore nonlinearities for the moment) Want to impose common structure / relatedness across tasks Idea: use a common linear kernel for all tasks K(x, x ) = x,d x (where D 0) 13

15 Learning Multiple Tasks with a Common Kernel For every t = 1,..., n solve min w t IR d m E ( w t, x ti, y ti ) + γ w t, D 1 w t i=1 Adding up, we obtain the equivalent problem min w 1,...,w n IR d n t=1 m E ( w t, x ti, y ti ) + γ i=1 n w t, D 1 w t t=1 14

16 Learning Multiple Tasks with a Common Kernel For multi-task learning, we want to learn the common kernel from a convex set of kernels: inf w 1,...,w n IR d D 0, tr(d) 1 n t=1 m E ( w t, x ti, y ti ) + γ tr(w D 1 W) (MT L) i=1 We denote W = w 1... w n n w t, D 1 w t t=1 15

17 Learning Multiple Tasks with a Common Kernel Jointly convex problem in (W, D) The constraint tr(d) 1 is important Fixing W, the optimal D(W) is D(W) (WW ) 2 1 (D(W) is usually not in the feasible set because of the inf) Once we have learned ˆD, we can transfer it to learning of a new task t min w IR d m E ( w, x t i, y t i) + γ w, ˆD 1 w i=1 16

18 Alternating Minimization Algorithm Alternating minimization over W (supervised learning) and D (unsupervised correlation of tasks). Initialization: set D = I d d d while convergence condition is not true do for t = 1,..., n learn w t independently by minimizing m E( w, x ti, y ti ) + γ w, D 1 w i=1 end for set D = (WW ) 1 2 end while tr(ww )

19 Alternating Minimization (contd.) Each w t step is a regularization problem (e.g. SVM, ridge regression etc.) It does not require computation of the (pseudo)inverse of D Each D step requires an SVD; this is usually the most costly step 18

20 Alternating Minimization (contd.) The algorithm (with some perturbation) converges to an optimal solution n m E( w t, x ti, y ti ) + γ tr ( D 1 (WW + εi) ) (R ε ) min w 1,...,w n IR d D 0, tr(d) 1 t=1 i=1 Theorem. An alternating algorithm for problem (R ε ) has the property that its iterates ( W (k), D (k)) converge to the minimizer of (R ε ) as k. Theorem. Consider a sequence ε l 0 + and let (W l, D l ) be the minimizer of (R εl ). Then any limiting point of the sequence {(W l, D l )} is an optimal solution of the problem (MT L). Note: the starting value of D does not matter 19

21 Alternating Minimization (contd.) η = 0.05 η = 0.03 η = 0.01 Alternating Alternating η = 0.05 objective function 26 seconds #iterations (green = alternating) #tasks (blue = alternating) Compare computational cost with a gradient descent approach (η := learning rate) 20

22 Alternating Minimization (contd.) Typically fewer than 50 iterations needed in experiments At least an order of magnitude fewer iterations than gradient descent (but cost per iteration is larger) Scales better with the number of tasks Both methods require SVD (costly if d is large) Alternative algorithms: SOCP methods [Srebro et al. 2005, Liu and Vandenberghe 2008], gradient descent on matrix factors [Rennie & Srebro 2005], singular value thresholding [Cai et al. 2008] 21

23 Trace Norm Regularization Eliminating D in optimization problem (MT L) yields min W IR d n n t=1 m E( w t, x ti, y ti ) + γ W 2 tr (T R) i=1 The trace norm (or nuclear norm) W tr is the sum of the singular values of W There has been recent interest in trace norm / rank problems in matrix factorization, statistics, matrix completion etc. [Cai et al. 2008, Fazel et al. 2001, Izenman 1975, Liu and Vandenberghe 2008, Srebro et al. 2005] 22

24 Trace Norm vs. Rank Problem (T R) is a convex relaxation of the problem min W IR d n n t=1 m E( w t, x ti, y ti ) + γ rank(w) i=1 NP-hard problem (at least as hard as Boolean LP) Rank and trace norm correspond to L 0, L 1 on the vector of singular values Multi-task intuition: we want the task parameter vectors w t to lie on a low dimensional subspace 23

25 Connection to Group Lasso Problem (MT L) is equivalent to n m E( a t, U x ti, y ti ) + γ A 2 2,1 min A IR d n U IR d d, U U=I where A 2,1 := d i=1 t=1 n i=1 a 2 it t=

26 Experiment (Computer Survey) Consumers ratings of products [Lenk et al. 1996] 180 persons (tasks) 8 PC models (training examples); 4 PC models (test examples) 13 binary input variables (RAM, CPU, price etc.) + bias term Integer output in {0,..., 10} (likelihood of purchase) The square loss was used 25

27 Experiment (Computer Survey) Test error Eig(D) #tasks Performance improves with more tasks (for learning of the tasks independently, error = 16.53) A single most important feature shared by all persons 26

28 Experiment (Computer Survey) 0.25 u TE RAM SC CPU HD CD CA CO AV WA SW GU PR Method RMSE Alternating Alg Hierarchical Bayes [Lenk et al.] 1.90 Independent 3.88 Aggregate 2.35 Group Lasso 2.01 The most important feature (eigenvector of D) weighs technical characteristics (RAM, CPU, CD-ROM) vs. price 27

29 Spectral Regularization Generalize (MT L): inf w 1,...,w n IR d D 0, tr(d) 1 n t=1 m E ( w t, x ti, y ti ) + γ tr(w F(D)W) i=1 where F is a spectral matrix function, f : (0, + ) (0, + ), or F(UΛU ) = U diag[f(λ 1 ),...,f(λ d )] U min W IR d n n t=1 m E ( w t, x ti, y ti ) + γ Ω(W) i=1 It can be shown that Ω(W) is a function of the singular values of W 28

30 Spectral Regularization (contd.) In particular, if f(λ) = λ 1 2 p, p (0,2], we have Ω(W) = W 2 p where W p is the Schatten L p (pre)norm of the singular values of W Theorem. The regularizer tr(w F(D)W) is jointly convex if and only if 1 f is matrix concave of order d, that is, ( ) ( ) ( ) µ (A) + (1 µ) (B) (µa + (1 µ)b) F F F for all A, B 0, µ [0,1] Spectral problems appear also in graph applications, control theory etc. 29

31 Learning Groups of Tasks Assume heterogeneous environment, i.e. K low dimensional subspaces Learn a partition of tasks in K groups inf D 1,...,D K 0 tr(d k ) 1 n t=1 K min k=1 min w t IR d { m i=1 E ( w t, x ti, y ti ) + γ w t, D 1 k w t } The representation learned is ( ˆD 1,..., ˆD K ); we can transfer this representation to easily learn a new task Non-convex problem; we use stochastic gradient descent 30

32 Experiment (Character Recognition - Projection on Image Halves) 6 vs. 1 task (on the right half) Binary classification tasks on images One half of the image contains the relevant character, the other half contains a randomly chosen character Two groups of tasks (with probabilities 50-50%): projection on left half - projection on right half 31

33 Experiment (Character Recognition - Projection on Image Halves) Training set contains pairs of alphabetic characters 1000 tasks, 10 examples per task Wish to obtain a representation that represents rotation invariance on either half of the image Wish to transfer this representation to pairs of digits 32

34 Experiment (Character Recognition - Projection on Image Halves) Independent K = 1 K = Transfer error for different methods D 1 D 2 All digits (left & right) 48.2% 51.8% Left 99.2% 0.8% Right 1.4% 98.6% Training data 50.7% 49.3% Assignment of tasks in groups (with K = 2) 33

35 Experiment (Character Recognition - Projection on Image Halves) D D 1 D 2 Dominant eigenvectors of D (for K = 1) and D 1, D 2 (for K = 2) 34

36 Experiment (Character Recognition - Projection on Image Halves) Spectrum of D learned with K = 1 (left) Spectra of D 1 (middle) and D 2 (right) learned with K = 2 35

37 Representer Theorems All previous formulations satisfy a multi-task representer theorem ŵ t = n s=1 m i=1 c (t) si x si t {1,..., n} (R.T.) Consequently, a nonlinear kernel can be used All tasks are involved in this expression (unlike the single-task representer theorem Frobenius norm regularization) Generally, consider any matrix optimization problem of the form min w 1,...,w n IR d n t=1 m E ( w t, x ti, y ti ) + Ω(W) i=1 36

38 Representer Theorems (contd.) Definitions: S n + = the positive semidefinite cone The function h : S n + IR is matrix nondecreasing, if h(a) h(b) A, B S n + s.t. A B Theorem. Rep. thm. (R.T.) holds if and only if there exists a matrix nondecreasing function h : S n + IR such that Ω(W) = h(w W) W IR d n (under differentiability assumptions) 37

39 Representer Theorems (contd.) Corollary. The standard rep. thm. for single-task learning (n = 1), ŵ = m c i x i i=1 holds if and only if there exists a nondecreasing function h : IR + IR such that Ω(w) = h( w, w ) w IR d Sufficiency of the condition has been known [Kimeldorf & Wahba, 1970, Schölkopf et al., 2001 etc.] 38

40 Implications Kernelization In single-task learning, the choice of h does not matter essentially However, in multi-task learning, the choice of h is important (since is a partial ordering) Many valid regularizers: Schatten L p norms p, rank, orthogonally invariant norms, norms of type W WM p etc. In matrix learning, kernels and sparsity can be exploited in the same model 39

41 Connection to Learning the Kernel Recall problem (MT L) min D 0, tr(d) 1 min w 1,...,w n IR m n t=1 m E ( w t, x ti, y ti ) + γ w t, D 1 w t i=1 (MT L) learning a common kernel K(x, x ) = x,dx within the convex hull of an infinite number of linear kernels Extends formulation of [Lanckriet et al. 2004] (single task) min K K min c IR m m E ( ) (Kc) i, y i + γ c,kc i=1 in which K was a polytope (convex hull of a finite set of kernels) 40

42 A General Framework for Learning the Kernel Convex set K is generated by basic kernels: K = conv(b) Example 1: Finite set of basic kernels (aka multiple kernel learning ) Example 2: Linear basic kernels B(x, x ) = x, Dx where D bounded, convex set (e.g. (MT L)) Example 3: Gaussian basic kernels B(x, x ) = e x x,σ 1 (x x ) where Σ belongs in a convex subset of the p.s.d. cone 41

43 Learning the Kernel and Structured Sparsity Interpretation of LTK in the feature space [Bach et al. 2004, Micchelli & Pontil 2005] min v 1,...,v N IR m m E i=1 N v j, Φ j (x i ), y i + γ j=1 N v j j=1 2 Group Lasso in the feature space The 2,1 norm tends to favor a small number of feature maps / kernels in the solution 42

44 Why Learn Kernels in Convex Sets? Data fusion (e.g. in bioinformatics) Kernel hyperparameter learning (e.g. Gaussian kernel parameters) Multi-task learning Metric learning Semi-supervised learning (learning the graph) Efficient alternative to cross validation; exploits the power of convex optimization 43

45 Properties of the Solution of LTK Formulation for learning the kernel min K K min c IR m m E ( ) (Kc) i, y i + γ c,kc (LT K) i=1 where K = conv(b) I.e. solutions of (LT K) are of the form ˆK = N ˆλ i 0, N ˆλ i = 1, ˆBi B i=1 i=1 ˆλ i ˆBi, where Any solution (ĉ, ˆK) of (LT K) is a saddle point of a minimax problem 44

46 Properties of the Solution of LTK (contd.) Theorem. (ĉ, ˆK) solves (LT K) if and only if 1. ĉ, ˆB i ĉ = max B B ĉ,bĉ 2. ĉ is the solution to min c IR m i = 1,..., N m E ( ) (ˆKc) i, y i + γ c, ˆKc i=1 Moreover, there exists a solution involving at most m + 1 kernels: ˆK = m+1 i=1 ˆλ i ˆBi 45

47 A General Algorithm for Learning the Kernel Incrementally builds an estimate of the solution, K (k) = k Initialization: Given an initial kernel K (1) in the convex set K while convergence condition is not true do m 1. Compute ĉ = argmin E ( ) (K (k) c) i, y i + γ c,k (k) c c IR m i=1 i=1 λ i K (i) 2. Find a basic kernel ˆB argmax ĉ,bĉ B B 3. Compute K (k+1) as the optimal convex combination of ˆB and K (k) end while 46

48 A General Algorithm for Learning the Kernel (contd.) Theorem. There exists a limit point of the LTK algorithm and any limit point is a solution of (LT K). Step 1 is standard regularization (SVM, ridge regression etc.) Step 2 is the hardest: tractable for e.g. finite B or MTL; non-convex for e.g. Gaussian kernels Step 3 is convex in one variable (requires solving a few regularization problems) Some non-convex cases (e.g. few-parameter Gaussians) are solvable 47

49 Character Recognition Experiments MNIST classification of digits. 1st experiment: Gaussian basic kernels with one parameter σ. Compared with continuously parameterized + local search finite grid of basic kernels SVM 2nd experiment: Gaussian basic kernels with two parameters σ 1, σ 2 (left, right halves of images). Compared with varying finite grids. 48

50 Character Recognition Experiments (contd.) Method Task LTK local finite SVM LTK local finite SVM LTK local finite SVM σ [75, 25000] σ [100, 10000] σ [500, 5000] odd-even vs vs Task LTK LTK LTK σ [75, 25000] σ [100, 10000] σ [500, 5000] odd-even vs vs Using two parameters improves performance Continuous vs. finite grid more efficient robust wrt. parameter range does not overfit 49

51 Character Recognition Experiments (contd.) Learned kernel coefficients. From left to right: odd vs. even (1 and 2 kernel params.), 3 vs. 8 and 4 vs. 7 (2 params.) 50

52 Learning the Graph in Semi-Supervised Learning We can also apply LTK when there are few labeled data points A number of graphs are given, e.g. generated using k-nn, exponential decay weights etc. To exploit the structure of each graph, the Gram matrices B i are taken to be the pseudoinverses of the graph Laplacians 51

53 Conclusion Multi-task learning is ubiquitous; exploiting task relatedness can enhance learning performance significantly Proposed an alternating algorithm to learn tasks that lie on a common subspace; this algorithm is simple and efficient Nonlinear kernels can be introduced via a multi-task representer theorem; also proposed spectral and non-convex matrix learning Multi-task learning can be viewed as an instance of learning combinations of infinite kernels Generally, we can use a greedy incremental algorithm to learn combinations of (finite or infinite) kernels 52

54 Future Work Convergence rates for the algorithms proposed Task relatedness can be of many types (hierarchical features, input transformation invariances etc.) Convex relaxation techniques for the non-convex formulations presented Algorithms and representer theorems for special types of norms Related problems (sparse coding, structured output, metric learning etc.) 53

A Spectral Regularization Framework for Multi-Task Structure Learning

A Spectral Regularization Framework for Multi-Task Structure Learning A Spectral Regularization Framework for Multi-Task Structure Learning Massimiliano Pontil Department of Computer Science University College London (Joint work with A. Argyriou, T. Evgeniou, C.A. Micchelli,

More information

An Algorithm for Transfer Learning in a Heterogeneous Environment

An Algorithm for Transfer Learning in a Heterogeneous Environment An Algorithm for Transfer Learning in a Heterogeneous Environment Andreas Argyriou 1, Andreas Maurer 2, and Massimiliano Pontil 1 1 Department of Computer Science University College London Malet Place,

More information

A Spectral Regularization Framework for Multi-Task Structure Learning

A Spectral Regularization Framework for Multi-Task Structure Learning A Spectral Regularization Framework for Multi-Task Structure Learning Andreas Argyriou Department of Computer Science University College London Gower Street, London WC1E 6BT, UK a.argyriou@cs.ucl.ac.uk

More information

Mathematical Methods for Data Analysis

Mathematical Methods for Data Analysis Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data

More information

Convex Multi-Task Feature Learning

Convex Multi-Task Feature Learning Convex Multi-Task Feature Learning Andreas Argyriou 1, Theodoros Evgeniou 2, and Massimiliano Pontil 1 1 Department of Computer Science University College London Gower Street London WC1E 6BT UK a.argyriou@cs.ucl.ac.uk

More information

Multi-task Learning and Structured Sparsity

Multi-task Learning and Structured Sparsity Multi-task Learning and Structured Sparsity Massimiliano Pontil Department of Computer Science Centre for Computational Statistics and Machine Learning University College London Massimiliano Pontil (UCL)

More information

Multi-task Learning and Structured Sparsity

Multi-task Learning and Structured Sparsity Multi-task Learning and Structured Sparsity Massimiliano Pontil Department of Computer Science Centre for Computational Statistics and Machine Learning University College London Massimiliano Pontil (UCL)

More information

Spectral k-support Norm Regularization

Spectral k-support Norm Regularization Spectral k-support Norm Regularization Andrew McDonald Department of Computer Science, UCL (Joint work with Massimiliano Pontil and Dimitris Stamos) 25 March, 2015 1 / 19 Problem: Matrix Completion Goal:

More information

Feature Learning for Multitask Learning Using l 2,1 Norm Minimization

Feature Learning for Multitask Learning Using l 2,1 Norm Minimization Feature Learning for Multitask Learning Using l, Norm Minimization Priya Venkateshan Abstract We look at solving the task of Multitask Feature Learning by way of feature selection. We find that formulating

More information

Kernels for Multi task Learning

Kernels for Multi task Learning Kernels for Multi task Learning Charles A Micchelli Department of Mathematics and Statistics State University of New York, The University at Albany 1400 Washington Avenue, Albany, NY, 12222, USA Massimiliano

More information

Learning to Integrate Data from Different Sources and Tasks

Learning to Integrate Data from Different Sources and Tasks 1 Learning to Integrate Data from Different Sources and Tasks Andreas Argyriou A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy of the University

More information

Maximum Margin Matrix Factorization

Maximum Margin Matrix Factorization Maximum Margin Matrix Factorization Nati Srebro Toyota Technological Institute Chicago Joint work with Noga Alon Tel-Aviv Yonatan Amit Hebrew U Alex d Aspremont Princeton Michael Fink Hebrew U Tommi Jaakkola

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Convex Optimization in Classification Problems

Convex Optimization in Classification Problems New Trends in Optimization and Computational Algorithms December 9 13, 2001 Convex Optimization in Classification Problems Laurent El Ghaoui Department of EECS, UC Berkeley elghaoui@eecs.berkeley.edu 1

More information

Learning Good Representations for Multiple Related Tasks

Learning Good Representations for Multiple Related Tasks Learning Good Representations for Multiple Related Tasks Massimiliano Pontil Computer Science Dept UCL and ENSAE-CREST March 25, 2015 1 / 35 Plan Problem formulation and examples Analysis of multitask

More information

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010 Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Course Info. Instructor: Nicholas Ruozzi Office: ECSS 3.409 Office hours: Tues.

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

Some tensor decomposition methods for machine learning

Some tensor decomposition methods for machine learning Some tensor decomposition methods for machine learning Massimiliano Pontil Istituto Italiano di Tecnologia and University College London 16 August 2016 1 / 36 Outline Problem and motivation Tucker decomposition

More information

6.036 midterm review. Wednesday, March 18, 15

6.036 midterm review. Wednesday, March 18, 15 6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that

More information

Analysis of Spectral Kernel Design based Semi-supervised Learning

Analysis of Spectral Kernel Design based Semi-supervised Learning Analysis of Spectral Kernel Design based Semi-supervised Learning Tong Zhang IBM T. J. Watson Research Center Yorktown Heights, NY 10598 Rie Kubota Ando IBM T. J. Watson Research Center Yorktown Heights,

More information

https://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:

More information

Using SVD to Recommend Movies

Using SVD to Recommend Movies Michael Percy University of California, Santa Cruz Last update: December 12, 2009 Last update: December 12, 2009 1 / Outline 1 Introduction 2 Singular Value Decomposition 3 Experiments 4 Conclusion Last

More information

Homework 4. Convex Optimization /36-725

Homework 4. Convex Optimization /36-725 Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

Learning Task Grouping and Overlap in Multi-Task Learning

Learning Task Grouping and Overlap in Multi-Task Learning Learning Task Grouping and Overlap in Multi-Task Learning Abhishek Kumar Hal Daumé III Department of Computer Science University of Mayland, College Park 20 May 2013 Proceedings of the 29 th International

More information

L 2,1 Norm and its Applications

L 2,1 Norm and its Applications L 2, Norm and its Applications Yale Chang Introduction According to the structure of the constraints, the sparsity can be obtained from three types of regularizers for different purposes.. Flat Sparsity.

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Fantope Regularization in Metric Learning

Fantope Regularization in Metric Learning Fantope Regularization in Metric Learning CVPR 2014 Marc T. Law (LIP6, UPMC), Nicolas Thome (LIP6 - UPMC Sorbonne Universités), Matthieu Cord (LIP6 - UPMC Sorbonne Universités), Paris, France Introduction

More information

Compressed Sensing and Neural Networks

Compressed Sensing and Neural Networks and Jan Vybíral (Charles University & Czech Technical University Prague, Czech Republic) NOMAD Summer Berlin, September 25-29, 2017 1 / 31 Outline Lasso & Introduction Notation Training the network Applications

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Structured matrix factorizations. Example: Eigenfaces

Structured matrix factorizations. Example: Eigenfaces Structured matrix factorizations Example: Eigenfaces An extremely large variety of interesting and important problems in machine learning can be formulated as: Given a matrix, find a matrix and a matrix

More information

Convex multi-task feature learning

Convex multi-task feature learning Mach Learn (2008) 73: 243 272 DOI 10.1007/s10994-007-5040-8 Convex multi-task feature learning Andreas Argyriou Theodoros Evgeniou Massimiliano Pontil Received: 6 February 2007 / Revised: 10 May 2007 /

More information

Matrix Support Functional and its Applications

Matrix Support Functional and its Applications Matrix Support Functional and its Applications James V Burke Mathematics, University of Washington Joint work with Yuan Gao (UW) and Tim Hoheisel (McGill), CORS, Banff 2016 June 1, 2016 Connections What

More information

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1). Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

Convex Optimization M2

Convex Optimization M2 Convex Optimization M2 Lecture 8 A. d Aspremont. Convex Optimization M2. 1/57 Applications A. d Aspremont. Convex Optimization M2. 2/57 Outline Geometrical problems Approximation problems Combinatorial

More information

CHAPTER 11. A Revision. 1. The Computers and Numbers therein

CHAPTER 11. A Revision. 1. The Computers and Numbers therein CHAPTER A Revision. The Computers and Numbers therein Traditional computer science begins with a finite alphabet. By stringing elements of the alphabet one after another, one obtains strings. A set of

More information

Collaborative Filtering Matrix Completion Alternating Least Squares

Collaborative Filtering Matrix Completion Alternating Least Squares Case Study 4: Collaborative Filtering Collaborative Filtering Matrix Completion Alternating Least Squares Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 19, 2016

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Matrix Factorizations: A Tale of Two Norms

Matrix Factorizations: A Tale of Two Norms Matrix Factorizations: A Tale of Two Norms Nati Srebro Toyota Technological Institute Chicago Maximum Margin Matrix Factorization S, Jason Rennie, Tommi Jaakkola (MIT), NIPS 2004 Rank, Trace-Norm and Max-Norm

More information

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM Wei Chu Chong Jin Ong chuwei@gatsby.ucl.ac.uk mpeongcj@nus.edu.sg S. Sathiya Keerthi mpessk@nus.edu.sg Control Division, Department

More information

A Convex Optimization Approach to Modeling Consumer Heterogeneity in Conjoint Estimation

A Convex Optimization Approach to Modeling Consumer Heterogeneity in Conjoint Estimation A Convex Optimization Approach to Modeling Consumer Heterogeneity in Conjoint Estimation Theodoros Evgeniou Technology Management and Decision Sciences, INSEAD, theodoros.evgeniou@insead.edu Massimiliano

More information

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17 3/9/7 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/9/7 Perceptron as a neural

More information

Introduction to Gaussian Process

Introduction to Gaussian Process Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

More information

CS 231A Section 1: Linear Algebra & Probability Review

CS 231A Section 1: Linear Algebra & Probability Review CS 231A Section 1: Linear Algebra & Probability Review 1 Topics Support Vector Machines Boosting Viola-Jones face detector Linear Algebra Review Notation Operations & Properties Matrix Calculus Probability

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang CS 231A Section 1: Linear Algebra & Probability Review Kevin Tang Kevin Tang Section 1-1 9/30/2011 Topics Support Vector Machines Boosting Viola Jones face detector Linear Algebra Review Notation Operations

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Collaborative Filtering

Collaborative Filtering Case Study 4: Collaborative Filtering Collaborative Filtering Matrix Completion Alternating Least Squares Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

A Convex Optimization Approach to Modeling Consumer Heterogeneity in Conjoint Estimation

A Convex Optimization Approach to Modeling Consumer Heterogeneity in Conjoint Estimation A Convex Optimization Approach to Modeling Consumer Heterogeneity in Conjoint Estimation Theodoros Evgeniou Technology Management and Decision Sciences, INSEAD, theodoros.evgeniou@insead.edu Massimiliano

More information

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs) ORF 523 Lecture 8 Princeton University Instructor: A.A. Ahmadi Scribe: G. Hall Any typos should be emailed to a a a@princeton.edu. 1 Outline Convexity-preserving operations Convex envelopes, cardinality

More information

Signal Recovery from Permuted Observations

Signal Recovery from Permuted Observations EE381V Course Project Signal Recovery from Permuted Observations 1 Problem Shanshan Wu (sw33323) May 8th, 2015 We start with the following problem: let s R n be an unknown n-dimensional real-valued signal,

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

When Dictionary Learning Meets Classification

When Dictionary Learning Meets Classification When Dictionary Learning Meets Classification Bufford, Teresa 1 Chen, Yuxin 2 Horning, Mitchell 3 Shee, Liberty 1 Mentor: Professor Yohann Tendero 1 UCLA 2 Dalhousie University 3 Harvey Mudd College August

More information

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017 Machine Learning Regularization and Feature Selection Fabio Vandin November 13, 2017 1 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m,

More information

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Low Rank Matrix Completion Formulation and Algorithm

Low Rank Matrix Completion Formulation and Algorithm 1 2 Low Rank Matrix Completion and Algorithm Jian Zhang Department of Computer Science, ETH Zurich zhangjianthu@gmail.com March 25, 2014 Movie Rating 1 2 Critic A 5 5 Critic B 6 5 Jian 9 8 Kind Guy B 9

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write

More information

Proximal Methods for Optimization with Spasity-inducing Norms

Proximal Methods for Optimization with Spasity-inducing Norms Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Multiple kernel learning for multiple sources

Multiple kernel learning for multiple sources Multiple kernel learning for multiple sources Francis Bach INRIA - Ecole Normale Supérieure NIPS Workshop - December 2008 Talk outline Multiple sources in computer vision Multiple kernel learning (MKL)

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University, 1 Introduction to Machine Learning Regression Computer Science, Tel-Aviv University, 2013-14 Classification Input: X Real valued, vectors over real. Discrete values (0,1,2,...) Other structures (e.g.,

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms François Caron Department of Statistics, Oxford STATLEARN 2014, Paris April 7, 2014 Joint work with Adrien Todeschini,

More information

Predicting Graph Labels using Perceptron. Shuang Song

Predicting Graph Labels using Perceptron. Shuang Song Predicting Graph Labels using Perceptron Shuang Song shs037@eng.ucsd.edu Online learning over graphs M. Herbster, M. Pontil, and L. Wainer, Proc. 22nd Int. Conf. Machine Learning (ICML'05), 2005 Prediction

More information

1.1 Basis of Statistical Decision Theory

1.1 Basis of Statistical Decision Theory ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 1: Introduction Lecturer: Yihong Wu Scribe: AmirEmad Ghassami, Jan 21, 2016 [Ed. Jan 31] Outline: Introduction of

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Kernel Methods & Support Vector Machines

Kernel Methods & Support Vector Machines Kernel Methods & Support Vector Machines Mahdi pakdaman Naeini PhD Candidate, University of Tehran Senior Researcher, TOSAN Intelligent Data Miners Outline Motivation Introduction to pattern recognition

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian

More information

Sparse and Robust Optimization and Applications

Sparse and Robust Optimization and Applications Sparse and and Statistical Learning Workshop Les Houches, 2013 Robust Laurent El Ghaoui with Mert Pilanci, Anh Pham EECS Dept., UC Berkeley January 7, 2013 1 / 36 Outline Sparse Sparse Sparse Probability

More information

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:

More information

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for 1 A cautionary tale Notes for 2016-10-05 You have been dropped on a desert island with a laptop with a magic battery of infinite life, a MATLAB license, and a complete lack of knowledge of basic geometry.

More information

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

More information

Discriminative Learning and Big Data

Discriminative Learning and Big Data AIMS-CDT Michaelmas 2016 Discriminative Learning and Big Data Lecture 2: Other loss functions and ANN Andrew Zisserman Visual Geometry Group University of Oxford http://www.robots.ox.ac.uk/~vgg Lecture

More information

Oslo Class 7 Structured sparsity

Oslo Class 7 Structured sparsity RegML2017@SIMULA Oslo Class 7 Structured sparsity Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017 Exploiting structure Building blocks of a function can be more structured than single variables Sparsity Variables

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Lecture 2 Part 1 Optimization

Lecture 2 Part 1 Optimization Lecture 2 Part 1 Optimization (January 16, 2015) Mu Zhu University of Waterloo Need for Optimization E(y x), P(y x) want to go after them first, model some examples last week then, estimate didn t discuss

More information

Least Squares SVM Regression

Least Squares SVM Regression Least Squares SVM Regression Consider changing SVM to LS SVM by making following modifications: min (w,e) ½ w 2 + ½C Σ e(i) 2 subject to d(i) (w T Φ( x(i))+ b) = e(i), i, and C>0. Note that e(i) is error

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information