... SPARROW. SPARse approximation Weighted regression. Pardis Noorzad. Department of Computer Engineering and IT Amirkabir University of Technology

Similar documents
Recent Advances in Structured Sparse Models

Sparse representation classification and positive L1 minimization

Semi-supervised Dictionary Learning Based on Hilbert-Schmidt Independence Criterion

Machine Learning Practice Page 2 of 2 10/28/13

Introduction to Three Paradigms in Machine Learning. Julien Mairal

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Last Lecture Recap. UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 6: Regression Models with Regulariza8on

Support Vector Machines for Classification: A Statistical Portrait

Improved Local Coordinate Coding using Local Tangents

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Hierarchical Penalization

Optimization methods

Machine Learning Basics

Compressed Sensing and Neural Networks

Lecture 3: Statistical Decision Theory (Part II)

Hierarchical kernel learning

Lecture 2 Machine Learning Review

SUPPORT VECTOR MACHINE

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

A Local Non-Negative Pursuit Method for Intrinsic Manifold Structure Preservation

CSE446: non-parametric methods Spring 2017

Foundation of Intelligent Systems, Part I. Regression

Support Vector Machines

Backpropagation Rules for Sparse Coding (Task-Driven Dictionary Learning)

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Back to the future: Radial Basis Function networks revisited

Compressed Sensing Meets Machine Learning - Classification of Mixture Subspace Models via Sparse Representation

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Action Recognition on Distributed Body Sensor Networks via Sparse Representation

Discriminative Learning and Big Data

Solving Regression. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 12. Slides adapted from Matt Nedrich and Trevor Hastie

The lasso: some novel algorithms and applications

Nonlinear Learning using Local Coordinate Coding

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

L5 Support Vector Classification

CSC 576: Variants of Sparse Learning

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

GWAS IV: Bayesian linear (variance component) models

ECS289: Scalable Machine Learning

A Quick Tour of Linear Algebra and Optimization for Machine Learning

ML (cont.): SUPPORT VECTOR MACHINES

Pathwise coordinate optimization

Compressed Sensing and Related Learning Problems

Statistical Methods for Data Mining

Tractable Upper Bounds on the Restricted Isometry Constant

Discriminative Models

Support Vector Machine Regression for Volatile Stock Market Prediction

Introduction to SVM and RVM

Optimization methods

Convex Optimization Algorithms for Machine Learning in 10 Slides

Support Vector Machines

SEMI-SUPERVISED classification, which estimates a decision

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Learning From Data: Modelling as an Optimisation Problem

CMU-Q Lecture 24:

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Statistical Methods for SVM

Cluster Kernels for Semi-Supervised Learning

CIS 520: Machine Learning Oct 09, Kernel Methods

SMO Algorithms for Support Vector Machines without Bias Term

Introduction to machine learning

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Least Squares SVM Regression

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Recap from previous lecture

Tufts COMP 135: Introduction to Machine Learning

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Support Vector Machines.

LEC 3: Fisher Discriminant Analysis (FDA)

Support Vector Machines With Adaptive L q Penalty

Multiple Kernel Learning

1 Graph Kernels by Spectral Transforms

Learning gradients: prescriptive models

Learning with Consistency between Inductive Functions and Kernels

6.036 midterm review. Wednesday, March 18, 15

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Classification and Support Vector Machine

Introduction to Support Vector Machines

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

SVRG++ with Non-uniform Sampling

A Generalized Uncertainty Principle and Sparse Representation in Pairs of Bases

Biostatistics Advanced Methods in Biostatistics IV

A Least Squares Formulation for Canonical Correlation Analysis

arxiv: v5 [stat.me] 12 Jul 2016

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Robust Kernel-Based Regression

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

12. Cholesky factorization

CPSC 540: Machine Learning

Kernel Logistic Regression and the Import Vector Machine

Learning Bound for Parameter Transfer Learning

5.6 Nonparametric Logistic Regression

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Transcription:

..... SPARROW SPARse approximation Weighted regression Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Université de Montréal March 12, 2012 SPARROW 1/47

..... Outline Introduction Motivation Local Methods SPARROW is a Local Method Defining the Effective Weights Defining the Observation Weights SPARROW 2/47

.. Motivation Problem setting... Given D := {(x i, y i ) : i = 1,, N} yi R is the output at the input xi := [x i1,, x im ] T R M our task is to estimate the regression function f : R M R such that y i = f(x i ) + ϵ i the ϵ i s are independent with zero mean. SPARROW 5/47

.. Motivation Global methods... In parametric approaches, the regression function is known for e.g., in multiple linear regression (MLR) we assume M f(x 0 ) = β j x 0j + ϵ j=1 we can also add higher order terms but still have a model that is linear in the parameters β j, γ j f(x 0 ) = M j=1 ( βj x 0j + γ j x 2 ) 0j + ϵ SPARROW 6/47

.. Motivation Global methods Continued... Example of a global nonparametric approach: ϵ-support vector regression (ϵ-svr) (Smola and Schölkopf, 2004) f(x 0 ) = N β j K(x 0, x i ) + ϵ i=1 SPARROW 7/47

.. Local Methods Local methods... A successful nonparametric approach to regression: local estimation (Hastie and Loader, 1993; Härdle and Linton, 1994; Ruppert and Wand, 1994) In local methods: f(x 0 ) = N l i (x 0 )y i + ϵ i=1 SPARROW 9/47

.. Local Methods Local methods Continued... For e.g. in k-nearest neighbor regression (k-nnr) f(x 0 ) = N i=1 α i (x 0 ) N p=1 α p(x 0 ) y i where α i (x 0 ) := I Nk (x 0)(x i ) Nk (x 0 ) D is the set of the k-nearest neighbors of x 0 SPARROW 10/47

.. Local Methods Local methods Continued... In weighted k-nnr (Wk-NNR), f(x 0 ) = N i=1 α i (x 0 ) N p=1 α p(x 0 ) y i α i (x 0 ) := S(x 0, x i ) 1 I Nk (x 0 )(x i ) S(x0, x i ) = (x 0 x i ) T V 1 (x 0 x i ) is the scaled Euclidean distance SPARROW 11/47

.. Local Methods Local methods Continued... Just so you know, here s another example of a local method: additive model (AM) (Buja et al., 1989) f(x 0 ) = M f j (x 0j ) + ϵ j=1 Estimate univariate functions of predictors locally SPARROW 12/47

.. Local Methods Local methods Continued... In local methods: estimate the regression function locally by a simple parametric model In local polynomial regression: estimate the regression function locally, by a Taylor polynomial This is what happens in SPARROW, as we will explain SPARROW 13/47

.. SPARROW is a Local Method... SPARROW 16/47

.. SPARROW is a Local Method I meant this sparrow... SPARROW 17/47

.. SPARROW is a Local Method... SPARROW is a local method Before we get into the details, see a few examples showing benefits of local methods then we ll talk about SPARROW SPARROW 18/47

1 0.8 0.6 0.4 0.2 0 0.2 0.4 Goal function Dataset 0.6 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 Figure: Our generated dataset. y i = f(x i ) + ϵ i, where f(x) = (x 3 + x 2 ) I(x) + sin(x) I( x).

1 0.8 0.6 0.4 0.2 0 0.2 Dataset Goal function 0.4 MLR:1st MLR:2nd MLR:3rd 0.6 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 Figure: Multiple linear regression with first-, second-, and third-order terms.

1 0.8 0.6 0.4 0.2 0 0.2 0.4 Dataset Goal function ε SVR 0.6 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 Figure: ϵ-support vector regression with an RBF kernel.

1 0.8 0.6 0.4 0.2 0 0.2 0.4 Dataset Goal function 4 NNR 0.6 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 Figure: 4-nearest neighbor regression.

.. Defining the Effective Weights... Effective weights in SPARROW In local methods: f(x 0 ) = N l i (x 0 )y i + ϵ i=1 Now we define l i (x 0 ) SPARROW 24/47

.. Defining the Effective Weights... Local estimation by a Taylor polynomial To locally estimate the regression function near x 0 let us approximate f(x) by a second-degree Taylor polynomial about x 0 P 2 (x) = ϕ + (x x 0 ) T θ + 1 2 (x x 0) T H(x x 0 ) (1) ϕ := f(x 0 ), θ := f(x 0 ) is the gradient of f(x), H := 2 f(x 0 ) is its Hessian both evaluated at x 0 SPARROW 25/47

.. Defining the Effective Weights... Local estimation by a Taylor polynomial Continued P 2 (x) = ϕ + (x x 0 ) T θ + 1 2 (x x 0) T H(x x 0 ) We need to solve the locally weighted least squares problem { min α i yi P 2 (x i ) } 2 (2) ϕ,θ,h i Ω SPARROW 26/47

.. Defining the Effective Weights... Local estimation by a Taylor polynomial Continued Express (2) as min A 1/2{ y XΘ(x 0 ) } 2 Θ(x 0 ) (3) a ii = α i, y := [y 1, y 2,, y N ] T 1 (x 1 x 0 ) T vech T {(x 1 x 0 )(x 1 x 0 ) T } X :=... 1 (x N x 0 ) T vech T {(x N x 0 )(x N x 0 ) T } parameter supervector: Θ(x 0 ) := [ ϕ, θ, vech(h) ] T SPARROW 27/47

.. Defining the Effective Weights... Local estimation by a Taylor polynomial Continued The solution: Θ(x 0 ) = ( X T AX ) 1 X T Ay And so the local quadratic estimate is ˆϕ = ˆf(x 0 ) = e T 1 ( X T AX ) 1 X T Ay Since f(x 0 ) = N i=1 l i(x 0 )y i, the vector of effective weights for SPARROW is [l 1 (x 0 ),, l N (x 0 )] T = A T X ( X T AX ) 1 e1 SPARROW 28/47

.. Defining the Effective Weights... Local estimation by a Taylor polynomial Continued The local constant regression estimate is ˆf(x 0 ) = (1 T A1) 1 1 T Ay = N i=1 α i (x 0 ) N k=1 α k(x 0 ) y i. Look familiar? SPARROW 29/47

.. Defining the Observation Weights... Observation weights in SPARROW We have to assign the weights here { min α i yi f(x i ) } 2 ϕ,θ,h i Ω that is, the diagonal elements of A min A 1/2{ y XΘ(x 0 ) } 2 Θ(x 0 ) (4) SPARROW 31/47

.. Defining the Observation Weights... Observation weights in SPARROW Continued To find α i we solve the following problem (Chen et al., 1995) min α 1 subject to x 0 Dα 2 α R N 2 σ (5) σ > 0 limits [ the maximum approximation ] error and D := x1 x 1, x 2 x 2,, x N x N SPARROW 32/47

.. Defining the Observation Weights Power family of penalties l p norms raised to the pth power... ( ) x p p = x i p i (6) For 1 p <, (6) is convex. 0 < p 1, is the range of p useful for measuring sparsity. SPARROW 33/47

Figure: As p goes to 0, x p becomes the indicator function and x p becomes a count of the nonzeros in x (Bruckstein et al., 2009).

.. Defining the Observation Weights... Representation by sparse approximation Continued To motivate this idea let s look at feature learning with sparse coding, and sparse representation classification (SRC) an example of exemplar-based sparse approximation SPARROW 35/47

.. Defining the Observation Weights... Unsupervised feature learning Application to image classification x 0 = Dα An example is the recent work by Coates and Ng (2011). where x 0 is the input vector could be a vectorized image patch, or a SIFT descriptor α is the higher-dimensional sparse representation of x0 D is usually learned SPARROW 36/47

Figure: Image classification (Coates et al., 2011).

.. Defining the Observation Weights Multiclass classification (Wright et al., 2009)... D := { (x i, y i ) : x i R m, y i {1,, c}, i {1,, N} } Given a test sample x 0 1. Solve min α R N α 1 subject to x 0 Dα 2 2 σ 2. Define {α y : y {1,, c}} where [α y ] i = α i if x i belongs to class y, o.w. 0 3. Construct X (α) := {ˆx y (α) = Dα y, y {1,, c} } 4. Predict ŷ := arg min y {1,...,c} x 0 ˆx y (α) 2 2 SPARROW 38/47

Figure: SRC on handwritten image dataset.

Figure: SVM with linear kernel on handwritten image dataset.

..... Back to SPARROW with evaluation on the MPG dataset Auto MPG Data Set from the UCI Machine Learning Repository (Frank and Asuncion, 2010) The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 4 continuous attributes. number of instances: 392 number of attributes: 7 (cylinders, displacement, horsepower, weight, acceleration, model year, origin) SPARROW 42/47

18 16 14 MSE 12 10 8 6 4 1 2 3 4 5 6 NWR 4 NNR W4 NNR SPARROW MLR LASSO Figure: Average mean squared error values achieved by various methods over 10-fold cross-validation.

..... Looking ahead What is causing the success of SPARROW and SRC? How important is the bandwidth? What about in SRC? SPARROW 44/47

References References I Alfred M. Bruckstein, David L. Donoho, and Michael Elad. From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Review, 51(1):34 81, 2009. Andreas Buja, Trevor Hastie, and Robert Tibshirani. Linear smoothers and additive models. The Annals of Statistics, 17(2):435 555, 1989. Scott S. Chen, David L. Donoho, and Michael A. Saunders. Atomic decomposition by basis pursuit. Technical Report 479, Department of Statistics, Stanford University, May 1995. Adam Coates and Andrew Ng. The importance of encoding versus training with sparse coding and vector quantization. In International Conference on Machine Learning (ICML), pages 921 928, 2011. Adam Coates, Honglak Lee, and Andrew Y. Ng. Self-taught learning: Transfer learning from unlabeled data. In International Conference on AI and Statistics, 2011. SPARROW 45/47

References References II A. Frank and A. Asuncion. UCI machine learning repository, 2010. URL http://archive.ics.uci.edu/ml. Wolfgang Härdle and Oliver Linton. Applied nonparametric methods. Technical Report 1069, Yale University, 1994. T. J. Hastie and C. Loader. Local regression: Automatic kernel carpentry. Statistical Science, 8(2):120 129, 1993. D. Ruppert and M. P. Wand. Multivariate locally weighted least squares regression. The Annals of Statistics, 22:1346 1370, 1994. Alex J. Smola and Bernhard Schölkopf. A tutorial on support vector regression. Statistics and Computing, 14:199 222, August 2004. John Wright, Allen Y. Yang, Arvind Ganesh, S. Shankar Sastry, and Yi Ma. Robust face recognition via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31:210 227, 2009. SPARROW 46/47

References Acknowledgements This is ongoing work carried out under the supervision of Prof. Bob L. Sturm of Aalborg University Copenhagen. Thanks to Isaac Nickaein and Sheida Bijani for helping out with the slides. SPARROW 47/47