L 0 methods. H.J. Kappen Donders Institute for Neuroscience Radboud University, Nijmegen, the Netherlands. December 5, 2011.

Similar documents
arxiv: v2 [stat.me] 8 Dec 2011

The Variational Garrote

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

STA414/2104 Statistical Methods for Machine Learning II

Regularized Regression A Bayesian point of view

Regression, Ridge Regression, Lasso

Bayesian construction of perceptrons to predict phenotypes from 584K SNP data.

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Least Squares Regression

Part 2: Multivariate fmri analysis using a sparsifying spatio-temporal prior

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Sparse Linear Models (10/7/13)

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

y(x) = x w + ε(x), (1)

Introduction to Machine Learning

Bayesian methods in economics and finance

Least Squares Regression

Linear Model Selection and Regularization

Bayesian Learning in Undirected Graphical Models

Bayesian linear regression

Week 3: Linear Regression

Nonparametric Bayesian Methods (Gaussian Processes)

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Linear Regression. Volker Tresp 2018

Linear Models for Regression

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

STA 4273H: Statistical Machine Learning

Bayesian learning of sparse factor loadings

Hierarchical Modeling for Univariate Spatial Data

y Xw 2 2 y Xw λ w 2 2

Partial factor modeling: predictor-dependent shrinkage for linear regression

ECE521 week 3: 23/26 January 2017

Machine Learning Linear Regression. Prof. Matteo Matteucci

Linear Models for Regression CS534

DATA MINING AND MACHINE LEARNING

Association studies and regression

Multivariate Bayesian Linear Regression MLAI Lecture 11

Statistical Data Mining and Machine Learning Hilary Term 2016

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Learning Gaussian Graphical Models with Unknown Group Sparsity

Linear Models for Regression CS534

Relevance Vector Machines

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Logistic Regression with the Nonnegative Garrote

Linear Regression Linear Regression with Shrinkage

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

The joint posterior distribution of the unknown parameters and hidden variables, given the

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Machine Learning Linear Classification. Prof. Matteo Matteucci

Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

PREDICTING SOLAR GENERATION FROM WEATHER FORECASTS. Chenlin Wu Yuhan Lou

Outline Lecture 2 2(32)

Bayesian Learning in Undirected Graphical Models

Linear Models for Regression

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester

Machine Learning for OR & FE

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Gaussian Process Regression

GWAS IV: Bayesian linear (variance component) models

Composite Loss Functions and Multivariate Regression; Sparse PCA

Lecture 1b: Linear Models for Regression

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Or How to select variables Using Bayesian LASSO

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

GWAS V: Gaussian processes

Sparse Covariance Selection using Semidefinite Programming

6.867 Machine Learning

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Bayesian Linear Regression [DRAFT - In Progress]

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Introduction to Gaussian Processes

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Outline lecture 2 2(30)

Recent Advances in Bayesian Inference Techniques

CS 195-5: Machine Learning Problem Set 1

Pattern Recognition and Machine Learning

Density Estimation. Seungjin Choi

Linear Models in Machine Learning

Latent state estimation using control theory

Regression Shrinkage and Selection via the Lasso

Linear Methods for Regression. Lijun Zhang

Dimension Reduction Methods

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

A new Hierarchical Bayes approach to ensemble-variational data assimilation

Introduction to Gaussian Processes

STA 4273H: Statistical Machine Learning

Bayesian Regression Linear and Logistic Regression

References. Lecture 7: Support Vector Machines. Optimum Margin Perceptron. Perceptron Learning Rule

Scale Mixture Modeling of Priors for Sparse Signal Recovery

Hierarchical Modelling for Univariate Spatial Data

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Transcription:

L methods H.J. Kappen Donders Institute for Neuroscience Radboud University, Nijmegen, the Netherlands December 5, 2 Bert Kappen

Outline George McCullochs model The Variational Garrote Bert Kappen L methods

Linear regression Given data x µ i, yµ, µ =,...,p find weights w i that best describe the relation y µ = n w i x µ i + ξµ i= The ordinary least square (OLS) minimizes OLS = ( y µ µ n w i x µ i i= ) 2 Solution is given by w = χ b χ ij = p x µ i xµ j b i = p x µ i yµ µ µ Problems: low accuracy due to overfitting (p < n) and interpretation: the OLS solution is not sparse Bert Kappen L methods 2

Ridge regression Add a regularization term Ridge = OLS + λ i w 2 i λ > w = (χ + λ) b Ridge regression - improves the prediction accuracy - maximal rank - solution is not sparse Bert Kappen L methods 3

Lasso Solve the OLS problem problem under the linear constraint i w i t. Equivalently, add a regularization term Lasso = OLS + λ i w i λ > There exist efficient methods to solve this quadratic programming problem. The solution tends to be sparse and improves both the prediction accuracy and the interpretability of the solution. Both ridge regression and Lasso are shrinkage methods that find a solution that is biased to smaller w. Bert Kappen L methods 4

Spike and slab Introduce a prior distribution over w i p(w i s i,β ± ) = ( s i )N(w i,σ spike ) + s i N(w i,σ slab ) s i =, p(s i γ) exp(γs i ) γ < p(w i ) = p(w i s i )p(s i γ) s i =, with /β = σ spike and /β + = σ slab. Likelihood p(y x, w, β ) = β 2π exp β 2 ( y n i= ) 2 w i x i p(d w, β ) = µ p(y µ x µ, w,β ) George and McCulloch in addition assume prior over β ±, β (and γ). Bert Kappen L methods 5

Posterior The posterior becomes p( w, s D,β, β ±,γ) = p(d w, β )p( w s,β ± )p( s γ) p(d) where D is the data and p(d w,β ) = p( w s,β ± ) ( P p(y µ x µ, w) exp β y µ 2 µ= µ i exp β p w i w j χ ij 2 w i b i 2 ij i ( exp n ( si β + wi 2 + ( s i )β w 2 ) ) i 2 w i x µ i ) 2 i= Bert Kappen L methods 6

p( s γ) = i exp(γs i ) 2 cosh(γ) For given s, the posterior distribution is Gaussian in w: p( w s) exp (w i wi )A ij ( s)(w j w 2 j) A ij ( s) = β pχ ij + (s i β + + ( s i )β )δ ij A ij wj = β pb i j ij For given w, the posterior factorizes in s: ( p( s w) = exp γ s i 2 i n ( si β + wi 2 + ( s i )β wi 2 ) ) i= Bert Kappen L methods 7

Gibbs sampling Sample w conditioned on s: w N( w ( s),a( s) ) Sample s independently: p(s i = ) = exp ( γ 2 β ) +wi 2 exp ( γ 2 β ( +wi) 2 + exp 2 β ) wi 2 Bert Kappen L methods 8

Spike and slab Advantage of spike and slab model is that it does not shrink w. However, MCMC is complex and time consuming. Bert Kappen L methods 9

The Garrote vil Bert Kappen L methods

The variational Garrote Introduce s i =, that select features. The regression model becomes y µ = n w i s i x µ i + ξµ s i =, i= To optimize the s i is equivalent to find the optimal subset of relevant features. Since the number of subsets is exponential in n one has to resort to heuristic methods to find a good subset of features. Here we propose a variational approximation. Bert Kappen L methods

The variational Garrote The likelihood term is given by p(y x, s, w, β) = β 2π exp β 2 ( y n i= ) 2 w i s i x i p(d s, w, β) = µ p(y µ x µ, s, w, β) = ( ) p/2 β exp βp 2π 2 n i,j= s i s j w i w j χ ij 2 n w i s i b i + σy 2 i= with b i, χ ij as before and σ 2 y = p µ (yµ ) 2. Bert Kappen L methods 2

The variational Garrote For concreteness, we assume that the prior over s p( s γ) = n i= p(s i γ) p(s i γ) = exp(γs i) + exp(γ) with γ given which specifies the sparsity of the solution. We further assume priors p(β, w). Bert Kappen L methods 3

The variational Garrote The posterior becomes p( s, w,β D,γ) = p( w, β)p( s γ)p(d s, w,β) p(d γ) Posterior is intractable: - MCMC - Variational Bayes - Variational MAP - BP, CVM,... Here we compute a variational MAP estimate. We approximate the marginal posterior p( w, β D,γ) = s p( s, w, β D,γ) and computing the MAP solution with respect to w,β. Bert Kappen L methods 4

Breiman s Garrote method The proposed model is similar to Breiman s Garrote method: y µ = n w i s i x µ i + ξµ s i =, i= which assumes s i instead of binary. It computes w i using OLS and then finds s i by minimizing ( y µ µ ) 2 n x µ i w is i subject to s i i= s i t i We refer to our method as the Binary Garrote (BG). Bert Kappen L methods 5

The variational approximation We compute the variational approximation using Jensens inequality: log s p( s γ)p(d s, w, β) s q( s) log q( s) p( s γ)p(d s, w,β) = F(q, w, β) The optimal q( s) is found by minimizing F(q, w,β) with respect to q( s). We consider the simplest case q( s) = n q i (s i ) q i (s i ) = m i s i + ( m i )( s i ) i= So q( s) is parametrized by m. Bert Kappen L methods 6

The variational approximation The expectation values with respect to q can now be easily evaluated and the result is F = p 2 log β 2π + βp 2 + γ n v i v j χ ij + i i,j n m i + n log( + exp(γ)) i= n (m i log m i + ( m i ) log( m i )) i= where we have defined v i = m i w i. m i m i v 2 i χ ii 2 n v i b i + σy 2 i= Bert Kappen L methods 7

The variational approximation The approximate posterior marginal posterior is then p( w, β D,γ) p( w, β) s p( s γ)p(d s, w, β) p( w, β) exp( F( m, w, β, γ)) = exp( G( m, w,β, γ)) G( m, w,β, γ) = F( m, w, β, γ) log p( w, β) We can compute the variational approximation m for given w, β, γ by minimizing F with respect to m. In addition, p( w, β D,γ) needs to be maximized with respect to w,β. Bert Kappen L methods 8

The variational approximation Taking the derivative of G with respect m, v, β and setting the derivatives equal to zero gives the following set of fixed point equations: m i ( = σ with σ(x) = ( + exp( x)) γ + βp 2 vi 2χ ) ii m 2 i v = (χ ) b χ ij = χ ij + m i χ ii δ ij m i n β = v i b i + σy 2 i= Bert Kappen L methods 9

Comments The variational approximation is not simply w i s i w i m i If this were the case, the substitution v i = w i m i would remove m i from the equations and the OLS problem would be recovered. The reason is s i s j = m i m j for i j, but s 2 i = si = m i. χ differs from χ by adding a positive diagonal to it, making χ automatically of maximal rank when m i <. Roughly speaking if χ has rank p < n, χ can be still of rank n when no more than p of the m i =, the remaining n p of the m i < making up for the rank deficiency. Bert Kappen L methods 2

When inputs are uncorrelated: χ ij = δ ij, Independent inputs w i = b i = x i y ( = σ γ + βp ) 2 b2 i m i /β = σ 2 y i b 2 im i i b2 i m i is the explained variance. The Garrote solution with m i has reduced explained variance with (hopefully) a better prediction accuracy and interpretability. Bert Kappen L methods 2

Univariate case In the -dimensional case these equations become m = σ β ( γ + p ) ρ 2 ρm = σ2 y( mρ) with ρ = b 2 /σ 2 y the squared correlation coefficient. = f(m) Bert Kappen L methods 22

Univariate case f(m) is an increasing function of m and crosses the line m either or three times, depending on the values of p,γ,ρ..8.8 f(m).6.4 f(m).6.4.2.2.2.4.6.8 m.2.4.6.8 m f(m) vs m. Left: p =, γ =, different lines correspond to different values of < ρ <. Right: p =, γ = 3. solutions for m. The solutions close to m, correspond to local minima of F. The intermediate solution corresponds to a local maximum of F. Bert Kappen L methods 23

Univariate case One can compute the critical p for which multiple solutions occur. p = 4 ρ 2 ρ ( ρ + 2 ) p is a decreasing function of ρ. For p > p, we find two solutions for m. For p < p, we find one solution for m. γ 2 4 m.5.5 ρ 6 m.5 8.2.4.6.8 ρ.5 ρ Left: Phase plot ρ, γ for p = Dotted line is solution for γ when m = /2. Right: m versus ρ for γ =, p = (top) and for γ = 4, p = (bottom). Bert Kappen L methods 24

Transfer function Suppose that data are generated from the model y = wx + ξ ξ 2 = x 2 = w estimated.5.5 vg ridge garrote lasso.5.5 w Binary Garrote (VG) with γ = and p =. Ridge regression with λ =.5. Garrote with γ = /4. Lasso with γ =.5. Bert Kappen L methods 25

Numerical examples Inputs are generated from a mean zero multi-variate Gaussian distribution with specified covariance structure. We generate outputs y µ = i ŵix µ i + ξµ with ξ µ N(, ˆσ). For each example, we generate a training set, a validation set and a test set (p/p v /p t ). For each value of the hyper parameters (γ in the case of BG, λ in the case of ridge regression and Lasso), we optimize the model parameters on the training set. We optimize the hyper parameters on the validation set. Bert Kappen L methods 26

x µ i N(, ) independently. ŵ = (,,...,), n = and ˆσ =. p/p v /p t = 5/5/4. Example F 5 45 4 35 forward backward 3 3 2 γ error 3 2 3 2 γ train val v.5.5 v v 2:n.5 3 2 γ 5 Binary Garrote Bert Kappen L methods 27

x µ i N(, ) independently. ŵ = (,,...,), n = and ˆσ =. p/p v /p t = 5/5/4. Example error 3 2 train val w.5 w w 2:n.5.5 λ.5 λ 5 error 2.5 train val w.8.6.4 w w 2:n.5.5 5 λ.2 5 λ 5 Lasso (top row) and ridge regression (bottom row) Bert Kappen L methods 28

x µ i N(, ) independently. ŵ = (,,...,), n = and ˆσ =. p/p v /p t = 5/5/4. Example Train Val Test # non-zero δw δw 2 Ridge.44 ±.29.75 ±.3.79 ±.8-4.4 ±.8.8 ±.5 Lasso.8 ±.22.6 ±.24.5 ±.23 3.8 ± 2.4.64 ±.39.6 ±.6 BG.83 ±.8.89 ±.9.2 ±.9.76 ±.36.28 ±.26.4 ±.4 True.93 ±.4.87 ±.2.98 ±.4 Results on random instances. Bert Kappen L methods 29

Example 2 x µ N(, Σ) with Σ ij = δ i j, δ =.5. ŵ i =,i =,2, 5,, 5 and all other ŵ i =, n =, ˆσ =. p/p v /p t = 5/5/4. Train Val Test # non-zero δw δw 2 Lasso.78 ±.47.4 ±.3.49 ±.23.2 ± 3.2 2.9 ±.77.55 ±.22 BG.8 ±.2.8 ±.2.2 ±.6 5.5 ±.7. ±.46.32 ±.37 True. ±.8.97 ±.9.99 ±.7 5 Results on random instances..2.8.6.4.2 lasso bg lasso 2.8.6.4.2.2 2 4 6 8.2.4.6.8 2 bg Bert Kappen L methods 3

Dependence on noise Data as in example. test error.6.4.2 Lasso VG gap.5.5 Lasso VG δ w 3 2.5 2.5.5 Lasso VG.8 2 3 σ.5 2 3 σ 2 3 σ All results are averages over runs. Bert Kappen L methods 3

Implementation issues for high dimensional problems For large n, the most expensive part of the computation is inversion of χ Note, that the free energy can also be written as F = p 2 log β 2π + βp 2 + γ p p (z µ ) 2 + µ i n m i + n log( + exp(γ)) i= n (m i log m i + ( m i ) log( m i )) i= m i m i v 2 i χ ii 2 n v i b i + σy 2 with z µ = i xµ i v i. We can thus mimimize F with respect to v, z under linear constraints without the need to compute the covariance matrix χ. This is a quadratic optimization problem. i= Bert Kappen L methods 32

Implementation issues for high dimensional problems The quadratic program can be computed linear in time with n. n Regression QP. ±..4 ±.4 5.2 ±.5. ±.3.6 ±..6 ±.4 2 4.29 ±.3.33 ±.4 5 -.79 ±.6 -.86 ±.5 5 -.37-2.47 CPU times in seconds for solving v by matrix inversion and for solving the QP problem using MOSEK. Problem is as described in Example with p = 5, ˆσ =. Lasso BG n Error δw 2 CPU (sec) Error δw 2 CPU (sec).424.583.2688.3239.59.798/38 5.627.3849.336.278.294 8.665/3.742.4924.588.2944.38.7249/25 2.643.342.2658.325.35 526 Bert Kappen L methods 33

Local minima: - appear for few and noisy data. - seem modest for (very) sparse problems. Discussion - increasing γ increases β and works as an annealing schedule. Extensions: - MAP: TAP, BP, CVM - Full Bayes : MCMC, VB,... - Use of priors (on γ) instead of cross validation Applications: - Finding structure of networks, both static and dynamic - Finding genes in GWAS -... arxiv.org/abs/9.486 Bert Kappen L methods 34