L 0 methods. H.J. Kappen Donders Institute for Neuroscience Radboud University, Nijmegen, the Netherlands. December 5, 2011.

Size: px
Start display at page:

Download "L 0 methods. H.J. Kappen Donders Institute for Neuroscience Radboud University, Nijmegen, the Netherlands. December 5, 2011."

Transcription

1 L methods H.J. Kappen Donders Institute for Neuroscience Radboud University, Nijmegen, the Netherlands December 5, 2 Bert Kappen

2 Outline George McCullochs model The Variational Garrote Bert Kappen L methods

3 Linear regression Given data x µ i, yµ, µ =,...,p find weights w i that best describe the relation y µ = n w i x µ i + ξµ i= The ordinary least square (OLS) minimizes OLS = ( y µ µ n w i x µ i i= ) 2 Solution is given by w = χ b χ ij = p x µ i xµ j b i = p x µ i yµ µ µ Problems: low accuracy due to overfitting (p < n) and interpretation: the OLS solution is not sparse Bert Kappen L methods 2

4 Ridge regression Add a regularization term Ridge = OLS + λ i w 2 i λ > w = (χ + λ) b Ridge regression - improves the prediction accuracy - maximal rank - solution is not sparse Bert Kappen L methods 3

5 Lasso Solve the OLS problem problem under the linear constraint i w i t. Equivalently, add a regularization term Lasso = OLS + λ i w i λ > There exist efficient methods to solve this quadratic programming problem. The solution tends to be sparse and improves both the prediction accuracy and the interpretability of the solution. Both ridge regression and Lasso are shrinkage methods that find a solution that is biased to smaller w. Bert Kappen L methods 4

6 Spike and slab Introduce a prior distribution over w i p(w i s i,β ± ) = ( s i )N(w i,σ spike ) + s i N(w i,σ slab ) s i =, p(s i γ) exp(γs i ) γ < p(w i ) = p(w i s i )p(s i γ) s i =, with /β = σ spike and /β + = σ slab. Likelihood p(y x, w, β ) = β 2π exp β 2 ( y n i= ) 2 w i x i p(d w, β ) = µ p(y µ x µ, w,β ) George and McCulloch in addition assume prior over β ±, β (and γ). Bert Kappen L methods 5

7 Posterior The posterior becomes p( w, s D,β, β ±,γ) = p(d w, β )p( w s,β ± )p( s γ) p(d) where D is the data and p(d w,β ) = p( w s,β ± ) ( P p(y µ x µ, w) exp β y µ 2 µ= µ i exp β p w i w j χ ij 2 w i b i 2 ij i ( exp n ( si β + wi 2 + ( s i )β w 2 ) ) i 2 w i x µ i ) 2 i= Bert Kappen L methods 6

8 p( s γ) = i exp(γs i ) 2 cosh(γ) For given s, the posterior distribution is Gaussian in w: p( w s) exp (w i wi )A ij ( s)(w j w 2 j) A ij ( s) = β pχ ij + (s i β + + ( s i )β )δ ij A ij wj = β pb i j ij For given w, the posterior factorizes in s: ( p( s w) = exp γ s i 2 i n ( si β + wi 2 + ( s i )β wi 2 ) ) i= Bert Kappen L methods 7

9 Gibbs sampling Sample w conditioned on s: w N( w ( s),a( s) ) Sample s independently: p(s i = ) = exp ( γ 2 β ) +wi 2 exp ( γ 2 β ( +wi) 2 + exp 2 β ) wi 2 Bert Kappen L methods 8

10 Spike and slab Advantage of spike and slab model is that it does not shrink w. However, MCMC is complex and time consuming. Bert Kappen L methods 9

11 The Garrote vil Bert Kappen L methods

12 The variational Garrote Introduce s i =, that select features. The regression model becomes y µ = n w i s i x µ i + ξµ s i =, i= To optimize the s i is equivalent to find the optimal subset of relevant features. Since the number of subsets is exponential in n one has to resort to heuristic methods to find a good subset of features. Here we propose a variational approximation. Bert Kappen L methods

13 The variational Garrote The likelihood term is given by p(y x, s, w, β) = β 2π exp β 2 ( y n i= ) 2 w i s i x i p(d s, w, β) = µ p(y µ x µ, s, w, β) = ( ) p/2 β exp βp 2π 2 n i,j= s i s j w i w j χ ij 2 n w i s i b i + σy 2 i= with b i, χ ij as before and σ 2 y = p µ (yµ ) 2. Bert Kappen L methods 2

14 The variational Garrote For concreteness, we assume that the prior over s p( s γ) = n i= p(s i γ) p(s i γ) = exp(γs i) + exp(γ) with γ given which specifies the sparsity of the solution. We further assume priors p(β, w). Bert Kappen L methods 3

15 The variational Garrote The posterior becomes p( s, w,β D,γ) = p( w, β)p( s γ)p(d s, w,β) p(d γ) Posterior is intractable: - MCMC - Variational Bayes - Variational MAP - BP, CVM,... Here we compute a variational MAP estimate. We approximate the marginal posterior p( w, β D,γ) = s p( s, w, β D,γ) and computing the MAP solution with respect to w,β. Bert Kappen L methods 4

16 Breiman s Garrote method The proposed model is similar to Breiman s Garrote method: y µ = n w i s i x µ i + ξµ s i =, i= which assumes s i instead of binary. It computes w i using OLS and then finds s i by minimizing ( y µ µ ) 2 n x µ i w is i subject to s i i= s i t i We refer to our method as the Binary Garrote (BG). Bert Kappen L methods 5

17 The variational approximation We compute the variational approximation using Jensens inequality: log s p( s γ)p(d s, w, β) s q( s) log q( s) p( s γ)p(d s, w,β) = F(q, w, β) The optimal q( s) is found by minimizing F(q, w,β) with respect to q( s). We consider the simplest case q( s) = n q i (s i ) q i (s i ) = m i s i + ( m i )( s i ) i= So q( s) is parametrized by m. Bert Kappen L methods 6

18 The variational approximation The expectation values with respect to q can now be easily evaluated and the result is F = p 2 log β 2π + βp 2 + γ n v i v j χ ij + i i,j n m i + n log( + exp(γ)) i= n (m i log m i + ( m i ) log( m i )) i= where we have defined v i = m i w i. m i m i v 2 i χ ii 2 n v i b i + σy 2 i= Bert Kappen L methods 7

19 The variational approximation The approximate posterior marginal posterior is then p( w, β D,γ) p( w, β) s p( s γ)p(d s, w, β) p( w, β) exp( F( m, w, β, γ)) = exp( G( m, w,β, γ)) G( m, w,β, γ) = F( m, w, β, γ) log p( w, β) We can compute the variational approximation m for given w, β, γ by minimizing F with respect to m. In addition, p( w, β D,γ) needs to be maximized with respect to w,β. Bert Kappen L methods 8

20 The variational approximation Taking the derivative of G with respect m, v, β and setting the derivatives equal to zero gives the following set of fixed point equations: m i ( = σ with σ(x) = ( + exp( x)) γ + βp 2 vi 2χ ) ii m 2 i v = (χ ) b χ ij = χ ij + m i χ ii δ ij m i n β = v i b i + σy 2 i= Bert Kappen L methods 9

21 Comments The variational approximation is not simply w i s i w i m i If this were the case, the substitution v i = w i m i would remove m i from the equations and the OLS problem would be recovered. The reason is s i s j = m i m j for i j, but s 2 i = si = m i. χ differs from χ by adding a positive diagonal to it, making χ automatically of maximal rank when m i <. Roughly speaking if χ has rank p < n, χ can be still of rank n when no more than p of the m i =, the remaining n p of the m i < making up for the rank deficiency. Bert Kappen L methods 2

22 When inputs are uncorrelated: χ ij = δ ij, Independent inputs w i = b i = x i y ( = σ γ + βp ) 2 b2 i m i /β = σ 2 y i b 2 im i i b2 i m i is the explained variance. The Garrote solution with m i has reduced explained variance with (hopefully) a better prediction accuracy and interpretability. Bert Kappen L methods 2

23 Univariate case In the -dimensional case these equations become m = σ β ( γ + p ) ρ 2 ρm = σ2 y( mρ) with ρ = b 2 /σ 2 y the squared correlation coefficient. = f(m) Bert Kappen L methods 22

24 Univariate case f(m) is an increasing function of m and crosses the line m either or three times, depending on the values of p,γ,ρ..8.8 f(m).6.4 f(m) m m f(m) vs m. Left: p =, γ =, different lines correspond to different values of < ρ <. Right: p =, γ = 3. solutions for m. The solutions close to m, correspond to local minima of F. The intermediate solution corresponds to a local maximum of F. Bert Kappen L methods 23

25 Univariate case One can compute the critical p for which multiple solutions occur. p = 4 ρ 2 ρ ( ρ + 2 ) p is a decreasing function of ρ. For p > p, we find two solutions for m. For p < p, we find one solution for m. γ 2 4 m.5.5 ρ 6 m ρ.5 ρ Left: Phase plot ρ, γ for p = Dotted line is solution for γ when m = /2. Right: m versus ρ for γ =, p = (top) and for γ = 4, p = (bottom). Bert Kappen L methods 24

26 Transfer function Suppose that data are generated from the model y = wx + ξ ξ 2 = x 2 = w estimated.5.5 vg ridge garrote lasso.5.5 w Binary Garrote (VG) with γ = and p =. Ridge regression with λ =.5. Garrote with γ = /4. Lasso with γ =.5. Bert Kappen L methods 25

27 Numerical examples Inputs are generated from a mean zero multi-variate Gaussian distribution with specified covariance structure. We generate outputs y µ = i ŵix µ i + ξµ with ξ µ N(, ˆσ). For each example, we generate a training set, a validation set and a test set (p/p v /p t ). For each value of the hyper parameters (γ in the case of BG, λ in the case of ridge regression and Lasso), we optimize the model parameters on the training set. We optimize the hyper parameters on the validation set. Bert Kappen L methods 26

28 x µ i N(, ) independently. ŵ = (,,...,), n = and ˆσ =. p/p v /p t = 5/5/4. Example F forward backward γ error γ train val v.5.5 v v 2:n γ 5 Binary Garrote Bert Kappen L methods 27

29 x µ i N(, ) independently. ŵ = (,,...,), n = and ˆσ =. p/p v /p t = 5/5/4. Example error 3 2 train val w.5 w w 2:n.5.5 λ.5 λ 5 error 2.5 train val w w w 2:n λ.2 5 λ 5 Lasso (top row) and ridge regression (bottom row) Bert Kappen L methods 28

30 x µ i N(, ) independently. ŵ = (,,...,), n = and ˆσ =. p/p v /p t = 5/5/4. Example Train Val Test # non-zero δw δw 2 Ridge.44 ± ±.3.79 ± ±.8.8 ±.5 Lasso.8 ±.22.6 ±.24.5 ± ± ±.39.6 ±.6 BG.83 ±.8.89 ±.9.2 ±.9.76 ± ±.26.4 ±.4 True.93 ±.4.87 ±.2.98 ±.4 Results on random instances. Bert Kappen L methods 29

31 Example 2 x µ N(, Σ) with Σ ij = δ i j, δ =.5. ŵ i =,i =,2, 5,, 5 and all other ŵ i =, n =, ˆσ =. p/p v /p t = 5/5/4. Train Val Test # non-zero δw δw 2 Lasso.78 ±.47.4 ±.3.49 ±.23.2 ± ± ±.22 BG.8 ±.2.8 ±.2.2 ± ±.7. ± ±.37 True. ±.8.97 ±.9.99 ±.7 5 Results on random instances lasso bg lasso bg Bert Kappen L methods 3

32 Dependence on noise Data as in example. test error Lasso VG gap.5.5 Lasso VG δ w Lasso VG σ σ 2 3 σ All results are averages over runs. Bert Kappen L methods 3

33 Implementation issues for high dimensional problems For large n, the most expensive part of the computation is inversion of χ Note, that the free energy can also be written as F = p 2 log β 2π + βp 2 + γ p p (z µ ) 2 + µ i n m i + n log( + exp(γ)) i= n (m i log m i + ( m i ) log( m i )) i= m i m i v 2 i χ ii 2 n v i b i + σy 2 with z µ = i xµ i v i. We can thus mimimize F with respect to v, z under linear constraints without the need to compute the covariance matrix χ. This is a quadratic optimization problem. i= Bert Kappen L methods 32

34 Implementation issues for high dimensional problems The quadratic program can be computed linear in time with n. n Regression QP. ±..4 ± ±.5. ±.3.6 ±..6 ± ±.3.33 ± ± ± CPU times in seconds for solving v by matrix inversion and for solving the QP problem using MOSEK. Problem is as described in Example with p = 5, ˆσ =. Lasso BG n Error δw 2 CPU (sec) Error δw 2 CPU (sec) / / / Bert Kappen L methods 33

35 Local minima: - appear for few and noisy data. - seem modest for (very) sparse problems. Discussion - increasing γ increases β and works as an annealing schedule. Extensions: - MAP: TAP, BP, CVM - Full Bayes : MCMC, VB,... - Use of priors (on γ) instead of cross validation Applications: - Finding structure of networks, both static and dynamic - Finding genes in GWAS -... arxiv.org/abs/9.486 Bert Kappen L methods 34

arxiv: v2 [stat.me] 8 Dec 2011

arxiv: v2 [stat.me] 8 Dec 2011 Abstract In this paper, I present a new solution method for sparse regression using L regularization. The model introduces a sparseness mechanism in the likelihood, instead of in the prior, as is done

More information

The Variational Garrote

The Variational Garrote Mach Learn (2014) 96:269 294 DOI 10.1007/s10994-013-5427-7 The Variational Garrote Hilbert J. Kappen Vicenç Gómez Received: 7 January 2012 / Accepted: 18 November 2013 / Published online: 6 December 2013

More information

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Regularized Regression A Bayesian point of view

Regularized Regression A Bayesian point of view Regularized Regression A Bayesian point of view Vincent MICHEL Director : Gilles Celeux Supervisor : Bertrand Thirion Parietal Team, INRIA Saclay Ile-de-France LRI, Université Paris Sud CEA, DSV, I2BM,

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

Bayesian construction of perceptrons to predict phenotypes from 584K SNP data.

Bayesian construction of perceptrons to predict phenotypes from 584K SNP data. Bayesian construction of perceptrons to predict phenotypes from 584K SNP data. Luc Janss, Bert Kappen Radboud University Nijmegen Medical Centre Donders Institute for Neuroscience Introduction Genetic

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Part 2: Multivariate fmri analysis using a sparsifying spatio-temporal prior

Part 2: Multivariate fmri analysis using a sparsifying spatio-temporal prior Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 2: Multivariate fmri analysis using a sparsifying spatio-temporal prior Tom Heskes joint work with Marcel van Gerven

More information

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh September 24 (All of the slides in this course have been adapted from previous versions

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:

More information

y(x) = x w + ε(x), (1)

y(x) = x w + ε(x), (1) Linear regression We are ready to consider our first machine-learning problem: linear regression. Suppose that e are interested in the values of a function y(x): R d R, here x is a d-dimensional vector-valued

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1

More information

Bayesian methods in economics and finance

Bayesian methods in economics and finance 1/26 Bayesian methods in economics and finance Linear regression: Bayesian model selection and sparsity priors Linear Regression 2/26 Linear regression Model for relationship between (several) independent

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul

More information

Bayesian linear regression

Bayesian linear regression Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding

More information

Week 3: Linear Regression

Week 3: Linear Regression Week 3: Linear Regression Instructor: Sergey Levine Recap In the previous lecture we saw how linear regression can solve the following problem: given a dataset D = {(x, y ),..., (x N, y N )}, learn to

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff IEOR 165 Lecture 7 Bias-Variance Tradeoff 1 Bias-Variance Tradeoff Consider the case of parametric regression with β R, and suppose we would like to analyze the error of the estimate ˆβ in comparison to

More information

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Consistent high-dimensional Bayesian variable selection via penalized credible regions Consistent high-dimensional Bayesian variable selection via penalized credible regions Howard Bondell bondell@stat.ncsu.edu Joint work with Brian Reich Howard Bondell p. 1 Outline High-Dimensional Variable

More information

Linear Regression. Volker Tresp 2018

Linear Regression. Volker Tresp 2018 Linear Regression Volker Tresp 2018 1 Learning Machine: The Linear Model / ADALINE As with the Perceptron we start with an activation functions that is a linearly weighted sum of the inputs h = M j=0 w

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA) Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA) Willem van den Boom Department of Statistics and Applied Probability National University

More information

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions:

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Bayesian learning of sparse factor loadings

Bayesian learning of sparse factor loadings Magnus Rattray School of Computer Science, University of Manchester Bayesian Research Kitchen, Ambleside, September 6th 2008 Talk Outline Brief overview of popular sparsity priors Example application:

More information

Hierarchical Modeling for Univariate Spatial Data

Hierarchical Modeling for Univariate Spatial Data Hierarchical Modeling for Univariate Spatial Data Geography 890, Hierarchical Bayesian Models for Environmental Spatial Data Analysis February 15, 2011 1 Spatial Domain 2 Geography 890 Spatial Domain This

More information

y Xw 2 2 y Xw λ w 2 2

y Xw 2 2 y Xw λ w 2 2 CS 189 Introduction to Machine Learning Spring 2018 Note 4 1 MLE and MAP for Regression (Part I) So far, we ve explored two approaches of the regression framework, Ordinary Least Squares and Ridge Regression:

More information

Partial factor modeling: predictor-dependent shrinkage for linear regression

Partial factor modeling: predictor-dependent shrinkage for linear regression modeling: predictor-dependent shrinkage for linear Richard Hahn, Carlos Carvalho and Sayan Mukherjee JASA 2013 Review by Esther Salazar Duke University December, 2013 Factor framework The factor framework

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Multivariate Bayesian Linear Regression MLAI Lecture 11

Multivariate Bayesian Linear Regression MLAI Lecture 11 Multivariate Bayesian Linear Regression MLAI Lecture 11 Neil D. Lawrence Department of Computer Science Sheffield University 21st October 2012 Outline Univariate Bayesian Linear Regression Multivariate

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

Learning Gaussian Graphical Models with Unknown Group Sparsity

Learning Gaussian Graphical Models with Unknown Group Sparsity Learning Gaussian Graphical Models with Unknown Group Sparsity Kevin Murphy Ben Marlin Depts. of Statistics & Computer Science Univ. British Columbia Canada Connections Graphical models Density estimation

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Relevance Vector Machines

Relevance Vector Machines LUT February 21, 2011 Support Vector Machines Model / Regression Marginal Likelihood Regression Relevance vector machines Exercise Support Vector Machines The relevance vector machine (RVM) is a bayesian

More information

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models Statistics 203: Introduction to Regression and Analysis of Variance Penalized models Jonathan Taylor - p. 1/15 Today s class Bias-Variance tradeoff. Penalized regression. Cross-validation. - p. 2/15 Bias-variance

More information

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Shunsuke Horii Waseda University s.horii@aoni.waseda.jp Abstract In this paper, we present a hierarchical model which

More information

Logistic Regression with the Nonnegative Garrote

Logistic Regression with the Nonnegative Garrote Logistic Regression with the Nonnegative Garrote Enes Makalic Daniel F. Schmidt Centre for MEGA Epidemiology The University of Melbourne 24th Australasian Joint Conference on Artificial Intelligence 2011

More information

Linear Regression Linear Regression with Shrinkage

Linear Regression Linear Regression with Shrinkage Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle

More information

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

The joint posterior distribution of the unknown parameters and hidden variables, given the

The joint posterior distribution of the unknown parameters and hidden variables, given the DERIVATIONS OF THE FULLY CONDITIONAL POSTERIOR DENSITIES The joint posterior distribution of the unknown parameters and hidden variables, given the data, is proportional to the product of the joint prior

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

Probabilistic machine learning group, Aalto University  Bayesian theory and methods, approximative integration, model Aki Vehtari, Aalto University, Finland Probabilistic machine learning group, Aalto University http://research.cs.aalto.fi/pml/ Bayesian theory and methods, approximative integration, model assessment and

More information

PREDICTING SOLAR GENERATION FROM WEATHER FORECASTS. Chenlin Wu Yuhan Lou

PREDICTING SOLAR GENERATION FROM WEATHER FORECASTS. Chenlin Wu Yuhan Lou PREDICTING SOLAR GENERATION FROM WEATHER FORECASTS Chenlin Wu Yuhan Lou Background Smart grid: increasing the contribution of renewable in grid energy Solar generation: intermittent and nondispatchable

More information

Outline Lecture 2 2(32)

Outline Lecture 2 2(32) Outline Lecture (3), Lecture Linear Regression and Classification it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic

More information

Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ and Center for Automated Learning and

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning

More information

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester Physics 403 Numerical Methods, Maximum Likelihood, and Least Squares Segev BenZvi Department of Physics and Astronomy University of Rochester Table of Contents 1 Review of Last Class Quadratic Approximation

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

GWAS IV: Bayesian linear (variance component) models

GWAS IV: Bayesian linear (variance component) models GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS IV: Bayesian

More information

Composite Loss Functions and Multivariate Regression; Sparse PCA

Composite Loss Functions and Multivariate Regression; Sparse PCA Composite Loss Functions and Multivariate Regression; Sparse PCA G. Obozinski, B. Taskar, and M. I. Jordan (2009). Joint covariate selection and joint subspace selection for multiple classification problems.

More information

Lecture 1b: Linear Models for Regression

Lecture 1b: Linear Models for Regression Lecture 1b: Linear Models for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced

More information

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop Bayesian Gaussian / Linear Models Read Sections 2.3.3 and 3.3 in the text by Bishop Multivariate Gaussian Model with Multivariate Gaussian Prior Suppose we model the observed vector b as having a multivariate

More information

Or How to select variables Using Bayesian LASSO

Or How to select variables Using Bayesian LASSO Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO On Bayesian Variable Selection

More information

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0. Gaussian Processes Gaussian Process Stochastic process: basically, a set of random variables. may be infinite. usually related in some way. Gaussian process: each variable has a Gaussian distribution every

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

Sparse Covariance Selection using Semidefinite Programming

Sparse Covariance Selection using Semidefinite Programming Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem Set 2 Due date: Wednesday October 6 Please address all questions and comments about this problem set to 6867-staff@csail.mit.edu. You will need to use MATLAB for some of

More information

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures

More information

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28 Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

Bayesian Linear Regression [DRAFT - In Progress]

Bayesian Linear Regression [DRAFT - In Progress] Bayesian Linear Regression [DRAFT - In Progress] David S. Rosenberg Abstract Here we develop some basics of Bayesian linear regression. Most of the calculations for this document come from the basic theory

More information

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Outline lecture 2 2(30)

Outline lecture 2 2(30) Outline lecture 2 2(3), Lecture 2 Linear Regression it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic Control

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

CS 195-5: Machine Learning Problem Set 1

CS 195-5: Machine Learning Problem Set 1 CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Latent state estimation using control theory

Latent state estimation using control theory Latent state estimation using control theory Bert Kappen SNN Donders Institute, Radboud University, Nijmegen Gatsby Unit, UCL London August 3, 7 with Hans Christian Ruiz Bert Kappen Smoothing problem Given

More information

Regression Shrinkage and Selection via the Lasso

Regression Shrinkage and Selection via the Lasso Regression Shrinkage and Selection via the Lasso ROBERT TIBSHIRANI, 1996 Presenter: Guiyun Feng April 27 () 1 / 20 Motivation Estimation in Linear Models: y = β T x + ɛ. data (x i, y i ), i = 1, 2,...,

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

Dimension Reduction Methods

Dimension Reduction Methods Dimension Reduction Methods And Bayesian Machine Learning Marek Petrik 2/28 Previously in Machine Learning How to choose the right features if we have (too) many options Methods: 1. Subset selection 2.

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization

More information

A new Hierarchical Bayes approach to ensemble-variational data assimilation

A new Hierarchical Bayes approach to ensemble-variational data assimilation A new Hierarchical Bayes approach to ensemble-variational data assimilation Michael Tsyrulnikov and Alexander Rakitko HydroMetCenter of Russia College Park, 20 Oct 2014 Michael Tsyrulnikov and Alexander

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Neil D. Lawrence GPSS 10th June 2013 Book Rasmussen and Williams (2006) Outline The Gaussian Density Covariance from Basis Functions Basis Function Representations Constructing

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

References. Lecture 7: Support Vector Machines. Optimum Margin Perceptron. Perceptron Learning Rule

References. Lecture 7: Support Vector Machines. Optimum Margin Perceptron. Perceptron Learning Rule References Lecture 7: Support Vector Machines Isabelle Guyon guyoni@inf.ethz.ch An training algorithm for optimal margin classifiers Boser-Guyon-Vapnik, COLT, 992 http://www.clopinet.com/isabelle/p apers/colt92.ps.z

More information

Scale Mixture Modeling of Priors for Sparse Signal Recovery

Scale Mixture Modeling of Priors for Sparse Signal Recovery Scale Mixture Modeling of Priors for Sparse Signal Recovery Bhaskar D Rao 1 University of California, San Diego 1 Thanks to David Wipf, Jason Palmer, Zhilin Zhang and Ritwik Giri Outline Outline Sparse

More information

Hierarchical Modelling for Univariate Spatial Data

Hierarchical Modelling for Univariate Spatial Data Hierarchical Modelling for Univariate Spatial Data Sudipto Banerjee 1 and Andrew O. Finley 2 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. 2 Department

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information