Bayesian Feature Selection with Strongly Regularizing Priors Maps to the Ising Model

Size: px
Start display at page:

Download "Bayesian Feature Selection with Strongly Regularizing Priors Maps to the Ising Model"

Transcription

1 LETTER Communicated by Ilya M. Nemenman Bayesian Feature Selection with Strongly Regularizing Priors Maps to the Ising Model Charles K. Fisher Pankaj Mehta Department of Physics, Boston University, Boston, MA 02215, U.S.A. Identifying small subsets of features that are relevant for prediction and classification tasks is a central problem in machine learning and statistics. The feature selection task is especially important, and computationally difficult, for modern data sets where the number of features can be comparable to or even exceed the number of samples. Here, we show that feature selection with Bayesian inference takes a universal form and reduces to calculating the magnetizations of an Ising model under some mild conditions. Our results exploit the observation that the evidence takes a universal form for strongly regularizing priors priors that have a large effect on the posterior probability even in the infinite data limit. We derive explicit expressions for feature selection for generalized linear models, a large class of statistical techniques that includes linear and logistic regression. We illustrate the power of our approach by analyzing feature selection in a logistic regression-based classifier trained to distinguish between the letters B and D in the notmnist data set. 1 Introduction Modern technological advances have fueled the growth of large data sets where thousands, or even millions, of features can be measured simultaneously. Dealing with such a large number of features is often impractical, however, especially when the sample size is limited. Feature selection is one promising approach for dealing with such complex data sets (Bishop, 2006; Mackay, 2003). The goal of feature selection is to identify a subset of features relevant for statistical tasks such as prediction or classification. Feature selection is a difficult problem because features are often redundant and strongly correlated with each other. Furthermore, the relevance of a feature depends not only on the data set being analyzed, but also the statistical techniques being employed. While powerful methods for feature selection exist for a handful of techniques such as linear regression (Tibshirani, 1996; Zou & Hastie, 2005), there is no unified framework for performing feature selection in a computationally tractable manner. To address this shortcoming, Neural Computation 27, (2015) doi: /neco_a_00780 c 2015 Massachusetts Institute of Technology

2 2412 C. Fisher and P. Mehta we present a unified approach for Bayesian feature selection that is applicable to a large class of commonly used statistical models. In principle, Bayesian inference provides a unified framework for feature selection (O Hagan, Forster, & Kendall, 2004; Mackay, 2003). Unfortunately, Bayesian methods often require extensive Monte Carlo simulations that become computationally intractable for very high-dimensional problems. In the Bayesian framework, a statistical model is defined by its likelihood function, p x (y θ), which describes the observed data, y, as a function of some features, x, and model parameters, θ. This likelihood function is supplemented by a prior, p(θ), that encodes our belief about the model parameters in the absence of data. Using Bayes rule, one can define the posterior probability, p x (θ y) p x (y θ)p(θ), that describes our belief about the parameter θ given the observed data y. Notice that for a flat (constant) prior, maximizing the posterior distribution is equivalent to maximizing the likelihood, and Bayesian inference reduces to the usual maximum likelihood framework. More generally, the log of the prior can be interpreted as a penalty function that regularizes the model parameters. For example, the choice of a gaussian and Laplace prior corresponds to imposing an L2 andl1 penalty on model parameters, respectively (Bishop, 2006). Since the ultimate goal of feature selection is to identify a subset of features, it is helpful to introduce a set of indicator variables, s, that indicates whether a feature is included in the statistical model, with s i = 1 if feature i is included and s i = 1 if it is not. The posterior distribution for s can be computed as (MacKay, 1992) p x (s y) p x (y s)p 0 (s), (1.1) where the p 0 (s) is a prior distribution describing any a priori information we have about which variables are relevant, and p x (y s), often referred to as the evidence, is given by p x (y s) = dθ p x (y θ, s)p(θ s). (1.2) For simplicity, we will assume that p 0 (s) 1 is a flat prior for the rest of this letter; it is easy to extend our results to other cases. Feature selection can be performed by choosing features with large posterior probabilities. Our central result is that the logarithm of the evidence maps to the energy of an Ising model for a large class of priors (i.e., p(θ)) we term strongly regularizing priors. Thus, computing the marginal posterior probabilities for the relevance of each feature reduces to computing the magnetizations of an Ising model. The key property of strongly regularizing priors is that they affect the posterior probability even when the sample size goes to infinity (Fisher & Mehta, 2015). This should be contrasted with the usual assumption of Bayesian inference that the effect of the prior on posterior

3 Bayesian Feature Selection Maps to the Ising Model 2413 probabilities diminishes inversely with the number of samples (Mackay, 2003). We expect that strongly regularizing priors are especially useful for feature selection problems where the number of potential features is greater than, or comparable to, the number of data points. Surprisingly, our results do not depend on the specific choice of prior or likelihood function, under some mild conditions, suggesting that Bayesian feature selection has universal properties for strongly regularizing priors. Practically, our Bayesian Ising approximation (BIA) provides an efficient method for computing the posterior probabilities necessary for feature selection with many commonly employed statistical procedures such as generalized linear models (GLMs). We envision using the BIA as part of a two-stage procedure where it is applied to rapidly screen out irrelevant candidates those features that have low rank in posterior probability before applying a more computationally intensive cross-validation procedure to infer the model parameters with the remaining features. 2 General Formulas For concreteness, consider a series of n independent observations of some dependent variable y, y = (y 1,...,y n ), that we want to explain using a set of p potential features x = (x 1,...,x p ). Furthermore, assume that to each of these potential features, x i, we associate a model parameter θ i.lets denote an index set specifying the positions j for which s j =+1. The cardinality of S, that is, the number of indicator variables with s j =+1, is denoted M. Also, we define a vector θ of length M that contains the model parameters corresponding to the active features, S. With these definitions, we define the log likelihood for the observed data as a function of the active features as L X ( θ y) = log p X (y θ). Throughout, we assume that the log likelihood is twice differentiable with respect to θ. In addition to the log likelihood, we need to specify prior distributions for the parameters. Here, we will work with factorized priors of the form P( θ) = j S 1 Z(λ) e λ f ( θ j ), (2.1) where λ is a parameter that controls the strength of the regularization, f (θ j ) is a twice differentiable convex function minimized at the point θ j with f ( θ j ) = 0, and Z(λ) is the normalization constant. As the function is convex, we are assuming that the second derivative evaluated at θ j, denoted 2 f, is positive. Many commonly used priors take this form, including gaussian and hyperbolic priors. Plugging these expressions into equation 1.2, yields an expression for the evidence where only the subset of features, S, is

4 2414 C. Fisher and P. Mehta included: p X (y S) = Z M (λ) d θe L X ( θ y) λ j S f ( θ j ). (2.2) By definition, this integral is dominated by the second term for strongly regularizing priors. Since the log likelihood is extensive in the number of data points, n, this generally requires that the regularization strength be much larger than the number of data points, λ n. For such strongly regularizing priors, we can perform a saddle-point approximation for the evidence. This yields, up to an irrelevant constant, L X (y S) = log p X (y S) = 1 2 log I 1 H b (I 1 H) 1 b, (2.3) where = λ( 2 f ) is a renormalized regularization strength, b = L X ( θ y) (2.4) is the gradient of the log likelihood, and H ij = 2 L X ( θ y) θ i θ j (2.5) is the Hessian of the log likelihood, with all derivatives evaluated at θ j = θ j. We emphasize that the large parameter in our saddle-point approximation is the regularization strength n. This differs from previous statistical physics approaches to Bayesian feature selection that commonly assume that n. In that case, one can take n as the large parameter in the saddlepoint approximation, which leads to an approximation for the evidence related to the Bayesian information criterion (Kinney & Atwal, 2014; Balasubramanian, 1997; Nemenman & Bialek, 2002). The fundamental reason for this difference is that we work in the strongly regularizing regime where the prior is assumed to be important even in the infinite data limit. Since for strongly regularizing priors, is the largest scale in the problem, we can expand the log in equation 2.3 in a power series expansion in ɛ = 1 to order O(ɛ 3 ) to get L X (y S) ɛ 2 (Tr[H] + b b) + ɛ2 2 ( ) 1 2 Tr[H2 ] + b Hb. (2.6)

5 Bayesian Feature Selection Maps to the Ising Model 2415 One of the most striking aspects of this expression is that it is independent of the detailed form of the prior function, f (θ ). All that is required is that the prior is a twice differential convex function with a global minimum at f ( θ j ) = 0. Different choices of prior simply renormalize the effective regularization strength. Rewriting this expression in terms of the spin variables, s, we see that the evidence takes the Ising form with L X (y s) = ɛ i s i h i + ij J ij s i s j (2.7) and couplings given by h i = 1 4 (H ii + b2 i ) + 2 j J ij, (2.8) J ij = ɛ 16 (H2 ij + 2b i H ij b j ). (2.9) Notice that the couplings, which are proportional to the small parameter ɛ, are weak. According to Bayes rule, L(s y) = L(yz s) + constant if the prior on s is flat, so the posterior probability of a set of features s is described by an Ising model of the form p(s y) = Z 1 e i s i h i + ij s i s j J ij, (2.10) where Z is the partition function that normalizes the probability distribution. We term this the Bayesian Ising approximation (BIA) (Fisher & Mehta, 2015). Finally, for future use, it is useful to define a scale for which the Ising approximation breaks down. This scale can be computed by requiring that the power series expansion used to derive equation 2.6 converges (Fisher & Mehta, 2015). We demonstrated the utility of the BIA for feature selection in the specific context of linear regression in a recent paper and can adapt that machinery to the current problem (Fisher & Mehta, 2015). It is useful to explicitly indicate the dependence of various expression on the regularization strength. We want to compute the marginal posterior probability that a feature, j, is relevant: p (s j = 1 y) (1 + m j ( ))/2, (2.11) where we have defined the magnetizations m j ( ) = s j. While there are many techniques for calculating the magnetizations of an Ising model, we

6 2416 C. Fisher and P. Mehta focus on the mean field approximation, which leads to a self-consistent equation (Opper & Winther, 2001): m i ( ) = tanh n2 h 4 i ( ) + 1 J 2 ij ( )m j ( ) (2.12) j i Our expressions depend on a free parameter ( ) that determines the strength of the prior distribution. As it is usually difficult, in practice, to choose a specific value of ahead of time, it is often helpful to compute the feature selection path that is, compute m j ( ) over a wide range of s. Indeed, computing the variable selection path is a common practice when applying other feature selection techniques such as Lasso regression (Tibshirani, 1996). To obtain the mean field variable selection path as a function of ɛ = 1/, we notice that lim ɛ 0 m j (ɛ) = 0 and so define the recursive formula, m i (ɛ + δ ɛ ) = tanh (ɛ + δ ɛ )n2 h 4 i (ɛ + δ ɛ ) + 1 J 2 ij (ɛ + δ ɛ )m j (ɛ), j i (2.13) with a small step size δ ɛ 1/.Wehavesetδ ɛ = 0.05/ in all of the examples we present. 3 Examples Logistic regression is a commonly used statistical method for modeling categorical data (Bishop, 2006). To simplify notation, it is also useful to define an extra feature variable x 0 = 1 that is always equal to 1 and a p + 1-dimensional vector of feature, x = (x 0, x 1,...,x p ) with corresponding parameters, θ = (θ 0,θ 1,...,θ p ). In terms of these vectors, the likelihood function for logistic regression takes the compact form eθ x P x (y = 1 θ) = 1 P x (y = 1 θ) = 1 + e, (3.1) θ x where θ is the transpose of θ.ifwehaven independent observations of the data (labeled by index l = 1...n), the log likelihood can be written as L X (y θ) = l y l (θ x l ) log (1 + e θ x l ). (3.2)

7 Bayesian Feature Selection Maps to the Ising Model 2417 We supplement this likelihood function with an L 2 norm on the parameters of the form p λ p(θ) = 2π e λ(θ j θ )2 j /2 j=1 (3.3) with θ = ( θ 0, 0,...,0) and where θ 0 is chosen to match the observed probability that y = 1 in the data: P obs (y). = e θ e θ 0. (3.4) Using these expressions, we can calculate the gradient and the Hessian of the log likelihood: b i L X (y θ) θ θ = i l x l i (yl P obs (y)) (3.5) and H ij = ( P obs (y)(1 P obs (y) ). l x l i xl j. (3.6) Plugging these into equation 2.9 yields the fields and couplings for the Ising model in equation 2.10: ( ) h i = 1 2 x l i 4 yl P obs (y)(1 P obs (y)) x l i xl j + 2 J ij, (3.7) l l j ( J ij = ɛ P 16 obs (y)(1 P obs (y)) ) 2 x l i xl j l ( ɛ P 8 obs (y)(1 P obs (y)) )( )( ) x l i xl j ) x l i yl x l j yl, l l l (3.8) where we have defined y l = y l P obs (y). Notice that the gradient, b i, is proportional to the correlations between the y and x i. Furthermore, except for a multiplicative constant reflecting the variance of y, H ij is just the correlation between x i and x j. Thus, as in linear regression (Fisher & Mehta, 2015), the coefficients of the Ising model are related to the correlations between variables or the data, or both. In fact,

8 2418 C. Fisher and P. Mehta Figure 1: Three randomly chosen examples of Bs (top) and Ds (bottom) from the notmnist data set. Each letter in the notmnist data set is represented by a28 28 pixel grayscale image. it is easy to show this is the case for all generalized linear models (see the appendix). To illustrate our approach, we used the BIA for a logistic regressionbased classifier designed to classify Bs and Ds in the notmnist data set, which consists of diverse images of the letters A to J composed from publicly available computer fonts. Each letter is represented as a grayscale image where pixel intensities vary between 0 and 255. Figure 1 shows three randomly chosen examples of the letters B and D from the notmnist data set. The BIA was performed using 500 randomly chosen examples of each letter. Notice that the number of examples (500 or each letter) is comparable to the number of pixels (784) in an image, suggesting that strongly regularizing the problem is appropriate. Using expressions 3.7 and 3.8, we calculated the couplings for the Ising model describing our logistic-regression-based classifier and calculated the feature selection path as a function of / using the mean field approximation. As in linear regression, we used = n(1 + pr), where n = 1000 is the number of samples, p = 784 is the number of potential features, and r is the root-mean-squared correlation between pixels (see Figure 2A). Figure 2B shows the posterior probability of all 784 pixels when =.To better visualize this, we have labeled the pixels with the highest posterior probabilities in red in the feature selection path in Figure 2A and in the sample images shown in Figure 2C. The results agree well with our basic intuition about which pixels are important for distinguishing the letters B and D. 4 Discussion and Conclusion We have presented a general framework for Bayesian feature selection in a large class of statistical models. In contrast to previous Bayesian approaches

9 Bayesian Feature Selection Maps to the Ising Model 2419 Figure 2: Identifying the most informative pixels for distinguishing Bs and Ds. (A) We used the Bayesian Ising approximation (BIA) to perform feature selection for a logistic regression based classifier designed to classify Bs and Ds. The model was trained using 500 randomly chosen examples of each letter from the notmnist data set. The resulting feature selection path is shown as a function of the regularization strength,. Each line corresponds to one of 784 pixels. We expect the BIA to break down when <. Pixels in red have the highest posterior probability and hence can be used for distinguishing between Bs and Ds. (B) Posterior probabilities of all 784 pixels when / = 1. (C) Visualization of the most informative pixels labeled in red in panel A. that assume that the effect of the prior vanishes inversely with the amount of data, we have used strongly regularizing priors that have large effects on the posterior distribution even in the infinite data limit. We have shown that in the strongly regularized limit, Bayesian feature selection takes the universal form of an Ising model. Thus, the marginal posterior probabilities that each feature is relevant can be efficiently computed using a mean field approximation. Furthermore, for generalized linear models, we have shown the coefficients of the Ising model can be calculated directly from correlations between the data and features. Methods for feature selection, including the BIA, generally attempt to identify a subset of features in which each feature is strongly correlated to the response but only weakly correlated to the other selected features. Most of these feature selection algorithms, such as the Lasso, identify a subset of features that minimizes a chosen objective function (Tibshirani, 1996; Zou & Hastie, 2005). This can also be viewed as finding a subset of features with maximum posterior probability with a suitably chosen likelihood and prior. It is typically difficult, however, to assess the significance of the selected features obtained from these maximum a posteriori methods (Lockhart, Taylor, Tibshirani, & Tibshirani, 2014). By contrast, the BIA directly computes the marginal posterior probabilities that each of the features is relevant. Thus, selecting a subset of features with the BIA is equivalent to choosing a threshold for statistical significance in a Bayesian framework in which significance is measured by posterior probabilities.

10 2420 C. Fisher and P. Mehta Like other approaches to feature selection, the BIA has a hyperparameter ( ) that must be specified. Here, we have set = because it is the weakest prior for which the mapping to the Ising model still holds. In practical applications, however, it will be worthwhile to explore other approaches to specifying a value for by maximizing the evidence, or through crossvalidation. Moreover, we have used a flat prior on s throughout this work, but it is trivial to extend our results to other cases. Including a prior on s would introduce other hyperparameters, which also have to be specified. Surprisingly, aside from some mild regularity conditions, our approach is independent of the choice of prior or likelihood. This suggests that it may be possible to obtain general results about strongly regularizing priors. It will be interesting to explore Bayesian inference in this new limit. Our approach also gives a practical algorithm for quickly performing feature selection in many commonly employed statistical and machine learning approaches. The methods outlined here are especially well suited for modern data sets where the number of potential features can vastly exceed the number of independent samples. We envision using the Bayesian feature selection algorithm outlined here as part of a two-stage procedure. One can use the BIA to rapidly screen out irrelevant candidates and reduce the complexity of the data set before applying a more comprehensive cross-valid procedure. More generally, it will be interesting to develop statistical physics based approaches for the analysis of complex data. Appendix: Formulas for Generalized Linear Models The gradient, b, and Hessian, H, of the log likelihood have particularly simple definitions for generalized linear models (GLMs) that extend the exponential family of distributions. In the exponential family, we can write the distribution in the form p(y θ) = g(y)e η(θ )T (y)+f(η), (A.1) where T (y) is a vector of sufficient statistics and the η are called natural parameters. Notice that for these distributions, we have that F η i = T i (y) (A.2) and 2 F η i η j = Cov[T i (y), T j (y)] (A.3) where Cov denotes the covariance (connected correlation function).

11 Bayesian Feature Selection Maps to the Ising Model 2421 In a GLM, we restrict ourselves to distribution to scalar quantities where T(y) = y and say that η = θ x. Then we can write the likelihood as p(y θ, x) = g(y)e (θ x)y+f(θ x). (A.4) If we have n independent data points with l = 1,...,n, then we can write the log likelihood for such a distribution as L X (y θ) = l log g(y l ) + (θ x l )y l + F(θ x l ). (A.5) Using the expressions above for the exponential family and equation 2.4, we have that b i = L X (y θ) θ = θ= θ i l y l x l i xl i y θ, (A.6) where y θ is the expectation value of y for choice of parameter θ = θ.ifwe choose θ to reproduce the empirical probability, we get b i ncov[y, x i ]. (A.7) Moreover, the entries of the Hessian are given by H ij = 2 L X (y θ) θ i θ j = l x l i xl j Var[y] θ= θ. (A.8) If we consider standardized variables, x, then we can write H ij ncov[x i, x j ]Var[y]. (A.9) Acknowledgments We thank Alex Lang, Javad Noorbakhsh, and David Schwab for helpful discussions. This work was supported by a Sloan Research Fellowship and a Simons Investigator Award (to P.M.), and internal funding from Boston University. References Balasubramanian, V. (1997). Statistical inference, Occam s razor, and statistical mechanics on the space of probability distributions. Neural Computation, 9(2),

12 2422 C. Fisher and P. Mehta Bishop, C. M. (2006). Pattern recognition and machine learning.newyork:springer. Fisher, C. K., & Mehta, P. (2015). Bayesian feature selection for high dimensional linear regression via the Ising approximation with applications to genomics. Bioinformatics, btv037. Kinney, J. B., & Atwal, G. S. (2014). Parametric inference in the large data limit using maximally informative models. Neural Computation, 26(4), Lockhart, R., Taylor, J., Tibshirani, R. J., & Tibshirani, R. (2014). A significance test for the Lasso. Annals of Statistics, 42(2), MacKay, D. J. (1992). Bayesian interpolation. Neural Computation, 4(3), Mackay, D. J. (2003). Information theory, inference and learning algorithms. Cambridge: Cambridge University Press. Nemenman, I., & Bialek, W. (2002). Occam factors and model independent Bayesian learning of continuous distributions. Physical Review E, 65(2), O Hagan, A., Forster, J., & Kendall, M. G. (2004). Bayesian inference. London: Arnold. Opper, M., & Winther, O. (2001). From naive mean field theory to the TAP equations. In M. Opper & D. Saad (Eds.), Advanced mean field methods: Theory and practice. Cambridge, MA: MIT Press. Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58, Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), Received April 17, 2015; accepted June 29, 2015.

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

An Introduction to Statistical and Probabilistic Linear Models

An Introduction to Statistical and Probabilistic Linear Models An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning

More information

GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

More information

Bayesian Deep Learning

Bayesian Deep Learning Bayesian Deep Learning Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp June 06, 2018 Mohammad Emtiyaz Khan 2018 1 What will you learn? Why is Bayesian inference

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

GWAS IV: Bayesian linear (variance component) models

GWAS IV: Bayesian linear (variance component) models GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS IV: Bayesian

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation

Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation COMPSTAT 2010 Revised version; August 13, 2010 Michael G.B. Blum 1 Laboratoire TIMC-IMAG, CNRS, UJF Grenoble

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Bayesian Learning (II)

Bayesian Learning (II) Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

Approximate Message Passing

Approximate Message Passing Approximate Message Passing Mohammad Emtiyaz Khan CS, UBC February 8, 2012 Abstract In this note, I summarize Sections 5.1 and 5.2 of Arian Maleki s PhD thesis. 1 Notation We denote scalars by small letters

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Understanding Covariance Estimates in Expectation Propagation

Understanding Covariance Estimates in Expectation Propagation Understanding Covariance Estimates in Expectation Propagation William Stephenson Department of EECS Massachusetts Institute of Technology Cambridge, MA 019 wtstephe@csail.mit.edu Tamara Broderick Department

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

Prediction of Data with help of the Gaussian Process Method

Prediction of Data with help of the Gaussian Process Method of Data with help of the Gaussian Process Method R. Preuss, U. von Toussaint Max-Planck-Institute for Plasma Physics EURATOM Association 878 Garching, Germany March, Abstract The simulation of plasma-wall

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview

More information

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013) A Survey of L 1 Regression Vidaurre, Bielza and Larranaga (2013) Céline Cunen, 20/10/2014 Outline of article 1.Introduction 2.The Lasso for Linear Regression a) Notation and Main Concepts b) Statistical

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Introduction to Gaussian Process

Introduction to Gaussian Process Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Outline Lecture 2 2(32)

Outline Lecture 2 2(32) Outline Lecture (3), Lecture Linear Regression and Classification it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic

More information

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer. University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

Neutron inverse kinetics via Gaussian Processes

Neutron inverse kinetics via Gaussian Processes Neutron inverse kinetics via Gaussian Processes P. Picca Politecnico di Torino, Torino, Italy R. Furfaro University of Arizona, Tucson, Arizona Outline Introduction Review of inverse kinetics techniques

More information

Bayesian Support Vector Machines for Feature Ranking and Selection

Bayesian Support Vector Machines for Feature Ranking and Selection Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction

More information

20: Gaussian Processes

20: Gaussian Processes 10-708: Probabilistic Graphical Models 10-708, Spring 2016 20: Gaussian Processes Lecturer: Andrew Gordon Wilson Scribes: Sai Ganesh Bandiatmakuri 1 Discussion about ML Here we discuss an introduction

More information

Variable Selection in Data Mining Project

Variable Selection in Data Mining Project Variable Selection Variable Selection in Data Mining Project Gilles Godbout IFT 6266 - Algorithmes d Apprentissage Session Project Dept. Informatique et Recherche Opérationnelle Université de Montréal

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII

More information

The Variational Gaussian Approximation Revisited

The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

More information

Model Selection for Gaussian Processes

Model Selection for Gaussian Processes Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 1 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Generalized Elastic Net Regression

Generalized Elastic Net Regression Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1

More information

STAT 518 Intro Student Presentation

STAT 518 Intro Student Presentation STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible

More information

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic NPFL108 Bayesian inference Introduction Filip Jurčíček Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic Home page: http://ufal.mff.cuni.cz/~jurcicek Version: 21/02/2014

More information

Notes on Machine Learning for and

Notes on Machine Learning for and Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori

More information

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II 1 Non-linear regression techniques Part - II Regression Algorithms in this Course Support Vector Machine Relevance Vector Machine Support vector regression Boosting random projections Relevance vector

More information

Part 2: Multivariate fmri analysis using a sparsifying spatio-temporal prior

Part 2: Multivariate fmri analysis using a sparsifying spatio-temporal prior Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 2: Multivariate fmri analysis using a sparsifying spatio-temporal prior Tom Heskes joint work with Marcel van Gerven

More information

From inductive inference to machine learning

From inductive inference to machine learning From inductive inference to machine learning ADAPTED FROM AIMA SLIDES Russel&Norvig:Artificial Intelligence: a modern approach AIMA: Inductive inference AIMA: Inductive inference 1 Outline Bayesian inferences

More information

Multivariate statistical methods and data mining in particle physics

Multivariate statistical methods and data mining in particle physics Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general

More information

Gaussian Processes in Machine Learning

Gaussian Processes in Machine Learning Gaussian Processes in Machine Learning November 17, 2011 CharmGil Hong Agenda Motivation GP : How does it make sense? Prior : Defining a GP More about Mean and Covariance Functions Posterior : Conditioning

More information

Introduction to Machine Learning

Introduction to Machine Learning Outline Introduction to Machine Learning Bayesian Classification Varun Chandola March 8, 017 1. {circular,large,light,smooth,thick}, malignant. {circular,large,light,irregular,thick}, malignant 3. {oval,large,dark,smooth,thin},

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

g-priors for Linear Regression

g-priors for Linear Regression Stat60: Bayesian Modeling and Inference Lecture Date: March 15, 010 g-priors for Linear Regression Lecturer: Michael I. Jordan Scribe: Andrew H. Chan 1 Linear regression and g-priors In the last lecture,

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

Bayesian Dropout. Tue Herlau, Morten Morup and Mikkel N. Schmidt. Feb 20, Discussed by: Yizhe Zhang

Bayesian Dropout. Tue Herlau, Morten Morup and Mikkel N. Schmidt. Feb 20, Discussed by: Yizhe Zhang Bayesian Dropout Tue Herlau, Morten Morup and Mikkel N. Schmidt Discussed by: Yizhe Zhang Feb 20, 2016 Outline 1 Introduction 2 Model 3 Inference 4 Experiments Dropout Training stage: A unit is present

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10 COS53: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 0 MELISSA CARROLL, LINJIE LUO. BIAS-VARIANCE TRADE-OFF (CONTINUED FROM LAST LECTURE) If V = (X n, Y n )} are observed data, the linear regression problem

More information

Expectation Propagation for Approximate Bayesian Inference

Expectation Propagation for Approximate Bayesian Inference Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

Bayesian Hidden Markov Models and Extensions

Bayesian Hidden Markov Models and Extensions Bayesian Hidden Markov Models and Extensions Zoubin Ghahramani Department of Engineering University of Cambridge joint work with Matt Beal, Jurgen van Gael, Yunus Saatci, Tom Stepleton, Yee Whye Teh Modeling

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

An Introduction to Bayesian Machine Learning

An Introduction to Bayesian Machine Learning 1 An Introduction to Bayesian Machine Learning José Miguel Hernández-Lobato Department of Engineering, Cambridge University April 8, 2013 2 What is Machine Learning? The design of computational systems

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features Y target

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University UNDERDETERMINED LINEAR EQUATIONS We

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Or How to select variables Using Bayesian LASSO

Or How to select variables Using Bayesian LASSO Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO On Bayesian Variable Selection

More information

Approximate Inference using MCMC

Approximate Inference using MCMC Approximate Inference using MCMC 9.520 Class 22 Ruslan Salakhutdinov BCS and CSAIL, MIT 1 Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Basic Sampling Algorithms. 4. Markov

More information

Feature Engineering, Model Evaluations

Feature Engineering, Model Evaluations Feature Engineering, Model Evaluations Giri Iyengar Cornell University gi43@cornell.edu Feb 5, 2018 Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 1 / 35 Overview 1 ETL 2 Feature Engineering

More information