Variable Selection in Data Mining Project

Size: px
Start display at page:

Download "Variable Selection in Data Mining Project"


1 Variable Selection Variable Selection in Data Mining Project Gilles Godbout IFT Algorithmes d Apprentissage Session Project Dept. Informatique et Recherche Opérationnelle Université de Montréal Montreal, QC, Canada H3C 3J7 Editor: Yoshua Bengio Keywords: Data Mining, Variable Selection, l 1 -norm Regularization, Gradient Directed Regularization 1. Introduction Bishop(2) establishes that in practical applications of data mining, the choice of preprocessing of the available data will be one of the most significant factors in determining the performance of the final system. Data mining situations are often caracterized with the availability of a large number of raw input input variables, sometimes in the tens of thousands range, and comparably few training examples. Learning algorithms perform well on domain with a relatively small number of relevant variables. They tend to degrade however in presence of a large number of variables including possibly irrelevant and redundant information. Many approaches have been proposed to address the problems of relevance and space dimensionality reduction of the input variables. They include algorithms for feature extraction, variable and feature selection, and example selection to name a few. This document present the report on a session project for the course IFT6266 Algorithmes d Apprentissage. The objectives of the project are described in the next section. Section 3 brieffly documents the concepts around variable selection. In section 4, we present a specific problem of data mining and proposed different approaches of variable selection to address the problem of relevance and dimensionality reduction. We document the results of our experimentation in section 5 and our conclusions in section 6. This is a preliminary version. At this stage, it is presented as a plan. It contains several elements that are incomplete and require further research and/or discussions. 1

2 Gilles Godbout 2. Project Objectives This project will concentrate on the study of various techniques for variable selection. The objectives of this project are three-fold: - to familiarize the author with the spectrum of approaches proposed to address the problems of relevance and dimensionality reduction in the current data mining field; - to compare various selection algorithms in order to solve a specific data mining problem; - to complement the LISA project PLearn library with one of the studied technique: the Gradient Directed Regularization for Linear Classification. 3. Variable Selection We can identify three benefits to variable selection: - to improve the prediction performance of the predictor; - to reduce the processing requirements of the predictor; - to provide for a better understanding of the data by identifying the variables most relevant to the problem at hand. The question is how to identify a subset of the input variables that will lead to the building of a good predictor. Guyon & Elisseeff(5) classified the currently proposed methods within three categories: filters, wrappers and embedded methods. 3.1 Filters Filters select subsets of variables as a pre-processing step, independently of the chosen predictor. They rank variables according to their predictive power which can be measured in various ways. Many filter techniques will rely on empirical estimates of the mutual information between each variable and the target. Clearly, ranking the input variables by their predictive power does provide some understanding of the data in relation with the problem at hand. However, these techniques have more difficulties identifying combinations of variables with high predictive power. These techniques can also be prone to the selection of redundant subsets of variables hence not addressing as well as one would wich, the issue of dimensionality. For these reasons, we have chosen in this project, not to include any experimentation with one of these techniques. 3.2 Wrappers Wrappers utilize the chosen predictor to score subsets of variable according to their predictive power. The main idea here is to train the predictor with many different subsets of the input variables and choose the subset providing the best generalization performance. In most cases, it would be prohibitive to try with each possible subset. So the question becomes how to search the space of all possible variable subsets. Some greedy forward 2

3 Variable Selection and backward search strategies have been proposed. Their processing requirements does not grow exponentially with the number of variables. Forward selection algorithms starts with an empty subset and progressively add the next most promising variable. Conversely, backward selection starts with the set of all variables and progressively eliminates the least promising one. Cross validation can be used to assess the performance of the predictor with the various subsets and to select the subset with best generalization power. In this project, we will experiment with a greedy forward selection algorithm to try to improve the generalization performance of a chosen predictor. The algorithm developped is described in more details in section Embedded Methods Embedded methods perform implicit variable selection in the process of training the predictor itself. One way of acheiving this concept is to add a regularization term to the optimization criterion wich will control the complexity of the model by keeping some of the weights to zero, hence in fact applying the predictor to a subset of the input variables. In this project, we will experiment this concept using a l 1 -norm regularization term. The l 1 -norm regularization approach is described in more details in sections 4.4. Other approaches to implicit variable selection are intrinsic to the training algorithm itself. We will implement the Gradient Directed Regularization, a gradient descent algorithm in this category. We will use it in the learning of the weight vector and we will emphasize its variable selection capacity. The Gradient Directed Regularization is described in more details in sections The Data Mining Problem This section describes the various components of the experimentation that we want to carry. 4.1 The Data This project is centered around a set of available data wich presents many of the caracteristics typical of data-mining problems. It is a classification task and the target is binary (0 or 1). The inputs are noisy and in very high dimension (1092), thus overfitting is likely to occur ( curse of dimensionality ). There is a large imbalance between the two classes. There are only 8.7% examples of class 1. Finnaly, the number of examples is relatively small (8176), given the large dimension of the input variables. 3

4 Gilles Godbout 4.2 The Predictor Because of the imbalance between the two classes, the classification accuracy will not be a good measure of performance. We are proposing to use instead the logistic regression to build an estimator of P (Y X). In logistic regression, the probability of an example x i being drawn from the positive class is estimated by: P (Y = +1 X = x i ) = sigmoid( w k x ik ) where sigmoid(z) = 1 and 1 sigmoid(z) = sigmoid( z). Note that we have added 1+e z to each example, a feature x i0 = 1 and we have w 0 as a bias parameter. We will also pre-process the available data to modify the 0 class label for -1. Then, we can write: P (P = 1 X = x i ) = 1 P (Y = +1 X = x i ) = 1 sigmoid( w k x ik ) = sigmoid( w k x ik ) Therefore we can combine the two formulas to facilitate the implementation: P (Y = y X = x i ) = sigmoid(y w k x ik ) We will learn the parameters in order to maximize the likelihood of the training data. Hence, we will use the negative log-likelihood as the performance criterion to minimize: L(w) = log(sigmoid(y i w k x ik )) The weight vector w will be learned through gradient descent. Note that δ(sigmoid(z)) δz = sigmoid(z)sigmoid( z)dz. We can compute the gradient of the negative log-likelihood with respect to each w j with the following formula: δ(l(w)) δw j = y i x ij sigmoid( y i w k x ik ) Under a scenario with no regularization, the parameter update formula will be: w t j = w t 1 j ɛ δ(l(w)) δw j We will stop the learning when the sum of the absolute values of the weight differences is less than a threshold: δ(l(w)) γ δw j j=0 The parameter ɛ is the learning rate and we call the parameter γ, early stopping. Under this model both the learning rate and the early stopping parameters are hyper-parameters. They will be learned with cross-validation. Some questions to consider: Should we use a stochastic gradient descent to accelerate learning or does the imbalance precludes it? Should we use a momentum to progressively reduce learning rate? The two questions are related. 4

5 Variable Selection 4.3 The Greedy Forward Selection Algorithm This forward selection algorithms starts with an empty subset and progressively add the next most promising variable. It then uses cross validation to assess the performance of the predictor with the various subsets and to select the subset with best generalization power. The pseudo-code of the algorithm is as follow: 010 we define d as the dimension of the input 020 we define A as our predictor 030 we define {X i } as the subset of the input containing dimension i 040 initialize subset B 0 = {} (an empty subset). 050 for d iteration with i = 1 to d do 060 for all {X j } not yet selected in B i 1 do 070 train A with B i 1 {X j } 080 select C i as the {X j } that maximized the performance criterion of 090 A trained with B i 1 {X j } 100 set B i = B i 1 C i 110 Use cross-validation to select between all B i Note on line 080, we can select between the {X j } directly from the training results since we are choosing between models of the same complexity. 4.4 The l 1 -norm Regularization The l 1 -norm of vector v is defined by: v 1 = i v i Tibshirani(11) has demonstrated that introducing a constraint in the form of l 1 -norm(v) t on a parameter vector v shrinks some coefficients and sets others to zero, and hence will acheive implicit variable subset selection if we apply this constraint on the non-bias components of our weight vector. This is equivalent to rewriting our performance criterion to be minimize as: L l1 (w, λ) = log(sigmoid(y i w k x ik )) + λ w k To overcome the difficulty of w k being a non-differentiable function, we will transform w into w + w and add the 2d additionnal constraints: w + k 0 and w k 0 for k {1 d} We will fix w0 = 0 and let w+ 0 move freely as our bias parameter since we do not need two. Then our criterion to minimize becomes: L l1 (w, λ) = log(sigmoid(y i k=1 (w + k w k )x ik)) + λ( (w + k + w k )) When j = 0, the formulas to compute the gradient and update the parameter w 0 + remains the same as in section 4.2. For j > 0, the formulas to compute the gradient of our criterion k=1 5

6 Gilles Godbout with respect to each w s j becomes: δ(l l1 (w, λ)) δw + j = y i x ij sigmoid( y i (w + k w k )x ik) + λ δ(l l1 (w, λ)) δw j = y i x ij sigmoid( y i (w + k w k )x ik) + λ Under this scenario with regularization, the parameter update formulas for j > 0 will be modified as follows to maintain the additional constraints: w s(t) w s(t 1) j = j ɛ δ(l l 1 (w,λ)) δw s if the result is greater than 0 j 0 otherwise The hyper-parameter λ is the regularization parameter wich will be learned together with the learning rate and the early stopping parameter with cross-validation. We need to verify that this is the proper way of maintaining the positive constraint on all the parameters. How is this acheiving implicit variable subset selection? We can easily see that choosing a zero regularization parameter λ reverts to the same algorithm than the one described in section 4.2 with no subset selection. As we grow λ, we progressively force the least promising variable weights to be zero, hence moving progressively towards an empty subset. 4.5 The Gradient Directed Regularization Instead of using a regularisation term to do implicit subset selection as in the previous section, this approach uses the parameter update algorithm of the gradient descent as the regularization mean. In this algorithm, we start with an empty subset of selected variables and we set all the variable weights to zero. Then, at each update step of the gradient descent, we first identify wich variable weight shows the largest gradient in absolute value. If that variable had not yet been selected, we include it in the subset of selected variable. After that we only update the weights of the variable in the subset of selected variables using the same formula as in section 4.2. When we compare this to the approach in the previous section, this is equivalent to starting with a very large value for the parameter λ and moving progressively towards a situation where λ = 0. The advantage is obviously to do away with having to learn an additional hyper-parameter. Have we understood this properly? We have not found references for this approach. The closest are Efron & all(3) Least Angle Regression and Hastie & all(6) Forward Stagewise Linear Regression. Friedman & all(4) eludes to it in reference to the previous two. It makes sense to us but we need to confirm before starting to spend more time on it. 6

7 Variable Selection 5. Experimentation Results This section will be developped after the previous section is completely finalized. It will present and compare the results of: - training the chosen predictor without any regularization, - training the predictor using greedy forward selection to choose a subset of variables, - training the predictor with a l 1 -norm regularization term as described in section training the predictor using the gradient directed regularization algorithm as described in section 4.5. One of the big todo is to map out the partionning of the available data in order to do all the training, validation and comparison implied in this section. 6. Conclusions This section will be developped after the results of the experimentation is known. It should establish wether we have indeed been able to improve the generalization performance of our chosen predictor by limiting the complexity of the model through variable selection. It will also identify additionnal work interesting to pursue on the subject of pre-processing for the purpose of relevance and dimensionality reduction. 7

8 Gilles Godbout References [1] Yoshua Bengio & Nicolas Chapados: Extensions to Metric Based Models, Journal of Machine Learning Research 3 (2003), [2] Christopher M. Bishop: Chapter 8: Pre-Processing and Feature Extraction, Neural Network for Pattern Recognition, Oxford University Press (1995), [3] Bradley Efron, Trevor Hastie, Iain Johnstone & Robert Tibshirani: Least Angle Regression, WebPublished Writings (2003), 1-44 [4] Jerome H. Friedman & Bogdan E. Popescu: Gradient Directed Regularization for Linear Regression and Classification, WebPublished Writings (2004), 1-40 [5] Isabelle Guyon & André Elisseeff: An Introduction to Variable and Feature Selection, Journal of Machine Learning Research 3 (2003), [6] Trevor Hastie, Robert Tibshirani & Jerome Friedman: Section Regularization, The Elements of Statistical Learning, Springer Publishing (2003), [7] Simon Latendresse & Yoshua Bengio: Linear Regression and the Optimization of Hyper-Parameters, Web Published Writings, 1-7 [8] Baranidharan Raman & Thomas R. Ioerger: Enhancing Learning using Feature and Example Selection, Journal of Machine Learning Research 3 (2003), 1-37 [9] Jason Rennie: Logistic Regression, Web-Published Writings (2003), 1-3 [10] Saharon rosset & Ji Zhu: Piecewise Linear Regularized Solution Paths, Submission for a Workshop at NIPS (2003), 1-20 [11] Robert Tibshirani: Regression Shrinkage and Selection via the Lasso, The Journal of Royal Statistics Society, Series B, Volume 58, No. 1 (1996),

Sparse Approximation and Variable Selection

Sparse Approximation and Variable Selection Sparse Approximation and Variable Selection Lorenzo Rosasco 9.520 Class 07 February 26, 2007 About this class Goal To introduce the problem of variable selection, discuss its connection to sparse approximation

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information


LEAST ANGLE REGRESSION 469 LEAST ANGLE REGRESSION 469 Specifically for the Lasso, one alternative strategy for logistic regression is to use a quadratic approximation for the log-likelihood. Consider the Bayesian version of Lasso

More information

Regularization Paths

Regularization Paths December 2005 Trevor Hastie, Stanford Statistics 1 Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Saharon Rosset, Ji Zhu, Hui Zhou, Rob Tibshirani and

More information

Learning Binary Classifiers for Multi-Class Problem

Learning Binary Classifiers for Multi-Class Problem Research Memorandum No. 1010 September 28, 2006 Learning Binary Classifiers for Multi-Class Problem Shiro Ikeda The Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku, Tokyo, 106-8569,

More information

Regularization Paths. Theme

Regularization Paths. Theme June 00 Trevor Hastie, Stanford Statistics June 00 Trevor Hastie, Stanford Statistics Theme Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Mee-Young Park,

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions BACK-PROPAGATION NETWORKS Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks Cannot approximate (learn) non-linear functions Difficult (if not impossible) to design

More information

Margin Maximizing Loss Functions

Margin Maximizing Loss Functions Margin Maximizing Loss Functions Saharon Rosset, Ji Zhu and Trevor Hastie Department of Statistics Stanford University Stanford, CA, 94305 saharon, jzhu, Abstract Margin maximizing

More information

Dreem Challenge report (team Bussanati)

Dreem Challenge report (team Bussanati) Wavelet course, MVA 04-05 Simon Bussy, Antoine Recanati, Dreem Challenge report (team Bussanati) Description and specifics of the challenge We worked on the

More information

Kernel Logistic Regression and the Import Vector Machine

Kernel Logistic Regression and the Import Vector Machine Kernel Logistic Regression and the Import Vector Machine Ji Zhu and Trevor Hastie Journal of Computational and Graphical Statistics, 2005 Presented by Mingtao Ding Duke University December 8, 2011 Mingtao

More information

Introduction to Logistic Regression

Introduction to Logistic Regression Introduction to Logistic Regression Guy Lebanon Binary Classification Binary classification is the most basic task in machine learning, and yet the most frequent. Binary classifiers often serve as the

More information

Iterative Laplacian Score for Feature Selection

Iterative Laplacian Score for Feature Selection Iterative Laplacian Score for Feature Selection Linling Zhu, Linsong Miao, and Daoqiang Zhang College of Computer Science and echnology, Nanjing University of Aeronautics and Astronautics, Nanjing 2006,

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

Analysis of Fast Input Selection: Application in Time Series Prediction

Analysis of Fast Input Selection: Application in Time Series Prediction Analysis of Fast Input Selection: Application in Time Series Prediction Jarkko Tikka, Amaury Lendasse, and Jaakko Hollmén Helsinki University of Technology, Laboratory of Computer and Information Science,

More information

An Introduction to Statistical and Probabilistic Linear Models

An Introduction to Statistical and Probabilistic Linear Models An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning

More information

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection Instructor: Herke van Hoof ( Based on slides by:, Jackie Chi Kit Cheung Class web page:

More information

Neural Networks and the Back-propagation Algorithm

Neural Networks and the Back-propagation Algorithm Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

Pathwise coordinate optimization

Pathwise coordinate optimization Stanford University 1 Pathwise coordinate optimization Jerome Friedman, Trevor Hastie, Holger Hoefling, Robert Tibshirani Stanford University Acknowledgements: Thanks to Stephen Boyd, Michael Saunders,

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017 Machine Learning Regularization and Feature Selection Fabio Vandin November 14, 2017 1 Regularized Loss Minimization Assume h is defined by a vector w = (w 1,..., w d ) T R d (e.g., linear models) Regularization

More information

Machine Learning for Biomedical Engineering. Enrico Grisan

Machine Learning for Biomedical Engineering. Enrico Grisan Machine Learning for Biomedical Engineering Enrico Grisan Curse of dimensionality Why are more features bad? Redundant features (useless or confounding) Hard to interpret and

More information

Computational statistics

Computational statistics Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email:

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Tufts COMP 135: Introduction to Machine Learning

Tufts COMP 135: Introduction to Machine Learning Tufts COMP 135: Introduction to Machine Learning Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard)

More information

Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks

Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks Yoshua Bengio Dept. IRO Université de Montréal Montreal, Qc, Canada, H3C 3J7 Samy Bengio IDIAP CP 592,

More information

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: Naïve Bayes

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen This Lecture: Advanced Machine Learning Regression

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen Course Outline Fundamentals Bayes Decision Theory

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University

More information

Multilayer Neural Networks

Multilayer Neural Networks Multilayer Neural Networks Multilayer Neural Networks Discriminant function flexibility NON-Linear But with sets of linear parameters at each layer Provably general function approximators for sufficient

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

Least Angle Regression, Forward Stagewise and the Lasso

Least Angle Regression, Forward Stagewise and the Lasso January 2005 Rob Tibshirani, Stanford 1 Least Angle Regression, Forward Stagewise and the Lasso Brad Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani Stanford University Annals of Statistics,

More information

Adaptive Boosting of Neural Networks for Character Recognition

Adaptive Boosting of Neural Networks for Character Recognition Adaptive Boosting of Neural Networks for Character Recognition Holger Schwenk Yoshua Bengio Dept. Informatique et Recherche Opérationnelle Université de Montréal, Montreal, Qc H3C-3J7, Canada fschwenk,

More information

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification ABC-Boost: Adaptive Base Class Boost for Multi-class Classification Ping Li Department of Statistical Science, Cornell University, Ithaca, NY 14853 USA Abstract We propose -boost (adaptive

More information


OPTIMIZATION METHODS IN DEEP LEARNING Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate

More information

Transductive Experiment Design

Transductive Experiment Design Appearing in NIPS 2005 workshop Foundations of Active Learning, Whistler, Canada, December, 2005. Transductive Experiment Design Kai Yu, Jinbo Bi, Volker Tresp Siemens AG 81739 Munich, Germany Abstract

More information

Data Mining und Maschinelles Lernen

Data Mining und Maschinelles Lernen Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting

More information

Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets

Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets Nan Zhou, Wen Cheng, Ph.D. Associate, Quantitative Research, J.P. Morgan The 4th Annual

More information

Bayesian Feature Selection with Strongly Regularizing Priors Maps to the Ising Model

Bayesian Feature Selection with Strongly Regularizing Priors Maps to the Ising Model LETTER Communicated by Ilya M. Nemenman Bayesian Feature Selection with Strongly Regularizing Priors Maps to the Ising Model Charles K. Fisher Pankaj Mehta

More information

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses

More information


ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD WHAT IS A NEURAL NETWORK? The simplest definition of a neural network, more properly referred to as an 'artificial' neural network (ANN), is provided

More information

Lasso Regression: Regularization for feature selection

Lasso Regression: Regularization for feature selection Lasso Regression: Regularization for feature selection Emily Fox University of Washington January 18, 2017 1 Feature selection task 2 1 Why might you want to perform feature selection? Efficiency: - If

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline

More information

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp.

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp. On different ensembles of kernel machines Michiko Yamana, Hiroyuki Nakahara, Massimiliano Pontil, and Shun-ichi Amari Λ Abstract. We study some ensembles of kernel machines. Each machine is first trained

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

MSA220/MVE440 Statistical Learning for Big Data

MSA220/MVE440 Statistical Learning for Big Data MSA220/MVE440 Statistical Learning for Big Data Lecture 7/8 - High-dimensional modeling part 1 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Classification

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives

A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives Paul Grigas May 25, 2016 1 Boosting Algorithms in Linear Regression Boosting [6, 9, 12, 15, 16] is an extremely

More information

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

More information

Ordinal Classification with Decision Rules

Ordinal Classification with Decision Rules Ordinal Classification with Decision Rules Krzysztof Dembczyński 1, Wojciech Kotłowski 1, and Roman Słowiński 1,2 1 Institute of Computing Science, Poznań University of Technology, 60-965 Poznań, Poland

More information

Regularization Path Algorithms for Detecting Gene Interactions

Regularization Path Algorithms for Detecting Gene Interactions Regularization Path Algorithms for Detecting Gene Interactions Mee Young Park Trevor Hastie July 16, 2006 Abstract In this study, we consider several regularization path algorithms with grouped variable

More information

Links between Perceptrons, MLPs and SVMs

Links between Perceptrons, MLPs and SVMs Links between Perceptrons, MLPs and SVMs Ronan Collobert Samy Bengio IDIAP, Rue du Simplon, 19 Martigny, Switzerland Abstract We propose to study links between three important classification algorithms:

More information

Bayesian Support Vector Machines for Feature Ranking and Selection

Bayesian Support Vector Machines for Feature Ranking and Selection Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction

More information

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

More information


MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run

More information


COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16 COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS6 Lecture 3: Classification with Logistic Regression Advanced optimization techniques Underfitting & Overfitting Model selection (Training-

More information

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation BACKPROPAGATION Neural network training optimization problem min J(w) w The application of gradient descent to this problem is called backpropagation. Backpropagation is gradient descent applied to J(w)

More information

Logistic Regression Trained with Different Loss Functions. Discussion

Logistic Regression Trained with Different Loss Functions. Discussion Logistic Regression Trained with Different Loss Functions Discussion CS640 Notations We restrict our discussions to the binary case. g(z) = g (z) = g(z) z h w (x) = g(wx) = + e z = g(z)( g(z)) + e wx =

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

Statistical NLP for the Web

Statistical NLP for the Web Statistical NLP for the Web Neural Networks, Deep Belief Networks Sameer Maskey Week 8, October 24, 2012 *some slides from Andrew Rosenberg Announcements Please ask HW2 related questions in courseworks

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Neural Networks. Haiming Zhou. Division of Statistics Northern Illinois University.

Neural Networks. Haiming Zhou. Division of Statistics Northern Illinois University. Neural Networks Haiming Zhou Division of Statistics Northern Illinois University Neural Networks The term neural network has evolved to encompass a large class of models and learning methods.

More information

Machine Learning

Machine Learning Machine Learning 10-601 Maria Florina Balcan Machine Learning Department Carnegie Mellon University 02/10/2016 Today: Artificial neural networks Backpropagation Reading: Mitchell: Chapter 4 Bishop: Chapter

More information

Logistic Regression & Neural Networks

Logistic Regression & Neural Networks Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability

More information

Boosting as a Regularized Path to a Maximum Margin Classifier

Boosting as a Regularized Path to a Maximum Margin Classifier Journal of Machine Learning Research () Submitted 5/03; Published Boosting as a Regularized Path to a Maximum Margin Classifier Saharon Rosset Data Analytics Research Group IBM T.J. Watson Research Center

More information

Machine Learning Linear Models

Machine Learning Linear Models Machine Learning Linear Models Outline II - Linear Models 1. Linear Regression (a) Linear regression: History (b) Linear regression with Least Squares (c) Matrix representation and Normal Equation Method

More information

COMP-4360 Machine Learning Neural Networks

COMP-4360 Machine Learning Neural Networks COMP-4360 Machine Learning Neural Networks Jacky Baltes Autonomous Agents Lab University of Manitoba Winnipeg, Canada R3T 2N2 Email: WWW:

More information

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition Ad Feelders Universiteit Utrecht Department of Information and Computing Sciences Algorithmic Data

More information

Variations of Logistic Regression with Stochastic Gradient Descent

Variations of Logistic Regression with Stochastic Gradient Descent Variations of Logistic Regression with Stochastic Gradient Descent Panqu Wang( Phuc Xuan Nguyen( January 26, 2012 Abstract In this paper, we extend the traditional logistic

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Neural Networks Varun Chandola x x 5 Input Outline Contents February 2, 207 Extending Perceptrons 2 Multi Layered Perceptrons 2 2. Generalizing to Multiple Labels.................

More information

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS Learning Neural Networks Classifier Short Presentation INPUT: classification data, i.e. it contains an classification (class) attribute.

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting

More information

Optimization and Gradient Descent

Optimization and Gradient Descent Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function

More information


CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun September 14, 2014 Today s Schedule Course Project Introduction Linear Regression Model Decision Tree 2 Methods

More information

Learning Neural Networks

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex decision boundaries Variable size. Any boolean function can be represented. Hidden units can be interpreted as new features Deterministic

More information

The Entire Regularization Path for the Support Vector Machine

The Entire Regularization Path for the Support Vector Machine The Entire Regularization Path for the Support Vector Machine Trevor Hastie Department of Statistics Stanford University Stanford, CA 905, USA Saharon Rosset IBM Watson Research Center

More information

LASSO Review, Fused LASSO, Parallel LASSO Solvers

LASSO Review, Fused LASSO, Parallel LASSO Solvers Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable

More information

CSC 578 Neural Networks and Deep Learning

CSC 578 Neural Networks and Deep Learning CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross

More information

Multilayer Neural Networks

Multilayer Neural Networks Multilayer Neural Networks Introduction Goal: Classify objects by learning nonlinearity There are many problems for which linear discriminants are insufficient for minimum error In previous methods, the

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information