Variable Selection in Data Mining Project
|
|
- Brett Cooper
- 5 years ago
- Views:
Transcription
1 Variable Selection Variable Selection in Data Mining Project Gilles Godbout IFT Algorithmes d Apprentissage Session Project Dept. Informatique et Recherche Opérationnelle Université de Montréal Montreal, QC, Canada H3C 3J7 godbougi@iro.umontreal.ca Editor: Yoshua Bengio Keywords: Data Mining, Variable Selection, l 1 -norm Regularization, Gradient Directed Regularization 1. Introduction Bishop(2) establishes that in practical applications of data mining, the choice of preprocessing of the available data will be one of the most significant factors in determining the performance of the final system. Data mining situations are often caracterized with the availability of a large number of raw input input variables, sometimes in the tens of thousands range, and comparably few training examples. Learning algorithms perform well on domain with a relatively small number of relevant variables. They tend to degrade however in presence of a large number of variables including possibly irrelevant and redundant information. Many approaches have been proposed to address the problems of relevance and space dimensionality reduction of the input variables. They include algorithms for feature extraction, variable and feature selection, and example selection to name a few. This document present the report on a session project for the course IFT6266 Algorithmes d Apprentissage. The objectives of the project are described in the next section. Section 3 brieffly documents the concepts around variable selection. In section 4, we present a specific problem of data mining and proposed different approaches of variable selection to address the problem of relevance and dimensionality reduction. We document the results of our experimentation in section 5 and our conclusions in section 6. This is a preliminary version. At this stage, it is presented as a plan. It contains several elements that are incomplete and require further research and/or discussions. 1
2 Gilles Godbout 2. Project Objectives This project will concentrate on the study of various techniques for variable selection. The objectives of this project are three-fold: - to familiarize the author with the spectrum of approaches proposed to address the problems of relevance and dimensionality reduction in the current data mining field; - to compare various selection algorithms in order to solve a specific data mining problem; - to complement the LISA project PLearn library with one of the studied technique: the Gradient Directed Regularization for Linear Classification. 3. Variable Selection We can identify three benefits to variable selection: - to improve the prediction performance of the predictor; - to reduce the processing requirements of the predictor; - to provide for a better understanding of the data by identifying the variables most relevant to the problem at hand. The question is how to identify a subset of the input variables that will lead to the building of a good predictor. Guyon & Elisseeff(5) classified the currently proposed methods within three categories: filters, wrappers and embedded methods. 3.1 Filters Filters select subsets of variables as a pre-processing step, independently of the chosen predictor. They rank variables according to their predictive power which can be measured in various ways. Many filter techniques will rely on empirical estimates of the mutual information between each variable and the target. Clearly, ranking the input variables by their predictive power does provide some understanding of the data in relation with the problem at hand. However, these techniques have more difficulties identifying combinations of variables with high predictive power. These techniques can also be prone to the selection of redundant subsets of variables hence not addressing as well as one would wich, the issue of dimensionality. For these reasons, we have chosen in this project, not to include any experimentation with one of these techniques. 3.2 Wrappers Wrappers utilize the chosen predictor to score subsets of variable according to their predictive power. The main idea here is to train the predictor with many different subsets of the input variables and choose the subset providing the best generalization performance. In most cases, it would be prohibitive to try with each possible subset. So the question becomes how to search the space of all possible variable subsets. Some greedy forward 2
3 Variable Selection and backward search strategies have been proposed. Their processing requirements does not grow exponentially with the number of variables. Forward selection algorithms starts with an empty subset and progressively add the next most promising variable. Conversely, backward selection starts with the set of all variables and progressively eliminates the least promising one. Cross validation can be used to assess the performance of the predictor with the various subsets and to select the subset with best generalization power. In this project, we will experiment with a greedy forward selection algorithm to try to improve the generalization performance of a chosen predictor. The algorithm developped is described in more details in section Embedded Methods Embedded methods perform implicit variable selection in the process of training the predictor itself. One way of acheiving this concept is to add a regularization term to the optimization criterion wich will control the complexity of the model by keeping some of the weights to zero, hence in fact applying the predictor to a subset of the input variables. In this project, we will experiment this concept using a l 1 -norm regularization term. The l 1 -norm regularization approach is described in more details in sections 4.4. Other approaches to implicit variable selection are intrinsic to the training algorithm itself. We will implement the Gradient Directed Regularization, a gradient descent algorithm in this category. We will use it in the learning of the weight vector and we will emphasize its variable selection capacity. The Gradient Directed Regularization is described in more details in sections The Data Mining Problem This section describes the various components of the experimentation that we want to carry. 4.1 The Data This project is centered around a set of available data wich presents many of the caracteristics typical of data-mining problems. It is a classification task and the target is binary (0 or 1). The inputs are noisy and in very high dimension (1092), thus overfitting is likely to occur ( curse of dimensionality ). There is a large imbalance between the two classes. There are only 8.7% examples of class 1. Finnaly, the number of examples is relatively small (8176), given the large dimension of the input variables. 3
4 Gilles Godbout 4.2 The Predictor Because of the imbalance between the two classes, the classification accuracy will not be a good measure of performance. We are proposing to use instead the logistic regression to build an estimator of P (Y X). In logistic regression, the probability of an example x i being drawn from the positive class is estimated by: P (Y = +1 X = x i ) = sigmoid( w k x ik ) where sigmoid(z) = 1 and 1 sigmoid(z) = sigmoid( z). Note that we have added 1+e z to each example, a feature x i0 = 1 and we have w 0 as a bias parameter. We will also pre-process the available data to modify the 0 class label for -1. Then, we can write: P (P = 1 X = x i ) = 1 P (Y = +1 X = x i ) = 1 sigmoid( w k x ik ) = sigmoid( w k x ik ) Therefore we can combine the two formulas to facilitate the implementation: P (Y = y X = x i ) = sigmoid(y w k x ik ) We will learn the parameters in order to maximize the likelihood of the training data. Hence, we will use the negative log-likelihood as the performance criterion to minimize: L(w) = log(sigmoid(y i w k x ik )) The weight vector w will be learned through gradient descent. Note that δ(sigmoid(z)) δz = sigmoid(z)sigmoid( z)dz. We can compute the gradient of the negative log-likelihood with respect to each w j with the following formula: δ(l(w)) δw j = y i x ij sigmoid( y i w k x ik ) Under a scenario with no regularization, the parameter update formula will be: w t j = w t 1 j ɛ δ(l(w)) δw j We will stop the learning when the sum of the absolute values of the weight differences is less than a threshold: δ(l(w)) γ δw j j=0 The parameter ɛ is the learning rate and we call the parameter γ, early stopping. Under this model both the learning rate and the early stopping parameters are hyper-parameters. They will be learned with cross-validation. Some questions to consider: Should we use a stochastic gradient descent to accelerate learning or does the imbalance precludes it? Should we use a momentum to progressively reduce learning rate? The two questions are related. 4
5 Variable Selection 4.3 The Greedy Forward Selection Algorithm This forward selection algorithms starts with an empty subset and progressively add the next most promising variable. It then uses cross validation to assess the performance of the predictor with the various subsets and to select the subset with best generalization power. The pseudo-code of the algorithm is as follow: 010 we define d as the dimension of the input 020 we define A as our predictor 030 we define {X i } as the subset of the input containing dimension i 040 initialize subset B 0 = {} (an empty subset). 050 for d iteration with i = 1 to d do 060 for all {X j } not yet selected in B i 1 do 070 train A with B i 1 {X j } 080 select C i as the {X j } that maximized the performance criterion of 090 A trained with B i 1 {X j } 100 set B i = B i 1 C i 110 Use cross-validation to select between all B i Note on line 080, we can select between the {X j } directly from the training results since we are choosing between models of the same complexity. 4.4 The l 1 -norm Regularization The l 1 -norm of vector v is defined by: v 1 = i v i Tibshirani(11) has demonstrated that introducing a constraint in the form of l 1 -norm(v) t on a parameter vector v shrinks some coefficients and sets others to zero, and hence will acheive implicit variable subset selection if we apply this constraint on the non-bias components of our weight vector. This is equivalent to rewriting our performance criterion to be minimize as: L l1 (w, λ) = log(sigmoid(y i w k x ik )) + λ w k To overcome the difficulty of w k being a non-differentiable function, we will transform w into w + w and add the 2d additionnal constraints: w + k 0 and w k 0 for k {1 d} We will fix w0 = 0 and let w+ 0 move freely as our bias parameter since we do not need two. Then our criterion to minimize becomes: L l1 (w, λ) = log(sigmoid(y i k=1 (w + k w k )x ik)) + λ( (w + k + w k )) When j = 0, the formulas to compute the gradient and update the parameter w 0 + remains the same as in section 4.2. For j > 0, the formulas to compute the gradient of our criterion k=1 5
6 Gilles Godbout with respect to each w s j becomes: δ(l l1 (w, λ)) δw + j = y i x ij sigmoid( y i (w + k w k )x ik) + λ δ(l l1 (w, λ)) δw j = y i x ij sigmoid( y i (w + k w k )x ik) + λ Under this scenario with regularization, the parameter update formulas for j > 0 will be modified as follows to maintain the additional constraints: w s(t) w s(t 1) j = j ɛ δ(l l 1 (w,λ)) δw s if the result is greater than 0 j 0 otherwise The hyper-parameter λ is the regularization parameter wich will be learned together with the learning rate and the early stopping parameter with cross-validation. We need to verify that this is the proper way of maintaining the positive constraint on all the parameters. How is this acheiving implicit variable subset selection? We can easily see that choosing a zero regularization parameter λ reverts to the same algorithm than the one described in section 4.2 with no subset selection. As we grow λ, we progressively force the least promising variable weights to be zero, hence moving progressively towards an empty subset. 4.5 The Gradient Directed Regularization Instead of using a regularisation term to do implicit subset selection as in the previous section, this approach uses the parameter update algorithm of the gradient descent as the regularization mean. In this algorithm, we start with an empty subset of selected variables and we set all the variable weights to zero. Then, at each update step of the gradient descent, we first identify wich variable weight shows the largest gradient in absolute value. If that variable had not yet been selected, we include it in the subset of selected variable. After that we only update the weights of the variable in the subset of selected variables using the same formula as in section 4.2. When we compare this to the approach in the previous section, this is equivalent to starting with a very large value for the parameter λ and moving progressively towards a situation where λ = 0. The advantage is obviously to do away with having to learn an additional hyper-parameter. Have we understood this properly? We have not found references for this approach. The closest are Efron & all(3) Least Angle Regression and Hastie & all(6) Forward Stagewise Linear Regression. Friedman & all(4) eludes to it in reference to the previous two. It makes sense to us but we need to confirm before starting to spend more time on it. 6
7 Variable Selection 5. Experimentation Results This section will be developped after the previous section is completely finalized. It will present and compare the results of: - training the chosen predictor without any regularization, - training the predictor using greedy forward selection to choose a subset of variables, - training the predictor with a l 1 -norm regularization term as described in section training the predictor using the gradient directed regularization algorithm as described in section 4.5. One of the big todo is to map out the partionning of the available data in order to do all the training, validation and comparison implied in this section. 6. Conclusions This section will be developped after the results of the experimentation is known. It should establish wether we have indeed been able to improve the generalization performance of our chosen predictor by limiting the complexity of the model through variable selection. It will also identify additionnal work interesting to pursue on the subject of pre-processing for the purpose of relevance and dimensionality reduction. 7
8 Gilles Godbout References [1] Yoshua Bengio & Nicolas Chapados: Extensions to Metric Based Models, Journal of Machine Learning Research 3 (2003), [2] Christopher M. Bishop: Chapter 8: Pre-Processing and Feature Extraction, Neural Network for Pattern Recognition, Oxford University Press (1995), [3] Bradley Efron, Trevor Hastie, Iain Johnstone & Robert Tibshirani: Least Angle Regression, WebPublished Writings (2003), 1-44 [4] Jerome H. Friedman & Bogdan E. Popescu: Gradient Directed Regularization for Linear Regression and Classification, WebPublished Writings (2004), 1-40 [5] Isabelle Guyon & André Elisseeff: An Introduction to Variable and Feature Selection, Journal of Machine Learning Research 3 (2003), [6] Trevor Hastie, Robert Tibshirani & Jerome Friedman: Section Regularization, The Elements of Statistical Learning, Springer Publishing (2003), [7] Simon Latendresse & Yoshua Bengio: Linear Regression and the Optimization of Hyper-Parameters, Web Published Writings, 1-7 [8] Baranidharan Raman & Thomas R. Ioerger: Enhancing Learning using Feature and Example Selection, Journal of Machine Learning Research 3 (2003), 1-37 [9] Jason Rennie: Logistic Regression, Web-Published Writings (2003), 1-3 [10] Saharon rosset & Ji Zhu: Piecewise Linear Regularized Solution Paths, Submission for a Workshop at NIPS (2003), 1-20 [11] Robert Tibshirani: Regression Shrinkage and Selection via the Lasso, The Journal of Royal Statistics Society, Series B, Volume 58, No. 1 (1996),
Sparse Approximation and Variable Selection
Sparse Approximation and Variable Selection Lorenzo Rosasco 9.520 Class 07 February 26, 2007 About this class Goal To introduce the problem of variable selection, discuss its connection to sparse approximation
More informationMS-C1620 Statistical inference
MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents
More informationLEAST ANGLE REGRESSION 469
LEAST ANGLE REGRESSION 469 Specifically for the Lasso, one alternative strategy for logistic regression is to use a quadratic approximation for the log-likelihood. Consider the Bayesian version of Lasso
More informationRegularization Paths
December 2005 Trevor Hastie, Stanford Statistics 1 Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Saharon Rosset, Ji Zhu, Hui Zhou, Rob Tibshirani and
More informationLearning Binary Classifiers for Multi-Class Problem
Research Memorandum No. 1010 September 28, 2006 Learning Binary Classifiers for Multi-Class Problem Shiro Ikeda The Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku, Tokyo, 106-8569,
More informationRegularization Paths. Theme
June 00 Trevor Hastie, Stanford Statistics June 00 Trevor Hastie, Stanford Statistics Theme Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Mee-Young Park,
More informationLarge-Scale Feature Learning with Spike-and-Slab Sparse Coding
Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationSerious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions
BACK-PROPAGATION NETWORKS Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks Cannot approximate (learn) non-linear functions Difficult (if not impossible) to design
More informationMargin Maximizing Loss Functions
Margin Maximizing Loss Functions Saharon Rosset, Ji Zhu and Trevor Hastie Department of Statistics Stanford University Stanford, CA, 94305 saharon, jzhu, hastie@stat.stanford.edu Abstract Margin maximizing
More informationDreem Challenge report (team Bussanati)
Wavelet course, MVA 04-05 Simon Bussy, simon.bussy@gmail.com Antoine Recanati, arecanat@ens-cachan.fr Dreem Challenge report (team Bussanati) Description and specifics of the challenge We worked on the
More informationKernel Logistic Regression and the Import Vector Machine
Kernel Logistic Regression and the Import Vector Machine Ji Zhu and Trevor Hastie Journal of Computational and Graphical Statistics, 2005 Presented by Mingtao Ding Duke University December 8, 2011 Mingtao
More informationIntroduction to Logistic Regression
Introduction to Logistic Regression Guy Lebanon Binary Classification Binary classification is the most basic task in machine learning, and yet the most frequent. Binary classifiers often serve as the
More informationIterative Laplacian Score for Feature Selection
Iterative Laplacian Score for Feature Selection Linling Zhu, Linsong Miao, and Daoqiang Zhang College of Computer Science and echnology, Nanjing University of Aeronautics and Astronautics, Nanjing 2006,
More informationECE521 Lectures 9 Fully Connected Neural Networks
ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationAnalysis of Fast Input Selection: Application in Time Series Prediction
Analysis of Fast Input Selection: Application in Time Series Prediction Jarkko Tikka, Amaury Lendasse, and Jaakko Hollmén Helsinki University of Technology, Laboratory of Computer and Information Science,
More informationAn Introduction to Statistical and Probabilistic Linear Models
An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning
More informationCOMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection
COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection Instructor: Herke van Hoof (herke.vanhoof@cs.mcgill.ca) Based on slides by:, Jackie Chi Kit Cheung Class web page:
More informationNeural Networks and the Back-propagation Algorithm
Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne
More informationPathwise coordinate optimization
Stanford University 1 Pathwise coordinate optimization Jerome Friedman, Trevor Hastie, Holger Hoefling, Robert Tibshirani Stanford University Acknowledgements: Thanks to Stephen Boyd, Michael Saunders,
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More informationMachine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017
Machine Learning Regularization and Feature Selection Fabio Vandin November 14, 2017 1 Regularized Loss Minimization Assume h is defined by a vector w = (w 1,..., w d ) T R d (e.g., linear models) Regularization
More informationMachine Learning for Biomedical Engineering. Enrico Grisan
Machine Learning for Biomedical Engineering Enrico Grisan enrico.grisan@dei.unipd.it Curse of dimensionality Why are more features bad? Redundant features (useless or confounding) Hard to interpret and
More informationComputational statistics
Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationTufts COMP 135: Introduction to Machine Learning
Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard)
More informationModeling High-Dimensional Discrete Data with Multi-Layer Neural Networks
Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks Yoshua Bengio Dept. IRO Université de Montréal Montreal, Qc, Canada, H3C 3J7 bengioy@iro.umontreal.ca Samy Bengio IDIAP CP 592,
More informationData Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395
Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationLecture 3. Linear Regression II Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationMachine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling
Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University
More informationMultilayer Neural Networks
Multilayer Neural Networks Multilayer Neural Networks Discriminant function flexibility NON-Linear But with sets of linear parameters at each layer Provably general function approximators for sufficient
More informationMachine Learning Linear Regression. Prof. Matteo Matteucci
Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares
More informationLeast Angle Regression, Forward Stagewise and the Lasso
January 2005 Rob Tibshirani, Stanford 1 Least Angle Regression, Forward Stagewise and the Lasso Brad Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani Stanford University Annals of Statistics,
More informationAdaptive Boosting of Neural Networks for Character Recognition
Adaptive Boosting of Neural Networks for Character Recognition Holger Schwenk Yoshua Bengio Dept. Informatique et Recherche Opérationnelle Université de Montréal, Montreal, Qc H3C-3J7, Canada fschwenk,bengioyg@iro.umontreal.ca
More informationABC-Boost: Adaptive Base Class Boost for Multi-class Classification
ABC-Boost: Adaptive Base Class Boost for Multi-class Classification Ping Li Department of Statistical Science, Cornell University, Ithaca, NY 14853 USA pingli@cornell.edu Abstract We propose -boost (adaptive
More informationOPTIMIZATION METHODS IN DEEP LEARNING
Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate
More informationTransductive Experiment Design
Appearing in NIPS 2005 workshop Foundations of Active Learning, Whistler, Canada, December, 2005. Transductive Experiment Design Kai Yu, Jinbo Bi, Volker Tresp Siemens AG 81739 Munich, Germany Abstract
More informationData Mining und Maschinelles Lernen
Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting
More informationNon-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets
Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets Nan Zhou, Wen Cheng, Ph.D. Associate, Quantitative Research, J.P. Morgan nan.zhou@jpmorgan.com The 4th Annual
More informationBayesian Feature Selection with Strongly Regularizing Priors Maps to the Ising Model
LETTER Communicated by Ilya M. Nemenman Bayesian Feature Selection with Strongly Regularizing Priors Maps to the Ising Model Charles K. Fisher charleskennethfisher@gmail.com Pankaj Mehta pankajm@bu.edu
More informationLinear Discrimination Functions
Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach
More informationComments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms
Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationClassification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box
ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses
More informationARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD
ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD WHAT IS A NEURAL NETWORK? The simplest definition of a neural network, more properly referred to as an 'artificial' neural network (ANN), is provided
More informationLasso Regression: Regularization for feature selection
Lasso Regression: Regularization for feature selection Emily Fox University of Washington January 18, 2017 1 Feature selection task 2 1 Why might you want to perform feature selection? Efficiency: - If
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline
More informationESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp.
On different ensembles of kernel machines Michiko Yamana, Hiroyuki Nakahara, Massimiliano Pontil, and Shun-ichi Amari Λ Abstract. We study some ensembles of kernel machines. Each machine is first trained
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationMSA220/MVE440 Statistical Learning for Big Data
MSA220/MVE440 Statistical Learning for Big Data Lecture 7/8 - High-dimensional modeling part 1 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Classification
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationA New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives
A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives Paul Grigas May 25, 2016 1 Boosting Algorithms in Linear Regression Boosting [6, 9, 12, 15, 16] is an extremely
More informationECE521 Lecture7. Logistic Regression
ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard
More informationOrdinal Classification with Decision Rules
Ordinal Classification with Decision Rules Krzysztof Dembczyński 1, Wojciech Kotłowski 1, and Roman Słowiński 1,2 1 Institute of Computing Science, Poznań University of Technology, 60-965 Poznań, Poland
More informationRegularization Path Algorithms for Detecting Gene Interactions
Regularization Path Algorithms for Detecting Gene Interactions Mee Young Park Trevor Hastie July 16, 2006 Abstract In this study, we consider several regularization path algorithms with grouped variable
More informationLinks between Perceptrons, MLPs and SVMs
Links between Perceptrons, MLPs and SVMs Ronan Collobert Samy Bengio IDIAP, Rue du Simplon, 19 Martigny, Switzerland Abstract We propose to study links between three important classification algorithms:
More informationBayesian Support Vector Machines for Feature Ranking and Selection
Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction
More informationNeural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann
Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable
More informationMIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,
MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run
More informationCOMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16
COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS6 Lecture 3: Classification with Logistic Regression Advanced optimization techniques Underfitting & Overfitting Model selection (Training-
More informationBACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation
BACKPROPAGATION Neural network training optimization problem min J(w) w The application of gradient descent to this problem is called backpropagation. Backpropagation is gradient descent applied to J(w)
More informationLogistic Regression Trained with Different Loss Functions. Discussion
Logistic Regression Trained with Different Loss Functions Discussion CS640 Notations We restrict our discussions to the binary case. g(z) = g (z) = g(z) z h w (x) = g(wx) = + e z = g(z)( g(z)) + e wx =
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationStatistical NLP for the Web
Statistical NLP for the Web Neural Networks, Deep Belief Networks Sameer Maskey Week 8, October 24, 2012 *some slides from Andrew Rosenberg Announcements Please ask HW2 related questions in courseworks
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationNeural Networks. Haiming Zhou. Division of Statistics Northern Illinois University.
Neural Networks Haiming Zhou Division of Statistics Northern Illinois University zhouh@niu.edu Neural Networks The term neural network has evolved to encompass a large class of models and learning methods.
More informationMachine Learning
Machine Learning 10-601 Maria Florina Balcan Machine Learning Department Carnegie Mellon University 02/10/2016 Today: Artificial neural networks Backpropagation Reading: Mitchell: Chapter 4 Bishop: Chapter
More informationLogistic Regression & Neural Networks
Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability
More informationBoosting as a Regularized Path to a Maximum Margin Classifier
Journal of Machine Learning Research () Submitted 5/03; Published Boosting as a Regularized Path to a Maximum Margin Classifier Saharon Rosset Data Analytics Research Group IBM T.J. Watson Research Center
More informationMachine Learning Linear Models
Machine Learning Linear Models Outline II - Linear Models 1. Linear Regression (a) Linear regression: History (b) Linear regression with Least Squares (c) Matrix representation and Normal Equation Method
More informationCOMP-4360 Machine Learning Neural Networks
COMP-4360 Machine Learning Neural Networks Jacky Baltes Autonomous Agents Lab University of Manitoba Winnipeg, Canada R3T 2N2 Email: jacky@cs.umanitoba.ca WWW: http://www.cs.umanitoba.ca/~jacky http://aalab.cs.umanitoba.ca
More informationCLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition
CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition Ad Feelders Universiteit Utrecht Department of Information and Computing Sciences Algorithmic Data
More informationVariations of Logistic Regression with Stochastic Gradient Descent
Variations of Logistic Regression with Stochastic Gradient Descent Panqu Wang(pawang@ucsd.edu) Phuc Xuan Nguyen(pxn002@ucsd.edu) January 26, 2012 Abstract In this paper, we extend the traditional logistic
More informationIntroduction to Machine Learning
Introduction to Machine Learning Neural Networks Varun Chandola x x 5 Input Outline Contents February 2, 207 Extending Perceptrons 2 Multi Layered Perceptrons 2 2. Generalizing to Multiple Labels.................
More informationCSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning
CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS Learning Neural Networks Classifier Short Presentation INPUT: classification data, i.e. it contains an classification (class) attribute.
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting
More informationOptimization and Gradient Descent
Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 14, 2014 Today s Schedule Course Project Introduction Linear Regression Model Decision Tree 2 Methods
More informationLearning Neural Networks
Learning Neural Networks Neural Networks can represent complex decision boundaries Variable size. Any boolean function can be represented. Hidden units can be interpreted as new features Deterministic
More informationThe Entire Regularization Path for the Support Vector Machine
The Entire Regularization Path for the Support Vector Machine Trevor Hastie Department of Statistics Stanford University Stanford, CA 905, USA hastie@stanford.edu Saharon Rosset IBM Watson Research Center
More informationLASSO Review, Fused LASSO, Parallel LASSO Solvers
Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable
More informationCSC 578 Neural Networks and Deep Learning
CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross
More informationMultilayer Neural Networks
Multilayer Neural Networks Introduction Goal: Classify objects by learning nonlinearity There are many problems for which linear discriminants are insufficient for minimum error In previous methods, the
More informationApplied Machine Learning Annalisa Marsico
Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More information