Dimensionality reduction Feature selection

Similar documents
Dimensionality reduction Feature selection

CS 2750 Machine Learning. Lecture 8. Linear regression. CS 2750 Machine Learning. Linear regression. is a linear combination of input components x

Generative classification models

Kernel-based Methods and Support Vector Machines

Binary classification: Support Vector Machines

CS 1675 Introduction to Machine Learning Lecture 12 Support vector machines

Support vector machines II

Supervised learning: Linear regression Logistic regression

Support vector machines

Bayes (Naïve or not) Classifiers: Generative Approach

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

An Introduction to. Support Vector Machine

Naïve Bayes MIT Course Notes Cynthia Rudin

Unsupervised Learning and Other Neural Networks

CS 2750 Machine Learning. Lecture 7. Linear regression. CS 2750 Machine Learning. Linear regression. is a linear combination of input components x

Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture)

Principal Components. Analysis. Basic Intuition. A Method of Self Organized Learning

ENGI 3423 Simple Linear Regression Page 12-01

Overview. Basic concepts of Bayesian learning. Most probable model given data Coin tosses Linear regression Logistic regression

Bayesian Classification. CS690L Data Mining: Classification(2) Bayesian Theorem: Basics. Bayesian Theorem. Training dataset. Naïve Bayes Classifier

Announcements. Recognition II. Computer Vision I. Example: Face Detection. Evaluating a binary classifier

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

Introduction to local (nonparametric) density estimation. methods

ρ < 1 be five real numbers. The

Machine Learning. Introduction to Regression. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Chapter 4 (Part 1): Non-Parametric Classification (Sections ) Pattern Classification 4.3) Announcements

Radial Basis Function Networks

Linear regression (cont.) Linear methods for classification

Chapter 4 Multiple Random Variables

Dimensionality Reduction and Learning

Linear Regression Linear Regression with Shrinkage. Some slides are due to Tommi Jaakkola, MIT AI Lab

Tema 5: Aprendizaje NO Supervisado: CLUSTERING Unsupervised Learning: CLUSTERING. Febrero-Mayo 2005

6. Nonparametric techniques


Lecture 3 Probability review (cont d)

Objectives of Multiple Regression

Functions of Random Variables

CHAPTER VI Statistical Analysis of Experimental Data

Lecture 8: Linear Regression

Principal Component Analysis (PCA)

Bayesian belief networks

( ) = ( ) ( ) Chapter 13 Asymptotic Theory and Stochastic Regressors. Stochastic regressors model

Algebraic-Geometric and Probabilistic Approaches for Clustering and Dimension Reduction of Mixtures of Principle Component Subspaces

Simple Linear Regression

Econometric Methods. Review of Estimation

Classification : Logistic regression. Generative classification model.

Lecture 12: Multilayer perceptrons II

Summary of the lecture in Biostatistics

QR Factorization and Singular Value Decomposition COS 323

ESS Line Fitting

Sampling Theory MODULE V LECTURE - 14 RATIO AND PRODUCT METHODS OF ESTIMATION

4. Standard Regression Model and Spatial Dependence Tests

Chapter 9 Jordan Block Matrices

TESTS BASED ON MAXIMUM LIKELIHOOD

Rademacher Complexity. Examples

Lecture Notes 2. The ability to manipulate matrices is critical in economics.

LECTURE - 4 SIMPLE RANDOM SAMPLING DR. SHALABH DEPARTMENT OF MATHEMATICS AND STATISTICS INDIAN INSTITUTE OF TECHNOLOGY KANPUR

Model Fitting, RANSAC. Jana Kosecka

1. The weight of six Golden Retrievers is 66, 61, 70, 67, 92 and 66 pounds. The weight of six Labrador Retrievers is 54, 60, 72, 78, 84 and 67.

THE ROYAL STATISTICAL SOCIETY 2016 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE MODULE 5

Linear Regression with One Regressor

Generalized Linear Regression with Regularization

ECON 482 / WH Hong The Simple Regression Model 1. Definition of the Simple Regression Model

Example: Multiple linear regression. Least squares regression. Repetition: Simple linear regression. Tron Anders Moger

Machine Learning. knowledge acquisition skill refinement. Relation between machine learning and data mining. P. Berka, /18

LINEAR REGRESSION ANALYSIS

Mean is only appropriate for interval or ratio scales, not ordinal or nominal.

Bayesian belief networks

ENGI 4421 Joint Probability Distributions Page Joint Probability Distributions [Navidi sections 2.5 and 2.6; Devore sections

Multiple Choice Test. Chapter Adequacy of Models for Regression

Research on SVM Prediction Model Based on Chaos Theory

THE ROYAL STATISTICAL SOCIETY 2009 EXAMINATIONS SOLUTIONS GRADUATE DIPLOMA MODULAR FORMAT MODULE 2 STATISTICAL INFERENCE

ENGI 4421 Propagation of Error Page 8-01

Chapter Business Statistics: A First Course Fifth Edition. Learning Objectives. Correlation vs. Regression. In this chapter, you learn:

C-1: Aerodynamics of Airfoils 1 C-2: Aerodynamics of Airfoils 2 C-3: Panel Methods C-4: Thin Airfoil Theory

Application of Calibration Approach for Regression Coefficient Estimation under Two-stage Sampling Design

Chapter Two. An Introduction to Regression ( )

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #1

X X X E[ ] E X E X. is the ()m n where the ( i,)th. j element is the mean of the ( i,)th., then

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

Multivariate Transformation of Variables and Maximum Likelihood Estimation

Nonparametric Density Estimation Intro

Investigation of Partially Conditional RP Model with Response Error. Ed Stanek

Lecture 1: Introduction to Regression

Recall MLR 5 Homskedasticity error u has the same variance given any values of the explanatory variables Var(u x1,...,xk) = 2 or E(UU ) = 2 I

Lecture Note to Rice Chapter 8

6.867 Machine Learning

Sampling Theory MODULE X LECTURE - 35 TWO STAGE SAMPLING (SUB SAMPLING)

Special Instructions / Useful Data

MATH 247/Winter Notes on the adjoint and on normal operators.

Multiple Regression. More than 2 variables! Grade on Final. Multiple Regression 11/21/2012. Exam 2 Grades. Exam 2 Re-grades

Analysis of System Performance IN2072 Chapter 5 Analysis of Non Markov Systems

L5 Polynomial / Spline Curves

CS286.2 Lecture 4: Dinur s Proof of the PCP Theorem

ε. Therefore, the estimate

STA302/1001-Fall 2008 Midterm Test October 21, 2008

Gender Classification from ECG Signal Analysis using Least Square Support Vector Machine

Handout #8. X\Y f(x) 0 1/16 1/ / /16 3/ / /16 3/16 0 3/ /16 1/16 1/8 g(y) 1/16 1/4 3/8 1/4 1/16 1

Midterm Exam 1, section 1 (Solution) Thursday, February hour, 15 minutes

12.2 Estimating Model parameters Assumptions: ox and y are related according to the simple linear regression model

Transcription:

CS 750 Mache Learg Lecture 3 Dmesoalty reducto Feature selecto Mlos Hauskrecht mlos@cs.ptt.edu 539 Seott Square CS 750 Mache Learg Dmesoalty reducto. Motvato. Classfcato problem eample: We have a put data { 1,,.., N } such that 1 d = (,,.., ) ad a set of correspodg output labels { y1, y,.., y N } Assume the dmeso d of the data pot s very large We wat to classfy Problems wth hgh dmesoal put vectors A large umber of parameters to lear, f a dataset s small ths ca result : Large varace of estmates ad overft t becomes hard to epla what features are mportat the model (too may choces some ca be substtutable) CS 750 Mache Learg

Dmesoalty reducto. Solutos: Selecto of a smaller subset of puts (features) from a large set of puts; tra classfer o the reduced put set Combato of hgh dmesoal puts to a smaller set of features () ; tra classfer o ew features φ k selecto combato CS 750 Mache Learg Feature selecto How to fd a good subset of puts/features? We eed: A crtero for rakg good puts/features Search procedure for fdg a good set of features Feature selecto process ca be: Depedet o the learg task e.g. classfcato Selecto of features affected by what we wat to predct Idepedet of the learg task puts are reduced wthout lookg at the output PCA, compoet aalyss, clusterg of puts may lack the accuracy for classfcato/regresso tasks CS 750 Mache Learg

Task-depedet feature selecto Assume: Classfcato problem: put vector, y - output Feature mappgs φ = { φ1( ), φ ( ), K φ k ( ), K} Objectve: Fd a subset of features that gves/preserves most of the output predcto capabltes Selecto approaches: Flterg approaches Flter out features wth small predctve potetal doe before classfcato; typcally uses uvarate aalyss Wrapper approaches Select features that drectly optmze the accuracy of the multvarate classfer Embedded methods Feature selecto ad learg closely ted the method CS 750 Mache Learg Feature selecto through flterg Assume: Classfcato problem: put vector, y - output Iputs or feature mappgs () φ k How to select the feature: Uvarate aalyss Preted that oly oe varable, k, ests See how well t predcts the output y aloe Eample: dfferetally epressed features (or puts) Good separato bary (case/cotrol settgs) CS 750 Mache Learg

Dfferetally epressed features Crtera for measurg the dfferetal epresso T-Test score (Bald & Log) Based o the test that two groups come from the same populato ( + ) ( ) Fsher Score µ µ Fsher ( ) = ( + ) ( ) σ + σ Area uder Recever Operatg Characterstc (AUC) score Problems: f may radom features, the features wth a good dfferetally epressed score must arse Techques to reduce FDR (False dscovery rate) ad FWER (Famly wse error). CS 750 Mache Learg Feature flterg Other uvarate scores: Cov ( φk, y) Correlato coeffcets ρ ( φk, y) = Var ( φk ) Var ( y ) Measures lear depedeces Mutual formato ~ ~ ~ P( φk = j, y = ) I ( φk, y) = P( φk = j, y = ) log ~ P( φ = j) P( y = ) j Uvarate assumptos: Oly oe feature ad ts effect o y s corporated the mutual formato score Effects of two features o y are depedet What to do f the combato of features gves the best predcto? k CS 750 Mache Learg

Feature selecto: depedet features Flterg wth depedet features Let φ be a curret set of features (startg from complete set) We ca remove feature ~ φ k () from t whe: ~ P ( y φ \ φ k ) P ( y φ) for all values of φ k, y Repeat removals utl the probabltes dffer too much. Problem: how to compute/estmate P ~ ( y φ \ φ ~ k ), P ( y φ)? Soluto: make some smplfyg assumpto about the uderlyg probablstc model Eample: use a Naïve Bayes Advatage: speed, modularty, appled before classfcato Dsadvatage: may ot be as accurate CS 750 Mache Learg Feature selecto: wrappers Wrapper approach: The feature selecto s drve by the predcto accuracy of the classfer (regressor) actually bult How to fd the approprate feature set? Idea: Greedy search the space of classfers Gradually add features mprovg most the qualty score Gradually remove features that effect the accuracy the least Score should reflect the accuracy of the classfer (error) ad also prevet overft Stadard way to measure the qualty: Iteral cross-valdato (m-fold cross valdato) CS 750 Mache Learg

Feature selecto: wrappers Eample of a greedy (forward) search: logstc regresso model wth features Start wth p ( y = 1, w) = g( wo ) Choose the feature φ () wth the best score p( y = 1, w) = g( w o + w φ ( )) Choose the feature φ j () wth the best score p( y = 1, w) = g( w + w φ ( ) + w φ ( )) Etc. Whe to stop? o j j CS 750 Mache Learg Iteral cross-valdato Goal: Stop the learg whe smallest geeralzato error (performace o the populato from whch data were draw) Test set ca be used to estmate geeralzato error Data dfferet from the trag set Iteral valdato set = test set used to stop the learg process E.g. feature selecto process Cross-valdato (m-fold): Dvde the data to m equal parttos (of sze N/m) Hold out oe partto for valdato, tra the classfer o the rest of data Repeat such that every partto s held out oce The estmate of the geeralzato error of the learer s the mea of errors of all classfers CS 750 Mache Learg

Embedded methods Feature selecto + classfcato model learg doe together Embedded models: Regularzed models Models of hgher complety are eplctly pealzed leadg to vrtual removal of puts from the model Regularzed logstc/lear regresso Support vector maches Optmzato of margs pealzes ozero weghts CART/Decso trees CS 750 Mache Learg Prcpal compoet aalyss (PCA) Objectve: We wat to replace a hgh dmesoal put wth a small set of features (obtaed by combg puts) Dfferet from the feature subset selecto!!! PCA: A lear trasformato of d dmesoal put to M dmesoal feature vector z such that M < d uder whch the retaed varace s mamal. Equvaletly t s the lear projecto for whch the sum of squares recostructo cost s mmzed. CS 750 Mache Learg

PCA 40 0 0-0 30 40 0-40 40 30 0 10 0-10 -0-30 -30-0 -10 0 10 CS 750 Mache Learg PCA CS 750 Mache Learg

PCA CS 750 Mache Learg PCA 40 30 Xprm=0.04+ 0.06y- 0.99z Yprm=0.70+0.70y+0.07z 97% varace retaed 0 10 Yprm 0-10 -0-30 -40-40 -30-0 -10 0 10 0 30 40 50 Xprm CS 750 Mache Learg

Prcpal compoet aalyss (PCA) PCA: lear trasformato of d dmesoal put to M dmesoal feature vector z such that M < d uder whch the retaed varace s mamal. Task depedet Fact: A vector ca be represeted usg a set of orthoormal d vectors u = z u = 1 Leads to trasformato of coordates (from to z usg u s) z u T = CS 750 Mache Learg PCA Idea: replace d coordates wth M of z coordates to represet. We wat to fd the subset M of bass vectors. M ~ = z u + b u = 1 d = M + 1 b - costat ad fed How to choose the best set of bass vectors? We wat the subset that gves the best appromato of data the dataset o average (we use least squares ft) Error for data etry E M = 1 N = 1 ~ = CS 750 Mache Learg 1 d ~ = ( z b ) u N = 1 d = M + 1 = M + 1 ( z b )

PCA Dfferetate the error fucto wth regard to all b ad set equal to 0 we get: N 1 b u T = z = N =1 The we ca rewrte: T E M = u Σu Σ = ( )( ) = M + 1 = 1 The error fucto s optmzed whe bass vectors satsfy: d 1 Σu = λ u E M = λ The best M bass vectors: dscard vectors wth d-m smallest egevalues (or keep vectors wth M largest egevalues) Egevector s called a prcpal compoet CS 750 Mache Learg = 1 N N = 1 1 d N T u = M + 1 u PCA Oce egevectors wth largest egevalues are detfed, they are used to trasform the orgal d-dmesoal data to M dmesos u u 1 To fd the true dmesoalty of the data d we ca just look at egevalues that cotrbute the most (small egevalues are dsregarded) Problem: PCA s a lear method. The true dmesoalty ca be overestmated. There ca be o-lear correlatos. 1 CS 750 Mache Learg

Dmesoalty reducto wth eural ets PCA s lmted to lear dmesoalty reducto To do o-lear reductos we ca use eural ets Auto-assocatve etwork: a eural etwork wth the same puts ad outputs ( ) 1 d z = ( z 1, z ) 1 d The mddle layer correspods to the reduced dmesos CS 750 Mache Learg Dmesoalty reducto wth eural ets Error crtero: 1 E = N d ( y ( ) ) = 1 = 1 Error measure tres to recover the orgal data through lmted umber of dmesos the mddle layer No-leartes modeled through termedate layers betwee the mddle layer ad put/output If o termedate layers are used the model replcates PCA optmzato through learg 1 d z = ( z 1, z 1 d ) CS 750 Mache Learg

Dmesoalty reducto through clusterg Clusterg algorthms group together smlar staces the data sample Dmesoalty reducto based o clusterg: Replace a hgh dmesoal data etry wth a cluster label Problem: Determstc clusterg gves oly oe label per put May ot be eough to represet the data for predcto Solutos: Clusterg over subsets of put data Soft clusterg (probablty of a cluster s used drectly) CS 750 Mache Learg Dmesoalty reducto through clusterg Soft clusterg (e.g. mture of Gaussas) attempts to cover all staces the data sample wth a small umber of groups Each group s more or less resposble for a data etry (resposblty a posteror of a group gve the data etry) Mture of G. resposblty = k Dmesoalty reducto based o soft clusterg Replace a hgh dmesoal data wth the set of group posterors Feed all posterors to the learer e.g. lear regressor, classfer h l π p ( u = 1 π p ( u l y l l y = ) l = u ) CS 750 Mache Learg

Dmesoalty reducto through clusterg We ca use the dea of soft clusterg before applyg regresso/classfcato learg Two stage algorthms Lear the clusterg Lear the classfcato Iput clusterg: (hgh dmesoal) Output clusterg (Iput classfer): p ( c = ) Output classfer: y Eample: Networks wth Radal Bass Fuctos (RBFs) Problem: Clusterg lears based o p() (dsregards the target) Predcto based o p( y ) CS 750 Mache Learg Networks wth radal bass fuctos A alteratve to multlayer NN for o-leartes k Radal bass fuctos: f ( ) = w + φ ( ) Based o terpolatos of prototype pots (meas) Affected by the dstace betwee the ad the mea Ft the outputs of bass fuctos through the lear model Choce of bass fuctos: µ j Gaussa φ j ( ) = ep σ j Learg: I practce seem to work OK for up to 10 dmesos For hgher dmesos (rdge fuctos logstc) combg multple learers seem to do better job CS 750 Mache Learg 0 w j j j= 1