Dimensionality reduction Feature selection

Similar documents
Dimensionality reduction Feature selection

CS 2750 Machine Learning. Lecture 8. Linear regression. CS 2750 Machine Learning. Linear regression. is a linear combination of input components x

Generalized Linear Regression with Regularization

Linear regression (cont) Logistic regression

Kernel-based Methods and Support Vector Machines

Generative classification models

Supervised learning: Linear regression Logistic regression

Principal Components. Analysis. Basic Intuition. A Method of Self Organized Learning

CS 2750 Machine Learning. Lecture 7. Linear regression. CS 2750 Machine Learning. Linear regression. is a linear combination of input components x

CS 1675 Introduction to Machine Learning Lecture 12 Support vector machines

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

Differential Encoding

Binary classification: Support Vector Machines

CS 2750 Machine Learning Lecture 8. Linear regression. Supervised learning. a set of n examples

CS 3710 Advanced Topics in AI Lecture 17. Density estimation. CS 3710 Probabilistic graphical models. Administration

ENGI 3423 Simple Linear Regression Page 12-01

Support vector machines II

Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture)

Linear regression (cont.) Linear methods for classification

Unsupervised Learning and Other Neural Networks

An Introduction to. Support Vector Machine

Lecture 8: Linear Regression

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

Machine Learning. knowledge acquisition skill refinement. Relation between machine learning and data mining. P. Berka, /18

Overview. Basic concepts of Bayesian learning. Most probable model given data Coin tosses Linear regression Logistic regression

Solutions to Odd-Numbered End-of-Chapter Exercises: Chapter 17

Support vector machines

Bayesian Classification. CS690L Data Mining: Classification(2) Bayesian Theorem: Basics. Bayesian Theorem. Training dataset. Naïve Bayes Classifier

ENGI 4421 Propagation of Error Page 8-01

Tema 5: Aprendizaje NO Supervisado: CLUSTERING Unsupervised Learning: CLUSTERING. Febrero-Mayo 2005

Bayes (Naïve or not) Classifiers: Generative Approach

LINEAR REGRESSION ANALYSIS

ε. Therefore, the estimate

Radial Basis Function Networks

Principal Component Analysis (PCA)

Econometric Methods. Review of Estimation

ESS Line Fitting

Maximum Walk Entropy Implies Walk Regularity

M2S1 - EXERCISES 8: SOLUTIONS

Objectives of Multiple Regression

Linear Regression Linear Regression with Shrinkage. Some slides are due to Tommi Jaakkola, MIT AI Lab

TESTS BASED ON MAXIMUM LIKELIHOOD

Naïve Bayes MIT Course Notes Cynthia Rudin

= 2. Statistic - function that doesn't depend on any of the known parameters; examples:

Probability and. Lecture 13: and Correlation

Simple Linear Regression

MATH 247/Winter Notes on the adjoint and on normal operators.

Sampling Theory MODULE V LECTURE - 14 RATIO AND PRODUCT METHODS OF ESTIMATION

Chapter 9 Jordan Block Matrices

Algebraic-Geometric and Probabilistic Approaches for Clustering and Dimension Reduction of Mixtures of Principle Component Subspaces

Special Instructions / Useful Data

QR Factorization and Singular Value Decomposition COS 323

Chapter 13, Part A Analysis of Variance and Experimental Design. Introduction to Analysis of Variance. Introduction to Analysis of Variance

Lecture 1 Review of Fundamental Statistical Concepts

ECON 482 / WH Hong The Simple Regression Model 1. Definition of the Simple Regression Model

Mean is only appropriate for interval or ratio scales, not ordinal or nominal.

Classification : Logistic regression. Generative classification model.

ENGI 4421 Joint Probability Distributions Page Joint Probability Distributions [Navidi sections 2.5 and 2.6; Devore sections

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #1

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

CHAPTER VI Statistical Analysis of Experimental Data

Announcements. Recognition II. Computer Vision I. Example: Face Detection. Evaluating a binary classifier

ρ < 1 be five real numbers. The


C-1: Aerodynamics of Airfoils 1 C-2: Aerodynamics of Airfoils 2 C-3: Panel Methods C-4: Thin Airfoil Theory

Dimensionality Reduction and Learning

Research on SVM Prediction Model Based on Chaos Theory

Chapter 4 Multiple Random Variables

Sampling Theory MODULE X LECTURE - 35 TWO STAGE SAMPLING (SUB SAMPLING)

3D Geometry for Computer Graphics. Lesson 2: PCA & SVD

Linear Regression with One Regressor

Lecture Notes Types of economic variables

The Optimal Algorithm. 7. Algorithm-Independent Learning. No Free Lunch theorem. Theorem: No Free Lunch. Aleix M. Martinez

STAT 400 Homework 09 Spring 2018 Dalpiaz UIUC Due: Friday, April 6, 2:00 PM

Lecture 16: Backpropogation Algorithm Neural Networks with smooth activation functions

CSE 5526: Introduction to Neural Networks Linear Regression

ECONOMETRIC THEORY. MODULE VIII Lecture - 26 Heteroskedasticity

Computational Geometry

Multiple Regression. More than 2 variables! Grade on Final. Multiple Regression 11/21/2012. Exam 2 Grades. Exam 2 Re-grades

Lecture 12: Multilayer perceptrons II

Chapter 14 Logistic Regression Models

Model Fitting, RANSAC. Jana Kosecka

THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA

Chapter 4 (Part 1): Non-Parametric Classification (Sections ) Pattern Classification 4.3) Announcements

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Lecture Notes 2. The ability to manipulate matrices is critical in economics.

Summary of the lecture in Biostatistics

ECON 5360 Class Notes GMM

Functions of Random Variables

Multivariate Transformation of Variables and Maximum Likelihood Estimation

The TDT. (Transmission Disequilibrium Test) (Qualitative and quantitative traits) D M D 1 M 1 D 2 M 2 M 2D1 M 1

For combinatorial problems we might need to generate all permutations, combinations, or subsets of a set.

Machine Learning. Introduction to Regression. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

6.867 Machine Learning

UNIT 7 RANK CORRELATION

Statistics: Unlocking the Power of Data Lock 5

Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand DIS 10b

THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA

STK4011 and STK9011 Autumn 2016

( ) = ( ) ( ) Chapter 13 Asymptotic Theory and Stochastic Regressors. Stochastic regressors model

Transcription:

CS 675 Itroucto to ache Learg Lecture Dmesoalty reucto Feature selecto los Hauskrecht mlos@cs.ptt.eu 539 Seott Square Dmesoalty reucto. otvato. L methos are sestve to the mesoalty of ata Questo: Is there a lower mesoal represetato of the ata that captures well ts characterstcs? Objectve of mesoalty reucto: F a lower mesoal represetato of ata Two learg problems: Supervse D {, y,, y,..,,, y},,.., Usupervse,,..,,,.., Goal: replace,,.., wth ' of mesoalty < D { }

Dmesoalty reucto Solutos: Selecto of a smaller subset of puts features from a large set of puts; tra classfer o the reuce put set Combato of hgh mesoal puts to a smaller set of features ; tra classfer o ew features k selecto combato Task-epeet feature selecto Assume: Classfcato problem: put vector, y - output Objectve: F a subset of puts/features that gves/preserves most of the output precto capabltes Selecto approaches: Last lecture Flterg approaches Flter out features wth small prectve potetal Doe before classfcato; typcally uses uvarate aalyss Wrapper approaches Select features that rectly optmze the accuracy of the multvarate classfer Embee methos Feature selecto a learg closely te the metho Regularzato methos, ecso tree methos

Feature selecto through flterg Assume: Classfcato problem: put vector, y - output How to select the features/puts? For each put Calculate a score reflectg how well output y aloe prects the Pck the puts wth the best scores or equvaletly elmate/flter the puts wth the worst scores Feature scorg for classfcato Scores for measurg the fferetal epresso T-Test score Bal & Log Base o the test that two groups come from the same populato Null hypothess: s mea of class 0 = mea of class Class 0 Class 3

4 Feature scorg for classfcato Scores for measurg the fferetal epresso Fsher Score AUROC score: Area uer Recever Operatg Characterstc curve Fsher Class 0 Class Feature scorg Correlato coeffcets easures lear epeeces utual formato easures epeeces Nees scretze put values ~ ~, ~ log, ~, y P j P y j P y j P y I k k k j k,, y Var Var y Cov y k k k

Feature/put epeeces Uvarate score assumptos: Oly oe put a ts effect o y s corporate the score Effects of two features o y are cosere to be epeet Correlato base feature selecto A partal soluto to the above problem Iea: goo feature subsets cota features that are hghly correlate wth the class but epeet of each other Assume a set of features S of sze. The S r y r Average correlato betwee a class y Average correlato betwee pars of s r r y Feature selecto: low sample sze Problems: ay puts a low sample sze f may raom features, a ot may staces we ca lear from, the features wth a goo fferetally epresse score may arse smply by chace The probablty of ths happeg ca be qute large Techques to aress the problem: reuce FDR False scovery rate a FWER Famly wse error. 5

Feature selecto: wrappers Wrapper approach: The put/feature selecto s rve by the precto accuracy of the classfer regressor we actually wat to bult How to f the approprate feature subset S? For puts/features there are fferet feature subsets Iea: Greey search the space of classfers Graually a features mprovg the qualty of the moel Graually remove features that effect the accuracy the least Score shoul reflect the accuracy of the classfer error a also prevet overft Staar way to measure the qualty of the moel: Iteral cross-valato k-fol cross valato Iteral cross-valato Splt tra set: to teral tra a test sets Iteral tra set: tra fferet moels efe e.g. o fferet subsets of features Iteral test set/s: estmate the geeralzato error a select the best moel amog possble moels Iteral cross-valato k-fol: Dve the tra ata to m equal parttos of sze N/k Hol out oe partto for valato, tra the classfers o the rest of ata Repeat such that every partto s hel out oce The estmate of the geeralzato error of the learer s the mea of errors of o all parttos 6

Feature selecto: wrappers Eample: Greey forwar search: Assume a logstc regresso moel Start wth a smple moel: Choose feature p y, w g wo w j p y, w g w o wth the best error the teral step Choose feature wth the best error the teral step p y, w g wo w wj j Etc. Whe to stop? Goal: Stop ag features whe the yteral error o the ata stops mprovg Embee methos Feature selecto + classfcato moel learg oe jotly Eamples of embee methos: Regularze moels oels of hgher complety are eplctly pealze leag to vrtual removal of puts from the moel Covers: Regularze logstc/lear regresso Support vector maches» Optmzato of margs pealzes ozero weghts w, D L w, D R w J Fucto to optmze CART/Decso trees Loss fucto ft of the ata Regularzato pealty 7

Usupervse mesoalty reucto Is there a lower mesoal represetato of the ata that captures well ts characterstcs? Assume: We have a ata { } such that,,.., N,,.., Assume the meso of the ata pot s very large We wat to aalyze, there s o class label y Our goal: F a lower mesoal represetato of ata of meso < Prcpal compoet aalyss PCA Objectve: We wat to replace a hgh mesoal put wth a small set of puts obtae by combg puts Dfferet from the feature subset selecto!!! PCA: A lear trasformato of mesoal put to mesoal feature vector z such that z A ay fferet trasformatos ests, whch oe to pck? PCA selects the lear trasformato for whch the retae varace s mamal Or, equvaletly t s the lear trasformato for whch the sum of squares recostructo cost s mmze 8

PCA: eample 40 0 0-0 30 40 0-40 40 30 0 0 0-0 -0-30 -30-0 -0 0 0 Projectos to fferet as PCA 9

PCA PCA projecto to the mesoal space PCA PCA projecto to the mesoal space 40 30 Xprm=0.04+ 0.06y- 0.99z Yprm=0.70+0.70y+0.07z 97% varace retae 0 0 Yprm 0-0 -0-30 -40-40 -30-0 -0 0 0 0 30 40 50 Xprm 0

Prcpal compoet aalyss PCA PCA: lear trasformato of a mesoal put to mesoal vector z such that uer whch the retae varace s mamal. Remember: o y s eee Fact: A vector ca be represete usg a set of orthoormal vectors u z u Leas to trasformato of coorates from to z usg u s z u T Prcpal compoet aalyss PCA Fact: A vector ca be represete usg a set of orthoormal vectors u z u Leas to trasformato of coorates from to z usg u s z u T New bases: u, u, u 3 Staar bases:,0,0; 0,,0; 0,0,

PCA Iea: replace coorates wth of coorates to represet. We wat to f the subset of bass vectors. How to choose the best set of bass vectors? We wat the subset that gves the best appromato of ata the ataset o average we use least squares ft z b ~ u u z b - costat a fe b z ~ u Error for ata etry N N b z E ~ Recostructo error PCA Dfferetate the error fucto wth regar to all a set equal to 0 we get: The we ca rewrte: The error fucto s optmze whe bass vectors satsfy: The best bass vectors: scar vectors wth - smallest egevalues or keep vectors wth largest egevalues Egevector s calle a prcpal compoet u T N z N b b N N T E Σu u T N Σ u Σu E u

PCA Oce egevectors u wth largest egevalues are etfe, they are use to trasform the orgal -mesoal ata to mesos u u To f the true mesoalty of the ata we ca just look at egevalues that cotrbute the most small egevalues are sregare Problem: PCA s a lear metho. The true mesoalty ca be overestmate. There ca be o-lear correlatos. ofcatos for oleartes: kerel PCA Dmesoalty reucto wth eural ets PCA s lmte to lear mesoalty reucto To o o-lear reuctos we ca use eural ets Auto-assocatve or auto-ecoer etwork: a eural etwork wth the same puts a outputs z z, z The mle layer correspos to the reuce mesos 3

Dmesoalty reucto wth eural ets Error crtero: E N y Error measure tres to recover the orgal ata through lmte umber of mesos the mle layer No-leartes moele through termeate layers betwee the mle layer a put/output If o termeate layers are use the moel replcates PCA optmzato through learg z z, z Dmesoalty reucto through clusterg Clusterg algorthms group together smlar staces the ata sample Dmesoalty reucto base o clusterg: Replace a hgh mesoal ata etry wth a cluster label Problem: Determstc clusterg gves oly oe label per put ay ot be eough to represet the ata for precto Solutos: Clusterg over subsets of put ata Soft clusterg probablty of a cluster s use rectly 4

Dmesoalty reucto through clusterg Soft clusterg e.g. mture of Gaussas attempts to cover all staces the ata sample wth a small umber of groups Each group s more or less resposble for a ata etry resposblty a posteror of a group gve the ata etry ture of G. resposblty l k p y Dmesoalty reucto base o soft clusterg Replace a hgh mesoal ata wth the set of group posterors Fee all posterors to the learer e.g. lear regressor, classfer h u u l p y l l l u CS 750 ache Learg Dmesoalty reucto through clusterg We ca use the ea of soft clusterg before applyg regresso/classfcato learg Two stage algorthms Lear the clusterg Lear the classfcato Iput clusterg: hgh mesoal Output clusterg Iput classfer: p c Output classfer: y Eample: Networks wth Raal Bass Fuctos RBFs Problem: Clusterg leare base o p sregars the target Precto base o p y 5