COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Similar documents
COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Sparse Approximation and Variable Selection

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

COMP 551 Applied Machine Learning Lecture 2: Linear regression

Dimensionality reduction

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

20 Unsupervised Learning and Principal Components Analysis (PCA)

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Deriving Principal Component Analysis (PCA)

Machine Learning 2nd Edition

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Linear Models for Regression CS534

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Statistical Machine Learning

Principal Component Analysis (PCA)

CITS 4402 Computer Vision

PCA FACE RECOGNITION

Lecture 7: Con3nuous Latent Variable Models

CS 6375 Machine Learning

Eigenface-based facial recognition

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Introduction to Machine Learning

Lecture 5 Neural models for NLP


Expectation Maximization

What is Principal Component Analysis?

6.036 midterm review. Wednesday, March 18, 15

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Machine Learning - MT & 14. PCA and MDS

Data Mining Techniques

CS 340 Lec. 6: Linear Dimensionality Reduction

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Principal Component Analysis

A summary of Deep Learning without Poor Local Minima

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Data Mining Techniques

Eigenfaces. Face Recognition Using Principal Components Analysis

CSC321 Lecture 2: Linear Regression

Statistical Pattern Recognition

Machine Learning 4771

PCA, Kernel PCA, ICA

Recap from previous lecture

Principal Component Analysis

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Machine learning - HT Basis Expansion, Regularization, Validation

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Machine learning for pervasive systems Classification in high-dimensional spaces

PCA and LDA. Man-Wai MAK

CSC 411 Lecture 12: Principal Component Analysis

Machine Learning (CS 567) Lecture 2

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Advanced Introduction to Machine Learning CMU-10715

CPSC 340: Machine Learning and Data Mining

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Face Detection and Recognition

Dimensionality Reduction

Principal Components Analysis

Dimensionality Reduction and Principle Components Analysis

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture VIII Dim. Reduction (I)

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Principal Component Analysis and Linear Discriminant Analysis

Assignment 3. Introduction to Machine Learning Prof. B. Ravindran

Pattern Recognition 2

Example: Face Detection

Dimensionality Reduction

Overfitting, Bias / Variance Analysis

Modeling Classes of Shapes Suppose you have a class of shapes with a range of variations: System 2 Overview

Face detection and recognition. Detection Recognition Sally

November 28 th, Carlos Guestrin 1. Lower dimensional projections

Multi-Label Informed Latent Semantic Indexing

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Machine Learning. Lecture 9: Learning Theory. Feng Li.

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Lecture: Face Recognition and Feature Reduction

Probabilistic & Unsupervised Learning

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Lecture 2 Machine Learning Review

Structured matrix factorizations. Example: Eigenfaces

Machine Learning for Software Engineering

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Logistic Regression. COMP 527 Danushka Bollegala

Spectral Regularization

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Machine Learning 11. week

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Machine Learning Techniques

Regularization via Spectral Filtering

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Variable Selection in Data Mining Project

Transcription:

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection Instructor: Herke van Hoof (herke.vanhoof@cs.mcgill.ca) Based on slides by:, Jackie Chi Kit Cheung Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor s written permission.

Stacking and overfitting To see the difference between the strategies for using the crossvalidation set, I worked out an example here: https://drive.google.com/file/d/1crzbv8xpy07ayxbxxijvfx-ovsrup1kq/view?usp=sharing 2

Instruction team Last lecture by me! Second half of the course taught by Ryan Lowe I ll still have office hours until the end of February For questions regarding the first half of the course try to come by before then! 3

<latexit sha1_base64="skdgcxz5fpmp3p7bt6nkqpimmc=">aaadlhichvjdb9mwfhuwpkb56ucbb3iwqiy6cvvjhqq8tjqykhhca1e2qs6r47itncfxbgdbzfwfhv8cf55xvko0ayjk0u5vvec4yvfm0rotimi78fweo36jzvbt3q379y9d7/8czlkpf6iquvfankdaum0enhhlot6sioe85pu5pd6v68tlvmhxik1ljosvxqra5i9j4vnl/hussiu7nzohsumdcyqxwylnilpmyigqvkzgiimvzslhf0uzbfdiiyffvatbs9vslbem1ijvih2pyoif9qtsk6obxqdycawjjknkjfqcsigvohsecaz2ni2lm3tywwqnrovjtickpxtcphwlnvm9s/xqo7vpmbuef8p8wsm7rba413qvp56zy7pu3vqv/fdtwpr5q5llqpagctjcnc85naws5gazpigxfoubjor5xifzyowj8dpqiuevsjhn2l8psguenvpipvgsovqnfdyi6lav20hsnim9pxcdpg/n3zsrbwfetwhjnqxyfruv9v7pbdjvfqcq8w1/grz9elttu8bzwuoznl/k46ezbs8lresuhj//g0otir2s3l5hflnc0vsnxdh6tgmh69hkufxgwo3rsbsg0eg6dgcglwehyadaitaabp4mnww7wlhwu7oeh4duguhw0modgi8l3vwbbjs6w</latexit> Find good non-linear expansions Outside of language data, we have mainly looked at polynomial expansions, e.g. polynomial expansion up to second order: x1 x 2 = When is this a good idea? What are the alternatives? 0 B @ 1 x 1 x 2 x 2 1 x 2 2 x 1 x 2 1 C A 4

Find good non-linear expansions Sometimes, we can find good features by prior knowledge about the type of problem E.g. language features Robert arm y x y = l cos(x), use (x) = cos(x) tofindl 5

Find good non-linear expansions Sometimes, we can find good features by prior knowledge about the type of problem Other types of knowledge: Is there a periodic aspect to the data (e.g. temperature over days or years) -> use cosine or sine with proper period Are there special points in the function? E.g. discontinuity at 0? -> at a binary function to indicate when it happens? Do we know certain factors to be independent? -> we can avoid adding cross-terms for these features 6

Find good non-linear expansions Sometimes, we can find good features by analyzing the data set x 2 - - - - - - - - - - - - - - x 1 7

Find good non-linear expansions Sometimes, we can find good features by analyzing the data set x 2 - - - - - - - - - - - - - - x 1 It seems distance from the origin is a good features, we could use d = q x 2 1 x2 2, or x2 1 and x 2 2 8

Find good non-linear expansions Sometimes, we can find good features by analyzing the data set It seems distance from the origin is a good features, we could use x 2 - - - - - - - - - - - - - - d = x 1 x 2 2 q x 2 1 x2 2, or x2 1 and x 2 2 - - - - - - - - - - - - - x 1 2 9

<latexit sha1_base64="qingchd2dmeq58e4nehqwxhwgve=">aaaddnichzlnihqxemcz7dc6fu3q0utweeaqovse9sasevg4iumutldnol2zezbpzjp0rkpmq3jxqm/hsbz6cj6e72d6q3f6bqua/qfq968uoqrfmbfx/gmuntt/4eklrcvjk1evxbxvxpzjzg1pjcnkkt9ubadnfuwt8xyofaaicg47bdhz5v6/glow2t12q4vziicvmzjklehlazqxxkgnj3b1mpylncrv4reh6muf97ou7o59pkwktolkue2mwsaxs5oi2jhlw47q2oag9ioewcliiakzm2qe9vhsyjv5khb7k4jb7t8mrycxafieuxk7msnyk/1vb1hb5ohosurwfinyxlwuorctnccsaacwr4mgvlmwk6yrogm14z3gaqwnvapbqtklhetlcvc/sllu0awkuuubw7vwk8r7j8cpb5sg7je/2urjzujab8gmdewif9/jb/nn1enp0sxvutpcxevfo56z/3e4wee4zjsixfh0we37k06tv2pmxzi2ywr6vzpyc0wfiqzrsnzmx8wezklxz6c7d7rn2ul3uz30bql6bhars/qhpojiht6id6hz9gh6ev0nfrwodgo99xcgxf9/wvsdqk5</latexit> Consideration for generic expansions With enough terms, polynomial can fit any function (generic) Polynomial features and fitted function (6 features) 10 i = x i 8 1.5 6 1 4 2 0.5 0 0-2 -4-0.5-6 -1-8 -10-1.5-1 -0.5 0 0.5 1 1.5-1.5-1.5-1 -0.5 0 0.5 1 1.5 Assumes function is relatively flat around 0 Tends to to extrapolate wildly away from 0: can we do better? 10

<latexit sha1_base64="dk4i8u3ediuyxudqhrh2svjtiy=">aaadixichzlnihqxemcz7dc6fs3quytgimycdn0iqadh0yvhvrx3yti06xtnttikk03suw4xj1/fi1d9c0/itxwg38h0dcvorgbb0/uv0r1aeklbh1afq9l5w7fhipa3l/stxr12/mdicaq2jcymcwuosiobcermdjubbxoa1qwavalodnff8ejowqeu2wgmashlz8zhl1mzup7hc94dnhtzfhyo6i46ie/zyqzthfyqfddjyuap8vwsegqiu9flv3k5sk1riqxws1dpql2s08ny4zaafpaguasin6cnmokyrbzvzqpwkfzmlnistn8rhvfzvh6fs2qusiimpw9jnwpp8v21au/njmeevrh1urd1oxgvsfg4ubzfcahnigqvlhsdzmvtqq5mlv9cnfzwyjswtsk8kjcotyggazdqpzialpglondipsxac7hmbjktun7/ercvb2tjewucn5wl4t3rbsbpwuqon1jsh7co8k9c7jvp/c4tnjzhzvyc4i9hg9xmb91yncdudlzlxz539gxmfxqblndtgrjg/gt8bpy4fd3wfdpmyh2gugqempuk76axaqxpe0hv0ex1cn5mpyzfka/ktrzne57mf1il58quupwnj</latexit> Consideration for generic expansions Alternative: Fourier-based transform (Konidaris et al, 2011) Scale between 0 and 1, then i = cos( x i) 1 0.8 1.5 0.6 1 0.4 0.2 0.5 0 0-0.2-0.4-0.5-0.6-1 -0.8-1 -1.5-1 -0.5 0 0.5 1 1.5-1.5-1.5-1 -0.5 0 0.5 1 1.5 Behaves better outside of the data! But requires range of x Easy to model periodic behavior (scale between -1 and 1) 11

<latexit sha1_base64="jr1sc0wbw5zkumgxrhyqrc3vuae=">aaaduxichzlnbhmxemedhis8phckytfhjribbqbiqehpaouxjakirrshfzerzexaq9d29smmn4qxovlr3dmbtjhzs4vszeyydq/z34znh1nqjgznoouw3ojzu3bu/c6e7evxf/qw/v4scjs03ohegu9umkdewsobpllkcnslmsuk6p09m3vfz4ngrdzphrrhsdctwvwm4itsgv9n4hy3hg3dinbb8itwajg68gokufom3t4adlghm3wmidemuop49g2oidjslhc5is/ncdpnepxpfa4pxrdyipmjsknlr/uszjkwghsucgzoni2vndmvlckei0pdfsanee6nqrzyudnz6//28gnwzdcxopzcwrx37wyhhterkqzsylsw27hka/ytlt5i5ljhsotluj9uf5yacwshggzpimxfbuejpqfxifz4dang0bdrqw9ifiixgqopzjn55t4atyrl1vdkjtuvaqf68fee9gnw7yoch/4zsjkckzwpgg2qcih/zf11/ih30q1vca683v9grn74bpx5ow3ox4r5ywlibd2wcl266tak79jzow4qxqv5tjvctrlcxivl0o18vkpho5it4/6xbjzlbzwgt8aaxoa5oarvwrgyaakgkvwhfxof2v/6obou0bbrsbnediwzu5vurozzq==</latexit> Consideration for generic expansions Alternative: Radial basis functions Pick representative points 1 x i, i =exp (x x) 2 2 2 0.9 1.5 0.8 1 0.7 0.6 0.5 0.5 0 0.4 0.3-0.5 0.2-1 0.1 0-1.5-1 -0.5 0 0.5 1 1.5-1.5-1.5-1 -0.5 0 0.5 1 1.5 Again well-behaved outside of the data! Flexible, but requires knowing range of x and setting σ 12

Consideration for generic expansions Side by side comparison: each uses 6 features Polynomial Fourier RBF 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 0 0 0-0.5-0.5-0.5-1 -1-1 -1.5-1.5-1.5-1.5-1 -0.5 0 0.5 1 1.5-1.5-1 -0.5 0 0.5 1 1.5-1.5-1 -0.5 0 0.5 1 1.5 All three can model any function given enough bases Polynomial does relatively poorly Alternatives require knowing / normalizing range of x 13

Main types of machine learning problems Supervised learning Classification Regression Unsupervised learning Dimensionality reduction Reinforcement learning 14

Applications with lots of features Any kind of task involving images or videos - object recognition, face recognition. Lots of pixels! Classifying from gene expression data. Lots of different genes! Number of data examples: 100 Number of variables: 6000 to 60,000 Natural language processing tasks. Lots of possible words! Tasks where we constructed many features using e.g. nonlinear expansion (e.g. polynomial) 15

Dimensionality reduction Thousands to millions of low level features: select the most relevant one to build better, faster, and easier to understand learning machines. m m n X slide by Isabelle Guyon 16

Dimensionality reduction Principal Component Analysis (PCA) Combine existing features into a smaller set of features Also called (Truncated) Singular Value Decomposition, or Latent Semantic Indexing in NLP Variable Ranking Think of features as random variables Find how strong they are associated with the output prediction, remove the ones that are not highly associated, either before training or during training Representation learning techniques like word2vec 17

Principal Component Analysis (PCA) Idea: Project data into a lower-dimensional sub-space, R m èr m, where m <m. 18

Principal Component Analysis (PCA) Idea: Project data into a lower-dimensional sub-space, R m èr m, where m <m. Consider a linear mapping, x i => Wx i W is the compression matrix with dimension R m xm. Assume there is a decompression matrix U mxm. 19

Principal Component Analysis (PCA) Idea: Project data into a lower-dimensional sub-space, R m èr m, where m <m. Consider a linear mapping, x i => Wx i W is the compression matrix with dimension R m xm. Assume there is a decompression matrix U mxm. Solve the following problem: argmin W,U Σ i=1:n x i UWx i 2 20

Principal Component Analysis (PCA) Idea: Project data into a lower-dimensional sub-space, R m èr m, where m <m. Consider a linear mapping, x i => Wx i W is the compression matrix with dimension R m xm. Assume there is a decompression matrix U mxm. Solve the following problem: argmin W,U Σ i=1:n x i UWx i 2 Select the project dimension, m, using cross-validation. Typically center the examples before applying PCA (subtract the mean). 21

Principal Component Analysis (PCA) This problem has a closed-form solution: W are the m eigenvectors corresponding to the largest eigenvalues of the covariance matrix of X (For centered data sample covariance is simply X T X) 22

Principal Component Analysis (PCA) Eigenvector equation: λw= Σw 23

Principal Component Analysis (PCA) Σv v Σw w u Σu Eigenvector equation: λw= Σw 24

Principal Component Analysis (PCA) w 1 w 2 Eigenvector equation: λw= Σw 25

Principal Component Analysis (PCA) w 1 w 2 W specifies a new basis 26

Principal Component Analysis (PCA) w 1 λ 1 λ 2 w 2 W specifies a new basis 27

Principal Component Analysis (PCA) w 1 w 2 Now it is easy to specify most relevant dimension: Reduce the dimensionality 1d Found dimension is a combination of x1 & x2 28

Principal Component Analysis (PCA) w 1 Now it is easy to specify most relevant dimension: Reduce the dimensionality 1d Found dimension is a combination of x1 & x2 29

Principal Component Analysis (PCA) PCA Calculate full matrix W (eigenvectors of covariance) Take m most relevant components (largest eigenvalues) Covariance between variables i and j after transformation? (Xw i ) T (Xw i ) w it X T Xw j (W are eigenvalues, so λ j w j =X T Xw j ) λ j w it w j = 0 (i j), λ j (i=j) 30

Principal Component Analysis (PCA) PCA Calculate full matrix W (eigenvectors of covariance) Take m most relevant components (largest eigenvalues) Covariance between variables i and j after transformation? (Xw i ) T (Xw i ) w it X T Xw j (W are eigenvalues, so λ j w j =X T Xw j ) λ j w it w j = 0 (i j), λ j (i=j) 31

Principal Component Analysis (PCA) Coloring transform Multiply by W T Multiply by Λ 1/2 Multiply by W Whitening transform Multiply by Λ -1/2 32

Eigenfaces Turk & Pentland (1991) used PCA method to capture face images. Assume all faces are about the same size Represent each face image as a data vector. Each Eigen vector is an image, called an Eigenface. Eigenfaces Average image 33

Training images Original images in R 50x50. Projection to R 10 and reconstruction 34

Training images Original images in R 50x50. Projection to R 10 and reconstruction o o o o o Projection to R 2. Different marks Indicate different individuals. x x x x * xx * * * 35

Feature selection methods The goal: Find the input representation that produces the best generalization error. Two classes of approaches: Wrapper & Filter methods: Feature selection is applied as a preprocessing step. Embedded methods: Feature selection is integrated in the learning (optimization) method, e.g. Regularization 36

Variable Ranking Idea: Rank features by a scoring function defined for individual features, independently of the context of others. Choose the m highest ranked features. Pros / cons: 37

Variable Ranking Idea: Rank features by a scoring function defined for individual features, independently of the context of others. Choose the m highest ranked features. Pros / cons: Need to select a scoring function. Must select subset size (m ): cross-validation Simple and fast just need to compute a scoring function m times and sort m scores. 38

Scoring function: Correlation Criteria R( j) = (x ij x i=1:n j )(y i y) (x ij x i=1:n j ) 2 (y i y) 2 i=1:n slide by Michel Verleysen 39

Scoring function: Mutual information Think of X j and Y as random variables. Mutual information between variable X j and target Y: I( j) = p(x j, y)log p(x j, y) dx dy X j Y p(x j )p(y) Empirical estimate from data (assume discretized variables): I( j) = P(X j = x j,y = y)log p(x j = x j,y = y) X j Y p(x j = x j )p(y = y) 40

Nonlinear dependencies with MI Mutual information identifies nonlinear relationships between variables. Example: x uniformly distributed over [-1 1] y = x 2 noise z uniformly distributed over [-1 1] z and x are independent slide by Michel Verleysen 41

Variable Ranking Idea: Rank features by a scoring function defined for individual features, independently of the context of others. Choose the m highest ranked features. Pros / cons? 42

Variable Ranking Idea: Rank features by a scoring function defined for individual features, independently of the context of others. Choose the m highest ranked features. Pros / cons? Need select a scoring function. Must select subset size (m ): cross-validation Simple and fast just need to compute a scoring function m times and sort m scores. Scoring function is defined for individual features (not subsets). 43

Best-Subset selection Idea: Consider all possible subsets of the features, measure performance on a validation set, and keep the subset with the best performance. Pros / cons? We get the best model! Very expensive to compute, since there is a combinatorial number of subsets. 44

Search space of subsets Kohavi-John, 1997 n features, 2 n 1 possible feature subsets! slide by Isabelle Guyon 45

Subset selection in practice Formulate as a search problem, where the state is the feature set that is used, and search operators involve adding or removing feature set Constructive methods like forward/backward search Local search methods, genetic algorithms Use domain knowledge to help you group features together, to reduce size of search space e.g., In NLP, group syntactic features together, semantic features, etc. 46

Steps to solving a supervised learning problem 1. Decide what the input-output pairs are. 2. Decide how to encode inputs and outputs. This defines the input space X and output space Y. 3. Choose a class of hypotheses / representations H. Evaluate on validation set Evaluate on test set E.g. linear functions. 4. Choose an error function (cost function) to define best hypothesis. E.g. Least-mean squares. 5. Choose an algorithm for searching through space of hypotheses. If doing k-fold cross-validation, re-do feature selection for each fold. 47

Regularization Idea: Modify the objective function to constrain the model choice. Typically adding term ( j=1:m w jp ) 1/p. Linear regression -> Ridge regression, Lasso Challenge: Need to adapt the optimization procedure (e.g. handle non-convex objective). This approach is often used for very large natural (nonconstructed) feature sets, e.g. images, speech, text, video. With lasso (p=1), some weights tend to become 0, so can interpret as feature selection 48

Final comments Classic paper on this: I. Guyon, A. Elisseeff, An introduction to Variable and Feature Selection. Journal of Machine Learning Research 3 (2003) pp.1157-1182 http://machinelearning.wustl.edu/mlpapers/paper_files/guyone03.pdf (and references therein.) More recently, move towards learning the features end-to-end, using neural network architecture (more on this next week.) 49

What you should know Properties of some families of generic features Basic idea and algorithm of principal component analysis Mutual information and correlation Feature selection approaches: Wrap&Filter, Embedded 50