COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection Instructor: Herke van Hoof (herke.vanhoof@cs.mcgill.ca) Based on slides by:, Jackie Chi Kit Cheung Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor s written permission.
Stacking and overfitting To see the difference between the strategies for using the crossvalidation set, I worked out an example here: https://drive.google.com/file/d/1crzbv8xpy07ayxbxxijvfx-ovsrup1kq/view?usp=sharing 2
Instruction team Last lecture by me! Second half of the course taught by Ryan Lowe I ll still have office hours until the end of February For questions regarding the first half of the course try to come by before then! 3
<latexit sha1_base64="skdgcxz5fpmp3p7bt6nkqpimmc=">aaadlhichvjdb9mwfhuwpkb56ucbb3iwqiy6cvvjhqq8tjqykhhca1e2qs6r47itncfxbgdbzfwfhv8cf55xvko0ayjk0u5vvec4yvfm0rotimi78fweo36jzvbt3q379y9d7/8czlkpf6iquvfankdaum0enhhlot6sioe85pu5pd6v68tlvmhxik1ljosvxqra5i9j4vnl/hussiu7nzohsumdcyqxwylnilpmyigqvkzgiimvzslhf0uzbfdiiyffvatbs9vslbem1ijvih2pyoif9qtsk6obxqdycawjjknkjfqcsigvohsecaz2ni2lm3tywwqnrovjtickpxtcphwlnvm9s/xqo7vpmbuef8p8wsm7rba413qvp56zy7pu3vqv/fdtwpr5q5llqpagctjcnc85naws5gazpigxfoubjor5xifzyowj8dpqiuevsjhn2l8psguenvpipvgsovqnfdyi6lav20hsnim9pxcdpg/n3zsrbwfetwhjnqxyfruv9v7pbdjvfqcq8w1/grz9elttu8bzwuoznl/k46ezbs8lresuhj//g0otir2s3l5hflnc0vsnxdh6tgmh69hkufxgwo3rsbsg0eg6dgcglwehyadaitaabp4mnww7wlhwu7oeh4duguhw0modgi8l3vwbbjs6w</latexit> Find good non-linear expansions Outside of language data, we have mainly looked at polynomial expansions, e.g. polynomial expansion up to second order: x1 x 2 = When is this a good idea? What are the alternatives? 0 B @ 1 x 1 x 2 x 2 1 x 2 2 x 1 x 2 1 C A 4
Find good non-linear expansions Sometimes, we can find good features by prior knowledge about the type of problem E.g. language features Robert arm y x y = l cos(x), use (x) = cos(x) tofindl 5
Find good non-linear expansions Sometimes, we can find good features by prior knowledge about the type of problem Other types of knowledge: Is there a periodic aspect to the data (e.g. temperature over days or years) -> use cosine or sine with proper period Are there special points in the function? E.g. discontinuity at 0? -> at a binary function to indicate when it happens? Do we know certain factors to be independent? -> we can avoid adding cross-terms for these features 6
Find good non-linear expansions Sometimes, we can find good features by analyzing the data set x 2 - - - - - - - - - - - - - - x 1 7
Find good non-linear expansions Sometimes, we can find good features by analyzing the data set x 2 - - - - - - - - - - - - - - x 1 It seems distance from the origin is a good features, we could use d = q x 2 1 x2 2, or x2 1 and x 2 2 8
Find good non-linear expansions Sometimes, we can find good features by analyzing the data set It seems distance from the origin is a good features, we could use x 2 - - - - - - - - - - - - - - d = x 1 x 2 2 q x 2 1 x2 2, or x2 1 and x 2 2 - - - - - - - - - - - - - x 1 2 9
<latexit sha1_base64="qingchd2dmeq58e4nehqwxhwgve=">aaaddnichzlnihqxemcz7dc6fu3q0utweeaqovse9sasevg4iumutldnol2zezbpzjp0rkpmq3jxqm/hsbz6cj6e72d6q3f6bqua/qfq968uoqrfmbfx/gmuntt/4eklrcvjk1evxbxvxpzjzg1pjcnkkt9ubadnfuwt8xyofaaicg47bdhz5v6/glow2t12q4vziicvmzjklehlazqxxkgnj3b1mpylncrv4reh6muf97ou7o59pkwktolkue2mwsaxs5oi2jhlw47q2oag9ioewcliiakzm2qe9vhsyjv5khb7k4jb7t8mrycxafieuxk7msnyk/1vb1hb5ohosurwfinyxlwuorctnccsaacwr4mgvlmwk6yrogm14z3gaqwnvapbqtklhetlcvc/sllu0awkuuubw7vwk8r7j8cpb5sg7je/2urjzujab8gmdewif9/jb/nn1enp0sxvutpcxevfo56z/3e4wee4zjsixfh0we37k06tv2pmxzi2ywr6vzpyc0wfiqzrsnzmx8wezklxz6c7d7rn2ul3uz30bql6bhars/qhpojiht6id6hz9gh6ev0nfrwodgo99xcgxf9/wvsdqk5</latexit> Consideration for generic expansions With enough terms, polynomial can fit any function (generic) Polynomial features and fitted function (6 features) 10 i = x i 8 1.5 6 1 4 2 0.5 0 0-2 -4-0.5-6 -1-8 -10-1.5-1 -0.5 0 0.5 1 1.5-1.5-1.5-1 -0.5 0 0.5 1 1.5 Assumes function is relatively flat around 0 Tends to to extrapolate wildly away from 0: can we do better? 10
<latexit sha1_base64="dk4i8u3ediuyxudqhrh2svjtiy=">aaadixichzlnihqxemcz7dc6fs3quytgimycdn0iqadh0yvhvrx3yti06xtnttikk03suw4xj1/fi1d9c0/itxwg38h0dcvorgbb0/uv0r1aeklbh1afq9l5w7fhipa3l/stxr12/mdicaq2jcymcwuosiobcermdjubbxoa1qwavalodnff8ejowqeu2wgmashlz8zhl1mzup7hc94dnhtzfhyo6i46ie/zyqzthfyqfddjyuap8vwsegqiu9flv3k5sk1riqxws1dpql2s08ny4zaafpaguasin6cnmokyrbzvzqpwkfzmlnistn8rhvfzvh6fs2qusiimpw9jnwpp8v21au/njmeevrh1urd1oxgvsfg4ubzfcahnigqvlhsdzmvtqq5mlv9cnfzwyjswtsk8kjcotyggazdqpzialpglondipsxac7hmbjktun7/ercvb2tjewucn5wl4t3rbsbpwuqon1jsh7co8k9c7jvp/c4tnjzhzvyc4i9hg9xmb91yncdudlzlxz539gxmfxqblndtgrjg/gt8bpy4fd3wfdpmyh2gugqempuk76axaqxpe0hv0ex1cn5mpyzfka/ktrzne57mf1il58quupwnj</latexit> Consideration for generic expansions Alternative: Fourier-based transform (Konidaris et al, 2011) Scale between 0 and 1, then i = cos( x i) 1 0.8 1.5 0.6 1 0.4 0.2 0.5 0 0-0.2-0.4-0.5-0.6-1 -0.8-1 -1.5-1 -0.5 0 0.5 1 1.5-1.5-1.5-1 -0.5 0 0.5 1 1.5 Behaves better outside of the data! But requires range of x Easy to model periodic behavior (scale between -1 and 1) 11
<latexit sha1_base64="jr1sc0wbw5zkumgxrhyqrc3vuae=">aaaduxichzlnbhmxemedhis8phckytfhjribbqbiqehpaouxjakirrshfzerzexaq9d29smmn4qxovlr3dmbtjhzs4vszeyydq/z34znh1nqjgznoouw3ojzu3bu/c6e7evxf/qw/v4scjs03ohegu9umkdewsobpllkcnslmsuk6p09m3vfz4ngrdzphrrhsdctwvwm4itsgv9n4hy3hg3dinbb8itwajg68gokufom3t4adlghm3wmidemuop49g2oidjslhc5is/ncdpnepxpfa4pxrdyipmjsknlr/uszjkwghsucgzoni2vndmvlckei0pdfsanee6nqrzyudnz6//28gnwzdcxopzcwrx37wyhhterkqzsylsw27hka/ytlt5i5ljhsotluj9uf5yacwshggzpimxfbuejpqfxifz4dang0bdrqw9ifiixgqopzjn55t4atyrl1vdkjtuvaqf68fee9gnw7yoch/4zsjkckzwpgg2qcih/zf11/ih30q1vca683v9grn74bpx5ow3ox4r5ywlibd2wcl266tak79jzow4qxqv5tjvctrlcxivl0o18vkpho5it4/6xbjzlbzwgt8aaxoa5oarvwrgyaakgkvwhfxof2v/6obou0bbrsbnediwzu5vurozzq==</latexit> Consideration for generic expansions Alternative: Radial basis functions Pick representative points 1 x i, i =exp (x x) 2 2 2 0.9 1.5 0.8 1 0.7 0.6 0.5 0.5 0 0.4 0.3-0.5 0.2-1 0.1 0-1.5-1 -0.5 0 0.5 1 1.5-1.5-1.5-1 -0.5 0 0.5 1 1.5 Again well-behaved outside of the data! Flexible, but requires knowing range of x and setting σ 12
Consideration for generic expansions Side by side comparison: each uses 6 features Polynomial Fourier RBF 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 0 0 0-0.5-0.5-0.5-1 -1-1 -1.5-1.5-1.5-1.5-1 -0.5 0 0.5 1 1.5-1.5-1 -0.5 0 0.5 1 1.5-1.5-1 -0.5 0 0.5 1 1.5 All three can model any function given enough bases Polynomial does relatively poorly Alternatives require knowing / normalizing range of x 13
Main types of machine learning problems Supervised learning Classification Regression Unsupervised learning Dimensionality reduction Reinforcement learning 14
Applications with lots of features Any kind of task involving images or videos - object recognition, face recognition. Lots of pixels! Classifying from gene expression data. Lots of different genes! Number of data examples: 100 Number of variables: 6000 to 60,000 Natural language processing tasks. Lots of possible words! Tasks where we constructed many features using e.g. nonlinear expansion (e.g. polynomial) 15
Dimensionality reduction Thousands to millions of low level features: select the most relevant one to build better, faster, and easier to understand learning machines. m m n X slide by Isabelle Guyon 16
Dimensionality reduction Principal Component Analysis (PCA) Combine existing features into a smaller set of features Also called (Truncated) Singular Value Decomposition, or Latent Semantic Indexing in NLP Variable Ranking Think of features as random variables Find how strong they are associated with the output prediction, remove the ones that are not highly associated, either before training or during training Representation learning techniques like word2vec 17
Principal Component Analysis (PCA) Idea: Project data into a lower-dimensional sub-space, R m èr m, where m <m. 18
Principal Component Analysis (PCA) Idea: Project data into a lower-dimensional sub-space, R m èr m, where m <m. Consider a linear mapping, x i => Wx i W is the compression matrix with dimension R m xm. Assume there is a decompression matrix U mxm. 19
Principal Component Analysis (PCA) Idea: Project data into a lower-dimensional sub-space, R m èr m, where m <m. Consider a linear mapping, x i => Wx i W is the compression matrix with dimension R m xm. Assume there is a decompression matrix U mxm. Solve the following problem: argmin W,U Σ i=1:n x i UWx i 2 20
Principal Component Analysis (PCA) Idea: Project data into a lower-dimensional sub-space, R m èr m, where m <m. Consider a linear mapping, x i => Wx i W is the compression matrix with dimension R m xm. Assume there is a decompression matrix U mxm. Solve the following problem: argmin W,U Σ i=1:n x i UWx i 2 Select the project dimension, m, using cross-validation. Typically center the examples before applying PCA (subtract the mean). 21
Principal Component Analysis (PCA) This problem has a closed-form solution: W are the m eigenvectors corresponding to the largest eigenvalues of the covariance matrix of X (For centered data sample covariance is simply X T X) 22
Principal Component Analysis (PCA) Eigenvector equation: λw= Σw 23
Principal Component Analysis (PCA) Σv v Σw w u Σu Eigenvector equation: λw= Σw 24
Principal Component Analysis (PCA) w 1 w 2 Eigenvector equation: λw= Σw 25
Principal Component Analysis (PCA) w 1 w 2 W specifies a new basis 26
Principal Component Analysis (PCA) w 1 λ 1 λ 2 w 2 W specifies a new basis 27
Principal Component Analysis (PCA) w 1 w 2 Now it is easy to specify most relevant dimension: Reduce the dimensionality 1d Found dimension is a combination of x1 & x2 28
Principal Component Analysis (PCA) w 1 Now it is easy to specify most relevant dimension: Reduce the dimensionality 1d Found dimension is a combination of x1 & x2 29
Principal Component Analysis (PCA) PCA Calculate full matrix W (eigenvectors of covariance) Take m most relevant components (largest eigenvalues) Covariance between variables i and j after transformation? (Xw i ) T (Xw i ) w it X T Xw j (W are eigenvalues, so λ j w j =X T Xw j ) λ j w it w j = 0 (i j), λ j (i=j) 30
Principal Component Analysis (PCA) PCA Calculate full matrix W (eigenvectors of covariance) Take m most relevant components (largest eigenvalues) Covariance between variables i and j after transformation? (Xw i ) T (Xw i ) w it X T Xw j (W are eigenvalues, so λ j w j =X T Xw j ) λ j w it w j = 0 (i j), λ j (i=j) 31
Principal Component Analysis (PCA) Coloring transform Multiply by W T Multiply by Λ 1/2 Multiply by W Whitening transform Multiply by Λ -1/2 32
Eigenfaces Turk & Pentland (1991) used PCA method to capture face images. Assume all faces are about the same size Represent each face image as a data vector. Each Eigen vector is an image, called an Eigenface. Eigenfaces Average image 33
Training images Original images in R 50x50. Projection to R 10 and reconstruction 34
Training images Original images in R 50x50. Projection to R 10 and reconstruction o o o o o Projection to R 2. Different marks Indicate different individuals. x x x x * xx * * * 35
Feature selection methods The goal: Find the input representation that produces the best generalization error. Two classes of approaches: Wrapper & Filter methods: Feature selection is applied as a preprocessing step. Embedded methods: Feature selection is integrated in the learning (optimization) method, e.g. Regularization 36
Variable Ranking Idea: Rank features by a scoring function defined for individual features, independently of the context of others. Choose the m highest ranked features. Pros / cons: 37
Variable Ranking Idea: Rank features by a scoring function defined for individual features, independently of the context of others. Choose the m highest ranked features. Pros / cons: Need to select a scoring function. Must select subset size (m ): cross-validation Simple and fast just need to compute a scoring function m times and sort m scores. 38
Scoring function: Correlation Criteria R( j) = (x ij x i=1:n j )(y i y) (x ij x i=1:n j ) 2 (y i y) 2 i=1:n slide by Michel Verleysen 39
Scoring function: Mutual information Think of X j and Y as random variables. Mutual information between variable X j and target Y: I( j) = p(x j, y)log p(x j, y) dx dy X j Y p(x j )p(y) Empirical estimate from data (assume discretized variables): I( j) = P(X j = x j,y = y)log p(x j = x j,y = y) X j Y p(x j = x j )p(y = y) 40
Nonlinear dependencies with MI Mutual information identifies nonlinear relationships between variables. Example: x uniformly distributed over [-1 1] y = x 2 noise z uniformly distributed over [-1 1] z and x are independent slide by Michel Verleysen 41
Variable Ranking Idea: Rank features by a scoring function defined for individual features, independently of the context of others. Choose the m highest ranked features. Pros / cons? 42
Variable Ranking Idea: Rank features by a scoring function defined for individual features, independently of the context of others. Choose the m highest ranked features. Pros / cons? Need select a scoring function. Must select subset size (m ): cross-validation Simple and fast just need to compute a scoring function m times and sort m scores. Scoring function is defined for individual features (not subsets). 43
Best-Subset selection Idea: Consider all possible subsets of the features, measure performance on a validation set, and keep the subset with the best performance. Pros / cons? We get the best model! Very expensive to compute, since there is a combinatorial number of subsets. 44
Search space of subsets Kohavi-John, 1997 n features, 2 n 1 possible feature subsets! slide by Isabelle Guyon 45
Subset selection in practice Formulate as a search problem, where the state is the feature set that is used, and search operators involve adding or removing feature set Constructive methods like forward/backward search Local search methods, genetic algorithms Use domain knowledge to help you group features together, to reduce size of search space e.g., In NLP, group syntactic features together, semantic features, etc. 46
Steps to solving a supervised learning problem 1. Decide what the input-output pairs are. 2. Decide how to encode inputs and outputs. This defines the input space X and output space Y. 3. Choose a class of hypotheses / representations H. Evaluate on validation set Evaluate on test set E.g. linear functions. 4. Choose an error function (cost function) to define best hypothesis. E.g. Least-mean squares. 5. Choose an algorithm for searching through space of hypotheses. If doing k-fold cross-validation, re-do feature selection for each fold. 47
Regularization Idea: Modify the objective function to constrain the model choice. Typically adding term ( j=1:m w jp ) 1/p. Linear regression -> Ridge regression, Lasso Challenge: Need to adapt the optimization procedure (e.g. handle non-convex objective). This approach is often used for very large natural (nonconstructed) feature sets, e.g. images, speech, text, video. With lasso (p=1), some weights tend to become 0, so can interpret as feature selection 48
Final comments Classic paper on this: I. Guyon, A. Elisseeff, An introduction to Variable and Feature Selection. Journal of Machine Learning Research 3 (2003) pp.1157-1182 http://machinelearning.wustl.edu/mlpapers/paper_files/guyone03.pdf (and references therein.) More recently, move towards learning the features end-to-end, using neural network architecture (more on this next week.) 49
What you should know Properties of some families of generic features Basic idea and algorithm of principal component analysis Mutual information and correlation Feature selection approaches: Wrap&Filter, Embedded 50