Machine learning for pervasive systems Classification in high-dimensional spaces

Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version 1.1,

Lecture overview 02.11.2016 Introduction 09.11.2016 Decision Trees 16.11.2016 Regression 23.11.2016 Training Principles, Performance Evaluation 30.11.2016 PCA, LSI and SVM 07.12.2016 Instance-based learning 14.12.2016 Artificial Neural Networks 21.12.2016 No lecture (ELEC christmas party) 11.01.2017 Probabilistic Graphical Models 18.01.2017 Unsupervised learning 25.01.2017 Anomaly detection, online-learning, recommender systems 01.02.2017 Trend mining and Topic Models 08.02.2017 Feature selection 15.02.2017 Presentations 2 / 32

Outline Principle Component Analysis Latent Semantic Indexing Support Vector Machines Cost function Hypothesis Kernels 3 / 32

Principle Component Analysis Principal Component Analysis Find lower dimensional surface onto which to project the data 4 / 32

Principle Component Analysis PCA finds k vectors v (1),..., v (k) onto which to project the data such that the projection error is minimized. 5 / 32

Principle Component Analysis PCA finds k vectors v (1),..., v (k) onto which to project the data such that the projection error is minimized. In particular, we find values z (i) to represent the x (i) in this k-dimensional vector space spanned by the v (i) 5 / 32

Principle Component Analysis 1 Compute the covariance matrix from the x (i) : When a matrix C is C = 1 multiplied n (x (i)) with a (x (i)) vector T u, this usually results in a new vector Cu m of different direction than u. i=1 }{{}}{{} 1 m-dim. m 1-dim. }{{} m m-dim. (We assume that all features are mean-normalized and scaled into [0, 1]) 2 The pricipal components are found by computing the eigenvectors and eigenvalues of C (solving (C λi m)u = 0) 6 / 32

Principle Component Analysis 1 Compute the covariance matrix from the x (i) : When a matrix C is C = 1 multiplied n (x (i)) with a (x (i)) vector T u, this usually results in a new vector Cu m of different direction than u. i=1 }{{}}{{} There are few vectors u, however, which have the same 1 m-dim. m 1-dim. direction (Cu = λu). }{{} m m-dim. These are the eigenvectors of C and λ are the eigenvalues (We assume that all features are mean-normalized and scaled into [0, 1]) 2 The pricipal components are found by computing the eigenvectors and eigenvalues of C (solving (C λi m)u = 0) 6 / 32

Principle Component Analysis 1 Compute the covariance matrix from the x (i) : C = 1 m n (x (i)) }{{} (x (i)) T }{{} i=1 1 m-dim. m 1-dim. } {{ } m m-dim. (We assume that all features are mean-normalized and scaled into [0, 1]) 2 The pricipal components are found by computing the eigenvectors and eigenvalues of C (solving (C λi m)u = 0) Eigenvectors and Eigenvalues The (orthogonal) eigenvectors are sorted by their eigenvalues with respect to the direction of greatest variance in the data. 6 / 32

Principle Component Analysis 1 Compute the covariance matrix from the x (i) : C = 1 m n (x (i)) }{{} (x (i)) T }{{} i=1 1 m-dim. m 1-dim. } {{ } m m-dim. (We assume that all features are mean-normalized and scaled into [0, 1]) 2 The pricipal components are found by computing the eigenvectors and eigenvalues of C (solving (C λi m)u = 0) 3 Choose the k eigenvectors with largest eigenvalues to represent the projection space U 6 / 32

Principle Component Analysis 1 Compute the covariance matrix from the x (i) : C = 1 n (x (i)) (x (i)) T m i=1 }{{}}{{} 1 m-dim. m 1-dim. }{{} m m-dim. (We assume that all features are mean-normalized and scaled into [0, 1]) 2 The pricipal components are found by computing the eigenvectors and eigenvalues of C (solving (C λi m)u = 0) 3 Choose the k eigenvectors with largest eigenvalues to represent the projection space U 4 These k eigenvectors in U are used to transform the inputs x i to z i : z (i) = U T x (i) 6 / 32

Principle Component Analysis How to choose the number k of dimensions? We can calculate Average squared projection error Total variation in the data m i=1 x (i) x (i) m i=1 x (i) 2 1 m approx 2 as the accuracy of the projection using k principle components as a function of the eigenvalues k i=1 λi m j=1 λj = d 7 / 32

Outline Principle Component Analysis Latent Semantic Indexing Support Vector Machines Cost function Hypothesis Kernels 8 / 32

Latent Semantic Indexing Motivation In information retrieval, a common task is to obtain from many documents that subset which best matches a query 9 / 32

Latent Semantic Indexing Motivation In information retrieval, a common task is to obtain from many documents that subset which best matches a query Typical feature representations of documents are then term-document matrices: These matrices are typically huge but sparse. 9 / 32

Latent Semantic Indexing Singular Value Decomposition Any m n matrix C can be represented as a singular value decomposition in the form C = UΣV T where U m m matrix; columns are the orthog. eigenvectors of CC T V n n matrix; columns are the orthog. eigenvectors of C T C Σ Diagonal Matrix with Σ ii = λ i ; Σ ij = 0, i j First k eigenvectors map document vectors to lower dimensional representation It can be shown that this mapping resuls in the k-dim. space with smallest distance to the original space 10 / 32

Latent Semantic Indexing Example U: Σ: V T : 11 / 32

Latent Semantic Indexing Example Σ: C 2 : Cosine-similarity 12 / 32

13 / 32

Outline Principle Component Analysis Latent Semantic Indexing Support Vector Machines Cost function Hypothesis Kernels 14 / 32

For our previous classifier, we have designed an objective function of sufficient dimension 15 / 32

For our previous classifier, we have designed an objective function of sufficient dimension Alternative to designing complex non-linear functions: Change dimension of input space so that linear separator is again possible 15 / 32

16 / 32

SVM pre-processes data to represent patterns in a high dimension Dimension often much higher than original feature space Then, insert hyperplane in order to separate the data 17 / 32

The goal for support vector machines is to find a separating hyperplane with the largest margin to the outer points in all sets If no such hyperplane exists, map all points into a higher dimensional space until such a plane exists 18 / 32

Simple application to several classes by iterative approach: One-versus-all belongs to class 1 or not? belongs to class 2 or not?... Search for optimum mapping between input space and feature space complicated (no optimum approach known) 19 / 32

Outline Principle Component Analysis Latent Semantic Indexing Support Vector Machines Cost function Hypothesis Kernels 20 / 32

Cost function Contribution of a single sample to the overall cost: 21 / 32

Cost function Contribution of a single sample to the overall cost: Logistic regression 1 y log 1 + e W T x ( (1 y) log 1 1 1 + e W T x ) 21 / 32

Cost function Contribution of a single sample to the overall cost: Logistic regression SVM 1 y log 1 + e W T x ( (1 y) log 1 1 1 + e W T x y cost y=1 (W T x) + (1 y) cost y=0 (W T x) ) 21 / 32

Cost function min W min W. 1 [ m ( 1 m i=1 y i log C m i=1 ) 1 + (1 y 1+e W T x i ) i Logistic regression ( log ( 1 ))] 1 + λ n 1+e W T x i 2m j=1 w j 2 SVM [ yi cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) ] + 1 n 2 j=1 w j 2 1 C here plays a similar role as 1 λ 22 / 32

Outline Principle Component Analysis Latent Semantic Indexing Support Vector Machines Cost function Hypothesis Kernels 23 / 32

SVM hypothesis min W C m i=1 [ ] y i cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) + 1 2 n j=1 w 2 j 24 / 32

SVM hypothesis W T x { 0 < 0 sufficient min W C m i=1 [ ] y i cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) + 1 2 n j=1 w 2 j 24 / 32

SVM hypothesis W T x { 1 1 confidence min W C m i=1 [ ] y i cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) + 1 2 n j=1 w 2 j 24 / 32

SVM hypothesis W T x { 1 1 confidence Outliers: Elastic decision boundary min W C m i=1 [ ] y i cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) + 1 2 n j=1 w 2 j 24 / 32

SVM hypothesis W T x { 1 1 confidence Outliers: Elastic decision boundary large C stricter boundary at the cost of smaller margin min W C m i=1 [ ] y i cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) + 1 2 n j=1 w 2 j 24 / 32

SVM hypothesis W T x { 1 1 confidence Outliers: Elastic decision boundary small C tolerates outliers min W C m i=1 [ ] y i cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) + 1 2 n j=1 w 2 j 24 / 32

Large margin classifier min C m [ ] y i cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) + 1 W 2 i=1 n j=1 w 2 j 25 / 32

Large margin classifier min C m [ ] y i cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) + 1 W 2 i=1 n j=1 w 2 j Rewrite the SVM optimisation problem as min W 1 n 2 j=1 w j 2 s.t. W T x i 1 if y i = 1 W T x i 1 if y i = 0 25 / 32

Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w1 2 + + w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 26 / 32

Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w1 2 + + w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 W T x = w 1 x 1 + w 2 x 2 26 / 32

Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w1 2 + + w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 W T x = w 1 x 1 + w 2 x 2 = W p 26 / 32

Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w1 2 + + w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 Which decision boundaray is found? 26 / 32

Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w1 2 + + w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 Which decision boundaray is found? h(x) = w 1 x 1 + w 2 x 2 W orthogonal to all x with h(x) = 0 26 / 32

Outline Principle Component Analysis Latent Semantic Indexing Support Vector Machines Cost function Hypothesis Kernels 27 / 32

Kernels Non linear decision boundary 28 / 32

Kernels Non linear decision boundary Hypothesis = 1 if w 0 +w 1 x 1 +w 2 x 2 +w 3 x 1 x 2 +w 4 x 2 1 + 0 28 / 32

Kernels Non linear decision boundary Hypothesis = 1 if w 0 +w 1 x 1 +w 2 x 2 +w 3 x 1 x 2 +w 4 x 2 1 + 0 w 0 + w 1 k 1 + w 2 k 2 + w 3 k 3 +... 28 / 32

Kernels Non linear decision boundary w 0 + w 1 k 1 + w 2 k 2 + w 3 k 3 +... Kernel Define kernel via landmarks 28 / 32

Kernels Non linear decision boundary w 0 + w 1 k 1 + w 2 k 2 + w 3 k 3 +... Gaussian: k(x, l i ) = e x l i 2 2σ 2 28 / 32

Kernels Non linear decision boundary w 0 + w 1 k 1 + w 2 k 2 + w 3 k 3 +... Gaussian: k(x, l i ) = e x l i 2 2σ 2 x l i k(x, l i ) 1 (towards 0 else) 28 / 32

Kernels Non linear decision boundary w 0 + w 1 k 1 + w 2 k 2 + w 3 k 3 +... Gaussian: k(x, l i ) = e x l i 2 2σ 2 x l i k(x, l i ) 1 (towards 0 else) Example: w 0 = 0.5, w 1 = 1, w 2 = 1, w 3 = 0 h(x) = w 0 + w 1 k(x, l 1 ) + w 2 k(x, l 2 ) + w 3 k(x, l 3 ) 28 / 32

Kernels Non linear decision boundary w 0 + w 1 k 1 + w 2 k 2 + w 3 k 3 +... Gaussian: k(x, l i ) = e x l i 2 2σ 2 x l i k(x, l i ) 1 (towards 0 else) σ controls the width of the Gaussian Example: w 0 = 0.5, w 1 = 1, w 2 = 1, w 3 = 0 h(x) = w 0 + w 1 k(x, l 1 ) + w 2 k(x, l 2 ) + w 3 k(x, l 3 ) σ = 1 σ = 0.5 σ = 2 28 / 32

Kernels placement of landmarks Possible choice of initial landmarks: All training-set samples Training of w i f i = k(x i, l 1 ). k(x i, l m ) min C m y i cost yi =1(W T f i ) + (1 y i ) cost yi =0(W T f i ) + 1 W 2 i=1 m j=1 w 2 j 29 / 32

Outline Principle Component Analysis Latent Semantic Indexing Support Vector Machines Cost function Hypothesis Kernels 30 / 32

Questions? stephan.sigg@aalto.fi Le Nguyen le.ngu.nguyen@aalto.fi 31 / 32

Literature C.M. Bishop: Pattern recognition and machine learning, Springer, 2007. R.O. Duda, P.E. Hart, D.G. Stork: Pattern Classification, Wiley, 2001. 32 / 32