Machine learning for pervasive systems Classification in high-dimensional spaces

Size: px

Start display at page:

Download "Machine learning for pervasive systems Classification in high-dimensional spaces"

Garry Hicks
6 years ago
Views:

1 Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering Version 1.1,

2 Lecture overview Introduction Decision Trees Regression Training Principles, Performance Evaluation PCA, LSI and SVM Instance-based learning Artificial Neural Networks No lecture (ELEC christmas party) Probabilistic Graphical Models Unsupervised learning Anomaly detection, online-learning, recommender systems Trend mining and Topic Models Feature selection Presentations 2 / 32

3 Outline Principle Component Analysis Latent Semantic Indexing Support Vector Machines Cost function Hypothesis Kernels 3 / 32

4 Principle Component Analysis Principal Component Analysis Find lower dimensional surface onto which to project the data 4 / 32

5 Principle Component Analysis Principal Component Analysis Find lower dimensional surface onto which to project the data 4 / 32

6 Principle Component Analysis Principal Component Analysis Find lower dimensional surface onto which to project the data 4 / 32

7 Principle Component Analysis Principal Component Analysis Find lower dimensional surface onto which to project the data 4 / 32

8 Principle Component Analysis PCA finds k vectors v (1),..., v (k) onto which to project the data such that the projection error is minimized. 5 / 32

9 Principle Component Analysis PCA finds k vectors v (1),..., v (k) onto which to project the data such that the projection error is minimized. In particular, we find values z (i) to represent the x (i) in this k-dimensional vector space spanned by the v (i) 5 / 32

10 Principle Component Analysis 1 Compute the covariance matrix from the x (i) : C = 1 m n (x (i)) }{{} (x (i)) T }{{} i=1 1 m-dim. m 1-dim. } {{ } m m-dim. (We assume that all features are mean-normalized and scaled into [0, 1]) 6 / 32

11 Principle Component Analysis 1 Compute the covariance matrix from the x (i) : C = 1 m n (x (i)) }{{} (x (i)) T }{{} i=1 1 m-dim. m 1-dim. } {{ } m m-dim. (We assume that all features are mean-normalized and scaled into [0, 1]) Covariance A measure of spread of a set of points around their center of mass 6 / 32

12 Principle Component Analysis 1 Compute the covariance matrix from the x (i) : C = 1 m n (x (i)) }{{} (x (i)) T }{{} i=1 1 m-dim. m 1-dim. } {{ } m m-dim. (We assume that all features are mean-normalized and scaled into [0, 1]) 2 The pricipal components are found by computing the eigenvectors and eigenvalues of C (solving (C λi m)u = 0) 6 / 32

13 Principle Component Analysis 1 Compute the covariance matrix from the x (i) : When a matrix C is C = 1 multiplied n (x (i)) with a (x (i)) vector T u, this usually results in a new vector Cu m of different direction than u. i=1 }{{}}{{} 1 m-dim. m 1-dim. }{{} m m-dim. (We assume that all features are mean-normalized and scaled into [0, 1]) 2 The pricipal components are found by computing the eigenvectors and eigenvalues of C (solving (C λi m)u = 0) 6 / 32

14 Principle Component Analysis 1 Compute the covariance matrix from the x (i) : When a matrix C is C = 1 multiplied n (x (i)) with a (x (i)) vector T u, this usually results in a new vector Cu m of different direction than u. i=1 }{{}}{{} There are few vectors u, however, which have the same 1 m-dim. m 1-dim. direction (Cu = λu). }{{} m m-dim. These are the eigenvectors of C and λ are the eigenvalues (We assume that all features are mean-normalized and scaled into [0, 1]) 2 The pricipal components are found by computing the eigenvectors and eigenvalues of C (solving (C λi m)u = 0) 6 / 32

15 Principle Component Analysis 1 Compute the covariance matrix from the x (i) : C = 1 m n (x (i)) }{{} (x (i)) T }{{} i=1 1 m-dim. m 1-dim. } {{ } m m-dim. (We assume that all features are mean-normalized and scaled into [0, 1]) 2 The pricipal components are found by computing the eigenvectors and eigenvalues of C (solving (C λi m)u = 0) Eigenvectors and Eigenvalues The (orthogonal) eigenvectors are sorted by their eigenvalues with respect to the direction of greatest variance in the data. 6 / 32

16 Principle Component Analysis 1 Compute the covariance matrix from the x (i) : C = 1 m n (x (i)) }{{} (x (i)) T }{{} i=1 1 m-dim. m 1-dim. } {{ } m m-dim. (We assume that all features are mean-normalized and scaled into [0, 1]) 2 The pricipal components are found by computing the eigenvectors and eigenvalues of C (solving (C λi m)u = 0) 3 Choose the k eigenvectors with largest eigenvalues to represent the projection space U 6 / 32

17 Principle Component Analysis 1 Compute the covariance matrix from the x (i) : C = 1 n (x (i)) (x (i)) T m i=1 }{{}}{{} 1 m-dim. m 1-dim. }{{} m m-dim. (We assume that all features are mean-normalized and scaled into [0, 1]) 2 The pricipal components are found by computing the eigenvectors and eigenvalues of C (solving (C λi m)u = 0) 3 Choose the k eigenvectors with largest eigenvalues to represent the projection space U 4 These k eigenvectors in U are used to transform the inputs x i to z i : z (i) = U T x (i) 6 / 32

18 Principle Component Analysis How to choose the number k of dimensions? We can calculate Average squared projection error Total variation in the data m i=1 x (i) x (i) m i=1 x (i) 2 1 m approx 2 as the accuracy of the projection using k principle components as a function of the eigenvalues k i=1 λi m j=1 λj = d 7 / 32

19 Principle Component Analysis How to choose the number k of dimensions? We can calculate Average squared projection error Total variation in the data m i=1 x (i) x (i) m i=1 x (i) 2 1 m approx 2 as the accuracy of the projection using k principle components as a function of the eigenvalues k i=1 λi m = d j=1 λj We say that 100 (1 d)% of variance is retained. (Typically, d [0.01, 0.05] ) 7 / 32

20 Outline Principle Component Analysis Latent Semantic Indexing Support Vector Machines Cost function Hypothesis Kernels 8 / 32

21 Latent Semantic Indexing Motivation In information retrieval, a common task is to obtain from many documents that subset which best matches a query 9 / 32

22 Latent Semantic Indexing Motivation In information retrieval, a common task is to obtain from many documents that subset which best matches a query Typical feature representations of documents are then term-document matrices: 9 / 32

23 Latent Semantic Indexing Motivation In information retrieval, a common task is to obtain from many documents that subset which best matches a query Typical feature representations of documents are then term-document matrices: 9 / 32

24 Latent Semantic Indexing Motivation In information retrieval, a common task is to obtain from many documents that subset which best matches a query Typical feature representations of documents are then term-document matrices: These matrices are typically huge but sparse. 9 / 32

25 Latent Semantic Indexing Motivation In information retrieval, a common task is to obtain from many documents that subset which best matches a query Typical feature representations of documents are then term-document matrices: These matrices are typically huge but sparse. How to identify those feature dimensions (or combinations thereof) which are most meaningful? 9 / 32

26 Latent Semantic Indexing Singular Value Decomposition Any m n matrix C can be represented as a singular value decomposition in the form C = UΣV T where U m m matrix; columns are the orthog. eigenvectors of CC T V n n matrix; columns are the orthog. eigenvectors of C T C Σ Diagonal Matrix with Σ ii = λ i ; Σ ij = 0, i j 10 / 32

27 Latent Semantic Indexing Singular Value Decomposition Any m n matrix C can be represented as a singular value decomposition in the form C = UΣV T where U m m matrix; columns are the orthog. eigenvectors of CC T V n n matrix; columns are the orthog. eigenvectors of C T C Σ Diagonal Matrix with Σ ii = λ i ; Σ ij = 0, i j 10 / 32

28 Latent Semantic Indexing Singular Value Decomposition Any m n matrix C can be represented as a singular value decomposition in the form C = UΣV T where U m m matrix; columns are the orthog. eigenvectors of CC T V n n matrix; columns are the orthog. eigenvectors of C T C Σ Diagonal Matrix with Σ ii = λ i ; Σ ij = 0, i j First k eigenvectors map document vectors to lower dimensional representation It can be shown that this mapping resuls in the k-dim. space with smallest distance to the original space 10 / 32

29 Latent Semantic Indexing Example U: Σ: V T : 11 / 32

30 Latent Semantic Indexing Example U: Σ: V T : 11 / 32

31 Latent Semantic Indexing Example Σ: C 2 : Cosine-similarity 12 / 32

32 13 / 32

33 Outline Principle Component Analysis Latent Semantic Indexing Support Vector Machines Cost function Hypothesis Kernels 14 / 32

34 For our previous classifier, we have designed an objective function of sufficient dimension 15 / 32

35 For our previous classifier, we have designed an objective function of sufficient dimension Alternative to designing complex non-linear functions: Change dimension of input space so that linear separator is again possible 15 / 32

36 16 / 32

37 SVM pre-processes data to represent patterns in a high dimension Dimension often much higher than original feature space Then, insert hyperplane in order to separate the data 17 / 32

38 The goal for support vector machines is to find a separating hyperplane with the largest margin to the outer points in all sets If no such hyperplane exists, map all points into a higher dimensional space until such a plane exists 18 / 32

39 Simple application to several classes by iterative approach: One-versus-all belongs to class 1 or not? belongs to class 2 or not?... Search for optimum mapping between input space and feature space complicated (no optimum approach known) 19 / 32

40 Outline Principle Component Analysis Latent Semantic Indexing Support Vector Machines Cost function Hypothesis Kernels 20 / 32

41 Cost function Contribution of a single sample to the overall cost: 21 / 32

42 Cost function Contribution of a single sample to the overall cost: Logistic regression 1 y log 1 + e W T x ( (1 y) log e W T x ) 21 / 32

43 Cost function Contribution of a single sample to the overall cost: Logistic regression SVM 1 y log 1 + e W T x ( (1 y) log e W T x y cost y=1 (W T x) + (1 y) cost y=0 (W T x) ) 21 / 32

44 Cost function min W min W. 1 [ m ( 1 m i=1 y i log C m i=1 ) 1 + (1 y 1+e W T x i ) i Logistic regression ( log ( 1 ))] 1 + λ n 1+e W T x i 2m j=1 w j 2 SVM [ yi cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) ] + 1 n 2 j=1 w j 2 1 C here plays a similar role as 1 λ 22 / 32

45 Outline Principle Component Analysis Latent Semantic Indexing Support Vector Machines Cost function Hypothesis Kernels 23 / 32

46 SVM hypothesis min W C m i=1 [ ] y i cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) n j=1 w 2 j 24 / 32

47 SVM hypothesis min W C m i=1 [ ] y i cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) n j=1 w 2 j 24 / 32

48 SVM hypothesis W T x { 0 < 0 sufficient min W C m i=1 [ ] y i cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) n j=1 w 2 j 24 / 32

49 SVM hypothesis W T x { 1 1 confidence min W C m i=1 [ ] y i cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) n j=1 w 2 j 24 / 32

50 SVM hypothesis W T x { 1 1 confidence Outliers: Elastic decision boundary min W C m i=1 [ ] y i cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) n j=1 w 2 j 24 / 32

51 SVM hypothesis W T x { 1 1 confidence Outliers: Elastic decision boundary min W C m i=1 [ ] y i cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) n j=1 w 2 j 24 / 32

52 SVM hypothesis W T x { 1 1 confidence Outliers: Elastic decision boundary large C stricter boundary at the cost of smaller margin min W C m i=1 [ ] y i cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) n j=1 w 2 j 24 / 32

53 SVM hypothesis W T x { 1 1 confidence Outliers: Elastic decision boundary small C tolerates outliers min W C m i=1 [ ] y i cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) n j=1 w 2 j 24 / 32

54 Large margin classifier min C m [ ] y i cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) + 1 W 2 i=1 n j=1 w 2 j 25 / 32

55 Large margin classifier min C m [ ] y i cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) + 1 W 2 i=1 n j=1 w 2 j Rewrite the SVM optimisation problem as min W 1 n 2 j=1 w j 2 s.t. W T x i 1 if y i = 1 W T x i 1 if y i = 0 25 / 32

56 Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 26 / 32

57 Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 26 / 32

58 Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 26 / 32

59 Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 W T x = w 1 x 1 + w 2 x 2 26 / 32

60 Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 W T x = w 1 x 1 + w 2 x 2 = W p 26 / 32

61 Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 W T x = w 1 x 1 + w 2 x 2 = W p 26 / 32

62 Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 Which decision boundaray is found? 26 / 32

Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w1 2 + + w n 2 = 1 W 2 2 s.t.

63 Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 Which decision boundaray is found? 26 / 32

64 Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 Which decision boundaray is found? h(x) = w 1 x 1 + w 2 x 2 W orthogonal to all x with h(x) = 0 26 / 32

65 Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 Which decision boundaray is found? h(x) = w 1 x 1 + w 2 x 2 W orthogonal to all x with h(x) = 0 26 / 32

66 Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 Which decision boundaray is found? h(x) = w 1 x 1 + w 2 x 2 W orthogonal to all x with h(x) = 0 min 1 2 W 2 and W p i 1 necessitate larger p i 26 / 32

67 Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 Which decision boundaray is found? h(x) = w 1 x 1 + w 2 x 2 W orthogonal to all x with h(x) = 0 min 1 2 W 2 and W p i 1 necessitate larger p i 26 / 32

68 Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 Which decision boundaray is found? h(x) = w 1 x 1 + w 2 x 2 W orthogonal to all x with h(x) = 0 min 1 2 W 2 and W p i 1 necessitate larger p i 26 / 32

69 Outline Principle Component Analysis Latent Semantic Indexing Support Vector Machines Cost function Hypothesis Kernels 27 / 32

70 Kernels Non linear decision boundary 28 / 32

71 Kernels Non linear decision boundary Hypothesis = 1 if w 0 +w 1 x 1 +w 2 x 2 +w 3 x 1 x 2 +w 4 x / 32

72 Kernels Non linear decision boundary Hypothesis = 1 if w 0 +w 1 x 1 +w 2 x 2 +w 3 x 1 x 2 +w 4 x w 0 + w 1 k 1 + w 2 k 2 + w 3 k / 32

73 Kernels Non linear decision boundary w 0 + w 1 k 1 + w 2 k 2 + w 3 k Kernel Define kernel via landmarks 28 / 32

74 Kernels Non linear decision boundary w 0 + w 1 k 1 + w 2 k 2 + w 3 k Kernel Define kernel via landmarks 28 / 32

75 Kernels Non linear decision boundary w 0 + w 1 k 1 + w 2 k 2 + w 3 k Gaussian: k(x, l i ) = e x l i 2 2σ 2 28 / 32

76 Kernels Non linear decision boundary w 0 + w 1 k 1 + w 2 k 2 + w 3 k Gaussian: k(x, l i ) = e x l i 2 2σ 2 x l i k(x, l i ) 1 (towards 0 else) 28 / 32

77 Kernels Non linear decision boundary w 0 + w 1 k 1 + w 2 k 2 + w 3 k Gaussian: k(x, l i ) = e x l i 2 2σ 2 x l i k(x, l i ) 1 (towards 0 else) Example: w 0 = 0.5, w 1 = 1, w 2 = 1, w 3 = 0 h(x) = w 0 + w 1 k(x, l 1 ) + w 2 k(x, l 2 ) + w 3 k(x, l 3 ) 28 / 32

78 Kernels Non linear decision boundary w 0 + w 1 k 1 + w 2 k 2 + w 3 k Gaussian: k(x, l i ) = e x l i 2 2σ 2 x l i k(x, l i ) 1 (towards 0 else) Example: w 0 = 0.5, w 1 = 1, w 2 = 1, w 3 = 0 h(x) = w 0 + w 1 k(x, l 1 ) + w 2 k(x, l 2 ) + w 3 k(x, l 3 ) σ = 1 σ = 0.5 σ = 2 28 / 32

Kernels Non linear decision boundary w 0 + w 1 k 1 + w 2 k 2 + w 3 k 3 +.

79 Kernels Non linear decision boundary w 0 + w 1 k 1 + w 2 k 2 + w 3 k Gaussian: k(x, l i ) = e x l i 2 2σ 2 x l i k(x, l i ) 1 (towards 0 else) Example: w 0 = 0.5, w 1 = 1, w 2 = 1, w 3 = 0 h(x) = w 0 + w 1 k(x, l 1 ) + w 2 k(x, l 2 ) + w 3 k(x, l 3 ) σ = 1 σ = 0.5 σ = 2 28 / 32

80 Kernels Non linear decision boundary w 0 + w 1 k 1 + w 2 k 2 + w 3 k Gaussian: k(x, l i ) = e x l i 2 2σ 2 x l i k(x, l i ) 1 (towards 0 else) Example: w 0 = 0.5, w 1 = 1, w 2 = 1, w 3 = 0 h(x) = w 0 + w 1 k(x, l 1 ) + w 2 k(x, l 2 ) + w 3 k(x, l 3 ) σ = 1 σ = 0.5 σ = 2 28 / 32

81 Kernels Non linear decision boundary w 0 + w 1 k 1 + w 2 k 2 + w 3 k Gaussian: k(x, l i ) = e x l i 2 2σ 2 x l i k(x, l i ) 1 (towards 0 else) σ controls the width of the Gaussian Example: w 0 = 0.5, w 1 = 1, w 2 = 1, w 3 = 0 h(x) = w 0 + w 1 k(x, l 1 ) + w 2 k(x, l 2 ) + w 3 k(x, l 3 ) σ = 1 σ = 0.5 σ = 2 28 / 32

82 Kernels Non linear decision boundary w 0 + w 1 k 1 + w 2 k 2 + w 3 k Gaussian: k(x, l i ) = e x l i 2 2σ 2 x l i k(x, l i ) 1 (towards 0 else) σ controls the width of the Gaussian Example: w 0 = 0.5, w 1 = 1, w 2 = 1, w 3 = 0 h(x) = w 0 + w 1 k(x, l 1 ) + w 2 k(x, l 2 ) + w 3 k(x, l 3 ) σ = 1 σ = 0.5 σ = 2 28 / 32

83 Kernels placement of landmarks Possible choice of initial landmarks: All training-set samples Training of w i f i = k(x i, l 1 ). k(x i, l m ) min C m y i cost yi =1(W T f i ) + (1 y i ) cost yi =0(W T f i ) + 1 W 2 i=1 m j=1 w 2 j 29 / 32

84 Outline Principle Component Analysis Latent Semantic Indexing Support Vector Machines Cost function Hypothesis Kernels 30 / 32

85 Questions? Le Nguyen 31 / 32

86 Literature C.M. Bishop: Pattern recognition and machine learning, Springer, R.O. Duda, P.E. Hart, D.G. Stork: Pattern Classification, Wiley, / 32

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD DATA MINING LECTURE 8 Dimensionality Reduction PCA -- SVD The curse of dimensionality Real data usually have thousands, or millions of dimensions E.g., web documents, where the dimensionality is the vocabulary