MACHINE LEARNING 2012 MACHINE LEARNING

Size: px

Start display at page:

Download "MACHINE LEARNING 2012 MACHINE LEARNING"

Veronica Chandler
6 years ago
Views:

1 1 MACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering

2 2 Change in timetable: We have practical session net week!

3 3 Structure of toda s and net week s class 1) Briefl go through some etension or variants on the principle of kernel PCA, namel kernel CCA. 2) Look at one particular set of clustering algorithms for structure discover, kernel K-Means 3) Describe the general concept of Spectral Clustering, highlighting equivalenc between kernel PCA, ISOMAP, etc and their use for clustering. 4) Introduce the notion of unsupervised and semi-supervised learning and how this can be used to evaluate clustering methods 5) Compare kernel K-means and Gaussian Miture Model for unsupervised and semi-supervised clustering

4 Canonical Correlation Analsis (CCA) N P, 1 1,, w 2 2 ma T, T corr w w w Video description Audio description Determine features in two (or more) separate descriptions of the dataset that best eplain each datapoint. Etract hidden structure that maimize correlation across two different projections. 4

5 5 Canonical Correlation Analsis (CCA) Pair of multidimensional zero mean variables We have M instances of the pairs. M i N i q, X Y i1 i1, 1 1 M Search two projections w and w : T T X ' w X and Y ' w Y solutions of: ma ma corr X',Y' w, w,, w 2 2 ma T, T corr w X w Y w

6 6 Canonical Correlation Analsis (CCA) ma ma corr X',Y' w, w ma w, w ma w, w w T w w T T E XY X w T T w C C w T w w Y w C w T Crosscovariance matri C is N q Measure crosscorrelation between X and Y. Covariance matrices T C =E XX : N N T C =E YY : q q

7 7 Canonical Correlation Analsis (CCA) ma ma corr X',Y' w, w ma w, w ma w, w w T w w T T E XY X w T w C C w T T w w Y w C w T Correlation not affected b rescaling the norm of the vectors, we can ask that w C w w C w 1 T T ma ma w C w T w, w T u. c. w C w w C w 1 T

8 8 Canonical Correlation Analsis (CCA) To determine the optimum (maimum) of, solve b Lagrange: T wc w ma T T w, 1 2 w wcwwc w CCCw Cw T T u.c. w Cw w Cw 1 Generalized Eigenvalue Problem; can be reduced to a classical eigenvalue problem if C is invertible

9 9 Kernel Canonical Correlation Analsis - CCA finds basis vectors, s.t. the correlation between the projections is mutuall maimized. generalized version of PCA for two or more multi-dimensional datasets. CCA depends on the coordinate sstem in which the variables are described. Even if strong linear relationship between variables, depending on the coordinate sstem used, this relationship might not be visible as a correlation Kernel CCA.

10 Principle of Kernel Methods Determine a metric which brings out features of the data so as to make subsequent computation easier Original Space After Lifting Data in Feature Space 2 1 Data becomes linearl separable when using a rbf kernel and projecting onto first 2 PC of kernelpca. 10

11 Principle of Kernel Methods (Recall) Idea: Send the data X into a feature space H through the nonlinear map f. In feature space, perform classical linear computation i1... M i N 1,..., M f f f X X Original Space f H Performs linear transformation in feature space 2 e 2 1 e 1 11

12 12 Kernel CCA M, i N i q X Y M M i1 i1 Project into a feature space M M M i i i i f and f, with f 0 and f 0 i1 i1 i1 i1 Construct associated kernel matrices: f K F F, K F F, columns of F, F are f, T T i i The projection vectors can be epressed as a linear combination in feature space: w F and w F

13 13 Kernel CCA Kernel CCA can then become an optimization problem of the ma ma,, KK T 1/2 1/2 T 2 T 2 K K T 2 T 2 u.c K K 1 In Linear CCA, we were solving for: ma w, w w T T w T C C w w w C w T u.c. w C w w C w 1 T

14 14 Kernel CCA Kernel CCA can then become an optimization problem of the ma ma,, KK T 1/2 1/2 T 2 T 2 K K T 2 T 2 u.c K K 1 In practice, the intersection between the spaces spanned b K, K is non-zero, then the problem has a trivial solution, as ~ cos K, K 1.

15 15 Kernel CCA Generalized eigenvalue problem: 2 0 KK K 0 2 KK 0 0 K Add a regularization term to increase the rank of the matri and make it invertible K 2 K M + I 2 2 Several methods have been proposed to choose carefull the regularizing term so as to get projections that are as close as possible to the true projections.

16 16 Kernel CCA 2 M K + I 0 0 KK 2 2 KK 0 M 0 K + I A 2 T Set: B C C and C Becomes a classical eigenvalue problem T 1 1 B C AC

17 17 Kernel CCA M, i N i q X Y Two datasets case 1 M i1 i1 Can be etended to multiple datasets: L datasets: X,..., X with M observations each L Dimensions N,... N :, i.e. X : N M 1 L i i Appling a non-linear transformation f to X,... X construct L Gram matrices: K,..., K 1 L 1 L

18 18 Kernel CCA M, i N i q X Y Two datasets case i1 i1 Was formulated as generalized eigenvalue problem: 2 M K + I 0 0 KK 2 2 KK 0 M 0 K + I A 2 B Can be etended to multiple datasets: 2 2 M K1 + I 0 0 K1K2... K1K L K2K K2K L KLK1 KLK2... KLK L 2 L 2 M L 0 KL + I 2 M

19 19 Kernel CCA M. Kuss and T, Graepel,The Geometr Of Kernel Canonical Correlation Analsis Tech. Report, Ma Planck Institute, 2003

20 20 Kernel CCA Matlab Eample File: demo_cca-kcca.m

21 Applications of Kernel CCA Goal: To measure correlation between heterogeneous datasets and to etract sets of genes which share similarities with respect to multiple biological attributes Kernel matrices K1, K2 and K3 correspond to gene-gene similarities in pathwas, genome position, and microarra epression data resp. Use RBF kernel with fied kernel width. Correlation scores in MKCCA: pathwa vs. genome vs. epression. Y Yamanishi, JP Vert, A Nakaa, M Kanehisa - Etraction of correlated gene clusters from multiple genomic data b generalized kernel canonical correlation analsis, Bioinformatics,

22 Applications of Kernel CCA Goal: To measure correlation between heterogeneous datasets and to etract sets of genes which share similarities with respect to multiple biological attributes Gives pairwise correlation between K1, K2 Gives pairwise correlation between K1, K3 Two clusters correspond to genes close to each other with respect to their positions in the pathwas, in the genome, and to their epression A readout of the entries with equal projection onto the first canonical vectors give the genes which belong to each cluster Correlation scores in MKCCA: pathwa vs. genome vs. epression. Y Yamanishi, JP Vert, A Nakaa, M Kanehisa - Etraction of correlated gene clusters from multiple genomic data b generalized kernel canonical correlation analsis, Bioinformatics,

23 Applications of Kernel CCA Goal: To construct appearance models for estimating an object s pose from raw brightness images X: Set of images Y: pose parameters (pan and tilt angle of the object w.r.t. the camera in degrees) Eample of two image datapoints with different poses Use linear kernel on X and RBF kernel on Y and compare performance to appling PCA on the (X, Y) dataset directl T. Melzer, M. Reiter and H. Bischof, Appearance models based on kernel canonical correlation analsis, Pattern Recognition 36 (2003) p

24 Applications of Kernel CCA Goal: To construct appearance models for estimating an object s pose from raw brightness images Testing/training ratio kernel-cca performs better than PCA, especiall for small k=testing/training ratio (i.e., for larger training sets). The kernel-cca estimators tend to produce less outliers, i.e., gross errors, and consequentl ield a smaller standard deviation of the pose estimation error than their PCA-based counterparts. For ver small training sets, the performance of both approaches becomes similar T. Melzer, M. Reiter and H. Bischof, Appearance models based on kernel canonical correlation analsis, Pattern Recognition 36 (2003) p

25 26 Kernel K-means Spectral Clustering

26 27 Structure Discover: Clustering Groups pair of points according to how similar these are. Densit based clustering methods (Soft K-means, Kernel K-means, Gaussian Miture Models) compare the relative distributions.

27 28 K-means K-means is a hard partitioning of the space through K clusters equidistant according to norm 2 measure. The distribution of data within each cluster is encapsulated in a sphere. *m2 *m1 *m3

28 29 K-means Algorithm k m, k 1... K,

29 30 K-means Algorithm Iterative Method (variant on Epectation-Maimization) 2. Calculate the distance from each data point to each centroid. 3. Assignment Step: Assign the responsibilit of each data point to its closest centroid (E-step). If a tie happens (i.e. two centroids are equidistant to a data point, one assigns the data point to the smallest winning centroid). d, m m j k j k p j k arg min d, m k

30 31 K-means Algorithm Iterative Method (variant on Epectation-Maimization) 2. Calculate the distance from each data point to each centroid. 3. Assignment Step: Assign the responsibilit of each data point to its closest centroid (E-step). If a tie happens (i.e. two centroids are equidistant to a data point, one assigns the data point to the smallest winning centroid). d, m m j k j k p j k arg min d, m k 4. Update Step: Adjust the centroids to be the means of all data points assigned to them (M-step) 5. Go back to step 2 and repeat the process until the clusters are stable.

31 32 K-means Clustering: Weaknesses Two hperparameters: number of clusters K and power p of the metric Ver sensitive to the choice of the number of clusters K and the initialization.

32 K-means Clustering: Hperparameters Two hperparameters: number of clusters K and power p of the metric Choice of power determines the form of the decision boundaries P=1 P=2 P=3 P=4 33

33 Kernel K-means K-Means algorithm consists of minimization of: J m k K 1 K j k p k j C m,..., m m with m k1 C k : number of datapoints in cluster j k C k k m j M i f i 1 Project into a feature space J K 1 m,..., m K f j f m k k1 C j k p j C k f m k j We cannot observe the mean in feature space. Construct the mean in feature space using image of points in same cluster 34

34 35 Kernel K-means J K 1 m,..., m K f j f m k = k1 C K j j f 2 2 l j j l f f f i j j l k,, C k1 C k k K = k, j k k k f j j l k C 2 k, j k m i i C j l k, C k1 C k k f m m j l k 2 m 2

35 36 Kernel K-means Kernel K-means algorithm is also an iterative procedure: 1. Initialization: pick K clusters 2. Assignment Step: Assign each data point to its closest centroid (E-step). If a tie happens (i.e. two centroids are equidistant to a data point, one assigns the data point to the smallest winning centroid) b computing the distance in feature space. 2 k, i k min, min k i i, j k C d C k k mk i j k j l, j l k, C 3. Update Step: Update the list of points belonging to each centroid (M-step) 4. Go back to step 2 and repeat the process until the clusters are stable. m k 2

36 37 Kernel K-means With a RBF kernel Cst of value 1 If i is close to all points in cluster k, this is close to 1. If the points are well grouped in cluster k, this sum is close to 1. 2 k, i k min, min k i i, j k C d C k k mk i j k j l, j l k, C m k 2 With homogeneous polnomial kernel?

37 38 Kernel K-means With a polnomial kernel Positive value Some of the terms change sign depending on their position with respect to the origin. If the points are aligned in the same Quadran, the sum is maimal 2 k, i k min, min k i i, j k C d C k k mk i j k j l, j l k, C m k 2

38 39 Kernel K-means: eamples Rbf Kernel, 2 Clusters

39 40 Kernel K-means: eamples Rbf Kernel, 2 Clusters

40 41 Kernel K-means: eamples Rbf Kernel, 2 Clusters Kernel width: 0.5 Kernel width: 0.05

41 42 Kernel K-means: eamples Polnomial Kernel, 2 Clusters

42 43 Kernel K-means: eamples Polnomial Kernel (p=8), 2 Clusters

43 Kernel K-means: eamples Polnomial Kernel, 2 Clusters Order 2 Order 4 Order 6 44

45 Kernel K-means: eamples Polnomial Kernel, 2 Clusters The separating line will alwas be perpendicular to the line passing b the origin (which is located at the mean

44 45 Kernel K-means: eamples Polnomial Kernel, 2 Clusters The separating line will alwas be perpendicular to the line passing b the origin (which is located at the mean of the datapoints) and parrallel to the ais of the ordinates (because of the change in sign of the cosine function in the inner product. No better than linear K-means!

not overlap across quadrans with respect to the origin (careful, data are

45 46 Kernel K-means: eamples Polnomial Kernel, 4 Clusters Solutions found with Kernel K-means Solutions found with K-means Can onl group datapoints that do not overlap across quadrans with respect to the origin (careful, data are centered!). No better than linear K-means! (ecept less sensitive to random initialization)

46 47 Kernel K-means: Limitations Choice of number of Clusters in Kernel K-means is important

47 48 Kernel K-means: Limitations Choice of number of Clusters in Kernel K-means is important

48 49 Kernel K-means: Limitations Choice of number of Clusters in Kernel K-means is important

49 50 Limitations of kernel K-means Raw Data

50 51 Limitations of kernel K-means kernel K-means with K=3, RBF kernel

51 52 From Non-Linear Manifolds Laplacian Eigenmaps, Isomaps To Spectral Clustering

52 53 Non-Linear Manifolds PCA and Kernel PCA belong to a more general class of methods to create non-linear manifolds based on spectral decomposition. (Spectral decomposition of matrices is more frequentl referred to as an eigenvalue decomposition.) Depending on which matri we decompose, we get a different set of projections. PCA decomposes the covariance matri of the dataset generate rotation and projection in the original space kernel PCA decomposes the Gram matri partition or regroup the datapoints The Laplacian matri is a matri representation of a graph. Its spectral decomposition can be used for clustering.

53 54 Embed Data in a Graph Original dataset Graph representation of the dataset Build a similarit graph Each verte on the graph is a datapoint

54 55 Measure Distances in Graph S Construct the similarit matri S to denote whether points are close or far awa to weight the edges of the graph:

55 56 Disconnected Graphs Disconnected Graph: Two data-points are connected if: a) the similarit between them is higher than a threshold. b) or if the are k-nearest neighbors (according to the similarit metric) S

56 57 Graph Laplacian Given the similarit matri S Construct the diagonal matri D composed of the sum on each line of K: S1 i i 0 S2...0 i D i, S Mi i and then, build the Laplacian matri : L D S L is positive semi-definite spectral decomposition possible

57 58 Graph Laplacian Eigenvalue decomposition of the Laplacian matri: L UU T All eigenvalues of L are positive and the smallest eigenvalue of L is zero: If we order the eigenvalue b increasing order: M If the graph has k connected components, then the eigenvalue =0 has multiplicit k.

58 59 Spectral Clustering The multiplicit of the eigenvalue 0 determines the number of connected components in a graph. The associated eigenvectors identif these connected components. For an eigenvalue i i 0, the corresponding eigenvector e has the same value for all vertices in a component,and a different value for each one of their i 1 components. Identifing the clusters is then trivial when the similarit matri is composed of zeros and ones (as when using k-nearest neighbor). What happens when the similarit matri is full?

59 60 Spectral Clustering N N Similarit map S : S can either be binar (k-nearest neighboor) or continuous with Gaussia kernel i j 2 2 S, e i j S

60 61 Spectral Clustering 1) Build the Laplacian matri : L D S 2) Do eigenvalue decomposition of the Laplacian matri: 3) Order the eigenvalue b increasing order: M L UU T The first eigenvalue is still zero but with multiplicit 1 onl (full connected graph)! Idea: the smallest eigenvalues are close to zero and hence provide also information on the partioning of the graph (see eercise session) N N Similarit map S : S can either be binar (k-nearest neighboor) or continuous with Gaussia kernel i j 2 2 S, e i j

61 62 Spectral Clustering Eigenvalue decomposition of the Laplacian matri: L UU U T e e... e 1 e2... e e... e 1 2 K K M 1 1 i e... e 1 i K i Construct an embedding of each of the i i M datapoints through. Reduce dimensionalit b picking k<m i projections, i 1... K. With a clear partitioning of the graph, the entries in are split into sets of equal values. Each group of points with same value belong to the same partition (cluster).

62 63 Spectral Clustering Eigenvalue decomposition of the Laplacian matri: L UU U T e e... e 1 e2... e e... e 1 2 K K M 1 1 i e... e 1 i K i Construct an embedding of each of the i i M datapoints through. Reduce dimensionalit b picking k<m i projections, i 1... K. When we have a full connected graph, the entries in Y take an real value.

63 Spectral Clustering Eample: 3 datapoints in a graph composed of 2 partitions The similarit matri is S L has eigenvalue =0 with multiplicit two One solution for the two associated eigenvectors is: e e Anothersolution for the two associated eigenvectors is: e e Entries in the eigenvector for the two first datapoins are equal for the 1st set of eigenvectors for the 2nd set of eigenvectors 64

64 65 Spectral Clustering Eample: 3 datapoints in a full connected graph The similarit matri is S L has eigenvalue =0 with multiplicit 1. The second eigenvalue is small 0.04, whereas the 3rd one is large, with associated eigenvectors : e 2 1, e 0.404, e Entries in the 2 nd eigenvector for the two first datapoins are almost equal , The first two points have almost the same coordinates on the embedding. Reduce the dimensionalit b considering the smallest eigenvalue 3 1 2

65 66 Spectral Clustering Eample: 3 datapoints in a full connected graph The similarit matri is S L has eigenvalue =0 with multiplicit 1. The second and third eigenvalues are both large 2.23, with associated eigenvectors : e 2 1, e 0.57, e Entries in the 2 nd eigenvector for the two first datapoins are no longer equal , The 3 rd point is now closer to the two other points The first two points have no longer the same coordinates on the embedding

66 67 Spectral Clustering 1 w w Step 1: Embedding in 5 Idea: Points close to one another have almost the same coordinate on the eigenvectors of L with small eigenvalues. Step1: Do an eigenvalue decomposition of the Lagrange matri L and project the datapoints onto the first K eigenvectors with smallest eigenvalue (hence reducing the dimensionalit).

67 68 Spectral Clustering 1 w w Step 2: 1 M Perform K-Means on the set of,... vectors Cluster datapoints according to their clustering in.

68 70 Equivalenc to other non-linear Embeddings Spectral deomposition of the similarit matri (which is alread positive semi-definite) 1 1 e i. i.. K Kei In Isomap, the embedding is normalized b the eigenvalues and uses geodesic distance to build the similarit matri, see supplementar material.

Ensures minimal distorsion while preventing arbitrar scaling.

69 Laplacian Eigenmaps Solve the generalized eigenvalue problem: L T T Solution to optimization problem: min L such that D 1. Ensures minimal distorsion while preventing arbitrar scaling. D Swissroll eample Projections on each Laplacian eigenvector i The vectors, i 1... M, form an embedding of the datapoints. Image courtes from A. Singh 71

70 72 Equivalenc to other non-linear Embeddings kernel PCA: Eigenvalue decomposition of the matri of similarit S S UDU T The choice of parameters in kernel K-Means can be initialized b doing a readout of the Gram matri after kernel PCA.

71 73 Kernel K-means and Kernel PCA The optimization problem of kernel K-means is equivalent to: T ma tr H KH, H YD H 1 2 See paper b M. Welling, supplementar document on website T Since tr H KH i, i1 : M eigenvalues, resulting from the eigenvalue decomposition i of the Gram Matri M Look at the eigenvalues to determine optimal number of clusters. Y : M K D : K K Each entr of Y is 1 if the datapoint belongs to cluster k, otherwise zero D is diagonal. Element on the diagonal is sum of the datapoint in cluster k

72 74 Kernel PCA projections can also help determine the kernel width From top to bottom Kernel width of 0.8, 1.5, 2.5

73 75 Kernel PCA projections can help determine the kernel width The sum of eigenvalue grows as we get a better clustering

74 76 Quick Recap of Gaussian Miture Model

75 77 Clustering with Miture of Gaussians Alternative to K-means; soft partitioning with elliptic clusters instead of spheres Clustering with Mitures of Gaussians using spherical Gaussians (left) and non spherical Gaussians (i.e. with full covariance matri) (right). Notice how the clusters become elongated along the direction of the clusters (the gre circles represent the first and second variances of the distributions).

76 78 Gaussian Miture Model (GMM) Using a set of M N-dimensional training datapoints i i 1,... j The pdf of X will be modeled through a miture of K Gaussians: X M j1,... N K i i i i,, with, i i, i m m m p X p X p X N i1 i i m, : mean and covariance matri of Gaussian i Miing Coefficients K i1 1 Probabilit that the data was eplained b Gaussian i: i i M j p i p i j1

77 79 Gaussian Miture Modeling The parameters of a GMM are the means, covariance matrices and prior pdf: m,... m,,...,,... 1 K 1 K 1 K Estimation of all the parameters can be done through Epectation- Maimization (E-M). E-M tries to find the optimum of the likelihood of the model given the data, i.e.: ma L X ma p X See lecture notes for details

78 80 Gaussian Miture Model

79 81 Gaussian Miture Model GMM using 4 Gaussians with random initialization

80 82 Gaussian Miture Model Epectation Maimization is ver sensitive to initial conditions: GMM using 4 Gaussians with new random initialization

81 83 Gaussian Miture Model Ver sensitive to choice of number of Gaussians. Number of Gaussians can be optimized iterativel using AIC or BIC, like for K-means: Here, GMM using 8 Gaussians

82 84 Evaluation of Clustering Methods

83 85 Evaluation of Clustering Methods Clustering methods rel on hper parameters Number of clusters Kernel parameters Need to determine the goodness of these choices Clustering is unsupervised classification Do not know the real number of clusters and the data labels Difficult to evaluate these choice without ground truth

84 86 Evaluation of Clustering Methods Two tpes of measures: Internal versus eternal measures Internal measures rel on measure of similarit (e.g. intra-cluster distance versus inter-cluster distances) E.g.: Residual Sum of Square is an internal measure (available in mldemos); Gives the squared distance of each vector from its centroid summed over all vectors. RSS= K k1 C k k m 2 Internal measures are problematic as the metric of similarit is often alread optimized b clustering algorithm

85 87 Evaluation of Clustering Methods K-Means, soft-k-means and GMM have several hperparameters: (Fied number of clusters, beta, number of Gaussian functions) Measure to determine how well the choice of hperparameters fit the dataset (maimum-likelihood measure) X : dataset; M : number of datapoints; B: number of free parameters - Aikaike Information Criterion: AIC= 2 ln L 2B - Baesian Information Criterion: BIC 2ln L Bln M L: maimum likelihood of the model given B parameters Choosing AIC versus BIC depends on the application: Is the purpose of the analsis to make predictions, or to decide which model best represents realit? AIC ma have better predictive abilit than BIC, but BIC finds a computationall more efficient solution. Lower BIC implies either fewer eplanator variables, better fit, or both. As the number of datapoints (observations) increase, BIC assigns more weights to simpler models than AIC. Penalt for increase in computational costs

86 88 Evaluation of Clustering Methods Two tpes of measures: Internal versus eternal measures Eternal measures assume that a subset of datapoints have class label and measures how well these datapoints are clustered. Needs to have an idea of the class and have labeled some datapoints Interesting onl in cases when labeling is highl time-consuming when the data is ver large (e.g. in speech recognition)

87 Evaluation of Clustering Methods Raw Data 89

88 90 Semi-Supervised Learning Clustering F-Measure: (careful: similar but not the same F-measure as the F-measure we will see for classification!) Tradeoff between clustering correctl all datapoints of the same class in the same cluster and making sure that each cluster contains points of onl one class. M : nm of datapoints, C c : the set of classes K i, ma F c, k : nm of clusters, n : nm of members of class c and of cluster k ik F C K F c, k i R c, k i P c, k i cc k i Rci k P ci k, P c, k 2,, ik i ik i c M R c k n c n k i i i i Picks for each class the cluster with the maimal number of datapoints Recall: proportion of datapoints correctl classified/clusterized Precision: proportion of datapoints of the same class in the cluster

to random initialization (left and right: two different

89 91 Evaluation of Clustering Methods RSS with K-means can find the true optimal number of clusters but ver sensitive to random initialization (left and right: two different runs). RSS finds an optimum for K=4 and K=5 for the right run

90 Evaluation of Clustering Methods BIC (left) and AIC (right) perform ver poorl here, splitting some clusters into two halves. 92

91 Evaluation of Clustering Methods BIC (left) and AIC (right) perform much better for picking the right number of clusters in GMM. 93

92 94 Evaluation of Clustering Methods Raw Data

93 95 Evaluation of Clustering Methods Optimization with BIC using K-means

94 Evaluation of Clustering Methods Optimization with AIC using K-means AIC tends to find more clusters 96

95 97 Evaluation of Clustering Methods Raw Data

96 98 Evaluation of Clustering Methods Optimization with AIC using kernel K-means with RBF

97 99 Evaluation of Clustering Methods Optimization with BIC using kernel K-means with RBF

98 100 Semi-Supervised Learning Raw Data: 3 classes

99 101 Semi-Supervised Learning Clustering with RBF kernel K-Means after optimization with BIC

100 102 Semi-Supervised Learning After semi-supervised learning

101 103 Summar We have seen several methods for etracting structure in data using the notion of kernel and discussed their similarit, differences and complementarit: Kernel CCA is a generalization of kernel PCA to determine partial grouping in different dimensions. Kernel PCA can be used too bootstrap choice of hperparameters in kernel K-means We have compared the geometrical division of space ielded b Rbf and Polnomial kernels. We have seen that simpler techniques than kernel K-means such as K- means with norm-p and miture of Gaussians can ield comple non-linear clustering.

102 104 Summar When to use what? E.g. k-means versus kernel K-means versus GMM When using an machine learning algorithm, ou have to balance a number of factors: Computing time at training and at testing Number of open parameters Curse of dimensionalit (order of growth with number of datapoints, dimension, etc) Robustness to initial condition, optimalit of solution

MACHINE LEARNING ADVANCED MACHINE LEARNING

MACHINE LEARNING ADVANCED MACHINE LEARNING Recap of Important Notions on Estimation of Probability Density Functions 2 2 MACHINE LEARNING Overview Definition pdf Definition joint, condition, marginal,