Statistical Learning. Dong Liu. Dept. EEIS, USTC

Similar documents
Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Nonlinear Dimensionality Reduction

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

CS534 Machine Learning - Spring Final Exam

L11: Pattern recognition principles

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Unsupervised Learning

Statistical Machine Learning

Statistical Pattern Recognition

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Clustering, K-Means, EM Tutorial

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Expectation Maximization

Nonlinear Dimensionality Reduction

Manifold Regularization

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Linear and Non-Linear Dimensionality Reduction

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering. Léon Bottou COS 424 3/4/2010. NEC Labs America

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Unsupervised dimensionality reduction

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Classification: The rest of the story

LECTURE NOTE #11 PROF. ALAN YUILLE

Machine learning for pervasive systems Classification in high-dimensional spaces

Dimension Reduction and Low-dimensional Embedding

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Latent Variable Models and EM algorithm

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Gaussian Mixture Models

Non-linear Dimensionality Reduction

ECE 5984: Introduction to Machine Learning

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Lecture 3: Pattern Classification

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Principal Component Analysis (PCA)

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Clustering VS Classification

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

CPSC 540: Machine Learning

Learning gradients: prescriptive models

Lecture 14. Clustering, K-means, and EM

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Expectation Maximization

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Probabilistic & Unsupervised Learning

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Introduction to Machine Learning Midterm Exam

CSC411 Fall 2018 Homework 5

Lecture 4: Probabilistic Learning

Semi-Supervised Learning through Principal Directions Estimation

MLCC Clustering. Lorenzo Rosasco UNIGE-MIT-IIT

Statistical Pattern Recognition

Data Preprocessing. Cluster Similarity

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

Introduction to Machine Learning

Fisher s Linear Discriminant Analysis

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Introduction to Machine Learning Midterm Exam Solutions

Lecture 3: Pattern Classification. Pattern classification

EXPECTATION- MAXIMIZATION THEORY

Machine Learning. Data visualization and dimensionality reduction. Eric Xing. Lecture 7, August 13, Eric Xing Eric CMU,

Manifold Learning: Theory and Applications to HRI

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

ECE521 week 3: 23/26 January 2017

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Lecture 11: Unsupervised Machine Learning

Unsupervised Anomaly Detection for High Dimensional Data

PCA and admixture models

Brief Introduction of Machine Learning Techniques for Content Analysis

Machine Learning Notes

L26: Advanced dimensionality reduction

Lecture 10: Dimension Reduction Techniques

Apprentissage non supervisée

Partially labeled classification with Markov random walks

Data-dependent representations: Laplacian Eigenmaps

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

CSE446: Clustering and EM Spring 2017

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

FINAL: CS 6375 (Machine Learning) Fall 2014

ECE 661: Homework 10 Fall 2014

Machine Learning (BSMC-GA 4439) Wenke Liu

CS6220: DATA MINING TECHNIQUES

Variable selection and feature construction using methods related to information theory

Data Mining Techniques

Data dependent operators for the spatial-spectral fusion problem

Machine Learning - MT & 14. PCA and MDS

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Transcription:

Statistical Learning Dong Liu Dept. EEIS, USTC

Chapter 6. Unsupervised and Semi-Supervised Learning 1. Unsupervised learning 2. k-means 3. Gaussian mixture model 4. Other approaches to clustering 5. Principle component analysis 6. Other approaches to dimensionality reduction 7. Semi-supervised learning 1 68

Section 6.1 Unsupervised Learning

What is unsupervised learning? Supervised learning aims to identify relation between data, to solve predictive task Unsupervised learning aims to discover pattern from data, to solve descriptive task Which pattern? Association analysis Clustering Anomaly detection Dimensionality reduction... 2 68

Association analysis Input: data with multiple attributes Output: which attributes are associated, i.e. frequently co-occurred? Goods Customers A B C D E... Coffee 1 1 1... Tea 1... Milk 1 1 1... Beer 1 1 1... Diaper 1 1... Aspirin 1 1... Mining shopping carts r(milk, Coffee) r(milk, Tea) r(beer, Diaper)... Challenge: computational efficiency 3 68

Clustering Input: data (with no class label) Output: which data are clustered? 4 68

Examples of clustering Market segmentation: Divide customers into clusters Document clustering: Divide retrieved documents about Amazon into clusters Image segmentation: Divide pixels into clusters 5 68

Anomaly detection Input: data (with no class label) Output: which data are normal, and which are abnormal? Detecting credit card frauds Given credit card transactions, try to detect which are normal and which are frauds Also useful in supervised learning: remove outliers to clean data 6 68

Dimensionality reduction Input: high-dimensional data (with many attributes) Output: low-dimensional representation (with few attributes) Often used as a data preprocessing step Techniques: Broad sense: feature extraction, feature selection, etc. Narrow sense: transform that reduces dimension Benefits: Reduce the computational cost Alleviate noise or irrelevant attributes Avoid curse of dimensionality 7 68

Curse of dimensionality 1/2 Many statistical learning methods depend on distance measure Distance cannot distinguish in high-dimensional space Randomly drop several points in a hypercube in a high-dimensional space, and measure the Euclidean distance between points, identify the maximum and minimum distances 8 68

Curse of dimensionality 2/2 Nearest neighbors are all far away For a point in a D-dimensional space, its nearest neighbors (having distance r) are distributed in a spherical shell (with distance αr) with probability 1 α D 1 Why? Because the volume of high-dimensional space is too huge, samples are very sparse in it Dimensionality reduction can find a low-dimensional structure (manifold) in high-dimensional space, and unfold the inherent structure 9 68

Section 6.2 k-means

Prototype-based clustering Each cluster has a prototype Distance to prototype decides cluster membership k-means is the most well known representative 10 68

k-means algorithm Input: dataset {x 1,..., x N }, number of clusters k Output: clusters q(x i ) {1,..., k} 1: Initialize k centroids {c 1,..., c k } 2: repeat 3: for i = 1,..., N do 4: q(x i ) arg min j x i c j 5: end for 6: for j = 1,..., k do 7: c j mean(x i q(x i ) = j) 8: end for 9: until centroids do not change 11 68

Illustration of k-means 1/4 Initialize k centroids Usually we select centroids from the dataset We will see that initial centroids are crucial for the final results Is it wise to select scattered initial centroids? 12 68

Illustration of k-means 2/4 Assign data points to clusters Complexity O(kN) Here we measure Euclidean distance, can another distance metric be used? 13 68

Illustration of k-means 3/4 Update cluster centroids Complexity O(N + k) Here we use (arithmetic) mean, can another method be used? 14 68

Illustration of k-means 4/4 Until the centroids do not change Will k-means converge? Usually it does 15 68

Interpretation of k-means k-means was firstly proposed for vector quantization, an extension of scalar quantization to Euclidean space Centroids are known as codewords that constitute codebook k-means is to solve min q,{c j } x i c q(xi ) 2 Heuristically, k-means update q and {c j } alternately, it is greedy and cannot ensure (global) optimum 16 68

k-means found local minimum Case 1: Case 2: In Case 2, we select initial centroids to be more scattered, but result is worse In practice, we often run k-means multiple times with different initializations, and choose the best one 17 68

Limitation of k-means: Outliers 18 68

Limitation of k-means: Different sized clusters Left: Ideal clusters; Right: k-means results 19 68

Limitation of k-means: Clusters with different densities Left: Ideal clusters; Right: k-means results 20 68

Limitation of k-means: Clusters having irregular shapes Left: Ideal clusters; Right: k-means results 21 68

A remedy for k-means: Over-segmentation and post-processing Left: For clusters with different densities; Right: For clusters having irregular shapes 22 68

Section 6.3 Gaussian mixture model

Distribution-based clustering Each cluster corresponds to a (single-modal) distribution Calculate posterior probabilities to decide clusters Gaussian mixture model is the most well known representative 23 68

Gaussian mixture model A mixture model is a combination of multiple single-modal distributions In Gaussian mixture model (GMM), each component is a Gaussian One-dimensional case: p(x) = k j=1 w jn (x µ j, σ 2 j ), where j w j = 1 Multi-dimensional case: p(x) = k j=1 w jn (x µ j, Σ j ) 24 68

GMM for clustering Assume we have known the parameters of GMM, then we can calculate posterior (responsibility) as p(q(x i ) = j) = γ ij = w j N (x µ j, σ 2 j ) k j=1 w jn (x µ j, σ 2 j ) Then we have q(x i ) = arg max j γ ij Now, the problem is how to estimate the parameters of GMM ϑ = {w j, µ j, Σ j j = 1,..., k} And we consider maximum likelihood estimation ˆϑ = arg max ϑ N i=1 p(x i ϑ) 25 68

Intuitive solution 1/3 As we want to maximize N i=1 k j=1 w jn (x i µ j, Σ j ), this is quite difficult Can we borrow the idea of k-means? First, we initialize the parameters 26 68

Intuitive solution 2/3 As we want to maximize N i=1 k j=1 w jn (x i µ j, Σ j ), this is quite difficult Can we borrow the idea of k-means? Second, we calculate responsibilities ( assign data to clusters ) 27 68

Intuitive solution 3/3 As we want to maximize N i=1 k j=1 w jn (x i µ j, Σ j ), this is quite difficult Can we borrow the idea of k-means? Third, we update parameters Note that p(q(x i ) = j) = γ ij, so it is intuitive that w j = µ j = i γ ij N i γ ijx i i γ ij Σ j = i γ ij(x i µ j )(x i µ j ) T i γ ij The above two steps are executed alternately, until convergence 28 68

Example of GMM result 29 68

k-means is a special case of GMM In this special case, we set w j 1 k, Σ j αi, and we quantize γ ij to 0 or 1 Initialize parameters = Initialize µ j s Calculate responsibilities = Find nearest µ j and set that γ ij = 1 Update parameters = Update µ j s as means So we can interpret why k-means has limitations for clusters having different sizes or densities or irregular shapes 30 68

Example of comparison between GMM and k- means 31 68

Interpretation: Expectation maximization Introduce latent variables z i {1,..., k}, representing the true cluster that generates x i Consider to maximize i p(x i, z i ϑ) = i j (w jn (x i µ j, Σ j )) I(zi=j), or equivalently i j I(z i = j) log(w j N (x i µ j, Σ j )) Expectation maximization (EM) executes two steps alternately: 1. E-step: Given ϑ t, calculate the expectation of objective function with eliminating latent variables. Note that γ ij is the expectation of I(z i = j), so we have i j γ ij log(w j N (x i µ j, Σ j )) 2. M-step: Maximize the expectation of objective function to find ϑ t+1. Note that j w j = 1, so we can derive the equations of GMM 32 68

EM as an algorithm Input: p(x, Z ϑ), X = {x 1,..., x N }, Z is unobserved Output: ˆϑ 1: t 0, initialize ϑ 0 2: repeat 3: Given ϑ t, calculate the expectation of log p(x, Z ϑ) with eliminating Z, denote the expectation by Q(ϑ, ϑ t ) 4: ϑ t+1 arg max ϑ Q(ϑ, ϑ t ) 5: until ϑ t+1 is similar to ϑ t 6: ˆϑ = ϑ t+1 33 68

Is EM correct? Consider log p(x ϑ) = Z p(z X, ϑ t ) log p(x ϑ) = Z p(z X, ϑ t ) log p(x, Z ϑ) Z p(z X, ϑ t ) log p(z X, ϑ) = Q(ϑ, ϑ t ) + H(ϑ, ϑ t ) The first term is actually Q(ϑ, ϑ t ). Since ϑ t+1 arg max ϑ Q(ϑ, ϑ t ), we know Q(ϑ t+1, ϑ t ) Q(ϑ t, ϑ t ) The second term H(ϑ, ϑ t ) is actually the cross-entropy between p(z X, ϑ t ) and p(z X, ϑ), so we know for any ϑ, H(ϑ, ϑ t ) H(ϑ t, ϑ t ) (the latter is entropy, and the difference is K-L divergence) Thus, EM ensures that log p(x ϑ t+1 ) log p(x ϑ t ) This is a greedy algorithm to maximize the likelihood, and it definitely will converge, but cannot ensure global optimum 34 68

Variants of EM We may set different initial values to escape a local optimum We may not maximize Q(ϑ, ϑ t ), having Q(ϑ t+1, ϑ t ) Q(ϑ t, ϑ t ) is enough (e.g. by gradient ascent); this is especially helpful if the maximization problem has no closed-form solution 35 68

Section 6.4 Other approaches to clustering

Density-based clustering Clusters correspond to high-density regions, while low-density regions separate clusters Exclude noisy data and outliers Mean-shift and DBSCAN are the most well known representatives 36 68

Mean-shift Mean-shift is an iterative algorithm for locating modes (i.e. local maximums of density), where density is estimated using Parzen window (non-parametric) Initialize a mode x 0, and iteratively refine it Given x t, local density is estimated by ˆp(x) = K(x x t ), where K() is a kernel function Thus local mean is m(x t x ) = i N (x t ) x ik(x i x t ) x i N (x t ) K(x i x t ), let xt+1 m(x t ) Then find the next mode... 37 68

DBSCAN Density-based spatial clustering of applications with noise (DBSCAN) seeks a cluster as a high-density region For each point, find its neighbors, if number of neighboring points is less than a threshold, then this point is noise Otherwise, let this point and its neighbors belong to a cluster, and expand this cluster until reaching low-density region 38 68

Connectivity-based clustering Use a graph to represent data points and their relations Find clusters as subgraphs that are highly related Also called graph-based clustering 39 68

Agglomerative clustering Bottom-up strategy: Combine points into clusters progressively 40 68

Divisive clustering Also called highly connected subgraphs (HCS) clustering Top-down strategy: Split a graph into two subgraphs by finding out the minimum cut, and split subgraphs further, until a subgraph is highly connected To evaluate whether a subgraph is highly connected or not, we use the ratio: 2n e n v(n v 1) 41 68

Section 6.5 Principle component analysis (PCA)

Projection for dimensionality reduction Given x R D We want to find a projection matrix P R K D, where K < D Then we have dimensionality reduced y = Px, y R K This is linear dimensionality reduction, and the problem is how to find P Principle component analysis (PCA) is the mostly used method 42 68

Motivation of PCA PCA seeks a projection that can keep the data s Euclidean distances as much as possible Note that i j (x i x j ) 2 is related to data variance 43 68

PCA 1/3 i x i N Step 1: subtract mean x = (x 1 x) T Let X =..., then X T X is the data covariance matrix (x N x) T 44 68

PCA 2/3 Step 2: rotate x i x Let C = X T X, calculate its eigenvalues and eigenvectors, Cu i = λ i u i, then we have an orthonormal matrix U = [u 1,..., u D ], and CU = ΛU where Λ = diag{λ 1,..., λ D }, λ 1 λ 2 λ D 0 Let X = XU, so X T X = Λ becomes a diagonal matrix 45 68

PCA 3/3 Step 3: select the K largest entries from {λ 1,..., λ D }, and select the corresponding columns of U to constitute P Let y i = P(x i x), it completes dimensionality reduction 46 68

Example: Eigenface 1/2 Consider applying PCA on face data, the calculated eigenvectors are termed eigenfaces Note that we usually reshape an image into a vector, otherwise we need 2D-PCA 47 68

Example: Eigenface 2/2 Using eigenfaces, for a new input face, we can perform dimensionality reduction y N+1 = Px N+1 (we omit the mean subtraction step) Note that P T P = I, so we have x N+1 P T y N+1, it is a decomposition over eigenfaces It is also a commonly used step for feature extraction 48 68

Kernel PCA 1/3 PCA finds a linear projection, how to deal with nonlinearity? Consider using basis functions φ i = φ(x i ), and then the data φ T covariance matrix is Φ T 1 Φ where Φ =... (here we assume φ T N i φ i = 0) We may also use the kernel trick, where we have k(x, y) = φ(x) φ(y) but we do not use φ( ) explicitly Note that we can calculate k(x 1, x 1 ),..., k(x 1, x N ) K =...,...,... = ΦΦ T k(x N, x 1 ),..., k(x N, x 1 ) And we can calculate eigenvalues/eigenvectors for K: Ku i = λ i u i 49 68

Kernel PCA 2/3 Note that Φ T ΦΦ T u i = Φ T Ku i = λ i Φ T u i, which means Φ T u i is an eigenvector of Φ T Φ. Here, we cannot ensure Φ T u i is a unit vector, so we normalize it u i = ΦT u i Φ T u i = Φ T u i = ΦT u i u T i ΦΦT u λi i (u 1) T Then let P =... where the corresponding K eigenvalues (u K )T are the largest And the projection is y i = Pφ i = u T 1 / λ 1... u T K / λ K k(x 1, x i )... k(x N, x i ) 50 68

Kernel PCA 3/3 Previously we assume i φ i = 0, but this is not satisfied by an arbitrary kernel, so we have to centralize the kernel Note that k ij = φ i φ j, now we want to calculate k ij = (φ i φ) (φ j φ) where φ = i φ i N k ij = k ij 1 k N ij 1 k N ij + 1 N 2 Replace K with K in the previous steps j i i j k ij 51 68

Example of kernel PCA Left: input points in 2-D plane; Right: projected points using k(x, y) = (1 + x y) 2 52 68

Section 6.6 Other approaches to dimensionality reduction

Nonlinear dimensionality reduction Kernel PCA ISOMAP Locally linear embedding (LLE) Self organizing map (Chap. 10) Autoencoder (Chap. 10) Laplacian eigenmap t-distributed stochastic neighbor embedding (t-sne)... 53 68

Manifold learning A manifold is a topological space that locally resembles Euclidean space near each point 1-D manifolds include lines and circles, but not 8 2-D manifolds are also called surfaces, such as spheres The intrinsic dimension of a manifold can be lower than its residing space Manifold learning is to identify such low-dimensional structure from high-dimensional data For manifold learning, Euclidean distance is not appropriate, and is replaced by geodesic distance 54 68

ISOMAP While PCA seeks to preserve data s Euclidean distances as much as possible, ISOMAP seeks to preserve data s geodesic distances as much as possible In ISOMAP, geodesic distance is defined as shortest distance on graph, the graph is constructed by nearest neighbors for each point As we have d ij = d(x i, x j ), we seek arg min ( y i y j d ij ) 2 y 1,...,y N which is solved by multi-dimensional scaling (MDS) i j i 55 68

Example of MDS MDS seeks an appropriate coordinate system for a distance 56 68

Locally linear embedding Locally linear embedding (LLE) seeks to preserve locally linear relations as much as possible In LLE, first, a locally linear relation matrix W is found arg min W x i i x j N (x i ) w ij x j 2, s.t. j If x j / N (x i ), then w ij 0 Second, a low-dimensional embedding is found arg min y 1,...,y N i y i j w ij y j 2 w ij = 1 57 68

Section 6.7 Semi-supervised learning

Supervised versus unsupervised learning For example, consider classification versus clustering Classification is Good at predicting the true class Requiring many labeled data for training Clustering is Able to find out clusters, but that are not equal to classes Requiring no label, just data Can we combine supervised and unsupervised learning? 58 68

Motivation of semi-supervised learning One practical difficulty for supervised learning is the lack of accurate labels Semi-supervised learning tries to use unlabeled data in addition to labeled data This includes weakly-supervised learning, where we have labels but labels are not accurate or noisy This also includes transductive learning, where we do not build a model but just want to have predictions for unlabeled data Supervised Semi-supervised (inductive) Transductive Data Labeled data Both labeled and unlabeled data Objective A model: ŷ = f(x) or q(y x) Predictions for unlabeled data 59 68

Examples of semi-supervised learning Image classification with unlabeled images Image classification from click-through data from search engine Image segmentation with only labels of bounding-boxes Anomaly detection with limited labels... 60 68

Why does semi-supervised learning work? For generative methods Labeled data provide information of p(x, y) and unlabeled data provide additional information of p(x), the latter is helpful to estimate p(x, y) For discriminative methods 1. Cluster assumption: If two points belong to the same cluster, they are likely to belong to the same class 2. Density assumption: The decision boundary shall locate at low-density regions that separate high-density regions 61 68

A generative method for semi-supervised classification For labeled data, we know (x i, y i ), i = 1,..., N. For unlabeled data, we know x j, j = 1,..., M We consider a generative model for the labeled data: p(x i, y i ) = p(y i )p(x i y i ), and assume a mixture model for the unlabeled data: p(x j ) = y j p(y j )p(x j y j ) We further parameterize the probabilities as p(y) = p(y ϑ 0 ), p(x y) = p(x y, ϑ 1 ) Then we can maximize the log-likelihood: i log(p(y i ϑ 0 )p(x i y i, ϑ 1 )) + j log( y j p(y j ϑ 0 )p(x j y j, ϑ 1 )) 62 68

EM for semi-supervised classification Consider y j s as latent variables, and we have the expectation of log p({x i, y i }, {x j, y j } ϑ) with respect to p({y j } ϑ t ) is Q(ϑ, ϑ t ) = i log(p(y i ϑ 0 )p(x i y i, ϑ 1 )) + k γt jk log(p(y j ϑ 0 )p(x j y j, ϑ 1 )) j where γ t jk = p(y j = k ϑ t ) p(y j = k ϑ t 0)p(x j y j, ϑ t 1) 63 68

Semi-supervised SVM 1/2 Recall (supervised) SVM 1 min w,b,ξ 2 w 2 + C i ξ i s.t. i, ξ i 0, y i (w T x i + b) 1 ξ i 64 68

Semi-supervised SVM 2/2 For semi-supervised SVM, also called transductive SVM ( min 1 w,b,ξ,y,ζ 2 w 2 + C i ξ i +C ) j ζ j s.t. i, ξ i 0, y i (w T x i + b) 1 ξ i j, y j {+1, 1}, ζ j 0, y j (w T x j + b) 1 ζ j 65 68

A graph-based method for transductive classification Input: (x i, y i ), i = 1,..., N, x j, j = 1,..., M Output: ŷ j 1: Construct a graph whose vertexes are x i and x j 2: In the graph, if j N (i ), construct an edge from i to j and let the edge weight be w i j = 1 d(x i,x j ) 3: t 0, initialize p 0 (class probability vectors) i 4: repeat 5: j, p t+1 0 j 6: i {1,..., M, M + 1,..., M + N}, j, if w i j 0, p t+1 p t+1 + w i j j j k w pt i k i 7: j, normalize p t+1 to be unit vector j 8: until convergence 9: ŷ j = arg max p t+1 j 66 68

Random walk and PageRank The loop is actually a random walk For example, PageRank is an unsupervised method based on random walk to determine the relative importances of webpages Input: A graph of webpages and hyperlinks Output: Relative importances of all webpages 1: t 0 2: Initialize r 0 (importance values) i 3: repeat 4: j, r t+1 0 j 5: i, j, if w ij 0, r t+1 r t+1 + w ij j j 6: j, r t+1 βr t+1 j j + 1 β N factor, used to avoid trap 7: until convergence k w ik rt i β is called damping An example of trap 67 68

In this chapter Dictionary Clustering Curse of dimensionality Dimensionality reduction Manifold learning Semi-supervised learning Transductive learning Unsupervised learning Toolbox Expectation maximization Gaussian mixture model k-means Principle component analysis, kernel Random walk Transductive support vector machine 68 / 68