Active and Semi-supervised Kernel Classification

Size: px

Start display at page:

Download "Active and Semi-supervised Kernel Classification"

Grant Ball
5 years ago
Views:

1 Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU), and Hyun-Chul Kim (Gatsby / POSTECH, Korea) Tübingen, Sep 23

2 Outline Part I: EM-EP for Gaussian Process Classification Part II: Semi-supervised Classification Part III: Active Semi-supervised Classification

3 Part I: EM-EP for Gaussian Process Classification

4 GPs are Bayesian Kernel Regression Machines The covariance between two function values under the prior is given by the covariance function, which typically decays monotonically with, encoding smoothness. and 2 2 input, x 2 2 input, x output, y(x) 2 2 output, y(x) 2 2 Samples from the Prior Samples from the Posterior A Gaussian Process places a prior directly on the space of functions by assuming that any finite selection of points distribution over the corresponding function values. gives rise to a multivariate Gaussian Introduction to Gaussian Process Regression

5 noisy threshold threshold cumulative normal sigmoid There are many ways to relate function values to class probabilities: x.5.5 y f.5 f 5 y = + y =.5 Binary classification problem: Given a data set, infer class label probabilities at new points. Using Gaussian Processes for Classification x.5, where

6 are the parameters of the covariance function (i.e. kernel) edges and are latent function values with joint Gaussian prior, hence the undirected is the test point with unknown label and are observed training points and labels y y i y n y ~ f f i f n f ~ Θ x x i x n x ~ Graphical model for GPCs with training data points and one test data point. Gaussian Process Classification (GPC) as a Probabilistic Graphical Model

7 The Expectation Maximization (EM) Algorithm for GPCs We can view GPCs as a latent variable model. This suggests using the EM algorithm which iterates between: E step: infer latent function values holding fixed, we approximate this using a multivariate Gaussian M step: optimize hyperparameters holding fixed x x i x n ~ x Θ f f i f n ~ f y y i y n ~ y How do we obtain approximation?

8 3. Find new such that it minimizes the KL divergence: 2. Multiply in the true factor:. Remove the term from : Approximate distribution with a product of simple terms: iterate the following procedure until convergence:. Then EP approximates a product of complicated factors by a product of simpler factors, eg Gaussians (Minka, 2; Opper and Winther, 2). The Expectation Propagation (EP) Algorithm

9 The M step is: M step: optimize hyperparameters, holding fixed. By Jensen s inequality defines a lower bound on the likelihood: E step: infer latent function values holding fixed, using the Expectation Propagation (EP) algorithm which approximates with a multivariate Gaussian Combining the EM outer loop with the EP inner loop: The EM-EP Algorithm

10 the label noise probability bias term length-scale (relevance) of input feature class signal boundary variance softness The parameters of the model are and if if is continuous is discrete where is the th element of, Covariance function used in our experiments : The Form of the Covariance Function

11 The Role of the Label Noise Parameter,.8 Training data Test data c c2.8 c c outlier outlier Test results without Test results with c c2.8 c c

12 ! "! " Methods single lengthscale multiple lengthscales Error rates.6.4 Result: improving the likelihood and classification error rates, compared to a single-lengthscale model. x x Data set: 6-dimensional data set with three relevant features and three irrelevant features. For each data point, the relevant features depend on its class label:, while the irrelevant features do not:. The Role of the Length-Scale Parameters,

13 % # $ % VL GPC trained using variational lower bound (Gibbs & MacKay, 2) VU GPC trained using variational upper bound (Gibbs & MacKay, 2) L GPC trained using Laplace s approximation (Williams & Barber, 998) soft for SVM means soft-margin, for GPC means hard for SVM means hard-margin, for GPC means # $'& Table 2: Legend for description of learning algorithms compared -NN One nearest neighbour LDA Linear Discriminants Analysis SVM Support Vector Machine GPC Gaussian Process Classifier CV Cross-validation used to determine kernel width EP EM-EP used to determine kernel parameters s a single length-scale parameter for all features m multiple length-scale parameters Table : Detailed information on the real world data sets from UCI Data set # of discrete variables # of continuous variables # of classes # of data points Thyroid 5 3 (to 2) 25 Heart disease Ionosphere Crabs Pima (Tr) Experiments on Real World Data Sets

14 Experiments on Real World Data Sets: Results Table 3: Classification error rates Heart disease Thyroid Ionosphere Crabs Pima -NN LDA SVM-CV(hard) SVM-CV(soft) SVM-EP(s,soft) GPC-EP(s,hard) GPC-EP(s,soft) GPC-VL(m) GPC-VU(m) GPC-L(m) GPC-EP(m,soft) Comments: Using EM-EP to determine the kernel gives the best performance on all five data sets. Multiple length-scales helps on some data sets (Crabs) but not on many others; these UCI data sets do not seem to have many irrelevant inputs. Using the kernel from a GPC trained via EM-EP in an SVM seems to work better than crossvalidation on the SVM.

15 Part I: Summary and Conclusions Gaussian Process Classifiers are a Bayesian method for kernel classification Advantages over other non-probabilistic kernel classifiers Principled model and kernel selection based on likelihood Class probability rather than a hard class labels Prior knowledge can be incorporated We presented the EM-EP algorithm for learning GPCs The EP approximation appears superior to other methods of training GPCs Disadvantages Computationally costly,, for large data sets; however sparse approximations have been developed (Csato & Opper, 2; Seeger et al, 22) Extension to multi-class problems has also been developed

16 Part II: Semi-supervised Learning

17 Labeled and Unlabeled Data as a Graph Idea: Construct a random field on graph Intuition: Similar examples have similar labels Information propagates from labeled examples Graph encodes prior intuition

18 unhappy, high energy happy, low energy ( energy: edges: local similarity. symmetric weight matrix assumed given. nodes: instances in. Binary labels The Graph

19 Low energy Label Propagation energy= Conditioned on labeled data: energy=4 energy=2 energy=

20 [Zhu & Ghahramani 2] ) * Discrete Markov Random Fields

21 Graph Mincut graph mincut min energy discrete MRF mode Multiple minima, may be unbalanced +? Hard to compute when multi-class. [Blum & Chawla ]

22 We propose to use Gaussian random fields

23 ) * Discrete Markov Random Fields, revisited

24 ) * Gaussian Random Fields

25 + * +* + + * * The Laplacian The Laplacian

26 + + * + + The mean is + + The field is Gaussian: * + + ) * ( ) * Gaussian Random Fields ) *

27 , Related to heat kernels etc. in spectral graph theory. or harmonic soft labels, unique min energy state The mean + mode of Gaussian Random Field The Mean

28 . -. -, i reach label from Interpretation: Random Walks

29 R ij = w ij, volt Interpretation: Electric Networks + volt

30 * + and / minimize subject to e.g. prior: class incorporating Class Priors (heuristic) naive: threshold + at.5. Classification often unbalanced. Classification

31 accuracy OCR Ten Digits ( ) labeled set size CMN NN thresh

32 Threads? labeled set size CMN thresh VP.65 accuracy Newsgroups (PC vs. MAC, )

33 2 3 $ NN unweighted graph,, etc.; NN unweighted graph, ; 2 2 $, length scales; Learn the graph weights (or hyperparameters): Hyperparameter Learning

34 Hyperparameter Learning Minimize entropy on (maximize label confidence); Evidence maximization with Gaussian process classifiers [tech report CMU-CS-3-75].

35 OCR Digits vs. 2 Hyperparameter Learning (bits) GF acc start % end %

36 Summary Summary: - An approach to semi-supervised learning based on Gaussian fields and harmonic functions. Ongoing work: - Semi-supervised graph kernels for Gaussian process classifiers

37 Part III: Active Semi-Supervised Learning

38 When is Scarce Semi-supervised learning uses to help classification. Active learning (pool based) selects queries in to ask for labels. Put it together, we have a better query selection criterion than naively selecting the point with maximum label ambiguity.

39 + volt w ij R ij = Input: Graph weights, labels. Output: Harmonic function labels, head probability of coin flips. volt, mean of a Gaussian random field, soft Reminder: Semi-supervised Learning with Harmonic functions

40 Active Learning Select a query to minimize the estimated generalization error, not by maximum ambiguity. a.5 B.4

41 + + generalization error err approximation estimated generalization error err true ( sgn Active Learning true

42 4 4 err 4 err ( (.. (.. (. arg min. (. ( select query s.t re-train is fast for the harmonic function err estimated generalization error after querying Active Learning 44and receiving label

43 Accuracy OCR Digits vs. 2 ( ) Our Method Most Uncertain Query Random Query SVM Labeled set size

44 Accuracy 2 Newsgroups PC vs. MAC ( ).9 Our Method Most Uncertain Query Random Query SVM Labeled set size

45 Conclusions EM-EP is an effective way to learn the kernel of a Gaussian Process Classifier Semi-supervised learning with harmonic functions: Gaussian process with datadependent covariance function Active semi-supervised learning using harmonic functions by minimizing expected generalization error

46 In fact, we have recently compared the class normalized version of ZGL and it performs either comparably or better than ZBLWS on the same data sets. However, these results are not comparisons to the class normalized version advocated by ZGL. ZBLWS claims far superior performance to ZGL. So. In practice is chosen near. 5 5 Zhou, Bousquet, Lal, Weston, and Schölkopf (to appear in NIPS, 23): Zhu, Ghahramani, and Lafferty (ICML, 23): Comparison to Learning with Local and Global Consistency

48 step random walk: [Szummer & Jaakkola ] Diffusion kernel at time : / / [Kondor & Lafferty 2] Connection: Graph Kernels

49 8 + + * * Integrate diffusion kernels over : + + / / + + Connection: Graph Kernels

50 Gaussian Field Kernel vs. RBF Kernel GF uses RBF ignores

Semi-Supervised Learning with Graphs

Semi-Supervised Learning with Graphs Xiaojin (Jerry) Zhu LTI SCS CMU Thesis Committee John Lafferty (co-chair) Ronald Rosenfeld (co-chair) Zoubin Ghahramani Tommi Jaakkola 1 Semi-supervised Learning classifiers