Data and Structure in Machine Learning

Size: px

Start display at page:

Download "Data and Structure in Machine Learning"

Jasmin Mason
6 years ago
Views:

1 Data and Structure in Machine Learning John Lafferty School of Computer Science Carnegie Mellon University based on joint work with Zoubin Ghahramani, Risi Kondor, Guy Lebanon, Jerry Zhu

2 The Data Century The coming century is surely the century of data. Our society is investing massively in the collection and processing of data of all kinds, on scales unimaginable until recently. David Donoho (August 2000) Satellite Images DNA sequences Internet portals Sensor networks Text and hypertext Financial transactions Environment monitoring Astrostatistics... 2

3 Data Maven: IBM s John Cocke Language models from data Machine translation data Optimizing compilers 3

4 Scholarly Documents 4

5 Relating Phenotype and Genotype 5

earth-science models Current size (October 2003): 1.

6 NCAR Mass Storage System Central facility for storing data used and generated by climate models, field experiments, and other earth-science models Current size (October 2003): 1.5 petabytes Growth rate: 50 terabytes/month. Coupled to, but exceeds Moore s law growth for cycles 6

7 Garbage Out: Municipal Waste 440 Population (millions) Municipal Waste (millions of tons) Waste Importers Million Tons Waste Exporters Million Tons 1. Pennsylvania New York Virginia New Jersey Michigan Missouri Illinois Maryland Indiana Massachusetts

8 Using Labeled and Unlabeled Data Standard paradigm in machine learning is supervised learning Examples given by teacher to student Well-understood theoretically, startling success Labeled data is limited Labeling examples often very expensive Example: protein shape classification, crystallography How can unlabeled data be most effectively leveraged? Making theoretical and practical progress on the labeled/unlabeled data question is currently one of the most important and interesting problems in machine learning 8

9 Labeled and Unlabeled Data as a Graph 9

10 Outline Laplacians and Graph Kernels Laplacians on graphs Diffusion kernels on graphs Diffusion kernels on statistical manifolds Semi-supervised learning Gaussian fields and graph kernels Active learning Experiments with text and images Kernel conditional random fields Challenges 10

11 Kernels (Aizerman, Braverman & Rozonoér, 1964) (Boser, Guyon & Vapnik, 1992) A powerful & elegant tool dramatic improvements in accuracy (and sometimes computational efficiency) Non-linear classifiers, implicit representation feature space Popular indoor sport Kernel X, where X is PCA, ICA, Clustering,.... Primarily a tool for Euclidean space 11

12 Linear Classifiers vs. Kernel Classifiers ˆf(x) = N i=1 α i y i x, x i ˆf(x) = N i=1 α i y i K(x, x i ) 12

13 Popular Kernels Gaussian: K σ (x, x ) = exp ( x x 2 /2σ 2) Polynomial: K d (x, x ) = (1 + x, x ) d Processing data into vectors in R n is the dirty laundry of machine learning (Dietterich); proper set of features is crucial. Often data is shoehorned into Euclidean space (e.g., vector space representation for text processing). Are there methods that take advantage of the natural structure of the data and problem? 13

14 Structured Data What if data lies on a graph or other discrete structure? S N time VP Google Cornell CMU L V like flies foobar.com NSF 14

15 An Elliptical Sun The Laplace operator in its various manifestations is the most beautiful and central object in all of mathematics. Probability theory, mathematical physics, Fourier analysis, partial differential equations, the theory of Lie groups, and differential geometry all revolve around this sun, and its light even penetrates such obscure regions as number theory and algebraic geometry. Edward Nelson, Tensor Analysis There is nothing new under the sun. Ecclesiastes, 1:9 15

16 Laplacian on a Riemannian Manifold On a Riemannian manifold with metric g, the geometric Laplacian is div or f = 1 i (g ij ) det g i f det g = ij g ij j i f + (lower order terms) More generally, = d d + d d. 16

17 Laplacian: Invariant Definition 2n f(p) = lim r 0 r 2 ( f(p) 1 vol(s r ) S r (p) f ) f is harmonic if f = 0 17

18 Combinatorial Laplacian Think of edge e as tangent vector at e. For f : V R, df : E R is the 1-form df(e) = f(e + ) f(e ) Then = d d is discrete analogue of div. 18

19 Combinatorial Laplacian Since f, g = df, dg, is self-adjoint and positive. = D W is an averaging operator f(x) = w xy (f(x) f(y)) y x = d(x) f(x) x y w xy f(y) We say f is harmonic if f = 0. 19

20 Diffusion Kernels on Graphs (Kondor and L., ICML 2002) Kernels are generalized covariances. Crucial positive semidefinite property required by RKHS theory is in general difficult to obtain. Let be the graph Laplacian. In analogy with the continuous setting, d dt K t = K t is the heat equation on a graph. Solution K t = e t is called the heat kernel or diffusion kernel. 20

21 Simple Example: K n For the complete graph K n on n vertices ij = nδ(i, j) 1 and K t (i, j) = 1 + (n 1)e nt n 1 e nt n if i = j i j 21

22 Building Up Kernels If K (i) t are kernels on X i then K t = n i=1 K(i) t X 1... X n. is a kernel on For the hypercube, we get Hamming distance {}}{ K t (x, x ) (tanh t) d(x, x ) Other graphs with explicit diffusion kernels: Infinite trees (Chung & Yau, 1999) Rooted trees Cycles Strings with wildcards 22

23 RKHS Representation General spectral representation of a kernel as K(x, y) = n i=1 λ i φ i (x)φ i (y) leads to reproducing kernel Hilbert space i a i φ i, i b i φ i HK = i a i b i λ i For the diffusion kernel, RKHS inner product is f, g HK = i e tµ i fi ĝ i Regularization interpretation: Functions with small norm don t oscillate rapidly on the graph. 23

24 Results on UCI Datasets Hamming Diffusion Kernel Improv. Data Set error SV error SV t err SV Breast Cancer 7.64% % % 83% Hepatitis 17.98% % % 58% Income 19.19% % % 8% Mushroom 3.36% % % 70% Votes 4.69% % % 12% Using the voted perceptron (Freund & Schapire, 1999). K(x, x ) i ( 1 e A i t 1 + ( A i 1) e A i t ) d(xi,x i ) 24

25 Recent Application of Diffusion Kernels (Vert and Kanehisa, NIPS 02) Task: Gene function prediction from microarray data. Question: Can knowledge of metabolic pathways and catalyzing enzymes improve the performance? Idea: Documented network of biochemical reactions forms a graph where genes are vertices. Relevant features are likely to exhibit correlation with respect to the topology of the graph. Combine diffusion kernel with expression profiles using kernel CCA (Bach and Jordan, 2002). Experiments on genes of yeast S.cerevisiae yield consistent improvement in predicting genes functional class. 25

26 Diffusion on Statistical Manifolds (L. and Lebanon, NIPS 02) Similarity between x and x may be difficult to quantify even for domain experts. Generative models are fairly easy to derive. A good generative model will use domain knowledge. Fisher information on the family codifies this into a distance. Gives statistical family structure of a Riemannian manifold. Associate each data point with a model: x θ(x), and consider diffusion kernel on information manifold. 26

27 Information Geometry F = {p( θ), θ Θ R d } d-dimensional statistical family. Fisher information matrix [g ij (θ)] g ij (θ) = X i log p(x θ) j log p(x θ) p(x θ) dx Defines a Riemannian metric on Θ, which is invariant under reparameterization. See (Amari, 00; Kass & Voss, 97) 27

28 Special Case: Multinomial maps simplex to the sphere d(θ, θ ) = 2 arccos ( d+1 i=1 θi θ i ) 28

29 Parametrices for the Heat Kernel Have an asymptotic expansion for the heat kernel: ( ) K t (x, y) = (4πt) n 2 exp d2 (x, y) N ψ i (x, y)t i + O(t N ) 4t i=0 in terms of the parametrices ψ i. Much of modern differential geometry is based on Laplacians, heat kernels and related constructions. 29

30 Special Case: Multinomial Parametrix expansion leads to approximation K t (x, x ) ( ( 1 exp 1 d+1 (4πt) d/2 t arccos2 i=1 θi (x)θ i (x ) )) Positive definite for small t. 30

31 SVM Decision Boundaries: Trinomial Gaussian kernel Information diffusion kernel 31

32 WebKB Text Classification Experiments linear rbf diffusion 0.3 linear rbf diffusion Test set error rate Test set error rate Number of training examples Number of training examples Intuition: improvement comes from in metric, which places more emphasis on values close to the boundary of the simplex. Related work: Fisher kernel of Haussler and Jaakkola. 32

33 Bounds on Covering Numbers and Generalization Error Theorem. Suppose the Fisher information manifold M is compact, with positive curvature, dimension d and volume V. Then the SVM hypothesis class covering numbers N (ɛ, F R (x)) for the diffusion kernel K t on M satisfy log N (ɛ, F R (x)) = O (( V t d 2 ) log d+2 2 ( )) 1 ɛ Based on eigenvalue bounds in differential geometry: c 1 (d) ( j V )2 d µj c 2 (d) ( j + 1 V )2 d Better error bounds now available based on Rademacher averages. 33

34 Semi-Supervised Learning: Initial Explorations Two main frameworks: Generative and Discriminative According to standard statistical theory, only generative models can benefit from unlabeled data Castelli and Cover (Trans. IT, 1996) Assumes simple generative mixture model, identifiable Quantifies relative value of labeled/unlabeled data in terms of Fisher information Asymptotic analysis Blum and Mitchell (COLT, 1998) Co-Training Two discriminative classifiers, sufficiently redundant Can PAC learn from unlabeled data Unrealistic independence assumptions 34

35 Labeled and Unlabeled Data as a Graph 35

36 Labeled and Unlabeled Data as a Graph Idea: Construct a random field on graph Intuition: Similar examples have similar labels Information propagates from labeled examples Graph encodes prior information may be inaccurate or misleading! 36

37 Random Fields Energy function E(y) = ij w ij (y i y j ) 2 with (conditional) random field p(y x) exp ( βh(y)) Discrete case, y i {+1, 1} Most probable = graph mincuts (Blum and Chawla, 2001). Not unique. Intractable to compute marginals Our framework: relaxation to continuous case, y i R Convex, unique mode Gaussian field, efficient algorithms 37

38 Gaussian Fields and Harmonic Functions [ Combinatorial Laplacian = LL LU UL UU ] = D W, where W = w w 1n... w n1... w nn D = w wn Field is clamped on the labeled data. On unlabeled data, Gaussian: y U N ( ) f U, UU f U = 1 UU ULf L harmonic 38

39 Random Walks and Electric Networks 1 R = 1 ij w ij 1 +1 volt 0 0 i Related recent work: Szummer and Jaakkola 2002, Belkin and Niyogi 2002, Chapelle et al. 2003, Deng et al

40 Graph Kernels Integrate diffusion kernel over time: 1 UU = 0 e t UU dt

41 Gaussian Fields vs. Gaussian Processes p(y) = exp ( y y ) p(y) = exp ( y Σ 1 y ) uses unlabeled data ignores unlabeled data

42 Text Classification: PC vs. MAC (Zhu, Ghahramani, and L., ICML 2003) accuracy labeled set size GF thresh SVM 42

43 Digit Recognition (all 10) accuracy labeled set size GF thresh SVM 43

44 Active Learning (Experimental Design) a B Idea: Choose query point to be labeled Common heuristic: most uncertain example Our approach: minimize (estimated) expected risk, combining semi-supervised and active learning Can be implemented efficiently in Gaussian field/process view 44

45 Active Learning: Text Classification Accuracy Active Learning Random Query SVM Labeled set size 45

46 Active Learning: Digit Classification Accuracy Active Learning Random Query SVM Labeled set size 46

47 Why Does This Work So Well? Data matches our assumptions: clusters in the graph have like labels. In newsgroups data, like , one posting will quote part of a previous posting, creating threads of discussion that are well-captured by the graph. In active learning, a single label suffices to disambiguate the entire thread. If your computer could ask you just one question, should it ask the one it s most unsure about, or the one from which it can learn the most? Graph structure allows the program a way to estimate the impact of each such question. 47

48 Generalization Error Bounds Gaussian prior over y, covariance kernel uses unlabeled data What if prior assumptions are wrong? Information-theoretic, PAC bounds (McAllester, 2000; Seeger 2002, Derbeko et al., 2003) ε gen ε emp + O ( ) D(Q P )+1+log(1/δ) n Q is approximate posterior (e.g., Laplace, Expectation-Propagation), P is prior If classification task doesn t match prior, bound blows up 48

49 Inference for Kernel (Prior) Generalization bound suggests minimizing D(Q P ). approximation: Gaussian D(Q P ) = log det K + trace ( f Kf ) + constant Convex optimization problem for graph selection Closely related to computing channel capacity for Gaussian channels 49

Recent Work: Free Food Cam Data Our goal

50 Recent Work: Free Food Cam Data Our goal is not to solve the object/person recognition problem, but to develop machine learning methods that will complement image processing techniques, improving the system by leveraging different sources of information 50

51 Extension: Conditional Random Fields (L., McCallum & Pereira, 2001) p(y x) = 1 Z(x) exp ( s λ sφ s (x, y)) Global normalization. Undirected graphical models; general and expressive modeling technique Using features, incorporate domain knowledge without increasing state space Do not make strong independence assumptions Comparable to HMMs in computationally efficiency 51

Protein Secondary Structure Prediction Predict protein secondary structure: coil, α-helix, or β-strand Current state of the art: SVM window-based classifiers using polynomial kernel derived from

52 Protein Secondary Structure Prediction Predict protein secondary structure: coil, α-helix, or β-strand Current state of the art: SVM window-based classifiers using polynomial kernel derived from PSI-BLAST queries. CRFs give improvements by modeling sequential structure Our research question: Can unlabeled protein data be effectively incorporated using a graph kernel? Current work with Yan Liu and Jerry Zhu, part of joint UPenn/CMU NSF ITR grant. 52

53 Kernel Conditional Random Fields Proposition. For each clique c let K c be a Mercer kernel with associated RKHS norm c, and let Ω c : R c R be a strictly monotonic function. Then the minimizer {f c } = argmin f c H c m ) l (x i, y i, {f c (x i, y c )} c,yc + i=1 if it exists, has the form f c ( ) = i,y c α c,i (y c )K c (x i, y c ; ) C Ω c ( f c ) c=1 The dual parameters α c,i (y c ) depend on the i-th instance, and all of the clique label assignments y c. 53

54 Challenges Fundamental limits for using unlabeled data Capacity and risk Consistency and error bounds Algorithms for graph construction Random forests Multiple similarity metrics Approximation methods and scaling to large data sets Propagation algorithms Unbiasedness vs. computation 54

Random Walks, Random Fields, and Graph Kernels

Random Walks, Random Fields, and Graph Kernels John Lafferty School of Computer Science Carnegie Mellon University Based on work with Avrim Blum, Zoubin Ghahramani, Risi Kondor Mugizi Rwebangira, Jerry