Review. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Size: px

Start display at page:

Download "Review. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012"

John Cooper
5 years ago
Views:

1 Review Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

2 What is Machine Learning (ML) Study of algorithms that improve their performance at some task with experience 2

3 Graphical Models Representation directed vs. undirected Conditional independence semantics Factorization Inference message passing algorithm (tree vs. general graph) Junction tree for graphs Variational inference vs. sampling Learning directed vs. undirected Fully observed vs. latent variable Structure learning 3

active trail BN MN A Derived local and pairwise assumption C B A S (A H) A H S H N S H

4 Conditional Independence Assumptions Local Markov Assumption X Nondescendant X Pa X Global Markov Assumption A B C, sep G A, B; C Nondescendant X X Pa X D-separation, active trail BN MN A Derived local and pairwise assumption C B A S (A H) A H S H N S H N H S (N H) X TheRest MB X X Y TheRest (no X Y) A S F A F (A F S) A C X B D MB X = {ABCD} 4

Distribution Factorization Bayesian Networks

P(X 1,, X n ) = P(X i Pa Xi ) n i=1 Conditional

(Undirected Graphical Models) strictly positive

Normalization (Partition Function) P(X 1,, X n

5 Distribution Factorization Bayesian Networks (Directed Graphical Models) I map: I l G I P P(X 1,, X n ) = P(X i Pa Xi ) n i=1 Conditional Probability Tables (CPTs) Markov Networks (Undirected Graphical Models) strictly positive P, I map: I G I P m Clique Potentials Normalization (Partition Function) P(X 1,, X n ) = 1 Z Z = i=1 m x 1,x 2,,x n i=1 Ψ i D i Ψ i D i Maximal Clique 5

6 Representation Power? P BN MN convert? Minimal I-map not unique Do not always have P-map Minimal I-map unique Do not always have P-map X 1 X 1 X 3 X 2, X 4 X 2 X 4 X 1, X 3 A F X 2 X 4 S A F (A F S) X 3 6

7 Inference in Graphical Models General form of the inference problem P X 1,, X n Ψ(D i ) i Want to query Y variable given evidence e, and don t care a set of Z variables Compute τ Y, e = Z i Ψ(D i ) using variable elimination Renormalize to obtain the conditionals P Y e = τ(y,e) Y τ(y,e) Two examples: use graph structure to order computation DAG: C B D A Chain: A B C D E E F G H 7

$Message passing algorithm m ji X i Xj Ψ X i, X j Ψ X j s N j \i m sj X j product of incoming messages multiply by local$

8 Message passing algorithm m ji X i Xj Ψ X i, X j Ψ X j s N j \i m sj X j product of incoming messages multiply by local potentials N j \i k m kj X j Sum out X j m ji X i X j can send message when incoming messages from N j \i arrive j i f l m lj X j 8

9 Junction tree algorithm for DAG B A B A B A C E D F Moralize C E D F Triangulate C E D F G H G H G H BC C DE ADE BC C DE ADE GE E CDE EFH AE AEF EF Maximum spanning tree GE E E CDE E E E EFH AE AEF EF Junction Tree Clique Graph 9

$Message passing in junction trees m Dj D i S ji Φ D j D j \S ji m Dt D j S tj D t N D j \D i product of incoming messages multiply by local potentials Sum out variables not in$

10 Message passing in junction trees m Dj D i S ji Φ D j D j \S ji m Dt D j S tj D t N D j \D i product of incoming messages multiply by local potentials Sum out variables not in separator N D j Separator: S kj = D k D j \D i D k S kj m Dk D j S kj m Dj D i S ji D j D S i ji Can also be applied to loopy clique graphs for approximate inference S lj m Dl D j S lj D l 10

11 Variational Inference What is the approximating structure???? P How to measure the goodness of the approximation of Q X 1,, X n to the original P X 1,, X n? Reverse KL-divergence KL(Q P Q Q Q How to compute the new parameters? Optimization Q = argmin Q KL(Q P mean field New Parameters 11

12 Mean Field Algorithm Initialize Q X 1,, X n = i Q X i (eg., randomly or smartly) Set all variables to unprocessed Pick an unprocessed variable X i Update Q i : Q i X i = 1 Z i exp E Q ln Ψ D j Set variable X i as processed If Q i changed D j :X i D j Set neighbors of X i to unprocessed Guaranteed to converge 12

13 Why Sampling Previous inference tasks focus on obtaining the entire posterior distribution P X i e Often we want to take expectations Mean μ Xi e = E X i e = X i P X i e dx i 2 Variance σ Xi e More general E f analytically = E (X i μ Xi e) 2 e = (X i μ Xi e) 2 P X i e dx i = f X P X e dx, can be difficult to do it Key idea: approximate expectation by sample average N E f 1 N f x i i=1 where x 1,, x N P X e independently and identically 13

14 Sampling Methods Direct Sampling Works only for easy distributions (multinomial, Gaussian etc.) Rejection Sampling Create samples like direct sampling Only count samples consistent with given evidence Importance Sampling Create samples like direct sampling Assign weights to samples Gibbs Sampling Often used for high-dimensional problem Use variables and its Markov blanket for sampling 14

15 Gibbs Sampling in formula Gibbs sampling X = x 0 For t = 1 to N x 1 t P(X 1 x 2 t 1,, x K t 1 ) x 2 t P(X 2 x 1 t, x 3 t 1, x K t 1 ) t ) x K t P(X K x 1 t,, x K 1 For graphical models Only need to condition on the Variables in the Markov blanket X 2 X 3 Variants: Randomly pick variable to sample sample block by block x t t 1, x 2 P(X 1, X 2 x t 1 3, x t 1 K ) X 1 X 4 X 5 15

16 Learning for GMs Known Structure Unknown Structure Fully observable data Relatively Easy Hard Missing data Hard (EM) Very hard Estimation principle: Maximal likelihood estimation Bayesian estimation Common Feature Make use of distribution factorization Make use of inference algorithm Make use of regularization/prior 16

17 Bayesian Parameter Estimation Bayesian treat the unknown parameters as a random variable, whose distribution can be inferred using Bayes rule: P(θ D) = P D θ P(θ) P(D) = P D θ P(θ) P D θ P θ dθ θ The crucial equation can be written in words Posterior = likelihood prior marginal likelihood X N N For iid data, the likelihood is P D θ = P(x i θ) i=1 N i=1 θ x i 1 θ 1 x i = θ i x i 1 θ i 1 x i = θ #head 1 θ #tail The prior P θ encodes our prior knowledge on the domain Different prior P θ will end up with different estimate P(θ D)! 17

18 Frequentist Parameter Estimation Bayesian estimation has been criticized for being subjective Frequentists think of a parameter as a fixed, unknown constant, not a random variable Hence different objective estimators, instead of Bayes rule These estimators have different properties, such as being unbiased, minimum variance, etc. A very popular estimator is the maximum likelihood estimator (MLE), which is simple and has good statistical properties N θ = argmax θ P D θ = argmax θ i=1 P(x i θ) 18

19 How estimators should be used? θ MAP is not Bayesian (even though it uses a prior) since it is a point estimate Consider predicting the future. A sensible way is to combine predictions based on all possible value of θ, weighted by their posterior probability, this is called Bayesian prediction: P x new D = P x new, θ D dθ = P x new θ, D P θ D dθ = P x new θ P θ D dθ A frequentist prediction will typically use a plug-in estimator such as ML/MAP P x new D = P(x new θ ML ) or P x new D = P(x new θ MAP ) θ X N X new 19

Decomposable likelihood of directed model l θ; D = log P D θ = log P a i i θ a + log P f i i θ f + i logp s i a i, f i, θ s + i logp(h i s i, θ h ) One term for each CPT; break up MLE problem into

20 Decomposable likelihood of directed model l θ; D = log P D θ = log P a i i θ a + log P f i i θ f + i logp s i a i, f i, θ s + i logp(h i s i, θ h ) One term for each CPT; break up MLE problem into independent subproblems Because the factorization of the distribution, we can estimate each CPT separately. Allergy Flu Allergy Allergy Flu Flu Sinus Learn separately Sinus Sinus Headache Headache 20

21 Bayesian estimator for directed models Factorization P X = x = P x i pa Xi, θ i ) i Local CPT: multinomial distribution P X i = k Pa Xi = j = θ kj Factorized prior over parameters P θ a P θ b P θ s P(θ h ) θ a θ b Allergy Flu θ s Sinus θ h Headache 21

22 MLE Learning Algorithm for Exponential models max θ l θ, D is a convex optimization problem. Can be solve by many methods, such as gradient descent, conjugate gradient. Initialize model parameters θ Loop until convergence Compute l θ,d Update θ ij θ ij η θ ij = E P Xi,X j X i X j E P X θ X i X j l θ,d θ ij 22

23 Partially observed graphical models Mixture models and hidden Markov models 23

24 Why is learning hard? In fully observed iid settings, the log-likelihood decomposes into a sum of local terms l θ; D = log p x, z θ = log P z θ 1 + log p(x z, θ 2 ) With latent variables, all the parameters become coupled together via marginalization l θ; D = log p x, z θ z = log p(x z z, θ 2 )P z θ 1 Z Z X N X N 24

25 EM algorithm EM: Expectation-maximization for finding θ l θ; D = log p x, z θ z = log p(x i z z, θ 2 )P z θ 1 Iterate between E-step and M-step until convergence Expectation step (E-step) f θ = E q z log p x, z θ, where q z = P(z x, θ t ) Maximization step (M-step) θ t+1 = argmax θ f θ 25

26 Structure Learning The goal: given set of independent samples (assignments of random variables), find the best (the most likely) graphical model structure A F A S F A F S candidate structure N H Score structures S N H (A,F,S,N,H) = (T,F,F,T,F) (A,F,S,N,H) = (T,F,T,T,F) (A,F,S,N,H) = (F,T,T,T,T) A N S F H Maximum likelihood; Bayesian score; Margin N H 26

27 Chow-liu algorithm T = argmax T M (i,j) T I(x i, x j ) M i H(x i ) Chow-liu algorithm For each pair of variables X i, X j, compute their empirical mutual information I(x i, x j ) Now you have a complete graph connecting variable nodes, with edge weight equal to I(x i, x j ) Run maximum spanning tree algorithm 27

28 Kernel methods Kernels Similarity measure between a pair of data points Positive definite kernel matrix Design and combine kernels Fast kernel computation Kernelize algorithms Use inner product between data points to express algorithms The learned function lies is a linear combination of data points Replace inner products with kernels SVM, ridge regression, clustering, PCA, CCA, ICA, Statistical tests Gaussian processes Covariance functions are kernel functions 28

29 Support Vector Machines (SVM) 1 min w 2 w w + C j ξ j s. t. w x j + b y j 1 ξ j, ξ j 0, j ξ j : Slack variables 29

decision boundary in feature space Transform data points Nonlinear

30 SVM for nonlinear problem Solve nonlinear problem with linear relation in feature space Non-linear decision boundary Linear decision boundary in feature space Transform data points Nonlinear clustering, principal component analysis, canonical correlation analysis 30

31 SVM for nonlinear problems Some problem needs complicated and even infinite features φ x = x, x 2, x 3, x 4, Explicitly computing high dimension features is time consuming, and makes subsequent optimization costly Nonlinear Decision Boundaries Linear SVM Decision Boundaries 31

32 Kernel trick The dual problem of SVMs, replace inner product by kernel Max α i α i 1 2 s. t. i α i y i = 0 0 α i C i,j α i α j y i y j φ(x i ) φ(x j ) k(x i, x j ) Corresponding kernel matrix is psd It is a quadratic programming; solve for α, then we get w = j α j y j φ(x j ) b = y k w φ(x k ) for any k such that 0 < α k < C Evaluate the decision boundary on a new data point f x = w φ x =( j α j y j φ(x j )) φ x = j α j y j k(x j, x) 32

33 Typical kernels for vector data Polynomial of degree d k x, y = x y d Polynomial of degree up to d k x, y = x y + c d Gaussian RBF kernel k x, y = exp x y 2 Laplace Kernel 2σ 2 k x, y = exp x y 2σ 2 33

34 Kernel Functions Denote the inner product as a function k x i, x j = φ x i φ x j K(, )=0.6 K(, )=0.2 Inner product maps to # node # edge # triangle # rectangle # pentagon K(, )=0.5 maps to # node # edge # triangle # rectangle # pentagon ACAAGAT GCCATTG GCCATTG K( TCCCCCG, )=0.7 GCCTCCT GCTGCTG GCATGAC ACCTGCT GGTCCTA 34

35 Combining kernels Positive weighted combination of kernels are kernels k 1 x, y and k 2 (x, y) are kernels α, β 0 Then k x, y = αk 1 x, y + βk 2 x, y is a kernel Product of kernels are kernels k 1 x, y and k 2 (x, y) are kernels Then k x, y = k 1 x, y k 2 x, y is a kernel Mapping between spaces give you kernels k x, y is a kernel, then k φ x, φ y is a kernel k x, y = x 2 y 2 35

36 Principal component analysis Given a set of M centered observations x k R d, PCA finds the direction that maximizes the variance X = x 1, x 2,, x M w = 1 argmax w 1 M k w x k 2 1 = argmax w 1 M w XX w C = 1 M XX, w can be found by solving the following eigen-value problem Cw = λ w 36

37 Alternative expression for PCA The principal component lies in the span of the data w = α k x k = Xα k Plug this in we have Cw = 1 M XX Xα = λ Xα Furthermore, for each data point x k, the following relation holds x k Cw = 1 M x k XX Xα = λ x k Xα, k In matrix form, 1 M X XX Xα = λx Xα Only depends on inner product matrix 37

38 Kernel PCA Key Idea: Replace inner product matrix by kernel matrix PCA: 1 M X XX Xα = λx Xα x k φ x k, Φ = φ x 1,, φ x k, K = Φ Φ Nonlinear component w = Φα Kernel PCA: 1 M KKα = λkα, equivalent to 1 M Kα = λ α First form an M by M kernel matrix K, and then perform eigendecomposition on K 38

39 CCA in inner product format Similar to PCA, the directions of projection lie in the span of the data X = x 1,, x m, Y = (y 1,, y m ) w x = Xα, w y = Yβ C xy = 1 m XY, C xx = 1 m XX, C yy = 1 m YY^ Earlier we have Plug in w x = Xα, w y = Yβ, we have max, T X T XX T X T T X XY T Y T Y T YY Data only appear in inner products T Y 39

40 Kernel CCA Replace inner product matrix by kernel matrix Where K x is kernel matrix for data X, with entries K x i, j = k x i, x j Solve generalized eigenvalue problem 40 y y T x x T y x T K K K K K K, max y y x x x y y x K K K K K K K K

41 Embedding with kernel features Transform distribution to infinite dimensional vector Rich representation Feature space Mean, Variance, higher order moment 41

42 Estimating embedding distance Finite sample estimator Form a kernel matrix with 4 blocks Average this block Average this block Average this block Average this block 42

43 Measure Dependence via Embeddings Use squared distance to measure dependence between X and Y Feature space [Smola, Gretton, Song and Scholkopf. 2007] Dependence measure useful for: Dimensionality reduction Clustering Matching 43

Y ] 2 =< μ XY, μ XY > 2 < μ XY, μ X μ Y >+< μ X μ Y, μ X μ Y > Kernel matrix operation (H =

44 Estimating embedding distances Given samples (x 1, y 1 ),, (x m, y m ) P X, Y Dependence measure can be expressed as inner products μ XY μ X μ Y 2 = E XY [φ X ψ Y ] E X φ X E Y [ψ Y ] 2 =< μ XY, μ XY > 2 < μ XY, μ X μ Y >+< μ X μ Y, μ X μ Y > Kernel matrix operation (H = I 1 m 11 ) X and Y data are ordered in the same way trace( H H k(x i, x j ) k(y i, y j ) ) 44

45 Other advanced methods Combining classifiers Bagging Stacking Boosting (Adaboost) Semisupervised learning Graph-based methods (label propagation) Co-training Semisupervised SVM Active learning Tensor data decomposition Parafac and Tucker decomposition 45

46 What is Gaussian Process? A Gaussian process is a generalization of a multivariate Gaussian distribution to infinitely many variables Formally: a collection of random variables, any finite number of which have (consistent) Gaussian distributions Informally, infinitely long vector with dimensions index by x function f(x) A Gaussian process is fully specified by a mean function m x = E[f(x)] and covariance function k x, x = E f x m x f x m x f x GP m x, k x, x, x: indices 46

47 Covariance function of Gaussian processes For any finite collection of indices x 1, x 2,, x n, the covariance matrix is positive semidefinite Σ = K = k x 1, x 1 k x 1, x 2 k x 2, x 1 k x 2, x 2 k(x n, x 1 ) k(x n, x 2 ) k x 1, x n k x 2, x n k(x n, x n ) The covariance function needs to be a kernel function over the indices! Eg. Gaussian RBF kernel k x, x = exp 1 2 x x 2 47

48 Samples from GPs with different kernels k x i, x j = v 0 exp x i x j r α + v 1 + v 2 δ ij 48

49 Using Gaussian process for nonlinear regression Observing a dataset D = n x i, y i i=1 Prior P(f) is Gaussian process, like a multivariate Gaussian, therefore, posterior of f is also a Gaussian process Bayesian rule P f D = P D f P(f) P(D) Everything else about GPs follows the basic rules of probabilities applied to multivariate Gaussians 49

50 Noisy Observation y x, f x 2 N f, σ noise I, let Y = (y 2,, y n ) n f x x i, y i i=1 ~GP m post x, k post x, x 2 m post x = k x, X K + σ noise I 1 Y k post x, x = k(x, x 2 ) k x, X K + σ noise I 1 k x, X 50

51 Relate GP to class probability Transform the continuous output of Gaussian Process to a value between [-1,1] or [0,1] With binary outputs, the joint distribution of all variables in the model is no longer Gaussians The likelihood is also not Gaussian, so we will need to use approximate inference to compute the posterior GP (Laplace approximation, sampling) 51

52 Kernel low rank approximation Incomplete Cholesky factorization of kernel matrix K of size n n to R of size d n, and d n K R R R A R n f x x i, y i i=1 ~GP m post x, k post x, x m post x = R x RR 2 + σ noise I 1 RY k post x, x = R xx R x RR 2 + σ noise I 1 (RR )R x 52

53 Incomplete Cholesky Decomposition We have a few things to understand Gram-Schmidt orthogonalization Given a set of vectors V = {v 1, v 2,, v n }, find a set of orthonormal basis Q = u 1, u 2, u n, u i u j = 0, u i u i = 0 QR decomposition Given a set of orthonormal basis Q, compute the projection of V onto Q, v i = j r ji u j, R = r ji V = QR Cholesky decomposition with pivots V Q :, 1: k R 1: k, Kernelization V V = R Q QR = R R R 1: k, R 1: k, K = Φ Φ R 1: k, R 1: k, 53

54 Incomplete Cholesky decomposition: Matlab Kernel entries can be computed on the fly Computation O nd 2 number of kernel evaluation 54

55 Random features What basis to use? e jω (x y) can be replaced by cos (ω x y ) since both k x y and p ω real functions cos ω x y = cos ωx cos ωy + sin ωx sin ωy For each ω, use feature [cos ωx, sin ωx ] What randomness to use? Randomly draw ω from p ω Eg. Gaussian RBF kernel, drawn from Gaussian 55

56 Other advanced methods Combining classifiers Bagging, Stacking, Boosting (adaboost) Semi-supervised learning Graph-based methods (label propagation), Co-training, Semisupervised SVM Active Learning Active learning for SVM Tensor data analysis Parafac and Tucker decomposition Connection with latent variable models 56

57 Bagging Bagging: Bootstrap aggregating Generate B bootstrap samples of the training data: uniformly random sampling with replacement Train a classifier or a regression function using each bootstrap sample For classification: majority vote on the classification results For regression: average on the predicted values Original Training set Training set Training set Training set

58 Stacking classifiers Level-0 models are based on different learning models and use original data (level-0 data) Level-1 models are based on results of level-0 models (level-1 data are outputs of level-0 models) -- also called generalizer If you have lots of models, you can stacking into deeper hierarchies 58

59 Boosting Boosting: general methods of converting rough rules of thumb into highly accurate prediction rule A family of methods which produce a sequence of classifiers Each classifier is dependent on the previous one and focuses on the previous one s errors Examples that are incorrectly predicted in the previous classifiers are chosen more often or weighted more heavily when estimating a new classifier. Questions: How to choose hardest examples? How to combine these classifiers? 59

60 Adaboost flow chart Original training set training instances that are wrongly predicted by Learner 1 will play more important roles in the training of Learner 2 Data set 1 Data set 2... Data set T... Learner 1 Learner 2... Learner T weighted combination 60

61 AdaBoost 61

62 Graph-based methods Idea: construct a graph with edges between very similar examples Unlabeled data can help glue the objects of the same class together Suppose just two labels: 0 & 1. Solve for labels f(x) for unlabeled examples x to minimize: Label propagation: average of neighbor labels Minimum cut e=(u,v) f(u)-f(v) Minimum soft-cut e=(u,v) (f(u)-f(v)) 2 Spectral partitioning

63 Passive Learning (Non-sequential Design) Data Learning Source Algorithm (estimator) Expert / Oracle Labeled data points Algorithm outputs a classifier 63

64 Active Learning (Sequential Design) Learning Algorithm Data Source Expert / Oracle Request for the label of a data point The label of that point Request for the label of another data point The label of that point... Algorithm outputs a classifier 64

65 Active Learning (Sequential Design) Learning Algorithm Data Source Expert / Oracle Request for the label of a data point The label of that point Request for the label of another data point The label of that point... Algorithm outputs a classifier How many label requests are required to learn? Label Complexity 65

Kernel methods, kernel SVM and ridge regression

Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;