Need for Sampling in Machine Learning. Sargur Srihari

Size: px

Start display at page:

Download "Need for Sampling in Machine Learning. Sargur Srihari"

Baldwin Lynch
5 years ago
Views:

1 Need for Sampling in Machine Learning Sargur 1

2 Rationale for Sampling 1. ML methods model data with probability distributions E.g., p(x,y; θ) 2. Models are used to answer queries, E.g., p(y x; θ) 3. Sometimes answering queries is intractable p(y x) = p(x,y) p(x,y) y 4. Sampling provides an approximate answer Generative models (Bayesian Networks, Markov Networks) can be readily used to generate samples 2

3 1. Why Sampling? Many ML algorithms are based on drawing samples from some probability distribution and using these samples to form a Monte Carlo estimate of some desired quantity Sampling provides a flexible way to approximate many sums and integrals at reduced cost Sometimes for speedup of a costly but tractable sum, e.g., subsample training cost with minibatches In other cases, learning algorithms require us to approximate an intractable sum or integral E.g., gradient of the log partition function of an undirected model 3

4 Basics of MC sampling When a sum or integral is intractable E.g., has exponential no of terms and no exact simplification is known It can often be approximated using MC sampling The idea is to view the sum or integral as an expectation under some distribution and to approximate the expectation by a corresponding average 4

5 SummationàExpectationàAverage Sum or integral to approximate is s = x p(x)f (x) or Rewriting expression as an expectation s = E p [f (x)] or s = p(x)f (x)dx s = E p [f (x)] with the constraint that p is a probability distribution (for the sum) or a pdf (for the integral) Approximate s by drawing n samples x (1),..x (n) from p and then forming the empirical average n ŝ n = 1 f (x (i) ) n i=1 5

6 Justification of Approximation The sample average approximation is justified by a few different properties 1. The estimator ŝ is unbiased, since E[ŝ n ] = 1 n = 1 n =s n i=1 n i=1 ( ) E f x (i) s since since is the sample average ŝ n = 1 f (x (i) ) n 2. The law of large numbers states that if the samples x (i) are i.i.d. then the average converges almost surely to the expected value limŝ n n = s ŝ n s = Ep [f (x)] Provided the variance of the individual terms Var[f(x (i) )] is bounded 6 n i=1

7 Variance of estimate Consider variance of ŝ n ŝ n as n increases Var[ ] converges & decreases to 0 if Var[f(x (i) )]< : Var[ŝ n ] = 1 n 2 n i=1 Var[f (x)] Var[f (x)] = n Result tells how to estimate error in MC average Equivalently expected error of the approximation Compute empirical average of the f (x (i) ) and their empirical variance to determine estimator of Var[ ŝ n ] Central Limit Theorem tells us that distribution of the average ŝ n has a normal distribution with mean s and variance Var[ ŝ n ]. This allows us to estimate confidence intervals around the estimate ŝ n using the Normal cdf 7

8 Sampling from base distribution p(x) Sampling from p(x) is not always possible In such a case use importance sampling A more general approach is Monte Carlo Markov chains To form a sequence of estimators that converge towards the distribution of interest 8

9 How are distributions modeled in AI/ML? Bayesian Networks Markov Networks p(x) = 1 Z ˆp(x)!p(x) = φ(c) C G Z =!p(x)dx p(a,b,c,d,e, f ) = 1 Z φ a,b (a,b)φ b,c (b,c)φ a,d (a,d)φ b,e (b,e)φ e,f (e, f ) CPDs: p(x i pa(x i )) Joint Distribution x = {x 1,..x n } N p(x) = p(x i pa(x i )) i=1 P(D,I,G,S,L) = P(D)P(I )P(G D,I )P(S I )P(L G) Energy model E(a,b,c,d,e,f)= E a,b (a,b)+e b,c (b,c)+e a,d (a,d)+e b,e (b,e)+e e,f (e,f) ϕ a,b (a,b)=exp(-e(a,b)) Restricted Boltzmann machine E(v,h)= -b T v c T h v T Wh p(h v)=π i p(h i v) and p(v h)=π i p(v i h) Deep Belief Network 9

10 Why is sampling needed in ML? Inference is the task of answering probabilistic queries from model When exact inference is intractable, we need some form of approximation True of probabilistic models of practical significance Samples can always be used to construct distributions Inference methods based on numerical sampling are known as Monte Carlo techniques Most situations will require evaluating expectations of unobserved variables, e.g., to make predictions Rather than the posterior distribution 10

11 1. Probability Queries Query Types Given x give distribution of y 2. MAP (Maximum a posteriori probability) What is the most likely setting of y 3. Marginal MAP Queries 1. When some variables are known 11

12 Probability Queries Most common type of query is a probability query Query has two parts Evidence: a subset E of variables and their instantiation e Query Variables: a subset Y of random variables in network Inference Task: P(Y E=e) Posterior probability distribution over values y of Y Conditioned on the fact E=e Can be viewed as Marginal over Y in distribution we obtain by conditioning on e P(Y E = e) = P(Y,E = e) P(E = e) An intractable problem #P complete n P(E = e) = P(X i pa(x i )) E =e X \ E i=1 12

13 Need for Sampling in Bayesian Prediction Given training data x and t and new test point x, goal is to predict value of t i.e, wish to evaluate predictive distribution p(t x,x,t) Predictive distribution (where parameter has prior/ posteriors) p(t x,x,t) = With Gaussian noise: p(t x,w)p(w x,t)dw p(t x,w) = N(t y(x,w),β 1 ) Convolution of two Gaussians is Gaussian giving a closed form solution. When distributions are complex the integration can be replaced by sampling 13

14 Machine Learning Samples generated by GANs Deep Convolutional Generative Adversarial Networks Laplacian Pyramid GAN 14

15 What is a sample? Given a set of variables x ={x 1,.., x d } A sample is an instantiation of an assignment to all variables x t ={x 1t,.., x dt }, where t indicates sample index Each variable in a sample takes one possible allowable value in its domain according to a probability distribution defined over x 15

16 An example of samples Scalar variable x which can take one of K values {x 0,.. x K-1 } Probability distribution P(x) With K=4 x P(x=x j ) x x x Examples of samples x 1,x 2,x 3,x 4.. x 1 =x 2, x 2 =x 0, x 3 =x 1, x 4 =x 0,.. x 0 repeats more often than others since it is more probable 16

17 Algorithm to generate univariate samples Domain of x is {x 0,..x K-1 } Probability distribution is discrete P(x) 1. Divide a real line [0,1] into K intervals such that width of interval is proportional to P(x=x j ) P(x=x j ) x 0 x 1 x 2 x 3 2. Draw a random number r ε [0,1] 3. Determine region j in which r lies x x x x Output x j Random number r =0.2929, x=? Random number r =0.5209, x=? 17

18 Ancestral Sampling for BNs Start with lowest numbered node Draw a sample from the distribution p(x 1 ) which we call ˆx 1 Work through each of the nodes in order For node n we draw a sample from conditional distribution p(x n pa n ) Where parent variables are set to their sampled values Once final variable x K is sampled Achieved objective of obtaining a single sample from joint distribution To sample from marginal distribution Sample from full distribution and discard unnecessary values E.g., to draw from distribution p(x 2,x 4 ) simply sample from full distribution, retain values x 2^,x 4^ and discard remaining values { ˆx j 2,4 } 18

represents the output of another 1000 steps of Gibbs sampling

19 Samples from a trained RBM Gibbs sampling (model trained on MNIST data) Each column is a separate Gibbs process Each row represents the output of another 1000 steps of Gibbs sampling Successive samples are highly correlated Corresponding weight vectors 19

Using Graphs to Describe Model Structure. Sargur N. Srihari

Using Graphs to Describe Model Structure Sargur N. srihari@cedar.buffalo.edu 1 Topics in Structured PGMs for Deep Learning 0. Overview 1. Challenge of Unstructured Modeling 2. Using graphs to describe