Intelligent Systems I

Size: px

Start display at page:

Download "Intelligent Systems I"

Alexander Turner
5 years ago
Views:

1 1, Intelligent Systems I 12 SAMPLING METHODS THE LAST THING YOU SHOULD EVER TRY Philipp Hennig & Stefan Harmeling Max Planck Institute for Intelligent Systems 23. January 2014 Dptmt. of Empirical Inference

2 2, Recap the course so far 1. intro: intelligence is reasoning under uncertainty 2. probability theory specifies the mathematics of uncertainty 3. graphical models: independence is crucial for probabilistic computations 4. Gaussians map probabilistic reasoning onto linear algebra 5. regression: Gaussian algebra allows learning functional relationships 6. kernels: it is possible to learn infinitely complex functions 7. classification: non-gaussian extension of regression, with approximate inference 8. SVM / kpca: special link function choices speed up inference 9. reproducing kernel Hilbert spaces: the capacity of kernel models 10. Statistical Learning Theory Bayes estimator is statistically optimal if prior correct. On VC-finite spaces, there are other methods (ERM) that converge to it for large data sets. 11. Neural Networks learning features, by maximum likelihood

3 3, Today probabilistic inference in intractable models probabilistic inference requires expectations, marginals f p = E p [f] = f(x)p(x) dx; p(y) = p(y x)p(x) dx Example: Marginalizing over hyperparameters, e.g. remember Gaussian process regression so far: p(f x X, y) = GP(f x X, y, θ)p(θ) dθ analytic integrals using Gaussian models ignore probability measure, only use maximum posterior/likelihood approximate posterior with simple Gaussian by Laplace approximation not covered in this course: more elaborate approximate posterior distributions using approximate inference, e.g. variational bounds, moment matching, kernel mean embeddings,...

4 samples from a probability distribution can be used to estimate expectations, roughly, without having to design an elaborate integration algorithm 4,

5, Sampling Methods aka Monte Carlo Methods [DJC MacKay, chapters 29 32]

f(x)p(x)dx f(x i ); i p(x, y)dx p(y x i ); i if x i p(x) this requires

1 (Monte Carlo method) Algorithms that compute expectations in the above

5 5, Sampling Methods aka Monte Carlo Methods [DJC MacKay, chapters 29 32] instead, the simplest thing to do : replace integral with sum: f(x)p(x)dx f(x i ); i p(x, y)dx p(y x i ); i if x i p(x) this requires being able to sample x i p(x) Definition 9.1 (Monte Carlo method) Algorithms that compute expectations in the above way, using samples x i p(x) are called Monte Carlo methods (Stanisław Ulam, John von Neumann). pictures:wikipedia

6 6, Why Sampling is a Good Idea sampling gives a rough estimate, fast The sampling estimator is unbiased ˆφ = 1 S s f(x s ) ˆφ = 1 S f(x s ) = φ s an unbiased estimator Its variance ( error ) drops over time ( ˆφ ˆφ ) 2 = [ 1 S s 2 (f(x s ) φ)] = 1 S 2 s f(x s )f(x r ) φf(x s ) f(x r )φ + φ 2 r = 1 S 2 (f(x s ) φ) 2 s = 1 S var(f)

7 7, sampling is for rough guesses a dumb way to learn π 1 ratio of in-circle to out-circle: π/4/1 π = 4 I(x x < 1)u(x) dx draw x u(x), check x x < 1, count 4 * sum(sum(rand(100,2).^2,2) < 1) / * sum(sum(rand(1e6,2).^2,2) < 1) / 1e

8 sampling is for rough guesses a dumb way to learn π var(f)/s # samples # samples need only 9 samples to get order of magnitude right (std(φ)/3) need samples for single-precision ( 10 7 ) calculations! sampling is good for rough estimates, not for precise calculations Always think of other options before trying to sample! 7,

9 8, samples from a probability distribution can be used to estimate expectations, roughly, without having to design an elaborate integration algorithm How do we generate samples from p(x)?

10 9, What is a Random Number? What is a sequence or random numbers?

11 9, What is a Random Number? What is a sequence or random numbers? dice, doubled

12 9, What is a Random Number? What is a sequence or random numbers? dice, doubled 41-70th digits of π

13 9, What is a Random Number? What is a sequence or random numbers? dice, doubled th digits of π th digits of 1/2(1 + 5)

14 9, What is a Random Number? What is a sequence or random numbers? dice, doubled th digits of π th digits of 1/2(1 + 5) von Neumann method, seed

15 9, What is a Random Number? What is a sequence or random numbers? dice, doubled 41-70th digits of π th digits of 1/2(1 + 5) von Neumann method, seed bits from G. Marsaglia s diehard CD for use in Monte Carlo, the important property is freedom from patterns (because it implies anytime unbiasedness) in fact, use for MC only really requires the right density, disorder is just helpful for the argument for use in cryptography, the important property is unpredictability randomness is a philosophically dodgy concept uncertainty is a much clearer idea (see first lectures)

16 10, Turning uniform samples into other distributions inverse transform sampling p(x) p(x) u, p(x) x requires P (x) = x p( x) d x requires solving x(u) = P 1 (u), for u Uniform[0, 1] tricky for multivariate distributions

17 Some special cases sampling from an exponential distribution is analytic 1 p(x) p(x) dx u, p(x) x p(x) = 1 λ e x/λ p(x) dx = 1 e x/λ 1 u = 1 e x/λ x = λ log(u) 11,

p(r 2, φ) d(r 2 ) dφ = 1 exp ( r2 2 2 ) d(r2 ) 1 2π dφ = 1 2π exp ( 1 2 r2 ) r dr dφ = 1 2π exp ( 1 2 (z2 1 + z2)) 2 dz 1 dz 2 = N (z 1 ; 0, 1)N (z 2 ; 0,

18 Some special cases the Box-Muller method for sampling from a Gaussian draw a square radius r 2 r 2 = 2 log u 1 p(r 2 ) = 1 2 draw an angle uniformly on unit circle write in Euclidean coordinates φ = 2πu 2 p(φ) = 1/(2π) exp ( r2 2 ) z 1 = r cos φ = 2 log u 1 cos(2πu 2 ) z 2 = r sin φ = 2 log u 1 sin(2πu 2 ) p(r 2, φ) d(r 2 ) dφ = 1 exp ( r2 2 2 ) d(r2 ) 1 2π dφ = 1 2π exp ( 1 2 r2 ) r dr dφ = 1 2π exp ( 1 2 (z2 1 + z2)) 2 dz 1 dz 2 = N (z 1 ; 0, 1)N (z 2 ; 0, 1) requires 1 unit random, 1 sin, 0.5 sqrt, 0.5 log per draw. GEP Box, ME Muller, Note on the Generation of Random Normal Deviates, Ann. Math. Stat., ,

19 13, The Box Muller method, graphically and why it requires two independent samples p(r) cos(φ) r sin(φ)

20 14, Specialised direct sampling methods exist It is worth thinking about hard problems if they are important For example Ziggurat Method: For monotonically decaying pdfs: Approximate PDF with an upper and lower bound step function, only spend computational effort on the difficult cases in between (currently the best way to draw normal samples) G. Marsaglia; W.W. Tsang The Ziggurat Method for Generating Random Variables J of Statistical Software 5 (8), 2000 Subgroup Algorithm: To generate uniformly random elements from a compact group, iteratively draw in an expanding chain of subgroups by drawing uniformly from the cosets (e.g. used to draw rotation, permutation matrices at random) P. Diaconis; M. Shahshahani The subgroup algorithm for generating uniform random variables Probability in the Engineering and Informational Sciences, 1987

21 15, samples from a probability distribution can be used to estimate expectations, roughly random numbers don t really need to be unpredictable, as long as they have as little structure as possible uniformly distributed random numbers can be transformed into other distributions. This can be done numerically efficiently in some cases, and it is worth thinking about doing so What do we do if we don t know a good transformation?

22 Rejection Sampling a simple method [Georges-Louis Leclerc, Comte de Buffon, ] for any p(x) = p(x)/z (normalizer Z not required) choose q(x) s.t. cq(x) p(x) draw s q(x), u Uniform[0, cq(s)] reject if u > p(s) 16,

23 17, The Problem with Rejection Sampling the curse of dimensionality [MacKay, 29.3] p(x) cq(x) Example: p(x) = N (x; 0, σp) 2 q(x) = N (x; 0, σq) 2 σ q > σ p optimal c = (2πσ2 q) D/2 (2πσ 2 p) D /2 = exp (D ln σ q σ p ) acceptance rate is ratio of volumes: 1/c rejection rate rises exponentially in D for σ q /σ p = 1.1, D = 100, 1/c < 10 4

24 18, Importance Sampling a slightly less simple method computing p(x), q(x), then throwing them away seems wasteful instead, rewrite (assume q(x) > 0 if p(x) > 0) φ = f(x)p(x) dx = f(x) p(x) q(x) dx q(x) 1 S f(x s ) p(x s) if x s q(x) s q(x s ) this is just using a new function g(x) = f(x)p(x)/q(x) so it is an unbiased estimator if normalization unknown, can also use p(x) = Zp(x) f(x)p(x) = 1 1 Z S f(x s ) p(x s) s q(x s ) dx = 1 S p(x s )/q(x s ) f(x s ) 1 s S s 1 p(x s)/q(x s ) = f(x s )w s s this is consistent, but biased, because of double estimation

25 19, What s wrong with Importance Sampling? the curse of dimensionality, revisited recall that var ˆφ = var(f)/s importance sampling replaces var(f) with var(g) = var (f p q ) var (f p ) can be very large if q p somewhere q in many dimensions, usually q p all but everywhere! if p has undiscovered islands, some samples have p(x)/q(x) 4 log 1 0 sample count x f(x), g(x)

26 20, Marginalizing Hyperparameters of a Gaussian process regression model 4 2 y x assume p(f λ) = GP(0, k(λ)), p(y f(x), σ) = N (y, f x, σ 2 ) with kernel k(x, x, λ) = exp( 1 /2(x x ) 2 /λ 2 ) we would like p(f y) = p(f y, σ, λ)p(σ, λ) dσ dλ = GP(f; µ y,x,σ,λ, k x,σ,λ )p(σ, λ) dσ dλ

27 21, Marginalizing Hyperparameters of a Gaussian process regression model λ σ 1 log likelihood log p(y σ, λ) = 1 2 y (k xx (λ)+σ 2 I)y 1 2 log k xx(λ)+σ 2 I N 2 log 2π

28 22, Marginalizing Hyperparameters of a Gaussian process regression model λ σ 1 proposal distribution p(σ, λ) = N [( λ σ ) ; ( ), (0.2 )] I(σ > 0)/Z with Z =

29 23, Marginalizing Hyperparameters of a Gaussian process regression model 4 2 f(x) x w 1 p(f λ 1, σ 1, x, y)

30 23, Marginalizing Hyperparameters of a Gaussian process regression model 4 2 f(x) x w 2 p(f λ 2, σ 2, x, y)

31 23, Marginalizing Hyperparameters of a Gaussian process regression model 4 2 f(x) x w 3 p(f λ 3, σ 3, x, y)

32 23, Marginalizing Hyperparameters of a Gaussian process regression model 4 2 f(x) x i w i p(f λ i, σ i ) p(f x, y)

33 24, Summary Sampling (Monte Carlo) Methods Sampling is a way of performing rough probabilistic computations, in particular for expectations (including marginalization). samples from a probability distribution can be used to estimate expectations, roughly random numbers don t really need to be unpredictable, as long as they have as little structure as possible uniformly distributed random numbers can be transformed into other distributions. This can be done numerically efficiently in some cases, and it is worth thinking about doing so Rejection sampling is a primitive but exact method that works with intractable models Importance sampling makes more efficient use of samples, but can have high variance (and this may not be obvious) In two weeks: Markov Chain Monte Carlo methods are more elaborate ways of getting approximate answers to intractable problems.

34 25, Next Week guest lecture on causality by Dominik Janzing (MPI IS).

35 26,

Bayesian Inference and MCMC

Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the