Searching for the Principles of Reasoning and Intelligence

Size: px

Start display at page:

Download "Searching for the Principles of Reasoning and Intelligence"

Dustin Powers
5 years ago
Views:

1 Searching for the Principles of Reasoning and Intelligence Shakir Mohamed DALI

2 Statistical Operations Estimation and Learning Inference Hypothesis Testing Summarisation Comparison Modelling Data Enumeration Eperimental Design Efron, 1981 Wald Lecture. 2

3 1 Better clinical outcomes 2Enhance 3Reduce costs patient and clinician eperience Autonomous Systems Triple Aims of Healthcare The core questions of AGI are those of probabilistic inference Fair and safe ML 3

Inferential Questions Probabilistic deterity is needed to solve the fundamental problems of machine learning and artificial intelligence.

4 Inferential Questions Probabilistic deterity is needed to solve the fundamental problems of machine learning and artificial intelligence. p() = Evidence Estimation Z p(, )d Moment Computation Z E[f() ] = f()p( )d Parameter Estimation Prediction Planning Hypothesis Testing Eperimental Design B = log p( H 1 ) log p( H 2 ) 4

5 Latent Variable Models Introduce an unobserved random variable for every observed data point to eplain hidden causes. f() Prescribed models Use observer likelihoods and assume observation noise. Implicit models Likelihood-free or simulation-based models. f() Diggle and Gratton (1984); Mohamed and Lakshminarayanan (2016) 5

6 Variational Inference True posterior KL[q( y)kp( y)] q () Approimation class Learning principle: Model Evidence p() = Z p(, )d f() F(,q)=E q() [log p( )] KL[q()kp()] Appro. Posterior Reconstruction Penalty Q How can we turn variational inference into a generic tool for inference? 6

(VEM). Repeat: log p() E-step M-step /r F(,q) /r F(,q) Var.

7 Amortised Inference F(,q)=E q() [log p( )] KL[q()kp()] Alternating optimisation for the variational parameters and then model parameters (VEM). Repeat: log p() E-step M-step /r F(,q) /r F(,q) Var. params Model params KL[q p ] F(,q) Initialisation t = 1 Convergence Reende, Mohamed, Wierstra (2014) 7

Parameters of q are now a set of global parameters used for inference of all data points - test and train. Amortise (spread) the cost of inference over all data.

8 Amortised Inference Repeat: E-step (compute q) For i = 1, N n /r E q () [log p ( n n )] r KL[q( n )kp()] ~ q( ) M-step / 1 N X E q () [r log p ( n n )] n Instead of solving for every observation, amortise using a model. Inference network: q is an encoder, an inverse model, recognition model. Parameters of q are now a set of global parameters used for inference of all data points - test and train. Amortise (spread) the cost of inference over all data. Joint optimisation of variational and model parameters. Inference Network q( ) Inference networks provide an efficient mechanism for posterior inference with memory Reende, Mohamed, Wierstra (2014) Data Q How to understand correctness, design principles, missing data? 8

9 Posterior Approimation Q General-purpose posterior approimation? High dimensions? Hierarchical models? Families of Posterior Approimations True Posterior Normalising flows K Structured mean-field Covariance models + Fully-factorised 2 1 Auiliary variables p() Mitures y p( ) r(!, ) Most Epressive q ( ) / p( )p() Least Epressive q MF ( ) = Y k q( k ) Reende and Mohamed (2015) 9

10 Stochastic Optimisation r Eq () [f ()] Pathwise Estimator When easy to use transformation is available and diﬀerentiable function f. = Ep( ) [r f (g(, ))] p() μ r = µ + R R q () = g(, ) p( ) =r Z q ()f ()d Score-function estimator When function f non-diﬀerentiable and q() is easy to sample from. = Eq() [f ()r log q ())] Q New estimators, probabilistic programming, variance properties 10

11 Estimation-by-Comparison For some models, we only have access to an unnormalised probability or partial knowledge of the distribution. f() q() p*() We compare the estimated distribution q() to the true distribution p*() using samples. Ratios p( (1) ) p( (2) ) Learning principle: Two-sample tests p () q() =1 p () =q() Interest is not in estimating the marginal probabilities, only in how they are related. 11

Density Estimation by Comparison H 0 : p=q vs H 1 : p q L(, ) Density Difference

Discrepency Moment Matching Bregman Divergence Class Probability Estimation

12 Density Estimation by Comparison H 0 : p=q vs H 1 : p q L(, ) Density Difference r = p q Density Ratio r = p q Mitures with identical moments B f [r kr] Ma Mean Discrepency Moment Matching Bregman Divergence Class Probability Estimation f-divergence Comparison: Use a test or comparison to tells how simulated data differs from observed data. Mohamed and Lakshminarayanan (2016) f(u) =u log u (u + 1) log(u + 1) Estimation: Adjust model to match the data distribution using the comparison. 12

13 Density-ratio Estimation p () q() = p(y =1 ) p(y = 1 ) p(y =+1 ) =D () F(,, )=E p ()[log D ()] + E q () [log(1 D ()] f() gen p() gen = f () obs Alternating optimisation Unsupervised-assupervised learning Classifier ABC min ma F(,, ) Instances of testing and inference: Noise-contrastive estimation Adversarial learning; GANs Mohamed and Lakshminarayanan (2016); Rosca et al (2017) real Generative Adversarial Networks Comparison /r E p ()[log D ()] + r E q () [log(1 D ()] Model gen Estimation / r E q() [log(1 D (f ())] 13

14 Method-of-Moments f () h l (h l 1 ; l) Moment estimator Tangent of posterior odds. l h 1 (; 1) r f l () Moment Vector 1 Model r f 1 () g () L G ( ) L M ( ) Moment Network f () Consistent estimators: the number of moments is greater than the number of model parameters. Features should not be not co-linear. More stable than adversarial training. Does not require frequent updating of the classifier. gen Ravuri et al (2018) real Q Right type of feature functions? Connection between adversarial training and statistical efficiency? 14

Replace density ratios by classifiers, replace posteriors with implicit

15 Convergent Approaches Q Scale to higher-dimensions? Meaningful evaluation? Better samples? Replace density ratios by classifiers, replace posteriors with implicit models, view as optimal transport in primal cases, and connections to integral probability metrics. Bellemare et al, (2017), Rosca et al (2018) 15

16 Inference KL[q( y)kp( y)] Approimation class True posterior Summarisation Comparison q () Data Enumeration Model real gen ~ q( ) f() f() Inference Network q( ) Data L G ( ) L M ( ) shakir@deepmind.com Model g () Moment Network f gen real 16

17 Referenced in slides Efron B. Maimum likelihood and decision theory. The annals of Statistics Jun 1: Reende, Danilo Jimene, Shakir Mohamed, and Daan Wierstra. "Stochastic backpropagation and approimate inference in deep generative models. ICML 2014 Reende, Danilo Jimene, and Shakir Mohamed. "Variational inference with normaliing flows." ICML Mohamed S, Lakshminarayanan B. Learning in implicit generative models Mihaela Rosca, Balaji Lakshminarayanan, Shakir Mohamed Distribution Matching in Variational Inference, 2018 Other Important References Frey, Brendan J., and Geoffrey E. Hinton. "Variational learning in nonlinear Gaussian belief networks." Neural Computation 11, no. 1 (1999): Durk Kingma and Ma Welling. "Auto-encoding Variational Bayes." ICLR (2014). Ranganath, Rajesh, Sean Gerrish, and David M. Blei. "Black Bo Variational Inference." In AISTATS, pp Mnih, Andriy, and Karol Gregor. "Neural variational inference and learning in belief networks." arxiv preprint arxiv: (2014). Láaro-Gredilla, Miguel. "Doubly stochastic variational Bayes for non-conjugate inference." (2014). Wingate, David, and Theophane Weber. "Automated variational inference in probabilistic programming." arxiv preprint arxiv: (2013). Paisley, John, David Blei, and Michael Jordan. "Variational Bayesian inference with stochastic search." arxiv preprint arxiv: (2012). Paul Glasserman, Monte Carlo Methods in Financial engineering, 2003 Michael C Fu, Gradient estimation, Handbooks in operations research and management science, 2006 Fan K, Wang Z, Beck J, Kwok J, Heller KA. Fast second order stochastic backpropagation for variational inference. In Advances in Neural Information Processing Systems 2015 (pp ). Dayan, Peter, Geoffrey E. Hinton, Radford M. Neal, and Richard S. Zemel. "The helmholt machine." Neural computation 7, no. 5 (1995): Gershman, Samuel J., and Noah D. Goodman. "Amortied inference in probabilistic reasoning." In Proceedings of the 36th Annual Conference of the Cognitive Science Society Gregor, Karol, Ivo Danihelka, Ale Graves, Danilo Jimene Reende, and Daan Wierstra. "DRAW: A recurrent neural network for image generation." ICML (2015). Maaløe L, Sønderby CK, Sønderby SK, Winther O. Auiliary deep generative models. ICML 2016 Tabak, E. G., and Cristina V. Turner. "A family of nonparametric density estimation algorithms." Communications on Pure and Applied Mathematics 66, no. 2 (2013): Kingma, D.P., Salimans, T. and Welling, M., Improving variational inference with inverse autoregressive flow. arxiv preprint arxiv: Dinh, L., Sohl-Dickstein, J. and Bengio, S., Density estimation using Real NVP. arxiv preprint arxiv: Diggle PJ, Gratton RJ. Monte Carlo methods of inference for implicit statistical models. Journal of the Royal Statistical Society. Series B (Methodological) Jan 1: Sugiyama, Masashi, Taiji Suuki, and Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge University Press, Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mira, Bing Xu, David Warde-Farley, Sherjil Oair, Aaron Courville, and Yoshua Bengio. "Generative adversarial nets." In Advances in Neural Information Processing Systems, pp Gutmann MU, Dutta R, Kaski S, Corander J. Likelihood-free inference via classification. Statistics and Computing Mar 13:1-5. Nowoin S, Cseke B, Tomioka R. f-gan: Training generative neural samplers using variational divergence minimiation. In Advances in Neural Information Processing Systems 2016 (pp ). Friedman J, Hastie T, Tibshirani R. The elements of statistical learning: Section on Unsupervised as supervised learning."new York: Springer series in statistics;

GENERATIVE ADVERSARIAL LEARNING

GENERATIVE ADVERSARIAL LEARNING OF MARKOV CHAINS Jiaming Song, Shengjia Zhao & Stefano Ermon Computer Science Department Stanford University {tsong,zhaosj12,ermon}@cs.stanford.edu ABSTRACT We investigate