Building an Automatic Statistician

Size: px

Start display at page:

Download "Building an Automatic Statistician"

Bryan Atkinson
6 years ago
Views:

1 Building an Automatic Statistician Zoubin Ghahramani Department of Engineering University of Cambridge ALT-Discovery Science Conference, October 214 James Robert Lloyd David Duvenaud Roger Grosse Josh Tenenbaum Cambridge Cambridge! Harvard MIT! Toronto MIT

2 THERE IS A GROWING NEED FOR DATA ANALYSIS I We live in an era of abundant data I The McKinsey Global Institute claim I The United States alone faces a shortage of 14, to 19, people with analytical expertise and 1.5 million managers and analysts with the skills to understand and make decisions based on the analysis of big data. I Diverse fields increasingly relying on expert statisticians, machine learning researchers and data scientists e.g. I Computational sciences (e.g. biology, astronomy,... ) I Online advertising I Quantitative finance I... James Robert Lloyd and Zoubin Ghahramani 2 / 44

3 GOALS OF THE AUTOMATIC STATISTICIAN PROJECT I Provide a set of tools for understanding data that require minimal expert input I Uncover challenging research problems in e.g. I Automated inference I Model construction and comparison I Data visualisation and interpretation I Advance the field of machine learning in general James Robert Lloyd and Zoubin Ghahramani 4 / 44

4 Background Probabilistic modelling Model selection and marginal likelihoods Bayesian nonparametrics Gaussian processes

5 Probabilistic Modelling A model describes data that one could observe from a system If we use the mathematics of probability theory to express all forms of uncertainty and noise associated with our model......then inverse probability (i.e. Bayes rule) allows us to infer unknown quantities, adapt our models, make predictions and learn from data.

6 Bayesian Machine Learning Everything follows from two simple rules: Sum rule: P (x) = P y P (x, y) Product rule: P (x, y) =P (x)p (y x) P ( D,m) = P (D,m)P ( m) P (D m) P (D,m) P ( m) P ( D,m) likelihood of parameters in model m prior probability of posterior of given data D Prediction: P (x D,m) = Z P (x, D,m)P ( D,m)d Model Comparison: P (m D) = P (D m) = P (D m)p(m) P (D) Z P (D,m)P( m) d

7 Model Comparison M = M = 1 M = 2 M = M = 4 M = 5 M = 6 M =

8 Bayesian Occam s Razor and Model Comparison Compare model classes, e.g. m and m, using posterior probabilities given D: Z p(d m) p(m) p(m D) =, p(d m) = p(d,m) p( m) d p(d) Interpretations of the Marginal Likelihood ( model evidence ): The probability that randomly selected parameters from the prior would generate D. Probability of the data under the model, averaging over all possible parameter values. 1 log 2 is the number of bits of surprise at observing data D under model m. p(d m) Model classes that are too simple are unlikely to generate the data set. Model classes that are too complex can generate many possible data sets, so again, they are unlikely to generate that particular data set at random. P(D m) too simple "just right" too complex D All possible data sets of size n

9 Bayesian Model Comparison: Occam s Razor at Work M = M = 1 M = 2 M = Model Evidence M = M = M = M = 7 P(Y M) M For example, for quadratic polynomials (m =2): y = a + a 1 x + a 2 x 2 +, where N (, 2 ) and parameters =(a a 1 a 2 ) demo: polybayes

10 Parametric vs Nonparametric Models Parametric models assume some finite set of parameters. Given the parameters, future predictions, x, are independent of the observed data, D: P (x, D) =P (x ) therefore capture everything there is to know about the data. So the complexity of the model is bounded even if the amount of data is unbounded. This makes them not very flexible. Non-parametric models assume that the data distribution cannot be defined in terms of such a finite set of parameters. But they can often be defined by assuming an infinite dimensional. Usually we think of as a function. The amount of information that can capture about the data D can grow as the amount of data grows. This makes them more flexible.

11 Bayesian nonparametrics A simple framework for modelling complex data. Nonparametric models can be viewed as having infinitely many parameters Examples of non-parametric models: Parametric Non-parametric Application polynomial regression Gaussian processes function approx. logistic regression Gaussian process classifiers classification mixture models, k-means Dirichlet process mixtures clustering hidden Markov models infinite HMMs time series factor analysis / ppca / PMF infinite latent factor models feature discovery...

12 Nonlinear regression and Gaussian processes Consider the problem of nonlinear regression: You want to learn a function f with error bars from data D = {X, y} y x A Gaussian process defines a distribution over functions p(f) which can be used for Bayesian regression: p(f D) = p(f)p(d f) p(d) Let f =(f(x 1 ),f(x 2 ),...,f(x n )) be an n-dimensional vector of function values evaluated at n points x i 2 X. Note, f is a random variable. Definition: p(f) is a Gaussian process if for any finite subset {x 1,...,x n } X, the marginal distribution over that subset p(f) is multivariate Gaussian.

13 Apicture Linear Regression Logistic Regression Bayesian Linear Regression Bayesian Logistic Regression Kernel Regression Kernel Classification GP Regression GP Classification Classification Kernel Bayesian

14 Gaussian process covariance functions (kernels) p(f) is a Gaussian process if for any finite subset {x 1,...,x n } X, the marginal distribution over that finite subset p(f) has a multivariate Gaussian distribution. Gaussian processes (GPs) are parameterized by a mean function, µ(x), and a covariance function, or kernel, K(x, x ). p(f(x),f(x )) = N(µ, ) where µ = apple µ(x) µ(x ) = apple K(x, x) K(x, x ) K(x,x) K(x,x ) and similarly for p(f(x 1 ),...,f(x n )) where now µ is an n 1 vector and is an n n matrix.

15 Gaussian process covariance functions Gaussian processes (GPs) are parameterized by a mean function, µ(x), and a covariance function, K(x, x ), where µ(x) = E(f(x)) and K(x, x ) = Cov(f(x),f(x )). An example covariance function: x x K(x, x )=v exp + v 1 + v 2 ij r with parameters (v,v 1,v 2,r, ). These kernel parameters are interpretable and can be learned from data: v v 1 v 2 r signal variance variance of bias noise variance lengthscale roughness Once the mean and covariance functions are defined, everything else about GPs follows from the basic rules of probability applied to mutivariate Gaussians.

16 Samples from GPs with di erent K(x, x ) f(x) 1.5 f(x) f(x).5 f(x) x x x x f(x) f(x) f(x) f(x) x x x x f(x) f(x) f(x) f(x) gpdemogen x x x x

17 Prediction using GPs with di erent K(x, x ) A sample from the prior for each covariance function: f(x) f(x) f(x) x x x Corresponding predictions, mean with two standard deviations: gpdemo

18 The Automatic Statistician

19 WHAT WOULD AN AUTOMATIC STATISTICIAN DO? Language of models Translation Data Search Model Prediction Report Evaluation Checking James Robert Lloyd and Zoubin Ghahramani 3 / 44

20 GOALS OF THE AUTOMATIC STATISTICIAN PROJECT I Provide a set of tools for understanding data that require minimal expert input I Uncover challenging research problems in e.g. I Automated inference I Model construction and comparison I Data visualisation and interpretation I Advance the field of machine learning in general James Robert Lloyd and Zoubin Ghahramani 4 / 44

21 INGREDIENTS OF AN AUTOMATIC STATISTICIAN Language of models Translation Data Search Model Prediction Report Evaluation Checking I An open-ended language of models I Expressive enough to capture real-world phenomena... I... and the techniques used by human statisticians I A search procedure I To efficiently explore the language of models I A principled method of evaluating models I Trading off complexity and fit to data I A procedure to automatically explain the models I Making the assumptions of the models explicit... I... in a way that is intelligible to non-experts James Robert Lloyd and Zoubin Ghahramani 5 / 44

22 PREVIEW: ANENTIRELYAUTOMATICANALYSIS 7 Raw data 7 Full model posterior with extrapolations Four additive components have been identified in the data I A linearly increasing function. I An approximately periodic function with a period of 1. years and with linearly increasing amplitude. I A smooth function. I Uncorrelated noise with linearly increasing standard deviation. James Robert Lloyd and Zoubin Ghahramani 6 / 44

23 DEFINING A LANGUAGE OF MODELS Language of models Translation Data Search Model Prediction Report Evaluation Checking James Robert Lloyd and Zoubin Ghahramani 7 / 44

24 DEFINING A LANGUAGE OF REGRESSION MODELS I Regression consists of learning a function f : X! Y from inputs to outputs from example input / output pairs I Language should include simple parametric forms... I e.g. Linear functions, Polynomials, Exponential functions I... as well as functions specified by high level properties I e.g. Smoothness, Periodicity I Inference should be tractable for all models in language James Robert Lloyd and Zoubin Ghahramani 8 / 44

25 WE CANBUILDREGRESSIONMODELSWITH GAUSSIAN PROCESSES I GPs are distributions over functions such that any finite subset of function evaluations, (f (x 1 ), f (x 2 ),...f (x N )), have a joint Gaussian distribution I A GP is completely specified by I Mean function, µ(x) =E(f (x)) I Covariance / kernel function, k(x, x )=Cov(f (x), f (x )) I Denoted f GP(µ, k) f(x) f(x) f(x) f(x) x GP Posterior Mean GP Posterior Uncertainty GP Posterior Mean GP Posterior Uncertainty Data x GP Posterior Mean GP Posterior Uncertainty Data x GP Posterior Mean GP Posterior Uncertainty Data x James Robert Lloyd and Zoubin Ghahramani 9 / 44

26 A LANGUAGE OF GAUSSIAN PROCESS KERNELS I It is common practice to use a zero mean function since the mean can be marginalised out I Suppose, f (x) a GP(a µ(x), k(x, x )) where a N (, 1) I Then equivalently, f (x) GP(,µ(x)µ(x )+k(x, x )) I We therefore define a language of GP regression models by specifying a language of kernels James Robert Lloyd and Zoubin Ghahramani 1 / 44

27 THE ATOMS OF OUR LANGUAGE Five base kernels Squared exp. (SE) Periodic (PER) Linear (LIN) Constant (C) White noise (WN) Encoding for the following types of functions Smooth functions Periodic functions Linear functions Constant functions Gaussian noise James Robert Lloyd and Zoubin Ghahramani 11 / 44

28 THE COMPOSITION RULES OF OUR LANGUAGE I Two main operations: addition, multiplication LIN LIN quadratic functions SE PER locally periodic LIN + PER periodic plus linear trend SE + PER periodic plus smooth trend James Robert Lloyd and Zoubin Ghahramani 12 / 44

29 MODELING CHANGEPOINTS Assume f 1 (x) GP(, k 1 ) and f 2 (x) GP(, k 2 ). Define: f (x) =(1 (x)) f 1 (x) + (x) f 2 (x) where is a sigmoid function between and 1. Then f GP(, k), where k(x, x )=(1 (x)) k 1 (x, x ) (1 (x )) + (x) k 2 (x, x ) (x ) We define the changepoint operator k = CP(k 1, k 2 ). James Robert Lloyd and Zoubin Ghahramani 13 / 44

30 AN EXPRESSIVELANGUAGEOFMODELS Regression model GP smoothing Linear regression Multiple kernel learning Trend, cyclical, irregular Fourier decomposition Sparse spectrum GPs Spectral mixture Changepoints Heteroscedasticity Kernel SE + WN C + LIN P + WN P SE + WN P SE + PER +WN C + P P cos + WN P cos + WN SE cos + WN e.g. CP(SE, SE)+WN e.g. SE + LIN WN Note: cos is a special case of our version of PER James Robert Lloyd and Zoubin Ghahramani 14 / 44

31 DISCOVERING A GOOD MODEL VIA SEARCH Language of models Translation Data Search Model Prediction Report Evaluation Checking James Robert Lloyd and Zoubin Ghahramani 15 / 44

32 DISCOVERING A GOOD MODEL VIA SEARCH I Language defined as the arbitrary composition of five base kernels (WN, C, LIN, SE, PER) via three operators (+,, CP). I The space spanned by this language is open-ended and can have a high branching factor requiring a judicious search I We propose a greedy search for its simplicity and similarity to human model-building James Robert Lloyd and Zoubin Ghahramani 16 / 44

33 EXAMPLE: MAUNA LOA KEELING CURVE 6 4 RQ Start 2 SE RQ LIN PER SE + RQ... PER + RQ... PER RQ SE + PER + RQ... SE (PER + RQ) James Robert Lloyd and Zoubin Ghahramani 17 / 44

34 EXAMPLE: MAUNA LOA KEELING CURVE 4 ( Per + RQ ) 3 Start 2 1 SE RQ LIN PER SE + RQ... PER + RQ... PER RQ SE + PER + RQ... SE (PER + RQ) James Robert Lloyd and Zoubin Ghahramani 17 / 44

35 EXAMPLE: MAUNA LOA KEELING CURVE 5 SE ( Per + RQ ) 4 Start 3 2 SE RQ LIN PER SE + RQ... PER + RQ... PER RQ SE + PER + RQ... SE (PER + RQ) James Robert Lloyd and Zoubin Ghahramani 17 / 44

36 EXAMPLE: MAUNA LOA KEELING CURVE 6 ( SE + SE ( Per + RQ ) ) 4 Start 2 SE RQ LIN PER SE + RQ... PER + RQ... PER RQ SE + PER + RQ... SE (PER + RQ) James Robert Lloyd and Zoubin Ghahramani 17 / 44

37 MODEL EVALUATION Language of models Translation Data Search Model Prediction Report Evaluation Checking James Robert Lloyd and Zoubin Ghahramani 18 / 44

38 MODEL EVALUATION I After proposing a new model its kernel parameters are optimised by conjugate gradients I We evaluate each optimised model, M, using the model evidence (marginal likelihood) which can be computed analytically for GPs I We penalise the marginal likelihood for the optimised kernel parameters using the Bayesian Information Criterion (BIC):.5 BIC(M) =log p(d M) p 2 log n where p is the number of kernel parameters, D represents the data, and n is the number of data points. James Robert Lloyd and Zoubin Ghahramani 19 / 44

39 AUTOMATIC TRANSLATION OF MODELS Language of models Translation Data Search Model Prediction Report Evaluation Checking James Robert Lloyd and Zoubin Ghahramani 2 / 44

40 AUTOMATIC TRANSLATION OF MODELS I Search can produce arbitrarily complicated models from open-ended language but two main properties allow description to be automated I Kernels can be decomposed into a sum of products I A sum of kernels corresponds to a sum of functions I Therefore, we can describe each product of kernels separately I Each kernel in a product modifies a model in a consistent way I Each kernel roughly corresponds to an adjective James Robert Lloyd and Zoubin Ghahramani 21 / 44

41 SUM OF PRODUCTS NORMAL FORM Suppose the search finds the following kernel SE (WN LIN + CP(C, PER)) James Robert Lloyd and Zoubin Ghahramani 22 / 44

42 SUM OF PRODUCTS NORMAL FORM Suppose the search finds the following kernel SE (WN LIN + CP(C, PER)) The changepoint can be converted into a sum of products SE (WN LIN + C + PER ) James Robert Lloyd and Zoubin Ghahramani 22 / 44

43 SUM OF PRODUCTS NORMAL FORM Suppose the search finds the following kernel SE (WN LIN + CP(C, PER)) The changepoint can be converted into a sum of products SE (WN LIN + C + PER ) Multiplication can be distributed over addition SE WN LIN + SE C + SE PER James Robert Lloyd and Zoubin Ghahramani 22 / 44

44 SUM OF PRODUCTS NORMAL FORM Suppose the search finds the following kernel SE (WN LIN + CP(C, PER)) The changepoint can be converted into a sum of products SE (WN LIN + C + PER ) Multiplication can be distributed over addition SE WN LIN + SE C + SE PER Simplification rules are applied WN LIN + SE + SE PER James Robert Lloyd and Zoubin Ghahramani 22 / 44

45 Posterior of component Posterior of component SUMS OF KERNELS ARE SUMS OF FUNCTIONS If f 1 GP(, k 1 ) and independently f 2 GP(, k 2 ) then e.g. f 1 + f 2 GP(, k 1 + k 2 ) = + + = + + We can therefore describe each component separately James Robert Lloyd and Zoubin Ghahramani 23 / 44

46 PRODUCTS OF KERNELS PER {z} periodic function On their own, each kernel is described by a standard noun phrase James Robert Lloyd and Zoubin Ghahramani 24 / 44

47 PRODUCTS OF KERNELS -SE SE {z} approximately PER {z} periodic function Multiplication by SE removes long range correlations from a model since SE(x, x ) decreases monotonically to as x x increases. James Robert Lloyd and Zoubin Ghahramani 25 / 44

48 PRODUCTS OF KERNELS -LIN SE {z} approximately PER {z} periodic function LIN {z} with linearly growing amplitude Multiplication by LIN is equivalent to multiplying the function being modeled by a linear function. If f (x) GP(, k), then xf(x) GP (, k LIN). This causes the standard deviation of the model to vary linearly without affecting the correlation. James Robert Lloyd and Zoubin Ghahramani 26 / 44

49 PRODUCTS OF KERNELS - CHANGEPOINTS SE {z} approximately PER {z} periodic function {z} LIN {z} with linearly growing amplitude until 17 Multiplication by is equivalent to multiplying the function being modeled by a sigmoid. James Robert Lloyd and Zoubin Ghahramani 27 / 44

50 AUTOMATICALLY GENERATED REPORTS Language of models Translation Data Search Model Prediction Report Evaluation Checking James Robert Lloyd and Zoubin Ghahramani 28 / 44

51 EXAMPLE: AIRLINE PASSENGER VOLUME 7 Raw data 7 Full model posterior with extrapolations Four additive components have been identified in the data I A linearly increasing function. I An approximately periodic function with a period of 1. years and with linearly increasing amplitude. I A smooth function. I Uncorrelated noise with linearly increasing standard deviation. James Robert Lloyd and Zoubin Ghahramani 29 / 44

52 EXAMPLE: AIRLINE PASSENGER VOLUME This component is linearly increasing. 5 Posterior of component 1 7 Sum of components up to component James Robert Lloyd and Zoubin Ghahramani 3 / 44

53 EXAMPLE: AIRLINE PASSENGER VOLUME This component is approximately periodic with a period of 1. years and varying amplitude. Across periods the shape of this function varies very smoothly. The amplitude of the function increases linearly. The shape of this function within each period has a typical lengthscale of 6. weeks. 2 Posterior of component 2 7 Sum of components up to component James Robert Lloyd and Zoubin Ghahramani 31 / 44

54 EXAMPLE: AIRLINE PASSENGER VOLUME This component is a smooth function with a typical lengthscale of 8.1 months. 4 Posterior of component 3 7 Sum of components up to component James Robert Lloyd and Zoubin Ghahramani 32 / 44

55 EXAMPLE: AIRLINE PASSENGER VOLUME This component models uncorrelated noise. The standard deviation of the noise increases linearly. 2 Posterior of component 4 7 Sum of components up to component James Robert Lloyd and Zoubin Ghahramani 33 / 44

56 EXAMPLE: SOLAR IRRADIANCE This component is constant Posterior of component Sum of components up to component James Robert Lloyd and Zoubin Ghahramani 34 / 44

57 EXAMPLE: SOLAR IRRADIANCE This component is constant. This component applies from 1643 until Posterior of component Sum of components up to component James Robert Lloyd and Zoubin Ghahramani 35 / 44

58 EXAMPLE: SOLAR IRRADIANCE This component is a smooth function with a typical lengthscale of 23.1 years. This component applies until 1643 and from 1716 onwards..8 Posterior of component Sum of components up to component James Robert Lloyd and Zoubin Ghahramani 36 / 44

59 EXAMPLE: SOLAR IRRADIANCE This component is approximately periodic with a period of 1.8 years. Across periods the shape of this function varies smoothly with a typical lengthscale of 36.9 years. The shape of this function within each period is very smooth and resembles a sinusoid. This component applies until 1643 and from 1716 onwards..6 Posterior of component Sum of components up to component James Robert Lloyd and Zoubin Ghahramani 37 / 44

60 OTHER EXAMPLES See James Robert Lloyd and Zoubin Ghahramani 38 / 44

61 GOOD PREDICTIVE PERFORMANCE AS WELL Standardised RMSE over 13 data sets Standardised RMSE ABCD accuracy ABCD interpretability Spectral kernels Trend, cyclical irregular Bayesian MKL Eureqa Changepoints Squared Exponential Linear regression I Tweaks can be made to the algorithm to improve accuracy or interpretability of models produced... I... but both methods are highly competitive at extrapolation (shown above) and interpolation James Robert Lloyd and Zoubin Ghahramani 39 / 44

62 MODEL CHECKING AND CRITICISM I Good statistical modelling should include model criticism: I Does the data match the assumptions of the model? I For example, if the model assumed Gaussian noise, does a Q-Q plot reveal non-gaussian residuals? I Our automatic statistician does posterior predictive checks, dependence tests and residual tests I We have also been developing more systematic nonparametric approaches to model criticism using kernel two-sample testing with MMD. Lloyd, J. R., and Ghahramani, Z. (214) Statistical Model Criticism using Kernel Two Sample Tests. James Robert Lloyd and Zoubin Ghahramani 4 / 44

63 CHALLENGES I Interpretability / accuracy I Increasing the expressivity of language I e.g. Monotonocity, positive functions, symmetries I Computational complexity of searching through a huge space of models I Extending the automatic reports to multidimensional datasets I Search and descriptions naturally extend to multiple dimensions, but automatically generating relevant visual summaries harder James Robert Lloyd and Zoubin Ghahramani 41 / 44

64 CURRENT AND FUTURE DIRECTIONS I Automatic statistician for: * One-dimensional time series * Linear regression (classical) I Multivariate nonlinear regression (c.f. Duvenaud, Lloyd et al, ICML 213) I Multivariate classification (w/ Nikola Mrksic) I Completing and interpreting tables and databases (w/ Kee Chong Tan) I Probabilistic programming I Probabilistic models are expressed in a general (Turing complete) programming language I A universal inference engine can then be used to infer unobserved variables given observed data I This can be used to implement seach over the model space in an automatic statistician James Robert Lloyd and Zoubin Ghahramani 42 / 44

65 SUMMARY I We have presented the beginnings of an automatic statistician I Our system I Defines an open-ended language of models I Searches greedily through this space I Produces detailed reports describing patterns in data I Performs automatic model criticism I Extrapolation and interpolation performance highly competitive I We believe this line of research has the potential to make powerful statistical model-building techniques accessible to non-experts James Robert Lloyd and Zoubin Ghahramani 43 / 44

66 REFERENCES Website: Duvenaud, D., Lloyd, J. R., Grosse, R., Tenenbaum, J. B. and Ghahramani, Z. (213) Structure Discovery in Nonparametric Regression through Compositional Kernel Search. ICML 213. Lloyd, J. R., Duvenaud, D., Grosse, R., Tenenbaum, J. B. and Ghahramani, Z. (214) Automatic Construction and Natural-language Description of Nonparametric Regression Models AAAI Lloyd, J. R., and Ghahramani, Z. (214) Statistical Model Criticism using Kernel Two Sample Tests Ghahramani, Z. (213) Bayesian nonparametrics and the probabilistic approach to modelling Philosophical Trans. Royal Society A 371: Looking for postdocs! James Robert Lloyd and Zoubin Ghahramani 44 / 44

How to build an automatic statistician

How to build an automatic statistician James Robert Lloyd 1, David Duvenaud 1, Roger Grosse 2, Joshua Tenenbaum 2, Zoubin Ghahramani 1 1: Department of Engineering, University of Cambridge, UK 2: Massachusetts