How to build an automatic statistician

Size: px

Start display at page:

Download "How to build an automatic statistician"

Myra Phoebe Montgomery
5 years ago
Views:

1 How to build an automatic statistician James Robert Lloyd 1, David Duvenaud 1, Roger Grosse 2, Joshua Tenenbaum 2, Zoubin Ghahramani 1 1: Department of Engineering, University of Cambridge, UK 2: Massachusetts Institute of Technology, USA November 19, 214

2 THERE IS A GROWING NEED FOR DATA ANALYSIS We live in an era of abundant data The McKinsey Global Institute claim The United States alone faces a shortage of 14, to 19, people with analytical expertise and 1.5 million managers and analysts with the skills to understand and make decisions based on the analysis of big data. Diverse fields increasingly relying on expert statisticians, machine learning researchers and data scientists e.g. Computational sciences (e.g. biology, astronomy,... ) Online advertising Quantitative finance... James Robert Lloyd 2 / 64

3 WHAT WOULD AN AUTOMATIC STATISTICIAN DO? Language of models Description Data Search Model Prediction Report Evaluation Checking James Robert Lloyd 3 / 64

4 PREVIEW: AN ENTIRELY AUTOMATIC ANALYSIS 7 Raw data 7 Full model posterior with extrapolations Four additive components have been identified in the data A linearly increasing function. An approximately periodic function with a period of 1. years and with linearly increasing amplitude. A smooth function. Uncorrelated noise with linearly increasing standard deviation. James Robert Lloyd 4 / 64

5 AGENDA Introduction to Bayesian statistics Introduction to Gaussian processes via linear regression Description of automatic statistician Examples of automatically generated ouput James Robert Lloyd 5 / 64

6 HOW TO DO LINEAR REGRESSION Linear regression starts by assuming a model of the form y i = mx i + ε i where x and y are inputs and outputs respectively and ε are errors or noise James Robert Lloyd 6 / 64

7 HOW TO DO LINEAR REGRESSION Common approach: Estimate m by least squares James Robert Lloyd 6 / 64

8 HOW TO DO BAYESIAN LINEAR REGRESSION Bayesian approach: Specify beliefs about m using probability distributions James Robert Lloyd 7 / 64

9 HOW TO DO BAYESIAN LINEAR REGRESSION Bayesian approach: Specify prior beliefs about m using probability distributions Follow rules of probability to update beliefs rationally after observing data James Robert Lloyd 7 / 64

10 BAYESIAN STATISTICS Suppose we wish to fit a model M with parameters θ to data D M defined by p(d θ, M) - the likelihood James Robert Lloyd 8 / 64

11 BAYESIAN STATISTICS Suppose we wish to fit a model M with parameters θ to data D M defined by p(d θ, M) - the likelihood We represent our uncertainty about θ with a probability distribution p(θ M) - the prior James Robert Lloyd 8 / 64

12 BAYESIAN STATISTICS Suppose we wish to fit a model M with parameters θ to data D M defined by p(d θ, M) - the likelihood We represent our uncertainty about θ with a probability distribution p(θ M) - the prior We now derive p(θ D, M) - the posterior James Robert Lloyd 8 / 64

13 BAYES RULE Suppose we wish to fit a model M with parameters θ to data D M defined by p(d θ, M) - the likelihood We represent our uncertainty about θ with a probability distribution p(θ M) - the prior We then derive p(θ D, M) - the posterior p(θ D, M) = }{{} posterior likelihood prior {}}{{}}{ p(d θ, M) p(θ M) p(d θ, M)p(θ M)dθ } {{ } marginal likelihood James Robert Lloyd 9 / 64

14 BAYESIAN LINEAR REGRESSION Linear regression starts by assuming a model of the form y i = mx i + ε i where x and y are inputs and outputs respectively and ε are errors or noise James Robert Lloyd 1 / 64

15 BAYESIAN LINEAR REGRESSION Linear regression starts by assuming a model of the form y i = mx i + ε i where x and y are inputs and outputs respectively and ε are errors or noise Represent prior beliefs about m using a probability distribution James Robert Lloyd 1 / 64

16 BAYESIAN LINEAR REGRESSION Linear regression starts by assuming a model of the form y i = mx i + ε i where x and y are inputs and outputs respectively and ε are errors or noise Represent prior beliefs about m using a probability distribution e.g. m Normal(mean =, variance = 1) James Robert Lloyd 1 / 64

17 BAYESIAN LINEAR REGRESSION Linear regression starts by assuming a model of the form y i = mx i + ε i where x and y are inputs and outputs respectively and ε are errors or noise Represent prior beliefs about m using a probability distribution e.g. m N (, 1) James Robert Lloyd 1 / 64

18 BAYESIAN LINEAR REGRESSION Linear regression starts by assuming a model of the form y i = mx i + ε i where x and y are inputs and outputs respectively and ε are errors or noise Represent prior beliefs about m using a probability distribution e.g. m N (, 1) Define likelihood by representing prior beliefs about ε using a probability distribution e.g. εi N (, σε) 2 independently for all i James Robert Lloyd 1 / 64

19 BAYESIAN LINEAR REGRESSION Linear regression starts by assuming a model of the form y i = mx i + ε i where x and y are inputs and outputs respectively and ε are errors or noise Represent prior beliefs about m using a probability distribution e.g. m N (, 1) Define likelihood by representing prior beliefs about ε using a probability distribution e.g. εi N (, σε) 2 independently for all i Apply Bayes rule James Robert Lloyd 1 / 64

20 BAYES RULE APPLIED TO LINEAR REGRESSION p(m y, x) = p(y m, x) p(m) p(y m, x)p(m) dm James Robert Lloyd 11 / 64

21 BAYES RULE APPLIED TO LINEAR REGRESSION p(m y, x) = p(y m, x) p(m) p(y m, x)p(m) dm p(y m, x) p(m) James Robert Lloyd 11 / 64

22 BAYES RULE APPLIED TO LINEAR REGRESSION p(m y, x) = p(y m, x) p(m) p(y m, x)p(m) dm p(y m, x) p(m) ( ) 1 e (y i mx i )2/(2σ2 ε ) 1 2πσ ε i 2π e m2 /2 James Robert Lloyd 11 / 64

23 BAYES RULE APPLIED TO LINEAR REGRESSION p(m y, x) = p(y m, x) p(m) p(y m, x)p(m) dm p(y m, x) p(m) ( ) 1 e (y i mx i )2/(2σ2 ε ) 1 2πσ ε i e i (y i mx i ) 2 /(2σ 2 ε ) m2 /2 2π e m2 /2 James Robert Lloyd 11 / 64

24 BAYES RULE APPLIED TO LINEAR REGRESSION p(m y, x) = p(y m, x) p(m) p(y m, x)p(m) dm p(y m, x) p(m) ( ) 1 e (y i mx i )2/(2σ2 ε ) 1 2πσ ε i e i (y i mx i ) 2 /(2σε 2) m2 /2 ( e i x2 i +σ2 ε 2σ 2 ε m i x i y i i x2 i +σ2 ε ) 2 2π e m2 /2 James Robert Lloyd 11 / 64

25 BAYES RULE APPLIED TO LINEAR REGRESSION p(m y, x) = p(y m, x) p(m) p(y m, x)p(m) dm p(y m, x) p(m) ( ) 1 e (y i mx i )2/(2σ2 ε ) 1 2πσ ε i e i (y i mx i ) 2 /(2σε 2) m2 /2 ( e i x2 i +σ2 ε 2σ 2 ε m i x i y i i x2 i +σ2 ε ) 2 2π e m2 /2 ( ) i This is a normal distribution: m y, x N x iy i, σ2 ε i x2 i +σ2 ε i x2 i +σ2 ε James Robert Lloyd 11 / 64

26 BAYESIAN LINEAR REGRESSION m y, x N ( i x iy i, i x2 i + σε 2 σ 2 ε i x2 i + σ 2 ε ) James Robert Lloyd 12 / 64

27 BAYESIAN LINEAR REGRESSION m y, x N ( i x iy i, i x2 i + σε 2 σ 2 ε i x2 i + σ 2 ε ) James Robert Lloyd 12 / 64

28 BAYESIAN LINEAR REGRESSION m y, x N ( i x iy i, i x2 i + σε 2 σ 2 ε i x2 i + σ 2 ε ) James Robert Lloyd 12 / 64

29 BAYESIAN LINEAR REGRESSION m y, x N ( i x iy i, i x2 i + σε 2 σ 2 ε i x2 i + σ 2 ε ) James Robert Lloyd 12 / 64

30 BAYESIAN LINEAR REGRESSION m y, x N ( i x iy i, i x2 i + σε 2 σ 2 ε i x2 i + σ 2 ε ) James Robert Lloyd 12 / 64

31 BAYESIAN LINEAR REGRESSION m y, x N ( i x iy i, i x2 i + σε 2 σ 2 ε i x2 i + σ 2 ε ) James Robert Lloyd 12 / 64

32 BAYESIAN LINEAR REGRESSION REWRITTEN We defined a Bayesian linear regression model by specifying priors on m and ε i James Robert Lloyd 13 / 64

33 BAYESIAN LINEAR REGRESSION REWRITTEN We defined a Bayesian linear regression model by specifying priors on m and ε i m N (, 1), εi iid N (, σ 2 ε) James Robert Lloyd 13 / 64

34 BAYESIAN LINEAR REGRESSION REWRITTEN We defined a Bayesian linear regression model by specifying priors on m and ε i m N (, 1), εi iid N (, σ 2 ε) yi m, ε i = mx i + ε i James Robert Lloyd 13 / 64

35 BAYESIAN LINEAR REGRESSION REWRITTEN We defined a Bayesian linear regression model by specifying priors on m and ε i m N (, 1), εi iid N (, σ 2 ε) yi m, ε i = mx i + ε i This implictly defined a joint prior on {y i : i = 1,..., n} James Robert Lloyd 13 / 64

36 BAYESIAN LINEAR REGRESSION REWRITTEN We defined a Bayesian linear regression model by specifying priors on m and ε i m N (, 1), εi iid N (, σ 2 ε) yi m, ε i = mx i + ε i This implictly defined a joint prior on {y i : i = 1,..., n} yi N (, xi 2 + σε) 2 (sum of two normals) James Robert Lloyd 13 / 64

37 BAYESIAN LINEAR REGRESSION REWRITTEN We defined a Bayesian linear regression model by specifying priors on m and ε i m N (, 1), εi iid N (, σ 2 ε) yi m, ε i = mx i + ε i This implictly defined a joint prior on {y i : i = 1,..., n} yi N (, xi 2 + σε) 2 (sum of two normals) cov(yi, y j ) = x i x j i j James Robert Lloyd 13 / 64

38 BAYESIAN LINEAR REGRESSION REWRITTEN We defined a Bayesian linear regression model by specifying priors on m and ε i m N (, 1), εi iid N (, σ 2 ε) yi m, ε i = mx i + ε i This implictly defined a joint prior on {y i : i = 1,..., n} yi N (, xi 2 + σε) 2 (sum of two normals) cov(yi, y j ) = x i x j i j {y i : i = 1,..., n} has a multivariate normal distribution y N (, K) where k ij = x i x j + δ ij σ 2 ε 5 5 James Robert Lloyd 13 / 64

39 BAYESIAN LINEAR REGRESSION REWRITTEN {y i : i = 1,..., n} has a multivariate normal distribution y N (, K) where k ij = x i x j + δ ij σ 2 ε 5 5 James Robert Lloyd 14 / 64

40 BAYESIAN LINEAR REGRESSION REWRITTEN {y i : i = 1,..., n} has a multivariate normal distribution y N (, K) where k ij = x i x j + δ ij σ 2 ε 5 5 The covariance can be written as a function of pairs of x i k ij = k(x i, x j ) = x i x j + δ xi =x j σ 2 ε James Robert Lloyd 14 / 64

41 BAYESIAN LINEAR REGRESSION REWRITTEN {y i : i = 1,..., n} has a multivariate normal distribution y N (, K) where k ij = x i x j + δ ij σ 2 ε 5 5 The covariance can be written as a function of pairs of x i k ij = k(x i, x j ) = x i x j + δ xi =x j σ 2 ε Do we define a valid probability distribution if {x i } = R? James Robert Lloyd 14 / 64

42 GAUSSIAN PROCESSES A Gaussian process is collection of random variables, any finite number of which have a joint Gaussian distribution James Robert Lloyd 15 / 64

43 GAUSSIAN PROCESSES A Gaussian process is collection of random variables, any finite number of which have a joint Gaussian distribution We can write this collection of random variables as {f (x) : x X } i.e. a function f evaluated at inputs x James Robert Lloyd 15 / 64

44 GAUSSIAN PROCESSES A Gaussian process is collection of random variables, any finite number of which have a joint Gaussian distribution We can write this collection of random variables as {f (x) : x X } i.e. a function f evaluated at inputs x A GP is completely specified by Mean function, µ(x) = E(f (x)) Covariance / kernel function, k(x, x ) = Cov(f (x), f (x )) Denoted f GP(µ, k) James Robert Lloyd 15 / 64

45 GAUSSIAN PROCESSES A Gaussian process is collection of random variables, any finite number of which have a joint Gaussian distribution We can write this collection of random variables as {f (x) : x X } i.e. a function f evaluated at inputs x A GP is completely specified by Mean function, µ(x) = E(f (x)) Covariance / kernel function, k(x, x ) = Cov(f (x), f (x )) Denoted f GP(µ, k) Can be thought of as a probability distribution on functions James Robert Lloyd 15 / 64

46 LINEAR KERNEL = LINEAR FUNCTIONS k(x, x ) = xx James Robert Lloyd 16 / 64

47 WHAT ABOUT OTHER KERNELS? k(x, x ) = e (x x ) 2 James Robert Lloyd 17 / 64

48 BAYESIAN NON-LINEAR REGRESSION James Robert Lloyd 18 / 64

49 BAYESIAN NON-LINEAR REGRESSION James Robert Lloyd 18 / 64

50 BAYESIAN NON-LINEAR REGRESSION James Robert Lloyd 18 / 64

51 BAYESIAN NON-LINEAR REGRESSION James Robert Lloyd 18 / 64

52 BAYESIAN NON-LINEAR REGRESSION James Robert Lloyd 18 / 64

53 BAYESIAN NON-LINEAR REGRESSION James Robert Lloyd 18 / 64

54 BAYESIAN LINEAR REGRESSION GONE WRONG James Robert Lloyd 19 / 64

55 BAYESIAN LINEAR REGRESSION GONE WRONG James Robert Lloyd 19 / 64

56 BAYESIAN LINEAR REGRESSION GONE WRONG James Robert Lloyd 19 / 64

57 BAYESIAN LINEAR REGRESSION GONE WRONG James Robert Lloyd 19 / 64

58 BAYESIAN LINEAR REGRESSION GONE WRONG James Robert Lloyd 19 / 64

59 BAYESIAN LINEAR REGRESSION GONE WRONG James Robert Lloyd 19 / 64

60 NON-LINEARITY TO THE RESCUE James Robert Lloyd 2 / 64

61 NON-LINEARITY TO THE RESCUE James Robert Lloyd 2 / 64

62 NON-LINEARITY TO THE RESCUE James Robert Lloyd 2 / 64

63 NON-LINEARITY TO THE RESCUE James Robert Lloyd 2 / 64

64 NON-LINEARITY TO THE RESCUE James Robert Lloyd 2 / 64

65 NON-LINEARITY TO THE RESCUE James Robert Lloyd 2 / 64

66 WHICH KERNEL FUNCTION? How far can we go with kernel functions? James Robert Lloyd 21 / 64

67 DEFINING A LANGUAGE OF MODELS Language of models Description Data Search Model Prediction Report Evaluation Checking James Robert Lloyd 22 / 64

68 THE ATOMS OF OUR LANGUAGE Five base kernels Squared exp. (SE) Periodic (PER) Linear (LIN) Constant (C) White noise (WN) Encoding for the following types of functions Smooth functions Periodic functions Linear functions Constant functions Gaussian noise James Robert Lloyd 23 / 64

69 THE COMPOSITION RULES OF OUR LANGUAGE Two main operations: addition, multiplication LIN LIN quadratic functions SE PER locally periodic LIN + PER periodic plus linear trend SE + PER periodic plus smooth trend James Robert Lloyd 24 / 64

70 MODELING CHANGEPOINTS Time series data often exhibit changepoints: x 14 Sum of components up to component 7 Sum of components up to component James Robert Lloyd 25 / 64

71 MODELING CHANGEPOINTS Time series data often exhibit changepoints: x 14 Sum of components up to component 7 Sum of components up to component We can model this by assuming f 1 (x) GP(, k 1 ) and f 2 (x) GP(, k 2 ) and then defining f (x) = (1 σ(x)) f 1 (x) + σ(x) f 2 (x) where σ is a sigmoid function between and 1. James Robert Lloyd 25 / 64

72 MODELING CHANGEPOINTS We can model this by assuming f 1 (x) GP(, k 1 ) and f 2 (x) GP(, k 2 ) and then defining f (x) = (1 σ(x)) f 1 (x) + σ(x) f 2 (x) where σ is a sigmoid function between and 1. Then f GP(, k), where k(x, x ) = (1 σ(x)) k 1 (x, x ) (1 σ(x )) + σ(x) k 2 (x, x ) σ(x ) We define the changepoint operator k = CP(k 1, k 2 ). James Robert Lloyd 26 / 64

73 AN EXPRESSIVE LANGUAGE OF MODELS Regression model GP smoothing Linear regression Multiple kernel learning Trend, cyclical, irregular Fourier decomposition Sparse spectrum GPs Spectral mixture Changepoints Heteroscedasticity Kernel SE + WN C + LIN + WN SE + WN SE + PER + WN C + cos + WN cos + WN SE cos + WN e.g. CP(SE, SE) + WN e.g. SE + LIN WN Note: cos is a special case of our version of PER James Robert Lloyd 27 / 64

74 DISCOVERING A GOOD MODEL VIA SEARCH Language of models Description Data Search Model Prediction Report Evaluation Checking James Robert Lloyd 28 / 64

75 DISCOVERING A GOOD MODEL VIA SEARCH Language defined as the arbitrary composition of five base kernels (WN, C, LIN, SE, PER) via three operators (+,, CP). The space spanned by this language is open-ended and can have a high branching factor requiring a judicious search We propose a greedy search for its simplicity and similarity to human model-building James Robert Lloyd 29 / 64

76 EXAMPLE: MAUNA LOA KEELING CURVE 6 4 RQ Start 2 SE RQ LIN PER SE + RQ... PER + RQ... PER RQ SE + PER + RQ... SE (PER + RQ) James Robert Lloyd 3 / 64

77 EXAMPLE: MAUNA LOA KEELING CURVE 4 ( Per + RQ ) 3 Start 2 1 SE RQ LIN PER SE + RQ... PER + RQ... PER RQ SE + PER + RQ... SE (PER + RQ) James Robert Lloyd 3 / 64

78 EXAMPLE: MAUNA LOA KEELING CURVE 5 SE ( Per + RQ ) 4 Start 3 2 SE RQ LIN PER SE + RQ... PER + RQ... PER RQ SE + PER + RQ... SE (PER + RQ) James Robert Lloyd 3 / 64

79 EXAMPLE: MAUNA LOA KEELING CURVE 6 ( SE + SE ( Per + RQ ) ) 4 Start 2 SE RQ LIN PER SE + RQ... PER + RQ... PER RQ SE + PER + RQ... SE (PER + RQ) James Robert Lloyd 3 / 64

80 MODEL EVALUATION Language of models Description Data Search Model Prediction Report Evaluation Checking James Robert Lloyd 31 / 64

81 BAYESIAN MODEL SELECTION Suppose we have a collection of models {M i } and some data D James Robert Lloyd 32 / 64

82 BAYESIAN MODEL SELECTION Suppose we have a collection of models {M i } and some data D Bayes rule tells us p(m i D) = p(d M i)p(m i ) p(d) James Robert Lloyd 32 / 64

83 BAYESIAN MODEL SELECTION Suppose we have a collection of models {M i } and some data D Bayes rule tells us p(m i D) = p(d M i)p(m i ) p(d) If p(m i ) is equal for all i (prior ignorance) then p(m i D) p(d M i ) = p(d θ i, M i )p(θ i M i )dθ i James Robert Lloyd 32 / 64

84 BAYESIAN MODEL SELECTION Suppose we have a collection of models {M i } and some data D Bayes rule tells us p(m i D) = p(d M i)p(m i ) p(d) If p(m i ) is equal for all i (prior ignorance) then p(m i D) p(d M i ) = p(d θ i, M i )p(θ i M i )dθ i i.e. The most likely model has the highest marginal likelihood James Robert Lloyd 32 / 64

85 AUTOMATIC TRANSLATION OF MODELS Language of models Description Data Search Model Prediction Report Evaluation Checking James Robert Lloyd 33 / 64

86 AUTOMATIC TRANSLATION OF MODELS Search can produce arbitrarily complicated models from open-ended language but two main properties allow description to be automated Kernels can be decomposed into a sum of products A sum of kernels corresponds to a sum of functions Therefore, we can describe each product of kernels separately Each kernel in a product modifies a model in a consistent way Each kernel roughly corresponds to an adjective James Robert Lloyd 34 / 64

87 SUM OF PRODUCTS NORMAL FORM Suppose the search finds the following kernel SE (WN LIN + CP(C, PER)) James Robert Lloyd 35 / 64

88 SUM OF PRODUCTS NORMAL FORM Suppose the search finds the following kernel SE (WN LIN + CP(C, PER)) The changepoint can be converted into a sum of products SE (WN LIN + C σ + PER σ) James Robert Lloyd 35 / 64

89 SUM OF PRODUCTS NORMAL FORM Suppose the search finds the following kernel SE (WN LIN + CP(C, PER)) The changepoint can be converted into a sum of products SE (WN LIN + C σ + PER σ) Multiplication can be distributed over addition SE WN LIN + SE C σ + SE PER σ James Robert Lloyd 35 / 64

90 SUM OF PRODUCTS NORMAL FORM Suppose the search finds the following kernel SE (WN LIN + CP(C, PER)) The changepoint can be converted into a sum of products SE (WN LIN + C σ + PER σ) Multiplication can be distributed over addition SE WN LIN + SE C σ + SE PER σ Simplification rules are applied WN LIN + SE σ + SE PER σ James Robert Lloyd 35 / 64

91 Full model posterior with extrapolations Full model posterior with extrapolations Posterior of component Posterior of component Posterior of component Posterior of component Posterior of component Posterior of component SUMS OF KERNELS ARE SUMS OF FUNCTIONS If f 1 GP(, k 1 ) and independently f 2 GP(, k 2 ) then e.g. f 1 + f 2 GP(, k 1 + k 2 ) = + + = + + We can therefore describe each component separately James Robert Lloyd 36 / 64

92 PRODUCTS OF KERNELS PER }{{} periodic function On their own, each kernel is described by a standard noun phrase James Robert Lloyd 37 / 64

93 PRODUCTS OF KERNELS - SE }{{} SE }{{} PER approximately periodic function Multiplication by SE removes long range correlations from a model since SE(x, x ) decreases monotonically to as x x increases. James Robert Lloyd 38 / 64

94 PRODUCTS OF KERNELS - LIN SE }{{} approximately PER }{{} periodic function LIN }{{} with linearly growing amplitude Multiplication by LIN is equivalent to multiplying the function being modeled by a linear function. If f (x) GP(, k), then x f (x) GP (, k LIN). This causes the standard deviation of the model to vary linearly without affecting the correlation. James Robert Lloyd 39 / 64

95 PRODUCTS OF KERNELS - CHANGEPOINTS SE }{{} approximately PER }{{} periodic function LIN }{{} with linearly growing amplitude σ }{{} until 17 Multiplication by σ is equivalent to multiplying the function being modeled by a sigmoid. James Robert Lloyd 4 / 64

96 AUTOMATICALLY GENERATED REPORTS Language of models Description Data Search Model Prediction Report Evaluation Checking James Robert Lloyd 41 / 64

97 EXAMPLE: AIRLINE PASSENGER VOLUME 7 Raw data 7 Full model posterior with extrapolations Four additive components have been identified in the data A linearly increasing function. An approximately periodic function with a period of 1. years and with linearly increasing amplitude. A smooth function. Uncorrelated noise with linearly increasing standard deviation. James Robert Lloyd 42 / 64

98 EXAMPLE: AIRLINE PASSENGER VOLUME This component is linearly increasing. 5 Posterior of component 1 7 Sum of components up to component James Robert Lloyd 43 / 64

99 EXAMPLE: AIRLINE PASSENGER VOLUME This component is approximately periodic with a period of 1. years and varying amplitude. Across periods the shape of this function varies very smoothly. The amplitude of the function increases linearly. The shape of this function within each period has a typical lengthscale of 6. weeks. 2 Posterior of component 2 7 Sum of components up to component James Robert Lloyd 44 / 64

100 EXAMPLE: AIRLINE PASSENGER VOLUME This component is a smooth function with a typical lengthscale of 8.1 months. 4 Posterior of component 3 7 Sum of components up to component James Robert Lloyd 45 / 64

101 EXAMPLE: AIRLINE PASSENGER VOLUME This component models uncorrelated noise. The standard deviation of the noise increases linearly. 2 Posterior of component 4 7 Sum of components up to component James Robert Lloyd 46 / 64

102 EXAMPLE: SOLAR IRRADIANCE This component is constant Posterior of component Sum of components up to component James Robert Lloyd 47 / 64

103 EXAMPLE: SOLAR IRRADIANCE This component is constant. This component applies from 1643 until Posterior of component Sum of components up to component James Robert Lloyd 48 / 64

104 EXAMPLE: SOLAR IRRADIANCE This component is a smooth function with a typical lengthscale of 23.1 years. This component applies until 1643 and from 1716 onwards..8 Posterior of component Sum of components up to component James Robert Lloyd 49 / 64

105 EXAMPLE: SOLAR IRRADIANCE This component is approximately periodic with a period of 1.8 years. Across periods the shape of this function varies smoothly with a typical lengthscale of 36.9 years. The shape of this function within each period is very smooth and resembles a sinusoid. This component applies until 1643 and from 1716 onwards..6 Posterior of component Sum of components up to component James Robert Lloyd 5 / 64

106 EXAMPLE: FULL REPORT See pdf James Robert Lloyd 51 / 64

107 SUMMARY We have briefly introduced Bayesian statistics and Gaussian processes Described the beginnings of an automatic statistician for regression Defines an open-ended language of models Searches greedily through this space Produces detailed reports describing patterns in data We believe this line of research has the potential to make powerful statistical model-building techniques accessible to non-experts James Robert Lloyd 52 / 64

108 THANKS Thanks James Robert Lloyd 53 / 64

109 APPENDIX James Robert Lloyd 54 / 64

110 SUM AND PRODUCT RULES OF PROBABILITY Sum rule : P(X = x) = y P(X = x, Y = y) Product rule : P(X = x, Y = y) = P(X = x) P(Y = y X = x) James Robert Lloyd 55 / 64

111 SUM AND PRODUCT RULES OF PROBABILITY Sum rule : p(x) = p(x, y) dy Product rule : p(x, y) = p(x) p(y x) James Robert Lloyd 55 / 64

112 DERIVING BAYES RULE p(θ, D M) = p(θ, D M) James Robert Lloyd 56 / 64

113 DERIVING BAYES RULE p(θ, D M) = p(θ, D M) p(θ D, M)p(D M) = p(d θ, M)p(θ M) (product rule) James Robert Lloyd 56 / 64

114 DERIVING BAYES RULE p(θ, D M) = p(θ, D M) p(θ D, M)p(D M) = p(d θ, M)p(θ M) (product rule) p(θ D, M) = p(d θ, M)p(θ M) p(d M) (divide) James Robert Lloyd 56 / 64

115 DERIVING BAYES RULE p(θ, D M) = p(θ, D M) p(θ D, M)p(D M) = p(d θ, M)p(θ M) p(θ D, M) = p(θ D, M) = p(d θ, M)p(θ M) p(d M) p(d θ, M)p(θ M) p(d, θ M)dθ (product rule) (divide) (sum rule) James Robert Lloyd 56 / 64

116 DERIVING BAYES RULE p(θ, D M) = p(θ, D M) p(θ D, M)p(D M) = p(d θ, M)p(θ M) p(θ D, M) = p(θ D, M) = p(θ D, M) = p(d θ, M)p(θ M) p(d M) p(d θ, M)p(θ M) p(d, θ M)dθ p(d θ, M)p(θ M) p(d θ, M)p(θ M)dθ (product rule) (divide) (sum rule) (product rule) James Robert Lloyd 56 / 64

117 DEFINING A LANGUAGE OF REGRESSION MODELS Regression consists of learning a function f : X Y from inputs to outputs from example input / output pairs Language should include simple parametric forms... e.g. Linear functions, Polynomials, Exponential functions... as well as functions specified by high level properties e.g. Smoothness, Periodicity Inference should be tractable for all models in language James Robert Lloyd 57 / 64

118 A LANGUAGE OF GAUSSIAN PROCESS KERNELS It is common practice to use a zero mean function since the mean can be marginalised out Suppose, f (x) a GP(a µ(x), k(x, x )) where a N (, 1) Then equivalently, f (x) GP(, µ(x)µ(x ) + k(x, x )) We therefore define a language of GP regression models by specifying a language of kernels James Robert Lloyd 58 / 64

119 MODEL EVALUATION After proposing a new model its kernel parameters are optimised by conjugate gradients We evaluate each optimised model, M, using the marginal likelihood which can be computed analytically for GPs We penalise the marginal likelihood for the optimised kernel parameters using the Bayesian Information Criterion (BIC):.5 BIC(M) = log p(d M) p 2 log n where p is the number of kernel parameters, D represents the data, and n is the number of data points. James Robert Lloyd 59 / 64

120 GOOD PREDICTIVE PERFORMANCE AS WELL Standardised RMSE over 13 data sets Standardised RMSE ABCD accuracy ABCD interpretability Spectral kernels Trend, cyclical irregular Bayesian MKL Eureqa Changepoints Squared Exponential Linear regression Tweaks can be made to the algorithm to improve accuracy or interpretability of models produced but both methods are highly competitive at extrapolation (shown above) and interpolation James Robert Lloyd 6 / 64

121 GOALS OF THE AUTOMATIC STATISTICIAN PROJECT Provide a set of tools for understanding data that require minimal expert input Uncover challenging research problems in e.g. Automated inference Model construction and comparison Data visualisation and interpretation Advance the field of machine learning in general James Robert Lloyd 61 / 64

122 INGREDIENTS OF AN AUTOMATIC STATISTICIAN Language of models Description Data Search Model Prediction Report Evaluation Checking An open-ended language of models Expressive enough to capture real-world phenomena and the techniques used by human statisticians A search procedure To efficiently explore the language of models A principled method of evaluating models Trading off complexity and fit to data A procedure to automatically explain the models Making the assumptions of the models explicit in a way that is intelligible to non-experts James Robert Lloyd 62 / 64

123 CHALLENGES Interpretability / accuracy Increasing the expressivity of language e.g. Monotonocity, positive functions, symmetries Computational complexity of searching through a huge space of models Extending the automatic reports to multidimensional datasets Search and descriptions naturally extend to multiple dimensions, but automatically generating relevant visual summaries harder James Robert Lloyd 63 / 64

124 CURRENT AND FUTURE DIRECTIONS Automatic statistician for: Multivariate nonlinear regression Classification Completing and interpreting tables and databases Probabilistic programming Probabilistic models are expressed in a general (Turing complete) programming language (e.g. Church/Venture/Anglican) A universal inference engine can then be used to infer unobserved variables given observed data This can be used to implement seach over the model space in an automated statistician James Robert Lloyd 64 / 64

Building an Automatic Statistician

Building an Automatic Statistician Zoubin Ghahramani Department of Engineering University of Cambridge zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ ALT-Discovery Science Conference, October