How to build an automatic statistician

Size: px
Start display at page:

Download "How to build an automatic statistician"

Transcription

1 How to build an automatic statistician James Robert Lloyd 1, David Duvenaud 1, Roger Grosse 2, Joshua Tenenbaum 2, Zoubin Ghahramani 1 1: Department of Engineering, University of Cambridge, UK 2: Massachusetts Institute of Technology, USA November 19, 214

2 THERE IS A GROWING NEED FOR DATA ANALYSIS We live in an era of abundant data The McKinsey Global Institute claim The United States alone faces a shortage of 14, to 19, people with analytical expertise and 1.5 million managers and analysts with the skills to understand and make decisions based on the analysis of big data. Diverse fields increasingly relying on expert statisticians, machine learning researchers and data scientists e.g. Computational sciences (e.g. biology, astronomy,... ) Online advertising Quantitative finance... James Robert Lloyd 2 / 64

3 WHAT WOULD AN AUTOMATIC STATISTICIAN DO? Language of models Description Data Search Model Prediction Report Evaluation Checking James Robert Lloyd 3 / 64

4 PREVIEW: AN ENTIRELY AUTOMATIC ANALYSIS 7 Raw data 7 Full model posterior with extrapolations Four additive components have been identified in the data A linearly increasing function. An approximately periodic function with a period of 1. years and with linearly increasing amplitude. A smooth function. Uncorrelated noise with linearly increasing standard deviation. James Robert Lloyd 4 / 64

5 AGENDA Introduction to Bayesian statistics Introduction to Gaussian processes via linear regression Description of automatic statistician Examples of automatically generated ouput James Robert Lloyd 5 / 64

6 HOW TO DO LINEAR REGRESSION Linear regression starts by assuming a model of the form y i = mx i + ε i where x and y are inputs and outputs respectively and ε are errors or noise James Robert Lloyd 6 / 64

7 HOW TO DO LINEAR REGRESSION Common approach: Estimate m by least squares James Robert Lloyd 6 / 64

8 HOW TO DO BAYESIAN LINEAR REGRESSION Bayesian approach: Specify beliefs about m using probability distributions James Robert Lloyd 7 / 64

9 HOW TO DO BAYESIAN LINEAR REGRESSION Bayesian approach: Specify prior beliefs about m using probability distributions Follow rules of probability to update beliefs rationally after observing data James Robert Lloyd 7 / 64

10 BAYESIAN STATISTICS Suppose we wish to fit a model M with parameters θ to data D M defined by p(d θ, M) - the likelihood James Robert Lloyd 8 / 64

11 BAYESIAN STATISTICS Suppose we wish to fit a model M with parameters θ to data D M defined by p(d θ, M) - the likelihood We represent our uncertainty about θ with a probability distribution p(θ M) - the prior James Robert Lloyd 8 / 64

12 BAYESIAN STATISTICS Suppose we wish to fit a model M with parameters θ to data D M defined by p(d θ, M) - the likelihood We represent our uncertainty about θ with a probability distribution p(θ M) - the prior We now derive p(θ D, M) - the posterior James Robert Lloyd 8 / 64

13 BAYES RULE Suppose we wish to fit a model M with parameters θ to data D M defined by p(d θ, M) - the likelihood We represent our uncertainty about θ with a probability distribution p(θ M) - the prior We then derive p(θ D, M) - the posterior p(θ D, M) = }{{} posterior likelihood prior {}}{{}}{ p(d θ, M) p(θ M) p(d θ, M)p(θ M)dθ } {{ } marginal likelihood James Robert Lloyd 9 / 64

14 BAYESIAN LINEAR REGRESSION Linear regression starts by assuming a model of the form y i = mx i + ε i where x and y are inputs and outputs respectively and ε are errors or noise James Robert Lloyd 1 / 64

15 BAYESIAN LINEAR REGRESSION Linear regression starts by assuming a model of the form y i = mx i + ε i where x and y are inputs and outputs respectively and ε are errors or noise Represent prior beliefs about m using a probability distribution James Robert Lloyd 1 / 64

16 BAYESIAN LINEAR REGRESSION Linear regression starts by assuming a model of the form y i = mx i + ε i where x and y are inputs and outputs respectively and ε are errors or noise Represent prior beliefs about m using a probability distribution e.g. m Normal(mean =, variance = 1) James Robert Lloyd 1 / 64

17 BAYESIAN LINEAR REGRESSION Linear regression starts by assuming a model of the form y i = mx i + ε i where x and y are inputs and outputs respectively and ε are errors or noise Represent prior beliefs about m using a probability distribution e.g. m N (, 1) James Robert Lloyd 1 / 64

18 BAYESIAN LINEAR REGRESSION Linear regression starts by assuming a model of the form y i = mx i + ε i where x and y are inputs and outputs respectively and ε are errors or noise Represent prior beliefs about m using a probability distribution e.g. m N (, 1) Define likelihood by representing prior beliefs about ε using a probability distribution e.g. εi N (, σε) 2 independently for all i James Robert Lloyd 1 / 64

19 BAYESIAN LINEAR REGRESSION Linear regression starts by assuming a model of the form y i = mx i + ε i where x and y are inputs and outputs respectively and ε are errors or noise Represent prior beliefs about m using a probability distribution e.g. m N (, 1) Define likelihood by representing prior beliefs about ε using a probability distribution e.g. εi N (, σε) 2 independently for all i Apply Bayes rule James Robert Lloyd 1 / 64

20 BAYES RULE APPLIED TO LINEAR REGRESSION p(m y, x) = p(y m, x) p(m) p(y m, x)p(m) dm James Robert Lloyd 11 / 64

21 BAYES RULE APPLIED TO LINEAR REGRESSION p(m y, x) = p(y m, x) p(m) p(y m, x)p(m) dm p(y m, x) p(m) James Robert Lloyd 11 / 64

22 BAYES RULE APPLIED TO LINEAR REGRESSION p(m y, x) = p(y m, x) p(m) p(y m, x)p(m) dm p(y m, x) p(m) ( ) 1 e (y i mx i )2/(2σ2 ε ) 1 2πσ ε i 2π e m2 /2 James Robert Lloyd 11 / 64

23 BAYES RULE APPLIED TO LINEAR REGRESSION p(m y, x) = p(y m, x) p(m) p(y m, x)p(m) dm p(y m, x) p(m) ( ) 1 e (y i mx i )2/(2σ2 ε ) 1 2πσ ε i e i (y i mx i ) 2 /(2σ 2 ε ) m2 /2 2π e m2 /2 James Robert Lloyd 11 / 64

24 BAYES RULE APPLIED TO LINEAR REGRESSION p(m y, x) = p(y m, x) p(m) p(y m, x)p(m) dm p(y m, x) p(m) ( ) 1 e (y i mx i )2/(2σ2 ε ) 1 2πσ ε i e i (y i mx i ) 2 /(2σε 2) m2 /2 ( e i x2 i +σ2 ε 2σ 2 ε m i x i y i i x2 i +σ2 ε ) 2 2π e m2 /2 James Robert Lloyd 11 / 64

25 BAYES RULE APPLIED TO LINEAR REGRESSION p(m y, x) = p(y m, x) p(m) p(y m, x)p(m) dm p(y m, x) p(m) ( ) 1 e (y i mx i )2/(2σ2 ε ) 1 2πσ ε i e i (y i mx i ) 2 /(2σε 2) m2 /2 ( e i x2 i +σ2 ε 2σ 2 ε m i x i y i i x2 i +σ2 ε ) 2 2π e m2 /2 ( ) i This is a normal distribution: m y, x N x iy i, σ2 ε i x2 i +σ2 ε i x2 i +σ2 ε James Robert Lloyd 11 / 64

26 BAYESIAN LINEAR REGRESSION m y, x N ( i x iy i, i x2 i + σε 2 σ 2 ε i x2 i + σ 2 ε ) James Robert Lloyd 12 / 64

27 BAYESIAN LINEAR REGRESSION m y, x N ( i x iy i, i x2 i + σε 2 σ 2 ε i x2 i + σ 2 ε ) James Robert Lloyd 12 / 64

28 BAYESIAN LINEAR REGRESSION m y, x N ( i x iy i, i x2 i + σε 2 σ 2 ε i x2 i + σ 2 ε ) James Robert Lloyd 12 / 64

29 BAYESIAN LINEAR REGRESSION m y, x N ( i x iy i, i x2 i + σε 2 σ 2 ε i x2 i + σ 2 ε ) James Robert Lloyd 12 / 64

30 BAYESIAN LINEAR REGRESSION m y, x N ( i x iy i, i x2 i + σε 2 σ 2 ε i x2 i + σ 2 ε ) James Robert Lloyd 12 / 64

31 BAYESIAN LINEAR REGRESSION m y, x N ( i x iy i, i x2 i + σε 2 σ 2 ε i x2 i + σ 2 ε ) James Robert Lloyd 12 / 64

32 BAYESIAN LINEAR REGRESSION REWRITTEN We defined a Bayesian linear regression model by specifying priors on m and ε i James Robert Lloyd 13 / 64

33 BAYESIAN LINEAR REGRESSION REWRITTEN We defined a Bayesian linear regression model by specifying priors on m and ε i m N (, 1), εi iid N (, σ 2 ε) James Robert Lloyd 13 / 64

34 BAYESIAN LINEAR REGRESSION REWRITTEN We defined a Bayesian linear regression model by specifying priors on m and ε i m N (, 1), εi iid N (, σ 2 ε) yi m, ε i = mx i + ε i James Robert Lloyd 13 / 64

35 BAYESIAN LINEAR REGRESSION REWRITTEN We defined a Bayesian linear regression model by specifying priors on m and ε i m N (, 1), εi iid N (, σ 2 ε) yi m, ε i = mx i + ε i This implictly defined a joint prior on {y i : i = 1,..., n} James Robert Lloyd 13 / 64

36 BAYESIAN LINEAR REGRESSION REWRITTEN We defined a Bayesian linear regression model by specifying priors on m and ε i m N (, 1), εi iid N (, σ 2 ε) yi m, ε i = mx i + ε i This implictly defined a joint prior on {y i : i = 1,..., n} yi N (, xi 2 + σε) 2 (sum of two normals) James Robert Lloyd 13 / 64

37 BAYESIAN LINEAR REGRESSION REWRITTEN We defined a Bayesian linear regression model by specifying priors on m and ε i m N (, 1), εi iid N (, σ 2 ε) yi m, ε i = mx i + ε i This implictly defined a joint prior on {y i : i = 1,..., n} yi N (, xi 2 + σε) 2 (sum of two normals) cov(yi, y j ) = x i x j i j James Robert Lloyd 13 / 64

38 BAYESIAN LINEAR REGRESSION REWRITTEN We defined a Bayesian linear regression model by specifying priors on m and ε i m N (, 1), εi iid N (, σ 2 ε) yi m, ε i = mx i + ε i This implictly defined a joint prior on {y i : i = 1,..., n} yi N (, xi 2 + σε) 2 (sum of two normals) cov(yi, y j ) = x i x j i j {y i : i = 1,..., n} has a multivariate normal distribution y N (, K) where k ij = x i x j + δ ij σ 2 ε 5 5 James Robert Lloyd 13 / 64

39 BAYESIAN LINEAR REGRESSION REWRITTEN {y i : i = 1,..., n} has a multivariate normal distribution y N (, K) where k ij = x i x j + δ ij σ 2 ε 5 5 James Robert Lloyd 14 / 64

40 BAYESIAN LINEAR REGRESSION REWRITTEN {y i : i = 1,..., n} has a multivariate normal distribution y N (, K) where k ij = x i x j + δ ij σ 2 ε 5 5 The covariance can be written as a function of pairs of x i k ij = k(x i, x j ) = x i x j + δ xi =x j σ 2 ε James Robert Lloyd 14 / 64

41 BAYESIAN LINEAR REGRESSION REWRITTEN {y i : i = 1,..., n} has a multivariate normal distribution y N (, K) where k ij = x i x j + δ ij σ 2 ε 5 5 The covariance can be written as a function of pairs of x i k ij = k(x i, x j ) = x i x j + δ xi =x j σ 2 ε Do we define a valid probability distribution if {x i } = R? James Robert Lloyd 14 / 64

42 GAUSSIAN PROCESSES A Gaussian process is collection of random variables, any finite number of which have a joint Gaussian distribution James Robert Lloyd 15 / 64

43 GAUSSIAN PROCESSES A Gaussian process is collection of random variables, any finite number of which have a joint Gaussian distribution We can write this collection of random variables as {f (x) : x X } i.e. a function f evaluated at inputs x James Robert Lloyd 15 / 64

44 GAUSSIAN PROCESSES A Gaussian process is collection of random variables, any finite number of which have a joint Gaussian distribution We can write this collection of random variables as {f (x) : x X } i.e. a function f evaluated at inputs x A GP is completely specified by Mean function, µ(x) = E(f (x)) Covariance / kernel function, k(x, x ) = Cov(f (x), f (x )) Denoted f GP(µ, k) James Robert Lloyd 15 / 64

45 GAUSSIAN PROCESSES A Gaussian process is collection of random variables, any finite number of which have a joint Gaussian distribution We can write this collection of random variables as {f (x) : x X } i.e. a function f evaluated at inputs x A GP is completely specified by Mean function, µ(x) = E(f (x)) Covariance / kernel function, k(x, x ) = Cov(f (x), f (x )) Denoted f GP(µ, k) Can be thought of as a probability distribution on functions James Robert Lloyd 15 / 64

46 LINEAR KERNEL = LINEAR FUNCTIONS k(x, x ) = xx James Robert Lloyd 16 / 64

47 WHAT ABOUT OTHER KERNELS? k(x, x ) = e (x x ) 2 James Robert Lloyd 17 / 64

48 BAYESIAN NON-LINEAR REGRESSION James Robert Lloyd 18 / 64

49 BAYESIAN NON-LINEAR REGRESSION James Robert Lloyd 18 / 64

50 BAYESIAN NON-LINEAR REGRESSION James Robert Lloyd 18 / 64

51 BAYESIAN NON-LINEAR REGRESSION James Robert Lloyd 18 / 64

52 BAYESIAN NON-LINEAR REGRESSION James Robert Lloyd 18 / 64

53 BAYESIAN NON-LINEAR REGRESSION James Robert Lloyd 18 / 64

54 BAYESIAN LINEAR REGRESSION GONE WRONG James Robert Lloyd 19 / 64

55 BAYESIAN LINEAR REGRESSION GONE WRONG James Robert Lloyd 19 / 64

56 BAYESIAN LINEAR REGRESSION GONE WRONG James Robert Lloyd 19 / 64

57 BAYESIAN LINEAR REGRESSION GONE WRONG James Robert Lloyd 19 / 64

58 BAYESIAN LINEAR REGRESSION GONE WRONG James Robert Lloyd 19 / 64

59 BAYESIAN LINEAR REGRESSION GONE WRONG James Robert Lloyd 19 / 64

60 NON-LINEARITY TO THE RESCUE James Robert Lloyd 2 / 64

61 NON-LINEARITY TO THE RESCUE James Robert Lloyd 2 / 64

62 NON-LINEARITY TO THE RESCUE James Robert Lloyd 2 / 64

63 NON-LINEARITY TO THE RESCUE James Robert Lloyd 2 / 64

64 NON-LINEARITY TO THE RESCUE James Robert Lloyd 2 / 64

65 NON-LINEARITY TO THE RESCUE James Robert Lloyd 2 / 64

66 WHICH KERNEL FUNCTION? How far can we go with kernel functions? James Robert Lloyd 21 / 64

67 DEFINING A LANGUAGE OF MODELS Language of models Description Data Search Model Prediction Report Evaluation Checking James Robert Lloyd 22 / 64

68 THE ATOMS OF OUR LANGUAGE Five base kernels Squared exp. (SE) Periodic (PER) Linear (LIN) Constant (C) White noise (WN) Encoding for the following types of functions Smooth functions Periodic functions Linear functions Constant functions Gaussian noise James Robert Lloyd 23 / 64

69 THE COMPOSITION RULES OF OUR LANGUAGE Two main operations: addition, multiplication LIN LIN quadratic functions SE PER locally periodic LIN + PER periodic plus linear trend SE + PER periodic plus smooth trend James Robert Lloyd 24 / 64

70 MODELING CHANGEPOINTS Time series data often exhibit changepoints: x 14 Sum of components up to component 7 Sum of components up to component James Robert Lloyd 25 / 64

71 MODELING CHANGEPOINTS Time series data often exhibit changepoints: x 14 Sum of components up to component 7 Sum of components up to component We can model this by assuming f 1 (x) GP(, k 1 ) and f 2 (x) GP(, k 2 ) and then defining f (x) = (1 σ(x)) f 1 (x) + σ(x) f 2 (x) where σ is a sigmoid function between and 1. James Robert Lloyd 25 / 64

72 MODELING CHANGEPOINTS We can model this by assuming f 1 (x) GP(, k 1 ) and f 2 (x) GP(, k 2 ) and then defining f (x) = (1 σ(x)) f 1 (x) + σ(x) f 2 (x) where σ is a sigmoid function between and 1. Then f GP(, k), where k(x, x ) = (1 σ(x)) k 1 (x, x ) (1 σ(x )) + σ(x) k 2 (x, x ) σ(x ) We define the changepoint operator k = CP(k 1, k 2 ). James Robert Lloyd 26 / 64

73 AN EXPRESSIVE LANGUAGE OF MODELS Regression model GP smoothing Linear regression Multiple kernel learning Trend, cyclical, irregular Fourier decomposition Sparse spectrum GPs Spectral mixture Changepoints Heteroscedasticity Kernel SE + WN C + LIN + WN SE + WN SE + PER + WN C + cos + WN cos + WN SE cos + WN e.g. CP(SE, SE) + WN e.g. SE + LIN WN Note: cos is a special case of our version of PER James Robert Lloyd 27 / 64

74 DISCOVERING A GOOD MODEL VIA SEARCH Language of models Description Data Search Model Prediction Report Evaluation Checking James Robert Lloyd 28 / 64

75 DISCOVERING A GOOD MODEL VIA SEARCH Language defined as the arbitrary composition of five base kernels (WN, C, LIN, SE, PER) via three operators (+,, CP). The space spanned by this language is open-ended and can have a high branching factor requiring a judicious search We propose a greedy search for its simplicity and similarity to human model-building James Robert Lloyd 29 / 64

76 EXAMPLE: MAUNA LOA KEELING CURVE 6 4 RQ Start 2 SE RQ LIN PER SE + RQ... PER + RQ... PER RQ SE + PER + RQ... SE (PER + RQ) James Robert Lloyd 3 / 64

77 EXAMPLE: MAUNA LOA KEELING CURVE 4 ( Per + RQ ) 3 Start 2 1 SE RQ LIN PER SE + RQ... PER + RQ... PER RQ SE + PER + RQ... SE (PER + RQ) James Robert Lloyd 3 / 64

78 EXAMPLE: MAUNA LOA KEELING CURVE 5 SE ( Per + RQ ) 4 Start 3 2 SE RQ LIN PER SE + RQ... PER + RQ... PER RQ SE + PER + RQ... SE (PER + RQ) James Robert Lloyd 3 / 64

79 EXAMPLE: MAUNA LOA KEELING CURVE 6 ( SE + SE ( Per + RQ ) ) 4 Start 2 SE RQ LIN PER SE + RQ... PER + RQ... PER RQ SE + PER + RQ... SE (PER + RQ) James Robert Lloyd 3 / 64

80 MODEL EVALUATION Language of models Description Data Search Model Prediction Report Evaluation Checking James Robert Lloyd 31 / 64

81 BAYESIAN MODEL SELECTION Suppose we have a collection of models {M i } and some data D James Robert Lloyd 32 / 64

82 BAYESIAN MODEL SELECTION Suppose we have a collection of models {M i } and some data D Bayes rule tells us p(m i D) = p(d M i)p(m i ) p(d) James Robert Lloyd 32 / 64

83 BAYESIAN MODEL SELECTION Suppose we have a collection of models {M i } and some data D Bayes rule tells us p(m i D) = p(d M i)p(m i ) p(d) If p(m i ) is equal for all i (prior ignorance) then p(m i D) p(d M i ) = p(d θ i, M i )p(θ i M i )dθ i James Robert Lloyd 32 / 64

84 BAYESIAN MODEL SELECTION Suppose we have a collection of models {M i } and some data D Bayes rule tells us p(m i D) = p(d M i)p(m i ) p(d) If p(m i ) is equal for all i (prior ignorance) then p(m i D) p(d M i ) = p(d θ i, M i )p(θ i M i )dθ i i.e. The most likely model has the highest marginal likelihood James Robert Lloyd 32 / 64

85 AUTOMATIC TRANSLATION OF MODELS Language of models Description Data Search Model Prediction Report Evaluation Checking James Robert Lloyd 33 / 64

86 AUTOMATIC TRANSLATION OF MODELS Search can produce arbitrarily complicated models from open-ended language but two main properties allow description to be automated Kernels can be decomposed into a sum of products A sum of kernels corresponds to a sum of functions Therefore, we can describe each product of kernels separately Each kernel in a product modifies a model in a consistent way Each kernel roughly corresponds to an adjective James Robert Lloyd 34 / 64

87 SUM OF PRODUCTS NORMAL FORM Suppose the search finds the following kernel SE (WN LIN + CP(C, PER)) James Robert Lloyd 35 / 64

88 SUM OF PRODUCTS NORMAL FORM Suppose the search finds the following kernel SE (WN LIN + CP(C, PER)) The changepoint can be converted into a sum of products SE (WN LIN + C σ + PER σ) James Robert Lloyd 35 / 64

89 SUM OF PRODUCTS NORMAL FORM Suppose the search finds the following kernel SE (WN LIN + CP(C, PER)) The changepoint can be converted into a sum of products SE (WN LIN + C σ + PER σ) Multiplication can be distributed over addition SE WN LIN + SE C σ + SE PER σ James Robert Lloyd 35 / 64

90 SUM OF PRODUCTS NORMAL FORM Suppose the search finds the following kernel SE (WN LIN + CP(C, PER)) The changepoint can be converted into a sum of products SE (WN LIN + C σ + PER σ) Multiplication can be distributed over addition SE WN LIN + SE C σ + SE PER σ Simplification rules are applied WN LIN + SE σ + SE PER σ James Robert Lloyd 35 / 64

91 Full model posterior with extrapolations Full model posterior with extrapolations Posterior of component Posterior of component Posterior of component Posterior of component Posterior of component Posterior of component SUMS OF KERNELS ARE SUMS OF FUNCTIONS If f 1 GP(, k 1 ) and independently f 2 GP(, k 2 ) then e.g. f 1 + f 2 GP(, k 1 + k 2 ) = + + = + + We can therefore describe each component separately James Robert Lloyd 36 / 64

92 PRODUCTS OF KERNELS PER }{{} periodic function On their own, each kernel is described by a standard noun phrase James Robert Lloyd 37 / 64

93 PRODUCTS OF KERNELS - SE }{{} SE }{{} PER approximately periodic function Multiplication by SE removes long range correlations from a model since SE(x, x ) decreases monotonically to as x x increases. James Robert Lloyd 38 / 64

94 PRODUCTS OF KERNELS - LIN SE }{{} approximately PER }{{} periodic function LIN }{{} with linearly growing amplitude Multiplication by LIN is equivalent to multiplying the function being modeled by a linear function. If f (x) GP(, k), then x f (x) GP (, k LIN). This causes the standard deviation of the model to vary linearly without affecting the correlation. James Robert Lloyd 39 / 64

95 PRODUCTS OF KERNELS - CHANGEPOINTS SE }{{} approximately PER }{{} periodic function LIN }{{} with linearly growing amplitude σ }{{} until 17 Multiplication by σ is equivalent to multiplying the function being modeled by a sigmoid. James Robert Lloyd 4 / 64

96 AUTOMATICALLY GENERATED REPORTS Language of models Description Data Search Model Prediction Report Evaluation Checking James Robert Lloyd 41 / 64

97 EXAMPLE: AIRLINE PASSENGER VOLUME 7 Raw data 7 Full model posterior with extrapolations Four additive components have been identified in the data A linearly increasing function. An approximately periodic function with a period of 1. years and with linearly increasing amplitude. A smooth function. Uncorrelated noise with linearly increasing standard deviation. James Robert Lloyd 42 / 64

98 EXAMPLE: AIRLINE PASSENGER VOLUME This component is linearly increasing. 5 Posterior of component 1 7 Sum of components up to component James Robert Lloyd 43 / 64

99 EXAMPLE: AIRLINE PASSENGER VOLUME This component is approximately periodic with a period of 1. years and varying amplitude. Across periods the shape of this function varies very smoothly. The amplitude of the function increases linearly. The shape of this function within each period has a typical lengthscale of 6. weeks. 2 Posterior of component 2 7 Sum of components up to component James Robert Lloyd 44 / 64

100 EXAMPLE: AIRLINE PASSENGER VOLUME This component is a smooth function with a typical lengthscale of 8.1 months. 4 Posterior of component 3 7 Sum of components up to component James Robert Lloyd 45 / 64

101 EXAMPLE: AIRLINE PASSENGER VOLUME This component models uncorrelated noise. The standard deviation of the noise increases linearly. 2 Posterior of component 4 7 Sum of components up to component James Robert Lloyd 46 / 64

102 EXAMPLE: SOLAR IRRADIANCE This component is constant Posterior of component Sum of components up to component James Robert Lloyd 47 / 64

103 EXAMPLE: SOLAR IRRADIANCE This component is constant. This component applies from 1643 until Posterior of component Sum of components up to component James Robert Lloyd 48 / 64

104 EXAMPLE: SOLAR IRRADIANCE This component is a smooth function with a typical lengthscale of 23.1 years. This component applies until 1643 and from 1716 onwards..8 Posterior of component Sum of components up to component James Robert Lloyd 49 / 64

105 EXAMPLE: SOLAR IRRADIANCE This component is approximately periodic with a period of 1.8 years. Across periods the shape of this function varies smoothly with a typical lengthscale of 36.9 years. The shape of this function within each period is very smooth and resembles a sinusoid. This component applies until 1643 and from 1716 onwards..6 Posterior of component Sum of components up to component James Robert Lloyd 5 / 64

106 EXAMPLE: FULL REPORT See pdf James Robert Lloyd 51 / 64

107 SUMMARY We have briefly introduced Bayesian statistics and Gaussian processes Described the beginnings of an automatic statistician for regression Defines an open-ended language of models Searches greedily through this space Produces detailed reports describing patterns in data We believe this line of research has the potential to make powerful statistical model-building techniques accessible to non-experts James Robert Lloyd 52 / 64

108 THANKS Thanks James Robert Lloyd 53 / 64

109 APPENDIX James Robert Lloyd 54 / 64

110 SUM AND PRODUCT RULES OF PROBABILITY Sum rule : P(X = x) = y P(X = x, Y = y) Product rule : P(X = x, Y = y) = P(X = x) P(Y = y X = x) James Robert Lloyd 55 / 64

111 SUM AND PRODUCT RULES OF PROBABILITY Sum rule : p(x) = p(x, y) dy Product rule : p(x, y) = p(x) p(y x) James Robert Lloyd 55 / 64

112 DERIVING BAYES RULE p(θ, D M) = p(θ, D M) James Robert Lloyd 56 / 64

113 DERIVING BAYES RULE p(θ, D M) = p(θ, D M) p(θ D, M)p(D M) = p(d θ, M)p(θ M) (product rule) James Robert Lloyd 56 / 64

114 DERIVING BAYES RULE p(θ, D M) = p(θ, D M) p(θ D, M)p(D M) = p(d θ, M)p(θ M) (product rule) p(θ D, M) = p(d θ, M)p(θ M) p(d M) (divide) James Robert Lloyd 56 / 64

115 DERIVING BAYES RULE p(θ, D M) = p(θ, D M) p(θ D, M)p(D M) = p(d θ, M)p(θ M) p(θ D, M) = p(θ D, M) = p(d θ, M)p(θ M) p(d M) p(d θ, M)p(θ M) p(d, θ M)dθ (product rule) (divide) (sum rule) James Robert Lloyd 56 / 64

116 DERIVING BAYES RULE p(θ, D M) = p(θ, D M) p(θ D, M)p(D M) = p(d θ, M)p(θ M) p(θ D, M) = p(θ D, M) = p(θ D, M) = p(d θ, M)p(θ M) p(d M) p(d θ, M)p(θ M) p(d, θ M)dθ p(d θ, M)p(θ M) p(d θ, M)p(θ M)dθ (product rule) (divide) (sum rule) (product rule) James Robert Lloyd 56 / 64

117 DEFINING A LANGUAGE OF REGRESSION MODELS Regression consists of learning a function f : X Y from inputs to outputs from example input / output pairs Language should include simple parametric forms... e.g. Linear functions, Polynomials, Exponential functions... as well as functions specified by high level properties e.g. Smoothness, Periodicity Inference should be tractable for all models in language James Robert Lloyd 57 / 64

118 A LANGUAGE OF GAUSSIAN PROCESS KERNELS It is common practice to use a zero mean function since the mean can be marginalised out Suppose, f (x) a GP(a µ(x), k(x, x )) where a N (, 1) Then equivalently, f (x) GP(, µ(x)µ(x ) + k(x, x )) We therefore define a language of GP regression models by specifying a language of kernels James Robert Lloyd 58 / 64

119 MODEL EVALUATION After proposing a new model its kernel parameters are optimised by conjugate gradients We evaluate each optimised model, M, using the marginal likelihood which can be computed analytically for GPs We penalise the marginal likelihood for the optimised kernel parameters using the Bayesian Information Criterion (BIC):.5 BIC(M) = log p(d M) p 2 log n where p is the number of kernel parameters, D represents the data, and n is the number of data points. James Robert Lloyd 59 / 64

120 GOOD PREDICTIVE PERFORMANCE AS WELL Standardised RMSE over 13 data sets Standardised RMSE ABCD accuracy ABCD interpretability Spectral kernels Trend, cyclical irregular Bayesian MKL Eureqa Changepoints Squared Exponential Linear regression Tweaks can be made to the algorithm to improve accuracy or interpretability of models produced but both methods are highly competitive at extrapolation (shown above) and interpolation James Robert Lloyd 6 / 64

121 GOALS OF THE AUTOMATIC STATISTICIAN PROJECT Provide a set of tools for understanding data that require minimal expert input Uncover challenging research problems in e.g. Automated inference Model construction and comparison Data visualisation and interpretation Advance the field of machine learning in general James Robert Lloyd 61 / 64

122 INGREDIENTS OF AN AUTOMATIC STATISTICIAN Language of models Description Data Search Model Prediction Report Evaluation Checking An open-ended language of models Expressive enough to capture real-world phenomena and the techniques used by human statisticians A search procedure To efficiently explore the language of models A principled method of evaluating models Trading off complexity and fit to data A procedure to automatically explain the models Making the assumptions of the models explicit in a way that is intelligible to non-experts James Robert Lloyd 62 / 64

123 CHALLENGES Interpretability / accuracy Increasing the expressivity of language e.g. Monotonocity, positive functions, symmetries Computational complexity of searching through a huge space of models Extending the automatic reports to multidimensional datasets Search and descriptions naturally extend to multiple dimensions, but automatically generating relevant visual summaries harder James Robert Lloyd 63 / 64

124 CURRENT AND FUTURE DIRECTIONS Automatic statistician for: Multivariate nonlinear regression Classification Completing and interpreting tables and databases Probabilistic programming Probabilistic models are expressed in a general (Turing complete) programming language (e.g. Church/Venture/Anglican) A universal inference engine can then be used to infer unobserved variables given observed data This can be used to implement seach over the model space in an automated statistician James Robert Lloyd 64 / 64

Building an Automatic Statistician

Building an Automatic Statistician Building an Automatic Statistician Zoubin Ghahramani Department of Engineering University of Cambridge zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ ALT-Discovery Science Conference, October

More information

The Automatic Statistician

The Automatic Statistician The Automatic Statistician Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ James Lloyd, David Duvenaud (Cambridge) and

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes Roger Grosse Roger Grosse CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes 1 / 55 Adminis-Trivia Did everyone get my e-mail

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Model Selection for Gaussian Processes

Model Selection for Gaussian Processes Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal

More information

GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Nonparameteric Regression:

Nonparameteric Regression: Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/

More information

Gaussian Processes (10/16/13)

Gaussian Processes (10/16/13) STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs

More information

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Probabilistic Graphical Models Lecture 20: Gaussian Processes Probabilistic Graphical Models Lecture 20: Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 30, 2015 1 / 53 What is Machine Learning? Machine learning algorithms

More information

Gaussian Processes in Machine Learning

Gaussian Processes in Machine Learning Gaussian Processes in Machine Learning November 17, 2011 CharmGil Hong Agenda Motivation GP : How does it make sense? Prior : Defining a GP More about Mean and Covariance Functions Posterior : Conditioning

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Kernels for Automatic Pattern Discovery and Extrapolation

Kernels for Automatic Pattern Discovery and Extrapolation Kernels for Automatic Pattern Discovery and Extrapolation Andrew Gordon Wilson agw38@cam.ac.uk mlg.eng.cam.ac.uk/andrew University of Cambridge Joint work with Ryan Adams (Harvard) 1 / 21 Pattern Recognition

More information

Reliability Monitoring Using Log Gaussian Process Regression

Reliability Monitoring Using Log Gaussian Process Regression COPYRIGHT 013, M. Modarres Reliability Monitoring Using Log Gaussian Process Regression Martin Wayne Mohammad Modarres PSA 013 Center for Risk and Reliability University of Maryland Department of Mechanical

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

Automatic Model Construction with Gaussian Processes

Automatic Model Construction with Gaussian Processes Automatic Model Construction with Gaussian Processes David Kristjanson Duvenaud University of Cambridge This dissertation is submitted for the degree of Doctor of Philosophy Pembroke College June 2014

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Practical Bayesian Optimization of Machine Learning. Learning Algorithms Practical Bayesian Optimization of Machine Learning Algorithms CS 294 University of California, Berkeley Tuesday, April 20, 2016 Motivation Machine Learning Algorithms (MLA s) have hyperparameters that

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Kernel methods, kernel SVM and ridge regression

Kernel methods, kernel SVM and ridge regression Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;

More information

20: Gaussian Processes

20: Gaussian Processes 10-708: Probabilistic Graphical Models 10-708, Spring 2016 20: Gaussian Processes Lecturer: Andrew Gordon Wilson Scribes: Sai Ganesh Bandiatmakuri 1 Discussion about ML Here we discuss an introduction

More information

Expectation Propagation for Approximate Bayesian Inference

Expectation Propagation for Approximate Bayesian Inference Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

CS 7140: Advanced Machine Learning

CS 7140: Advanced Machine Learning Instructor CS 714: Advanced Machine Learning Lecture 3: Gaussian Processes (17 Jan, 218) Jan-Willem van de Meent (j.vandemeent@northeastern.edu) Scribes Mo Han (han.m@husky.neu.edu) Guillem Reus Muns (reusmuns.g@husky.neu.edu)

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Will Penny. SPM short course for M/EEG, London 2013

Will Penny. SPM short course for M/EEG, London 2013 SPM short course for M/EEG, London 2013 Ten Simple Rules Stephan et al. Neuroimage, 2010 Model Structure Bayes rule for models A prior distribution over model space p(m) (or hypothesis space ) can be updated

More information

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

More information

Lecture 5: GPs and Streaming regression

Lecture 5: GPs and Streaming regression Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X

More information

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group Nonparmeteric Bayes & Gaussian Processes Baback Moghaddam baback@jpl.nasa.gov Machine Learning Group Outline Bayesian Inference Hierarchical Models Model Selection Parametric vs. Nonparametric Gaussian

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Neil D. Lawrence GPSS 10th June 2013 Book Rasmussen and Williams (2006) Outline The Gaussian Density Covariance from Basis Functions Basis Function Representations Constructing

More information

Will Penny. SPM for MEG/EEG, 15th May 2012

Will Penny. SPM for MEG/EEG, 15th May 2012 SPM for MEG/EEG, 15th May 2012 A prior distribution over model space p(m) (or hypothesis space ) can be updated to a posterior distribution after observing data y. This is implemented using Bayes rule

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Overview Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Probabilistic Interpretation: Linear Regression Assume output y is generated

More information

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Lecturer: Drew Bagnell Scribe: Venkatraman Narayanan 1, M. Koval and P. Parashar 1 Applications of Gaussian

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

PMR Learning as Inference

PMR Learning as Inference Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning

More information

Introduction to Gaussian Process

Introduction to Gaussian Process Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

More information

Will Penny. DCM short course, Paris 2012

Will Penny. DCM short course, Paris 2012 DCM short course, Paris 2012 Ten Simple Rules Stephan et al. Neuroimage, 2010 Model Structure Bayes rule for models A prior distribution over model space p(m) (or hypothesis space ) can be updated to a

More information

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart Machine Learning Bayesian Regression & Classification learning as inference, Bayesian Kernel Ridge regression & Gaussian Processes, Bayesian Kernel Logistic Regression & GP classification, Bayesian Neural

More information

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words,

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray School of Informatics, University of Edinburgh The problem Learn scalar function of vector values f(x).5.5 f(x) y i.5.2.4.6.8 x f 5 5.5 x x 2.5 We have (possibly

More information

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic NPFL108 Bayesian inference Introduction Filip Jurčíček Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic Home page: http://ufal.mff.cuni.cz/~jurcicek Version: 21/02/2014

More information

DD Advanced Machine Learning

DD Advanced Machine Learning Modelling Carl Henrik {chek}@csc.kth.se Royal Institute of Technology November 4, 2015 Who do I think you are? Mathematically competent linear algebra multivariate calculus Ok programmers Able to extend

More information

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics DS-GA 100 Lecture notes 11 Fall 016 Bayesian statistics In the frequentist paradigm we model the data as realizations from a distribution that depends on deterministic parameters. In contrast, in Bayesian

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

More information

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II 1 Non-linear regression techniques Part - II Regression Algorithms in this Course Support Vector Machine Relevance Vector Machine Support vector regression Boosting random projections Relevance vector

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Optimization of Gaussian Process Hyperparameters using Rprop

Optimization of Gaussian Process Hyperparameters using Rprop Optimization of Gaussian Process Hyperparameters using Rprop Manuel Blum and Martin Riedmiller University of Freiburg - Department of Computer Science Freiburg, Germany Abstract. Gaussian processes are

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

Gaussian Processes for Machine Learning

Gaussian Processes for Machine Learning Gaussian Processes for Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics Tübingen, Germany carl@tuebingen.mpg.de Carlos III, Madrid, May 2006 The actual science of

More information

Bayesian Linear Regression [DRAFT - In Progress]

Bayesian Linear Regression [DRAFT - In Progress] Bayesian Linear Regression [DRAFT - In Progress] David S. Rosenberg Abstract Here we develop some basics of Bayesian linear regression. Most of the calculations for this document come from the basic theory

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

Approximate Inference using MCMC

Approximate Inference using MCMC Approximate Inference using MCMC 9.520 Class 22 Ruslan Salakhutdinov BCS and CSAIL, MIT 1 Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Basic Sampling Algorithms. 4. Markov

More information

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond Lecture 2: From Linear Regression to Kalman Filter and Beyond January 18, 2017 Contents 1 Batch and Recursive Estimation 2 Towards Bayesian Filtering 3 Kalman Filter and Bayesian Filtering and Smoothing

More information

Gaussian with mean ( µ ) and standard deviation ( σ)

Gaussian with mean ( µ ) and standard deviation ( σ) Slide from Pieter Abbeel Gaussian with mean ( µ ) and standard deviation ( σ) 10/6/16 CSE-571: Robotics X ~ N( µ, σ ) Y ~ N( aµ + b, a σ ) Y = ax + b + + + + 1 1 1 1 1 1 1 1 1 1, ~ ) ( ) ( ), ( ~ ), (

More information

Introduction to Machine Learning

Introduction to Machine Learning Outline Introduction to Machine Learning Bayesian Classification Varun Chandola March 8, 017 1. {circular,large,light,smooth,thick}, malignant. {circular,large,light,irregular,thick}, malignant 3. {oval,large,dark,smooth,thin},

More information

Talk on Bayesian Optimization

Talk on Bayesian Optimization Talk on Bayesian Optimization Jungtaek Kim (jtkim@postech.ac.kr) Machine Learning Group, Department of Computer Science and Engineering, POSTECH, 77-Cheongam-ro, Nam-gu, Pohang-si 37673, Gyungsangbuk-do,

More information

Gaussian processes for inference in stochastic differential equations

Gaussian processes for inference in stochastic differential equations Gaussian processes for inference in stochastic differential equations Manfred Opper, AI group, TU Berlin November 6, 2017 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017

More information

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation COMP 55 Applied Machine Learning Lecture 2: Bayesian optimisation Associate Instructor: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp55 Unless otherwise noted, all material posted

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu Lecture: Gaussian Process Regression STAT 6474 Instructor: Hongxiao Zhu Motivation Reference: Marc Deisenroth s tutorial on Robot Learning. 2 Fast Learning for Autonomous Robots with Gaussian Processes

More information

Probability and Information Theory. Sargur N. Srihari

Probability and Information Theory. Sargur N. Srihari Probability and Information Theory Sargur N. srihari@cedar.buffalo.edu 1 Topics in Probability and Information Theory Overview 1. Why Probability? 2. Random Variables 3. Probability Distributions 4. Marginal

More information

1 Bayesian Linear Regression (BLR)

1 Bayesian Linear Regression (BLR) Statistical Techniques in Robotics (STR, S15) Lecture#10 (Wednesday, February 11) Lecturer: Byron Boots Gaussian Properties, Bayesian Linear Regression 1 Bayesian Linear Regression (BLR) In linear regression,

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Relevance Vector Machines

Relevance Vector Machines LUT February 21, 2011 Support Vector Machines Model / Regression Marginal Likelihood Regression Relevance vector machines Exercise Support Vector Machines The relevance vector machine (RVM) is a bayesian

More information

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA Sistemi di Elaborazione dell Informazione Regressione Ruggero Donida Labati Dipartimento di Informatica via Bramante 65, 26013 Crema (CR), Italy http://homes.di.unimi.it/donida

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Bayesian Support Vector Machines for Feature Ranking and Selection

Bayesian Support Vector Machines for Feature Ranking and Selection Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction

More information

Multivariate Bayesian Linear Regression MLAI Lecture 11

Multivariate Bayesian Linear Regression MLAI Lecture 11 Multivariate Bayesian Linear Regression MLAI Lecture 11 Neil D. Lawrence Department of Computer Science Sheffield University 21st October 2012 Outline Univariate Bayesian Linear Regression Multivariate

More information