Probabilistic Reasoning in Deep Learning

Size: px
Start display at page:

Download "Probabilistic Reasoning in Deep Learning"

Transcription

1 Probabilistic Reasoning in Deep Learning Dr Konstantina Palla, PhD September 2017 Deep Learning Indaba, Johannesburgh Konstantina Palla 1 / 39

2 OVERVIEW OF THE TALK Basics of Bayesian Inference Konstantina Palla 2 / 39

3 OVERVIEW OF THE TALK Basics of Bayesian Inference Bayesian reasoning inside DNNs. Konstantina Palla 2 / 39

4 OVERVIEW OF THE TALK Basics of Bayesian Inference Bayesian reasoning inside DNNs. > The Bayesian paradigm as an approach to account for uncetainty in the DNNs. Konstantina Palla 2 / 39

5 OVERVIEW OF THE TALK Basics of Bayesian Inference Bayesian reasoning inside DNNs. > The Bayesian paradigm as an approach to account for uncetainty in the DNNs. Connection to other statistical models (nonparametric models). Konstantina Palla 2 / 39

6 OVERVIEW OF THE TALK Basics of Bayesian Inference Bayesian reasoning inside DNNs. > The Bayesian paradigm as an approach to account for uncetainty in the DNNs. Connection to other statistical models (nonparametric models). > How kernels and Gaussian Processes fit in the picture of NNs. Konstantina Palla 2 / 39

7 PART I Bayesian Inference: Basics Konstantina Palla 3 / 39

8 BAYESIAN INFERENCE: BASICS Data: Observations Y = y (1), y (2),..., y (N) Problem: How was it generated? Konstantina Palla 4 / 39

9 BAYESIAN INFERENCE: BASICS Data: Observations Y = y (1), y (2),..., y (N) Problem: How was it generated? Answer: Make assumptions distributional θ y (i) θ p θ (y θ) y Konstantina Palla 4 / 39

10 BAYESIAN INFERENCE: BASICS Data: Observations Y = y (1), y (2),..., y (N) Problem: How was it generated? Answer: Make assumptions distributional θ y (i) θ p θ (y θ) y y (i) θ N (y; µ, σ 2 ) θ = {µ, σ 2 } Konstantina Palla 4 / 39

11 BAYESIAN INFERENCE: BASICS Data: Observations Y = y (1), y (2),..., y (N) Problem: How was it generated? Answer: Make assumptions distributional θ y (i) θ p θ (y θ) y likelihood: the probability of the data under the model parameters θ y (i) θ N (y; µ, σ 2 ) θ = {µ, σ 2 } N p(y θ) = p(y (i) θ) i=1 N = N (y (i) µ, σ 2 ) i=1 Konstantina Palla 4 / 39

12 BAYESIAN INFERENCE: BASICS Bayesian Principle All forms of uncertainty should be expressed by means of probability distributions. θ Prior: θ p(θ) Our prior belief before seeing the data Posterior: Our updated belief after having seen the data Bayes Rule y p(θ Y) = p(y θ)p(θ) p(y) Konstantina Palla 5 / 39

13 BAYESIAN INFERENCE: BASICS Baye s rule p(θ Y) = p(y θ)p(θ) p(y) Konstantina Palla 6 / 39

14 BAYESIAN INFERENCE: BASICS Baye s rule posterior p(θ Y) = p(y θ)p(θ) p(y) The posterior is proportional to likelihood prior Konstantina Palla 6 / 39

15 BAYESIAN INFERENCE: BASICS Baye s rule likelihood posterior p(θ Y) = p(y θ)p(θ) p(y) The posterior is proportional to likelihood prior Konstantina Palla 6 / 39

16 BAYESIAN INFERENCE: BASICS likelihood Baye s rule posterior p(θ Y) = prior p(y θ)p(θ) p(y) The posterior is proportional to likelihood prior Konstantina Palla 6 / 39

17 BAYESIAN INFERENCE: BASICS likelihood Baye s rule posterior p(θ Y) = prior p(y θ)p(θ) p(y) The posterior is proportional to likelihood prior Update my prior beliefs after I see the data Konstantina Palla 6 / 39

18 BAYESIAN INFERENCE: BASICS P(θ Y) P(Y θ)p(θ) Konstantina Palla 7 / 39

19 BAYESIAN INFERENCE: BASICS Make predictions For unseen y marginalise! P(y Y) = P(y θ)p(θ Y)dθ Konstantina Palla 8 / 39

20 BAYESIAN INFERENCE: BASICS Make predictions For unseen y marginalise! P(y Y) = P(y θ)p(θ Y)dθ The prediction does not depend on a point estimate of θ but it is expressed as a weighted average over all the possible values of θ Konstantina Palla 8 / 39

21 BAYESIAN INFERENCE: BASICS Why is this useful? > Uncertainty is taken into account No point estimates BUT distribution over the unknown θ; p(θ Y) Konstantina Palla 9 / 39

22 BAYESIAN INFERENCE: BASICS Why is this useful? > Uncertainty is taken into account No point estimates BUT distribution over the unknown θ; p(θ Y) > Natural handling of missing data θ y a probability distribution is estimated for each missing value θ? θ p(θ Y) Konstantina Palla 9 / 39

23 BAYESIAN INFERENCE: BASICS Why is this useful? > Uncertainty is taken into account No point estimates BUT distribution over the unknown θ; p(θ Y) > Natural handling of missing data θ y a probability distribution is estimated for each missing value θ? θ p(θ Y) > Adresses issues like regularization, overfitting useful in Deep Neural Networks. Konstantina Palla 9 / 39

24 PART II Bayesian Reasoning in (Deep) Neural Networks Konstantina Palla 10 / 39

25 PART II Bayesian Reasoning in (Deep) Neural Networks Overview of (D)NNs Konstantina Palla 10 / 39

26 PART II Bayesian Reasoning in (Deep) Neural Networks Overview of (D)NNs Bayesian Reasoning in DNNs Konstantina Palla 10 / 39

27 OVERVIEW OF (DEEP) NEURAL NETWORKS Input layer Hidden layer Hidden layer Hidden layer Output layer What is Deep Learning A framework for constructing flexible models a way of constructing outputs using functions on inputs. Konstantina Palla 11 / 39

28 OVERVIEW OF DEEP NNS Repetition of building blocks Konstantina Palla 12 / 39

29 OVERVIEW OF DEEP NNS Building block x x 1 β 1 x 2 β 2 f (η) y η = β T x linear. β M f (η) non linear x M y > Output: y > Linear predictor : η = M m=1 βmxm = βt x, coefficients β weights > Inverse Link function: f (η) activation function Konstantina Palla 13 / 39

30 OVERVIEW OF DEEP NNS..... In each layer l the input is linearly composed η l = β T l x l Konstantina Palla 14 / 39

31 OVERVIEW OF DEEP NNS..... In each layer l the input is linearly composed η l = β T l x l a non-linear function is applied f l(η l ) Konstantina Palla 14 / 39

32 OVERVIEW OF DEEP NNS..... In each layer l the input is linearly composed η l = β T l x l a non-linear function is applied f l(η l ) A Deep Neural Network construction of recursive building blocks Konstantina Palla 14 / 39

33 OVERVIEW OF DEEP NNS x η = β T x f (η) η l = β T l x l f l(η l ) y A Deep Neural Network can represent functions of increasing complexity. Konstantina Palla 15 / 39

34 DEEP NNS AS A PROBABILISTIC MODEL x 1 β 1 x 2 β 2 f (η) y. β p x p η = β T x Konstantina Palla 16 / 39

35 DEEP NNS AS A PROBABILISTIC MODEL x 1 β 1 x y x 2. β 2 β p f (η) y Parametric! p(y x; β) x p η = β T x Konstantina Palla 16 / 39

36 DEEP NNS AS A PROBABILISTIC MODEL x y p(y x; β) Parametric! x 1 β 1 Example: Regression x 2. β 2 β p f (η) η f (η) = η = β T x y = β T x + ɛ, ɛ N (0, σ 2 ) p(y x, β) = N (β T x, σ 2 ) x p η = β T x Konstantina Palla 16 / 39

37 DEEP NNS AS A PROBABILISTIC MODEL x y p(y x; β) Parametric! x 1 β 1 Example: Regression x 2. β 2 β p f (η) η f (η) = η = β T x y = β T x + ɛ, ɛ N (0, σ 2 ) p(y x, β) = N (β T x, σ 2 ) x p η = β T x Train the network: learn β? Konstantina Palla 16 / 39

38 DEEP NNS AS A PROBABILISTIC MODEL Learn β Maximum Likelihood Principal Minimize the negative log-likelihood J (β) = log(p(y x, β)) β MLE : arg min J(β) β Konstantina Palla 17 / 39

39 DEEP NNS AS A PROBABILISTIC MODEL Learn β Maximum Likelihood Principal Minimize the negative log-likelihood J (β) = log(p(y x, β)) β MLE : arg min J(β) β Regression x y p(y x, β) = N (β T x, σ 2 ) Konstantina Palla 17 / 39

40 DEEP NNS AS A PROBABILISTIC MODEL Learn β Maximum Likelihood Principal Minimize the negative log-likelihood J (β) = log(p(y x, β)) β MLE : arg min J(β) β Regression x Squared Loss function y J(β) = 1 2 N {y (i) β T x (i) } 2 + const. i=1 p(y x, β) = N (β T x, σ 2 ) Konstantina Palla 17 / 39

41 DEEP NNS AS A PROBABILISTIC MODEL Learn β Maximum Likelihood Principal Minimize the negative log-likelihood J (β) = log(p(y x, β)) β MLE : arg min J(β) β Regression x Squared Loss function y J(β) = 1 2 N {y (i) β T x (i) } 2 + const. i=1 p(y x, β) = N (β T x, σ 2 ) > Solve using Stachastic Gradient, back-propagation etc. Konstantina Palla 17 / 39

42 DEEP NNS AS A PROBABILISTIC MODEL δ β Maximum Likelihood Estimation β is fixed but unknown - Point estimates - Overfitting Konstantina Palla 18 / 39

43 DEEP NNS AS A PROBABILISTIC MODEL δ β p(β) Maximum Likelihood Estimation β is fixed but unknown - Point estimates - Overfitting Bayesian Approach β is unknown/uncertain + Account for uncertainty p(β), p(β y) + Regularisation Konstantina Palla 18 / 39

44 BAYESIAN REASONING IN DEEP LEARNING Maximum a Posteriori Estimation β x y Konstantina Palla 19 / 39

45 BAYESIAN REASONING IN DEEP LEARNING Maximum a Posteriori Estimation Distribution over the network weights β β x β p(β) y x, β p(y x, β) y Konstantina Palla 19 / 39

46 BAYESIAN REASONING IN DEEP LEARNING Maximum a Posteriori Estimation Distribution over the network weights β β x β p(β) y x, β p(y x, β) Update the belief over β Bayes rule y Likelihood p(β Y) p(y β, X) prior p(β) Konstantina Palla 19 / 39

47 BAYESIAN REASONING IN DEEP LEARNING Maximum a Posteriori Estimation Distribution over the network weights β β x β p(β) y x, β p(y x, β) Update the belief over β Bayes rule y Likelihood p(β Y) p(y β, X) prior p(β) Maximum a Posteriori Find β MAP that maximizes log p(β Y) Find β MAP that minimizes J (β) = log p(β Y) β MAP : arg min J(β) β Konstantina Palla 19 / 39

48 BAYESIAN REASONING IN DEEP LEARNING Regression: Maximum a Posteriori β x y prior over the weights y = β T x + ɛ multivariate Gaussian p(β) = N (m β, λ 1 I) p(y x, β) = N (β T x, σ 2 ) Konstantina Palla 20 / 39

49 BAYESIAN REASONING IN DEEP LEARNING Regression: Maximum a Posteriori β prior over the weights x y y = β T x + ɛ multivariate Gaussian Loss function: J (β) = log p(β Y) J (β) = 1 2 N {y (i) β T x (i) } 2 + i=1 weight decay - regularization λ 2 βt β +const. p(β) = N (m β, λ 1 I) p(y x, β) = N (β T x, σ 2 ) Konstantina Palla 20 / 39

50 BAYESIAN REASONING IN DEEP LEARNING Regression: Maximum a Posteriori β prior over the weights x y y = β T x + ɛ multivariate Gaussian Loss function: J (β) = log p(β Y) J (β) = 1 2 N {y (i) β T x (i) } 2 + i=1 weight decay - regularization λ 2 βt β +const. The posterior distribution over the weights allows for a natural regularizer to arise! Bayesian side effect p(β) = N (m β, λ 1 I) p(y x, β) = N (β T x, σ 2 ) Konstantina Palla 20 / 39

51 BAYESIAN REASONING IN DEEP LEARNING Regression: Maximum a Posteriori β prior over the weights x y y = β T x + ɛ multivariate Gaussian p(β) = N (m β, λ 1 I) Loss function: J (β) = log p(β Y) J (β) = 1 2 N {y (i) β T x (i) } 2 + i=1 weight decay - regularization λ 2 βt β +const. The posterior distribution over the weights allows for a natural regularizer to arise! Bayesian side effect β MAP : arg min J(β) β p(y x, β) = N (β T x, σ 2 ) Konstantina Palla 20 / 39

52 BAYESIAN REASONING IN DEEP LEARNING Regression: Maximum a Posteriori β prior over the weights x y y = β T x + ɛ multivariate Gaussian p(β) = N (m β, λ 1 I) p(y x, β) = N (β T x, σ 2 ) Loss function: J (β) = log p(β Y) J (β) = 1 2 N {y (i) β T x (i) } 2 + i=1 weight decay - regularization λ 2 βt β +const. The posterior distribution over the weights allows for a natural regularizer to arise! Bayesian side effect β MAP : arg min J(β) β Predicitons: y Y, x N (β T MAP x, σ 2 ) Konstantina Palla 20 / 39

53 BAYESIAN REASONING IN DEEP LEARNING What we have seen so far... > Being Bayesian over the weights β, i.e. putting a prior distribution, allows for a better loss function in learning. natural regularization Konstantina Palla 21 / 39

54 BAYESIAN REASONING IN DEEP LEARNING What we have seen so far... > Being Bayesian over the weights β, i.e. putting a prior distribution, allows for a better loss function in learning. natural regularization > (D)NNs don t handle missing data x p(y x; β) y x (i), y (i), i = 1,..., N train data y : p(y x ; β) Konstantina Palla 21 / 39

55 BAYESIAN REASONING IN DEEP LEARNING What we have seen so far... > Being Bayesian over the weights β, i.e. putting a prior distribution, allows for a better loss function in learning. natural regularization > (D)NNs don t handle missing data > Need for p(x); x p(x) x p(y x; β) y x (i), y (i), i = 1,..., N train data y : p(y x ; β) x (i)?, y (i) Konstantina Palla 21 / 39

56 BAYESIAN REASONING IN DEEP LEARNING What we have seen so far... > Being Bayesian over the weights β, i.e. putting a prior distribution, allows for a better loss function in learning. natural regularization > (D)NNs don t handle missing data > Need for p(x); x p(x) > (D)NNs don t generate new examples x p(y x; β) y x (i), y (i), i = 1,..., N train data y : p(y x ; β) x (i)?, y (i) Konstantina Palla 21 / 39

57 BAYESIAN REASONING IN DEEP LEARNING What we have seen so far... > Being Bayesian over the weights β, i.e. putting a prior distribution, allows for a better loss function in learning. natural regularization > (D)NNs don t handle missing data > Need for p(x); x p(x) > (D)NNs don t generate new examples > p(x, y) = p(y x)p(x) draw x (i) p(x (i) ) draw y (i) p(y (i) x (i) ) p(y x; β) x y x (i), y (i), i = 1,..., N train data y : p(y x ; β) x (i)?, y (i)? Konstantina Palla 21 / 39

58 BAYESIAN REASONING IN DEEP LEARNING What we have seen so far... > Being Bayesian over the weights β, i.e. putting a prior distribution, allows for a better loss function in learning. natural regularization > (D)NNs don t handle missing data > Need for p(x); x p(x) > (D)NNs don t generate new examples > p(x, y) = p(y x)p(x) draw x (i) p(x (i) ) draw y (i) p(y (i) x (i) ) p(y x; β) x y x (i), y (i), i = 1,..., N train data y : p(y x ; β) x (i)?, y (i)? Need for a fully probabilstic model Konstantina Palla 21 / 39

59 BAYESIAN REASONING IN DEEP LEARNING Autoencoder as a fully probabilistic Neural Network Konstantina Palla 22 / 39

60 > Two parts: > Encoder: "encodes" the input to hidden code (representation) > Decoder: "decodes" the input back > Minimize reconstruction error L = y ỹ 2 2 Konstantina Palla 23 / 39 BAYESIAN REASONING IN DEEP LEARNING Autoencoder Input Output Code z y Encoder Decoder ỹ

61 BAYESIAN REASONING IN DEEP LEARNING Autoencoder Input Output Code z y Encoder Decoder ỹ Unsupervised method: Experiences data y and attempts to learn the distribution over y, i.e. p(y z) Konstantina Palla 24 / 39

62 BAYESIAN REASONING IN DEEP LEARNING Input Output Code p φ (z y) p θ (y z) y z ỹ Encoder Decoder {θ, φ} NN s weights Konstantina Palla 25 / 39

63 BAYESIAN REASONING IN DEEP LEARNING Decoder Generative Model z Latent variable model ỹ z p(z) ỹ p θ (y z) Konstantina Palla 26 / 39

64 BAYESIAN REASONING IN DEEP LEARNING Encoder Inference Mechanism z Decoder Generative Model z Latent variable model y ỹ Learn p φ (z y) φ Learn θ z p(z) ỹ p θ (y z) Konstantina Palla 26 / 39

65 BAYESIAN REASONING IN DEEP LEARNING Encoder Inference Mechanism Decoder Generative Model z z Latent variable model y ỹ Learn p φ (z y) φ z p(z) Learn θ ỹ p θ (y z) Inference: Learn φ, θ weights! Konstantina Palla 26 / 39

66 BAYESIAN REASONING IN DEEP LEARNING Variational Inference p φ (z y) = p(y z)p(z) p(y)? Cannot sample from p φ (z y) cannot construct it using the NNs structure (weights φ) Konstantina Palla 27 / 39

67 BAYESIAN REASONING IN DEEP LEARNING Variational Inference Main Principle approximate distribution q φ (z) p(z y) p φ (z y) = p(y z)p(z) p(y)? Cannot sample from p φ (z y) cannot construct it using the NNs structure (weights φ) Konstantina Palla 27 / 39

68 BAYESIAN REASONING IN DEEP LEARNING Variational Inference Main Principle approximate distribution q φ (z) p(z y) p φ (z y) = p(y z)p(z) p(y)? Cannot sample from p φ (z y) cannot construct it using the NNs structure (weights φ) z z q φ (z) q φ (z) y Konstantina Palla 27 / 39

69 BAYESIAN REASONING IN DEEP LEARNING Variational Inference Main Principle approximate distribution q φ (z) p(z y) p φ (z y) = p(y z)p(z) p(y)? KL[q φ (z) p(z y)] Cannot sample from p φ (z y) cannot construct it using the NNs structure (weights φ) KL measures the information lost when using q to approximate p z z q φ (z) q φ (z) y Konstantina Palla 27 / 39

70 BAYESIAN REASONING IN DEEP LARNING Variational Inference Cost function A metric for how well {φ, θ} capture the data generating p θ (y z) & latent distributions p φ (z y) Minimize L = log p(y θ, φ) Konstantina Palla 28 / 39

71 BAYESIAN REASONING IN DEEP LARNING Variational Inference Cost function A metric for how well {φ, θ} capture the data generating p θ (y z) & latent distributions p φ (z y) Minimize L = log p(y θ, φ) Reconstruction cost: Expected log-likelihood measures how well samples from q φ (z) are able to explain the data y. Penalty: This divergence measures how much information is lost (in units of nats) when using q φ (z) to represent p(z y) Minimize F Reconstruction F(φ, θ; y) = E qφ (z y)[log p θ (y z)] + Penalty KL[q φ (z) p(z y)] Konstantina Palla 28 / 39

72 BAYESIAN REASONING IN DEEP LARNING Variational Inference Cost function A metric for how well {φ, θ} capture the data generating p θ (y z) & latent distributions p φ (z y) Reconstruction cost: Expected log-likelihood measures how well samples from q φ (z) are able to explain the data y. Penalty: This divergence measures how much information is lost (in units of nats) when using q φ (z) to represent p(z y) Minimize F Reconstruction Penalty F(φ, θ; y) = E qφ (z y)[log p θ (y z)] + KL[q φ (z) p(z y)] Penalty is derived from your model and does not need to be designed Konstantina Palla 28 / 39

73 BAYESIAN REASONING IN DEEP LEARNING Bayesian & DL marriage - What have we gained? z z q φ (z) z ++ Principled inference approach - Bayesian Penalty as intrinsic part of the model! ++ Encode uncertainty ++ Impute missing data ỹ ỹ p θ (y z) y Model Inference Konstantina Palla 29 / 39

74 Deep Learning as a kernel method > So far... Predictions made using point estimates β (weights) parametric > Now... Store the entire training set in order to make predictions kernel approach non-parametric Konstantina Palla 30 / 39

75 DEEP LEARNING AS A KERNEL METHOD.... y x β φ(.) Regression example M y = β T φ(x) = β mφ m(x) m=1 Konstantina Palla 31 / 39

76 DEEP LEARNING AS A KERNEL METHOD Parametric approach J(β) = 1 2 N n=1 {βt φ(x n) y n} 2 + λ 2 βt β > MLE training point estimates of β > Predictions: J(β) = 0 β MLE = 1 λ N {β T φ(x n) y n}φ(x) n n=1 y = β T MLE φ(x ) Predictions purely based on the learned parameter- all the information from the data is transferred into a set of parameters Konstantina Palla 32 / 39

77 DEEP LEARNING AS A KERNEL METHOD Parametric approach J(β) = 1 2 N n=1 {βt φ(x n) y n} 2 + λ 2 βt β > MLE training point estimates of β > Predictions: J(β) = 0 β MLE = 1 λ N {β T φ(x n) y n}φ(x) n n=1 y = β T MLE φ(x ) Predictions purely based on the learned parameter- all the information from the data is transferred into a set of parameters parametric! Konstantina Palla 32 / 39

78 DEEP LEARNING AS A KERNEL METHOD Change of scenery...let the data speak J(β) = 1 2 N n=1 {βt φ(x n) y n} 2 + λ 2 βt β J(β) = 0 β MLE = 1 λ = N {β T φ(x n) y n}φ(x n) n=1 N α nφ(x n) = n=1 design matrix M N Φ T α n = 1 λ (βt φ(x n) y n) N 1 α Konstantina Palla 33 / 39

79 DEEP LEARNING AS A KERNEL METHOD Change of scenery...let the data speak J(β) = 1 2 N n=1 {βt φ(x n) y n} 2 + λ 2 βt β dual representation J(β) = 0 β MLE = 1 λ = N {β T φ(x n) y n}φ(x n) n=1 N α nφ(x n) = n=1 design matrix M N Φ T N 1 α J(α) = 0 α = ( Gram matrix N N K +λi N) 1 y κ lm = φ(x l), φ(x m) = k(x l, x m) kernel function α n = 1 λ (βt φ(x n) y n) Konstantina Palla 33 / 39

80 DEEP LEARNING AS A KERNEL METHOD Why? β MLE = 1 N {β T φ(x n) y n}φ(x n) λ n=1 Konstantina Palla 34 / 39

81 DEEP LEARNING AS A KERNEL METHOD Why? α = (K + λi N) 1 y β MLE = 1 N {β T φ(x n) y n}φ(x n) λ n=1 Konstantina Palla 34 / 39

82 DEEP LEARNING AS A KERNEL METHOD Why? β MLE = 1 λ Predictions: N {β T φ(x n) y n}φ(x n) n=1 point estimateno memory of data α = (K + λi N) 1 y y = β T MLE φ(x ) Konstantina Palla 34 / 39

83 DEEP LEARNING AS A KERNEL METHOD Why? β MLE = 1 λ Predictions: N {β T φ(x n) y n}φ(x n) n=1 point estimateno memory of data α = (K + λi N) 1 y all data y = k(x, x ) T (K + λi) 1 y y = β T MLE φ(x ) Konstantina Palla 34 / 39

84 DEEP LEARNING AS A KERNEL METHOD Why? β MLE = 1 λ Predictions: N {β T φ(x n) y n}φ(x n) n=1 point estimateno memory of data α = (K + λi N) 1 y all data y = k(x, x ) T (K + λi) 1 y y = β T MLE φ(x ) parametric non parametric Konstantina Palla 34 / 39

85 DEEP LEARNING AS A KERNEL METHOD Why? β MLE = 1 λ Predictions: N {β T φ(x n) y n}φ(x n) n=1 point estimateno memory of data α = (K + λi N) 1 y all data y = k(x, x ) T (K + λi) 1 y y = β T MLE φ(x ) parametric non parametric Memory in β MLE Memory by actual storing all the data Konstantina Palla 34 / 39

86 DEEP LEARNING AS A KERNEL METHOD Deep Networks dual Kernel Methods Konstantina Palla 35 / 39

87 DEEP LEARNING AS A KERNEL METHOD Step back... φ 1(x) φ 2(x) β 1 β 2 y M y = β T φ(x) = β mφ m(x) y = f (x) TASK: learn the best regression function f () m. β M φ M(x) Konstantina Palla 36 / 39

88 DEEP LEARNING AS A KERNEL METHOD Step back... φ 1(x) φ 2(x) β 1 β 2 y y = β T φ(x) = M β mφ m(x) m y = f (x) TASK: learn the best regression function f (). β M What about a bit more of Bayesian reasoning? φ M(x) f = [f (x 1), f (x 2),..., f (x n)] p(f) Consider f a multivariate variable- Apply distribution! Konstantina Palla 36 / 39

89 DEEP LEARNING AS A KERNEL METHOD φ 1(x) φ 2(x) β 1 β 2 y y = β T φ(x) = M β mφ m(x) m y = f (x). φ M(x) β M f = [f (x 1), f (x 2),..., f (x n)] Number of hidden units φ m(x) to infinity, M β N (m, Σ) f N (0, K), K ij = k(x i, x j) Konstantina Palla 37 / 39

90 DEEP LEARNING AS A KERNEL METHOD φ 1(x) φ 2(x) β 1 β 2 y y = β T φ(x) = M β mφ m(x) m y = f (x). φ M(x) β M f = [f (x 1), f (x 2),..., f (x n)] Number of hidden units φ m(x) to infinity, M β N (m, Σ) f N (0, K), K ij = k(x i, x j) When N, sample from a Gaussian process f GP(0, K) Konstantina Palla 37 / 39

91 DEEP LEARNING AS A KERNEL METHOD φ 1(x) φ 2(x) β 1 β 2 y y = β T φ(x) = M β mφ m(x) m y = f (x). φ M(x) β M f = [f (x 1), f (x 2),..., f (x n)] Number of hidden units φ m(x) to infinity, M β N (m, Σ) Predict y = f (x ) f N (0, K), K ij = k(x i, x j) When N, sample from a Gaussian process f GP(0, K) p(f X, y, x ) = p(y x, f, X, y)p(f X, y)df Konstantina Palla 37 / 39

92 DEEP LEARNING AS A KERNEL METHOD φ 1(x) φ 2(x). φ M(x) β 1 β 2 β M Number of hidden units φ m(x) to infinity, M y Gaussian Processes inifnite limits Deep Networks Konstantina Palla 38 / 39

93 Thank you! Konstantina Palla 39 / 39

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Nonparameteric Regression:

Nonparameteric Regression: Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

Probabilistic & Bayesian deep learning. Andreas Damianou

Probabilistic & Bayesian deep learning. Andreas Damianou Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of Sheffield, 19 March 2019 In this talk Not in this talk: CRFs, Boltzmann machines,... In this

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,

More information

Deep latent variable models

Deep latent variable models Deep latent variable models Pierre-Alexandre Mattei IT University of Copenhagen http://pamattei.github.io @pamattei 19 avril 2018 Séminaire de statistique du CNAM 1 Overview of talk A short introduction

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

Variational Autoencoder

Variational Autoencoder Variational Autoencoder Göker Erdo gan August 8, 2017 The variational autoencoder (VA) [1] is a nonlinear latent variable model with an efficient gradient-based training procedure based on variational

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 3 Stochastic Gradients, Bayesian Inference, and Occam s Razor https://people.orie.cornell.edu/andrew/orie6741 Cornell University August

More information

Variational Autoencoders

Variational Autoencoders Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly

More information

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction

More information

Gaussian Processes (10/16/13)

Gaussian Processes (10/16/13) STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs

More information

GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

PMR Learning as Inference

PMR Learning as Inference Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

Variational Inference via Stochastic Backpropagation

Variational Inference via Stochastic Backpropagation Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

Massachusetts Institute of Technology

Massachusetts Institute of Technology Massachusetts Institute of Technology 6.867 Machine Learning, Fall 2006 Problem Set 5 Due Date: Thursday, Nov 30, 12:00 noon You may submit your solutions in class or in the box. 1. Wilhelm and Klaus are

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:

More information

Mathematical Formulation of Our Example

Mathematical Formulation of Our Example Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot

More information

Probabilistic Graphical Models for Image Analysis - Lecture 4

Probabilistic Graphical Models for Image Analysis - Lecture 4 Probabilistic Graphical Models for Image Analysis - Lecture 4 Stefan Bauer 12 October 2018 Max Planck ETH Center for Learning Systems Overview 1. Repetition 2. α-divergence 3. Variational Inference 4.

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement Simon Leglaive 1 Laurent Girin 1,2 Radu Horaud 1 1: Inria Grenoble Rhône-Alpes 2: Univ. Grenoble Alpes, Grenoble INP,

More information

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain Variational Inference in TensorFlow Danijar Hafner Stanford CS 20 2018-02-16 University College London, Google Brain Outline Variational Inference Tensorflow Distributions VAE in TensorFlow Variational

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

20: Gaussian Processes

20: Gaussian Processes 10-708: Probabilistic Graphical Models 10-708, Spring 2016 20: Gaussian Processes Lecturer: Andrew Gordon Wilson Scribes: Sai Ganesh Bandiatmakuri 1 Discussion about ML Here we discuss an introduction

More information

Introduction into Bayesian statistics

Introduction into Bayesian statistics Introduction into Bayesian statistics Maxim Kochurov EF MSU November 15, 2016 Maxim Kochurov Introduction into Bayesian statistics EF MSU 1 / 7 Content 1 Framework Notations 2 Difference Bayesians vs Frequentists

More information

Linear Dynamical Systems (Kalman filter)

Linear Dynamical Systems (Kalman filter) Linear Dynamical Systems (Kalman filter) (a) Overview of HMMs (b) From HMMs to Linear Dynamical Systems (LDS) 1 Markov Chains with Discrete Random Variables x 1 x 2 x 3 x T Let s assume we have discrete

More information

Lecture 5: GPs and Streaming regression

Lecture 5: GPs and Streaming regression Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 2. Statistical Schools Adapted from slides by Ben Rubinstein Statistical Schools of Thought Remainder of lecture is to provide

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Bayesian Deep Learning

Bayesian Deep Learning Bayesian Deep Learning Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp June 06, 2018 Mohammad Emtiyaz Khan 2018 1 What will you learn? Why is Bayesian inference

More information

Bayesian Support Vector Machines for Feature Ranking and Selection

Bayesian Support Vector Machines for Feature Ranking and Selection Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction

More information

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

CSC411: Final Review. James Lucas & David Madras. December 3, 2018 CSC411: Final Review James Lucas & David Madras December 3, 2018 Agenda 1. A brief overview 2. Some sample questions Basic ML Terminology The final exam will be on the entire course; however, it will be

More information

The connection of dropout and Bayesian statistics

The connection of dropout and Bayesian statistics The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University

More information

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak Gaussian processes and bayesian optimization Stanisław Jastrzębski kudkudak.github.io kudkudak Plan Goal: talk about modern hyperparameter optimization algorithms Bayes reminder: equivalent linear regression

More information

Linear Regression (9/11/13)

Linear Regression (9/11/13) STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter

More information

Introduction to Bayesian inference

Introduction to Bayesian inference Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015 Probabilistic models Describe how data was generated using probability distributions

More information

Auto-Encoding Variational Bayes

Auto-Encoding Variational Bayes Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39 Outline 1 Introduction 2 Variational Lower

More information

Statistical learning. Chapter 20, Sections 1 4 1

Statistical learning. Chapter 20, Sections 1 4 1 Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

More information

Model Selection for Gaussian Processes

Model Selection for Gaussian Processes Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

g-priors for Linear Regression

g-priors for Linear Regression Stat60: Bayesian Modeling and Inference Lecture Date: March 15, 010 g-priors for Linear Regression Lecturer: Michael I. Jordan Scribe: Andrew H. Chan 1 Linear regression and g-priors In the last lecture,

More information

Bayesian Interpretations of Regularization

Bayesian Interpretations of Regularization Bayesian Interpretations of Regularization Charlie Frogner 9.50 Class 15 April 1, 009 The Plan Regularized least squares maps {(x i, y i )} n i=1 to a function that minimizes the regularized loss: f S

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

Variational Autoencoders (VAEs)

Variational Autoencoders (VAEs) September 26 & October 3, 2017 Section 1 Preliminaries Kullback-Leibler divergence KL divergence (continuous case) p(x) andq(x) are two density distributions. Then the KL-divergence is defined as Z KL(p

More information

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Probabilistic Graphical Models Lecture 20: Gaussian Processes Probabilistic Graphical Models Lecture 20: Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 30, 2015 1 / 53 What is Machine Learning? Machine learning algorithms

More information

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model Kai Yu Joint work with John Lafferty and Shenghuo Zhu NEC Laboratories America, Carnegie Mellon University First Prev Page

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words,

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

An Introduction to Statistical and Probabilistic Linear Models

An Introduction to Statistical and Probabilistic Linear Models An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning

More information

Posterior Regularization

Posterior Regularization Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods

More information

y Xw 2 2 y Xw λ w 2 2

y Xw 2 2 y Xw λ w 2 2 CS 189 Introduction to Machine Learning Spring 2018 Note 4 1 MLE and MAP for Regression (Part I) So far, we ve explored two approaches of the regression framework, Ordinary Least Squares and Ridge Regression:

More information

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Thang D. Bui Richard E. Turner tdb40@cam.ac.uk ret26@cam.ac.uk Computational and Biological Learning

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

Bayesian Inference and MCMC

Bayesian Inference and MCMC Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Latent Dirichlet Allocation Introduction/Overview

Latent Dirichlet Allocation Introduction/Overview Latent Dirichlet Allocation Introduction/Overview David Meyer 03.10.2016 David Meyer http://www.1-4-5.net/~dmm/ml/lda_intro.pdf 03.10.2016 Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification

Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification N. Zabaras 1 S. Atkinson 1 Center for Informatics and Computational Science Department of Aerospace and Mechanical Engineering University

More information

Lecture 1b: Linear Models for Regression

Lecture 1b: Linear Models for Regression Lecture 1b: Linear Models for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced

More information

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014 Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of

More information

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept Computational Biology Lecture #3: Probability and Statistics Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept 26 2005 L2-1 Basic Probabilities L2-2 1 Random Variables L2-3 Examples

More information

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Overview Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Probabilistic Interpretation: Linear Regression Assume output y is generated

More information

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh September 24 (All of the slides in this course have been adapted from previous versions

More information

Lecture 3: Machine learning, classification, and generative models

Lecture 3: Machine learning, classification, and generative models EE E6820: Speech & Audio Processing & Recognition Lecture 3: Machine learning, classification, and generative models 1 Classification 2 Generative models 3 Gaussian models Michael Mandel

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

STAT 518 Intro Student Presentation

STAT 518 Intro Student Presentation STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible

More information

Expectation Propagation in Dynamical Systems

Expectation Propagation in Dynamical Systems Expectation Propagation in Dynamical Systems Marc Peter Deisenroth Joint Work with Shakir Mohamed (UBC) August 10, 2012 Marc Deisenroth (TU Darmstadt) EP in Dynamical Systems 1 Motivation Figure : Complex

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of

More information

Basic math for biology

Basic math for biology Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood

More information

Latent Variable Models

Latent Variable Models Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:

More information

Neutron inverse kinetics via Gaussian Processes

Neutron inverse kinetics via Gaussian Processes Neutron inverse kinetics via Gaussian Processes P. Picca Politecnico di Torino, Torino, Italy R. Furfaro University of Arizona, Tucson, Arizona Outline Introduction Review of inverse kinetics techniques

More information

Statistical learning. Chapter 20, Sections 1 3 1

Statistical learning. Chapter 20, Sections 1 3 1 Statistical learning Chapter 20, Sections 1 3 Chapter 20, Sections 1 3 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

More information