Bayesian Interpretations of Regularization

Size: px
Start display at page:

Download "Bayesian Interpretations of Regularization"

Transcription

1 Bayesian Interpretations of Regularization Charlie Frogner 9.50 Class 15 April 1, 009

2 The Plan Regularized least squares maps {(x i, y i )} n i=1 to a function that minimizes the regularized loss: f S = arg min f H 1 n (y i f(x i )) + λ f H i=1 Can we justify Tikhonov regularization from a probabilistic point of view?

3 The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation of linear RLS Bayesian interpretation of kernel RLS Transductive model Infinite dimensions = weird

4 Some notation S = {(x i, y i )} n i=1 is the set of observed input/output pairs in R d R (the training set). X and Y denote the matrices [x 1,..., x n ] T R n d and [y 1,...,y n ] T R n, respectively. θ is a vector of parameters in R p. p(y X,θ) is the joint distribution over outputs Y given inputs X and the parameters.

5 Estimation The setup: A model: relates observed quantities (α) to an unobserved quantity (say β). Want: an estimator maps observed data α back to an estimate of unobserved β. Nothing new yet... Estimator β B is unobserved, α A is observed. An estimator for β is a function ˆβ : A B such that ˆβ(α) is an estimate of β given an observation α.

6 Estimation Tikhonov fits in the estimation framework. f S = arg min f H 1 n (y i f(x i )) + λ f H i=1 Regression model: y i = f(x i ) + ε, ) ε N (0,σε I

7 Bayesian Estimation Difference: Bayesian model specifies p(β, α), usually by a measurement model, p(α β) and a prior p(β). Bayesian model β is unobserved, α is observed. p(β,α) = p(α β) p(β)

8 ERM as a Maximum Likelihood Estimator (Linear) Expected risk minimization: f S (x) = x T ˆθ ERM (S), ˆθ ERM (S) = arg min θ 1 n (y i xi T θ) i=1 Measurement model: ) Y X,θ N (Xθ,σε I X fixed/non-random, θ is unknown.

9 ERM as a Maximum Likelihood Estimator (Linear) Expected risk minimization: f S (x) = x T ˆθ ERM (S), ˆθ ERM (S) = arg min θ 1 n (y i xi T θ) i=1 Measurement model: ) Y X,θ N (Xθ,σε I X fixed/non-random, θ is unknown.

10 ERM as a Maximum Likelihood Estimator Measurement model: ) Y X,θ N (Xθ,σε I Want to estimate θ. Can do this without defining a prior on θ. Maximize the likelihood, i.e. the probability of the observations. Likelihood The likelihood of any fixed parameter vector θ is: L(θ X) = p(y X,θ) Note: we always condition on X.

11 ERM as a Maximum Likelihood Estimator Measurement model: Y X,θ N ( ) Xθ,σεI Likelihood: ( ) L(θ X) = N Y; Xθ,σεI ( exp 1 ) σε Y Xθ Maximum likelihood estimator is ERM: arg min θ 1 Y Xθ

12 ERM as a Maximum Likelihood Estimator Measurement model: Y X,θ N ( ) Xθ,σεI Likelihood: ( ) L(θ X) = N Y; Xθ,σεI ( exp 1 ) σε Y Xθ Maximum likelihood estimator is ERM: arg min θ 1 Y Xθ

13 Really? 1 Y Xθ

14 Really? e 1 σε Y Xθ

15 What about regularization? Linear regularized least squares: f S (x) = x T θ, ˆθRLS (S) = arg min θ 1 n i=1 (y i x T i θ) + λ θ Is there a model of Y and θ that yields linear RLS? Yes. e 1 σε ( n (y i xi T θ) + λ θ ) i=1 p(y X,θ) p(θ)

16 What about regularization? Linear regularized least squares: f S (x) = x T θ, ˆθRLS (S) = arg min θ 1 n i=1 (y i x T i θ) + λ θ Is there a model of Y and θ that yields linear RLS? Yes. e 1 σε ( n (y i xi T θ) + λ θ ) i=1 p(y X,θ) p(θ)

17 What about regularization? Linear regularized least squares: f S (x) = x T θ, ˆθRLS (S) = arg min θ 1 n i=1 (y i x T i θ) + λ θ Is there a model of Y and θ that yields linear RLS? Yes. e 1 σ ε n (y i x T i=1 i θ) e λ θ p(y X,θ) p(θ)

18 What about regularization? Linear regularized least squares: f S (x) = x T θ, ˆθRLS (S) = arg min θ 1 n i=1 (y i x T i θ) + λ θ Is there a model of Y and θ that yields linear RLS? Yes. e 1 σ ε n (y i x T i=1 i θ) e λ θ p(y X,θ) p(θ)

19 Linear RLS as a MAP estimator Measurement model: Add a prior: Y X,θ N So σε = λ. How to estimate θ? θ N (0, I) ( ) Xθ,σεI

20 The Bayesian method Take p(y X,θ) and p(θ). Apply Bayes rule to get posterior: p(θ X, Y) = = p(y X,θ) p(θ) p(y X) p(y X,θ) p(θ) p(y X,θ)dθ Use the posterior to estimate θ.

21 Estimators that use the posterior Bayes least squares estimator The Bayes least squares estimator for θ given the observed Y is: ˆθ BLS (Y X) = E θ X,Y [θ] i.e. the mean of the posterior. Maximum a posteriori estimator The MAP estimator for θ given the observed Y is: i.e. a mode of the posterior. ˆθ MAP (Y X) = arg max p(θ X, Y) θ

22 Estimators that use the posterior Bayes least squares estimator The Bayes least squares estimator for θ given the observed Y is: ˆθ BLS (Y X) = E θ X,Y [θ] i.e. the mean of the posterior. Maximum a posteriori estimator The MAP estimator for θ given the observed Y is: i.e. a mode of the posterior. ˆθ MAP (Y X) = arg max p(θ X, Y) θ

23 Linear RLS as a MAP estimator Model: Y X,θ N ( ) Xθ,σε I, θ N (0, I) Joint over Y and θ: [ ] Y X N θ ([ 0 0 ] [ XX, T + σε I X X T I ]) Condition on Y X.

24 Linear RLS as a MAP estimator Model: Y X,θ N (Xθ,σ ε I ), θ N (0, I) Posterior: where θ X, Y N ( µ θ X,Y,Σ θ X,Y ) µ θ X,Y = X T (XX T + σ ε I) 1 Y Σ θ X,Y = I X T (XX T + σ ε I) 1 X This is Gaussian, so ˆθ MAP (Y X) = ˆθ BLS (Y X) = X T (XX T + σ ε I) 1 Y

25 Linear RLS as a MAP estimator Model: Y X,θ N (Xθ,σ ε I ), θ N (0, I) Posterior: where θ X, Y N ( µ θ X,Y,Σ θ X,Y ) µ θ X,Y = X T (XX T + σ ε I) 1 Y Σ θ X,Y = I X T (XX T + σ ε I) 1 X This is Gaussian, so ˆθ MAP (Y X) = ˆθ BLS (Y X) = X T (XX T + σ ε I) 1 Y

26 Linear RLS as a MAP estimator Model: Y X,θ N (Xθ,σ ε I ), θ N (0, I) ˆθ MAP (Y X) = X T (XX T + σ ε I) 1 Y Recall the linear RLS solution: ˆθ RLS (Y X) = arg min θ 1 n (y i xi T θ) + λ θ i=1 = X T (XX T + λ I) 1 Y How do we write the estimated function?

27 Linear RLS as a MAP estimator Model: Y X,θ N (Xθ,σ ε I ), θ N (0, I) ˆθ MAP (Y X) = X T (XX T + σ ε I) 1 Y Recall the linear RLS solution: ˆθ RLS (Y X) = arg min θ 1 n (y i xi T θ) + λ θ i=1 = X T (XX T + λ I) 1 Y How do we write the estimated function?

28 What about Kernel RLS? We can use basically the same trick to derive kernel RLS; How? 1 f S = arg min f H K n (y i f(x i )) + λ f H K i=1 Feature space: f(x) = θ,φ(x) F e 1 ( n (y i φ(x i ) T θ) + λ θt θ) i=1 Feature space must be finite-dimensional.

29 What about Kernel RLS? We can use basically the same trick to derive kernel RLS; How? 1 f S = arg min f H K n (y i f(x i )) + λ f H K i=1 Feature space: f(x) = θ,φ(x) F e 1 ( n (y i φ(x i ) T θ) + λ θt θ) i=1 Feature space must be finite-dimensional.

30 What about Kernel RLS? We can use basically the same trick to derive kernel RLS; How? 1 f S = arg min f H K n (y i f(x i )) + λ f H K i=1 Feature space: f(x) = θ,φ(x) F e 1 n i=1(y i φ(x i ) T θ) e λ θt θ Feature space must be finite-dimensional.

31 What about Kernel RLS? We can use basically the same trick to derive kernel RLS; How? 1 f S = arg min f H K n (y i f(x i )) + λ f H K i=1 Feature space: f(x) = θ,φ(x) F e 1 n i=1(y i φ(x i ) T θ) e λ θt θ Feature space must be finite-dimensional.

32 More notation φ(x) = [φ(x 1 ),...,φ(x n )] T K(X, X) is the kernel matrix: [K(X, X)] ij = K(x i, x j ) K(x, X) = [K(x, x 1 ),..., K(x, x n )] f(x) = [f(x 1 ),..., f(x n )] T

33 What about Kernel RLS? Model: Then: Y X,θ N What is φ(x)φ(x) T? It s K(X, X). (φ(x)θ,σ ε I ), θ N (0, I) ˆθ MAP (Y X) = φ(x) T (φ(x)φ(x) T + σ ε I) 1 Y

34 What about Kernel RLS? Model: Then: Y X,θ N What is φ(x)φ(x) T? It s K(X, X). (φ(x)θ,σ ε I ), θ N (0, I) ˆθ MAP (Y X) = φ(x) T (φ(x)φ(x) T + σ ε I) 1 Y

35 What about Kernel RLS? Model: Y X,θ N (φ(x)θ,σ ε I ), θ N (0, I) Then: ˆθ MAP (Y X) = φ(x) T (K(X, X) + σ ε I) 1 Y Estimated function? ˆfMAP (x) = φ(x)ˆθ MAP (Y X) = φ(x)φ(x) T (K(X, X) + σε I) 1 Y = K(x, X)(K(X, X) + λ I) 1 Y = ˆf RLS (x)

36 What about Kernel RLS? Model: Y X,θ N (φ(x)θ,σ ε I ), θ N (0, I) Then: ˆθ MAP (Y X) = φ(x) T (K(X, X) + σ ε I) 1 Y Estimated function? ˆfMAP (x) = φ(x)ˆθ MAP (Y X) = φ(x)φ(x) T (K(X, X) + σε I) 1 Y = K(x, X)(K(X, X) + λ I) 1 Y = ˆf RLS (x)

37 A prior over functions Model: Y X,θ N Can we write this as a prior on H K? (φ(x)θ,σ ε I ), θ N (0, I)

38 A prior over functions Remember Mercer s theorem: K(x i, x j ) = k ν k ψ k (x i )ψ k (x j ) where ν k ψ k ( ) = K(, y)ψ k (y)dy for all k. The functions { ν k ψ k ( )} form an orthonormal basis for H K. Let φ( ) = [ ν 1 ψ 1 ( ),..., ν p ψ p ( )]. Then: H K = {θ T φ( ) θ R p }

39 A prior over functions We showed: when θ N (0, I), ˆfMAP ( ) = ˆθ MAP (Y X) T φ( ) = ˆf RLS ( ) Taking φ( ) = [ ν 1 ψ 1,..., ν p ψ p ], this prior is equivalently: f( ) = θ T φ( ) N (0, I) i.e. the functions in H K are Gaussian distributed: p(f) exp ( 1 ) ( f HK = exp 1 ) θt θ Note: again we need H K to be finite-dimensional.

40 A prior over functions So: p(f) exp ( 1 ) f HK θ N (0, I) ˆf MAP = ˆf RLS Assuming p(f) exp( 1 f H K ), p (Y X, f) p(f) p (f X, Y) = p (Y X) exp ( 1 ) Y f(x) exp ( 1 ) f = exp ( 1 Y f(x) 1 ) f

41 A quick recap We wanted to know if RLS has a probabilistic interpretation. Empirical risk minimization is ML. p(y X,θ) e 1 Y Xθ Linear RLS is MAP. p(y,θ X) e 1 Y Xθ e λ θt θ Kernel RLS is also MAP. p(y,θ X) e 1 Y φ(x)θ e λ θt θ Equivalent to a Gaussian prior on H K : p(y,θ X) e 1 Y f(x) e λ f H K But these don t work for infinite dimensional function spaces...

42 A quick recap We wanted to know if RLS has a probabilistic interpretation. Empirical risk minimization is ML. p(y X,θ) e 1 Y Xθ Linear RLS is MAP. p(y,θ X) e 1 Y Xθ e λ θt θ Kernel RLS is also MAP. p(y,θ X) e 1 Y φ(x)θ e λ θt θ Equivalent to a Gaussian prior on H K : p(y,θ X) e 1 Y f(x) e λ f H K But these don t work for infinite dimensional function spaces...

43 A quick recap We wanted to know if RLS has a probabilistic interpretation. Empirical risk minimization is ML. p(y X,θ) e 1 Y Xθ Linear RLS is MAP. p(y,θ X) e 1 Y Xθ e λ θt θ Kernel RLS is also MAP. p(y,θ X) e 1 Y φ(x)θ e λ θt θ Equivalent to a Gaussian prior on H K : p(y,θ X) e 1 Y f(x) e λ f H K But these don t work for infinite dimensional function spaces...

44 A quick recap We wanted to know if RLS has a probabilistic interpretation. Empirical risk minimization is ML. p(y X,θ) e 1 Y Xθ Linear RLS is MAP. p(y,θ X) e 1 Y Xθ e λ θt θ Kernel RLS is also MAP. p(y,θ X) e 1 Y φ(x)θ e λ θt θ Equivalent to a Gaussian prior on H K : p(y,θ X) e 1 Y f(x) e λ f H K But these don t work for infinite dimensional function spaces...

45 Transductive setting We hinted at problems if dim H K =. Idea: Forget about estimating θ (i.e. f ). Instead: Estimate predicted outputs Y = [y1,...,y M ]T at test inputs X = [x 1,...,x M ]T Need the joint distribution over Y and Y.

46 Transductive setting Say Y and Y are jointly Gaussian: [ ] ([ ] [ Y µy ΛY Λ Y = N, YY µ Y Λ Y Y Λ Y ]) Want: kernel RLS. General form for the posterior: Y X, Y N ( ) µ Y X,Y,Σ Y X,Y where µ Y X,Y = µ Y + Λ T YY Λ 1 Y (Y µ Y) Σ Y X,Y = Λ Y Λ T YY Λ 1 Y Λ YY

47 Transductive setting Say Y and Y are jointly Gaussian: [ ] ([ ] [ Y µy ΛY Λ Y = N, YY µ Y Λ Y Y Λ Y ]) Want: kernel RLS. General form for the posterior: Y X, Y N ( ) µ Y X,Y,Σ Y X,Y where µ Y X,Y = µ Y + Λ T YY Λ 1 Y (Y µ Y) Σ Y X,Y = Λ Y Λ T YY Λ 1 Y Λ YY

48 Transductive setting Set Λ Y = K(X, X) + σ I, Λ YY = K(X, X ), Λ Y = K(X, X ). Posterior: where Y X, Y N ( ) µ Y X,Y,Σ Y X,Y µ Y X,Y = µ Y + K(X, X)(K(X, X + σ I) 1 (Y µ Y ) Σ Y X,Y = K(X, X ) K(X, X)(K(X, X) + σ I) 1 K(X, X ) So: Ŷ MAP = ˆf RLS (X ).

49 Transductive setting Model: [ Y Y ] = N ([ µy µ Y ] [ K(X, X) + σ, ε I K(X, X ) K(X, X) K(X, X ) ]) MAP estimate (posterior mean) = RLS function at every point x, regardless of dim H K. Are the prior and posterior (on points!) consistent with a distribution on H K?

50 Transductive setting Strictly speaking, θ and f don t come into play here at all: Have: p(y X, Y) Do not have: p(θ X, Y) or p(f X, Y) But, if H K is finite dimensional, the joint over Y and Y is consistent with: Y = f(x) + ε, Y = f(x), and f H K is a random trajectory from a Gaussian process over the domain, with mean µ and covariance K. (Ergo, people call this Gaussian process regression. ) (Also Kriging, because of a guy.)

51 Transductive setting Strictly speaking, θ and f don t come into play here at all: Have: p(y X, Y) Do not have: p(θ X, Y) or p(f X, Y) But, if H K is finite dimensional, the joint over Y and Y is consistent with: Y = f(x) + ε, Y = f(x), and f H K is a random trajectory from a Gaussian process over the domain, with mean µ and covariance K. (Ergo, people call this Gaussian process regression. ) (Also Kriging, because of a guy.)

52 Recap redux Empirical risk minimization is the maximum likelihood estimator when: y = x T θ + ε Linear RLS is the MAP estimator when: y = x T θ + ε, θ N (0, I) Kernel RLS is the MAP estimator when: y = φ(x) T θ + ε, θ N (0, I) in finite dimensional H K. Kernel RLS is the MAP estimator at points when: [ ] ([ ] [ Y µy K(X, X) + σ Y = N, ε I K(X, X ) µ Y K(X, X) K(X, X ) in possibly infinite dimensional H K. ])

53 Is this useful in practice? Want confidence intervals + believe the posteriors are meaningful = yes Maybe other reasons? 5 4 Gaussian Regression + Confidence Intervals Regression function Observed points 3 1 Y X

54 What is going on with infinite-dimensional H K? Wrote down statements like: θ N (0, I) and f N (0, I). The space H K can be written H K = {f : f H K < } = {θ T φ( ) : θi < } i=1 Difference between finite and infinite: not every θ yields a function θ T φ( ) in H K. A hint: θ N (0, I) E θ T φ( ) H K =. In fact: θ N (0, I) θ T φ( ) H K with probability zero. So be careful out there.

55 What is going on with infinite-dimensional H K? Wrote down statements like: θ N (0, I) and f N (0, I). The space H K can be written H K = {f : f H K < } = {θ T φ( ) : θi < } i=1 Difference between finite and infinite: not every θ yields a function θ T φ( ) in H K. A hint: θ N (0, I) E θ T φ( ) H K =. In fact: θ N (0, I) θ T φ( ) H K with probability zero. So be careful out there.

56 What is going on with infinite-dimensional H K? Wrote down statements like: θ N (0, I) and f N (0, I). The space H K can be written H K = {f : f H K < } = {θ T φ( ) : θi < } i=1 Difference between finite and infinite: not every θ yields a function θ T φ( ) in H K. A hint: θ N (0, I) E θ T φ( ) H K =. In fact: θ N (0, I) θ T φ( ) H K with probability zero. So be careful out there.

57 A hint that things are amiss Assume: θ N (0, I), {φ i } i=1 orthonormal basis in H K. E θ T φ H K = E θ i φ i H K = E = i=1 i=1 1 i=1 = θ i

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak Gaussian processes and bayesian optimization Stanisław Jastrzębski kudkudak.github.io kudkudak Plan Goal: talk about modern hyperparameter optimization algorithms Bayes reminder: equivalent linear regression

More information

GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

Gaussian Processes for Machine Learning

Gaussian Processes for Machine Learning Gaussian Processes for Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics Tübingen, Germany carl@tuebingen.mpg.de Carlos III, Madrid, May 2006 The actual science of

More information

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018 Bayesian Methods David S. Rosenberg New York University March 20, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 1 / 38 Contents 1 Classical Statistics 2 Bayesian

More information

Nonparameteric Regression:

Nonparameteric Regression: Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

Bayesian Linear Regression [DRAFT - In Progress]

Bayesian Linear Regression [DRAFT - In Progress] Bayesian Linear Regression [DRAFT - In Progress] David S. Rosenberg Abstract Here we develop some basics of Bayesian linear regression. Most of the calculations for this document come from the basic theory

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Probabilistic Reasoning in Deep Learning

Probabilistic Reasoning in Deep Learning Probabilistic Reasoning in Deep Learning Dr Konstantina Palla, PhD palla@stats.ox.ac.uk September 2017 Deep Learning Indaba, Johannesburgh Konstantina Palla 1 / 39 OVERVIEW OF THE TALK Basics of Bayesian

More information

Advanced Introduction to Machine Learning

Advanced Introduction to Machine Learning 10-715 Advanced Introduction to Machine Learning Homework Due Oct 15, 10.30 am Rules Please follow these guidelines. Failure to do so, will result in loss of credit. 1. Homework is due on the due date

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Kernel methods, kernel SVM and ridge regression

Kernel methods, kernel SVM and ridge regression Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;

More information

GWAS IV: Bayesian linear (variance component) models

GWAS IV: Bayesian linear (variance component) models GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS IV: Bayesian

More information

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0. Gaussian Processes Gaussian Process Stochastic process: basically, a set of random variables. may be infinite. usually related in some way. Gaussian process: each variable has a Gaussian distribution every

More information

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Overview Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Probabilistic Interpretation: Linear Regression Assume output y is generated

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical learning theory, Support vector machines, and Bioinformatics 1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester Physics 403 Parameter Estimation, Correlations, and Error Bars Segev BenZvi Department of Physics and Astronomy University of Rochester Table of Contents 1 Review of Last Class Best Estimates and Reliability

More information

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words,

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: Tony Jebara Topic 11 Maximum Likelihood as Bayesian Inference Maximum A Posteriori Bayesian Gaussian Estimation Why Maximum Likelihood? So far, assumed max (log) likelihood

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

Introduction to Machine Learning

Introduction to Machine Learning Outline Introduction to Machine Learning Bayesian Classification Varun Chandola March 8, 017 1. {circular,large,light,smooth,thick}, malignant. {circular,large,light,irregular,thick}, malignant 3. {oval,large,dark,smooth,thin},

More information

Discrete time processes

Discrete time processes Discrete time processes Predictions are difficult. Especially about the future Mark Twain. Florian Herzog 2013 Modeling observed data When we model observed (realized) data, we encounter usually the following

More information

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

More information

COS 424: Interacting with Data. Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007

COS 424: Interacting with Data. Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007 COS 424: Interacting ith Data Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007 Recapitulation of Last Lecture In linear regression, e need to avoid adding too much richness to the

More information

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes About this class Maximum margin classifiers SVMs: geometric derivation of the primal problem Statement of the dual problem The kernel trick SVMs as the solution to a regularization problem Maximizing the

More information

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators Estimation theory Parametric estimation Properties of estimators Minimum variance estimator Cramer-Rao bound Maximum likelihood estimators Confidence intervals Bayesian estimation 1 Random Variables Let

More information

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824 Naïve Bayes Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative HW 1 out today. Please start early! Office hours Chen: Wed 4pm-5pm Shih-Yang: Fri 3pm-4pm Location: Whittemore 266

More information

(a) (3 points) Construct a 95% confidence interval for β 2 in Equation 1.

(a) (3 points) Construct a 95% confidence interval for β 2 in Equation 1. Problem 1 (21 points) An economist runs the regression y i = β 0 + x 1i β 1 + x 2i β 2 + x 3i β 3 + ε i (1) The results are summarized in the following table: Equation 1. Variable Coefficient Std. Error

More information

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction

More information

Parameter estimation Conditional risk

Parameter estimation Conditional risk Parameter estimation Conditional risk Formalizing the problem Specify random variables we care about e.g., Commute Time e.g., Heights of buildings in a city We might then pick a particular distribution

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

g-priors for Linear Regression

g-priors for Linear Regression Stat60: Bayesian Modeling and Inference Lecture Date: March 15, 010 g-priors for Linear Regression Lecturer: Michael I. Jordan Scribe: Andrew H. Chan 1 Linear regression and g-priors In the last lecture,

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Lecture 13 Fundamentals of Bayesian Inference

Lecture 13 Fundamentals of Bayesian Inference Lecture 13 Fundamentals of Bayesian Inference Dennis Sun Stats 253 August 11, 2014 Outline of Lecture 1 Bayesian Models 2 Modeling Correlations Using Bayes 3 The Universal Algorithm 4 BUGS 5 Wrapping Up

More information

Lecture 5: GPs and Streaming regression

Lecture 5: GPs and Streaming regression Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Homework 5: Conditional Probability Models

Homework 5: Conditional Probability Models Homework 5: Conditional Probability Models Instructions: Your answers to the questions below, including plots and mathematical work, should be submitted as a single PDF file. It s preferred that you write

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

Reliability Monitoring Using Log Gaussian Process Regression

Reliability Monitoring Using Log Gaussian Process Regression COPYRIGHT 013, M. Modarres Reliability Monitoring Using Log Gaussian Process Regression Martin Wayne Mohammad Modarres PSA 013 Center for Risk and Reliability University of Maryland Department of Mechanical

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: Tony Jebara Topic 7 Unsupervised Learning Statistical Perspective Probability Models Discrete & Continuous: Gaussian, Bernoulli, Multinomial Maimum Likelihood Logistic

More information

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart Machine Learning Bayesian Regression & Classification learning as inference, Bayesian Kernel Ridge regression & Gaussian Processes, Bayesian Kernel Logistic Regression & GP classification, Bayesian Neural

More information

Gaussian Processes (10/16/13)

Gaussian Processes (10/16/13) STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal In this class we continue our journey in the world of RKHS. We discuss the Mercer theorem which gives

More information

STAT 830 Bayesian Estimation

STAT 830 Bayesian Estimation STAT 830 Bayesian Estimation Richard Lockhart Simon Fraser University STAT 830 Fall 2011 Richard Lockhart (Simon Fraser University) STAT 830 Bayesian Estimation STAT 830 Fall 2011 1 / 23 Purposes of These

More information

STA 2201/442 Assignment 2

STA 2201/442 Assignment 2 STA 2201/442 Assignment 2 1. This is about how to simulate from a continuous univariate distribution. Let the random variable X have a continuous distribution with density f X (x) and cumulative distribution

More information

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33 Hypothesis Testing Econ 690 Purdue University Justin L. Tobias (Purdue) Testing 1 / 33 Outline 1 Basic Testing Framework 2 Testing with HPD intervals 3 Example 4 Savage Dickey Density Ratio 5 Bartlett

More information

Parameter learning in CRF s

Parameter learning in CRF s Parameter learning in CRF s June 01, 2009 Structured output learning We ish to learn a discriminant (or compatability) function: F : X Y R (1) here X is the space of inputs and Y is the space of outputs.

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University Likelihood, MLE & EM for Gaussian Mixture Clustering Nick Duffield Texas A&M University Probability vs. Likelihood Probability: predict unknown outcomes based on known parameters: P(x q) Likelihood: estimate

More information

Lecture 3: Machine learning, classification, and generative models

Lecture 3: Machine learning, classification, and generative models EE E6820: Speech & Audio Processing & Recognition Lecture 3: Machine learning, classification, and generative models 1 Classification 2 Generative models 3 Gaussian models Michael Mandel

More information

Bayesian Linear Models

Bayesian Linear Models Bayesian Linear Models Sudipto Banerjee September 03 05, 2017 Department of Biostatistics, Fielding School of Public Health, University of California, Los Angeles Linear Regression Linear regression is,

More information

Stats 579 Intermediate Bayesian Modeling. Assignment # 2 Solutions

Stats 579 Intermediate Bayesian Modeling. Assignment # 2 Solutions Stats 579 Intermediate Bayesian Modeling Assignment # 2 Solutions 1. Let w Gy) with y a vector having density fy θ) and G having a differentiable inverse function. Find the density of w in general and

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

COS513 LECTURE 8 STATISTICAL CONCEPTS

COS513 LECTURE 8 STATISTICAL CONCEPTS COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions

More information

Lecture 9: PGM Learning

Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

More information

Bayesian Support Vector Machines for Feature Ranking and Selection

Bayesian Support Vector Machines for Feature Ranking and Selection Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction

More information

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1). Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes

More information

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Bayesian Inference. Chapter 4: Regression and Hierarchical Models Bayesian Inference Chapter 4: Regression and Hierarchical Models Conchi Ausín and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced Statistics and Data Mining Summer School

More information

20: Gaussian Processes

20: Gaussian Processes 10-708: Probabilistic Graphical Models 10-708, Spring 2016 20: Gaussian Processes Lecturer: Andrew Gordon Wilson Scribes: Sai Ganesh Bandiatmakuri 1 Discussion about ML Here we discuss an introduction

More information

11 : Gaussian Graphic Models and Ising Models

11 : Gaussian Graphic Models and Ising Models 10-708: Probabilistic Graphical Models 10-708, Spring 2017 11 : Gaussian Graphic Models and Ising Models Lecturer: Bryon Aragam Scribes: Chao-Ming Yen 1 Introduction Different from previous maximum likelihood

More information

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Practical Bayesian Optimization of Machine Learning. Learning Algorithms Practical Bayesian Optimization of Machine Learning Algorithms CS 294 University of California, Berkeley Tuesday, April 20, 2016 Motivation Machine Learning Algorithms (MLA s) have hyperparameters that

More information

Hierarchical Models & Bayesian Model Selection

Hierarchical Models & Bayesian Model Selection Hierarchical Models & Bayesian Model Selection Geoffrey Roeder Departments of Computer Science and Statistics University of British Columbia Jan. 20, 2016 Contact information Please report any typos or

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Parameter Estimation December 14, 2015 Overview 1 Motivation 2 3 4 What did we have so far? 1 Representations: how do we model the problem? (directed/undirected). 2 Inference: given a model and partially

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Statistical Machine Learning Hilary Term 2018

Statistical Machine Learning Hilary Term 2018 Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes Roger Grosse Roger Grosse CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes 1 / 55 Adminis-Trivia Did everyone get my e-mail

More information

Introduction into Bayesian statistics

Introduction into Bayesian statistics Introduction into Bayesian statistics Maxim Kochurov EF MSU November 15, 2016 Maxim Kochurov Introduction into Bayesian statistics EF MSU 1 / 7 Content 1 Framework Notations 2 Difference Bayesians vs Frequentists

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

Bayesian RL Seminar. Chris Mansley September 9, 2008

Bayesian RL Seminar. Chris Mansley September 9, 2008 Bayesian RL Seminar Chris Mansley September 9, 2008 Bayes Basic Probability One of the basic principles of probability theory, the chain rule, will allow us to derive most of the background material in

More information

DD Advanced Machine Learning

DD Advanced Machine Learning Modelling Carl Henrik {chek}@csc.kth.se Royal Institute of Technology November 4, 2015 Who do I think you are? Mathematically competent linear algebra multivariate calculus Ok programmers Able to extend

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1) HW 1 due today Parameter Estimation Biometrics CSE 190 Lecture 7 Today s lecture was on the blackboard. These slides are an alternative presentation of the material. CSE190, Winter10 CSE190, Winter10 Chapter

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Bayesian Classification Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Statistical learning. Chapter 20, Sections 1 4 1

Statistical learning. Chapter 20, Sections 1 4 1 Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released

More information

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Bayesian Inference. Chapter 4: Regression and Hierarchical Models Bayesian Inference Chapter 4: Regression and Hierarchical Models Conchi Ausín and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Master in Business Administration and Quantitative

More information

Bayesian Inference. Chapter 9. Linear models and regression

Bayesian Inference. Chapter 9. Linear models and regression Bayesian Inference Chapter 9. Linear models and regression M. Concepcion Ausin Universidad Carlos III de Madrid Master in Business Administration and Quantitative Methods Master in Mathematical Engineering

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Probability and Estimation. Alan Moses

Probability and Estimation. Alan Moses Probability and Estimation Alan Moses Random variables and probability A random variable is like a variable in algebra (e.g., y=e x ), but where at least part of the variability is taken to be stochastic.

More information