Bayesian Interpretations of Regularization
|
|
- Charla Jones
- 6 years ago
- Views:
Transcription
1 Bayesian Interpretations of Regularization Charlie Frogner 9.50 Class 15 April 1, 009
2 The Plan Regularized least squares maps {(x i, y i )} n i=1 to a function that minimizes the regularized loss: f S = arg min f H 1 n (y i f(x i )) + λ f H i=1 Can we justify Tikhonov regularization from a probabilistic point of view?
3 The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation of linear RLS Bayesian interpretation of kernel RLS Transductive model Infinite dimensions = weird
4 Some notation S = {(x i, y i )} n i=1 is the set of observed input/output pairs in R d R (the training set). X and Y denote the matrices [x 1,..., x n ] T R n d and [y 1,...,y n ] T R n, respectively. θ is a vector of parameters in R p. p(y X,θ) is the joint distribution over outputs Y given inputs X and the parameters.
5 Estimation The setup: A model: relates observed quantities (α) to an unobserved quantity (say β). Want: an estimator maps observed data α back to an estimate of unobserved β. Nothing new yet... Estimator β B is unobserved, α A is observed. An estimator for β is a function ˆβ : A B such that ˆβ(α) is an estimate of β given an observation α.
6 Estimation Tikhonov fits in the estimation framework. f S = arg min f H 1 n (y i f(x i )) + λ f H i=1 Regression model: y i = f(x i ) + ε, ) ε N (0,σε I
7 Bayesian Estimation Difference: Bayesian model specifies p(β, α), usually by a measurement model, p(α β) and a prior p(β). Bayesian model β is unobserved, α is observed. p(β,α) = p(α β) p(β)
8 ERM as a Maximum Likelihood Estimator (Linear) Expected risk minimization: f S (x) = x T ˆθ ERM (S), ˆθ ERM (S) = arg min θ 1 n (y i xi T θ) i=1 Measurement model: ) Y X,θ N (Xθ,σε I X fixed/non-random, θ is unknown.
9 ERM as a Maximum Likelihood Estimator (Linear) Expected risk minimization: f S (x) = x T ˆθ ERM (S), ˆθ ERM (S) = arg min θ 1 n (y i xi T θ) i=1 Measurement model: ) Y X,θ N (Xθ,σε I X fixed/non-random, θ is unknown.
10 ERM as a Maximum Likelihood Estimator Measurement model: ) Y X,θ N (Xθ,σε I Want to estimate θ. Can do this without defining a prior on θ. Maximize the likelihood, i.e. the probability of the observations. Likelihood The likelihood of any fixed parameter vector θ is: L(θ X) = p(y X,θ) Note: we always condition on X.
11 ERM as a Maximum Likelihood Estimator Measurement model: Y X,θ N ( ) Xθ,σεI Likelihood: ( ) L(θ X) = N Y; Xθ,σεI ( exp 1 ) σε Y Xθ Maximum likelihood estimator is ERM: arg min θ 1 Y Xθ
12 ERM as a Maximum Likelihood Estimator Measurement model: Y X,θ N ( ) Xθ,σεI Likelihood: ( ) L(θ X) = N Y; Xθ,σεI ( exp 1 ) σε Y Xθ Maximum likelihood estimator is ERM: arg min θ 1 Y Xθ
13 Really? 1 Y Xθ
14 Really? e 1 σε Y Xθ
15 What about regularization? Linear regularized least squares: f S (x) = x T θ, ˆθRLS (S) = arg min θ 1 n i=1 (y i x T i θ) + λ θ Is there a model of Y and θ that yields linear RLS? Yes. e 1 σε ( n (y i xi T θ) + λ θ ) i=1 p(y X,θ) p(θ)
16 What about regularization? Linear regularized least squares: f S (x) = x T θ, ˆθRLS (S) = arg min θ 1 n i=1 (y i x T i θ) + λ θ Is there a model of Y and θ that yields linear RLS? Yes. e 1 σε ( n (y i xi T θ) + λ θ ) i=1 p(y X,θ) p(θ)
17 What about regularization? Linear regularized least squares: f S (x) = x T θ, ˆθRLS (S) = arg min θ 1 n i=1 (y i x T i θ) + λ θ Is there a model of Y and θ that yields linear RLS? Yes. e 1 σ ε n (y i x T i=1 i θ) e λ θ p(y X,θ) p(θ)
18 What about regularization? Linear regularized least squares: f S (x) = x T θ, ˆθRLS (S) = arg min θ 1 n i=1 (y i x T i θ) + λ θ Is there a model of Y and θ that yields linear RLS? Yes. e 1 σ ε n (y i x T i=1 i θ) e λ θ p(y X,θ) p(θ)
19 Linear RLS as a MAP estimator Measurement model: Add a prior: Y X,θ N So σε = λ. How to estimate θ? θ N (0, I) ( ) Xθ,σεI
20 The Bayesian method Take p(y X,θ) and p(θ). Apply Bayes rule to get posterior: p(θ X, Y) = = p(y X,θ) p(θ) p(y X) p(y X,θ) p(θ) p(y X,θ)dθ Use the posterior to estimate θ.
21 Estimators that use the posterior Bayes least squares estimator The Bayes least squares estimator for θ given the observed Y is: ˆθ BLS (Y X) = E θ X,Y [θ] i.e. the mean of the posterior. Maximum a posteriori estimator The MAP estimator for θ given the observed Y is: i.e. a mode of the posterior. ˆθ MAP (Y X) = arg max p(θ X, Y) θ
22 Estimators that use the posterior Bayes least squares estimator The Bayes least squares estimator for θ given the observed Y is: ˆθ BLS (Y X) = E θ X,Y [θ] i.e. the mean of the posterior. Maximum a posteriori estimator The MAP estimator for θ given the observed Y is: i.e. a mode of the posterior. ˆθ MAP (Y X) = arg max p(θ X, Y) θ
23 Linear RLS as a MAP estimator Model: Y X,θ N ( ) Xθ,σε I, θ N (0, I) Joint over Y and θ: [ ] Y X N θ ([ 0 0 ] [ XX, T + σε I X X T I ]) Condition on Y X.
24 Linear RLS as a MAP estimator Model: Y X,θ N (Xθ,σ ε I ), θ N (0, I) Posterior: where θ X, Y N ( µ θ X,Y,Σ θ X,Y ) µ θ X,Y = X T (XX T + σ ε I) 1 Y Σ θ X,Y = I X T (XX T + σ ε I) 1 X This is Gaussian, so ˆθ MAP (Y X) = ˆθ BLS (Y X) = X T (XX T + σ ε I) 1 Y
25 Linear RLS as a MAP estimator Model: Y X,θ N (Xθ,σ ε I ), θ N (0, I) Posterior: where θ X, Y N ( µ θ X,Y,Σ θ X,Y ) µ θ X,Y = X T (XX T + σ ε I) 1 Y Σ θ X,Y = I X T (XX T + σ ε I) 1 X This is Gaussian, so ˆθ MAP (Y X) = ˆθ BLS (Y X) = X T (XX T + σ ε I) 1 Y
26 Linear RLS as a MAP estimator Model: Y X,θ N (Xθ,σ ε I ), θ N (0, I) ˆθ MAP (Y X) = X T (XX T + σ ε I) 1 Y Recall the linear RLS solution: ˆθ RLS (Y X) = arg min θ 1 n (y i xi T θ) + λ θ i=1 = X T (XX T + λ I) 1 Y How do we write the estimated function?
27 Linear RLS as a MAP estimator Model: Y X,θ N (Xθ,σ ε I ), θ N (0, I) ˆθ MAP (Y X) = X T (XX T + σ ε I) 1 Y Recall the linear RLS solution: ˆθ RLS (Y X) = arg min θ 1 n (y i xi T θ) + λ θ i=1 = X T (XX T + λ I) 1 Y How do we write the estimated function?
28 What about Kernel RLS? We can use basically the same trick to derive kernel RLS; How? 1 f S = arg min f H K n (y i f(x i )) + λ f H K i=1 Feature space: f(x) = θ,φ(x) F e 1 ( n (y i φ(x i ) T θ) + λ θt θ) i=1 Feature space must be finite-dimensional.
29 What about Kernel RLS? We can use basically the same trick to derive kernel RLS; How? 1 f S = arg min f H K n (y i f(x i )) + λ f H K i=1 Feature space: f(x) = θ,φ(x) F e 1 ( n (y i φ(x i ) T θ) + λ θt θ) i=1 Feature space must be finite-dimensional.
30 What about Kernel RLS? We can use basically the same trick to derive kernel RLS; How? 1 f S = arg min f H K n (y i f(x i )) + λ f H K i=1 Feature space: f(x) = θ,φ(x) F e 1 n i=1(y i φ(x i ) T θ) e λ θt θ Feature space must be finite-dimensional.
31 What about Kernel RLS? We can use basically the same trick to derive kernel RLS; How? 1 f S = arg min f H K n (y i f(x i )) + λ f H K i=1 Feature space: f(x) = θ,φ(x) F e 1 n i=1(y i φ(x i ) T θ) e λ θt θ Feature space must be finite-dimensional.
32 More notation φ(x) = [φ(x 1 ),...,φ(x n )] T K(X, X) is the kernel matrix: [K(X, X)] ij = K(x i, x j ) K(x, X) = [K(x, x 1 ),..., K(x, x n )] f(x) = [f(x 1 ),..., f(x n )] T
33 What about Kernel RLS? Model: Then: Y X,θ N What is φ(x)φ(x) T? It s K(X, X). (φ(x)θ,σ ε I ), θ N (0, I) ˆθ MAP (Y X) = φ(x) T (φ(x)φ(x) T + σ ε I) 1 Y
34 What about Kernel RLS? Model: Then: Y X,θ N What is φ(x)φ(x) T? It s K(X, X). (φ(x)θ,σ ε I ), θ N (0, I) ˆθ MAP (Y X) = φ(x) T (φ(x)φ(x) T + σ ε I) 1 Y
35 What about Kernel RLS? Model: Y X,θ N (φ(x)θ,σ ε I ), θ N (0, I) Then: ˆθ MAP (Y X) = φ(x) T (K(X, X) + σ ε I) 1 Y Estimated function? ˆfMAP (x) = φ(x)ˆθ MAP (Y X) = φ(x)φ(x) T (K(X, X) + σε I) 1 Y = K(x, X)(K(X, X) + λ I) 1 Y = ˆf RLS (x)
36 What about Kernel RLS? Model: Y X,θ N (φ(x)θ,σ ε I ), θ N (0, I) Then: ˆθ MAP (Y X) = φ(x) T (K(X, X) + σ ε I) 1 Y Estimated function? ˆfMAP (x) = φ(x)ˆθ MAP (Y X) = φ(x)φ(x) T (K(X, X) + σε I) 1 Y = K(x, X)(K(X, X) + λ I) 1 Y = ˆf RLS (x)
37 A prior over functions Model: Y X,θ N Can we write this as a prior on H K? (φ(x)θ,σ ε I ), θ N (0, I)
38 A prior over functions Remember Mercer s theorem: K(x i, x j ) = k ν k ψ k (x i )ψ k (x j ) where ν k ψ k ( ) = K(, y)ψ k (y)dy for all k. The functions { ν k ψ k ( )} form an orthonormal basis for H K. Let φ( ) = [ ν 1 ψ 1 ( ),..., ν p ψ p ( )]. Then: H K = {θ T φ( ) θ R p }
39 A prior over functions We showed: when θ N (0, I), ˆfMAP ( ) = ˆθ MAP (Y X) T φ( ) = ˆf RLS ( ) Taking φ( ) = [ ν 1 ψ 1,..., ν p ψ p ], this prior is equivalently: f( ) = θ T φ( ) N (0, I) i.e. the functions in H K are Gaussian distributed: p(f) exp ( 1 ) ( f HK = exp 1 ) θt θ Note: again we need H K to be finite-dimensional.
40 A prior over functions So: p(f) exp ( 1 ) f HK θ N (0, I) ˆf MAP = ˆf RLS Assuming p(f) exp( 1 f H K ), p (Y X, f) p(f) p (f X, Y) = p (Y X) exp ( 1 ) Y f(x) exp ( 1 ) f = exp ( 1 Y f(x) 1 ) f
41 A quick recap We wanted to know if RLS has a probabilistic interpretation. Empirical risk minimization is ML. p(y X,θ) e 1 Y Xθ Linear RLS is MAP. p(y,θ X) e 1 Y Xθ e λ θt θ Kernel RLS is also MAP. p(y,θ X) e 1 Y φ(x)θ e λ θt θ Equivalent to a Gaussian prior on H K : p(y,θ X) e 1 Y f(x) e λ f H K But these don t work for infinite dimensional function spaces...
42 A quick recap We wanted to know if RLS has a probabilistic interpretation. Empirical risk minimization is ML. p(y X,θ) e 1 Y Xθ Linear RLS is MAP. p(y,θ X) e 1 Y Xθ e λ θt θ Kernel RLS is also MAP. p(y,θ X) e 1 Y φ(x)θ e λ θt θ Equivalent to a Gaussian prior on H K : p(y,θ X) e 1 Y f(x) e λ f H K But these don t work for infinite dimensional function spaces...
43 A quick recap We wanted to know if RLS has a probabilistic interpretation. Empirical risk minimization is ML. p(y X,θ) e 1 Y Xθ Linear RLS is MAP. p(y,θ X) e 1 Y Xθ e λ θt θ Kernel RLS is also MAP. p(y,θ X) e 1 Y φ(x)θ e λ θt θ Equivalent to a Gaussian prior on H K : p(y,θ X) e 1 Y f(x) e λ f H K But these don t work for infinite dimensional function spaces...
44 A quick recap We wanted to know if RLS has a probabilistic interpretation. Empirical risk minimization is ML. p(y X,θ) e 1 Y Xθ Linear RLS is MAP. p(y,θ X) e 1 Y Xθ e λ θt θ Kernel RLS is also MAP. p(y,θ X) e 1 Y φ(x)θ e λ θt θ Equivalent to a Gaussian prior on H K : p(y,θ X) e 1 Y f(x) e λ f H K But these don t work for infinite dimensional function spaces...
45 Transductive setting We hinted at problems if dim H K =. Idea: Forget about estimating θ (i.e. f ). Instead: Estimate predicted outputs Y = [y1,...,y M ]T at test inputs X = [x 1,...,x M ]T Need the joint distribution over Y and Y.
46 Transductive setting Say Y and Y are jointly Gaussian: [ ] ([ ] [ Y µy ΛY Λ Y = N, YY µ Y Λ Y Y Λ Y ]) Want: kernel RLS. General form for the posterior: Y X, Y N ( ) µ Y X,Y,Σ Y X,Y where µ Y X,Y = µ Y + Λ T YY Λ 1 Y (Y µ Y) Σ Y X,Y = Λ Y Λ T YY Λ 1 Y Λ YY
47 Transductive setting Say Y and Y are jointly Gaussian: [ ] ([ ] [ Y µy ΛY Λ Y = N, YY µ Y Λ Y Y Λ Y ]) Want: kernel RLS. General form for the posterior: Y X, Y N ( ) µ Y X,Y,Σ Y X,Y where µ Y X,Y = µ Y + Λ T YY Λ 1 Y (Y µ Y) Σ Y X,Y = Λ Y Λ T YY Λ 1 Y Λ YY
48 Transductive setting Set Λ Y = K(X, X) + σ I, Λ YY = K(X, X ), Λ Y = K(X, X ). Posterior: where Y X, Y N ( ) µ Y X,Y,Σ Y X,Y µ Y X,Y = µ Y + K(X, X)(K(X, X + σ I) 1 (Y µ Y ) Σ Y X,Y = K(X, X ) K(X, X)(K(X, X) + σ I) 1 K(X, X ) So: Ŷ MAP = ˆf RLS (X ).
49 Transductive setting Model: [ Y Y ] = N ([ µy µ Y ] [ K(X, X) + σ, ε I K(X, X ) K(X, X) K(X, X ) ]) MAP estimate (posterior mean) = RLS function at every point x, regardless of dim H K. Are the prior and posterior (on points!) consistent with a distribution on H K?
50 Transductive setting Strictly speaking, θ and f don t come into play here at all: Have: p(y X, Y) Do not have: p(θ X, Y) or p(f X, Y) But, if H K is finite dimensional, the joint over Y and Y is consistent with: Y = f(x) + ε, Y = f(x), and f H K is a random trajectory from a Gaussian process over the domain, with mean µ and covariance K. (Ergo, people call this Gaussian process regression. ) (Also Kriging, because of a guy.)
51 Transductive setting Strictly speaking, θ and f don t come into play here at all: Have: p(y X, Y) Do not have: p(θ X, Y) or p(f X, Y) But, if H K is finite dimensional, the joint over Y and Y is consistent with: Y = f(x) + ε, Y = f(x), and f H K is a random trajectory from a Gaussian process over the domain, with mean µ and covariance K. (Ergo, people call this Gaussian process regression. ) (Also Kriging, because of a guy.)
52 Recap redux Empirical risk minimization is the maximum likelihood estimator when: y = x T θ + ε Linear RLS is the MAP estimator when: y = x T θ + ε, θ N (0, I) Kernel RLS is the MAP estimator when: y = φ(x) T θ + ε, θ N (0, I) in finite dimensional H K. Kernel RLS is the MAP estimator at points when: [ ] ([ ] [ Y µy K(X, X) + σ Y = N, ε I K(X, X ) µ Y K(X, X) K(X, X ) in possibly infinite dimensional H K. ])
53 Is this useful in practice? Want confidence intervals + believe the posteriors are meaningful = yes Maybe other reasons? 5 4 Gaussian Regression + Confidence Intervals Regression function Observed points 3 1 Y X
54 What is going on with infinite-dimensional H K? Wrote down statements like: θ N (0, I) and f N (0, I). The space H K can be written H K = {f : f H K < } = {θ T φ( ) : θi < } i=1 Difference between finite and infinite: not every θ yields a function θ T φ( ) in H K. A hint: θ N (0, I) E θ T φ( ) H K =. In fact: θ N (0, I) θ T φ( ) H K with probability zero. So be careful out there.
55 What is going on with infinite-dimensional H K? Wrote down statements like: θ N (0, I) and f N (0, I). The space H K can be written H K = {f : f H K < } = {θ T φ( ) : θi < } i=1 Difference between finite and infinite: not every θ yields a function θ T φ( ) in H K. A hint: θ N (0, I) E θ T φ( ) H K =. In fact: θ N (0, I) θ T φ( ) H K with probability zero. So be careful out there.
56 What is going on with infinite-dimensional H K? Wrote down statements like: θ N (0, I) and f N (0, I). The space H K can be written H K = {f : f H K < } = {θ T φ( ) : θi < } i=1 Difference between finite and infinite: not every θ yields a function θ T φ( ) in H K. A hint: θ N (0, I) E θ T φ( ) H K =. In fact: θ N (0, I) θ T φ( ) H K with probability zero. So be careful out there.
57 A hint that things are amiss Assume: θ N (0, I), {φ i } i=1 orthonormal basis in H K. E θ T φ H K = E θ i φ i H K = E = i=1 i=1 1 i=1 = θ i
Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak
Gaussian processes and bayesian optimization Stanisław Jastrzębski kudkudak.github.io kudkudak Plan Goal: talk about modern hyperparameter optimization algorithms Bayes reminder: equivalent linear regression
More informationGAUSSIAN PROCESS REGRESSION
GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationCOMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS
More informationTutorial on Gaussian Processes and the Gaussian Process Latent Variable Model
Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,
More informationCSC321 Lecture 18: Learning Probabilistic Models
CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling
More informationGaussian Process Regression
Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process
More informationGaussian Processes for Machine Learning
Gaussian Processes for Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics Tübingen, Germany carl@tuebingen.mpg.de Carlos III, Madrid, May 2006 The actual science of
More informationBayesian Methods. David S. Rosenberg. New York University. March 20, 2018
Bayesian Methods David S. Rosenberg New York University March 20, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 1 / 38 Contents 1 Classical Statistics 2 Bayesian
More informationNonparameteric Regression:
Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationGWAS V: Gaussian processes
GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011
More informationBayesian Linear Regression [DRAFT - In Progress]
Bayesian Linear Regression [DRAFT - In Progress] David S. Rosenberg Abstract Here we develop some basics of Bayesian linear regression. Most of the calculations for this document come from the basic theory
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationProbabilistic Reasoning in Deep Learning
Probabilistic Reasoning in Deep Learning Dr Konstantina Palla, PhD palla@stats.ox.ac.uk September 2017 Deep Learning Indaba, Johannesburgh Konstantina Palla 1 / 39 OVERVIEW OF THE TALK Basics of Bayesian
More informationAdvanced Introduction to Machine Learning
10-715 Advanced Introduction to Machine Learning Homework Due Oct 15, 10.30 am Rules Please follow these guidelines. Failure to do so, will result in loss of credit. 1. Homework is due on the due date
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationKernel methods, kernel SVM and ridge regression
Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;
More informationGWAS IV: Bayesian linear (variance component) models
GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS IV: Bayesian
More informationLinear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.
Gaussian Processes Gaussian Process Stochastic process: basically, a set of random variables. may be infinite. usually related in some way. Gaussian process: each variable has a Gaussian distribution every
More informationOverview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation
Overview Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Probabilistic Interpretation: Linear Regression Assume output y is generated
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationStatistical learning theory, Support vector machines, and Bioinformatics
1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.
More informationCSci 8980: Advanced Topics in Graphical Models Gaussian Processes
CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationPhysics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester
Physics 403 Parameter Estimation, Correlations, and Error Bars Segev BenZvi Department of Physics and Astronomy University of Rochester Table of Contents 1 Review of Last Class Best Estimates and Reliability
More informationIntroduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak
Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words,
More informationMachine Learning 4771
Machine Learning 4771 Instructor: Tony Jebara Topic 11 Maximum Likelihood as Bayesian Inference Maximum A Posteriori Bayesian Gaussian Estimation Why Maximum Likelihood? So far, assumed max (log) likelihood
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:
More informationIntroduction to Machine Learning
Outline Introduction to Machine Learning Bayesian Classification Varun Chandola March 8, 017 1. {circular,large,light,smooth,thick}, malignant. {circular,large,light,irregular,thick}, malignant 3. {oval,large,dark,smooth,thin},
More informationDiscrete time processes
Discrete time processes Predictions are difficult. Especially about the future Mark Twain. Florian Herzog 2013 Modeling observed data When we model observed (realized) data, we encounter usually the following
More informationComputer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression
Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.
More informationCOS 424: Interacting with Data. Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007
COS 424: Interacting ith Data Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007 Recapitulation of Last Lecture In linear regression, e need to avoid adding too much richness to the
More informationAbout this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes
About this class Maximum margin classifiers SVMs: geometric derivation of the primal problem Statement of the dual problem The kernel trick SVMs as the solution to a regularization problem Maximizing the
More informationEstimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators
Estimation theory Parametric estimation Properties of estimators Minimum variance estimator Cramer-Rao bound Maximum likelihood estimators Confidence intervals Bayesian estimation 1 Random Variables Let
More informationNaïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824
Naïve Bayes Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative HW 1 out today. Please start early! Office hours Chen: Wed 4pm-5pm Shih-Yang: Fri 3pm-4pm Location: Whittemore 266
More information(a) (3 points) Construct a 95% confidence interval for β 2 in Equation 1.
Problem 1 (21 points) An economist runs the regression y i = β 0 + x 1i β 1 + x 2i β 2 + x 3i β 3 + ε i (1) The results are summarized in the following table: Equation 1. Variable Coefficient Std. Error
More informationCS-E3210 Machine Learning: Basic Principles
CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction
More informationParameter estimation Conditional risk
Parameter estimation Conditional risk Formalizing the problem Specify random variables we care about e.g., Commute Time e.g., Heights of buildings in a city We might then pick a particular distribution
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationg-priors for Linear Regression
Stat60: Bayesian Modeling and Inference Lecture Date: March 15, 010 g-priors for Linear Regression Lecturer: Michael I. Jordan Scribe: Andrew H. Chan 1 Linear regression and g-priors In the last lecture,
More informationAssociation studies and regression
Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration
More informationLecture 13 Fundamentals of Bayesian Inference
Lecture 13 Fundamentals of Bayesian Inference Dennis Sun Stats 253 August 11, 2014 Outline of Lecture 1 Bayesian Models 2 Modeling Correlations Using Bayes 3 The Universal Algorithm 4 BUGS 5 Wrapping Up
More informationLecture 5: GPs and Streaming regression
Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationHomework 5: Conditional Probability Models
Homework 5: Conditional Probability Models Instructions: Your answers to the questions below, including plots and mathematical work, should be submitted as a single PDF file. It s preferred that you write
More informationCPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017
CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class
More informationReliability Monitoring Using Log Gaussian Process Regression
COPYRIGHT 013, M. Modarres Reliability Monitoring Using Log Gaussian Process Regression Martin Wayne Mohammad Modarres PSA 013 Center for Risk and Reliability University of Maryland Department of Mechanical
More informationMachine Learning 4771
Machine Learning 4771 Instructor: Tony Jebara Topic 7 Unsupervised Learning Statistical Perspective Probability Models Discrete & Continuous: Gaussian, Bernoulli, Multinomial Maimum Likelihood Logistic
More informationMachine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart
Machine Learning Bayesian Regression & Classification learning as inference, Bayesian Kernel Ridge regression & Gaussian Processes, Bayesian Kernel Logistic Regression & GP classification, Bayesian Neural
More informationGaussian Processes (10/16/13)
STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs
More informationProbabilistic Graphical Models & Applications
Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal In this class we continue our journey in the world of RKHS. We discuss the Mercer theorem which gives
More informationSTAT 830 Bayesian Estimation
STAT 830 Bayesian Estimation Richard Lockhart Simon Fraser University STAT 830 Fall 2011 Richard Lockhart (Simon Fraser University) STAT 830 Bayesian Estimation STAT 830 Fall 2011 1 / 23 Purposes of These
More informationSTA 2201/442 Assignment 2
STA 2201/442 Assignment 2 1. This is about how to simulate from a continuous univariate distribution. Let the random variable X have a continuous distribution with density f X (x) and cumulative distribution
More informationHypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33
Hypothesis Testing Econ 690 Purdue University Justin L. Tobias (Purdue) Testing 1 / 33 Outline 1 Basic Testing Framework 2 Testing with HPD intervals 3 Example 4 Savage Dickey Density Ratio 5 Bartlett
More informationParameter learning in CRF s
Parameter learning in CRF s June 01, 2009 Structured output learning We ish to learn a discriminant (or compatability) function: F : X Y R (1) here X is the space of inputs and Y is the space of outputs.
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationLikelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University
Likelihood, MLE & EM for Gaussian Mixture Clustering Nick Duffield Texas A&M University Probability vs. Likelihood Probability: predict unknown outcomes based on known parameters: P(x q) Likelihood: estimate
More informationLecture 3: Machine learning, classification, and generative models
EE E6820: Speech & Audio Processing & Recognition Lecture 3: Machine learning, classification, and generative models 1 Classification 2 Generative models 3 Gaussian models Michael Mandel
More informationBayesian Linear Models
Bayesian Linear Models Sudipto Banerjee September 03 05, 2017 Department of Biostatistics, Fielding School of Public Health, University of California, Los Angeles Linear Regression Linear regression is,
More informationStats 579 Intermediate Bayesian Modeling. Assignment # 2 Solutions
Stats 579 Intermediate Bayesian Modeling Assignment # 2 Solutions 1. Let w Gy) with y a vector having density fy θ) and G having a differentiable inverse function. Find the density of w in general and
More informationBayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework
HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for
More informationCOS513 LECTURE 8 STATISTICAL CONCEPTS
COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions
More informationLecture 9: PGM Learning
13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and
More informationBayesian Support Vector Machines for Feature Ranking and Selection
Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction
More informationClassification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).
Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes
More informationBayesian Inference. Chapter 4: Regression and Hierarchical Models
Bayesian Inference Chapter 4: Regression and Hierarchical Models Conchi Ausín and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced Statistics and Data Mining Summer School
More information20: Gaussian Processes
10-708: Probabilistic Graphical Models 10-708, Spring 2016 20: Gaussian Processes Lecturer: Andrew Gordon Wilson Scribes: Sai Ganesh Bandiatmakuri 1 Discussion about ML Here we discuss an introduction
More information11 : Gaussian Graphic Models and Ising Models
10-708: Probabilistic Graphical Models 10-708, Spring 2017 11 : Gaussian Graphic Models and Ising Models Lecturer: Bryon Aragam Scribes: Chao-Ming Yen 1 Introduction Different from previous maximum likelihood
More informationPractical Bayesian Optimization of Machine Learning. Learning Algorithms
Practical Bayesian Optimization of Machine Learning Algorithms CS 294 University of California, Berkeley Tuesday, April 20, 2016 Motivation Machine Learning Algorithms (MLA s) have hyperparameters that
More informationHierarchical Models & Bayesian Model Selection
Hierarchical Models & Bayesian Model Selection Geoffrey Roeder Departments of Computer Science and Statistics University of British Columbia Jan. 20, 2016 Contact information Please report any typos or
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationProbabilistic Graphical Models
Parameter Estimation December 14, 2015 Overview 1 Motivation 2 3 4 What did we have so far? 1 Representations: how do we model the problem? (directed/undirected). 2 Inference: given a model and partially
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More informationStatistical Machine Learning Hilary Term 2018
Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html
More informationMaximum Likelihood, Logistic Regression, and Stochastic Gradient Training
Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions
More informationMachine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari
Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationCSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes
CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes Roger Grosse Roger Grosse CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes 1 / 55 Adminis-Trivia Did everyone get my e-mail
More informationIntroduction into Bayesian statistics
Introduction into Bayesian statistics Maxim Kochurov EF MSU November 15, 2016 Maxim Kochurov Introduction into Bayesian statistics EF MSU 1 / 7 Content 1 Framework Notations 2 Difference Bayesians vs Frequentists
More informationIntroduction to Probabilistic Machine Learning
Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning
More informationBayesian RL Seminar. Chris Mansley September 9, 2008
Bayesian RL Seminar Chris Mansley September 9, 2008 Bayes Basic Probability One of the basic principles of probability theory, the chain rule, will allow us to derive most of the background material in
More informationDD Advanced Machine Learning
Modelling Carl Henrik {chek}@csc.kth.se Royal Institute of Technology November 4, 2015 Who do I think you are? Mathematically competent linear algebra multivariate calculus Ok programmers Able to extend
More informationLecture 4: Probabilistic Learning
DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods
More informationChapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)
HW 1 due today Parameter Estimation Biometrics CSE 190 Lecture 7 Today s lecture was on the blackboard. These slides are an alternative presentation of the material. CSE190, Winter10 CSE190, Winter10 Chapter
More informationComputer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression
Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:
More informationIntroduction to Machine Learning
Introduction to Machine Learning Bayesian Classification Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574
More informationStatistical learning. Chapter 20, Sections 1 4 1
Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released
More informationBayesian Inference. Chapter 4: Regression and Hierarchical Models
Bayesian Inference Chapter 4: Regression and Hierarchical Models Conchi Ausín and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Master in Business Administration and Quantitative
More informationBayesian Inference. Chapter 9. Linear models and regression
Bayesian Inference Chapter 9. Linear models and regression M. Concepcion Ausin Universidad Carlos III de Madrid Master in Business Administration and Quantitative Methods Master in Mathematical Engineering
More informationGraphical Models for Collaborative Filtering
Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,
More informationKernel Bayes Rule: Nonparametric Bayesian inference with kernels
Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationProbability and Estimation. Alan Moses
Probability and Estimation Alan Moses Random variables and probability A random variable is like a variable in algebra (e.g., y=e x ), but where at least part of the variability is taken to be stochastic.
More information