Algebraic Geometry and Model Selection

Size: px
Start display at page:

Download "Algebraic Geometry and Model Selection"

Transcription

1 Algebraic Geometry and Model Selection American Institute of Mathematics 2011/Dec/12-16 I would like to thank Prof. Russell Steele, Prof. Bernd Sturmfels, and all participants. Thank you very much. Sumio Watanabe Tokyo Institute of Technology

2 Contents 1. AIC, BIC, and DIC 2. Birational Invariants 3. Singular Fluctuation 4. Open Questions

3 1 AIC, BIC, and DIC 12/15/2011 Birational probability theory 3

4 Statistical Model and True Distribution x R N, w W(compact) R d (1) True dist. q(x) i.i.d. X 1, X 2,, X n (2) Statistical model p(x w) (3) Prior dist. j(w) (Remark) E[ ] shows the expectation over X 1,X 2,,X n. E x [ ] does that over X whose prob. dist. is q(x). 12/15/2011 Birational probability theory 4

5 Statistical Estimation Posterior Dist. E w [ ] = n ( ) P p(x i w) j(w) dw n i=1 P p(x i w) j(w) dw i=1 Predictive Dist. p*(x) = E w [ p(x w) ] (Remark) In Bayesian estimation, the true distribution q(x) is estimated by p*(x). 12/15/2011 Birational probability theory 5

6 Stochastic Complexity and Generalization Loss (1) Stochastic Complexity = - (Bayes Marginal) F = - log n P p(x i w) j(w) dw i=1 (2) Generalization Loss G = - E x [ log p*(x) ] = S + KL( q(x) p*(x) ) (Remark) S = -E x [log q(x)] is the entropy of q(x). 12/15/2011 Birational probability theory 6

7 BIC(Schwarz,1978) (1) BIC = - S log p(x i w MLE ) + (d/2) log n If the posterior ~ normal distribution. F = BIC + O p (1). In general, yesterday, we learned F = - S log p(x i w MLE ) + l log n (m-1)loglog n + O p (1). RLCT In our workshop, Dr. Lin: Relation to asymptotic integral. Dr. Drton: How to use in statistics. Dr. Leykin: D-module and b-function. 12/15/2011 Birational probability theory 7

8 AIC(Akaike,1974) & DIC(Spiegelhalter,et.al. 2002) (2) AIC = - S log p(x i w MLE ) + d (3) DIC = - S log E w [p(x i w) ] + 2 S {-E w [log p(x i w)]+log p(x i E w [w] ) } If the posterior ~ the normal distribution. E[AIC]= n E[G]+o(1), E[DIC]= n E[G]+o(1). If otherwise, such relations do not hold. 12/15/2011 Birational probability theory 8

9 Estimation of G and F in regular cases Main Term Estimated Consistency in model selection Unbiased Estimator of G AIC, DIC Random (Order 1) No Yes BIC Constant (log n) Yes No 12/15/2011 Birational probability theory 9

10 2 Birational Statistics 12/15/2011 Birational probability theory 10

11 Birational Invariant If a value that is defined using resolution of singularities does not depend on the choice of resolution, then it is called a birational invariant. g1 U1 W g2 g3 U2 U3 12/15/2011 Birational probability theory 11

12 Nature in Statistics It is natural that statistical theory should be made to be invariant under birational transform. Fisher s asymptotic theory does not satisfy such invariance. For example, it is not invariant under blow-up. Y = ax + b + Noise a = c =c d b = cd=d Asymptotic normality of (a,b) holds, whereas that of (c,d) in projective space not. 12/15/2011 Birational probability theory 12

13 Differential and Birational Algebraic geometry studies mathematical properties those are invariant under the birational transform. diffeo birational To construct birational statistics might be one of the purposes of algebraic statistics. 12/15/2011 Birational probability theory 13

14 3 Singular Fluctuation and model selection 12/15/2011 Birational probability theory 14

15 Loss for estimation Predictive Dist. p*(x) = E w [ p(x w) ] Generalization Loss G n = - E x [ log E w [p(x w)] ] Training Loss n T n = - (1/n) S log E w [p(x i w)] i=1 12/15/2011 Birational probability theory 15

16 Functional Cumulant Def. Two Cumulant Generating Functions g(a)= - E x [ log E w [p(x w) a ] ] n t(a) = - (1/n) S log E w [p(x i w) a ] i=1 Then g(0)=t(0)=0, and G n =g(1), T n = t(1). 12/15/2011 Birational probability theory 16

17 Invariance Two functions g(a) and t(a) are invariant under w = g(u) p(x w) = p(x g(u)) j(w) dw = j(g(u)) g (u) du Cumulant generating function = Birational invariant generating function Example. l = lim n { E[g(1)]-S } n o Birational probability theory

18 Notation Def. Log density ratio function f(x,w)=log( q(x)/p(x w) ). Then log p(x w)= log q(x) + f(x,w). Birational probability theory

19 g(a) is rewritten as Expansion of g(a) g(a) = as - E x [ log E w [exp(-af(x,w))] ] Therefore g (0) = S + E x [ E w [f(x,w)] ] g (0) = - E x [ E w [f(x,w) 2 ] - E w [f(x,w)] 2 ] 12/15/2011 Birational probability theory 19

20 By the same way, Expansion of t(a) n t(a) = as n - (1/n) S log E w [exp(-af(x i w))] i=1 Therefore t (0) = S n + (1/n) S E w [f(x i,w)] n t (0) = - (1/n) S { E w [f(x i,w) 2 ] - E w [f(x i,w)] 2 } i=1 n i=1 12/15/2011 Birational probability theory 20

21 Functional Variance Def. Two random variables V 1 = n E x [ E w [f(x,w)] ] l n V 2 = S { E w [ (log p(x i w) ) 2 ] - E w [ log p(x i w) ] 2 } i=1 Remark. V 2 can be calculated by samples and a model without any information about true dist.. In order to calculate V 1, we need the information of the true distribution. 12/15/2011 Birational probability theory 21

22 Singular Fluctuation Theorem. Convergences hold, V 1 V 1 * E[V 1 ] E[V 1 *] V 2 V 2 * Theorem and Def. E[V 2 ] E[V 2 *] Singular Fluctuation E[V 1 *] = E[V 2 *] = 2n Remark. In order to prove the above theorems, we need resolution theorem and empirical process theory. In regular cases, l=n=d/ /15/2011 Birational probability theory

23 Outline of Proofs Posterior distribution measure exp(-nk n (w)) j(w) dw K n (w)=(1/n)s f(x i,w) = exp( - nu 2k + n 1/2 u k ξ n (u) ) j(g(u)) g (u) du o = dt d(t-nu 2k ) u h b(u) exp( - t + t 1/2 ξ n (u) ) du 0 (log n) = m-1 dt t l-1 exp( - t + t 1/2 ξ n (u) ) D(u)du n l 12/15/2011 Birational probability theory 23

24 Outline of Proofs 2 Def. Expectation over the limit posterior distribution < = > dt dud(u)( ) t l-1 exp( -t+t 1/2 x(u)) dt dud(u) t l-1 exp( -t+t 1/2 x(u)) Lemma. For s 0, n s/2 E w [ f(x,w) s ] < t s/2 a(x,u) s > 12/15/2011 Birational probability theory 24

25 Cumulants --- Random Variables g (0) = S + (l+v 1 /2)/n + o p (1/n) g (0)= - V 2 /n + o p (1/n) t (0) = S n + (l-v 1 /2)/n + o p (1/n) t (0) = - V 2 /n + o p (1/n) 12/15/2011 Birational probability theory 25

26 Cumulants --- Birational invariants. E[ g (0) ] = S + (l+n)/n + o p (1/n) E[ g (0) ] = - 2n/n + o p (1/n) E[ t (0) ] = S + (l-n)/n + o p (1/n) E[ t (0) ]= - 2n /n + o p (1/n) 12/15/2011 Birational probability theory 26

27 Theorem Generalization Loss G n = S + [ l + V 1 /2 V 2 /2 ] / n +o p (1/n) Training Loss T n = S n + [ l - V 1 /2 - V 2 /2 ] / n +o p (1/n) 12/15/2011 Birational probability theory 27

28 Theorem When n tends to infinity, E[ G n ] = S + l / n + o(1/n), E[ T n ] = S + ( l-2n ) / n + o(1/n), E[ V 2 ] = 2n + o(1). 12/15/2011 Birational probability theory 28

29 Def. WAIC WAIC is defined by W n = T n + V 2 /n. Theorem For arbitrary set (q(x),p(x w),j(w)), E[ G n ] = E[ W n ] + o(1/n 2 ). (G n -S) +( W n S n ) = 2l/n + o p (1/n). Remark. WAIC is asymptotically equivalent to Bayes cross validation (Watanabe,JMLR,Vol.11, ,2010) 12/15/2011 Birational probability theory 29W

30 T n -S n W n -S n + G-S Red =W n -S n Theory=l/n Blue =G n - S 100 Indep. Trials Reduced Rank Regression True Thank you By Theory, l=12 n=200 Metropolis /15/2011 Birational probability theory 30

31 G n -S 0.1 (G n -S)+(W n -S n ) = 2l/n W n - S n

32 WAIC (1) E[ W n ]= E[ G n ]+o(1/n 2 ) holds even if q(x) is not realizable by p(x w). (2) The essential main term is fluctuated. Inconsistency in model selection. (3) If the posterior can be approximated by a normal distribution, WAIC is equivalent to AIC and DIC as a random variable. (4) If otherwise, WAIC is unbiased estimator of generalization error, whereas either AIC or DIC not. 12/15/2011 Birational probability theory 32

33 4 Open Problems 12/15/2011 Birational probability theory 33

34 Two Biratinal Invariants If the regularity condition is satisfied, l = n = d/2. In general, they are different. l = Dimension that shows how fast the posterior shrinks. n = Dimension that shows how strong the posterior fluctuates. Q. Mathematically, what is n? 12/15/2011 Birational probability theory 34

35 Generalized RLCT g(a)= - E x [ log E w [p(x w) a ] ] t(a) = - (1/n) S log E w [p(x i w) a ] n i=1 Q. E[ (d/da) k g(0) ] and E[ (d/da) k t(0) ] (k=1,2,,) are all birational invariants. They are statistically generalized ones from real log canonical threshold. What are they? 12/15/2011 Birational probability theory 35

36 Variance of Information Criterion E[ G n ] = E[ W n ] + o(1/n 2 ). (G n -S) +( W n S n ) = 2l/n + o p (1/n). Q. These equations show that WAIC seems to have the smallest variance among the information criteria whose averages are equal to G n. Can we prove this? 12/15/2011 Birational probability theory 36

37 Bayes hypothesis testing For a given prior j(w), we define F(j) = - log P p(x i w) j(w) dw n i=1 If the null hypothesis is j 0 (w), and if the alternative is j 1 (w), then the most powerful test is given by DF = F(j 1 )-F(j 0 ). Q. In order to make the most powerful hypothesis test, probability distribution of DF is necessary. 12/15/2011 Birational probability theory 37

Algebraic Information Geometry for Learning Machines with Singularities

Algebraic Information Geometry for Learning Machines with Singularities Algebraic Information Geometry for Learning Machines with Singularities Sumio Watanabe Precision and Intelligence Laboratory Tokyo Institute of Technology 4259 Nagatsuta, Midori-ku, Yokohama, 226-8503

More information

What is Singular Learning Theory?

What is Singular Learning Theory? What is Singular Learning Theory? Shaowei Lin (UC Berkeley) shaowei@math.berkeley.edu 23 Sep 2011 McGill University Singular Learning Theory A statistical model is regular if it is identifiable and its

More information

A Widely Applicable Bayesian Information Criterion

A Widely Applicable Bayesian Information Criterion Journal of Machine Learning Research 14 (2013) 867-897 Submitted 8/12; Revised 2/13; Published 3/13 A Widely Applicable Bayesian Information Criterion Sumio Watanabe Department of Computational Intelligence

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Asymptotic Analysis of the Bayesian Likelihood Ratio for Testing Homogeneity in Normal Mixture Models

Asymptotic Analysis of the Bayesian Likelihood Ratio for Testing Homogeneity in Normal Mixture Models Asymptotic Analysis of the Bayesian Likelihood Ratio for Testing Homogeneity in Normal Mixture Models arxiv:181.351v1 [math.st] 9 Dec 18 Natsuki Kariya, and Sumio Watanabe Department of Mathematical and

More information

Asymptotic Approximation of Marginal Likelihood Integrals

Asymptotic Approximation of Marginal Likelihood Integrals Asymptotic Approximation of Marginal Likelihood Integrals Shaowei Lin 10 Dec 2008 Abstract We study the asymptotics of marginal likelihood integrals for discrete models using resolution of singularities

More information

How New Information Criteria WAIC and WBIC Worked for MLP Model Selection

How New Information Criteria WAIC and WBIC Worked for MLP Model Selection How ew Information Criteria WAIC and WBIC Worked for MLP Model Selection Seiya Satoh and Ryohei akano ational Institute of Advanced Industrial Science and Tech, --7 Aomi, Koto-ku, Tokyo, 5-6, Japan Chubu

More information

Statistical Learning Theory

Statistical Learning Theory Statistical Learning Theory Part I : Mathematical Learning Theory (1-8) By Sumio Watanabe, Evaluation : Report Part II : Information Statistical Mechanics (9-15) By Yoshiyuki Kabashima, Evaluation : Report

More information

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn Parameter estimation and forecasting Cristiano Porciani AIfA, Uni-Bonn Questions? C. Porciani Estimation & forecasting 2 Temperature fluctuations Variance at multipole l (angle ~180o/l) C. Porciani Estimation

More information

A Bayesian view of model complexity

A Bayesian view of model complexity A Bayesian view of model complexity Angelika van der Linde University of Bremen, Germany 1. Intuitions 2. Measures of dependence between observables and parameters 3. Occurrence in predictive model comparison

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Stochastic Complexities of Reduced Rank Regression in Bayesian Estimation

Stochastic Complexities of Reduced Rank Regression in Bayesian Estimation Stochastic Complexities of Reduced Rank Regression in Bayesian Estimation Miki Aoyagi and Sumio Watanabe Contact information for authors. M. Aoyagi Email : miki-a@sophia.ac.jp Address : Department of Mathematics,

More information

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Nonconcave Penalized Likelihood with A Diverging Number of Parameters Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized

More information

SRNDNA Model Fitting in RL Workshop

SRNDNA Model Fitting in RL Workshop SRNDNA Model Fitting in RL Workshop yael@princeton.edu Topics: 1. trial-by-trial model fitting (morning; Yael) 2. model comparison (morning; Yael) 3. advanced topics (hierarchical fitting etc.) (afternoon;

More information

1 Hypothesis Testing and Model Selection

1 Hypothesis Testing and Model Selection A Short Course on Bayesian Inference (based on An Introduction to Bayesian Analysis: Theory and Methods by Ghosh, Delampady and Samanta) Module 6: From Chapter 6 of GDS 1 Hypothesis Testing and Model Selection

More information

7. Estimation and hypothesis testing. Objective. Recommended reading

7. Estimation and hypothesis testing. Objective. Recommended reading 7. Estimation and hypothesis testing Objective In this chapter, we show how the election of estimators can be represented as a decision problem. Secondly, we consider the problem of hypothesis testing

More information

Construction of an Informative Hierarchical Prior Distribution: Application to Electricity Load Forecasting

Construction of an Informative Hierarchical Prior Distribution: Application to Electricity Load Forecasting Construction of an Informative Hierarchical Prior Distribution: Application to Electricity Load Forecasting Anne Philippe Laboratoire de Mathématiques Jean Leray Université de Nantes Workshop EDF-INRIA,

More information

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018 Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability

More information

Basic concepts in estimation

Basic concepts in estimation Basic concepts in estimation Random and nonrandom parameters Definitions of estimates ML Maimum Lielihood MAP Maimum A Posteriori LS Least Squares MMS Minimum Mean square rror Measures of quality of estimates

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos & Aarti Singh Contents Markov Chain Monte Carlo Methods Goal & Motivation Sampling Rejection Importance Markov

More information

Markov Chain Monte Carlo Methods for Stochastic

Markov Chain Monte Carlo Methods for Stochastic Markov Chain Monte Carlo Methods for Stochastic Optimization i John R. Birge The University of Chicago Booth School of Business Joint work with Nicholas Polson, Chicago Booth. JRBirge U Florida, Nov 2013

More information

5 Introduction to the Theory of Order Statistics and Rank Statistics

5 Introduction to the Theory of Order Statistics and Rank Statistics 5 Introduction to the Theory of Order Statistics and Rank Statistics This section will contain a summary of important definitions and theorems that will be useful for understanding the theory of order

More information

Bayesian techniques for fatigue life prediction and for inference in linear time dependent PDEs

Bayesian techniques for fatigue life prediction and for inference in linear time dependent PDEs Bayesian techniques for fatigue life prediction and for inference in linear time dependent PDEs Marco Scavino SRI Center for Uncertainty Quantification Computer, Electrical and Mathematical Sciences &

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Resolution of Singularities and Stochastic Complexity of Complete Bipartite Graph-Type Spin Model in Bayesian Estimation

Resolution of Singularities and Stochastic Complexity of Complete Bipartite Graph-Type Spin Model in Bayesian Estimation Resolution of Singularities and Stochastic Complexity of Complete Bipartite Graph-Type Spin Model in Bayesian Estimation Miki Aoyagi and Sumio Watanabe Precision and Intelligence Laboratory, Tokyo Institute

More information

Feature selection with high-dimensional data: criteria and Proc. Procedures

Feature selection with high-dimensional data: criteria and Proc. Procedures Feature selection with high-dimensional data: criteria and Procedures Zehua Chen Department of Statistics & Applied Probability National University of Singapore Conference in Honour of Grace Wahba, June

More information

Bayesian Inference and MCMC

Bayesian Inference and MCMC Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the

More information

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution

More information

Variations. ECE 6540, Lecture 10 Maximum Likelihood Estimation

Variations. ECE 6540, Lecture 10 Maximum Likelihood Estimation Variations ECE 6540, Lecture 10 Last Time BLUE (Best Linear Unbiased Estimator) Formulation Advantages Disadvantages 2 The BLUE A simplification Assume the estimator is a linear system For a single parameter

More information

Particle Filtering Approaches for Dynamic Stochastic Optimization

Particle Filtering Approaches for Dynamic Stochastic Optimization Particle Filtering Approaches for Dynamic Stochastic Optimization John R. Birge The University of Chicago Booth School of Business Joint work with Nicholas Polson, Chicago Booth. JRBirge I-Sim Workshop,

More information

Seminar über Statistik FS2008: Model Selection

Seminar über Statistik FS2008: Model Selection Seminar über Statistik FS2008: Model Selection Alessia Fenaroli, Ghazale Jazayeri Monday, April 2, 2008 Introduction Model Choice deals with the comparison of models and the selection of a model. It can

More information

ST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks

ST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks (9) Model selection and goodness-of-fit checks Objectives In this module we will study methods for model comparisons and checking for model adequacy For model comparisons there are a finite number of candidate

More information

CS 540: Machine Learning Lecture 2: Review of Probability & Statistics

CS 540: Machine Learning Lecture 2: Review of Probability & Statistics CS 540: Machine Learning Lecture 2: Review of Probability & Statistics AD January 2008 AD () January 2008 1 / 35 Outline Probability theory (PRML, Section 1.2) Statistics (PRML, Sections 2.1-2.4) AD ()

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

If we want to analyze experimental or simulated data we might encounter the following tasks:

If we want to analyze experimental or simulated data we might encounter the following tasks: Chapter 1 Introduction If we want to analyze experimental or simulated data we might encounter the following tasks: Characterization of the source of the signal and diagnosis Studying dependencies Prediction

More information

STAT 730 Chapter 4: Estimation

STAT 730 Chapter 4: Estimation STAT 730 Chapter 4: Estimation Timothy Hanson Department of Statistics, University of South Carolina Stat 730: Multivariate Analysis 1 / 23 The likelihood We have iid data, at least initially. Each datum

More information

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Convex optimization COMS 4771

Convex optimization COMS 4771 Convex optimization COMS 4771 1. Recap: learning via optimization Soft-margin SVMs Soft-margin SVM optimization problem defined by training data: w R d λ 2 w 2 2 + 1 n n [ ] 1 y ix T i w. + 1 / 15 Soft-margin

More information

Bayesian methods in economics and finance

Bayesian methods in economics and finance 1/26 Bayesian methods in economics and finance Linear regression: Bayesian model selection and sparsity priors Linear Regression 2/26 Linear regression Model for relationship between (several) independent

More information

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee 9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic

More information

Understanding the Curse of Singularities in Machine Learning

Understanding the Curse of Singularities in Machine Learning Understanding the Curse of Singularities in Machine Learning Shaowei Lin (UC Berkeley) shaowei@ math.berkeley.edu 19 September 2012 UC Berkeley Statistics 1 / 27 Linear Regression Param Estimation Laplace

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Machine Learning, Midterm Exam: Spring 2009 SOLUTION 10-601 Machine Learning, Midterm Exam: Spring 2009 SOLUTION March 4, 2009 Please put your name at the top of the table below. If you need more room to work out your answer to a question, use the back of

More information

CS540 Machine learning Lecture 5

CS540 Machine learning Lecture 5 CS540 Machine learning Lecture 5 1 Last time Basis functions for linear regression Normal equations QR SVD - briefly 2 This time Geometry of least squares (again) SVD more slowly LMS Ridge regression 3

More information

Stochastic Complexity of Variational Bayesian Hidden Markov Models

Stochastic Complexity of Variational Bayesian Hidden Markov Models Stochastic Complexity of Variational Bayesian Hidden Markov Models Tikara Hosino Department of Computational Intelligence and System Science, Tokyo Institute of Technology Mailbox R-5, 459 Nagatsuta, Midori-ku,

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2 Logistics CSE 446: Point Estimation Winter 2012 PS2 out shortly Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2 Last Time Random variables, distributions Marginal, joint & conditional

More information

Conditional Distributions

Conditional Distributions Conditional Distributions The goal is to provide a general definition of the conditional distribution of Y given X, when (X, Y ) are jointly distributed. Let F be a distribution function on R. Let G(,

More information

Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk

Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk Ann Inst Stat Math (0) 64:359 37 DOI 0.007/s0463-00-036-3 Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk Paul Vos Qiang Wu Received: 3 June 009 / Revised:

More information

7. Estimation and hypothesis testing. Objective. Recommended reading

7. Estimation and hypothesis testing. Objective. Recommended reading 7. Estimation and hypothesis testing Objective In this chapter, we show how the election of estimators can be represented as a decision problem. Secondly, we consider the problem of hypothesis testing

More information

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier Machine Learning 10-701/15 701/15-781, 781, Spring 2008 Theory of Classification and Nonparametric Classifier Eric Xing Lecture 2, January 16, 2006 Reading: Chap. 2,5 CB and handouts Outline What is theoretically

More information

Answers and expectations

Answers and expectations Answers and expectations For a function f(x) and distribution P(x), the expectation of f with respect to P is The expectation is the average of f, when x is drawn from the probability distribution P E

More information

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren 1 / 34 Metamodeling ME598/494 Lecture Max Yi Ren Department of Mechanical Engineering, Arizona State University March 1, 2015 2 / 34 1. preliminaries 1.1 motivation 1.2 ordinary least square 1.3 information

More information

Linear Regression Linear Regression with Shrinkage

Linear Regression Linear Regression with Shrinkage Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle

More information

ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering

ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering Lecturer: Nikolay Atanasov: natanasov@ucsd.edu Teaching Assistants: Siwei Guo: s9guo@eng.ucsd.edu Anwesan Pal:

More information

LTI Systems, Additive Noise, and Order Estimation

LTI Systems, Additive Noise, and Order Estimation LTI Systems, Additive oise, and Order Estimation Soosan Beheshti, Munther A. Dahleh Laboratory for Information and Decision Systems Department of Electrical Engineering and Computer Science Massachusetts

More information

Markov Chain Monte Carlo Methods for Stochastic Optimization

Markov Chain Monte Carlo Methods for Stochastic Optimization Markov Chain Monte Carlo Methods for Stochastic Optimization John R. Birge The University of Chicago Booth School of Business Joint work with Nicholas Polson, Chicago Booth. JRBirge U of Toronto, MIE,

More information

Lecture 8: The Metropolis-Hastings Algorithm

Lecture 8: The Metropolis-Hastings Algorithm 30.10.2008 What we have seen last time: Gibbs sampler Key idea: Generate a Markov chain by updating the component of (X 1,..., X p ) in turn by drawing from the full conditionals: X (t) j Two drawbacks:

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

A Very Brief Summary of Statistical Inference, and Examples

A Very Brief Summary of Statistical Inference, and Examples A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2008 Prof. Gesine Reinert 1 Data x = x 1, x 2,..., x n, realisations of random variables X 1, X 2,..., X n with distribution (model)

More information

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1 Chapter 4 HOMEWORK ASSIGNMENTS These homeworks may be modified as the semester progresses. It is your responsibility to keep up to date with the correctly assigned homeworks. There may be some errors in

More information

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003 Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)

More information

Signal Processing - Lecture 7

Signal Processing - Lecture 7 1 Introduction Signal Processing - Lecture 7 Fitting a function to a set of data gathered in time sequence can be viewed as signal processing or learning, and is an important topic in information theory.

More information

Linear Regression Linear Regression with Shrinkage

Linear Regression Linear Regression with Shrinkage Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle

More information

Bioinformatics: Network Analysis

Bioinformatics: Network Analysis Bioinformatics: Network Analysis Model Fitting COMP 572 (BIOS 572 / BIOE 564) - Fall 2013 Luay Nakhleh, Rice University 1 Outline Parameter estimation Model selection 2 Parameter Estimation 3 Generally

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

Linear Models A linear model is defined by the expression

Linear Models A linear model is defined by the expression Linear Models A linear model is defined by the expression x = F β + ɛ. where x = (x 1, x 2,..., x n ) is vector of size n usually known as the response vector. β = (β 1, β 2,..., β p ) is the transpose

More information

Midterm, Fall 2003

Midterm, Fall 2003 5-78 Midterm, Fall 2003 YOUR ANDREW USERID IN CAPITAL LETTERS: YOUR NAME: There are 9 questions. The ninth may be more time-consuming and is worth only three points, so do not attempt 9 unless you are

More information

Policy Gradient. U(θ) = E[ R(s t,a t );π θ ] = E[R(τ);π θ ] (1) 1 + e θ φ(s t) E[R(τ);π θ ] (3) = max. θ P(τ;θ)R(τ) (6) P(τ;θ) θ log P(τ;θ)R(τ) (9)

Policy Gradient. U(θ) = E[ R(s t,a t );π θ ] = E[R(τ);π θ ] (1) 1 + e θ φ(s t) E[R(τ);π θ ] (3) = max. θ P(τ;θ)R(τ) (6) P(τ;θ) θ log P(τ;θ)R(τ) (9) CS294-40 Learning for Robotics and Control Lecture 16-10/20/2008 Lecturer: Pieter Abbeel Policy Gradient Scribe: Jan Biermeyer 1 Recap Recall: H U() = E[ R(s t,a ;π ] = E[R();π ] (1) Here is a sample path

More information

EnKF and filter divergence

EnKF and filter divergence EnKF and filter divergence David Kelly Andrew Stuart Kody Law Courant Institute New York University New York, NY dtbkelly.com December 12, 2014 Applied and computational mathematics seminar, NIST. David

More information

Gaussian processes for inference in stochastic differential equations

Gaussian processes for inference in stochastic differential equations Gaussian processes for inference in stochastic differential equations Manfred Opper, AI group, TU Berlin November 6, 2017 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017

More information

Foundations of Nonparametric Bayesian Methods

Foundations of Nonparametric Bayesian Methods 1 / 27 Foundations of Nonparametric Bayesian Methods Part II: Models on the Simplex Peter Orbanz http://mlg.eng.cam.ac.uk/porbanz/npb-tutorial.html 2 / 27 Tutorial Overview Part I: Basics Part II: Models

More information

Financial Econometrics and Quantitative Risk Managenent Return Properties

Financial Econometrics and Quantitative Risk Managenent Return Properties Financial Econometrics and Quantitative Risk Managenent Return Properties Eric Zivot Updated: April 1, 2013 Lecture Outline Course introduction Return definitions Empirical properties of returns Reading

More information

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

ECE662: Pattern Recognition and Decision Making Processes: HW TWO ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

Density Estimation: ML, MAP, Bayesian estimation

Density Estimation: ML, MAP, Bayesian estimation Density Estimation: ML, MAP, Bayesian estimation CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Introduction Maximum-Likelihood Estimation Maximum

More information

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning Basics: Maximum Likelihood Estimation Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning

More information

(a) (3 points) Construct a 95% confidence interval for β 2 in Equation 1.

(a) (3 points) Construct a 95% confidence interval for β 2 in Equation 1. Problem 1 (21 points) An economist runs the regression y i = β 0 + x 1i β 1 + x 2i β 2 + x 3i β 3 + ε i (1) The results are summarized in the following table: Equation 1. Variable Coefficient Std. Error

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

EECS564 Estimation, Filtering, and Detection Exam 2 Week of April 20, 2015

EECS564 Estimation, Filtering, and Detection Exam 2 Week of April 20, 2015 EECS564 Estimation, Filtering, and Detection Exam Week of April 0, 015 This is an open book takehome exam. You have 48 hours to complete the exam. All work on the exam should be your own. problems have

More information

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation Yujin Chung November 29th, 2016 Fall 2016 Yujin Chung Lec13: MLE Fall 2016 1/24 Previous Parametric tests Mean comparisons (normality assumption)

More information

Bayesian Methods: Naïve Bayes

Bayesian Methods: Naïve Bayes Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior

More information

Foundations of Statistical Inference

Foundations of Statistical Inference Foundations of Statistical Inference Julien Berestycki Department of Statistics University of Oxford MT 2016 Julien Berestycki (University of Oxford) SB2a MT 2016 1 / 32 Lecture 14 : Variational Bayes

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Multivariate Survival Analysis

Multivariate Survival Analysis Multivariate Survival Analysis Previously we have assumed that either (X i, δ i ) or (X i, δ i, Z i ), i = 1,..., n, are i.i.d.. This may not always be the case. Multivariate survival data can arise in

More information

arxiv: v1 [stat.co] 6 Dec 2015

arxiv: v1 [stat.co] 6 Dec 2015 Bayesian inference and model comparison for metallic fatigue data Ivo Babuška a, Zaid Sawlan b,, Marco Scavino b,d, Barna Szabó c, Raúl Tempone b a ICES, The University of Texas at Austin, Austin, USA

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Machine learning - HT Maximum Likelihood

Machine learning - HT Maximum Likelihood Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce

More information

Primer on statistics:

Primer on statistics: Primer on statistics: MLE, Confidence Intervals, and Hypothesis Testing ryan.reece@gmail.com http://rreece.github.io/ Insight Data Science - AI Fellows Workshop Feb 16, 018 Outline 1. Maximum likelihood

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Problem Selected Scores

Problem Selected Scores Statistics Ph.D. Qualifying Exam: Part II November 20, 2010 Student Name: 1. Answer 8 out of 12 problems. Mark the problems you selected in the following table. Problem 1 2 3 4 5 6 7 8 9 10 11 12 Selected

More information

2 Statistical Estimation: Basic Concepts

2 Statistical Estimation: Basic Concepts Technion Israel Institute of Technology, Department of Electrical Engineering Estimation and Identification in Dynamical Systems (048825) Lecture Notes, Fall 2009, Prof. N. Shimkin 2 Statistical Estimation:

More information

Bayesian inference. Justin Chumbley ETH and UZH. (Thanks to Jean Denizeau for slides)

Bayesian inference. Justin Chumbley ETH and UZH. (Thanks to Jean Denizeau for slides) Bayesian inference Justin Chumbley ETH and UZH (Thanks to Jean Denizeau for slides) Overview of the talk Introduction: Bayesian inference Bayesian model comparison Group-level Bayesian model selection

More information

Testing and Model Selection

Testing and Model Selection Testing and Model Selection This is another digression on general statistics: see PE App C.8.4. The EViews output for least squares, probit and logit includes some statistics relevant to testing hypotheses

More information

Statistical Learning Theory. Part I 5. Deep Learning

Statistical Learning Theory. Part I 5. Deep Learning Statistical Learning Theory Part I 5. Deep Learning Sumio Watanabe Tokyo Institute of Technology Review : Supervised Learning Training Data X 1, X 2,, X n q(x,y) =q(x)q(y x) Information Source Y 1, Y 2,,

More information

STATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN

STATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN Massimo Guidolin Massimo.Guidolin@unibocconi.it Dept. of Finance STATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN SECOND PART, LECTURE 2: MODES OF CONVERGENCE AND POINT ESTIMATION Lecture 2:

More information