Lecture 7 Introduction to Statistical Decision Theory

Size: px
Start display at page:

Download "Lecture 7 Introduction to Statistical Decision Theory"

Transcription

1 Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, / 55 I-Hsiang Wang IT Lecture 7

2 In the rest of this course we switch gear to the interplay between information theory and statistics. 1 In this lecture, we will introduce the basic elements of statistical decision theory: It is about how to make decision from data samples collected from a statistical model. It is about how to evaluate decision making algorithms (decision rules) under a statistical model. It also serves the purpose of overviewing the contents to be covered in the follow-up lectures. 2 In the follow-up lectures, we will go into details of several topics, including Hypothesis testing: large-sample asymptotic performance limits Point estimation: Bayes vs. Minimax, lower bounding techniques, high dimensional problems, etc. Along side, we will introduce tools and techniques for investigating the asymptotic performance of several statistical problems, and show its interplay with information theory. Tools from probability theory: large deviation, concentration inequalities, etc. Elements from information theory: information measures, lower bounding techniques, etc. 2 / 55 I-Hsiang Wang IT Lecture 7

3 Overview of this Lecture In this lecture, the goal is to establish basics of statistical decision theory. 1 We will begin with setting up the framework of statistical decision theory, including: Statistical experiment: parameter space, data samples, statistical model Decision rule: deterministic vs. randomized Performance evaluation: loss function, risk, minimax vs. Bayes 2 Next, we will introduce two basic statical decision making problems, including Hypothesis testing Point estimation 3 / 55 I-Hsiang Wang IT Lecture 7

4 Statistical Model and Decision Making 1 Statistical Model and Decision Making Basic Framework Examples Paradigms 2 Hypothesis Testing Basics 3 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency 4 / 55 I-Hsiang Wang IT Lecture 7

5 Statistical Model and Decision Making Basic Framework 1 Statistical Model and Decision Making Basic Framework Examples Paradigms 2 Hypothesis Testing Basics 3 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency 5 / 55 I-Hsiang Wang IT Lecture 7

6 Statistical Model and Decision Making Basic Framework Statistical Experiment P ( ) X X X Decision Making (X) = ˆT 6 / 55 I-Hsiang Wang IT Lecture 7

7 Statistical Model and Decision Making Basic Framework Statistical Experiment P ( ) X X X Decision Making (X) = ˆT Statistical Experiment Statistical Model: A collection of data-generating distributions P {P θ θ Θ}, where Θ is called the parameter space, could be finite, infinitely countable, or uncountable. Pθ ( ) is a probability distribution which accounts for the implicit randomness in experiments, sampling, or making observations Data (Sample/Outcome/Observation): X is generated by a random draw from P θ, that is, X P θ. X could be random variables, vectors, matrices, processes, etc. 7 / 55 I-Hsiang Wang IT Lecture 7

8 Statistical Model and Decision Making Basic Framework Statistical Experiment P ( ) X X X Decision Making (X) = ˆT Inference Task Objective: T (θ), a function of the parameter θ. From the data X P θ, one would like to infer T (θ) from X. Decision Rule Decision rule (deterministic): τ ( ) is a function of X. ˆT = τ (X) is the inferred result. Decision rule (randomized): τ (, ) is a function of (X, U), where U is external randomness. ˆT = τ (X, U) is the inferred result. 8 / 55 I-Hsiang Wang IT Lecture 7

9 Statistical Model and Decision Making Statistical Experiment P ( ) T ( ) X X X T( ) Basic Framework Decision Making (X) = T ᅸ Loss Function l (, ) T l(t ( ), (X)) EX P [ ] L ( ) Performance Evaluation: how good is a decision rule τ? Loss function: l(t (θ), τ (X)) measures how bad the decision rule τ is (with a specific data point X ). Note: since X is random, l (T (θ), τ (X)) is also random. Risk: Lθ (τ ) EX Pθ [l(t (θ), τ (X))] measures on average how bad the decision rule τ is when the true parameter is θ. 9 / 55 I-Hsiang Wang IT Lecture 7

10 Statistical Model and Decision Making Statistical Experiment P ( ) T ( ) X X X T( ) Basic Framework Decision Making (X) = T ᅸ Loss Function l (, ) T l(t ( ), (X)) EX P [ ] Performance Evaluation: what if the decision rule τ is randomized? Loss function becomes l(t (θ), τ (X, U )). Risk becomes Lθ (τ ) EU,X Pθ [l(t (θ), τ (X, U ))]. 10 / 55 I-Hsiang Wang IT Lecture 7 L ( )

11 Statistical Model and Decision Making Examples 1 Statistical Model and Decision Making Basic Framework Examples Paradigms 2 Hypothesis Testing Basics 3 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency 11 / 55 I-Hsiang Wang IT Lecture 7

12 Statistical Model and Decision Making Examples Sometimes we care about the inferred object itself. 12 / 55 I-Hsiang Wang IT Lecture 7

13 Example: Decoding Statistical Model and Decision Making Examples Decoding in channel coding over a DMC is one example that we are familiar with. Parameter is the message: Parameter space is the message set: Data is the received sequence: Statistical model is Encoder+Channel: Task is to decode the message: θ m Θ { 1, 2,..., 2 NR} X Y N P θ (x) N i=1 P Y X(y i x i (m)) T (θ) m Decision rule is the decoding algorithm: τ(x) dec(y N ) Loss function is the 0-1 loss: l (T (θ), τ(x)) 1 { m dec(y N ) } Risk is the decoding error probability: L θ (τ) λ m,dec P { m dec(y N ) m is sent } 13 / 55 I-Hsiang Wang IT Lecture 7

14 Statistical Model and Decision Making Example: Hypothesis Testing Examples Decoding in channel coding belongs to a more general class of problems called hypothesis testing. Parameter space is a finite set: Task is to infer parameter θ: Loss function is the 0-1 loss: Risk is the probability of error: Θ < T (θ) = θ l (T (θ), τ(x)) = 1 {θ τ(x)} L θ (τ) = P X Pθ {θ τ(x)} 14 / 55 I-Hsiang Wang IT Lecture 7

15 Statistical Model and Decision Making Example: Density Estimation Examples Estimate the probability density function from the collected samples. Parameter space is a (huge) set of density functions: Θ F = {f : R [0, + ) which is concave/continuous/lipschitz continuous/etc.} Data is the observed i.i.d. sequence: Task is to infer density function f( ): X X n, X i i.i.d. f( ) T (θ) f Decision rule is the density estimator: τ(x) ˆf X n( ) ( Loss function is some sort of divergence: l (T (θ), τ(x)) D f ˆf ) x n ( Risk is the expected loss: L θ (τ) E X n f [D n f ˆf )] X n 15 / 55 I-Hsiang Wang IT Lecture 7

16 Statistical Model and Decision Making Examples Sometimes we care about the utility of the inferred object. 16 / 55 I-Hsiang Wang IT Lecture 7

17 Statistical Model and Decision Making Examples Example: Classification/Prediction A basic problem in learning is to train a classifier that predicts the category of a new object. Parameter space is a collection of labelings: Data is the training data set: Statistical model is the noisy labeling: Task is to infer the true labeling h H: Θ H = {h : X [1 : K]} X (X n, Y n ), label Y i [1 : K]. P θ (x) n i=1 P X (x i ) P Y h(x) (y i h(x i )) T (θ) h( ), τ(x) ĥx n,y n( ). Loss function is the prediction error probability: }] l (T, τ(x)) E X PX [1 {h(x) ĥ(x) (Note: This is still random as ĥ depends on the randomly drawn training data (X, Y )n ) Risk is the averaged loss over training: [ }]] L θ (τ) E (X n,y n ) (P X P Y h(x) ) n E X PX [1 {h(x) ĥ(x) 17 / 55 I-Hsiang Wang IT Lecture 7

18 Example: Regression Statistical Model and Decision Making Examples Another example that we are very familiar with is regression under mean squared error. Parameter space is a collection of functions: Θ F = {f : R p R} Data is the training data: X (X n, Y n ) Statistical model is the noisy observation: (Y = f(x) + Z where Z is the observation noise) P θ (x) n i=1 P X (x i ) P Z (y i f(x i )) Task is to infer f F: T (θ) f, τ(x) ˆf X n,y n( ) Loss function is the mean-squared-error: [ l (T, τ(x)) E (X,Y ) Pf (Y ˆf(X)) ] 2 (Note: This is still random as ˆf depends on the randomly drawn training data (X, Y ) n ) Risk is the averaged loss over training: L θ (τ) E (X n,y n ) P n f [E (X,Y ) Pf [(Y ˆf(X)) 2 ]] 18 / 55 I-Hsiang Wang IT Lecture 7

19 Statistical Model and Decision Making Example: Linear Regression Examples For linear regression, the underlying model is linear: f (x) = β x + γ. Parameter is (β, γ) R p R: θ (β, γ), Θ R p R Data is the training data: X (X n, Y n ) Statistical model is the noisy linear model: P θ (x) n i=1 P X (x i ) P Z (y i γ β x i ) Task is to infer (β, γ): T (θ) (β, γ), τ(x) (ˆβ, ˆγ) X n,y n Loss function is the mean-squared-error: [ l (T, τ(x)) E (X,Y ) Pβ,γ (Y ˆβ ] X ˆγ) 2 (Note: This is still random as (ˆβ, ˆγ) depends on the randomly drawn training data (X, Y ) n ) Risk is the averaged loss over random training data: L θ (τ) E (X n,y n ) P [E n (X,Y ) Pβ,γ [(Y ˆβ ]] X ˆγ) 2 β,γ 19 / 55 I-Hsiang Wang IT Lecture 7

20 Statistical Model and Decision Making Paradigms 1 Statistical Model and Decision Making Basic Framework Examples Paradigms 2 Hypothesis Testing Basics 3 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency 20 / 55 I-Hsiang Wang IT Lecture 7

21 Statistical Model and Decision Making Paradigms How to Determine the Best Estimator? Recall: Risk of a decision rule τ is L θ (τ), which is defined according to the underlying statistical model, the task, and the loss function. Note that the risk depends on the hidden parameter θ. If we can find a decision making algorithm τ such that for all other decision rule τ, then τ is obviously the best. L θ (τ ) L θ (τ) θ Θ, Unfortunately this is not likely to happen. For example, say the loss function is the l 2 -norm between θ and ˆθ = τ(x). Then, τ(x) = θ will minimize the risk L θ (τ) when the true parameter is θ. This means that there is no single τ that can simultaneously minimize L θ for all θ Θ! There are two main paradigms that resolve this issue: Average-case paradigm (Bayes) and Worst-case paradigm (Minimax) 21 / 55 I-Hsiang Wang IT Lecture 7

22 Bayes Paradigm Statistical Model and Decision Making Paradigms In the Bayes paradigm, a prior distribution π ( ) over the parameter space Θ is required, and the performance of a decision rule is evaluated according to the average-case analysis with respect to π. Definition 1 (Bayes Risk) The average risk of a decision rule τ with respect to prior π is defined as R π (τ) E Θ π [L Θ (τ)]. The Bayes risk for a prior π is defined as the minimum average risk, that is, which is attained by a Bayes decision rule τπ (may not be unique). R π inf τ R π (τ), (1) Note that in information theory most coding theorems are derived in the Bayes paradigm we assume random sources and messages. We will give more results later when we talk about hypothesis testing and point estimation. 22 / 55 I-Hsiang Wang IT Lecture 7

23 Minimax Paradigm Statistical Model and Decision Making Paradigms In a Minimax paradigm, the performance of a decision rule is evaluated according to the worst-case risk over the entire parameter space. Definition 2 (Minimax Risk) The worst-case risk of a decision rule τ is defined as R (τ) sup θ Θ L θ (τ). The Minimax risk is defined as the minimum worst-case risk, that is, R inf τ R (τ) = inf τ sup θ Θ L θ (τ), (2) which is attained by a Minimax decision rule τ (may not be unique). A main criticism on the Bayes paradigm is that in many applications there is no prior over the parameter space, because the statistical task is done one-shot. In such cases, the Minimax paradigm provides a conservative but robust evaluation and theoretical guarantees. 23 / 55 I-Hsiang Wang IT Lecture 7

24 Bayes vs. Minimax Statistical Model and Decision Making Paradigms A simple yet fundamental relationship between Minimax and Bayes is stated in the theorem below. Theorem 1 (Minimax Risk Worst-Case Bayes Risk) In general, the Minimax risk is not smaller than any Bayes risk, that is, R sup π Rπ. (3) Furthermore, the above inequality becomes equality in the following two cases: 1 The parameter space Θ and the data alphabet X are both finite. 2 Θ < and the loss function is bounded from below. Remark: When deriving lower bound on the minimax risk, inequality (3) is useful. 24 / 55 I-Hsiang Wang IT Lecture 7

25 Statistical Model and Decision Making Paradigms pf: Note that R = inf τ sup θ Θ L θ (τ) and sup π Rπ = sup π inf τ E Θ π [L Θ (τ)]. For any decision rule τ and prior distribution π, we have sup θ Θ L θ (τ) E Θ π [L Θ (τ)]. Hence, for any prior distribution π, we have R = inf τ sup θ Θ L θ (τ) inf τ E Θ π [L Θ (τ)] = Rπ. Therefore, R sup π Rπ. Proof of the two sufficient conditions for the equality to hold requires some knowledge in convex optimization theory. Essentially speaking, these are the two sufficient conditions such that "strong duality" holds. 25 / 55 I-Hsiang Wang IT Lecture 7

26 Statistical Model and Decision Making Paradigms Deterministic vs. Randomized Decision Rules Randomization does not always help. In the following we give two such scenarios. Proposition 1 1 If the loss function τ l(t, τ) is convex, then randomization does not help. 2 In the Bayes paradigm, there always exists a deterministic decision rule which is Bayes optimal. pf: First, for any randomized decision rule τ(x, U), by Jensen's inequality we have L θ (τ) E U,X Pθ [l(t (θ), τ(x, U))] E X Pθ [l(t (θ), E U [τ(x, U) X])] = L θ (E U [τ(x, U) X]). Second, for any randomized decision rule τ(x, U) and prior distribution Θ π, R π (τ) = E Θ,U,X [l (T (Θ), τ(x, U)] = E U [R π (τ(, U))] R π (τ(, u)), for some u. Hence, the average risk of a randomized decision rule is always lower bound by that of a deterministic one, namely, τ(, u) wtih u attains a smaller R π (τ(, u)) than the average. 26 / 55 I-Hsiang Wang IT Lecture 7

27 Statistical Model and Decision Making Paradigms Parametric vs. Non-Parametric Models Conventionally, parametric model refers to the case where the parameter of interest is finite dimensional, while in the non-parametric model, parameter is infinite dimensional (eg., a function). The parametric framework is useful when one is familiar with certain properties of the data and has a good statistical model for the data. The parameter space Θ is hence fixed, not scaling with the samples of data. In contrast, if such knowledge about the underlying data is not sufficient, the non-parametric framework might be more suitable. However, this cut is vague and not well-defined at times. Parametric: Hypothesis testing, Mean estimation, Covariance matrix estimation, etc. Non-Parametric: Density estimation, Regression, etc. 27 / 55 I-Hsiang Wang IT Lecture 7

28 Hypothesis Testing 1 Statistical Model and Decision Making Basic Framework Examples Paradigms 2 Hypothesis Testing Basics 3 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency 28 / 55 I-Hsiang Wang IT Lecture 7

29 Hypothesis Testing Basics 1 Statistical Model and Decision Making Basic Framework Examples Paradigms 2 Hypothesis Testing Basics 3 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency 29 / 55 I-Hsiang Wang IT Lecture 7

30 Hypothesis Testing Basics Basic Setup We begin with the simplest setup binary hypothesis testing: 1 Two hypotheses regarding the observation X, indexed by θ {0, 1}: H 0 : X P 0 (Null Hypothesis, θ = 0) H 1 : X P 1 (Alternative Hypothesis, θ = 1) 2 Goal: design a decision making algorithm ϕ : X {0, 1}, x ˆθ, to choose one of the two hypotheses, based on the observed realization of X, so that a certain risk is minimized. 3 The loss function is the 0-1 loss, rendering two kinds probability of errors: Probability of false alarm (false positive; type I error): α ϕ P FA (ϕ) P {H 1 is chosen H 0 }. Probability of miss detection (false negative; type II error): β ϕ P MD (ϕ) P {H 0 is chosen H 1 }. 30 / 55 I-Hsiang Wang IT Lecture 7

31 Hypothesis Testing Deterministic Testing Algorithm Decision Regions Basics X Observation Space A 1 ( ) Acceptance Region of H 1. A 0 ( ) Acceptance Region of H 0. A test ϕ : X {0, 1} is equivalently characterized by its corresponding acceptance (decision) regions: A θ (ϕ) ϕ 1 (ˆθ) {x X : ϕ (x) = ˆθ}, ˆθ = 0, 1. The two types of probability of error can be equivalently represented as α ϕ = x A 1 (ϕ) P 0 (x) = x X ϕ (x) P 0 (x), β ϕ = x A 0 (ϕ) P 1 (x) = x X (1 ϕ (x)) P 1 (x). When the context is clear, we often drop the dependency on the test ϕ when dealing with acceptance regions Aˆθ. 31 / 55 I-Hsiang Wang IT Lecture 7

32 Hypothesis Testing Basics Likelihood Ratio Test Definition 3 (Likelihood Ratio Test) A (deteministic) likelihood ratio test (LRT) is a test ϕ τ, parametrized by constants τ > 0 (called threshold), defined as follows: { 1 if P 1 (x) > τp 0 (x) ϕ τ (x) = 0 if P 1 (x) τp 0 (x). For x supp P0, the likelihood ratio L (x) P 1(x) P 0 (x). Hence, LRT is a thresholding algorithm on the likelihood ratio L (x). Remark: For computational convenience, often one deals with log likelihood ratio (LLR) log (L(x)) = log (P 1 (x)) log (P 0 (x)). 32 / 55 I-Hsiang Wang IT Lecture 7

33 Hypothesis Testing Basics Trade-Off Between α (P FA ) and β (P MD ) Theorem 2 (Neyman-Pearson Lemma) For a likelihood ratio test ϕ τ and another deterministic test ϕ, α ϕ α ϕτ = β ϕ β ϕτ. pf: Observe x X, 0 (ϕ τ (x) ϕ (x)) (P 1 (x) τp 0 (x)), because if P 1 (x) τp 0 (x) > 0 = ϕ τ (x) = 1 = (ϕ τ (x) ϕ (x)) 0. if P 1 (x) τp 0 (x) 0 = ϕ τ (x) = 0 = (ϕ τ (x) ϕ (x)) 0. Summing over all x X, we get 0 (1 β ϕτ ) (1 β ϕ ) τ (α ϕτ α ϕ ) = (β ϕ β ϕτ ) + τ (α ϕ α ϕτ ). Since τ > 0, from above we conclude that α ϕ α ϕτ = β ϕ β ϕτ. 33 / 55 I-Hsiang Wang IT Lecture 7

34 Hypothesis Testing Basics (P MD ) (P MD ) (P FA ) 1 (P FA ) Question: What is the optimal trade-off curve? What is the optimal test achieving the curve? 34 / 55 I-Hsiang Wang IT Lecture 7

35 Hypothesis Testing Randomized Testing Algorithm Basics Randomized tests include deterministic tests as special cases. Definition 4 (Randomized Test) A randomized test decides ˆθ = 1 with probability ϕ (x) and ˆθ = 0 with probability 1 ϕ (x), where ϕ is a mapping ϕ : X [0, 1]. Note: A randomized test is characterized by ϕ, as in deterministic tests. Definition 5 (Randomized LRT) A randomized likelihood ratio test (LRT) is a test ϕ τ,γ, parametrized by cosntants τ > 0 and γ (0, 1), defined as follows: { 1/0 if P 1 (x) τp 0 (x) ϕ τ,γ (x) = γ if P 1 (x) = τp 0 (x). 35 / 55 I-Hsiang Wang IT Lecture 7

36 Hypothesis Testing Basics Randomized LRT Achieves the Optimal Trade-Off Consider the following optimization problem: Neyman-Pearson Problem minimize ϕ:x [0,1] subject to β ϕ α ϕ α Theorem 3 (Neyman-Pearson) A randomized LRT ϕ τ,γ with the parameters (τ, γ ) satisfying α = α ϕτ,γ attains optimality for the Neyman-Pearson Problem. 36 / 55 I-Hsiang Wang IT Lecture 7

37 Hypothesis Testing Basics pf: First argue that for any α (0, 1), one can find (τ, γ ) such that α = α ϕτ,γ = x X ϕ τ,γ (x) P 0 (x) = P 0 (x) + γ P 0 (x) x: L(x)>τ x: L(x)=τ For any test ϕ, due to a similar argument as in Theorem 2, we have x X, (ϕ τ,γ (x) ϕ (x)) (P 1 (x) τ P 0 (x)) 0. Summing over all x X, similarly we get ( ) ( ) β ϕ β ϕτ,γ + τ α ϕ α ϕτ,γ 0 Hence, for any feasible test ϕ with α ϕ α = α ϕτ,γ, its probability of type II error β ϕ β ϕτ,γ 37 / 55 I-Hsiang Wang IT Lecture 7

38 Bayes Risk Hypothesis Testing Basics Sometimes prior probabilities of the two hypotheses are known: π θ P {H θ is true}, θ = 0, 1, π 0 + π 1 = 1. In this sense, one can view the index Θ as a (binary) random variable with (prior) distribution P {Θ = θ} = π θ, for θ = 0, 1. With prior probabilities, it then makes sense to talk about the average probability of error for a test ϕ, or more generally, the average risk: P e (ϕ) π 0 α ϕ + π 1 β ϕ = E Θ,X [1 {Θ ϕ(x)}] ; R π (ϕ) E Θ,X [l(θ, ϕ(x))]. The Bayes hypothesis testing problem is to test the two hypotheses with knowledge of prior probabilities so that the average probability of error (or in general, a risk function) is minimized. 38 / 55 I-Hsiang Wang IT Lecture 7

39 Hypothesis Testing Basics Minimizing Bayes Risk Consider the following problem of minimizing Bayes risk. Bayes Problem minimize ϕ: X [0,1] with known R π (ϕ) E Θ,X [l(θ, ϕ(x))] (π 0, π 1 ) and l(θ, ˆθ) Theorem 4 (LRT is an Optimal Bayes Test) Assume l(0, 0) < l(0, 1) and l(1, 1) < l(1, 0). A deterministic LRT ϕ τ with threshold attains optimality for the Bayes Problem. τ = (l(0,1) l(0,0))π 0 (l(1,0) l(1,1))π 1 39 / 55 I-Hsiang Wang IT Lecture 7

40 Hypothesis Testing Basics pf: R π (ϕ) = x X l(0, 0)π 0P 0 (x) (1 ϕ (x)) + x X l(0, 1)π 0P 0 (x) ϕ (x) + x X l(1, 0)π 1P 1 (x) (1 ϕ (x)) + x X l(1, 1)π 1P 1 (x) ϕ (x) = l(0, 0)π 0 + x X (l(0, 1) l(0, 0)) π 0P 0 (x) ϕ (x) + l(1, 0)π 1 + x X (l(1, 1) l(1, 0)) π 1P 1 (x) ϕ (x) = [ x X (l(0, 1) l(0, 0)) π0 P 0 (x) (l(1, 1) l(1, 0)) π 1 P 1 (x) ] ϕ (x) }{{} ( ) + l(0, 0)π 0 + l(1, 0)π 1. For each x X, we shall choose ϕ (x) [0, 1] such that ( ) is minimized. It is then obvious that we should choose { 1 if (l(0, 1) l(0, 0)) π 0 P 0 (x) (l(1, 1) l(1, 0)) π 1 P 1 (x) < 0 ϕ (x) = 0 if (l(0, 1) l(0, 0)) π 0 P 0 (x) (l(1, 1) l(1, 0)) π 1 P 1 (x) / 55 I-Hsiang Wang IT Lecture 7

41 Discussions Hypothesis Testing Basics For binary hypothesis testing problems, the likelihood ratio L (x) P 1(x) P 0 (x) is a sufficient statistics. Moreover, a likelihood ratio test (LRT) is optimal both in the Neyman-Pearson and the Bayes settings. Extensions include M-ary hypothesis testing Minimax risk optimization (with unknown prior) Composite hypothesis testing, etc. Here we do not pursue these directions further at the moment. We leave some of them in the later lectures including asymptotic behavior of hypothesis testing, and discuss about its close connection with information theory, in particular, information divergence. Next, we quickly overview the asymptotic behavior of performance in hypothesis testing. 41 / 55 I-Hsiang Wang IT Lecture 7

42 Estimation 1 Statistical Model and Decision Making Basic Framework Examples Paradigms 2 Hypothesis Testing Basics 3 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency 42 / 55 I-Hsiang Wang IT Lecture 7

43 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound 1 Statistical Model and Decision Making Basic Framework Examples Paradigms 2 Hypothesis Testing Basics 3 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency 43 / 55 I-Hsiang Wang IT Lecture 7

44 Estimation Estimator, Bias, Mean Squared Error Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Definition 6 (Estimator) Consider X P θ randomly generates the data x, where θ Θ is an unknown parameter. An estimator of θ based on observed x is a mapping ϕ : X Θ, x ˆθ. An estimator of function z (θ) is a mapping ζ : X z (Θ), x ẑ. When Θ R or R n, it is reasonable to consider the following two measures of performance. Definition 7 (Bias, Mean Squared Error) For an estimator ϕ (x) of θ, Bias θ (ϕ) E X Pθ [ϕ (X)] θ, MSE θ (ϕ) E X Pθ [ ϕ (X) θ 2]. 44 / 55 I-Hsiang Wang IT Lecture 7

45 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Fact 1 (MSE = Variance +(Bias) 2 ) For an estimator ϕ (x) of θ, MSE θ (ϕ) = Var Pθ [ϕ (X)] + (Bias θ (ϕ)) 2. pf: MSE θ (ϕ) E Pθ [ ϕ (X) θ 2] = E Pθ [ (ϕ (X) E Pθ [ϕ (X)] + E Pθ [ϕ (X)] θ) 2] = Var Pθ [ϕ (X)] + (Bias θ (ϕ)) 2 + 2Bias θ (ϕ) E Pθ [ϕ (X) E Pθ [ϕ (X)]]. }{{} 0 Note: MSE is the risk of an estimator, when the loss function is the squared-error loss. In the following we provide a parameter-dependent lower bound on the MSE of unbiased estimators, namely, Cramér-Rao Inequality. 45 / 55 I-Hsiang Wang IT Lecture 7

46 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Lower Bound on MSE of Unbiased Estimators Below we deal with densities and hence change notation from P θ to f θ. Definition 8 (Fisher Information) The Fisher Information of θ is defined as J (θ) E fθ [ ( θ ln f θ(x) ) 2 ]. Definition 9 (Unbiased Estimator) An estimator ϕ is unbiased if Bias θ (ϕ) = 0 for all θ Θ. Now we are ready to state the theorem. Theorem 5 (Cramér-Rao) For any unbiased estimator ϕ, we have MSE θ (ϕ) 1 J(θ), θ Θ. 46 / 55 I-Hsiang Wang IT Lecture 7

47 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound pf: The proof is essentially an application of Cauchy-Schwarz inequality. Let us begin with the observation that J (θ) = Var fθ [s θ (X)], where s θ (X) θ ln f θ(x) = 1 θ f θ(x), because f θ (X) E fθ [s θ (X)] = f 1 θ (x) f θ (x) θ f θ(x) dx = f θ(x) dx = 0. = d d θ Hence, by Cauchy-Schwarz inequality, we have θ f θ(x) dx (Cov fθ (s θ (X), ϕ (X))) 2 Var fθ [s θ (X)] Var fθ [ϕ (X)]. Since Bias θ (ϕ) = 0, we have MSE θ (ϕ) = Var fθ [ϕ (X)], and hence MSE θ (ϕ) J (θ) (Cov fθ (s θ (X), ϕ (X))) / 55 I-Hsiang Wang IT Lecture 7

48 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound It remains to prove that Cov fθ (s θ (X), ϕ (X)) = 1: {}}{ Cov fθ (s θ (X), ϕ (X)) = E fθ [s θ (X) ϕ (X)] E fθ [s θ (X)] E fθ [ϕ (X)] = E fθ [s θ (X) ϕ (X)] [ ] 1 = E fθ f θ (X) θ f θ(x)ϕ (X) = d d θ f θ(x)ϕ (x) dx = d d θ E f θ [ϕ (X)] (a) = d d θ θ = 1, where the (a) holds because ϕ is unbiased. The proof is complete. Remark: Cramér-Rao inequality can be extended to vector estimators, biased estimators, estimator of a function of θ, etc / 55 I-Hsiang Wang IT Lecture 7

49 Estimation Extensions of Cramér-Rao Inequality Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Below we list some extensions and leave the proofs as exercises. Exercise 1 (Cramér-Rao Inequality for Unbiased Functional Estimators) Prove that for any unbiased estimator ζ of z (θ), MSE θ (ζ) 1 J(θ) ( d d θ z (θ)) 2. Exercise 2 (Cramér-Rao Inequality for Biased Estimators) Prove that for any estimator ϕ of the parameter θ, MSE θ (ϕ) 1 J(θ) ( 1 + d d θ Bias θ (ϕ) ) 2 + (Biasθ (ϕ)) 2. Exercise 3 (Attainment of Cramér-Rao) Show that the necessary and sufficient condition for an unbiased estimator ϕ to attain the Cramér-Rao lower bound is that, there exists some function g such that for all x, g (θ) (ϕ (x) θ) = θ ln f θ (x). 49 / 55 I-Hsiang Wang IT Lecture 7

50 More on Fisher Information Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Fisher Information plays a key role in Cramér-Rao lower bound. We make further remarks about it. 1 J (θ) E fθ [(s θ (X)) 2] = Var fθ [s θ (X)], where the score of θ, s θ (X) θ ln f θ(x) = 1 f θ (X) θ f θ(x) is zero-mean. 2 Suppose X i i.i.d. f θ, then for the estimation problem with observation X n, its Fisher information J n (θ) = nj (θ), where J (θ) is the Fisher information when the observation is just X f θ. 3 For an exponential family {f θ θ Θ}, it can be shown that [ ] J (θ) = E 2 fθ ln f θ 2 θ (X), which makes computation of J (θ) simpler. 50 / 55 I-Hsiang Wang IT Lecture 7

51 Estimation Maximum Likelihood Estimator, Consistency, and Efficiency 1 Statistical Model and Decision Making Basic Framework Examples Paradigms 2 Hypothesis Testing Basics 3 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency 51 / 55 I-Hsiang Wang IT Lecture 7

52 Maximum Likelihood Estimator Estimation Maximum Likelihood Estimator, Consistency, and Efficiency Maximum Likelihood Estimator (MLE) is a widely used estimator. Definition 10 (Maximum Likelihood Estimator) The Maximum Likelihood Estimator (MLE) for estimating θ from a randomly drawn X P θ is Here P θ (x) is called the likelihood function. ϕ MLE (x) arg max {P θ (x)}. θ Θ Exercise 4 (MLE of Gaussian with Unknown Mean and Variance) i.i.d. Consider X i N ( µ, σ 2) for i = 1, 2,..., n, where θ ( µ, σ 2) denote the unknown parameter. Let x 1 n n i=1 x i. Show that ϕ MLE (x n ) = ( 1 x, n n i=1 (x i x) 2). 52 / 55 I-Hsiang Wang IT Lecture 7

53 Estimation Maximum Likelihood Estimator, Consistency, and Efficiency In the following we consider observation of n i.i.d. drawn samples X i i.i.d. P θ, i = 1,..., n, and give two ways of evaluating the performance of a sequence of estimators {ϕ n (x n ) n N} as n. 1 Consistency: The estimator output will coincide the true parameter as sample size n. 2 Efficiency: The estimator output will achieve the MSE Cramér-Rao lower bound, as sample size n. We will see that MLE is not only consistent but also efficient. 53 / 55 I-Hsiang Wang IT Lecture 7

54 Estimation Asymptotic Evaluations: Consistency Maximum Likelihood Estimator, Consistency, and Efficiency Definition 11 (Consistency) A sequence of estimators {ζ n (x n ) n N} is consistent if ε > 0, lim n P X i i.i.d. P θ { ζ n (X n ) z (θ) < ε} = 1, θ Θ. In other words, ζ n (X n ) p z (θ) for all θ Θ. Theorem 6 (MLE is Consistent) For a family of densities {f θ θ Θ}, under some regularity conditions on f θ (x), the plug-in estimator z (ϕ MLE (x n )) is a consistent estimator of z (θ), where z is a continuous function of θ. 54 / 55 I-Hsiang Wang IT Lecture 7

55 Estimation Maximum Likelihood Estimator, Consistency, and Efficiency Asymptotic Evaluations: Efficiency Definition 12 (Efficiency) A sequence of estimators {ζ n (x n ) n N} is asymptotically efficient if ( n (ζn (X n ) z (θ)) d ( ) 1 N 0, d J(θ) d θ z (θ)) 2 as n. Theorem 7 (MLE is Asymptotically Efficient) For a family of densities {f θ θ Θ}, under some regularity conditions on f θ (x), the plug-in estimator z (ϕ n (x n )) is an asymptotically efficient estimator of z (θ), where z is a continuous function of θ. 55 / 55 I-Hsiang Wang IT Lecture 7

Lecture 8: Information Theory and Statistics

Lecture 8: Information Theory and Statistics Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 23, 2015 1 / 50 I-Hsiang

More information

Lecture 8: Information Theory and Statistics

Lecture 8: Information Theory and Statistics Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and Estimation I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 22, 2015

More information

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

10-704: Information Processing and Learning Fall Lecture 24: Dec 7 0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 24: Dec 7 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy of

More information

Lecture 5 Channel Coding over Continuous Channels

Lecture 5 Channel Coding over Continuous Channels Lecture 5 Channel Coding over Continuous Channels I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw November 14, 2014 1 / 34 I-Hsiang Wang NIT Lecture 5 From

More information

1.1 Basis of Statistical Decision Theory

1.1 Basis of Statistical Decision Theory ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 1: Introduction Lecturer: Yihong Wu Scribe: AmirEmad Ghassami, Jan 21, 2016 [Ed. Jan 31] Outline: Introduction of

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

Ch. 5 Hypothesis Testing

Ch. 5 Hypothesis Testing Ch. 5 Hypothesis Testing The current framework of hypothesis testing is largely due to the work of Neyman and Pearson in the late 1920s, early 30s, complementing Fisher s work on estimation. As in estimation,

More information

Lecture 4 Noisy Channel Coding

Lecture 4 Noisy Channel Coding Lecture 4 Noisy Channel Coding I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw October 9, 2015 1 / 56 I-Hsiang Wang IT Lecture 4 The Channel Coding Problem

More information

Master s Written Examination

Master s Written Examination Master s Written Examination Option: Statistics and Probability Spring 016 Full points may be obtained for correct answers to eight questions. Each numbered question which may have several parts is worth

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1) Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1) Detection problems can usually be casted as binary or M-ary hypothesis testing problems. Applications: This chapter: Simple hypothesis

More information

6.1 Variational representation of f-divergences

6.1 Variational representation of f-divergences ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 6: Variational representation, HCR and CR lower bounds Lecturer: Yihong Wu Scribe: Georgios Rovatsos, Feb 11, 2016

More information

Mathematical statistics

Mathematical statistics October 4 th, 2018 Lecture 12: Information Where are we? Week 1 Week 2 Week 4 Week 7 Week 10 Week 14 Probability reviews Chapter 6: Statistics and Sampling Distributions Chapter 7: Point Estimation Chapter

More information

A Few Notes on Fisher Information (WIP)

A Few Notes on Fisher Information (WIP) A Few Notes on Fisher Information (WIP) David Meyer dmm@{-4-5.net,uoregon.edu} Last update: April 30, 208 Definitions There are so many interesting things about Fisher Information and its theoretical properties

More information

Brief Review on Estimation Theory

Brief Review on Estimation Theory Brief Review on Estimation Theory K. Abed-Meraim ENST PARIS, Signal and Image Processing Dept. abed@tsi.enst.fr This presentation is essentially based on the course BASTA by E. Moulines Brief review on

More information

Review. December 4 th, Review

Review. December 4 th, Review December 4 th, 2017 Att. Final exam: Course evaluation Friday, 12/14/2018, 10:30am 12:30pm Gore Hall 115 Overview Week 2 Week 4 Week 7 Week 10 Week 12 Chapter 6: Statistics and Sampling Distributions Chapter

More information

ECE 275A Homework 7 Solutions

ECE 275A Homework 7 Solutions ECE 275A Homework 7 Solutions Solutions 1. For the same specification as in Homework Problem 6.11 we want to determine an estimator for θ using the Method of Moments (MOM). In general, the MOM estimator

More information

Lecture 4 Channel Coding

Lecture 4 Channel Coding Capacity and the Weak Converse Lecture 4 Coding I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw October 15, 2014 1 / 16 I-Hsiang Wang NIT Lecture 4 Capacity

More information

A Very Brief Summary of Statistical Inference, and Examples

A Very Brief Summary of Statistical Inference, and Examples A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2008 Prof. Gesine Reinert 1 Data x = x 1, x 2,..., x n, realisations of random variables X 1, X 2,..., X n with distribution (model)

More information

DA Freedman Notes on the MLE Fall 2003

DA Freedman Notes on the MLE Fall 2003 DA Freedman Notes on the MLE Fall 2003 The object here is to provide a sketch of the theory of the MLE. Rigorous presentations can be found in the references cited below. Calculus. Let f be a smooth, scalar

More information

Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics. 1 Executive summary

Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics. 1 Executive summary ECE 830 Spring 207 Instructor: R. Willett Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics Executive summary In the last lecture we saw that the likelihood

More information

Detection theory. H 0 : x[n] = w[n]

Detection theory. H 0 : x[n] = w[n] Detection Theory Detection theory A the last topic of the course, we will briefly consider detection theory. The methods are based on estimation theory and attempt to answer questions such as Is a signal

More information

Lecture 12 November 3

Lecture 12 November 3 STATS 300A: Theory of Statistics Fall 2015 Lecture 12 November 3 Lecturer: Lester Mackey Scribe: Jae Hyuck Park, Christian Fong Warning: These notes may contain factual and/or typographic errors. 12.1

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

ST5215: Advanced Statistical Theory

ST5215: Advanced Statistical Theory Department of Statistics & Applied Probability Wednesday, October 5, 2011 Lecture 13: Basic elements and notions in decision theory Basic elements X : a sample from a population P P Decision: an action

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Mathematical statistics

Mathematical statistics October 18 th, 2018 Lecture 16: Midterm review Countdown to mid-term exam: 7 days Week 1 Chapter 1: Probability review Week 2 Week 4 Week 7 Chapter 6: Statistics Chapter 7: Point Estimation Chapter 8:

More information

557: MATHEMATICAL STATISTICS II BIAS AND VARIANCE

557: MATHEMATICAL STATISTICS II BIAS AND VARIANCE 557: MATHEMATICAL STATISTICS II BIAS AND VARIANCE An estimator, T (X), of θ can be evaluated via its statistical properties. Typically, two aspects are considered: Expectation Variance either in terms

More information

Econometrics I, Estimation

Econometrics I, Estimation Econometrics I, Estimation Department of Economics Stanford University September, 2008 Part I Parameter, Estimator, Estimate A parametric is a feature of the population. An estimator is a function of the

More information

Lecture notes on statistical decision theory Econ 2110, fall 2013

Lecture notes on statistical decision theory Econ 2110, fall 2013 Lecture notes on statistical decision theory Econ 2110, fall 2013 Maximilian Kasy March 10, 2014 These lecture notes are roughly based on Robert, C. (2007). The Bayesian choice: from decision-theoretic

More information

Variations. ECE 6540, Lecture 10 Maximum Likelihood Estimation

Variations. ECE 6540, Lecture 10 Maximum Likelihood Estimation Variations ECE 6540, Lecture 10 Last Time BLUE (Best Linear Unbiased Estimator) Formulation Advantages Disadvantages 2 The BLUE A simplification Assume the estimator is a linear system For a single parameter

More information

21.1 Lower bounds on minimax risk for functional estimation

21.1 Lower bounds on minimax risk for functional estimation ECE598: Information-theoretic methods in high-dimensional statistics Spring 016 Lecture 1: Functional estimation & testing Lecturer: Yihong Wu Scribe: Ashok Vardhan, Apr 14, 016 In this chapter, we will

More information

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing STAT 135 Lab 5 Bootstrapping and Hypothesis Testing Rebecca Barter March 2, 2015 The Bootstrap Bootstrap Suppose that we are interested in estimating a parameter θ from some population with members x 1,...,

More information

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b) LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

Scalable robust hypothesis tests using graphical models

Scalable robust hypothesis tests using graphical models Scalable robust hypothesis tests using graphical models Umamahesh Srinivas ipal Group Meeting October 22, 2010 Binary hypothesis testing problem Random vector x = (x 1,...,x n ) R n generated from either

More information

A General Overview of Parametric Estimation and Inference Techniques.

A General Overview of Parametric Estimation and Inference Techniques. A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying

More information

Graduate Econometrics I: Unbiased Estimation

Graduate Econometrics I: Unbiased Estimation Graduate Econometrics I: Unbiased Estimation Yves Dominicy Université libre de Bruxelles Solvay Brussels School of Economics and Management ECARES Yves Dominicy Graduate Econometrics I: Unbiased Estimation

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Statistics Ph.D. Qualifying Exam: Part I October 18, 2003

Statistics Ph.D. Qualifying Exam: Part I October 18, 2003 Statistics Ph.D. Qualifying Exam: Part I October 18, 2003 Student Name: 1. Answer 8 out of 12 problems. Mark the problems you selected in the following table. 1 2 3 4 5 6 7 8 9 10 11 12 2. Write your answer

More information

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators Estimation theory Parametric estimation Properties of estimators Minimum variance estimator Cramer-Rao bound Maximum likelihood estimators Confidence intervals Bayesian estimation 1 Random Variables Let

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Lecture 22: Error exponents in hypothesis testing, GLRT

Lecture 22: Error exponents in hypothesis testing, GLRT 10-704: Information Processing and Learning Spring 2012 Lecture 22: Error exponents in hypothesis testing, GLRT Lecturer: Aarti Singh Scribe: Aarti Singh Disclaimer: These notes have not been subjected

More information

BTRY 4090: Spring 2009 Theory of Statistics

BTRY 4090: Spring 2009 Theory of Statistics BTRY 4090: Spring 2009 Theory of Statistics Guozhang Wang September 25, 2010 1 Review of Probability We begin with a real example of using probability to solve computationally intensive (or infeasible)

More information

Suggested solutions to written exam Jan 17, 2012

Suggested solutions to written exam Jan 17, 2012 LINKÖPINGS UNIVERSITET Institutionen för datavetenskap Statistik, ANd 73A36 THEORY OF STATISTICS, 6 CDTS Master s program in Statistics and Data Mining Fall semester Written exam Suggested solutions to

More information

ECE531 Lecture 10b: Maximum Likelihood Estimation

ECE531 Lecture 10b: Maximum Likelihood Estimation ECE531 Lecture 10b: Maximum Likelihood Estimation D. Richard Brown III Worcester Polytechnic Institute 05-Apr-2011 Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 1 / 23 Introduction So

More information

Chapter 3. Point Estimation. 3.1 Introduction

Chapter 3. Point Estimation. 3.1 Introduction Chapter 3 Point Estimation Let (Ω, A, P θ ), P θ P = {P θ θ Θ}be probability space, X 1, X 2,..., X n : (Ω, A) (IR k, B k ) random variables (X, B X ) sample space γ : Θ IR k measurable function, i.e.

More information

Lecture 2: Basic Concepts of Statistical Decision Theory

Lecture 2: Basic Concepts of Statistical Decision Theory EE378A Statistical Signal Processing Lecture 2-03/31/2016 Lecture 2: Basic Concepts of Statistical Decision Theory Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: John Miller and Aran Nayebi In this lecture

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

2 Statistical Estimation: Basic Concepts

2 Statistical Estimation: Basic Concepts Technion Israel Institute of Technology, Department of Electrical Engineering Estimation and Identification in Dynamical Systems (048825) Lecture Notes, Fall 2009, Prof. N. Shimkin 2 Statistical Estimation:

More information

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided Let us first identify some classes of hypotheses. simple versus simple H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided H 0 : θ θ 0 versus H 1 : θ > θ 0. (2) two-sided; null on extremes H 0 : θ θ 1 or

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

Math 494: Mathematical Statistics

Math 494: Mathematical Statistics Math 494: Mathematical Statistics Instructor: Jimin Ding jmding@wustl.edu Department of Mathematics Washington University in St. Louis Class materials are available on course website (www.math.wustl.edu/

More information

Lecture 1: Introduction

Lecture 1: Introduction Principles of Statistics Part II - Michaelmas 208 Lecturer: Quentin Berthet Lecture : Introduction This course is concerned with presenting some of the mathematical principles of statistical theory. One

More information

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn Parameter estimation and forecasting Cristiano Porciani AIfA, Uni-Bonn Questions? C. Porciani Estimation & forecasting 2 Temperature fluctuations Variance at multipole l (angle ~180o/l) C. Porciani Estimation

More information

2. What are the tradeoffs among different measures of error (e.g. probability of false alarm, probability of miss, etc.)?

2. What are the tradeoffs among different measures of error (e.g. probability of false alarm, probability of miss, etc.)? ECE 830 / CS 76 Spring 06 Instructors: R. Willett & R. Nowak Lecture 3: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics Executive summary In the last lecture we

More information

A Very Brief Summary of Statistical Inference, and Examples

A Very Brief Summary of Statistical Inference, and Examples A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2009 Prof. Gesine Reinert Our standard situation is that we have data x = x 1, x 2,..., x n, which we view as realisations of random

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n

Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n Jiantao Jiao (Stanford EE) Joint work with: Kartik Venkat Yanjun Han Tsachy Weissman Stanford EE Tsinghua

More information

10. Composite Hypothesis Testing. ECE 830, Spring 2014

10. Composite Hypothesis Testing. ECE 830, Spring 2014 10. Composite Hypothesis Testing ECE 830, Spring 2014 1 / 25 In many real world problems, it is difficult to precisely specify probability distributions. Our models for data may involve unknown parameters

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

10-701/ Machine Learning, Fall

10-701/ Machine Learning, Fall 0-70/5-78 Machine Learning, Fall 2003 Homework 2 Solution If you have questions, please contact Jiayong Zhang .. (Error Function) The sum-of-squares error is the most common training

More information

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Statistics - Lecture One. Outline. Charlotte Wickham  1. Basic ideas about estimation Statistics - Lecture One Charlotte Wickham wickham@stat.berkeley.edu http://www.stat.berkeley.edu/~wickham/ Outline 1. Basic ideas about estimation 2. Method of Moments 3. Maximum Likelihood 4. Confidence

More information

If we want to analyze experimental or simulated data we might encounter the following tasks:

If we want to analyze experimental or simulated data we might encounter the following tasks: Chapter 1 Introduction If we want to analyze experimental or simulated data we might encounter the following tasks: Characterization of the source of the signal and diagnosis Studying dependencies Prediction

More information

Hypothesis testing: theory and methods

Hypothesis testing: theory and methods Statistical Methods Warsaw School of Economics November 3, 2017 Statistical hypothesis is the name of any conjecture about unknown parameters of a population distribution. The hypothesis should be verifiable

More information

Hypothesis Test. The opposite of the null hypothesis, called an alternative hypothesis, becomes

Hypothesis Test. The opposite of the null hypothesis, called an alternative hypothesis, becomes Neyman-Pearson paradigm. Suppose that a researcher is interested in whether the new drug works. The process of determining whether the outcome of the experiment points to yes or no is called hypothesis

More information

A first model of learning

A first model of learning A first model of learning Let s restrict our attention to binary classification our labels belong to (or ) We observe the data where each Suppose we are given an ensemble of possible hypotheses / classifiers

More information

Detection theory 101 ELEC-E5410 Signal Processing for Communications

Detection theory 101 ELEC-E5410 Signal Processing for Communications Detection theory 101 ELEC-E5410 Signal Processing for Communications Binary hypothesis testing Null hypothesis H 0 : e.g. noise only Alternative hypothesis H 1 : signal + noise p(x;h 0 ) γ p(x;h 1 ) Trade-off

More information

SGN Advanced Signal Processing: Lecture 8 Parameter estimation for AR and MA models. Model order selection

SGN Advanced Signal Processing: Lecture 8 Parameter estimation for AR and MA models. Model order selection SG 21006 Advanced Signal Processing: Lecture 8 Parameter estimation for AR and MA models. Model order selection Ioan Tabus Department of Signal Processing Tampere University of Technology Finland 1 / 28

More information

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources STA 732: Inference Notes 10. Parameter Estimation from a Decision Theoretic Angle Other resources 1 Statistical rules, loss and risk We saw that a major focus of classical statistics is comparing various

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables? Linear Regression Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2014 1 What about continuous variables? n Billionaire says: If I am measuring a continuous variable, what

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

Institute of Actuaries of India

Institute of Actuaries of India Institute of Actuaries of India Subject CT3 Probability & Mathematical Statistics May 2011 Examinations INDICATIVE SOLUTION Introduction The indicative solution has been written by the Examiners with the

More information

Central Limit Theorem ( 5.3)

Central Limit Theorem ( 5.3) Central Limit Theorem ( 5.3) Let X 1, X 2,... be a sequence of independent random variables, each having n mean µ and variance σ 2. Then the distribution of the partial sum S n = X i i=1 becomes approximately

More information

Week 5: Logistic Regression & Neural Networks

Week 5: Logistic Regression & Neural Networks Week 5: Logistic Regression & Neural Networks Instructor: Sergey Levine 1 Summary: Logistic Regression In the previous lecture, we covered logistic regression. To recap, logistic regression models and

More information

DETECTION theory deals primarily with techniques for

DETECTION theory deals primarily with techniques for ADVANCED SIGNAL PROCESSING SE Optimum Detection of Deterministic and Random Signals Stefan Tertinek Graz University of Technology turtle@sbox.tugraz.at Abstract This paper introduces various methods for

More information

Lecture 21: Minimax Theory

Lecture 21: Minimax Theory Lecture : Minimax Theory Akshay Krishnamurthy akshay@cs.umass.edu November 8, 07 Recap In the first part of the course, we spent the majority of our time studying risk minimization. We found many ways

More information

STATISTICS SYLLABUS UNIT I

STATISTICS SYLLABUS UNIT I STATISTICS SYLLABUS UNIT I (Probability Theory) Definition Classical and axiomatic approaches.laws of total and compound probability, conditional probability, Bayes Theorem. Random variable and its distribution

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013 Bayesian Methods Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2013 1 What about prior n Billionaire says: Wait, I know that the thumbtack is close to 50-50. What can you

More information

Generalization Bounds

Generalization Bounds Generalization Bounds Here we consider the problem of learning from binary labels. We assume training data D = x 1, y 1,... x N, y N with y t being one of the two values 1 or 1. We will assume that these

More information

If there exists a threshold k 0 such that. then we can take k = k 0 γ =0 and achieve a test of size α. c 2004 by Mark R. Bell,

If there exists a threshold k 0 such that. then we can take k = k 0 γ =0 and achieve a test of size α. c 2004 by Mark R. Bell, Recall The Neyman-Pearson Lemma Neyman-Pearson Lemma: Let Θ = {θ 0, θ }, and let F θ0 (x) be the cdf of the random vector X under hypothesis and F θ (x) be its cdf under hypothesis. Assume that the cdfs

More information

ENEE 621 SPRING 2016 DETECTION AND ESTIMATION THEORY THE PARAMETER ESTIMATION PROBLEM

ENEE 621 SPRING 2016 DETECTION AND ESTIMATION THEORY THE PARAMETER ESTIMATION PROBLEM c 2007-2016 by Armand M. Makowski 1 ENEE 621 SPRING 2016 DETECTION AND ESTIMATION THEORY THE PARAMETER ESTIMATION PROBLEM 1 The basic setting Throughout, p, q and k are positive integers. The setup With

More information

Covariance function estimation in Gaussian process regression

Covariance function estimation in Gaussian process regression Covariance function estimation in Gaussian process regression François Bachoc Department of Statistics and Operations Research, University of Vienna WU Research Seminar - May 2015 François Bachoc Gaussian

More information

Linear regression COMS 4771

Linear regression COMS 4771 Linear regression COMS 4771 1. Old Faithful and prediction functions Prediction problem: Old Faithful geyser (Yellowstone) Task: Predict time of next eruption. 1 / 40 Statistical model for time between

More information

Lecture 5 September 19

Lecture 5 September 19 IFT 6269: Probabilistic Graphical Models Fall 2016 Lecture 5 September 19 Lecturer: Simon Lacoste-Julien Scribe: Sébastien Lachapelle Disclaimer: These notes have only been lightly proofread. 5.1 Statistical

More information

STATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN

STATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN Massimo Guidolin Massimo.Guidolin@unibocconi.it Dept. of Finance STATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN SECOND PART, LECTURE 2: MODES OF CONVERGENCE AND POINT ESTIMATION Lecture 2:

More information

Economics 620, Lecture 9: Asymptotics III: Maximum Likelihood Estimation

Economics 620, Lecture 9: Asymptotics III: Maximum Likelihood Estimation Economics 620, Lecture 9: Asymptotics III: Maximum Likelihood Estimation Nicholas M. Kiefer Cornell University Professor N. M. Kiefer (Cornell University) Lecture 9: Asymptotics III(MLE) 1 / 20 Jensen

More information

19.1 Problem setup: Sparse linear regression

19.1 Problem setup: Sparse linear regression ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 19: Minimax rates for sparse linear regression Lecturer: Yihong Wu Scribe: Subhadeep Paul, April 13/14, 2016 In

More information

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1) HW 1 due today Parameter Estimation Biometrics CSE 190 Lecture 7 Today s lecture was on the blackboard. These slides are an alternative presentation of the material. CSE190, Winter10 CSE190, Winter10 Chapter

More information

Composite Hypotheses and Generalized Likelihood Ratio Tests

Composite Hypotheses and Generalized Likelihood Ratio Tests Composite Hypotheses and Generalized Likelihood Ratio Tests Rebecca Willett, 06 In many real world problems, it is difficult to precisely specify probability distributions. Our models for data may involve

More information

Maximum Likelihood Estimation. only training data is available to design a classifier

Maximum Likelihood Estimation. only training data is available to design a classifier Introduction to Pattern Recognition [ Part 5 ] Mahdi Vasighi Introduction Bayesian Decision Theory shows that we could design an optimal classifier if we knew: P( i ) : priors p(x i ) : class-conditional

More information

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That Statistics Lecture 2 August 7, 2000 Frank Porter Caltech The plan for these lectures: The Fundamentals; Point Estimation Maximum Likelihood, Least Squares and All That What is a Confidence Interval? Interval

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training

More information

Introduction to Statistical Learning Theory

Introduction to Statistical Learning Theory Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated

More information