Lecture 2: Statistical Decision Theory

Size: px
Start display at page:

Download "Lecture 2: Statistical Decision Theory"

Transcription

1 EE378A Statistical Signal Processing Lecture 2-04/06/207 Lecture 2: Statistical Decision Theory Lecturer: Jiantao Jiao Scribe: Andrew Hilger In this lecture, we discuss a unified theoretical framework of statistics proposed by Abraham Wald, which is named statistical decision theory.. Goals. Evaluation: The theoretical framework should aid fair comparisons between algorithms (e.g., maximum entropy vs. maximum likelihood vs. method of moments). 2. Achievability: The theoretical framework should be able to inspire the constructions of statistical algorithms that are (nearly) optimal under the optimality criteria introduced in the framework. 2 Basic Elements of Statistical Decision Theory. Statistical Experiment: A family of probability measures P = {P θ : θ Θ}, where θ is a parameter and P θ is a probability distribution indexed by the parameter. 2. Data: X P θ, where X is a random variable observed for some parameter value θ. 3. Objective: g(θ), e.g., inference on the entropy of distribution P θ. 4. Decision Rule: δ(x). The decision rule need not be deterministic. In other words, there could be a probabilistically defined decision rule with an associated P δ X. 5. Loss Function: L(θ, δ). The loss function tells us how bad we feel about our decision once we find out the true value of the parameter θ chosen by nature. Example: P θ (x) = 2π e (x θ)2 2, g(θ) = θ, and L(θ, δ) = (θ δ) 2. In other words, X is normally distributed with mean θ and unit variance X N(θ, ), and we are trying to estimate the mean θ. We judge our success (or failure) using mean-square error. 3 Risk Function Definition (Risk Function). R(θ, δ) E[L(θ, δ(x))] () = L(θ, δ(x))p θ (dx) (2) = L(θ, δ)p δ X (dδ x)p θ (dx). (3) 992. See Wald, Abraham. Statistical decision functions. In Breakthroughs in Statistics, pp Springer New York,

2 Figure : Example risk functions computed over a range of parameter values for two different decision rules. A risk function evaluates a decision rule s success over a large number of experiments with fixed parameter value θ. By the law of large numbers, if we observe X many times independently, the average empirical loss of the decision rule δ will converge to the risk R(θ, δ). Even after determining risk functions of two decision rules, it may still be unclear which is better. Consider the example of Figure. Two different decision rules δ and δ 2 result in two different risk functions R(θ, δ ) and R(θ, δ 2 ) evaluated over different values on the parameter θ. The first decision rule δ is inferior for low and high values of the parameter θ but is superior for the middle values. Thus, even after computing a risk function R, it can still be unclear which decision rule is better. We need new ideas to enable us to compare different decision rules. 4 Optimality Criterion of Decision Rules Given the risk function of various decision rules as a function of the parameter θ, there are various approaches to determining which decision rule is optimal. 4. Restrict the Competitors This is a traditional set of methods that were overshadowed by other approaches that we introduce later. A decision rule δ is eliminated (or formally, is inadmissible) if there are any other decision rules δ that are strictly better, i.e., R(θ, δ ) R(θ, δ) for any θ Θ and the inequality becomes strict for at least one θ 0 Θ. However, the problem is that many decision rules cannot be eliminated in this way and we still lack a criterion to determine which one is better. Then to aid in selection, the rationale of the approach of restricting competitors is that only decision rules that are members of a certain decision rule class D are considered. The advantage is that, sometimes all but one decision rule in D is inadmissible, and we just use the only admissible one.. Example : Class of unbiased decision rules D = {δ : E[δ(X)] = g(θ), θ Θ} 2. Example 2: Class of invariant estimators. However, a serious drawback of this approach is that D may be an empty set for various decision theoretic problems. 2

3 4.2 Bayesian: Average Risk Optimality The idea is to use averaging to reduce the risk function R(θ, δ) to a single number for any given δ. Definition 2 (Average risk under prior Λ(dθ)). r(λ, δ) = R(θ, δ)λ(dθ) (4) Here Λ is the is the prior distribution, a probability measure on Θ. The Bayesians and the frequentists disagree about Λ; namely, the frequentists do not believe the existence of the prior. However, there do exist more justifications of the Bayesian approach than the interpretation of Λ as prior belief: indeed, the complete class theorem in statistical decision theory asserts that in various decision theoretic problems, all the admissible decision rules can be approximated by Bayes estimators. 2 Definition 3 (Bayes estimator). δ Λ = arg min δ r(λ, δ) (5) The Bayes estimator δ Λ can usually be found using the principle of computing posterior distributions. Note that r(λ, δ) = L(θ, δ)p δ X (dδ x)p θ (dx)λ(dθ) (6) ( ) = L(θ, δ)p δ X (dδ x)p θ X (dθ x) P X (dx) (7) where P X (dx) is the marginal distribution of X and P θ X (dθ x) is the posterior distribution of θ given X. In Equation 6, P θ (dx)λ(dθ) is the joint distribution of θ and X. In Equation 7, we only have to minimize the portion in parentheses to minimize r(λ, δ) because P X (dx) doesn t depend on δ. Theorem 4. 3 Under mild conditions, δ Λ (x) = arg min E[L(θ, δ) X = x] (8) δ = arg min L(θ, δ)p δ X (dδ x)p θ X (dθ x) (9) P δ X Lemma 5. If L(θ, δ) is convex in δ, it suffices to consider deterministic rules δ(x). Proof Jensen s inequality: L(θ, δ)p δ X (dδ x)p θ X (dθ x) L(θ, δp δ X (dδ x))p θ X (dθ x). (0) Examples. L(θ, δ) = (g(θ) δ) 2 δ Λ (x) = E[g(θ) X = x]. In other words, the Bayes estimator under squared error loss is the conditional expectation of g(θ) given x. 2. L(θ, δ) = g(θ) δ δ Λ (x) is any median of the posterior distribution P θ X=x. 3. L(θ, δ) = (g(θ) δ) δ Λ (x) = arg max θ P θ x (θ x) = arg max θ P θ (x)λ(θ). In other words, an indicator loss function results in a maximum a posteriori (MAP) estimator decision rule 4. 2 See Chapter 3 of Friedrich Liese, and Klaus-J. Miescke. Statistical Decision Theory: Estimation, Testing, and Selection. (2009) 3 See Theorem., Chapter 4 of Lehmann EL, Casella G. Theory of point estimation. Springer Science & Business Media; Next week, we will cover special cases of P θ and how to solve Bayes estimator in a computationally efficient way. In the general case, however, computing the posterior distribution may be difficult. 3

4 4.3 Frequentist: Worst-Case Optimality (Minimax) Definition 6 (Minimax estimator). The decision rule δ is minimax among all decision rules in D iff 4.3. First observation sup R(θ, δ ) = inf θ Θ sup δ D θ Θ R(θ, δ). () Since R(θ, δ) = L(θ, δ)p δ X (dδ x)p θ (dx) (2) is linear in P θ X, this is a convex function in δ. The supremum of a convex function is convex, so finding the optimal decision rule is a convex optimization problem. However, solving this convex optimization problem may be computationally intractable. For example, it may not even be computationally tractable to compute the supremum of the risk function R(θ, δ) over θ Θ. Hence, finding the exact minimax estimator is usually hard Second observation Due to the previous difficulty of finding the exact minimax estimator, we turn to another goal: we wish to find an estimator δ such that inf δ sup θ R(θ, δ) sup θ R(θ, δ ) c inf δ sup R(θ, δ) (3) θ where c > is a constant. The left inequality is trivially true. For the right inequality, in practice one can usually choose some specific δ and evaluate an upper bound of sup θ R(θ, δ ) explicitly. However, it remains to find a lower bound of inf δ sup θ R(θ, δ). To solve the problem (and save the world), we can use the minimax theorem. Theorem 7 (Minimax Theorem (Sion-Kakutani)). Let Λ, X be two compact, convex sets in some topologically vector spaces. Let function H(λ, x) : Λ X R be a continuous function such that: Then. H(λ, ) is convex for any fixed λ Λ 2. H(, x) is concave for any fixed x X.. Strong duality: max λ min x H(λ, x) = min x max λ H(λ, x) 2. Existence of Saddle point: (λ, x ) : H(λ, x ) H(λ, x ) H(λ, x) λ Λ, x X. The existence of saddle point implies the strong duality. We note that other than the strong duality, the following weak duality is always true without assumptions on H: sup inf H(λ, x) inf sup H(λ, x) (4) λ x x λ We define the quantity r Λ inf δ r(λ, δ) as the Bayes risk under prior distribution Λ. following lines of arguments using weak duality: inf δ sup θ R(θ, δ) = inf δ sup Λ We have the sup r(λ, δ) (5) Λ inf r(λ, δ) (6) δ = sup r Λ (7) Λ 4

5 Equation (7) gives us a strong tool for lower bounding the minimax risk: for any prior distribution Λ, the Bayes risk under Λ is a lower bound of the corresponding minimax risk. When the condition of the minimax theorem is satisfied (which may be expected due to the bilinearity of r(λ, δ) in the pair (Λ, δ)), equation (6) achieves equality, which shows that there exists a sequence of priors such that the corresponding Bayes risk sequence converges to the minimax risk. In practice, it suffices to choose some appropriate prior distribution Λ in order to solve the (nearly) minimax estimator. 5 Decision Theoretic Interpretation of Maximum Entropy Definition 8 (Logarithmic Loss). For a distribution P over X and x X, L(x, P ) log P (x). (8) Note that P (x) is the discrete mass on x in the discrete case, and is the density at x in the continuous case. By definition, if P (x) is small, then the loss is large. Now we see how the maximum likelihood estimator reduces to the so-called maximum entropy under the logarithmic loss function. Specifically, suppose we have functions r (x),, r k (x) and compute the following empirical averages: β i n n r i (x j ) = E Pn [r i (X)], i =,, k (9) j= where P n denotes the empirical distribution based on n samples x,, x n. By the law of large numbers, it is expected that β i is close to the true expectation under P, so we may restrict the true distribution P in the following set: P = {P : E P [r i (X)] = β i, i k}. (20) Also, denote by Q the set of all probability distributions which belong to the exponential family with central statistics r (x),, r k (x), i.e., Q = {Q : Q(x) = Z(λ) e k i= λiri(x), x X } (2) where Z(λ) is the normalization factor. Now we consider two estimators of P. The first one is the maximum likelihood estimator over Q, i.e., P ML = arg max P Q j= The second one is the maximum entropy estimator over P, i.e., n log P (x j ). (22) P ME = arg max H(P ) (23) P P where H(P ) is the entropy of P, which is H(P ) = x X P (x) log P (x) in the discrete case, and H(P ) = P (x) log P (x)dx in the continuous case. X The key result is as follows: 5 Theorem 9. In the above setting, we have P ML = P ME. 5 Berger, Adam L., Vincent J. Della Pietra, and Stephen A. Della Pietra. A maximum entropy approach to natural language processing. Computational linguistics 22. (996):

6 Proof We begin with the following lemma, which is the key motivation for the use of the cross-entropy loss in machine learning. Lemma 0. The true distribution minimizes the expected logarithmic loss: and the corresponding minimum E P L(x, P ) is the entropy of P. P = arg min Q E P L(x, Q) (24) Proof The assertion follows from the non-negativity of the KL divergence D(P Q) 0, where D(P Q) = E P ln dp dq. Equipped with this lemma, and noting that E P [log Q(x) ] is linear in P and convex in Q, by the minimax theorem we have max H(P ) = max P P min P P Q E P [log Q(x) ] = min max E P [log Q P P Q(x) ] (25) In particular, we know that (P ME, P ME ) is a saddle point in P M, where M denotes the set of all probability measures on X. We first take a look at P ME which maximizes the LHS. By variational methods 6, it is easy to see that the entropy-maximizing distribution P ME belongs to the exponential family Q, i.e., for some parameters λ,, λ k. Now by the definition of saddle points, for any Q Q M we have P ME (x) = Z(λ) e k i= λiri(x) (26) E PME [log P ME (x) ] E P ME [log Q(x) ] (27) while for any Q Q with possibly different coefficients λ, E PME [log Q(x) ] = E P ME [log Z(λ ) i λ ir i (x)] [since Q Q] (28) = log Z(λ ) i λ iβ i [since P ME P] (29) = E Pn [log Z(λ ) i λ ir i (x)] [since P n P] (30) = E Pn [log = n n log j= As a result, the saddle point condition reduces to n j= establishing the fact that P ME = P ML. n log P ME (x j ) n Q(x) ] (3) Q(x j ). (32) n log j=, Q Q (33) Q(x j ) 6 See Chapter 2 of Thomas, Joy A., and Thomas M. Cover. Elements of information theory. John Wiley & Sons,

7 EE378A Statistical Signal Processing Lecture 3-04//207 Lecture 3: State Estimation in Hidden Markov Processes Lecturer: Tsachy Weissman Scribe: Alexandros Anemogiannis, Vighnesh Sachidananda In this lecture, we define the Hidden Markov Processes (HMP) as well as the forward-backward recursion method that efficiently estimates the posteriors in HMPs. Markov Chain We begin by introducing the Markov triplet, which is useful in the design and analysis of algorithms for state estimation in Hidden Markov Processes. Let X, Y, Z denote discrete random variables. We say that X, Y, and Z form a Markov triplet (which we denote as X Y Z) given the following relationships: X Y Z p(x, z y) = p(x y)p(z y) () p(z x, y) = p(z y) (2) If X, Y, Z form a Markov triplet, their joint distribution enjoys the important property that it can be factored into the product of two functions φ and φ 2. Lemma. X Y Z φ, φ 2 s.t. p(x, y, z) = φ (x, y)φ 2 (y, z). Proof (Proof of necessity) Assume that X Y Z. Then by definition of the Markov chain, we have p(x, y, z) = p(x y, z)p(y, z) = p(x y)p(y, z). Now taking φ (x, y) = p(x, y) and φ 2 (y, z) = p(y, z) proves this direction. (Proof of sufficiency) Now, the functions φ, φ 2 exist as mention in the statement. Then we have: p(x y, z) = = p(x, y, z) p(y, z) p(x, y, z) p( x, y, z) x = φ (x, y)φ 2 (y, z) φ 2 (y, z) x φ ( x, y) = φ (x, y) x φ ( x, y). The final equality shows that p(x y, z) is not a function of z and hence, X Y Z follows by definition. 2 Hidden Markov Processes Based on our understanding of the Markov Triplet, we define the Hidden Markov Process (HMP) as follows: Reading: Ephrahim & Merhav 2002, lectures 9,0 of 206 notes

8 Definition 2 (Hidden Markov Process). The process {(X n, Y n )} n is a Hidden Markov Process if {X n } n is a Markov process, i.e., n : p(x n ) = n p(x t x t ), where by convention, we take X 0 0 and hence p(x x 0 ) = p(x ). {Y n } n is determined by a memoryless channel; i.e., t= n : p(y n x n ) = n p(y t x t ). The processes {X n } n and {Y n } n are called the state process and the noisy observations, respectively. In view of the above definition, we can derive the joint distribution of the state and the observation as follows: p(x n, y n ) = p(x n )p(y n x n ) = t= n p(x t x t )p(y t x t ). The main problem in the Hidden Markov Models is to compute the the posterior probability of the state at any time, given all the observations up to that time, i.e. p(x t y t ). The naive approach to do this is to simply use the the definition of the conditional probability: p(x t y t ) = p(x t, y t ) p(y t ) = t= x t p(yt x t.x t )p(x t, x t ) x t x t p(yt x t, x t )p( x t, x t ). The above approach needs exponentially many computations both for the numerator and the denominator. To avoid this problem, we develop an efficient method for computing the posterior probabilities using forward recursion. Before getting to the algorithm, we establish some conditional independence relations. 3 Conditional Independence Relations We introduce some conditional independence relations which help us find more efficient ways of calculating the posterior.. (X t, Y t ) X t (X n t+, Y n t+): Proof We have : p(x n, y n ) = n p(x i x i )p(y i x i ) i= i= The statement then follows from Lemma. [ t ] [ n ] = p(x i x i )p(y i x i ) p(x i x i )p(y i x i ) } {{ } φ (x t,(x t,y t )) i=t+ } {{ } φ 2(x t,(x n t+,yn t+ )) 2

9 2. (X t, Y t ) X t (Xt+, n Yt n ): Proof Note that if {X i, Y i } n i= is a Hidden Markov Process, then {X n i, Y n i } n i= Hidden Markov Process. Applying (a) to this reversed process proves (b). will also be a 3. X t (X t, Y t ) (Xt+, n Yt n ): Proof Left as an exercise. 4 Causal Inference via Forward Recursion We now derive the forward recursion algorithm as an efficient method to sequentially compute the causal posterior probabilities. Note that we have p(x t y t ) = p(x t, y t ) x t p( x t, y t ) = p(x t, y t )p(y t x t, y t ) x t p( x t, y t )p(y t x t, y t ) = p(yt )p(x t y t )p(y t x t ) p(y t ) x t p( x t y t )p(y t x t ) = p(x t y t )p(y t x t ) x t p( x t y t )p(y t x t ) (by (b)) Define α t (x t ) = p(x t y t ) and β t (x t ) = p(x t y t ), then the above computation can be summarized as α t (x t ) = β t(x t )p(y t x t ) x t β t ( x t )p(y t x t ). (3) Similarly, we can write β as β t+ (x t+ ) = p(x t+ y t ) = p(x t+, x t y t ) x t = p(x t y t )p(x t+ y t, x t ) x t = x t p(x t y t )p(x t+ x t ) (by (a)). The above equation can be summarized as β t+ (x t+ ) = x t α t (x t )p(x t+ x t ). (4) Equations (3) and (4) indicate that α t and β t can be sequentially computed based on each other for t =,, n with the initialization β(x ) = p(x ). Hence, with this simple algorithm, the causal inference can be done efficiently in terms of both computation and memory. This is called the forward recursion. 3

10 5 Non-causal Inference via Backward Recursion Now suppose that we are interested in finding p(x t y n ) for t =, 2,, n in a non-causal manner. For this purpose, we can derive a backward recursion algorithm as follows. p(x t y n ) = x t+ p(x t, x t+ y n ) = x t+ p(x t+ y n )p(x t x t+, y t ) (by (c)) = x t+ p(x t+ y n ) p(x t y t )p(x t+ x t ) p(x t+ y t. (by (a)) ) Now, let γ t (x t ) = p(x t y n ). Then, the above equation can be summarized as γ t (x t ) = x t+ γ t+ (x t+ ) α t(x t )p(x t+ x t ). (5) β t+ (x t+ ) Equation (5) indicates that γ t can be recursively computed based on α t s and β t s for t = n, n 2,, with the initialization of γ n (x n ) = α n (x n ). This is called the backward recursion. 4

11 EE378A Statistical Signal Processing Lecture 4-04/3/207 Lecture 4: State Estimation in Hidden Markov Models (cont.) Lecturer: Tsachy Weissman Scribe: David Wugofski In this lecture we build on previous analysis of state estimation for Hidden Markov Models to create an algorithm for estimating the posterior distribution of the states given the observations. Notation A quick summary of the notation:. Random Variables (objects): used more loosely, i.e. X, Y, U, V 2. Alphabets: X, Y, U, V 3. Specific Realizations of Random Variables: x, y, u, v 4. Vector Notation: For a vector (x,, x n ), we denote the vector of values from the i-th component to the j-th component (inclusive) as x j i, with the convention xi x i. 5. Markov Chains: We use the standard Markov Chain notation X Y Z to denote that X, Y, Z form a Markov triplet. For discrete random variable (object), we write p(x) for the pmf P(X = x), and similarly p(x, y) for p X,Y (x, y) and p(y x) for p Y X (y x), etc. 2 Recap of HMP For a set of unknown states {X n } n of a Markov Process and noisy observations {Y n } n on those states taken through some memoryless channel, we can express the joint distribution of the states and outputs: n n p(x n, y n ) = p(x t x t ) p(y t x t ) () t= where x 0 is taken to be some fixed arbitrary value (e.g., 0). Additionally we have the following principles for this setting: t= (X t, Y t ) X t (Xt+, n Yt+) n (X t, Y t ) X t (Xt+, n Yt n ) X t (X t, Y t ) (Xt+, n Yt n ) In the last lecture, through the forward-backward recursion we have the following algorithms for computing various posteriors: α t (x t ) = p Xt Y t(x t y t ) = β t(x t )p Yt X t (y t x t ) x t β t ( x t )p Yt X t (y t x t ) (a) (b) (c) (Measurement Update) β t+ (x t+ ) = p Xt Y t (x t y t ) = x t α t (x t )p Xt+ X t (x t+ x t ) (Time Update) γ t (x t ) = p Xt Y n(x t y n ) = x t+ γ t+ (x t+ ) α t(x t )p Xt+ X t (x t+ x t ) β t+ (x t+ ) (Backward Recursion)

12 3 Factor Graphs Let us derive a graphical representation of the joint distribution expressed in () as follows:. Create nodes on the graph representing each random variable: X,..., X n, Y,..., Y n. 2. For each pairing of nodes, create an edge if there is some factor in which contains both variables. This form of graph is called a Factor Graph. Example: Figure shows the factor graph for the hidden Markov process in our setting. Figure : The factor graph for our hidden Markov process For any partitioning of the graph into three node sets S, S 2, and S 3 such that any path from S to S 3 passes some node in S 2, these sets form a Markov process S S 2 S 3 (to see this, simply express the joint pmf into factors). Hence, for any s i S i, i =, 2, 3, we have p(s, s 3 s 2 ) = φ (s, s 2 )φ 2 (s 2, s 3 ) (2) for some functions φ, φ 2 which consist of the factors which were used to form the factor graph. Using the factor graph, it is quite easy to establish principles (a)-(c) via graphs. For example, principles (a) and (b) can be proved via Figure 2 and 3 shown below. Figure 2: The factor graph partition for (a) Figure 3: The factor graph partition for (b) 2

13 We can also construct an additional principle from the factor graph partitioning presented in Figure 4: X t (X t, Y n ) X n t+ (d) Figure 4: The graphical proof of (d) 4 General Forward Backward Recursion: Module Functions Let us consider a generic discrete random variable U p U passed through a memoryless channel characterized by p V U to get a discrete random variable V. By Bayes rule we know p U V (u v) = p U (u)p V U (v u) ũ p U (ũ)p V U (v ũ) If we create the probability vector of p U V for each value of U, we can define a vector function F s.t. { } pu (u)p V U (v u) {p U V (u v)} u = ũ p F (p U, p V U, v) (4) U (ũ)p V U (v ũ) u Additionally, overloading the notation of F, we can define the matrix function F as F (p U, p V U ) = {F (p U, p V U, v)} v (5) This function F acts as a module function for calculating the posterior distribution of a random variable given a prior distribution and a channel distribution. In addition to F, we can write a module function for calculating the marginal distribution given a prior distribution on the input and a channel distribution. We label this vector function G and define it as follows: { } p V = {p V (v)} v = p U (ũ)p V U (v ũ) G(p U, p V U ) (6) ũ With these module functions in mind, we can express our recursion functions as follows v (3) α t = F (β t, p Yt X t, y t ) β t+ = G(α t, p Xt+ X t ) γ t = G(γ t+, F (α t, p Xt+ X t )) (Measurement Update) (Time Update) (Backward Recursion) 3

14 5 State Estimation: the Viterbi Algorithm Now we are ready to derive the joint posterior distribution p(x n y n ): n p(x n y n ) = p(x n y n ) p(x t x n t+, y n ) t= n = p(x n y n ) p(x t x t+, y n ) t= n = p(x n y n ) p(x t x t+, y t ) t= n = p(x n y n ) t= n = γ n (x n ) t= p(x t y t )p(x t+ x t, y t ) p(x t+ y t ) α t (x t )p(x t+ x t ) β t+ (x t+ ) (By principle (d)) (By principle (c)) (By principle (a)) Write n ln p(x n y n ) = g t (x t, x t+ ) t= with { ln α t(x t)p(x t+ x t) β g t (x t, x t+ ) t+(x t+) t =,, n ln γ n (x n ) t = n we can solve the MAP estimator ˆx MAP with the help of the following definition: Definition. For k n, Let M k (x k ) := max x n k+ n g t (x t, x t+ ). It is straightforward from definition that M (ˆx MAP ) = ln p(ˆx MAP y n ) = max x n p(x n y n ) and t=k M k (x k ) = max max(g k (x k, x k+ ) + x k+ x n k+2 = max x k+ (g k (x k, x k+ ) + max n t=k+ n x n k+2 t=k+ = max x k+ (g k (x k, x k+ ) + M k+ (x k+ )). g t (x t, x t+ )) g t (x t, x t+ )) Since M n (x n ) = g(x n, x n+ ) = ln γ n (x n ) only depends on one term x n, we may start from n and use the previous recursive formula to obtain M (x ). This is called the Viterbi Algorithm. The Bellman Equations, referred to in the communication setting as the Viterbi Algorithm, is an application of a technique called dynamic programming. Using the recursive equations derived in the previous section, we can express the Viterbi Algorithm as follows. 4

15 : function Viterbi 2: M n (x n ) ln γ n (x n ) Initialization of log-likelihood 3: ˆx MAP (x n ) Initialization of the MAP estimator 4: for k = n,, do 5: M k (x k ) max xk+ (g k (x k, x k+ ) + M k+ (x k+ )) Maximum of log-likelihood 6: ˆx k+ (x k ) arg max xk+ (g k (x k, x k+ ) + M k+ (x k+ )) 7: ˆx MAP (x k ) [ˆx k+ (x k ), ˆx MAP (ˆx k+ (x k ))] Maximizing sequence with leading term x k 8: end for 9: M max x M (x ) Maximum of overall log-likelihood 0: ˆx arg max x M (x ) : ˆx MAP [ˆx, ˆx MAP (ˆx )] Overall maximizing sequence 2: end function 5

16 EE378A Statistical Signal Processing Lecture 3-04/8/207 Lecture 3: Particle Filtering and Introduction to Bayesian Inference Lecturer: Tsachy Weissman Scribe: Charles Hale In this lecture, we introduce new methods for solving Hidden Markov Processes in cases beyond discrete alphabets. This requires us to learn first about estimating pdfs based on samples from a different distribution. At the end, we introduce the Bayesian inference briefly. Recap: Inference on Hidden Markov Processes (HMPs). Setting A quick recap on the setting of a HMP:. {X n } n is a Markov process, called the state process. 2. {Y n } n is the observation process where Y i is the output of X i sent through a memoryless channel characterized by the distribution P Y X. Alternatively, we showed in HW that the above is totally equivalent to the following:. {X n } n is the state process defined by X t = f t (X t, W t ) where {W n } n is an independent process, i.e-each W t is independent of all the others. 2. {Y n } n is the observation process related to {X n } n by where {N n } n is an independent process. Y t = g t (X t, N t ) 3. The processes {N n } n and {W n } n are independent of each other..2 Goal Since the state process, {X n } n is hidden from us, we only get to observe {Y n } n and wish to find a way of estimating {X n } n based on {Y n } n. To do this, we define the forward recursions for causal estimation: where F and G are the operators defined in lecture 4..3 Challenges α t (x t ) = F (β t, P Yt X t, y t ) () β t+ (x t+ ) = G(α t, P Xt+ X t ) (2) For general alphabets X, Y, computing the forward recursion is a very difficult problem. However, we know efficient algorithms that can be applied in the following situations:. All random variables are discrete with a finite alphabet. We can then solve these equations by brute force in time proportional to the alphabet sizes.

17 2. For t, f t and g t are linear and N t, W t, X t are Gaussian. The solution is the Kalman Filter algorithm that we derive in Homework 2. Beyond these special situations, there are many heuristic algorithms for this computation:. We can compute the forward recursions approximately by quantizing the alphabets. This leads to a tradeoff between model accuracy and computation cost. 2. This lecture will focus on Particle Filtering, which is a way of estimating the forward recursions with an adaptive quantization. 2 Importance Sampling Before we can learn about Particle Filtering, we first need to discuss importance sampling. 2. Simple Setting Let X,, X N be i.i.d. random variables with density f. Now we wish to approximate f by a probability mass function, ˆf N, using our data. The idea: we can approximate f by taking ˆf N = N N δ(x X i ) (3) i= where δ(x) is a Dirac-delta distribution. Claim: ˆfN is a good estimate of f. Here we define the goodness of our estimate by looking at and seeing how close it is to E X ˆfN [g(x)] (4) E X f [g(x)] (5) for any function g with E g(x) <. We show that by this definition, (3) is a good estimate for f. Proof: E X ˆfN [g(x)] = = = N = N ˆf N (x)g(x)dx (6) N N i= N δ(x X i )g(x)dx (7) i= See section of the Spring 203 lecture notes for further reference. δ(x X i )g(x)dx (8) N g(x i ) (9) i= 2

18 By the law of large numbers, as N. Hence, N N i= g(x i ) a.s. E X f [g(x)] (0) E X [g(x)] E ˆfN X f [g(x)] () so ˆf N is indeed a good estimate of f. 2.2 General Setting Suppose now that X,, X N i.i.d. with common density q, but we wish to estimate a different pdf f. Theorem: The following estimator with weights N ˆf N (x) = w i δ(x X i ) (2) i= w i = f(x i) q(x i) N f(x j) j= q(x j) (3) is a good estimator of f. Proof: It s easy to see E X ˆfN [g(x)] = As N, the above expression converges to = = = E[ f(x) q(x) g(x)] E[ f(x) q(x) ] = = = N w i g(x i ) (4) i= f(x N i) q(x i) N f(x j) i= j= q(x j) N i= N f(x j) j= q(x j) f(x i) q(x i) g(x i) N f(x i) N i= N f(x j) N j= q(x j) f(x) f(x) g(x i ) (5) (6) q(x g(x i) i). (7) q(x) g(x)q(x)dx q(x) q(x)dx (8) f(x)g(x)dx f(x)dx (9) f(x)g(x)dx (20) = E X f [g(x)]. (2) 3

19 Hence, E X ˆfN [g(x)] E X f [g(x)] (22) as N. 2.3 Implications The above theorem is useful in situations when we don t know f and cannot draw samples relying on X i f, but we know that f(x) = c h(x) for some constant c (in many practical cases it is computationally prohibitive to obtain the normalization constant c = ( h(x)dx) ). Then w i = = = f(x i) q(x i) N f(x j) j= q(x j) c h(x i) q(x i) N c h(x j) j= q(x j) h(x i) q(x i) N h(x j) j= q(x j) (23) (24) (25) Using these w i s, we see that we can approximate f using samples drawn from a different distribution and h. 2.4 Application Let X f X and Y be the observation of X through a memoryless channel characterized by f Y X. For MAP reconstruction of X, we must compute f X Y (x y) = f X (x)f Y X (y x) f X( x)f Y X (y x)d x In general, this computation may be hard due to the integral appearing on the bottom. But the denominator is just a constant. Hence, suppose we have X i i.i.d. f X for i N. By the previous theorem, we can take as an approximation, where ˆf X y = (26) N w i (y)δ(x X i ) (27) i= w i (y) = = = f X y (X i) f X (X i) N f X y (X j) j= f X (X j) f X (X i)f Y X (y X i) f X (X i) N f X (X j)f Y X (y X j) j= f X (X j) f Y X (y X i ) N j= f Y X(y X j ) (28) (29) (30) Hence, using the samples X, X 2,, X N using the above procedure. we can build a good approximation of f X Y ( y) for fixed y 4

20 Let us denote this approximation as 3 Particle Filtering ˆf N (X, X 2,..., X N, f X, f Y X, y) = N w i δ(x X i ) (3) Particle Filtering is a way to approximate the forward recursions of a HMP using importance sampling to approximate the generated distributions rather than computing the true solutions. particles generate {X (i) }N i= i.i.d. f X ; for t 2 do ˆα t = ˆF N ({X (i) }N i=, ˆβ t, P Yt X t, y t ) ; // estimate α t using importance sampling as in (3) (i) nextparticles generate { X t } N i= i.i.d. ˆα t; reset particles; for i N do particles generate X (i) t+ f X t+ X t ( end ˆβ t = N N i= δ(x X t+); X (i) t ); end Algorithm : Particle Filter This algorithm requires some art in use. It is not well understood how errors propagate due to these successive approximations. In practice, the parameters need to be periodically reset. Choosing these hyperparameters is an application-specific task. 4 Inference under logarithmic loss Recall the Bayesian decision theory and let us have the following: and X X is something we want to infer; X p x (assume that X is discrete); ˆx ˆX is something we wish to reconstruct; Λ: X ˆX R is our loss function. First we assume that we do not have any observation, and define the Bayesian response as U(p x ) = min EΛ(X, ˆx) = min p x (s)λ(x, ˆx) (32) ˆx ˆx Example: Let X = ˆX = R, Λ(x, ˆx) = (x ˆx) 2. Then x i= ˆX Bayes (p X ) = arg min ˆx EΛ(X, ˆx). (33) U(p X ) = min E X px (X ˆx) 2 = Var(X) ˆx (34) ˆX Bayes (p X ) = E[X]. (35) 5

21 Consider now the case we have some observation (side information) Y, where our reconstruction ˆX can be a function of Y. Then EΛ(X, ˆX(Y )) = E[E[Λ(X, ˆX(y)) Y = y]] (36) = y E[Λ(X, ˆX(y)) Y = y]p Y (y) (37) = y p Y (y) x p X Y =y (x)λ(x, ˆX(y)) (38) This implies U(p X ) = min EΛ(X, ˆX(Y )) (39) ˆX( ) = min p Y (y) p X Y =y (x)λ(x, ˆX(y)) (40) ˆX( ) y x = y p Y (y)u(p X Y =y ) (4) and thus ˆX opt (y) = ˆX Bayes (p X Y =y ) (42) 6

22 EE378A Statistical Signal Processing Lecture 6-04/20/207 Lecture 6: Bayesian Decision Theory and Justification of Log Loss Lecturer: Tsachy Weissman Scribe: Okan Atalar In this lecture, we introduce the topic of Bayesian Decision Theory and define the concept of loss function for assessing the accuracy of the estimation. The log loss function is presented as an important special case. Notation A quick summary of the notation:. Random Variables (objects): used more loosely, i.e., X, Y 2. PMF (PDF): used more loosely, i.e. P X, P Y 3. Alphabets: X, ˆX 4. Estimate of X: ˆx 5. Specific Values: x, y 6. Loss Function: Λ 7. Markov Triplet: X Y Z notation corresponds to a Markov triplet 8. Logarithm: log notation denotes log 2, unless otherwise stated For discrete random variable (object), U has p.m.f: P (u) U Similarly: p(x, y) for P (x,y) X,Y (y x) and p(y x) for P Y X, etc. 2 Bayesian Decision Theory Let X X, X P X, ˆx ˆX, Λ : X ˆX R. Definition (Bayes Envelope). Definition 2 (Bayes Response). U(P X ) min ˆx E[Λ(X, ˆx)] = min ˆx P (U = u). Often, we ll just write p(u). P X (x)λ(x, ˆx) () x ˆX Bayes (P X ) arg min ˆx E[Λ(X, ˆx)] (2) Essentially, given a prior distribution on X, we are trying to achieve the best estimate through minimizing the expected value of some loss function that we have defined. If in addition to the prior distribution of X, we are also given the observation Y, the associated Bayes Envelope becomes: min E[Λ(X, ˆX(Y ))] = ˆX( ) y U(P X Y )P Y (y) = E[U(P X Y )] (3) In this case, the Bayes Response takes the form: Reading: Lecture 3 of the EE378A Spring 206 notes ˆX(y) = ˆX Bayes (P X Y =y ) (4)

23 3 Bayesian Inference Under Logarithmic Loss In this section, we will be using the logarithm function as the loss function. Definition 3 (Logarithmic Loss Function). Λ(x, Q) log Q(x) where x is the realization of X, Q is the estimator of the pmf of X. Using Definition 3, the expected value of the log loss function could be expressed as: E[Λ(X, Q)] = x = x P X (x)λ(x, Q) = P X (x) log Q(x) = x x P X (x) log P X (x) + P X (x) log P X(x) Q(x) x P X (x) P X (x) log Q(x) P X (x) (5) (6) (7) Definition 4 (Entropy). H(X) x P X (x) log P X (x) (8) Definition 5 (Relative Entropy, or the Kullback Leibler Divergence). D(P X Q) x Using Equation (6,7) and Definitions (4,5): P X (x) log P X(x) Q(x) (9) E[Λ(X, Q)] = H(X) + D(P X Q) (0) Claim: D(P X Q) 0, and D(P X Q) = 0 Q = P X. Before attempting the proof, we provide Jensen s inequality: Jensen s Inequality: Let Q denote a concave function, and X be any random variable. Then E[Q(X)] Q(E[X]) () Further, if Q is strictly concave: E[Q(X)] = Q(E[X]) X is a constant. In particular, Q(x) = log(x) is a concave function, thus for a random variable X 0: E[log X] log E[X]. (2) Proof of the Claim: D(P X Q) = E[log P X(x) Q(x) Q(x) ] = E[log ] log E[ Q(x) P X (x) P X (x) ] = log x P X (x) Q(x) P X (x) = 0 (3) The last equality follows from x Q(x) =, since Q(x) is a pmf and thus sums to one. Hence, it follows from the claim that min Q E[Λ(X, Q)] = H(X) = U(P X) (4) 2

24 where the minimizer is Q = P X. Therefore: ˆX Bayes (P X ) = P X. (5) Introducing the notation H(X) H(P X ) and making use of equation (3): min E[Λ(X, ˆX(Y ))] = E[U(P X Y )] = ˆX( ) y H(P X Y )P Y (y) (6) Definition 6 (Conditional Entropy). H(X Y ) y H(P X Y )P Y (y) (7) Definition 7 (Mutual Information). Definition 8 (Conditional Relative Entropy). I(X; Y ) H(X) H(X Y ) (8) D(P X Y Q X Y, P Y ) y D(P X Y Q X Y )P Y (y) (9) Definition 9 (Conditional Mutual Information). Exercises: (a) H(X, X 2 ) = H(X ) + H(X 2 X ) (b) I(X; Y ) = H(Y ) H(Y X) (c) I(X, X 2 ; Y ) = I(X ; Y ) + I(X 2 ; Y X ) (d) D(P X,X 2 Q X,X 2 ) = D(P X Q X ) + D(P X2 X Q X2 X P X ) I(X, Y Z) H(X Z) H(X Y, Z) (20) 4 Justification of Log Loss as the Right Loss Function For general loss function Λ: Definition 0 (Coherent Dependence Function). C(Λ, P X,Y ) inf ˆx E[Λ(X, ˆx)] inf E[Λ(X, ˆX(Y ))] (2) The first term corresponds to the minimum expected error (depending on the choice of loss function) in estimating X based on only the prior distribution of X. The second term corresponds to the minimum expected error in estimating X by also taking the observation Y into account. Therefore, the coherent dependence function gives a measure on how much the estimate improves given observation Y, compared to estimating X based on only the prior distribution of X. Two examples are provided below to make this concept more clear: () Λ(x, ˆx) = (x ˆx) 2 C(Λ, P X,Y ) = Var(X) E[Var(X Y )] (2) Λ(x, Q) = log Q(x) C(Λ, P X,Y ) = I(X, Y ) Proving the expressions below left as exercise: (a) C(Λ, P X,Y ) 0, and C(Λ, P X,Y ) = 0 X and Y are independent ˆX( ) 3

25 (b) C(Λ, P X,Y ) C(Λ, P X,Z ) when X Y Z Natural Requirement: If T = T (X) is a function of X, then C(Λ, P T,Y ) C(Λ, P X,Y ). This requirement basically states that a function of X cannot carry more information about X compared to X itself. Definition (Sufficient Statistic). T = T (X) is a sufficient statistic of X for Y if X T Y. Modest Requirement: If T is a sufficient statistic of X for Y, then C(Λ, P T,Y ) C(Λ, P X,Y ) Theorem: Assume X 3, the only C(Λ, P X,Y ) satisfying the modest requirement is C(Λ, P X,Y ) = I(X; Y ) (up to a multiplicative factor). Since it was shown that C(Λ, P X,Y ) = I(X; Y ) if (and only if) the logarithmic loss function is used (Λ(x, Q) = log Q(x) ), it follows that the logarithmic loss function is the only loss function which satisfies the modest requirement. 4

26 EE378A Statistical Signal Processing Lecture 7-04/25/207 Lecturer: Tsachy Weissman Lecture 7: Prediction Scribe: Cem Kalkanli In this lecture we are going to study prediction under Bayesian setting with general and log loss. We will also quickly start the subject of prediction of individual sequences. Notation A quick summary of the notation:. Random Variables (objects): used more loosely, i.e. X, Y, U, V ; 2. Alphabets: X, ˆX ; 3. Sequence of Random Variables: X, X 2,, X t ; 4. Loss function: Λ(x, ˆx); 5. Estimate of X: ˆx; 6. Set of distributions on Alphabet X : M(X ). For discrete random variable (object), U has p.m.f: P (u) U P (U = u). Often, we ll just write p(u). Similarly: p(x, y) for P X,Y (x, y) and p(y x) for P Y X (y x), etc. 2 Prediction Consider X, X 2,, X n X. Definition (Predictor). A predictor F is a sequence of functions F = (F t ) t with F t : X t ˆX. Definition 2 (Prediction Loss). Given a loss function Λ : X ˆX R +, the prediction loss incurred by the predictor F is defined as L F (X n ) = n n Λ(X t, F t (X t )). () t= 2. Prediction under Bayesian Setting In Bayesian setting, we assume X, X 2, are governed jointly by some underlying law P. Then we have the following prediction loss. Definition 3 (Prediction Loss Under Bayesian Setting). min E P [L F (X n )] = F n n t= min E P [Λ(X t, F t (X t ))] = F t n where U( ) is the Bayes envelope we have defined in the previous lecture. n t= E P [U(P Xt Xt )] (2)

27 As we can see from (2), our initial objective was to minimize expected prediction loss with respect to whole sequence of functions. However, since we can interchange summation and expectation, we can treat the overall predictor F as the composition of the sequential optimal predictors F t. Finally, we note the overall loss is just the sum of expectations of Bayes envelopes. Theorem 4 (Miss-match Loss). The additional prediction loss incurred by doing prediction based on law Q instead of P is upper bounded by some form of relative entropy: 2 E P [L F Q(X n ) L F P (X n )] Λ max n D(P X n Q Xn), (3) where Λ max = sup x,ˆx Λ(x, ˆx). Now we are going to introduce Lemma 5 which will be used to prove Theorem 4. Lemma 5. Proof E P [Λ(X, ˆX Bayes (Q)) Λ(X, ˆX Bayes (P ))] Λ max P Q Λ max 2D(P Q) (4) E P [Λ(X, ˆX Bayes (Q)) Λ(X, ˆX Bayes (P ))] E P [Λ(X, ˆX Bayes (Q)) Λ(X, ˆX Bayes (P ))] + E Q [ Λ(X, ˆX Bayes (Q)) + Λ(X, ˆX Bayes (P ))] (5) = x (P (x) Q(x))(Λ(x, ˆX Bayes (Q)) Λ(x, ˆX Bayes (P ))) (6) x P (x) Q(x) Λ max (7) = Λ max P Q (8) Λ max 2D(P Q) (9) where in the last step we have used Pinsker s inequality D(P Q) 2 P Q 2. Now we prove Theorem 4 as follows: as desired. E P [L F Q(X n ) L F P (X n )] (0) = E P [ n (Λ(X t, F n t Q(X t )) Λ(X t, F t P (X t )))] () = n t= n E P [E P [(Λ(X t, F t Q(X t )) Λ(X t, F t P (X t ))) X t ]] (Iterated Expectation) (2) t= n n t= Λ max 2 n E P [Λ max 2D(P t Q Xt X X t P t X Xt )] (Using Lemma 5) (3) n t= E P [D(P Xt X t Q X t X t P Xt )] (Jensen s inequality) (4) 2 = Λ max n D(P X n Q Xn) (Using chain rule for relative entropy) (5) 2

28 3 Prediction Under Logarithmic Loss In this section, we specialize our setup to the logarithmic loss while changing our predictor set into a set of distributions. So here we have ˆX = M(X ) and Λ(x, P ) = log P (x) as the logarithmic loss. Also note that F t (X t ) M(X ). So we can represent our predictor with 3 different setups as follows:. Obviously we can carry over the setup we worked previously and see our predictors as a set of functions {F t } t. 2. Another approach would be to see each predictor as a sequence of conditional distributions {Q Xt X t } t where F t (X t ) = Q Xt X t. 3. And finally we can see our predictor as a model on the data. Then Q would be the law we assigned to the sequence we observe. Then we can immediately adapt previous results to log loss case. First one is L F (X n ): L F (X n ) = n = n n log F t= t (X t )[x t ] n log Q Xt X t (x t) t= = n log Q(X n ) (6) (7) (Where now Q is our model about the data) (8) Now let s look at the logarithmic loss under the Bayesian setting while assuming X, X 2, sequence is governed by P: E P [L F (X n )] = E P [ n log Q(X n ) ] (9) = E P [ n log P (X n ) Q(X n )P (X n ) ] (20) = n [H(Xn ) + D(P X n Q X n)] (2) Then we can see that min F E P [L F (X n )] = n H(Xn ) and it is achieved when F t (X t ) = P Xt X t. 4 Prediction of Individual Sequences As for the last part, we look at a different set of predictors. In this setup we are given a class of predictors F, and we seek a predictor F such that: lim max[l n X n F (X n ) min L G(X n )] = 0 (22) G F 4. Prediction of Individual Sequences under Log Loss Now the question arises that given a class of law F, can we find P such that (23) is small for any X n : n log P (X n ) min Q F n log Q(X n ) = n log P (X n ) n log max Q F Q(X n ). (23) This problem will be solved in the upcoming lecture. 3

29 EE378A Statistical Signal Processing Lecture 8-04/27/207 Lecture 8: Prediction of Individual Sequences Lecturer: Tsachy Weissman Scribe: Chuan Zheng Lee In the last lecture we have studied the prediction problem under Bayesian setting, and in this lecture we are going to study the prediction problem under logarithmic loss in the individual sequence setting. Problem setup We first recall the problem setup: given a class of probability laws F, we wish to find some law P with a small worst-case regret [ ] WCR n (P, F) max log x n P (x n ) min log Q F Q(x n. () ) Then a first natural question is as follows: could we find some P such that with ɛ n 0 as n? WCR n (P, F) nɛ n (2) 2 First answer when F < : uniform mixture The answer to the previous question turns out to be affirmative when F <. Our construction of P is as follows: P Unif (x n ) = F Q(x n ). (3) Q F The performance of P Unif is summarized in the following lemma: Lemma. The worst-case regret of P Unif satisfies: WCR n (P Unif, F) log F. (4) Proof Just apply Q(x n ) 0 for any Q F and x n to conclude that log P (x n ) min log Q F Q(x n ) = log max Q F Q(x n ) P (x n ) = log F + log max Q F Q(x n ) Q F Q(x n ) (5) (6) log F. (7) By Lemma, we know that we can choose ɛ n = to our question. log F n 0 as n, which provides an affirmative answer

30 3 Second answer: normalized maximum likelihood In the previous section we know that ɛ n 0 is possible when F <, but we are still looking for some law P which gives the minimum ɛ n, or equivalently, a minimum WCR n (P, F). Complicated as the definition of WCR n (P, F) suggests, surprisingly its minimum admits a closed-form expression: Theorem 2 (Normalized Maximum Likelihood). The minimum worst-case regret is given by min P WCR n(p, F) = log x n max Q F Q(xn ). (8) This bound is attained by the normalized maximum likelihood predictor expressed as P NML (x n ) = max Q F Q(x n ) x n max Q F Q( x n ). (9) P Proof Since NML(x n ) max Q F Q(x n ) is a constant, it is clear that P NML attains the desired equality. To prove the optimality, suppose P is any other predictor. Since both P NML and P are probability distributions, there must exist some x n such that Considering this sequence x n gives as desired. P (x n ) P NML (x n ). (0) WCR n (P, F) log max Q F Q(x n ) P (x n ) log max Q F Q(x n ) P NML (x n ) () (2) = log x n max Q F Q(xn ) (3) 4 Third answer: Dirichlet mixture Although the normalized maximum likelihood predictor gives the optimal worst-case regret, it has a severe drawback that it is usually really hard to compute in practice. In contrast, the mixture idea used in the first answer can usually be efficiently computed (as will be shown below). This observation motivates us to come up with a mixture-based predictor P, which enjoys a worst-case regret close to that of P NML. The general mixture idea is as follows: suppose the model class F = (Q θ ) θ Θ can be parametrized by θ. Consider some prior w( ) on Θ and choose P ω (x n ) = Q θ (x n )w(dθ) (4) Θ 2

31 as our predictor. Note that in this case, where P ω (x t x t ) = P ω(x t ) P ω (x t (5) ) Θ = Q θ(x t )w(dθ) Θ Q (6) θ(x t )w(dθ) Θ = Q θ(x t x t )Q θ (x t )w(dθ) Θ Q (7) θ(x t )w(dθ) = Q θ (x t x t Q θ (x t ) ) Θ Θ Q w(dθ) (8) θ (xt )w(dθ ) Q θ (x t x t )w(dθ x t ) (9) Θ w(dθ x t ) = Q θ(x t )w(dθ) Θ Q θ (xt )w(dθ ). (20) Then how to choose the prior w( )? On one hand, the prior w( ) should be chosen such that it assigns roughly equal weights to the representative sources. On the other hand, it should be chosen in a way such that it is computationally tractable for the sequential probability assignments. We take a look at some concrete examples. 4. Bernoulli case In the first example, we assume that Q θ = Bern(θ) n and F = {Q θ } θ [0,]. Note that in this case the model class consists of all i.i.d. distributions. We assign the Dirichlet prior w(dθ) = π The properties of the resulting predictor are left as exercises: dθ Beta( θ( θ) 2, ). (2) 2 Exercise 3. Denote by n 0 = n 0 (x n ), n = n (x n ) the number of 0 s and s in the sequence x n, respectively. Prove the following: (a) The conditional probability is given by P ω (x t x t ) = n x t (x t ) + 2. (22) t This is called the Krichevsky Trofimov sequential probability assignment. (b) Show that WCR n (P ω, F) = 2 log n + O(). (c) Considering the normalized maximum likelihood predictor, also show that min P WCR n (P, F) = 2 log n+ O(). Hence, our Dirichlet mixture P ω attains the optimal leading term of the worst-case regret. In fact, there is a general theorem on the performance of the mixture-based predictor: Theorem 4 (Rissanen, 984). Let F = {Q θ } θ Θ with Θ R d. Under benign regularity and smoothness conditions, we have min WCR n(p, F) = d log n + o(log n). (23) P 2 3

32 Moreover, there exists some Dirichlet prior w( ) such that 4.2 Markov sources WCR n (P ω, F) = d log n + o(log n). (24) 2 Given that we can compete with i.i.d. Bernoulli sequences, how can we compete under logarithmic loss with respect to Markov models? Assume X = {0, } and denote by M k the set of all k-th order Markov models. Note that for Q M k, Q(x n x 0 k+) = = n i= Q(x i x i i k ) (25) all possible k tuples u k Q(x i u k ). (26) i:x i i k =uk The product i:x i i k =uk Q(x i u k ) gives us the probability of the subsequence {x i } x i i k =uk under Bernstein(Q( u k )). Then it makes sense to employ the KT estimator on each of these subsequences. Consider the following P such that where n u k(x n ) = n i= (xi i k+ = uk ). Then, for this P, we have P (x t x t k+ ) = n x t (xt t k k+ ) + /2 n x t t k (x t k+ ) +, (27) WCR n (P, M k ) u k 2 ln(n u k(xn )) + o(ln n) (28) which matches the Rissanen lower bound on the leading term. 2k 2 ln n + o(ln n), (29) 2k 4

33 EE378A Statistical Signal Processing Lecture 9-05/02/207 Lecture 9: Context Tree Weighting Lecturer: Tsachy Weissman Scribe: Chuan-Zheng Lee In the last lecture we have talked abut the mixture idea for i.i.d. sources and Markov sources. Specifically, we introduced the Krichevsky Trofimov probability assignment for a binary vector x n {0, } n given by P KT (x n ) = ω(θ)θ n(xn) ( θ) n0(xn) dθ = Γ(n )Γ(n + 2 ) (Γ(, () 2 ))2 Γ(n + ) where ω(θ), n n (x n ) is the number of s in x n, n 0 n 0 (x n ) is the number of 0 s (so that θ( θ) n = n 0 + n ) and Γ( ) is the gamma function. Note that ω(θ) is a Dirichlet prior with parameters ( 2, 2 ), and the integrand is just a Dirichlet distribution with parameters (n 0 + 2, n + 2 ). The main result is that, for regular models with d degrees of freedom, the Dirichlet mixing idea achieves the optimal worst-case regret d 2 log n + o(log n) for individual sequences of length n. In this lecture, we will extend this idea to general tree sources. Tree Source We first begin with an informal defintion of a tree source. Definition (Context of a symbol). A context of a symbol x t is a suffix of the sequence that precedes x t, that is, (, x t 2, x t ). Definition 2 (Tree source). A tree source is a source where a set of suffixes is sufficient to determine the probability distribution of the next item (regardless of the past) in the sequence. A tree source is hence a generalization of an order-one Markov source, where only the previous element in the sequence determines the next: now, instead of just one element being sufficient, the suffix is sufficient. It is perhaps best elucidated by example. Example 3. Let S = {00, 0, } and Θ = {θ s, s S}. Then We can represent this structure in a binary tree: p(x t = X t 2 = 0, X t = 0) = θ 00 p(x t = X t 2 =, X t = 0) = θ 0 p(x t = X t = ) = θ t 2 t t θ θ θ 0 Remark Note that the source in Example 3 is not first-order Markov, but can be cast as a special case of a second-order Markov source. We denote as P S,θ (x n ) the associated probability mass function on x n X n.

Lecture 4: State Estimation in Hidden Markov Models (cont.)

Lecture 4: State Estimation in Hidden Markov Models (cont.) EE378A Statistical Signal Processing Lecture 4-04/13/2017 Lecture 4: State Estimation in Hidden Markov Models (cont.) Lecturer: Tsachy Weissman Scribe: David Wugofski In this lecture we build on previous

More information

Lecture 2: Basic Concepts of Statistical Decision Theory

Lecture 2: Basic Concepts of Statistical Decision Theory EE378A Statistical Signal Processing Lecture 2-03/31/2016 Lecture 2: Basic Concepts of Statistical Decision Theory Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: John Miller and Aran Nayebi In this lecture

More information

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 14: Information Theoretic Methods Lecturer: Jiaming Xu Scribe: Hilda Ibriga, Adarsh Barik, December 02, 2016 Outline f-divergence

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Robustness and duality of maximum entropy and exponential family distributions

Robustness and duality of maximum entropy and exponential family distributions Chapter 7 Robustness and duality of maximum entropy and exponential family distributions In this lecture, we continue our study of exponential families, but now we investigate their properties in somewhat

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Variational Inference (11/04/13)

Variational Inference (11/04/13) STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture : Mutual Information Method Lecturer: Yihong Wu Scribe: Jaeho Lee, Mar, 06 Ed. Mar 9 Quick review: Assouad s lemma

More information

Context tree models for source coding

Context tree models for source coding Context tree models for source coding Toward Non-parametric Information Theory Licence de droits d usage Outline Lossless Source Coding = density estimation with log-loss Source Coding and Universal Coding

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Lecture 8 October Bayes Estimators and Average Risk Optimality

Lecture 8 October Bayes Estimators and Average Risk Optimality STATS 300A: Theory of Statistics Fall 205 Lecture 8 October 5 Lecturer: Lester Mackey Scribe: Hongseok Namkoong, Phan Minh Nguyen Warning: These notes may contain factual and/or typographic errors. 8.

More information

11. Learning graphical models

11. Learning graphical models Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical

More information

ECE 4400:693 - Information Theory

ECE 4400:693 - Information Theory ECE 4400:693 - Information Theory Dr. Nghi Tran Lecture 8: Differential Entropy Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 1 / 43 Outline 1 Review: Entropy of discrete RVs 2 Differential

More information

Factor Graphs and Message Passing Algorithms Part 1: Introduction

Factor Graphs and Message Passing Algorithms Part 1: Introduction Factor Graphs and Message Passing Algorithms Part 1: Introduction Hans-Andrea Loeliger December 2007 1 The Two Basic Problems 1. Marginalization: Compute f k (x k ) f(x 1,..., x n ) x 1,..., x n except

More information

Lecture 8: Channel Capacity, Continuous Random Variables

Lecture 8: Channel Capacity, Continuous Random Variables EE376A/STATS376A Information Theory Lecture 8-02/0/208 Lecture 8: Channel Capacity, Continuous Random Variables Lecturer: Tsachy Weissman Scribe: Augustine Chemparathy, Adithya Ganesh, Philip Hwang Channel

More information

Lecture 2: August 31

Lecture 2: August 31 0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: August 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy

More information

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b) LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered

More information

16.1 Bounding Capacity with Covering Number

16.1 Bounding Capacity with Covering Number ECE598: Information-theoretic methods in high-dimensional statistics Spring 206 Lecture 6: Upper Bounds for Density Estimation Lecturer: Yihong Wu Scribe: Yang Zhang, Apr, 206 So far we have been mostly

More information

Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n

Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n Jiantao Jiao (Stanford EE) Joint work with: Kartik Venkat Yanjun Han Tsachy Weissman Stanford EE Tsinghua

More information

Quantitative Biology II Lecture 4: Variational Methods

Quantitative Biology II Lecture 4: Variational Methods 10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate

More information

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003 Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)

More information

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information 4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information Ramji Venkataramanan Signal Processing and Communications Lab Department of Engineering ramji.v@eng.cam.ac.uk

More information

Lecture 8: Information Theory and Statistics

Lecture 8: Information Theory and Statistics Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 23, 2015 1 / 50 I-Hsiang

More information

LECTURE 3. Last time:

LECTURE 3. Last time: LECTURE 3 Last time: Mutual Information. Convexity and concavity Jensen s inequality Information Inequality Data processing theorem Fano s Inequality Lecture outline Stochastic processes, Entropy rate

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3 Hypothesis Testing CB: chapter 8; section 0.3 Hypothesis: statement about an unknown population parameter Examples: The average age of males in Sweden is 7. (statement about population mean) The lowest

More information

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources STA 732: Inference Notes 10. Parameter Estimation from a Decision Theoretic Angle Other resources 1 Statistical rules, loss and risk We saw that a major focus of classical statistics is comparing various

More information

Lecture 4 Noisy Channel Coding

Lecture 4 Noisy Channel Coding Lecture 4 Noisy Channel Coding I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw October 9, 2015 1 / 56 I-Hsiang Wang IT Lecture 4 The Channel Coding Problem

More information

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010 Hidden Lecture 4: Hidden : An Introduction to Dynamic Decision Making November 11, 2010 Special Meeting 1/26 Markov Model Hidden When a dynamical system is probabilistic it may be determined by the transition

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

Lecture 22: Error exponents in hypothesis testing, GLRT

Lecture 22: Error exponents in hypothesis testing, GLRT 10-704: Information Processing and Learning Spring 2012 Lecture 22: Error exponents in hypothesis testing, GLRT Lecturer: Aarti Singh Scribe: Aarti Singh Disclaimer: These notes have not been subjected

More information

Lecture 5 Channel Coding over Continuous Channels

Lecture 5 Channel Coding over Continuous Channels Lecture 5 Channel Coding over Continuous Channels I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw November 14, 2014 1 / 34 I-Hsiang Wang NIT Lecture 5 From

More information

Dynamic Approaches: The Hidden Markov Model

Dynamic Approaches: The Hidden Markov Model Dynamic Approaches: The Hidden Markov Model Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Inference as Message

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Bayesian Machine Learning - Lecture 7

Bayesian Machine Learning - Lecture 7 Bayesian Machine Learning - Lecture 7 Guido Sanguinetti Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh gsanguin@inf.ed.ac.uk March 4, 2015 Today s lecture 1

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Variational Inference II: Mean Field Method and Variational Principle Junming Yin Lecture 15, March 7, 2012 X 1 X 1 X 1 X 1 X 2 X 3 X 2 X 2 X 3

More information

CS 630 Basic Probability and Information Theory. Tim Campbell

CS 630 Basic Probability and Information Theory. Tim Campbell CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability Theory Probability Theory is the study of how best to predict outcomes of events. An experiment (or trial or event)

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

EE/Stats 376A: Homework 7 Solutions Due on Friday March 17, 5 pm

EE/Stats 376A: Homework 7 Solutions Due on Friday March 17, 5 pm EE/Stats 376A: Homework 7 Solutions Due on Friday March 17, 5 pm 1. Feedback does not increase the capacity. Consider a channel with feedback. We assume that all the recieved outputs are sent back immediately

More information

COS513 LECTURE 8 STATISTICAL CONCEPTS

COS513 LECTURE 8 STATISTICAL CONCEPTS COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions

More information

Gaussian Estimation under Attack Uncertainty

Gaussian Estimation under Attack Uncertainty Gaussian Estimation under Attack Uncertainty Tara Javidi Yonatan Kaspi Himanshu Tyagi Abstract We consider the estimation of a standard Gaussian random variable under an observation attack where an adversary

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

x log x, which is strictly convex, and use Jensen s Inequality:

x log x, which is strictly convex, and use Jensen s Inequality: 2. Information measures: mutual information 2.1 Divergence: main inequality Theorem 2.1 (Information Inequality). D(P Q) 0 ; D(P Q) = 0 iff P = Q Proof. Let ϕ(x) x log x, which is strictly convex, and

More information

Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157

Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157 Lecture 6: Gaussian Channels Copyright G. Caire (Sample Lectures) 157 Differential entropy (1) Definition 18. The (joint) differential entropy of a continuous random vector X n p X n(x) over R is: Z h(x

More information

Lecture notes on statistical decision theory Econ 2110, fall 2013

Lecture notes on statistical decision theory Econ 2110, fall 2013 Lecture notes on statistical decision theory Econ 2110, fall 2013 Maximilian Kasy March 10, 2014 These lecture notes are roughly based on Robert, C. (2007). The Bayesian choice: from decision-theoretic

More information

Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk

Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk John Lafferty School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 lafferty@cs.cmu.edu Abstract

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Probabilities Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: AI systems need to reason about what they know, or not know. Uncertainty may have so many sources:

More information

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018 Please submit the solutions on Gradescope. EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018 1. Optimal codeword lengths. Although the codeword lengths of an optimal variable length code

More information

Chapter I: Fundamental Information Theory

Chapter I: Fundamental Information Theory ECE-S622/T62 Notes Chapter I: Fundamental Information Theory Ruifeng Zhang Dept. of Electrical & Computer Eng. Drexel University. Information Source Information is the outcome of some physical processes.

More information

Conditional probabilities and graphical models

Conditional probabilities and graphical models Conditional probabilities and graphical models Thomas Mailund Bioinformatics Research Centre (BiRC), Aarhus University Probability theory allows us to describe uncertainty in the processes we model within

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Lecture 5 - Information theory

Lecture 5 - Information theory Lecture 5 - Information theory Jan Bouda FI MU May 18, 2012 Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 1 / 42 Part I Uncertainty and entropy Jan Bouda (FI MU) Lecture 5 - Information

More information

Lecture 1: Introduction, Entropy and ML estimation

Lecture 1: Introduction, Entropy and ML estimation 0-704: Information Processing and Learning Spring 202 Lecture : Introduction, Entropy and ML estimation Lecturer: Aarti Singh Scribes: Min Xu Disclaimer: These notes have not been subjected to the usual

More information

Lecture 5: GPs and Streaming regression

Lecture 5: GPs and Streaming regression Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Introduction to Machine Learning. Lecture 2

Introduction to Machine Learning. Lecture 2 Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for

More information

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization

More information

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015 10-704 Homework 1 Due: Thursday 2/5/2015 Instructions: Turn in your homework in class on Thursday 2/5/2015 1. Information Theory Basics and Inequalities C&T 2.47, 2.29 (a) A deck of n cards in order 1,

More information

If g is also continuous and strictly increasing on J, we may apply the strictly increasing inverse function g 1 to this inequality to get

If g is also continuous and strictly increasing on J, we may apply the strictly increasing inverse function g 1 to this inequality to get 18:2 1/24/2 TOPIC. Inequalities; measures of spread. This lecture explores the implications of Jensen s inequality for g-means in general, and for harmonic, geometric, arithmetic, and related means in

More information

Lecture 4: Lower Bounds (ending); Thompson Sampling

Lecture 4: Lower Bounds (ending); Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from

More information

LECTURE 2. Convexity and related notions. Last time: mutual information: definitions and properties. Lecture outline

LECTURE 2. Convexity and related notions. Last time: mutual information: definitions and properties. Lecture outline LECTURE 2 Convexity and related notions Last time: Goals and mechanics of the class notation entropy: definitions and properties mutual information: definitions and properties Lecture outline Convexity

More information

Series 7, May 22, 2018 (EM Convergence)

Series 7, May 22, 2018 (EM Convergence) Exercises Introduction to Machine Learning SS 2018 Series 7, May 22, 2018 (EM Convergence) Institute for Machine Learning Dept. of Computer Science, ETH Zürich Prof. Dr. Andreas Krause Web: https://las.inf.ethz.ch/teaching/introml-s18

More information

The binary entropy function

The binary entropy function ECE 7680 Lecture 2 Definitions and Basic Facts Objective: To learn a bunch of definitions about entropy and information measures that will be useful through the quarter, and to present some simple but

More information

9 Forward-backward algorithm, sum-product on factor graphs

9 Forward-backward algorithm, sum-product on factor graphs Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 2014 9 Forward-backward algorithm, sum-product on factor graphs The previous

More information

A Very Brief Summary of Statistical Inference, and Examples

A Very Brief Summary of Statistical Inference, and Examples A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2008 Prof. Gesine Reinert 1 Data x = x 1, x 2,..., x n, realisations of random variables X 1, X 2,..., X n with distribution (model)

More information

Information Theory in Intelligent Decision Making

Information Theory in Intelligent Decision Making Information Theory in Intelligent Decision Making Adaptive Systems and Algorithms Research Groups School of Computer Science University of Hertfordshire, United Kingdom June 7, 2015 Information Theory

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Machine Learning Lecture Notes

Machine Learning Lecture Notes Machine Learning Lecture Notes Predrag Radivojac January 25, 205 Basic Principles of Parameter Estimation In probabilistic modeling, we are typically presented with a set of observations and the objective

More information

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University Chapter 4 Data Transmission and Channel Capacity Po-Ning Chen, Professor Department of Communications Engineering National Chiao Tung University Hsin Chu, Taiwan 30050, R.O.C. Principle of Data Transmission

More information

Lecture 2: Statistical Decision Theory (Part I)

Lecture 2: Statistical Decision Theory (Part I) Lecture 2: Statistical Decision Theory (Part I) Hao Helen Zhang Hao Helen Zhang Lecture 2: Statistical Decision Theory (Part I) 1 / 35 Outline of This Note Part I: Statistics Decision Theory (from Statistical

More information

A General Overview of Parametric Estimation and Inference Techniques.

A General Overview of Parametric Estimation and Inference Techniques. A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 9: Variational Inference Relaxations Volkan Cevher, Matthias Seeger Ecole Polytechnique Fédérale de Lausanne 24/10/2011 (EPFL) Graphical Models 24/10/2011 1 / 15

More information

Introduction to Statistical Learning Theory

Introduction to Statistical Learning Theory Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information. L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate

More information

Dept. of Linguistics, Indiana University Fall 2015

Dept. of Linguistics, Indiana University Fall 2015 L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission

More information

Bioinformatics: Biology X

Bioinformatics: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA Model Building/Checking, Reverse Engineering, Causality Outline 1 Bayesian Interpretation of Probabilities 2 Where (or of what)

More information

Computer Intensive Methods in Mathematical Statistics

Computer Intensive Methods in Mathematical Statistics Computer Intensive Methods in Mathematical Statistics Department of mathematics johawes@kth.se Lecture 16 Advanced topics in computational statistics 18 May 2017 Computer Intensive Methods (1) Plan of

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

INTRODUCTION TO BAYESIAN METHODS II

INTRODUCTION TO BAYESIAN METHODS II INTRODUCTION TO BAYESIAN METHODS II Abstract. We will revisit point estimation and hypothesis testing from the Bayesian perspective.. Bayes estimators Let X = (X,..., X n ) be a random sample from the

More information

Lecture 7 October 13

Lecture 7 October 13 STATS 300A: Theory of Statistics Fall 2015 Lecture 7 October 13 Lecturer: Lester Mackey Scribe: Jing Miao and Xiuyuan Lu 7.1 Recap So far, we have investigated various criteria for optimal inference. We

More information

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows. Chapter 5 Two Random Variables In a practical engineering problem, there is almost always causal relationship between different events. Some relationships are determined by physical laws, e.g., voltage

More information

Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11 Contents 1 Motivation and definition 1 2 Least favorable prior 3 3 Least favorable prior sequence 11 4 Nonparametric problems 15 5 Minimax and admissibility 18 6 Superefficiency and sparsity 19 Most of

More information

U Logo Use Guidelines

U Logo Use Guidelines Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

1 Random Variable: Topics

1 Random Variable: Topics Note: Handouts DO NOT replace the book. In most cases, they only provide a guideline on topics and an intuitive feel. 1 Random Variable: Topics Chap 2, 2.1-2.4 and Chap 3, 3.1-3.3 What is a random variable?

More information

Lecture 3: Lower Bounds for Bandit Algorithms

Lecture 3: Lower Bounds for Bandit Algorithms CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and

More information

Lecture 1: Bayesian Framework Basics

Lecture 1: Bayesian Framework Basics Lecture 1: Bayesian Framework Basics Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de April 21, 2014 What is this course about? Building Bayesian machine learning models Performing the inference of

More information