Lecture 2: Statistical Decision Theory

Size: px

Start display at page:

Download "Lecture 2: Statistical Decision Theory"

Meagan Gardner
5 years ago
Views:

1 EE378A Statistical Signal Processing Lecture 2-04/06/207 Lecture 2: Statistical Decision Theory Lecturer: Jiantao Jiao Scribe: Andrew Hilger In this lecture, we discuss a unified theoretical framework of statistics proposed by Abraham Wald, which is named statistical decision theory.. Goals. Evaluation: The theoretical framework should aid fair comparisons between algorithms (e.g., maximum entropy vs. maximum likelihood vs. method of moments). 2. Achievability: The theoretical framework should be able to inspire the constructions of statistical algorithms that are (nearly) optimal under the optimality criteria introduced in the framework. 2 Basic Elements of Statistical Decision Theory. Statistical Experiment: A family of probability measures P = {P θ : θ Θ}, where θ is a parameter and P θ is a probability distribution indexed by the parameter. 2. Data: X P θ, where X is a random variable observed for some parameter value θ. 3. Objective: g(θ), e.g., inference on the entropy of distribution P θ. 4. Decision Rule: δ(x). The decision rule need not be deterministic. In other words, there could be a probabilistically defined decision rule with an associated P δ X. 5. Loss Function: L(θ, δ). The loss function tells us how bad we feel about our decision once we find out the true value of the parameter θ chosen by nature. Example: P θ (x) = 2π e (x θ)2 2, g(θ) = θ, and L(θ, δ) = (θ δ) 2. In other words, X is normally distributed with mean θ and unit variance X N(θ, ), and we are trying to estimate the mean θ. We judge our success (or failure) using mean-square error. 3 Risk Function Definition (Risk Function). R(θ, δ) E[L(θ, δ(x))] () = L(θ, δ(x))p θ (dx) (2) = L(θ, δ)p δ X (dδ x)p θ (dx). (3) 992. See Wald, Abraham. Statistical decision functions. In Breakthroughs in Statistics, pp Springer New York,

2 Figure : Example risk functions computed over a range of parameter values for two different decision rules. A risk function evaluates a decision rule s success over a large number of experiments with fixed parameter value θ. By the law of large numbers, if we observe X many times independently, the average empirical loss of the decision rule δ will converge to the risk R(θ, δ). Even after determining risk functions of two decision rules, it may still be unclear which is better. Consider the example of Figure. Two different decision rules δ and δ 2 result in two different risk functions R(θ, δ ) and R(θ, δ 2 ) evaluated over different values on the parameter θ. The first decision rule δ is inferior for low and high values of the parameter θ but is superior for the middle values. Thus, even after computing a risk function R, it can still be unclear which decision rule is better. We need new ideas to enable us to compare different decision rules. 4 Optimality Criterion of Decision Rules Given the risk function of various decision rules as a function of the parameter θ, there are various approaches to determining which decision rule is optimal. 4. Restrict the Competitors This is a traditional set of methods that were overshadowed by other approaches that we introduce later. A decision rule δ is eliminated (or formally, is inadmissible) if there are any other decision rules δ that are strictly better, i.e., R(θ, δ ) R(θ, δ) for any θ Θ and the inequality becomes strict for at least one θ 0 Θ. However, the problem is that many decision rules cannot be eliminated in this way and we still lack a criterion to determine which one is better. Then to aid in selection, the rationale of the approach of restricting competitors is that only decision rules that are members of a certain decision rule class D are considered. The advantage is that, sometimes all but one decision rule in D is inadmissible, and we just use the only admissible one.. Example : Class of unbiased decision rules D = {δ : E[δ(X)] = g(θ), θ Θ} 2. Example 2: Class of invariant estimators. However, a serious drawback of this approach is that D may be an empty set for various decision theoretic problems. 2

3 4.2 Bayesian: Average Risk Optimality The idea is to use averaging to reduce the risk function R(θ, δ) to a single number for any given δ. Definition 2 (Average risk under prior Λ(dθ)). r(λ, δ) = R(θ, δ)λ(dθ) (4) Here Λ is the is the prior distribution, a probability measure on Θ. The Bayesians and the frequentists disagree about Λ; namely, the frequentists do not believe the existence of the prior. However, there do exist more justifications of the Bayesian approach than the interpretation of Λ as prior belief: indeed, the complete class theorem in statistical decision theory asserts that in various decision theoretic problems, all the admissible decision rules can be approximated by Bayes estimators. 2 Definition 3 (Bayes estimator). δ Λ = arg min δ r(λ, δ) (5) The Bayes estimator δ Λ can usually be found using the principle of computing posterior distributions. Note that r(λ, δ) = L(θ, δ)p δ X (dδ x)p θ (dx)λ(dθ) (6) ( ) = L(θ, δ)p δ X (dδ x)p θ X (dθ x) P X (dx) (7) where P X (dx) is the marginal distribution of X and P θ X (dθ x) is the posterior distribution of θ given X. In Equation 6, P θ (dx)λ(dθ) is the joint distribution of θ and X. In Equation 7, we only have to minimize the portion in parentheses to minimize r(λ, δ) because P X (dx) doesn t depend on δ. Theorem 4. 3 Under mild conditions, δ Λ (x) = arg min E[L(θ, δ) X = x] (8) δ = arg min L(θ, δ)p δ X (dδ x)p θ X (dθ x) (9) P δ X Lemma 5. If L(θ, δ) is convex in δ, it suffices to consider deterministic rules δ(x). Proof Jensen s inequality: L(θ, δ)p δ X (dδ x)p θ X (dθ x) L(θ, δp δ X (dδ x))p θ X (dθ x). (0) Examples. L(θ, δ) = (g(θ) δ) 2 δ Λ (x) = E[g(θ) X = x]. In other words, the Bayes estimator under squared error loss is the conditional expectation of g(θ) given x. 2. L(θ, δ) = g(θ) δ δ Λ (x) is any median of the posterior distribution P θ X=x. 3. L(θ, δ) = (g(θ) δ) δ Λ (x) = arg max θ P θ x (θ x) = arg max θ P θ (x)λ(θ). In other words, an indicator loss function results in a maximum a posteriori (MAP) estimator decision rule 4. 2 See Chapter 3 of Friedrich Liese, and Klaus-J. Miescke. Statistical Decision Theory: Estimation, Testing, and Selection. (2009) 3 See Theorem., Chapter 4 of Lehmann EL, Casella G. Theory of point estimation. Springer Science & Business Media; Next week, we will cover special cases of P θ and how to solve Bayes estimator in a computationally efficient way. In the general case, however, computing the posterior distribution may be difficult. 3

4 4.3 Frequentist: Worst-Case Optimality (Minimax) Definition 6 (Minimax estimator). The decision rule δ is minimax among all decision rules in D iff 4.3. First observation sup R(θ, δ ) = inf θ Θ sup δ D θ Θ R(θ, δ). () Since R(θ, δ) = L(θ, δ)p δ X (dδ x)p θ (dx) (2) is linear in P θ X, this is a convex function in δ. The supremum of a convex function is convex, so finding the optimal decision rule is a convex optimization problem. However, solving this convex optimization problem may be computationally intractable. For example, it may not even be computationally tractable to compute the supremum of the risk function R(θ, δ) over θ Θ. Hence, finding the exact minimax estimator is usually hard Second observation Due to the previous difficulty of finding the exact minimax estimator, we turn to another goal: we wish to find an estimator δ such that inf δ sup θ R(θ, δ) sup θ R(θ, δ ) c inf δ sup R(θ, δ) (3) θ where c > is a constant. The left inequality is trivially true. For the right inequality, in practice one can usually choose some specific δ and evaluate an upper bound of sup θ R(θ, δ ) explicitly. However, it remains to find a lower bound of inf δ sup θ R(θ, δ). To solve the problem (and save the world), we can use the minimax theorem. Theorem 7 (Minimax Theorem (Sion-Kakutani)). Let Λ, X be two compact, convex sets in some topologically vector spaces. Let function H(λ, x) : Λ X R be a continuous function such that: Then. H(λ, ) is convex for any fixed λ Λ 2. H(, x) is concave for any fixed x X.. Strong duality: max λ min x H(λ, x) = min x max λ H(λ, x) 2. Existence of Saddle point: (λ, x ) : H(λ, x ) H(λ, x ) H(λ, x) λ Λ, x X. The existence of saddle point implies the strong duality. We note that other than the strong duality, the following weak duality is always true without assumptions on H: sup inf H(λ, x) inf sup H(λ, x) (4) λ x x λ We define the quantity r Λ inf δ r(λ, δ) as the Bayes risk under prior distribution Λ. following lines of arguments using weak duality: inf δ sup θ R(θ, δ) = inf δ sup Λ We have the sup r(λ, δ) (5) Λ inf r(λ, δ) (6) δ = sup r Λ (7) Λ 4

5 Equation (7) gives us a strong tool for lower bounding the minimax risk: for any prior distribution Λ, the Bayes risk under Λ is a lower bound of the corresponding minimax risk. When the condition of the minimax theorem is satisfied (which may be expected due to the bilinearity of r(λ, δ) in the pair (Λ, δ)), equation (6) achieves equality, which shows that there exists a sequence of priors such that the corresponding Bayes risk sequence converges to the minimax risk. In practice, it suffices to choose some appropriate prior distribution Λ in order to solve the (nearly) minimax estimator. 5 Decision Theoretic Interpretation of Maximum Entropy Definition 8 (Logarithmic Loss). For a distribution P over X and x X, L(x, P ) log P (x). (8) Note that P (x) is the discrete mass on x in the discrete case, and is the density at x in the continuous case. By definition, if P (x) is small, then the loss is large. Now we see how the maximum likelihood estimator reduces to the so-called maximum entropy under the logarithmic loss function. Specifically, suppose we have functions r (x),, r k (x) and compute the following empirical averages: β i n n r i (x j ) = E Pn [r i (X)], i =,, k (9) j= where P n denotes the empirical distribution based on n samples x,, x n. By the law of large numbers, it is expected that β i is close to the true expectation under P, so we may restrict the true distribution P in the following set: P = {P : E P [r i (X)] = β i, i k}. (20) Also, denote by Q the set of all probability distributions which belong to the exponential family with central statistics r (x),, r k (x), i.e., Q = {Q : Q(x) = Z(λ) e k i= λiri(x), x X } (2) where Z(λ) is the normalization factor. Now we consider two estimators of P. The first one is the maximum likelihood estimator over Q, i.e., P ML = arg max P Q j= The second one is the maximum entropy estimator over P, i.e., n log P (x j ). (22) P ME = arg max H(P ) (23) P P where H(P ) is the entropy of P, which is H(P ) = x X P (x) log P (x) in the discrete case, and H(P ) = P (x) log P (x)dx in the continuous case. X The key result is as follows: 5 Theorem 9. In the above setting, we have P ML = P ME. 5 Berger, Adam L., Vincent J. Della Pietra, and Stephen A. Della Pietra. A maximum entropy approach to natural language processing. Computational linguistics 22. (996):

6 Proof We begin with the following lemma, which is the key motivation for the use of the cross-entropy loss in machine learning. Lemma 0. The true distribution minimizes the expected logarithmic loss: and the corresponding minimum E P L(x, P ) is the entropy of P. P = arg min Q E P L(x, Q) (24) Proof The assertion follows from the non-negativity of the KL divergence D(P Q) 0, where D(P Q) = E P ln dp dq. Equipped with this lemma, and noting that E P [log Q(x) ] is linear in P and convex in Q, by the minimax theorem we have max H(P ) = max P P min P P Q E P [log Q(x) ] = min max E P [log Q P P Q(x) ] (25) In particular, we know that (P ME, P ME ) is a saddle point in P M, where M denotes the set of all probability measures on X. We first take a look at P ME which maximizes the LHS. By variational methods 6, it is easy to see that the entropy-maximizing distribution P ME belongs to the exponential family Q, i.e., for some parameters λ,, λ k. Now by the definition of saddle points, for any Q Q M we have P ME (x) = Z(λ) e k i= λiri(x) (26) E PME [log P ME (x) ] E P ME [log Q(x) ] (27) while for any Q Q with possibly different coefficients λ, E PME [log Q(x) ] = E P ME [log Z(λ ) i λ ir i (x)] [since Q Q] (28) = log Z(λ ) i λ iβ i [since P ME P] (29) = E Pn [log Z(λ ) i λ ir i (x)] [since P n P] (30) = E Pn [log = n n log j= As a result, the saddle point condition reduces to n j= establishing the fact that P ME = P ML. n log P ME (x j ) n Q(x) ] (3) Q(x j ). (32) n log j=, Q Q (33) Q(x j ) 6 See Chapter 2 of Thomas, Joy A., and Thomas M. Cover. Elements of information theory. John Wiley & Sons,

7 EE378A Statistical Signal Processing Lecture 3-04//207 Lecture 3: State Estimation in Hidden Markov Processes Lecturer: Tsachy Weissman Scribe: Alexandros Anemogiannis, Vighnesh Sachidananda In this lecture, we define the Hidden Markov Processes (HMP) as well as the forward-backward recursion method that efficiently estimates the posteriors in HMPs. Markov Chain We begin by introducing the Markov triplet, which is useful in the design and analysis of algorithms for state estimation in Hidden Markov Processes. Let X, Y, Z denote discrete random variables. We say that X, Y, and Z form a Markov triplet (which we denote as X Y Z) given the following relationships: X Y Z p(x, z y) = p(x y)p(z y) () p(z x, y) = p(z y) (2) If X, Y, Z form a Markov triplet, their joint distribution enjoys the important property that it can be factored into the product of two functions φ and φ 2. Lemma. X Y Z φ, φ 2 s.t. p(x, y, z) = φ (x, y)φ 2 (y, z). Proof (Proof of necessity) Assume that X Y Z. Then by definition of the Markov chain, we have p(x, y, z) = p(x y, z)p(y, z) = p(x y)p(y, z). Now taking φ (x, y) = p(x, y) and φ 2 (y, z) = p(y, z) proves this direction. (Proof of sufficiency) Now, the functions φ, φ 2 exist as mention in the statement. Then we have: p(x y, z) = = p(x, y, z) p(y, z) p(x, y, z) p( x, y, z) x = φ (x, y)φ 2 (y, z) φ 2 (y, z) x φ ( x, y) = φ (x, y) x φ ( x, y). The final equality shows that p(x y, z) is not a function of z and hence, X Y Z follows by definition. 2 Hidden Markov Processes Based on our understanding of the Markov Triplet, we define the Hidden Markov Process (HMP) as follows: Reading: Ephrahim & Merhav 2002, lectures 9,0 of 206 notes

8 Definition 2 (Hidden Markov Process). The process {(X n, Y n )} n is a Hidden Markov Process if {X n } n is a Markov process, i.e., n : p(x n ) = n p(x t x t ), where by convention, we take X 0 0 and hence p(x x 0 ) = p(x ). {Y n } n is determined by a memoryless channel; i.e., t= n : p(y n x n ) = n p(y t x t ). The processes {X n } n and {Y n } n are called the state process and the noisy observations, respectively. In view of the above definition, we can derive the joint distribution of the state and the observation as follows: p(x n, y n ) = p(x n )p(y n x n ) = t= n p(x t x t )p(y t x t ). The main problem in the Hidden Markov Models is to compute the the posterior probability of the state at any time, given all the observations up to that time, i.e. p(x t y t ). The naive approach to do this is to simply use the the definition of the conditional probability: p(x t y t ) = p(x t, y t ) p(y t ) = t= x t p(yt x t.x t )p(x t, x t ) x t x t p(yt x t, x t )p( x t, x t ). The above approach needs exponentially many computations both for the numerator and the denominator. To avoid this problem, we develop an efficient method for computing the posterior probabilities using forward recursion. Before getting to the algorithm, we establish some conditional independence relations. 3 Conditional Independence Relations We introduce some conditional independence relations which help us find more efficient ways of calculating the posterior.. (X t, Y t ) X t (X n t+, Y n t+): Proof We have : p(x n, y n ) = n p(x i x i )p(y i x i ) i= i= The statement then follows from Lemma. [ t ] [ n ] = p(x i x i )p(y i x i ) p(x i x i )p(y i x i ) } {{ } φ (x t,(x t,y t )) i=t+ } {{ } φ 2(x t,(x n t+,yn t+ )) 2

9 2. (X t, Y t ) X t (Xt+, n Yt n ): Proof Note that if {X i, Y i } n i= is a Hidden Markov Process, then {X n i, Y n i } n i= Hidden Markov Process. Applying (a) to this reversed process proves (b). will also be a 3. X t (X t, Y t ) (Xt+, n Yt n ): Proof Left as an exercise. 4 Causal Inference via Forward Recursion We now derive the forward recursion algorithm as an efficient method to sequentially compute the causal posterior probabilities. Note that we have p(x t y t ) = p(x t, y t ) x t p( x t, y t ) = p(x t, y t )p(y t x t, y t ) x t p( x t, y t )p(y t x t, y t ) = p(yt )p(x t y t )p(y t x t ) p(y t ) x t p( x t y t )p(y t x t ) = p(x t y t )p(y t x t ) x t p( x t y t )p(y t x t ) (by (b)) Define α t (x t ) = p(x t y t ) and β t (x t ) = p(x t y t ), then the above computation can be summarized as α t (x t ) = β t(x t )p(y t x t ) x t β t ( x t )p(y t x t ). (3) Similarly, we can write β as β t+ (x t+ ) = p(x t+ y t ) = p(x t+, x t y t ) x t = p(x t y t )p(x t+ y t, x t ) x t = x t p(x t y t )p(x t+ x t ) (by (a)). The above equation can be summarized as β t+ (x t+ ) = x t α t (x t )p(x t+ x t ). (4) Equations (3) and (4) indicate that α t and β t can be sequentially computed based on each other for t =,, n with the initialization β(x ) = p(x ). Hence, with this simple algorithm, the causal inference can be done efficiently in terms of both computation and memory. This is called the forward recursion. 3

10 5 Non-causal Inference via Backward Recursion Now suppose that we are interested in finding p(x t y n ) for t =, 2,, n in a non-causal manner. For this purpose, we can derive a backward recursion algorithm as follows. p(x t y n ) = x t+ p(x t, x t+ y n ) = x t+ p(x t+ y n )p(x t x t+, y t ) (by (c)) = x t+ p(x t+ y n ) p(x t y t )p(x t+ x t ) p(x t+ y t. (by (a)) ) Now, let γ t (x t ) = p(x t y n ). Then, the above equation can be summarized as γ t (x t ) = x t+ γ t+ (x t+ ) α t(x t )p(x t+ x t ). (5) β t+ (x t+ ) Equation (5) indicates that γ t can be recursively computed based on α t s and β t s for t = n, n 2,, with the initialization of γ n (x n ) = α n (x n ). This is called the backward recursion. 4

11 EE378A Statistical Signal Processing Lecture 4-04/3/207 Lecture 4: State Estimation in Hidden Markov Models (cont.) Lecturer: Tsachy Weissman Scribe: David Wugofski In this lecture we build on previous analysis of state estimation for Hidden Markov Models to create an algorithm for estimating the posterior distribution of the states given the observations. Notation A quick summary of the notation:. Random Variables (objects): used more loosely, i.e. X, Y, U, V 2. Alphabets: X, Y, U, V 3. Specific Realizations of Random Variables: x, y, u, v 4. Vector Notation: For a vector (x,, x n ), we denote the vector of values from the i-th component to the j-th component (inclusive) as x j i, with the convention xi x i. 5. Markov Chains: We use the standard Markov Chain notation X Y Z to denote that X, Y, Z form a Markov triplet. For discrete random variable (object), we write p(x) for the pmf P(X = x), and similarly p(x, y) for p X,Y (x, y) and p(y x) for p Y X (y x), etc. 2 Recap of HMP For a set of unknown states {X n } n of a Markov Process and noisy observations {Y n } n on those states taken through some memoryless channel, we can express the joint distribution of the states and outputs: n n p(x n, y n ) = p(x t x t ) p(y t x t ) () t= where x 0 is taken to be some fixed arbitrary value (e.g., 0). Additionally we have the following principles for this setting: t= (X t, Y t ) X t (Xt+, n Yt+) n (X t, Y t ) X t (Xt+, n Yt n ) X t (X t, Y t ) (Xt+, n Yt n ) In the last lecture, through the forward-backward recursion we have the following algorithms for computing various posteriors: α t (x t ) = p Xt Y t(x t y t ) = β t(x t )p Yt X t (y t x t ) x t β t ( x t )p Yt X t (y t x t ) (a) (b) (c) (Measurement Update) β t+ (x t+ ) = p Xt Y t (x t y t ) = x t α t (x t )p Xt+ X t (x t+ x t ) (Time Update) γ t (x t ) = p Xt Y n(x t y n ) = x t+ γ t+ (x t+ ) α t(x t )p Xt+ X t (x t+ x t ) β t+ (x t+ ) (Backward Recursion)

3 Factor Graphs Let us derive a graphical representation of the joint distribution expressed in () as follows:. Create nodes on the graph representing each random variable: X,..., X n, Y,..., Y n.

Example: Figure shows the factor graph for the hidden Markov process in our setting.

sets form a Markov process S S 2 S 3 (to see this, simply express the joint pmf into factors).

12 3 Factor Graphs Let us derive a graphical representation of the joint distribution expressed in () as follows:. Create nodes on the graph representing each random variable: X,..., X n, Y,..., Y n. 2. For each pairing of nodes, create an edge if there is some factor in which contains both variables. This form of graph is called a Factor Graph. Example: Figure shows the factor graph for the hidden Markov process in our setting. Figure : The factor graph for our hidden Markov process For any partitioning of the graph into three node sets S, S 2, and S 3 such that any path from S to S 3 passes some node in S 2, these sets form a Markov process S S 2 S 3 (to see this, simply express the joint pmf into factors). Hence, for any s i S i, i =, 2, 3, we have p(s, s 3 s 2 ) = φ (s, s 2 )φ 2 (s 2, s 3 ) (2) for some functions φ, φ 2 which consist of the factors which were used to form the factor graph. Using the factor graph, it is quite easy to establish principles (a)-(c) via graphs. For example, principles (a) and (b) can be proved via Figure 2 and 3 shown below. Figure 2: The factor graph partition for (a) Figure 3: The factor graph partition for (b) 2

We can also construct an additional principle from the factor graph partitioning presented in Figure 4: X t (X t, Y n ) X n t+ (d) Figure 4: The graphical proof of (d) 4 General Forward Backward

13 We can also construct an additional principle from the factor graph partitioning presented in Figure 4: X t (X t, Y n ) X n t+ (d) Figure 4: The graphical proof of (d) 4 General Forward Backward Recursion: Module Functions Let us consider a generic discrete random variable U p U passed through a memoryless channel characterized by p V U to get a discrete random variable V. By Bayes rule we know p U V (u v) = p U (u)p V U (v u) ũ p U (ũ)p V U (v ũ) If we create the probability vector of p U V for each value of U, we can define a vector function F s.t. { } pu (u)p V U (v u) {p U V (u v)} u = ũ p F (p U, p V U, v) (4) U (ũ)p V U (v ũ) u Additionally, overloading the notation of F, we can define the matrix function F as F (p U, p V U ) = {F (p U, p V U, v)} v (5) This function F acts as a module function for calculating the posterior distribution of a random variable given a prior distribution and a channel distribution. In addition to F, we can write a module function for calculating the marginal distribution given a prior distribution on the input and a channel distribution. We label this vector function G and define it as follows: { } p V = {p V (v)} v = p U (ũ)p V U (v ũ) G(p U, p V U ) (6) ũ With these module functions in mind, we can express our recursion functions as follows v (3) α t = F (β t, p Yt X t, y t ) β t+ = G(α t, p Xt+ X t ) γ t = G(γ t+, F (α t, p Xt+ X t )) (Measurement Update) (Time Update) (Backward Recursion) 3

14 5 State Estimation: the Viterbi Algorithm Now we are ready to derive the joint posterior distribution p(x n y n ): n p(x n y n ) = p(x n y n ) p(x t x n t+, y n ) t= n = p(x n y n ) p(x t x t+, y n ) t= n = p(x n y n ) p(x t x t+, y t ) t= n = p(x n y n ) t= n = γ n (x n ) t= p(x t y t )p(x t+ x t, y t ) p(x t+ y t ) α t (x t )p(x t+ x t ) β t+ (x t+ ) (By principle (d)) (By principle (c)) (By principle (a)) Write n ln p(x n y n ) = g t (x t, x t+ ) t= with { ln α t(x t)p(x t+ x t) β g t (x t, x t+ ) t+(x t+) t =,, n ln γ n (x n ) t = n we can solve the MAP estimator ˆx MAP with the help of the following definition: Definition. For k n, Let M k (x k ) := max x n k+ n g t (x t, x t+ ). It is straightforward from definition that M (ˆx MAP ) = ln p(ˆx MAP y n ) = max x n p(x n y n ) and t=k M k (x k ) = max max(g k (x k, x k+ ) + x k+ x n k+2 = max x k+ (g k (x k, x k+ ) + max n t=k+ n x n k+2 t=k+ = max x k+ (g k (x k, x k+ ) + M k+ (x k+ )). g t (x t, x t+ )) g t (x t, x t+ )) Since M n (x n ) = g(x n, x n+ ) = ln γ n (x n ) only depends on one term x n, we may start from n and use the previous recursive formula to obtain M (x ). This is called the Viterbi Algorithm. The Bellman Equations, referred to in the communication setting as the Viterbi Algorithm, is an application of a technique called dynamic programming. Using the recursive equations derived in the previous section, we can express the Viterbi Algorithm as follows. 4

15 : function Viterbi 2: M n (x n ) ln γ n (x n ) Initialization of log-likelihood 3: ˆx MAP (x n ) Initialization of the MAP estimator 4: for k = n,, do 5: M k (x k ) max xk+ (g k (x k, x k+ ) + M k+ (x k+ )) Maximum of log-likelihood 6: ˆx k+ (x k ) arg max xk+ (g k (x k, x k+ ) + M k+ (x k+ )) 7: ˆx MAP (x k ) [ˆx k+ (x k ), ˆx MAP (ˆx k+ (x k ))] Maximizing sequence with leading term x k 8: end for 9: M max x M (x ) Maximum of overall log-likelihood 0: ˆx arg max x M (x ) : ˆx MAP [ˆx, ˆx MAP (ˆx )] Overall maximizing sequence 2: end function 5

16 EE378A Statistical Signal Processing Lecture 3-04/8/207 Lecture 3: Particle Filtering and Introduction to Bayesian Inference Lecturer: Tsachy Weissman Scribe: Charles Hale In this lecture, we introduce new methods for solving Hidden Markov Processes in cases beyond discrete alphabets. This requires us to learn first about estimating pdfs based on samples from a different distribution. At the end, we introduce the Bayesian inference briefly. Recap: Inference on Hidden Markov Processes (HMPs). Setting A quick recap on the setting of a HMP:. {X n } n is a Markov process, called the state process. 2. {Y n } n is the observation process where Y i is the output of X i sent through a memoryless channel characterized by the distribution P Y X. Alternatively, we showed in HW that the above is totally equivalent to the following:. {X n } n is the state process defined by X t = f t (X t, W t ) where {W n } n is an independent process, i.e-each W t is independent of all the others. 2. {Y n } n is the observation process related to {X n } n by where {N n } n is an independent process. Y t = g t (X t, N t ) 3. The processes {N n } n and {W n } n are independent of each other..2 Goal Since the state process, {X n } n is hidden from us, we only get to observe {Y n } n and wish to find a way of estimating {X n } n based on {Y n } n. To do this, we define the forward recursions for causal estimation: where F and G are the operators defined in lecture 4..3 Challenges α t (x t ) = F (β t, P Yt X t, y t ) () β t+ (x t+ ) = G(α t, P Xt+ X t ) (2) For general alphabets X, Y, computing the forward recursion is a very difficult problem. However, we know efficient algorithms that can be applied in the following situations:. All random variables are discrete with a finite alphabet. We can then solve these equations by brute force in time proportional to the alphabet sizes.

17 2. For t, f t and g t are linear and N t, W t, X t are Gaussian. The solution is the Kalman Filter algorithm that we derive in Homework 2. Beyond these special situations, there are many heuristic algorithms for this computation:. We can compute the forward recursions approximately by quantizing the alphabets. This leads to a tradeoff between model accuracy and computation cost. 2. This lecture will focus on Particle Filtering, which is a way of estimating the forward recursions with an adaptive quantization. 2 Importance Sampling Before we can learn about Particle Filtering, we first need to discuss importance sampling. 2. Simple Setting Let X,, X N be i.i.d. random variables with density f. Now we wish to approximate f by a probability mass function, ˆf N, using our data. The idea: we can approximate f by taking ˆf N = N N δ(x X i ) (3) i= where δ(x) is a Dirac-delta distribution. Claim: ˆfN is a good estimate of f. Here we define the goodness of our estimate by looking at and seeing how close it is to E X ˆfN [g(x)] (4) E X f [g(x)] (5) for any function g with E g(x) <. We show that by this definition, (3) is a good estimate for f. Proof: E X ˆfN [g(x)] = = = N = N ˆf N (x)g(x)dx (6) N N i= N δ(x X i )g(x)dx (7) i= See section of the Spring 203 lecture notes for further reference. δ(x X i )g(x)dx (8) N g(x i ) (9) i= 2

18 By the law of large numbers, as N. Hence, N N i= g(x i ) a.s. E X f [g(x)] (0) E X [g(x)] E ˆfN X f [g(x)] () so ˆf N is indeed a good estimate of f. 2.2 General Setting Suppose now that X,, X N i.i.d. with common density q, but we wish to estimate a different pdf f. Theorem: The following estimator with weights N ˆf N (x) = w i δ(x X i ) (2) i= w i = f(x i) q(x i) N f(x j) j= q(x j) (3) is a good estimator of f. Proof: It s easy to see E X ˆfN [g(x)] = As N, the above expression converges to = = = E[ f(x) q(x) g(x)] E[ f(x) q(x) ] = = = N w i g(x i ) (4) i= f(x N i) q(x i) N f(x j) i= j= q(x j) N i= N f(x j) j= q(x j) f(x i) q(x i) g(x i) N f(x i) N i= N f(x j) N j= q(x j) f(x) f(x) g(x i ) (5) (6) q(x g(x i) i). (7) q(x) g(x)q(x)dx q(x) q(x)dx (8) f(x)g(x)dx f(x)dx (9) f(x)g(x)dx (20) = E X f [g(x)]. (2) 3

19 Hence, E X ˆfN [g(x)] E X f [g(x)] (22) as N. 2.3 Implications The above theorem is useful in situations when we don t know f and cannot draw samples relying on X i f, but we know that f(x) = c h(x) for some constant c (in many practical cases it is computationally prohibitive to obtain the normalization constant c = ( h(x)dx) ). Then w i = = = f(x i) q(x i) N f(x j) j= q(x j) c h(x i) q(x i) N c h(x j) j= q(x j) h(x i) q(x i) N h(x j) j= q(x j) (23) (24) (25) Using these w i s, we see that we can approximate f using samples drawn from a different distribution and h. 2.4 Application Let X f X and Y be the observation of X through a memoryless channel characterized by f Y X. For MAP reconstruction of X, we must compute f X Y (x y) = f X (x)f Y X (y x) f X( x)f Y X (y x)d x In general, this computation may be hard due to the integral appearing on the bottom. But the denominator is just a constant. Hence, suppose we have X i i.i.d. f X for i N. By the previous theorem, we can take as an approximation, where ˆf X y = (26) N w i (y)δ(x X i ) (27) i= w i (y) = = = f X y (X i) f X (X i) N f X y (X j) j= f X (X j) f X (X i)f Y X (y X i) f X (X i) N f X (X j)f Y X (y X j) j= f X (X j) f Y X (y X i ) N j= f Y X(y X j ) (28) (29) (30) Hence, using the samples X, X 2,, X N using the above procedure. we can build a good approximation of f X Y ( y) for fixed y 4

20 Let us denote this approximation as 3 Particle Filtering ˆf N (X, X 2,..., X N, f X, f Y X, y) = N w i δ(x X i ) (3) Particle Filtering is a way to approximate the forward recursions of a HMP using importance sampling to approximate the generated distributions rather than computing the true solutions. particles generate {X (i) }N i= i.i.d. f X ; for t 2 do ˆα t = ˆF N ({X (i) }N i=, ˆβ t, P Yt X t, y t ) ; // estimate α t using importance sampling as in (3) (i) nextparticles generate { X t } N i= i.i.d. ˆα t; reset particles; for i N do particles generate X (i) t+ f X t+ X t ( end ˆβ t = N N i= δ(x X t+); X (i) t ); end Algorithm : Particle Filter This algorithm requires some art in use. It is not well understood how errors propagate due to these successive approximations. In practice, the parameters need to be periodically reset. Choosing these hyperparameters is an application-specific task. 4 Inference under logarithmic loss Recall the Bayesian decision theory and let us have the following: and X X is something we want to infer; X p x (assume that X is discrete); ˆx ˆX is something we wish to reconstruct; Λ: X ˆX R is our loss function. First we assume that we do not have any observation, and define the Bayesian response as U(p x ) = min EΛ(X, ˆx) = min p x (s)λ(x, ˆx) (32) ˆx ˆx Example: Let X = ˆX = R, Λ(x, ˆx) = (x ˆx) 2. Then x i= ˆX Bayes (p X ) = arg min ˆx EΛ(X, ˆx). (33) U(p X ) = min E X px (X ˆx) 2 = Var(X) ˆx (34) ˆX Bayes (p X ) = E[X]. (35) 5

21 Consider now the case we have some observation (side information) Y, where our reconstruction ˆX can be a function of Y. Then EΛ(X, ˆX(Y )) = E[E[Λ(X, ˆX(y)) Y = y]] (36) = y E[Λ(X, ˆX(y)) Y = y]p Y (y) (37) = y p Y (y) x p X Y =y (x)λ(x, ˆX(y)) (38) This implies U(p X ) = min EΛ(X, ˆX(Y )) (39) ˆX( ) = min p Y (y) p X Y =y (x)λ(x, ˆX(y)) (40) ˆX( ) y x = y p Y (y)u(p X Y =y ) (4) and thus ˆX opt (y) = ˆX Bayes (p X Y =y ) (42) 6

22 EE378A Statistical Signal Processing Lecture 6-04/20/207 Lecture 6: Bayesian Decision Theory and Justification of Log Loss Lecturer: Tsachy Weissman Scribe: Okan Atalar In this lecture, we introduce the topic of Bayesian Decision Theory and define the concept of loss function for assessing the accuracy of the estimation. The log loss function is presented as an important special case. Notation A quick summary of the notation:. Random Variables (objects): used more loosely, i.e., X, Y 2. PMF (PDF): used more loosely, i.e. P X, P Y 3. Alphabets: X, ˆX 4. Estimate of X: ˆx 5. Specific Values: x, y 6. Loss Function: Λ 7. Markov Triplet: X Y Z notation corresponds to a Markov triplet 8. Logarithm: log notation denotes log 2, unless otherwise stated For discrete random variable (object), U has p.m.f: P (u) U Similarly: p(x, y) for P (x,y) X,Y (y x) and p(y x) for P Y X, etc. 2 Bayesian Decision Theory Let X X, X P X, ˆx ˆX, Λ : X ˆX R. Definition (Bayes Envelope). Definition 2 (Bayes Response). U(P X ) min ˆx E[Λ(X, ˆx)] = min ˆx P (U = u). Often, we ll just write p(u). P X (x)λ(x, ˆx) () x ˆX Bayes (P X ) arg min ˆx E[Λ(X, ˆx)] (2) Essentially, given a prior distribution on X, we are trying to achieve the best estimate through minimizing the expected value of some loss function that we have defined. If in addition to the prior distribution of X, we are also given the observation Y, the associated Bayes Envelope becomes: min E[Λ(X, ˆX(Y ))] = ˆX( ) y U(P X Y )P Y (y) = E[U(P X Y )] (3) In this case, the Bayes Response takes the form: Reading: Lecture 3 of the EE378A Spring 206 notes ˆX(y) = ˆX Bayes (P X Y =y ) (4)

23 3 Bayesian Inference Under Logarithmic Loss In this section, we will be using the logarithm function as the loss function. Definition 3 (Logarithmic Loss Function). Λ(x, Q) log Q(x) where x is the realization of X, Q is the estimator of the pmf of X. Using Definition 3, the expected value of the log loss function could be expressed as: E[Λ(X, Q)] = x = x P X (x)λ(x, Q) = P X (x) log Q(x) = x x P X (x) log P X (x) + P X (x) log P X(x) Q(x) x P X (x) P X (x) log Q(x) P X (x) (5) (6) (7) Definition 4 (Entropy). H(X) x P X (x) log P X (x) (8) Definition 5 (Relative Entropy, or the Kullback Leibler Divergence). D(P X Q) x Using Equation (6,7) and Definitions (4,5): P X (x) log P X(x) Q(x) (9) E[Λ(X, Q)] = H(X) + D(P X Q) (0) Claim: D(P X Q) 0, and D(P X Q) = 0 Q = P X. Before attempting the proof, we provide Jensen s inequality: Jensen s Inequality: Let Q denote a concave function, and X be any random variable. Then E[Q(X)] Q(E[X]) () Further, if Q is strictly concave: E[Q(X)] = Q(E[X]) X is a constant. In particular, Q(x) = log(x) is a concave function, thus for a random variable X 0: E[log X] log E[X]. (2) Proof of the Claim: D(P X Q) = E[log P X(x) Q(x) Q(x) ] = E[log ] log E[ Q(x) P X (x) P X (x) ] = log x P X (x) Q(x) P X (x) = 0 (3) The last equality follows from x Q(x) =, since Q(x) is a pmf and thus sums to one. Hence, it follows from the claim that min Q E[Λ(X, Q)] = H(X) = U(P X) (4) 2

24 where the minimizer is Q = P X. Therefore: ˆX Bayes (P X ) = P X. (5) Introducing the notation H(X) H(P X ) and making use of equation (3): min E[Λ(X, ˆX(Y ))] = E[U(P X Y )] = ˆX( ) y H(P X Y )P Y (y) (6) Definition 6 (Conditional Entropy). H(X Y ) y H(P X Y )P Y (y) (7) Definition 7 (Mutual Information). Definition 8 (Conditional Relative Entropy). I(X; Y ) H(X) H(X Y ) (8) D(P X Y Q X Y, P Y ) y D(P X Y Q X Y )P Y (y) (9) Definition 9 (Conditional Mutual Information). Exercises: (a) H(X, X 2 ) = H(X ) + H(X 2 X ) (b) I(X; Y ) = H(Y ) H(Y X) (c) I(X, X 2 ; Y ) = I(X ; Y ) + I(X 2 ; Y X ) (d) D(P X,X 2 Q X,X 2 ) = D(P X Q X ) + D(P X2 X Q X2 X P X ) I(X, Y Z) H(X Z) H(X Y, Z) (20) 4 Justification of Log Loss as the Right Loss Function For general loss function Λ: Definition 0 (Coherent Dependence Function). C(Λ, P X,Y ) inf ˆx E[Λ(X, ˆx)] inf E[Λ(X, ˆX(Y ))] (2) The first term corresponds to the minimum expected error (depending on the choice of loss function) in estimating X based on only the prior distribution of X. The second term corresponds to the minimum expected error in estimating X by also taking the observation Y into account. Therefore, the coherent dependence function gives a measure on how much the estimate improves given observation Y, compared to estimating X based on only the prior distribution of X. Two examples are provided below to make this concept more clear: () Λ(x, ˆx) = (x ˆx) 2 C(Λ, P X,Y ) = Var(X) E[Var(X Y )] (2) Λ(x, Q) = log Q(x) C(Λ, P X,Y ) = I(X, Y ) Proving the expressions below left as exercise: (a) C(Λ, P X,Y ) 0, and C(Λ, P X,Y ) = 0 X and Y are independent ˆX( ) 3

25 (b) C(Λ, P X,Y ) C(Λ, P X,Z ) when X Y Z Natural Requirement: If T = T (X) is a function of X, then C(Λ, P T,Y ) C(Λ, P X,Y ). This requirement basically states that a function of X cannot carry more information about X compared to X itself. Definition (Sufficient Statistic). T = T (X) is a sufficient statistic of X for Y if X T Y. Modest Requirement: If T is a sufficient statistic of X for Y, then C(Λ, P T,Y ) C(Λ, P X,Y ) Theorem: Assume X 3, the only C(Λ, P X,Y ) satisfying the modest requirement is C(Λ, P X,Y ) = I(X; Y ) (up to a multiplicative factor). Since it was shown that C(Λ, P X,Y ) = I(X; Y ) if (and only if) the logarithmic loss function is used (Λ(x, Q) = log Q(x) ), it follows that the logarithmic loss function is the only loss function which satisfies the modest requirement. 4

26 EE378A Statistical Signal Processing Lecture 7-04/25/207 Lecturer: Tsachy Weissman Lecture 7: Prediction Scribe: Cem Kalkanli In this lecture we are going to study prediction under Bayesian setting with general and log loss. We will also quickly start the subject of prediction of individual sequences. Notation A quick summary of the notation:. Random Variables (objects): used more loosely, i.e. X, Y, U, V ; 2. Alphabets: X, ˆX ; 3. Sequence of Random Variables: X, X 2,, X t ; 4. Loss function: Λ(x, ˆx); 5. Estimate of X: ˆx; 6. Set of distributions on Alphabet X : M(X ). For discrete random variable (object), U has p.m.f: P (u) U P (U = u). Often, we ll just write p(u). Similarly: p(x, y) for P X,Y (x, y) and p(y x) for P Y X (y x), etc. 2 Prediction Consider X, X 2,, X n X. Definition (Predictor). A predictor F is a sequence of functions F = (F t ) t with F t : X t ˆX. Definition 2 (Prediction Loss). Given a loss function Λ : X ˆX R +, the prediction loss incurred by the predictor F is defined as L F (X n ) = n n Λ(X t, F t (X t )). () t= 2. Prediction under Bayesian Setting In Bayesian setting, we assume X, X 2, are governed jointly by some underlying law P. Then we have the following prediction loss. Definition 3 (Prediction Loss Under Bayesian Setting). min E P [L F (X n )] = F n n t= min E P [Λ(X t, F t (X t ))] = F t n where U( ) is the Bayes envelope we have defined in the previous lecture. n t= E P [U(P Xt Xt )] (2)

27 As we can see from (2), our initial objective was to minimize expected prediction loss with respect to whole sequence of functions. However, since we can interchange summation and expectation, we can treat the overall predictor F as the composition of the sequential optimal predictors F t. Finally, we note the overall loss is just the sum of expectations of Bayes envelopes. Theorem 4 (Miss-match Loss). The additional prediction loss incurred by doing prediction based on law Q instead of P is upper bounded by some form of relative entropy: 2 E P [L F Q(X n ) L F P (X n )] Λ max n D(P X n Q Xn), (3) where Λ max = sup x,ˆx Λ(x, ˆx). Now we are going to introduce Lemma 5 which will be used to prove Theorem 4. Lemma 5. Proof E P [Λ(X, ˆX Bayes (Q)) Λ(X, ˆX Bayes (P ))] Λ max P Q Λ max 2D(P Q) (4) E P [Λ(X, ˆX Bayes (Q)) Λ(X, ˆX Bayes (P ))] E P [Λ(X, ˆX Bayes (Q)) Λ(X, ˆX Bayes (P ))] + E Q [ Λ(X, ˆX Bayes (Q)) + Λ(X, ˆX Bayes (P ))] (5) = x (P (x) Q(x))(Λ(x, ˆX Bayes (Q)) Λ(x, ˆX Bayes (P ))) (6) x P (x) Q(x) Λ max (7) = Λ max P Q (8) Λ max 2D(P Q) (9) where in the last step we have used Pinsker s inequality D(P Q) 2 P Q 2. Now we prove Theorem 4 as follows: as desired. E P [L F Q(X n ) L F P (X n )] (0) = E P [ n (Λ(X t, F n t Q(X t )) Λ(X t, F t P (X t )))] () = n t= n E P [E P [(Λ(X t, F t Q(X t )) Λ(X t, F t P (X t ))) X t ]] (Iterated Expectation) (2) t= n n t= Λ max 2 n E P [Λ max 2D(P t Q Xt X X t P t X Xt )] (Using Lemma 5) (3) n t= E P [D(P Xt X t Q X t X t P Xt )] (Jensen s inequality) (4) 2 = Λ max n D(P X n Q Xn) (Using chain rule for relative entropy) (5) 2

28 3 Prediction Under Logarithmic Loss In this section, we specialize our setup to the logarithmic loss while changing our predictor set into a set of distributions. So here we have ˆX = M(X ) and Λ(x, P ) = log P (x) as the logarithmic loss. Also note that F t (X t ) M(X ). So we can represent our predictor with 3 different setups as follows:. Obviously we can carry over the setup we worked previously and see our predictors as a set of functions {F t } t. 2. Another approach would be to see each predictor as a sequence of conditional distributions {Q Xt X t } t where F t (X t ) = Q Xt X t. 3. And finally we can see our predictor as a model on the data. Then Q would be the law we assigned to the sequence we observe. Then we can immediately adapt previous results to log loss case. First one is L F (X n ): L F (X n ) = n = n n log F t= t (X t )[x t ] n log Q Xt X t (x t) t= = n log Q(X n ) (6) (7) (Where now Q is our model about the data) (8) Now let s look at the logarithmic loss under the Bayesian setting while assuming X, X 2, sequence is governed by P: E P [L F (X n )] = E P [ n log Q(X n ) ] (9) = E P [ n log P (X n ) Q(X n )P (X n ) ] (20) = n [H(Xn ) + D(P X n Q X n)] (2) Then we can see that min F E P [L F (X n )] = n H(Xn ) and it is achieved when F t (X t ) = P Xt X t. 4 Prediction of Individual Sequences As for the last part, we look at a different set of predictors. In this setup we are given a class of predictors F, and we seek a predictor F such that: lim max[l n X n F (X n ) min L G(X n )] = 0 (22) G F 4. Prediction of Individual Sequences under Log Loss Now the question arises that given a class of law F, can we find P such that (23) is small for any X n : n log P (X n ) min Q F n log Q(X n ) = n log P (X n ) n log max Q F Q(X n ). (23) This problem will be solved in the upcoming lecture. 3

29 EE378A Statistical Signal Processing Lecture 8-04/27/207 Lecture 8: Prediction of Individual Sequences Lecturer: Tsachy Weissman Scribe: Chuan Zheng Lee In the last lecture we have studied the prediction problem under Bayesian setting, and in this lecture we are going to study the prediction problem under logarithmic loss in the individual sequence setting. Problem setup We first recall the problem setup: given a class of probability laws F, we wish to find some law P with a small worst-case regret [ ] WCR n (P, F) max log x n P (x n ) min log Q F Q(x n. () ) Then a first natural question is as follows: could we find some P such that with ɛ n 0 as n? WCR n (P, F) nɛ n (2) 2 First answer when F < : uniform mixture The answer to the previous question turns out to be affirmative when F <. Our construction of P is as follows: P Unif (x n ) = F Q(x n ). (3) Q F The performance of P Unif is summarized in the following lemma: Lemma. The worst-case regret of P Unif satisfies: WCR n (P Unif, F) log F. (4) Proof Just apply Q(x n ) 0 for any Q F and x n to conclude that log P (x n ) min log Q F Q(x n ) = log max Q F Q(x n ) P (x n ) = log F + log max Q F Q(x n ) Q F Q(x n ) (5) (6) log F. (7) By Lemma, we know that we can choose ɛ n = to our question. log F n 0 as n, which provides an affirmative answer

30 3 Second answer: normalized maximum likelihood In the previous section we know that ɛ n 0 is possible when F <, but we are still looking for some law P which gives the minimum ɛ n, or equivalently, a minimum WCR n (P, F). Complicated as the definition of WCR n (P, F) suggests, surprisingly its minimum admits a closed-form expression: Theorem 2 (Normalized Maximum Likelihood). The minimum worst-case regret is given by min P WCR n(p, F) = log x n max Q F Q(xn ). (8) This bound is attained by the normalized maximum likelihood predictor expressed as P NML (x n ) = max Q F Q(x n ) x n max Q F Q( x n ). (9) P Proof Since NML(x n ) max Q F Q(x n ) is a constant, it is clear that P NML attains the desired equality. To prove the optimality, suppose P is any other predictor. Since both P NML and P are probability distributions, there must exist some x n such that Considering this sequence x n gives as desired. P (x n ) P NML (x n ). (0) WCR n (P, F) log max Q F Q(x n ) P (x n ) log max Q F Q(x n ) P NML (x n ) () (2) = log x n max Q F Q(xn ) (3) 4 Third answer: Dirichlet mixture Although the normalized maximum likelihood predictor gives the optimal worst-case regret, it has a severe drawback that it is usually really hard to compute in practice. In contrast, the mixture idea used in the first answer can usually be efficiently computed (as will be shown below). This observation motivates us to come up with a mixture-based predictor P, which enjoys a worst-case regret close to that of P NML. The general mixture idea is as follows: suppose the model class F = (Q θ ) θ Θ can be parametrized by θ. Consider some prior w( ) on Θ and choose P ω (x n ) = Q θ (x n )w(dθ) (4) Θ 2

31 as our predictor. Note that in this case, where P ω (x t x t ) = P ω(x t ) P ω (x t (5) ) Θ = Q θ(x t )w(dθ) Θ Q (6) θ(x t )w(dθ) Θ = Q θ(x t x t )Q θ (x t )w(dθ) Θ Q (7) θ(x t )w(dθ) = Q θ (x t x t Q θ (x t ) ) Θ Θ Q w(dθ) (8) θ (xt )w(dθ ) Q θ (x t x t )w(dθ x t ) (9) Θ w(dθ x t ) = Q θ(x t )w(dθ) Θ Q θ (xt )w(dθ ). (20) Then how to choose the prior w( )? On one hand, the prior w( ) should be chosen such that it assigns roughly equal weights to the representative sources. On the other hand, it should be chosen in a way such that it is computationally tractable for the sequential probability assignments. We take a look at some concrete examples. 4. Bernoulli case In the first example, we assume that Q θ = Bern(θ) n and F = {Q θ } θ [0,]. Note that in this case the model class consists of all i.i.d. distributions. We assign the Dirichlet prior w(dθ) = π The properties of the resulting predictor are left as exercises: dθ Beta( θ( θ) 2, ). (2) 2 Exercise 3. Denote by n 0 = n 0 (x n ), n = n (x n ) the number of 0 s and s in the sequence x n, respectively. Prove the following: (a) The conditional probability is given by P ω (x t x t ) = n x t (x t ) + 2. (22) t This is called the Krichevsky Trofimov sequential probability assignment. (b) Show that WCR n (P ω, F) = 2 log n + O(). (c) Considering the normalized maximum likelihood predictor, also show that min P WCR n (P, F) = 2 log n+ O(). Hence, our Dirichlet mixture P ω attains the optimal leading term of the worst-case regret. In fact, there is a general theorem on the performance of the mixture-based predictor: Theorem 4 (Rissanen, 984). Let F = {Q θ } θ Θ with Θ R d. Under benign regularity and smoothness conditions, we have min WCR n(p, F) = d log n + o(log n). (23) P 2 3

32 Moreover, there exists some Dirichlet prior w( ) such that 4.2 Markov sources WCR n (P ω, F) = d log n + o(log n). (24) 2 Given that we can compete with i.i.d. Bernoulli sequences, how can we compete under logarithmic loss with respect to Markov models? Assume X = {0, } and denote by M k the set of all k-th order Markov models. Note that for Q M k, Q(x n x 0 k+) = = n i= Q(x i x i i k ) (25) all possible k tuples u k Q(x i u k ). (26) i:x i i k =uk The product i:x i i k =uk Q(x i u k ) gives us the probability of the subsequence {x i } x i i k =uk under Bernstein(Q( u k )). Then it makes sense to employ the KT estimator on each of these subsequences. Consider the following P such that where n u k(x n ) = n i= (xi i k+ = uk ). Then, for this P, we have P (x t x t k+ ) = n x t (xt t k k+ ) + /2 n x t t k (x t k+ ) +, (27) WCR n (P, M k ) u k 2 ln(n u k(xn )) + o(ln n) (28) which matches the Rissanen lower bound on the leading term. 2k 2 ln n + o(ln n), (29) 2k 4

33 EE378A Statistical Signal Processing Lecture 9-05/02/207 Lecture 9: Context Tree Weighting Lecturer: Tsachy Weissman Scribe: Chuan-Zheng Lee In the last lecture we have talked abut the mixture idea for i.i.d. sources and Markov sources. Specifically, we introduced the Krichevsky Trofimov probability assignment for a binary vector x n {0, } n given by P KT (x n ) = ω(θ)θ n(xn) ( θ) n0(xn) dθ = Γ(n )Γ(n + 2 ) (Γ(, () 2 ))2 Γ(n + ) where ω(θ), n n (x n ) is the number of s in x n, n 0 n 0 (x n ) is the number of 0 s (so that θ( θ) n = n 0 + n ) and Γ( ) is the gamma function. Note that ω(θ) is a Dirichlet prior with parameters ( 2, 2 ), and the integrand is just a Dirichlet distribution with parameters (n 0 + 2, n + 2 ). The main result is that, for regular models with d degrees of freedom, the Dirichlet mixing idea achieves the optimal worst-case regret d 2 log n + o(log n) for individual sequences of length n. In this lecture, we will extend this idea to general tree sources. Tree Source We first begin with an informal defintion of a tree source. Definition (Context of a symbol). A context of a symbol x t is a suffix of the sequence that precedes x t, that is, (, x t 2, x t ). Definition 2 (Tree source). A tree source is a source where a set of suffixes is sufficient to determine the probability distribution of the next item (regardless of the past) in the sequence. A tree source is hence a generalization of an order-one Markov source, where only the previous element in the sequence determines the next: now, instead of just one element being sufficient, the suffix is sufficient. It is perhaps best elucidated by example. Example 3. Let S = {00, 0, } and Θ = {θ s, s S}. Then We can represent this structure in a binary tree: p(x t = X t 2 = 0, X t = 0) = θ 00 p(x t = X t 2 =, X t = 0) = θ 0 p(x t = X t = ) = θ t 2 t t θ θ θ 0 Remark Note that the source in Example 3 is not first-order Markov, but can be cast as a special case of a second-order Markov source. We denote as P S,θ (x n ) the associated probability mass function on x n X n.

Lecture 4: State Estimation in Hidden Markov Models (cont.)

EE378A Statistical Signal Processing Lecture 4-04/13/2017 Lecture 4: State Estimation in Hidden Markov Models (cont.) Lecturer: Tsachy Weissman Scribe: David Wugofski In this lecture we build on previous