Finding a Needle in a Haystack: Conditions for Reliable. Detection in the Presence of Clutter
|
|
- Elaine Bryan
- 5 years ago
- Views:
Transcription
1 Finding a eedle in a Haystack: Conditions for Reliable Detection in the Presence of Clutter Bruno Jedynak and Damianos Karakos October 23, 2006 Abstract We study conditions for the detection of an -length iid sequence with unknown pmf p, among M -length iid sequences with unknown pmf p 0. We show how the quantity M2 Dp p 0 determines the asymptotic probability of error. Keywords: reliable detection, probability of error, Sanov s theorem, Kulback- Leibler distance, phase transition. Introduction Our motivation for this paper has its origins in Geman et. al. 996, where an algorithm for tracking roads in satellite images was experimentally studied. Below a certain clutter level, the algorithm could track a road acurately, and suddenly, with increased clutter level, tracking would become impossible. This phenomenon was studied theoretically in Yuille et. al.2000 and 200. Using a simplified statistical model, the authors show that, in an This research was partially supported by ARO DAAD9/ and general funds from the Center for Imaging Science at The Johns Hopkins University. USTL and Department of Applied Mathematics, Johns Hopkins University Mail should be adressed to: Bruno Jedynak, Clark 302b, Johns Hopkins University, Baltimore, MD, Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD,
2 apropriate asymptotic setting, the number of false detections is subject to a phase transition. Our objective in this paper is to generalize these results. First, we demonstrate, in the same setting, that the phase transition phenomenon occurs for the error rate of the maximum likelihood estimator. Second, we consider the situation where the underlying statistical model is unknown; i.e., there is a special object among many others but it is not known how it is special it is an outlier, in some sense. We show that the same phase transition phenomenon occurs in this case as well. Moreover, we propose a target detector that has the same asymptotic performance as the maximum likelihood estmator, had the model been known. Simulations illustrate these results. Let X = X X 2... X X2 X X2... X M+ X 2 M+... X M+ be a M + matrix made of independent random variables rvs taking values in a finite set. We denote by X m = Xm,..., Xm X the rvs in line m and by X m the ones that are not in line m. There is a special line, the target, with index t. All the other lines will be called distractors. The rvs X t are identically distributed with point mass function pmf p. The other ones, X t, are identically distributed with pmf p 0 p. The goal is to estimate t, the target, from a single realization of X. If p 0 and p are close, the target does not differ much from the distractors, a situation akin to finding a needle in a haystack. 2
3 2 Known distributions Let x be a realization of X. Then, the log-likelihood of x is lx = = log p x n t + n= n= M+ log p x n M+ t p 0 x n t + m=,m t n= m= n= log p 0 x n m 2 log p 0 x n m 3 The maximum likelihood estimator mle for t is then ˆtx = arg max m M+ n= log p x n m p 0 x n m 4 We call the reward of line m the quantity n= log p x n m p 0 x n m 5 The mle entails choosing the line with the largest reward. The quantity of interest is the probability that the mle differs from the target: em, = IP ˆtX t 6 which is the probability that a distractor gets a reward which is greater than the reward of the target. If M is fixed, letting, and using the law of large numbers, we obtain n= n= log p x n t p 0 x n t log p x n m p 0 x n m Dp, p 0 and 7 Dp 0, p for every m t 8 Logarithms are base 2 throughout the paper. 3
4 almost surely, where Dp, q = x px log px qx 9 is the Kulback-Leibler distance between p and q. Hence, as long p q, Dp, q > 0, and the reward of the target converges to a positive value while the reward of each distractor converges to a negative value which allows us to show that the error of the mle goes to zero. One can even bound em, from above for any fixed M and as follows Theorem 2 em, M p0 xp x. 0 x ote that p0 xp x = Hellingerp 0, p, 0 x where Hellingerp 0, p is the Hellinger distance between p 0 and p. The proof, using classical large deviations techniques, is in Section 6. ote that if the right-hand side of 0 goes to 0 as M and, the probability that the mle differs from the target goes to 0. This condition, however, is not necessary. As we show below, there is a maximum rate at which M can go to infinity in order for the probability of error to go to zero if M increases faster, then the probability of error goes to one. A similar result, i.e., that the number of distractors for which the reward is larger than the reward of the target follows a phase transition, was also shown by Yuille et. al We present below the same analysis for the convergence of the mle. The phase transition, or in other words, the dependence of the probability of error on the rate at which M goes to infinity, is expressed in the following theorem: Theorem 2 If ε > 0, such that lim M, M2 Dp,p0 ε = 0 then lim em, = 0, 2 M, 4
5 and If ε > 0, such that lim M, M2 Dp,p0+ε = + then lim em, =. 3 M, The intuition is as follows. First, as goes to infinity, if M remains fixed, the probability of error goes to zero exponentially fast, following a large deviation phenomenon since the reward of the target line converges to a positive value, while the reward of the distractors converges to a negative value as was mentioned earlier. On the other hand, as the number M of distractors increases, when remains fixed, the probability that there exists a distractor with a reward larger than the reward of the target increases as well. These are two competing phenomena, whose interaction gives rise to the critical rate Dp, p 0. The detailed proof appears in Section 6. ote: In order for the limits of functions of M, to be well-defined as M,, we assume that M is, in general, a function of. Hence, all limits lim M, should be interpreted as lim, with the proviso that M is increasing according to some function of. We kept the notation lim M, for simplicity. 3 Unknown Distributions We now look at the case where p 0 and p are unknown. It is clear that the error rate of any estimator in this context cannot be lower than the error rate of the mle with known p 0 and p. Hence, 3 holds even when em, is the error rate of any estimator. Can one build an estimator of the target for which the error rate will satisfy 2? The answer is yes as we shall see now. A simple way of building an estimator of the target when p 0 and p are unknown is to plug-in estimators of p 0 and p in the previous mle estimator 4. Hence, let us define 5
6 tx = arg max m M+ n= log ˆp mx n m ˆp m x n m 4 where ˆp m and ˆp m are the empirical distributions of the rvs in line m and in all the other lines, respectively. I.e., and ˆp m x = ˆp m x = M {Xm n = x}, 5 n= M+ n= j=,j m {X n j = x}. 6 ote that tx = arg max Dˆp m, ˆp m. 7 m M+ Hence, t is the line that differs the most in the Kulback-Leibler sense from the average distribution of the other lines. The reader may be more familiar with the variant ṫx = arg max Dˆp m, ˆp, 8 m M+ where ˆp is the empirical distribution over all rvs, including line m; both ṫ and t are similar in the sense that they pick the sequence which differs the most from the rest. It turns out that the error rate of t, that is ẽm, = IP tx t, 9 where, as before, t denotes the target, has the same asymptotic behavior as the mle 4 in the case of known distributions. Theorem 3 If ε > 0, such that lim M, M2 Dp,p0 ε = 0 then lim ẽm, = 0 20 M, 6
7 error 0.5 error Figure : Estimates of the probability of error for various, for the case p 0 = 0.9, 0., p = 0.8, 0.2, M = 000. The two plots correspond to the cases of known and unknown distributions, respectively. The red line represents the upper bound as established by Theorem. and If ε > 0, such that lim M, M2 Dp,p0+ε = + then lim ẽm, =. 2 M, The proof uses the same large deviations techniques as the proof of Theorem 2 but is slightly more complex due to the fact that the rewards are not independent anymore. The proof appears in Section 6. 4 Simulations We now provide simulations that show Theorems, 2 and 3 in action. We generated M = 000 binary sequences with probabilities p 0 = 0.9, 0. and p = 0.8, 0.2 for the background and the target, respectively. We varied the number from 0 to 500, and we observed the probability of error decreasing to zero. We performed the random experiment 00 times for each value of. The procedure was replicated 20 times in 7
8 error 0.5 error Figure 2: Estimates of the probability of error for various, for the case p 0 = 0.9, 0., p = 0.7, 0.3, M = 000. The two plots correspond to the cases of known and unknown distributions, respectively. The red line represents the upper bound as established by Theorem. order to compute error bars. The plots in Figure show the estimated probability of error versus, for the two maximum likelihood detectors known and unknown distributions, respectively, along with standard deviation error bars. As expected, the error for the case of unknown distributions is somewhat higher, as there is an additional error due to the inaccuracy in estimating the two distributions. The KL divergence is Dp, p The dashed line shows the phase transition boundary, i.e., the value of such that M = 2 Dp,p0. For M = 000, this value is equal to For the known distributions plot, the red line corresponds to the upper bound established by Theorem, and it is equal to Similar plots for the case p 0 = 0.9, 0. and p = 0.7, 0.3 are shown in Figure 2. As expected, the error curves of Figure are higher than the ones in Figure 2, since the former detection case is harder than the latter. The phase transition boundary is depicted in Figure 2 with the dashed line at value = The upper bound of Theorem is given by
9 5 Conclusions We have considered a statistical model with M + sequences of independent random variables, each of length. All random variables have the same point mass function p 0 except for one sequence, the target, for which the common point mass function is p. The error of the maximum likelihood estimator for the target converges to 0 if there exists an ε > 0 such that M2 Dp,p0 ε 0, and it converges to if there exists an ε > 0 such that M2 Dp,p0+ε +. Moreover, when p 0 and p are unknown, we are able to build an estimator of the target with the same performace; this allows us to study the important practical problem of outlier detection. We conjecture that these results can be generalized to the case of ergodic Markov chains, and we plan to report the more general results in a subsequent publication. 6 Proofs Without loss of generality, we assume that the target line is line number. Proof of Theorem : em, = IP max log p X n m 2 m M+ p n= 0 Xm n > log p X n p n= 0 X n 22 MIP log p X n 2 p n= 0 X2 n > log p X n p n= 0 X n 23 p X2 n p 0 X n s = MIP p n= 0 X2 np X n >, for all s > 0 24 [ p X2 n p 0 X n ] s ME p n= 0 X2 np X n 25 [ p X2 p 0 X s ] = M E p 0 X2 p X 26 where 25 is due to the Markov inequality. 9
10 Let us define fs E [ p X2 p 0 X s ] p 0 X2 p X and gs ln fs 27 One can check that f /2 = g /2 = 0. Moreover, using Hölder s inequality, Grimmett et. al. 992, it is easy to show that, for any s, t > 0 and 0 α, E [ p X2 p 0 X αs+ αt ] p 0 X2 p X E [ p X2 p 0 X s ] α p 0 X2 p X E [ p X2 p 0 X t ] α p 0 X2 p X. 28 By taking the log on both sides, we deduce that g is a convex function of s. Hence, it achieves its minimum value at s = /2 therefore, f achieves its minimum value at s = /2. This leads to the tightest upper bound in 26, i.e., em, Mf 2 = M x p0 xp x In order to prove Theorems 2 and 3, we start with two technical lemmas that will be useful later on. Lemma Let U and R be two rvs, and y IR. Then, for any ε > 0, IP U > y + ε IP R < ε IP U + R > y IP U > y ε + IP R > ε 30 Proof of Lemma : IP U + R > y = IP U + R > y, R ε + IP U + R > y, R > ε 3 IP U > y ε + IP R > ε 32 0
11 and IP U + R y = IP U + R y, R < ε + IP U + R y, R ε 33 IP U y + ε + IP R < ε 34 which allows us to obtain the lower bound by computing the complementary event. Lemma 2 Let V,..., VM be a sequence of M independent, identically distributed, discrete random variables. Moreover, assume that the following large deviation property holds for some z IR, IP V > z =. 2 Iz., where Iz > 0 and a = b lim + log a = b Then, if ε > 0 s.t. lim M, + M2 Iz ε = 0, then lim IP max V m > z = M, + m M Also, if ε > 0 s.t. lim M, + M2 Iz+ε = +, then lim IP max m > z =. M, + m M 37 Proof of Lemma 2: Let ε > 0 be arbitrarily small. Then, there exists 0 ε > 0 such that > 0, IP V log > z < ε Iz To prove the first part, we start with the following claim: ε > 0 > 0 > : IP max m M V m > z > ε.
12 Then, using the union bound, we obtain ε > 0 > 0 > : M m= IP V m > z = MIP V > z > ε. 39 By picking > 0 ε, 39 becomes ε > 0 > 0 ε > : M2 Iz ε > ε. Hence, M2 Iz ε does not converge to zero for any ε > 0, as required. To prove the second part, we first assume that > 0 ε, as above. Then, IP max m M V m > z = IP max m M V m z 40 = IP M V z 4 M log IP V = 2 >z 42 MIP V 2 >z 43 2 M2 Iz+ε, 44 where the first inequality is a consequence of the inequality log x x, and the second inequality arises from 38. ote that 44 is true for any arbitrary ε > 0. Hence, if there exists ε > 0 such that M2 Iz+ε +, then necessarily IP max m M Vm > z. This concludes the proof of the second part, and the proof of the lemma. We are now ready for Proof of Theorem 2: em, = IP max 2 m M+ n= log p X n m p 0 X n m > n= log p X n p 0 X n 45 = IP U M + R > Dp, p 0, where 46 2
13 UM = max 2 m M+ R = Dp, p 0 n= log p X n m p 0 X n m n= and 47 log p X n p 0 X n 48 From the law of large numbers, R 0 in probability. Hence, for all η > 0 and α > 0, and for sufficiently large, using Lemma, IP U M > Dp, p 0 + η α em, IP U M > Dp, p 0 η + α 49 Let us define V m n= log p X n m p 0 X n m 50 ow, using Sanov s theorem, Dembo at. al.998, IP V 2 Dp, p 0. = 2 Dp,p0 5 Indeed, IP V 2 Dp, p 0. = 2 Dp,p 0 52 where Dp, p 0 = inf p C Dp, p 0, with C = {p; E p log p p 0 Dp, p 0 }, 53 and for p C, Dp, p 0 = E p log p p 0 = Dp, p + E p log p p 0 54 Dp, p + Dp, p 0 Dp, p ow, by continuity of the rate function, there exists ε > 0 such that IP V 2 > Dp, p 0 η. = 2 Dp,p0 ε 56 3
14 and there exists ε > 0 such that IP V 2 > Dp, p 0 + η. = 2 Dp,p0+ε 57 Finally, since U M = max V m 58 2 m M+ and the rvs V2,..., VM+ are iid, we obtain the required result from Lemma 2. Proof of Theorem 3: We proceed along the same lines as for the proof of Theorem 2. ẽm, = IP max 2 m M+ n= log ˆp mx n m ˆp m X n m > n= log ˆp X n ˆp X n 59 IP UM + RM > Dp, p 0, with 60 UM = max log ˆp mxm n 2 m M+ p 0 Xm n 6 n= RM = A M + B + C 62 A M = max log p 0Xm n 2 m M+ ˆp m Xm n 63 B = n= C = Dp, p 0 For all η > 0, from Lemma, n= log ˆp X n p 0 X n 64 n= log ˆp X n p 0 X n 65 ẽm, IP U M > Dp, p 0 η + IP R M > η. 66 Let V m = n= log ˆp mxm n p 0 Xm n, 2 m M
15 Using Sanov s theorem, IP V 2 Dp, p 0. = 2 Dp,p0. 68 Indeed, IP V 2 Dp, p 0. = 2 Dp,p 0, 69 where Dp, p 0 = inf p C Dp, p 0, with C = {p; E p log p p 0 Dp, p 0 }. 70 And for p C, Dp, p 0 = E p log p p 0 Dp, p 0. 7 ow, by continuity of the rate function, there exists ε > 0 such that IP V 2 > Dp, p 0 η. = 2 Dp,p0 ε. 72 To show that 66 approaches zero as M, with M2 Dp,p0 ε 0, it suffices to prove that R M 0, since the term IP U M > Dp, p 0 η of 66 goes to zero by virtue of 72, Lemma 2, and the fact that U M = max V m, 73 2 m M+ and the rvs V2,..., VM+ are iid. Using the law of large numbers, C 0 in probability. Also, IP A M > η MIP log p 0X2 n ˆp n= 2 X2 n > η MIP max log p 0X n 2 n ˆp 2 X2 n > η MIP log p 0X2 ˆp 2 X2 > η
16 M max IP log p 0x x ˆp 2 x > η 77 = M max x IP ˆp 2x < 2 η p 0 x 78 M max x 2 M Ix,η = M2 M Jη 79 where Ix, η > 0 is a rate function, and Jη = min x Ix, η. The last inequality comes from the fact that ˆp 2 x p 0 x in probability. A similar argument shows that B 0 in probability as well. ote that the result would still hold if we replaced ˆp m with ˆp, i.e., with the empirical distribution over the full data. References [] Amir Dembo and Ofer Zeitouni. Large Deviations Techniques and Applications. Springer Verlag, 998. [2] D. Geman and B. Jedynak. An active testing model for tracking roads from satellite images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8: 4, January 996. [3] G.R. Grimmett and D. R. Stirzaker. Probability and Random Processes. Oxford Science Publications, 992. [4] Alan L. Yuille and James M. Coughlan. Fundamental limits of bayesian inference: Order parameters and phase transitions for road tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 222:60 73, February [5] Alan L. Yuille, James M. Coughlan, Yingnian Wu, and Song Chun Zhu. Order parameters for detecting target curves in images: When does high level knowledge help? International Journal of Computer Vision, 4:9 33,
Order Parameters for Minimax Entropy Distributions: When does high level knowledge help?
Order Parameters for Minimax Entropy Distributions: When does high level knowledge help? A.L. Yuille and James Coughlan Smith-Kettlewell Eye Research Institute 2318 Fillmore Street San Francisco CA 94115
More informationLecture 8: Information Theory and Statistics
Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 23, 2015 1 / 50 I-Hsiang
More informationLarge Deviations Techniques and Applications
Amir Dembo Ofer Zeitouni Large Deviations Techniques and Applications Second Edition With 29 Figures Springer Contents Preface to the Second Edition Preface to the First Edition vii ix 1 Introduction 1
More informationLecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016
Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,
More informationGärtner-Ellis Theorem and applications.
Gärtner-Ellis Theorem and applications. Elena Kosygina July 25, 208 In this lecture we turn to the non-i.i.d. case and discuss Gärtner-Ellis theorem. As an application, we study Curie-Weiss model with
More informationOrder Parameters for Detecting Target Curves in Images: When Does High Level Knowledge Help?
International Journal of Computer Vision 41(1/2), 9 33, 2001 c 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. Order Parameters for Detecting Target Curves in Images: When Does High Level
More informationConvergence Rates of Algorithms for Visual Search: Detecting Visual Contours
Convergence Rates of Algorithms for Visual Search: Detecting Visual Contours A.L. Yuille Smith-Kettlewell Inst. San Francisco, CA 94115 James M. Coughlan Smith-Kettlewell Inst. San Francisco, CA 94115
More informationLecture 22: Error exponents in hypothesis testing, GLRT
10-704: Information Processing and Learning Spring 2012 Lecture 22: Error exponents in hypothesis testing, GLRT Lecturer: Aarti Singh Scribe: Aarti Singh Disclaimer: These notes have not been subjected
More informationLecture 4 Noisy Channel Coding
Lecture 4 Noisy Channel Coding I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw October 9, 2015 1 / 56 I-Hsiang Wang IT Lecture 4 The Channel Coding Problem
More informationGeneralized Neyman Pearson optimality of empirical likelihood for testing parameter hypotheses
Ann Inst Stat Math (2009) 61:773 787 DOI 10.1007/s10463-008-0172-6 Generalized Neyman Pearson optimality of empirical likelihood for testing parameter hypotheses Taisuke Otsu Received: 1 June 2007 / Revised:
More informationLecture 3: Lower Bounds for Bandit Algorithms
CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and
More informationLecture 17: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 31, 2016 [Ed. Apr 1]
ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture 7: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 3, 06 [Ed. Apr ] In last lecture, we studied the minimax
More informationComputation of Csiszár s Mutual Information of Order α
Computation of Csiszár s Mutual Information of Order Damianos Karakos, Sanjeev Khudanpur and Care E. Priebe Department of Electrical and Computer Engineering and Center for Language and Speech Processing
More information5.1 Inequalities via joint range
ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 5: Inequalities between f-divergences via their joint range Lecturer: Yihong Wu Scribe: Pengkun Yang, Feb 9, 2016
More informationC.7. Numerical series. Pag. 147 Proof of the converging criteria for series. Theorem 5.29 (Comparison test) Let a k and b k be positive-term series
C.7 Numerical series Pag. 147 Proof of the converging criteria for series Theorem 5.29 (Comparison test) Let and be positive-term series such that 0, for any k 0. i) If the series converges, then also
More informationHomework Set #2 Data Compression, Huffman code and AEP
Homework Set #2 Data Compression, Huffman code and AEP 1. Huffman coding. Consider the random variable ( x1 x X = 2 x 3 x 4 x 5 x 6 x 7 0.50 0.26 0.11 0.04 0.04 0.03 0.02 (a Find a binary Huffman code
More informationIntroduction to Statistical Learning Theory
Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated
More informationInformation measures in simple coding problems
Part I Information measures in simple coding problems in this web service in this web service Source coding and hypothesis testing; information measures A(discrete)source is a sequence {X i } i= of random
More informationLECTURE 3. Last time:
LECTURE 3 Last time: Mutual Information. Convexity and concavity Jensen s inequality Information Inequality Data processing theorem Fano s Inequality Lecture outline Stochastic processes, Entropy rate
More informationInformation Theory and Hypothesis Testing
Summer School on Game Theory and Telecommunications Campione, 7-12 September, 2014 Information Theory and Hypothesis Testing Mauro Barni University of Siena September 8 Review of some basic results linking
More informationMachine Learning Theory Lecture 4
Machine Learning Theory Lecture 4 Nicholas Harvey October 9, 018 1 Basic Probability One of the first concentration bounds that you learn in probability theory is Markov s inequality. It bounds the right-tail
More informationSeries 7, May 22, 2018 (EM Convergence)
Exercises Introduction to Machine Learning SS 2018 Series 7, May 22, 2018 (EM Convergence) Institute for Machine Learning Dept. of Computer Science, ETH Zürich Prof. Dr. Andreas Krause Web: https://las.inf.ethz.ch/teaching/introml-s18
More informationRobustness and duality of maximum entropy and exponential family distributions
Chapter 7 Robustness and duality of maximum entropy and exponential family distributions In this lecture, we continue our study of exponential families, but now we investigate their properties in somewhat
More informationLecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016
Lecture 15: MCMC Sanjeev Arora Elad Hazan COS 402 Machine Learning and Artificial Intelligence Fall 2016 Course progress Learning from examples Definition + fundamental theorem of statistical learning,
More informationBasic math for biology
Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More informationIEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm
IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.
More informationPhenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 2012
Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 202 BOUNDS AND ASYMPTOTICS FOR FISHER INFORMATION IN THE CENTRAL LIMIT THEOREM
More informationEntropy and Ergodic Theory Lecture 15: A first look at concentration
Entropy and Ergodic Theory Lecture 15: A first look at concentration 1 Introduction to concentration Let X 1, X 2,... be i.i.d. R-valued RVs with common distribution µ, and suppose for simplicity that
More informationConsistency of the maximum likelihood estimator for general hidden Markov models
Consistency of the maximum likelihood estimator for general hidden Markov models Jimmy Olsson Centre for Mathematical Sciences Lund University Nordstat 2012 Umeå, Sweden Collaborators Hidden Markov models
More informationMaximum likelihood in log-linear models
Graphical Models, Lecture 4, Michaelmas Term 2010 October 22, 2010 Generating class Dependence graph of log-linear model Conformal graphical models Factor graphs Let A denote an arbitrary set of subsets
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationChaos, Complexity, and Inference (36-462)
Chaos, Complexity, and Inference (36-462) Lecture 7: Information Theory Cosma Shalizi 3 February 2009 Entropy and Information Measuring randomness and dependence in bits The connection to statistics Long-run
More informationSolutions to Set #2 Data Compression, Huffman code and AEP
Solutions to Set #2 Data Compression, Huffman code and AEP. Huffman coding. Consider the random variable ( ) x x X = 2 x 3 x 4 x 5 x 6 x 7 0.50 0.26 0. 0.04 0.04 0.03 0.02 (a) Find a binary Huffman code
More informationCS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018
CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018 1 Overview This lecture mainly covers Recall the statistical theory of GANs
More informationOn Finding Optimal Policies for Markovian Decision Processes Using Simulation
On Finding Optimal Policies for Markovian Decision Processes Using Simulation Apostolos N. Burnetas Case Western Reserve University Michael N. Katehakis Rutgers University February 1995 Abstract A simulation
More informationOn Improved Bounds for Probability Metrics and f- Divergences
IRWIN AND JOAN JACOBS CENTER FOR COMMUNICATION AND INFORMATION TECHNOLOGIES On Improved Bounds for Probability Metrics and f- Divergences Igal Sason CCIT Report #855 March 014 Electronics Computers Communications
More informationMachine Learning Lecture Notes
Machine Learning Lecture Notes Predrag Radivojac January 25, 205 Basic Principles of Parameter Estimation In probabilistic modeling, we are typically presented with a set of observations and the objective
More informationThe Game of Twenty Questions with noisy answers. Applications to Fast face detection, micro-surgical tool tracking and electron microscopy
The Game of Twenty Questions with noisy answers. Applications to Fast face detection, micro-surgical tool tracking and electron microscopy Graduate Summer School: Computer Vision July 22 - August 9, 2013
More informationStrong Converse and Stein s Lemma in the Quantum Hypothesis Testing
Strong Converse and Stein s Lemma in the Quantum Hypothesis Testing arxiv:uant-ph/9906090 v 24 Jun 999 Tomohiro Ogawa and Hiroshi Nagaoka Abstract The hypothesis testing problem of two uantum states is
More informationInformation Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18
Information Theory David Rosenberg New York University June 15, 2015 David Rosenberg (New York University) DS-GA 1003 June 15, 2015 1 / 18 A Measure of Information? Consider a discrete random variable
More informationChapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye
Chapter 2: Entropy and Mutual Information Chapter 2 outline Definitions Entropy Joint entropy, conditional entropy Relative entropy, mutual information Chain rules Jensen s inequality Log-sum inequality
More informationLearning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013
Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description
More informationProbabilistic Graphical Models
Parameter Estimation December 14, 2015 Overview 1 Motivation 2 3 4 What did we have so far? 1 Representations: how do we model the problem? (directed/undirected). 2 Inference: given a model and partially
More informationCapacity of AWGN channels
Chapter 3 Capacity of AWGN channels In this chapter we prove that the capacity of an AWGN channel with bandwidth W and signal-tonoise ratio SNR is W log 2 (1+SNR) bits per second (b/s). The proof that
More informationA minimalist s exposition of EM
A minimalist s exposition of EM Karl Stratos 1 What EM optimizes Let O, H be a random variables representing the space of samples. Let be the parameter of a generative model with an associated probability
More informationAn Empirical Algorithm for Relative Value Iteration for Average-cost MDPs
2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter
More informationU Logo Use Guidelines
Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity
More informationInformation Theory in Intelligent Decision Making
Information Theory in Intelligent Decision Making Adaptive Systems and Algorithms Research Groups School of Computer Science University of Hertfordshire, United Kingdom June 7, 2015 Information Theory
More informationLecture 35: December The fundamental statistical distances
36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose
More information(each row defines a probability distribution). Given n-strings x X n, y Y n we can use the absence of memory in the channel to compute
ENEE 739C: Advanced Topics in Signal Processing: Coding Theory Instructor: Alexander Barg Lecture 6 (draft; 9/6/03. Error exponents for Discrete Memoryless Channels http://www.enee.umd.edu/ abarg/enee739c/course.html
More informationPrequential Analysis
Prequential Analysis Philip Dawid University of Cambridge NIPS 2008 Tutorial Forecasting 2 Context and purpose...................................................... 3 One-step Forecasts.......................................................
More informationSupplemental for Spectral Algorithm For Latent Tree Graphical Models
Supplemental for Spectral Algorithm For Latent Tree Graphical Models Ankur P. Parikh, Le Song, Eric P. Xing The supplemental contains 3 main things. 1. The first is network plots of the latent variable
More informationLecture 8: Information Theory and Statistics
Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and Estimation I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 22, 2015
More informationScalable Hash-Based Estimation of Divergence Measures
Scalable Hash-Based Estimation of Divergence easures orteza oshad and Alfred O. Hero III Electrical Engineering and Computer Science University of ichigan Ann Arbor, I 4805, USA {noshad,hero} @umich.edu
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationConvergence of Random Variables
1 / 15 Convergence of Random Variables Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department of Electrical Engineering Indian Institute of Technology Bombay March 19, 2014 2 / 15 Motivation Theorem (Weak
More informationELEC633: Graphical Models
ELEC633: Graphical Models Tahira isa Saleem Scribe from 7 October 2008 References: Casella and George Exploring the Gibbs sampler (1992) Chib and Greenberg Understanding the Metropolis-Hastings algorithm
More informationNote that in the example in Lecture 1, the state Home is recurrent (and even absorbing), but all other states are transient. f ii (n) f ii = n=1 < +
Random Walks: WEEK 2 Recurrence and transience Consider the event {X n = i for some n > 0} by which we mean {X = i}or{x 2 = i,x i}or{x 3 = i,x 2 i,x i},. Definition.. A state i S is recurrent if P(X n
More informationExpectation Propagation in Factor Graphs: A Tutorial
DRAFT: Version 0.1, 28 October 2005. Do not distribute. Expectation Propagation in Factor Graphs: A Tutorial Charles Sutton October 28, 2005 Abstract Expectation propagation is an important variational
More information4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information
4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information Ramji Venkataramanan Signal Processing and Communications Lab Department of Engineering ramji.v@eng.cam.ac.uk
More informationHardness of the Covering Radius Problem on Lattices
Hardness of the Covering Radius Problem on Lattices Ishay Haviv Oded Regev June 6, 2006 Abstract We provide the first hardness result for the Covering Radius Problem on lattices (CRP). Namely, we show
More informationComputable priors sharpened into Occam s razors
Computable priors sharpened into Occam s razors David R. Bickel To cite this version: David R. Bickel. Computable priors sharpened into Occam s razors. 2016. HAL Id: hal-01423673 https://hal.archives-ouvertes.fr/hal-01423673v2
More informationHands-On Learning Theory Fall 2016, Lecture 3
Hands-On Learning Theory Fall 016, Lecture 3 Jean Honorio jhonorio@purdue.edu 1 Information Theory First, we provide some information theory background. Definition 3.1 (Entropy). The entropy of a discrete
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationCOMPSCI 650 Applied Information Theory Jan 21, Lecture 2
COMPSCI 650 Applied Information Theory Jan 21, 2016 Lecture 2 Instructor: Arya Mazumdar Scribe: Gayane Vardoyan, Jong-Chyi Su 1 Entropy Definition: Entropy is a measure of uncertainty of a random variable.
More informationComposite Hypotheses and Generalized Likelihood Ratio Tests
Composite Hypotheses and Generalized Likelihood Ratio Tests Rebecca Willett, 06 In many real world problems, it is difficult to precisely specify probability distributions. Our models for data may involve
More informationStatistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation
Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider
More informationClustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.
Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)
More informationFORMULATION OF THE LEARNING PROBLEM
FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we
More informationSTATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY
2nd International Symposium on Information Geometry and its Applications December 2-6, 2005, Tokyo Pages 000 000 STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY JUN-ICHI TAKEUCHI, ANDREW R. BARRON, AND
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 2: Introduction to statistical learning theory. 1 / 22 Goals of statistical learning theory SLT aims at studying the performance of
More informationcertain class of distributions, any SFQ can be expressed as a set of thresholds on the sufficient statistic. For distributions
Score-Function Quantization for Distributed Estimation Parvathinathan Venkitasubramaniam and Lang Tong School of Electrical and Computer Engineering Cornell University Ithaca, NY 4853 Email: {pv45, lt35}@cornell.edu
More informationA Note on the Central Limit Theorem for a Class of Linear Systems 1
A Note on the Central Limit Theorem for a Class of Linear Systems 1 Contents Yukio Nagahata Department of Mathematics, Graduate School of Engineering Science Osaka University, Toyonaka 560-8531, Japan.
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More information1 Inverse Transform Method and some alternative algorithms
Copyright c 2016 by Karl Sigman 1 Inverse Transform Method and some alternative algorithms Assuming our computer can hand us, upon demand, iid copies of rvs that are uniformly distributed on (0, 1), it
More information3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.
Appendix A Information Theory A.1 Entropy Shannon (Shanon, 1948) developed the concept of entropy to measure the uncertainty of a discrete random variable. Suppose X is a discrete random variable that
More informationLecture 11: Introduction to Markov Chains. Copyright G. Caire (Sample Lectures) 321
Lecture 11: Introduction to Markov Chains Copyright G. Caire (Sample Lectures) 321 Discrete-time random processes A sequence of RVs indexed by a variable n 2 {0, 1, 2,...} forms a discretetime random process
More informationIEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 43, NO. 3, MARCH
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 43, NO. 3, MARCH 1998 315 Asymptotic Buffer Overflow Probabilities in Multiclass Multiplexers: An Optimal Control Approach Dimitris Bertsimas, Ioannis Ch. Paschalidis,
More informationReputations. Larry Samuelson. Yale University. February 13, 2013
Reputations Larry Samuelson Yale University February 13, 2013 I. Introduction I.1 An Example: The Chain Store Game Consider the chain-store game: Out In Acquiesce 5, 0 2, 2 F ight 5,0 1, 1 If played once,
More informationMINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava
MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,
More informationExponential Tail Bounds
Exponential Tail Bounds Mathias Winther Madsen January 2, 205 Here s a warm-up problem to get you started: Problem You enter the casino with 00 chips and start playing a game in which you double your capital
More informationMASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science Transmission of Information Spring 2006
MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science 6.44 Transmission of Information Spring 2006 Homework 2 Solution name username April 4, 2006 Reading: Chapter
More information1 Sequences of events and their limits
O.H. Probability II (MATH 2647 M15 1 Sequences of events and their limits 1.1 Monotone sequences of events Sequences of events arise naturally when a probabilistic experiment is repeated many times. For
More informationUniversal Outlier Hypothesis Testing
Universal Outlier Hypothesis Testing Venu Veeravalli ECE Dept & CSL & ITI University of Illinois at Urbana-Champaign http://www.ifp.illinois.edu/~vvv (with Sirin Nitinawarat and Yun Li) NSF Workshop on
More informationBandit models: a tutorial
Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses
More informationIowa State University. Instructor: Alex Roitershtein Summer Exam #1. Solutions. x u = 2 x v
Math 501 Iowa State University Introduction to Real Analysis Department of Mathematics Instructor: Alex Roitershtein Summer 015 Exam #1 Solutions This is a take-home examination. The exam includes 8 questions.
More informationABC methods for phase-type distributions with applications in insurance risk problems
ABC methods for phase-type with applications problems Concepcion Ausin, Department of Statistics, Universidad Carlos III de Madrid Joint work with: Pedro Galeano, Universidad Carlos III de Madrid Simon
More informationApplication of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University.
Application of Information Theory, Lecture 7 Relative Entropy Handout Mode Iftach Haitner Tel Aviv University. December 1, 2015 Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December
More informationAsymptotic formulae for the number of repeating prime sequences less than N
otes on umber Theory and Discrete Mathematics Print ISS 30 532, Online ISS 2367 8275 Vol. 22, 206, o. 4, 29 40 Asymptotic formulae for the number of repeating prime sequences less than Christopher L. Garvie
More informationMachine Learning for Data Science (CS4786) Lecture 24
Machine Learning for Data Science (CS4786) Lecture 24 HMM Particle Filter Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2017fa/ Rejection Sampling Rejection Sampling Sample variables from joint
More informationStatistics: Learning models from data
DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial
More informationUnsupervised Learning with Permuted Data
Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University
More informationLecture 4: Proof of Shannon s theorem and an explicit code
CSE 533: Error-Correcting Codes (Autumn 006 Lecture 4: Proof of Shannon s theorem and an explicit code October 11, 006 Lecturer: Venkatesan Guruswami Scribe: Atri Rudra 1 Overview Last lecture we stated
More informationMath 229: Introduction to Analytic Number Theory The product formula for ξ(s) and ζ(s); vertical distribution of zeros
Math 9: Introduction to Analytic Number Theory The product formula for ξs) and ζs); vertical distribution of zeros Behavior on vertical lines. We next show that s s)ξs) is an entire function of order ;
More information10. Composite Hypothesis Testing. ECE 830, Spring 2014
10. Composite Hypothesis Testing ECE 830, Spring 2014 1 / 25 In many real world problems, it is difficult to precisely specify probability distributions. Our models for data may involve unknown parameters
More informationf-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models
IEEE Transactions on Information Theory, vol.58, no.2, pp.708 720, 2012. 1 f-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models Takafumi Kanamori Nagoya University,
More informationAn inverse of Sanov s theorem
An inverse of Sanov s theorem Ayalvadi Ganesh and Neil O Connell BRIMS, Hewlett-Packard Labs, Bristol Abstract Let X k be a sequence of iid random variables taking values in a finite set, and consider
More informationFirst Hitting Time Analysis of the Independence Metropolis Sampler
(To appear in the Journal for Theoretical Probability) First Hitting Time Analysis of the Independence Metropolis Sampler Romeo Maciuca and Song-Chun Zhu In this paper, we study a special case of the Metropolis
More informationP i [B k ] = lim. n=1 p(n) ii <. n=1. V i :=
2.7. Recurrence and transience Consider a Markov chain {X n : n N 0 } on state space E with transition matrix P. Definition 2.7.1. A state i E is called recurrent if P i [X n = i for infinitely many n]
More information