Computation of Csiszár s Mutual Information of Order α
|
|
- Reynold Cobb
- 5 years ago
- Views:
Transcription
1 Computation of Csiszár s Mutual Information of Order Damianos Karakos, Sanjeev Khudanpur and Care E. Priebe Department of Electrical and Computer Engineering and Center for Language and Speech Processing Department of Applied Mathematics and Statistics Johns Hopkins Universit, Baltimore, MD 8, USA {damianos, khudanpur, cep}@jhu.edu Abstract Csiszár introduced the mutual information of order in [] as a parameterized version of Shannon s mutual information. It involves a minimization over the probabilit space, and it cannot be computed in closed form. An alternating minimization algorithm for its computation is presented in this paper, along with a proof of its convergence. Furthermore, it is proved that the algorithm is an instance of the Csiszár-Tusnád iterative minimization procedure []. I. INTRODUCTION Csiszár [] defined the mutual information of order between two discrete random variables X, Y (with joint distribution P XY ) as I (X; Y ) = min where 0, and D (P ) = log P Y ()D (P X Y ( ) ( )), () P () (), () is the Réni divergence of order [3] between distributions P,. The limit of (), as, is the well-known Kullback-Leibler information divergence. Furthermore, we adopt the same convention as in []: if > and () = 0 for some, then P () () is equal to 0 or + when P () = 0 or P () > 0, respectivel. It can be proved [] that I (X; Y ) retains most of the properties of I(X; Y ) (i.e., it is a non-negative, continuous, and concave function of P X, and, when <, it is conve with respect to P X Y ). Also, as, I (X; Y ) converges to I(X; Y ). Observe that () is computed b eponentiating the probabilit measures. If the eponents are strictl less than unit (i.e., 0 ), this results in smoothed values, lessening possible etreme variations in P (), (). Smoothing is important in cases where the distributions are computed from limited amounts of data; sparsit tpicall leads to underestimates of the least likel events, and to overestimates of the most likel events. Such phenomena are prevalent when the underling spaces are ver large, compared to the sample size; see, for instance, [4] [7] for various treatments of the sparsit problem in natural language processing. This fact partl eplains the improvement in performance of the tet categorization algorithms of [8], [9], when Shannon s mutual information was replaced with () as a criterion for clustering sparse empirical distributions. There is no analtic form for the minimizer of the right-hand side of (). The net section describes an alternating minimization algorithm which converges to the desired minimum. Use of notation: Random variables (r.v.) are denoted b capital letters, while their realizations are denoted b the corresponding lowercase letters. All random variables are assumed to lie in discrete finite spaces, denoted b the corresponding calligraphic letter (e.g., X X ). The probabilit mass function (pmf) of a random quantit is denoted using the appropriate subscript, e.g., P X for r.v. X, or P XY
2 for the joint pmf of X, Y. The conditional pmf of X given Y is denoted b P X Y ; letter W is also used to denote a conditional pmf. Finall, for the rest of the paper it will be assumed that 0 < <. II. THE ALGORITHM The algorithm for the computation of () is described in Figure, and it takes as input the joint distribution P XY and γ > 0 (the latter is tpicall a small number, and is used in the termination condition). The algorithm is based on the following identit (proved in Section III): for an distributions P,, and 0 < <, D (P ) = where Therefore, P () = I (X; Y ) = min D(P P ) + D(P ), (3) (P ()) (()) (P ()) (()). ( P Y () D ( P ( ) ) + D( P ( ) P ( ) ) ), which, b virtue of Lemma of Section III, can be written as I (X; Y ) = min min ( P Y () D ( W ( ) ) + W D( W ( ) P ( ) ) ). (4) The double minimization of (4) is achieved through the alternating minimization of the algorithm of Figure, as is proved in Theorem of Section III. III. CONVERGENCE OF THE ALGORITHM First, we will prove identit (3). Given distributions P, and a scalar 0 < <, we define: P () = C (P ()) (()), where C = (P ()) (()) is just a normalizing constant. Then, we have the following Algorithm for computing Csiszár s mutual information of order. Input: Joint distribution P XY, 0 < <, and threshold γ > 0. Initialization: t = 0 t () = P Y ()P X Y ( ) I t = Loop: Repeat t = t + Pt ( ) = Ct ()(P X Y ( )) ( t ()) t () = P Y ()Pt ( ) I t = P Y () ( D(P t ( ) P X Y ( )) +D(P t ( ) t ) ) Until I t I t < γ Output: I t. Fig.. Alternating minimization algorithm for computing I (X; Y ). C t() is just a normalizing constant, computed for each conditioning. chain of equalities D(P ) = ( ( ) ) P () P () log C () = log(c) + ( ) P () P () log () = log(c) + (D(P ) D(P P )) = ( )D (P ) + (D(P ) D(P P )). (5) B re-arranging terms in (5), we obtain (3). Now, for fied distributions P Y, P X Y, let us define the functional J(W, ) = ( P Y () D ( W ( ) ) + D( W ( ) P X Y ( ) ) ), (6) which is a function of distribution and conditional distribution W. The algorithm of Figure performs an alternating update of, W. As is proved in the
3 net two lemmas, this alternating update can onl reduce J(W, ) (alternating minimization). Then, convergence to the global minimum is guaranteed b the fact that, W lie in conve (probabilit) spaces, and that J is a conve function of its arguments, W. Lemma : For a fied conditional distribution W, the functional J(W, ) is minimized b () = P Y ()W ( ). Proof: Onl the first term in the right-hand side of (6) needs to be minimized. It can be easil checked that this term is equal to D(P Y W P Y ), and hence the proof follows immediatel from Lemma 3.8. of [0]. Alternative proof: Let be an arbitrar pmf on X, and as specified in the statement of the lemma. Then, P Y ()D(W ( ) ) P Y ()D(W ( ) ) = ( P Y () W ( ) log W ( ) () W ( ) log W ( ) ) () = () log () () 0, Hence, the first term in the right-hand side of (6) is minimized b, as required. Lemma : For a fied distribution, the functional J(W, ) is minimized b W ( ) = C ()(P X Y ( )) (()), where C() = (P X Y ( )) (()) is a normalizing constant. Proof: It suffices to find the minimizer of the epression K ( W ( ), ) = D ( W ( ) ) + D( W ( ) P X Y ( ) ), for an given. This will impl minimization of J(W, ) as well. Notice that K ( W ( ), ) is a conve function of W ( ), for all, b virtue of the conveit of the KL-divergence [0]. Hence, it has a unique minimizer, which can be found using Lagrange multipliers, b solving the equations (K ( W ( ), ) + λ ( W ( ) )) = 0, W ( ) (7) for all,, under the usual constraint W ( ) =. Note that there are X Y equalities implied in Equation (7). The left-hand-side of (7) is equal to K (W, ) + λ = + log(w ( )) log(()) + ( + log(w ( )) log(p X Y ( ))) + λ = ( + log(w ( )) ( ) log(()) log(p X Y ( )) + λ = ( + log(w ( )) log((p X Y ( )) (()) )) + λ. (8) B setting (8) equal to zero and solving for W ( ), we obtain W ( ) = ep( ( )λ )(P X Y ( )) (()), which, together with the constraint W ( ) =, gives us the required result. Alternative proof: Minimizing K(W, ) is equivalent to minimizing ( )K(W, ) = ( )D(W ) + D(W P ) with respect to W, for fied, P. Then, ( )D(W ) + D(W P ) (9) = CW () W () log C ()P () = CW () W () log ()P log C, (0) () where C = ()P () and the first epression in (0) is the non-negative Kullback- Leibler distance of the distribution W and the
4 distribution W () = C ()P (). Hence (0) is minimized when W = W, as required. Theorem : The alternating minimization algorithm of Figure converges to I (X; Y ), i.e., it computes the epression min min J(W, ), W where J(W, ) was defined in (6). Proof: It is eas to establish that J(W, ) is a conve function of the pair (, W ), as it is a linear combination of KL-divergences having, W as their arguments. Also, and W belong to conve probabilit sets. Thus, the sufficient condition of [] is satisfied, which implies that the alternating minimization algorithm of Figure converges to the (unique) global minimum of J(W, ). IV. CONNECTION TO THE CSISZÁR-TUSNÁDY ALTERNATING MINIMIZATION ALGORITHM In their paper [], Csiszár and Tusnád show that an iterative algorithm that successivel computes projections between two conve sets eventuall converges, under certain conditions, to the minimum distance between the two sets. In this section, we prove that these conditions are satisfied for functional J of (6), thus establishing that the alternating minimization algorithm of Section II is an instance of the algorithm of []. This provides an alternative (albeit more complicated) proof of the convergence of the algorithm. Let P, represent two abstract conve and closed sets, and let d : P IR {+ } be an etended real-valued function. The projection of a member of one set to the other set is defined as follows: P iff d(p, ) = min d(p, ) P iff d(p, ) = min d(p, ) P P where P P and. The alternating minimization algorithm of [] computes a sequence of members of P and satisfing 0 P.... For a given function δ, we have the following definitions (from []): Definition : The three points propert holds for P P if 0 P d(p, 0 ) δ(p, P ) + d(p, ). Definition : The four points propert holds for P P if, P d(p, ) + δ(p, P ) d(p, ). The following theorem of [] establishes convergence of the iterative minimization procedure to the minimum between the two sets: Theorem : If the three points propert and the four points propert hold for all P P, then lim d(p n, n ) = min d(p, ) d(p, ). n P P, We will now show that the sequence of updates 0 W, as epressed in the algorithm of Figure, leads to convergence to the minimum of J(W, ). Note that W P Y X and P X, where P X is the set of measures on X. Furthermore, δ is the KL-divergence function. Theorem 3: J satisfies the four points propert for all W P Y X and all P X. Proof: We have J(W, ) J(W, ) = Y P Y ()(D(W ( ) ) D(W ( ) )) = (P Y W )() log () () X = D(P Y W ) D(P Y W ) () = D(P Y W P Y W ) D(P Y W ) () δ(p Y W, P Y W ), (3) where, in (), P Y W is the X marginal of the joint distribution W P Y, and in (), was replaced b P Y W b virtue of Lemma. Theorem 4: J satisfies the three points propert for all W P Y X.
5 Proof: Let µ [0, ] and W (µ) = µw + ( µ)w. It suffices to show that J(W (µ), 0 ) J(W, 0 ) D(P Y W (µ) P Y W ) (4) for all µ [0, ] (and, hence, for µ = ). First, note that, when µ = 0, the left-hand side and the right-hand side of (4) are both equal to zero. Hence, it suffices to prove that J(W (µ), 0 ) We have D(P Y W (µ) P Y W ) = X D(P Y W (µ) P Y W ), µ. (P Y W () P Y W ()) log (P Y W (µ) )() (P Y W )() = µ (D(P Y W (µ) P Y W ) + D(P Y W P Y W (µ) )) (5) where µ is assumed non-zero in (5). Note that D(W (µ) P Y P Y W )/ = 0 when µ = 0 (the same is true for J(W (µ), 0 )/, as can be seen below). Net, we have the following J(W (µ), 0 ) = Y P Y () X(W ( ) W ( )) (W (µ) ( )) /( ) log 0 ()(P X Y ( )) /( ) = P Y () (W ( ) W ( )) Y X log W (µ) ( ) (6) W ( ) = P Y ()(D(W (µ) ( ) W ( )) µ( ) Y +D(W ( ) W (µ) ( ))) µ (D(P Y W (µ) P Y W ) + D(P Y W P Y W (µ) )), (7) where the update equation of Lemma was used in (6) and µ > 0 was assumed in (7). B combining (5) and (7) we obtain the required result. Finall, Theorems 3 and 4 establish that the iterative algorithm of Figure results in the minimization of J, as required. ACKNOWLEDGMENTS The authors would like to thank the anonmous reviewers for their comments. The alternative proofs of Lemmas and were supplied b the first reviewer. This work was partiall supported b National Science Foundation grant No CCF REFERENCES [] I. Csiszár, Generalized cutoff rates and Réni s information measures, IEEE Trans. on Information Theor, vol. 4, no., pp. 6 34, Januar 995. [] I. Csiszár and G. Tusnád, Information geometr and alternating minimization procedures, Statistics and Decisions, Supplement Issue, pp , 984. [3] A. Réni, On measures of entrop and information, in Proc. 4th Berkele Smposium on Math. Statist. Probabilit, vol., 96, pp [4] F. Jelinek, Statistical Methods for Speech Recognition. MIT Press, 997. [5] S. F. Chen and J. Goodman, An empirical stud of smoothing techniques for language modeling, in Proceedings of the 34th Annual Meeting of the ACL, 996, pp [6] D. Karakos and S. Khudanpur, Language modeling with the maimum likelihood set: Compleit issues and the back-off formula, in Proc. 006 IEEE Intern. Smposium on Information Theor (ISIT-06), Seattle, WA, Jul 006. [7], Error bounds and improved probabilit estimation using the maimum likelihood set, in Proc. 007 IEEE Intern. Smposium on Information Theor (ISIT-06), Nice, France, Jul 007. [8] D. Karakos, J. Eisner, S. Khudanpur, and C. E. Priebe, Cross-instance tuning of unsupervised document clustering algorithms, in Proc. 007 Conference of the North American Chapter of the Assoc. for Computational Linguistics (NAACL-HLT 007), April 007. [9] N. Slonim, N. Friedman, and N. Tishb, Unsupervised document classification using sequential information maimization, in Proc. SIGIR 0, 5th ACM Int. Conf. on Research and Development of Inform. Retrieval, 00. [0] T. Cover and J. Thomas, Elements of Information Theor. John Wile and Sons, 99. [] R. W. Yeung and T. Berger, Multi-wa alternating minimization, in Proc. IEEE Int. Smposium Information Theor (ISIT-995), September 995, p. 9.
Computation of Total Capacity for Discrete Memoryless Multiple-Access Channels
IEEE TRANSACTIONS ON INFORATION THEORY, VOL. 50, NO. 11, NOVEBER 2004 2779 Computation of Total Capacit for Discrete emorless ultiple-access Channels ohammad Rezaeian, ember, IEEE, and Ale Grant, Senior
More informationThe Entropy Power Inequality and Mrs. Gerber s Lemma for Groups of order 2 n
The Entrop Power Inequalit and Mrs. Gerber s Lemma for Groups of order 2 n Varun Jog EECS, UC Berkele Berkele, CA-94720 Email: varunjog@eecs.berkele.edu Venkat Anantharam EECS, UC Berkele Berkele, CA-94720
More informationStochastic Interpretation for the Arimoto Algorithm
Stochastic Interpretation for the Arimoto Algorithm Serge Tridenski EE - Sstems Department Tel Aviv Universit Tel Aviv, Israel Email: sergetr@post.tau.ac.il Ram Zamir EE - Sstems Department Tel Aviv Universit
More informationINF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification
INF 4300 151014 Introduction to classifiction Anne Solberg anne@ifiuiono Based on Chapter 1-6 in Duda and Hart: Pattern Classification 151014 INF 4300 1 Introduction to classification One of the most challenging
More informationAn Information Theory For Preferences
An Information Theor For Preferences Ali E. Abbas Department of Management Science and Engineering, Stanford Universit, Stanford, Ca, 94305 Abstract. Recent literature in the last Maimum Entrop workshop
More informationEQUIVALENT FORMULATIONS OF HYPERCONTRACTIVITY USING INFORMATION MEASURES
1 EQUIVALENT FORMULATIONS OF HYPERCONTRACTIVITY USING INFORMATION MEASURES CHANDRA NAIR 1 1 Dept. of Information Engineering, The Chinese Universit of Hong Kong Abstract We derive alternate characterizations
More informationINF Anne Solberg One of the most challenging topics in image analysis is recognizing a specific object in an image.
INF 4300 700 Introduction to classifiction Anne Solberg anne@ifiuiono Based on Chapter -6 6inDuda and Hart: attern Classification 303 INF 4300 Introduction to classification One of the most challenging
More informationSome New Information Inequalities Involving f-divergences
BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 12, No 2 Sofia 2012 Some New Information Inequalities Involving f-divergences Amit Srivastava Department of Mathematics, Jaypee
More informationFinding a Needle in a Haystack: Conditions for Reliable. Detection in the Presence of Clutter
Finding a eedle in a Haystack: Conditions for Reliable Detection in the Presence of Clutter Bruno Jedynak and Damianos Karakos October 23, 2006 Abstract We study conditions for the detection of an -length
More informationINF Introduction to classifiction Anne Solberg
INF 4300 8.09.17 Introduction to classifiction Anne Solberg anne@ifi.uio.no Introduction to classification Based on handout from Pattern Recognition b Theodoridis, available after the lecture INF 4300
More informationThe Information Bottleneck Revisited or How to Choose a Good Distortion Measure
The Information Bottleneck Revisited or How to Choose a Good Distortion Measure Peter Harremoës Centrum voor Wiskunde en Informatica PO 94079, 1090 GB Amsterdam The Nederlands PHarremoes@cwinl Naftali
More information5 Mutual Information and Channel Capacity
5 Mutual Information and Channel Capacity In Section 2, we have seen the use of a quantity called entropy to measure the amount of randomness in a random variable. In this section, we introduce several
More informationarxiv:physics/ v1 [physics.data-an] 24 Apr 2000 Naftali Tishby, 1,2 Fernando C. Pereira, 3 and William Bialek 1
The information bottleneck method arxiv:physics/0004057v1 [physics.data-an] 24 Apr 2000 Naftali Tishby, 1,2 Fernando C. Pereira, 3 and William Bialek 1 1 NEC Research Institute, 4 Independence Way Princeton,
More informationSpeech Recognition Lecture 7: Maximum Entropy Models. Mehryar Mohri Courant Institute and Google Research
Speech Recognition Lecture 7: Maximum Entropy Models Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com This Lecture Information theory basics Maximum entropy models Duality theorem
More informationVariational Bayes. A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M
A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M PD M = PD θ, MPθ Mdθ Lecture 14 : Variational Bayes where θ are the parameters of the model and Pθ M is
More informationCSE 546 Midterm Exam, Fall 2014
CSE 546 Midterm Eam, Fall 2014 1. Personal info: Name: UW NetID: Student ID: 2. There should be 14 numbered pages in this eam (including this cover sheet). 3. You can use an material ou brought: an book,
More information3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.
Appendix A Information Theory A.1 Entropy Shannon (Shanon, 1948) developed the concept of entropy to measure the uncertainty of a discrete random variable. Suppose X is a discrete random variable that
More informationToda s Theorem: PH P #P
CS254: Computational Compleit Theor Prof. Luca Trevisan Final Project Ananth Raghunathan 1 Introduction Toda s Theorem: PH P #P The class NP captures the difficult of finding certificates. However, in
More informationH(X) = plog 1 p +(1 p)log 1 1 p. With a slight abuse of notation, we denote this quantity by H(p) and refer to it as the binary entropy function.
LECTURE 2 Information Measures 2. ENTROPY LetXbeadiscreterandomvariableonanalphabetX drawnaccordingtotheprobability mass function (pmf) p() = P(X = ), X, denoted in short as X p(). The uncertainty about
More information( 3x. Chapter Review. Review Key Vocabulary. Review Examples and Exercises 6.1 Properties of Square Roots (pp )
6 Chapter Review Review Ke Vocabular closed, p. 266 nth root, p. 278 eponential function, p. 286 eponential growth, p. 296 eponential growth function, p. 296 compound interest, p. 297 Vocabular Help eponential
More informationIn applications, we encounter many constrained optimization problems. Examples Basis pursuit: exact sparse recovery problem
1 Conve Analsis Main references: Vandenberghe UCLA): EECS236C - Optimiation methods for large scale sstems, http://www.seas.ucla.edu/ vandenbe/ee236c.html Parikh and Bod, Proimal algorithms, slides and
More information6. Vector Random Variables
6. Vector Random Variables In the previous chapter we presented methods for dealing with two random variables. In this chapter we etend these methods to the case of n random variables in the following
More informationUniversity of Toronto Mississauga
Surname: First Name: Student Number: Tutorial: Universit of Toronto Mississauga Mathematical and Computational Sciences MATY5Y Term Test Duration - 0 minutes No Aids Permitted This eam contains pages (including
More informationInformation Theory and Hypothesis Testing
Summer School on Game Theory and Telecommunications Campione, 7-12 September, 2014 Information Theory and Hypothesis Testing Mauro Barni University of Siena September 8 Review of some basic results linking
More information16.5. Maclaurin and Taylor Series. Introduction. Prerequisites. Learning Outcomes
Maclaurin and Talor Series 6.5 Introduction In this Section we eamine how functions ma be epressed in terms of power series. This is an etremel useful wa of epressing a function since (as we shall see)
More informationUnsupervised Learning with Permuted Data
Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University
More informationCommunication Limits with Low Precision Analog-to-Digital Conversion at the Receiver
This full tet paper was peer reviewed at the direction of IEEE Communications Society subject matter eperts for publication in the ICC 7 proceedings. Communication Limits with Low Precision Analog-to-Digital
More informationIntroduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.
L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate
More informationDept. of Linguistics, Indiana University Fall 2015
L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission
More informationSolutions to O Level Add Math paper
Solutions to O Level Add Math paper 4. Bab food is heated in a microwave to a temperature of C. It subsequentl cools in such a wa that its temperature, T C, t minutes after removal from the microwave,
More informationJournal of Inequalities in Pure and Applied Mathematics
Journal of Inequalities in Pure and Applied Mathematics http://jipam.vu.edu.au/ Volume 2, Issue 2, Article 15, 21 ON SOME FUNDAMENTAL INTEGRAL INEQUALITIES AND THEIR DISCRETE ANALOGUES B.G. PACHPATTE DEPARTMENT
More informationThe binary entropy function
ECE 7680 Lecture 2 Definitions and Basic Facts Objective: To learn a bunch of definitions about entropy and information measures that will be useful through the quarter, and to present some simple but
More informationSample-based Optimal Transport and Barycenter Problems
Sample-based Optimal Transport and Barcenter Problems MAX UANG New York Universit, Courant Institute of Mathematical Sciences AND ESTEBAN G. TABA New York Universit, Courant Institute of Mathematical Sciences
More informationECE 587 / STA 563: Lecture 2 Measures of Information Information Theory Duke University, Fall 2017
ECE 587 / STA 563: Lecture 2 Measures of Information Information Theory Duke University, Fall 207 Author: Galen Reeves Last Modified: August 3, 207 Outline of lecture: 2. Quantifying Information..................................
More informationStar-Structured High-Order Heterogeneous Data Co-clustering based on Consistent Information Theory
Star-Structured High-Order Heterogeneous Data Co-clustering based on Consistent Information Theory Bin Gao Tie-an Liu Wei-ing Ma Microsoft Research Asia 4F Sigma Center No. 49 hichun Road Beijing 00080
More informationSolutions to Two Interesting Problems
Solutions to Two Interesting Problems Save the Lemming On each square of an n n chessboard is an arrow pointing to one of its eight neighbors (or off the board, if it s an edge square). However, arrows
More informationSchool of Computer and Communication Sciences. Information Theory and Coding Notes on Random Coding December 12, 2003.
ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE School of Computer and Communication Sciences Handout 8 Information Theor and Coding Notes on Random Coding December 2, 2003 Random Coding In this note we prove
More informationcertain class of distributions, any SFQ can be expressed as a set of thresholds on the sufficient statistic. For distributions
Score-Function Quantization for Distributed Estimation Parvathinathan Venkitasubramaniam and Lang Tong School of Electrical and Computer Engineering Cornell University Ithaca, NY 4853 Email: {pv45, lt35}@cornell.edu
More informationCapacity Upper Bounds for the Deletion Channel
Capacity Upper Bounds for the Deletion Channel Suhas Diggavi, Michael Mitzenmacher, and Henry D. Pfister School of Computer and Communication Sciences, EPFL, Lausanne, Switzerland Email: suhas.diggavi@epfl.ch
More informationUNIVERSIDAD CARLOS III DE MADRID MATHEMATICS II EXERCISES (SOLUTIONS )
UNIVERSIDAD CARLOS III DE MADRID MATHEMATICS II EXERCISES (SOLUTIONS ) CHAPTER : Limits and continuit of functions in R n. -. Sketch the following subsets of R. Sketch their boundar and the interior. Stud
More informationApproximate inference, Sampling & Variational inference Fall Cours 9 November 25
Approimate inference, Sampling & Variational inference Fall 2015 Cours 9 November 25 Enseignant: Guillaume Obozinski Scribe: Basile Clément, Nathan de Lara 9.1 Approimate inference with MCMC 9.1.1 Gibbs
More informationConvexity/Concavity of Renyi Entropy and α-mutual Information
Convexity/Concavity of Renyi Entropy and -Mutual Information Siu-Wai Ho Institute for Telecommunications Research University of South Australia Adelaide, SA 5095, Australia Email: siuwai.ho@unisa.edu.au
More informationIntroduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 14 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Density Estimation Maxent Models 2 Entropy Definition: the entropy of a random variable
More informationMA Game Theory 2005, LSE
MA. Game Theor, LSE Problem Set The problems in our third and final homework will serve two purposes. First, the will give ou practice in studing games with incomplete information. Second (unless indicated
More informationMutual Information Approximation via. Maximum Likelihood Estimation of Density Ratio:
Mutual Information Approimation via Maimum Likelihood Estimation of Densit Ratio Taiji Suzuki Dept. of Mathematical Informatics, The Universit of Toko Email: s-taiji@stat.t.u-toko.ac.jp Masashi Sugiama
More informationAffine Hybrid Systems
Affine Hbrid Sstems Aaron D Ames and Shankar Sastr Universit of California, Berkele CA 9472, USA {adames,sastr}@eecsberkeleedu Abstract Affine hbrid sstems are hbrid sstems in which the discrete domains
More informationSquare Estimation by Matrix Inverse Lemma 1
Proof of Two Conclusions Associated Linear Minimum Mean Square Estimation b Matri Inverse Lemma 1 Jianping Zheng State e Lab of IS, Xidian Universit, Xi an, 7171, P. R. China jpzheng@idian.edu.cn Ma 6,
More informationThe Inverse Function Theorem
Inerse Function Theorem April, 3 The Inerse Function Theorem Suppose and Y are Banach spaces and f : Y is C (continuousl differentiable) Under what circumstances does a (local) inerse f exist? In the simplest
More informationThe sequential decoding metric for detection in sensor networks
The sequential decoding metric for detection in sensor networks B. Narayanaswamy, Yaron Rachlin, Rohit Negi and Pradeep Khosla Department of ECE Carnegie Mellon University Pittsburgh, PA, 523 Email: {bnarayan,rachlin,negi,pkk}@ece.cmu.edu
More informationFACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures AB = BA = I,
FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING Lectures MODULE 7 MATRICES II Inverse of a matri Sstems of linear equations Solution of sets of linear equations elimination methods 4
More informationExponential, Logistic, and Logarithmic Functions
CHAPTER 3 Eponential, Logistic, and Logarithmic Functions 3.1 Eponential and Logistic Functions 3.2 Eponential and Logistic Modeling 3.3 Logarithmic Functions and Their Graphs 3.4 Properties of Logarithmic
More informationOrdinal One-Switch Utility Functions
Ordinal One-Switch Utilit Functions Ali E. Abbas Universit of Southern California, Los Angeles, California 90089, aliabbas@usc.edu David E. Bell Harvard Business School, Boston, Massachusetts 0163, dbell@hbs.edu
More informationGlossary. Also available at BigIdeasMath.com: multi-language glossary vocabulary flash cards. An equation that contains an absolute value expression
Glossar This student friendl glossar is designed to be a reference for ke vocabular, properties, and mathematical terms. Several of the entries include a short eample to aid our understanding of important
More informationChapter 6. Self-Adjusting Data Structures
Chapter 6 Self-Adjusting Data Structures Chapter 5 describes a data structure that is able to achieve an epected quer time that is proportional to the entrop of the quer distribution. The most interesting
More information1. Sets A set is any collection of elements. Examples: - the set of even numbers between zero and the set of colors on the national flag.
San Francisco State University Math Review Notes Michael Bar Sets A set is any collection of elements Eamples: a A {,,4,6,8,} - the set of even numbers between zero and b B { red, white, bule} - the set
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationMath-2. Lesson:1-2 Properties of Exponents
Math- Lesson:- Properties of Eponents Properties of Eponents What is a power? Power: An epression formed b repeated multiplication of the same factor. Coefficient Base Eponent The eponent applies to the
More informationA minimalist s exposition of EM
A minimalist s exposition of EM Karl Stratos 1 What EM optimizes Let O, H be a random variables representing the space of samples. Let be the parameter of a generative model with an associated probability
More informationEstimation of Rényi Information Divergence via Pruned Minimal Spanning Trees 1
Estimation of Rényi Information Divergence via Pruned Minimal Spanning Trees Alfred Hero Dept. of EECS, The university of Michigan, Ann Arbor, MI 489-, USA Email: hero@eecs.umich.edu Olivier J.J. Michel
More informationArimoto Channel Coding Converse and Rényi Divergence
Arimoto Channel Coding Converse and Rényi Divergence Yury Polyanskiy and Sergio Verdú Abstract Arimoto proved a non-asymptotic upper bound on the probability of successful decoding achievable by any code
More informationBrief Introduction of Machine Learning Techniques for Content Analysis
1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview
More informationBiostatistics in Research Practice - Regression I
Biostatistics in Research Practice - Regression I Simon Crouch 30th Januar 2007 In scientific studies, we often wish to model the relationships between observed variables over a sample of different subjects.
More informationFunctions of Several Variables
Chapter 1 Functions of Several Variables 1.1 Introduction A real valued function of n variables is a function f : R, where the domain is a subset of R n. So: for each ( 1,,..., n ) in, the value of f is
More informationEstimation of Mixed Exponentiated Weibull Parameters in Life Testing
International Mathematical Forum, Vol., 6, no. 6, 769-78 HIKARI Ltd, www.m-hikari.com http://d.doi.org/.988/imf.6.6445 Estimation of Mied Eponentiated Weibull Parameters in Life Testing A. M. Abd-Elrahman
More informationSteepest descent on factor graphs
Steepest descent on factor graphs Justin Dauwels, Sascha Korl, and Hans-Andrea Loeliger Abstract We show how steepest descent can be used as a tool for estimation on factor graphs. From our eposition,
More informationSHARED INFORMATION. Prakash Narayan with. Imre Csiszár, Sirin Nitinawarat, Himanshu Tyagi, Shun Watanabe
SHARED INFORMATION Prakash Narayan with Imre Csiszár, Sirin Nitinawarat, Himanshu Tyagi, Shun Watanabe 2/41 Outline Two-terminal model: Mutual information Operational meaning in: Channel coding: channel
More informationMinimax Learning for Remote Prediction
1 Minimax Learning for Remote Prediction Cheuk Ting Li *, Xiugang Wu, Afer Ozgur, Abbas El Gamal * Department of Electrical Engineering and Computer Sciences Universit of California, Berkele Email: ctli@berkele.edu
More informationNovember 13, 2018 MAT186 Week 8 Justin Ko
1 Mean Value Theorem Theorem 1 (Mean Value Theorem). Let f be a continuous on [a, b] and differentiable on (a, b). There eists a c (a, b) such that f f(b) f(a) (c) =. b a Eample 1: The Mean Value Theorem
More informationMATH 2300 review problems for Exam 3 ANSWERS
MATH 300 review problems for Eam 3 ANSWERS. Check whether the following series converge or diverge. In each case, justif our answer b either computing the sum or b b showing which convergence test ou are
More informationLanguage and Statistics II
Language and Statistics II Lecture 19: EM for Models of Structure Noah Smith Epectation-Maimization E step: i,, q i # p r $ t = p r i % ' $ t i, p r $ t i,' soft assignment or voting M step: r t +1 # argma
More informationLecture 7. The Cluster Expansion Lemma
Stanford Universit Spring 206 Math 233: Non-constructive methods in combinatorics Instructor: Jan Vondrák Lecture date: Apr 3, 206 Scribe: Erik Bates Lecture 7. The Cluster Expansion Lemma We have seen
More informationAE/ME 339. K. M. Isaac Professor of Aerospace Engineering. December 21, 2001 topic13_grid_generation 1
AE/ME 339 Professor of Aerospace Engineering December 21, 2001 topic13_grid_generation 1 The basic idea behind grid generation is the creation of the transformation laws between the phsical space and the
More informationCHAPTER 5. Jointly Probability Mass Function for Two Discrete Distributed Random Variables:
CHAPTER 5 Jointl Distributed Random Variable There are some situations that experiment contains more than one variable and researcher interested in to stud joint behavior of several variables at the same
More informationChapter Eleven. Chapter Eleven
Chapter Eleven Chapter Eleven CHAPTER ELEVEN Hughes Hallett et al c 005, John Wile & Sons ConcepTests and Answers and Comments for Section. For Problems, which of the following functions satisf the given
More informationUsing MatContM in the study of a nonlinear map in economics
Journal of Phsics: Conference Series PAPER OPEN ACCESS Using MatContM in the stud of a nonlinear map in economics To cite this article: Niels Neirnck et al 016 J. Phs.: Conf. Ser. 69 0101 Related content
More informationLESSON #42 - INVERSES OF FUNCTIONS AND FUNCTION NOTATION PART 2 COMMON CORE ALGEBRA II
LESSON #4 - INVERSES OF FUNCTIONS AND FUNCTION NOTATION PART COMMON CORE ALGEBRA II You will recall from unit 1 that in order to find the inverse of a function, ou must switch and and solve for. Also,
More informationLESSON #12 - FORMS OF A LINE COMMON CORE ALGEBRA II
LESSON # - FORMS OF A LINE COMMON CORE ALGEBRA II Linear functions come in a variet of forms. The two shown below have been introduced in Common Core Algebra I and Common Core Geometr. TWO COMMON FORMS
More informationCartesian coordinates in space (Sect. 12.1).
Cartesian coordinates in space (Sect..). Overview of Multivariable Calculus. Cartesian coordinates in space. Right-handed, left-handed Cartesian coordinates. Distance formula between two points in space.
More informationTHE BIVARIATE F 3 -BETA DISTRIBUTION. Saralees Nadarajah. 1. Introduction
Commun. Korean Math. Soc. 21 2006, No. 2, pp. 363 374 THE IVARIATE F 3 -ETA DISTRIUTION Saralees Nadarajah Abstract. A new bivariate beta distribution based on the Appell function of the third kind is
More informationComments on Problems. 3.1 This problem offers some practice in deriving utility functions from indifference curve specifications.
CHAPTER 3 PREFERENCES AND UTILITY These problems provide some practice in eamining utilit unctions b looking at indierence curve maps and at a ew unctional orms. The primar ocus is on illustrating the
More informationEntropy and Ergodic Theory Lecture 4: Conditional entropy and mutual information
Entropy and Ergodic Theory Lecture 4: Conditional entropy and mutual information 1 Conditional entropy Let (Ω, F, P) be a probability space, let X be a RV taking values in some finite set A. In this lecture
More informationMath Lesson 2-2 Properties of Exponents
Math-00 Lesson - Properties of Eponents Properties of Eponents What is a power? Power: An epression formed b repeated multiplication of the base. Coefficient Base Eponent The eponent applies to the number
More informationMath 3201 UNIT 5: Polynomial Functions NOTES. Characteristics of Graphs and Equations of Polynomials Functions
1 Math 301 UNIT 5: Polnomial Functions NOTES Section 5.1 and 5.: Characteristics of Graphs and Equations of Polnomials Functions What is a polnomial function? Polnomial Function: - A function that contains
More informationReview: critical point or equivalently f a,
Review: a b f f a b f a b critical point or equivalentl f a, b A point, is called a of if,, 0 A local ma or local min must be a critical point (but not conversel) 0 D iscriminant (or Hessian) f f D f f
More informationThe Kuhn-Tucker and Envelope Theorems
The Kuhn-Tucker and Envelope Theorems Peter Ireland EC720.01 - Math for Economists Boston College, Department of Economics Fall 2010 The Kuhn-Tucker and envelope theorems can be used to characterize the
More informationIndependent Component Analysis and Unsupervised Learning. Jen-Tzung Chien
Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent voices Nonparametric likelihood
More informationIndependent Component Analysis and Unsupervised Learning
Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien National Cheng Kung University TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent
More information10.2 The Unit Circle: Cosine and Sine
0. The Unit Circle: Cosine and Sine 77 0. The Unit Circle: Cosine and Sine In Section 0.., we introduced circular motion and derived a formula which describes the linear velocit of an object moving on
More informationMath Lagrange Multipliers
Math 213 - Lagrange Multipliers Peter A. Perr Universit of Kentuck October 12, 2018 Homework Re-read section 14.8 Begin practice homework on section 14.8, problems 3-11 (odd), 15, 21, 23 Begin (or continue!)
More informationOn The Binary Lossless Many-Help-One Problem with Independently Degraded Helpers
On The Binary Lossless Many-Help-One Problem with Independently Degraded Helpers Albrecht Wolf, Diana Cristina González, Meik Dörpinghaus, José Cândido Silveira Santos Filho, and Gerhard Fettweis Vodafone
More informationCovariance and Correlation Class 7, Jeremy Orloff and Jonathan Bloom
1 Learning Goals Covariance and Correlation Class 7, 18.05 Jerem Orloff and Jonathan Bloom 1. Understand the meaning of covariance and correlation. 2. Be able to compute the covariance and correlation
More informationStrain Transformation and Rosette Gage Theory
Strain Transformation and Rosette Gage Theor It is often desired to measure the full state of strain on the surface of a part, that is to measure not onl the two etensional strains, and, but also the shear
More informationApproximate Message Passing with Built-in Parameter Estimation for Sparse Signal Recovery
Approimate Message Passing with Built-in Parameter Estimation for Sparse Signal Recovery arxiv:1606.00901v1 [cs.it] Jun 016 Shuai Huang, Trac D. Tran Department of Electrical and Computer Engineering Johns
More informationDesign of Optimal Quantizers for Distributed Source Coding
Design of Optimal Quantizers for Distributed Source Coding David Rebollo-Monedero, Rui Zhang and Bernd Girod Information Systems Laboratory, Electrical Eng. Dept. Stanford University, Stanford, CA 94305
More informationON INTERVENED NEGATIVE BINOMIAL DISTRIBUTION AND SOME OF ITS PROPERTIES
STATISTICA, anno LII, n. 4, 01 ON INTERVENED NEGATIVE BINOMIAL DISTRIBUTION AND SOME OF ITS PROPERTIES 1. INTRODUCTION Since 1985, intervened tpe distributions have received much attention in the literature.
More informationBACKWARD FOKKER-PLANCK EQUATION FOR DETERMINATION OF MODEL PREDICTABILITY WITH UNCERTAIN INITIAL ERRORS
BACKWARD FOKKER-PLANCK EQUATION FOR DETERMINATION OF MODEL PREDICTABILITY WITH UNCERTAIN INITIAL ERRORS. INTRODUCTION It is widel recognized that uncertaint in atmospheric and oceanic models can be traced
More informationCOMPUTING π(x): THE MEISSEL, LEHMER, LAGARIAS, MILLER, ODLYZKO METHOD
MATHEMATICS OF COMPUTATION Volume 65, Number 213 Januar 1996, Pages 235 245 COMPUTING π(): THE MEISSEL, LEHMER, LAGARIAS, MILLER, ODLYZKO METHOD M. DELEGLISE AND J. RIVAT Abstract. Let π() denote the number
More informationMATRIX TRANSFORMATIONS
CHAPTER 5. MATRIX TRANSFORMATIONS INSTITIÚID TEICNEOLAÍOCHTA CHEATHARLACH INSTITUTE OF TECHNOLOGY CARLOW MATRIX TRANSFORMATIONS Matri Transformations Definition Let A and B be sets. A function f : A B
More informationChapter One. Chapter One
Chapter One Chapter One CHAPTER ONE Hughes Hallett et al c 005, John Wile & Sons ConcepTests and Answers and Comments for Section.. Which of the following functions has its domain identical with its range?
More informationDynamics of multiple pendula without gravity
Chaotic Modeling and Simulation (CMSIM) 1: 57 67, 014 Dnamics of multiple pendula without gravit Wojciech Szumiński Institute of Phsics, Universit of Zielona Góra, Poland (E-mail: uz88szuminski@gmail.com)
More information