Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang
|
|
- Della Howard
- 5 years ago
- Views:
Transcription
1 Machine Learning Lecture 02.2: Basics of Information Theory Nevin L. Zhang Department of Computer Science and Engineering The Hong Kong University of Science and Technology Nevin L. Zhang (HKUST) Machine Learning 1 / 28
2 Jensen s Inequality Outline 1 Jensen s Inequality 2 Entropy 3 Divergence 4 Mutual Information Nevin L. Zhang (HKUST) Machine Learning 2 / 28
3 Jensen s Inequality Concave functions A function f is concave on interval I if for any x, y I, λf (x) + (1 λ)f (y) f (λx + (1 λ)y) for anyλ [0, 1] Weighted average of function is upper bounded by function of weighted average. It is strictly concave if the equality holds only when x=y. Nevin L. Zhang (HKUST) Machine Learning 3 / 28
4 Jensen s Inequality Jensen s Inequality Theorem (1.1) Suppose function f is concave on interval I.Then For any p i [0, 1], n i=1 p i = 1 and x i I. n n p i f (x i ) f ( p i x i ) i=1 Weighted average of function is upper bounded by function of weighted average. If f is strictly CONCAVE, the equality holds iff p i p j 0 implies x i =x j. Exercise: Prove this (using induction). i=1 Nevin L. Zhang (HKUST) Machine Learning 4 / 28
5 Jensen s Inequality Logarithmic function The logarithmic function is concave in the interval (0, ): Hence n n p i log(x i ) log( p i x i ) i=1 i=1 0 x i In words, exchanging i p i with log increases a quantity. Nevin L. Zhang (HKUST) Machine Learning 5 / 28
6 Entropy Outline 1 Jensen s Inequality 2 Entropy 3 Divergence 4 Mutual Information Nevin L. Zhang (HKUST) Machine Learning 6 / 28
7 Entropy Entropy The entropy of a random variable X : H(X ) = X P(X ) log 1 P(X ) = E P[log P(X )] with convention that 0 log(1/0) = 0. Base of logarithm is 2, unit is bit. Sometimes, also called the entropy of the distribution. Nevin L. Zhang (HKUST) Machine Learning 7 / 28
8 Entropy Entropy H(X ) measures uncertainty about X : X binary. The chart on the right shows H(X ) as a function of p=p(x =1). The higher H(X ) is, the more uncertainty about the value of X Nevin L. Zhang (HKUST) Machine Learning 8 / 28
9 Entropy Entropy Another example: X result of coin tossing Y result of dice throw Z result of randomly pick a card from a deck of 54 Which one has the highest uncertainty? Entropy: H(X ) = 1 2 log log 2 = 1(log 2) 2 H(Y ) = 1 6 log log 6 = log 6 6 H(Z) = 1 54 log log 54 = log Indeed we have: H(X ) < H(Y ) < H(Z). Nevin L. Zhang (HKUST) Machine Learning 9 / 28
10 Entropy Entropy Proposition (1.2) H(X ) 0 H(X ) = 0 equality iff P(X =x) = 1 for some x Ω X. i.e. iff no uncertainty. H(X ) log( X ) with equality iff P(X =x)=1/ X. Uncertainty is the highest in the case of uniform distribution. Proof: Because log is concave, by Jensen s inequality: H(X ) = X log X 1 P(X )log P(X ) 1 P(X ) P(X ) = log X Nevin L. Zhang (HKUST) Machine Learning 10 / 28
11 Entropy Conditional entropy The conditional entropy of X given event Y =y: Entropy of the conditional distribution P(X Y = y), i.e. H(X Y =y) = X 1 P(X Y =y)log P(X Y =y) The uncertainty that remains about X when Y is known to be y. It is possible that H(X Y =y) > H(X ) Intuitively Y =y might contradicts our prior knowledge about X and increase our uncertainty about X Exercise: Give example. Nevin L. Zhang (HKUST) Machine Learning 11 / 28
12 Entropy Conditional entropy The conditional entropy of X given variable Y : H(X Y ) = y Ω Y P(Y = y)h(x Y =y) = P(Y ) 1 P(X Y )log P(X Y ) Y X = 1 P(X, Y )log P(X Y ) X,Y = E[logP(X Y )] The average uncertainty that remains about X when Y is known. Nevin L. Zhang (HKUST) Machine Learning 12 / 28
13 Entropy Joint entropy The joint entropy of X and Y : H(X, Y ) = 1 P(X, Y )log P(X, Y ) X,Y Chain rule: H(X, Y ) = H(X ) + H(Y X ) = H(Y, X ) = H(Y ) + H(X Y ) Proof: 1 P(X, Y )log P(X, Y ) X,Y = 1 P(X, Y )log P(X )P(Y X ) X,Y = 1 P(X, Y )log P(X ) + 1 P(X, Y )log P(Y X X,Y X,Y = 1 P(X )log + H(Y X ) P(X ) X = H(X ) + H(Y X ) Nevin L. Zhang (HKUST) Machine Learning 13 / 28
14 Divergence Outline 1 Jensen s Inequality 2 Entropy 3 Divergence 4 Mutual Information Nevin L. Zhang (HKUST) Machine Learning 14 / 28
15 Divergence Kullback-Leibler divergence Relative entropy or Kullback-Leibler divergence Measures how much a distribution Q(X ) differs from a true probability distribution P(X ). K-L divergence of Q from P is defined as follows: KL(P Q) = X P(X )log P(X ) Q(X ) = E P[logP(X )] E P [logq(x )] 0log 0 0 = 0 and plog p 0 = if p 0 Not symmetric. So, not a distance measure mathematically. The second term is called cross entropy: H(P, Q) = E P [logq(x )]. H(P, Q) = KL(P Q) + H(P) Nevin L. Zhang (HKUST) Machine Learning 15 / 28
16 Divergence KL divergence between P and Q is larger than 0 unless P and Q are Nevin L. Zhang (HKUST) Machine Learning 16 / 28 Kullback-Leibler divergence Theorem (1.2) (Gibbs inequality) KL(P, Q) 0 with equality holds iff P is identical to Q Proof: P(X )log P(X ) Q(X ) X = X log X P(X )log Q(X ) P(X ) P(X ) Q(X ) P(X ) Jensen s inequality = log X Q(X ) = 0.
17 Divergence A corollary Corollary (1.1) (Gibbs Inequality) H(P, Q) H(P), or P(X ) log Q(X ) P(X ) log P(X ) X X In general, let f (X ) be a non-negative function. Then f (X ) log Q(X ) f (X ) log P (X ) X X where P (X ) = f (X )/ X f (X ). Nevin L. Zhang (HKUST) Machine Learning 17 / 28
18 Divergence Jensen-Shannon divergence KL is not symmetric: KL(P Q) usually is not equal to reverse KL KL(Q P). Jensen-Shannon divergence is one symmetrized version of KL: JS(P Q) = 1 2 KL(P M) KL(Q M) where M = P+Q 2 Properties: 0 JS(P Q) log 2 JS(P Q) = 0 if P = Q JS(P Q) = log 2 if P and Q has disjoint support. Nevin L. Zhang (HKUST) Machine Learning 18 / 28
19 Mutual Information Outline 1 Jensen s Inequality 2 Entropy 3 Divergence 4 Mutual Information Nevin L. Zhang (HKUST) Machine Learning 19 / 28
20 Mutual Information Mutual information The mutual information of X and Y : I (X ; Y ) = H(X ) H(X Y ) Average reduction in uncertainty about X from learning the value of Y, or Average amount of information Y conveys about X. Nevin L. Zhang (HKUST) Machine Learning 20 / 28
21 Mutual Information Mutual information and KL Divergence Note that: I (X ; Y ) = 1 P(X )log P(X ) 1 P(X, Y )log P(X Y ) X X,Y = 1 P(X, Y )log P(X ) 1 P(X, Y )log P(X Y ) X,Y X,Y = X,Y P(X Y ) P(X, Y )log P(X ) = P(X, Y )log P(X, Y ) P(X )P(Y ) X,Y = KL(P(X, Y ), P(X )P(Y )) Due to equivalent definition: equivalent definition I (X ; Y ) = H(X ) H(X Y ) = I (Y ; X ) = H(Y ) H(Y X ) Nevin L. Zhang (HKUST) Machine Learning 21 / 28
22 Mutual Information Property of Mutual information Theorem (1.3) with equality holds iff X Y. I (X ; Y ) 0 Interpretation: X and Y are independent iff X contains no information about Y and vice versa. Proof: Follows from previous slide and Theorem 1.2. Nevin L. Zhang (HKUST) Machine Learning 22 / 28
23 Mutual Information Conditional Entropy Revisited Theorem (1.4) H(X Y ) H(X ) with equality holds iff X Y Observation reduces uncertainty in average except for the case of independence. Proof: Follows from Theorem 1.3. Nevin L. Zhang (HKUST) Machine Learning 23 / 28
24 Mutual Information Mutual information and Entropy From definition of mutual information I (X ; Y ) = H(X ) H(X Y ) and the chain rule, H(X, Y ) = H(Y ) + H(X Y ) we get H(X ) + H(Y ) = H(X, Y ) + I (X ; Y ) I (X ; Y ) = H(X ) + H(Y ) H(X, Y ) Consequently H(X, Y ) H(X ) + H(Y ) with equality holds iff X Y. Nevin L. Zhang (HKUST) Machine Learning 24 / 28
25 Mutual Information Mutual information and entropy Venn Diagram: Relationships among joint entropy, conditional entropy, and mutual information H(X ) + H(Y ) = H(X, Y ) + I (X ; Y ) I (X ; Y ) = H(X ) H(X Y ) I (Y ; X ) = H(Y ) H(Y X ) Nevin L. Zhang (HKUST) Machine Learning 25 / 28
26 Mutual Information Conditional Mutual information The conditional mutual information of X and Y given Z: I (X ; Y Z) = H(X Z) H(X Y, Z) Average amount of information Y conveys about X given Z. Nevin L. Zhang (HKUST) Machine Learning 26 / 28
27 Mutual Information Conditional mutual information and KL Divergence Note: I (X ; Y Z) = 1 P(X, Z)log P(X Z) 1 P(X, Y, Z)log P(X Y, Z) X,Z X,Y,Z X,Y,Z = 1 P(X, Y, Z)log P(X Z) 1 P(X, Y, Z)log P(X Y, Z) = = Z X,Y,Z P(X Y, Z) P(X, Y, Z)log P(X Z) X,Y,Z P(Z) P(X, Y Z) P(X, Y Z)log P(X Z)P(Y Z) X,Y equivalent definition = Z P(Z)KL(P(X, Y Z), P(X Z)P(Y Z)) 0. Nevin L. Zhang (HKUST) Machine Learning 27 / 28
28 Mutual Information Property of conditional mutual information Theorem (1.5) with equality hold iff X Y Z. Interpretation: I (X ; Y Z) 0 H(X Z) H(X Y, Z) More observations reduce uncertainty on average except for the case of conditional independence. X and Y are independently given Z iff X contain no information about Y given Z and vice versa: X Y Z I (X ; Y Z) = 0. Another characterization of conditional independence. Nevin L. Zhang (HKUST) Machine Learning 28 / 28
Lecture 5 - Information theory
Lecture 5 - Information theory Jan Bouda FI MU May 18, 2012 Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 1 / 42 Part I Uncertainty and entropy Jan Bouda (FI MU) Lecture 5 - Information
More informationInformation Theory and Communication
Information Theory and Communication Ritwik Banerjee rbanerjee@cs.stonybrook.edu c Ritwik Banerjee Information Theory and Communication 1/8 General Chain Rules Definition Conditional mutual information
More informationLecture 2: August 31
0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: August 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy
More informationInformation Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18
Information Theory David Rosenberg New York University June 15, 2015 David Rosenberg (New York University) DS-GA 1003 June 15, 2015 1 / 18 A Measure of Information? Consider a discrete random variable
More informationChapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye
Chapter 2: Entropy and Mutual Information Chapter 2 outline Definitions Entropy Joint entropy, conditional entropy Relative entropy, mutual information Chain rules Jensen s inequality Log-sum inequality
More informationCOMPSCI 650 Applied Information Theory Jan 21, Lecture 2
COMPSCI 650 Applied Information Theory Jan 21, 2016 Lecture 2 Instructor: Arya Mazumdar Scribe: Gayane Vardoyan, Jong-Chyi Su 1 Entropy Definition: Entropy is a measure of uncertainty of a random variable.
More information3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.
Appendix A Information Theory A.1 Entropy Shannon (Shanon, 1948) developed the concept of entropy to measure the uncertainty of a discrete random variable. Suppose X is a discrete random variable that
More informationInformation Theory Primer:
Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen s inequality Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,
More informationThe binary entropy function
ECE 7680 Lecture 2 Definitions and Basic Facts Objective: To learn a bunch of definitions about entropy and information measures that will be useful through the quarter, and to present some simple but
More informationHands-On Learning Theory Fall 2016, Lecture 3
Hands-On Learning Theory Fall 016, Lecture 3 Jean Honorio jhonorio@purdue.edu 1 Information Theory First, we provide some information theory background. Definition 3.1 (Entropy). The entropy of a discrete
More informationLECTURE 2. Convexity and related notions. Last time: mutual information: definitions and properties. Lecture outline
LECTURE 2 Convexity and related notions Last time: Goals and mechanics of the class notation entropy: definitions and properties mutual information: definitions and properties Lecture outline Convexity
More informationECE 4400:693 - Information Theory
ECE 4400:693 - Information Theory Dr. Nghi Tran Lecture 8: Differential Entropy Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 1 / 43 Outline 1 Review: Entropy of discrete RVs 2 Differential
More informationComplex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity
Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity Eckehard Olbrich MPI MiS Leipzig Potsdam WS 2007/08 Olbrich (Leipzig) 26.10.2007 1 / 18 Overview 1 Summary
More informationIntroduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.
L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate
More informationLecture 1: Introduction, Entropy and ML estimation
0-704: Information Processing and Learning Spring 202 Lecture : Introduction, Entropy and ML estimation Lecturer: Aarti Singh Scribes: Min Xu Disclaimer: These notes have not been subjected to the usual
More informationDept. of Linguistics, Indiana University Fall 2015
L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission
More informationMachine Learning Srihari. Information Theory. Sargur N. Srihari
Information Theory Sargur N. Srihari 1 Topics 1. Entropy as an Information Measure 1. Discrete variable definition Relationship to Code Length 2. Continuous Variable Differential Entropy 2. Maximum Entropy
More informationCommunication Theory and Engineering
Communication Theory and Engineering Master's Degree in Electronic Engineering Sapienza University of Rome A.A. 018-019 Information theory Practice work 3 Review For any probability distribution, we define
More informationOverview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58
Overview of Course So far, we have studied The concept of Bayesian network Independence and Separation in Bayesian networks Inference in Bayesian networks The rest of the course: Data analysis using Bayesian
More information4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information
4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information Ramji Venkataramanan Signal Processing and Communications Lab Department of Engineering ramji.v@eng.cam.ac.uk
More information5 Mutual Information and Channel Capacity
5 Mutual Information and Channel Capacity In Section 2, we have seen the use of a quantity called entropy to measure the amount of randomness in a random variable. In this section, we introduce several
More informationECE 587 / STA 563: Lecture 2 Measures of Information Information Theory Duke University, Fall 2017
ECE 587 / STA 563: Lecture 2 Measures of Information Information Theory Duke University, Fall 207 Author: Galen Reeves Last Modified: August 3, 207 Outline of lecture: 2. Quantifying Information..................................
More information3F1 Information Theory, Lecture 1
3F1 Information Theory, Lecture 1 Jossy Sayir Department of Engineering Michaelmas 2013, 22 November 2013 Organisation History Entropy Mutual Information 2 / 18 Course Organisation 4 lectures Course material:
More informationCS 591, Lecture 2 Data Analytics: Theory and Applications Boston University
CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University Charalampos E. Tsourakakis January 25rd, 2017 Probability Theory The theory of probability is a system for making better guesses.
More informationComputing and Communications 2. Information Theory -Entropy
1896 1920 1987 2006 Computing and Communications 2. Information Theory -Entropy Ying Cui Department of Electronic Engineering Shanghai Jiao Tong University, China 2017, Autumn 1 Outline Entropy Joint entropy
More informationExample: Letter Frequencies
Example: Letter Frequencies i a i p i 1 a 0.0575 2 b 0.0128 3 c 0.0263 4 d 0.0285 5 e 0.0913 6 f 0.0173 7 g 0.0133 8 h 0.0313 9 i 0.0599 10 j 0.0006 11 k 0.0084 12 l 0.0335 13 m 0.0235 14 n 0.0596 15 o
More information1/37. Convexity theory. Victor Kitov
1/37 Convexity theory Victor Kitov 2/37 Table of Contents 1 2 Strictly convex functions 3 Concave & strictly concave functions 4 Kullback-Leibler divergence 3/37 Convex sets Denition 1 Set X is convex
More informationExample: Letter Frequencies
Example: Letter Frequencies i a i p i 1 a 0.0575 2 b 0.0128 3 c 0.0263 4 d 0.0285 5 e 0.0913 6 f 0.0173 7 g 0.0133 8 h 0.0313 9 i 0.0599 10 j 0.0006 11 k 0.0084 12 l 0.0335 13 m 0.0235 14 n 0.0596 15 o
More informationInformation in Biology
Lecture 3: Information in Biology Tsvi Tlusty, tsvi@unist.ac.kr Living information is carried by molecular channels Living systems I. Self-replicating information processors Environment II. III. Evolve
More informationHomework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015
10-704 Homework 1 Due: Thursday 2/5/2015 Instructions: Turn in your homework in class on Thursday 2/5/2015 1. Information Theory Basics and Inequalities C&T 2.47, 2.29 (a) A deck of n cards in order 1,
More informationEntropy and Ergodic Theory Lecture 4: Conditional entropy and mutual information
Entropy and Ergodic Theory Lecture 4: Conditional entropy and mutual information 1 Conditional entropy Let (Ω, F, P) be a probability space, let X be a RV taking values in some finite set A. In this lecture
More informationMachine learning - HT Maximum Likelihood
Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce
More informationCOMP538: Introduction to Bayesian Networks
COMP538: Introduction to Bayesian Networks Lecture 9: Optimal Structure Learning Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering Hong Kong University of Science and Technology
More informationInformation in Biology
Information in Biology CRI - Centre de Recherches Interdisciplinaires, Paris May 2012 Information processing is an essential part of Life. Thinking about it in quantitative terms may is useful. 1 Living
More informationIntroduction to Statistical Learning Theory
Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated
More informationPROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability
More informationMGMT 69000: Topics in High-dimensional Data Analysis Falll 2016
MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 14: Information Theoretic Methods Lecturer: Jiaming Xu Scribe: Hilda Ibriga, Adarsh Barik, December 02, 2016 Outline f-divergence
More informationU Logo Use Guidelines
Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity
More informationQuantitative Biology II Lecture 4: Variational Methods
10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate
More information1 Basic Information Theory
ECE 6980 An Algorithmic and Information-Theoretic Toolbo for Massive Data Instructor: Jayadev Acharya Lecture #4 Scribe: Xiao Xu 6th September, 206 Please send errors to 243@cornell.edu and acharya@cornell.edu
More information6.1 Main properties of Shannon entropy. Let X be a random variable taking values x in some alphabet with probabilities.
Chapter 6 Quantum entropy There is a notion of entropy which quantifies the amount of uncertainty contained in an ensemble of Qbits. This is the von Neumann entropy that we introduce in this chapter. In
More informationInformation theory and decision tree
Information theory and decision tree Jianxin Wu LAMDA Group National Key Lab for Novel Software Technology Nanjing University, China wujx2001@gmail.com June 14, 2018 Contents 1 Prefix code and Huffman
More informationx log x, which is strictly convex, and use Jensen s Inequality:
2. Information measures: mutual information 2.1 Divergence: main inequality Theorem 2.1 (Information Inequality). D(P Q) 0 ; D(P Q) = 0 iff P = Q Proof. Let ϕ(x) x log x, which is strictly convex, and
More informationSeries 7, May 22, 2018 (EM Convergence)
Exercises Introduction to Machine Learning SS 2018 Series 7, May 22, 2018 (EM Convergence) Institute for Machine Learning Dept. of Computer Science, ETH Zürich Prof. Dr. Andreas Krause Web: https://las.inf.ethz.ch/teaching/introml-s18
More informationClassification & Information Theory Lecture #8
Classification & Information Theory Lecture #8 Introduction to Natural Language Processing CMPSCI 585, Fall 2007 University of Massachusetts Amherst Andrew McCallum Today s Main Points Automatically categorizing
More informationHow to Quantitate a Markov Chain? Stochostic project 1
How to Quantitate a Markov Chain? Stochostic project 1 Chi-Ning,Chou Wei-chang,Lee PROFESSOR RAOUL NORMAND April 18, 2015 Abstract In this project, we want to quantitatively evaluate a Markov chain. In
More informationLecture 1: September 25, A quick reminder about random variables and convexity
Information and Coding Theory Autumn 207 Lecturer: Madhur Tulsiani Lecture : September 25, 207 Administrivia This course will cover some basic concepts in information and coding theory, and their applications
More informationIntroduction to Machine Learning
Introduction to Machine Learning Introduction to Probabilistic Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB
More informationStatistical Machine Learning Lectures 4: Variational Bayes
1 / 29 Statistical Machine Learning Lectures 4: Variational Bayes Melih Kandemir Özyeğin University, İstanbul, Turkey 2 / 29 Synonyms Variational Bayes Variational Inference Variational Bayesian Inference
More informationDATA MINING LECTURE 9. Minimum Description Length Information Theory Co-Clustering
DATA MINING LECTURE 9 Minimum Description Length Information Theory Co-Clustering MINIMUM DESCRIPTION LENGTH Occam s razor Most data mining tasks can be described as creating a model for the data E.g.,
More informationExpectation Maximization
Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /
More informationLecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157
Lecture 6: Gaussian Channels Copyright G. Caire (Sample Lectures) 157 Differential entropy (1) Definition 18. The (joint) differential entropy of a continuous random vector X n p X n(x) over R is: Z h(x
More informationA CLASSROOM NOTE: ENTROPY, INFORMATION, AND MARKOV PROPERTY. Zoran R. Pop-Stojanović. 1. Introduction
THE TEACHING OF MATHEMATICS 2006, Vol IX,, pp 2 A CLASSROOM NOTE: ENTROPY, INFORMATION, AND MARKOV PROPERTY Zoran R Pop-Stojanović Abstract How to introduce the concept of the Markov Property in an elementary
More informationLecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016
Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,
More informationInformation Theory. M1 Informatique (parcours recherche et innovation) Aline Roumy. January INRIA Rennes 1/ 73
1/ 73 Information Theory M1 Informatique (parcours recherche et innovation) Aline Roumy INRIA Rennes January 2018 Outline 2/ 73 1 Non mathematical introduction 2 Mathematical introduction: definitions
More informationExample: Letter Frequencies
Example: Letter Frequencies i a i p i 1 a 0.0575 2 b 0.0128 3 c 0.0263 4 d 0.0285 5 e 0.0913 6 f 0.0173 7 g 0.0133 8 h 0.0313 9 i 0.0599 10 j 0.0006 11 k 0.0084 12 l 0.0335 13 m 0.0235 14 n 0.0596 15 o
More informationAQI: Advanced Quantum Information Lecture 6 (Module 2): Distinguishing Quantum States January 28, 2013
AQI: Advanced Quantum Information Lecture 6 (Module 2): Distinguishing Quantum States January 28, 2013 Lecturer: Dr. Mark Tame Introduction With the emergence of new types of information, in this case
More informationApplication of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University.
Application of Information Theory, Lecture 7 Relative Entropy Handout Mode Iftach Haitner Tel Aviv University. December 1, 2015 Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December
More informationMedical Imaging. Norbert Schuff, Ph.D. Center for Imaging of Neurodegenerative Diseases
Uses of Information Theory in Medical Imaging Norbert Schuff, Ph.D. Center for Imaging of Neurodegenerative Diseases Norbert.schuff@ucsf.edu With contributions from Dr. Wang Zhang Medical Imaging Informatics,
More informationLecture 3: Lower Bounds for Bandit Algorithms
CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and
More informationLecture 5 Channel Coding over Continuous Channels
Lecture 5 Channel Coding over Continuous Channels I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw November 14, 2014 1 / 34 I-Hsiang Wang NIT Lecture 5 From
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationNoisy-Channel Coding
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/05264298 Part II Noisy-Channel Coding Copyright Cambridge University Press 2003.
More informationCOMP 328: Machine Learning
COMP 328: Machine Learning Lecture 2: Naive Bayes Classifiers Nevin L. Zhang Department of Computer Science and Engineering The Hong Kong University of Science and Technology Spring 2010 Nevin L. Zhang
More informationIndependence. P(A) = P(B) = 3 6 = 1 2, and P(C) = 4 6 = 2 3.
Example: A fair die is tossed and we want to guess the outcome. The outcomes will be 1, 2, 3, 4, 5, 6 with equal probability 1 6 each. If we are interested in getting the following results: A = {1, 3,
More informationECE598: Information-theoretic methods in high-dimensional statistics Spring 2016
ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture : Mutual Information Method Lecturer: Yihong Wu Scribe: Jaeho Lee, Mar, 06 Ed. Mar 9 Quick review: Assouad s lemma
More informationPrinciples of Communications
Principles of Communications Weiyao Lin Shanghai Jiao Tong University Chapter 10: Information Theory Textbook: Chapter 12 Communication Systems Engineering: Ch 6.1, Ch 9.1~ 9. 92 2009/2010 Meixia Tao @
More informationProbabilistic and Bayesian Machine Learning
Probabilistic and Bayesian Machine Learning Lecture 1: Introduction to Probabilistic Modelling Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Why a
More information1 Introduction to information theory
1 Introduction to information theory 1.1 Introduction In this chapter we present some of the basic concepts of information theory. The situations we have in mind involve the exchange of information through
More information1.6. Information Theory
48. INTRODUCTION Section 5.6 Exercise.7 (b) First solve the inference problem of determining the conditional density p(t x), and then subsequently marginalize to find the conditional mean given by (.89).
More informationAn instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1
Kraft s inequality An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if N 2 l i 1 Proof: Suppose that we have a tree code. Let l max = max{l 1,...,
More informationEE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018
Please submit the solutions on Gradescope. EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018 1. Optimal codeword lengths. Although the codeword lengths of an optimal variable length code
More informationChapter 2 Review of Classical Information Theory
Chapter 2 Review of Classical Information Theory Abstract This chapter presents a review of the classical information theory which plays a crucial role in this thesis. We introduce the various types of
More informationIntroduction to Information Theory
Introduction to Information Theory Gurinder Singh Mickey Atwal atwal@cshl.edu Center for Quantitative Biology Kullback-Leibler Divergence Summary Shannon s coding theorems Entropy Mutual Information Multi-information
More informationInformation. = more information was provided by the outcome in #2
Outline First part based very loosely on [Abramson 63]. Information theory usually formulated in terms of information channels and coding will not discuss those here.. Information 2. Entropy 3. Mutual
More informationA Gentle Tutorial on Information Theory and Learning. Roni Rosenfeld. Carnegie Mellon University
A Gentle Tutorial on Information Theory and Learning Roni Rosenfeld Mellon University Mellon Outline First part based very loosely on [Abramson 63]. Information theory usually formulated in terms of information
More informationExpectation Maximization
Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger
More informationH(X) = plog 1 p +(1 p)log 1 1 p. With a slight abuse of notation, we denote this quantity by H(p) and refer to it as the binary entropy function.
LECTURE 2 Information Measures 2. ENTROPY LetXbeadiscreterandomvariableonanalphabetX drawnaccordingtotheprobability mass function (pmf) p() = P(X = ), X, denoted in short as X p(). The uncertainty about
More informationQuantitative Biology Lecture 3
23 nd Sep 2015 Quantitative Biology Lecture 3 Gurinder Singh Mickey Atwal Center for Quantitative Biology Summary Covariance, Correlation Confounding variables (Batch Effects) Information Theory Covariance
More informationEntropy, Relative Entropy and Mutual Information Exercises
Entrop, Relative Entrop and Mutual Information Exercises Exercise 2.: Coin Flips. A fair coin is flipped until the first head occurs. Let X denote the number of flips required. (a) Find the entrop H(X)
More informationThe Shannon s basic inequalities refer to the following fundamental properties of entropy function:
COMMUNICATIONS IN INFORMATION AND SYSTEMS c 2003 International Press Vol. 3, No. 1, pp. 47-60, June 2003 004 ON A NEW NON-SHANNON TYPE INFORMATION INEQUALITY ZHEN ZHANG Abstract. Recently, K. Makarychev,
More informationSDS 321: Introduction to Probability and Statistics
SDS 321: Introduction to Probability and Statistics Lecture 13: Expectation and Variance and joint distributions Purnamrita Sarkar Department of Statistics and Data Science The University of Texas at Austin
More informationCOMP538: Introduction to Bayesian Networks
COMP538: Introduction to ayesian Networks Lecture 4: Inference in ayesian Networks: The VE lgorithm Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering Hong Kong University
More informationCoding of memoryless sources 1/35
Coding of memoryless sources 1/35 Outline 1. Morse coding ; 2. Definitions : encoding, encoding efficiency ; 3. fixed length codes, encoding integers ; 4. prefix condition ; 5. Kraft and Mac Millan theorems
More informationFeature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with
Feature selection c Victor Kitov v.v.kitov@yandex.ru Summer school on Machine Learning in High Energy Physics in partnership with August 2015 1/38 Feature selection Feature selection is a process of selecting
More informationLECTURE 3. Last time:
LECTURE 3 Last time: Mutual Information. Convexity and concavity Jensen s inequality Information Inequality Data processing theorem Fano s Inequality Lecture outline Stochastic processes, Entropy rate
More informationNotes from Week 9: Multi-Armed Bandit Problems II. 1 Information-theoretic lower bounds for multiarmed
CS 683 Learning, Games, and Electronic Markets Spring 007 Notes from Week 9: Multi-Armed Bandit Problems II Instructor: Robert Kleinberg 6-30 Mar 007 1 Information-theoretic lower bounds for multiarmed
More informationIntroduction to Information Theory and Its Applications
Introduction to Information Theory and Its Applications Radim Bělohlávek Information Theory: What and Why information: one of key terms in our society: popular keywords such as information/knowledge society
More information1 Review of The Learning Setting
COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #8 Scribe: Changyan Wang February 28, 208 Review of The Learning Setting Last class, we moved beyond the PAC model: in the PAC model we
More informationEntropies & Information Theory
Entropies & Information Theory LECTURE I Nilanjana Datta University of Cambridge,U.K. See lecture notes on: http://www.qi.damtp.cam.ac.uk/node/223 Quantum Information Theory Born out of Classical Information
More informationBayesian Machine Learning - Lecture 7
Bayesian Machine Learning - Lecture 7 Guido Sanguinetti Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh gsanguin@inf.ed.ac.uk March 4, 2015 Today s lecture 1
More informationInformation Theory, Statistics, and Decision Trees
Information Theory, Statistics, and Decision Trees Léon Bottou COS 424 4/6/2010 Summary 1. Basic information theory. 2. Decision trees. 3. Information theory and statistics. Léon Bottou 2/31 COS 424 4/6/2010
More information3F1 Information Theory, Lecture 3
3F1 Information Theory, Lecture 3 Jossy Sayir Department of Engineering Michaelmas 2011, 28 November 2011 Memoryless Sources Arithmetic Coding Sources with Memory 2 / 19 Summary of last lecture Prefix-free
More informationINTRODUCTION TO INFORMATION THEORY
INTRODUCTION TO INFORMATION THEORY KRISTOFFER P. NIMARK These notes introduce the machinery of information theory which is a eld within applied mathematics. The material can be found in most textbooks
More informationChannel capacity. Outline : 1. Source entropy 2. Discrete memoryless channel 3. Mutual information 4. Channel capacity 5.
Channel capacity Outline : 1. Source entropy 2. Discrete memoryless channel 3. Mutual information 4. Channel capacity 5. Exercices Exercise session 11 : Channel capacity 1 1. Source entropy Given X a memoryless
More informationExercises with solutions (Set D)
Exercises with solutions Set D. A fair die is rolled at the same time as a fair coin is tossed. Let A be the number on the upper surface of the die and let B describe the outcome of the coin toss, where
More informationEE5585 Data Compression May 2, Lecture 27
EE5585 Data Compression May 2, 2013 Lecture 27 Instructor: Arya Mazumdar Scribe: Fangying Zhang Distributed Data Compression/Source Coding In the previous class we used a H-W table as a simple example,
More informationMachine Learning Recitation 8 Oct 21, Oznur Tastan
Machine Learning 10601 Recitation 8 Oct 21, 2009 Oznur Tastan Outline Tree representation Brief information theory Learning decision trees Bagging Random forests Decision trees Non linear classifier Easy
More informationLecture 18: Quantum Information Theory and Holevo s Bound
Quantum Computation (CMU 1-59BB, Fall 2015) Lecture 1: Quantum Information Theory and Holevo s Bound November 10, 2015 Lecturer: John Wright Scribe: Nicolas Resch 1 Question In today s lecture, we will
More information