Extract. Data Analysis Tools

Similar documents
Probability inequalities 11

University of Regina. Lecture Notes. Michael Kozdron

2. Suppose (X, Y ) is a pair of random variables uniformly distributed over the triangle with vertices (0, 0), (2, 0), (2, 1).

Discrete Mathematics and Probability Theory Fall 2014 Anant Sahai Note 15. Random Variables: Distributions, Independence, and Expectations

If g is also continuous and strictly increasing on J, we may apply the strictly increasing inverse function g 1 to this inequality to get

1 Basic continuous random variable problems

Lecture 2: Repetition of probability theory and statistics

Information Theory and Communication

Some Concepts of Probability (Review) Volker Tresp Summer 2018

Part IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Introduction to Machine Learning

Lecture 5 - Information theory

Probability Theory and Simulation Methods

Lecture 11. Probability Theory: an Overveiw

Handout 1: Mathematical Background

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Lecture 13 (Part 2): Deviation from mean: Markov s inequality, variance and its properties, Chebyshev s inequality

1 Basic continuous random variable problems

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Information & Correlation

Lecture 22: Variance and Covariance

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Data Analysis and Monte Carlo Methods

The binary entropy function

Course: ESO-209 Home Work: 1 Instructor: Debasis Kundu

Intelligent Data Analysis. Principal Component Analysis. School of Computer Science University of Birmingham

More than one variable

STOR Lecture 16. Properties of Expectation - I

CME 106: Review Probability theory

Statistical Machine Learning Lectures 4: Variational Bayes

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

X = X X n, + X 2

Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 12. Random Variables: Distribution and Expectation

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

1 Review of The Learning Setting

Basic Probability. Introduction

Lecture 4: Two-point Sampling, Coupon Collector s problem

SUMMARY OF PROBABILITY CONCEPTS SO FAR (SUPPLEMENT FOR MA416)

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

Shannon s Noisy-Channel Coding Theorem

Introduction to Statistical Learning Theory

Handout 1: Mathematical Background

Lecture 5: Two-point Sampling

1 Stat 605. Homework I. Due Feb. 1, 2011

Some Basic Concepts of Probability and Information Theory: Pt. 2

Machine Learning for Large-Scale Data Analysis and Decision Making A. Week #1

EE514A Information Theory I Fall 2013

3 Multiple Discrete Random Variables

Vectors and Matrices Statistics with Vectors and Matrices

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 16. Random Variables: Distribution and Expectation

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Review of Probability Theory

3.4 Introduction to power series

Final Review: Problem Solving Strategies for Stat 430

Lecture 7: February 6

conditional cdf, conditional pdf, total probability theorem?

6.1 Main properties of Shannon entropy. Let X be a random variable taking values x in some alphabet with probabilities.

IEOR 4701: Stochastic Models in Financial Engineering. Summer 2007, Professor Whitt. SOLUTIONS to Homework Assignment 9: Brownian motion

. Find E(V ) and var(v ).

ECE 4400:693 - Information Theory

The Hilbert Space of Random Variables

Expectation, inequalities and laws of large numbers

Review (Probability & Linear Algebra)

CS280, Spring 2004: Final

MAS223 Statistical Inference and Modelling Exercises

Week 12-13: Discrete Probability

Algorithms for Uncertainty Quantification

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

A General Overview of Parametric Estimation and Inference Techniques.

Lecture 18: Quantum Information Theory and Holevo s Bound

Lecture 4: Probability and Discrete Random Variables

Chapter 1 Preliminaries

1 Review of Probability

Quantitative Biology Lecture 3

Math 151. Rumbos Fall Solutions to Review Problems for Exam 2. Pr(X = 1) = ) = Pr(X = 2) = Pr(X = 3) = p X. (k) =

Lecture 1: Probability Fundamentals

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology

PROBABILITY THEORY REVIEW

CS 246 Review of Proof Techniques and Probability 01/14/19

Lecture 35: December The fundamental statistical distances

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

1 Maintaining a Dictionary

Lecture 1: September 25, A quick reminder about random variables and convexity

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

Probability reminders

Introduction to Stochastic Processes

COS597D: Information Theory in Computer Science September 21, Lecture 2

Discrete Mathematics for CS Spring 2006 Vazirani Lecture 22

If we want to analyze experimental or simulated data we might encounter the following tasks:

Stochastic Processes

STAT2201. Analysis of Engineering & Scientific Data. Unit 3

Discrete Mathematics and Probability Theory Fall 2012 Vazirani Note 14. Random Variables: Distribution and Expectation

Topic 3 Random variables, expectation, and variance, II

Theorem 2.1 (Caratheodory). A (countably additive) probability measure on a field has an extension. n=1

1 Basic Combinatorics

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H.

LECTURE 2. Convexity and related notions. Last time: mutual information: definitions and properties. Lecture outline

Math 493 Final Exam December 01

Lecture 2: August 31

Lecture 2: Review of Probability

Transcription:

Extract Data Analysis Tools Harjoat S. Bhamra July 8, 2017

Contents 1 Introduction 7 I Probability 13 2 Inequalities 15 2.1 Jensen s inequality................................ 17 2.1.1 Arithmetic Mean-Geometric Mean Inequality............. 18 2.1.2 Proof of Corollary 1........................... 18 2.2 Cauchy-Schwarz Inequality........................... 19 2.2.1 Multiple random variables........................ 19 2.3 Information entropy............................... 21 2.4 Exercises..................................... 27 3 Weak Law of Large Numbers 31 3.1 Markov Inequality................................ 31 3.2 Chebyshev inequality............................... 33 3.3 Weak law of large numbers........................... 34 3.4 Exercises..................................... 35 4 Normal Distribution 37 4.1 Calculations with the normal distribution................... 38 4.1.1 Checking the normal density integrates to one............. 38 4.2 Mode, median, sample mean and variance................... 39 4.2.1 Bounds on tail probability of a normal distribution.......... 41 4.3 Multivariate normal............................... 41 4.4 Bivariate normal................................. 43 4.5 Brownian motion................................. 44 4.6 Exercises..................................... 44 3

4 CONTENTS 5 Generating and Characteristic Functions 47 5.1 Moment Generating Functions......................... 47 5.2 Characteristic Functions............................. 50 5.3 A Brownian interlude with mirrors, but no smoke.............. 52 5.4 Exercises..................................... 57 6 Central limit theorem 59 6.1 Exercises..................................... 62 II Statistics 65 7 Mathematical Statistics 67 7.1 Introduction.................................... 67 7.1.1 Mean squared error........................... 69 8 Bayesian inference 71 9 Estimating stochastic volatility models 73 III Applying Linear Algebra and Statistics to Data Analysis 75 10 Principal Components Analysis (PCA) 77 10.1 Overview..................................... 78 10.2 A simple example from high school physics.................. 78 10.3 Linear Algebra for Principal Components Analysis.............. 79 10.3.1 Vector spaces............................... 79 10.3.2 Linear independence, subspaces, spanning and bases......... 81 10.3.3 Change of basis.............................. 83 10.4 How do we choose the basis?.......................... 89 10.4.1 Noisy data................................ 89 10.4.2 Redundancy................................ 89 10.4.3 Covariance matrix............................ 91 10.4.4 Covariance matrix under a new basis.................. 92 10.5 PCA via Projection............................... 93 10.6 Mathematics of Orthogonal Projections.................... 93 10.6.1 Orthogonal projection operators.................... 96 10.7 Projecting the data onto a 1-d subspace.................... 100 10.8 Spectral Theorem for Real Symmetric Matrices................ 103 10.9 Projecting the data onto an m-dimensional subspace............. 104 10.9.1 Scree Plots................................ 106

CONTENTS 5 10.10What about changing the basis?........................ 106 10.11Orthogonal matrices............................... 107 10.12Statistical Inference............................... 109 10.13Exercises..................................... 109 IV Exam Preparation 111 11 Topics covered in Final Exam 2016 113 11.1 Linear Maps.................................... 114

Chapter 1 Introduction There are of course many reasons for learning mathematics. Some take the view of Simeon Poisson, to whom the following saying is attributed: Life is good for only two things: doing mathematics and teaching it. However, the type of person who takes Simeon Poisson s view of life probably does not need to read these notes. So what is their purpose? 7

8 CHAPTER 1. INTRODUCTION Simeon Poisson (1781-1840) was a French mathematician. Poisson s name is attached to a wide variety of ideas in mathematics and physics, for example: Poisson s integral, Poisson s equation in potential theory, Poisson brackets in differential equations, Poisson s ratio in elasticity, and Poisson s constant in electricity.

9 Benjamin Disraeli (1804-1881) was a British politician. Disraeli trained as a lawyer but wanted to be a novelist. He became a politician in the 1830 s and is generally acknowledged to be one of the key figures in the history of the Conservative Party. He was Prime Minister in 1868 and from 1874 to 1880. He famously acquired for Britain a large stake in the Suez Canal, and made Queen Victoria Empress of India. In 1876 he was raised to the peerage as the Earl of Beaconsfield. In these lecture notes, I hope to include sufficient mathematics for you to be able to be analyze data, without delving too deeply into advanced statistics or econometrics, but while covering enough material to ensure that you and your work are neither a danger to yourself, nor to others. As Benjamin Disraeli famously said: There are three types of lies lies, damn lies, and statistics. Hopefully, after this course, no one will say that about your work.

10 CHAPTER 1. INTRODUCTION We shall cover some basic probability theory and linear algebra, before delving into some elementary statistics, culminating with a study of principal components analysis. Principal components analysis (PCA) is one of a family of techniques for taking a large amount of data and summarizing it via a smaller more set of variables. You can think of it as the process of replacing a long book via a summary. More formally, PCA is the process of taking high-dimensional data and using the dependencies between the variables to represent it in a more tractable, lower-dimensional form, without losing too much information. A nice example of PCA is in politics. Stephen A. Weis, now a software engineer at Facebook, analyzed the voting records of senators in the first session of the 108th US Congress. He looked at 458 separate votes taken by the 100 senators. Each vote was described by a 1 (yes), 1 (no) or 0 (absent). In total, this gave rise to a 100 (number of senators) 458 (number of votes) sized matrix. Using PCA (also known as the singular value decomposition, SVD), Weis was able to reduce the dimensionality of the data to the extent that he could summarize it on a 2-dimensional plot depicted in Figure 1. If you were kind, you might describe these lecture notes as the result of PCA applied to the mathematics of data analysis.

Democratic and Republican senators have been colored blue and red, respectively. The values of the axes and the axes themselves are artifacts of the singular value decomposition. In other words, the axes don t mean anything they are simply the two most significant dimensions of the data s optimal representation. Regardless, one can see that this map clearly clusters senators according to party. Note that this map was generated only from voting records, without any data on party affiliation. From just the voting data, there is clearly a partisan divide between the parties. The above example illustrates the strengths and limitations of PCA. It allows you summarize large data sets via small set of variables in way which makes it easy to visualize the data. But it does not tell you what the summary variables mean. Indeed, as was said by Henry Clay: Statistics are no substitute for judgment. 11

12 CHAPTER 1. INTRODUCTION Henry Clay (1777-1852) was an American lawyer and planter, politician, and skilled orator who represented Kentucky in both the United States Senate and House of Representatives. He served three non-consecutive terms as Speaker of the House of Representatives and served as Secretary of State under President John Quincy Adams from 1825 to 1829. Clay ran for the presidency in 1824, 1832 and 1844, while also seeking the Whig Party nomination in 1840 and 1848.

Part I Probability 13

Chapter 2 Inequalities Contents 2.1 Jensen s inequality........................... 17 2.1.1 Arithmetic Mean-Geometric Mean Inequality............ 18 2.1.2 Proof of Corollary 1.......................... 18 2.2 Cauchy-Schwarz Inequality...................... 19 2.2.1 Multiple random variables....................... 19 2.3 Information entropy.......................... 21 2.4 Exercises................................. 27 We often model data via random variables. For example, stock returns are often assumed to be normal. Inequalities give us well-defined facts about random variables. Definition 1 A function f : (a, b) R is concave if for all x, y (a, b) and λ [0, 1] λf(x) + (1 λ)f(y) f(λx + (1 λ)y) It is strictly concave if strict inequality holds when x y and 0 < λ < 1. Definition 2 A function f is convex (strictly convex) if -f is concave (strictly concave). Fact If f is a twice differentiable function and f (x) 0 for all x (a, b) then f is concave [a basic exercise in Analysis]. It is strictly concave if f (x) > 0 for all x (a, b). 15

16 CHAPTER 2. INEQUALITIES The chord lies below the function. Johan Jensen (1859 1925) He was a Danish mathematician and engineer. Although he studied mathematics among various subjects at college, and even published a research paper in mathematics, he learned advanced math topics later by himself and never held any academic position. Instead, he was a successful engineer for the Copenhagen Telephone Company between 1881 and 1924, and became head of the technical department in 1890. All his mathematics research was carried out in his spare time. Jensen is mostly renowned for his famous inequality, Jensen s inequality. In 1915, Jensen also proved Jensen s formula in complex analysis.

2.1. JENSEN S INEQUALITY 17 2.1 Jensen s inequality Theorem 1 (Jensen s Inequality) Let f : (a, b) R be a concave function. Then ( N ) f p n x n p n f(x n ) (2.1) for all x 1,..., x N (a, b) and p 1,..., p N (0, 1) such that N p n = 1. Furthermore if f is strictly concave then equality holds iff all the x n are equal. If X is a random variable that takes finitely many values, Jensen s Inequality can be written as f(e[x]) E[f(X)]. (2.2) Proof of Theorem 1 We use proof by induction. Jensen s Inequality for N = 2 is just the definition of concavity. Suppose it holds for N 1. Now consider ( N ) ( ) f p n x n = f p 1 x 1 + p n x n (2.3) To apply the definition of concavity, we observe that ( N ) p n f p 1 x 1 + p k N n=2 p x n = f (p 1 x 1 + (1 p 1 ) z), (2.4) k where n=2 n=2 z = n=2 Applying the definition of concavity, we have n=2 p n N n=2 p k x n. (2.5) f (p 1 x 1 + (1 p 1 ) z) p 1 f(x 1 ) + (1 p 1 )f(z) (2.6) ( N ) p n = p 1 f(x 1 ) + (1 p 1 )f N n=2 p x n (2.7) k Jensen s Inequality holds for N 1 and so ( N ) p n p n f N n=2 p x n N k n=2 p f(x n ) (2.8) k n=2 n=2 n=2

18 CHAPTER 2. INEQUALITIES Therefore ( N ) f p n x n x 1 f(x 1 ) + = N p k n=2 n=2 p n N n=2 p k f(x n ) (2.9) x n f(x n ) (2.10) Therefore, if Jensen s Inequality holds for N 1, it also holds for N by virtue of concavity. Since, Jensen s Inequality for N = 2 is just the definition of concavity it follows by induction that Jensen s Inequality holds for all finite integers N greater than or equal to 2. 2.1.1 Arithmetic Mean-Geometric Mean Inequality Corollary 1 (Arithmetic Mean-Geometric Mean Inequality) Given positive real numbers x 1,..., x N, ( N ) 1 N x n 1 x n (2.11) N 2.1.2 Proof of Corollary 1 Suppose X is a discrete random variable such that Pr(X = x n ) = 1 N, x n > 0, n {1,..., N} (2.12) Observe that ln x is a concave function of x and so from Jensen s Inequality E[ln X] ln E[X] (2.13) Therefore ( ) 1 1 ln x n ln x n N N ( N ) ( ) 1 N ln 1 x n ln x n N ( N ) 1 ( ) N ln x n 1 ln x n N (2.14) (2.15) (2.16)

2.2. CAUCHY-SCHWARZ INEQUALITY 19 Now because e x is monotonically increaasing, we have ( N x n ) 1 N 1 N x n (2.17) 2.2 Cauchy-Schwarz Inequality Theorem 2 (Cauchy-Schwarz Inequality) For any random variables X and Y E[XY ] 2 E[X 2 ]E[Y 2 ]. (2.18) Proof of Theorem 2 If Y = 0, then both sides are 0. Otherwise, E[Y 2 ] > 0. Let w = X Y E[XY ] E[Y 2 ]. (2.19) Then [ E[w 2 ] = E X 2 2XY E[XY ] E[Y 2 + Y ] ] 2 (E[XY ])2 (E[Y 2 ]) 2 = E[X 2 (E[XY ])2 (E[XY ])2 ] 2 E[Y 2 + ] E[Y 2 ] = E[X 2 ] (E[XY ])2 E[Y 2 ] (2.20) (2.21) (2.22) Since E[w 2 ] 0, the Cauchy-Schwarz inequality follows. 2.2.1 Multiple random variables If we have two random variables, we can study the relationship between them. Definition 3 (Covariance) Given two random variables X, Y, the covariance is Cov(X, Y ) = E[(X E[X])(Y E[Y ])]. Proposition 1 1. Cov(X, c) = 0 for constant c.

20 CHAPTER 2. INEQUALITIES Visualizing the Cauchy-Schwarz Inequality 2. Cov(X + c, Y ) = Cov(X, Y ). 3. Cov(X, Y ) = Cov(Y, X). 4. Cov(X, Y ) = E[XY ] E[X]E[Y ]. 5. Cov(X, X) = V ar(x). 6. V ar(x + Y ) = V ar(x) + V ar(y ) + 2Cov(X, Y ). 7. If X, Y are independent, Cov(X, Y ) = 0. These are all trivial to prove. It is important to note that Cov(X, Y ) = 0 does not imply X and Y are independent. Example 1

2.3. INFORMATION ENTROPY 21 Let (X, Y ) = (2, 0), ( 1, 1) or ( 1, 1) with equal probabilities of 1/3. These are not independent since Y = 0 X = 2. However, Cov(X, Y ) = E[XY ] E[X]E[Y ] = 0 0 0 = 0. If we randomly pick a point on the unit circle, and let the coordinates be (X, Y ), then E[X] = E[Y ] = E[XY ] = 0 by symmetry. So Cov(X, Y ) = 0 but X and Y are clearly not independent (they have to satisfy x 2 + y 2 = 1). The covariance is not that useful in measuring how well two variables correlate. For one, the covariance can (potentially) have dimensions, which means that the numerical value of the covariance can depend on what units we are using. Also, the magnitude of the covariance depends largely on the variance of X and Y themselves. To solve these problems, we define Definition 4 (Correlation coefficient) The correlation coefficient of X and Y is ρ(x, Y ) = Cov(X, Y ) V ar(x)v ar(y ). Proposition 2 ρ(x, Y ) 1. Proof of 2 Apply Cauchy-Schwarz to X E[X] and Y E[Y ]. Again, zero correlation does not necessarily imply independence. 2.3 Information entropy Suppose we observe data about the economy up until 2010 and then look again in 2016. How much more information do we have? Information theory gives us ways of measuring information. We shall start (and end!) with the basic idea of information entropy, also known as Shannon s entropy. In the context of PCA, we want to reduce the dimensionality of a dataset, but without losing too much information. Entropy gives us a way of measuring this.

22 CHAPTER 2. INEQUALITIES Claude Shannon (1916-2001). He introduced the notion that information could be quantified. In A Mathematical Theory of Communication, his legendary paper from 1948, Shannon proposed that data should be measured in bits discrete values of zero or one. Shannon developed information entropy as a measure for the uncertainty in a message while essentially inventing the field of information theory. Perhaps confusingly, in information theory, the term entropy refers to information we don t have (normally people define information as what they know!). The information we don t have about a system, its entropy, is related to its unpredictability: how much it can surprise us. Suppose an event A occurs with probability P (A) = p. How surprising is it? If is not very suprising, there cannot be much new information in the event. Let s try to invent a surprise function, say S(p). What properties should this have? Since a certain event is unsurprising we would like S(1) = 0. We should also like S(p) to be decreasing and continuous in p. If A and B are independent events then we should like S(P (A B)) = S(P (A)) + S(P (B)). It turns out that the only function with these properties is one of the form S(p) = c log a p, (2.23) with c > 0. For simplicity, take c = 1. The log can be any base, but for the time being let us use base 2 (a = 2).

2.3. INFORMATION ENTROPY 23 If X is a random variable that takes values 1,..., N with probabilities p 1,..., p N, then on average the surprise obtained on learning X is H(X) = ES(p x ) = p n log 2 p n (2.24) This is the information entropy of X. It is an important quantity in information theory. The log can be taken to any base, but using base 2, nh(x) is roughly the expected number of binary bits required to report the result of n experiments in which X 1,..., X N are i.i.d. observations from distribution (p n, 1 n N) and we encode our reporting of the results of experiments in the most efficient way. Let s use Jensens inequality to prove the entropy is maximized by p 1 =... = p N = 1/N. Consider f(x) = log x, which is a convex function. We may assume p n > 0 for all n. Let X be a r.v. such that X = 1/p n with probability p n. Then 1 p n log p n = Ef(X) f(e[x]) = f(n) = log N = N log 1 N (2.25) To provide some more underpinnings for ideas from information theory, we shall make two definitions. Definition 5 If X is a random variable that takes values x 1,..., x N with probabilities p 1,..., p N, then the Shannon information context of an outcome x n is defined as h(x n ) = log 2 1 p n. (2.26) Information content is measured in bits. One bit is typically defined as the uncertainty of a binary random variable that is 0 or 1 with equal probability, or the information that is gained when the value of such a variable becomes known. Definition 6 If X is a random variable that takes values x 1,..., x N with probabilities p 1,..., p N, then the information entropy of the random variable is given by the mean Shannon information content H(X) = p n log 2 p n (2.27) Note that the entropy does not depend on the values that the random variable takes, but only depends on the probability distribution. We can also define the joint entropy of a family of random variables.

24 CHAPTER 2. INEQUALITIES Definition 7 Consider a family of discrete random variables, X 1,..., X N, where X i takes a finite set of values in some set A, which wlog is a subset of N. Their joint entropy is defined by H(X 1,..., X n ) = x 1 A 1... x N A N Pr((X 1,..., X N ) = (x 1,..., x N )) log 2 Pr((X 1,..., X N ) = (x 1,..., x N )) Example 2 Suppose X 1 and X 2 take the following values (2.28) Pr(X 1 = 1, X 2 = 1) = 1/4 (2.29) Pr(X 1 = 1, X 2 = 1) = 1/4 (2.30) Pr(X 1 = 1, X 2 = 1) = 1/4 (2.31) Pr(X 1 = 1, X 2 = 1) = 1/4 (2.32) Clearly X 1 and X 2 are independent. The joint entropy of X 1 and X 2 is We can deduce that Observe that and so we see that 1 4 log 1 2 4 1 4 log 1 2 4 1 3 log 1 4 4 1 4 log 1 2 4 (2.33) = log 2 4 = 2 (2.34) Pr(X 1 = 1) = Pr(X 1 = 1) = 1/2 = Pr(X 2 = 1) = Pr(X 2 = 1). (2.35) H(X 1 ) = H(X 2 ) = 1 2 log 2 2 + 1 2 log 2 2 = 1 (2.36) H(X 1, X 2 ) = H(X 1 ) + H(X 2 ) = 2. (2.37) Suppose X 1 and X 2 are correlated and take the following values Pr(X 1 = 1, X 2 = 1) = 1/6 (2.38) Pr(X 1 = 1, X 2 = 1) = 1/3 (2.39) Pr(X 1 = 1, X 2 = 1) = 1/3 (2.40) Pr(X 1 = 1, X 2 = 1) = 1/6 (2.41)

2.3. INFORMATION ENTROPY 25 The joint entropy of X 1 and X 2 is and so We can easily deduce that But, now 1 6 log 1 2 6 1 3 log 1 2 3 1 3 log 1 2 3 1 6 log 1 2 6 = 1 3 log 1 2 6 2 3 log 1 2 3 (2.42) (2.43) = 1 3 log 2 6 + 2 3 log 2 3 (2.44) = log 2 6 1/3 3 2/3 = 1.9183 < 2 (2.45) Pr(X 1 = 1) = Pr(X 1 = 1) = 1/2 = Pr(X 2 = 1) = Pr(X 2 = 1), (2.46) H(X 1 ) = H(X 2 ) = 1 (2.47) 1.9183 = H(X 1, X 2 ) < H(X 1 ) + H(X 2 ) = 2. (2.48) This result is intuitive. When the random variables are correlated, their joint information is less than their sum. You may well have seen the following definition of independence for discrete random variables. Definition 8 Consider two discrete random variables, X and Y, which can take values on the set {a 1,..., a N }. X and Y are independent if i, j, {1,..., N}, Pr({X = a i, Y = a j }) = Pr(X = a i ) Pr(Y = a j ). (2.49) Using the above definition we can prove that the joint entropy of two independent random variables is just the sum of the individual entropies. Proposition 3 For two discrete random variables, X and Y, if and only if X and Y are independent. H(X, Y ) = H(X) + H(Y ) (2.50)

26 CHAPTER 2. INEQUALITIES Proof of Proposition 3 From Definition 7, we have H(X, Y ) = i=1 j=1 Pr({X = a i, Y = a j }) log 1 Pr({X = a i, Y = a j }). (2.51) Supposing X and Y are independent, we obtain H(X, Y ) = i=1 j=1 Pr({X = a i, Y = a j }) log 1 N Pr(X = a i ) + i=1 j=1 Pr({X = a i, Y = a j }) log (2.52) 1 Pr(Y = a j ). Hence H(X, Y ) = 1 log Pr(X = a i ) i=1 Pr({X = a i, Y = a j }) + j=1 1 log Pr(Y = a j ) j=1 Pr({X = a i, Y = a j }). i=1 (2.53) Observe that Pr({X = a i, Y = a j }) = Pr(X = a i ) (2.54) j=1 and Pr({X = a i, Y = a j }) = Pr(Y = a j ). (2.55) i=1 Hence H(X, Y ) = 1 log Pr(X = a i ) Pr(X = a i) + i=1 log j=1 1 Pr(Y = a j ) Pr(Y = a j) (2.56) = H(X) + H(Y ). (2.57) Out of laziness, I am leaving the only if part as an exercise. But what happens when X and Y are not independent random variables? We have the following inequality, which you can try and prove yourself. Proposition 4 For 2 discrete random variables, X and Y, we have H(X, Y ) H(X) + H(Y ). (2.58)

2.4. EXERCISES 27 We can also measure the difference between two sets of probabilities. In economics, we can use this to measure how far apart two sets of beliefs are. Definition 9 Suppose we have a discrete random variable X, which can take the values x 1,..., x N. We can define two different sets of probabilities, P = {p 1,..., p N } and Q = {q 1,..., q N }. The relative entropy or Kullback Leibler divergence between the two probabilities is D KL (P Q) = p n log 2 p n q n. (2.59) Proposition 5 (Gibbs Inequality) The relative entropy satisfies Gibbs inequality with equality only if P and Q are identical. D KL (P Q) 0, (2.60) 2.4 Exercises 1. Consider the concave function u(x). Suppose X is a random variable. Prove that E[u(X)] u(e[x]) + 1 2 V ar[x]u (E[X]). (2.61) 2. Let X 1,..., X N be independent random variables, all with uniform distribution on [0,1]. What is the probability of the event {X 1 > X 2 > > X N 1 > X N }? 3. Let X and Y be two non-constant random variables with finite variances. The correlation coefficient is denoted by ρ(x, Y ). (a) Using the Cauchy-Schwarz inequality or otherwise, prove that ρ(x, Y ) 1 (2.62) (b) What can be said about the relationship between X and Y when either (i) ρ(x, Y ) = 0 or (ii) ρ(x, Y ) = 1. [Proofs are not required.] (c) Take r [0, 1] and let X, X be independent random variables taking values ±1 with probabilities 1/2. Set { X with probability r Y = X with probability 1 r (2.63) Find ρ(x, Y ).

28 CHAPTER 2. INEQUALITIES 4. The 1-Trick and the Splitting Trick Show that for each real sequence x 1, x 2,..., x N one has x n ( N N x 2 n ) 1 2 (2.64) and show that one also has ( N ) 1 ( 2 N ) 1 2 a n a n 2/3 a n 4/3. (2.65) The two tricks illustrated by this simple exercise are very useful when proving inequalities. 5. If p(k; θ) 0 for all k D and θ Θ and if p(k; θ) = 1, θ Θ (2.66) k D then for each θ Θ one can think of M θ = {p(k; θ) : k D} as specifying a probability model where p(k; θ) represents the probability that we observe k when the parameter θ is the true state of nature If the function g : D R satisfies g(k)p(k; θ) = θ, θ Θ, (2.67) k D then g is called an unbiased estimator of the parameter θ. The variance of the unbiased estimator g is given by k D (g(k) θ)2 p(k; θ). Assuming that D is finite and p(k; θ) is a differentiable function of θ, show that one has the following lower bound for the variance of the unbiased estimator of θ (g(k) θ) 2 p(k; θ) 1 I(θ) k D where I : Θ R is defined by the sum (2.68) { } p(k; θ)/ θ 2 p(k; θ). (2.69) I(θ) = k D p(k; θ) The quantity I(θ) is known as the Fisher information at θ of the model θ. The inequality (2.68) is known as the Cramer-Rao lower bound, and it has extensive applications in mathematical statistics.

2.4. EXERCISES 29 6. Show that if X is discrete r.v. such that Pr(X = x n ) = p n for n {1,..., N} and f : R R and g : R R are nondecreasing, then E[f(X)]E[g(X)] E[f(X)g(X)]. (2.70) 7. Given n random people, what is the probability that two or more of them have the same birthday? Under the natural (but approximate!) model where the birthdays are viewed as an independent and uniformly distributed in the set {1, 2,..., 365}, show that this probability is at least 1/2 if n 23. 8. A fair coin is flipped until the first head occurs. Let X denote the number of flips required. Find the entropy H(X) in bits. 9. Use Jensen s Inequality to prove Gibbs Inequality. 10. It is well known that there are infinitely many prime numbers a proof appears in Euclid s famous Elements. We will not only show that there are infinitely many prime numbers, but we will also give a lower bound on the rate of their growth using information theory. Let π(n) denote the number of primes no greater than n. Every positive integer n has a unique prime factorization of the form n = π(n) i=1 p X i(n) i, (2.71) where p 1, p 2, p 3,... are the primes, that is, p 1 = 2, p 2 = 3, p 3 = 5, etc., and X i (n) is the non-negative integer representing the multiplicity of p i in the prime factorization of n. Let N be uniformly distributed on {1, 2, 3,..., n}. (a) Show that X i (N) is an integer-valued random variable satisfying 0 X i (N) log 2 n. (2.72) [Hint : Try finding a lower and an upper bound for p X i(n) i ] (b) Show that π(n) log 2 n log 2 (log 2 n + 1). (2.73) [Hint: Do X 1 (N), X 2 (N),..., X π(n) (N) determine N? What does that say about the respective entropies?].