Extract. Data Analysis Tools

Size: px

Start display at page:

Download "Extract. Data Analysis Tools"

Maryann Bishop
6 years ago
Views:

1 Extract Data Analysis Tools Harjoat S. Bhamra July 8, 2017

2 Contents 1 Introduction 7 I Probability 13 2 Inequalities Jensen s inequality Arithmetic Mean-Geometric Mean Inequality Proof of Corollary Cauchy-Schwarz Inequality Multiple random variables Information entropy Exercises Weak Law of Large Numbers Markov Inequality Chebyshev inequality Weak law of large numbers Exercises Normal Distribution Calculations with the normal distribution Checking the normal density integrates to one Mode, median, sample mean and variance Bounds on tail probability of a normal distribution Multivariate normal Bivariate normal Brownian motion Exercises

3 4 CONTENTS 5 Generating and Characteristic Functions Moment Generating Functions Characteristic Functions A Brownian interlude with mirrors, but no smoke Exercises Central limit theorem Exercises II Statistics 65 7 Mathematical Statistics Introduction Mean squared error Bayesian inference 71 9 Estimating stochastic volatility models 73 III Applying Linear Algebra and Statistics to Data Analysis Principal Components Analysis (PCA) Overview A simple example from high school physics Linear Algebra for Principal Components Analysis Vector spaces Linear independence, subspaces, spanning and bases Change of basis How do we choose the basis? Noisy data Redundancy Covariance matrix Covariance matrix under a new basis PCA via Projection Mathematics of Orthogonal Projections Orthogonal projection operators Projecting the data onto a 1-d subspace Spectral Theorem for Real Symmetric Matrices Projecting the data onto an m-dimensional subspace Scree Plots

4 CONTENTS What about changing the basis? Orthogonal matrices Statistical Inference Exercises IV Exam Preparation Topics covered in Final Exam Linear Maps

5 Chapter 1 Introduction There are of course many reasons for learning mathematics. Some take the view of Simeon Poisson, to whom the following saying is attributed: Life is good for only two things: doing mathematics and teaching it. However, the type of person who takes Simeon Poisson s view of life probably does not need to read these notes. So what is their purpose? 7

8 CHAPTER 1. INTRODUCTION Simeon Poisson (1781-1840) was a French mathematician.

6 8 CHAPTER 1. INTRODUCTION Simeon Poisson ( ) was a French mathematician. Poisson s name is attached to a wide variety of ideas in mathematics and physics, for example: Poisson s integral, Poisson s equation in potential theory, Poisson brackets in differential equations, Poisson s ratio in elasticity, and Poisson s constant in electricity.

9 Benjamin Disraeli (1804-1881) was a British politician. Disraeli trained as a lawyer but wanted to be a novelist.

7 9 Benjamin Disraeli ( ) was a British politician. Disraeli trained as a lawyer but wanted to be a novelist. He became a politician in the 1830 s and is generally acknowledged to be one of the key figures in the history of the Conservative Party. He was Prime Minister in 1868 and from 1874 to He famously acquired for Britain a large stake in the Suez Canal, and made Queen Victoria Empress of India. In 1876 he was raised to the peerage as the Earl of Beaconsfield. In these lecture notes, I hope to include sufficient mathematics for you to be able to be analyze data, without delving too deeply into advanced statistics or econometrics, but while covering enough material to ensure that you and your work are neither a danger to yourself, nor to others. As Benjamin Disraeli famously said: There are three types of lies lies, damn lies, and statistics. Hopefully, after this course, no one will say that about your work.

8 10 CHAPTER 1. INTRODUCTION We shall cover some basic probability theory and linear algebra, before delving into some elementary statistics, culminating with a study of principal components analysis. Principal components analysis (PCA) is one of a family of techniques for taking a large amount of data and summarizing it via a smaller more set of variables. You can think of it as the process of replacing a long book via a summary. More formally, PCA is the process of taking high-dimensional data and using the dependencies between the variables to represent it in a more tractable, lower-dimensional form, without losing too much information. A nice example of PCA is in politics. Stephen A. Weis, now a software engineer at Facebook, analyzed the voting records of senators in the first session of the 108th US Congress. He looked at 458 separate votes taken by the 100 senators. Each vote was described by a 1 (yes), 1 (no) or 0 (absent). In total, this gave rise to a 100 (number of senators) 458 (number of votes) sized matrix. Using PCA (also known as the singular value decomposition, SVD), Weis was able to reduce the dimensionality of the data to the extent that he could summarize it on a 2-dimensional plot depicted in Figure 1. If you were kind, you might describe these lecture notes as the result of PCA applied to the mathematics of data analysis.

9 Democratic and Republican senators have been colored blue and red, respectively. The values of the axes and the axes themselves are artifacts of the singular value decomposition. In other words, the axes don t mean anything they are simply the two most significant dimensions of the data s optimal representation. Regardless, one can see that this map clearly clusters senators according to party. Note that this map was generated only from voting records, without any data on party affiliation. From just the voting data, there is clearly a partisan divide between the parties. The above example illustrates the strengths and limitations of PCA. It allows you summarize large data sets via small set of variables in way which makes it easy to visualize the data. But it does not tell you what the summary variables mean. Indeed, as was said by Henry Clay: Statistics are no substitute for judgment. 11

10 12 CHAPTER 1. INTRODUCTION Henry Clay ( ) was an American lawyer and planter, politician, and skilled orator who represented Kentucky in both the United States Senate and House of Representatives. He served three non-consecutive terms as Speaker of the House of Representatives and served as Secretary of State under President John Quincy Adams from 1825 to Clay ran for the presidency in 1824, 1832 and 1844, while also seeking the Whig Party nomination in 1840 and 1848.

11 Part I Probability 13

12 Chapter 2 Inequalities Contents 2.1 Jensen s inequality Arithmetic Mean-Geometric Mean Inequality Proof of Corollary Cauchy-Schwarz Inequality Multiple random variables Information entropy Exercises We often model data via random variables. For example, stock returns are often assumed to be normal. Inequalities give us well-defined facts about random variables. Definition 1 A function f : (a, b) R is concave if for all x, y (a, b) and λ [0, 1] λf(x) + (1 λ)f(y) f(λx + (1 λ)y) It is strictly concave if strict inequality holds when x y and 0 < λ < 1. Definition 2 A function f is convex (strictly convex) if -f is concave (strictly concave). Fact If f is a twice differentiable function and f (x) 0 for all x (a, b) then f is concave [a basic exercise in Analysis]. It is strictly concave if f (x) > 0 for all x (a, b). 15

16 CHAPTER 2. INEQUALITIES The chord lies below the function. Johan Jensen (1859 1925) He was a Danish mathematician and engineer.

13 16 CHAPTER 2. INEQUALITIES The chord lies below the function. Johan Jensen ( ) He was a Danish mathematician and engineer. Although he studied mathematics among various subjects at college, and even published a research paper in mathematics, he learned advanced math topics later by himself and never held any academic position. Instead, he was a successful engineer for the Copenhagen Telephone Company between 1881 and 1924, and became head of the technical department in All his mathematics research was carried out in his spare time. Jensen is mostly renowned for his famous inequality, Jensen s inequality. In 1915, Jensen also proved Jensen s formula in complex analysis.

14 2.1. JENSEN S INEQUALITY Jensen s inequality Theorem 1 (Jensen s Inequality) Let f : (a, b) R be a concave function. Then ( N ) f p n x n p n f(x n ) (2.1) for all x 1,..., x N (a, b) and p 1,..., p N (0, 1) such that N p n = 1. Furthermore if f is strictly concave then equality holds iff all the x n are equal. If X is a random variable that takes finitely many values, Jensen s Inequality can be written as f(e[x]) E[f(X)]. (2.2) Proof of Theorem 1 We use proof by induction. Jensen s Inequality for N = 2 is just the definition of concavity. Suppose it holds for N 1. Now consider ( N ) ( ) f p n x n = f p 1 x 1 + p n x n (2.3) To apply the definition of concavity, we observe that ( N ) p n f p 1 x 1 + p k N n=2 p x n = f (p 1 x 1 + (1 p 1 ) z), (2.4) k where n=2 n=2 z = n=2 Applying the definition of concavity, we have n=2 p n N n=2 p k x n. (2.5) f (p 1 x 1 + (1 p 1 ) z) p 1 f(x 1 ) + (1 p 1 )f(z) (2.6) ( N ) p n = p 1 f(x 1 ) + (1 p 1 )f N n=2 p x n (2.7) k Jensen s Inequality holds for N 1 and so ( N ) p n p n f N n=2 p x n N k n=2 p f(x n ) (2.8) k n=2 n=2 n=2

15 18 CHAPTER 2. INEQUALITIES Therefore ( N ) f p n x n x 1 f(x 1 ) + = N p k n=2 n=2 p n N n=2 p k f(x n ) (2.9) x n f(x n ) (2.10) Therefore, if Jensen s Inequality holds for N 1, it also holds for N by virtue of concavity. Since, Jensen s Inequality for N = 2 is just the definition of concavity it follows by induction that Jensen s Inequality holds for all finite integers N greater than or equal to Arithmetic Mean-Geometric Mean Inequality Corollary 1 (Arithmetic Mean-Geometric Mean Inequality) Given positive real numbers x 1,..., x N, ( N ) 1 N x n 1 x n (2.11) N Proof of Corollary 1 Suppose X is a discrete random variable such that Pr(X = x n ) = 1 N, x n > 0, n {1,..., N} (2.12) Observe that ln x is a concave function of x and so from Jensen s Inequality E[ln X] ln E[X] (2.13) Therefore ( ) 1 1 ln x n ln x n N N ( N ) ( ) 1 N ln 1 x n ln x n N ( N ) 1 ( ) N ln x n 1 ln x n N (2.14) (2.15) (2.16)

16 2.2. CAUCHY-SCHWARZ INEQUALITY 19 Now because e x is monotonically increaasing, we have ( N x n ) 1 N 1 N x n (2.17) 2.2 Cauchy-Schwarz Inequality Theorem 2 (Cauchy-Schwarz Inequality) For any random variables X and Y E[XY ] 2 E[X 2 ]E[Y 2 ]. (2.18) Proof of Theorem 2 If Y = 0, then both sides are 0. Otherwise, E[Y 2 ] > 0. Let w = X Y E[XY ] E[Y 2 ]. (2.19) Then [ E[w 2 ] = E X 2 2XY E[XY ] E[Y 2 + Y ] ] 2 (E[XY ])2 (E[Y 2 ]) 2 = E[X 2 (E[XY ])2 (E[XY ])2 ] 2 E[Y 2 + ] E[Y 2 ] = E[X 2 ] (E[XY ])2 E[Y 2 ] (2.20) (2.21) (2.22) Since E[w 2 ] 0, the Cauchy-Schwarz inequality follows Multiple random variables If we have two random variables, we can study the relationship between them. Definition 3 (Covariance) Given two random variables X, Y, the covariance is Cov(X, Y ) = E[(X E[X])(Y E[Y ])]. Proposition 1 1. Cov(X, c) = 0 for constant c.

17 20 CHAPTER 2. INEQUALITIES Visualizing the Cauchy-Schwarz Inequality 2. Cov(X + c, Y ) = Cov(X, Y ). 3. Cov(X, Y ) = Cov(Y, X). 4. Cov(X, Y ) = E[XY ] E[X]E[Y ]. 5. Cov(X, X) = V ar(x). 6. V ar(x + Y ) = V ar(x) + V ar(y ) + 2Cov(X, Y ). 7. If X, Y are independent, Cov(X, Y ) = 0. These are all trivial to prove. It is important to note that Cov(X, Y ) = 0 does not imply X and Y are independent. Example 1

18 2.3. INFORMATION ENTROPY 21 Let (X, Y ) = (2, 0), ( 1, 1) or ( 1, 1) with equal probabilities of 1/3. These are not independent since Y = 0 X = 2. However, Cov(X, Y ) = E[XY ] E[X]E[Y ] = = 0. If we randomly pick a point on the unit circle, and let the coordinates be (X, Y ), then E[X] = E[Y ] = E[XY ] = 0 by symmetry. So Cov(X, Y ) = 0 but X and Y are clearly not independent (they have to satisfy x 2 + y 2 = 1). The covariance is not that useful in measuring how well two variables correlate. For one, the covariance can (potentially) have dimensions, which means that the numerical value of the covariance can depend on what units we are using. Also, the magnitude of the covariance depends largely on the variance of X and Y themselves. To solve these problems, we define Definition 4 (Correlation coefficient) The correlation coefficient of X and Y is ρ(x, Y ) = Cov(X, Y ) V ar(x)v ar(y ). Proposition 2 ρ(x, Y ) 1. Proof of 2 Apply Cauchy-Schwarz to X E[X] and Y E[Y ]. Again, zero correlation does not necessarily imply independence. 2.3 Information entropy Suppose we observe data about the economy up until 2010 and then look again in How much more information do we have? Information theory gives us ways of measuring information. We shall start (and end!) with the basic idea of information entropy, also known as Shannon s entropy. In the context of PCA, we want to reduce the dimensionality of a dataset, but without losing too much information. Entropy gives us a way of measuring this.

22 CHAPTER 2. INEQUALITIES Claude Shannon (1916-2001). He introduced the notion that information could be quantified.

19 22 CHAPTER 2. INEQUALITIES Claude Shannon ( ). He introduced the notion that information could be quantified. In A Mathematical Theory of Communication, his legendary paper from 1948, Shannon proposed that data should be measured in bits discrete values of zero or one. Shannon developed information entropy as a measure for the uncertainty in a message while essentially inventing the field of information theory. Perhaps confusingly, in information theory, the term entropy refers to information we don t have (normally people define information as what they know!). The information we don t have about a system, its entropy, is related to its unpredictability: how much it can surprise us. Suppose an event A occurs with probability P (A) = p. How surprising is it? If is not very suprising, there cannot be much new information in the event. Let s try to invent a surprise function, say S(p). What properties should this have? Since a certain event is unsurprising we would like S(1) = 0. We should also like S(p) to be decreasing and continuous in p. If A and B are independent events then we should like S(P (A B)) = S(P (A)) + S(P (B)). It turns out that the only function with these properties is one of the form S(p) = c log a p, (2.23) with c > 0. For simplicity, take c = 1. The log can be any base, but for the time being let us use base 2 (a = 2).

20 2.3. INFORMATION ENTROPY 23 If X is a random variable that takes values 1,..., N with probabilities p 1,..., p N, then on average the surprise obtained on learning X is H(X) = ES(p x ) = p n log 2 p n (2.24) This is the information entropy of X. It is an important quantity in information theory. The log can be taken to any base, but using base 2, nh(x) is roughly the expected number of binary bits required to report the result of n experiments in which X 1,..., X N are i.i.d. observations from distribution (p n, 1 n N) and we encode our reporting of the results of experiments in the most efficient way. Let s use Jensens inequality to prove the entropy is maximized by p 1 =... = p N = 1/N. Consider f(x) = log x, which is a convex function. We may assume p n > 0 for all n. Let X be a r.v. such that X = 1/p n with probability p n. Then 1 p n log p n = Ef(X) f(e[x]) = f(n) = log N = N log 1 N (2.25) To provide some more underpinnings for ideas from information theory, we shall make two definitions. Definition 5 If X is a random variable that takes values x 1,..., x N with probabilities p 1,..., p N, then the Shannon information context of an outcome x n is defined as h(x n ) = log 2 1 p n. (2.26) Information content is measured in bits. One bit is typically defined as the uncertainty of a binary random variable that is 0 or 1 with equal probability, or the information that is gained when the value of such a variable becomes known. Definition 6 If X is a random variable that takes values x 1,..., x N with probabilities p 1,..., p N, then the information entropy of the random variable is given by the mean Shannon information content H(X) = p n log 2 p n (2.27) Note that the entropy does not depend on the values that the random variable takes, but only depends on the probability distribution. We can also define the joint entropy of a family of random variables.

21 24 CHAPTER 2. INEQUALITIES Definition 7 Consider a family of discrete random variables, X 1,..., X N, where X i takes a finite set of values in some set A, which wlog is a subset of N. Their joint entropy is defined by H(X 1,..., X n ) = x 1 A 1... x N A N Pr((X 1,..., X N ) = (x 1,..., x N )) log 2 Pr((X 1,..., X N ) = (x 1,..., x N )) Example 2 Suppose X 1 and X 2 take the following values (2.28) Pr(X 1 = 1, X 2 = 1) = 1/4 (2.29) Pr(X 1 = 1, X 2 = 1) = 1/4 (2.30) Pr(X 1 = 1, X 2 = 1) = 1/4 (2.31) Pr(X 1 = 1, X 2 = 1) = 1/4 (2.32) Clearly X 1 and X 2 are independent. The joint entropy of X 1 and X 2 is We can deduce that Observe that and so we see that 1 4 log log log log (2.33) = log 2 4 = 2 (2.34) Pr(X 1 = 1) = Pr(X 1 = 1) = 1/2 = Pr(X 2 = 1) = Pr(X 2 = 1). (2.35) H(X 1 ) = H(X 2 ) = 1 2 log log 2 2 = 1 (2.36) H(X 1, X 2 ) = H(X 1 ) + H(X 2 ) = 2. (2.37) Suppose X 1 and X 2 are correlated and take the following values Pr(X 1 = 1, X 2 = 1) = 1/6 (2.38) Pr(X 1 = 1, X 2 = 1) = 1/3 (2.39) Pr(X 1 = 1, X 2 = 1) = 1/3 (2.40) Pr(X 1 = 1, X 2 = 1) = 1/6 (2.41)

22 2.3. INFORMATION ENTROPY 25 The joint entropy of X 1 and X 2 is and so We can easily deduce that But, now 1 6 log log log log = 1 3 log log (2.42) (2.43) = 1 3 log log 2 3 (2.44) = log 2 6 1/3 3 2/3 = < 2 (2.45) Pr(X 1 = 1) = Pr(X 1 = 1) = 1/2 = Pr(X 2 = 1) = Pr(X 2 = 1), (2.46) H(X 1 ) = H(X 2 ) = 1 (2.47) = H(X 1, X 2 ) < H(X 1 ) + H(X 2 ) = 2. (2.48) This result is intuitive. When the random variables are correlated, their joint information is less than their sum. You may well have seen the following definition of independence for discrete random variables. Definition 8 Consider two discrete random variables, X and Y, which can take values on the set {a 1,..., a N }. X and Y are independent if i, j, {1,..., N}, Pr({X = a i, Y = a j }) = Pr(X = a i ) Pr(Y = a j ). (2.49) Using the above definition we can prove that the joint entropy of two independent random variables is just the sum of the individual entropies. Proposition 3 For two discrete random variables, X and Y, if and only if X and Y are independent. H(X, Y ) = H(X) + H(Y ) (2.50)

23 26 CHAPTER 2. INEQUALITIES Proof of Proposition 3 From Definition 7, we have H(X, Y ) = i=1 j=1 Pr({X = a i, Y = a j }) log 1 Pr({X = a i, Y = a j }). (2.51) Supposing X and Y are independent, we obtain H(X, Y ) = i=1 j=1 Pr({X = a i, Y = a j }) log 1 N Pr(X = a i ) + i=1 j=1 Pr({X = a i, Y = a j }) log (2.52) 1 Pr(Y = a j ). Hence H(X, Y ) = 1 log Pr(X = a i ) i=1 Pr({X = a i, Y = a j }) + j=1 1 log Pr(Y = a j ) j=1 Pr({X = a i, Y = a j }). i=1 (2.53) Observe that Pr({X = a i, Y = a j }) = Pr(X = a i ) (2.54) j=1 and Pr({X = a i, Y = a j }) = Pr(Y = a j ). (2.55) i=1 Hence H(X, Y ) = 1 log Pr(X = a i ) Pr(X = a i) + i=1 log j=1 1 Pr(Y = a j ) Pr(Y = a j) (2.56) = H(X) + H(Y ). (2.57) Out of laziness, I am leaving the only if part as an exercise. But what happens when X and Y are not independent random variables? We have the following inequality, which you can try and prove yourself. Proposition 4 For 2 discrete random variables, X and Y, we have H(X, Y ) H(X) + H(Y ). (2.58)

24 2.4. EXERCISES 27 We can also measure the difference between two sets of probabilities. In economics, we can use this to measure how far apart two sets of beliefs are. Definition 9 Suppose we have a discrete random variable X, which can take the values x 1,..., x N. We can define two different sets of probabilities, P = {p 1,..., p N } and Q = {q 1,..., q N }. The relative entropy or Kullback Leibler divergence between the two probabilities is D KL (P Q) = p n log 2 p n q n. (2.59) Proposition 5 (Gibbs Inequality) The relative entropy satisfies Gibbs inequality with equality only if P and Q are identical. D KL (P Q) 0, (2.60) 2.4 Exercises 1. Consider the concave function u(x). Suppose X is a random variable. Prove that E[u(X)] u(e[x]) V ar[x]u (E[X]). (2.61) 2. Let X 1,..., X N be independent random variables, all with uniform distribution on [0,1]. What is the probability of the event {X 1 > X 2 > > X N 1 > X N }? 3. Let X and Y be two non-constant random variables with finite variances. The correlation coefficient is denoted by ρ(x, Y ). (a) Using the Cauchy-Schwarz inequality or otherwise, prove that ρ(x, Y ) 1 (2.62) (b) What can be said about the relationship between X and Y when either (i) ρ(x, Y ) = 0 or (ii) ρ(x, Y ) = 1. [Proofs are not required.] (c) Take r [0, 1] and let X, X be independent random variables taking values ±1 with probabilities 1/2. Set { X with probability r Y = X with probability 1 r (2.63) Find ρ(x, Y ).

25 28 CHAPTER 2. INEQUALITIES 4. The 1-Trick and the Splitting Trick Show that for each real sequence x 1, x 2,..., x N one has x n ( N N x 2 n ) 1 2 (2.64) and show that one also has ( N ) 1 ( 2 N ) 1 2 a n a n 2/3 a n 4/3. (2.65) The two tricks illustrated by this simple exercise are very useful when proving inequalities. 5. If p(k; θ) 0 for all k D and θ Θ and if p(k; θ) = 1, θ Θ (2.66) k D then for each θ Θ one can think of M θ = {p(k; θ) : k D} as specifying a probability model where p(k; θ) represents the probability that we observe k when the parameter θ is the true state of nature If the function g : D R satisfies g(k)p(k; θ) = θ, θ Θ, (2.67) k D then g is called an unbiased estimator of the parameter θ. The variance of the unbiased estimator g is given by k D (g(k) θ)2 p(k; θ). Assuming that D is finite and p(k; θ) is a differentiable function of θ, show that one has the following lower bound for the variance of the unbiased estimator of θ (g(k) θ) 2 p(k; θ) 1 I(θ) k D where I : Θ R is defined by the sum (2.68) { } p(k; θ)/ θ 2 p(k; θ). (2.69) I(θ) = k D p(k; θ) The quantity I(θ) is known as the Fisher information at θ of the model θ. The inequality (2.68) is known as the Cramer-Rao lower bound, and it has extensive applications in mathematical statistics.

26 2.4. EXERCISES Show that if X is discrete r.v. such that Pr(X = x n ) = p n for n {1,..., N} and f : R R and g : R R are nondecreasing, then E[f(X)]E[g(X)] E[f(X)g(X)]. (2.70) 7. Given n random people, what is the probability that two or more of them have the same birthday? Under the natural (but approximate!) model where the birthdays are viewed as an independent and uniformly distributed in the set {1, 2,..., 365}, show that this probability is at least 1/2 if n A fair coin is flipped until the first head occurs. Let X denote the number of flips required. Find the entropy H(X) in bits. 9. Use Jensen s Inequality to prove Gibbs Inequality. 10. It is well known that there are infinitely many prime numbers a proof appears in Euclid s famous Elements. We will not only show that there are infinitely many prime numbers, but we will also give a lower bound on the rate of their growth using information theory. Let π(n) denote the number of primes no greater than n. Every positive integer n has a unique prime factorization of the form n = π(n) i=1 p X i(n) i, (2.71) where p 1, p 2, p 3,... are the primes, that is, p 1 = 2, p 2 = 3, p 3 = 5, etc., and X i (n) is the non-negative integer representing the multiplicity of p i in the prime factorization of n. Let N be uniformly distributed on {1, 2, 3,..., n}. (a) Show that X i (N) is an integer-valued random variable satisfying 0 X i (N) log 2 n. (2.72) [Hint : Try finding a lower and an upper bound for p X i(n) i ] (b) Show that π(n) log 2 n log 2 (log 2 n + 1). (2.73) [Hint: Do X 1 (N), X 2 (N),..., X π(n) (N) determine N? What does that say about the respective entropies?].

Probability inequalities 11

Paninski, Intro. Math. Stats., October 5, 2005 29 Probability inequalities 11 There is an adage in probability that says that behind every limit theorem lies a probability inequality (i.e., a bound on