Statistical Inference Writeup

Size: px
Start display at page:

Download "Statistical Inference Writeup"

Transcription

1 Statistical Inference Writeup January 19, 2015 This is a personal writeup of Statistical Inference (Casella and Berger, 2nd ed.). The purpose of this note is to keep a log of my impressions during the reading process, so I cannot guarantee the correctness of contents. :-) Contents 1 Probability Theory 4 2 Transformations and Expectations Transformations of random variables Monotonic functions Piecewise monotonic functions Probablity integral transformation Expected Values Moments and Moment Generating Functions Common Families of Distribution Discrete Distributions Continuous Distributions Exponential Families of Distribution Scaling and Location Families Inequalities and Identities Multiple Random Variables Joint and Marginal Distributions Conditional Distributions and Independence Bivariate Transformations Hierarchical Models and Mixture Distributions Covariance and Correlation Multivariate Distributions Inequalities and Identities

2 5 Properties of a Random Sample Basic Concepts of Random Samples Sums of Random Variables from a Random Sample Basic Statistics Bessel s Correction Sampling Distributions of Sample Mean and Sample Variance Using mgfs to Find Sampling Distributions Sampling From the Normal Distribution Student s t-distribution Fisher s F -distribution Order Statistics Convergence Concepts Definitions Law of Large Numbers Central Limit Theorem Delta Method Principles of Data Reduction The Sufficiency Principle Sufficient Statistic Factorization Theorem Minimal Sufficient Statistics Ancillary Statistics Complete Statistics The Likelihood Principle Point Estimation Methods of Finding Estimators Method of Moments for Finding Estimator Maximum Likelihood Estimation Bayes Estimators The EM Algorithm Methods of Evaluating Estimators Mean Squared Error Best Unbiased Estimator General Loss Functions Hypothesis Testing Terminology Methods of Finding Tests Likelihood Ratio Tests Bayesian Tests

3 8.2.3 Union-Intersection and Intersection-Union Tests Evaluating Tests Types of Errors and Power Function Most Powerful Tests: Uniformly Most Powerful Tests Size of UIT and IUT p-values Loss Function Optimality Interval Estimation Introduction Coverage Probability Example: Uniform Scale Distribution Methods of Finding Interval Estimators Equivalence of Hypothesis Test and Interval Estimations: Inverting a Test Statistic Using Pivotal Quantities Pivoting CDFs Using Probability Integral Transformation Bayesian Inference Methods of Evaluating Interval Estimators Size and Coverage Probability Test-Related Optimality Bayesian Optimality Loss Function Optimality Asymptotic Evaluations Point Estimation Criteria Comparing Consistent Estimators Asymptotic Behavior of Bootstrapping Robustness Robustness of Mean and Median M-estimators Hypothesis Testing Asymptotic Distribution of LRT Wald s Test and Score Test Interval Estimation Analysis of Variance and Regression One-way ANOVA Different ANOVA Hypothesis Inference Regarding Linear Combination of Means The ANOVA F Test Simultaneous Estimation of Contrasts

4 Partitioning Sum of Squares Simple Linear Regression General Model Least Square Solution Best Linear Unbiased Estimators: BLUE Normal Assumptions Regression Models Errors in Variables (EIV) Models Functional And Structural Relationship Mathematical Solution: Orthogonal Least Squares Maximum Likelihood Estimation Confidence Sets Logistic Regression And GLM Generalized Linear Model Logistic Regression Robust Regression Huber Loss Probability Theory We will not discuss elementary set theory and probability theory. Kolmogorov s axiom set defines a probability function for a given σ-algebra. Bonferroni s inequality is useful for having a gut estimation of lower bound for events that are hard or impossible to calculate probabilities for. It is equivalent to Boole s inequality. Is a generalization of: P (A B) P (A) + P (B) 1 * Proof: P (A B) = P (A) + P (B) P (A B). Rearranging, we get P (A B) = P (A) + P (B) P (A B) P (A) + P (B) 1. Random variables: function mapping from sample space to real numbers. 2 Transformations and Expectations 2.1 Transformations of random variables This section covers functions of random variables and how to derive their cdf from the cdf of the original RV. Say we have a random variable X with a pdf f X or a cdf F X. What is the distribution of Y = g (X)? 4

5 2.1.1 Monotonic functions We can only do this for monotonic functions, or at least piecewise monotonic functions. So what do we do when function Y = g (X) is monotonic? If monotonically increasing, F Y (y) = P (Y y) = P (g (x) y) = P ( x g 1 (y) ) ( = F X g 1 (y) ) If decreasing F Y (y) = P (Y y) = P (g (x) y) = P ( x g 1 (y) ) ( = 1 F X g 1 (y) ) What do you do when you have pdf, not cdf of the original RV? You don t have to go through the above way; you can differentiate above using chain rule the below formula takes care of both increasing and decreasing functions. ( f Y (y) = f X g 1 (y) ) d dy g 1 (y) In fact, this is used far more often than the cdf case Piecewise monotonic functions You can also do this when g() is piecewise monotonic. it means, you should be able to partition χ, the domain of the original RV, into contiguous sets so the function is monotonic in each partition. For example, if f (X) = X 2, we should split the real line into (, 0] and (0, ). Let us partition χinto multiple subsets A 1, A 2, and let g 1, be the inverse of g() in each of the intervals. Then we state without proof f Y (y) = ( f X ( g 1 i (y) d dy g 1 i 1, g 1 2 )) (y) which basically sums up the previous formula for each interval Probablity integral transformation If X has a cdf F X, and we say Y = F X (X), then Y is uniformly distributed on (0, 1). This can be understood intuitively: let F X (x) = y. Then P (Y y) = P (X x) = F X (x) = y. This of course assumes monotonicity on F X s part, which is not always true, but this can be treated technically. 2.2 Expected Values This section discusses the definition of expected values and their properties. The linearity of expectation arises from integration which is how EVs are defined: E (A + B) = EA + EB 5

6 regardless if A and B are independent or not. When you want to make transformation of a random variable and take its expected value, you can calculate EVs directly from definition. Otherwise, you can transform the pdf using the above strategy and go from there. 2.3 Moments and Moment Generating Functions Definitions: The nth moment of X, µ n is defined by E (X n ). The nth central moment of X, µ n is defined by E ((X µ) n ). A moment generating function of X is defined as M X (t) = Ee tx = x etx f X (x) dx. Variance VarX is defined as the second central moment. Of couse, we have the following equality as well: VarX = EX 2 (EX) 2 A moment generating function can be used to generate moments: we have d n dt n M X (0) = EX n Moment generating functions can be used to identify distributions; if two distributions have same mgf and all moments exist, they are the same distribution. We discuss some theorems regarding convergence of mgfs to another known mgf to prove the convergence of the distribution. It looks like it has more of a theoretical importance, rather than practical. 3 Common Families of Distribution 3.1 Discrete Distributions Discrete Uniform Distribution The simplest of sorts. P (X = x N) = 1 N, x 1, 2, 3,, N EX = N+1 (N+1)(N 1) 2, VarX = 12 Hypergeometric Distribution Say we have a population of size N, of which M has a desired property. We take a sample of size K (typically K M). What is the probablity that x of those have this property? pmf is derived from counting principles, EV and Var from a similar manner to binomial distribution: rewrite the sum as a sum of hypergeometric pmf for a smaller parameter set which equals 1. ( ) ( P (X = x N, M, K) = M N M x K x ) / ( ) N K 6

7 EX = KM N, VarX = KM N ( ) (N M)(N K) N(N 1) Binomial Distribution P (X = x N, p) = ( ) N x p x (1 p) N x EX = np, VarX = np (1 p) Poisson Distribution P (X = x λ) = e λ λ x x! EX = VarX = λ When a binomial distribution s p is very small, the distribution can reasonably approximated by a Poisson distribution with λ = np. (Proof uses mgfs) Negative Binomial Distribution If we want to know the number of Bernoulli trials required to get r successes? Put another way, we are interested in the number of failures before rth success. P (Y = y) = ( 1) y ( r y ) p r (1 p) y = ( ) r+y 1 y p r (1 p) y EY = r 1 p p (a simple proof: you can find expected number of failures before each success andsum them, because linearity. Woohoo!) VarY = r(1 p) p 2 = µ + 1 r µ2 Negative binomial family includes Poisson as a limiting case (Poisson is also related with Binomial, so this seems natural) but doesn t seem to have a large practical significance. Geometric Distribution A special case of negative binomial distribution with r = 1. P (X = x) = p (1 p) x 1 EX = 1 1 p p, VarX = p 2 Memoryless property: history so far has no influence on what will happen from now on. So: P (X > s X > t) = P (X > s t) 3.2 Continuous Distributions Uniform Gamma A highly generic & versatile family of distributions. A primer on gamma function Γ: Γ (α) = 0 t α 1 e t dt which also has a closed form when α N. Also, the gamma function serves as a generalization of factorial, because Γ (α + 1) = αγ (α) if α > 0. (This makes Gamma function easy to evaluate when we know its values between 0 and 1) 7

8 We have Γ (α) = (α 1)! for integers. The full gamma family has two parameters, α(shape parameter, defines peakedness) β the scale parameter (determines spreadness). The pdf is set as EX = αβ, VarX = αβ 3 f (x α, β) = 1 Γ (α) β α xα 1 e x/β Being a generic distribution, it is related to many other distributions. Specifically, many distributions are special cases of Gamma. P (X x) = P (Y α) where Y Poisson (x/β) When α = p/2 (p is an integer) and β = 2, it becomes a chi-squared pdf with p degrees of freedom. When we set α = 1, it becomes the exponential distribution pdf with scale parameter β. (The exponential distribution is a continuous cousin of the geometric distribution.) It is also related to the Weibull distribution which is useful for analyzing failure time data and modeling hazard functions. Applications: mostly relted with lifetime testing, etc. Normal without doubt, the most important distribution. pdf f ( X = x µ, σ 2) = 1 σ 2 2π e (x µ) / ( 2σ 2) Proving the pdf integrates to 1 is kind of clunky. EX = µ, VarX = σ 2. 68% - 95% - 99% rule. Thanks to CLT, normal distribution pops up everywhere. One example would be using normal to approximate binomials. If p is not extreme and n is large, normal gives a good approximation. Continuity correction: Say Y Binomial (n, p) and X n (np, n (1 p) p), we can say P (a Y b) P (a X b). However, P ( a 1 2 X b + 1 2) is a much better approximation, which is clear from a graphical representation. Beta One of rare distributions which have a domain of [0, 1]. Can be used to model proportions. Given that B (α, β) = 1 0 xα 1 (1 x) β 1 dx, f (X = x α, β) = 1 B (α, β) xa 1 (1 x) β 1 8

9 ( Since the Beta function is related to Gamma function ), it is related to Gamma function as well. B (α, β) = Γ(α)Γ(β) Γ(α+β) It can have varying shapes depending on the parameters; unimodal, monotonic, u- shaped, uniform, etc. EX = α α+β, VarX = αβ (α+β) 2 (α+β+1) Cauchy A symmetric, bell-shaped curve with undefined expected value. Therefore, it is mostly used as an extreme case against which we test conjectures. However, it pops up in unexpected circumstances. For example, the ratio of two normal variables follows the Cauchy distribution. 1 f (x θ) = π (1 + (x θ) 2) Lognormal log of which is the normal distribution. Looks like Gamma function. f ( x µ, σ 2) = 1 2πσx e (lg x µ)2 / ( 2σ 2) EX = e µ+( σ 2 /2 ), VarX = e 2( µ+σ 2) e 2µ+σ2 Double Exponential formed by reflecting the exponential distribution around its mean. 3.3 Exponential Families of Distribution A family of pdfs or pmfs is an exponential family if it can be expressed as: ( ) f (x θ) = h (x) c (θ) exp w i (θ) t i (x) Many common families are exponential; normal, gamma, beta, binomial, Poisson, etc. Such form has some algebraic advantages: there are equalities which provide a shortcut for calculating the first two moments (EV and Var) with differentiation instead of summation/integration! Will not discuss the actual formulas because they are kind of messy. The section also discusses natural parameter set of a exponential family: in the above definition, Also, the definition of full and curved exponential families are introduced; not sure about practical significance. i 3.4 Scaling and Location Families If f (x) is a pdf, then the following function is a pdf as well. g (x µ, σ) = 1 ( ) x µ σ f σ where µ is the location parameter, σ is the scaling parameter. Expected value will get translated accordingly, and variance will grow by σ 2. 9

10 3.5 Inequalities and Identities Chebychev Inequality if g (x) is a nonnegative function, P (g (x) r) Eg (X) r which looks kind of arbitrary (the units do not match) but is useful sometimes. It sometimes provides useful boundaries. For example, let g (x) = ( ) x µ 2 σ which is the number of standard deviations squared (square is there to make g nonnegative). Letting r = 2 2 yields ( (x ) 2 µ P 4) σ 14 E(x µ)2 σ 2 = 1 4 which is a lower bound of elements 2 standard deviations! This did not make any assumption on the distribution of X. Stein s Lemma with X n ( θ, σ 2) and g that is differentiable and satisfies E g (X) <, E [g (X) (X θ)] = σ 2 Eg (X). Seems obscure, but useful for calculating higher order moments. Also, wikipedia notes that it is useful in MPT. 4 Multiple Random Variables 4.1 Joint and Marginal Distributions Mostly trivial stuff. We extend the notion of pmf/pdfs by adding more variables, which give us joint pdf/pmfs. Marginal distributions (distribution of a subset of variables without referencing the other variables) are introduced. Intuitively, they are compressed versions of joint pdf/pmfs by integrating them along a subset of parameters. 4.2 Conditional Distributions and Independence Intuitively, conditional distributions are sliced versions of joint pdf/pmfs. Some of the random variables are observed; what are the distribution of the remaining variables given the observation? The derivation of conditional pmf is straightforward for discrete RVs, it is the ratio of joint pmf and marginal pmf. This relationship, somewhat surprisingly, holds true for continuous RVs as well. Two variables X and Y are said to be independent when f (x, y) = f X (x) f Y (y). Actually, the converse is true as well; if the joint pdf can be decomposed into a product of two functions, one on x and one on y, they are independent. Consequences of independence: 10

11 E (g (X) h (Y )) = Eg (X) Eh (Y ) Covariance is 0 You can get the mgf of their sum by multiplying individual mgfs. This can be used to derive the formula for adding two normal variables. 4.3 Bivariate Transformations This section mainly discusses strategies of taking transformations (sum, product, division) of two random variables. This is analogous to section 2.1 where we discussed transformations of a single variable. Problem: You have a random vector (X, Y ) and want to know about U = f (X, Y ). Strategy: We have a recipe for transforming a bivariate vector into another. So we transform (X, Y ) into (U, V ) and take the marginal pdf to get the distribution of U. V is chosen so it will be easy to back out X and Y from U and V ; which is essential in the below recipe. Recipe: Basically, similar to transformation recipe in 2.1. However, the derivative of the inverse function is replaced by the Jacobian of the transformation; which is defined as the determinant of the matrix of partial derivatives. Given this, we have J = x y u v x y v u f U,V (u, v) = f X,Y (h 1 (u, v), h 2 (u, v)) J where g 1 (x, y) = u, h 1 (u, v) = x, g 2 (x, y) = y, h 2 (u, v) = y. Similar to formula in 2.1, this assumes the transformation is 1-to-1, and thus the inverse exists. When this assumption breaks, we can use the same trick as in 2.1 by breaking the domain into sets where in each set the transformation is 1:1. f U,V (u, v) = which is the formula in k f X,Y (h 1i (u, v), h 2i (u, v) J i ) 4.4 Hierarchical Models and Mixture Distributions Hierarchical models arise when we model a distribution where a parameter has its own distribution. This is sometimes useful in gaining a deeper undestanding of how things work. As an example, say an insect lays a large number of eggs following Poisson, and individual egg s survival is a Bernoulli trial. Then, the expected number of surviving insect is X Y binomial (Y, p) and Y Poisson (λ). 11

12 This section also introduces a trivial, but useful equality: EX = E (E (X Y )) This is very intuitive if you think about it, but realizing this makes calculations very easy sometimes. A noncentral chi-squared distribution is given as an example. A formula for variance in hierarchical model is given: VarX = E (Var (X Y )) + Var (E (X Y )) 4.5 Covariance and Correlation This section introduces covariances and correlations. Important stuff, but it only gives rather a basic treatment. Definitions and identities: Cov (X, Y ) = E [(X µ X ) (Y µ Y )] = EXY µ X µ Y ρ XY = Cov(X,Y ) σ X σ Y Var (ax + by ) = a 2 VarX + b 2 VarY + 2abCov (X, Y ) This section also introduces bivariate normal distributions. 4.6 Multivariate Distributions A strategy for transforming a vector of random variable is introduced. Since the Jacobian is well defined for larger matrices, the recipe is more or less the same with the bivariate case. 4.7 Inequalities and Identities. Analogous to chapter 2, we have a section devoted to inequalities and identities. Apparently, all of these inequalities have many forms, being applied in different contexts. Many of these have popped up in CvxOpt course as well. (Looks like most primitive forms come from field of mathematical analysis.) Holder s Inequality Let p, q be real positive numbers s.t. 1 p + 1 q have = 1 and X, Y are RV. Then we EXY E XY (E X p ) 1/p (E Y q ) 1/q Cauchy-Schwarz Inequality A special case of Holder s inequality when p = q = 2. Then, EXY E XY E X 2 E Y 2 In vector terms, it means x y x y. This is intuitive, as taking inner product gives chance of things canceling out each other, just like triangular inequality. 12

13 Also notable is that this can be used to prove the range of the correlation; just take E (X µ X ) (Y µ Y ) and apply CS inequality. Squaring each side gives (Cov (X, Y )) 2 σ 2 Xσ 2 Y Minkowski s Inequality This feels like an additive version of Holder s Inequality. [E X + Y p ] 1/p [E X p ] 1/p + [E Y p ] 1/p Jensen s Inequality on convex function. Given a convex function g, Eg (X) g (EX) 5 Properties of a Random Sample This chapter deals with several things: Definition of random sample Distribution of functions of random sample (statistics) and how they converge as we increase sample size And of course, the LLN and the CLT. Generating random samples. 5.1 Basic Concepts of Random Samples A random sample X 1,, X n is a set of independent and identically distributed (iid) RVs. This means we are sampling with replacement from an infinite population. This assumption doesn t always hold, but is a good approximation in a lot of cases. 5.2 Sums of Random Variables from a Random Sample Sums of random variables can be calculated, of course, using the transformation strategies from Chapter 4. However, since each random variable is iid, the calculation can be simplified greatly. Also, a definition: Given a vector-or-scalar valued statistic Y = T (X 1, X 2, ), the distribution of Y is called the sampling distribution of Y. 13

14 5.2.1 Basic Statistics Two most basic statistics are introduced. Sample mean X is defined as X = 1 n i X i Sample variance S 2 is defined as S 2 = 1 n 1 ( Xi X ) i The formula for sample variance raises some questions; why n 1, not n? This was chosen so that ES 2 = σ 2. If we were to use n, S 2 would be a biased towards Bessel s Correction Using n 1 instead of n as the denominator in sample variance is called Bessel s Correction. The legitimacy of this can be proven by taking ES 2 and seeing it equals to σ 2, but the wikipedia page offers some explanation. By taking sample mean, we are minimizing the squared error of the samples. So unless the sample mean is equal to population mean, sample variance must be smaller than variance measured using the population mean. Put more intuitively, the sample mean skews towards whatever sample observed, so the variance will be smaller than real. Another subtle, but astounding point mentioned is that by Jensen s inequality, S is a biased estimator of standard deviation it underestimates! Since square root is a concave function, according to Jensen s inequality we have E S 2 ES 2 = σ 2 = σ Also, it is noted that there is no general formula for an unbiased estimate of the standard deviation! Sampling Distributions of Sample Mean and Sample Variance E X = µ: the expected value of sample mean is the population mean. LLN will state that it will converge to the population mean almost surely as the sample size grows. Var X = σ2 n : this follows from that the random variables are independent. ( 1 ) Var Xi = 1 ( ) n n 2 Var Xi = VarX 1 = σ2 n n So, the variance decreases linearly with the sample size. 14

15 ES 2 = σ 2. This can be shown algebraically Using mgfs to Find Sampling Distributions We can plug the statistic s definition into mgf, and simplify, praying the mgf would be a recognizable form. For example, this is how we show the mean of iid normal variables X i n ( ( ) µ, σ 2) follows n µ, σ2 n. 5.3 Sampling From the Normal Distribution More notable facts when we sample from the normal distribution. X and S 2 are independent random variables. ( X has a n µ, σ 2 /n ) distribution (proven by mgfs as noted above) (n 1) S 2 /σ 2 has a chi-squared distribution with n 1 degrees of freedom Student s t-distribution If X i n ( µ, σ 2), we know that X µ σ/ n n (0, 1) However, in most cases µ and σ are unknown parameters. This makes it hard to make inferences about one of them. We can approximate σ by S: this gives us another distribution, however it makes it easier to make inferences about µ. So, the statistic X µ S/ n is known to follow Student s t-distribution with n 1 degrees of freedom. The concrete distribution could be found using the independence between X and S, and their respective distributions (normal and chi-squared) Fisher s F -distribution Now we are comparing variances between two samples. One way to look at this is the variance ratio. There are two variance ratios; one sample, one population. Of course, we don t know about population variance ratios. However, the ratio between the two ratios (I smell recursion) S 2 X /S2 Y σ 2 X /σ2 Y = S2 X /σ2 X S 2 Y /σ2 Y is known to follow F -distribution. The distribution is found by noting that in the second representation above, both the numerator and the denomiator follows chi-squared distributions. 15

16 5.4 Order Statistics Order statistics of a random sample are the sorted sample values. This seems obscure, but indeed is useful where point estimation depends on the minimum/maximum value observed. The pmf/pdf of those can be derived from noticing that: Say we want to find P ( X (j) x i ) (X(j) is thejth smallest number). Each random variable is now a Bernoulli trial, with F X (x i ) probability of being x i. So the cumulative functions can be derived from binomial distribution of each variable (for discrete) or something similar (continous). f X(j) (x) = n! (j 1)! (n j)! f X (x) [F X (x)] j 1 [1 F X (x)] n j 5.5 Convergence Concepts The main section of the chapter. Deals with how sampling distributions of some statistics change when we send the sample size to infinity Definitions Say we have a series of random variables {X i }: X n being the statistic value when we have n variables. We study the behavior of lim n X n. There are three types of convergences: Convergence in Probability when for any ɛ > 0, Almost Sure Convergence when lim P ( X n X ɛ) = 0 n ( ) P lim X n X ɛ = 0 n which is a much stronger guarantee and implies convergence in probability. Convergence in Distribution when their mgfs converge and they have the same distribution. This is equivalent to convergence in probability when the target distribution is a constant Law of Large Numbers Given a random sample {X i } where each RV has finite mean, the sample mean converges almost surely to µ, the population mean. There is a weak variant of LLN, which states it converges in probability. 16

17 5.5.3 Central Limit Theorem The magical theorem! When we have a sequence of iid RVs {X i } with EX i = µ and VarX i = σ 2 > 0 then which also means lim n lim n ( Xn µ ) σ/ n n (0, 1) X n n ( µ, σ 2) Another useful, and intuitive theorem is mentioned: Slutsky s Theorem if Y n converges to a constant a in probability, and X n converges to X in distribution, X n Y n = ax in distribution X n + Y n = a + X in distribution Slutsky s theorem is used in proving that normal approximation with estimated variance goes to standard normal as well. We know and we can prove lim n in probability. Multiplying those yields Marvelous! lim n ( Xn µ ) σ/ n σ lim n S = 1 ( Xn µ ) S/ n n (0, 1) n (0, 1) Delta Method Delta method is a generalized version of CLT. Multiple versions of it are discussed, however here I will only state the most obvious univariate case. ) lim Y n n (µ, σ2 n n = lim n g (Y n ) n ( g (µ), g (µ) 2 σ 2 n When Y n converges to µ with normal distribution when we increase n with virtual certainty, a function of Y n converges to a normal distribution. In the limiting case, both are virtually a constant, so this doesn t surprise me much. ) 17

18 6 Principles of Data Reduction This chapter seems to be highly theoretic. Without much background, it is hard to say why this material is needed. However, it looks like this chapter is closely related with the next chapter - point estimation. That makes sense, because estimating point parameters are essentially data reduction. 6.1 The Sufficiency Principle Sufficient Statistic Colloquially, a sufficient statistic captures all information regarding a parameter θ in a sample x; there are no remaining information to be obtained by consulting the actual sample. Formally, a statistic T (X) is a sufficient statistic if the conditional distribution P θ (X T (X)) does not depend on θ. The practical upshot of this is that this justifies only reporting means and standard deviations of a given sample; if we assume the population is normal, they are sufficient statistics which contain all the information we can infer about the population. However, remember that these are model dependent; the population might be coming from a different family with different parameters those might not be entirely captured by the statistic Factorization Theorem The definition could be directly used to verify if a given statistic is sufficient or not, but practically the following theorem makes it easier to identify sufficient statistics. If the joint pdf f (x θ) could be factored as a product of two parts: f (x θ) = g (T (x) θ) h (x) where h (x) does not depend on θ, then T (X) is a sufficient statistic. This makes intuitive sense; if T is sufficient, the probabity of us seeing x is related with only two things: a function of the statistic (given θ) and a function unrelated to θ. If there were a part with another input involved, we wouldn t be able to make inferences about θ with only T Minimal Sufficient Statistics There are lots of sufficient statistics, some more compact than the other. So what is a minimal sufficient statistic? A sufficient statistic T (X) is minimal when for any other sufficient statistic T (X), T (X) is a function of T (X). In other words, take any sufficient statistic, and we can use it to derive a minimal sufficient statistic. For example, the sample mean x is a sufficient statistic for population mean µ. On the other hand, the sample itself is a sufficient statistic but not minimal you can derive x from the sample, but not vice versa. 18

19 6.1.4 Ancillary Statistics Ancillary statistic contains no information about θ. Paradoxically, an ancillary statistic, when used in conjunction with other statistics, does contain valuable information about θ. The following is a good example; let X have the following discrete distribution P θ (X = θ) = P θ (X = θ + 1) = P θ (X = θ + 2) = 1 3 Now, say we observe a sample and take the range of the sample R = X (N) X (1). It is obvious that the range itself has no information regarding θ. However, say we know the mid-range statistic as well; M = ( ) X (1) + X (N) /2. The mid-range itself can be used to guess θ, but if you combine it with the range, suddely you can nail the exact θ if R = 2. Of course, in the above case, the statistic M was not sufficient. What happens when we have a minimal sufficient statistic? The intuition says they should be independent. However, it could not be the case. In fact, the pair ( ) X (1), X (N) is minimal sufficient and R is closely related to them! So they are not independent at all Complete Statistics For many important situations, however, the minimal sufficient statistic is indeed independent of ancillary variables. The notion of complete statistic is introduced, but the definition is very far from intuitive. It goes like this: let T (X) f (t θ). If θe θ g (T ) = 0 = θp θ (g (T ) = 0), T (X) is called a complete statistic. Colloquially, T (X) has to be uncorrelated with all unbiased estimators of 0. So WTF does this mean at all? It looks like we will get more context in the next chapter. The most intuitive explanation I could get was from here. It goes like Intuitively, if a nontrivial function of T has mean value not dependent on θ, that mean value is not informative about θ and we could get rid of it to obtain a sufficient statistic simpler. Hopefully reading chapter 7 will give me more background and intuition so I can revisit this. 6.2 The Likelihood Principle The likelihood function is defined to be L (θ x) = f (x θ) We can use likelihood functions to compare the plausibility of various parameter values. Say if we don t know the parameter and observed x. If for two parameters θ 1 and θ 2, if we have L (θ 1 x) > L (θ 2 x), we can say θ 1 is more plausible than θ 2. Note we used plausible rather than probable. The likelihood principle is an important principle used in later chapters on inferences: if x and y are two sample points such that L (θ x) is proportional to L (θ y), that is, there is a constant 19

20 C (x, y) such that L (θ x) = C (x, y) L (θ y), then the conclusions drawn from x and y should be identical. 7 Point Estimation This chapter deals with estimating parameters of a population by looking at random samples. The first section discusses strategies for finding estimators; the second section deals with ways to evaluate the estimators. 7.1 Methods of Finding Estimators Method of Moments for Finding Estimator Basically, you can equate expected values of arbitrary functions of the random sample with realized value to estimate parameters. Method of moments use the first k sample moments to achieve this. Suppose we have X i n ( θ, σ 2). The first moment would be θ, the second (uncentered) moment would be θ 2 + σ 2. So we have a system of equations [ ] [ ] X θ = X 2 i θ 2 + σ 2 1 n On the left hand side there are concrete values from the random sample; so solving this is trivial. Also note solving the above for σ gives σ 2 = 1 n X 2 i X 2 = 1 n ( Xi X ) 2 which does not have Bessel s correction. This is an example where the method of moments come short; it is a highly general method where you can fall back on, but in general it produces relatively inferior estimators 1. Interesting Example: Estimating Both Parameters Say we have X i binomial (k, p) where both k and p are unknown. This might be an unusual setting, but here s an application: if we are modeling crime rates you don t know the actual number of crimes committed k and the reporting rate p. In this setting, it is hard to come up with any intuitive formulas. So we could use the method of moments as a baseline. Equating the first two moments, we get: [ X 1 n X2 i Solving this gives us an approximation for k: 1 However, this is still a maximum-likelihood estimate. ] = [ kp kp (1 p) + k 2 p 2 ] 20

21 k = X 2 X (1/n) ( X i X ) 2 which is a feat in itself! Maximum Likelihood Estimation MLE is, by far, the most popular technique for deriving estimators. There are multiple ways to find MLEs. Classic calculus differentiation is one, exploiting model properties is two, using computer software and maximizing it numerically is three. Couple things worth noting: using log-likelihood sometimes makes calculations much easier, so it s a good idea to try taking logs. Another thing is that MLE can be unstable - if inputs change slightly, the MLE can change dramatically. (I wonder if there s any popular regularization scheme for MLEs. Maybe you can still use ridge style regularization mixed with cross validation.) There s a very useful property of maximum likelihood estimators called the invariance property of maximum likelihood estimators, which says that if ˆθ ) is a MLE for θ, then r (ˆθ is a MLE for r (θ). This is very useful to have to say the least Bayes Estimators The Bayesian approach to statistics is fundamentally different from the classical frequentist approach, because they treat parameters to have distribution themselves. In this approach, we have a prior distribution of θ which is our belief before seeing the random sample. After we see some samples from the population, our belief is updated to the posterior distribtion following this formula: π (θ x) = f (x θ) π (θ) f (x θ) π (θ) dθ which is just the Bayes formula. Conjugate Families In general, for any sampling distribution, there is a natural family of prior distributions, called the conjugate family. For example, binomial family s conjugate family is the beta family. Let s say if we have a binomial (n, p) distribution where only p is unknown. If our prior distribution of p is beta (α, β), then after observing a sample y our posterior distribution updates as: beta (y + α, n y + β). So classy and elegant! However it is debatable that a conjugate family always should be used. Also, normals are their own conjugates The EM Algorithm The EM algorithm is useful when the likelihood function cannot be directly maximized; due to missing observations or latent variables which are not observed. The setup goes like this; 21

22 given a parameter set θ, there are two random variables x and y of which only y is observed. The marginal distribution for y, g (y θ), is unknown, however we do know f (y, x θ). Since x is not observed (or incomplete), it is hard to estimate θ from given y. The EM algorithm solves it by an iterative approach. We start with a guess of the parameter θ (0), and the following two steps are repeated: Expectation step: Given the latest guess for the parameter θ (i), and the observed y, find the expected x. Maximization step: Given the x and y, find the parameter θ (i+1) that is most plausible for this pair of observations. It can be proven that the likelihood from successive iterations of parameters are nondecreasing, and will eventually converge. EM Algorithm, Formally The above two steps are just colloquial descriptions; the actual expected x is not directly calculated. In the expectation step, the following function is created: ( Q θ θ (t)) = E X Y,θ [log L (θ; X, Y)] Let s see what this function is trying to do; since we have a guess for the parameter θ (t), we now can plug this into the joint pdf and find the distribution of the pair (x, y). For each concrete pair of observations, we can calculate the log-likelihood of θ using this observation. Q is simply a weighted average of these likelihoods. Q is a function of θ, and answers this question: Let s say (x, y) f ( x, y θ (t)). Now what is the EV of log likelihood for a given θ? Now the second step puts this function through argmax operator. 7.2 Methods of Evaluating Estimators This section is much more involved than the previous section, and discusses multiple strategies of evaluating estimators also discusses some strategies to improve existing estimators Mean Squared Error A simple, intuitive way of measuring the quality of the estimator is taking the MSE: E θ (W θ) 2 note this is a function of the parameter θ. Also, MSE has a very intuitive decomposition: E θ (W θ) 2 = Var θ W + (E θ W θ) 2 22

23 The first term is the variance of the parameter, while the second term is the square of the bias. Variance represents the precision of the estimator; if variance is high, even if the estimator works in average, it cannot be trusted. If the bias is high, it will point in a wrong direction. So this tradeoff is a topic that is discussed throughout this chapter. Quantifying Bassel s Correction We know S 2 = ( X i X ) 2 / (n 1) is an unbiased estimator of σ 2, thus the bias is 0. However, we also know n n 1 S2 is the maximum likelihood estimator. Which of these are better in terms of MSE? Carrying out the calculations tells you the MLE will give you smaller MSE; so by accepting some bias, we could reduce the variance and the overall MSE. Which estimator we should ultimately use is still debatable. No Universal Winner Since MSE are functions of parameters, it often is the case the MSE of two estimators are incomparable, because one can outperform another on a certain parameter set, but can underperform on another parameter set. In the text, this is shown with an example of estimating p from binomial (n, p) where n is known. Two approaches are showcased; one is just using the MLE X i /n, another is taking a Bayesian estimate from a constant prior. If you plot the MSE against different values of p, MSE from the MLE is a quadractic curve; but the MSE from the Bayesian estimate is a horizontal line. Therefore, one doesn t dominate the other plotting for different n will reveal the Bayesian approach gets better as we increase n Best Unbiased Estimator One way of solving the bias-variance tradeoff problem is to ignore some choices! We can restrict ourselves to unbiased estimators and try to minimize variance. In this setting, the best estimator is called a uniform minimum variance unbiased estimator (UMVUE). Finding UMVUE is a hard problem; it would be hard to prove that the given estimator is the global minimum. Finding Lower Bounds The Cramer-Rao Theorem sometimes helps; it is an application of the Cauchy-Schwartz inequality which provides a lower bound for the variance of an unbiased estimator. So if this limit is reached, we can say for sure we have a UMVUE. This is indeed attainable for some distributions - such as Poisson. The big honking problem of CR theorem is that the bound is not always sharp sometimes the bound is not attainable. Since CR is an application of CS inequality, we can state the condition under which the equality will hold. This can either give us hints about the shape of the UMVUE, or let us prove it is unreachable. Sufficiency and Completeness for UMVUE Somewhat surprising and arbitrary - but sufficient or complete statitics can be used to improve unbiased estimators, by conditioning the 23

24 estimator on the sufficient statistic! The Rao-Blackwell theorem specifies that if W is any unbiased estimator of an arbitrary function of θ, τ (θ), the following function is a uniformly better unbiased estimator: φ (T ) = E (W T ) This was a WTF moment for me. However, note that Var θ W = Var θ [E (W T )] + E [Var (W T )] Var θ [E (W T )] for any T you can condition on any statistic, even totally irrelevant, and it can never actually hurt. Then why the requirement for sufficiency? If not, the estimator will become a function of θ so it isn t exactly an estimator. Can we go further than that? Can we create a UMVUE from sufficient statistics? We analyze the properties of a best estimator to find out. Suppose we have an estimator W satisfying E θ W = τ (θ). Also say we have an unbiased estimator of 0, U. Then the following estimator φ a = W + au is still an unbiased estimator of τ(θ). What is the variance of this? Var θ φ a = Var θ (W + au) = Var θ W + a 2 Var θ U + 2aCov θ (W, U) Now, we can always concoct a U such that Cov θ (W, U) < 0 so Var θ φ a < Var θ W. Note that U is essentially a random noise for estimation - yet it can give us an improvement. This doesn t make sense at all. Theorem actually brings order, by stating that W is the UMVUE iff it is uncorrelated with all unbiased estimators of 0. Wait, that definition sounds familiar. Recall the definition of complete statistics? Now we have to guess they are somewhat related. They indeed are, by Theorem This states: Let T be a complete sufficient statistic for a parameter θ, and let φ (T ) be any estimator based only on T, then φ (T ) is the unique best unbiased estimator of its expected value General Loss Functions MSE is a special case of loss function, so we can generalize on that. Absolute error loss and squared loss are the most common functions, but you can come up with different styles, of course. Evaluating an estimator on a given loss function L (θ, a) is done by a risk function: R (θ, δ) = E θ L (θ, δ (X)) 24

25 which is still a function of the parameter. Also note when we use squared loss function, R becomes MSE. Stein s Loss Function For scale parameters such as σ 2, the values are bounded below at 0. Therefore, a symmetric loss function such as MSE can be penalizing overestimation because the penalty can grow infinitely only in one direction. An interesting class of loss function, which tries to overcome this problem, is introduced: L ( σ 2, a ) = a σ 2 1 log a σ 2 which goes to infinity as a goes to either 0 or infinity. Say we want to estimate σ 2 using an estimator δ b = bs 2 for a constant b. Then R ( σ 2, δ b ) = E ( bs 2 which is minimized at b = 1. σ 2 ) bs2 1 log σ 2 = b log b 1 E log S2 σ 2 Bayes Risk Instead of using the risk function as a parameter of θ, we can take a prior distribution on θ and getting the expected value. This takes us to the Bayes risk: Note: ˆ Θ R (θ, δ) π (θ) dθ Finding the estimator δ which minimizes the Bayes risk seems daunting but is tractable. ˆ Θ ˆ R (θ, δ) π (θ) dθ = ˆ = Θ χ (ˆ ) L (θ, δ (x)) f (x θ) dx π (θ) dθ χ [ˆ ] L (θ, δ (x)) π (θ x) dθ m (x) dx Θ The quantity in square brackets, which is the expected loss given an observation, is called the posterior expected loss. It is a function of x, not θ. So we can minimize the posterior expected loss for each x and minimize Bayes risk. A general recipe for doing this is not available, but the book contains some examples of doing this. 8 Hypothesis Testing This chapter deals with ways of testing hypotheses, which are statements about the population parameter. 25

26 8.1 Terminology In all testing, we use two hypothesis; the null hypothesis H 0 and the alternative hypothesis H 1, one of which we will accept and the other we will reject. We can always formulate the hypothesis as two sets: H 0 is θ Θ 0 and H 1 is θ Θ C 0. We decide which hypothesis to accept on the basis of a random sample. The subset of the sample space which will make us reject H 0 is called the rejection region. Typically, a hypothesis test is performed by calculating a test statistic and rejecting/accepting H 0 if the statistic falls into a specific set. 8.2 Methods of Finding Tests We discuss three methods of constructing tests Likelihood Ratio Tests Likelihood Ratio Tests are very widely applicable, and are also optimal in some cases. We can prove t-test is a special case of LRT. The LRT calculates the ratio between the maximum likelihood, and the maximum likelihood given the null hypothesis. The test statistic λ is defined as: λ (x) = sup Θ 0 L(θ x) sup Θ L(θ x) = sup Θ L(θ x) 0 ) (ˆθ is MLE of θ) L (ˆθ x When do we reject H 0? We want to reject the null hypothesis when the alternative hypothesis is much more plausible; so the smaller λ is, the more likely we will reject H 0. Therefore, the rejection region is {x λ (x) c} where 0 c 1. Also note that λ based on a sufficient statistic of θ is equivalent to regular λ. Example: Normal LRT for testing θ = θ 0, known variance Setup: we are drawing sample from n (θ, 1) population. H 0 : θ = θ 0 and H 1 : θ θ 0. Since the MLE is x, the LRT statistic is: λ (x) = f (x θ [ ] 0) f (x x) = (some simplification) = exp n ( x θ 0 ) 2 /2 Say we reject H 0 when λ c; we can rewrite this condition as x θ 0 2 (log c) /n Therefore, we are simply testing the difference between the sample mean and the asserted mean with a positive number. 26

27 Example: Normal LRT for testing µ < µ 0, unknown variance Setup: we are drawing from n ( µ, σ 2) where both parameters are unknown. H 0 : µ µ 0 and H 1 : µ > µ 0. The unrelated parameter σ 2 is called a nuisance parameter. It can be shown (from exercise 8.37) that the test relying on this statistic is equivalent to Student s t test Bayesian Tests Bayesian tests will obtain posterior distribution from the prior distribution and the sample, using the Bayesian estimation technique discussed in Chapter 7. And we use the posterior distribution to run the test. We might choose to reject H 0 only if P ( θ Θ C 0 X ) is greater than some large number, say Union-Intersection and Intersection-Union Tests Ways of structuring the rejection region when the null hypothesis is not simple. For example, the null hypothesis Θ C 0 is best expressed by an intersection or union of multiple sets. If we have tests for each individual sets that constitute the union/intersection, how can we test the entire hypothesis? When the null hypothesis is an intersection of sets, any test failing will result in the rejection of the null hypothesis. When the null hypothesis is a union of sets, any test passing will let us reject the null hypothesis; null hypothesis is rejected only when all of the tests fail. 8.3 Evaluating Tests How do we assess the goodness of tests? Types of Errors and Power Function Since the null hypothesis could be true or not, and we can either accept or reject it, there are four possibilities in a hypothesis test, which can be summarized in the below table: To reiterate: Decision Accept H 0 Reject H 0 Truth H 0 Correct Type I Error H 1 Type II Error Correct Type I Error incorrectly rejects H 0 when H 0 is true. Since we usually want to prove the alternative hypothesis, this is a false positive error, when we assert H 1 when it is not. Often this is the error people want to avoid more. 27

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

A General Overview of Parametric Estimation and Inference Techniques.

A General Overview of Parametric Estimation and Inference Techniques. A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying

More information

STAT 512 sp 2018 Summary Sheet

STAT 512 sp 2018 Summary Sheet STAT 5 sp 08 Summary Sheet Karl B. Gregory Spring 08. Transformations of a random variable Let X be a rv with support X and let g be a function mapping X to Y with inverse mapping g (A = {x X : g(x A}

More information

t x 1 e t dt, and simplify the answer when possible (for example, when r is a positive even number). In particular, confirm that EX 4 = 3.

t x 1 e t dt, and simplify the answer when possible (for example, when r is a positive even number). In particular, confirm that EX 4 = 3. Mathematical Statistics: Homewor problems General guideline. While woring outside the classroom, use any help you want, including people, computer algebra systems, Internet, and solution manuals, but mae

More information

STATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN

STATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN Massimo Guidolin Massimo.Guidolin@unibocconi.it Dept. of Finance STATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN SECOND PART, LECTURE 2: MODES OF CONVERGENCE AND POINT ESTIMATION Lecture 2:

More information

Probability and Distributions

Probability and Distributions Probability and Distributions What is a statistical model? A statistical model is a set of assumptions by which the hypothetical population distribution of data is inferred. It is typically postulated

More information

BTRY 4090: Spring 2009 Theory of Statistics

BTRY 4090: Spring 2009 Theory of Statistics BTRY 4090: Spring 2009 Theory of Statistics Guozhang Wang September 25, 2010 1 Review of Probability We begin with a real example of using probability to solve computationally intensive (or infeasible)

More information

Lectures on Statistics. William G. Faris

Lectures on Statistics. William G. Faris Lectures on Statistics William G. Faris December 1, 2003 ii Contents 1 Expectation 1 1.1 Random variables and expectation................. 1 1.2 The sample mean........................... 3 1.3 The sample

More information

Practice Problems Section Problems

Practice Problems Section Problems Practice Problems Section 4-4-3 4-4 4-5 4-6 4-7 4-8 4-10 Supplemental Problems 4-1 to 4-9 4-13, 14, 15, 17, 19, 0 4-3, 34, 36, 38 4-47, 49, 5, 54, 55 4-59, 60, 63 4-66, 68, 69, 70, 74 4-79, 81, 84 4-85,

More information

Spring 2012 Math 541B Exam 1

Spring 2012 Math 541B Exam 1 Spring 2012 Math 541B Exam 1 1. A sample of size n is drawn without replacement from an urn containing N balls, m of which are red and N m are black; the balls are otherwise indistinguishable. Let X denote

More information

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3 Hypothesis Testing CB: chapter 8; section 0.3 Hypothesis: statement about an unknown population parameter Examples: The average age of males in Sweden is 7. (statement about population mean) The lowest

More information

TABLE OF CONTENTS CHAPTER 1 COMBINATORIAL PROBABILITY 1

TABLE OF CONTENTS CHAPTER 1 COMBINATORIAL PROBABILITY 1 TABLE OF CONTENTS CHAPTER 1 COMBINATORIAL PROBABILITY 1 1.1 The Probability Model...1 1.2 Finite Discrete Models with Equally Likely Outcomes...5 1.2.1 Tree Diagrams...6 1.2.2 The Multiplication Principle...8

More information

Review of Probabilities and Basic Statistics

Review of Probabilities and Basic Statistics Alex Smola Barnabas Poczos TA: Ina Fiterau 4 th year PhD student MLD Review of Probabilities and Basic Statistics 10-701 Recitations 1/25/2013 Recitation 1: Statistics Intro 1 Overview Introduction to

More information

A Very Brief Summary of Statistical Inference, and Examples

A Very Brief Summary of Statistical Inference, and Examples A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2008 Prof. Gesine Reinert 1 Data x = x 1, x 2,..., x n, realisations of random variables X 1, X 2,..., X n with distribution (model)

More information

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems Review of Basic Probability The fundamentals, random variables, probability distributions Probability mass/density functions

More information

Part IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Part IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015 Part IA Probability Theorems Based on lectures by R. Weber Notes taken by Dexter Chua Lent 2015 These notes are not endorsed by the lecturers, and I have modified them (often significantly) after lectures.

More information

Economics 583: Econometric Theory I A Primer on Asymptotics

Economics 583: Econometric Theory I A Primer on Asymptotics Economics 583: Econometric Theory I A Primer on Asymptotics Eric Zivot January 14, 2013 The two main concepts in asymptotic theory that we will use are Consistency Asymptotic Normality Intuition consistency:

More information

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. Unbiased Estimation Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. To compare ˆθ and θ, two estimators of θ: Say ˆθ is better than θ if it

More information

Multivariate Distributions

Multivariate Distributions IEOR E4602: Quantitative Risk Management Spring 2016 c 2016 by Martin Haugh Multivariate Distributions We will study multivariate distributions in these notes, focusing 1 in particular on multivariate

More information

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix) 1 EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix) Taisuke Otsu London School of Economics Summer 2018 A.1. Summation operator (Wooldridge, App. A.1) 2 3 Summation operator For

More information

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Statistics - Lecture One. Outline. Charlotte Wickham  1. Basic ideas about estimation Statistics - Lecture One Charlotte Wickham wickham@stat.berkeley.edu http://www.stat.berkeley.edu/~wickham/ Outline 1. Basic ideas about estimation 2. Method of Moments 3. Maximum Likelihood 4. Confidence

More information

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

Chapter 5. Chapter 5 sections

Chapter 5. Chapter 5 sections 1 / 43 sections Discrete univariate distributions: 5.2 Bernoulli and Binomial distributions Just skim 5.3 Hypergeometric distributions 5.4 Poisson distributions Just skim 5.5 Negative Binomial distributions

More information

SUMMARY OF PROBABILITY CONCEPTS SO FAR (SUPPLEMENT FOR MA416)

SUMMARY OF PROBABILITY CONCEPTS SO FAR (SUPPLEMENT FOR MA416) SUMMARY OF PROBABILITY CONCEPTS SO FAR (SUPPLEMENT FOR MA416) D. ARAPURA This is a summary of the essential material covered so far. The final will be cumulative. I ve also included some review problems

More information

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Joint Probability Distributions and Random Samples (Devore Chapter Five) Joint Probability Distributions and Random Samples (Devore Chapter Five) 1016-345-01: Probability and Statistics for Engineers Spring 2013 Contents 1 Joint Probability Distributions 2 1.1 Two Discrete

More information

Master s Written Examination

Master s Written Examination Master s Written Examination Option: Statistics and Probability Spring 016 Full points may be obtained for correct answers to eight questions. Each numbered question which may have several parts is worth

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Statistics for scientists and engineers

Statistics for scientists and engineers Statistics for scientists and engineers February 0, 006 Contents Introduction. Motivation - why study statistics?................................... Examples..................................................3

More information

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources STA 732: Inference Notes 10. Parameter Estimation from a Decision Theoretic Angle Other resources 1 Statistical rules, loss and risk We saw that a major focus of classical statistics is comparing various

More information

This does not cover everything on the final. Look at the posted practice problems for other topics.

This does not cover everything on the final. Look at the posted practice problems for other topics. Class 7: Review Problems for Final Exam 8.5 Spring 7 This does not cover everything on the final. Look at the posted practice problems for other topics. To save time in class: set up, but do not carry

More information

University of Regina. Lecture Notes. Michael Kozdron

University of Regina. Lecture Notes. Michael Kozdron University of Regina Statistics 252 Mathematical Statistics Lecture Notes Winter 2005 Michael Kozdron kozdron@math.uregina.ca www.math.uregina.ca/ kozdron Contents 1 The Basic Idea of Statistics: Estimating

More information

STATISTICS SYLLABUS UNIT I

STATISTICS SYLLABUS UNIT I STATISTICS SYLLABUS UNIT I (Probability Theory) Definition Classical and axiomatic approaches.laws of total and compound probability, conditional probability, Bayes Theorem. Random variable and its distribution

More information

1 Appendix A: Matrix Algebra

1 Appendix A: Matrix Algebra Appendix A: Matrix Algebra. Definitions Matrix A =[ ]=[A] Symmetric matrix: = for all and Diagonal matrix: 6=0if = but =0if 6= Scalar matrix: the diagonal matrix of = Identity matrix: the scalar matrix

More information

Math Review Sheet, Fall 2008

Math Review Sheet, Fall 2008 1 Descriptive Statistics Math 3070-5 Review Sheet, Fall 2008 First we need to know about the relationship among Population Samples Objects The distribution of the population can be given in one of the

More information

Lecture 1: August 28

Lecture 1: August 28 36-705: Intermediate Statistics Fall 2017 Lecturer: Siva Balakrishnan Lecture 1: August 28 Our broad goal for the first few lectures is to try to understand the behaviour of sums of independent random

More information

3 Continuous Random Variables

3 Continuous Random Variables Jinguo Lian Math437 Notes January 15, 016 3 Continuous Random Variables Remember that discrete random variables can take only a countable number of possible values. On the other hand, a continuous random

More information

Chapter 2. Discrete Distributions

Chapter 2. Discrete Distributions Chapter. Discrete Distributions Objectives ˆ Basic Concepts & Epectations ˆ Binomial, Poisson, Geometric, Negative Binomial, and Hypergeometric Distributions ˆ Introduction to the Maimum Likelihood Estimation

More information

Mathematical Statistics

Mathematical Statistics Mathematical Statistics MAS 713 Chapter 8 Previous lecture: 1 Bayesian Inference 2 Decision theory 3 Bayesian Vs. Frequentist 4 Loss functions 5 Conjugate priors Any questions? Mathematical Statistics

More information

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics) Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics) Probability quantifies randomness and uncertainty How do I estimate the normalization and logarithmic slope of a X ray continuum, assuming

More information

CS 361: Probability & Statistics

CS 361: Probability & Statistics March 14, 2018 CS 361: Probability & Statistics Inference The prior From Bayes rule, we know that we can express our function of interest as Likelihood Prior Posterior The right hand side contains the

More information

Statistics 100A Homework 5 Solutions

Statistics 100A Homework 5 Solutions Chapter 5 Statistics 1A Homework 5 Solutions Ryan Rosario 1. Let X be a random variable with probability density function a What is the value of c? fx { c1 x 1 < x < 1 otherwise We know that for fx to

More information

Statistical Distribution Assumptions of General Linear Models

Statistical Distribution Assumptions of General Linear Models Statistical Distribution Assumptions of General Linear Models Applied Multilevel Models for Cross Sectional Data Lecture 4 ICPSR Summer Workshop University of Colorado Boulder Lecture 4: Statistical Distributions

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

Introduction to Machine Learning. Lecture 2

Introduction to Machine Learning. Lecture 2 Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for

More information

Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics

Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics A short review of the principles of mathematical statistics (or, what you should have learned in EC 151).

More information

Contents 1. Contents

Contents 1. Contents Contents 1 Contents 6 Distributions of Functions of Random Variables 2 6.1 Transformation of Discrete r.v.s............. 3 6.2 Method of Distribution Functions............. 6 6.3 Method of Transformations................

More information

[y i α βx i ] 2 (2) Q = i=1

[y i α βx i ] 2 (2) Q = i=1 Least squares fits This section has no probability in it. There are no random variables. We are given n points (x i, y i ) and want to find the equation of the line that best fits them. We take the equation

More information

Nonparametric hypothesis tests and permutation tests

Nonparametric hypothesis tests and permutation tests Nonparametric hypothesis tests and permutation tests 1.7 & 2.3. Probability Generating Functions 3.8.3. Wilcoxon Signed Rank Test 3.8.2. Mann-Whitney Test Prof. Tesler Math 283 Fall 2018 Prof. Tesler Wilcoxon

More information

Week 2: Review of probability and statistics

Week 2: Review of probability and statistics Week 2: Review of probability and statistics Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ALL RIGHTS RESERVED

More information

Foundations of Statistical Inference

Foundations of Statistical Inference Foundations of Statistical Inference Jonathan Marchini Department of Statistics University of Oxford MT 2013 Jonathan Marchini (University of Oxford) BS2a MT 2013 1 / 27 Course arrangements Lectures M.2

More information

Probability. Table of contents

Probability. Table of contents Probability Table of contents 1. Important definitions 2. Distributions 3. Discrete distributions 4. Continuous distributions 5. The Normal distribution 6. Multivariate random variables 7. Other continuous

More information

Bias Variance Trade-off

Bias Variance Trade-off Bias Variance Trade-off The mean squared error of an estimator MSE(ˆθ) = E([ˆθ θ] 2 ) Can be re-expressed MSE(ˆθ) = Var(ˆθ) + (B(ˆθ) 2 ) MSE = VAR + BIAS 2 Proof MSE(ˆθ) = E((ˆθ θ) 2 ) = E(([ˆθ E(ˆθ)]

More information

Asymptotic Statistics-III. Changliang Zou

Asymptotic Statistics-III. Changliang Zou Asymptotic Statistics-III Changliang Zou The multivariate central limit theorem Theorem (Multivariate CLT for iid case) Let X i be iid random p-vectors with mean µ and and covariance matrix Σ. Then n (

More information

Math 494: Mathematical Statistics

Math 494: Mathematical Statistics Math 494: Mathematical Statistics Instructor: Jimin Ding jmding@wustl.edu Department of Mathematics Washington University in St. Louis Class materials are available on course website (www.math.wustl.edu/

More information

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators Estimation theory Parametric estimation Properties of estimators Minimum variance estimator Cramer-Rao bound Maximum likelihood estimators Confidence intervals Bayesian estimation 1 Random Variables Let

More information

Lecture 8: Information Theory and Statistics

Lecture 8: Information Theory and Statistics Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 23, 2015 1 / 50 I-Hsiang

More information

Interval Estimation. Chapter 9

Interval Estimation. Chapter 9 Chapter 9 Interval Estimation 9.1 Introduction Definition 9.1.1 An interval estimate of a real-values parameter θ is any pair of functions, L(x 1,..., x n ) and U(x 1,..., x n ), of a sample that satisfy

More information

Eco517 Fall 2004 C. Sims MIDTERM EXAM

Eco517 Fall 2004 C. Sims MIDTERM EXAM Eco517 Fall 2004 C. Sims MIDTERM EXAM Answer all four questions. Each is worth 23 points. Do not devote disproportionate time to any one question unless you have answered all the others. (1) We are considering

More information

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn Parameter estimation and forecasting Cristiano Porciani AIfA, Uni-Bonn Questions? C. Porciani Estimation & forecasting 2 Temperature fluctuations Variance at multipole l (angle ~180o/l) C. Porciani Estimation

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

If g is also continuous and strictly increasing on J, we may apply the strictly increasing inverse function g 1 to this inequality to get

If g is also continuous and strictly increasing on J, we may apply the strictly increasing inverse function g 1 to this inequality to get 18:2 1/24/2 TOPIC. Inequalities; measures of spread. This lecture explores the implications of Jensen s inequality for g-means in general, and for harmonic, geometric, arithmetic, and related means in

More information

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n Chapter 9 Hypothesis Testing 9.1 Wald, Rao, and Likelihood Ratio Tests Suppose we wish to test H 0 : θ = θ 0 against H 1 : θ θ 0. The likelihood-based results of Chapter 8 give rise to several possible

More information

6.1 Moment Generating and Characteristic Functions

6.1 Moment Generating and Characteristic Functions Chapter 6 Limit Theorems The power statistics can mostly be seen when there is a large collection of data points and we are interested in understanding the macro state of the system, e.g., the average,

More information

Review for the previous lecture

Review for the previous lecture Lecture 1 and 13 on BST 631: Statistical Theory I Kui Zhang, 09/8/006 Review for the previous lecture Definition: Several discrete distributions, including discrete uniform, hypergeometric, Bernoulli,

More information

CS 361: Probability & Statistics

CS 361: Probability & Statistics October 17, 2017 CS 361: Probability & Statistics Inference Maximum likelihood: drawbacks A couple of things might trip up max likelihood estimation: 1) Finding the maximum of some functions can be quite

More information

Conditional probabilities and graphical models

Conditional probabilities and graphical models Conditional probabilities and graphical models Thomas Mailund Bioinformatics Research Centre (BiRC), Aarhus University Probability theory allows us to describe uncertainty in the processes we model within

More information

MAS223 Statistical Inference and Modelling Exercises

MAS223 Statistical Inference and Modelling Exercises MAS223 Statistical Inference and Modelling Exercises The exercises are grouped into sections, corresponding to chapters of the lecture notes Within each section exercises are divided into warm-up questions,

More information

Basic Concepts of Inference

Basic Concepts of Inference Basic Concepts of Inference Corresponds to Chapter 6 of Tamhane and Dunlop Slides prepared by Elizabeth Newton (MIT) with some slides by Jacqueline Telford (Johns Hopkins University) and Roy Welsch (MIT).

More information

Robustness and Distribution Assumptions

Robustness and Distribution Assumptions Chapter 1 Robustness and Distribution Assumptions 1.1 Introduction In statistics, one often works with model assumptions, i.e., one assumes that data follow a certain model. Then one makes use of methodology

More information

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided Let us first identify some classes of hypotheses. simple versus simple H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided H 0 : θ θ 0 versus H 1 : θ > θ 0. (2) two-sided; null on extremes H 0 : θ θ 1 or

More information

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation. PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.. Beta Distribution We ll start by learning about the Beta distribution, since we end up using

More information

simple if it completely specifies the density of x

simple if it completely specifies the density of x 3. Hypothesis Testing Pure significance tests Data x = (x 1,..., x n ) from f(x, θ) Hypothesis H 0 : restricts f(x, θ) Are the data consistent with H 0? H 0 is called the null hypothesis simple if it completely

More information

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Theory of Maximum Likelihood Estimation. Konstantin Kashin Gov 2001 Section 5: Theory of Maximum Likelihood Estimation Konstantin Kashin February 28, 2013 Outline Introduction Likelihood Examples of MLE Variance of MLE Asymptotic Properties What is Statistical

More information

Introduction to Bayesian Statistics

Introduction to Bayesian Statistics Bayesian Parameter Estimation Introduction to Bayesian Statistics Harvey Thornburg Center for Computer Research in Music and Acoustics (CCRMA) Department of Music, Stanford University Stanford, California

More information

Probability Theory and Statistics. Peter Jochumzen

Probability Theory and Statistics. Peter Jochumzen Probability Theory and Statistics Peter Jochumzen April 18, 2016 Contents 1 Probability Theory And Statistics 3 1.1 Experiment, Outcome and Event................................ 3 1.2 Probability............................................

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015 Part IA Probability Definitions Based on lectures by R. Weber Notes taken by Dexter Chua Lent 2015 These notes are not endorsed by the lecturers, and I have modified them (often significantly) after lectures.

More information

Expectation. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Expectation. DS GA 1002 Statistical and Mathematical Models.   Carlos Fernandez-Granda Expectation DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall16 Carlos Fernandez-Granda Aim Describe random variables with a few numbers: mean, variance,

More information

This exam is closed book and closed notes. (You will have access to a copy of the Table of Common Distributions given in the back of the text.

This exam is closed book and closed notes. (You will have access to a copy of the Table of Common Distributions given in the back of the text. TEST #3 STA 5326 December 4, 214 Name: Please read the following directions. DO NOT TURN THE PAGE UNTIL INSTRUCTED TO DO SO Directions This exam is closed book and closed notes. (You will have access to

More information

Statistics 3858 : Maximum Likelihood Estimators

Statistics 3858 : Maximum Likelihood Estimators Statistics 3858 : Maximum Likelihood Estimators 1 Method of Maximum Likelihood In this method we construct the so called likelihood function, that is L(θ) = L(θ; X 1, X 2,..., X n ) = f n (X 1, X 2,...,

More information

Probability Distributions Columns (a) through (d)

Probability Distributions Columns (a) through (d) Discrete Probability Distributions Columns (a) through (d) Probability Mass Distribution Description Notes Notation or Density Function --------------------(PMF or PDF)-------------------- (a) (b) (c)

More information

Mathematical Statistics

Mathematical Statistics Mathematical Statistics Chapter Three. Point Estimation 3.4 Uniformly Minimum Variance Unbiased Estimator(UMVUE) Criteria for Best Estimators MSE Criterion Let F = {p(x; θ) : θ Θ} be a parametric distribution

More information

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata Maura Department of Economics and Finance Università Tor Vergata Hypothesis Testing Outline It is a mistake to confound strangeness with mystery Sherlock Holmes A Study in Scarlet Outline 1 The Power Function

More information

Testing Statistical Hypotheses

Testing Statistical Hypotheses E.L. Lehmann Joseph P. Romano Testing Statistical Hypotheses Third Edition 4y Springer Preface vii I Small-Sample Theory 1 1 The General Decision Problem 3 1.1 Statistical Inference and Statistical Decisions

More information

Likelihood-Based Methods

Likelihood-Based Methods Likelihood-Based Methods Handbook of Spatial Statistics, Chapter 4 Susheela Singh September 22, 2016 OVERVIEW INTRODUCTION MAXIMUM LIKELIHOOD ESTIMATION (ML) RESTRICTED MAXIMUM LIKELIHOOD ESTIMATION (REML)

More information

Multiple Random Variables

Multiple Random Variables Multiple Random Variables Joint Probability Density Let X and Y be two random variables. Their joint distribution function is F ( XY x, y) P X x Y y. F XY ( ) 1, < x

More information

Continuous Random Variables

Continuous Random Variables Continuous Random Variables Recall: For discrete random variables, only a finite or countably infinite number of possible values with positive probability. Often, there is interest in random variables

More information

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006 Hypothesis Testing Part I James J. Heckman University of Chicago Econ 312 This draft, April 20, 2006 1 1 A Brief Review of Hypothesis Testing and Its Uses values and pure significance tests (R.A. Fisher)

More information

Probability Background

Probability Background Probability Background Namrata Vaswani, Iowa State University August 24, 2015 Probability recap 1: EE 322 notes Quick test of concepts: Given random variables X 1, X 2,... X n. Compute the PDF of the second

More information

Statistical Data Analysis

Statistical Data Analysis DS-GA 0 Lecture notes 8 Fall 016 1 Descriptive statistics Statistical Data Analysis In this section we consider the problem of analyzing a set of data. We describe several techniques for visualizing the

More information

Statistical inference

Statistical inference Statistical inference Contents 1. Main definitions 2. Estimation 3. Testing L. Trapani MSc Induction - Statistical inference 1 1 Introduction: definition and preliminary theory In this chapter, we shall

More information

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007 MIT OpenCourseWare http://ocw.mit.edu HST.582J / 6.555J / 16.456J Biomedical Signal and Image Processing Spring 2007 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. Unbiased Estimation Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. To compare ˆθ and θ, two estimators of θ: Say ˆθ is better than θ if it

More information

STAT 461/561- Assignments, Year 2015

STAT 461/561- Assignments, Year 2015 STAT 461/561- Assignments, Year 2015 This is the second set of assignment problems. When you hand in any problem, include the problem itself and its number. pdf are welcome. If so, use large fonts and

More information

1. Fisher Information

1. Fisher Information 1. Fisher Information Let f(x θ) be a density function with the property that log f(x θ) is differentiable in θ throughout the open p-dimensional parameter set Θ R p ; then the score statistic (or score

More information

Probability. Lecture Notes. Adolfo J. Rumbos

Probability. Lecture Notes. Adolfo J. Rumbos Probability Lecture Notes Adolfo J. Rumbos October 20, 204 2 Contents Introduction 5. An example from statistical inference................ 5 2 Probability Spaces 9 2. Sample Spaces and σ fields.....................

More information

1 Exercises for lecture 1

1 Exercises for lecture 1 1 Exercises for lecture 1 Exercise 1 a) Show that if F is symmetric with respect to µ, and E( X )

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information