Reading Material for Students Arnab Adhikari Indian Institute of Management Calcutta, Joka, Kolkata 714, India, arnaba1@email.iimcal.ac.in Indranil Biswas Indian Institute of Management Lucknow, Prabandh Nagar, Lucknow 613, India, indranil@iiml.ac.in Arnab Bisi Johns opkins Carey Business School, 1 International Drive, Baltimore, Maryland 1, abisi1@jhu.edu Probability Distributions Binomial distribution. The Binomial distribution describes the probability of exactly x successes out of N trials; the probability associated with a success in a single trial is given by p and that with a failure is given by 1 p (also designated by q). The expression of the probability mass function (pmf) of this distribution is as follows p(x; N, p) = ( N x )px (1 p) N x, where the variable x and the parameter N are integers, satisfying the conditions x N and N >. The parameter p is a real quantity and p [,1]. The expected value and the variance of a random variable X having binomial distribution can be expressed as follows: and Var X N p (1 p ) X N p. ypergeometric distribution. The hypergeometric distribution describes the experiment where out of total N elements, M possesses a certain attribute [and the remaining (N M) does not]; if we then choose n elements at random without replacement, p(x; n, N, M) gives the probability that exactly x of the selected n elements have come from the group of M elements that possesses the attribute. Let the number of elements with that certain attribute be denoted by X. The probability mass function (pmf) of X with hypergeometric distribution is given by f(x; n, N, M) = (M x )(N M n x ) ( N n ) where x is discrete and its range is given by: x [max(, n N + M), min (n, M)]. The parameters n, N and M are all integers and satisfy the following conditions: 1 n N, N 1 1
and M 1. Let probability of success be represented by M p N the variance of X under hypergeometric distribution can be expressed as follows: X np and X np (1 p) ( N n) Var. ( N 1). Then, the expected value and In real life, when a marketing group is trying to understand their customer base by testing a set of known customers for over-representation of various demographic subgroups, they use hypergeometric test designed based on hypergeometric distribution. Negative Binomial distribution. The negative binomial distribution (also known as Pascal distribution) gives the probability of waiting for exactly x trials until k th success has occurred. Let the number of trials before k th success be denoted by X. ere p and q(= 1 p) designates the probability of a success and a failure in a single trial, respectively. The probability mass function (pmf) of this distribution is given by f(x; k, p) = ( x 1 k 1 )pk (1 p) x k, where the variable x and parameter k are integers and satisfies the following condition: x k >. Now, the expected value and the variance of a random variable X under negative binomial distribution can be expressed as follows: X k ( 1 and Var X p p) k (1 p). p The negative binomial distribution has applications in the insurance industry, where for example the rate at which people have accidents is affected by a random variable like the weather condition. Geometric distribution. The geometric distribution is a special case of the negative binomial distribution discussed above with k = 1. It expresses the probability of waiting for exactly x trials before the occurrence of the first successful event. Let the number of trials before the first success be denoted by X. Then, the probability mass function (pmf) of X with this distribution is given by f(x; p) = p(1 p) x 1, where p denotes the probability of success in each trial. The expected value and the variance of a random variable X under geometric distribution can be expressed as follows:
X ( 1 and Var X p p) (1 p). p In real life, if a NGO wants to know the number of male births before one female birth regarding the study of sex ratio in human population then it can use this kind of distribution. Poisson distribution. The Poisson distribution gives the probability of finding occurrence of exactly x events in a given length of time when the events are independent in nature and happens at a constant rate, given by. The probability mass function (pmf) of this distribution is given by e f ( x ; ), x! where the variable x is a positive integer and the parameter is a real positive quantity. Now, the expected value and the variance of a random variable X under Poisson distribution can be expressed as follows: X and X x Var. When the value of N is very large and p is very small in the binomial distribution described before, then it can be approximated by a Poisson distribution with expected value = Np. Poisson distribution is applied to determine the probability of rare events like birth defects, genetic mutations, car accidents, etc. Uniform distribution. If a continuous random variable X follows the uniform distribution, then its probability density function (pdf) is given by the expression f(x; a, b) = 1 b a for a X b. The expected value and the variance of a random variable X under uniform distribution can be expressed as follows X b a and Var X b a. 1 In oil exploration, the position of the oil-water contact in a potential prospect is often considered to be uniformly distributed. xponential distribution. If a continuous random variable X follows the exponential distribution, then its pdf can be expressed as follows: 3
f(x; θ) = 1 θ e x θ, where θ represents the scale parameter. The expected value and the variance of a random variable X under exponential distribution are given by: X and X Var. In real life, the radioactive or particle decays is considered to follow exponential distribution. Normal distribution. The normal distribution (also called the Gauss distribution) is one of the most important distributions in statistics. The pdf of normal distribution is given by the following expression: f(x; μ, σ ) = 1 1 σ π e (x μ σ ), where μ is the mean or expected value and σ is the variance of the distribution. For μ = and σ = 1, the distribution is called the standard normal distribution. It has widespread applications in natural and social sciences, financial models, etc. Beta distribution. The beta distribution has been applied to model the behavior of random variables limited to intervals of finite length in a wide variety of disciplines. The pdf of beta distribution is given by: f(x; α, β) = 1 B(α,β) xα 1 (1 x) β 1, where the shape parameters α and β are positive real numbers, and the variable x satisfies the condition x 1. B(α, β) designates the beta function and is given by the following expression B(α, β) = Γ(α)Γ(β) Γ(α+β). For α R +, the gamma function Γ(α) is defined by the integral Γ(α) = t α 1 e t dt. When α = β = 1, the beta distribution assumes the form of the uniform distribution between and 1; when α = β = the distribution takes parabolic shape; when α = and β = 1 or vise versa the distribution takes triangular shaped distribution. The expected value and the variance of a random variable X under beta distribution can be expressed as follows: X and X Var. ( 1) 4
Beta distribution is usually applied to determine the time allocation in project management/ control systems, heterogeneity in the probability of IV transmission, etc. Gamma distribution. It is a two-parameter family of continuous probability distributions. xponential distribution is a special case of the gamma distribution. The pdf of gamma distribution can be represented by the following functional form: f(x; k, θ) = xk 1 e x θ, θ k Γ(k) where the shape parameter k and the scale parameter θ are positive real numbers (k R + and θ R + ) and the variable x is also a positive real number (x R + ). The expected value and the variance of a random variable X under gamma distribution are given by: X k and X Var k. Sampling Distribution and Confidence Interval. If we take repeated samples from the same population, samples means x would vary from sample to sample and form a sampling distribution of sample means. It explains the random behavior of a sample mean. The variability of x from can be obtained by determining the variance of x. The variance of the sample mean with a sample of size n is given by:. x n Next, the confidence interval contains the true population parameter. A confidence interval comprise point estimate, i.e., the best estimate of the population parameter from the sample statistic and the margin of error or maximum sampling error (the maximum accepted difference between the true population parameter and a sample estimate of that parameter). The confidence interval where lies can be determined by the following expression: x z x z / / n n The confidence level is denoted by 1 1 %. The margin of error denoted by is given by the following formula:. z / n. 5
From the formula given above, the required minimum sample size can be easily obtained and it is given by: n z ( / ). ypothesis Testing. ypothesis testing is a technique to check with the help of a sample data whether a claim or hypothesis about a population parameter is true or not. In hypothesis testing, the stated conjecture defined as the null hypothesis can be disproved, but it cannot be proved. owever, by disproving the null hypothesis, one can prove that the contrary is true. The contrary of the null hypothesis is termed as the alternative hypothesis. The test statistic represents the value determined using the sample data. A test statistic for testing a hypothesis on population mean is given by the following formula: z x, n where denotes the hypothesized value of the population mean. Following are the null ( ) and alternative ( The Two-Tailed Test. A ) hypotheses for three standard tests on population mean: : A z : x n reject if z z or z z. / / The One-Tailed Test to the Right : A z : x n reject if. z z 6
The One-Tailed Test to the Left : A : x z n reject if. z z Regression Models Simple linear regression ere we present a simple linear regression model to determine the relationship between the dependent variable Y and the independent variable X, captured by the following equation: ( Y X) = α + βx. Then the regression model can be designated as: Y = α + βx + ε, where ε = Y ( Y X) is a random variable or an error term with (ε) = and Var ( ε) = σ. If α and β denote the best estimates of the parameters α and β, respectively, then the estimated linear regression equation of Y on X is: Multiple linear regression Y = α + β X. The effect of independent variables X 1, X and X 3 on the dependent variable Y can be captured by the following equation: ( Y X 1, X, X 3 ) = α + β 1 X 1 + β X + β 3 X 3, where ε = Y ( Y X 1, X, X 3 ) is a random variable or an error term with (ε) = and Var( ε) = σ. If α, β, 1 β, and β 3 denote the best estimates of the parameters α, β 1, β, and β 3, respectively, then the estimated multiple linear regression equation of Y on X 1, X and X 3 is given by: Multicollinearity check Y = α + β X 1 1 + β X + β X 3 3. Often regression model is affected by linear relationship between independent variables termed as multicollinearity. Variance Inflation Factor (VIF) is one of the conventional techniques employed to check whether any multicollinearity exists or not. VIF between two independent variables X 1 and X can be determined by the following expression: 7
VIF X1,X = 1 1 R X 1,X, where R X1,X denotes the co-efficient of determination between X 1 and X. If the value of VIF is greater than 5, then it indicates multicollinearity and the overall regression model gets affected by it. Sources Anderson, D., Sweeney, D., Williams, T., Camm, J., Cochran, J. 11. Statistics for Business & conomics, 11 th ed. Cengage Learning, Mason. Berenson, M., Levine, D., Krehbiel, T. C. 11. Basic business statistics: Concepts and applications. Pearson ducation, New Jersey. Groebner, D. F., Shannon, P. W., Fry, P. C., Smith, K. D. 13. Business statistics: a decision making approach, 9 th ed. Pearson ducation, New Jersey. ildebrand, D. K. and O. Lyman. 1998. Statistical Thinking for Managers, 4 th ed. Duxbury Press, California. Levin, R. I. and D. S. Rubin. 1997. Statistics for Management, 7 th ed. Prentice all International, New Jersey. http://wps.aw.com/wps/media/objects/15/1551/formulas.pdf http://www.nzqa.govt.nz/assets/qualifications-and-standards/qualifications/ncea/ncasubject-resources/mathematics/l3-stats-formulae-13.pdf 8