Basic Probability Reference Sheet

Similar documents
EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

Correlation 1. December 4, HMS, 2017, v1.1

Confidence Intervals

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Probability. Table of contents

Simple Linear Regression

Simple Linear Regression for the Climate Data

2. Variance and Covariance: We will now derive some classic properties of variance and covariance. Assume real-valued random variables X and Y.

Intelligent Data Analysis. Principal Component Analysis. School of Computer Science University of Birmingham

Econ 371 Problem Set #1 Answer Sheet

Econ 2120: Section 2

Simple and Multiple Linear Regression

3 Algebraic Methods. we can differentiate both sides implicitly to obtain a differential equation involving x and y:

ME 597: AUTONOMOUS MOBILE ROBOTICS SECTION 2 PROBABILITY. Prof. Steven Waslander

MITOCW ocw f99-lec16_300k

Ma 3/103: Lecture 25 Linear Regression II: Hypothesis Testing and ANOVA

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019

The Hilbert Space of Random Variables

Finite Mathematics : A Business Approach

1 Least Squares Estimation - multiple regression.

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1

Fourier and Stats / Astro Stats and Measurement : Stats Notes

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Simple Linear Regression Estimation and Properties

Chapter 27 Summary Inferences for Regression

Statistics Introductory Correlation

1.4 Techniques of Integration

MAT2342 : Introduction to Applied Linear Algebra Mike Newman, fall Projections. introduction

Introduction to Simple Linear Regression

GEOMETRIC -discrete A discrete random variable R counts number of times needed before an event occurs

Review of Statistics

UNIVERSITY OF TORONTO Faculty of Arts and Science

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

Lecture 4: Sampling, Tail Inequalities

Section 3: Simple Linear Regression

An overview of applied econometrics

[y i α βx i ] 2 (2) Q = i=1

4.5 Integration of Rational Functions by Partial Fractions

Lecture 1: Basics of Probability

Data Analysis, Standard Error, and Confidence Limits E80 Spring 2015 Notes

Lecture 10: Powers of Matrices, Difference Equations

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur

Updated: January 16, 2016 Calculus II 7.4. Math 230. Calculus II. Brian Veitch Fall 2015 Northern Illinois University

ECO220Y Simple Regression: Testing the Slope

ENGINEERING MATH 1 Fall 2009 VECTOR SPACES

Linear Regression with 1 Regressor. Introduction to Econometrics Spring 2012 Ken Simons

Notes 11: OLS Theorems ECO 231W - Undergraduate Econometrics

DS-GA 1002 Lecture notes 12 Fall Linear regression

Regression and correlation. Correlation & Regression, I. Regression & correlation. Regression vs. correlation. Involve bivariate, paired data, X & Y

Basics on Probability. Jingrui He 09/11/2007

Quantitative Techniques - Lecture 8: Estimation

Chapter 13 Correlation

PHP2510: Principles of Biostatistics & Data Analysis. Lecture X: Hypothesis testing. PHP 2510 Lec 10: Hypothesis testing 1

Chapter 2. Some Basic Probability Concepts. 2.1 Experiments, Outcomes and Random Variables

Review of probability and statistics 1 / 31

Simple Linear Regression

Inferences for Regression

Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 16. A Brief Introduction to Continuous Probability

appstats27.notebook April 06, 2017

Note on Bivariate Regression: Connecting Practice and Theory. Konstantin Kashin

Exam Empirical Methods VU University Amsterdam, Faculty of Exact Sciences h, February 12, 2015

Y i = η + ɛ i, i = 1,...,n.

1 Proof techniques. CS 224W Linear Algebra, Probability, and Proof Techniques

Multiple Regression Theory 2006 Samuel L. Baker

Describing Distributions with Numbers

STAT 151A: Lab 1. 1 Logistics. 2 Reference. 3 Playing with R: graphics and lm() 4 Random vectors. Billy Fang. 2 September 2017

Constrained and Unconstrained Optimization Prof. Adrijit Goswami Department of Mathematics Indian Institute of Technology, Kharagpur

Business Statistics. Lecture 9: Simple Regression

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 18

Statistics 100A Homework 5 Solutions

Problem Set #6: OLS. Economics 835: Econometrics. Fall 2012

STAT 285: Fall Semester Final Examination Solutions

Ma 3/103: Lecture 24 Linear Regression I: Estimation

Lecture 1: Probability Fundamentals

Partial Fractions. June 27, In this section, we will learn to integrate another class of functions: the rational functions.

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

Assumptions, Diagnostics, and Inferences for the Simple Linear Regression Model with Normal Residuals

review session gov 2000 gov 2000 () review session 1 / 38

Math 416 Lecture 3. The average or mean or expected value of x 1, x 2, x 3,..., x n is

Originality in the Arts and Sciences: Lecture 2: Probability and Statistics

Probability and Statistics. Terms and concepts

:Effects of Data Scaling We ve already looked at the effects of data scaling on the OLS statistics, 2, and R 2. What about test statistics?

ORF 245 Fundamentals of Statistics Practice Final Exam

STAT 111 Recitation 7

Maximum-Likelihood Estimation: Basic Ideas

Basic Probability. Introduction

If we want to analyze experimental or simulated data we might encounter the following tasks:

Confidence Intervals and Hypothesis Tests

Inference in Regression Model

Math Review Sheet, Fall 2008

Lecture 9 SLR in Matrix Form

Probability & Statistics - FALL 2008 FINAL EXAM

LINEAR ALGEBRA KNOWLEDGE SURVEY

Chapter 1 Review of Equations and Inequalities

SUMMARY OF PROBABILITY CONCEPTS SO FAR (SUPPLEMENT FOR MA416)

Statistical Distribution Assumptions of General Linear Models

Transcription:

February 27, 2001 Basic Probability Reference Sheet 17.846, 2001 This is intended to be used in addition to, not as a substitute for, a textbook. X is a random variable. This means that X is a variable that takes on value X with probability x. This is the density function and is written as: Prob(Xx) f( x) for all X domain The cumulative probability distribution is the probability that X takes on a value less than or equal to x. This is written as: Prob(X x) for all X domain F( x) A probability distribution is any function such that f( x) > 0 and f( x) dx 1 1 of 8

or if X is discrete px ( ) 1 x domain These say that the probability of any event if zero or positive, and that the sum of the probabilities of all events must equal one. (In other words, you can t have a negative probability of something happening, and something must happen.) Two important probability distributions are the standard normal and the uniform[0,]: f( x) and f( x) x -- 2 1 2 ---------- e 2π 1 --- Most probability distributions have a mean and a variance. Mean µ and Variance σ 2 The standard deviation (the average distance that the sample random variable is from the true mean) is equal to the square root of the variance. Mean is also equal to the first moment, or expected value of X. Mean E[X] The mean is the average value of X, weighted by the probability that X x, for all values of x. For a continuous distribution, this is: µ x f( x) dx or for discrete variables µ x f( x) The expected value operator is linear. That mean: x domain 2 of 8 Basic Probability Reference Sheet

EaX [ + b] ae[ X] + b This is true because integration and summation, the methods we use to calculate the mean, are also linear operators. The variance is also called the second, mean-deviated moment. It is formally: E[ X E[ X] ] 2 This can be reduced a bit... EEX [ [ 2 ] 2XE[ X] + E[ X] 2 ] E[ X 2 ] 2E[ X]E[ X] + E[ X] 2 E[ X 2 ] E[ X] 2 since the expected value of an expected value is just the expected vale (E[X] is a constant, and the expected value of a constant is simply the constant). For a continuous distribution, this is: ( x µ ) 2 f( x) dx For a discrete distribution, this is: x domain ( x µ ) 2 f( x) When working with variances, most of what you need can be derived from a few simple formulas: Cov( ax, b ab( X, which is often used in the form Cov( X, Cov( X, and especially Cov( X + Var( X) + Var( + 2Cov( X, In the case of independence, the covariance of X and Y is zero, and we get the rule that the variance of a sum equals the sum of the variance whenever the variables are independent from each other. This can be generalized to variables: Var X i Var( X i ) since we know by independence that Cov ( X i, X ) j 0 j 1, j i Here is the intuition: Even though the variance is the sum of squared deviations, the variance of a sum of increases linearly. This is because if you take the sum, the deviations tend to cancel out. Basic Probability Reference Sheet 3 of 8

Thus, if the variance of a coin flip (heads 1, tails 0) is 0.25 (0.5 * 0.5), the variance of the sum of 100 coin flips is 25, not 0.25*(100 squared). This is because the sum centers around 50... in other words, the chance of getting 100 heads is really small. Thus, if we were to flip 100 coins and take the sum, then repeat this exercise 1000 times, we would get a bell shaped curve. If we flip two coins, take the sum, and repeat 1000 times, we do not get a bell shaped curve. We get a histogram with 0 25% of the time, 1 50% of the time, and 2 25% of the time. This yields a variance of: [ 0.25( 2 1) 2 + 0.5( 1 1) 2 + 0.25( 1 0) 2 ] 0.5 If we flip just one coin, we get a 0 half the time and a 1 half the time, for a variance of 0.25. This may not sound so important, but it becomes important when we combine it with the following rule: Var( ax + b) a 2 Var( X) This equation can be derived directly from the expectation formulas, and is highly intuitive. If we go back to our single coin flip, the variance of a single coin flip is: 0.5( 1 0.5) 2 + 0.5( 0 0.5) 2 0.25 But the variance of 100 * a single coin flip is: 0.5( 100 50) 2 + 0.5( 0 50) 2 2500 100 2 0.25 So you can see the difference between the variance of 100 times a single coin flip, and the variance of the sum of 100 coin flips. Mathematically: Var( kx) EkX [ EkX [ ]] 2 EkX [ ( EX [ ])] 2 k 2 E[ X E[ X] ] 2 k 2 Var( X) ow we can caluculate the variance of a sample mean from the variance of the random variable itself: Var( X) Var --- 1 Xi --- 1 2 1 1 Var Xi ------ Var( Xi) --- Var( Xi) Or, using standard deviations: 2 4 of 8 Basic Probability Reference Sheet

1 Stdev( X) Var( X) --- Var( X) 1 ------- Stdev( X) So after estimating the mean and variance of X normally, you can estimate the mean and variance of the mean of X. If you combine this with the central limit theorem, this forms the entire basis of hypothesis testing and confidence interval estimation in regression analysis. So far we have talked about the true mean and true variance. In real life, we don t know the true values. All we see is the data geenrated by some mysterious natural or human process. We assume this process can be described by a mathematical relationship (such as a probability distribution). If so, and if the probability distribution which determines how the world operates isn t too weird, we can estimate the parameters of that distribution. We do so by using the data we find in the world. The problem is that the world is noisy - is full of error. Or perhaps our measurements are noisy. Either way, we need some way of accounting for or at least quantifying this noise. This is why we estimate the variance of our observations. So, if X follows some distribution, and we take 100 samples of X... {X1, X2, X3, X4, X5... X99 X100} We can use this sample to estimate the value of the mean of whatever distribution generated X. Our best guess at this true mean is in fact X, which can be defined: X 100 1 --- Xi It turns out that this is an unbiased estimated of the mean. In other words, the expected value of this sum is equal to the true mean. This is true because the expected value of any single Xi is equal to the mean, so 1/ times such expected values of Xi is also equal to the mean. The only advantage of using more observations is that the extimate of the mean becomes more precise. We measure this using the variance. First, we estimate the variance of Xi. We do this with the formula: σˆ 2 ------------ 1 ( 1 X X ) 2 Basic Probability Reference Sheet 5 of 8

We have an -1 in the denominator to account for the fact that we used up a degree of freedom to estimate the sample mean. In other words, estimating the sample variance with the same data that we used to estimate the sample mean gives us some wiggle room, and we need to penalize ourselves for doing this. [Technically, we prove that dividing by -1 gives us an unbiased estimator by breaking the sample variance into two pieces, and showing they are independent of each other, which is a neat little trick.] The next useful result is the Central Limit Theorem, which tells us that no matter what the distribution of X (within certain limits), the distribution of X is normal as long as X uses enough observations (about 20-30, it turns out). Since we can calculate the mean and variance of X, and we know the shape of the distribution of X, we can use this knowledge to make some statements about X. For example, X will fall within 1.96 standard deviations of the true mean µ 95% of the σ time. If σ is the standard deviation of X, then X will fall within 1.96 ------- of µ 95% of the time. σ Or in other words, µ will be within 1.96 ------- of X 95% of the time. Thus, our 95% confidence interval is X ± σ 1.96 -------. ow since we don t know the true σ (just as we don t know the true mean), we substitute in our estimated standard deviation σˆ for the true standard deviation. Thus we obtain our confidence intervals. Hypothesis testing is conducted by a similar procedure. We ask, if it were true that µ K, what is the probability that a randomly generated X equals or is further away from the value of X that we observed? Well, if µ really does equal K, and X is normally distributed around µ σ with a standard deviation of -------, then we reject the hypothesis that K µ with 95% certainty if σ X K > 1.96 -------. K is often zero, and the T statistic in your STATA regressions measures how many standard deviations your estimated mean is away from zero. This gives you an implicit hypothesis test against the null hypothesis than nothing is going on, or the true mean actually is zero. For different certainty levels (other than 95%) we use values other than 1.96. These values can be read off of Z tables (for the normal), or T tables if you really want to be precise. Regression Analysis and Statistics The above notes deal entirely with statistics. Regression itself need not be thought of a statistical operation. Instead, it can be considered a tool that uses linear algebra to calculate conditional means. We then use our knowledge of statistics to make probabilistic statements about the parameters we calculate using linear algebra. 6 of 8 Basic Probability Reference Sheet

What is a regression? Geometrically, a regression is the projection of a single vector (your dependent variable) of a dimension (equal to your number of observations) onto a K-dimensional subspace spanned by the K vectors which we call independent variables (also of dimension, that is, with observations). The subspace spanned by your independent variable vectors defines a hyperplane - a K-dimensional space which can be reached by some combination of your independent variable vectors. Regression then finds the point in that hyperplane which is closest to the point reached by your dependent variable vector. In fact, it constructs a whole line of points in the independent variable hyperplane which are closest to the points in your dependent variable vector. It then tells you in what combination you should combine your independent variable vectors to reach this closet-points vector (or the best guess vector). The best-guess vector is also called the projection of Y onto the subspace spanned by the X s. Your coefficients B1, B2, etc... tell us how to create the projection line given our X1, X2, etc... vectors. The vector of error terms defines a vector perpendicular to your dependent variable hyperplane. By combining your X s in the proportions defined by your Bs, and then adding the error term, you get back to your Y vector. Back to the real(?) world. To calculate your B coefficients, you write down the projection equation: Ŷ X' ( X'X) 1 X'Y X'B where B ( X'X) 1 X'Y otice that if X and Y have only one variable, this reduces to: Bˆ --------------- XY X2 Actually, even sinple regressions with one dependent variable usually include an extra X vector of constants, or 1 s, in order to allow for an intercept. If the 1 s vector is the only vector in the matrix X, then B equals the mean of Y. If there is another vector in addition to the 1 s vector, then: Bˆ ( X X) ( Y ------------------------------------------------- ( X X )2 Cov( X, ------------------------- Var( X) If you are actually regressing X on Y, the reverse regression yields: Bˆ ( X X) ( Y Cov( X, ------------------------------------------------- ------------------------- ( Y Y )2 Var( The only difference is the scaling factor in the denominator. For an unbiased scaling factor, we can use: Basic Probability Reference Sheet 7 of 8

Estimated Correlation( X, ( X X) ( Y ------------------------------------------------------------------ ( X X) 2 ( Y Y )2 Cov( X, -------------------------------------------------- Stdev( X) Stdev( The correlation coefficient is always bounded between -1 and 1. To grasp the intuition behind this, imagine that X and Y are plotted, and form a perfect line. Then the correlation is 1 or -1. When you introduce noise to that line, the noise terms are summed and then multiplied in the denominator, whereas they are multiplied and then summed in the numerator. So the numerator is smaller. Imagine that we have mean deviated the variables already, and thus the sample means are equal to zero. Then the equation reduces to: -------------------------------- XY X 2 Y 2 ( XY ) 2 ---------------------------- X 2 Y 2 Since we have mean-deviated, X and Y measure the distance (positive or negative) away from 0. Thus, their product sort of measures how they move together. If you have two long X and Y vectors, and for each observation Y is always positive when X is positive, and Y is always negative when X is negative, then you will have positive correlation. If the reverse is true, you will have negative covariation. Moreover, the degree to which X and Y move the same amount in the same direction yields the magnitude of the correlation. If X Y always, you get 1, as you can see. 8 of 8 Basic Probability Reference Sheet