APPENDIX 1 BASIC STATISTICS. Summarizing Data

Similar documents
Business Statistics. Lecture 10: Correlation and Linear Regression

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

Section Linear Correlation and Regression. Copyright 2013, 2010, 2007, Pearson, Education, Inc.

Correlation: basic properties.

BIOSTATISTICS NURS 3324

Chapter 8. Linear Regression. The Linear Model. Fat Versus Protein: An Example. The Linear Model (cont.) Residuals

Unit 2. Describing Data: Numerical

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Overview. Overview. Overview. Specific Examples. General Examples. Bivariate Regression & Correlation

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

Lecture (chapter 13): Association between variables measured at the interval-ratio level

Chapter 4 Describing the Relation between Two Variables

Review of Statistics

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Business Statistics. Lecture 10: Course Review

Predicted Y Scores. The symbol stands for a predicted Y score

Correlation and Regression

appstats8.notebook October 11, 2016

6x 2 8x + 5 ) = 12x 8. f (x) ) = d (12x 8) = 12

statistical sense, from the distributions of the xs. The model may now be generalized to the case of k regressors:

Ordinary Least Squares Regression Explained: Vartanian

Chapter 12 - Part I: Correlation Analysis

MATH 1150 Chapter 2 Notation and Terminology

Statistics for Managers using Microsoft Excel 6 th Edition

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population

Histograms, Central Tendency, and Variability

appstats27.notebook April 06, 2017

NAME: DATE: SECTION: MRS. KEINATH

Chapter 2: simple regression model

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

Reteach 2-3. Graphing Linear Functions. 22 Holt Algebra 2. Name Date Class

Archdiocese of Washington Catholic Schools Academic Standards Mathematics

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Chapter 6: Exploring Data: Relationships Lesson Plan

Algebra vocabulary CARD SETS Frame Clip Art by Pixels & Ice Cream

REVIEW 8/2/2017 陈芳华东师大英语系

Finite Mathematics Chapter 1

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

6x 2 8x + 5 ) = 12x 8

Descriptive Data Summarization

Algebra I Vocabulary Cards

Chapter 3: Examining Relationships

Class 11 Maths Chapter 15. Statistics

Chapter 3. Measuring data

y response variable x 1, x 2,, x k -- a set of explanatory variables

Chapter 5: Ordinary Least Squares Estimation Procedure The Mechanics Chapter 5 Outline Best Fitting Line Clint s Assignment Simple Regression Model o

Introduction to Linear Regression

THE PEARSON CORRELATION COEFFICIENT

Review Notes for IB Standard Level Math

p(z)

Final Exam - Solutions

Keystone Exams: Algebra

After completing this chapter, you should be able to:

Midterm 2 - Solutions

Chapter 2: Tools for Exploring Univariate Data

Regression Analysis: Exploring relationships between variables. Stat 251

Summary statistics. G.S. Questa, L. Trapani. MSc Induction - Summary statistics 1

Wooldridge, Introductory Econometrics, 4th ed. Chapter 2: The simple regression model

Algebra I Vocabulary Cards

Simple Linear Regression Estimation and Properties

Sampling, Frequency Distributions, and Graphs (12.1)

Review of Multiple Regression

Chapter 6. Exploring Data: Relationships. Solutions. Exercises:

WISE International Masters

Final Exam - Solutions

Stat 101 Exam 1 Important Formulas and Concepts 1

Chapter 16. Simple Linear Regression and dcorrelation

Chapter 27 Summary Inferences for Regression

Geometry 21 Summer Work Packet Review and Study Guide

Chapter 7 Linear Regression

CHAPTER 4 DESCRIPTIVE MEASURES IN REGRESSION AND CORRELATION

Linear Regression and Correlation. February 11, 2009

UNIT 3: MODELING AND ANALYZING QUADRATIC FUNCTIONS

Econometrics Summary Algebraic and Statistical Preliminaries

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

Transition Passage to Descriptive Statistics 28

IAM 530 ELEMENTS OF PROBABILITY AND STATISTICS LECTURE 3-RANDOM VARIABLES

The Simple Regression Model. Part II. The Simple Regression Model

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

The Simple Linear Regression Model

MATH CRASH COURSE GRA6020 SPRING 2012

Business Statistics. Lecture 9: Simple Regression

Basics of Experimental Design. Review of Statistics. Basic Study. Experimental Design. When an Experiment is Not Possible. Studying Relations

Statistical Concepts. Constructing a Trend Plot

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

Final Exam. Name: Solution:

IB Math Standard Level 2-Variable Statistics Practice SL 2-Variable Statistics Practice from Math Studies

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Economics 203: Intermediate Microeconomics. Calculus Review. A function f, is a rule assigning a value y for each value x.

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

Year 10 Mathematics Semester 2 Bivariate Data Chapter 13

Correlation and Linear Regression

STA 218: Statistics for Management

Model Fitting. Jean Yves Le Boudec

CORRELATION AND SIMPLE REGRESSION 10.0 OBJECTIVES 10.1 INTRODUCTION

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

A company recorded the commuting distance in miles and number of absences in days for a group of its employees over the course of a year.

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

Covariance and Correlation

1. Create a scatterplot of this data. 2. Find the correlation coefficient.

Transcription:

1 APPENDIX 1 Figure A1.1: Normal Distribution BASIC STATISTICS The problem that we face in financial analysis today is not having too little information but too much. Making sense of large and often contradictory information is part of what we are called on to do when analyzing companies. Basic statistics can make this job easier. In this appendix, we consider the most fundamental tools available in data analysis. Summarizing Data Large amounts of data are often compressed into more easily assimilated summaries, which provide the user with a sense of the content, without overwhelming him or her with too many numbers. There a number of ways data can be presented. We will consider two here one is to present the data in a distribution, and the other is to provide summary statistics that capture key aspects of the data. Data Distributions When presented with thousands of pieces of information, you can break the numbers down into individual values (or ranges of values and indicate the number of individual data items that take on each value or range of values. This is called a frequency distribution. If the data can only take on specific values, as is the case when we record the number of goals scored in a soccer game, you get a discrete distribution. When the data can take on any value within the range, as is the case with income or market capitalization, it is called a continuous distribution. The advantages of presenting the data in a distribution are twofold. For one thing, you can summarize even the largest data sets into one distribution and get a measure of what values occur most frequently and the range of high and low values. The second is that the distribution can resemble one of the many common ones about which we know a great deal in statistics. Consider, for instance, the distribution that we tend to draw on the most in analysis: the normal distribution, illustrated in Figure A1.1. A normal distribution is symmetric, has a peak centered around the middle of the distribution, and tails that are not fat and stretch to include infinite positive or negative values. Not all distributions are symmetric, though. Some are weighted towards extreme positive values and are called positively skewed, and some towards extreme negative values and are considered negatively skewed. Figure A1. illustrates positively and negatively skewed distributions. A1.1 A1.

3 4 Positively skewed distribution Figure A1.: Skewed Distributions Negatively skewed distribution The mean and the standard deviation are the called the first two moments of any data distribution. A normal distribution can be entirely described by just these two moments; in other words, the mean and the standard deviation of a normal distribution suffice to characterize it completely. If a distribution is not symmetric, the skewness is the third moment that describes both the direction and the magnitude of the asymmetry and the kurtosis (the fourth moment measures the fatness of the tails of the distribution relative to a normal distribution. Looking for Relationships in the Data When there are two series of data, there are a number of statistical measures that can be used to capture how the series move together over time. Returns Summary Statistics The simplest way to measure the key characteristics of a data set is to estimate the summary statistics for the data. For a data series, X 1, X, X 3,... X n, where n is the number of observations in the series, the most widely used summary statistics are as follows: The mean (µ, which is the average of all of the observations in the data series. X j Mean = µ X = n The median, which is the midpoint of the series; half the data in the series is higher than the median and half is lower. The variance, which is a measure of the spread in the distribution around the mean and is calculated by first summing up the squared deviations from the mean, and then dividing by either the number of observations (if the data represent the entire population or by this number, reduced by one (if the data represent a sample. Variance = " X = n 1 The standard deviation is the square root of the variance. " (X j µ $ Correlations and Covariances The two most widely used measures of how two variables move together (or do not are the correlation and the covariance. For two data series, X (X 1, X, and Y(Y, Y..., the covariance provides a measure of the degree to which they move together and is estimated by taking the product of the deviations from the mean for each variable in each period. Covariance = " XY = $ (X j µ X (Y j µ Y The sign on the covariance indicates the type of relationship the two variables have. A positive sign indicates that they move together and a negative sign that they move in opposite directions. Although the covariance increases with the strength of the relationship, it is still relatively difficult to draw judgments on the strength of the relationship between two variables by looking at the covariance, because it is not standardized. The correlation is the standardized measure of the relationship between two variables. It can be computed from the covariance : n 1 A1.3 A1.4

5 6 (X j $ µ X (Y j $ µ Y Correlation = " XY = XY / X Y = j=n j=n %(X j $ µ X % (Y j $ µ Y The correlation can never be greater than one or less than negative one. A correlation close to zero indicates that the two variables are unrelated. A positive correlation indicates that the two variables move together, and the relationship is stronger as the correlation gets closer to one. A negative correlation indicates the two variables move in opposite directions, and that relationship gets stronger the as the correlation gets closer to negative one. Two variables that are perfectly positively correlated (! XY = 1 essentially move in perfect proportion in the same direction, whereas two variables that are perfectly negatively correlated move in perfect proportion in opposite directions. Regressions A simple regression is an extension of the correlation/covariance concept. It attempts to explain one variable, the dependent variable, using the other variable, the independent variable. Scatter Plots and Regression Lines Keeping with statistical tradition, let Y be the dependent variable and X be the independent variable. If the two variables are plotted against each other with each pair of observations representing a point on the graph, you have a scatterplot, with Y on the vertical axis and X on the horizontal axis. Figure A1.3 illustrates a scatter plot. % Figure A1.3: Scatter Plot of Y versus X In a regression, we attempt to fit a straight line through the points that best fits the data. In its simplest form, this is accomplished by finding a line that minimizes the sum of the squared deviations of the points from the line. Consequently, it is called an ordinary least squares (OLS regression. When such a line is fit, two parameters emerge one is the point at which the line cuts through the Y-axis, called the intercept of the regression, and the other is the slope of the regression line: Y = a + bx The slope (b of the regression measures both the direction and the magnitude of the relationship between the dependent variable (Y and the independent variable (X. When the two variables are positively correlated, the slope will also be positive, whereas when the two variables are negatively correlated, the slope will be negative. The magnitude of the slope of the regression can be read as follows: For every unit increase in the dependent variable (X, the independent variable will change by b (slope. Estimating Regression Parameters Although there are statistical packages that allow us to input data and get the regression parameters as output, it is worth looking at how they are estimated in the first place. The slope of the regression line is a logical extension of the covariance concept introduced in the last section. In fact, the slope is estimated using the covariance: A1.5 A1.6

7 8 Slope of the Regression = b = Covariance YX Variance of X = " YX " X The intercept (a of the regression can be read in a number of ways. One interpretation is that it is the value that Y will have when X is zero. Another is more straightforward and is based on how it is calculated. It is the difference between the average value of Y, and the slope-adjusted value of X. Intercept of the Regression = a = µ Y - b * (µ X Regression parameters are always estimated with some error or statistical noise, partly because the relationship between the variables is not perfect and partly because we estimate them from samples of data. This noise is captured in a couple of statistics. One is the R of the regression, which measures the proportion of the variability in the dependent variable (Y that is explained by the independent variable (X. It is also a direct function of the correlation between the variables: R - squared of the Regression = Correlation YX = " YX = b X Y An R value close to one indicates a strong relationship between the two variables, though the relationship may be either positive or negative. Another measure of noise in a regression is the standard error, which measures the spread around each of the two parameters estimated the intercept and the slope. Each parameter has an associated standard error, which is calculated from the data: Standard Error of Intercept = SE a = Standard Error of Slope = SE b = j=n $ ' ( X j &(Y j " bx j & % ( (n "1 (X j " µ X $ ' &(Y j " bx j & % ( (n "1 (X j " µ X If we make the additional assumption that the intercept and slope estimates are normally distributed, the parameter estimate and the standard error can be combined to get a t- statistic that measures whether the relationship is statistically significant. t-statistic for Intercept = a/se a t-statistic from Slope = b/se b For samples with more than 10 observations, a t-statistic greater than 1.95 indicates that the variable is significantly different from zero with 95% certainty, whereas a statistic greater than.33 indicates the same with 99% certainty. For smaller samples, the t- statistic has to be larger to have statistical significance. 1 Using Regressions Although regressions mirror correlation coefficients and covariances in showing the strength of the relationship between two variables, they also serve another useful purpose. The regression equation described in the last section can be used to estimate predicted values for the dependent variable, based on assumed or actual values for the independent variable. In other words, for any given Y, we can estimate what X should be: X = a + b(y How good are these predictions? That will depend entirely on the strength of the relationship measured in the regression. When the independent variable explains a high proportion of the variation in the dependent variable (R is high, the predictions will be precise. When the R is low, the predictions will have a much wider range. From Simple to Multiple Regressions The regression that measures the relationship between two variables becomes a multiple regression when it is extended to include more than one independent variables (X1, X, X3, X4... in trying to explain the dependent variable Y. Although the graphical presentation becomes more difficult, the multiple regression yields output that is an extension of the simple regression. Y = a + bx1 + cx + dx3 + ex4 The R still measures the strength of the relationship, but an additional R statistic called the adjusted R is computed to counter the bias that will induce the R to keep increasing as more independent variables are added to the regression. If there are k independent variables in the regression, the adjusted R is computed as follows: 1 The actual values that t-statistics need to take can be found in a table for the t distribution, which can be found in any standard statistics book or software package. A1.7 A1.8

9 $ ' & (Y j " bx j & % ( R squared = n -1 $ ' & (Y j " bx j & % ( Adjusted R squared = n - k Multiple regressions are powerful tools that allow us to examine the determinants of any variable. Regression Assumptions and Constraints Both the simple and multiple regressions described in this section also assume linear relationships between the dependent and independent variables. If the relationship is not linear, we have two choices. One is to transform the variables by taking the square, square root, or natural log (for example of the values and hope that the relationship between the transformed variables is more linear. The other is to run nonlinear regressions that attempt to fit a curve (rather than a straight line through the data. There are implicit statistical assumptions behind every multiple regression that we ignore at our own peril. For the coefficients on the individual independent variables to make sense, the independent variable needs to be uncorrelated with each other, a condition that is often difficult to meet. When independent variables are correlated with each other, the statistical hazard that is created is called multicollinearity. In its presence, the coefficients on independent variables can take on unexpected signs (positive instead of negative, for instance and unpredictable values. There are simple diagnostic statistics that allow us to measure how far the data may be deviating from our ideal. Conclusion In the course of trying to make sense of large amounts of contradictory data, there are useful statistical tools on which we can draw. Although we have looked at the only most basic ones in this appendix, there are far more sophisticated and powerful tools available. A1.9