- a value calculated or derived from the data.

Similar documents
- measures the center of our distribution. In the case of a sample, it s given by: y i. y = where n = sample size.

Descriptive Statistics (And a little bit on rounding and significant digits)

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Do students sleep the recommended 8 hours a night on average?

Statistics and parameters

Chapter 1 Review of Equations and Inequalities

Explanation of R 2, and Other Stories

Bridging the gap between GCSE and A level mathematics

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Grades 7 & 8, Math Circles 10/11/12 October, Series & Polygonal Numbers

( )( b + c) = ab + ac, but it can also be ( )( a) = ba + ca. Let s use the distributive property on a couple of

The Basics COPYRIGHTED MATERIAL. chapter. Algebra is a very logical way to solve

Alex s Guide to Word Problems and Linear Equations Following Glencoe Algebra 1

We're in interested in Pr{three sixes when throwing a single dice 8 times}. => Y has a binomial distribution, or in official notation, Y ~ BIN(n,p).

ASTRO 114 Lecture Okay. What we re going to discuss today are what we call radiation laws. We ve

THE SIMPLE PROOF OF GOLDBACH'S CONJECTURE. by Miles Mathis

Correlation. We don't consider one variable independent and the other dependent. Does x go up as y goes up? Does x go down as y goes up?

Algebra. Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

Solving with Absolute Value

Chapter 5. Piece of Wisdom #2: A statistician drowned crossing a stream with an average depth of 6 inches. (Anonymous)

Contingency Tables. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels.

Confidence intervals

Please bring the task to your first physics lesson and hand it to the teacher.

1.3.1 Measuring Center: The Mean

Violating the normal distribution assumption. So what do you do if the data are not normal and you still need to perform a test?

Business Statistics. Lecture 9: Simple Regression

Chapter 1: Exploring Data

Ratios, Proportions, Unit Conversions, and the Factor-Label Method

2 Analogies between addition and multiplication

Line Integrals and Path Independence

Hi, my name is Dr. Ann Weaver of Argosy University. This WebEx is about something in statistics called z-

Uni- and Bivariate Power

In this unit we will study exponents, mathematical operations on polynomials, and factoring.

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi

But, there is always a certain amount of mystery that hangs around it. People scratch their heads and can't figure

Math Fundamentals for Statistics I (Math 52) Unit 7: Connections (Graphs, Equations and Inequalities)

Data Analysis and Statistical Methods Statistics 651

One sided tests. An example of a two sided alternative is what we ve been using for our two sample tests:

Modern Algebra Prof. Manindra Agrawal Department of Computer Science and Engineering Indian Institute of Technology, Kanpur

Note that we are looking at the true mean, μ, not y. The problem for us is that we need to find the endpoints of our interval (a, b).

Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878

Take the Anxiety Out of Word Problems

Introduction to Algebra: The First Week

Basics of Proofs. 1 The Basics. 2 Proof Strategies. 2.1 Understand What s Going On

Algebra & Trig Review

Solution to Proof Questions from September 1st

Final Review Sheet. B = (1, 1 + 3x, 1 + x 2 ) then 2 + 3x + 6x 2

MITOCW ocw f99-lec30_300k

Conceptual Explanations: Simultaneous Equations Distance, rate, and time

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Lecture 11: Extrema. Nathan Pflueger. 2 October 2013

#29: Logarithm review May 16, 2009

MITOCW watch?v=vjzv6wjttnc

Intermediate Logic. Natural Deduction for TFL

Note: Please use the actual date you accessed this material in your citation.

Part I Electrostatics. 1: Charge and Coulomb s Law July 6, 2008

Chemical Applications of Symmetry and Group Theory Prof. Manabendra Chandra Department of Chemistry Indian Institute of Technology, Kanpur

Chapter 7 Summary Scatterplots, Association, and Correlation

MITOCW ocw-18_02-f07-lec02_220k

Chapter 4.notebook. August 30, 2017

Chapter 18. Sampling Distribution Models. Copyright 2010, 2007, 2004 Pearson Education, Inc.

We're in interested in Pr{three sixes when throwing a single dice 8 times}. => Y has a binomial distribution, or in official notation, Y ~ BIN(n,p).

Natural deduction for truth-functional logic

MITOCW ocw f99-lec09_300k

Quadratic Equations Part I

of 8 28/11/ :25

Chapter 7. Scatterplots, Association, and Correlation. Copyright 2010 Pearson Education, Inc.

8. TRANSFORMING TOOL #1 (the Addition Property of Equality)

Physics 225 Relativity and Math Applications. Fall Unit 8 Differential Calculus: now available in 3D

Properties of Arithmetic

HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE 1 OF 11. Figure 1 - Variation in the Response Variable

MITOCW Lec 11 MIT 6.042J Mathematics for Computer Science, Fall 2010

Chapter2 Description of samples and populations. 2.1 Introduction.

Guide to Proofs on Sets

LECTURE 15: SIMPLE LINEAR REGRESSION I

AN ALGEBRA PRIMER WITH A VIEW TOWARD CURVES OVER FINITE FIELDS

Note that we are looking at the true mean, μ, not y. The problem for us is that we need to find the endpoints of our interval (a, b).

ENGINEERING MATH 1 Fall 2009 VECTOR SPACES

Lesson 39. The Vine and the Branches. John 15:1-8

Math Lecture 3 Notes

Dealing with the assumption of independence between samples - introducing the paired design.

1. AN INTRODUCTION TO DESCRIPTIVE STATISTICS. No great deed, private or public, has ever been undertaken in a bliss of certainty.

What Every Programmer Should Know About Floating-Point Arithmetic DRAFT. Last updated: November 3, Abstract

Why should you care?? Intellectual curiosity. Gambling. Mathematically the same as the ESP decision problem we discussed in Week 4.

Describing Distributions with Numbers

Solving Equations by Adding and Subtracting

Joint, Conditional, & Marginal Probabilities

Lecture 4: Constructing the Integers, Rationals and Reals

Implicit Differentiation Applying Implicit Differentiation Applying Implicit Differentiation Page [1 of 5]

Creating and Exploring Circles

Lecture - 24 Radial Basis Function Networks: Cover s Theorem

DIFFERENTIAL EQUATIONS

Lecture 5. 1 Review (Pairwise Independence and Derandomization)

Descriptive Statistics-I. Dr Mahmoud Alhussami

Unit Two Descriptive Biostatistics. Dr Mahmoud Alhussami

No Solution Equations Let s look at the following equation: 2 +3=2 +7

MITOCW Investigation 3, Part 1

The Derivative of a Function

Relativistic Boats: an explanation of special relativity. Brianna Thorpe, Dr. Michael Dugger

Transcription:

Descriptive statistics: Note: I'm assuming you know some basics. If you don't, please read chapter 1 on your own. It's pretty easy material, and it gives you a good background as to why we need statistics. First, some definitions: sample: statistic: noise/error: notation: - a bunch of data that are collected from a population. For example, you want to get some information about blood types. - You go and collect blood type information from 2300 people. These data on the 2300 people is your sample. The blood types for all people is your population. We'll say more about samples and populations a bit later. - a value calculated or derived from the data. - examples: mean, median, standard deviation, or simply one of the data points. - this is the problem. In one sense, the reason for statistics is to deal with noise or error! What is real and what isn t? - data used for statistical analyses are generally variable. The question becomes, what is due to noise and what is due to a real difference. - for example, three different people measure the same thing. Will everyone get the same result? This is due to noise. - another example - height in people. It has been shown that a good diet while young will lead to more height. But two people given the same diet don t necessarily wind up at the same height. Why not? -> error (incidentally, what is the cause of this error?). - basically, noise or error is something we can t control or account for. Capital & lower case letters. Y -- y -- y i -- represents a variable. For instance, birth weight. Y says nothing about an actual value. represents an actual value for Y. For instance, 9 pounds, 13 ounces. represents the observation in the i th place.

For example: You collect the following data: then we have: 14, 12, 16, 23, 18, 17 y 1 = 14 y 2 = 12 y 3 = 16 y 4 = 23 y 5 = 18 y 6 = 17 Capital Sigma: Σ is used as a symbol which means Sum. Your book isn t real good about explaining this correctly. Here s how it s used, using the above numbers as an example: 6 6 y i = y 1 y 2 y 3 y 4 y 5 y 6 or in our case: y i =14 12 16 23 18 17=100 Our text omits the numbers/symbols above and below the sigma. Most often, this doesn t make much of a difference, but at times this does become important. Get in the habit of using these together with the Sigma. For example, suppose we only want to add the first three numbers: 3 y i =14 12 16=42 Here are a couple of other ways this can be used: or: 4 0

5 i=4 5 3 i = 27 = 3 i i=4 We ll get to see some more complicated examples of how this works when we do means and variances and sums of squares next class. If you want, you can try the following: 5 1) i=3 y i 5 2) y i 1 5 3) 15y i 6 4) 15 y i Now we have all the basics to introduce some basic descriptive statistics (you should be familiar with these): Now suppose you have a sample. If you want to describe this sample to someone else, you're not going to give the other person a list of numbers. That'd be silly. You want to describe the sample using just one or two numbers. If we only use one number, what could we use? Some examples: - minimum (is this useful?) - maximum - third largest number? - the number in the middle? - mode - mean The first three candidates are kind of silly, at least if you re trying to figure out how to describe this population with just one number. Let s talk about the last three, beginning with the last: I. Mean (see p. 32 & 33 [26 & 27] {41 & 42} {41 & 42}) - measures the center of our distribution. In the case of a sample, it s given by:

y = n n y i where n = sample size. - this is nothing new - here is the example [2.15] from the book (everyone should know how to calculate an average!): weight gain in lambs over two weeks: 11, 13, 19, 2, 10, 1 thus we have 11 + 13 + 19 + 2 + 10 + 1 = 56 and we get 56/6 = 9.33 pounds. - this is the SAMPLE mean. One can also talk about the population mean or the mean of a distribution. More on this later. II. Mode (see p. 18 [15] {33} {33}) - The mode of a sample is simply that value which has the highest frequency (i.e., there are more observations for this value than for any other). We'll discuss the mode again when we look at distributions. Suffice it to say for now that it's not terribly useful in statistics (at least the kind we're learning here). III. Median (p. 33 & 34 [28 & 29] {40 & 41} {40 & 41}) - the sample median is simply the value in the middle. - if there is no middle number, then it s considered to be halfway between the two middle values. In other words: - if there are an odd number of observations, it s in the middle. - if there are an even number of observations, it s half way between the two middle values. - Example (exrc. 2.14 [2.16, p. 30] {2.3.3, p. 44} {2.3.3, p. 44}): arranging the values from smallest to largest: 5.9 5.9 6.3 6.9 7.0 here the median is 6.3 nmoles/gm (the middle value)

- Example (exrc. 2.15 [2.18, p. 30] {2.3.5, p. 44} {2.3.5, p. 44}): again, arranging the values from smallest to largest: 230 274 274 292 327 366 to calculate the median, take the average of the two middle numbers: 274 + 292 = 566, and then 566/2 = 283. so the median is 283 mg/dl Finally, which is better? Mean or median? (See also p. 36 [30] {43} {43}) Depends (don t you love a vague answer like that?) For most things (particularly in this class) the mean is probably a better indication of the center. Why? Because it uses all of the data. The median uses only the middle or middle two numbers (though the other numbers do determine where the middle is). The mean is extensively used in statistics, particularly the kind we re going to learn. So why bother with the median? It does better when the data are highly skewed, very spread out, or have lots of outliers. A common example is in income. Listing the average income is very misleading. Why? Consider Bill Gates. He pulls the average income WAY up. Also note that income usually doesn t drop below 0. The median does much better here, since Bill Gates only moves it up half a notch. (Lots of research going on in statistics. Some years back there was a talk in the statistics department about the median). So now we have an idea of how to measure the center of our distribution. What about the spread? We also want to know: - are all the observations sort of the same? - or are they all very different from each other? Here we also have some candidates: - range - average absolute deviation - variance

- standard deviation Let s go through these: I. Range (p. 48 [p.40] {59} {59}): - maximum value - minimum value = range. (your book talks about interquartile ranges - ignore these references for now). - sensitive to extremes (e.g. Bill Gates again). II. So why not use something like average deviation? - here s why, using the example from exrc. 2.15 [2.18] {2.3.5} {2.3.5} which we talked about: 230-293.8333 = -63.8333 274-293.8333 = -19.8333 274-293.8333 = -19.8333 292-293.8333 = -1.8333 327-293.8333 = 33.1667 366-293.8333 = 72.1667 now we sum all the totals: (-63.8333) + (-19.8333) + (-19.8333) + (-1.8333) + (33.1667) + (72.1667) = 0 (oops!) dividing 0 by 6 is pointless, so we can stop here. The sum of the deviations from the mean is always 0. III. So what can we do instead? Average absolute deviations (this one s not in the book): - Take the absolute value of each of our numbers above. - So we get (remember -63.8333 = 63.8333): 63.8333 + 19.8333 + 19.8333 + 1.8333 + 33.1667 + 72.1667 = 210.6666 - And now we have 210.6666/6 = 35.1111. - This is used, but as it turns out, is not terribly useful for us. The mathematics needed to use this for doing anything useful can be difficult (the folks using this use a computer to deal with the details), though you might not believe this after seeing the next couple of formulas.

IV. Variance (& standard deviation) (p. 49-52 [p. 41-44] {60-63} {60-62}): - The basic problem is that we need to make our deviations positive. So what else can we do? Square the deviations, which makes them positive, and then take an average (well, sort of). - sample variance: - take all the deviations and square them. - sum these up (this, incidentally gives you the SUM OF SQUARES, an important quantity) - divide by n-1. We get: s 2 = n y i y 2 n 1 - Here s an example, using the same set as above: - Remember, we got -63.8333 by taking 230, one of our observations, and subtracting the average, 293.8333 - ALSO, the ^ symbol means raised to the power, thus 2^2 would mean 2 squared, or 4. In any case, we get: -63.8333^2 + 19.8333^2 + 19.8333^2 + -1.8333^2 + 33.1667^2 + 72.1667^2 = 11172.8333 = Sum of Squares = SS And then we get 11172.8333/5 = 2234.5666 - The units on this are (mg/dl)^2. - The variance is used extensively in statistics. - Often, statisticians don t even bother with standard deviations until they re ready to present results. - The problem with variance is that the units are not directly comparable to the original. Thus we use the standard deviation, which is simply the square root of the variance. - Here s an example of standard deviation, using exrc. 2.34 p. 58 [2.46, p. 49] {2.6.7, p. 67} {2.6.7, p. 66}:

mean: 6.8 + 5.3 + 6.0 + 5.9 + 6.8 + 7.4 + 6.2 = 44.4 and 44.4/7 = 6.343. variance: (6.8-6.343)^2 = 0.20898 (5.3-6.343)^2 = 1.08755 (6.0-6.343)^2 = 0.11755 (5.9-6.343)^2 = 0.19512 (6.8-6.343)^2 = 0.20898 (7.4-6.343)^2 = 1.11755 (6.2-6.343)^2 = 0.02041 Sum of Squares = 2.9571 so variance = 2.9571/6 = 0.49285 (remember, divide by n-1; 7-1 = 6) standard deviation: This is the square root of 0.49285, which is equal to 0.70203. Some concluding remarks about all this. - Here is the formula for the standard deviation: n y i y 2 s = n 1 - the usual abbreviation we use for the SAMPLE standard deviation is s. The SAMPLE variance is simply s^2. - Why on earth do we use n-1 instead of n in the denominator? an intuitive explanation (ex. 2.31, p. 52 [p. 43-44] {62-63} {62} ): - take a sample of size 1. - now, what is the variance? - using the formula, one winds up with:

0 0 = undefined - this makes sense, because a sample of size one can t tell us anything about the variation of a population. There ISN T any variation in a sample of size one. Note that it can be shown that if you use n instead of n-1 that your variance will be biased. Strangely enough, the standard deviation is always a bit biased regardless of whether or not you use n or n-1. - is n ever appropriate? Yes, if you re really ONLY interested in the data you have, and NOT in making inferences about the population at large. This is not usually the case. We will pick up with this theme next time.