Data Analysis and Statistical Methods Statistics 651

Similar documents
Data Analysis and Statistical Methods Statistics 651

Data Analysis and Statistical Methods Statistics 651

Data Analysis and Statistical Methods Statistics 651

Data Analysis and Statistical Methods Statistics 651

Data Analysis and Statistical Methods Statistics 651

Data Analysis and Statistical Methods Statistics 651

Data Analysis and Statistical Methods Statistics 651

Conditional Probability Solutions STAT-UB.0103 Statistics for Business Control and Regression Models

ECON Semester 1 PASS Mock Mid-Semester Exam ANSWERS

Lecture 1: Description of Data. Readings: Sections 1.2,

DE CHAZAL DU MEE BUSINESS SCHOOL AUGUST 2003 MOCK EXAMINATIONS IOP 201-Q (INDUSTRIAL PSYCHOLOGICAL RESEARCH)

Sampling. Module II Chapter 3

Data Analysis and Statistical Methods Statistics 651

Sets and Set notation. Algebra 2 Unit 8 Notes

ST Presenting & Summarising Data Descriptive Statistics. Frequency Distribution, Histogram & Bar Chart

Statistical Theory 1

Two-sample Categorical data: Testing

Chapter 5 : Probability. Exercise Sheet. SHilal. 1 P a g e

Psych 230. Psychological Measurement and Statistics

3 PROBABILITY TOPICS

Chapter. Probability

Probability and Discrete Distributions

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression

Statistics 1L03 - Midterm #2 Review

Topic -2. Probability. Larson & Farber, Elementary Statistics: Picturing the World, 3e 1

Data Analysis and Statistical Methods Statistics 651

Data Analysis and Statistical Methods Statistics 651

Review of Multiple Regression

Potential Outcomes Model (POM)

Data Analysis and Statistical Methods Statistics 651

Data Analysis and Statistical Methods Statistics 651

Lesson 8: Graphs of Simple Non Linear Functions

2 Chapter 2: Conditional Probability

Ec1123 Section 7 Instrumental Variables

Lecture 14. Analysis of Variance * Correlation and Regression. The McGraw-Hill Companies, Inc., 2000

Lecture 14. Outline. Outline. Analysis of Variance * Correlation and Regression Analysis of Variance (ANOVA)

Probability (special topic)

Binomial and Poisson Probability Distributions

Chapter 15. General Probability Rules /42

( ) P A B : Probability of A given B. Probability that A happens

Section 5.1: Probability and area

FCE 3900 EDUCATIONAL RESEARCH LECTURE 8 P O P U L A T I O N A N D S A M P L I N G T E C H N I Q U E

Chapter 5 Random vectors, Joint distributions. Lectures 18-23

Announcements. Lecture 1 - Data and Data Summaries. Data. Numerical Data. all variables. continuous discrete. Homework 1 - Out 1/15, due 1/22

Math 140 Introductory Statistics

STAT Chapter 3: Probability

CHAPTER 10 Comparing Two Populations or Groups

Probability. Hosung Sohn

SESSION 5 Descriptive Statistics

Resistant Measure - A statistic that is not affected very much by extreme observations.

Data Collection: What Is Sampling?

Math 223 Lecture Notes 3/15/04 From The Basic Practice of Statistics, bymoore

Nicole Dalzell. July 2, 2014

Lesson 19: Understanding Variability When Estimating a Population Proportion

MATH 1150 Chapter 2 Notation and Terminology

Slide 1 Math 1520, Lecture 21

Objectives. 2.1 Scatterplots. Scatterplots Explanatory and response variables. Interpreting scatterplots Outliers

Probability: Why do we care? Lecture 2: Probability and Distributions. Classical Definition. What is Probability?

Statistics for Business and Economics

Lecture 9. Selected material from: Ch. 12 The analysis of categorical data and goodness of fit tests

Chapter 2 Solutions Page 15 of 28

6 THE NORMAL DISTRIBUTION

Data Analysis and Statistical Methods Statistics 651

Lecture 6. Probability events. Definition 1. The sample space, S, of a. probability experiment is the collection of all

Chapter 7 Wednesday, May 26th

Regression Analysis. Ordinary Least Squares. The Linear Model

Announcements. Lecture 5: Probability. Dangling threads from last week: Mean vs. median. Dangling threads from last week: Sampling bias

Topic 4 Probability. Terminology. Sample Space and Event

AP Final Review II Exploring Data (20% 30%)

Chapter 2. Mean and Standard Deviation

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Exercise 1. Exercise 2. Lesson 2 Theoretical Foundations Probabilities Solutions You ip a coin three times.

Math 140 Introductory Statistics

Homework (due Wed, Oct 27) Chapter 7: #17, 27, 28 Announcements: Midterm exams keys on web. (For a few hours the answer to MC#1 was incorrect on

Chapter 1 - Lecture 3 Measures of Location

Econ 113. Lecture Module 2

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Lecture 2. Conditional Probability

Lecture 2: Probability and Distributions

7.1: What is a Sampling Distribution?!?!

tossing a coin selecting a card from a deck measuring the commuting time on a particular morning

University of Jordan Fall 2009/2010 Department of Mathematics

STT 315 Problem Set #3

Conditional Probability 2 Solutions COR1-GB.1305 Statistics and Data Analysis

The enumeration of all possible outcomes of an experiment is called the sample space, denoted S. E.g.: S={head, tail}

Intro to Probability Day 3 (Compound events & their probabilities)

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Section 7.2 Homework Answers

MATH 10 INTRODUCTORY STATISTICS

Review of probability. Nuno Vasconcelos UCSD

Chapter 11. Correlation and Regression

Analysis of Variance. Contents. 1 Analysis of Variance. 1.1 Review. Anthony Tanbakuchi Department of Mathematics Pima Community College

Probability deals with modeling of random phenomena (phenomena or experiments whose outcomes may vary)

Bemidji Area Schools Outcomes in Mathematics Algebra 2 Applications. Based on Minnesota Academic Standards in Mathematics (2007) Page 1 of 7

From Bayes Theorem to Pattern Recognition via Bayes Rule

the yellow gene from each of the two parents he wrote Experiments in Plant

Lecture 4 Scatterplots, Association, and Correlation

green green green/green green green yellow green/yellow green yellow green yellow/green green yellow yellow yellow/yellow yellow

What is statistics? Statistics is the science of: Collecting information. Organizing and summarizing the information collected

Psych Jan. 5, 2005

Transcription:

Data Analysis and Statistical Methods Statistics 651 http://www.stat.tamu.edu/~suhasini/teaching.html Lecture 5 (MWF) Probabilities and the rules Suhasini Subba Rao

Review of previous lecture We looked at the interquartile range. The concept of a variance and standard deviation was introduced. The standard deviation is the square root of the variance. Like the mean, the standard deviation (which is the square root of the variance) is a measure of spread of the population, and is not random. On the other hand, the sample variance and standard deviation are random. In statistics we have: Populations, Random variables (the measurements we are interested in the population) and Probabilities which are allocated 1

to the random variables. In this lecture we will introduce the notion of a probability and the rules that go with it. 2

Probability Suppose there is a population of people and I select one person at random. What is the chance (probability) that the selected person s height lies in the interval [5.5, 5.75]feet? How do we calculate this chance? The probability that the randomly selected person s height lies in the interval [5.5, 5.75] is: Total number of people in population with height in the interval [5.5, 5.75]. Total number of people in population 3

Probability and random variables A probability is always positive and is greater than or equal to zero and less than or equal to one (lies in the interval [0, 1]). Random variables and probabilities Recall a random variable X denotes a measurement on a randomly selected individual (it can be the height of a randomly selected student or the gender of a random selected person (for the sake of simplicity we will assume either male or female)). A random variable can take any one of several different outcomes, to each outcome we allocate a probability. We now consider some examples to familiarize ourselves with this notation. 4

Examples of random variables and their probabilities For example if X is the gender of a randomly chosen in the world population. Then clearly X = {Male or Female}. We write P (X =Female) to denote the probability a random chosen person is female. Similar X can denote the height of a randomly chosen person. Then P (5 < X < 5.25), denotes the probability of that a randomly selected students height is in the interval [5, 5.25]. 5

Illustration:Data from a 651 class Summary of information from a 651 class. 6

Probabilities and random variables: Illustration 1 (i) The measurement of interest may be the height of the randomly chosen person. Therefore X = height of randomly selected person. The set of all possible outcomes is any number in the interval [5, 7]. Therefore, P (5 X 7) = 1. P (5.75 X < 6.25) = P (5.75 X < 6) + P (6 X < 6.25) = 0.22 + 0.14. It is impossible from this plot to evaluate P (5.9 X < 6.15). (ii) The measurement of interest may be the gender of the randomly chosen person. Therefore Y = gender of randomly chosen person. The set of all possible outcomes is {male, female}. To each outcome we assign a probability. Since {male, female} is the set of all outcomes (there are no other), we have that P (Y = male or female) =1 and P (Y = Male) = 0.4308. 7

Mutually exclusive events Two events (outcomes) are mutually exclusive, if the occurrence of one event excludes the occurrence of the other event. In the height example, the interval A = [5, 5.5) is an (this is the event that a randomly chosen person s height lies in the interval A = [5, 5.5)). Similarly B = [5.75, 6) is an event. A = [5, 5.5) and B = [5.75, 6) are mutually exclusive events. This means if the randomly chosen person s height lies in the interval A = [5, 5.5), then their height cannot lie in the interval B = [5.75, 6]. Similarly if we know that X (the height of a randomly chosen person) lies in B = [5.75, 6) then it cannot lie in A = [5, 5.5). On the other hand, the events A = [5, 5.5) and C = [5.25, 5.75) are not mutually exclusive. If X = 5.3, then it lies in both events A and C. 8

Suppose X is the time of the day I wake up. Let A=night time and B=day time, then A and B are mutually exclusive events. In anyone day, I cannot get up at both night and day time. 9

Mutually exclusive events and probabilities Benefits of mutually exclusive events If two events A and B are mutually exclusive, then P (A or B) = P (A) + P (B) (often P (A or B) = P (A B)). Height Example Suppose X is the height of a randomly chosen person. If A = [5, 5.5) and B = [5.75, 6), then P (X in A or X in B) = P (5 X < 5.5 or 5.75 X < 6). Since A and B are mutually exclusive then P (X in A or X in B) = P (5 X < 5.5 or 5.75 X < 6) = P (5 X < 5.5) + P (5.75 X < 6). Plot these two events on a density plot. 10

Example 2 Suppose that X is the height of a randomly chosen person. Suppose that the probability P (X t) is known for every single number t. That is I know P (X 4), P (X 4.5) P (X 6) etc. Using this information, how can I evaluate: (i) P (4 < X 7). (ii) P (X 8) (iii) P (4 < X 7 or 8 < X 9) (iv) P (4 < X 7 or 6 < X 8) Solutions in http://www.stat.tamu.edu/~suhasini/teaching651/ probabilities_lecture5.pdf 11

Reasons for the above exercise Soon you will need to understand how to calculate the chance of a certain event happening (such as the sample mean being less than a certain value). This will require you to use the normal tables and look up probabilites and the above will be useful. It will give you some idea how percentiles are calculated. Let us return to the doctor office example. Suppose your blood level lies between 13-15, then given the chances the distribution of blood levels (based on) P (X t), you can calculate the chance P (13 < X 15). 12

Motivation: Conditional probabilities Is there a relationship between binge drinking (excessive consumption of alcohol) and gender. 17096 students were surveyed (7180 males and 9916 females). The data is given below. Male Female Binge 1630 1684 Not Binge 5550 8232 In order to analyze this type of data we need to introduce the notion of a conditional probability. This is a probability calculated using not the entire population but the conditioned subpopulation. 13

Motivation: cont A conditional probability is the probability the probability of an event given that we already have some (possibly partial) information about. Male Female Binge 1630 1684 Not Binge 5550 8232 Subtotals 7180 9916 The probability of binge drinking given that you know person is female is 1684/9916 17%, whereas the probability of binge drinking given that you know person is male is 1630/7180 22.7%. We can write these conditional probabilities as P(binge drink Male) = 0.227 and P(binge drink Female) = 0.17. Based on the above information, does gender change the chance of 14

binge drinking? In other words does information about the gender of the person change the chance of bring drinking. This example alludes to the notion of independence, which we discuss at the end of this lecture. 15

Conditional probability Example Suppose X is the height of a randomly selected person and we know that X lies in the interval [5, 6], then what is the probability X lies in [5, 5.5]? This probability is Number of people whose height is in [5, 5.5] Number of people whose height is in [5, 6]. We write this probability as: the probability that X lies in [5, 5.5] given that X lies in [5, 6]. Using probability notation this is written as: P (5 X 5.5 5 X 6) = P (5 X 5.5 }{{} 5 X 6) denotes given 16

Compare this with the the probability a random person lies in the interval [5, 5.5], written as P (5 X 5.5). In this case P (5 X 5.5) will be smaller than P (5 X 5.5 5 X 6). What is P (5 X 5.5 X 6)? 17

Conditional probability: Heights and gender of 18 people Height 5.5 5.9 4.9 6.2 6 5.9 5.2 5.7 5.3 Gender F M F M M F F M F Height 5 6.3 5.6 5.9 5.8 5.9 6 5.6 5.5 Gender M M F M F M M F F Suppose we randomly select a person and let X denote their height and Y their gender. Calculate (i) P (X 5.5). (ii) P (X 5.5 Y = M). (iii) P (X 5.5 Y = F ). What do we observe? Is there a difference in the distribution of male and female heights. 18

Example - heights and gender (general) Suppose we randomly choose a person, and let X denote their height and Y their gender. Then, P (5 X 5.5 Y = female), is the probability that a randomly person s height lies in interval [5, 5.5] given that we know the person is female. Clearly P (5 X 5.5 Y = female) P (5 X 5.5) and P (5 X 5.5 Y = female) P (5 X 5.5 Y = male). Question In general, which do you think is the larger probability; P (5 X 5.5 Y = female) or P (5 X 5.5 Y = male). Answer Since females, in general, tends to be smaller than males, and [5, 5.5] feet is quite small. It is more likely that a female should have a height in the interval in [5, 5.5] than a male to have a height in [5, 5.5]. 19

This is saying that the probability a randomly chosen person s height lies in [5, 5.5] given that the person is female is greater than the probability a randomly chosen persons height lies in [5, 5.5] given that the person is male. In more succinct notation, this is saying that P (5 X 5.5 Y = female) > P (5 X 5.5 Y = male). Suppose that X denotes the height of a randomly chosen person in the 651 data and Y denotes their gender. How to evaluate: (i) P (5 X 5.5) (ii) P (5 X 5.5 Y = female) (iii) P (5 X 5.5 Y = male)? We will not be using conditional probabilities much in this course. 20

However, the idea is important to understand. This is because when we do statistical testing we will use the idea of probability of observing some event given that A is true. 21

Independence and conditional probabilities Definition Suppose that we have two events A and B. The events A and B are independent of each other if P (A B) = P (A). This means the event B has no influence what so ever on the chance of event A occurring. Let us return to the example of heights and gender. Do you think height and gender are independent? In other words, does information about the gender of a person play no role in the chance of their being a certain height. Intuitively this seems not be the case. We showed this in the above example. Where we showed that P (5 X 5.5 Y = female) > P (5 X 5.5 Y = male). Which means that P (5 X 5.5 Y = female) P (5 X 5.5) Hence gender and height are not independent. 22

Random variables X and Y are independent if for any outcome of X and and outcome of Y, P (X = event A Y = event B) = P (X = event A). Independent and mutually exclusive (they not the same) Example If A is the event a height is in the interval [5, 5.5] and B is the event a height is in the interval [6, 6.5], then P (X in [5, 5.5] X in [6, 6.5]) = 0 P (X in [5, 5.5]). A and B are not independent events. 23

Example: Binge drinking We recall that data set: Male Female Binge 1630 1684 Not Binge 5550 8232 Subtotals 7180 9916 We recall the probability of binge drinking given that you know the person is female = P(binge drink Female) = 0.17, whereas the probability of binge drinking given that you know the person is male = P(binge drink Male) = 0.227. As these conditional probabilities are different, there is a dependence/association between gender and binge drinking. 24

Examples: Independent events Define the random variables. X is the season in college station. Y are the number of hair dressers in college station. Z is the temperature in college station. It is unlikely that the number of hair dressers in college station have an influence on the temperature. Therefore P (Z [25, 28] Y ) = P (X [25, 28]). 25

On the other hand season does have an unfluence on temperature, therefore P (X [25, 28] X = summer ) P (X [25, 28]) or equivalently, P (X [25, 28] and Z = summer ) P (X [25, 28])P (Z = summer). We will understand the above statement in lecture 6. 26

Dependence does not imply causality A study was done in the early 90s to see if the mortality rates between left and right handed people were the same. To do this the psychologists collected the death records of 2000 people who died in May, 1990, in Southern California. They rang the families of all these people and asked whether the were left or right handed. They also categorized the people who died into those above 60 and below 60 years old. This is the data they collected. Out of the 2000, 400 were left handed. Out of the 2000, 150 were left handed and died below 60. Out of the 2000, 300 were right handed and died below 60. 27

(a) A person dies below 60? Lecture 5 (MWF) Random variables, Probability and conditional probabilities (b) A person dies below 60 given that they are left handed? (c) Does the data suggest there is an association/dependence between left handedess and early mortality? 28

Solution It instructive to summarize the data as a contingency table: Died before 60 Died after 60 totals Left 150 250 400 Right 300 1300 1600 Totals 450 1550 2000 (a) The chance a person dies below 60. Calculate the total proportion who died before 60 P (before 60) = 450 2000 (b) A person dies below 60 given that they are left handed. 29

Focus on people who are only left handed. There are 400 of these. Of these, 150 died before 60. Therefore the proportion of left handed people in the sample who died before 60 is P (before 60 left handed) = 150 400 = 3 8 = 37.5% Using the same argument the proportion of right handed people in the sample who died before 60 is P (before 60 right handed) = 300 1600 = 3 16 = 18.75%. (c) As these proportions are different, it suggests there is some sort of dependence/association between lefthandedness and early mortality. 30

Why the dependence? Can we conclude from this data that being left handed increases the chance of early mortality? This statement is a causal statement, it asks whether being left handed causes early mortality. This data set cannot answer this question, does not take into account other variables associated with being left handed which are causing the effect you are seeing. The most plausible explanation for the difference in proportions is that the incidence of left-handedness has increased over the past 50 years (because people are no longer forced to be right handed). The unobserved variable is 31

Example: Before 1930, no one was left handed, what would the numbers in the table and proportions be? 32