Biostatistics and Epidemiology, Midterm Review

Size: px
Start display at page:

Download "Biostatistics and Epidemiology, Midterm Review"

Transcription

1 Biostatistics and Epidemiology, Midterm Review New York Medical College By: Jasmine Nirody This review is meant to cover lectures from the first half of the Biostatistics course. The sections are not organised by lecture, but rather by topic. If you have any comments or corrections, please them to me at 1 Introduction to Statistics This section discusses the definition of biostatistics (the application of statistics to a wide range of topics in the biological and medical sciences) and how it can be used in our medical and research careers. 1.1 Types of Measurements Data can be highly variable due to several factors, including genetics, age, sex, race, economic background, measurement techniques, and many others. For this reason, we need ways to classify measurements Categorical Data Categorical data is meant to place data into more or less arbritrary groups meaning, that the way the groups are ordered or presented doesn t affect anything in the presentation. This is a qualitative measure, and usually is not numerical (though sometimes numbers can be assigned; we will show an example). Examples: Sex (Male/Female), Blood Type (A,B,AB,O), Disease Status (Y/N, 1/0). [Note in the last example that, though numbers (1,0) may be used to denote categorical data, the number choices are arbritary. That is, you can choose to denote positive or negative disease status by any symbol 1, 0, 293, or so on.] When constructing a categorical variable, note that categories should be exhaustive, that is, there should be a category for every possibility (this often means including an Other category) and mutually exclusive, which means that all observations fit into one, and ONLY one, category. Often it is possible to convert ordinal or quantitative data into categorical data. For example, consider a situation where you have weight data which is continuous quantitative. By assigning ranges to be considered Underweight, Normal, Overweight, and Obese, we now have data in unordered categories. As a general rule, more informative data can be converted into less informative data, but not vice versa Ordinal Data Ordinal data assigns data into categories which can be ranked, though only the order, and not the distance, between categories is considered. A good general rule is that if there is no true zero, the data is ordinal. Examples: Opinion ranked on a scale of 1-5. Here, we could have used any (numeric or non-numeric) scale with 5 categories, say [Poor, Fair, Good, Very Good, Amazing], or 1-10 (using only even numbers), or the set [-4, 3 2, 9, 30, 3993]. The specific numbers used don t matter in ordinal data, only that they exist in some prespecified order. 1

2 1.1.3 Quantitative Data Quantitative data is data you can put onto a numeric scale, where zero has a real meaning. There are two types of quantitative data, continuous (which can take on any value within a certain range) and discrete (which can only have certain values within a certain range). A good way to tell the difference is to pick any two possible values the data could have, and then pick any random value between those two numbers. Can the data take on that value? If no, it s discrete. If yes, it might be continuous, but you need to see if it can do that for any value in that range, or if you just picked luckily! Examples: Blood pressure (continuous), Height/Weight (Continuous), Age in whole years (discrete), Age with no restrictions (Continuous you can be or or or blablabla years old). Quantative data is often presented in frequency tables. We show an example from the lecture in Figure 1. Figure 1: An example of a frequency table. Relative frequency (RF) is defined as the fraction (in the form of a fraction, percentage, or decimal) of times a certain answer occurs. For example, the RF of 4-year olds in the data shown in Figure 1 can be calculated as: number of four year olds total number of children % Cumulative relative frequency (Cum RF) is the sum of all the relative frequencies that occur when any value less than or equal to the answer is considered. For example, the Cum RF for 4-year olds in the same data is: number of four year olds + number of three year olds total number of children 1.2 Types of Inaccuracies % Inaccuracies in collecting and presenting data can appear through imprecision in measurement (which results in poor reproducability of data), or by inherent bias in the measurement. See Figure 2. 2 Descriptive Statistics Often, working with raw data is difficult and cumbersome, especially since there is usually a lot, so we look for ways to visualise the data. This is accomplished by using frequency distributions. 2.1 Frequency Distributions Continuous quantiative variables can be represented using continuous distributions. Discrete variables can be plotted using histograms and other methods, but we will not be particularly concerned with this. A more full discussion is given in the lecture slides. 2

3 Figure 2: Repeated glucose measurements on a single sample. The distribution we will be most concerned with in this course is the Normal distribution or Gaussian distribution, shown in Figure 3. The normal distribution has a higher density in the middle, and tapers off towards the edges. Even within the category of normal distributions, we can observe some unique shapes: the distribution might be flat and spread out (high variability in the data) or very high and thin (low variability in the data). Of course, distributions don t necessarily have to be normal, but we will discuss these in a later section. Figure 3: A normal (Gaussian) distribution. 2.2 Measures of Central Tendency Instinctively, when the grades for an exam come out, the first thing we wonder is How did I do in relation to the rest of the class?. To properly answer this question, we involve measures of central tendency. In this section, we discuss not one, not two, but three ways to formally define the center of a distribution: the mean, median, and mode. 3

4 2.2.1 Mean The mean, also called the average, is the sum of all observations in a certain group, divided by the number of observations in that group. Symbolically, in a group of n observations: x x 1 + x 2 + x x n n x i n n. When the number of observations in a data set in small, the mean is sensitive to extreme values (outliers). [Note, we define outliers as values which are three standard deviations from the center. These terms will be further discussed in a later section.] As the number of observations in a group increase, the effect of outliers is diluted Median The median is defined as the true midpoint of a set of data. The calculation of the median is easily done in two steps: 1. Arrange data in order of magnitude. 2. If number of observations is odd, choose middle number. If even, choose middle two numbers and calculate their mean. The median is insensitive to outliers. Consider a set of numbers organised by order of magnitude. Whether the value in the final space has magnitude 40 or 9008, the median remains unchanged Mode The mode is probably the most simple measure of central tendency to calculate. It is defined as the value which occurs most often in a data set. There may be more than one mode in a set (multimodal), but rarely more than two (bimodal). 2.3 Measures of Variability While knowing where the center of a distribution is located is important, we also tend to wonder what the distribution actually looks like that is, are all the data points located right at the center, or are they spread out? To answer this question formally, we use measures of variability. We also discuss three of these: range, variance, and standard deviation Range The range is defined as the difference between the highest and lowest values in a data set. Calculation is straightforward Variance We define a deviation of a value as the difference between that value and the mean. Symbolically, x i x. We then can define the variance (S 2 ) as the sum of the squares of the deviations divided by one less than the number of observations: n s 2 i1 (x i x) 2. n 1 Note that the squared deviations, rather than the deviations themselves, are used. This is to account for values on opposite sides of the mean, which would have deviations of opposite sign, and would cancel each other out in a summation. i1 4

5 2.3.3 Standard Deviation Because variance uses the squares of the deviations, the units of variance are also squared. That is, if the units of the observations in a data set is inches, then the unit of the variance of this set will be inches squared. For this reason, it is usually preferable to use another measure of variability, the standard deviation. The standard deviation is simply the square root of the variance: s s 2 n i1 (x i x) 2. n 1 Example: Consider the following data set: [3, 5, 6, 9, 0, -5, 3]. The mean of this data set is calculated as follows: x ( 5) The median is calculated by ordering the data set in order of magnitude: [-5, 0, 3, 3, 5, 6, 9]. Since n is odd, we choose the midpoint: 3. The mode is easily seen to also be 3. The range of the data is the difference between the highest value (9) and the lowest value (-5) 14. The variance is calculated as follows: n s 2 i1 (x i x) 2 n 1 ( 5 3)2 + (0 3) 2 + (3 3) 2 + (3 3) 2 + (5 3) 2 + (6 3) 2 + (9 3) Calculation of the standard deviation is straightforward from here: s s Quartiles We define a quartile as one of four equal groups, representing one fourth of a distribution. Specifically, we define first quartile: the lowest 25% of the data second quartile: cuts the data set in half third quartile: the highest 25% of the data (or, conversely, the lowest 75%). There are many ways to compute quartiles, all of which provide different results. We will discuss Tukey s hinges Tukey s Hinges System Tukey s Hinges system is used to determine the 25th and 75th percentiles of a data set so, the first and third quartiles. According to this system, the first quartile is defined as the median of the first half of the sample and the third quartile as the median of the second half of the sample. The calculations then, can be divided into the following simple steps: 1. Order the data from smallest to largest. This is similar to if we were simply finding the median of a set. 2. That being said...find the median. Remember, if n is even, the median is the mean of the middle two numbers. 5

6 3. Since the median is the midpoint of the set, split the data set into two groups one with values higher than the median, and one with values lower than the median. [Note: When n is even, the median is not included in either of the two groups! When n is odd, the median is included in both of the two groups!] 4. Now you have two sets of data. Find the median of each. The median of the low group is Q1, the first quartile. The median of the high group is Q3, the third quartile. Again, remember that if n (where now, n is the number in each of the two groups) is even, you use the mean of the two middle values. Finally, we discuss the interquartile range, which is defined simply as the difference between the third and first quartile: IQR Q3-Q1. Example: Let s use the same data set as above: [3, 5, 6, 9, 0, -5, 3]. As before, we order it by magnitude to get: [-5, 0, 3, 3, 5, 6, 9]. From above, we know that the median of the set is 3, and that n 7 is odd. So the median is included in both high and low groups. We now form these groups: high group [ ], low group [ ]. The median calculations are straightforward (remember n 4 is now even), and we arrive at Q1 5.5 and Q The interquartile range, Q1 - Q3, is Coefficient of Variation We quickly discuss one final term dealing with variability: the coefficient of variation. This is defined as the ratio of the standard deviation to the mean: c v s x. Note that this is only valid for data with a non-zero mean. 3 Basic Probability Knowing the basic rules of probability is important to understand and deal with random variability in a data set. In this section, we will define some terms and explain some fundamental proability rules. 3.1 Probability The probability of an event is the ways the event can occur divided by the total number of possible events: P (E) n N number of favorable outcomes number of possible outcomes. Often there are so many possible events that it is not possible to count all of them, and so we cannot directly determine the probability of an event by counting. In this case, we have to estimate the probability as a long term relative frequency that is, repeat a process over and over until we are more or less sure we are close to the real probability. The Law of Large Numbers states that if the same experiment is performed a large number of times, the average results from that experiment will be close to the expected value. For example, in a coin toss, we expect to get the result heads with a probability 0.5. While we cannot observe this directly, if we were to perform a large number of coin tosses, the proportion of heads would be close to the expected value 0.5. Since the probability of an event is a ratio, it s value is always between 0 and 1. If the proabability of an event is 0, this means that the number of favorable outcomes is 0, and so the event is impossible. On the 6

7 other hand, if the probability is 1, the number of favorable outcomes is equal to the number of possible outcomes, and the event is certain to occur. 3.2 Multiple Events So far we have described the probability of single events, e.g. the results of a single coin toss. However, often we are concerned with multiple events, e.g. the rolling of two die simultaneously or two consecutive coin tosses The Addition Rule The addition rule is used to calculate the probability of event A or event B occurring. This probability is calculated by: P (A or B) P (A) + P (B). This rule can be generalised to any number of events The Multiplication Rule Before we continue, we define mutually exclusive events as those which cannot occur simulataneously, for example, one cannot have been vaccinated for the flu and not vaccinated for the flu at the same time. If events A and B are mutually exclusive, then the probability of both A and B occurring is always 0. For non-mutually exclusive events, however: P (A and B) P (A) P (B). This rule can also be generalised to any number of events Conditional Probability Up until now, we have assumed that all events are independent of each other that is, the outcome of one event doesn t influence the outcome of the following event. The probabilities in this case are called unconditional probabilities. However, when two events are not independent, we consider their conditional probabilities. An example of conditional probability is the likelihood of getting into an car accident with a BAC of 0.8 versus that of a sober driver. We calculate the conditional probability of event A given that event B has occurred as follows: Note that, if A and B are independent events: P (A B) P (A B) P (A and B) P (B) P (A and B). P (B) P (A)P (B) P (B) P (A). So, the conditional probability of event A occurring if event B has occurred is simply the probability of event A occurring, as expected. 3.3 The Binary Classification Test Finally, we will quickly go over some concepts necessary to analyse a binary classification test. A binary classification test is one that classifies the members of a set into two groups depending on the existence of 7

8 one property, for example, diseased or not diseased. Sensitivity is a measure of the number of positives which are correctly identified divided by the total number of actual positives (for example, the number of people correctly diagnosed with a disorder divided by the total number of people who actually have that disorder). Specificity is the number of correctly identified negatives divided by the total amount of negatives (for example, the number of people who are identified as healthy divided by the total number of healthy people). The outcome of a binary classification test may take four results: True positive: sick people diagnosed as sick False positive or Type I error: healthy people diagnosed as sick True negative: healthy people identified as healthy False negative or Type II error: sick people left undiagnosed. 4 The Binomial Distribution The binomial distribution is the probability distribution typically associated with the number of successes when performing n yes/no experiments. The classical experiment associated with the binomial distribution is the Bernoulli trial, an experiment with random outcome with two possibilities: success or failure. (Fun fact: the binomial distribution for n 1 is called the Bernoulli distribution). So, when would we use the binomial distribution? Consider a population say all the adults on the planet Blorg. If you are considering a certain trait for which you know the prevalence in that population (say, purple skin), then we can use the binomial distribution to tell us what the chances are of randomly selecting some person (or some random sample of people) from the population who has that trait. Example: So let us assume we have the Blorgian population we discussed above. Now, assume we know that the percentage of Blorgian adults with purple skin is 29% (p 0.29). Now, if we pick 1000 random adults (N 1000) from this population, we want to know what is the probability (P ) we will get 230 (x 230) purple skinned ones. For this, we use the binomial distribution and the following formula: ( ) N ˆp p x (1 p) n x. x Here, we calculate ( ) N x by: ( ) N x So, in our example: P N! x!(n x)!. 1000! 230!( )! (1 0.29) This is a very big number, and very difficult to calculate, even on a calculator. We ll see in a later section that we can use the normal distribution to approximate the binomial distribution in certain cases. There are some other things we can calculate for the binomial distribution mean and variance are given as follows: µ Np s 2 Np(1 p). 8

9 5 The Normal Distribution The normal distribution, which we briefly discussed before (See Figure 3), is considered the most basic probability distribution, and is determined by its mean µ and its standard deviation σ. A normal distribution has the following properties: the mean is at the center if you consider an area under the curve one standard deviation in both directions away from the mean, you will cover approximately 68% of the area (exactly, 68.2%) similarly, two standard deviations comprise about 95% and three, 99.7% (See Figure 4) Figure 4: Standard deviations in a normal distribution. Example: Let s consider the same Blorgian population as above, again, and this time we ll look at number of nipples. The mean number of nipples in this population is 4, with a standard deviation of 0.7, because Blorgians can have fractions of a nipple. We want to know the probability of finding a Blorgian with 3.3 nipples or less. This corresponds to a boundary that is one standard deviation towards the left of the mean. Looking at the normal curve (Figure 4), we see that if we consider the area under the curve to the left of this boundary (which corresponds to less than or equal to 3.3 nipples), the probability of finding such a Blorgian is approximately 16%. Since we often cannot sample an entire planet, we must settle for choosing a random sample of size N. If we do this, then we must find a relationship between the parameters of the population (µ, σ) and the statistics of the sample ( x, s). For a normal approximation, this is quite simple: x µ and s σ N. This term, s is called the standard error. Note that as N becomes large, the standard error becomes small, so that the distribution converges onto the true mean of the population. 5.1 The Central Limit Theorem Even though the normal distribution is nice to work with, it is often not a great approximation for a sample (for example, in skewed distributions). However, if N is large enough, we can use the normal distribution to approximate the sample mean, no matter the original distribution of the population. This is the Central Limit Theorem. More formally: If the underlying distribution of the population is normal, X N(µ, σ 2 ), then the sample mean distribution is also a normal distribution with X N(µ, σ2 N ). However, if the underlying distribution of the population is not normal but rather some unknown distribution, X f(x µ, σ 2 ), then for large enough N, the sample mean distribution can be approximated to the normal distribution X N(µ, σ2 N ). 9

10 For our purposes, we consider N to be large enough. 5.2 Standardized Normal Distribution and Z-Scores To make calculations more convenient, sometimes we standardize the normal distribution meaning, we convert the distribution to one that has µ 0 and σ 1. In order to do this, we first shift the distribution so that it is centered around zero (so, we subtract the mean) and then we divide by the standard deviation. This gives us the Z-score: Z X µ. σ Now, we can use this Z-score to tell us many things (quickly, and without any other calculations!) about where we sit on the distribution. Example: Z < 1 means that we are one standard deviation to the left of the mean, and so, as we had calculated before: P (Z < 1) Hypothesis Testing In doing research, we are often presented with a claim, which we then either prove or disprove by experiments. The same is true in statistics, and is called hypothesis testing. The original claim presented to you is called the null hypothesis, and the opposite of that claim, which you are trying to prove, is called the alternative hypothesis. The procedure for developing a hypothesis test is as follows: 1. Develop the null and alternative hypotheses. An important thing to consider is that the two hypotheses must encompass all possibilities and be mutually exclusive that is, there should only be cases when one, and ONLY one, of the hypotheses is true. 2. Set an α-level. This determines your tolerance of Type 1 errors (or false positives, discussed previously. In the case of hypothesis testing, a Type 1 error means rejecting the null hypothesis when it is true.) A typical α-level is 0.05 (5%), but some stricter journals may require 0.01 (1%). 3. Once the α-level is established, you can calculate whether or not you can reject the null hypothesis at this level. Often it happens that with a high α, you get high Type 1 error that is, you may accept the null hypothesis when you might have rejected it at a more strict α. Example: Let s say we have our Blorgian population once again! We are presented with the claim that the proportion of Blorgians with blue toenails in the population is 29%. Set up your null hypothesis as: H 0 :µ 0.29 H 1 :µ We set an α value of 0.05, which means we are willing to take a 5% chance that we are wrong. This corresponds to a critical z-score of 1.96 (We shouldn t try to memorise the big table of z-scores, it s probably impossible and definitely a huuuge waste of time. But this one is good to know!). Now we take a random sample with N 100. In this sample, we find 33 sets of blue toenails (x 33, ˆp 0.33). We now calculate the z-score: z ˆp p p(1 p) N (1 0.29) Now, since the z we calculated is lower than our critical z-score, we fail to reject the null hypothesis! 10

11 But what if we had taken a bigger sample? Let s consider N In this sample, we find 330 sets of blue toenails (x 33, ˆp 0.330). We now calculate the z-score: z ˆp p p(1 p) N ( ) Here, we see that our z-score is now BIGGER than the critical, and so this time we reject the null hypotheis. By taking a bigger sample size, we avoided a Type 1 error. 6.1 Confidence Intervals So far, we have discussed point estimation specifically, estimation of the mean. Another type of estimation is interval estimation, which attempts to provide a range of likely values, called a confidence interval. As with point estimation, we set an error level which we deem acceptable (generally, this is 5%, corresponding to a 95% confidence interval meaning that we are 95% confident that the correct answer lies within the range we are suggesting). Note that the higher the acceptable error, the smaller the interval actually is! This may seem counter-intuitive at first, but consider that if you want to be 100% confident (thus have the smallest error, 0%) you are in the right range, you would have to include all possible values (thus having the largest confidence interval possible). Example: Let us consider again the same population as before, Blorgians with 29% blue toenails. Now we want to know the 95% confidence interval for a sample of size N 100. Let s say the population standard deviation is σ 3%. Then we can calculate the standard error by: σ E σ 3 N Now, for a 95% confidence interval, we know we will cover the range 1.96 z This means we can go 1.96 standard errors in both directions from the mean. So, the 95% confidence interval is then given as: 29% ± (0.3)(1.96) 29% ± 0.6%. But what if we didn t know the standard deviation of the population? In this case, we would use a Student s t-distribution instead of the normal, and a t-statistic instead of the z-score. For this course, the calculations will be exactly the same, only using different charts. Note, however, that on the t-statistic chart there is an extra parameter called degree of freedom, which is simply equal to N Comparing Two Means (Two Sample t-test) Sometimes we are given two samples and our task is to find out if there exists a statistically significant difference between them. The procedure is not so different from a one-sample t-test (which is not so different from a z-test) but we will work out an example anyways! Example: Blorg s neighboring planet Glorf also has some subset of the population with multiple (and fractional) nipples. Everyone actually suspects that the Glorfites migrated over from Blorg and are the same species. Scientists determined that the only way to know for sure is to check if there is a statistically significant difference in the two populations in relation to nipple number. We pick two samples (N 10), one Glorfite and one Blorgian. We observe that the Glorfite sample has on average 3.7 nipples, with a standard deviation of 0.3 (variance of 0.09), and the Blorgians have 3.9 nipples with a standard deviation of 0.3 (variance of 0.09). The closeness of the variance of the two samples is necessary for the two-sample t-test and is referred to as homogeneity of variance. We also must have that the two populations are normally, or close to normally, distributed. Now we begin the calculations! 11

12 First we must calculate the difference between the two means: µ B µ G The claim that has been made is that there is no difference between the two populations, so we state our null and alternative hypotheses as follows: H 0 :µ B µ G 0 H 1 :µ B µ G 0. The standard error of the difference of the means is given as: SE M1 M 2 Next, we compute our t-statistic by: s s 2 2 N t observed hypothesised standarderror Using degrees of freedom , we see that a t-statistic of corresponds to a p-value of approximately 0.64 for a two-sided t-test. This is much higher than our cut-off of 0.05, and so we fail to reject the null. [Note, however, we used an extremely small sample size, and the result very well might not have been the same had we used more of the population.] 6.3 Comparing Multiple Means (ANOVA) Sometimes, if we wish to compare multiple means (more than 2), we must consider an alternative method other than the t-test. Technically, we could perform as many pairwise comparisons as needed to come to a conclusion, but this can be tiring and tedious. It also increases our chances of making a Type 1 error (because we have a chance to make one at every test), though it decreases our chance to make a Type 2 error (because we have 6 chances, rather than 1, to reject the null hypothesis). We would like to think of a single test which would efficiently and easily perform a comparison between multiple means. Such a test is the ANOVA (or ANalysis Of VAriance). ANOVA can only determine whether at least one population mean is different from at least one other population mean, but not which mean is different. If we wish to find that out, we perform other (usually pairwise) tests called post-hoc tests after the ANOVA. Example: The planet on the other side of Blorg, Flugle, is also suspected of being composed of migrated Blorgians. In addition to the samples above, we also pick 10 Fluglers, who have on average 3.2 nipples with a standard deviation of 0.3 (variance 0.09). We state our hypothesis: H 0 :µ F µ G µ B H 1 :not all of the population means are equal. For ANOVA tests, we use a statistic called the F-statistic, which depends on several parameters including: number of groups r (here, 3), combined sample size N (here, 30), and α (here, 0.05 as usual). The critical value of F is denoted as F (r 1,N r,α) F (2,27,0.05). The first value, r 1 is called the numerator degrees of freedom and the second, N r, is called the denominator degrees of freedom. Our critical F-value is 3.35 (from F-statistic table in lecture slide appendix). The calculation of the F-statistic is somewhat complicated (and we won t work it out here) but we give the formula: between-group variability F within-group variability i n i( X i X) 2 /(r 1) ij (X ij X i ) 2 /(N r). 12

13 Here, r and N are as defined above, n i is the size of an individual group i, and X i is the mean of that group, while X is the mean of the entire data set, and X ij is an individual observation (number j ) in group i. Since this is a pretty tedious calculation, we won t do it out here, but let s assume that the F-value was less than the critical value of 3.35, and the Blorgians were correct in assuming that they are the sole source of intelligent life in their immediate surroundings. 7 Correlation and Regression The final section (thank god!) has to do with correlation and regression, which are both methods to evaluate and quantify the relationship between two (quantitative) variables. One of the variables is called the dependent variable and the other the independent variable. The dependent variable is usually the factor we are measuring or interested in, such as disease prevalence or outcome of a treatment, while the independent variable is something we freely control, like dosage level or exposure to a carcinogen. Data points are usually graphically represented in a scatter plot, such as one shown in Figure 5. Figure 5: Scatter plot denoting cigarette use vs. kidney disease. 7.1 Correlation In this section, we talk about Pearson s correlation (ρ in a population, r in a sample), which is defined as a measure of strength of the linear relationship between two variables. If a relationship between two variables exists but is not linear, then this coefficient may not be adequate to describe the relation. This coefficient has a value between -1 and 1, with r 1 denoting a perfect negative relationship between the two variables, r 1 denoting a perfect positive relationship between variables, and r 0 denoting that there is no (linear) relationship. 7.2 Simple Linear Regression Going one step farther than correlation, regression is used to denote a functional relationship between two variables by fitting a line to bivariate data points. The equation denoting a relationship between variables x and y is given as: y a + bx where x is the independent and y is the dependent variable, b is the slope of the line, and a is the y-value at which the line crosses the x-axis. Since there is almost no way that there will be a single line that goes 13

14 perfectly through all points, there will be some distance between the points and the line. We call this the residual, and calculate it by: residual observed y - predicted y The least squares line is the one which minimizes this error. To calculate the parameters a and b for this line, we use the following formulas: b r( s y s x ) a ȳ b x, where s y and s x are the sample standard deviations of x and y, r is the correlation coefficient, and x and ȳ are the sample means. 7.3 Multiple Linear Regression Often, there are multiple factors that affect a certain outcome. In this case, we need to consider more than one independent variable, and so we perform multiple linear regression. In this course, we won t really be concerned much with multiple linear regression except to note how changing each independent variable affects the dependent variable. Example: Attractiveness (A) on Blorg is a combination of three factors: number of nipples (n), how blue one is (which Blorgians rate on a continous scale: 0 b 10), and intelligence (which Blorgians also rate on a continous scale: 0 i 10). The relationship is given by the following equation: (Intelligence is not that important to the Blorgians.) A 2.1n 2.3b + 0.8i From this equation, we can see how changes in any of these attribute can affect attractiveness. For example, if one loses a nipple (somehow), one s attractiveness goes down by 2.1 units. Conversely, if one were to find that nipple someone else lost, then that person s attractiveness would increase by 2.1 units. In another example, if a Blorgian fell into a tub of permanent paint (which exists on Blorg, I guess) and became less blue by 3 units, his/her attractiveness would increase (because there is a negative sign before the blueness term) by 6.9 units! We can do this for any number of independent variables. Most likely anything more complicated would be done using a software, which we have not learned in this term, so don t worry about more complicated problems! Thanks for reading! Again, please send me any corrections that you find!! 14

MATH 1150 Chapter 2 Notation and Terminology

MATH 1150 Chapter 2 Notation and Terminology MATH 1150 Chapter 2 Notation and Terminology Categorical Data The following is a dataset for 30 randomly selected adults in the U.S., showing the values of two categorical variables: whether or not the

More information

TOPIC: Descriptive Statistics Single Variable

TOPIC: Descriptive Statistics Single Variable TOPIC: Descriptive Statistics Single Variable I. Numerical data summary measurements A. Measures of Location. Measures of central tendency Mean; Median; Mode. Quantiles - measures of noncentral tendency

More information

AP Statistics Cumulative AP Exam Study Guide

AP Statistics Cumulative AP Exam Study Guide AP Statistics Cumulative AP Eam Study Guide Chapters & 3 - Graphs Statistics the science of collecting, analyzing, and drawing conclusions from data. Descriptive methods of organizing and summarizing statistics

More information

MATH 10 INTRODUCTORY STATISTICS

MATH 10 INTRODUCTORY STATISTICS MATH 10 INTRODUCTORY STATISTICS Tommy Khoo Your friendly neighbourhood graduate student. Week 1 Chapter 1 Introduction What is Statistics? Why do you need to know Statistics? Technical lingo and concepts:

More information

Probability and Statistics

Probability and Statistics Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be CHAPTER 4: IT IS ALL ABOUT DATA 4a - 1 CHAPTER 4: IT

More information

Chapter 26: Comparing Counts (Chi Square)

Chapter 26: Comparing Counts (Chi Square) Chapter 6: Comparing Counts (Chi Square) We ve seen that you can turn a qualitative variable into a quantitative one (by counting the number of successes and failures), but that s a compromise it forces

More information

Descriptive Statistics-I. Dr Mahmoud Alhussami

Descriptive Statistics-I. Dr Mahmoud Alhussami Descriptive Statistics-I Dr Mahmoud Alhussami Biostatistics What is the biostatistics? A branch of applied math. that deals with collecting, organizing and interpreting data using well-defined procedures.

More information

Lecture 1 : Basic Statistical Measures

Lecture 1 : Basic Statistical Measures Lecture 1 : Basic Statistical Measures Jonathan Marchini October 11, 2004 In this lecture we will learn about different types of data encountered in practice different ways of plotting data to explore

More information

where Female = 0 for males, = 1 for females Age is measured in years (22, 23, ) GPA is measured in units on a four-point scale (0, 1.22, 3.45, etc.

where Female = 0 for males, = 1 for females Age is measured in years (22, 23, ) GPA is measured in units on a four-point scale (0, 1.22, 3.45, etc. Notes on regression analysis 1. Basics in regression analysis key concepts (actual implementation is more complicated) A. Collect data B. Plot data on graph, draw a line through the middle of the scatter

More information

20 Hypothesis Testing, Part I

20 Hypothesis Testing, Part I 20 Hypothesis Testing, Part I Bob has told Alice that the average hourly rate for a lawyer in Virginia is $200 with a standard deviation of $50, but Alice wants to test this claim. If Bob is right, she

More information

Statistics Primer. A Brief Overview of Basic Statistical and Probability Principles. Essential Statistics for Data Analysts Using Excel

Statistics Primer. A Brief Overview of Basic Statistical and Probability Principles. Essential Statistics for Data Analysts Using Excel Statistics Primer A Brief Overview of Basic Statistical and Probability Principles Liberty J. Munson, PhD 9/19/16 Essential Statistics for Data Analysts Using Excel Table of Contents What is a Variable?...

More information

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Fundamentals to Biostatistics Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Statistics collection, analysis, interpretation of data development of new

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 Statistics Boot Camp Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 March 21, 2018 Outline of boot camp Summarizing and simplifying data Point and interval estimation Foundations of statistical

More information

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics DETAILED CONTENTS About the Author Preface to the Instructor To the Student How to Use SPSS With This Book PART I INTRODUCTION AND DESCRIPTIVE STATISTICS 1. Introduction to Statistics 1.1 Descriptive and

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

Probability Experiments, Trials, Outcomes, Sample Spaces Example 1 Example 2

Probability Experiments, Trials, Outcomes, Sample Spaces Example 1 Example 2 Probability Probability is the study of uncertain events or outcomes. Games of chance that involve rolling dice or dealing cards are one obvious area of application. However, probability models underlie

More information

DSST Principles of Statistics

DSST Principles of Statistics DSST Principles of Statistics Time 10 Minutes 98 Questions Each incomplete statement is followed by four suggested completions. Select the one that is best in each case. 1. Which of the following variables

More information

Glossary for the Triola Statistics Series

Glossary for the Triola Statistics Series Glossary for the Triola Statistics Series Absolute deviation The measure of variation equal to the sum of the deviations of each value from the mean, divided by the number of values Acceptance sampling

More information

Final Exam STAT On a Pareto chart, the frequency should be represented on the A) X-axis B) regression C) Y-axis D) none of the above

Final Exam STAT On a Pareto chart, the frequency should be represented on the A) X-axis B) regression C) Y-axis D) none of the above King Abdul Aziz University Faculty of Sciences Statistics Department Final Exam STAT 0 First Term 49-430 A 40 Name No ID: Section: You have 40 questions in 9 pages. You have 90 minutes to solve the exam.

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Chapter 2: Tools for Exploring Univariate Data

Chapter 2: Tools for Exploring Univariate Data Stats 11 (Fall 2004) Lecture Note Introduction to Statistical Methods for Business and Economics Instructor: Hongquan Xu Chapter 2: Tools for Exploring Univariate Data Section 2.1: Introduction What is

More information

Mathematical Notation Math Introduction to Applied Statistics

Mathematical Notation Math Introduction to Applied Statistics Mathematical Notation Math 113 - Introduction to Applied Statistics Name : Use Word or WordPerfect to recreate the following documents. Each article is worth 10 points and should be emailed to the instructor

More information

An introduction to biostatistics: part 1

An introduction to biostatistics: part 1 An introduction to biostatistics: part 1 Cavan Reilly September 6, 2017 Table of contents Introduction to data analysis Uncertainty Probability Conditional probability Random variables Discrete random

More information

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI Introduction of Data Analytics Prof. Nandan Sudarsanam and Prof. B Ravindran Department of Management Studies and Department of Computer Science and Engineering Indian Institute of Technology, Madras Module

More information

Originality in the Arts and Sciences: Lecture 2: Probability and Statistics

Originality in the Arts and Sciences: Lecture 2: Probability and Statistics Originality in the Arts and Sciences: Lecture 2: Probability and Statistics Let s face it. Statistics has a really bad reputation. Why? 1. It is boring. 2. It doesn t make a lot of sense. Actually, the

More information

What is statistics? Statistics is the science of: Collecting information. Organizing and summarizing the information collected

What is statistics? Statistics is the science of: Collecting information. Organizing and summarizing the information collected What is statistics? Statistics is the science of: Collecting information Organizing and summarizing the information collected Analyzing the information collected in order to draw conclusions Two types

More information

Descriptive Statistics (And a little bit on rounding and significant digits)

Descriptive Statistics (And a little bit on rounding and significant digits) Descriptive Statistics (And a little bit on rounding and significant digits) Now that we know what our data look like, we d like to be able to describe it numerically. In other words, how can we represent

More information

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke BIOL 51A - Biostatistics 1 1 Lecture 1: Intro to Biostatistics Smoking: hazardous? FEV (l) 1 2 3 4 5 No Yes Smoke BIOL 51A - Biostatistics 1 2 Box Plot a.k.a box-and-whisker diagram or candlestick chart

More information

Harvard University. Rigorous Research in Engineering Education

Harvard University. Rigorous Research in Engineering Education Statistical Inference Kari Lock Harvard University Department of Statistics Rigorous Research in Engineering Education 12/3/09 Statistical Inference You have a sample and want to use the data collected

More information

Correlation and regression

Correlation and regression NST 1B Experimental Psychology Statistics practical 1 Correlation and regression Rudolf Cardinal & Mike Aitken 11 / 12 November 2003 Department of Experimental Psychology University of Cambridge Handouts:

More information

Interpret Standard Deviation. Outlier Rule. Describe the Distribution OR Compare the Distributions. Linear Transformations SOCS. Interpret a z score

Interpret Standard Deviation. Outlier Rule. Describe the Distribution OR Compare the Distributions. Linear Transformations SOCS. Interpret a z score Interpret Standard Deviation Outlier Rule Linear Transformations Describe the Distribution OR Compare the Distributions SOCS Using Normalcdf and Invnorm (Calculator Tips) Interpret a z score What is an

More information

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math. Regression, part II I. What does it all mean? A) Notice that so far all we ve done is math. 1) One can calculate the Least Squares Regression Line for anything, regardless of any assumptions. 2) But, if

More information

Chapter 27 Summary Inferences for Regression

Chapter 27 Summary Inferences for Regression Chapter 7 Summary Inferences for Regression What have we learned? We have now applied inference to regression models. Like in all inference situations, there are conditions that we must check. We can test

More information

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved. 1-1 Chapter 1 Sampling and Descriptive Statistics 1-2 Why Statistics? Deal with uncertainty in repeated scientific measurements Draw conclusions from data Design valid experiments and draw reliable conclusions

More information

Summary of Chapters 7-9

Summary of Chapters 7-9 Summary of Chapters 7-9 Chapter 7. Interval Estimation 7.2. Confidence Intervals for Difference of Two Means Let X 1,, X n and Y 1, Y 2,, Y m be two independent random samples of sizes n and m from two

More information

STATISTICS 1 REVISION NOTES

STATISTICS 1 REVISION NOTES STATISTICS 1 REVISION NOTES Statistical Model Representing and summarising Sample Data Key words: Quantitative Data This is data in NUMERICAL FORM such as shoe size, height etc. Qualitative Data This is

More information

Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878

Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878 Contingency Tables I. Definition & Examples. A) Contingency tables are tables where we are looking at two (or more - but we won t cover three or more way tables, it s way too complicated) factors, each

More information

- a value calculated or derived from the data.

- a value calculated or derived from the data. Descriptive statistics: Note: I'm assuming you know some basics. If you don't, please read chapter 1 on your own. It's pretty easy material, and it gives you a good background as to why we need statistics.

More information

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14 CS 70 Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14 Introduction One of the key properties of coin flips is independence: if you flip a fair coin ten times and get ten

More information

Chapter 1 Review of Equations and Inequalities

Chapter 1 Review of Equations and Inequalities Chapter 1 Review of Equations and Inequalities Part I Review of Basic Equations Recall that an equation is an expression with an equal sign in the middle. Also recall that, if a question asks you to solve

More information

review session gov 2000 gov 2000 () review session 1 / 38

review session gov 2000 gov 2000 () review session 1 / 38 review session gov 2000 gov 2000 () review session 1 / 38 Overview Random Variables and Probability Univariate Statistics Bivariate Statistics Multivariate Statistics Causal Inference gov 2000 () review

More information

Introduction to Basic Statistics Version 2

Introduction to Basic Statistics Version 2 Introduction to Basic Statistics Version 2 Pat Hammett, Ph.D. University of Michigan 2014 Instructor Comments: This document contains a brief overview of basic statistics and core terminology/concepts

More information

Sociology 6Z03 Review II

Sociology 6Z03 Review II Sociology 6Z03 Review II John Fox McMaster University Fall 2016 John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 1 / 35 Outline: Review II Probability Part I Sampling Distributions Probability

More information

Stat 20 Midterm 1 Review

Stat 20 Midterm 1 Review Stat 20 Midterm Review February 7, 2007 This handout is intended to be a comprehensive study guide for the first Stat 20 midterm exam. I have tried to cover all the course material in a way that targets

More information

Do students sleep the recommended 8 hours a night on average?

Do students sleep the recommended 8 hours a night on average? BIEB100. Professor Rifkin. Notes on Section 2.2, lecture of 27 January 2014. Do students sleep the recommended 8 hours a night on average? We first set up our null and alternative hypotheses: H0: μ= 8

More information

One-sample categorical data: approximate inference

One-sample categorical data: approximate inference One-sample categorical data: approximate inference Patrick Breheny October 6 Patrick Breheny Biostatistical Methods I (BIOS 5710) 1/25 Introduction It is relatively easy to think about the distribution

More information

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 1: August 22, 2012

More information

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model EPSY 905: Multivariate Analysis Lecture 1 20 January 2016 EPSY 905: Lecture 1 -

More information

Math 221, REVIEW, Instructor: Susan Sun Nunamaker

Math 221, REVIEW, Instructor: Susan Sun Nunamaker Math 221, REVIEW, Instructor: Susan Sun Nunamaker Good Luck & Contact me through through e-mail if you have any questions. 1. Bar graphs can only be vertical. a. true b. false 2.

More information

PHP2510: Principles of Biostatistics & Data Analysis. Lecture X: Hypothesis testing. PHP 2510 Lec 10: Hypothesis testing 1

PHP2510: Principles of Biostatistics & Data Analysis. Lecture X: Hypothesis testing. PHP 2510 Lec 10: Hypothesis testing 1 PHP2510: Principles of Biostatistics & Data Analysis Lecture X: Hypothesis testing PHP 2510 Lec 10: Hypothesis testing 1 In previous lectures we have encountered problems of estimating an unknown population

More information

P (E) = P (A 1 )P (A 2 )... P (A n ).

P (E) = P (A 1 )P (A 2 )... P (A n ). Lecture 9: Conditional probability II: breaking complex events into smaller events, methods to solve probability problems, Bayes rule, law of total probability, Bayes theorem Discrete Structures II (Summer

More information

STAT 200 Chapter 1 Looking at Data - Distributions

STAT 200 Chapter 1 Looking at Data - Distributions STAT 200 Chapter 1 Looking at Data - Distributions What is Statistics? Statistics is a science that involves the design of studies, data collection, summarizing and analyzing the data, interpreting the

More information

18.05 Practice Final Exam

18.05 Practice Final Exam No calculators. 18.05 Practice Final Exam Number of problems 16 concept questions, 16 problems. Simplifying expressions Unless asked to explicitly, you don t need to simplify complicated expressions. For

More information

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation? Did You Mean Association Or Correlation? AP Statistics Chapter 8 Be careful not to use the word correlation when you really mean association. Often times people will incorrectly use the word correlation

More information

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z).

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z). Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z). For example P(X.04) =.8508. For z < 0 subtract the value from,

More information

AP Final Review II Exploring Data (20% 30%)

AP Final Review II Exploring Data (20% 30%) AP Final Review II Exploring Data (20% 30%) Quantitative vs Categorical Variables Quantitative variables are numerical values for which arithmetic operations such as means make sense. It is usually a measure

More information

appstats27.notebook April 06, 2017

appstats27.notebook April 06, 2017 Chapter 27 Objective Students will conduct inference on regression and analyze data to write a conclusion. Inferences for Regression An Example: Body Fat and Waist Size pg 634 Our chapter example revolves

More information

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n = Hypothesis testing I I. What is hypothesis testing? [Note we re temporarily bouncing around in the book a lot! Things will settle down again in a week or so] - Exactly what it says. We develop a hypothesis,

More information

Big Data Analysis with Apache Spark UC#BERKELEY

Big Data Analysis with Apache Spark UC#BERKELEY Big Data Analysis with Apache Spark UC#BERKELEY This Lecture: Relation between Variables An association A trend» Positive association or Negative association A pattern» Could be any discernible shape»

More information

This gives us an upper and lower bound that capture our population mean.

This gives us an upper and lower bound that capture our population mean. Confidence Intervals Critical Values Practice Problems 1 Estimation 1.1 Confidence Intervals Definition 1.1 Margin of error. The margin of error of a distribution is the amount of error we predict when

More information

Chapter 18. Sampling Distribution Models. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Chapter 18. Sampling Distribution Models. Copyright 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models Copyright 2010, 2007, 2004 Pearson Education, Inc. Normal Model When we talk about one data value and the Normal model we used the notation: N(μ, σ) Copyright 2010,

More information

Elementary Statistics

Elementary Statistics Elementary Statistics Q: What is data? Q: What does the data look like? Q: What conclusions can we draw from the data? Q: Where is the middle of the data? Q: Why is the spread of the data important? Q:

More information

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 2 MATH00040 SEMESTER / Probability

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 2 MATH00040 SEMESTER / Probability ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 2 MATH00040 SEMESTER 2 2017/2018 DR. ANTHONY BROWN 5.1. Introduction to Probability. 5. Probability You are probably familiar with the elementary

More information

Probability and Probability Distributions. Dr. Mohammed Alahmed

Probability and Probability Distributions. Dr. Mohammed Alahmed Probability and Probability Distributions 1 Probability and Probability Distributions Usually we want to do more with data than just describing them! We might want to test certain specific inferences about

More information

FRANKLIN UNIVERSITY PROFICIENCY EXAM (FUPE) STUDY GUIDE

FRANKLIN UNIVERSITY PROFICIENCY EXAM (FUPE) STUDY GUIDE FRANKLIN UNIVERSITY PROFICIENCY EXAM (FUPE) STUDY GUIDE Course Title: Probability and Statistics (MATH 80) Recommended Textbook(s): Number & Type of Questions: Probability and Statistics for Engineers

More information

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics Last Lecture Distinguish Populations from Samples Importance of identifying a population and well chosen sample Knowing different Sampling Techniques Distinguish Parameters from Statistics Knowing different

More information

Statistics 1. Edexcel Notes S1. Mathematical Model. A mathematical model is a simplification of a real world problem.

Statistics 1. Edexcel Notes S1. Mathematical Model. A mathematical model is a simplification of a real world problem. Statistics 1 Mathematical Model A mathematical model is a simplification of a real world problem. 1. A real world problem is observed. 2. A mathematical model is thought up. 3. The model is used to make

More information

Lecture 11. Data Description Estimation

Lecture 11. Data Description Estimation Lecture 11 Data Description Estimation Measures of Central Tendency (continued, see last lecture) Sample mean, population mean Sample mean for frequency distributions The median The mode The midrange 3-22

More information

Statistics and parameters

Statistics and parameters Statistics and parameters Tables, histograms and other charts are used to summarize large amounts of data. Often, an even more extreme summary is desirable. Statistics and parameters are numbers that characterize

More information

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES INTRODUCTION TO APPLIED STATISTICS NOTES PART - DATA CHAPTER LOOKING AT DATA - DISTRIBUTIONS Individuals objects described by a set of data (people, animals, things) - all the data for one individual make

More information

Correlation and Regression

Correlation and Regression Correlation and Regression October 25, 2017 STAT 151 Class 9 Slide 1 Outline of Topics 1 Associations 2 Scatter plot 3 Correlation 4 Regression 5 Testing and estimation 6 Goodness-of-fit STAT 151 Class

More information

Survey on Population Mean

Survey on Population Mean MATH 203 Survey on Population Mean Dr. Neal, Spring 2009 The first part of this project is on the analysis of a population mean. You will obtain data on a specific measurement X by performing a random

More information

ADMS2320.com. We Make Stats Easy. Chapter 4. ADMS2320.com Tutorials Past Tests. Tutorial Length 1 Hour 45 Minutes

ADMS2320.com. We Make Stats Easy. Chapter 4. ADMS2320.com Tutorials Past Tests. Tutorial Length 1 Hour 45 Minutes We Make Stats Easy. Chapter 4 Tutorial Length 1 Hour 45 Minutes Tutorials Past Tests Chapter 4 Page 1 Chapter 4 Note The following topics will be covered in this chapter: Measures of central location Measures

More information

Sections OPIM 303, Managerial Statistics H Guy Williams, 2006

Sections OPIM 303, Managerial Statistics H Guy Williams, 2006 Sections 3.1 3.5 The three major properties which describe a set of data: Central Tendency Variation Shape OPIM 303 Lecture 3 Page 1 Most sets of data show a distinct tendency to group or cluster around

More information

Lecture 1: Probability Fundamentals

Lecture 1: Probability Fundamentals Lecture 1: Probability Fundamentals IB Paper 7: Probability and Statistics Carl Edward Rasmussen Department of Engineering, University of Cambridge January 22nd, 2008 Rasmussen (CUED) Lecture 1: Probability

More information

One sided tests. An example of a two sided alternative is what we ve been using for our two sample tests:

One sided tests. An example of a two sided alternative is what we ve been using for our two sample tests: One sided tests So far all of our tests have been two sided. While this may be a bit easier to understand, this is often not the best way to do a hypothesis test. One simple thing that we can do to get

More information

The point value of each problem is in the left-hand margin. You must show your work to receive any credit, except in problem 1. Work neatly.

The point value of each problem is in the left-hand margin. You must show your work to receive any credit, except in problem 1. Work neatly. Introduction to Statistics Math 1040 Sample Final Exam - Chapters 1-11 6 Problem Pages Time Limit: 1 hour and 50 minutes Open Textbook Calculator Allowed: Scientific Name: The point value of each problem

More information

Hypothesis tests

Hypothesis tests 6.1 6.4 Hypothesis tests Prof. Tesler Math 186 February 26, 2014 Prof. Tesler 6.1 6.4 Hypothesis tests Math 186 / February 26, 2014 1 / 41 6.1 6.2 Intro to hypothesis tests and decision rules Hypothesis

More information

Background to Statistics

Background to Statistics FACT SHEET Background to Statistics Introduction Statistics include a broad range of methods for manipulating, presenting and interpreting data. Professional scientists of all kinds need to be proficient

More information

Examples of frequentist probability include games of chance, sample surveys, and randomized experiments. We will focus on frequentist probability sinc

Examples of frequentist probability include games of chance, sample surveys, and randomized experiments. We will focus on frequentist probability sinc FPPA-Chapters 13,14 and parts of 16,17, and 18 STATISTICS 50 Richard A. Berk Spring, 1997 May 30, 1997 1 Thinking about Chance People talk about \chance" and \probability" all the time. There are many

More information

Chapter 4: An Introduction to Probability and Statistics

Chapter 4: An Introduction to Probability and Statistics Chapter 4: An Introduction to Probability and Statistics 4. Probability The simplest kinds of probabilities to understand are reflected in everyday ideas like these: (i) if you toss a coin, the probability

More information

Final Exam - Solutions

Final Exam - Solutions Ecn 102 - Analysis of Economic Data University of California - Davis March 19, 2010 Instructor: John Parman Final Exam - Solutions You have until 5:30pm to complete this exam. Please remember to put your

More information

1 Probability Theory. 1.1 Introduction

1 Probability Theory. 1.1 Introduction 1 Probability Theory Probability theory is used as a tool in statistics. It helps to evaluate the reliability of our conclusions about the population when we have only information about a sample. Probability

More information

9 Correlation and Regression

9 Correlation and Regression 9 Correlation and Regression SW, Chapter 12. Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then retakes the

More information

Unit Two Descriptive Biostatistics. Dr Mahmoud Alhussami

Unit Two Descriptive Biostatistics. Dr Mahmoud Alhussami Unit Two Descriptive Biostatistics Dr Mahmoud Alhussami Descriptive Biostatistics The best way to work with data is to summarize and organize them. Numbers that have not been summarized and organized are

More information

Business Statistics. Lecture 10: Course Review

Business Statistics. Lecture 10: Course Review Business Statistics Lecture 10: Course Review 1 Descriptive Statistics for Continuous Data Numerical Summaries Location: mean, median Spread or variability: variance, standard deviation, range, percentiles,

More information

INTRODUCTION TO ANALYSIS OF VARIANCE

INTRODUCTION TO ANALYSIS OF VARIANCE CHAPTER 22 INTRODUCTION TO ANALYSIS OF VARIANCE Chapter 18 on inferences about population means illustrated two hypothesis testing situations: for one population mean and for the difference between two

More information

18.05 Final Exam. Good luck! Name. No calculators. Number of problems 16 concept questions, 16 problems, 21 pages

18.05 Final Exam. Good luck! Name. No calculators. Number of problems 16 concept questions, 16 problems, 21 pages Name No calculators. 18.05 Final Exam Number of problems 16 concept questions, 16 problems, 21 pages Extra paper If you need more space we will provide some blank paper. Indicate clearly that your solution

More information

Marquette University MATH 1700 Class 5 Copyright 2017 by D.B. Rowe

Marquette University MATH 1700 Class 5 Copyright 2017 by D.B. Rowe Class 5 Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science Copyright 2017 by D.B. Rowe 1 Agenda: Recap Chapter 3.2-3.3 Lecture Chapter 4.1-4.2 Review Chapter 1 3.1 (Exam

More information

MA 1125 Lecture 15 - The Standard Normal Distribution. Friday, October 6, Objectives: Introduce the standard normal distribution and table.

MA 1125 Lecture 15 - The Standard Normal Distribution. Friday, October 6, Objectives: Introduce the standard normal distribution and table. MA 1125 Lecture 15 - The Standard Normal Distribution Friday, October 6, 2017. Objectives: Introduce the standard normal distribution and table. 1. The Standard Normal Distribution We ve been looking at

More information

Describing distributions with numbers

Describing distributions with numbers Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods measure one of two data characteristics: The central

More information

Lecture 01: Introduction

Lecture 01: Introduction Lecture 01: Introduction Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South Carolina Lecture 01: Introduction

More information

Sampling Distribution Models. Chapter 17

Sampling Distribution Models. Chapter 17 Sampling Distribution Models Chapter 17 Objectives: 1. Sampling Distribution Model 2. Sampling Variability (sampling error) 3. Sampling Distribution Model for a Proportion 4. Central Limit Theorem 5. Sampling

More information

Two-sample Categorical data: Testing

Two-sample Categorical data: Testing Two-sample Categorical data: Testing Patrick Breheny April 1 Patrick Breheny Introduction to Biostatistics (171:161) 1/28 Separate vs. paired samples Despite the fact that paired samples usually offer

More information

Probability (Devore Chapter Two)

Probability (Devore Chapter Two) Probability (Devore Chapter Two) 1016-345-01: Probability and Statistics for Engineers Fall 2012 Contents 0 Administrata 2 0.1 Outline....................................... 3 1 Axiomatic Probability 3

More information

STA 101 Final Review

STA 101 Final Review STA 101 Final Review Statistics 101 Thomas Leininger June 24, 2013 Announcements All work (besides projects) should be returned to you and should be entered on Sakai. Office Hour: 2 3pm today (Old Chem

More information

STA1000F Summary. Mitch Myburgh MYBMIT001 May 28, Work Unit 1: Introducing Probability

STA1000F Summary. Mitch Myburgh MYBMIT001 May 28, Work Unit 1: Introducing Probability STA1000F Summary Mitch Myburgh MYBMIT001 May 28, 2015 1 Module 1: Probability 1.1 Work Unit 1: Introducing Probability 1.1.1 Definitions 1. Random Experiment: A procedure whose outcome (result) in a particular

More information

Chapter 6 The Standard Deviation as a Ruler and the Normal Model

Chapter 6 The Standard Deviation as a Ruler and the Normal Model Chapter 6 The Standard Deviation as a Ruler and the Normal Model Overview Key Concepts Understand how adding (subtracting) a constant or multiplying (dividing) by a constant changes the center and/or spread

More information

2011 Pearson Education, Inc

2011 Pearson Education, Inc Statistics for Business and Economics Chapter 2 Methods for Describing Sets of Data Summary of Central Tendency Measures Measure Formula Description Mean x i / n Balance Point Median ( n +1) Middle Value

More information