DATA ANALYSIS. Faculty of Civil Engineering

Size: px
Start display at page:

Download "DATA ANALYSIS. Faculty of Civil Engineering"

Transcription

1 DATA ANALYSIS Faculty of Civil Engineering

2 DATA

3 DATA - Introduction Data is a collection of facts, such as numbers, words, measurements, observations or even just descriptions of things. Qualitative data is descriptive information (it describes something). Quantitative data is numerical information (numbers).

4 DATA - Introduction Quantitative data can also be discrete or continuous. Discrete data can only take certain values (like whole numbers). Continuous data can take any value (within a range).

5 Data Analysis - Introduction A process of inspecting, cleaning transforming and modeling data with the goal of discovering useful information suggesting conclusions and supporting decision-making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different disciplines.

6 Data Analysis - Introduction Data analysis is about manipulating and presenting results. Data need to be organised, summarised and analysed in order to draw/infer conclusion.

7 Data Analysis - Processes Data requirements. Data collection. Data processing. Data cleaning. Exploratory data analysis. Modelling and algorithm Results & Report.

8 Sources of Data Lab Experimentation Survey Census Theoretical Analysis Numerical Analysis Software Other researchers data

9 Example Analysis Results Estimation of parameter mean values Estimation of parameters variability Comparison of parameter mean values Comparison of parameter variability Modelling the dependence of dependant variable on several quantitative & qualitative independent variables

10 Data Processing Data initially obtained must be processed or organized for analysis. For instance, this may involve placing data into rows and columns in a table format for further analysis, such as within a spreadsheet or statistical software.

11 Data Cleaning The data may be incomplete, contain duplicates, or contain errors. The need for data cleaning will arise from problems in the way that data is entered and stored. Data cleaning is the process of preventing and correcting these errors.

12 Data Checking Before doing data analysis and intrepretation, watch for invalid data using whatever data checking procedure. Weeding out of bad data is to be done continously throughout data gathering process. Bad data can bias results & intrepretation. Repeat data gathering or experimentation if there exist suspicous data.

13 Exploratory Data Analysis Once the data is cleaned, it can be analyzed. Analysts may apply a variety of techniques referred to as exploratory data analysis to begin understanding the messages contained in the data. The process of exploration may result in additional data cleaning or additional requests for data, so these activities may be iterative in nature.

14 Trial Test Do a simple trial test. To ensure that all parts in the testing setup function well. To determine the range of measurement to be taken. To anticipate the time taken for each step in the experiment. To see the error.

15 Error (Uncertainty) When writing a measurement results with ± e, it doesn t mean that we have done error It is uncertainty due to the limit of equipment and technique of experiment

16 Error (Uncertainty) For example Case 1: Theory said deflection = 5 mm, in the experiment the deflection = 5.5 mm. Is it mean that the theory wrong?. Ask first what is the error limit. If the error limit is ±0.75, the theory is correct.

17 Error (Uncertainty) For example Case 2: Two experimentalist doing measurement on the time taken for. The first researcher give the result as 20.4±0.4sec. While the second researcher give 19.8±0.8sec. Is their results contradict?

18 Error (Uncertainty) No, their results is actually overlapping. However, we are more confident with the first one because the error is half of the second, meaning that the measurement is done very carefully.

19 Analysis & Interpretation Mathematical formulas or numerical models called algorithms may be applied to the data to identify relationships among the variables. Numerical models: using software Statistical analysis: using software

20 STATISTICAL ANALYSIS

21 What is Statistics The science of collecting and analyzing data. It s about the whole process of using the scientific method to answer questions and make decisions.

22 What is Statistics The process involves designing studies, collecting good data, describing the data with numbers and graphs, analyzing the data, and then making conclusion.

23 Statistical Analysis 1) Designing studies 2) Collecting & selecting data 3) Describing data 4) Analyzing data 5) Making conclusion

24 Designing Studies Once a research question is defined, the next step is designing a study in order to answer that question. Figure out what process would be used to get the data we need.

25 Designing Studies The observational study could be survey. Surveys are questionaires that are presented to individuals who have been selected from a population of interest. Another widely used observational study is based on nature such: wildlife, geology hydrology, meteorology, environment,etc.

26 Designing Studies Experiments take place in a controlled setting, and are designed to minimize biases that might occur. It is perhaps most important to note that no matter what the study, it has to be designed so that the original questions can be answered in a credible way.

27 Collecting & selecting data If you select your subjects in a way that is biased - that is, favoring certain individuals or groups of individuals then the results will also be biased. Experiments and observational studies use instrumentation are sometimes even more challenging when it comes to collecting data. Something happens during the experiment to distract the subjects or the researchers.

28 Describing Data Once data are collected, the next step is to summarize it all to get a handle on the big picture. Statisticians describe data in two major ways: with pictures (that is, chart & graph) and with numbers, called descriptive statistics.

29 CHARTS AND GRAPHS

30 Charts and Graphs Line graphs for trend & behaviour Time charts for time series data Scatter graphs for relationships Pie charts & bar charts for categorical data Histogram & box plots for numerical data

31 Line Graphs A powerful tools to explain results in term of cause and effect. The horizontal x-axis is normally used for the independent variable (the cause or controlled variable). The vertical y-axis is normally used for dependent variable (the effect). To describe the development or progression. To show trend, response or behaviour in data.

32 Line Graph g

33 Time Charts To examine trend over time and another name for time chart is a line graph. Typically a time chart has some unit of time on the horizontal axis (year, day, month, and so on) and a measured quantity on the vertical axis (income, birth rate, total sales..)

34 Total sales Time Chart Time

35 Time Chart

36 Scatter Graphs Useful to present many data values. To show correlations between two variables. To draw conclusions about relationship in the data.

37 Scatter Graphs Y X

38 Pie Charts Present data in segment, convey simple and straightforward proportion of each category. A pie chart takes categorical data and shows the percentage of individuals that fall into each category. Each segment is presented in terms of percentage and can only be used with one data set.

39 Bar Charts An effective way of presenting frequencies. Common in reports of small scale research. The bar height represents quantity or amount. The number of bars represents the categories. Often used to compare groups by breaking and showing them as side-by-side. Visually striking and simple to read.

40 Bar Charts

41 Histogram Is the statistician s graph of choice for numerical data that provide a quick way to get the big idea about a numerical data set. A histogram is a graphical display of tabulated frequencies as well as a graphical version of a table that shows what proportion of cases fall into each of several or many specified categories.

42 Histogram A histogram is the most important graphical tool for exploring the shape of data distributions (Scott, 1992). The shape examined from the histogram puts the type of distribution into view. A histogram can be constructed by plotting the frequency of observation against midpoint class of the data.

43 Number of Class Interval Rule of thumb to choose appropriate width: a is bin widths or widths of class interval n is number of observation (data) Log 10 (n) is the number of based 10 of the number of observation According to Sturges s rule, 1000 observations would be graphed with 11 class intervals.

44 Histogram

45 Histogram - tips If there are too few classes, it is difficult to see how the data vary. If there are too many classes, then the table is less of a summary

46 Histogram tells three features How the data are distributed (symmetric, skewed right, skewed left, bell-shaped). The amount of variability in the data. Where the center of the data is (approximately).

47 Histogram tells three shapes Symmetric: the left-hand side of the histogram is a mirror image of the righthand side. Skewed right: it looks like a lopsided mound with one long tail going off to the right. Skewed left: it looks like a lopsided mound with one long tail going off to the left.

48 Histogram tells variability If a histogram is quite flat with the bars close to the same height, it indicates high variability. A histogram with a big lump in the middle and tails on the sides indicates more data in the middle bars than the outer bars, the the data are actually closer together or less variability.

49 Histogram tells center A histogram can also give you a rough idea of where the center of the data lies. To visualize the mean; the mean is the point where the fulcrum has to be in order to balance the weight on each side.

50 Boxplot A boxplot is a one-dimensional graph of numerical data based on the five-number summary, which includes the minimum value, the 25 th percentile (know as Q 1 ) median, the 75 th percentile (Q 3 ), and the maximum value. In essence, these five descriptive statistics divide the data set into four equal parts.

51 Making a Boxplot 1) Find the five number summary of data set. 2) Create a horizontal number line whose scale includes the numbers in the fivenumber summary. 3) Label the number line using appropriate units of equal distance from each other.

52 Exam score Making a Boxplot Five number summary: 43: Minimum 68: 25 th percentile 77: Median 89: 75 th percentile 99: Maximum 40

53 Making a Boxplots 4) Mark the location of each number in the fivenumber summary just above the number line. 5) Draw a box around the marks for the 25 th percentile and the 75 th percentile. 6) Draw a line in the box where the medians is located. 7) Draw lines from the outside edges of the box out to the minimum & maximum values of the data set.

54 Making a Boxplot Step 7 Step 5 Step Step 7

55 Interpreting a Boxplot A boxplot can show information about the distribution, variability, and center of a data set. Symmetric data shows a symmetric boxplot. Skewed data show a lopsided boxplot, where the median cuts the box into two unequal pieces.

56 Interpreting a Boxplot If the longer part of the box is to the right (or above) the median, the data is said to be skewed right. If the longer part is to the left (or below) the median, the data is skewed left.

57 Interpreting a Boxplot The upper part (vertical line) of the box is wider than the lower part (vertical line). This means that the data between the median (77) and Q 3 (89) are a little more spread out, or variable, than the data between the median (77) and Q 1 (68).

58 Interpreting a Boxplot Variability in a data set that is measured by the the interquartile range (IQR). The IQR is equal to Q 3 Q 1. A large distance from the 25 th percentile to the 75 th percentile indicates the data are more variable. IQR ignores data below 25 th or above 75 th which may contain outliers.

59 Interpreting a Boxplot The median is part of the five-number summary, and is shown by the line that cuts through the box in the boxplot. The mean, however is not part of the box plot. Misinterpret a boxplot: the bigger the box, the more data. A bigger part of the box means there is more variability (a wider range of values).

60 DESCRIPTIVE STATISTICS

61 Summarizing Data Descriptive statistics are numbers that summarize some characteristic about a set of data. Summarizing data by numerical measures makes a point clearly and concisely.. Mean, Median, Mode, Standard Deviation, Variance, Coefficient of Variation, Skewness, Kurtosis.

62 Sample Mean The sample mean is defined as the sum of the observed variable, x divided by the number of observed values.

63 Sample Median The sample median of a variable x is defined as the middle value when the n sample observations of x are ranked in increasing order of magnitude.

64 Sample Median S = 1,6,3,8,2,4,9 We need to find the value x, where half of the values aare above x and half the values below x. Rearrange, S = 1,2,3,4,6,8,9 The median is 4

65 Sample Mode The sample mode of a variable x is defined as the value with the highest frequency. The mode of a data set is the value that occurs most often or other words, has the most probability of occuring.

66 Sample Mode Sometimes we can have two, three, or more values that have relatively large probability of occurrence. In such cases, we say that the distribution is bimodal, tri-modal or multimodal, respectively.

67 Sample Mode Consider the rolls of a ten-sided die: R = 2,8,1,9,5,2,7,2,7,9,4,7,1,5,2 The number that appears the most is the number 2. Therefore the mode of set R is the number 2

68 Sample Mode Consider the rolls of a ten-sided die: R = 2,8,1,9,5,2,7,2,7,9,4,7,1,5,2 Note that if the number 7 had appeared one more time, it would have been present four times as well. In this case, we would have had a bimodal distribution, with 2 and 7 as the modes.

69 Mean Median Mode When to use mean, median & mode? Mean for normally distributed data (symmetrical distribution). Median & Mode for markedly skewed data.

70 Measures of Dispersion Consider the following data set: S = 5,5,5,5,5,5 and R = 0,0,0,10,10,10 If we calculated the mean for both S and R, we would get the number 5. However, these are two vastly different type of data sets.

71 Measures of Dispersion Therefore, we need another descriptive statistic besides a measure of central tendency, which we shall call a measure of dispersion. We shall measure the dispersion or scatter of the values of our data set about the mean of the data set.

72 Measures of Dispersion If the values tend to be concentrated near the mean, then this measure shall be small, while if the values of the data tend to be distributed far from the mean, then the measure will be large. The two measures of dispersions that are usually used are called the variance and standard deviation.

73 Variance and Std Deviation A quantity of great importance in probability and statistics is called the variance. The variance denoted by σ 2, for a set of n numbers x 1, x 2,., x n is given by

74 Variance and Std Deviation The variance is nonnegative number The positive square root of the variance (σ 2 ) is called the standard deviation (σ). Find the variance and std deviation for the following set of test scores: T = 75, 80, 82, 87, 96 from data set, μ = 84

75 Variance and Std Deviation T = 75, 80, 87, 96 from data set, μ = 84 Variance σ 2 = 50.8 Std Deviation, σ = 50.8 = 7.13

76 Variance & Std Deviation It is also widely accepted to divide by (n-1) as opposed to n. 1 2 n x x s n n x x s n x x s

77 Percentiles It is often convenient to subdivide your ordered data set by use of ordinates so that the amount of data points less than the ordinate is some percentage of the total amount of observations. The values corresponding to such areas are called percentile values, or briefly, percentiles.

78 Percentiles For example the percentage of scores that fall below the ordinate at x α is α. For instance, the amount of scores less than x 0,10 would be 0.10 or 10% and x 0,10 would be called the 10 th percentile.

79 Percentiles Another example is the median. Since half the data points fall below the median, it is the 50 th percentile (or fifth decile), and can be denoted by x 0,0.5.

80 Percentiles The 25 th percentile is often thought of as the median of the scores below the median, and the 75 th percentile is often thought of as the median of the scores above the median.

81 Percentiles The 25 th percentile is called the first quartile, while the 75 th percentile is called the third quartile. The median is also known as the second quartile.

82 Interquartile Range Another measure of dispersion is the interquartile range (IQR). The interquartile range is defined to be the first quartile substracted from the third quartile. x x 0.25

83 Interquartile Range Find interquartile range from the following data set: S = (67, 69, 70, 71, 74, 77, 78, 82, 89) The median is 74. The first quartile, x 0.25 is the median of the scores below the fifth position, the average of the second and third score, which leads to x 0.25 = 69.5

84 Interquartile Range (IQR) The third quartile, x 0.75 is the median of the scores above the fifth position, the average of the seventh and eighth score, which leads to x 0.75 = 80 The interquartile range is x x 0.25 = = Semiinterquartile range is 0.5(x x 0.25 ) leads to 5.25

85 Skewness Distribution of scores in data set. We might have a symmetrical data set, or a data set that is evenly distributed, or a data set with more high values than low values.

86 Skewness Often a distribution is not symmetric about any value, but instead has a few more higher values, then it is said to be skewed to the right. If the data set has a few more lower values, then it is said to be skewed to the left.

87 Skewed Skewed to the left Skewed to the right

88 Kurtosis Kurtosis (from the Greek word κυρτός, kyrtos or kurtos, meaning bulging) is a measure of the "peakedness" of the probability distribution of a real valued random variable Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution. Higher kurtosis means more of the variance is due to infrequent extreme deviations, as opposed to frequent modestly-sized deviations.

89 Kurtosis A distribution having a relatively high peak such as the curve is called leptokurtic, while the curve which is flat-topped is called platkurtic. The normal distribution which is not very peaked or very flat-topped is called mesokurtic.

90 Kurtosis (a) Leptokurtic (b) Playtykurtic (c) Mesokurtic

91 Kurtosis

92 Moments If X 1, X 2, X N are the N values assumed by the variable X, we define the quantity: Called the r th moment. The first moment with r = 1 is the arithmetic mean. N X N X N X X X X r N j r j r N r r r

93 Moments The r th moment about the mean defined as : m r N j1 X j N X r X N X r X X is X r If r = 1, m 1 = 0. If r= 2, m 2 = s 2, the variance.

94 Moments The r th moment about any origin A is defined as : Where d = X-A are the deviations of X from A. r r r N j r j r A X N d N A X N A X m 1 '

95 PROBABILITY DISTRIBUTION

96 Probability The classical definition of probability Suppose an event E can happen in h ways out of a total of n possible equally likely ways. Then the probability of non-occurrence of the event (called its success) is denoted by: p Pr E n h

97 Probability The probability of non-occurrence of the event (called its failure) is denoted by: q Pr n h n h n not E 1 1 p 1 PrE Thus p + q = 1, or Pr{E} + Pr{not E} = 1

98 Probability Distribution Discrete probability distribution: If a variable X can assume a discrete set of values X 1, X 2, X K with respective probabilities p 1, p 2,,p k where p 1 +p p K = 1; We say that a discrete probability distribution for X has been defined. In discrete case, by cumulating probabilities, we obtain cumulative probability distributions.

99 Probability Distribution The function P(X) which has the respective values p 1, p 2,,p K for X = X 1, X 2,,X K is called the probability function p(x) which has the respective values p 1,p 2,..,p K for X =X 1,X 2,..,X K, is called the probability function or frequency function of X. Because X can assume certain values with given probabilities, it is often called a discrete random variables. Also called random/chance/stochastic variables.

100 Probability Distribution Continuous probability distribution: If the variable X may assume a continuous set of values. The relative frequency polygon of a sample becomes, in the theoretical or limiting case of a population, a continuous curve such as shown in figure.

101 Probability Distribution Curve equation is Y = p(x), the total area under the curve bounded by the X axis is equal to one P(X) a b X

102 Probability Distribution The area under the curve between lines X = a and X = b (shaded in figure) gives the probability that X lies between a and b, which can be denoted by Pr a X b We call p(x) a probability density function. Variable X is called a continuous random variable.

103 Mathematical Expectation If p is the probability that a person will receive a sum of money S, the mathematical expectation, or simply the expectation, is defined by ps. If the probability that a man wins a RM100 prize is 1/5, his expectation is: 1 5 RM100 RM20

104 Mathematical Expectation If X denotes a discrete random variable which can assume the values X 1, X 2,, X K with respective probabilities p 1, p 2,..,p K where p 1 +p 2 + p K =1, the mathematical expectation of X or simply the expectation of X, denoted by E(X), is defined: E( X ) p1 X1 p2x 2.. p K X K K j1 p j X j px

105 Binomial Distribution An experiment such as tossing a coin or die repeatedly and each toss or selection is called a trial. In any single trial there will be a probability associated with a particular event such as head on the coin, four on the die. Such trials are said to be independent and often called Bernoulli trials. Binomial is discrete distribution.

106 Binomial Distribution Let p = the probability that an event will happen in any single Bernoulli trial = (called the probability of success). Then q = 1-p is the probability that the event will fail to happen in any single trial = (called the probability of failure).

107 Binomial Distribution ) (1 standard deviation ) (1 variance )!!(! ) ( ) ( ) ( 2 p np p np np mean q p x n x n x f q p x n x X P x f x n x x n x

108 Binomial Distribution Toss a fair coin 100 times, and count the number of heads that appear. Find the mean, variance, and standard deviation of this experiment. In 100 tosses of a fair coin, the expected or mean number of heads is μ = (100)(0.5) = 50 Variance σ 2 = 100(0.5)(0.5) = 25 Std deviation σ = (100)(0.5)(0.5) = 5

109 Poison Distributions Discrete distribution. Let X be a discrete random variable that can take on the values 0,1,2, such that the probability function of X is given by, f x e ( x) P( X x) x x! 0,1,2,... Where λ is a given positive constant.

110 Poison Distributions A random number variable having this distribution is said to be Poison distributed. The values of Poison distribution can be obtained using table (available in statistics text book), which gives values of e -λ for various values of λ. f x e ( x) P( X x) x x! 0,1,2,...

111 Poison Distributions f ( x) P( X mean x) x e x! x 0,1,2,... variance 2 standard deviation

112 Normal Distribution One of the most important examples of a continuous probability distribution is the normal distribution. Sometimes called the Gaussian distribution. Is very important and it will quite often come up in practice.

113 Normal Distribution The density function for this distribution is given by: f ( x) 1 e 2 ( x) 2 / 2 2 x Where μ = mean; σ = std deviation; π = ; e =

114 Normal Distribution The total area bounded by the following curve and the X axis is one. f ( x) 1 e 2 ( x) 2 / 2 Hence the area under the curve between two ordinates X = a and X = b where a < b, represents the probability that X lies between a and b denoted by Pr{a < X < b}. 2

115 Normal Distribution The corresponding distribution function is given by: F( x) P( X x) 1 2 x e ( x) 2 / 2 2 dv If X has the distribution function listed above Then we say that the random variable X is normally distributed with mean μ and variance σ 2

116 Normal Distribution If we let Z be the random variable corresponding to the following: Z X Then Z is called the standard variable corresponding to X. The mean or expected value of Z is 0 and the std deviation is 1.

117 Normal Distribution The density function for Z can be obtained from the definition of a normal distribution by allowing μ = 0 and σ 2 = 1 f ( z) 1 e 2 z 2 / 2 This is often referred to as the standard normal density function.

118 Normal Distribution The corresponding distribution is given by: f ( z) 1 e 2 z 2 / 2 We sometimes call the value z of the standardized variable Z the standard score A graph of the standard normal density function sometimes called the standard normal curve.

119 Normal Distribution The standard normal curve indicates the areas within 1, 2, and 3 standard deviations of the mean. i.e. between z = -1 and +1, z = -2 and +2, z = -3 and +3 as equal, respectively, to 68.27%, 95.45% and 99.73% of the total area, which is one. This means that: P( 1 P( 2 P( 3 Z Z Z 1) 2) 3)

120 Standard Normal Curve f (z) % % 99.73% z

121 Normal Distribution A table giving the areas under the curve bounded by the ordinates at z = 0 and any positive value of z is available in all statistics text book. From this table the areas between any two ordinates can be found by using the symmetry of the curve about z = 0.

122 Normal Distribution Approximately 68% of the area under any normal distribution curve lies within one standard deviation of the mean. Approximately 95% of the area under any normal distribution curve lies within two standard deviation of the mean. Approximately 99.7% of the area under any normal distribution curve lies within one standard deviation of the mean.

123 Normal Distribution Total area under the curve = 1.0 or 100% The area under the curve : within 1 std. deviation = 0.68 or 68%; within 2 std deviation = 95% within 3 std deviation = 99.7%

124 Normal Distribution

125 Normal Distribution A standard normal distribution is a normal distribution with zero mean and one unit variance, given by the probability function and distribution function

126 POPULATION & SAMPLE

127 Population and Sample Often in practice we are interested in drawing valid conclusions about large group of individuals or objects. Instead of examining the entire group, called the population, which may be difficult or impossible to do. We may examine only a small part of this population, which is called a sample. The process of obtaining samples is called sampling.

128 Population and Sample Statistical Inference is drawing a conclusions from sample data about the larger populations from which the samples are drawn. A population is the whole set of a measurements or counts about which we want to draw a conclusion. A sample is a subset of the population, a set of some of the measurements or counts which comprise the population.

129 Sampling If we draw an object from an urn, we have the choice of replacing the object into the urn before we draw again. If the first case a particular object can come up again and again, whereas in the second it can come up only once.

130 Sampling Sampling where each member of a population may be chosen more than once is called sampling with replacement. Sampling where each member cannot be chosen more than once is called sampling without replacement. Practical purposes, sampling from a finite population that is very large can be considered a sampling from an infinite population.

131 Random samples For a finite populations: make sure that each member of the population has the same chance of being in the sample, which is called a random sample. Random sampling can be accomplished for relatively small populations by drawing lots, or equivalently, by using a table of random numbers specially constructed for such purposes.

132 Random samples Because inference from sample to population cannot be certain, we must use the language of probability in any statement of conclusions.

133 Population parameters One important problem of statistical inference is the estimation of population parameters or briefly parameters (such as population mean, variance etc.) from the corresponding sample statistics or briefly statistics (i.e. sample mean, variance, etc). If the mean of the sampling distribution of a statistic equals the corresponding population parameter, the statistic is called an unbiased estimator of the parameter, otherwise it is called a biased estimator.

134 Population parameters If the sampling distributions of two statistics have the same mean (or expectation), the statistic with smaller variance is called an efficient estimator of the mean while the other statistic is called an inefficient estimator. If we consider all possible statistics whose sampling distributions have the same mean, the one with the smallest variance is sometimes called the most efficient or best estimator of this mean.

135 Population parameters An estimate of a population parameter given by a single number is called a point estimate of the parameter. An estimate of a population parameter given by two numbers between which the parameter may be considered to lie is called an interval estimate of the parameter. Interval estimates indicate the precision or accuracy of an estimate and are therefore preferable to point estimates.

136 Population parameters A population is considered to be known when we know the probability distribution f(x) of the associated random variable X. If X is normally distributed, we say the population is normally distributed or that we have a normal population. If X is binomially distributed, we say that the population is binomially distributed or that we have a binomial population.

137 Sample Statistics We can take random samples from the population and then use these samples to obtain values that serve to estimate and test hypothesis about the population parameters. For example, we wish to draw conclusions about the height of adults students by examining only 100 students selected from the population. In this case, X can be a random variable whose values are the various heights.

138 Standard error The standard deviation of a sampling distribution of a statistic is often called its standard error. If the sample size N is large enough, the sampling distributions are normally or nearly normal. For this reason the methods are known as large sampling methods. When N < 30, samples are called small and use the theory of small samples or exact sampling theory.

139 Confidence interval Confidence interval estimates of population parameters. Let μ s & σ s be the mean and std deviation of the sampling distribution of a statistic S. If the sampling distribution of S is approximately normal for n 30, S lying in the interval: μ s σ s to μ s + σ s : 68.27% of the time μ s 2σ s to μ s + 2σ s : 95.45% of the time μ s 3σ s to μ s + 3σ s : 99.73% ot the time

140 Confidence interval Equivalently we can expect to find, or we can be confident of finding μ s in the interval S : μ s σ s to μ s + σ s : 68.27% confidence intervals μ s 2σ s to μ s + 2σ s : 95.45% confidence intervals μ s 3σ s to μ s + 3σ s : 99.73% confidence intervals (i.e. for estimating the population parameter, in this case of an unbiased S)

141 Confidence interval Equivalently we can expect to find, or we can be confident of finding μ s in the interval S : μ s σ s to μ s + σ s : 68.27% confidence intervals μ s 2σ s to μ s + 2σ s : 95.45% confidence intervals μ s 3σ s to μ s + 3σ s : 99.73% confidence intervals S ± σ s S ± 2σ s S ± 3σ s : 68.27% confidence limits : 95.45% confidence limits : 99.73% confidence limits

142 Confidence level Confidence Level Z c (critical value) Confidence Level Z c (critical value) S ± 1.96σ s S ± 2.58σ s 99.73% 99% 98% 96% 95.45% % 90% 80% 68.27% 50% : 95% or 0.95 confidence level : 99% or 0.99 confidence level

143 Confidence interval For small sample n < 30, use the t distribution (table) to obtain confidence levels. For example, if t and t are the values of T for which 2.5% of the area lies in each tail of the t distribution, then a 95% confidence interval for T is given by: X t X t c Sˆ n [in t ˆ S n general term]

144 The t-distribution The normal distribution is the well-known bell-shaped distribution whose mean is μ and standard deviation is σ. The t-distribution has a basic bell shape with an area of 1 under it, but shorter and flatter than a normal distribution. The standard deviation for t-distribution is proportionally larger compared to the standard normal, Z-distribution

145 The t-distribution Each t-distribution is distinguished by the term degrees of freedom. If the sample size n = 10, the degrees of freedom for corresponding t-distribution is n-1= 10 1 = 9 degrees of freedom = t 9. Smaller sample size have flatter t- distributions than larger sample sizes. Larger sample size standard normal Z

146 Frequency distribution If a sample (or even a population) is large, it is difficult to observe the various characteristics or to compute statistics such as mean or standard deviation. For this reason it is useful to organize or group the raw data.

147 Frequency distribution Suppose that a sample consists of the height of 100 male student at XYZ University. We arrange data into classes or categories, and determine the number of individuals belonging to each class, called the class frequency.

148 Frequency distribution Height (inches) Number of students Total 100

149 HYPOTHESIS TESTS

150 Hypothesis testing Hypothesis testing is a statistician s way of trying to confirm or deny a claim about a population using data from a sample. A hypothesis is a conjecture about a population parameter. Hypothesis testing is a process of using sample data and statistical procedures to decide whether to reject or not reject a hypothesis (statement) about a population parameter value.

151 Hypothesis testing Because parameters tend to be unknown quantities, everyone wants to make claims about what their values may be. This conjecture may or may not be true. The null hypothesis (H o ) always states the population parameter is equal to the claimed value. If null hypothesis is found not to be true so what the alternative hypothesis (H a ) or (H 1 ).

152 Hypothesis testing Decide on null hypothesis, H 0. Decide on an alternative hypothesis, H 1 Decide on a significance level. Calculate the appropriate test statistic, using the sample data. Find from tables the appropriate tabulated test statistic. Compare the calculated and tabulated test statistics, and decide whether to reject the null hypothesis, H 0. State a conclusion, after checking to see whether the assumptions required for the test in question are valid.

153 Hypothesis testing The null hypothesis H 0, generally expresses the idea of no difference. The alternative hypothesis, which we denote by H 1, expresses the idea of some difference. Alternative hypothesis may be one-sided (greater or less than) or two-sided (not equal to).

154 Critical values of Z Level of significance, α Critical values of Z for onetailed tests Critical values of Z for twotailed tests or and or and or and or and or and 3.08

155 Level of significance Rejection region Acceptance region Rejection region Total shaded area is called level of significance of the decision rule : two-tailed test z

156 Hypothesis Example Situation A: A researcher is interested in finding out whether a new medicine will have any undesirable side effects on the pulse rate of the patient. Will the pulse rate increase, decrease or remain unchanged. Since the researcher knows the pulse rate of the population under study is 82 beats per minute, the hypothesis will be H o : μ = 82 (remain unchanged) H 1 : μ 82 (will be different) This is a two-tailed test since the possible effect could be to raise or lower the pulse

157 Hypothesis Example Situation B: A chemist invents an additive to increase the life of an automobile battery. The mean life time of ordinary battery is 36 months. The hypothesis will be: H o : μ 36 H 1 : μ > 36 The chemist is interested only in increasing the lifespan of the battery. His alternative hypothesis is that the mean is larger than 36. Therefore the test is called right-tailed, interested in the increase only.

158 Hypothesis Example Situation C: A contractor wishes to lower heating bill by using a special type of insulation in house. If the average monthly bill is RM100, his hypothesis will be: H o : μ RM 100 H 1 : μ < RM 100 This is a left-tailed test since the contractor is only interested in reducing the bill

159 Test of significance A z-test is used for testing the mean of a population versus a standard, or comparing the means of two populations, with large (n 30) samples whether you know the population standard deviation or not. It is also used for testing the proportion of some characteristic versus a standard proportion, or comparing the proportions of two populations. A significance level of 5% is the risk we take in rejecting the null hypothesis.

160 Test of significance A t-test is used for testing the mean of one population against a standard or comparing the means of two populations if you do not know the populations standard deviation and when you have a limited sample (n < 30). If you know the populations standard deviation, you may use a z-test. Example: Measuring the average diameter of shafts from a certain machine when you have a small sample.

161 Test of significance An F-test is used to compare 2 populations variances. The samples can be any size. It is the basis of ANOVA. Example: Comparing the variability of bolt diameters from two machines.

162 Chi-square goodness of fit test Chi-square value or can be denoted as χ 2 provided a good test to fit the hypothesis distribution with the real one. The observed data can be grouped into class interval and observed frequency, O. Suppose that for a group of observation data, a distribution can be specified for any whatsoever type by making hypothesis based on the histogram shape.

163 Chi-square goodness of fit test For each class of the grouped data, the expected frequency for each class can be estimated on the basis of the hypothecal distribution. It can be carried out by multiplying the reliability density function of hypothesis distribution for each class interval with number of data, n to obtain expected frequency, E. The χ 2 then can be estimated for each class using the given formula.

164 Chi-square goodness of fit test All single value of χ 2 for each class can be summed up. The hypothesis can be verified by comparing the estimated χ 2 with the critical value for χ 2 statistic from Chi-square statistic table. If the critical value for χ 2 statistics is less than the calculated value, the proposed distribution will be rejected. The χ 2 value from the statistic table can be determined based on level of significance.

165 Estimated Chi-square A measure of the discrepancy existing between observed & expected frequencies by chi-square: If chi-square zero: observed & theory agree exactly. If chi-square greater than zero, they do not agree exactly.

166 Test of normality Shapiro-Wilk test

167 Shapiro-Wilk test The Shapiro Wilk test is a test of normality. The Shapiro Wilk test utilizes the null hypothesis principle to check whether a sample x 1,..., x n came from a normally distributed population. Empirical testing has found that Shapiro Wilk has the best power for a given significance, followed closely by Anderson Darling when comparing the Shapiro Wilk, Kolmogorov- Smirnov, Lilliefors and Anderson-Darling tests.

168 Shapiro-Wilk test The null hypothesis of this test is that the population is normally distributed. Thus if the p-value is less than the chosen alpha level, then the null hypothesis is rejected and there is evidence that the data tested are not from a normally distributed population. In other words, the data are not normal..

169 Shapiro-Wilk test On the contrary, if the p-value is greater than the chosen alpha level, then the null hypothesis that the data came from a normally distributed population cannot be rejected. Example: for an alpha level of 0.05, a data set with a p- value of 0.02 rejects the null hypothesis that the data are from a normally distributed population. However, since the test is biased by sample size, the test may be statistically significant from a normal distribution in any large samples. Thus a Q-Q plot is required for verification in addition to the test.

170 Q-Q plot In statistics, a Q Q plot ("Q" stands for quantile) is a probability plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other. If the two distributions being compared are similar, the points in the Q Q plot will approximately lie on the line y = x. If the distributions are linearly related, the points in the Q Q plot will approximately lie on a line.

171 Q-Q plot

172 Q-Q plot

173 CURVE FITTING

174 Curve fitting The general problem of finding equations of approximating curves which fit given sets of data is called curve fitting. Linear relationship straight line Non linear relationship - curve

175 Curve fitting Y = a 0 + a 1 X straight line Y = a 0 + a 1 X + a 2 X 2 parabola/quadratic Y = a 0 + a 1 X + a 2 X 2 + a 3 X 3 cubic curve Y = a 0 + a 1 X + a 2 X 2 + a 3 X 3 + a 4 X 4 quartic curve Y = a 0 + a 1 X + a 2 X 2 + a 4 X n n th degree curve

176 Curve fitting :Logistic curve 1 1 curve :Geometric curve :Exponential :hyperbola g ab Y or g ab Y ax Y ab Y X a a Y or X a a Y X X b X

177 Raw data & fitted curve

178 Polynomial curve fit

179 Curve fitting & distribution fitting

180 Curve fitting & confidence interval

181 Multiple Regression Analysis The multiple regression test is used to identify change in two or more factors (independent variables) which contribute to change in a dependent variable. There are three types of multiple regression procedures; the backward solution, forward solution and stepwise solution. Stepwise has an advantage over the others.

182 Backward Solution This procedure is also known as the full multiple regression model because every predictor variable is initially entered into the regression model. The variables which do not contribute significantly to the regression model will only be removed later.

183 Forward Solution The predictor variable is entered into the regression model according to its contribution to the regression. The first variable selected to be entered into the model has the highest correlation with the criterion variable. Selection of predictor variables is conducted next until no more predictor variables which contribute to significant change.

184 Stepwise Solution Is a variation of forward solution. The procedure for selecting predictor variables is similar to the forward solution except that after each predictor variable is selected, a second significance test is conducted to determine the contribution of each predictor variable before this.

185 Multiple Regression Analysis Yˆ b X b X b X... b k X k a where Y X b a is the predicted criterion variable is the predictor variable is the regression coefficient for each predictor variable is regression constant

186 Correlation theory Correlation is the degree of relationship between variables, which seek to determine how well a linear or other equation describes or explains the relationship between variables. If satisfy an equation: perfectly correlated. If no relationship: uncorrelated.

187 Correlation theory If only two variables are involved: simple correlation and simple regression. If more than two variables are involved: multiple correlation and multiple regression.

188 Correlation theory The correlation is called linear if all points in the scatter diagram seem to lie near a line. A linear equation is appropriate for purposes of regression or estimation. If Y tends to increase as X increases: the correlation is called positive or direct correlation.

189 Correlation theory If Y tends to decrease as X increases: the correlation is called negative or inverse correlation. If all points seem to lie near some curve, the correlation is called non-linear and a non-linear equation is appropriate for regression or estimation. The non-linear correlation can be sometimes positive or sometimes negative.

190 Explained & Unexplained variation Total variation of Y is given, Total variation = unexplained variation + explained variation 2 Y Y Y Y 2 Y Y est. est. 2

191 Coefficient of Correlation The ratio of the explained variation to the total variation is called the coefficient of determination. The quantity r, called the coefficient of correlation is given, r Y est. Y Y Y explained variation 2 total variation 2

192 Rank Correlation Instead of using precise values of the variables, or when such precision is unavailable, the data may be ranked in order of size, importance, etc. using the numbers 1, 2,3.., N.

193 Rank Correlation If two variables X and Y are ranked in such manner the coefficient of rank correlation is given by (spearman s formula for rank correlation), 6 2 D r rank 1 2 N N 1 D = differences between ranks of corresponding values of X & Y. N = number of pairs of values (X,Y) in the data

194 Correlation Tests Inferential research is conducted to describe the characteristics of the research subjects by identifying the relationship between the dependent and independent variables. The dependent variable is the effect; the independent variable is the factor which causes or effects a change in the dependent variable.

195 Correlation Tests There are 3 steps to determine relationship between variables: 1. Indentify the dependent and independent variables in the relationship. 2. Determine the measurement for variables in the relationship. 3. Conduct an analysis of the relationship between variables.

196 Correlation Tests The relationship between variables is known as correlation and the strength of a correlation is represented by the correlation coefficient in the correlation test. There are various types of correlation tests as shown in table.

197 Correlation Tests The standard relationship coefficients between two variables, is the Pearson product-moment correlation coefficient. The Spearman s rho test is a nonparametric test. It is used to analyse data which is not normally distributed. For two sets of not normally distributed data, the data does not correlate linearly.

198 Correlation Tests The Spearman s rho test is conceptually similar to the Pearson r test. However, the Pearson r test is used to identify correlation between two sets of interval or ratio scale data while the Spearman s rho test is used to analyse correlation between two sets of ordinal scale data.

199 Correlation Tests In some cases, the data collected from a sample is not ordinal, interval or ratio scale data; instead, it is nominal scale data. The two correlation tests (Pearson r and Spearman s rho) are not suitable for analysing nominal scale data.

200 Correlation Tests Correlation between two nominal scale variables can be analysed by using the Cramer s V test. It is calculated based on the chi-square value. 2

201 Type of Correlation Tests Correlation test Pearson product-moment coefficient Point-biserial coefficient Spearman s rho or eta coefficient Type of measurement It states the relationship between variables using the interval and ratio scales It states the relationship between an interval or ratio scale variable and a nominal scale variable It states the relationship between variables when the distribution of data is not normal and where both variables are in ordinal scale which are arranged according to rank

202 Type of Correlation Tests Correlation test Biserial coefficient Tetrachoric coefficient Type of measurement It is similar to the point-biserial coefficient where one of the variables is measured in the interval or ratio scale whereas the other variables is in the ordinal scale. It is similar to the Phi coefficient which states the relationship of variables in the nominal scale. The difference is that this coefficient is used when the researcher estimates that both variable scales have ranking and the data distribution is normal.

203 Type of Correlation Tests Correlation test Cramer, Phi and Lambda coefficient Rank-biserial coefficient Type of measurement Used when variables are in the nominal scale and each variable has more than two categories. It is similar to the point-biserial coefficient where one variable in the relationship is in the nominal scale and the other variable is in the ordinal scale.

204 The Strength of coefficient, r Correlation coefficient (r) Correlation strength Very strong Strong Average/medium Weak Very weak 0.00 No correlation

205 Homogeneity of Variance Certain tests (e.g. ANOVA) require that the variances of different populations are equal. This can be determined by the following approaches: 1. Comparison of graphs (esp. box plots) 2. Comparison of variance, standard deviation and IQR statistics 3. Statistical tests

206 Homogeneity of Variance The F test presented in Two Sample Hyphotesis Testing of Variances can be used to determine whether the variances of two populations are equal. For three or more variables the following statistical tests for homogeneity of variances are commonly used: 1. Levene s test 2. Fligner Killeen test 3. Bartlett s test

207 Homogeneity of Variance Ways of dealing with models where the variances are not sufficiently homogeneous (it is called heterogeneous): 1. Non-parametric test: Kruskal-Wallis 2. Modified tests: Brown-Forsythe and Welch s ANOVA test 3. Transformations (square root, logarithmic)

208 Outliers The following ways of identifying the presence of outliers: 1. Side by side plotting of the raw data (histograms and box plots). 2. Examination of residuals. Residuals for Levene s test, e ij = x ij x j

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

Glossary for the Triola Statistics Series

Glossary for the Triola Statistics Series Glossary for the Triola Statistics Series Absolute deviation The measure of variation equal to the sum of the deviations of each value from the mean, divided by the number of values Acceptance sampling

More information

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics DETAILED CONTENTS About the Author Preface to the Instructor To the Student How to Use SPSS With This Book PART I INTRODUCTION AND DESCRIPTIVE STATISTICS 1. Introduction to Statistics 1.1 Descriptive and

More information

Business Statistics. Lecture 10: Course Review

Business Statistics. Lecture 10: Course Review Business Statistics Lecture 10: Course Review 1 Descriptive Statistics for Continuous Data Numerical Summaries Location: mean, median Spread or variability: variance, standard deviation, range, percentiles,

More information

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES INTRODUCTION TO APPLIED STATISTICS NOTES PART - DATA CHAPTER LOOKING AT DATA - DISTRIBUTIONS Individuals objects described by a set of data (people, animals, things) - all the data for one individual make

More information

STAT 200 Chapter 1 Looking at Data - Distributions

STAT 200 Chapter 1 Looking at Data - Distributions STAT 200 Chapter 1 Looking at Data - Distributions What is Statistics? Statistics is a science that involves the design of studies, data collection, summarizing and analyzing the data, interpreting the

More information

Basic Statistical Analysis

Basic Statistical Analysis indexerrt.qxd 8/21/2002 9:47 AM Page 1 Corrected index pages for Sprinthall Basic Statistical Analysis Seventh Edition indexerrt.qxd 8/21/2002 9:47 AM Page 656 Index Abscissa, 24 AB-STAT, vii ADD-OR rule,

More information

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty. What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty. Statistics is a field of study concerned with the data collection,

More information

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics Last Lecture Distinguish Populations from Samples Importance of identifying a population and well chosen sample Knowing different Sampling Techniques Distinguish Parameters from Statistics Knowing different

More information

Contents. Acknowledgments. xix

Contents. Acknowledgments. xix Table of Preface Acknowledgments page xv xix 1 Introduction 1 The Role of the Computer in Data Analysis 1 Statistics: Descriptive and Inferential 2 Variables and Constants 3 The Measurement of Variables

More information

Probability Distributions

Probability Distributions CONDENSED LESSON 13.1 Probability Distributions In this lesson, you Sketch the graph of the probability distribution for a continuous random variable Find probabilities by finding or approximating areas

More information

Background to Statistics

Background to Statistics FACT SHEET Background to Statistics Introduction Statistics include a broad range of methods for manipulating, presenting and interpreting data. Professional scientists of all kinds need to be proficient

More information

2/2/2015 GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY MEASURES OF CENTRAL TENDENCY CHAPTER 3: DESCRIPTIVE STATISTICS AND GRAPHICS

2/2/2015 GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY MEASURES OF CENTRAL TENDENCY CHAPTER 3: DESCRIPTIVE STATISTICS AND GRAPHICS Spring 2015: Lembo GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY CHAPTER 3: DESCRIPTIVE STATISTICS AND GRAPHICS Descriptive statistics concise and easily understood summary of data set characteristics

More information

Exam details. Final Review Session. Things to Review

Exam details. Final Review Session. Things to Review Exam details Final Review Session Short answer, similar to book problems Formulae and tables will be given You CAN use a calculator Date and Time: Dec. 7, 006, 1-1:30 pm Location: Osborne Centre, Unit

More information

REVIEW 8/2/2017 陈芳华东师大英语系

REVIEW 8/2/2017 陈芳华东师大英语系 REVIEW Hypothesis testing starts with a null hypothesis and a null distribution. We compare what we have to the null distribution, if the result is too extreme to belong to the null distribution (p

More information

MATH 10 INTRODUCTORY STATISTICS

MATH 10 INTRODUCTORY STATISTICS MATH 10 INTRODUCTORY STATISTICS Tommy Khoo Your friendly neighbourhood graduate student. Week 1 Chapter 1 Introduction What is Statistics? Why do you need to know Statistics? Technical lingo and concepts:

More information

AP Statistics Cumulative AP Exam Study Guide

AP Statistics Cumulative AP Exam Study Guide AP Statistics Cumulative AP Eam Study Guide Chapters & 3 - Graphs Statistics the science of collecting, analyzing, and drawing conclusions from data. Descriptive methods of organizing and summarizing statistics

More information

MIDTERM EXAMINATION (Spring 2011) STA301- Statistics and Probability

MIDTERM EXAMINATION (Spring 2011) STA301- Statistics and Probability STA301- Statistics and Probability Solved MCQS From Midterm Papers March 19,2012 MC100401285 Moaaz.pk@gmail.com Mc100401285@gmail.com PSMD01 MIDTERM EXAMINATION (Spring 2011) STA301- Statistics and Probability

More information

Dover- Sherborn High School Mathematics Curriculum Probability and Statistics

Dover- Sherborn High School Mathematics Curriculum Probability and Statistics Mathematics Curriculum A. DESCRIPTION This is a full year courses designed to introduce students to the basic elements of statistics and probability. Emphasis is placed on understanding terminology and

More information

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 Statistics Boot Camp Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 March 21, 2018 Outline of boot camp Summarizing and simplifying data Point and interval estimation Foundations of statistical

More information

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1 Lecture Slides Elementary Statistics Tenth Edition and the Triola Statistics Series by Mario F. Triola Slide 1 Chapter 3 Statistics for Describing, Exploring, and Comparing Data 3-1 Overview 3-2 Measures

More information

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007) FROM: PAGANO, R. R. (007) I. INTRODUCTION: DISTINCTION BETWEEN PARAMETRIC AND NON-PARAMETRIC TESTS Statistical inference tests are often classified as to whether they are parametric or nonparametric Parameter

More information

Chapter 3. Data Description

Chapter 3. Data Description Chapter 3. Data Description Graphical Methods Pie chart It is used to display the percentage of the total number of measurements falling into each of the categories of the variable by partition a circle.

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Introduction to Statistics

Introduction to Statistics Introduction to Statistics By A.V. Vedpuriswar October 2, 2016 Introduction The word Statistics is derived from the Italian word stato, which means state. Statista refers to a person involved with the

More information

MATH 117 Statistical Methods for Management I Chapter Three

MATH 117 Statistical Methods for Management I Chapter Three Jubail University College MATH 117 Statistical Methods for Management I Chapter Three This chapter covers the following topics: I. Measures of Center Tendency. 1. Mean for Ungrouped Data (Raw Data) 2.

More information

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved. 1-1 Chapter 1 Sampling and Descriptive Statistics 1-2 Why Statistics? Deal with uncertainty in repeated scientific measurements Draw conclusions from data Design valid experiments and draw reliable conclusions

More information

Chapter Four. Numerical Descriptive Techniques. Range, Standard Deviation, Variance, Coefficient of Variation

Chapter Four. Numerical Descriptive Techniques. Range, Standard Deviation, Variance, Coefficient of Variation Chapter Four Numerical Descriptive Techniques 4.1 Numerical Descriptive Techniques Measures of Central Location Mean, Median, Mode Measures of Variability Range, Standard Deviation, Variance, Coefficient

More information

Lecture 1: Descriptive Statistics

Lecture 1: Descriptive Statistics Lecture 1: Descriptive Statistics MSU-STT-351-Sum 15 (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 1 / 56 Contents 1 Introduction 2 Branches of Statistics Descriptive Statistics

More information

Descriptive Statistics-I. Dr Mahmoud Alhussami

Descriptive Statistics-I. Dr Mahmoud Alhussami Descriptive Statistics-I Dr Mahmoud Alhussami Biostatistics What is the biostatistics? A branch of applied math. that deals with collecting, organizing and interpreting data using well-defined procedures.

More information

Introduction to Basic Statistics Version 2

Introduction to Basic Statistics Version 2 Introduction to Basic Statistics Version 2 Pat Hammett, Ph.D. University of Michigan 2014 Instructor Comments: This document contains a brief overview of basic statistics and core terminology/concepts

More information

CIVL 7012/8012. Collection and Analysis of Information

CIVL 7012/8012. Collection and Analysis of Information CIVL 7012/8012 Collection and Analysis of Information Uncertainty in Engineering Statistics deals with the collection and analysis of data to solve real-world problems. Uncertainty is inherent in all real

More information

Probability and Statistics

Probability and Statistics Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be CHAPTER 4: IT IS ALL ABOUT DATA 4a - 1 CHAPTER 4: IT

More information

Final Exam STAT On a Pareto chart, the frequency should be represented on the A) X-axis B) regression C) Y-axis D) none of the above

Final Exam STAT On a Pareto chart, the frequency should be represented on the A) X-axis B) regression C) Y-axis D) none of the above King Abdul Aziz University Faculty of Sciences Statistics Department Final Exam STAT 0 First Term 49-430 A 40 Name No ID: Section: You have 40 questions in 9 pages. You have 90 minutes to solve the exam.

More information

Frequency Distribution Cross-Tabulation

Frequency Distribution Cross-Tabulation Frequency Distribution Cross-Tabulation 1) Overview 2) Frequency Distribution 3) Statistics Associated with Frequency Distribution i. Measures of Location ii. Measures of Variability iii. Measures of Shape

More information

Preliminary Statistics course. Lecture 1: Descriptive Statistics

Preliminary Statistics course. Lecture 1: Descriptive Statistics Preliminary Statistics course Lecture 1: Descriptive Statistics Rory Macqueen (rm43@soas.ac.uk), September 2015 Organisational Sessions: 16-21 Sep. 10.00-13.00, V111 22-23 Sep. 15.00-18.00, V111 24 Sep.

More information

AP Final Review II Exploring Data (20% 30%)

AP Final Review II Exploring Data (20% 30%) AP Final Review II Exploring Data (20% 30%) Quantitative vs Categorical Variables Quantitative variables are numerical values for which arithmetic operations such as means make sense. It is usually a measure

More information

1.0 Continuous Distributions. 5.0 Shapes of Distributions. 6.0 The Normal Curve. 7.0 Discrete Distributions. 8.0 Tolerances. 11.

1.0 Continuous Distributions. 5.0 Shapes of Distributions. 6.0 The Normal Curve. 7.0 Discrete Distributions. 8.0 Tolerances. 11. Chapter 4 Statistics 45 CHAPTER 4 BASIC QUALITY CONCEPTS 1.0 Continuous Distributions.0 Measures of Central Tendency 3.0 Measures of Spread or Dispersion 4.0 Histograms and Frequency Distributions 5.0

More information

ECLT 5810 Data Preprocessing. Prof. Wai Lam

ECLT 5810 Data Preprocessing. Prof. Wai Lam ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate

More information

QUANTITATIVE TECHNIQUES

QUANTITATIVE TECHNIQUES UNIVERSITY OF CALICUT SCHOOL OF DISTANCE EDUCATION (For B Com. IV Semester & BBA III Semester) COMPLEMENTARY COURSE QUANTITATIVE TECHNIQUES QUESTION BANK 1. The techniques which provide the decision maker

More information

REVIEW: Midterm Exam. Spring 2012

REVIEW: Midterm Exam. Spring 2012 REVIEW: Midterm Exam Spring 2012 Introduction Important Definitions: - Data - Statistics - A Population - A census - A sample Types of Data Parameter (Describing a characteristic of the Population) Statistic

More information

Descriptive Data Summarization

Descriptive Data Summarization Descriptive Data Summarization Descriptive data summarization gives the general characteristics of the data and identify the presence of noise or outliers, which is useful for successful data cleaning

More information

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization. Statistical Tools in Evaluation HPS 41 Dr. Joe G. Schmalfeldt Types of Scores Continuous Scores scores with a potentially infinite number of values. Discrete Scores scores limited to a specific number

More information

Variables, distributions, and samples (cont.) Phil 12: Logic and Decision Making Fall 2010 UC San Diego 10/18/2010

Variables, distributions, and samples (cont.) Phil 12: Logic and Decision Making Fall 2010 UC San Diego 10/18/2010 Variables, distributions, and samples (cont.) Phil 12: Logic and Decision Making Fall 2010 UC San Diego 10/18/2010 Review Recording observations - Must extract that which is to be analyzed: coding systems,

More information

Introduction to Statistics

Introduction to Statistics Introduction to Statistics Data and Statistics Data consists of information coming from observations, counts, measurements, or responses. Statistics is the science of collecting, organizing, analyzing,

More information

The entire data set consists of n = 32 widgets, 8 of which were made from each of q = 4 different materials.

The entire data set consists of n = 32 widgets, 8 of which were made from each of q = 4 different materials. One-Way ANOVA Summary The One-Way ANOVA procedure is designed to construct a statistical model describing the impact of a single categorical factor X on a dependent variable Y. Tests are run to determine

More information

Performance of fourth-grade students on an agility test

Performance of fourth-grade students on an agility test Starter Ch. 5 2005 #1a CW Ch. 4: Regression L1 L2 87 88 84 86 83 73 81 67 78 83 65 80 50 78 78? 93? 86? Create a scatterplot Find the equation of the regression line Predict the scores Chapter 5: Understanding

More information

University of Jordan Fall 2009/2010 Department of Mathematics

University of Jordan Fall 2009/2010 Department of Mathematics handouts Part 1 (Chapter 1 - Chapter 5) University of Jordan Fall 009/010 Department of Mathematics Chapter 1 Introduction to Introduction; Some Basic Concepts Statistics is a science related to making

More information

Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data

Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data Chapter 2: Summarising numerical data Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data Extract from Study Design Key knowledge Types of data: categorical (nominal and ordinal)

More information

The Model Building Process Part I: Checking Model Assumptions Best Practice

The Model Building Process Part I: Checking Model Assumptions Best Practice The Model Building Process Part I: Checking Model Assumptions Best Practice Authored by: Sarah Burke, PhD 31 July 2017 The goal of the STAT T&E COE is to assist in developing rigorous, defensible test

More information

Chapter 2: Tools for Exploring Univariate Data

Chapter 2: Tools for Exploring Univariate Data Stats 11 (Fall 2004) Lecture Note Introduction to Statistical Methods for Business and Economics Instructor: Hongquan Xu Chapter 2: Tools for Exploring Univariate Data Section 2.1: Introduction What is

More information

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) Authored by: Sarah Burke, PhD Version 1: 31 July 2017 Version 1.1: 24 October 2017 The goal of the STAT T&E COE

More information

Statistics for Managers using Microsoft Excel 6 th Edition

Statistics for Managers using Microsoft Excel 6 th Edition Statistics for Managers using Microsoft Excel 6 th Edition Chapter 3 Numerical Descriptive Measures 3-1 Learning Objectives In this chapter, you learn: To describe the properties of central tendency, variation,

More information

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami Parametric Assumptions The observations must be independent. Dependent variable should be continuous

More information

Correlation and Simple Linear Regression

Correlation and Simple Linear Regression Correlation and Simple Linear Regression Sasivimol Rattanasiri, Ph.D Section for Clinical Epidemiology and Biostatistics Ramathibodi Hospital, Mahidol University E-mail: sasivimol.rat@mahidol.ac.th 1 Outline

More information

Introduction and Descriptive Statistics p. 1 Introduction to Statistics p. 3 Statistics, Science, and Observations p. 5 Populations and Samples p.

Introduction and Descriptive Statistics p. 1 Introduction to Statistics p. 3 Statistics, Science, and Observations p. 5 Populations and Samples p. Preface p. xi Introduction and Descriptive Statistics p. 1 Introduction to Statistics p. 3 Statistics, Science, and Observations p. 5 Populations and Samples p. 6 The Scientific Method and the Design of

More information

Physics 509: Bootstrap and Robust Parameter Estimation

Physics 509: Bootstrap and Robust Parameter Estimation Physics 509: Bootstrap and Robust Parameter Estimation Scott Oser Lecture #20 Physics 509 1 Nonparametric parameter estimation Question: what error estimate should you assign to the slope and intercept

More information

20 Hypothesis Testing, Part I

20 Hypothesis Testing, Part I 20 Hypothesis Testing, Part I Bob has told Alice that the average hourly rate for a lawyer in Virginia is $200 with a standard deviation of $50, but Alice wants to test this claim. If Bob is right, she

More information

MATH 1150 Chapter 2 Notation and Terminology

MATH 1150 Chapter 2 Notation and Terminology MATH 1150 Chapter 2 Notation and Terminology Categorical Data The following is a dataset for 30 randomly selected adults in the U.S., showing the values of two categorical variables: whether or not the

More information

Statistics 1. Edexcel Notes S1. Mathematical Model. A mathematical model is a simplification of a real world problem.

Statistics 1. Edexcel Notes S1. Mathematical Model. A mathematical model is a simplification of a real world problem. Statistics 1 Mathematical Model A mathematical model is a simplification of a real world problem. 1. A real world problem is observed. 2. A mathematical model is thought up. 3. The model is used to make

More information

Statistical Methods. by Robert W. Lindeman WPI, Dept. of Computer Science

Statistical Methods. by Robert W. Lindeman WPI, Dept. of Computer Science Statistical Methods by Robert W. Lindeman WPI, Dept. of Computer Science gogo@wpi.edu Descriptive Methods Frequency distributions How many people were similar in the sense that according to the dependent

More information

Stat 101 Exam 1 Important Formulas and Concepts 1

Stat 101 Exam 1 Important Formulas and Concepts 1 1 Chapter 1 1.1 Definitions Stat 101 Exam 1 Important Formulas and Concepts 1 1. Data Any collection of numbers, characters, images, or other items that provide information about something. 2. Categorical/Qualitative

More information

Vocabulary: Samples and Populations

Vocabulary: Samples and Populations Vocabulary: Samples and Populations Concept Different types of data Categorical data results when the question asked in a survey or sample can be answered with a nonnumerical answer. For example if we

More information

Unit 2. Describing Data: Numerical

Unit 2. Describing Data: Numerical Unit 2 Describing Data: Numerical Describing Data Numerically Describing Data Numerically Central Tendency Arithmetic Mean Median Mode Variation Range Interquartile Range Variance Standard Deviation Coefficient

More information

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization. Statistical Tools in Evaluation HPS 41 Fall 213 Dr. Joe G. Schmalfeldt Types of Scores Continuous Scores scores with a potentially infinite number of values. Discrete Scores scores limited to a specific

More information

2011 Pearson Education, Inc

2011 Pearson Education, Inc Statistics for Business and Economics Chapter 2 Methods for Describing Sets of Data Summary of Central Tendency Measures Measure Formula Description Mean x i / n Balance Point Median ( n +1) Middle Value

More information

Chapter 1:Descriptive statistics

Chapter 1:Descriptive statistics Slide 1.1 Chapter 1:Descriptive statistics Descriptive statistics summarises a mass of information. We may use graphical and/or numerical methods Examples of the former are the bar chart and XY chart,

More information

Review for Final. Chapter 1 Type of studies: anecdotal, observational, experimental Random sampling

Review for Final. Chapter 1 Type of studies: anecdotal, observational, experimental Random sampling Review for Final For a detailed review of Chapters 1 7, please see the review sheets for exam 1 and. The following only briefly covers these sections. The final exam could contain problems that are included

More information

A is one of the categories into which qualitative data can be classified.

A is one of the categories into which qualitative data can be classified. Chapter 2 Methods for Describing Sets of Data 2.1 Describing qualitative data Recall qualitative data: non-numerical or categorical data Basic definitions: A is one of the categories into which qualitative

More information

Elementary Statistics

Elementary Statistics Elementary Statistics Q: What is data? Q: What does the data look like? Q: What conclusions can we draw from the data? Q: Where is the middle of the data? Q: Why is the spread of the data important? Q:

More information

1 Probability Distributions

1 Probability Distributions 1 Probability Distributions In the chapter about descriptive statistics sample data were discussed, and tools introduced for describing the samples with numbers as well as with graphs. In this chapter

More information

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency Math 1 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency The word average: is very ambiguous and can actually refer to the mean, median, mode or midrange. Notation:

More information

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS Ravinder Malhotra and Vipul Sharma National Dairy Research Institute, Karnal-132001 The most common use of statistics in dairy science is testing

More information

Class 11 Maths Chapter 15. Statistics

Class 11 Maths Chapter 15. Statistics 1 P a g e Class 11 Maths Chapter 15. Statistics Statistics is the Science of collection, organization, presentation, analysis and interpretation of the numerical data. Useful Terms 1. Limit of the Class

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

Practical Statistics for the Analytical Scientist Table of Contents

Practical Statistics for the Analytical Scientist Table of Contents Practical Statistics for the Analytical Scientist Table of Contents Chapter 1 Introduction - Choosing the Correct Statistics 1.1 Introduction 1.2 Choosing the Right Statistical Procedures 1.2.1 Planning

More information

Hypothesis Testing. ) the hypothesis that suggests no change from previous experience

Hypothesis Testing. ) the hypothesis that suggests no change from previous experience Hypothesis Testing Definitions Hypothesis a claim about something Null hypothesis ( H 0 ) the hypothesis that suggests no change from previous experience Alternative hypothesis ( H 1 ) the hypothesis that

More information

The science of learning from data.

The science of learning from data. STATISTICS (PART 1) The science of learning from data. Numerical facts Collection of methods for planning experiments, obtaining data and organizing, analyzing, interpreting and drawing the conclusions

More information

An introduction to biostatistics: part 1

An introduction to biostatistics: part 1 An introduction to biostatistics: part 1 Cavan Reilly September 6, 2017 Table of contents Introduction to data analysis Uncertainty Probability Conditional probability Random variables Discrete random

More information

Mathematical Notation Math Introduction to Applied Statistics

Mathematical Notation Math Introduction to Applied Statistics Mathematical Notation Math 113 - Introduction to Applied Statistics Name : Use Word or WordPerfect to recreate the following documents. Each article is worth 10 points and should be emailed to the instructor

More information

Histograms allow a visual interpretation

Histograms allow a visual interpretation Chapter 4: Displaying and Summarizing i Quantitative Data s allow a visual interpretation of quantitative (numerical) data by indicating the number of data points that lie within a range of values, called

More information

Marquette University Executive MBA Program Statistics Review Class Notes Summer 2018

Marquette University Executive MBA Program Statistics Review Class Notes Summer 2018 Marquette University Executive MBA Program Statistics Review Class Notes Summer 2018 Chapter One: Data and Statistics Statistics A collection of procedures and principles

More information

Men. Women. Men. Men. Women. Women

Men. Women. Men. Men. Women. Women Math 203 Topics for second exam Statistics: the science of data Chapter 5: Producing data Statistics is all about drawing conclusions about the opinions/behavior/structure of large populations based on

More information

Describing distributions with numbers

Describing distributions with numbers Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods measure one of two data characteristics: The central

More information

Textbook Examples of. SPSS Procedure

Textbook Examples of. SPSS Procedure Textbook s of IBM SPSS Procedures Each SPSS procedure listed below has its own section in the textbook. These sections include a purpose statement that describes the statistical test, identification of

More information

CHAPTER 4 VARIABILITY ANALYSES. Chapter 3 introduced the mode, median, and mean as tools for summarizing the

CHAPTER 4 VARIABILITY ANALYSES. Chapter 3 introduced the mode, median, and mean as tools for summarizing the CHAPTER 4 VARIABILITY ANALYSES Chapter 3 introduced the mode, median, and mean as tools for summarizing the information provided in an distribution of data. Measures of central tendency are often useful

More information

Lecture 11. Data Description Estimation

Lecture 11. Data Description Estimation Lecture 11 Data Description Estimation Measures of Central Tendency (continued, see last lecture) Sample mean, population mean Sample mean for frequency distributions The median The mode The midrange 3-22

More information

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Chapte The McGraw-Hill Companies, Inc. All rights reserved. er15 Chapte Chi-Square Tests d Chi-Square Tests for -Fit Uniform Goodness- Poisson Goodness- Goodness- ECDF Tests (Optional) Contingency Tables A contingency table is a cross-tabulation of n paired observations

More information

P8130: Biostatistical Methods I

P8130: Biostatistical Methods I P8130: Biostatistical Methods I Lecture 2: Descriptive Statistics Cody Chiuzan, PhD Department of Biostatistics Mailman School of Public Health (MSPH) Lecture 1: Recap Intro to Biostatistics Types of Data

More information

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 3.1- #

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 3.1- # Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series by Mario F. Triola Chapter 3 Statistics for Describing, Exploring, and Comparing Data 3-1 Review and Preview 3-2 Measures

More information

Everything is not normal

Everything is not normal Everything is not normal According to the dictionary, one thing is considered normal when it s in its natural state or conforms to standards set in advance. And this is its normal meaning. But, like many

More information

Inferences About the Difference Between Two Means

Inferences About the Difference Between Two Means 7 Inferences About the Difference Between Two Means Chapter Outline 7.1 New Concepts 7.1.1 Independent Versus Dependent Samples 7.1. Hypotheses 7. Inferences About Two Independent Means 7..1 Independent

More information

1. Exploratory Data Analysis

1. Exploratory Data Analysis 1. Exploratory Data Analysis 1.1 Methods of Displaying Data A visual display aids understanding and can highlight features which may be worth exploring more formally. Displays should have impact and be

More information

Descriptive Univariate Statistics and Bivariate Correlation

Descriptive Univariate Statistics and Bivariate Correlation ESC 100 Exploring Engineering Descriptive Univariate Statistics and Bivariate Correlation Instructor: Sudhir Khetan, Ph.D. Wednesday/Friday, October 17/19, 2012 The Central Dogma of Statistics used to

More information

Learning Objectives for Stat 225

Learning Objectives for Stat 225 Learning Objectives for Stat 225 08/20/12 Introduction to Probability: Get some general ideas about probability, and learn how to use sample space to compute the probability of a specific event. Set Theory:

More information

Chapter 26: Comparing Counts (Chi Square)

Chapter 26: Comparing Counts (Chi Square) Chapter 6: Comparing Counts (Chi Square) We ve seen that you can turn a qualitative variable into a quantitative one (by counting the number of successes and failures), but that s a compromise it forces

More information

psychological statistics

psychological statistics psychological statistics B Sc. Counselling Psychology 011 Admission onwards III SEMESTER COMPLEMENTARY COURSE UNIVERSITY OF CALICUT SCHOOL OF DISTANCE EDUCATION CALICUT UNIVERSITY.P.O., MALAPPURAM, KERALA,

More information

Contents Kruskal-Wallis Test Friedman s Two-way Analysis of Variance by Ranks... 47

Contents Kruskal-Wallis Test Friedman s Two-way Analysis of Variance by Ranks... 47 Contents 1 Non-parametric Tests 3 1.1 Introduction....................................... 3 1.2 Advantages of Non-parametric Tests......................... 4 1.3 Disadvantages of Non-parametric Tests........................

More information

Statistics Handbook. All statistical tables were computed by the author.

Statistics Handbook. All statistical tables were computed by the author. Statistics Handbook Contents Page Wilcoxon rank-sum test (Mann-Whitney equivalent) Wilcoxon matched-pairs test 3 Normal Distribution 4 Z-test Related samples t-test 5 Unrelated samples t-test 6 Variance

More information

TOPIC: Descriptive Statistics Single Variable

TOPIC: Descriptive Statistics Single Variable TOPIC: Descriptive Statistics Single Variable I. Numerical data summary measurements A. Measures of Location. Measures of central tendency Mean; Median; Mode. Quantiles - measures of noncentral tendency

More information