DATA ANALYSIS. Faculty of Civil Engineering

Size: px

Start display at page:

Download "DATA ANALYSIS. Faculty of Civil Engineering"

Claude Hoover
5 years ago
Views:

1 DATA ANALYSIS Faculty of Civil Engineering

2 DATA

3 DATA - Introduction Data is a collection of facts, such as numbers, words, measurements, observations or even just descriptions of things. Qualitative data is descriptive information (it describes something). Quantitative data is numerical information (numbers).

4 DATA - Introduction Quantitative data can also be discrete or continuous. Discrete data can only take certain values (like whole numbers). Continuous data can take any value (within a range).

5 Data Analysis - Introduction A process of inspecting, cleaning transforming and modeling data with the goal of discovering useful information suggesting conclusions and supporting decision-making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different disciplines.

6 Data Analysis - Introduction Data analysis is about manipulating and presenting results. Data need to be organised, summarised and analysed in order to draw/infer conclusion.

7 Data Analysis - Processes Data requirements. Data collection. Data processing. Data cleaning. Exploratory data analysis. Modelling and algorithm Results & Report.

8 Sources of Data Lab Experimentation Survey Census Theoretical Analysis Numerical Analysis Software Other researchers data

9 Example Analysis Results Estimation of parameter mean values Estimation of parameters variability Comparison of parameter mean values Comparison of parameter variability Modelling the dependence of dependant variable on several quantitative & qualitative independent variables

10 Data Processing Data initially obtained must be processed or organized for analysis. For instance, this may involve placing data into rows and columns in a table format for further analysis, such as within a spreadsheet or statistical software.

11 Data Cleaning The data may be incomplete, contain duplicates, or contain errors. The need for data cleaning will arise from problems in the way that data is entered and stored. Data cleaning is the process of preventing and correcting these errors.

12 Data Checking Before doing data analysis and intrepretation, watch for invalid data using whatever data checking procedure. Weeding out of bad data is to be done continously throughout data gathering process. Bad data can bias results & intrepretation. Repeat data gathering or experimentation if there exist suspicous data.

13 Exploratory Data Analysis Once the data is cleaned, it can be analyzed. Analysts may apply a variety of techniques referred to as exploratory data analysis to begin understanding the messages contained in the data. The process of exploration may result in additional data cleaning or additional requests for data, so these activities may be iterative in nature.

14 Trial Test Do a simple trial test. To ensure that all parts in the testing setup function well. To determine the range of measurement to be taken. To anticipate the time taken for each step in the experiment. To see the error.

15 Error (Uncertainty) When writing a measurement results with ± e, it doesn t mean that we have done error It is uncertainty due to the limit of equipment and technique of experiment

16 Error (Uncertainty) For example Case 1: Theory said deflection = 5 mm, in the experiment the deflection = 5.5 mm. Is it mean that the theory wrong?. Ask first what is the error limit. If the error limit is ±0.75, the theory is correct.

17 Error (Uncertainty) For example Case 2: Two experimentalist doing measurement on the time taken for. The first researcher give the result as 20.4±0.4sec. While the second researcher give 19.8±0.8sec. Is their results contradict?

18 Error (Uncertainty) No, their results is actually overlapping. However, we are more confident with the first one because the error is half of the second, meaning that the measurement is done very carefully.

19 Analysis & Interpretation Mathematical formulas or numerical models called algorithms may be applied to the data to identify relationships among the variables. Numerical models: using software Statistical analysis: using software

20 STATISTICAL ANALYSIS

21 What is Statistics The science of collecting and analyzing data. It s about the whole process of using the scientific method to answer questions and make decisions.

22 What is Statistics The process involves designing studies, collecting good data, describing the data with numbers and graphs, analyzing the data, and then making conclusion.

23 Statistical Analysis 1) Designing studies 2) Collecting & selecting data 3) Describing data 4) Analyzing data 5) Making conclusion

24 Designing Studies Once a research question is defined, the next step is designing a study in order to answer that question. Figure out what process would be used to get the data we need.

25 Designing Studies The observational study could be survey. Surveys are questionaires that are presented to individuals who have been selected from a population of interest. Another widely used observational study is based on nature such: wildlife, geology hydrology, meteorology, environment,etc.

26 Designing Studies Experiments take place in a controlled setting, and are designed to minimize biases that might occur. It is perhaps most important to note that no matter what the study, it has to be designed so that the original questions can be answered in a credible way.

27 Collecting & selecting data If you select your subjects in a way that is biased - that is, favoring certain individuals or groups of individuals then the results will also be biased. Experiments and observational studies use instrumentation are sometimes even more challenging when it comes to collecting data. Something happens during the experiment to distract the subjects or the researchers.

28 Describing Data Once data are collected, the next step is to summarize it all to get a handle on the big picture. Statisticians describe data in two major ways: with pictures (that is, chart & graph) and with numbers, called descriptive statistics.

29 CHARTS AND GRAPHS

30 Charts and Graphs Line graphs for trend & behaviour Time charts for time series data Scatter graphs for relationships Pie charts & bar charts for categorical data Histogram & box plots for numerical data

31 Line Graphs A powerful tools to explain results in term of cause and effect. The horizontal x-axis is normally used for the independent variable (the cause or controlled variable). The vertical y-axis is normally used for dependent variable (the effect). To describe the development or progression. To show trend, response or behaviour in data.

32 Line Graph g

33 Time Charts To examine trend over time and another name for time chart is a line graph. Typically a time chart has some unit of time on the horizontal axis (year, day, month, and so on) and a measured quantity on the vertical axis (income, birth rate, total sales..)

34 Total sales Time Chart Time

35 Time Chart

36 Scatter Graphs Useful to present many data values. To show correlations between two variables. To draw conclusions about relationship in the data.

37 Scatter Graphs Y X

38 Pie Charts Present data in segment, convey simple and straightforward proportion of each category. A pie chart takes categorical data and shows the percentage of individuals that fall into each category. Each segment is presented in terms of percentage and can only be used with one data set.

39 Bar Charts An effective way of presenting frequencies. Common in reports of small scale research. The bar height represents quantity or amount. The number of bars represents the categories. Often used to compare groups by breaking and showing them as side-by-side. Visually striking and simple to read.

40 Bar Charts

41 Histogram Is the statistician s graph of choice for numerical data that provide a quick way to get the big idea about a numerical data set. A histogram is a graphical display of tabulated frequencies as well as a graphical version of a table that shows what proportion of cases fall into each of several or many specified categories.

42 Histogram A histogram is the most important graphical tool for exploring the shape of data distributions (Scott, 1992). The shape examined from the histogram puts the type of distribution into view. A histogram can be constructed by plotting the frequency of observation against midpoint class of the data.

43 Number of Class Interval Rule of thumb to choose appropriate width: a is bin widths or widths of class interval n is number of observation (data) Log 10 (n) is the number of based 10 of the number of observation According to Sturges s rule, 1000 observations would be graphed with 11 class intervals.

44 Histogram

45 Histogram - tips If there are too few classes, it is difficult to see how the data vary. If there are too many classes, then the table is less of a summary

46 Histogram tells three features How the data are distributed (symmetric, skewed right, skewed left, bell-shaped). The amount of variability in the data. Where the center of the data is (approximately).

47 Histogram tells three shapes Symmetric: the left-hand side of the histogram is a mirror image of the righthand side. Skewed right: it looks like a lopsided mound with one long tail going off to the right. Skewed left: it looks like a lopsided mound with one long tail going off to the left.

48 Histogram tells variability If a histogram is quite flat with the bars close to the same height, it indicates high variability. A histogram with a big lump in the middle and tails on the sides indicates more data in the middle bars than the outer bars, the the data are actually closer together or less variability.

49 Histogram tells center A histogram can also give you a rough idea of where the center of the data lies. To visualize the mean; the mean is the point where the fulcrum has to be in order to balance the weight on each side.

50 Boxplot A boxplot is a one-dimensional graph of numerical data based on the five-number summary, which includes the minimum value, the 25 th percentile (know as Q 1 ) median, the 75 th percentile (Q 3 ), and the maximum value. In essence, these five descriptive statistics divide the data set into four equal parts.

51 Making a Boxplot 1) Find the five number summary of data set. 2) Create a horizontal number line whose scale includes the numbers in the fivenumber summary. 3) Label the number line using appropriate units of equal distance from each other.

52 Exam score Making a Boxplot Five number summary: 43: Minimum 68: 25 th percentile 77: Median 89: 75 th percentile 99: Maximum 40

53 Making a Boxplots 4) Mark the location of each number in the fivenumber summary just above the number line. 5) Draw a box around the marks for the 25 th percentile and the 75 th percentile. 6) Draw a line in the box where the medians is located. 7) Draw lines from the outside edges of the box out to the minimum & maximum values of the data set.

54 Making a Boxplot Step 7 Step 5 Step Step 7

55 Interpreting a Boxplot A boxplot can show information about the distribution, variability, and center of a data set. Symmetric data shows a symmetric boxplot. Skewed data show a lopsided boxplot, where the median cuts the box into two unequal pieces.

56 Interpreting a Boxplot If the longer part of the box is to the right (or above) the median, the data is said to be skewed right. If the longer part is to the left (or below) the median, the data is skewed left.

57 Interpreting a Boxplot The upper part (vertical line) of the box is wider than the lower part (vertical line). This means that the data between the median (77) and Q 3 (89) are a little more spread out, or variable, than the data between the median (77) and Q 1 (68).

58 Interpreting a Boxplot Variability in a data set that is measured by the the interquartile range (IQR). The IQR is equal to Q 3 Q 1. A large distance from the 25 th percentile to the 75 th percentile indicates the data are more variable. IQR ignores data below 25 th or above 75 th which may contain outliers.

59 Interpreting a Boxplot The median is part of the five-number summary, and is shown by the line that cuts through the box in the boxplot. The mean, however is not part of the box plot. Misinterpret a boxplot: the bigger the box, the more data. A bigger part of the box means there is more variability (a wider range of values).

60 DESCRIPTIVE STATISTICS

61 Summarizing Data Descriptive statistics are numbers that summarize some characteristic about a set of data. Summarizing data by numerical measures makes a point clearly and concisely.. Mean, Median, Mode, Standard Deviation, Variance, Coefficient of Variation, Skewness, Kurtosis.

62 Sample Mean The sample mean is defined as the sum of the observed variable, x divided by the number of observed values.

63 Sample Median The sample median of a variable x is defined as the middle value when the n sample observations of x are ranked in increasing order of magnitude.

64 Sample Median S = 1,6,3,8,2,4,9 We need to find the value x, where half of the values aare above x and half the values below x. Rearrange, S = 1,2,3,4,6,8,9 The median is 4

65 Sample Mode The sample mode of a variable x is defined as the value with the highest frequency. The mode of a data set is the value that occurs most often or other words, has the most probability of occuring.

66 Sample Mode Sometimes we can have two, three, or more values that have relatively large probability of occurrence. In such cases, we say that the distribution is bimodal, tri-modal or multimodal, respectively.

67 Sample Mode Consider the rolls of a ten-sided die: R = 2,8,1,9,5,2,7,2,7,9,4,7,1,5,2 The number that appears the most is the number 2. Therefore the mode of set R is the number 2

68 Sample Mode Consider the rolls of a ten-sided die: R = 2,8,1,9,5,2,7,2,7,9,4,7,1,5,2 Note that if the number 7 had appeared one more time, it would have been present four times as well. In this case, we would have had a bimodal distribution, with 2 and 7 as the modes.

69 Mean Median Mode When to use mean, median & mode? Mean for normally distributed data (symmetrical distribution). Median & Mode for markedly skewed data.

70 Measures of Dispersion Consider the following data set: S = 5,5,5,5,5,5 and R = 0,0,0,10,10,10 If we calculated the mean for both S and R, we would get the number 5. However, these are two vastly different type of data sets.

71 Measures of Dispersion Therefore, we need another descriptive statistic besides a measure of central tendency, which we shall call a measure of dispersion. We shall measure the dispersion or scatter of the values of our data set about the mean of the data set.

72 Measures of Dispersion If the values tend to be concentrated near the mean, then this measure shall be small, while if the values of the data tend to be distributed far from the mean, then the measure will be large. The two measures of dispersions that are usually used are called the variance and standard deviation.

73 Variance and Std Deviation A quantity of great importance in probability and statistics is called the variance. The variance denoted by σ 2, for a set of n numbers x 1, x 2,., x n is given by

74 Variance and Std Deviation The variance is nonnegative number The positive square root of the variance (σ 2 ) is called the standard deviation (σ). Find the variance and std deviation for the following set of test scores: T = 75, 80, 82, 87, 96 from data set, μ = 84

75 Variance and Std Deviation T = 75, 80, 87, 96 from data set, μ = 84 Variance σ 2 = 50.8 Std Deviation, σ = 50.8 = 7.13

76 Variance & Std Deviation It is also widely accepted to divide by (n-1) as opposed to n. 1 2 n x x s n n x x s n x x s

77 Percentiles It is often convenient to subdivide your ordered data set by use of ordinates so that the amount of data points less than the ordinate is some percentage of the total amount of observations. The values corresponding to such areas are called percentile values, or briefly, percentiles.

78 Percentiles For example the percentage of scores that fall below the ordinate at x α is α. For instance, the amount of scores less than x 0,10 would be 0.10 or 10% and x 0,10 would be called the 10 th percentile.

79 Percentiles Another example is the median. Since half the data points fall below the median, it is the 50 th percentile (or fifth decile), and can be denoted by x 0,0.5.

80 Percentiles The 25 th percentile is often thought of as the median of the scores below the median, and the 75 th percentile is often thought of as the median of the scores above the median.

81 Percentiles The 25 th percentile is called the first quartile, while the 75 th percentile is called the third quartile. The median is also known as the second quartile.

82 Interquartile Range Another measure of dispersion is the interquartile range (IQR). The interquartile range is defined to be the first quartile substracted from the third quartile. x x 0.25

83 Interquartile Range Find interquartile range from the following data set: S = (67, 69, 70, 71, 74, 77, 78, 82, 89) The median is 74. The first quartile, x 0.25 is the median of the scores below the fifth position, the average of the second and third score, which leads to x 0.25 = 69.5

84 Interquartile Range (IQR) The third quartile, x 0.75 is the median of the scores above the fifth position, the average of the seventh and eighth score, which leads to x 0.75 = 80 The interquartile range is x x 0.25 = = Semiinterquartile range is 0.5(x x 0.25 ) leads to 5.25

85 Skewness Distribution of scores in data set. We might have a symmetrical data set, or a data set that is evenly distributed, or a data set with more high values than low values.

86 Skewness Often a distribution is not symmetric about any value, but instead has a few more higher values, then it is said to be skewed to the right. If the data set has a few more lower values, then it is said to be skewed to the left.

87 Skewed Skewed to the left Skewed to the right

88 Kurtosis Kurtosis (from the Greek word κυρτός, kyrtos or kurtos, meaning bulging) is a measure of the "peakedness" of the probability distribution of a real valued random variable Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution. Higher kurtosis means more of the variance is due to infrequent extreme deviations, as opposed to frequent modestly-sized deviations.

89 Kurtosis A distribution having a relatively high peak such as the curve is called leptokurtic, while the curve which is flat-topped is called platkurtic. The normal distribution which is not very peaked or very flat-topped is called mesokurtic.

90 Kurtosis (a) Leptokurtic (b) Playtykurtic (c) Mesokurtic

91 Kurtosis

92 Moments If X 1, X 2, X N are the N values assumed by the variable X, we define the quantity: Called the r th moment. The first moment with r = 1 is the arithmetic mean. N X N X N X X X X r N j r j r N r r r

93 Moments The r th moment about the mean defined as : m r N j1 X j N X r X N X r X X is X r If r = 1, m 1 = 0. If r= 2, m 2 = s 2, the variance.

94 Moments The r th moment about any origin A is defined as : Where d = X-A are the deviations of X from A. r r r N j r j r A X N d N A X N A X m 1 '

95 PROBABILITY DISTRIBUTION

96 Probability The classical definition of probability Suppose an event E can happen in h ways out of a total of n possible equally likely ways. Then the probability of non-occurrence of the event (called its success) is denoted by: p Pr E n h

97 Probability The probability of non-occurrence of the event (called its failure) is denoted by: q Pr n h n h n not E 1 1 p 1 PrE Thus p + q = 1, or Pr{E} + Pr{not E} = 1

98 Probability Distribution Discrete probability distribution: If a variable X can assume a discrete set of values X 1, X 2, X K with respective probabilities p 1, p 2,,p k where p 1 +p p K = 1; We say that a discrete probability distribution for X has been defined. In discrete case, by cumulating probabilities, we obtain cumulative probability distributions.

99 Probability Distribution The function P(X) which has the respective values p 1, p 2,,p K for X = X 1, X 2,,X K is called the probability function p(x) which has the respective values p 1,p 2,..,p K for X =X 1,X 2,..,X K, is called the probability function or frequency function of X. Because X can assume certain values with given probabilities, it is often called a discrete random variables. Also called random/chance/stochastic variables.

100 Probability Distribution Continuous probability distribution: If the variable X may assume a continuous set of values. The relative frequency polygon of a sample becomes, in the theoretical or limiting case of a population, a continuous curve such as shown in figure.

101 Probability Distribution Curve equation is Y = p(x), the total area under the curve bounded by the X axis is equal to one P(X) a b X

102 Probability Distribution The area under the curve between lines X = a and X = b (shaded in figure) gives the probability that X lies between a and b, which can be denoted by Pr a X b We call p(x) a probability density function. Variable X is called a continuous random variable.

103 Mathematical Expectation If p is the probability that a person will receive a sum of money S, the mathematical expectation, or simply the expectation, is defined by ps. If the probability that a man wins a RM100 prize is 1/5, his expectation is: 1 5 RM100 RM20

104 Mathematical Expectation If X denotes a discrete random variable which can assume the values X 1, X 2,, X K with respective probabilities p 1, p 2,..,p K where p 1 +p 2 + p K =1, the mathematical expectation of X or simply the expectation of X, denoted by E(X), is defined: E( X ) p1 X1 p2x 2.. p K X K K j1 p j X j px

105 Binomial Distribution An experiment such as tossing a coin or die repeatedly and each toss or selection is called a trial. In any single trial there will be a probability associated with a particular event such as head on the coin, four on the die. Such trials are said to be independent and often called Bernoulli trials. Binomial is discrete distribution.

106 Binomial Distribution Let p = the probability that an event will happen in any single Bernoulli trial = (called the probability of success). Then q = 1-p is the probability that the event will fail to happen in any single trial = (called the probability of failure).

107 Binomial Distribution ) (1 standard deviation ) (1 variance )!!(! ) ( ) ( ) ( 2 p np p np np mean q p x n x n x f q p x n x X P x f x n x x n x

108 Binomial Distribution Toss a fair coin 100 times, and count the number of heads that appear. Find the mean, variance, and standard deviation of this experiment. In 100 tosses of a fair coin, the expected or mean number of heads is μ = (100)(0.5) = 50 Variance σ 2 = 100(0.5)(0.5) = 25 Std deviation σ = (100)(0.5)(0.5) = 5

109 Poison Distributions Discrete distribution. Let X be a discrete random variable that can take on the values 0,1,2, such that the probability function of X is given by, f x e ( x) P( X x) x x! 0,1,2,... Where λ is a given positive constant.

110 Poison Distributions A random number variable having this distribution is said to be Poison distributed. The values of Poison distribution can be obtained using table (available in statistics text book), which gives values of e -λ for various values of λ. f x e ( x) P( X x) x x! 0,1,2,...

111 Poison Distributions f ( x) P( X mean x) x e x! x 0,1,2,... variance 2 standard deviation

112 Normal Distribution One of the most important examples of a continuous probability distribution is the normal distribution. Sometimes called the Gaussian distribution. Is very important and it will quite often come up in practice.

113 Normal Distribution The density function for this distribution is given by: f ( x) 1 e 2 ( x) 2 / 2 2 x Where μ = mean; σ = std deviation; π = ; e =

114 Normal Distribution The total area bounded by the following curve and the X axis is one. f ( x) 1 e 2 ( x) 2 / 2 Hence the area under the curve between two ordinates X = a and X = b where a < b, represents the probability that X lies between a and b denoted by Pr{a < X < b}. 2

115 Normal Distribution The corresponding distribution function is given by: F( x) P( X x) 1 2 x e ( x) 2 / 2 2 dv If X has the distribution function listed above Then we say that the random variable X is normally distributed with mean μ and variance σ 2

116 Normal Distribution If we let Z be the random variable corresponding to the following: Z X Then Z is called the standard variable corresponding to X. The mean or expected value of Z is 0 and the std deviation is 1.

117 Normal Distribution The density function for Z can be obtained from the definition of a normal distribution by allowing μ = 0 and σ 2 = 1 f ( z) 1 e 2 z 2 / 2 This is often referred to as the standard normal density function.

118 Normal Distribution The corresponding distribution is given by: f ( z) 1 e 2 z 2 / 2 We sometimes call the value z of the standardized variable Z the standard score A graph of the standard normal density function sometimes called the standard normal curve.

119 Normal Distribution The standard normal curve indicates the areas within 1, 2, and 3 standard deviations of the mean. i.e. between z = -1 and +1, z = -2 and +2, z = -3 and +3 as equal, respectively, to 68.27%, 95.45% and 99.73% of the total area, which is one. This means that: P( 1 P( 2 P( 3 Z Z Z 1) 2) 3)

120 Standard Normal Curve f (z) % % 99.73% z

121 Normal Distribution A table giving the areas under the curve bounded by the ordinates at z = 0 and any positive value of z is available in all statistics text book. From this table the areas between any two ordinates can be found by using the symmetry of the curve about z = 0.

122 Normal Distribution Approximately 68% of the area under any normal distribution curve lies within one standard deviation of the mean. Approximately 95% of the area under any normal distribution curve lies within two standard deviation of the mean. Approximately 99.7% of the area under any normal distribution curve lies within one standard deviation of the mean.

123 Normal Distribution Total area under the curve = 1.0 or 100% The area under the curve : within 1 std. deviation = 0.68 or 68%; within 2 std deviation = 95% within 3 std deviation = 99.7%

124 Normal Distribution

125 Normal Distribution A standard normal distribution is a normal distribution with zero mean and one unit variance, given by the probability function and distribution function

126 POPULATION & SAMPLE

127 Population and Sample Often in practice we are interested in drawing valid conclusions about large group of individuals or objects. Instead of examining the entire group, called the population, which may be difficult or impossible to do. We may examine only a small part of this population, which is called a sample. The process of obtaining samples is called sampling.

128 Population and Sample Statistical Inference is drawing a conclusions from sample data about the larger populations from which the samples are drawn. A population is the whole set of a measurements or counts about which we want to draw a conclusion. A sample is a subset of the population, a set of some of the measurements or counts which comprise the population.

129 Sampling If we draw an object from an urn, we have the choice of replacing the object into the urn before we draw again. If the first case a particular object can come up again and again, whereas in the second it can come up only once.

130 Sampling Sampling where each member of a population may be chosen more than once is called sampling with replacement. Sampling where each member cannot be chosen more than once is called sampling without replacement. Practical purposes, sampling from a finite population that is very large can be considered a sampling from an infinite population.

131 Random samples For a finite populations: make sure that each member of the population has the same chance of being in the sample, which is called a random sample. Random sampling can be accomplished for relatively small populations by drawing lots, or equivalently, by using a table of random numbers specially constructed for such purposes.

132 Random samples Because inference from sample to population cannot be certain, we must use the language of probability in any statement of conclusions.

133 Population parameters One important problem of statistical inference is the estimation of population parameters or briefly parameters (such as population mean, variance etc.) from the corresponding sample statistics or briefly statistics (i.e. sample mean, variance, etc). If the mean of the sampling distribution of a statistic equals the corresponding population parameter, the statistic is called an unbiased estimator of the parameter, otherwise it is called a biased estimator.

134 Population parameters If the sampling distributions of two statistics have the same mean (or expectation), the statistic with smaller variance is called an efficient estimator of the mean while the other statistic is called an inefficient estimator. If we consider all possible statistics whose sampling distributions have the same mean, the one with the smallest variance is sometimes called the most efficient or best estimator of this mean.

135 Population parameters An estimate of a population parameter given by a single number is called a point estimate of the parameter. An estimate of a population parameter given by two numbers between which the parameter may be considered to lie is called an interval estimate of the parameter. Interval estimates indicate the precision or accuracy of an estimate and are therefore preferable to point estimates.

136 Population parameters A population is considered to be known when we know the probability distribution f(x) of the associated random variable X. If X is normally distributed, we say the population is normally distributed or that we have a normal population. If X is binomially distributed, we say that the population is binomially distributed or that we have a binomial population.

137 Sample Statistics We can take random samples from the population and then use these samples to obtain values that serve to estimate and test hypothesis about the population parameters. For example, we wish to draw conclusions about the height of adults students by examining only 100 students selected from the population. In this case, X can be a random variable whose values are the various heights.

138 Standard error The standard deviation of a sampling distribution of a statistic is often called its standard error. If the sample size N is large enough, the sampling distributions are normally or nearly normal. For this reason the methods are known as large sampling methods. When N < 30, samples are called small and use the theory of small samples or exact sampling theory.

139 Confidence interval Confidence interval estimates of population parameters. Let μ s & σ s be the mean and std deviation of the sampling distribution of a statistic S. If the sampling distribution of S is approximately normal for n 30, S lying in the interval: μ s σ s to μ s + σ s : 68.27% of the time μ s 2σ s to μ s + 2σ s : 95.45% of the time μ s 3σ s to μ s + 3σ s : 99.73% ot the time

140 Confidence interval Equivalently we can expect to find, or we can be confident of finding μ s in the interval S : μ s σ s to μ s + σ s : 68.27% confidence intervals μ s 2σ s to μ s + 2σ s : 95.45% confidence intervals μ s 3σ s to μ s + 3σ s : 99.73% confidence intervals (i.e. for estimating the population parameter, in this case of an unbiased S)

141 Confidence interval Equivalently we can expect to find, or we can be confident of finding μ s in the interval S : μ s σ s to μ s + σ s : 68.27% confidence intervals μ s 2σ s to μ s + 2σ s : 95.45% confidence intervals μ s 3σ s to μ s + 3σ s : 99.73% confidence intervals S ± σ s S ± 2σ s S ± 3σ s : 68.27% confidence limits : 95.45% confidence limits : 99.73% confidence limits

142 Confidence level Confidence Level Z c (critical value) Confidence Level Z c (critical value) S ± 1.96σ s S ± 2.58σ s 99.73% 99% 98% 96% 95.45% % 90% 80% 68.27% 50% : 95% or 0.95 confidence level : 99% or 0.99 confidence level

143 Confidence interval For small sample n < 30, use the t distribution (table) to obtain confidence levels. For example, if t and t are the values of T for which 2.5% of the area lies in each tail of the t distribution, then a 95% confidence interval for T is given by: X t X t c Sˆ n [in t ˆ S n general term]

144 The t-distribution The normal distribution is the well-known bell-shaped distribution whose mean is μ and standard deviation is σ. The t-distribution has a basic bell shape with an area of 1 under it, but shorter and flatter than a normal distribution. The standard deviation for t-distribution is proportionally larger compared to the standard normal, Z-distribution

145 The t-distribution Each t-distribution is distinguished by the term degrees of freedom. If the sample size n = 10, the degrees of freedom for corresponding t-distribution is n-1= 10 1 = 9 degrees of freedom = t 9. Smaller sample size have flatter t- distributions than larger sample sizes. Larger sample size standard normal Z

146 Frequency distribution If a sample (or even a population) is large, it is difficult to observe the various characteristics or to compute statistics such as mean or standard deviation. For this reason it is useful to organize or group the raw data.

147 Frequency distribution Suppose that a sample consists of the height of 100 male student at XYZ University. We arrange data into classes or categories, and determine the number of individuals belonging to each class, called the class frequency.

148 Frequency distribution Height (inches) Number of students Total 100

149 HYPOTHESIS TESTS

150 Hypothesis testing Hypothesis testing is a statistician s way of trying to confirm or deny a claim about a population using data from a sample. A hypothesis is a conjecture about a population parameter. Hypothesis testing is a process of using sample data and statistical procedures to decide whether to reject or not reject a hypothesis (statement) about a population parameter value.

151 Hypothesis testing Because parameters tend to be unknown quantities, everyone wants to make claims about what their values may be. This conjecture may or may not be true. The null hypothesis (H o ) always states the population parameter is equal to the claimed value. If null hypothesis is found not to be true so what the alternative hypothesis (H a ) or (H 1 ).

152 Hypothesis testing Decide on null hypothesis, H 0. Decide on an alternative hypothesis, H 1 Decide on a significance level. Calculate the appropriate test statistic, using the sample data. Find from tables the appropriate tabulated test statistic. Compare the calculated and tabulated test statistics, and decide whether to reject the null hypothesis, H 0. State a conclusion, after checking to see whether the assumptions required for the test in question are valid.

153 Hypothesis testing The null hypothesis H 0, generally expresses the idea of no difference. The alternative hypothesis, which we denote by H 1, expresses the idea of some difference. Alternative hypothesis may be one-sided (greater or less than) or two-sided (not equal to).

154 Critical values of Z Level of significance, α Critical values of Z for onetailed tests Critical values of Z for twotailed tests or and or and or and or and or and 3.08

155 Level of significance Rejection region Acceptance region Rejection region Total shaded area is called level of significance of the decision rule : two-tailed test z

156 Hypothesis Example Situation A: A researcher is interested in finding out whether a new medicine will have any undesirable side effects on the pulse rate of the patient. Will the pulse rate increase, decrease or remain unchanged. Since the researcher knows the pulse rate of the population under study is 82 beats per minute, the hypothesis will be H o : μ = 82 (remain unchanged) H 1 : μ 82 (will be different) This is a two-tailed test since the possible effect could be to raise or lower the pulse

157 Hypothesis Example Situation B: A chemist invents an additive to increase the life of an automobile battery. The mean life time of ordinary battery is 36 months. The hypothesis will be: H o : μ 36 H 1 : μ > 36 The chemist is interested only in increasing the lifespan of the battery. His alternative hypothesis is that the mean is larger than 36. Therefore the test is called right-tailed, interested in the increase only.

158 Hypothesis Example Situation C: A contractor wishes to lower heating bill by using a special type of insulation in house. If the average monthly bill is RM100, his hypothesis will be: H o : μ RM 100 H 1 : μ < RM 100 This is a left-tailed test since the contractor is only interested in reducing the bill

159 Test of significance A z-test is used for testing the mean of a population versus a standard, or comparing the means of two populations, with large (n 30) samples whether you know the population standard deviation or not. It is also used for testing the proportion of some characteristic versus a standard proportion, or comparing the proportions of two populations. A significance level of 5% is the risk we take in rejecting the null hypothesis.

160 Test of significance A t-test is used for testing the mean of one population against a standard or comparing the means of two populations if you do not know the populations standard deviation and when you have a limited sample (n < 30). If you know the populations standard deviation, you may use a z-test. Example: Measuring the average diameter of shafts from a certain machine when you have a small sample.

161 Test of significance An F-test is used to compare 2 populations variances. The samples can be any size. It is the basis of ANOVA. Example: Comparing the variability of bolt diameters from two machines.

162 Chi-square goodness of fit test Chi-square value or can be denoted as χ 2 provided a good test to fit the hypothesis distribution with the real one. The observed data can be grouped into class interval and observed frequency, O. Suppose that for a group of observation data, a distribution can be specified for any whatsoever type by making hypothesis based on the histogram shape.

163 Chi-square goodness of fit test For each class of the grouped data, the expected frequency for each class can be estimated on the basis of the hypothecal distribution. It can be carried out by multiplying the reliability density function of hypothesis distribution for each class interval with number of data, n to obtain expected frequency, E. The χ 2 then can be estimated for each class using the given formula.

164 Chi-square goodness of fit test All single value of χ 2 for each class can be summed up. The hypothesis can be verified by comparing the estimated χ 2 with the critical value for χ 2 statistic from Chi-square statistic table. If the critical value for χ 2 statistics is less than the calculated value, the proposed distribution will be rejected. The χ 2 value from the statistic table can be determined based on level of significance.

165 Estimated Chi-square A measure of the discrepancy existing between observed & expected frequencies by chi-square: If chi-square zero: observed & theory agree exactly. If chi-square greater than zero, they do not agree exactly.

166 Test of normality Shapiro-Wilk test

167 Shapiro-Wilk test The Shapiro Wilk test is a test of normality. The Shapiro Wilk test utilizes the null hypothesis principle to check whether a sample x 1,..., x n came from a normally distributed population. Empirical testing has found that Shapiro Wilk has the best power for a given significance, followed closely by Anderson Darling when comparing the Shapiro Wilk, Kolmogorov- Smirnov, Lilliefors and Anderson-Darling tests.

168 Shapiro-Wilk test The null hypothesis of this test is that the population is normally distributed. Thus if the p-value is less than the chosen alpha level, then the null hypothesis is rejected and there is evidence that the data tested are not from a normally distributed population. In other words, the data are not normal..

169 Shapiro-Wilk test On the contrary, if the p-value is greater than the chosen alpha level, then the null hypothesis that the data came from a normally distributed population cannot be rejected. Example: for an alpha level of 0.05, a data set with a p- value of 0.02 rejects the null hypothesis that the data are from a normally distributed population. However, since the test is biased by sample size, the test may be statistically significant from a normal distribution in any large samples. Thus a Q-Q plot is required for verification in addition to the test.

170 Q-Q plot In statistics, a Q Q plot ("Q" stands for quantile) is a probability plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other. If the two distributions being compared are similar, the points in the Q Q plot will approximately lie on the line y = x. If the distributions are linearly related, the points in the Q Q plot will approximately lie on a line.

171 Q-Q plot

172 Q-Q plot

173 CURVE FITTING

174 Curve fitting The general problem of finding equations of approximating curves which fit given sets of data is called curve fitting. Linear relationship straight line Non linear relationship - curve

175 Curve fitting Y = a 0 + a 1 X straight line Y = a 0 + a 1 X + a 2 X 2 parabola/quadratic Y = a 0 + a 1 X + a 2 X 2 + a 3 X 3 cubic curve Y = a 0 + a 1 X + a 2 X 2 + a 3 X 3 + a 4 X 4 quartic curve Y = a 0 + a 1 X + a 2 X 2 + a 4 X n n th degree curve

176 Curve fitting :Logistic curve 1 1 curve :Geometric curve :Exponential :hyperbola g ab Y or g ab Y ax Y ab Y X a a Y or X a a Y X X b X

177 Raw data & fitted curve

178 Polynomial curve fit

179 Curve fitting & distribution fitting

180 Curve fitting & confidence interval

181 Multiple Regression Analysis The multiple regression test is used to identify change in two or more factors (independent variables) which contribute to change in a dependent variable. There are three types of multiple regression procedures; the backward solution, forward solution and stepwise solution. Stepwise has an advantage over the others.

182 Backward Solution This procedure is also known as the full multiple regression model because every predictor variable is initially entered into the regression model. The variables which do not contribute significantly to the regression model will only be removed later.

183 Forward Solution The predictor variable is entered into the regression model according to its contribution to the regression. The first variable selected to be entered into the model has the highest correlation with the criterion variable. Selection of predictor variables is conducted next until no more predictor variables which contribute to significant change.

184 Stepwise Solution Is a variation of forward solution. The procedure for selecting predictor variables is similar to the forward solution except that after each predictor variable is selected, a second significance test is conducted to determine the contribution of each predictor variable before this.

185 Multiple Regression Analysis Yˆ b X b X b X... b k X k a where Y X b a is the predicted criterion variable is the predictor variable is the regression coefficient for each predictor variable is regression constant

186 Correlation theory Correlation is the degree of relationship between variables, which seek to determine how well a linear or other equation describes or explains the relationship between variables. If satisfy an equation: perfectly correlated. If no relationship: uncorrelated.

187 Correlation theory If only two variables are involved: simple correlation and simple regression. If more than two variables are involved: multiple correlation and multiple regression.

188 Correlation theory The correlation is called linear if all points in the scatter diagram seem to lie near a line. A linear equation is appropriate for purposes of regression or estimation. If Y tends to increase as X increases: the correlation is called positive or direct correlation.

189 Correlation theory If Y tends to decrease as X increases: the correlation is called negative or inverse correlation. If all points seem to lie near some curve, the correlation is called non-linear and a non-linear equation is appropriate for regression or estimation. The non-linear correlation can be sometimes positive or sometimes negative.

190 Explained & Unexplained variation Total variation of Y is given, Total variation = unexplained variation + explained variation 2 Y Y Y Y 2 Y Y est. est. 2

191 Coefficient of Correlation The ratio of the explained variation to the total variation is called the coefficient of determination. The quantity r, called the coefficient of correlation is given, r Y est. Y Y Y explained variation 2 total variation 2

192 Rank Correlation Instead of using precise values of the variables, or when such precision is unavailable, the data may be ranked in order of size, importance, etc. using the numbers 1, 2,3.., N.

193 Rank Correlation If two variables X and Y are ranked in such manner the coefficient of rank correlation is given by (spearman s formula for rank correlation), 6 2 D r rank 1 2 N N 1 D = differences between ranks of corresponding values of X & Y. N = number of pairs of values (X,Y) in the data

194 Correlation Tests Inferential research is conducted to describe the characteristics of the research subjects by identifying the relationship between the dependent and independent variables. The dependent variable is the effect; the independent variable is the factor which causes or effects a change in the dependent variable.

195 Correlation Tests There are 3 steps to determine relationship between variables: 1. Indentify the dependent and independent variables in the relationship. 2. Determine the measurement for variables in the relationship. 3. Conduct an analysis of the relationship between variables.

196 Correlation Tests The relationship between variables is known as correlation and the strength of a correlation is represented by the correlation coefficient in the correlation test. There are various types of correlation tests as shown in table.

197 Correlation Tests The standard relationship coefficients between two variables, is the Pearson product-moment correlation coefficient. The Spearman s rho test is a nonparametric test. It is used to analyse data which is not normally distributed. For two sets of not normally distributed data, the data does not correlate linearly.

198 Correlation Tests The Spearman s rho test is conceptually similar to the Pearson r test. However, the Pearson r test is used to identify correlation between two sets of interval or ratio scale data while the Spearman s rho test is used to analyse correlation between two sets of ordinal scale data.

199 Correlation Tests In some cases, the data collected from a sample is not ordinal, interval or ratio scale data; instead, it is nominal scale data. The two correlation tests (Pearson r and Spearman s rho) are not suitable for analysing nominal scale data.

200 Correlation Tests Correlation between two nominal scale variables can be analysed by using the Cramer s V test. It is calculated based on the chi-square value. 2

201 Type of Correlation Tests Correlation test Pearson product-moment coefficient Point-biserial coefficient Spearman s rho or eta coefficient Type of measurement It states the relationship between variables using the interval and ratio scales It states the relationship between an interval or ratio scale variable and a nominal scale variable It states the relationship between variables when the distribution of data is not normal and where both variables are in ordinal scale which are arranged according to rank

202 Type of Correlation Tests Correlation test Biserial coefficient Tetrachoric coefficient Type of measurement It is similar to the point-biserial coefficient where one of the variables is measured in the interval or ratio scale whereas the other variables is in the ordinal scale. It is similar to the Phi coefficient which states the relationship of variables in the nominal scale. The difference is that this coefficient is used when the researcher estimates that both variable scales have ranking and the data distribution is normal.

203 Type of Correlation Tests Correlation test Cramer, Phi and Lambda coefficient Rank-biserial coefficient Type of measurement Used when variables are in the nominal scale and each variable has more than two categories. It is similar to the point-biserial coefficient where one variable in the relationship is in the nominal scale and the other variable is in the ordinal scale.

204 The Strength of coefficient, r Correlation coefficient (r) Correlation strength Very strong Strong Average/medium Weak Very weak 0.00 No correlation

205 Homogeneity of Variance Certain tests (e.g. ANOVA) require that the variances of different populations are equal. This can be determined by the following approaches: 1. Comparison of graphs (esp. box plots) 2. Comparison of variance, standard deviation and IQR statistics 3. Statistical tests

206 Homogeneity of Variance The F test presented in Two Sample Hyphotesis Testing of Variances can be used to determine whether the variances of two populations are equal. For three or more variables the following statistical tests for homogeneity of variances are commonly used: 1. Levene s test 2. Fligner Killeen test 3. Bartlett s test

207 Homogeneity of Variance Ways of dealing with models where the variances are not sufficiently homogeneous (it is called heterogeneous): 1. Non-parametric test: Kruskal-Wallis 2. Modified tests: Brown-Forsythe and Welch s ANOVA test 3. Transformations (square root, logarithmic)

208 Outliers The following ways of identifying the presence of outliers: 1. Side by side plotting of the raw data (histograms and box plots). 2. Examination of residuals. Residuals for Levene s test, e ij = x ij x j

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the