Useful books: 1. Grady Hanrahan, Environmental Chemometrics, CRC Press, Statistics for Environmental Engineers, CRC Press LLC, 2002

Size: px

Start display at page:

Download "Useful books: 1. Grady Hanrahan, Environmental Chemometrics, CRC Press, Statistics for Environmental Engineers, CRC Press LLC, 2002"

Lambert Price
6 years ago
Views:

1 1 Environmetrics II (Part 1: DOE) Environmetrics is one of the newest so-called metrics sciences, e.g. psychometrics, econometrics, chemometrics, and (for some reason) biometry. These are methodological sciences for analysing measurement data, and designing experimental set-ups in different fields of science. Thus environmetrics deals with measurement data of all environmental sciences. Therefore data-analytical problems related to monitoring our environment are typical material of environmetrical publications. However, the emphasis is different in environmental engineering: designing new systems or processes, or monitoring existing ones, plays the key role. As a consequence of this, only a selection of environmetrical tools are treated in this course. Very many of them are similar to the ones in technometrics or chemometrics. Useful books: 1. Grady Hanrahan, Environmental Chemometrics, CRC Press, Statistics for Environmental Engineers, CRC Press LLC, 2002 The course is divided into two parts: In the first part we shall study the problem of how to design experiments so that number of experiments remains reasonable, and that at the same time we get reliable information about the system under study. In the second part, we shall learn the elements of multivariate data-analysis which is typical needed in monitoring or fault detection problems, or in any problems with large number of variables. As an example of the latter, consider a spectral measurement of an environmental sample. The measured spectrum consists actually of typically thousands of variables because each measured absorbance can be considered a single variable. Note that the lab exercises and examples carried out using Excel, Matlab or R are an essential part of this course. It is also recommended print out e.g. R history files or R command files.

2 2 Statistical design of experiments (DOE) The role of experimentation Experiments are needed if we want to find out something that cannot be found out otherwise. For example, building a new wastewater treatment unit doesn t necessarily require experiments. There is plenty of both theoretical and empirical knowledge of common wastewater treatment units, and using this knowledge it is possible to build a new one, and to expect it to work. However, if we are going to make something completely new, the only alternatives are to use either theoretical knowledge or experimentation. Using theoretical knowledge in design, scale-up or optimization in engineering means simulation based on mathematically described models converted into computer programs. Opposite to this, one might expect that experimentation could be modelling-free. But, that is not the case and hopefully this course will explain why. The objectives of part 1 are to show why proper design of experiments is crucial in getting reliable and meaningful results from experiments to give understanding in the basic principles and concepts of DOE to introduce some of the most common types of designs to give understanding of the complexity of real applications The most important of the objectives listed above is the first one. It is not realistic to expect anyone to become a specialist in the field after an introductory course. DOE is as much an art as a science, and its mastery requires a lot of experience and continuous studies. However, it is possible to understand its importance and main principles by going through some typical simple examples. In this introduction, unnecessary mathematics is avoided, but basic knowledge about engineering mathematics is needed. Also basic concepts of probability and statistics, e.g. normality, repeatability or measurement uncertainty, covered in Environmetrics I, are essential in understanding the principles of DOE. Many commercial software packages, which facilitate making good designs and analysing the results, are available. This introduction is independent of such packages and all the calculations of the examples are carried out using Matlab, R or Excel. However, for an infrequent user of DOE tools, commercial software (e.g. JMP, MODDE, SPSS,...) can be of great benefit. Exercise 1 Suppose we are planning experiments in order see how pressure and temperature effect the quality of the product. Experiments are very expensive and we want do as few experiments as possible. Below are three designs of a case where the design variables are pressure (p) and temperature (T):

3 3 Design 1 p T Design p T Design p T Which is the worst and which is the best design of these three ones and why? 1. The interplay between learning and experimentation All scientific knowledge is based on successive steps of deductive and inductive reasoning. In the beginning, there is a hypothesis (a theory) about a problem to be studied. This hypothesis allows us to conclude the expected outcome of our experiments which is deductive reasoning. After the experiments have been conducted, the results either confirm or contradict the hypothesis. If there is a contradiction, we have to reject the original hypothesis and there is a need for a new hypothesis to explain the contradiction. The process of finding new possible hypotheses is inductive reasoning. The new hypothesis allows new deductions and planning of new experiments to check its validity. This cycle can be repeated until the theory is satisfactory. It is easy to see the role of such cycles e.g. in the development of Newtonian mechanics to the theory of relativity. It is good to realize the importance of such cycles in all experimental research, and especially in DOE.

4 4 2. Experimental error Whenever experiments are carried out, the role of experimental error has to be considered. Any real life experiment involves several steps that influence the final result. In practice, none of these steps can be exactly repeated due to inevitable random variation in conditions related to the step. Therefore results of the repeated experiments will always vary even if one tries to keep the conditions as constant as possible. Thus we shall always encounter the problem of distinguishing systematic variation from random variation. On of the key issues in DOE is to get an estimate of the magnitude of random variation and reliable conclusions are impossible without this knowledge. For meaningful conclusions, we need to know the basic laws of random variation. The most important such laws are the central limit theorem, the law of large numbers and the propagation of independent random errors. The first of these states that an error that is a sum of independent random errors obeys the normal (Gaussian) distribution. Its importance relies on the fact that the total error of any well conducted experiment is a sum of small independent errors of different sources, e.g. weighing, dilutions, control of different variables such as temperature or pressure and many others of human origin or due to the equipment. The normal distribution is characterized by two parameters: the expected value which is the theoretical mean value and the variance (or its square root, the standard deviation ). The nature of the normal distribution is best realized by recalling the following rules of thumb: 68.3% of normally distributed measurement results are in the interval 95.4% of normally distributed measurement results are in the interval 99.7% of normally distributed measurement results are in the interval As a consequence, if a result deviates more than 3 from the expected value, it is reasonable to suspect that the true expected value of this result is not. Rather, it may be an error or a systematic change has occurred. This is the basic logic of e.g. statistical control charts or statistical tests. We shall introduce only the most important rules of propagation of errors. First of all, it should be noted that these rules are valid only for statistically independent measurements. The first of these states that the standard deviation of a mean ( ) of n results is smaller than the standard deviation of the individual results. To be more exact:. (1) The most important consequence of this fact is that we can increase the accuracy simply by increasing the number of replicate measurements. The other one states that the standard deviation of the difference of two replicate measurements, say x and y, is larger than the standard deviation of the individual results. To be more exact:

5 5. (2a) If x and y have different standard deviations, i.e. they are not replicates, but still independent, the equation is. (2b) This is a fact that we have to take into account whenever we compare two measurements that are assumed to have the same expected value, i.e. in statistical tests concerning differences between mean values. Rough conclusions about the experimental results can be obtained using these simple rules. In real cases, however, we have to take into account the uncertainty of our estimate of, because in practise we have to use the sample standard deviation S instead of. We come to this point later. Example 1 Now let us take a simple example: Six experiments, 3 with catalyst 1 and 3 with catalyst 2 have been made in random order. The results (purity) are The question is whether we can consider the catalyst 1 to be better than the catalyst 2, or not. The mean of the purities with catalyst 1 is.8 and the mean of the purities with catalyst 2 is The logic of the comparison goes in the following way: If we knew the repeatability, i.e. the standard deviation, of the measurements, we could calculate the standard deviation of the difference of the means (8.8), using the rules given above. Assuming that there is no difference in the catalysts, the expected value of the difference would be zero. After this we could simply use the rules of thumb of the normal distribution for estimating the plausibility of the observed difference under this assumption. If the observed difference would be highly improbable, we would conclude that there must be difference between the catalysts. The problem now is that we don t know the standard deviation. However, we can estimate it; actually we can get two estimates from the two sets of three replicates. These are S 1 = 2.4 and S 2 = 2.6. Considering the small difference between these two estimates, we can assume that the type of the catalyst doesn t affect the repeatability. Therefore we can get a more reliable estimate by averaging them. One of the rules of propagation of errors states that standard deviations must be averaged quadratically, i.e. 2.5.

6 6 Now the standard deviation of the difference, using the rules given above, is 2.5. Using the rules of thumb, we can conclude that the difference is more than 3 s and, consequently, it is plausible that the catalysts differ and the first one gives better purities. The problem with this reasoning is that we did not take into account the uncertainty of using S instead of the true. To overcome this, we could make a socalled permutation test or a formal statistical test. Two sample permutation test (applied to the example above) If the catalysts would perform equally, the only variation in the data would be random variation. Therefor labelling the experiments by 1 or 2 should have no effect, other than a random one, into the difference between what label 1 and label 2 means. The six labels can be assigned to the 6 experiments in 6! = 720 ways. Many of these give the same sequence, and there are actually only 20 different permutations of 1,1,1,2,2,2 (the design was 1,1,2,1,1,2). Only one of these different permutations gives a difference of means that is greater than 8. If this was only due to chance we were quite lucky, and maybe it is more reasonable to think that there was a cause for such a high difference, namely the different catalysts. Carrying out permutation tests in practise requires programming skills in Matlab, Octave, R or any similar software. If the number experiments is high, it is impossible to go through all permutations (e.g. 11! = ). In such cases a good alternative is to make enough, say one million, randomly chosen permutations and gather the statistics from them. However, if we can assume the experimental errors to be statistically independent and approximately normally distributed, we can carry out formal statistical tests. Two sample t-test (and some general concepts related to statistical tests) In statistical tests, we have to make two alternative hypotheses: The null hypothesis and the alternative (research) hypothesis. The null hypothesis assumes that there is no difference between the two subjects to be compared. The alternative hypothesis is the opposite to this. However, the alternative can be one-sided or two-sided. The two-sided alternative states that there is a difference, but the one-sided states that either of the subjects gives better results than the other. It should be noted that one should use onesided hypothesis only if there is a reason for it, prior to the experiments. If we assume that no such prior knowledge exists, the hypothesis of example 1 are : and : After this, we have to decide the maximum risk of an erroneous rejection of the null hypothesis. This is called the level of significance ( ) and the erroneous rejection of the null hypothesis is called type I error. Typically is set to 0.05, 0.01 or The more severe would the consequences of type I error be, the smaller should we select. Note the kind of a contradiction in terminology: small means high level of significance. Let us choose =0.05.

7 7 Next we have to calculate the so called test statistic. This is something whose statistical distribution is known under the null hypothesis, letting us calculate the probability of type I error. In this case the test statistic is calculated as. A more general formula for two samples X and Y would be where Now we have to know the statistical distribution of t under the null hypothesis, which is the so called Student s t-distribution, which has only one parameter, the so called degrees of freedom which in this case are = = 4. Now we can calculate the probability of type I error assuming that the null hypothesis is true. This is called the p-value of the test. It can be calculated e.g. using MS-Excel s TDIST function. After this, making the conclusion is easy: if the p-value is smaller than, the null hypothesis is rejected. Actually, one doesn t have to carry out the calculations above in practise, because Excel contains a macro for this particular test, and we shall show the results of it. It is even simpler to carry out the test in R using the R-function t.test. But before that, we have to consider a few more important facts about statistical tests. Let us consider the case if the null hypothesis is not rejected: It is important to understand that this doesn t mean that we have proved that the null hypothesis is true. It simply means that there is not enough evidence to reject it. A good analogy of statistical tests is a trial. If a person is announced not-guilty, it does not prove that he or she hasn t committed the crime. It simply means that there isn t evidence enough ( beyond any reasonable doubt ) to show that he or she is guilty. In the case of statistical testing, the question is related to two important concepts: practical significance and the power of a test. The power of a test is defined as the probability of rejecting the null hypothesis when the alternative is true. It is one minus the probability of the type II error, i.e. the error that is made, when the null hypothesis is not rejected when the alternative is true. The figure below depicts the probabilities of type I and type II errors, i.e. the level of significance and power. If H 0 were true and the true mean were 8, then the shaded red area gives the level of significance. If H 1 were true and the true mean were 12 the green shaded area gives the probability of the type II error (remember that the power is 1- P(type II error)). In both cases we assume that 10 is the rejection limit. Now it is easy to see that moving the rejection limit towards the null hypothesis will increase the type I probability and consequently lower the level of significance, but simultaneously it will increase the power. There deciding the rejection limit is always a compromise between level of significance and power.

8 8 In analysing designed process experiments, we tend to use lower levels of significance than in e.g. pharmaceutical testing. The reason is that we want good power for detecting possible improvements. However, common sense should be used, and if the rejection of the null hypothesis will introduce great financial or human risks, small enough level of significance must be used H 0 true H 1 true In order to be able to calculate the power, we need to know what is the true difference in the subjects to be compared, denoted by. In practise we never know the true difference. Therefore, it is reasonable to calculate the power assuming the difference to be the smallest one to have any practical significance. Practical and statistical significance are two different concepts. The former means that the difference has e.g. financial or other practical relevance as the latter means that we can be confident that such difference exists in reality, however small it is. Consequently, if we detect a practically significant difference that is not statistically significant, i.e. the null hypothesis is not rejected, the reasonable means to be taken is to make more experiments to achieve a test that has enough power to detect such difference. An approximate formula (assuming to be 0.05 and power to be 0.95) is to estimate the number of experiments to achieve the power wanted is, where is the number of subjects to be compared (2 in our example). Accurate formulae exist for specific tests (found in most statistical software), but this approximation is good enough if n is large enough. Here n means the total number of experiments, i.e. n/2 with both catalysts.

9 Let us use the formula in our example assuming that the smallest practically significant difference is 2 %-units. The formula gives (4*2*2.5/2)^2 = 100, i.e. 50 experiments with both catalysts.

9 9 Let us use the formula in our example assuming that the smallest practically significant difference is 2 %-units. The formula gives (4*2*2.5/2)^2 = 100, i.e. 50 experiments with both catalysts. If we assume the that true difference is the same as the observed difference (8.8), the formula gives (4*2*2.5/8.8)^2 5.2, i.e. 3 experiments with both catalysts, which is actually the number experiments made. If the calculations are carried out exactly, e.g. in R, we get 41 and 4 experiments with both catalysts. Thus the approximate formula, called the Wheeler s formula over-estimates the number of experiments when it is high and under-estimates the number of experiments when it is low. In spite of the approximate nature of the formula, it gives the correct order of magnitude of the number of experiments needed. Now, let us perform the test using Excel. After the data has been type in, you should click Tools/Data Analysis.../t-Test: Two-Sample Assuming Equal Variances and the following output is obtained: For the conclusion, we need only the p-value for a two-sided test which is ca Thus we reject the null hypothesis at the 0.05 level of significance. Though we have considered so far only one statistical test, it is good to know that the basic principles behind all statistical tests are exactly the same. The only things that vary are the calculation of the test statistic and its p-value. Later we shall see other applications of statistical tests. It is also good to know that all tests are based on some general assumptions. Practically all tests require that the experimental errors are statistically independent and approximately normally distributed. The latter can normally be safely assumed, if there are no gross errors, due to the central limit theorem. The former, in turn, can usually be guaranteed by randomizing the order of the experiments. Randomization is crucial in good experimentation!

10 10 Exercise 2 Carry out the same test using R or Matlab. Experimental error in DOE The Wheeler s formula clearly shows how the power of the test, with a fixed number of experiments, gets lower as the standard error of the replicates gets higher. Therefore, it is essential to know the degree of repeatability in order to make any meaningful (significant) conclusions. By experimental error we mean the total uncertainty related to the result of an experiment. The only objective way of getting information of the mean experimental error is to make replicate experiments. Consider experiments where we have varied the ph in a reaction with the following results: ph: yield: If we knew that the mean experimental error is negligible, we would expect that the optimal yield is obtained at ph between 8 and 9. However, if we knew that the mean experimental error is ca. 3, it might be quite possible that the yield would be higher at even higher ph values, i.e a linear trend would be quite plausible. In this case, fitting a quadratic model to the data, would mean over-fitting. On the other hand, if the mean experimental error would be, say 1, a linear model (a straight line) would suffer from lack-of-fit. Consequently, without the knowledge about the mean experimental error, it is impossible to assess the reliability of a model describing the dependencies in the data. Therefore, if there is no prior knowledge of the mean experimental error, a good design has to contain replicated experiments. A good model is neither over-fitted, nor does it suffer from lack-of-fit. If the mean experimental error is known, the degree of lack-of-fit can be statistically tested. In summary, the use of empirical models can lead to totally erroneous conclusion if we don t keep in mind the experimental error lurking behind every experimental result. We shall further elucidate this important fact with the following simulated example of a known model ( ) for the yield of a chemical process. Suppose that the standard deviation of the error is 5 and the errors are statistically independent and normally distributed. The four figures below show four independent series of three measurements at ph s 6.0, 6.5 and 7.0.

11 11 Replicate 1 Replicate Replicate Replicate If one made conclusions based on non-replicated measurements they might completely wrong, especially if they rely on over-fitted models. A typical example of an over-fitted model would be fitting a parabola, i.e. a quadratic model to the plots above. The fourth replicate series would give completely wrong predictions about the direction of improvement. The figure below shows the same data with added linear models, i.e. straight lines and quadratic models, i.e. parabolas: Replicate 1 Replicate Replicate Replicate Note that none of the linear models is completely wrong: they all show positive slopes between 7.7 and The reason is that in fitting a straight line the residuals have one degree of freedom as in fitting the parabola the residuals have zero degrees of freedom. In general, the more degrees of freedom are left for the residuals, the more reliable the model is in statistical sense.

12 12 If the experiment had been designed to have four replicates, the four series could have been plotted in a single graph: 'st 2'nd 3'rd 4'th mean values All replicates Now, the straight line above, fitted to the mean values of the replicates, almost coincides with the true straight line ( ), showing beautifully the statistical law of central tendency, i.e. the uncertainty of a mean is times smaller than the uncertainty of a single measurement (n is the number of independent replicate measurements). Exercise 3 Note also that the best way to guarantee statistical independence of the errors, is to randomize the order of experiments, for example in the case above the order might have been Next we shall discuss how to model dependencies. Study how to randomize the rows of a table (matrix) in Excel (use RAND and Sort), in R (use sample) and in Matlab (use randperm).

13 13 3. Empirical mathematical models In this introductory course, we shall discuss only so called empirical mathematical models. It is easier to understand the concept if we first consider their opposite, mechanistic models (also called theoretical or first principles models). A mechanistic model is such that its functional form can deducted from physical or chemical theories. A simple example would be e.g. the Arrhenius equation whose basis is in thermodynamics. In principle, one should use mechanistic model whenever it is possible. However, in many practical situations, the theory is either too complicated,or not even known, to describe the phenomenon under study. The rational approach then is to approximate the underlying unknown functional relationship using some convenient function. The basic principle is that any function, not close to its optimum, is approximately linear in a limited region. Close to an optimum, a quadratic function is a good approximation. For these reasons, the following functional forms are the most common approximations used. Such models are called empirical models: These are called linear, linear+interactions and quadratic models. The term interaction needs some clarification. The products in the second and third model are called pair-wise interactions. The interpretation of an interaction between two variables, say and is that the effect of on y, the response variable, is affected by. This means that the slope with respect to depends on the value of. An interaction can be antagonist or synergetic depending on whether the other variable decreases or increases the slope with respect to the other. It should be noted that interactions are very common in chemistry (just think about the ideal gas law!). It is easy to see that the number of unknown parameters increases quickly with the number of independent variables (also called explanatory or design variables), especially if pair-wise interaction are included into the model. Naturally higher order interactions may exist as well, but luckily these are seldom significant. Interactions play an important role in DOE. There existence is the reason why the so called one-variable-at-time (OVAT) designs fail in finding optima. This is best illustrated graphically. You can try with the simulated yield surface of a chemical reaction on p. 27. Just start at any point and maximize first in the time direction and then temperature direction (or vice versa), and see what happens. Actually, OVAT designs are inefficient for another reason as well, but shall come to that point later.

14 14 Variable types Before going into experimental designs, we have to consider different variable types. In typical experimental setups we have two types of variables: categorical (also called qualitative) or continuous. A categorical variable has discrete values assigning the object to a category. Typical examples would the type of a catalyst, the type of an impeller etc. All ordinary variables, such as temperature, pressure, concentration etc. are continuous variables. Quite often categorical variables are called factors, especially in connection with so-called analysis of variance (ANOVA). The type of a variable dictates the type of a model that we can apply. Naturally, e.g. powers or logarithms have no meaning with categorical variables. In general, if the model contains only categorical variables, the experiments are analysed by ANOVA, and if the model contains only continuous variables, the experiments are analysed by regression analysis. Actually ANOVA-models can be considered as a special case of regression where the categorical variable are coded by so called binary coding, a topic that is outside the scope of this course. In the sequel, we shall focus on models with continuous variables, but before that we ll introduce the subclasses of factors. Factors are classified on two different bases: 1) fixed vs. random, 2) crossed vs. nested. Though designs with qualitative factors is beyond our scope, it is important to understand these concepts. One must know that different kind of combinations of factor subclasses require different ANOVA tests. This is important, because a wrong test typically leads to wrong conclusions. Therefore, anybody who needs to analyse designs with such factors needs to study more about ANOVA, or consult a statistician. In addition, these terms appear quite often in environmental research, e.g. like in the following excerpt These factors are sludge type (fixed factor, qualitative, 3 terms) and seasonal evolution (fixed factor, qualitative, 4 terms). The geographical factor (random factor, qualitative, 4 terms) could not have been statistically tested because of the absence of repetition. from Biotechnol. Agron. Soc. Environ (2), Fixed vs. random The levels (values) of a fixed factor are exactly the ones that we are interested in. The focus of interest is in differences between the factor levels, not the overall variance caused by variation in the levels. The levels of random factor represent a random sample of a larger set of possible level values. The focus of interest is the overall variance caused by variation in the factor levels. Comparing different stirrers in some process development problem would be a typical case of a fixed factor (the type of the stirrer). In studying the effect of raw material variation in some process, raw material batch would be a typical random factor. Crossed vs. nested If we have at least 2 factors, we have to consider their pair-wise relationships. A factor is said to be nested within another factor if the levels of this factor may have different interpretations depending on the values of other factor. If the interpretation of factor

15 15 levels is unequivocal, factors are called crossed. This may need some further explanation. Consider a case where 3 analysts have each taken 3 samples of the same material. Now the samples are nested in the analysts, because for example the sample #1 of the analyst #1 is not physically the same sample as the sample #1 of the analyst #2. If the experiment had been organised so that the first 3 samples had been taken, and then the 3 analysts had analysed each sample, the factors (Analyst and Sample) would have been crossed. Modelling with factors Models with (qualitative) factors look different, because for example 2*Analyst doesn t make any sense. There are different ways of writing models with factors, but we ll get acquainted only with the most common way of using indexed variables. Let us take example 1 where we had two catalysts, i.e. the factor is Catalyst. We could model the yield by, where i refers to the catalyst (i = 1,2), and j refers to the replicate (j = 1,2,3). refers to the mean yield and 's refer to deviation of the mean caused by the i'th catalyst. The epsilon term refers to the experimental error of the j'th replicate experiment with the i'th catalyst. For example = After the parameters have been estimated, usually using the least squares principle analysis of variance (ANOVA), we can estimate the true values for the yield (the fitted values) by, where the hats mean estimated values. Naturally, if we have more factors, we need more indices. For example, for two factors the model would be or,. In the latter, the 4'th term is called the interaction between the factors. Analysing models with (qualitative) factors only (ANOVA) Models that contain factors only, are analysed by the so-called analysis of variance (ANOVA). We are not going to study the mathematics behind ANOVA, but in spite of that we can learn to apply ANOVA in analysing environmental data. For that we need some knowledge of 1) the models behind ANOVA, 2) the logic of statistical (hypothesis) testing, and 3) use of R (or some other software). We shall go through a couple of examples in the labs.

16 16 Factorial designs Consider a case where we have two independent variables (ph and T) and one response variable (yield). 18 experiments have been made at different levels of ph and T. In the figure below, the yield has been plotted against ph and T: yield 86 yield ph T Can we conclude that yield increases when ph increases and T increases? At first glance, the question may sound odd and without proper thinking many of us would answer immediately affirmatively. This example shows the kind of problems that arise when the experiments are not designed factorially or, more generally, orthogonally. A factorial design is such that all possible combinations of all variables at fixed levels are included in the design. For example, consider 3 catalysts (A, B and C) and two concentration levels (1 and 2 ppm). A factorial design would be: A 1 A 2 B 1 B 2 C 1 C 2 In order to guarantee the independence of the results these experiments should be carried out in random order! Note that catalyst is a qualitative (categorical) and concentration is a quantitative (and also continuous) variable. Factorial designs are applicable to both categorical and continuous design variables (as above). Factorial designs allow estimation of empirical models that contain linear terms (also called main effects) and interaction up to the order equal to the number of variables. The problem with factorial designs is that the number of experiments increases rapidly with the number of variable, For example, if we have 5 variables with

17 17 3, 4, 2, 3 and 5 levels, we need altogether 3*4*2*3*5 = 3 experiments. In R, it is easy to construct factorial designs using the function expand.grid. For example, if factor A has levels 6, 7 and 8, and factor B has levels low, medium and high, the design is obtained giving the following R-commands: > A <- 6:8; B <- c('low','medium','high') > design <- expand.grid(a,b) > design Var1 Var2 1 6 low 2 7 low 3 8 low 4 6 medium 5 7 medium 6 8 medium 7 6 high 8 7 high 9 8 high > In Matlab you can use the function mton from the Data Analysis toolbox. In Matlab it is easiest to use numerical codes for all factors. In Excel you have to learn rather complicated expressions or just to use copying in constructing factorial tables. Exercise 4 a) Suppose we are making a factorial design with the following variables: 1) stirring speed at levels 200 and 300, 2) temperature at levels 40, 50 and and 3) ph at levels 6, 6.5 and 7. Construct a factorial design, both using Excel and R or Matlab. b) Write down a model assuming all variables as qualitative factors, and each combination of variable is replicated twice. A huge number of experiments is a typical problem with factorial designs. However, for continuous design variables, there is a remedy. Namely, if we assume that we are far from a possible optimum, the dependencies can be assumed to be well described by linear and interaction effects only. For a continuous variable, a linear effect can be estimated using only two levels, as the determination of a slope requires only two data points. For this reason, the two level factorial designs, i.e. 2 N -designs are the basic designs for continuous variables, and using only 2 levels yields a reasonable number of experiments for moderate numbers of experiments.

18 18 Two level factorial designs (2 N -designs) Of different factorial designs, the two level designs play the most important role. The reason for this is that these design are very cost effective. They can be used for both qualitative and quantitative variables, and the results can analysed by regression in both cases. However, many extensions of 2 N -designs, e.g. adding centre points, or axial points, are applicable only for quantitative variables (these topics will be discussed later in the text). 2 N -designs are planned and analysed using so called coded units, i.e. the lower level of any variable is denoted by -1 and the upper level of any variable is denoted by +1. There are two good reasons for doing so: 1) the design can be tabulated without knowing the actual variables, and 2) the model based on coded variables allows meaningful calculation for determining the direction of maximal improvement (the gradient). Of course the table in coded units must be transformed into physical units before carrying out the experiments. The formulae for transforming from coded to physical units and vice versa are (3a) (3b) where capital letters denote coded variable levels (±1's), i denotes the difference of the actual variable levels and bar denotes the average of the two levels. Now, let us take a simple example. Suppose we want to maximize the yield of a batch reaction with respect to the reaction time (t) and temperature (T) and that we have decided the variable levels to be 100 and 1 min for t and and C for T. The design in coded units would be: x 1 x In physical units the table would be: t T

19 19 Again, the experiments should be carried out in random order. Now, suppose that we have obtained the following yields: t T y Before estimating a linear+interactions model, let us look at the data by plotting the yield against time and temperature t = 100 t = 1 The figure shows that the yield increases with temperature with both reaction times. However, there seems to be a clear interaction between variables, as the slope is clearly smaller with the longer reaction time. We also see that the yield increases with reaction time at the lower temperature, but not with the higher temperature. An extrapolation would suggest better yields increasing the temperature with the shorter reaction time. However, a better view of the dependencies is obtained when we estimate the model. This can be accomplished using ordinary linear regression analysis which can be carried out also in Excel. But before that, we must remember that at this point we have no idea about the repeatability. Exercise 5 Make a similar plot using R or Matlab. So, before regression, let us carry out some replicate experiments. We shall place them at the centre point of the design, i.e. t = 130 and T = (0 and 0 in coded units). There are good reasons to design the replicates at the centre point, and we shall come to this point later. The results of these experiments are the following:

20 t T y 130 76 130 75.6 130 76.2 130 77.8 130 75.1 The standard deviation of the replicate yields is ca. 1.0 from which we can conclude that the changes in yield in the original design vary significantly.

20 20 t T y The standard deviation of the replicate yields is ca. 1.0 from which we can conclude that the changes in yield in the original design vary significantly. Before the regression analysis we shall add these to the design in coded units and calculate also the product corresponding to the interaction term. Thus, the table for the regression analysis looks like: X1 X2 X1X2 y The output of Excel s regression macro looks like:

21 21 The important figures in this table are highlighted and next we shall give them meaningful interpretations. R 2 (coefficient of determination) This is usually expressed in percentages,.i.e. 94%. It tells the proportion of the variance of yield explained by the model. Adjusted R 2 (adjusted coefficient of determination) Also this is usually expressed in percentages,.i.e. 90%. It tells the proportion of the variance of yield explained by the model taking into account the degrees of freedom. In general, it is a more realistic estimate of the goodness of fit (cf. eg. Wikipedia). Standard error The title in Excel for this is not very good. A better one would be the standard error of the residuals. In a good model this is close to the repeatability standard error. A larger value than the residual standard error is a symptom of lack of fit and a smaller value is a symptom of over-fit. In this example 2.7 is higher than 1.0 showing some lack of fit. However its significance should be tested using a lack of fit test. Significance F Again the title is not a good one. This is actually a p-value in a test whose null hypothesis is that the yields vary randomly around a mean value. In our case this hypothesis would be rejected even at the 0.01 level of significance. Thus we can conclude that the variables have a significant effect to the yield. Coefficients P-value These are the least squares estimates of the model parameters. Thus our model is The model supports our graphical interpretations by just looking at the signs of the slopes. Note also that the interaction effect is larger than the slope of the (coded) time.. These are the p-values in a test whose null hypothesis is that the regression coefficient (slope) is zero. Only the intercept and the slope of the temperature are significant at 0.05 level of significance. However 0.07 is quite close to 0.05 and it is a matter of taste whether to keep these terms in the model or not. Any model of two design variables can be represented as a 3-dimensional surface or a contour plot. Let us see what kind of a surface our model is.

22 yield X X yield X X1 This kind of a surface is called as a saddle surface. Note that surface has been plotted with extrapolated values. The model suggests impossibly high yields which is quite common with empirical model. But this is not important, the important thing is whether the model tells the right direction of improvement or not! Now, it is easy to see that the model suggests that best results are achieved on the upper left corner, i.e. with shorter reaction time and higher temperature. The point (-2,2) is ( min, 90 C) in physical units, so let us make a new experiment with these values. The result is 86.5% which is slightly better than any of the previous yields, so the model was not totally wrong. Let us proceed into the same direction and make an experiment at (40 min, 100 C). Now we get 86.0%, which is poorer, but not significantly. So let us make one more experiment in the direction: (10 min, 110 C). Now we get 67.7% which means significantly lower yield.at this point, we have to plan a new design taking into account that the surface is now known to be curved. We could have estimated the degree of curvature already around the original design, thanks to the centre point we added to the design. This is based on a mathematical fact: the height of a linear surface (a plane in 3-D) at a centre point of a set of points is equal to the mean value of the heights at these points. In our case the mean value of yields at the corner points is 72,4 and the yield at the centre point 76,1 (taken as the average) and thus the difference is 3.7. Now let us use the rules for the propagation of errors. According to Eq. 1, the standard deviation of the corner points mean is 1.0/ = 0.50 and the standard deviation of the centre point mean is 1.0/ = 0.45, and the standard deviation for the difference (Eq. 2b) is = 0.67.

23 23 Now the difference is over 5 times larger than its standard error. According the rules of thumb of the normal distribution, we can be pretty sure that such a big difference cannot be explained by normal random variation. Of course we could have used a formal t-test as well, but in clear cases like this one, it is not necessary. Now let us introduce some designs that allow estimation of the parameters of quadratic models. These designs are needed whenever we know that the underlying model is a (significantly) curved surface. It should be noted that we use the word surface rather liberally; with more than 2 design variables we should, in mathematical terms, talk about hyper-surfaces. However in DOE, these will be called response surfaces. We shall study see, how well a OVAT-design would have performed in our example in one of our lab sessions. In R, the regression analysis, using DOE-tools by VMT, is carried out in the following way: > X = mton(2,2,5) # 2^2-design with 5 centre points > y = c(55.3,83.4,67.7,83.2,76,75.6,76.2,77.8,75.1) # yields > model = quad.model.fit(x,y,1.5) > args(quad.model.fit) function (X, y, opt = 1, terms = NULL, model = NULL, blockvar = NULL) NULL > summary(model) Call: y ~ X1 + X2 + I(X1 * X2) Residuals: Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-09 *** X X *** I(X1 * X2) Signif. codes: 0 *** ** 0.01 * Residual standard error: on 5 degrees of freedom Multiple R-squared: 0.94, Adjusted R-squared: F-statistic: on 3 and 5 DF, p-value: > In addition to what is shown by the summary function, the function quad.model.fit calculates some additional useful statistics: > model[13:17] $nonlin.test t p $LOF.test FLOF plof $CV CVpred CVresid CV.Stud.resid

24 $Q2 [1] The first of these (13. field of the object model ) gives the test statistic and the p-value of the so-called non-linearity test which is possible only if the design contains centre points. The p-value suggests that we have to reject the null hypothesis of linearity within the experimental region. The second gives the so-called lack-of-fit test statistic and the corresponding p-value. Again the null hypothesis is rejected, i.e. the model suffers from significant lack-of-fit. The next two fields (= list elements) give cross-validation information from the socalled leave-one-out cross-validation. The idea of leave-one-out cross-validation is simple: each observation in turn is left out of the data, and a model is fitted using the rest. Then, using this model, the value of left out response is predicted, and the prediction is recorded. This is repeated for all observations. Such predictions resemble more true predictions because the observation being predicted is not used in building the model. Therefore, usually cross-validated predictions give more realistic picture of the reliability of the model. The results of this example clear show that this model should not be used for predictions. The -value is the -value for the predictions, whereas is calculated using fitted values. You can easily make contour plots using R as well. The function for this is quad.plot. See the help in DOE_functions_v4.pdf and try it out. Second and higher order designs Obviously, optimization is not possible with linear or linear+interactions models, because any optimal point means curvature in its neighbourhood. Linear models can detect directions towards possible optima, but locating any optimum requires higher order models, and consequently designs that contain more than two levels. We shall next study the most common ones of such designs. A very natural approach for multi-level factorial designs would be to extend the 2 N - designs to 3 N -design (or M N -designs). The problem is that the number of experiments comes very quickly unbearable, for example for 5 variables 3 5 -design has 243 experiments. Yet another approach would be to use several superimposed 2 N -designs, e.g. one using coded values -2 and 2 and another using coded values -1 and 1. Unfortunately, it can be shown that such designs do not permit estimation of quadratic models. For these reasons, some other designs like central composite designs and Box- Behnken designs are used. Now we shall illustrate the use of second order response surfaces and the most popular second order design, the central composite design (CC). The structure of CC designs is very simple. The design consists of a 2 N -design plus a centre point (possibly replicated) and so called axial points.

25 25 The axial points are points whose all coordinates are at the centre, i.e. are zero, except for one. Thus there are 2 N axial points for N variables. The most common choice for the non-zero coordinate value is 2 N/4. We shall use the same example as above, and just accomplish the original 2 N -design with 4 axial points. In our example, the coordinate value in coded units, i.e. the distance from the origin is 2 2/ Thus the table of axial points in coded units is x 1 x and in physical units t T If we add the axial points, including the previously made centre point replicated, into our design, and make the experiments, it looks like t T y

26 26 Now, let us estimate and analyse a quadratic model in coded units for the yield. In Excel, one has to add 2 new columns for the quadratic terms. The table fo the regression analysis looks like: The output of the regression macro looks like:

27 27 Note that the R 2 value is very good, the overall significance is high and all coefficients are significant at the 0.05 level of significance, the quadratic term of the temperature being the weakest. The extrapolated response surface of this quadratic model is shown in the two figures below 100 yield time temp yield temp time The figure clearly shows that we should have raised the temperature more than we did, based on the 2 2 design. According to the figure, a good guess in coded units would have been (-2,3), i.e. ( min,100 C) in physical units. This, indeed, would have given a better yield, but we shall not give the results in order not to spoil our hands-on exercise. The same analyses using R are made in the labs. Response surface analysis The response surface above shows a clear rising ridge. The direction of the ridge can be derived by so called canonical analysis, but this needs some more advanced mathematical techniques, namely the use of so called eigenvalues which is beyond the scope of this introductory course. However, it is important to realize that graphical tools have severe limitations when the number of the variables in higher than 2. In such cases we have to rely on computational tools. Another computational tool for analysing response surfaces is the calculation of the so called stationary point. A stationary point of a quadratic surface, if it exists, can be either a minimum, maximum or a saddle point. Mathematically it is the point where the gradient, i.e. the vector of partial derivatives of the response variable with respect to the design variables is a vector of zeros.

28 28 Let us calculate the stationary point of the quadratic model above. The derivatives of the model in coded units are. Finding the stationary point simply means solving the pair of equations setting the derivatives to zero. This is easily done by hand or in Excel using Excel s matrix functions as shown below. The stationary point is physically impossible, because the reaction time cannot be negative. However, we have calculated the predicted value at the stationary point to verify that it is a maximum point. (A rigorous proof would require use of eigenvalues.) In cases like this, there is no sense in making experiments at the stationary point. Instead, we have two alternatives: 1) to design points along the so called gradient path or 2) to use an optimizer to find best points outside the design area. Naturally, if the stationary point is a maximum and it would lie inside the design region, one should make an experiment at the stationary point in order to verify if, the model predicts the best point correctly. For some reason, usually only the second alternative is available in commercial DOE software. It would be possible in Excel as well, using the Solver tool. However, in Excel it is faster to get new points using the gradient path method. We shall illustrate the latter approach. First it must be noted that the gradient path can be curved. There for we must calculate the points densely enough, and then use only more sparsely selected points. Note that both methods work also for linear and linear+interactions models!

29 29 It is easy to understand why to aim at processes that operate at optimal conditions. However, there s an aspect that is not always realized. Namely, around the optimum, the process is less sensitive against changes in process variables. A process operating at optimal conditions does not only produce the best product, but also the least varying product. The scaling factor in the above calculations means that the scaled gradient is 0.25 times the original gradient vector. This is then added to the previous point and the calculation are repeated for each new point. The physical units are obtained using the coding formulae. If we plot the path, it looks like: temperature time

Probability and Statistics

Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be CHAPTER 4: IT IS ALL ABOUT DATA 4a - 1 CHAPTER 4: IT