Regression analysis
Parametric technique A parametric technique assumes that the variables conform to some distribution (i.e. gaussian) The properties of the distribution are assumed in the underlying statistical method Bimodal distribution: the distribution has two maxima Skewness: the distribution is not symmetrical Kurtosis: the distribution is not bell shaped
Supervised techniques Supervised techniques use information about the dependent variable to derive the model with the goal of assigning correct output for a given input
Simple Linear Regression Let s assume the relationship between x and y is linear Linear relationship can be defined by a straight line with parameters k and w Equation of the straight line: y(x)=w0+w1 x Usually the line may do not fit the data exactly But we can try making the line a reasonable approximation Deviation for the pair (xi, yi): εi=yi y(xi)=yi (w0+w1xi) The total error is defined as the sum-of-squared deviations: RSS=Siεi² The best fitting line is defined by w0 and w1 minimizing the total error w0 = intercept w1 = slope
Standard deviation sy= [RSS/(n-2)] n numero di coppie di dati misurati (xi, yi) n-2 numero di gradi di libertà Perché ora dividiamo per (n-2) piuttosto che per (n)? Consideriamo il caso limite n=2, ovvero il caso in cui abbiamo due coppie di dati misurati. Poiché per 2 punti passa sempre una retta, allora con due coppie di dati non si possono avere informazioni sull'affidabilità delle misure. In altri termini possiamo dire che per calcolare la deviazione standard di una regressione lineare di (n) coppie di dati, è necessario prima calcolare i valori dell'intercetta e della pendenza, abbiamo dunque 2 gradi di libertà in meno rispetto agli (n) iniziali ed è quindi opportuno dividere per (n-2). Più in generale possiamo dire che i gradi di libertà corrispondono al numero di quantità che possono essere assegnate arbitrariamente: il numero di gradi di libertà è pari al numero di misure indipendenti (n, numero di dati osservati) meno il numero di parametri (pendenza e intercetta in questo caso) che da queste misure si calcolano (vincoli)
Squared correlation coefficient r²=ess/tss=(tss-rss)/tss RSS: residual sum of squares (deviation of the point from the line) ESS: explained sum of squares (deviation of the line from the mean) TSS: total sum of squares (deviation of the point from the mean) Linear regression Mean value The quality of a simple linear regression equation may be quantified by the squared correlation coefficient r² r² indicates the fraction of the total variation in the dependent variables yi that is explained by the regression equation Possible values reported for r² fall between 0 and 1 An r² of 0 means that there is no relationship between the dependent variable y and the independent variable x An r² of 1 means there is perfect correlation Disadvantage: higher r² values are obtained for larger data set
r tables The value of r can be controlled in the appropriate tables of statistical data (calculated on distributions of Gaussian type) to determine the significance of the regression equation The correlation between x and y will be significant at the given probability level if the value of r exceeds the tabulated r value Note: you should ignore the sign (+ or -) of r value when reading this table n=number of data points c= number of constrains n-c= degrees of freedom n-c 95% 99% 99.9% 1 0.997 1 1 2 0.950 0.990 0.999 3 0.878 0.959 0.991 4 0.811 0.917 0.974 5 0.755 0.875 0.951 10 0.576 0.708 0.823 15 0.482 0.606 0.725 20 0.423 0.535 0.652 25 0.381 0.487 0.597 30 0.349 0.449 0.554 35 0.325 0.418 0.519 40 0.304 0.393 0.490 45 0.288 0.372 0.465 50 0.273 0.354 0.443 60 0.250 0.325 0.408 70 0.232 0.302 0.380 80 0.217 0.283 0.357 90 0.205 0.267 0.338 100 0.195 0.254 0.321
A diagram tells you more than thousand equations Visualization may not be as precise as statistics, but it provides a unique view onto data that can make it much easier to discover interesting structures than numerical methods Visualization also provides the context necessary to make better choices and to be more careful when fitting models
Anscombe s Quartet The plot appears to be a simple linear relationship, corresponding to two variables correlated and following the assumption of normality The distribution is linear, but with a different regression line, which is offset by the one outlier which exerts enough influence to alter the regression line and lower the correlation coefficient from 1 to 0.816 While an obvious relationship between the two variables can be observed, it is not linear One outlier is enough to produce a high correlation coefficient, even though the relationship between the two variables is not linear Anscombe's quartet comprises four datasets (of 11 points) that have nearly identical simple statistical properties, yet appear very different when graphed
Anscombe s Quartet II I III IV I II III IV x y y y x y 4 4.26 3.10 5.39 19 12.50 5 5.68 4.74 5.73 8 6.89 6 7.24 6.13 6.08 8 5.25 7 4.82 7.26 6.42 8 7.91 8 6.95 8.14 6.77 8 5.76 9 8.81 8.77 7.11 8 8.84 10 8.04 9.14 7.46 8 6.58 Property (in each case) Value 11 8.33 9.26 7.81 8 8.47 Mean of x 9 12 10.84 9.13 8.15 8 5.56 Variance of x 11 13 7.58 8.74 12.74 8 7.71 Mean of y 7.50 14 9.96 8.10 8.84 8 7.04 Variance of y 4.122 or 4.127 Linear regression line f(x)=3.00+0.500 x Correlation between x and y 0.816
Chance correlation problem
Fisher s statistic F = (n-c)/(c-1) ESS/RSS = (n-c)/(c-1) r²/(1-r²) Although the fit of the data to the regression line could be excellent, how can one decide if this correlation is based purely on chance? The higher the value for r² the less likely that the relationship is due to chance Given the assumption that the data has a Gaussian distribution, the F statistic assesses the statistical significance of the linear regression equation Values of F are available in statistical tables at different values of confidence If the calculated value is greater than the tabulated value then the equation is said to be significant at that particular level of confidence The value of F depends upon the number of independent variables in the equation and the number of data points As the number of data points increases and/or the number of independent variables falls so the value of F which corresponds to a particular confidence level also decreases This is because we would like to be able to explain a large number of data points with an equation containing as few variables as necessary Such an equation would be expected to have greater predictive power
(n-c)/(c-1)=(n-2)/(2-1)=n-2 n-2 5 10 r² 30 100 1000 F values 0 0 0 0 0 0 0.1 0.05 0.10 0.30 1.01 10.10 0.2 0.21 0.42 1.25 4.17 41.67 0.3 0.49, 0.99 2.97 9.89 98.90 0.4 0.95 1.90 5.71 19.05 190.48 0.5 1.67 3.33 10.00 33.33 333.33 0.6 2.81 5.63 16.88 56.25 562.50 0.7 4.80 9.61 28.82 96.08 960.78 0.8 8.89 17.78 53.33 177.78 1777.78 0.9 21.32 42.63 127.89 426.32 4263.16 2.3E+16 4.5E+16 1.4E+17 4.5E+17 4.5E+18 1 Significance level Tabulated critical values of F for normal distributions 95.0% 6.61 4.96 4.17 3.94 3.85 99.0% 16.26 10.04 7.56 6.90 6.66 99.9% 47.18 21.04 13.29 11.50 10.89 If we measured 12 pairs of data (n=12) and r²=0.8, then the probability that there is no relationship between dependent and independent variables is less than 1%: 10.04<17.78<21.04
Y-Scrambling A model MUST be validated on new independent data to avoid a chance correlation X1 Y1 Y2 X2 Y2 Y5 X3 Y3 Y4 X4 Y4 Y6 X5 Y5 Y1 X6 Y6 Y7 X7 Y7 Y3 R2 0.0 1.0
Y-Scrambling X1 Y1 Y4 X2 Y2 Y1 X3 Y3 Y5 X4 Y4 Y2 X5 Y5 Y6 X6 Y6 Y3 X7 Y7 Y7 R2 0.0 1.0
Y-Scrambling X1 Y1 Y7 X2 Y2 Y6 X3 Y3 Y3 X4 Y4 Y5 X5 Y5 Y4 X6 Y6 Y1 X7 Y7 Y2 R2 0.0 1.0
Preparation of training and test sets Building of models Training set Initial data set Test 10 15 % Splitting of an initial data set into training and test sets Selection of the best models according to statistical criteria Prediction calculations using the best models
Cross validation n m ( f ( x i ) y ) 2 r 2cv =q 2 = i =1 n m ( y i y ) 2 i =1
Leave one out (LOO) The most common form of cross validation is leave one out : 1) a data value is left out 2) a model is derived using the remainder data 3) a value is predicted for the data left out 4) this is repeated for every data point in the set
Model s applicability domain Regression analysis is most effective for interpolation than extrapolation i.e., the region of experimental space described by the regression analysis has been explained, but projecting to a new, unanalysed region can be problematic
Model s applicability domain The data set should span the representation space evenly
Multiple linear regression: Linear regression in higher dimensions In order to analyse a relationship which is possibly influenced by several independent variables, it is useful to assess the contribution of each variable Multiple linear regression is used to determine the relative importance of multiple independent variables to the overall fit of the data For 2D inputs, linear regression fits a plane to the data: f(x1,x2)=w0+w1 x1+w2 x2 The best plane minimizes the sumof-squared deviations
Multiple linear regression Similar intuition carries over to higher dimensions too: fitting a pdimensional hyperplane to the data But it is hard to visualize in pictures... Multiple linear regression attempts to maximize the fit of the data to a regression equation (minimize the squared deviations from the regression equation) for the dependent variable (maximize the correlation coefficient) by adjusting each of the available parameters up or down Regression programs often approach this task in a stepwise fashion. That is, successive regression equations will be derived in which parameters will be either added or removed until the r² and s values are optimized The magnitude of the coefficients derived in this manner indicate the relative contribution of the associated parameter to dependent variable y
Overfitting Determining the most appropriate number of descriptors (and their nature) is generally a non-trivial task The choice of too few descriptors makes the model too general (with little, if any, predictive value) The choice too many descriptors render the model too specific for the training set (a process called over-fitting): given enough parameters any data set can be fitted to a regression line The consequence of this is that regression analysis generally requires significantly more compounds than parameters A useful rule of thumb is three to six times the number of parameters under consideration Model b performs well on the training examples, but poorly on new examples
Simple Linear Regression y1=w0+w1 x1 y2=w0+w1 x2 y3=w0+w1 x3 y4=w0+w1 x4 y4*1=w04*1+w1 x4*1 [ ][ ] [ ] y1 w0 y2 w0 = + w1. y3 w0 y4 w0 x1 x2 x3 x4
Single Linear Regression y1=w0 x10+w1 x11 y2=w0 x20+w1 x21 y3=w0 x30+w1 x31 y4=w0 x40+w1 x41 xi0=1 yn*1=xn*2 w2*1 [ ][ ] y1 x10 x 11 y2 x 20 x 21 w0 =. y3 x 30 x 31 w1 y4 x 40 x 41 [ ]
Single Linear Regression y1=w0 x10+w1 x11+ε1 y2=w0 x20+w1 x21+ε2 y3=w0 x30+w1 x31+ε3 y4=w0 x40+w1 x41+ε4 xi0=1 y4*1=x4*2 w2*1 [][ ] y1 x 10 x 11 ε1 y2 x 20 x 21 w 0 ε2 =. + y3 x 30 x 31 w 1 ε3 ε4 yn x 40 x 41 [ ][ ]
Double Linear Regression y1=w0 x10+w1 x11+w2 x12+ε1 y2=w0 x20+w1 x21+w2 x22+ε2 y3=w0 x30+w1 x31+w2 x32+ε3 y4=w0 x40+w1 x41+w2 x42+ε4 xi0=1 yn*1=x4*3 w3*1+ε4*1 [ ] [ ][ ] [ ] y1 x10 x 11 x12 w 0 ε1 y 2 = x 20 x 21 x 22. ε2 + w1 ε y3 x 30 x 31 x 32 3 w 2 ε4 y4 x 40 x 41 x 42
Triple Linear Regression y1=w0 x10+w1 x11+w2 x12+w3 x13+ε1 y2=w0 x20+w1 x21+w2 x22+w3 x23+ε2 y3=w0 x30+w1 x31+w2 x32+w3 x33+ε3 y4=w0 x40+w1 x41+w2 x42+w3 x43+ε4 xi0=1 y =X w +ε 4*1 4*4 4*1 4*1 [ ] [ ][ ] [ ] y1 x10 x 11 x12 x 13 w0 ε 1 y2 x x x x w = 20 21 22 23. 1 + ε2 y3 x 30 x 31 x 32 x 33 w 2 ε3 ε4 y4 x 40 x 41 x 42 x 43 w3
Bivariate-Triple Linear Regression y11=w01 x10+w11 x11+w21 x12+w31 x13+ε11 y21=w01 x20+w11 x21+w21 x22+w31 x23+ε21 y31=w01 x30+w11 x31+w21 x32+w31 x33+ε31 y41=w01 x40+w11 x41+w21 x42+w31 x43+ε41 y12=w02 x10+w12 x11+w22 x12+w32 x13+ε12 y22=w02 x20+w12 x21+w22 x22+w32 x23+ε22 y32=w02 x30+w12 x31+w22 x32+w32 x33+ε32 y42=w02 x40+w12 x41+w22 x42+w32 x43+ε42 xi0=1 multivariate problems: there is more than one dependent variable Y4*2=X4*4 W4*2+E4*2 [ ] [ ][ ] [ y 11 y 12 x10 x 11 x 12 x 13 w 01 w 02 ε ε 11 12 y 21 y 22 x x x x w w = 20 21 22 23. 11 12 + ε21 ε22 y 31 y 32 x 30 x 31 x 32 x 33 w 21 w 22 ε31 ε32 ε41 ε42 y 41 y 42 x 40 x 41 x 42 x 43 w 31 w 32 ]
Multiple Linear Regression yn*1=xn*(p+1) w(p+1)*1+εn*1 Multivariate Linear Regression Yn*k=Xn*(p+1) W(p+1)*k+En*k xi0=1 xi0=1 Y={y(1),..., y(k)} dependent variables (observations) X={x(1),..., x(p+1)} independent variables (parameters), x(1)=1 W={w(1),..., w(p+1)}weights E={ε(1),..., ε(k)} error matrix n=number of data points p+1=number independent variables k=number dependent variables
Metodo dei minimi quadrati Y = X W + E Col metodo dei minimi quadrati si determina la matrice dei pesi W che minimizza gli errori quadratici S=E E=(Y-X W) (YX W) assumendo che gli errori siano casuali e indipendenti Si procede calcolando la derivata di E E rispetto a W e si trova che: W=(X X)-1X Y Dunque le predizioni sono date da: Y=X W Y=X (X X)-1 X Y Y=H Y