Parametric technique

Similar documents
Fixed Priority Scheduling

Statistical concepts in QSAR.

Simple Linear Regression

LINEAR REGRESSION, RIDGE, LASSO, SVR

Microeconometria Day # 5 L. Cembalo. Regressione con due variabili e ipotesi dell OLS

Linear Regression. Chapter 3

Data Analysis and Statistical Methods Statistics 651

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Basic Business Statistics 6 th Edition

Statistics for classification

Chapter 14 Simple Linear Regression (A)

Chapter 16. Simple Linear Regression and dcorrelation

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Engineering 7: Introduction to computer programming for scientists and engineers

Simple Linear Regression

df=degrees of freedom = n - 1

Statistics for Managers using Microsoft Excel 6 th Edition

EM375 STATISTICS AND MEASUREMENT UNCERTAINTY CORRELATION OF EXPERIMENTAL DATA

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Linear Regression Spring 2014

4 Multiple Linear Regression

Statistical View of Least Squares

Review of Statistics

Inferences for Regression

MFin Econometrics I Session 4: t-distribution, Simple Linear Regression, OLS assumptions and properties of OLS estimators

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Midterm 2 - Solutions

2 Regression Analysis

Practical Statistics

9. Linear Regression and Correlation

Unit 10: Simple Linear Regression and Correlation

Business Statistics. Lecture 9: Simple Regression

Correlation Analysis

+ Statistical Methods in

Linear Regression In God we trust, all others bring data. William Edwards Deming

2 Prediction and Analysis of Variance

MATH 1150 Chapter 2 Notation and Terminology

Linear Model Selection and Regularization

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Data Mining - SVM. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - SVM 1 / 55

Chapter 16. Simple Linear Regression and Correlation

Advanced Engineering Statistics - Section 5 - Jay Liu Dept. Chemical Engineering PKNU

Simple Linear Regression: The Model

ES-2 Lecture: More Least-squares Fitting. Spring 2017

QSAR/QSPR modeling. Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships

appstats27.notebook April 06, 2017

CHAPTER 5. Outlier Detection in Multivariate Data

Regression. Oscar García

Linear Models for Regression. Sargur Srihari

Chapter 1. The Noble Eightfold Path to Linear Regression

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

Applied Regression Modeling: A Business Approach Chapter 3: Multiple Linear Regression Sections

Homework 2: Simple Linear Regression

Chapter 4. Probability and Statistics. Probability and Statistics

Learning Goals. 2. To be able to distinguish between a dependent and independent variable.

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Analysis of Simulated Data

Reconstruction, prediction and. of multiple monthly stream-flow series

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

CMU-Q Lecture 24:

Quantitative Analysis of Financial Markets. Summary of Part II. Key Concepts & Formulas. Christopher Ting. November 11, 2017

Tutorial 6: Linear Regression

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population

Statistical Pattern Recognition

Of small numbers with big influence The Sum Of Squares

Introduction to Machine Learning

1 Correlation and Inference from Regression

Regression Diagnostics Procedures

Marcel Dettling. Applied Statistical Regression AS 2012 Week 05. ETH Zürich, October 22, Institute for Data Analysis and Process Design

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

WEIGHTED LEAST SQUARES. Model Assumptions for Weighted Least Squares: Recall: We can fit least squares estimates just assuming a linear mean function.

Six Sigma Black Belt Study Guides

STAT5044: Regression and Anova. Inyoung Kim

Machine Learning Linear Regression. Prof. Matteo Matteucci

Regression Analysis. Table Relationship between muscle contractile force (mj) and stimulus intensity (mv).

STAT Chapter 11: Regression

ECE521 week 3: 23/26 January 2017

Multivariate Regression

Perceptron Revisited: Linear Separators. Support Vector Machines

THE ROYAL STATISTICAL SOCIETY 2008 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE (MODULAR FORMAT) MODULE 4 LINEAR MODELS

y Xw 2 2 y Xw λ w 2 2

Need for Several Predictor Variables

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Linear Regression 1 / 25. Karl Stratos. June 18, 2018

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

Lectures on Simple Linear Regression Stat 431, Summer 2012

INTRODUCTION TO BASIC LINEAR REGRESSION MODEL

SMA 6304 / MIT / MIT Manufacturing Systems. Lecture 10: Data and Regression Analysis. Lecturer: Prof. Duane S. Boning

CHAPTER 4 DESCRIPTIVE MEASURES IN REGRESSION AND CORRELATION

Two-Variable Regression Model: The Problem of Estimation

6x 2 8x + 5 ) = 12x 8

Modelli Lineari (Generalizzati) e SVM

3 Multiple Linear Regression

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

Regression Models - Introduction

Transcription:

Regression analysis

Parametric technique A parametric technique assumes that the variables conform to some distribution (i.e. gaussian) The properties of the distribution are assumed in the underlying statistical method Bimodal distribution: the distribution has two maxima Skewness: the distribution is not symmetrical Kurtosis: the distribution is not bell shaped

Supervised techniques Supervised techniques use information about the dependent variable to derive the model with the goal of assigning correct output for a given input

Simple Linear Regression Let s assume the relationship between x and y is linear Linear relationship can be defined by a straight line with parameters k and w Equation of the straight line: y(x)=w0+w1 x Usually the line may do not fit the data exactly But we can try making the line a reasonable approximation Deviation for the pair (xi, yi): εi=yi y(xi)=yi (w0+w1xi) The total error is defined as the sum-of-squared deviations: RSS=Siεi² The best fitting line is defined by w0 and w1 minimizing the total error w0 = intercept w1 = slope

Standard deviation sy= [RSS/(n-2)] n numero di coppie di dati misurati (xi, yi) n-2 numero di gradi di libertà Perché ora dividiamo per (n-2) piuttosto che per (n)? Consideriamo il caso limite n=2, ovvero il caso in cui abbiamo due coppie di dati misurati. Poiché per 2 punti passa sempre una retta, allora con due coppie di dati non si possono avere informazioni sull'affidabilità delle misure. In altri termini possiamo dire che per calcolare la deviazione standard di una regressione lineare di (n) coppie di dati, è necessario prima calcolare i valori dell'intercetta e della pendenza, abbiamo dunque 2 gradi di libertà in meno rispetto agli (n) iniziali ed è quindi opportuno dividere per (n-2). Più in generale possiamo dire che i gradi di libertà corrispondono al numero di quantità che possono essere assegnate arbitrariamente: il numero di gradi di libertà è pari al numero di misure indipendenti (n, numero di dati osservati) meno il numero di parametri (pendenza e intercetta in questo caso) che da queste misure si calcolano (vincoli)

Squared correlation coefficient r²=ess/tss=(tss-rss)/tss RSS: residual sum of squares (deviation of the point from the line) ESS: explained sum of squares (deviation of the line from the mean) TSS: total sum of squares (deviation of the point from the mean) Linear regression Mean value The quality of a simple linear regression equation may be quantified by the squared correlation coefficient r² r² indicates the fraction of the total variation in the dependent variables yi that is explained by the regression equation Possible values reported for r² fall between 0 and 1 An r² of 0 means that there is no relationship between the dependent variable y and the independent variable x An r² of 1 means there is perfect correlation Disadvantage: higher r² values are obtained for larger data set

r tables The value of r can be controlled in the appropriate tables of statistical data (calculated on distributions of Gaussian type) to determine the significance of the regression equation The correlation between x and y will be significant at the given probability level if the value of r exceeds the tabulated r value Note: you should ignore the sign (+ or -) of r value when reading this table n=number of data points c= number of constrains n-c= degrees of freedom n-c 95% 99% 99.9% 1 0.997 1 1 2 0.950 0.990 0.999 3 0.878 0.959 0.991 4 0.811 0.917 0.974 5 0.755 0.875 0.951 10 0.576 0.708 0.823 15 0.482 0.606 0.725 20 0.423 0.535 0.652 25 0.381 0.487 0.597 30 0.349 0.449 0.554 35 0.325 0.418 0.519 40 0.304 0.393 0.490 45 0.288 0.372 0.465 50 0.273 0.354 0.443 60 0.250 0.325 0.408 70 0.232 0.302 0.380 80 0.217 0.283 0.357 90 0.205 0.267 0.338 100 0.195 0.254 0.321

A diagram tells you more than thousand equations Visualization may not be as precise as statistics, but it provides a unique view onto data that can make it much easier to discover interesting structures than numerical methods Visualization also provides the context necessary to make better choices and to be more careful when fitting models

Anscombe s Quartet The plot appears to be a simple linear relationship, corresponding to two variables correlated and following the assumption of normality The distribution is linear, but with a different regression line, which is offset by the one outlier which exerts enough influence to alter the regression line and lower the correlation coefficient from 1 to 0.816 While an obvious relationship between the two variables can be observed, it is not linear One outlier is enough to produce a high correlation coefficient, even though the relationship between the two variables is not linear Anscombe's quartet comprises four datasets (of 11 points) that have nearly identical simple statistical properties, yet appear very different when graphed

Anscombe s Quartet II I III IV I II III IV x y y y x y 4 4.26 3.10 5.39 19 12.50 5 5.68 4.74 5.73 8 6.89 6 7.24 6.13 6.08 8 5.25 7 4.82 7.26 6.42 8 7.91 8 6.95 8.14 6.77 8 5.76 9 8.81 8.77 7.11 8 8.84 10 8.04 9.14 7.46 8 6.58 Property (in each case) Value 11 8.33 9.26 7.81 8 8.47 Mean of x 9 12 10.84 9.13 8.15 8 5.56 Variance of x 11 13 7.58 8.74 12.74 8 7.71 Mean of y 7.50 14 9.96 8.10 8.84 8 7.04 Variance of y 4.122 or 4.127 Linear regression line f(x)=3.00+0.500 x Correlation between x and y 0.816

Chance correlation problem

Fisher s statistic F = (n-c)/(c-1) ESS/RSS = (n-c)/(c-1) r²/(1-r²) Although the fit of the data to the regression line could be excellent, how can one decide if this correlation is based purely on chance? The higher the value for r² the less likely that the relationship is due to chance Given the assumption that the data has a Gaussian distribution, the F statistic assesses the statistical significance of the linear regression equation Values of F are available in statistical tables at different values of confidence If the calculated value is greater than the tabulated value then the equation is said to be significant at that particular level of confidence The value of F depends upon the number of independent variables in the equation and the number of data points As the number of data points increases and/or the number of independent variables falls so the value of F which corresponds to a particular confidence level also decreases This is because we would like to be able to explain a large number of data points with an equation containing as few variables as necessary Such an equation would be expected to have greater predictive power

(n-c)/(c-1)=(n-2)/(2-1)=n-2 n-2 5 10 r² 30 100 1000 F values 0 0 0 0 0 0 0.1 0.05 0.10 0.30 1.01 10.10 0.2 0.21 0.42 1.25 4.17 41.67 0.3 0.49, 0.99 2.97 9.89 98.90 0.4 0.95 1.90 5.71 19.05 190.48 0.5 1.67 3.33 10.00 33.33 333.33 0.6 2.81 5.63 16.88 56.25 562.50 0.7 4.80 9.61 28.82 96.08 960.78 0.8 8.89 17.78 53.33 177.78 1777.78 0.9 21.32 42.63 127.89 426.32 4263.16 2.3E+16 4.5E+16 1.4E+17 4.5E+17 4.5E+18 1 Significance level Tabulated critical values of F for normal distributions 95.0% 6.61 4.96 4.17 3.94 3.85 99.0% 16.26 10.04 7.56 6.90 6.66 99.9% 47.18 21.04 13.29 11.50 10.89 If we measured 12 pairs of data (n=12) and r²=0.8, then the probability that there is no relationship between dependent and independent variables is less than 1%: 10.04<17.78<21.04

Y-Scrambling A model MUST be validated on new independent data to avoid a chance correlation X1 Y1 Y2 X2 Y2 Y5 X3 Y3 Y4 X4 Y4 Y6 X5 Y5 Y1 X6 Y6 Y7 X7 Y7 Y3 R2 0.0 1.0

Y-Scrambling X1 Y1 Y4 X2 Y2 Y1 X3 Y3 Y5 X4 Y4 Y2 X5 Y5 Y6 X6 Y6 Y3 X7 Y7 Y7 R2 0.0 1.0

Y-Scrambling X1 Y1 Y7 X2 Y2 Y6 X3 Y3 Y3 X4 Y4 Y5 X5 Y5 Y4 X6 Y6 Y1 X7 Y7 Y2 R2 0.0 1.0

Preparation of training and test sets Building of models Training set Initial data set Test 10 15 % Splitting of an initial data set into training and test sets Selection of the best models according to statistical criteria Prediction calculations using the best models

Cross validation n m ( f ( x i ) y ) 2 r 2cv =q 2 = i =1 n m ( y i y ) 2 i =1

Leave one out (LOO) The most common form of cross validation is leave one out : 1) a data value is left out 2) a model is derived using the remainder data 3) a value is predicted for the data left out 4) this is repeated for every data point in the set

Model s applicability domain Regression analysis is most effective for interpolation than extrapolation i.e., the region of experimental space described by the regression analysis has been explained, but projecting to a new, unanalysed region can be problematic

Model s applicability domain The data set should span the representation space evenly

Multiple linear regression: Linear regression in higher dimensions In order to analyse a relationship which is possibly influenced by several independent variables, it is useful to assess the contribution of each variable Multiple linear regression is used to determine the relative importance of multiple independent variables to the overall fit of the data For 2D inputs, linear regression fits a plane to the data: f(x1,x2)=w0+w1 x1+w2 x2 The best plane minimizes the sumof-squared deviations

Multiple linear regression Similar intuition carries over to higher dimensions too: fitting a pdimensional hyperplane to the data But it is hard to visualize in pictures... Multiple linear regression attempts to maximize the fit of the data to a regression equation (minimize the squared deviations from the regression equation) for the dependent variable (maximize the correlation coefficient) by adjusting each of the available parameters up or down Regression programs often approach this task in a stepwise fashion. That is, successive regression equations will be derived in which parameters will be either added or removed until the r² and s values are optimized The magnitude of the coefficients derived in this manner indicate the relative contribution of the associated parameter to dependent variable y

Overfitting Determining the most appropriate number of descriptors (and their nature) is generally a non-trivial task The choice of too few descriptors makes the model too general (with little, if any, predictive value) The choice too many descriptors render the model too specific for the training set (a process called over-fitting): given enough parameters any data set can be fitted to a regression line The consequence of this is that regression analysis generally requires significantly more compounds than parameters A useful rule of thumb is three to six times the number of parameters under consideration Model b performs well on the training examples, but poorly on new examples

Simple Linear Regression y1=w0+w1 x1 y2=w0+w1 x2 y3=w0+w1 x3 y4=w0+w1 x4 y4*1=w04*1+w1 x4*1 [ ][ ] [ ] y1 w0 y2 w0 = + w1. y3 w0 y4 w0 x1 x2 x3 x4

Single Linear Regression y1=w0 x10+w1 x11 y2=w0 x20+w1 x21 y3=w0 x30+w1 x31 y4=w0 x40+w1 x41 xi0=1 yn*1=xn*2 w2*1 [ ][ ] y1 x10 x 11 y2 x 20 x 21 w0 =. y3 x 30 x 31 w1 y4 x 40 x 41 [ ]

Single Linear Regression y1=w0 x10+w1 x11+ε1 y2=w0 x20+w1 x21+ε2 y3=w0 x30+w1 x31+ε3 y4=w0 x40+w1 x41+ε4 xi0=1 y4*1=x4*2 w2*1 [][ ] y1 x 10 x 11 ε1 y2 x 20 x 21 w 0 ε2 =. + y3 x 30 x 31 w 1 ε3 ε4 yn x 40 x 41 [ ][ ]

Double Linear Regression y1=w0 x10+w1 x11+w2 x12+ε1 y2=w0 x20+w1 x21+w2 x22+ε2 y3=w0 x30+w1 x31+w2 x32+ε3 y4=w0 x40+w1 x41+w2 x42+ε4 xi0=1 yn*1=x4*3 w3*1+ε4*1 [ ] [ ][ ] [ ] y1 x10 x 11 x12 w 0 ε1 y 2 = x 20 x 21 x 22. ε2 + w1 ε y3 x 30 x 31 x 32 3 w 2 ε4 y4 x 40 x 41 x 42

Triple Linear Regression y1=w0 x10+w1 x11+w2 x12+w3 x13+ε1 y2=w0 x20+w1 x21+w2 x22+w3 x23+ε2 y3=w0 x30+w1 x31+w2 x32+w3 x33+ε3 y4=w0 x40+w1 x41+w2 x42+w3 x43+ε4 xi0=1 y =X w +ε 4*1 4*4 4*1 4*1 [ ] [ ][ ] [ ] y1 x10 x 11 x12 x 13 w0 ε 1 y2 x x x x w = 20 21 22 23. 1 + ε2 y3 x 30 x 31 x 32 x 33 w 2 ε3 ε4 y4 x 40 x 41 x 42 x 43 w3

Bivariate-Triple Linear Regression y11=w01 x10+w11 x11+w21 x12+w31 x13+ε11 y21=w01 x20+w11 x21+w21 x22+w31 x23+ε21 y31=w01 x30+w11 x31+w21 x32+w31 x33+ε31 y41=w01 x40+w11 x41+w21 x42+w31 x43+ε41 y12=w02 x10+w12 x11+w22 x12+w32 x13+ε12 y22=w02 x20+w12 x21+w22 x22+w32 x23+ε22 y32=w02 x30+w12 x31+w22 x32+w32 x33+ε32 y42=w02 x40+w12 x41+w22 x42+w32 x43+ε42 xi0=1 multivariate problems: there is more than one dependent variable Y4*2=X4*4 W4*2+E4*2 [ ] [ ][ ] [ y 11 y 12 x10 x 11 x 12 x 13 w 01 w 02 ε ε 11 12 y 21 y 22 x x x x w w = 20 21 22 23. 11 12 + ε21 ε22 y 31 y 32 x 30 x 31 x 32 x 33 w 21 w 22 ε31 ε32 ε41 ε42 y 41 y 42 x 40 x 41 x 42 x 43 w 31 w 32 ]

Multiple Linear Regression yn*1=xn*(p+1) w(p+1)*1+εn*1 Multivariate Linear Regression Yn*k=Xn*(p+1) W(p+1)*k+En*k xi0=1 xi0=1 Y={y(1),..., y(k)} dependent variables (observations) X={x(1),..., x(p+1)} independent variables (parameters), x(1)=1 W={w(1),..., w(p+1)}weights E={ε(1),..., ε(k)} error matrix n=number of data points p+1=number independent variables k=number dependent variables

Metodo dei minimi quadrati Y = X W + E Col metodo dei minimi quadrati si determina la matrice dei pesi W che minimizza gli errori quadratici S=E E=(Y-X W) (YX W) assumendo che gli errori siano casuali e indipendenti Si procede calcolando la derivata di E E rispetto a W e si trova che: W=(X X)-1X Y Dunque le predizioni sono date da: Y=X W Y=X (X X)-1 X Y Y=H Y