Nonlinear Regression

Similar documents
Introduction to Nonlinear Regression

Business Statistics. Lecture 9: Simple Regression

Lectures 5 & 6: Hypothesis Testing

Applied Regression Modeling: A Business Approach Chapter 2: Simple Linear Regression Sections

Generalized Linear Models (1/29/13)

Inference in Regression Analysis

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Simple Linear Regression

Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur

ENZYME SCIENCE AND ENGINEERING PROF. SUBHASH CHAND DEPARTMENT OF BIOCHEMICAL ENGINEERING AND BIOTECHNOLOGY IIT DELHI LECTURE 6

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Business Statistics. Lecture 10: Correlation and Linear Regression

Formal Statement of Simple Linear Regression Model

Bias Variance Trade-off

Generalized Linear Models

Simple Linear Regression

Response Surface Methods

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Regression I - the least squares line

Lecture 18: Simple Linear Regression

Remedial Measures, Brown-Forsythe test, F test

BNAD 276 Lecture 10 Simple Linear Regression Model

6. Multiple Linear Regression

Overview Scatter Plot Example

Lecture 6 Multiple Linear Regression, cont.

Correlation 1. December 4, HMS, 2017, v1.1

ENZYME SCIENCE AND ENGINEERING PROF. SUBHASH CHAND DEPARTMENT OF BIOCHEMICAL ENGINEERING AND BIOTECHNOLOGY IIT DELHI LECTURE 7

Intro to Linear Regression

Lecture 4: Testing Stuff

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Econometrics. 4) Statistical inference

LECTURE 10: NEYMAN-PEARSON LEMMA AND ASYMPTOTIC TESTING. The last equality is provided so this can look like a more familiar parametric test.

Business Statistics. Lecture 10: Course Review

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

Regression Analysis IV... More MLR and Model Building

Statistical Distribution Assumptions of General Linear Models

Tutorial 6: Linear Regression

Lecture 14 Simple Linear Regression

Statistical View of Least Squares

Structural Analysis II Prof. P. Banerjee Department of Civil Engineering Indian Institute of Technology, Bombay Lecture 38

Nonlinear Regression. Chapter 2 of Bates and Watts. Dave Campbell 2009

It can be derived from the Michaelis Menten equation as follows: invert and multiply with V max : Rearrange: Isolate v:

Constrained and Unconstrained Optimization Prof. Adrijit Goswami Department of Mathematics Indian Institute of Technology, Kharagpur

Econometrics. Week 11. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

Inference in Normal Regression Model. Dr. Frank Wood

Probability and Statistics Notes

Scatter plot of data from the study. Linear Regression

Binary Logistic Regression

Simple and Multiple Linear Regression

Important note: Transcripts are not substitutes for textbook assignments. 1

Lecture 2. The Simple Linear Regression Model: Matrix Approach

x y = 1, 2x y + z = 2, and 3w + x + y + 2z = 0

Using regression to study economic relationships is called econometrics. econo = of or pertaining to the economy. metrics = measurement

AMS 7 Correlation and Regression Lecture 8

One sided tests. An example of a two sided alternative is what we ve been using for our two sample tests:

Solving Linear and Rational Inequalities Algebraically. Definition 22.1 Two inequalities are equivalent if they have the same solution set.

LECTURE 18: NONLINEAR MODELS

Data Science for Engineers Department of Computer Science and Engineering Indian Institute of Technology, Madras

Intro to Linear Regression

Chapter 7. Linear Regression (Pt. 1) 7.1 Introduction. 7.2 The Least-Squares Regression Line

To get horizontal and slant asymptotes algebraically we need to know about end behaviour for rational functions.

Inference for Regression

Scatter plot of data from the study. Linear Regression

Graphing Linear Inequalities

BSCB 6520: Statistical Computing Homework 4 Due: Friday, December 5

1 Bayesian Linear Regression (BLR)

Correlation and Regression

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Linear Methods for Prediction

Multiple Regression Analysis

CS123 INTRODUCTION TO COMPUTER GRAPHICS. Linear Algebra /34

UNIVERSITY OF TORONTO Faculty of Arts and Science

Inferences for Regression

For more information about how to cite these materials visit

LECTURE 5 HYPOTHESIS TESTING

Regression. Estimation of the linear function (straight line) describing the linear component of the joint relationship between two variables X and Y.

Regression and the 2-Sample t

Big Data Analysis with Apache Spark UC#BERKELEY

1 Random and systematic errors

regression analysis is a type of inferential statistics which tells us whether relationships between two or more variables exist

Accelerated Failure Time Models

Lecture 10: F -Tests, ANOVA and R 2

Lecture 15. Hypothesis testing in the linear model

Elementary Statistics Triola, Elementary Statistics 11/e Unit 17 The Basics of Hypotheses Testing

Electric flux. You must be able to calculate the electric flux through a surface.

Simple linear regression

STAT 135 Lab 6 Duality of Hypothesis Testing and Confidence Intervals, GLRT, Pearson χ 2 Tests and Q-Q plots. March 8, 2015

13. Nonlinear least squares

Introduction to Statistical Data Analysis Lecture 8: Correlation and Simple Regression

Data Analysis and Statistical Methods Statistics 651

Proportional hazards regression

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

Computational Techniques Prof. Dr. Niket Kaisare Department of Chemical Engineering Indian Institute of Technology, Madras

Key Algebraic Results in Linear Regression

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1

Lecture 5: Clustering, Linear Regression

Lecture 1: Systems of linear equations and their solutions

Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS

Transcription:

Nonlinear Regression 28.11.2012

Goals of Today s Lecture Understand the difference between linear and nonlinear regression models. See that not all functions are linearizable. Get an understanding of the fitting algorithm in a statistical sense (i.e. fitting many linear regressions). Know that tests etc. are based on approximations and be able to interpret computer output, profile t-plots and profile traces. 1 / 35

Nonlinear Regression Model The nonlinear regression model is where Y i = h(x (1) i, x (2) i,..., x (m) i ; θ 1, θ 2,..., θ p ) + E i = h(x i ; θ) + E i. E i are the error terms, E i N (0, σ 2 ) independent x (1),..., x (m) are the predictors θ 1,..., θ p are the parameters h is the regression function, any function. h is a function of the predictors and the parameters. 2 / 35

Comparison with linear regression model In contrast to the linear regression model we now have a general function h. In the linear regression model we had h(x i ; θ) = x T i θ (there we denoted the parameters by β). Note that in linear regression we required that the parameters appear in linear form. In nonlinear regression, we don t have that restriction anymore. 3 / 35

Example: Puromycin The speed of an enzymatic reaction depends on the concentration of a substrate. The initial speed is the response variable (Y ). The concentration of the substrate is used as predictor (x). Observations are from different runs. Model with Michaelis-Menten function h(x; θ) = θ 1x θ 2 + x. Here we have one predictor x (the concentration) and two parameters: θ 1 and θ 2. Moreover, we observe two groups: One where we treat the enzyme with Puromycin and one without treatment (control group). 4 / 35

Illustration: Puromycin (two groups) 200 Velocity 150 100 Velocity 50 0.0 0.2 0.4 0.6 0.8 1.0 Concentration Concentration Left: Data ( treated enzyme; untreated enzyme) Right: Typical shape of the regression function. 5 / 35

Example: Biochemical Oxygen Demand (BOD) Model the biochemical oxygen demand (Y ) as a function of the incubation time (x) h(x; θ) = θ 1 (1 e θ 2x ). 20 18 Oxygen Demand 16 14 12 10 Oxygen Demand 8 1 2 3 4 5 6 7 Days Days 6 / 35

Linearizable Functions Sometimes (but not always), the function h is linearizable. Example Let s forget about the error term E for a moment. Assume we have We can rewrite this as y = h(x; θ) = θ 1 exp{θ 2 /x} log(y) = log(θ 1 ) + θ 2 (1/x) ỹ = θ 1 + θ 2 x, where ỹ = log(y), θ 1 = log(θ 1 ), θ2 = θ 2 and x = 1/x. If we use this linear model, we assume additive errors E i Ỹ i = θ 1 + θ 2 x i + E i. 7 / 35

This means that we have multiplicative errors on the original scale Y i = θ 1 exp{θ 2 /x i } exp{e i }. This is not the same as using a nonlinear model on the original scale (it would have additive errors!). Hence, transformations of Y modify the model with respect to the error term. In the Puromycin example: Do not linearize because error term would fit worse (see next slide). Hence, for those cases where h is linearizable, it depends on the data if it s advisable to do so or to perform a nonlinear regression. 8 / 35

Puromycin: Treated enzyme Velocity 50 100 150 200 0.0 0.2 0.4 0.6 0.8 1.0 Concentration 9 / 35

Parameter Estimation Let s know assume that we really want to fit a nonlinear model. Again, we use least squares. Minimize S(θ) := n (Y i η i (θ)) 2, i=1 where η i (θ) := h(x i ; θ) is the fitted value for the ith observation (x i is fixed, we only vary the parameter vector θ). 10 / 35

Geometrical Interpretation First we recall the situation for linear regression. By applying least squares we are looking for the parameter vector θ such that Y X θ 2 2 = n ( Yi x T i θ ) 2 i=1 is minimized. Or in other words: We are looking for the point on the plane spanned by the columns of X that is closest to Y R n. This is nothing else than projecting Y on that specific plane. 11 / 35

Linear Regression: Illustration of Projection Y Y 10 x 8 6 4 η 3 y 3 2 0 [1,1,1] 0 2 4 6 8 10 0 2 4 6 8 10 η 2 y 2 10 x 8 6 4 η 3 y 3 2 y [1,1,1] 0 0 2 4 6 8 10 0 2 4 6 8 10 η 2 y 2 η 1 y 1 η 1 y 1 12 / 35

Situation for nonlinear regression Conceptually, the same holds true for nonlinear regression. The difference is: All possible points now don t lie on a plane, but on a curved surface, the so called response or model surface defined by η(θ) R n when varying the parameter vector θ. This is a p-dimensional surface because we parameterize it with p parameters. 13 / 35

Nonlinear Regression: Projection on Curved Surface 22 21 20 19 18 θ 2 = 0.5 0.4 0.3 Y y θ 1 = 22 θ 1 = 21 θ 1 = 20 10 12 14 16 18 20 η 2 y 2 η 3 y 3 5 6 7 8 9 10 11 η 1 y 1 14 / 35

Computation Unfortunately, we can not derive a closed form solution for the parameter estimate θ. Iterative procedures are therefore needed. We use a Gauss-Newton approach. Starting from an initial value θ (0), the idea is to approximate the model surface by a plane, to perform a projection on that plane and to iterate many times. Remember η : R p R n. Define n p matrix A (j) i (θ) = η i(θ). θ j This is the Jacobi-matrix containing all partial derivatives. 15 / 35

Gauss-Newton Algorithm More formally, the Gauss-Newton algorithm is as follows Start with initial value For l = 1, 2,... θ (0) Calculate tangent plane of η(θ) in θ (l 1) : η(θ) η( θ (l 1) ) + A( θ (l 1) ) (θ θ (l 1) ) θ (l) Project Y on tangent plane Projection is a linear regression problem, see blackboard. Next l Iterate until convergence 16 / 35

Initial Values How can we get initial values? Available knowledge Linearized version (see Puromycin) Interpretation of parameters (asymptotes, half-life,... ), fitting by eye. Combination of these ideas (e.g., conditional linearizable functions) 17 / 35

Example: Puromycin (only treated enzyme) 0.020 200 1/velocity 0.015 0.010 Velocity 150 100 0.005 50 0 10 20 30 40 50 1/Concentration 0.0 0.2 0.4 0.6 0.8 1.0 Concentration Dashed line: Solution of linearized problem. Solid line: Solution of the nonlinear least squares problem. 18 / 35

Approximate Tests and Confidence Intervals Algorithm only gives us θ. How accurate is this estimate in a statistical sense? In linear regression we knew the (exact) distribution of the estimated parameters (remember animation!). In nonlinear regression the situation is more complex in the sense that we only have approximate results. It can be shown that θ j approx. N (θ j, V jj ) for some matrix V (V jj is the jth diagonal element). 19 / 35

Tests and confidence intervals are then constructed as in the linear regression situation, i.e. θ j θ j Vjj approx. t n p. The reason why we basically have the same result as in the linear regression case is because the algorithm is based on (many) linear regression problems. Once converged, the solution is not only the solution to the nonlinear regression problem but also for the linear one of the last iteration (otherwise we wouldn t have converged). In fact V = σ 2 (ÂT Â) 1. 20 / 35

Example Puromycin (two groups) Remember, we originally had two groups (treatment and control) 200 Velocity 150 100 Velocity 50 0.0 0.2 0.4 0.6 0.8 1.0 Concentration Concentration Question: Do the two groups need different regression parameters? 21 / 35

To answer this question we set up a model of the form Y i = (θ 1 + θ 3 z i )x i θ 2 + θ 4 z i + x i + E i, where z is the indicator variable for the treatment (z i = 1 if treated, z i = 0 otherwise). E.g., if θ 3 is nonzero we have a different asymptote for the treatment group (θ 1 + θ 3 vs. only θ 1 in the control group). Similarly for θ 2, θ 4. Let s fit this model to data. 22 / 35

Computer Output Formula: velocity ~ (T1 + T3 * (treated == T)) * conc/(t2 + T4 * (treated == T) + conc) Parameters: Estimate Std.Error t value Pr(> t ) T1 160.280 6.896 23.242 2.04e-15 T2 0.048 0.008 5.761 1.50e-05 T3 52.404 9.551 5.487 2.71e-05 T4 0.016 0.011 1.436 0.167 We only get a significant test result for θ 3 ( different asymptotes) and not θ 4. A 95%-confidence interval for θ 3 (=difference between asymptotes) is 52.404 ± q t 19 0.9759.551 = [32.4, 72.4], where q t 19 0.975 2.09. 23 / 35

More Precise Tests and Confidence Intervals Tests etc. that we have seen so far are only usable if linear approximation of the problem around the solution θ is good. We can use another approach that is better (but also more complicated). In linear regression we had a quick look at the F -test for testing simultaneous null-hypotheses. This is also possible here. Say we have the null hypothesis H 0 : θ = θ (whole vector). Fact: Under H 0 it holds ( ) n p S(θ ) S( θ) T = p S( θ) approx. F p,n p. 24 / 35

We still have only an approximate result. But this approximation is (much) better (more accurate) than the one that is based on the linear approximation. This can now be used to construct confidence regions by searching for all vectors θ that are not rejected using this test (as earlier). If we only have two parameters it s easy to illustrate these confidence regions. Using linear regression it s also possible to derive confidence regions (for several parameters). We haven t seen this in detail. This approach can also be used here (because we use a linear approximation in the algorithm, see also later). 25 / 35

Confidence Regions: Examples 0.10 0.09 Puromycin Biochem. Oxygen D. 10 8 θ 2 0.08 0.07 0.06 0.05 0.04 θ 2 6 4 2 0 190 200 210 220 230 240 θ 1 0 10 20 30 40 50 60 θ 1 Dashed: Confidence Region (80% and 95%) based on linear approx. Solid: Approach with F -test from above (more accurate). + is parameter estimate. 26 / 35

What if we only want to test a single component θ k? Assume we want to test H 0 : θ k = θk. Now fix θ k = θk and minimize S(θ) with respect to θ j, j k. Denote the minimum by S k (θk ). Fact: Under H 0 it holds that T k (θ k) = (n p) S k (θ k ) S( θ) S( θ) approx. F 1,n p, or similarly T k (θ k) = sign( θ k θ k) Sk (θ k ) S( θ) σ approx. t n p. 27 / 35

Our first approximation was based on the linear approximation and we got a test of the form where s.e.( θ k ) = Vjj. δ k (θk) = θ k θk approx. t n p, s.e.( θ k ) This is what we saw in the computer output. The new approach with T k (θk ) answers the same question (i.e., we do a test for a single component). The approximation of the new approach is (typically) much more accurate. We can compare the different approaches using plots. 28 / 35

Profile t-plots and Profile Traces The profile t-plot is defined as the plot of T k (θ k ) against δ k(θ k ) (for varying θ k ). Remember: The two tests (T k and δ k test the same thing. If they behave similarly, we would expect the same answers, hence the plot should show a diagonal (intercept 0, slope 1). Strong deviations from the diagonal indicate that the linear approximation at the solution is not suitable and that the problem is very non-linear in a neighborhood of θ k. 29 / 35

Profile t-plots: Examples Puromycin Biochem. Oxygen D. θ 1 190 200 210 220 230 240 θ 1 20 40 60 80 100 3 0.99 4 0.99 T 1 (θ 1 ) 2 1 0 1 2 3 2 0.80 0.0 Level T 1 (θ 1 ) 0 2 0.80 4 0.99 0.80 0.0 0.80 0.99 Level 2 0 2 4 δ(θ 1 ) 0 10 20 30 δ(θ 1 ) 30 / 35

Profile Traces Select a pair of parameters: θ j, θ k ; j k. Keep θ k fixed, estimate remaining parameters: θ j (θ k ). This means: When varying θ k we can plot the estimated θ j (and vice versa) Illustrate these two curves on a single plot. What can we learn from this? The angle between the two curves is a measure for the correlation between estimated parameters. The smaller the angle, the higher the correlation. In the linear case we would see straight lines. Deviations are an indication for nonlinearities. Correlated parameter estimates influence each other strongly and make estimation difficult. 31 / 35

Profile Traces: Examples Puromycin Biochem. Oxygen D. 0.10 2.0 0.09 0.08 1.5 θ 2 0.07 θ 2 1.0 0.06 0.05 0.5 0.04 190 200 210 220 230 240 θ 1 15 20 25 30 35 40 θ 1 Grey lines indicate confidence regions (80% and 95%). 32 / 35

Parameter Transformations In order to improve the linear approximation (and therefore improve convergence behaviour) it can be useful to transform the parameters. Transformations of parameters do not change the model, but the quality of the linear approximation, influencing the difficulty of computation and the validity of approximate confidence regions. the interpretation of the parameters. Finding good transformations is hard. Results can be transformed back to original parameters. Then, transformation is just a technical step to solve the problem. 33 / 35

Use parameter transformations to avoid side constraints, e.g. θ j > 0 Use θ j = exp{φ j }, φ j R θ j (a, b) b a Use θ j = a + 1 + exp{ φ j }, φ j R 34 / 35

Summary Nonlinear regression models are widespread in chemistry. Computation needs iterative procedure. Simplest tests and confidence intervals are based on linear approximations around solution θ. If linear approximation is not very accurate, problems can occur. Graphical tools for checking linearities are profile t-plots and profile traces. Tests and confidence intervals based on F -test are more accurate. Parameter transformations can help reducing these problems. 35 / 35