Statistics and Data Analysis in MATLAB Kendrick Kay, February 28, Lecture 4: Model fitting

Similar documents
Lecture 19. Curve fitting I. 1 Introduction. 2 Fitting a constant to measured data

A string of not-so-obvious statements about correlation in the data. (This refers to the mechanical calculation of correlation in the data.

ECE 901 Lecture 4: Estimation of Lipschitz smooth functions

Lecture 11. Solution of Nonlinear Equations - III

Chapter 2. Asymptotic Notation

1 The Primal and Dual of an Optimization Problem

Contents Two Sample t Tests Two Sample t Tests

Linear Regression Demystified

Lecture Outline. 2 Separating Hyperplanes. 3 Banach Mazur Distance An Algorithmist s Toolkit October 22, 2009

Mixture models (cont d)

) is a square matrix with the property that for any m n matrix A, the product AI equals A. The identity matrix has a ii

We have also learned that, thanks to the Central Limit Theorem and the Law of Large Numbers,

On Modeling On Minimum Description Length Modeling. M-closed

X. Perturbation Theory

19.1 The dictionary problem

5.6 Binomial Multi-section Matching Transformer

(s)h(s) = K( s + 8 ) = 5 and one finite zero is located at z 1

A PROBABILITY PROBLEM

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Name Period ALGEBRA II Chapter 1B and 2A Notes Solving Inequalities and Absolute Value / Numbers and Functions

Define a Markov chain on {1,..., 6} with transition probability matrix P =

Note that the argument inside the second square root is always positive since R L > Z 0. The series reactance can be found as

42 Dependence and Bases

Factor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis

Nonlinear regression

AVERAGE MARKS SCALING

Optimal Estimator for a Sample Set with Response Error. Ed Stanek

Probabilistic Analysis of Rectilinear Steiner Trees

Binomial transform of products

The Binomial Multi-Section Transformer

September 2012 C1 Note. C1 Notes (Edexcel) Copyright - For AS, A2 notes and IGCSE / GCSE worksheets 1

Statistics for Applications Fall Problem Set 7

5.6 Binomial Multi-section Matching Transformer

Formula List for College Algebra Sullivan 10 th ed. DO NOT WRITE ON THIS COPY.

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Integrals of Functions of Several Variables

Problem Cosider the curve give parametrically as x = si t ad y = + cos t for» t» ß: (a) Describe the path this traverses: Where does it start (whe t =

6.867 Machine learning, lecture 13 (Jaakkola)

PARTIAL DIFFERENTIAL EQUATIONS SEPARATION OF VARIABLES

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Algebra of Least Squares

Supplementary Material

Axis Aligned Ellipsoid

Topic 9: Sampling Distributions of Estimators

Statistics 511 Additional Materials

Machine Learning for Data Science (CS 4786)

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Automated Proofs for Some Stirling Number Identities

18.S34 (FALL, 2007) GREATEST INTEGER PROBLEMS. n + n + 1 = 4n + 2.

Data Analysis and Statistical Methods Statistics 651

Introduction to Machine Learning DIS10

Orthogonal transformations

Closed virial equation-of-state for the hard-disk fluid

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

Summer MA Lesson 13 Section 1.6, Section 1.7 (part 1)

Lecture 3: August 31

Lecture 10: Bounded Linear Operators and Orthogonality in Hilbert Spaces

REGRESSION (Physics 1210 Notes, Partial Modified Appendix A)

The Binomial Multi- Section Transformer

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Orthogonal Functions

1 Inferential Methods for Correlation and Regression Analysis

ECON 3150/4150, Spring term Lecture 3

CS 70 Second Midterm 7 April NAME (1 pt): SID (1 pt): TA (1 pt): Name of Neighbor to your left (1 pt): Name of Neighbor to your right (1 pt):

NUMERICAL METHODS FOR SOLVING EQUATIONS

Notes on iteration and Newton s method. Iteration

CHAPTER 10 INFINITE SEQUENCES AND SERIES

Inverse Matrix. A meaning that matrix B is an inverse of matrix A.

Properties and Hypothesis Testing

11 Correlation and Regression

Mixtures of Gaussians and the EM Algorithm

IP Reference guide for integer programming formulations.

Assignment 2 Solutions SOLUTION. ϕ 1 Â = 3 ϕ 1 4i ϕ 2. The other case can be dealt with in a similar way. { ϕ 2 Â} χ = { 4i ϕ 1 3 ϕ 2 } χ.

Simple Linear Regression

Bernoulli Polynomials Talks given at LSBU, October and November 2015 Tony Forbes

The z-transform. 7.1 Introduction. 7.2 The z-transform Derivation of the z-transform: x[n] = z n LTI system, h[n] z = re j

Grouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014

REVIEW OF CALCULUS Herman J. Bierens Pennsylvania State University (January 28, 2004) x 2., or x 1. x j. ' ' n i'1 x i well.,y 2

Math 312 Lecture Notes One Dimensional Maps

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss


distinct distinct n k n k n! n n k k n 1 if k n, identical identical p j (k) p 0 if k > n n (k)

Infinite Sequences and Series

Introduction to Optimization, DIKU Monday 19 November David Pisinger. Duality, motivation

Lecture 2: Monte Carlo Simulation

The Method of Least Squares. To understand least squares fitting of data.

APPENDIX F Complex Numbers

1 Introduction to reducing variance in Monte Carlo simulations

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

APPLIED MULTIVARIATE ANALYSIS

Section 14. Simple linear regression.

Sequences. Notation. Convergence of a Sequence

Statistical Properties of OLS estimators

6.3 Testing Series With Positive Terms

and then substitute this into the second equation to get 5(11 4 y) 3y

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

THE KALMAN FILTER RAUL ROJAS

Transcription:

Statistics ad Data Aalysis i MATLAB Kedrick Kay, kedrick.kay@wustl.edu February 28, 2014 Lecture 4: Model fittig 1. The basics - Suppose that we have a set of data ad suppose that we have selected the type of odel to apply to the data (see Lecture 3). Our task ow is to fit the odel to the data that is, adjust the free paraeters of the odel such that the odel describes the data as well as possible. - Before proceedig, we ust first establish a etric for the goodess-of-fit of the odel. A coo etric is squared error, that is, the su of the squares of the differeces betwee the data ad the odel fit: squared error = ( d i i ) 2 where is the uber of data poits, d i is the ith data poit, ad i is the odel fit for the ith data poit. We will see the otivatio for squared error later i this lecture. - Give the etric of squared error, our job is to deterie the specific set of paraeter values that iiize squared error. The solutio to this proble depeds o the type of odel that we are tryig to fit. 2. The case of liear odels - Model fittig i the case of liear (ad liearized) odels ca be give a ice geoetric iterpretatio. We ca view the data as a sigle poit i -diesioal space ad we ca view the regressors as vectors i this space eaatig fro the origi. Potetial odel fits are give by poits i the subspace spaed by the vectors. (The subspace cosists of all poits that ca be expressed as a weighted su of the regressors.) The odel fit that iiizes squared error is the poit that lies closest i a Euclidea sese to the data. (This is because the Euclidea distace betwee two poits is a siple ootoic trasforatio the square root of the su of the squares of the differeces i the two poits' coordiates.) The residuals of the odel fit ca be viewed as a vector that starts at the odel fit ad eds at the data. - With these geoetric isights i id, we ca ow derive the solutio to our fittig proble. Recall that liear odels ca be expressed as y = Xw + where y is a set of data poits ( 1), X is a set of regressors ( p), w is a set of weights (p 1), ad is a set of residuals ( 1). Let ŵ OLS deote the set of weights that provide the optial odel fit; these weights are called the ordiary least-squares (OLS) estiate. At the optial odel fit, the residuals ust be orthogoal to each of the regressors. (If the residuals were correlated with a give regressor, the a better odel fit could be obtaied by ovig i the directio of that regressor.) This orthogoality coditio iplies that the dot product betwee each regressor ad the residuals ust equal zero: X T (y Xŵ OLS ) = 0 where 0 is a vector of zeros ( 1). Expadig ad solvig, we obtai: X T y X T Xŵ OLS = 0 ad

ŵ OLS = (X T X) 1 X T y where A 1 idicates the atrix iverse of A. Thus, we see that the set of weights that iiize squared error ca be coputed usig a aalytic expressio ivolvig siple atrix operatios. - Note that if there are ore regressors tha data poits (p > ), the iversio of the correlatio atrix X T X is ill-defied, ad there is o OLS solutio. Ituitively, the idea is that if there are ore regressors tha data poits, the there are ifiitely ay solutios, all of which achieve zero error. 3. The case of oliear odels - Give that oliear odels ecopass a broad diversity of odels, there is little hope of writig dow a sigle expressio that will solve the fittig proble for a arbitrary oliear odel. - To fit oliear odels, we cast the proble as a search proble i which we have a paraeter space, a cost fuctio, ad our job is to search through the space to fid the poit that iiizes the cost fuctio. Sice we are usig the cost fuctio of squared error, we ca thik of our job as tryig to fid the iiu poit o a error surface. If there is oly oe paraeter, the error surface is a fuctio defied o oe diesio (i.e. a curvy lie); if there are two paraeters, the error surface is a fuctio defied o two diesios (i.e. a bupy sheet); etc. - To search through the paraeter space, the usual approach is to use local, iterative optiizatio algoriths that start at soe poit i the space (the iitial seed), look at the error surface i a sall eighborhood aroud that poit, ove i soe directio i a attept to reduce the error, ad the repeat this process util iproveets are sufficietly sall (e.g. util the iproveet is less tha soe sall uber). There are a variety of optiizatio algoriths, ad they basically vary with respect to how exactly they ake use of first-order derivative iforatio (gradiets) ad secod-order derivative iforatio (curvature). The Leveberg-Marquardt algorith is a popular ad effective algorith ad is ipleeted i the MATLAB Optiizatio Toolbox. - As a siple exaple of a optiizatio ethod, let us cosider how to perfor gradiet descet for a liear odel (reusig the earlier exaple y = Xw + ). Assuig the error etric is squared error, the the derivative of the error surface with respect to the jth weight is error = ((y Xw) T (y Xw) ) = (y i X i w) 2 = 2(y X w)( X ) i i i, j i i where w j is the jth eleet of w, y i is the ith eleet of y, X i is the ith row of X, ad X i,j is the (i,j)th eleet of X. Collectig the derivatives for differet weights ito a vector, we obtai the gradiet of the error surface (p 1): error = 2X T (y Xw) So, what this tell us is that give a set of weights w, we kow how to copute the aout by which the error will chage if we were to tweak ay of the weights (e.g., if we were to icreet the first weight by 0.01, the the error surface will icrease by approxiately 0.01 ties the first eleet i the gradiet vector). This suggests a siple algorith: first, set all the weights to soe iitial value (e.g. all zeros); the, copute the gradiet ad update the weights by subtractig soe sall fractio of the gradiet; ad repeat the weight-updatig process util the error stops decreasig. We will see that this algorith ca be ipleeted i MATLAB quite easily.

- A potetial proble with local, iterative optiizatio is local iia, that is, locatios o the error surface that are the iiu withi soe local rage but which are ot the absolute iiu that ca be achieved (which is kow as the global iiu). - For liear odels, the error surface is shaped like a bowl ad there are o local iia. The lack of local iia akes paraeter-search easy as log as a algorith ca adjust paraeters to reduce the error, we will evetually get to the optial solutio. Geoetrically, there is a ice ituitio for why liear odels have o local iia: assuig there are at least as ay data poits as regressors, there is exactly oe poit i the subspace spaed by the regressors that is closest to the data, ad as paraeter values deviate fro this poit, the distace fro the data grows ootoically. - For oliear odels, the error surface ay be "bupy" with local iia. Because of local iia, the solutio foud by a algorith ay ot be the best possible solutio. - The severity of the proble of local iia depeds o the ature of the data, the ature of the odel, ad the specific optiizatio algorith used, so it is difficult to ake ay geeral stateets. Strategies for dealig with local iia iclude (1) startig with differet iitial seeds ad selectig the best resultig odel, (2) exhaustively saplig the paraeter space, ad (3) usig alterative optiizatio techiques such as geetic algoriths. 4. The otivatio for squared error - Give a probability distributio, we ca quatify the probability, or likelihood, of a set of data. Moreover, for differet probability distributios, we ca ask which probability distributio axiizes the likelihood of the data. This procedure is kow as axiu likelihood estiatio ad provides a eas for choosig fro aogst differet odels. For exaple, give a set of data, out of all of the possible Gaussia distributios, the oe that axiizes the likelihood of the data has a ea ad stadard deviatio equal to the ea ad stadard deviatio of the data. - Let us apply axiu likelihood estiatio to the case of regressio odels. Suppose that the true uderlyig probability distributio for each data poit is a Gaussia whose ea is equal to the odel predictio ad whose stadard deviatio is soe fixed value. I other words, suppose that the data are geerated by the odel plus idepedet, idetically distributed (i.i.d.) Gaussia oise. The, the likelihood of a give set of data ca be writte as follows: (d i i ) 2 1 likelihood(d ) = p(d i ) = σ 2π e 2σ 2 where d represets the data, represets the odel, is the uber of data poits, d i is the ith data poit, σ is the stadard deviatio of the Gaussia oise, ad i is the odel predictio for the ith data poit. The odel estiate,, that we wat is the oe that axiizes the likelihood of the data: arg ax( likelihood(d ) ) Because the logarith is a ootoic fuctio, the desired odel estiate is also give by arg ax( log-likelihood(d ) ) which i tur is equivalet to arg i( egative-log-likelihood(d ) ) Now, let's substitute i the likelihood expressio:

(d 1 i i ) 2 argi log σ 2π e 2σ 2 Siplifyig, we obtai 1 arg i log σ 2π + (d i i )2 2σ 2 We ca drop the first ter sice it has o depedece o : (d arg i i i ) 2 2σ 2 We ca drop the deoiator sice it has o depedece o : arg i (d i i ) 2 Fially, we ca rewrite this ore siply: arg i( squared error) Thus, we see that to axiize the likelihood of the data, we should choose the odel that iiizes squared error. This shows that there is good otivatio to use squared error, aely, that assuig i.i.d. Gaussia oise, the axiu likelihood estiate of the paraeters of a odel is the set of paraeters that iiizes squared error. 5. A alterative error etric: absolute error - The assuptio that the oise is Gaussia ay be iaccurate for two reasos. Oe, the easureet oise ay be o-gaussia. Two, the odel beig applied ay ot be the correct odel (e.g., fittig a liear odel whe the true effect is quadratic). This has the cosequece that uodeled effects ay be subsued i the oise ad ay cause the oise to be o- Gaussia. Thus, although squared error has a ice otivatio, we ight desire to use a differet error etric. - Oe alterative etric is the su of the absolute values of the differeces betwee the data ad the odel fit: absolute error = d i i where is the uber of data poits, d i is the ith data poit, ad i is the odel fit for the ith data poit. This choice of error etric is useful as it reduces the ipact of outliers (which ca be roughly defied as ureasoably extree data poits). For a theoretical otivatio, it ca be show that the paraeter estiate that iiizes absolute error is the axiu likelihood estiate uder the assuptio of Laplacia oise. (The Laplace distributio is just like the Gaussia distributio except that a expoetial is take of the absolute differece fro the ea istead of the squared differece fro the ea.) - The differece betwee squared error ad absolute error ca be fraed i ters of the differece betwee the ea ad the edia: the ea of a set of data poits is the uber that iiizes squared error (that is, the su of the squares of the differeces betwee the uber ad each of the data poits), whereas the edia of a set of data poits is the uber that iiizes absolute error (that is, the su of the absolute differeces betwee the uber ad

each of the data poits). Thus, we ca view absolute error as a error etric that is potetially ore robust tha squared error. Note, however, there are soe disadvatages of absolute error: first, there is o aalytic solutio for liear odels, ad secod, error surfaces quatifyig absolute error ay be less well-behaved (e.g. ore local iia) tha error surfaces quatifyig squared error.