Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume a n-dmensonal feature vector s denoted by x R n, whle y R s the output varable In lnear regresson models, the hypothess functon s defned by h θ (x) = θ T x () where θ R n+ s a parameter vector Takng 2D lnear regresson for example, we have x R and θ = [θ, θ 0 ] T Therefore, the hypothess functon s h θ (x) = θ x + θ 0 The 2D hypothess functon actually s a straght lne defne on a 2D plane It s apparent that the hypothess functon s parameterzed by θ Snce our goal s to make predctons accordng to the hypothess functon gven a new test data, we need to fnd the optmal value of θ such that the resultng predcton s as accurate as possble Such a procedure s so-called tranng The tranng procedure s performed based on a gven set of m tranng data {x (), y () },,m In partcular, we are supposed to fnd a hypothess functon (parameterzed by θ) whch fts the tranng data as closely as possble To measure the error between h θ and the tranng data, we defne a cost functon (also called error functon) J(θ) : R n+ R as follows J(θ) = 2 m (h θ (x () ) y ()) 2 Our lnear regresson problem can be formulated as mn θ J(θ) = 2 m (θ T x () y ()) 2 We gve an llustraton n Fg to explan lnear regresson n 3D space (e, n = 2) In the 3D space, the hypothess functon s represented by a hyperplane The red ponts denote the tranng data, and t s shown that, J(θ) s the sum of the dfferences between the target values of the tranng data and the hyperplane
2 Gradent Descent Fgure : 3D lnear regresson Gradent Descent (GD) method s a frst-order teratve optmzaton algorthm for fndng the mnmum of a functon If the mult-varable functon J(θ) s dfferentable n a neghborhood of a pont θ, then J(θ) decreases fastest f one goes from θ n the drecton of the negatve gradent of J at θ Let J(θ) = [ J θ 0, J θ,, J θ n ] T (2) denote the gradent of J(θ) In each teraton, we update θ accordng to the followng rule: θ θ α J(θ) (3) where α s a step sze In more detals, θ j θ j α J(θ) θ j (4) The update s termnated when convergence s acheved In our lnear regresson model, the gradent can be calculated as J(θ) = θ j θ j 2 m m (θ T x () y () ) 2 = (θ T x () y () )x () j (5) We summarze the GD method as follows n Algorthm The algorthm usually starts wth a randomly ntalzed θ In each teraton, we update θ such that the objectve functon s decreased monotoncally The algorthm s sad to be converged when the dfference of J n successve teratons s less than (or equal to) a predefned threshold (say ε) Assumng θ (t) and θ (t+) are the values of θ n the t-th teraton and the (t + )-th teraton, respectvely, the algorthm s converged when J(θ (t+) ) J(θ (t) ) ε (6) 2
Algorthm : Gradent Descent Gven a startng pont θ dom J repeat Calculate gradent J(θ); 2 Update θ θ α J(θ) untl convergence crteron s satsfed Fgure 2: The convergence of GD algorthm Another convergence crteron s to set a fxed value for the maxmum number of teratons, such that the algorthm s termnated after the number of the teratons exceeds the threshold We llustrate how the algorthm converges teratvely n Fg 2 The colored contours represent the objectve functon, and the GD algorthm converges nto the mnmum step-by-step The choce of the step sze α actually has a very mportant nfluence on the convergence of the GD algorthm We llustrate the convergence processes under dfferent step szes n Fg 3 3 Stochastc Gradent Descent Accordng to Eq 5, t s observed that we have to vst all tranng data n each teraton Therefore, the nduced cost s consderable especally when the tranng data are of bg sze Stochastc Gradent Descent (SGD), also known as ncremental gradent descent, s a stochastc approxmaton of the gradent descent optmzaton method In each teraton, the parameters are updated accordng to the gradent of the error (e, the cost functon) wth respect to one tranng sample only Hence, t entals very lmted cost We summarze the SGD method n Algorthm 2 In each teraton, we frst randomly shuffle the tranng data, and then choose only one tranng example to 3
Objectve functon J 06, = 006, = 007, = 007 05 04 03 02 0 0 0 00 200 300 400 500 600 700 800 900 000 Iteratons Fgure 3: The convergence of GD algorthm under dfferent step szes calculate the gradent (e, J(θ; x (), y () )) to update θ In our lnear regresson model, J(θ; x (), y () ) s defned as and the update rule s J(θ; x (), y () ) = (θ T x () y () )x () (7) θ j θ j α(θ T x () y () )x () j (8) Algorthm 2: Stochastc Gradent Descent for Lnear Regresson : Gven a startng pont θ dom J 2: repeat 3: Randomly shuffle the tranng data; 4: for =, 2,, m do 5: θ θ α J(θ; x (), y () ) 6: end for 7: untl convergence crteron s satsfed Compared wth GD where the objectve cost functon s decreased for each step, SGD does not have such a guarantee In fact, SGD entals more steps to converge, but each step s cheaper One varants of SGD s so-called mnbatch SGD, where we pck up a small group of tranng data and do average to accelerate and smoothen the convergence For example, by randomly choosng k tranng data, we can calculate the average the gradent k k J(θ; x (), y () ) (9) 4
4 A Closed-Form Soluton to Lnear Regresson We frst look at the vector form of the lnear regresson model Assume (x () ) T y () X = Y = (x (m) ) T Therefore, we have (x () ) T θ Xθ Y = x (m) ) T θ y () y (m) = Then, the cost functon J(θ) can be redefned as J(θ) = 2 y (m) (0) h θ (x () ) y () h θ (x (m) ) y (m) m (h θ (x () ) y () ) = 2 (Xθ Y )T (Xθ Y ) () To mnmze the cost functon, we calculate ts dervatve and let t be zero θ J(θ) = θ 2 (Y Xθ)T (Y Xθ) = 2 θ(y T θ T X T )(Y Xθ) = 2 θtr(y T Y Y T Xθ θ T X T Y + θ T X T Xθ) = 2 θtr(θ T X T Xθ) X T Y = 2 (XT Xθ + X T Xθ) X T Y = X T Xθ X T Y Snce X T Xθ X T Y = 0, we have θ = (X T X) X T Y Note that the nverse of X T X does not always exst In fact, the matrx X T X s nvertble f and only f the columns of X are lnearly ndependent 5 A Probablstc Interpretaton An nterestng queston s why the least square form of the lnear regresson model s reasonable We hereby gve a probablstc nterpretaton We suppose a target value y are sampled from a lne θ T x wth certan nose Therefore, we have y () = x () + ε () where ε () denote the nose and s ndependently and dentcally dstrbuted (d) accordng to a Gaussan dstrbuton N (0, σ 2 ) The densty of ε () s gven by f(ɛ () ) = exp ( (ɛ() ) 2 ) 2πσ 2σ 2 5
and we equvalently have the followng condtonal probablty Pr(y () x () ; θ) = exp ( (y() θ T x () ) 2 ) 2πσ 2σ 2 Therefore, t s shown that the dstrbuton of y () gven x () s parameterzed by θ, e, y () x () ; θ N (θ T x (), σ 2 ) Recalled that Y and X are the vector of the target values and the matrx of the features (see Eq (0)), and Y = Xθ Consderng ε () s are d, we defne the followng lkelhood functon L(θ) = Pr(y () x () ; θ) = exp ( (y() θ T x () ) 2 ) 2πσ 2σ 2 whch denote the probablty of the gven target values n tranng data To calculatng the optmal θ such that the resultng lnear regresson model fts the gven tranng data best, we need to maxmze the lkelhood L(θ) To smplfy the computaton, we use the log lkelhood functon nstead, e, l(θ) = log L(θ) m = log exp ( (y() θ T x () ) 2 ) 2πσ 2σ 2 m = log exp ( (y() θ T x () ) 2 ) 2πσ 2σ 2 = m log 2πσ 2σ 2 (y () θ T x () ) 2 Apparently, maxmzng L(θ) s equvalent to mnmzng 2 (y () θ T x () ) 2 6