Lnear Feature Engneerng 11 2 Least-Squares 2.1 Smple least-squares Consder the followng dataset. We have a bunch of nputs x and correspondng outputs y. The partcular values n ths dataset are x y 0.23 0.19 0.88 0.96 0.21 0.33 0.92 0.80 0.49 0.46. 0.62 0.45 0.77 0.67 0.52 0.32 0.30 0.38 0.19 0.37 Now, suppose we d lke to ft a lne to ths data, of the form y = ax. A standard way to ft such a lne s by least-squares, where we measure the sum of the dfference of each data pont to the lne wth R = (ax y ) 2.
Lnear Feature Engneerng 12 Exercse 8. If we use a =0.7, what s the error R? Ths can be vsualzed as: Notce that we want to mnmze the sum of the square of the length of the error lnes. There are varous justfcatons for ths, but t s worth notng that one could alternatvely mnmze the length of the lnes themselves, the cube of them, etc. The square s by far the most common, however, leads to smpler algorthms, and wll be our focus here. Now, how can we fnd the best a? Theorem 9. For R = (ax y ) 2, the mnmum value of R s obtaned for a = x y x. (2.1) x Proof. We can smply calculate the dervatve of R and set t to zero. dr da = 2 x (ax y ) a x x = x y a = x y x x
Lnear Feature Engneerng 13 Exercse 10. Wrte a program that takes a bunch of values (x,y ), along wth a as nput, and returns R. Next wrte a program to compute a from Eq. 2.1. Test that, on the above dataset, ths program correctly returns a =.920. Compute the values of R for a range of a and see that ths s the rght value. We can plot R for a range of a and see that ths s the true mnmum. 5 4 3 R 2 1 0 0 0.5 1 1.5 2 a We can plot the data wth the curve for ths optmal a as.
Lnear Feature Engneerng 14 2.2 Least-squares wth a bas term It s common to add a bas term and, rather than fttng the lnear functon y = ax, ft the affne functon y = ax + b. As before, we can use the basc calculus technque of settng dervatves to zero n order to fnd the best a and b. Theorem 11. For R = (ax + b y ) 2, the mnmum value of R s obtaned for where b = ȳ x xy/ x 2 1 x 2 / x 2 a = 1 xȳ b), Proof. Start wth the dervatves ȳ = 1 y N x = 1 x N x 2 = 1 x 2 N xy = 1 x y. N dr da dr db = 2 = 2 x (ax + b y ) (ax + b y ) a x 2 + b x = x y a x + b 1 = y
Lnear Feature Engneerng 15 Now, makng use of the mean notaton, we can wrte the lnear system a x 2 + b x = xy a x + b = ȳ To elmnate a, we can subtract x/ x 2 tmes the frst equaton from the second, yeldng b b x =ȳ x x2 x x2 xy Whch we can solve for b = ȳ x xy/ x 2 1 x 2 / x. 2 Once we have computed ths, we can recover a from a = 1 xȳ b). Not the prettest theorem or proof n the world, but t works. Addng a bas term slghtly mproves the qualty of the ft. Exercse 12. Wrte a program that takes a bunch of values (x,y ), and fnds the optmal a and b for an affne ft y = ax + b. Verfy that on the example dataset, you correctly fnd a =.765 and b =.100.
Lnear Feature Engneerng 16 2.3 More powerful least squares. Now, suppose we want to ft a quadratc y = dx 2 + ax + b, or, somethng crazy lke y = a sn(x)+be x + d x. It would be an opton to derve an algorthm along the lnes of the prevous theorem, but one tends to shudder at the thought gven how messy t was just to ft an affne functon. Instead, we wll defne an abstracton. Suppose that the nput z s a vector, and we would lke to make predctons by y = w T z = w(j)z(j). j Theorem 13. For R = (wt z y ) 2, the mnmum value of R s obtaned for Proof. w = z z T 1 z y. dr dw =0 = 2 z (w T z y ) z (w T z ) = z y z (z T w) = z y z z T w = z y To actually mplement ths, we would use somethng lke the followng Algorthm (Least-Squares) Input {z }, {y } s z y M z z T
Lnear Feature Engneerng 17 Solve Mw = s for w usng Gaussan Elmnaton Output w. Exercse 14. Implement the above algorthm for solvng a least-squares system. Use your prevous algorthm for solvng lnear systems. Check that, on the dataset z(1) 0.1 0.2 0.3 0.4 z(2) 0.5 0.6 0.7 0.7 y 0.9 1.0 1.1 1.2 t correctly returns w =( 0.159, 1.739). Have your functon also return the resdual error R. Check that ths s R =0.00927. For testng, you mght fnd t useful to check that.3.66 M =,.66 1.59 1.1 s =. 2.66 Note that ths dataset means z 1 =(.1,.5), z 2 =(.2,.6), etc. 2.4 Bass expanson Suppose that we can ft vector-valued least-squares systems of the form y = w T z. Now the queston s: what s the connecton between all the measured data x and the nput vector z? Frst, take our old scalar dataset. Suppose we would lke to ft an equaton of the form How could we do ths? y = ax + b. An dea would be to use what s called a bass expanson. Rather than fttng to the dataset {(x,y )}, ft to {(z,y )}, where x z =. (2.2) 1 That s, we smply take the nput dataset, and replace each nput scalar by an nput vector of length two, consstng of the orgnal scalar plus a constant of one. Then, we would have that y = w T z = w 1 x + w 2,
Lnear Feature Engneerng 18 whch s of exactly the same form as y = ax + b. For example, f we started wth the dataset x 1 2 5 y 2 7 9 then, after bass expanson, we would have the dataset z(1) 1 2 5 z(2) 1 1 1 y 2 7 9 Exercse 15. Make a bass expanson for the orgnal dataset of the form n Eq. 2.2. Plug t nto your least-squares solver from the prevous exercse. Check that you correctly fnd w =(.765,.100) and the resdual error R =0.1128. Now, we have room for personal creatvty. There s nothng that prevents us from fttng more advanced functons. For example, suppose we want to ft a functon of the form y = ax + b + cx 2. What bass expanson should we use? The answer s x z = 1 x 2. Ths s called a quadratc bass expanson. Exercse 16. Make a quadratc bass expanson for the orgnal dataset. Plug t nto your least-squares solver from the prevous exercse. Check that you correctly fnd w =(.721,.406, 1.37 and the resdual error R =0.0632. If we are n a strange mood, we could even ft a functon of the form by smply usng the bass expanson y = a sn(2x)+b log(x)+c x z = sn(2x ) log(x ) x. Exercse 17. Implement ths last bass expanson. Check that you correctly fnd w = ( 1.44, 0.134, 2.41) and the resdual error R =0.0672. The output of the above two exercses wll look lke:
Lnear Feature Engneerng 19 2.4.1 Matrx notaton Exercse 16 Exercse 17 Often, when workng wth a dataset of a bunch of vectors z 1, z 2,... z M, t s convenent to put them together nto a sngle matrx Z by smply placng the vectors next to each other. It s not hard to see that Theorem 18. As defned n Eq. 2.3, Z = z 1 z 2 z M. (2.3) Zy = z y A smlar, but somewhat less obvous result s that Theorem 19. As defned n Eq. 2.3, ZZ T = z z T. All ths means that we can wrte w n the more compact notaton ZZ T w = Zy. Exercse 20. Wrte a program that takes nputs Z and y and fnds the least-squares ft. (Use your prevous routne to solve the lnear system.) Check that, on an nput of 1 1 1 1 Z =, y =(1, 2, 4, 5), 2 3 4 5 t correctly returns w =( 1.9, 1.4).
Lnear Feature Engneerng 20 2.4.2 Lnear or nonlnear? Notce that, wth a few of the above examples, we are actually fttng the output y usng (sometmes hghly nonlnear) functons wth respect to the nput x. However, once the data are gven, the functon s lnear wth respect to the unknown coeffcents / weghts w, and the optmal value of w s obtaned wth a smple leastsquare. Ths type of technques s known as lnear regresson. More formally, lnear regresson refers to the regresson functon s dependence on the regresson coeffcents w, not on nput data x. The crucal pont here s that t s easy to do nonlnear stuff usng lnear regresson wth: where z = y = w T z. f 1 (x) f 2 (x) f s (x) and f (x) can be of varous forms as shown n prevous examples, and can be thought as features extracted from the gven data x. In summary, clever hand-engneered features and lnear regresson (or lnear classfcaton) captures a huge percentage of how machne learnng s done n the real world. The advantages of lnear methods lke ths are: Smplcty Relablty Interpretablty Speed There are also some dsadvantages: Lmted modelng power. Imposng a smple lnear form y = w T z s a huge assumpton and some datasets smply don t have such a relatonshp between nputs and outputs. You need to fnd good features. The prevous ssue can be reduced by comng up wth good features, but ths process s not automated. More fancy machne learnng.
Lnear Feature Engneerng 21 methods try to adapt to the structure of the functon wthout the user specfyng t. Some, methods such as neural networks, can be understood as fndng the features at the same tme they ft w. There s recent work on dscoverng such features n an unsupervsed way.