Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

Size: px

Start display at page:

Download "Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)"

Kevin Payne
5 years ago
Views:

1 Lecture 4 Homework Hw 1 ad 2 will be reoped after class for every body. New deadlie 4/20 Hw 3 ad 4 olie (Nima is lead) Pod-cast lecture o-lie Fial projects Nima will register groups ext week. /tell Nima. Give proposal i week 5 See last years topic o webpage. Choose your ow or Liear regressio SVD Next lectures: I posted a rough pla. It is flexible though so please come with suggestios

2 Projects 3-4 perso groups Deliverables: Poster & Report & mai code (plus proposal, midterm slide) Topics your ow or chose form suggested topics Week 3 groups due to TA Nima (if you do t have a group, ask i week 2 ad we ca help). Week 5 proposal due. TAs ad Peter ca approve. Proposal: Oe page: Title, A large paragraph, data, webliks, refereces. Somethig physical Week ~7 Midterm slides? Likely preseted to a subgroup of class. Week 10/11 (likely 5pm 6 Jue Jacobs Hall lobby) fial poster sessio? Report due Saturday 16 Jue.

3 Mark s Probability ad Data homework

4 Mark s Probability ad Data homework

5 Liear regressio: Liear Basis Fuctio Models (1) Geerally where f j (x) are kow as basis fuctios. Typically, f 0 (x) = 1, so that w 0 acts as a bias. Simplest case is liear basis fuctios: f d (x) = x d.

6 Maximum Likelihood ad Least Squares (3) Computig the gradiet ad settig it to zero yields Solvig for w, where The Moore-Perose pseudo-iverse,.

7 Least mea squares: A alterative approach for big datasets w t + 1 = w t - h ÑE ( t ) weights after seeig traiig case tau+1 learig rate squared error derivatives w.r.t. the weights o the traiig case at time tau. This is o-lie learig. It is efficiet if the dataset is redudat ad simple to implemet. It is called stochastic gradiet descet if the traiig cases are picked radomly. Care must be take with the learig rate to prevet diverget oscillatios. Rate must decrease with tau to get a good fit.

8 Regularized least squares 2 1 } ), ( { ) ( ~ w w x w l + - å = = t y E N t X X X I w T T 1 * ) ( - + = l The squared weights pealty is mathematically compatible with the squared error fuctio, givig a closed form for the optimal weights: idetity matrix

9 A picture of the effect of the regularizer The overall cost fuctio is the sum of two parabolic bowls. The sum is also a parabolic bowl. The combied miimum lies o the lie betwee the miimum of the squared error ad the origi. The L2 regularizer just shriks the weights.

10 Other regularizers We do ot eed to use the squared error, provided we are willig to do more computatio. Other powers of the weights ca be used.

11 The lasso: pealizig the absolute values of the weights E ~ 1 N ( w) = å{ y( x å, w) - t } 2 + l wi 2 = 1 i Fidig the miimum requires quadratic programmig but its still uique because the cost fuctio is covex (a bowl plus a iverted pyramid) As lambda icreases, may weights go to exactly zero. This is great for iterpretatio, ad it is also prevets overfittig.

12 Geometrical view of the lasso compared with a pealty o the squared weights Notice w1=0 at the optimum

13 Miimizig the absolute error This miimizatio ivolves solvig a liear programmig problem. It correspods to maximum likelihood estimatio if the output oise is modeled by a Laplacia istead of a Gaussia. å - T over t mi x w w cost y t a y t p a e y t p y t a = - = - - ) ( log ) (

14 The bias-variace trade-off (a figmet of the frequetists lack of imagiatio?) Imagie a traiig set draw at radom from a whole set of traiig sets. The squared loss ca be decomposed ito a Bias = systematic error i the model s estimates Variace = oise i the estimates cause by samplig oise i the traiig set. There is also additioal loss due to oisy target values. We elimiate this extra, irreducible loss from the math by usig the average target values (i.e. the ukow, oise-free values)

15 The bias-variace decompositio model estimate for testcase traied o dataset D average target value for test case Bias term is the squared error of the average, over traiig datasets D, of the estimates. Bias: average betwee predictio ad desired. { } 2 y( x ; D) - t = { y( x ; D) - t } <. > meas expectatio over D D + D 2 { y( x } ; D) - < y( x; D) > D D Variace term: variace over traiig datasets D, of the model estimate. 2

16 Regularizatio parameter affects the bias ad variace terms high variace low variace 20 realizatios -2.4 = e l -.31 = e l 2.6 = e l True model average low bias high bias

17 A example of the bias-variace trade-off

18 Beatig the bias-variace trade-off Reduce the variace term by averagig lots of models traied o differet datasets. Seems silly. For lots of differet datasets it is better to combie them ito oe big traiig set. With more traiig data there will be much less variace. Weird idea: We ca create differet datasets by bootstrap samplig of our sigle traiig dataset. This is called baggig ad it works surprisigly well. If we have eough computatio its better doig it Bayesia: Combie the predictios of may models usig the posterior probability of each parameter vector as the combiatio weight.

19 Defie a cojugate prior over w Bayesia Liear Regressio (1) Combiig this with the likelihood fuctio ad usig results for margial ad coditioal Gaussia distributios, gives the posterior A commo simpler prior Which gives

20 From lecture 3: Bayes for liear model! = #$ + & &~N(*,, & ) y~n(#$,, & ) prior: $~N(*,, $ ). $! ~.! $. $ ~/ $ 0,,. mea $ 0 =, 1 # 2, 3 45! 0, 1 = # 2, 3 45 # +, 3 45

21 Iterpretatio of solutio Draw it Sequetial, cojugate prior! " # ~! # "! " ~N &", ( ) N *, ( " ~+ ",, (! Covariace (, -. = & 0 ( ) -. & + ( 2 -.

22 Likelihood, prior/posterior Bishop Fig 3.7! = # 0 + # 1 ' + ( 0,0.2 Data geerated with. w 0 =-0.3, w 1 =0.5 With o data we sample lies from the prior. With 20 data poits, the prior has little effect

23 Predictive Distributio Predict t for ew values of x by itegratig over w (Givig the margial distributio of t): traiig data precisio of prior precisio of output oise where

24 Just use ML solutio Prior predictive

25 Predictive distributio for oisy siusoidal data modeled by liear combiig 9 radial basis fuctios.

26 A way to see the covariace of predictios for differet values of x We sample models at radom from the posterior ad show the mea of each model s predictios

27 Equivalet Kerel BISHOP The predictive mea ca be writte This is a weighted sum of the traiig data target values, t. Equivalet kerel or smoother matrix.

28 Equivalet Kerel (2) Weight of t depeds o distace betwee x ad x ; earby x carry more weight.

29 Equivalet Kerel (4) The kerel as a covariace fuctio: cosider We ca avoid the use of basis fuctios ad defie the kerel fuctio directly, leadig to Gaussia Processes (Chapter 6). No eed to determie weights. Like all kerel fuctios, the equivalet kerel ca be expressed as a ier product:

30 ! = #$ SVD # = %&' ( = ) *,, ) - &. *,,. / ( ## ( = # ( # = 0 12 = # ( # 34 #!= = 89 + # ( # 34 #! =

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring Machie Learig Regressio I Hamid R. Rabiee [Slides are based o Bishop Book] Sprig 015 http://ce.sharif.edu/courses/93-94//ce717-1 Liear Regressio Liear regressio: ivolves a respose variable ad a sigle predictor