Admin REGULARIZATION. Schedule. Midterm 9/29/16. Assignment 5. Midterm next week, due Friday (more on this in 1 min)

Size: px

Start display at page:

Download "Admin REGULARIZATION. Schedule. Midterm 9/29/16. Assignment 5. Midterm next week, due Friday (more on this in 1 min)"

Adam Richards
5 years ago
Views:

1 Admi Assigmet 5! Starter REGULARIZATION David Kauchak CS 158 Fall 2016 Schedule Midterm ext week, due Friday (more o this i 1 mi Assigmet 6 due Friday before fall break Midterm Dowload from course web page whe you re ready to take it (available by ed of day Moday 2 hours to complete Must had-i (or i by 11:59pm Friday Oct. 7 Ca use: class otes, your otes, the book, your assigmets ad Wikipedia. You may ot use: your eighbor, aythig else o the web, etc. 1

2 What ca be covered Aythig we ve talked about i class Aythig i the readig (these are ot ecessarily the same thigs Aythig we ve covered i the assigmets Midterm topics Machie learig basics - differet types of learig problems - feature-based machie learig - data assumptios/data geeratig distributio Classificatio problem setup Proper experimetatio - trai/dev/test - evaluatio/accuracy/traiig error - optimizig hyperparameters Midterm topics Learig algorithms - Decisio trees - K-NN - Perceptro - Gradiet descet Algorithm properties - traiig/learig - ratioal/why it works - classifyig - hyperparameters - avoidig overfittig - algorithm variats/improvemets Midterm topics Geometric view of data - distaces betwee examples - decisio boudaries Features - example features - removig erroeous features/pickig good features - challeges with high-dimesioal data - feature ormalizatio Other pre-processig - outlier detectio 2

3 Midterm topics Comparig algorithms - -fold cross validatio - leave oe out validatio - bootstrap resamplig - t-test imbalaced data - evaluatio - precisio/recall, F1, AUC - subsamplig - oversamplig - weighted biary classifiers Midterm topics Multiclass classificatio - Modifyig existig approaches - Usig biary classifier - OVA - AVA - Tree-based - micro- vs. macro-averagig Rakig - usig biary classifier - usig weighted biary classifier - evaluatio Midterm topics Gradiet descet - 0/1 loss - Surrogate loss fuctios - Covexity - miimizatio algorithm - regularizatio - differet regularizers - p-orms Misc - good codig habits - JavaDoc Midterm geeral advice 2 hours goes by fast! - Do t pla o lookig everythig up - Lookup equatios, algorithms, radom details - Make sure you uderstad the key cocepts - Do t sped too much time o ay oe questio - Skip questios you re stuck o ad come back to them - Watch the time as you go Be careful o the T/F questios For writte questios - thik before you write - make your argumet/aalysis clear ad cocise 3

4 How may have you heard of? (Ordiary Least squares Ridge regressio Lasso regressio Elastic regressio Logistic regressio Model-based machie learig 1. pick a model 2. pick a criteria to optimize (aka objective fuctio 1[ y i + b 0] 3. develop a learig algorithm [ ] argmi w,b 1 y i + b 0 m 0 = b + Fid w ad b that miimize the 0/1 loss Model-based machie learig 1. pick a model 2. pick a criteria to optimize (aka objective fuctio 3. develop a learig algorithm argmi w,b m 0 = b + use a covex surrogate loss fuctio Fid w ad b that miimize the surrogate loss Surrogate loss fuctios 0/1 loss: Hige: Expoetial: l(y, y' =1[ yy' 0] l(y, y' = max(0,1 yy' l(y, y' = exp( yy' Squared loss: l(y, y' = (y y' 2 4

I drop you off somewhere ad tell you that you re i a covex shaped valley ad escape is at the bottom/miimum. How do you get out? Perceptro learig algorithm!

5 Fidig the miimum Gradiet descet! pick a startig poit (w! repeat util loss does t decrease i ay dimesio:! pick a dimesio! move a small amout i that dimesio towards decreasig loss (usig the derivative = η d d loss(w You re blidfolded, but you ca see out of the bottom of the blidfold to the groud right by your feet. I drop you off somewhere ad tell you that you re i a covex shaped valley ad escape is at the bottom/miimum. How do you get out? Perceptro learig algorithm! repeat util covergece (or for some # of iteratios: for each traiig example (f 1, f 2,, f m, label: predictio = b + m if predictio * label 0: // they do t agree for each : Note: for gradiet descet, we always update = + *label b = b + label The costat c = η learig rate label predictio Whe is this large/small? = +ηy i or = + y i c where c = η 5

6 The costat Oe cocer c = η label predictio If they re the same sig, as the predicted gets larger there update gets smaller argmi w,b We re calculatig this o the traiig set We still eed to be careful about overfittig! w loss If they re differet, the more differet they are, the bigger the update The mi w,b o the traiig set is geerally NOT the mi for the test set How did we deal with this for the perceptro algorithm? Overfittig revisited: regularizatio A regularizer is a additioal criterio to the loss fuctio to make sure that we do t overfit It s called a regularizer sice it tries to keep the parameters more ormal/regular It is a bias o the model that forces the learig to prefer certai types of weights over others argmi w,b loss(yy'+ λ regularizer(w, b Regularizers 0 = b + Should we allow all possible weights? Ay prefereces? What makes for a simpler model for a liear model? 6

7 Regularizers Regularizers 0 = b + Geerally, we do t wat huge weights If weights are large, a small chage i a feature ca result i a large chage i the predictio Also gives too much weight to ay oe feature How do we ecourage small weights? or pealize large weights? argmi w,b loss(yy'+ λ regularizer(w, b 0 = b + Might also prefer weights of 0 for features that are t useful Commo regularizers Commo regularizers sum of the weights r(w, b = sum of the squared weights r(w, b = 2 sum of the weights sum of the squared weights r(w, b = r(w, b = 2 What s the differece betwee these? Squared weights pealizes large values more Sum of weights will pealize small values more 7

p-orm p-orms visualized sum of the weights (1-orm r(w, b = sum of the squared weights (2-orm r(w, b = 2 w 1 lies idicate pealty = 1 w 2 p p-orm r(w, b = p = w p Smaller values of p (p < 2 ecourage

8 p-orm p-orms visualized sum of the weights (1-orm r(w, b = sum of the squared weights (2-orm r(w, b = 2 w 1 lies idicate pealty = 1 w 2 p p-orm r(w, b = p = w p Smaller values of p (p < 2 ecourage sparser vectors Larger values of p discourage large weights more For example, if w 1 = 0.5 p w p-orms visualized Model-based machie learig all p-orms pealize larger weights p < 2 teds to create sparse (i.e. lots of 0 weights p > 2 teds to like similar weights 1. pick a model 0 = b + 2. pick a criteria to optimize (aka objective fuctio 3. develop a learig algorithm loss(yy' + λregularizer(w argmi w,b loss(yy' + λregularizer(w Fid w ad b that miimize 8

9 Miimizig with a regularizer Covexity revisited We kow how to solve covex miimizatio problems usig gradiet descet: argmi w,b argmi w,b loss(yy' If we ca esure that the loss + regularizer is covex the we could still use gradiet descet: loss(yy' + λregularizer(w make covex Oe defiitio: The lie segmet betwee ay two poits o the fuctio is above the fuctio Mathematically, f is covex if for all x 1, x 2 : f (tx 1 tf (x 1 + (1 t f (x 2 0 < t <1 the value of the fuctio at some poit betwee x 1 ad x 2 the value at some poit o the lie segmet betwee x 1 ad x 2 Addig covex fuctios Claim: If f ad g are covex fuctios the so is the fuctio z=f+g Prove: z(tx 1 tz(x 1 + (1 tz(x 2 0 < t <1 Mathematically, f is covex if for all x 1, x 2 : f (tx 1 tf (x 1 + (1 t f (x 2 0 < t <1 Addig covex fuctios By defiitio of the sum of two fuctios: z(tx 1 = f (tx 1 + g(tx 1 tz(x 1 + (1 tz(x 2 = tf (x 1 + tg(x 1 + (1 t f (x 2 + (1 tg(x 2 = tf (x 1 + (1 t f (x 2 + tg(x 1 + (1 tg(x 2 The, give that: f (tx 1 tf (x 1 + (1 t f (x 2 We kow: So: g(tx 1 tg(x 1 + (1 tg(x 2 f (tx 1 + g(tx 1 tf (x 1 + (1 t f (x 2 + tg(x 1 + (1 tg(x 2 z(tx 1 tz(x 1 + (1 tz(x 2 9

10 Miimizig with a regularizer p-orms are covex We kow how to solve covex miimizatio problems usig gradiet descet: argmi w,b loss(yy' If we ca esure that the loss + regularizer is covex the we could still use gradiet descet: r(w, b = p p = w p p-orms are covex for p >= 1 argmi w,b loss(yy' + λregularizer(w covex as log as both loss ad regularizer are covex Model-based machie learig 1. pick a model 0 = b + 2. pick a criteria to optimize (aka objective fuctio + λ 2 w 2 3. develop a learig algorithm argmi w,b + λ 2 w 2 Fid w ad b that miimize Our optimizatio criterio argmi w,b + λ 2 w 2 Loss fuctio: pealizes examples where the predictio is differet tha the label Regularizer: pealizes large weights Key: this fuctio is covex allowig us to use gradiet descet 10

11 Gradiet descet Some more maths! pick a startig poit (w! repeat util loss does t decrease i ay dimesio:! pick a dimesio! move a small amout i that dimesio towards decreasig loss (usig the derivative = η d d (loss(w+ regularizer(w, b argmi w,b + λ 2 w 2 d d objective = d + λ d 2 w 2 = y i + λ (some math happes Gradiet descet The update! pick a startig poit (w! repeat util loss does t decrease i ay dimesio:! pick a dimesio! move a small amout i that dimesio towards decreasig loss (usig the derivative = η d (loss(w+ regularizer(w, b d = +ηy i ηλ learig rate directio to regularizatio update costat: how far from wrog = +η y i ηλ What effect does the regularizer have? 11

12 The update L1 regularizatio = +ηy i ηλ learig rate directio to regularizatio update costat: how far from wrog If is positive, reduces If is egative, icreases moves towards 0 argmi w,b d d objective = + w d d + λ w = y i + λsig( L1 regularizatio = +ηy i ηλsig( L1 regularizatio = +ηy i ηλsig( learig rate directio to regularizatio update costat: how far from wrog learig rate directio to regularizatio update costat: how far from wrog What effect does the regularizer have? If is positive, reduces by a costat If is egative, icreases by a costat moves towards 0 regardless of magitude 12

13 Regularizatio with p-orms L1: = +η(loss _ correctio λsig( L2: = +η(loss _ correctio λ Lp: = +η(loss _ correctio λcw p 1 j How do higher order orms affect the weights? Model-based machie learig develop a learig algorithm argmi w,b + λ 2 w 2 Fid w ad b that miimize Is gradiet descet the oly way to fid w ad b? No! May other ways to fid the miimum. Some are do t eve require iteratio Whole field called covex optimizatio Regularizers summarized L1 is popular because it teds to result i sparse solutios (i.e. lots of zero weights However, it is ot differetiable, so it oly works for gradiet descet solvers L2 is also popular because for some loss fuctios, it ca be solved directly (o gradiet descet required, though ofte iterative solvers still Lp is less popular sice they do t ted to shrik the weights eough The other loss fuctios Without regularizatio, the geeric update is: = +ηy i c where c = c =1[yy' <1] expoetial hige loss = +η(y i + b squared error 13

14 May tools support these differet combiatios Look at scikit learig package: Commo ames (Ordiary Least squares: squared loss Ridge regressio: squared loss with L2 regularizatio Lasso regressio: squared loss with L1 regularizatio Elastic regressio: squared loss with L1 AND L2 regularizatio Logistic regressio: logistic loss 14

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead) Lecture 4 Homework Hw 1 ad 2 will be reoped after class for every body. New deadlie 4/20 Hw 3 ad 4 olie (Nima is lead) Pod-cast lecture o-lie Fial projects Nima will register groups ext week. Email/tell