Regression and generalization

Size: px

Start display at page:

Download "Regression and generalization"

Daniella Newton
5 years ago
Views:

1 Regressio ad geeralizatio CE-717: Machie Learig Sharif Uiversity of Techology M. Soleymai Fall 2016

2 Curve fittig: probabilistic perspective Describig ucertaity over value of target variable as a probability distributio Example: f(x; w) f(x 0 ; w) p(y x 0, w, σ) 2

3 The learig diagram icludig oisy target Type equatio here. h: X Y x 1, y (1),, x N, y (N) x 1,, x N f x = h(x) f: X Y P x, y = P x P(y x) Distributio o features Target distributio 3 [Y.S. Abou Mostafa, et. al]

4 Curve fittig: probabilistic perspective (Example) Special case: Observed output = fuctio + oise y = f x; w + ε e.g., ε~n(0, σ 2 ) Noise: Whatever we caot capture with our chose family of fuctios 4

5 Curve fittig: probabilistic perspective (Example) Best regressio f x; w E y x = E f(x; w) + ε = f(x; w) ε~n(0,σ 2 ) is tryig to capture the mea of the observatios y give the iput x: E y x : coditioal expectatio of y give x evaluated accordig to the model (ot accordig to the uderlyig distributio P) 5

6 Curve fittig usig probabilistic estimatio Maximum Likelihood (ML) estimatio Maximum A Posteriori (MAP) estimatio Bayesia approach 6

7 Maximum likelihood estimatio Give observatios D = x i, y i i=1 Fid the parameters that maximize the (coditioal) likelihood of the outputs: L D; θ = p y X, θ = p(y i x i, θ) i=1 y = y (1) y () X = 1 x 1 (1) 1 () 1 x 1 x 1 (2) (1) x d x (2) d x d () 7

8 Maximum likelihood estimatio (Cot d) y = f x; w + ε, ε~n(0, σ 2 ) y give x is ormally distributed with mea f(x; w) ad variace σ 2 : we model the ucertaity i the predictios, ot just the mea p(y x, w, σ 2 ) = 1 2πσ exp{ 1 2σ 2 y f x; w 2 } 8

9 Maximum likelihood estimatio (Cot d) Example: uivariate liear fuctio p(y x, w, σ 2 ) = 1 2πσ exp{ 1 2σ 2 y w 0 w 1 x 2 } Why is this lie a bad fit accordig to the likelihood criterio? p(y x,w,σ 2 ) for most of the poits will be ear zero (as they are far from this lie) 9

10 Maximum likelihood estimatio (Cot d) Maximize the likelihood of the outputs (i.i.d): L D; w, σ 2 = i=1 p(y i x (i), w, σ 2 ) w = argmax w = argmax w i=1 L D; w, σ 2 p(y i x (i), w, σ 2 ) 10

11 Maximum likelihood estimatio (Cot d) It is ofte easier (but equivalet) to try to maximize the log-likelihood: l i=1 w = argmax w p(y i x (i), w, σ 2 ) = = l σ 2 l p y X, w, σ 2 i=1 l 2π 1 2σ 2 i=1 l N(y i x i, w, σ 2 ) y i f(x i ; w) 2 sum of squares error 11

12 Maximum likelihood estimatio (Cot d) Maximizig log-likelihood (whe we assume y = f x; w + ε, ε~n(0, σ 2 )) is equivalet to miimizig SSE Let w be the maximum likelihood (here least squares) settig of the parameters. What is the maximum likelihood estimate of σ 2? 12 log L(D; w, σ 2 ) σ 2 = 0 σ 2 = 1 i=1 y i f(x i ; w) 2 Mea squared predictio error

13 Maximum likelihood estimatio (Cot d) Geerally, maximizig log-likelihood is equivalet to miimizig empirical loss whe the loss is defied accordig to: Loss y i, f x i, w = l p(y i x (i), w, θ) Loss: egative log-probability More geeral distributios for p(y x) ca be cosidered 13

14 Maximum A Posterior (MAP) estimatio MAP: Give observatios D Fid the parameters that maximize the probabilities of the parameters after observig the data (posterior probabilities): θ MAP = max θ p θ D ) Sice p θ D p D θ p(θ) θ MAP = max θ p D θ p(θ) 14

15 Maximum A Posterior (MAP) estimatio Give observatios D = x i, y i i=1 max w p(w X, y) p y X, w p(w) p w = N 0, α 2 I = 1 2πα d+1 exp 1 2α 2 wt w 15

16 Maximum A Posterior (MAP) estimatio Give observatios D = x i, y i i=1 max w l p y X, w, σ2 p(w) mi w 1 σ 2 i=1 y i f(x i ; w) α 2 wt w Equivalet to regularized SSE with λ = σ2 α 2 16

17 Bayesia approach Give observatios D = x i, y i i=1 Fid the parameters that maximize the probabilities of observatios p y x, D = p y w, x p w D dw Example of prior distributio p w = N(0, α 2 I) 17

18 Bayesia approach Give observatios D = x i, y i i=1 Fid the parameters that maximize the probabilities of observatios p D w = L D; w, θ = p y i w T x i, θ N i=1 p y i f x i, w, θ = N(y i w T x i, σ 2 ) N p w = N(0, α 2 I) p(w D) p D w p(w) Predictive distributio p(y x, D) = p y w, x p w D dw p y x, D = N m N T x, σ N 2 (x) 18

19 Predictive distributio: example Example: Siusoidal data, 9 Gaussia basis fuctios Red curve shows the mea of the predictive distributio Pik regio spas oe stadard deviatio either side of the mea [Bishop]

20 Predictive distributio: example Fuctios whose parameters are sampled from p(w D) 20 [Bishop]

Topics Machine learning: lecture 3. Linear regression. Linear regression. Linear regression. Linear regression

Topics Machine learning: lecture 3. Linear regression. Linear regression. Linear regression. Linear regression 6.867 Machie learig: lecture 3 Tommi S. Jaakkola MIT CSAIL tommi@csail.mit.edu Topics Beod liear regressio models additive regressio models, eamples geeralizatio ad cross-validatio populatio miimizer Statistical