Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP

Size: px

Start display at page:

Download "Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP"

Elfreda Mills
6 years ago
Views:

1 Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP

2 Predic?ve Distribu?on (1) Predict t for new values of x by integra?ng over w: where

3 The Evidence Approxima?on (1) The fully Bayesian predic?ve distribu?on is given by but this integral is intractable. Approximate with where is the mode of, which is assumed to be sharply peaked; a.k.a. empirical Bayes, type II or generalized maximum likelihood, or evidence approxima;on.

4 The Evidence Approxima?on (2) From Bayes theorem we have and if we assume p(α,β) to be flat we see that

5 The Evidence Approxima?on (3) Cont.: Evidence func?on: p(t α, β) = ( ) N/2 β ( α ) M/2 2π 2π exp { E(w)} dw with E(w) = βe D (w)+αe W (w) = β 2 t Φw 2 + α 2 wt w.

6 The Evidence Approxima?on (4) Cont.: E(w) = βe D (w)+αe W (w) = β 2 t Φw 2 + α 2 wt w. Comple?ng the square over w: with E(w) =E(m N )+ 1 2 (w m N) T A(w m N ) ntroduced E(m N )= β 2 t Φm N 2 + α 2 mt Nm N A = αi + βφ T Φ A = S 1 N efore repre m N = βa 1 Φ T t.

7 The Evidence Approxima?on (5) Evaluate integral over w exp { E(w)} dw = exp{ E(m N )} { exp 1 } 2 (w m N) T A(w m N ) dw = exp{ E(m N )}(2π) M/2 A 1/2. Thus, log of marginal likelihood (evidence func?on):

8 The Evidence Approxima?on (6) Example: sinusoidal data, M th degree polynomial,

9 Maximizing the Evidence Func?on (1) To maximise w.r.t. α and β, we define the eigenvector equa?on Thus has eigenvalues λ i + α.

10 Maximizing the Evidence ( ) Func?on (2) ( ) Deriva?ve of ln A with respect to α Sta?onary points of log marginal likelihood Thus d dα d ln A = dα ln i and therefore (λ i + α) = d ln(λ i + α) = dα i i 0= M 2α 1 2 mt Nm N λ i + α αm T Nm N = M α i λ i γ = α + λ i i i 1 λ i + α = γ. 1 λ i + α

11 Maximizing the Evidence Func?on (3) Example: sinusoidal data, 9 Gaussian basis func?ons, β = 11.1.

12 Maximizing the Evidence Func?on (4) Thus differen?a?ng the results to zero, to get w.r.t. α and β, and set where Note γ depends on both α and β.

13 Effec?ve Number of Parameters (1) Likelihood w 1 is not well determined by the likelihood w 2 is well determined by the likelihood Prior γ is the number of well determined parameters

14 Effec?ve Number of Parameters (3) Example: sinusoidal data, 9 Gaussian basis func?ons, β = Test set error

15 Effec?ve Number of Parameters (4) Example: sinusoidal data, 9 Gaussian basis func?ons, β = 11.1.

16 Effec?ve Number of Parameters (5) In the limit, γ = M and we can consider using the easy-to-compute approxima?on

17 Limita?ons of Fixed Basis Func?ons Class of nonlinearities may be insufficient M basis func?on along each dimension of a D-dimensional input space requires M D basis func?ons: the curse of dimensionality. Choosing basis func?ons using the training data.

18 Classifica?on

19 Linear models for classifica?on Assign input vector x to one of k discrete classes C k, k=1,,k. D-dimensional input space Decision boundary/surface: (D-1)-dimensional hyperplane

20 Regression vs. Classifica?on Regression: x 2 [ 1, 1],t2 [ 1, 1] Classifica?on (two classes): x 2 [ 1, 1],t2 {0, 1}

21 Regression vs. Classifica?on Linear regression model predic?on (y real) Classifica?on: y in range (0,1) (posterior probabili?es) ( ) f: Ac?va?on func?on (nonlinear) Decision surface: y(x) = we wish the model w T x + w 0, to predict d y(x) =f ( w T x + w 0 ) (Generalized linear models) e statistics literature. Th w T x + w 0 =constant ven if the function ( ) is

22 Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribu?on

23 Binary Variables (2) N coin flips: Binomial Distribu?on

24 Binomial Distribu?on

25 Parameter Es?ma?on (1) ML for Bernoulli Given:

26 Parameter Es?ma?on (2) Example: Predic?on: all future tosses will land heads up Overfieng to D

27 Decision Theory Inference step Determine either or. Decision step For given x, determine op?mal t.

Minimum Misclassifica?on Rate We are free to choose the decision rule that assigns each point x to one of the two classes. This defines the decision regions Rk.

28 Minimum Misclassifica?on Rate We are free to choose the decision rule that assigns each point x to one of the two classes. This defines the decision regions Rk. To minimize integrand: p(x, C k )=p(c k x)p(x) obtained i n restate this result as sayi Assign x to class for which the posterior p(c k x) able x, in must be small is larger!

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

Logis@cs CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Assignment