TDA231. Logistic regression

Size: px

Start display at page:

Download "TDA231. Logistic regression"

Gyles Cook
5 years ago
Views:

1 TDA231 Devdatt Dubhashi Dept. of Computer Science and Engg. Chalmers University February 19, 2016

2 Some data 5 x x 1

3 In the Bayes classifier, we built a model of each class and then used Bayes rule: P(T new = k x new, X, t) = p(x new T new = k, X, t)p(t new = k) j p(x new t new = j, X, t)p(t new = j)

4 In the Bayes classifier, we built a model of each class and then used Bayes rule: P(T new = k x new, X, t) = p(x new T new = k, X, t)p(t new = k) j p(x new t new = j, X, t)p(t new = j) Alternative is to directly model P(T new = k x new, X, t) = f (x new ; w) with some parameters w.

5 In the Bayes classifier, we built a model of each class and then used Bayes rule: P(T new = k x new, X, t) = p(x new T new = k, X, t)p(t new = k) j p(x new t new = j, X, t)p(t new = j) Alternative is to directly model P(T new = k x new, X, t) = f (x new ; w) with some parameters w. We ve seen f (x new ; w) = w T x new before can we use it here? No output is unbounded and so can t be a probability.

6 In the Bayes classifier, we built a model of each class and then used Bayes rule: P(T new = k x new, X, t) = p(x new T new = k, X, t)p(t new = k) j p(x new t new = j, X, t)p(t new = j) Alternative is to directly model P(T new = k x new, X, t) = f (x new ; w) with some parameters w. We ve seen f (x new ; w) = w T x new before can we use it here? No output is unbounded and so can t be a probability. But, can use P(T new = k x new, w) = h(f (x new ; w)) where h( ) squashes f (x new ; w) to lie between 0 and 1 a probability.

7 h( ) For logistic regression (binary), we use the sigmoid function: P(T new = 1 x new, w) = h(w T x new ) = exp( w T x new ) 1 1/(1 + exp( w T x)) w T x

8 Bayesian logistic regression Recall Bayesian ideas... In theory, if we place a prior on w and define a likelihood we can obtain a posterior: p(w X, t) = p(t X, w)p(w) p(t X)

9 Bayesian logistic regression Recall Bayesian ideas... In theory, if we place a prior on w and define a likelihood we can obtain a posterior: p(w X, t) = p(t X, w)p(w) p(t X) And we can make predictions by taking expectations (averaging over w): P(T new = 1 x new, X, t) = E p(w X,t) {P(T new = 1 x new, w)} Sounds good so far...

10 Defining a prior Choose a Gaussian prior: D p(w) = N (0, σ 2 ). d=1 Prior choice is always important from a data analysis point of view. Previously, it was also important for the maths. This isn t the case today could choose any prior no prior makes the maths easier!

11 Logistic Regression: Likelihood First assume independence: N p(t X, w) = p(t n x n, w) n=1

12 Logistic Regression: Likelihood First assume independence: p(t X, w) = N p(t n x n, w) n=1 We have already defined this it s our squashing function! If t n = 1: and if t n = 0: P(t n = 1 x n, w) = exp( w T x n ) P(t n = 0 x n, w) = 1 P(t n = 1 x, w)

13 Defining a prior Choose a Gaussian prior: D p(w) = N (0, σ 2 ). d=1 Prior choice is always important from a data analysis point of view. Previously, it was also important for the maths. This isn t the case today could choose any prior no prior makes the maths easier!

14 Posterior p(w X, t, σ 2 ) = p(t X, w, σ2 )p(w σ 2 ) p(t X, σ 2 ) Now things start going wrong. We can t compute p(w X, t) analytically. Prior is not conjugate to likelihood. No prior is! This means we don t know the form of p(w X, t, σ 2 ) And we can t compute the marginal likelihood: p(t X, σ 2 ) = p(t X, w, σ 2 )p(w σ 2 ) dw

15 What can we compute? p(w X, t, σ 2 ) = p(t X, w, σ2 )p(w σ 2 ) p(t X, σ 2 ) We can compute p(t X, w, σ 2 )p(w σ 2 ) Define g(w; X, t, σ 2 ) = p(t X, w)p(w σ 2 )

16 What can we compute? p(w X, t, σ 2 ) = p(t X, w, σ2 )p(w σ 2 ) p(t X, σ 2 ) We can compute p(t X, w, σ 2 )p(w σ 2 ) Define g(w; X, t, σ 2 ) = p(t X, w)p(w σ 2 ) Armed with this, we have three options: Find the most likely value of w a point estimate.

17 What can we compute? p(w X, t, σ 2 ) = p(t X, w, σ2 )p(w σ 2 ) p(t X, σ 2 ) We can compute p(t X, w, σ 2 )p(w σ 2 ) Define g(w; X, t, σ 2 ) = p(t X, w)p(w σ 2 ) Armed with this, we have three options: Find the most likely value of w a point estimate. Approximate p(w X, t, σ 2 ) with something easier.

18 What can we compute? p(w X, t, σ 2 ) = p(t X, w, σ2 )p(w σ 2 ) p(t X, σ 2 ) We can compute p(t X, w, σ 2 )p(w σ 2 ) Define g(w; X, t, σ 2 ) = p(t X, w)p(w σ 2 ) Armed with this, we have three options: Find the most likely value of w a point estimate. Approximate p(w X, t, σ 2 ) with something easier. Sample from p(w X, t, σ 2 ).

19 What can we compute? p(w X, t, σ 2 ) = p(t X, w, σ2 )p(w σ 2 ) p(t X, σ 2 ) We can compute p(t X, w, σ 2 )p(w σ 2 ) Define g(w; X, t, σ 2 ) = p(t X, w)p(w σ 2 ) Armed with this, we have three options: Find the most likely value of w a point estimate. Approximate p(w X, t, σ 2 ) with something easier. Sample from p(w X, t, σ 2 ). These examples aren t the only ways of approximating/sampling. They are also general techniques not unique to logistic regression.

20 MAP estimate Out first method is to find the value of w that maximises p(w X, t, σ 2 ) (call it ŵ). g(w; X, t, σ 2 ) p(w X, t, σ 2 ) ŵ therefore also maximises g(w; X, t, σ 2 ). Very similar to maximum likelihood but additional effect of prior. Known as MAP (maximum a posteriori) solution.

21 MAP estimate Out first method is to find the value of w that maximises p(w X, t, σ 2 ) (call it ŵ). g(w; X, t, σ 2 ) p(w X, t, σ 2 ) ŵ therefore also maximises g(w; X, t, σ 2 ). Very similar to maximum likelihood but additional effect of prior. Known as MAP (maximum a posteriori) solution. Once we have ŵ, make predictions with: P(t new = 1 x new, ŵ) = exp( ŵ T x new )

22 MAP When we met maximum likelihood, we could find ŵ exactly with some algebra. Can t do that here (can t solve g(w;x,t,σ2 ) w = 0)

23 MAP When we met maximum likelihood, we could find ŵ exactly with some algebra. Can t do that here (can t solve g(w;x,t,σ2 ) w = 0) Resort to numerical optimisation: 1. Guess ŵ 2. Change it a bit in a way that increases g(w; X, t, σ 2 ) 3. Repeat until no further increase is possible.

24 MAP When we met maximum likelihood, we could find ŵ exactly with some algebra. Can t do that here (can t solve g(w;x,t,σ2 ) w = 0) Resort to numerical optimisation: 1. Guess ŵ 2. Change it a bit in a way that increases g(w; X, t, σ 2 ) 3. Repeat until no further increase is possible. Many algorithms exist that differ in how they do step 2. e.g. Gradient Descent Not covered in this course. You just need to know that sometimes we can t do things analytically and there are methods to help us! Ask John!

25 MAP numerical optimisation for our data w 2 w 1 x2 0 wi x Iteration Left: Data. Right: Evolution of ŵ in numerical optimisation.

26 Decision boundary Once we have ŵ, we can classify new examples. Decision boundary is a useful visualisation: 5 x x 1 Line corresponding to P(T new = 1 x new, ŵ) = 0.5.

27 Decision boundary Once we have ŵ, we can classify new examples. Decision boundary is a useful visualisation: 5 x x 1 Line corresponding to P(T new = 1 x new, ŵ) = = 1 2 = exp( ŵ T x new ). So: exp( ŵ T x new ) = 1. Or: ŵ T x new = 0

28 Predictive probabilities x x 1 Contours of P(T new = 1 x new, ŵ). Do they look sensible? 0.1

29 Sampling from posterior Suppose we can produce samples w 1, w 2,..., w s,... from p(w X, t, σ 2 ). Then we can average the predictions to approximate p(w X, t, σ 2 ): P(t new = 1 x new, X, t, σ 2 ) = E p(w X,t,σ2 ) {P(t new x new, w)} 1 S 1 S 1 + exp( ws T x new ) s=1

30 Magic! We can sample directly from p(w X, t, σ 2 ) even though we can t compute it! Various algorithms exist we ll use Metropolis-Hastings

31 Back to the script: Metropolis-Hastings Produces a sequence of samples w 1, w 2,..., w s,... Imagine we ve just produced w s 1

32 Back to the script: Metropolis-Hastings Produces a sequence of samples w 1, w 2,..., w s,... Imagine we ve just produced w s 1 MH firsts proposes a possible w s (call it w s ) based on w s 1.

33 Back to the script: Metropolis-Hastings Produces a sequence of samples w 1, w 2,..., w s,... Imagine we ve just produced w s 1 MH firsts proposes a possible w s (call it w s ) based on w s 1. MH then decides whether or not to accept w s If accepted, ws = w s If not, ws = w s 1

34 Back to the script: Metropolis-Hastings Produces a sequence of samples w 1, w 2,..., w s,... Imagine we ve just produced w s 1 MH firsts proposes a possible w s (call it w s ) based on w s 1. MH then decides whether or not to accept w s If accepted, ws = w s If not, ws = w s 1 Two distinct steps proposal and acceptance.

35 MH proposal Treat w s as a random variable conditioned on w s 1 i.e. need to define p( w s w s 1 ) Note that this does not necessarily have to be similar to posterior we re trying to sample from. Can choose whatever we like!

36 MH proposal Treat w s as a random variable conditioned on w s 1 i.e. need to define p( w s w s 1 ) Note that this does not necessarily have to be similar to posterior we re trying to sample from. Can choose whatever we like! e.g. use a Gaussian centered on w s 1 with some covariance: p( w s w s 1, Σ p ) = N (w s 1, Σ p )

37 MH proposal Treat w s as a random variable conditioned on w s 1 i.e. need to define p( w s w s 1 ) Note that this does not necessarily have to be similar to posterior we re trying to sample from. Can choose whatever we like! e.g. use a Gaussian centered on w s 1 with some covariance: p( w s w s 1, Σ p ) = N (w s 1, Σ p ) Σ = w Σ = w 1

38 MH acceptance Choice of acceptance based on the following ratio: r = p( w s X, t, σ 2 ) p(w s 1 w s, Σ p ) p(w s 1 X, t, σ 2 ) p( w s w s 1, Σ p ).

39 MH acceptance Choice of acceptance based on the following ratio: r = p( w s X, t, σ 2 ) p(w s 1 w s, Σ p ) p(w s 1 X, t, σ 2 ) p( w s w s 1, Σ p ). Which simplifies to (all of which we can compute): r = g( w s; X, t, σ 2 ) p(w s 1 w s, Σ p ) g(w s 1 ; X, t, σ 2 ) p( w s w s 1, Σ p ).

40 MH acceptance Choice of acceptance based on the following ratio: r = p( w s X, t, σ 2 ) p(w s 1 w s, Σ p ) p(w s 1 X, t, σ 2 ) p( w s w s 1, Σ p ). Which simplifies to (all of which we can compute): r = g( w s; X, t, σ 2 ) p(w s 1 w s, Σ p ) g(w s 1 ; X, t, σ 2 ) p( w s w s 1, Σ p ). We now use the following rules: If r 1, accept: ws = w s. If r < 1, accept with probability r.

41 MH acceptance Choice of acceptance based on the following ratio: r = p( w s X, t, σ 2 ) p(w s 1 w s, Σ p ) p(w s 1 X, t, σ 2 ) p( w s w s 1, Σ p ). Which simplifies to (all of which we can compute): r = g( w s; X, t, σ 2 ) p(w s 1 w s, Σ p ) g(w s 1 ; X, t, σ 2 ) p( w s w s 1, Σ p ). We now use the following rules: If r 1, accept: ws = w s. If r < 1, accept with probability r. If we do this enough, we ll eventually be sampling from p(w X, t), no matter where we started! i.e. for any w1

42 Where to Start? Convergence Theorem for Markov Chains No matter where the chain is started, the MH process will always converge (under some technical conditions) to its target distribution!

43 When to Stop?

44 When to Stop? How do we know Markov Chain has converged?

45 When to Stop? How do we know Markov Chain has converged? Start chain from different starting points and run until they look the same.

46 When to Stop? How do we know Markov Chain has converged? Start chain from different starting points and run until they look the same. Apply statistical hypothesis testing on empirical distributions.

47 MH flowchart s =1 Choose w s s = s +1 Generate w s from p( w s w s 1 ) w s = w s Compute acceptance ratio r w s = w s 1 Yes r 1? No Generate u from U(0, 1) Yes u r? No

48 MH walkthrough w2 0 w2 0 w 1 w w w 1 5 w2 0 w2 0 w1 w w w 1

49 MH walkthrough w2 0 w2 0 w 1 w w w 1 5 w2 0 w2 0 w1 w w w 1

50 What do the samples look like? w w samples from the posterior using MH.

51 Predictions with MH MH provides us with a set of samples w 1,..., w S. These can be used to approximate posterior: P(t new = 1 x new, X, t, σ 2 ) = E p(w X,t,σ2 ) {P(t new x new, w)} 1 S 1 S 1 + exp( ws T x new ) s=1

52 Predictions with MH MH provides us with a set of samples w 1,..., w S. These can be used to approximate posterior: P(t new = 1 x new, X, t, σ 2 ) = E p(w X,t,σ2 ) {P(t new x new, w)} 1 S 1 S 1 + exp( ws T x new ) s= x x 1 Contours of P(t new = 1 x new, X, t, σ 2 )

53 Summary Introduced logistic regression a probabilistic binary classifier. Saw that we couldn t compute the posterior. Introduced examples of two alternatives: MAP solution. Sample Metropolis-Hastings. Second is better than the last (in terms of predictions)......but each has greater complexity! To think about: What if posterior is multi-modal?

Week 3: Linear Regression

Week 3: Linear Regression Instructor: Sergey Levine Recap In the previous lecture we saw how linear regression can solve the following problem: given a dataset D = {(x, y ),..., (x N, y N )}, learn to