Introduction to Machine Learning

1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer and Jyrki Kivinen) November 2nd December 15th 2017

Classification: Probabilistic Methods 2,

3, Logistic regression Logistic regression models are linear models for probabilistic binary classification (so, not really regression where response is continuous) Given input (vector) x, the output is a probability that Y = 1. Let s denote it by ˆp(Y = 1 x). However, instead of using a linear model directly as in we let log ˆp(Y = 1 x) = β x ˆp(Y = 1 x) ˆp(Y = 0 x) = β x This amounts to the same as exp(β x) ˆp(Y = 1 x) = 1 + exp(β x) = 1 exp( β x) + 1

4, Logistic regression (2) For convenience, we use here class labels 0 and 1 Given probabilistic prediction ˆp(y x), and assuming instance x i has already been observed, the conditional likelihood for a sample point (x i, y i ) is ˆp(Y = 1 x i ) if y i = 1 1 ˆp(Y = 1 x i ) if y i = 0 which we write as ˆp(Y = 1 x i ) y i (1 ˆp(Y = 1 x i )) 1 y i

5, Logistic regression (3) Conditional likelihood of sequence of independent samples (x i, y i ), i = 1,..., n is then n ˆp(Y = 1 x i ) y i (1 ˆp(Y = 1 x i )) 1 y i i=1 we say conditional to emphasise that we take xi as given and only model probability of labels y i To maximise conditional likelihood, we can equivalently maximise conditional log-likelihood LCL(β) = n (y i ln ˆp(Y = 1 x i ) + (1 y i ) ln(1 ˆp(Y = 1 x i ))) i=1 Note that this is the same as log-loss!

6, Logistic regression (4) Maximizing the likelihood (or minimizing log-loss) isn t as straightforward as in the case of linear regression Nevertheless, the problem is convex which means that gradient-based techniques exist to find the optimum Standard techniques in R, Python, Matlab,... Often used with regularisation, as in linear regression ridge : arg max(lcl(β) λ β 2 2 ) lasso : arg max(lcl(β) λ β 1 ) In particular, if data is linearly separable, non-regularised solution tends to infinity

7, Generative vs discriminative learning Logistic regression was an example of a discriminative and probabilistic classifier that directly models the class distribution P(y x) Another probabilistic way to approach the problem is to use generative learning that builds a model for the whole joint distribution P(x, y) often using the decomposition P(y)P(x y) Both approaches have their pros and cons: Discriminative learning: only solve the task that you need to solve; may provide better accuracy since focuses on the specific learning task; optimization tends to be harder Generative learning: often more natural to build models for P(x y) than for P(y x); handles missing data more naturally; optimization often easier

8, Generative vs discriminative learning (2) Estimating the class prior P(y) is usually simple For example, in binary classification this time with Y { 1, +1} we can usually just count the number of positive examples Pos and negative examples Neg and set P(Y = +1) = Pos Pos + Neg and P(Y = 1) = Neg Pos + Neg Since P(x, y) = P(x y)p(y), what remains is estimating P(x y). In binary classification, we use the positive examples to build a model for P(x Y = +1) use the negative examples to build a model for P(x Y = 1) To classify a new data point x, we use the Bayes formula P(y x) = P(x y)p(y) P(x) = P(x y)p(y) y P(x y )P(y )

9, Generative vs discriminative learning (3) Examples of discriminative classifiers: logistic regression k-nn decision trees SVM multilayer perceptron (MLP) Examples of generative classifiers: naive Bayes (NB) linear discriminant analysis (LDA) quadratic discriminant analysis (QDA) We will study all of the above except MLP.

Normal distribution For probabilistic models for real-valued features x i R, one basic ingredient is the normal or Gaussian distribution Recall that for a single real-valued random variable, the normal distribution has two parameters µ and σ 2, and density ( ) N (x µ, σ 2 1 ) = exp (x µ)2 2πσ 2 2σ 2 If X has this distribution, then E[X ] = µ and Var[X ] = σ 2 For multivariate case x R p, we shall first consider the case where individual component x i has normal distribution with parameters µ i and σi 2 and the components are independent: p(x) = N (x 1 µ 1, σ 2 1)... N (x p µ p, σ 2 d ) 10,

11, Normal distribution (2) We get p(x) = N (x 1 µ 1, σ1) 2... N (x p µ p, σp) 2 ( ) p 1 = exp (x j µ j ) 2 j=1 2πσj 2 2σj 2 1 = (2π) p/2 exp 1 p (x j µ j ) 2 σ 1... σ p 2 σ 2 j=1 j ( 1 = exp 1 ) (2π) p/2 1/2 Σ 2 (x µ)t Σ 1 (x µ) where µ = (µ 1,..., µ p ) R p and Σ R p p is a diagonal matrix with σ 2 1,..., σ2 p on the diagonal and Σ is determinant of Σ

12, Normal distribution (3) More generally, let µ R p, and let Σ R p p be symmetric: Σ T = Σ positive definite: x T Σx > 0 for all x R { 0 } We then define p-dimensional Gaussian density with parameter µ and Σ as ( 1 N (x µ, Σ) = exp 1 ) (2π) p/2 1/2 Σ 2 (x µ)t Σ 1 (x µ) If Σ is diagonal, we get the special case where x j are independent

13, Normal distribution (4) To understand the multivariate normal distribution, consider a surface of constant density: for some a S = { x R p N (x µ, Σ) = a } By definition of N, this can be written as for some b S = { x R p (x µ) T Σ 1 (x µ) = b } Because Σ is symmetric and positive definite, so is Σ 1, and this set is an ellipsoid with centre µ

14, Normal distribution (5) More specifically, since Σ is symmetric and positive definite, it has an Eigenvalue decomposition Σ = UΛU T where Λ R p p is diagonal and U R T is orthogonal (U T = U 1 ), and further Σ 1 = UΛ 1 U T We then know from analytic geometry that for the ellipsoid S = { x R p (x µ) T Σ 1 (x µ) = b } the directions of the axes are given by the column vectors of U (Eigenvectors of Σ) the squared lengths of the axes are given by the elements of Λ (Eigenvalues of Σ)

15, Normal distribution (6) Let X = (X 1,..., X p ) have normal distribution with parameters µ and Σ Then E[X] = µ and E[(X r µ r )(X s µ s )] = Σ rs Hence, we call the parameter µ the mean and Σ the covariance matrix

16, Normal distribution (7) Let x 1,..., x n, where x i = (x i,1,..., x i,p ), be n independent samples from a p-dimensional normal distribution with unknown mean µ and covariance Σ The maximum likelihood estimates are given by ( ˆµ, ˆΣ) = arg max µ,σ n N (x i µ, Σ) i=1 ˆµ r = 1 n ˆΣ rs = 1 n n i=1 x i,r n (x i,r ˆµ r )(x i,s ˆµ s ) i=1

17, Gaussians in classification LDA and QDA are obtained by modeling positive and negative examples both with their own Gaussian: p(x Y = +1) = N (x µ +, Σ + ) p(x Y = 1) = N (x µ, Σ )) where µ ± and Σ ± are obtained for example as maximum likelihood estimates Decision boundary is given by or equivalently N (x µ +, Σ + ) = N (x µ, Σ ) ln N (x µ +, Σ + ) = ln N (x µ, Σ )

18, Gaussians in classification (2) By substituting the formula for N into and simplifying we get ln N (x µ +, Σ + ) = ln N (x µ, Σ ) (x µ + ) T Σ 1 + (x µ +) (x µ ) T Σ 1 (x µ )+ln Σ Σ + = 0 If Σ + = Σ this is a linear equation, so the decision boundary is a hyperplane: LDA In general case this is a quadratic surface: QDA In QDA, decision regions may be non-connected