Introduction to Machine Learning

1, DATA11002 Introduction to Machine Learning Lecturer: Antti Ukkonen TAs: Saska Dönges and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer, Jyrki Kivinen, and Teemu Roos) November 1st December 14th 2018

Classification: Probabilistic Methods 2,

3, Statistical learning model (recap) Recall that X are the input variables (features, predictors, etc.), and Y denotes the output variable. In classification, we assume Y to be a set of class labels,, e.g. 0 or 1, or +1 or -1. Assume that there is a fixed but unknown probability distribution P over X Y such that pairs are (x i, y i ) are i.i.d. samples from it. We wish to minimise the generalisation error (also called risk) of ˆf, which is the expected loss E (x,y) P [L(ˆf (x), y)] where E (x,y) P [ ] denotes expectation when a single data point (x, y) is drawn from P

4, Some definitions/notation P(X = x i, Y = y i ) denotes the probability of pair (x i, y i ). (Sometimes we write P(x i, y i ) for short.) P(Y = y i X = x i ) denotes the conditional probability of observing y i given x i. (We can also write P(Y = y i x i ) or P(y i x i ) for short.) If we knew P (which we in practice never do!), we could implement an optimal classifier by assigning x i to the class y i that maximises P(y i x i ). In the following, we are particularly interested in models for P(Y = y i x i ). These models are almost always wrong, but perhaps they give good predictions anyway!

5, Logistic regression Logistic regression models are linear models for probabilistic binary classification (so, not really regression where response is continuous) Given input (vector) x, the output is a probability that Y = 1. Let s denote it by ˆp(Y = 1 x). However, instead of using a linear model directly as in we let log ˆp(Y = 1 x) = β x ˆp(Y = 1 x) ˆp(Y = 0 x) = β x This amounts to the same as exp(β x) ˆp(Y = 1 x) = 1 + exp(β x) = 1 exp( β x) + 1

6, Logistic regression (2) For convenience, we use here class labels 0 and 1 Given probabilistic prediction ˆp(y x), and assuming instance x i has already been observed, the conditional likelihood for a sample point (x i, y i ) is ˆp(Y = 1 x i ) if y i = 1 1 ˆp(Y = 1 x i ) if y i = 0 which we write as ˆp(Y = 1 x i ) y i (1 ˆp(Y = 1 x i )) 1 y i

7, Logistic regression (3) Conditional likelihood of sequence of independent samples (x i, y i ), i = 1,..., n is then n ˆp(Y = 1 x i ) y i (1 ˆp(Y = 1 x i )) 1 y i i=1 we say conditional to emphasise that we take x i as given and only model probability of labels y i To maximise conditional likelihood, we can equivalently maximise conditional log-likelihood n LCL(β) = (y i ln ˆp(Y = 1 x i ) + (1 y i ) ln(1 ˆp(Y = 1 x i ))) i=1 This is the same as log-loss (except that the sign is flipped, i.e., without the minus)!

8, Logistic regression (4) Maximizing the likelihood (or minimizing log-loss) isn t as straightforward as in the case of linear regression Nevertheless, the problem is convex which means that gradient-based techniques exist to find the optimum Standard techniques in R, Python, Matlab,... Often used with regularisation, as in linear regression ridge : arg max(lcl(β) λ β 2 2 ) lasso : arg max(lcl(β) λ β 1 ) In particular, if data is linearly separable, non-regularised solution tends to infinity

9, Generative vs discriminative learning Logistic regression was an example of a discriminative and probabilistic classifier that directly models the class distribution P(y x) Another probabilistic way to approach the problem is to use generative learning that builds a model for the whole joint distribution P(x, y) often using the decomposition P(y)P(x y) Both approaches have their pros and cons: Discriminative learning: only solve the task that you need to solve; may provide better accuracy since focuses on the specific learning task; optimization tends to be harder Generative learning: often more natural to build models for P(x y) than for P(y x); handles missing data more naturally; optimization often easier

10, Generative vs discriminative learning (2) Estimating the class prior P(y) is usually simple For example, in binary classification this time with Y { 1, +1} we can usually just count the number of positive examples Pos and negative examples Neg and set P(Y = +1) = Pos Pos + Neg and P(Y = 1) = Neg Pos + Neg Since P(x, y) = P(x y)p(y), what remains is estimating P(x y). In binary classification, we could now e.g. use the positive examples to build a model for P(x Y = +1) use the negative examples to build a model for P(x Y = 1) To classify a new data point x, we use the Bayes formula P(y x) = P(x y)p(y) P(x) = P(x y)p(y) y P(x y )P(y )

11, Generative vs discriminative learning (3) Examples of discriminative classifiers: logistic regression k-nn decision trees SVM multilayer perceptron (MLP) Examples of generative classifiers: naive Bayes (NB) linear discriminant analysis (LDA) quadratic discriminant analysis (QDA) We will study all of the above except MLP.

Normal distribution For probabilistic models for real-valued features x i R, one basic ingredient is the normal or Gaussian distribution Recall that for a single real-valued random variable, the normal distribution has two parameters µ and σ 2, and density ( ) N (x µ, σ 2 1 ) = exp (x µ)2 2πσ 2 2σ 2 If X has this distribution, then E[X ] = µ and Var[X ] = σ 2 For multivariate case x R p, we shall first consider the case where individual component x i has normal distribution with parameters µ i and σi 2 and the components are independent: p(x) = N (x 1 µ 1, σ 2 1)... N (x p µ p, σ 2 d ) 12,

13, Normal distribution (2) We get p(x) = N (x 1 µ 1, σ1) 2... N (x p µ p, σp) 2 ( ) p 1 = exp (x j µ j ) 2 j=1 2πσj 2 2σj 2 1 = (2π) p/2 exp 1 p (x j µ j ) 2 σ 1... σ p 2 σ 2 j=1 j ( 1 = exp 1 ) (2π) p/2 1/2 Σ 2 (x µ)t Σ 1 (x µ) where µ = (µ 1,..., µ p ) R p and Σ R p p is a diagonal matrix with σ 2 1,..., σ2 p on the diagonal and Σ is determinant of Σ

14, Normal distribution (3) More generally, let µ R p, and let Σ R p p be symmetric: Σ T = Σ positive definite: x T Σx > 0 for all x R { 0 } We then define p-dimensional Gaussian density with parameter µ and Σ as ( 1 N (x µ, Σ) = exp 1 ) (2π) p/2 1/2 Σ 2 (x µ)t Σ 1 (x µ) If Σ is diagonal, we get the special case where x j are independent

15, Normal distribution (4) To understand the multivariate normal distribution, consider a surface of constant density: for some a S = { x R p N (x µ, Σ) = a } By definition of N, this can be written as for some b S = { x R p (x µ) T Σ 1 (x µ) = b } Because Σ is symmetric and positive definite, so is Σ 1, and this set is an ellipsoid with centre µ

16, Normal distribution (5) More specifically, since Σ is symmetric and positive definite, it has an Eigenvalue decomposition Σ = UΛU T where Λ R p p is diagonal and U R T is orthogonal (U T = U 1 ), and further Σ 1 = UΛ 1 U T We then know from analytic geometry that for the ellipsoid S = { x R p (x µ) T Σ 1 (x µ) = b } the directions of the axes are given by the column vectors of U (Eigenvectors of Σ) the squared lengths of the axes are given by the elements of Λ (Eigenvalues of Σ)

17, Normal distribution (6) Let X = (X 1,..., X p ) have normal distribution with parameters µ and Σ Then E[X] = µ and E[(X r µ r )(X s µ s )] = Σ rs Hence, we call the parameter µ the mean and Σ the covariance matrix

18, Normal distribution (7) Let x 1,..., x n, where x i = (x i,1,..., x i,p ), be n independent samples from a p-dimensional normal distribution with unknown mean µ and covariance Σ The maximum likelihood estimates are given by ( ˆµ, ˆΣ) = arg max µ,σ n N (x i µ, Σ) i=1 ˆµ r = 1 n ˆΣ rs = 1 n n i=1 x i,r n (x i,r ˆµ r )(x i,s ˆµ s ) i=1

19, Gaussians in classification LDA and QDA are obtained by modeling positive and negative examples both with their own Gaussian: p(x Y = +1) = N (x µ +, Σ + ) p(x Y = 1) = N (x µ, Σ )) where µ ± and Σ ± are obtained for example as maximum likelihood estimates Decision boundary is given by or equivalently N (x µ +, Σ + ) = N (x µ, Σ ) ln N (x µ +, Σ + ) = ln N (x µ, Σ )

20, Gaussians in classification (2) By substituting the formula for N into and simplifying we get ln N (x µ +, Σ + ) = ln N (x µ, Σ ) (x µ + ) T Σ 1 + (x µ +) (x µ ) T Σ 1 (x µ )+ln Σ Σ + = 0 If Σ + = Σ this is a linear equation, so the decision boundary is a hyperplane: LDA In general case this is a quadratic surface: QDA In QDA, decision regions may be non-connected