Introduction to Machine Learning

Size: px

Start display at page:

Download "Introduction to Machine Learning"

Scott Oliver
5 years ago
Views:

1 1, DATA11002 Introduction to Machine Learning Lecturer: Antti Ukkonen TAs: Saska Dönges and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer, Jyrki Kivinen, and Teemu Roos) November 1st December 14th 2018

2 Classification: Probabilistic Methods 2,

3 3, Statistical learning model (recap) Recall that X are the input variables (features, predictors, etc.), and Y denotes the output variable. In classification, we assume Y to be a set of class labels,, e.g. 0 or 1, or +1 or -1. Assume that there is a fixed but unknown probability distribution P over X Y such that pairs are (x i, y i ) are i.i.d. samples from it. We wish to minimise the generalisation error (also called risk) of ˆf, which is the expected loss E (x,y) P [L(ˆf (x), y)] where E (x,y) P [ ] denotes expectation when a single data point (x, y) is drawn from P

4 4, Some definitions/notation P(X = x i, Y = y i ) denotes the probability of pair (x i, y i ). (Sometimes we write P(x i, y i ) for short.) P(Y = y i X = x i ) denotes the conditional probability of observing y i given x i. (We can also write P(Y = y i x i ) or P(y i x i ) for short.) If we knew P (which we in practice never do!), we could implement an optimal classifier by assigning x i to the class y i that maximises P(y i x i ). In the following, we are particularly interested in models for P(Y = y i x i ). These models are almost always wrong, but perhaps they give good predictions anyway!

5 5, Logistic regression Logistic regression models are linear models for probabilistic binary classification (so, not really regression where response is continuous) Given input (vector) x, the output is a probability that Y = 1. Let s denote it by ˆp(Y = 1 x). However, instead of using a linear model directly as in we let log ˆp(Y = 1 x) = β x ˆp(Y = 1 x) ˆp(Y = 0 x) = β x This amounts to the same as exp(β x) ˆp(Y = 1 x) = 1 + exp(β x) = 1 exp( β x) + 1

6 6, Logistic regression (2) For convenience, we use here class labels 0 and 1 Given probabilistic prediction ˆp(y x), and assuming instance x i has already been observed, the conditional likelihood for a sample point (x i, y i ) is ˆp(Y = 1 x i ) if y i = 1 1 ˆp(Y = 1 x i ) if y i = 0 which we write as ˆp(Y = 1 x i ) y i (1 ˆp(Y = 1 x i )) 1 y i

7 7, Logistic regression (3) Conditional likelihood of sequence of independent samples (x i, y i ), i = 1,..., n is then n ˆp(Y = 1 x i ) y i (1 ˆp(Y = 1 x i )) 1 y i i=1 we say conditional to emphasise that we take x i as given and only model probability of labels y i To maximise conditional likelihood, we can equivalently maximise conditional log-likelihood n LCL(β) = (y i ln ˆp(Y = 1 x i ) + (1 y i ) ln(1 ˆp(Y = 1 x i ))) i=1 This is the same as log-loss (except that the sign is flipped, i.e., without the minus)!

8 8, Logistic regression (4) Maximizing the likelihood (or minimizing log-loss) isn t as straightforward as in the case of linear regression Nevertheless, the problem is convex which means that gradient-based techniques exist to find the optimum Standard techniques in R, Python, Matlab,... Often used with regularisation, as in linear regression ridge : arg max(lcl(β) λ β 2 2 ) lasso : arg max(lcl(β) λ β 1 ) In particular, if data is linearly separable, non-regularised solution tends to infinity

9 9, Generative vs discriminative learning Logistic regression was an example of a discriminative and probabilistic classifier that directly models the class distribution P(y x) Another probabilistic way to approach the problem is to use generative learning that builds a model for the whole joint distribution P(x, y) often using the decomposition P(y)P(x y) Both approaches have their pros and cons: Discriminative learning: only solve the task that you need to solve; may provide better accuracy since focuses on the specific learning task; optimization tends to be harder Generative learning: often more natural to build models for P(x y) than for P(y x); handles missing data more naturally; optimization often easier

10 10, Generative vs discriminative learning (2) Estimating the class prior P(y) is usually simple For example, in binary classification this time with Y { 1, +1} we can usually just count the number of positive examples Pos and negative examples Neg and set P(Y = +1) = Pos Pos + Neg and P(Y = 1) = Neg Pos + Neg Since P(x, y) = P(x y)p(y), what remains is estimating P(x y). In binary classification, we could now e.g. use the positive examples to build a model for P(x Y = +1) use the negative examples to build a model for P(x Y = 1) To classify a new data point x, we use the Bayes formula P(y x) = P(x y)p(y) P(x) = P(x y)p(y) y P(x y )P(y )

11 11, Generative vs discriminative learning (3) Examples of discriminative classifiers: logistic regression k-nn decision trees SVM multilayer perceptron (MLP) Examples of generative classifiers: naive Bayes (NB) linear discriminant analysis (LDA) quadratic discriminant analysis (QDA) We will study all of the above except MLP.

12 Normal distribution For probabilistic models for real-valued features x i R, one basic ingredient is the normal or Gaussian distribution Recall that for a single real-valued random variable, the normal distribution has two parameters µ and σ 2, and density ( ) N (x µ, σ 2 1 ) = exp (x µ)2 2πσ 2 2σ 2 If X has this distribution, then E[X ] = µ and Var[X ] = σ 2 For multivariate case x R p, we shall first consider the case where individual component x i has normal distribution with parameters µ i and σi 2 and the components are independent: p(x) = N (x 1 µ 1, σ 2 1)... N (x p µ p, σ 2 d ) 12,

13 13, Normal distribution (2) We get p(x) = N (x 1 µ 1, σ1) 2... N (x p µ p, σp) 2 ( ) p 1 = exp (x j µ j ) 2 j=1 2πσj 2 2σj 2 1 = (2π) p/2 exp 1 p (x j µ j ) 2 σ 1... σ p 2 σ 2 j=1 j ( 1 = exp 1 ) (2π) p/2 1/2 Σ 2 (x µ)t Σ 1 (x µ) where µ = (µ 1,..., µ p ) R p and Σ R p p is a diagonal matrix with σ 2 1,..., σ2 p on the diagonal and Σ is determinant of Σ

14 14, Normal distribution (3) More generally, let µ R p, and let Σ R p p be symmetric: Σ T = Σ positive definite: x T Σx > 0 for all x R { 0 } We then define p-dimensional Gaussian density with parameter µ and Σ as ( 1 N (x µ, Σ) = exp 1 ) (2π) p/2 1/2 Σ 2 (x µ)t Σ 1 (x µ) If Σ is diagonal, we get the special case where x j are independent

15 15, Normal distribution (4) To understand the multivariate normal distribution, consider a surface of constant density: for some a S = { x R p N (x µ, Σ) = a } By definition of N, this can be written as for some b S = { x R p (x µ) T Σ 1 (x µ) = b } Because Σ is symmetric and positive definite, so is Σ 1, and this set is an ellipsoid with centre µ

16 16, Normal distribution (5) More specifically, since Σ is symmetric and positive definite, it has an Eigenvalue decomposition Σ = UΛU T where Λ R p p is diagonal and U R T is orthogonal (U T = U 1 ), and further Σ 1 = UΛ 1 U T We then know from analytic geometry that for the ellipsoid S = { x R p (x µ) T Σ 1 (x µ) = b } the directions of the axes are given by the column vectors of U (Eigenvectors of Σ) the squared lengths of the axes are given by the elements of Λ (Eigenvalues of Σ)

17 17, Normal distribution (6) Let X = (X 1,..., X p ) have normal distribution with parameters µ and Σ Then E[X] = µ and E[(X r µ r )(X s µ s )] = Σ rs Hence, we call the parameter µ the mean and Σ the covariance matrix

18 18, Normal distribution (7) Let x 1,..., x n, where x i = (x i,1,..., x i,p ), be n independent samples from a p-dimensional normal distribution with unknown mean µ and covariance Σ The maximum likelihood estimates are given by ( ˆµ, ˆΣ) = arg max µ,σ n N (x i µ, Σ) i=1 ˆµ r = 1 n ˆΣ rs = 1 n n i=1 x i,r n (x i,r ˆµ r )(x i,s ˆµ s ) i=1

19 19, Gaussians in classification LDA and QDA are obtained by modeling positive and negative examples both with their own Gaussian: p(x Y = +1) = N (x µ +, Σ + ) p(x Y = 1) = N (x µ, Σ )) where µ ± and Σ ± are obtained for example as maximum likelihood estimates Decision boundary is given by or equivalently N (x µ +, Σ + ) = N (x µ, Σ ) ln N (x µ +, Σ + ) = ln N (x µ, Σ )

20 20, Gaussians in classification (2) By substituting the formula for N into and simplifying we get ln N (x µ +, Σ + ) = ln N (x µ, Σ ) (x µ + ) T Σ 1 + (x µ +) (x µ ) T Σ 1 (x µ )+ln Σ Σ + = 0 If Σ + = Σ this is a linear equation, so the decision boundary is a hyperplane: LDA In general case this is a quadratic surface: QDA In QDA, decision regions may be non-connected

Introduction to Machine Learning

1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer