An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017
Introduction In statistical learning theory, linear models are used for regression and classification tasks. 2
Introduction In statistical learning theory, linear models are used for regression and classification tasks. What is regression? 2
Introduction In statistical learning theory, linear models are used for regression and classification tasks. What is regression? What is classification? 2
Introduction In statistical learning theory, linear models are used for regression and classification tasks. What is regression? What is classification? How can we model such concepts in a mathematical context? 2
Linear regression
Linear regression - basics What is regression? 4
Linear regression - basics What is regression? Approximation of data using a (closed) mathematical expression 4
Linear regression - basics What is regression? Approximation of data using a (closed) mathematical expression Achieved by estimating the model parameters that maximize the approximation 4
Linear regression - example A company changes the price of their products for the nth time. 5
Linear regression - example A company changes the price of their products for the nth time. They know how the price changes affected the consumer behavior n 1 times before. 5
Linear regression - example A company changes the price of their products for the nth time. They know how the price changes affected the consumer behavior n 1 times before. Using linear regression, they can predict the consumer behavior for the nth price change. 5
Linear regression - basics Let D denote a set of n-dimensional data vectors 6
Linear regression - basics Let D denote a set of n-dimensional data vectors Let x be an n-dimensional observation 6
Linear regression - basics Let D denote a set of n-dimensional data vectors Let x be an n-dimensional observation How can we approximate x? 6
Linear regression - basics Create a linear function y(x, w) = w 0 + w 1 x 1 + + w n x n that approximates x with w. 7
Linear regression - basics Example for n = 2. y(x, w) = w 0 + w 1 x 1 + w 2 x 2 8
Linear regression - basics Problem Weight parameters w i are simply values. 9
Linear regression - basics Problem Weight parameters w i are simply values. = significant limitation! 9
Linear regression - basics Problem Weight parameters w i are simply values. = significant limitation! Idea Use weighted non-linear functions φ j instead! y(x, w) = w 0 + n w j φ j (x) = w φ(x), j=1 where φ = (φ 0,..., φ n ). 9
Polynomial regression Example Let φ i (x) = x i. 10
Polynomial regression Example Let φ i (x) = x i. y(x, w) = w 0 + n w j x j j=1 = w 0 + w 1 x + w 2 x 2 + + w n x n 10
Polynomial regression Approximation with a 2nd-order polynomial. 11
Polynomial regression Approximation with a 6th-order polynomial. 11
Polynomial regression Approximation with an 8th-order polynomial. 11
Polynomial regression Problem 12
Polynomial regression Problem = Overfitting with higher polynomial degree 12
Polynomial regression 13
Polynomial regression 13
Linear classification
Linear classification - basics What is classification? 15
Linear classification - basics What is classification? Aims to partition the data into predefined classes 15
Linear classification - basics What is classification? Aims to partition the data into predefined classes A class contains observations with similar characteristics 15
Linear classification - example We have n different cucumbers and courgettes. 16
Linear classification - example We have n different cucumbers and courgettes. Each record contains the weight and the texture (smooth, rough). 16
Linear classification - example We have n different cucumbers and courgettes. Each record contains the weight and the texture (smooth, rough). We want to predict the correct label without knowing the real label. 16
Linear classification - basics Assume a two-class classification problem 17
Linear classification - basics Assume a two-class classification problem How can we categorise the data into predefined classes? 17
Linear classification - basics Assume a two-class classification problem How can we categorise the data into predefined classes? 17
Linear classification - basics Assume a two-class classification problem How can we categorise the data into predefined classes? 17
Linear classification - basics Assume a two-class classification problem How can we categorise the data into predefined classes? 17
Linear classification - basics Cucumbers vs. courgettes 18
Linear classification - basics Discriminant functions Given a dataset D, 19
Linear classification - basics Discriminant functions Given a dataset D, we aim to categorise x into either class C 1 or C 2. 19
Linear classification - basics Discriminant functions Given a dataset D, we aim to categorise x into either class C 1 or C 2. Use y(x) such that x C 1 y(x) 0 and x C 2 otherwise 19
Linear classification - basics Discriminant functions Given a dataset D, we aim to categorise x into either class C 1 or C 2. Use y(x) such that with x C 1 y(x) 0 and x C 2 otherwise y(x) = w x + w 0. 19
Linear classification - basics Decision boundary H: H := {x D : y(x) = 0} 20
Linear classification - basics Decision boundary H: H := {x D : y(x) = 0} 20
Non-linearity However, it often occurs that the data are not linearly-separable: 21
Non-linearity However, it often occurs that the data are not linearly-separable: 21
Non-linearity However, it often occurs that the data are not linearly-separable: 21
Non-linearity Solution Use non-linear functions instead! 22
Non-linearity Solution Use non-linear functions instead! = y(x) = w φ(x) φ(x) = (φ 0, φ 1,..., φ M 1 ) 22
Common classification algorithms Other commonly-used classification algorithms: 23
Common classification algorithms Other commonly-used classification algorithms: Naive Bayes classifier 23
Common classification algorithms Other commonly-used classification algorithms: Naive Bayes classifier Logistic regression (Cox, 1958) 23
Common classification algorithms Other commonly-used classification algorithms: Naive Bayes classifier Logistic regression (Cox, 1958) Support vector machines (Vapnik and Lerner, 1963) 23
Summary 24
Summary Linear models Linear combinations of weighted (non-)linear functions 24
Summary Linear models Linear combinations of weighted (non-)linear functions Regression Approximation of a given amount of data using a closed mathematical representation 24
Summary Linear models Linear combinations of weighted (non-)linear functions Regression Approximation of a given amount of data using a closed mathematical representation Classification Categorisation of data according to individual characteristics and common patterns 24
Further readings Further readings: J. Aldrich, 1997. R.A. Fisher and the Making of Maximum Likelihood 1912-1922. Statistical Science Vol. 12, No. 3, 162-176. D. Barber, Bayesian Reasoning and Machine Learning. Cambridge University Press. 2012. C. M. Bishop, Pattern Recognition and Machine Learning. Springer. 2006. T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning - Data Mining, Inference, and Prediction. Second edition. Springer. 2009. K. Murphy, Machine Learning - A Probabilistic Perspective. The MIT Press, Cambridge, Massachusetts, London, England. 2012. A. Y. Ng and M. I. Jordan, 2002. On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive bayes. Advances in Neural Information Processing Systems. 25
References References: J. Aldrich, 1997. R.A. Fisher and the Making of Maximum Likelihood 1912-1922. Statistical Science Vol. 12, No. 3, 162-176. D. Barber, Bayesian Reasoning and Machine Learning. Cambridge University Press. 2012. C. M. Bishop, Pattern Recognition and Machine Learning. Springer. 2006. D. R. Cox, 1958. The regression analysis of binary sequences. Journal of the Royal Statistical Society Vol. XX, No. 2. 26
References References: T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning - Data Mining, Inference, and Prediction. Second edition. Springer. 2009. K. Murphy, Machine Learning - A Probabilistic Perspective. The MIT Press, Cambridge, Massachusetts, London, England. 2012. F. Rosenblatt, 1958. The Perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review Vol. 65, No. 6. 26
Thank you for your attention. Questions?
Backup slides
Sum of least squares
Parameter estimation Estimating the weight parameters How to choose the w i? 30
Parameter estimation Estimating the weight parameters How to choose the w i? = Find the set of parameters that maximize p(d w) 30
Sum of least squares Method to optimize the weights w 31
Sum of least squares Method to optimize the weights w Aims to minimize the residual sum of squares (RSS) by optimizing weights 31
Sum of least squares Method to optimize the weights w Aims to minimize the residual sum of squares (RSS) by optimizing weights Minimize RSS(w) = = N (z i y(x i, w)) 2 i=1 N z i w 0 i=1 M 1 j=1 x ij w j 2 31
Sum of least squares RSS can be simplified by using an N M matrix X with the x i as rows 32
Sum of least squares RSS can be simplified by using an N M matrix X with the x i as rows Then RSS(w) = (z Xw) (z Xw), where z is a vector of target values 32
Sum of least squares Building the derivatives leads to RSS w = 2X (z Xw) 2 RSS w w = 2X X 33
Sum of least squares Building the derivatives leads to RSS w = 2X (z Xw) 2 RSS w w = 2X X Setting the first derivative to zero and solving for w results in ŵ = (X X) 1 X z 33
Sum of least squares Note: the approach assumes X X to be positive definite 34
Sum of least squares Note: the approach assumes X X to be positive definite Therefore, X is assumed to have full column rank 34
Sum of least squares Approximation function can be described as z i = y(x i, w) + ɛ, where ɛ represents the data noise 35
Sum of least squares where φ = (φ 0,..., φ M 1 ) 35 Approximation function can be described as z i = y(x i, w) + ɛ, where ɛ represents the data noise RSS provides a measurement for the prediction error E D defined as E D (w) = 1 2 N (z i w φ(x i )) 2, i=1
Maximum Likelihood Estimation
Maximum Likelihood Estimation Maximum Likelihood Estimation (MLE) 37
Maximum Likelihood Estimation Maximum Likelihood Estimation (MLE) Introduced by Fisher (1922) 37
Maximum Likelihood Estimation Maximum Likelihood Estimation (MLE) Introduced by Fisher (1922) Commonly-used method to optimize the model parameters 37
Maximum Likelihood Estimation Goal Find the optimal parameters w such that p(d w) is maximized, i.e. ŵ arg max p(d w). w 38
Maximum Likelihood Estimation Goal Find the optimal parameters w such that p(d w) is maximized, i.e. ŵ arg max p(d w). w For target values z i, p(z i x i, w) are assumed to be independent such that p(d w) = N p(z i x i, w) i=1 holds for all z i D. 38
Maximum Likelihood Estimation The product makes the equation unwieldy. Let s simplify it using the log! 39
Maximum Likelihood Estimation The product makes the equation unwieldy. Let s simplify it using the log! log p(d w) = N log p(z i x i, w) i=1 39
Maximum Likelihood Estimation The product makes the equation unwieldy. Let s simplify it using the log! log p(d w) = N log p(z i x i, w) i=1 We can now compute log p(d w) = 0 and solve for w. = w ŵ are the optimal weight parameters! 39
Maximum Likelihood Estimation Assumption: data noise ɛ follows Gaussian distribution 40
Maximum Likelihood Estimation Assumption: data noise ɛ follows Gaussian distribution Then p(d w) can be described as p(z x, w, β) = N (z y(x, w), β 1 ), where β is the model s inverse variance 40
Maximum Likelihood Estimation For z we get log p(z X, w, β) = N log N (z i w φ(x i ), β 1 ) i=1 41
Maximum Likelihood Estimation It follows that log p(z X, w, β) = β N (z i w φ(x i )) φ(x i ) i=1 42
Maximum Likelihood Estimation It follows that log p(z X, w, β) = β N (z i w φ(x i )) φ(x i ) i=1 Setting the gradient to zero and solving for w leads to w ML = (Φ Φ) 1 Φ z 42
Maximum Likelihood Estimation Φ is called design matrix and is described as φ 0 (x 1 ) φ 1 (x 1 )... φ M 1 (x 1 ) Φ = φ 0 (x 2 ) φ 1 (x 2 )... φ M 1 (x 2 )...... φ 0 (x N ) φ 1 (x N )... φ M 1 (x N ) 43
Regularization and ridge regression
Ridge regression Method to prevent overfitting 45
Ridge regression Method to prevent overfitting Add regularization term compensating the biased prediction 45
Ridge regression Regularization term λ w 2 2 46
Ridge regression Regularization term λ w 2 2 Error function E(w) becomes N E(w) = 1 2 i=1 (z i w φ(x i )) 2 + λ w 2 2 46
Ridge regression Regularization term λ w 2 2 Error function E(w) becomes N E(w) = 1 2 i=1 (z i w φ(x i )) 2 + λ w 2 2 Minimizing E(w) and solving for w results in ŵ ridge = (λi + Φ Φ) 1 Φ z 46
Non-linear classification
Non-linearity In our example 48
Non-linearity In our example the appropriate decision boundary is φ : (x 1, x 2 ) (r cos(x 1 ), r sin(x 2 )), r R. 48
Multi-class classification
Multi-class classification The discriminant function for two-class classification can be extended to a k-class problem 50
Multi-class classification The discriminant function for two-class classification can be extended to a k-class problem Use k-class discriminator y k (x) = w k x + w k0 50
Multi-class classification The discriminant function for two-class classification can be extended to a k-class problem Use k-class discriminator y k (x) = w k x + w k0 x is assigned to class C j if y j (x) > y i (x) for all i j 50
Multi-class classification The decision boundary is then y j (x) = y i (x) which can be transformed to (w j w i ) x + w j0 w i0 = 0 51
Multi-class classification 52
Perceptron algorithm
Perceptron Given an input vector x and a fixed non-linear function φ(x), the class of x is estimated by y(x) = f(w φ(x)), where f(t) is called non-linear activation function. 54
Perceptron Given an input vector x and a fixed non-linear function φ(x), the class of x is estimated by y(x) = f(w φ(x)), where f(t) is called non-linear activation function. { +1 if t 0, f(t) = 1 otherwise. 54
Probabilistic generative models
Probabilistic generative models Compute probability p(x, z) directly instead of optimizing the weight parameters 56
Probabilistic generative models Compute probability p(x, z) directly instead of optimizing the weight parameters Apply Bayes theorem on p(z x) 56
Bayes theorem For a set of disjunct samples A 1,..., A n and a given sample B the probability p(a i B), i {1,..., n} can be computed as follows: p(a i B) = p(b A i ) p(a i ) n j=1 p(b A j) p(a j ) 57
Probabilistic generative models Consider a two-class classification problem for C 1 and C 2 58
Probabilistic generative models Consider a two-class classification problem for C 1 and C 2 Then where a = log p(x C 1 )p(c 1 ) p(c 1 x) = p(x C 1 )p(c 1 ) + p(x C 2 )p(c 2 ) 1 = 1 + e a = σ(a), ( ) p(x C1 )p(c 1 ) p(x C 2 )p(c 2 ) and σ(a) = 1 1 + e a 58
Probabilistic discriminative models
Probabilistic discriminative models Predict the correct class by directly computing the posterior probability p(z x) 60
Probabilistic discriminative models Predict the correct class by directly computing the posterior probability p(z x) This makes the computation of p(x z) (Bayes theorem) redundant 60
Probabilistic discriminative models Computing the posterior probability p(c k x, θ opt C k x ) is then achieved by using MLE 61
Probabilistic discriminative models Computing the posterior probability p(c k x, θ opt C k x ) is then achieved by using MLE Disadvantage: only little knowledge about the given data is required ( black-box ) 61
Logistic regression
Logistic regression Commonly-used classification algorithm for binary classification problems 63
Logistic regression Commonly-used classification algorithm for binary classification problems Assumption: data noise ɛ follows Bernoulli distribution Ber(n) = p n (1 p) 1 n, n {0, 1} as this distribution is more appropriate for a binary classification problem 63
Logistic regression Predict the correct class label on the probability p(c k x, w) = Ber(C k σ(w x)), where σ is a squashing function (e.g. sigmoid) 64