An Introduction to Statistical and Probabilistic Linear Models

An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017

Introduction In statistical learning theory, linear models are used for regression and classification tasks. 2

Introduction In statistical learning theory, linear models are used for regression and classification tasks. What is regression? 2

Introduction In statistical learning theory, linear models are used for regression and classification tasks. What is regression? What is classification? 2

Introduction In statistical learning theory, linear models are used for regression and classification tasks. What is regression? What is classification? How can we model such concepts in a mathematical context? 2

Linear regression

Linear regression - basics What is regression? 4

Linear regression - basics What is regression? Approximation of data using a (closed) mathematical expression 4

Linear regression - basics What is regression? Approximation of data using a (closed) mathematical expression Achieved by estimating the model parameters that maximize the approximation 4

Linear regression - example A company changes the price of their products for the nth time. 5

Linear regression - example A company changes the price of their products for the nth time. They know how the price changes affected the consumer behavior n 1 times before. 5

Linear regression - example A company changes the price of their products for the nth time. They know how the price changes affected the consumer behavior n 1 times before. Using linear regression, they can predict the consumer behavior for the nth price change. 5

Linear regression - basics Let D denote a set of n-dimensional data vectors 6

Linear regression - basics Let D denote a set of n-dimensional data vectors Let x be an n-dimensional observation 6

Linear regression - basics Let D denote a set of n-dimensional data vectors Let x be an n-dimensional observation How can we approximate x? 6

Linear regression - basics Create a linear function y(x, w) = w 0 + w 1 x 1 + + w n x n that approximates x with w. 7

Linear regression - basics Example for n = 2. y(x, w) = w 0 + w 1 x 1 + w 2 x 2 8

Linear regression - basics Problem Weight parameters w i are simply values. 9

Linear regression - basics Problem Weight parameters w i are simply values. = significant limitation! 9

Linear regression - basics Problem Weight parameters w i are simply values. = significant limitation! Idea Use weighted non-linear functions φ j instead! y(x, w) = w 0 + n w j φ j (x) = w φ(x), j=1 where φ = (φ 0,..., φ n ). 9

Polynomial regression Example Let φ i (x) = x i. 10

Polynomial regression Example Let φ i (x) = x i. y(x, w) = w 0 + n w j x j j=1 = w 0 + w 1 x + w 2 x 2 + + w n x n 10

Polynomial regression Approximation with a 2nd-order polynomial. 11

Polynomial regression Approximation with a 6th-order polynomial. 11

Polynomial regression Approximation with an 8th-order polynomial. 11

Polynomial regression Problem 12

Polynomial regression Problem = Overfitting with higher polynomial degree 12

Polynomial regression 13

Linear classification

Linear classification - basics What is classification? 15

Linear classification - basics What is classification? Aims to partition the data into predefined classes 15

Linear classification - basics What is classification? Aims to partition the data into predefined classes A class contains observations with similar characteristics 15

Linear classification - example We have n different cucumbers and courgettes. 16

Linear classification - example We have n different cucumbers and courgettes. Each record contains the weight and the texture (smooth, rough). 16

Linear classification - example We have n different cucumbers and courgettes. Each record contains the weight and the texture (smooth, rough). We want to predict the correct label without knowing the real label. 16

Linear classification - basics Assume a two-class classification problem 17

Linear classification - basics Assume a two-class classification problem How can we categorise the data into predefined classes? 17

Linear classification - basics Cucumbers vs. courgettes 18

Linear classification - basics Discriminant functions Given a dataset D, 19

Linear classification - basics Discriminant functions Given a dataset D, we aim to categorise x into either class C 1 or C 2. 19

Linear classification - basics Discriminant functions Given a dataset D, we aim to categorise x into either class C 1 or C 2. Use y(x) such that x C 1 y(x) 0 and x C 2 otherwise 19

Linear classification - basics Discriminant functions Given a dataset D, we aim to categorise x into either class C 1 or C 2. Use y(x) such that with x C 1 y(x) 0 and x C 2 otherwise y(x) = w x + w 0. 19

Linear classification - basics Decision boundary H: H := {x D : y(x) = 0} 20

Non-linearity However, it often occurs that the data are not linearly-separable: 21

Non-linearity Solution Use non-linear functions instead! 22

Non-linearity Solution Use non-linear functions instead! = y(x) = w φ(x) φ(x) = (φ 0, φ 1,..., φ M 1 ) 22

Common classification algorithms Other commonly-used classification algorithms: 23

Common classification algorithms Other commonly-used classification algorithms: Naive Bayes classifier 23

Common classification algorithms Other commonly-used classification algorithms: Naive Bayes classifier Logistic regression (Cox, 1958) 23

Common classification algorithms Other commonly-used classification algorithms: Naive Bayes classifier Logistic regression (Cox, 1958) Support vector machines (Vapnik and Lerner, 1963) 23

Summary 24

Summary Linear models Linear combinations of weighted (non-)linear functions 24

Summary Linear models Linear combinations of weighted (non-)linear functions Regression Approximation of a given amount of data using a closed mathematical representation 24

Summary Linear models Linear combinations of weighted (non-)linear functions Regression Approximation of a given amount of data using a closed mathematical representation Classification Categorisation of data according to individual characteristics and common patterns 24

Further readings Further readings: J. Aldrich, 1997. R.A. Fisher and the Making of Maximum Likelihood 1912-1922. Statistical Science Vol. 12, No. 3, 162-176. D. Barber, Bayesian Reasoning and Machine Learning. Cambridge University Press. 2012. C. M. Bishop, Pattern Recognition and Machine Learning. Springer. 2006. T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning - Data Mining, Inference, and Prediction. Second edition. Springer. 2009. K. Murphy, Machine Learning - A Probabilistic Perspective. The MIT Press, Cambridge, Massachusetts, London, England. 2012. A. Y. Ng and M. I. Jordan, 2002. On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive bayes. Advances in Neural Information Processing Systems. 25

References References: J. Aldrich, 1997. R.A. Fisher and the Making of Maximum Likelihood 1912-1922. Statistical Science Vol. 12, No. 3, 162-176. D. Barber, Bayesian Reasoning and Machine Learning. Cambridge University Press. 2012. C. M. Bishop, Pattern Recognition and Machine Learning. Springer. 2006. D. R. Cox, 1958. The regression analysis of binary sequences. Journal of the Royal Statistical Society Vol. XX, No. 2. 26

References References: T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning - Data Mining, Inference, and Prediction. Second edition. Springer. 2009. K. Murphy, Machine Learning - A Probabilistic Perspective. The MIT Press, Cambridge, Massachusetts, London, England. 2012. F. Rosenblatt, 1958. The Perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review Vol. 65, No. 6. 26

Thank you for your attention. Questions?

Backup slides

Sum of least squares

Parameter estimation Estimating the weight parameters How to choose the w i? 30

Parameter estimation Estimating the weight parameters How to choose the w i? = Find the set of parameters that maximize p(d w) 30

Sum of least squares Method to optimize the weights w 31

Sum of least squares Method to optimize the weights w Aims to minimize the residual sum of squares (RSS) by optimizing weights 31

Sum of least squares Method to optimize the weights w Aims to minimize the residual sum of squares (RSS) by optimizing weights Minimize RSS(w) = = N (z i y(x i, w)) 2 i=1 N z i w 0 i=1 M 1 j=1 x ij w j 2 31

Sum of least squares RSS can be simplified by using an N M matrix X with the x i as rows 32

Sum of least squares RSS can be simplified by using an N M matrix X with the x i as rows Then RSS(w) = (z Xw) (z Xw), where z is a vector of target values 32

Sum of least squares Building the derivatives leads to RSS w = 2X (z Xw) 2 RSS w w = 2X X 33

Sum of least squares Building the derivatives leads to RSS w = 2X (z Xw) 2 RSS w w = 2X X Setting the first derivative to zero and solving for w results in ŵ = (X X) 1 X z 33

Sum of least squares Note: the approach assumes X X to be positive definite 34

Sum of least squares Note: the approach assumes X X to be positive definite Therefore, X is assumed to have full column rank 34

Sum of least squares Approximation function can be described as z i = y(x i, w) + ɛ, where ɛ represents the data noise 35

Sum of least squares where φ = (φ 0,..., φ M 1 ) 35 Approximation function can be described as z i = y(x i, w) + ɛ, where ɛ represents the data noise RSS provides a measurement for the prediction error E D defined as E D (w) = 1 2 N (z i w φ(x i )) 2, i=1

Maximum Likelihood Estimation

Maximum Likelihood Estimation Maximum Likelihood Estimation (MLE) 37

Maximum Likelihood Estimation Maximum Likelihood Estimation (MLE) Introduced by Fisher (1922) 37

Maximum Likelihood Estimation Maximum Likelihood Estimation (MLE) Introduced by Fisher (1922) Commonly-used method to optimize the model parameters 37

Maximum Likelihood Estimation Goal Find the optimal parameters w such that p(d w) is maximized, i.e. ŵ arg max p(d w). w 38

Maximum Likelihood Estimation Goal Find the optimal parameters w such that p(d w) is maximized, i.e. ŵ arg max p(d w). w For target values z i, p(z i x i, w) are assumed to be independent such that p(d w) = N p(z i x i, w) i=1 holds for all z i D. 38

Maximum Likelihood Estimation The product makes the equation unwieldy. Let s simplify it using the log! 39

Maximum Likelihood Estimation The product makes the equation unwieldy. Let s simplify it using the log! log p(d w) = N log p(z i x i, w) i=1 39

Maximum Likelihood Estimation The product makes the equation unwieldy. Let s simplify it using the log! log p(d w) = N log p(z i x i, w) i=1 We can now compute log p(d w) = 0 and solve for w. = w ŵ are the optimal weight parameters! 39

Maximum Likelihood Estimation Assumption: data noise ɛ follows Gaussian distribution 40

Maximum Likelihood Estimation Assumption: data noise ɛ follows Gaussian distribution Then p(d w) can be described as p(z x, w, β) = N (z y(x, w), β 1 ), where β is the model s inverse variance 40

Maximum Likelihood Estimation For z we get log p(z X, w, β) = N log N (z i w φ(x i ), β 1 ) i=1 41

Maximum Likelihood Estimation It follows that log p(z X, w, β) = β N (z i w φ(x i )) φ(x i ) i=1 42

Maximum Likelihood Estimation It follows that log p(z X, w, β) = β N (z i w φ(x i )) φ(x i ) i=1 Setting the gradient to zero and solving for w leads to w ML = (Φ Φ) 1 Φ z 42

Maximum Likelihood Estimation Φ is called design matrix and is described as φ 0 (x 1 ) φ 1 (x 1 )... φ M 1 (x 1 ) Φ = φ 0 (x 2 ) φ 1 (x 2 )... φ M 1 (x 2 )...... φ 0 (x N ) φ 1 (x N )... φ M 1 (x N ) 43

Regularization and ridge regression

Ridge regression Method to prevent overfitting 45

Ridge regression Method to prevent overfitting Add regularization term compensating the biased prediction 45

Ridge regression Regularization term λ w 2 2 46

Ridge regression Regularization term λ w 2 2 Error function E(w) becomes N E(w) = 1 2 i=1 (z i w φ(x i )) 2 + λ w 2 2 46

Ridge regression Regularization term λ w 2 2 Error function E(w) becomes N E(w) = 1 2 i=1 (z i w φ(x i )) 2 + λ w 2 2 Minimizing E(w) and solving for w results in ŵ ridge = (λi + Φ Φ) 1 Φ z 46

Non-linear classification

Non-linearity In our example 48

Non-linearity In our example the appropriate decision boundary is φ : (x 1, x 2 ) (r cos(x 1 ), r sin(x 2 )), r R. 48

Multi-class classification

Multi-class classification The discriminant function for two-class classification can be extended to a k-class problem 50

Multi-class classification The discriminant function for two-class classification can be extended to a k-class problem Use k-class discriminator y k (x) = w k x + w k0 50

Multi-class classification The discriminant function for two-class classification can be extended to a k-class problem Use k-class discriminator y k (x) = w k x + w k0 x is assigned to class C j if y j (x) > y i (x) for all i j 50

Multi-class classification The decision boundary is then y j (x) = y i (x) which can be transformed to (w j w i ) x + w j0 w i0 = 0 51

Multi-class classification 52

Perceptron algorithm

Perceptron Given an input vector x and a fixed non-linear function φ(x), the class of x is estimated by y(x) = f(w φ(x)), where f(t) is called non-linear activation function. 54

Perceptron Given an input vector x and a fixed non-linear function φ(x), the class of x is estimated by y(x) = f(w φ(x)), where f(t) is called non-linear activation function. { +1 if t 0, f(t) = 1 otherwise. 54

Probabilistic generative models

Probabilistic generative models Compute probability p(x, z) directly instead of optimizing the weight parameters 56

Probabilistic generative models Compute probability p(x, z) directly instead of optimizing the weight parameters Apply Bayes theorem on p(z x) 56

Bayes theorem For a set of disjunct samples A 1,..., A n and a given sample B the probability p(a i B), i {1,..., n} can be computed as follows: p(a i B) = p(b A i ) p(a i ) n j=1 p(b A j) p(a j ) 57

Probabilistic generative models Consider a two-class classification problem for C 1 and C 2 58

Probabilistic generative models Consider a two-class classification problem for C 1 and C 2 Then where a = log p(x C 1 )p(c 1 ) p(c 1 x) = p(x C 1 )p(c 1 ) + p(x C 2 )p(c 2 ) 1 = 1 + e a = σ(a), ( ) p(x C1 )p(c 1 ) p(x C 2 )p(c 2 ) and σ(a) = 1 1 + e a 58

Probabilistic discriminative models

Probabilistic discriminative models Predict the correct class by directly computing the posterior probability p(z x) 60

Probabilistic discriminative models Predict the correct class by directly computing the posterior probability p(z x) This makes the computation of p(x z) (Bayes theorem) redundant 60

Probabilistic discriminative models Computing the posterior probability p(c k x, θ opt C k x ) is then achieved by using MLE 61

Probabilistic discriminative models Computing the posterior probability p(c k x, θ opt C k x ) is then achieved by using MLE Disadvantage: only little knowledge about the given data is required ( black-box ) 61

Logistic regression

Logistic regression Commonly-used classification algorithm for binary classification problems 63

Logistic regression Commonly-used classification algorithm for binary classification problems Assumption: data noise ɛ follows Bernoulli distribution Ber(n) = p n (1 p) 1 n, n {0, 1} as this distribution is more appropriate for a binary classification problem 63

Logistic regression Predict the correct class label on the probability p(c k x, w) = Ber(C k σ(w x)), where σ is a squashing function (e.g. sigmoid) 64