An Introduction to Statistical and Probabilistic Linear Models

Similar documents
Linear Models for Classification

Introduction to Machine Learning

Reading Group on Deep Learning Session 1

Naïve Bayes classification

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Logistic Regression. COMP 527 Danushka Bollegala

Machine Learning Lecture 5

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Ch 4. Linear Models for Classification

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear Classification: Probabilistic Generative Models

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Overfitting, Bias / Variance Analysis

Bayesian Learning (II)

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

ECE 5984: Introduction to Machine Learning

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Lecture : Probabilistic Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING

Logistic Regression. Machine Learning Fall 2018

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met

Intelligent Systems Discriminative Learning, Neural Networks

Machine Learning Lecture 7

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

CPSC 340: Machine Learning and Data Mining

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Linear Regression and Discrimination

Generative v. Discriminative classifiers Intuition

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Introduction to Machine Learning

Linear Models for Classification

Support Vector Machines

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Neural Network Training

Bayesian Machine Learning

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Machine Learning. 7. Logistic and Linear Regression

Introduction to Bayesian Learning. Machine Learning Fall 2018

ECE521 week 3: 23/26 January 2017

Artificial Neural Networks

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Linear Classification

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

CMU-Q Lecture 24:

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Linear & nonlinear classifiers

6.867 Machine Learning

Machine Learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Outline Lecture 2 2(32)

Multivariate statistical methods and data mining in particle physics

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

MLE/MAP + Naïve Bayes

CS-E3210 Machine Learning: Basic Principles

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Linear discriminant functions

Introduction to Machine Learning

Machine Learning 2017

Slides modified from: PATTERN RECOGNITION CHRISTOPHER M. BISHOP. and: Computer vision: models, learning and inference Simon J.D.

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Bayesian Machine Learning

Machine Learning Gaussian Naïve Bayes Big Picture

Introduction to Machine Learning

GWAS IV: Bayesian linear (variance component) models

Discriminative Models

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Bias-Variance Tradeoff

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

CS6220: DATA MINING TECHNIQUES

Linear Models for Regression

Kernel methods, kernel SVM and ridge regression

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Bayesian Methods: Naïve Bayes

Pattern Recognition and Machine Learning

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Mathematical Formulation of Our Example

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Statistical Data Mining and Machine Learning Hilary Term 2016

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Introduction to Machine Learning

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Machine Learning Linear Models

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Transcription:

An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017

Introduction In statistical learning theory, linear models are used for regression and classification tasks. 2

Introduction In statistical learning theory, linear models are used for regression and classification tasks. What is regression? 2

Introduction In statistical learning theory, linear models are used for regression and classification tasks. What is regression? What is classification? 2

Introduction In statistical learning theory, linear models are used for regression and classification tasks. What is regression? What is classification? How can we model such concepts in a mathematical context? 2

Linear regression

Linear regression - basics What is regression? 4

Linear regression - basics What is regression? Approximation of data using a (closed) mathematical expression 4

Linear regression - basics What is regression? Approximation of data using a (closed) mathematical expression Achieved by estimating the model parameters that maximize the approximation 4

Linear regression - example A company changes the price of their products for the nth time. 5

Linear regression - example A company changes the price of their products for the nth time. They know how the price changes affected the consumer behavior n 1 times before. 5

Linear regression - example A company changes the price of their products for the nth time. They know how the price changes affected the consumer behavior n 1 times before. Using linear regression, they can predict the consumer behavior for the nth price change. 5

Linear regression - basics Let D denote a set of n-dimensional data vectors 6

Linear regression - basics Let D denote a set of n-dimensional data vectors Let x be an n-dimensional observation 6

Linear regression - basics Let D denote a set of n-dimensional data vectors Let x be an n-dimensional observation How can we approximate x? 6

Linear regression - basics Create a linear function y(x, w) = w 0 + w 1 x 1 + + w n x n that approximates x with w. 7

Linear regression - basics Example for n = 2. y(x, w) = w 0 + w 1 x 1 + w 2 x 2 8

Linear regression - basics Problem Weight parameters w i are simply values. 9

Linear regression - basics Problem Weight parameters w i are simply values. = significant limitation! 9

Linear regression - basics Problem Weight parameters w i are simply values. = significant limitation! Idea Use weighted non-linear functions φ j instead! y(x, w) = w 0 + n w j φ j (x) = w φ(x), j=1 where φ = (φ 0,..., φ n ). 9

Polynomial regression Example Let φ i (x) = x i. 10

Polynomial regression Example Let φ i (x) = x i. y(x, w) = w 0 + n w j x j j=1 = w 0 + w 1 x + w 2 x 2 + + w n x n 10

Polynomial regression Approximation with a 2nd-order polynomial. 11

Polynomial regression Approximation with a 6th-order polynomial. 11

Polynomial regression Approximation with an 8th-order polynomial. 11

Polynomial regression Problem 12

Polynomial regression Problem = Overfitting with higher polynomial degree 12

Polynomial regression 13

Polynomial regression 13

Linear classification

Linear classification - basics What is classification? 15

Linear classification - basics What is classification? Aims to partition the data into predefined classes 15

Linear classification - basics What is classification? Aims to partition the data into predefined classes A class contains observations with similar characteristics 15

Linear classification - example We have n different cucumbers and courgettes. 16

Linear classification - example We have n different cucumbers and courgettes. Each record contains the weight and the texture (smooth, rough). 16

Linear classification - example We have n different cucumbers and courgettes. Each record contains the weight and the texture (smooth, rough). We want to predict the correct label without knowing the real label. 16

Linear classification - basics Assume a two-class classification problem 17

Linear classification - basics Assume a two-class classification problem How can we categorise the data into predefined classes? 17

Linear classification - basics Assume a two-class classification problem How can we categorise the data into predefined classes? 17

Linear classification - basics Assume a two-class classification problem How can we categorise the data into predefined classes? 17

Linear classification - basics Assume a two-class classification problem How can we categorise the data into predefined classes? 17

Linear classification - basics Cucumbers vs. courgettes 18

Linear classification - basics Discriminant functions Given a dataset D, 19

Linear classification - basics Discriminant functions Given a dataset D, we aim to categorise x into either class C 1 or C 2. 19

Linear classification - basics Discriminant functions Given a dataset D, we aim to categorise x into either class C 1 or C 2. Use y(x) such that x C 1 y(x) 0 and x C 2 otherwise 19

Linear classification - basics Discriminant functions Given a dataset D, we aim to categorise x into either class C 1 or C 2. Use y(x) such that with x C 1 y(x) 0 and x C 2 otherwise y(x) = w x + w 0. 19

Linear classification - basics Decision boundary H: H := {x D : y(x) = 0} 20

Linear classification - basics Decision boundary H: H := {x D : y(x) = 0} 20

Non-linearity However, it often occurs that the data are not linearly-separable: 21

Non-linearity However, it often occurs that the data are not linearly-separable: 21

Non-linearity However, it often occurs that the data are not linearly-separable: 21

Non-linearity Solution Use non-linear functions instead! 22

Non-linearity Solution Use non-linear functions instead! = y(x) = w φ(x) φ(x) = (φ 0, φ 1,..., φ M 1 ) 22

Common classification algorithms Other commonly-used classification algorithms: 23

Common classification algorithms Other commonly-used classification algorithms: Naive Bayes classifier 23

Common classification algorithms Other commonly-used classification algorithms: Naive Bayes classifier Logistic regression (Cox, 1958) 23

Common classification algorithms Other commonly-used classification algorithms: Naive Bayes classifier Logistic regression (Cox, 1958) Support vector machines (Vapnik and Lerner, 1963) 23

Summary 24

Summary Linear models Linear combinations of weighted (non-)linear functions 24

Summary Linear models Linear combinations of weighted (non-)linear functions Regression Approximation of a given amount of data using a closed mathematical representation 24

Summary Linear models Linear combinations of weighted (non-)linear functions Regression Approximation of a given amount of data using a closed mathematical representation Classification Categorisation of data according to individual characteristics and common patterns 24

Further readings Further readings: J. Aldrich, 1997. R.A. Fisher and the Making of Maximum Likelihood 1912-1922. Statistical Science Vol. 12, No. 3, 162-176. D. Barber, Bayesian Reasoning and Machine Learning. Cambridge University Press. 2012. C. M. Bishop, Pattern Recognition and Machine Learning. Springer. 2006. T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning - Data Mining, Inference, and Prediction. Second edition. Springer. 2009. K. Murphy, Machine Learning - A Probabilistic Perspective. The MIT Press, Cambridge, Massachusetts, London, England. 2012. A. Y. Ng and M. I. Jordan, 2002. On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive bayes. Advances in Neural Information Processing Systems. 25

References References: J. Aldrich, 1997. R.A. Fisher and the Making of Maximum Likelihood 1912-1922. Statistical Science Vol. 12, No. 3, 162-176. D. Barber, Bayesian Reasoning and Machine Learning. Cambridge University Press. 2012. C. M. Bishop, Pattern Recognition and Machine Learning. Springer. 2006. D. R. Cox, 1958. The regression analysis of binary sequences. Journal of the Royal Statistical Society Vol. XX, No. 2. 26

References References: T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning - Data Mining, Inference, and Prediction. Second edition. Springer. 2009. K. Murphy, Machine Learning - A Probabilistic Perspective. The MIT Press, Cambridge, Massachusetts, London, England. 2012. F. Rosenblatt, 1958. The Perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review Vol. 65, No. 6. 26

Thank you for your attention. Questions?

Backup slides

Sum of least squares

Parameter estimation Estimating the weight parameters How to choose the w i? 30

Parameter estimation Estimating the weight parameters How to choose the w i? = Find the set of parameters that maximize p(d w) 30

Sum of least squares Method to optimize the weights w 31

Sum of least squares Method to optimize the weights w Aims to minimize the residual sum of squares (RSS) by optimizing weights 31

Sum of least squares Method to optimize the weights w Aims to minimize the residual sum of squares (RSS) by optimizing weights Minimize RSS(w) = = N (z i y(x i, w)) 2 i=1 N z i w 0 i=1 M 1 j=1 x ij w j 2 31

Sum of least squares RSS can be simplified by using an N M matrix X with the x i as rows 32

Sum of least squares RSS can be simplified by using an N M matrix X with the x i as rows Then RSS(w) = (z Xw) (z Xw), where z is a vector of target values 32

Sum of least squares Building the derivatives leads to RSS w = 2X (z Xw) 2 RSS w w = 2X X 33

Sum of least squares Building the derivatives leads to RSS w = 2X (z Xw) 2 RSS w w = 2X X Setting the first derivative to zero and solving for w results in ŵ = (X X) 1 X z 33

Sum of least squares Note: the approach assumes X X to be positive definite 34

Sum of least squares Note: the approach assumes X X to be positive definite Therefore, X is assumed to have full column rank 34

Sum of least squares Approximation function can be described as z i = y(x i, w) + ɛ, where ɛ represents the data noise 35

Sum of least squares where φ = (φ 0,..., φ M 1 ) 35 Approximation function can be described as z i = y(x i, w) + ɛ, where ɛ represents the data noise RSS provides a measurement for the prediction error E D defined as E D (w) = 1 2 N (z i w φ(x i )) 2, i=1

Maximum Likelihood Estimation

Maximum Likelihood Estimation Maximum Likelihood Estimation (MLE) 37

Maximum Likelihood Estimation Maximum Likelihood Estimation (MLE) Introduced by Fisher (1922) 37

Maximum Likelihood Estimation Maximum Likelihood Estimation (MLE) Introduced by Fisher (1922) Commonly-used method to optimize the model parameters 37

Maximum Likelihood Estimation Goal Find the optimal parameters w such that p(d w) is maximized, i.e. ŵ arg max p(d w). w 38

Maximum Likelihood Estimation Goal Find the optimal parameters w such that p(d w) is maximized, i.e. ŵ arg max p(d w). w For target values z i, p(z i x i, w) are assumed to be independent such that p(d w) = N p(z i x i, w) i=1 holds for all z i D. 38

Maximum Likelihood Estimation The product makes the equation unwieldy. Let s simplify it using the log! 39

Maximum Likelihood Estimation The product makes the equation unwieldy. Let s simplify it using the log! log p(d w) = N log p(z i x i, w) i=1 39

Maximum Likelihood Estimation The product makes the equation unwieldy. Let s simplify it using the log! log p(d w) = N log p(z i x i, w) i=1 We can now compute log p(d w) = 0 and solve for w. = w ŵ are the optimal weight parameters! 39

Maximum Likelihood Estimation Assumption: data noise ɛ follows Gaussian distribution 40

Maximum Likelihood Estimation Assumption: data noise ɛ follows Gaussian distribution Then p(d w) can be described as p(z x, w, β) = N (z y(x, w), β 1 ), where β is the model s inverse variance 40

Maximum Likelihood Estimation For z we get log p(z X, w, β) = N log N (z i w φ(x i ), β 1 ) i=1 41

Maximum Likelihood Estimation It follows that log p(z X, w, β) = β N (z i w φ(x i )) φ(x i ) i=1 42

Maximum Likelihood Estimation It follows that log p(z X, w, β) = β N (z i w φ(x i )) φ(x i ) i=1 Setting the gradient to zero and solving for w leads to w ML = (Φ Φ) 1 Φ z 42

Maximum Likelihood Estimation Φ is called design matrix and is described as φ 0 (x 1 ) φ 1 (x 1 )... φ M 1 (x 1 ) Φ = φ 0 (x 2 ) φ 1 (x 2 )... φ M 1 (x 2 )...... φ 0 (x N ) φ 1 (x N )... φ M 1 (x N ) 43

Regularization and ridge regression

Ridge regression Method to prevent overfitting 45

Ridge regression Method to prevent overfitting Add regularization term compensating the biased prediction 45

Ridge regression Regularization term λ w 2 2 46

Ridge regression Regularization term λ w 2 2 Error function E(w) becomes N E(w) = 1 2 i=1 (z i w φ(x i )) 2 + λ w 2 2 46

Ridge regression Regularization term λ w 2 2 Error function E(w) becomes N E(w) = 1 2 i=1 (z i w φ(x i )) 2 + λ w 2 2 Minimizing E(w) and solving for w results in ŵ ridge = (λi + Φ Φ) 1 Φ z 46

Non-linear classification

Non-linearity In our example 48

Non-linearity In our example the appropriate decision boundary is φ : (x 1, x 2 ) (r cos(x 1 ), r sin(x 2 )), r R. 48

Multi-class classification

Multi-class classification The discriminant function for two-class classification can be extended to a k-class problem 50

Multi-class classification The discriminant function for two-class classification can be extended to a k-class problem Use k-class discriminator y k (x) = w k x + w k0 50

Multi-class classification The discriminant function for two-class classification can be extended to a k-class problem Use k-class discriminator y k (x) = w k x + w k0 x is assigned to class C j if y j (x) > y i (x) for all i j 50

Multi-class classification The decision boundary is then y j (x) = y i (x) which can be transformed to (w j w i ) x + w j0 w i0 = 0 51

Multi-class classification 52

Perceptron algorithm

Perceptron Given an input vector x and a fixed non-linear function φ(x), the class of x is estimated by y(x) = f(w φ(x)), where f(t) is called non-linear activation function. 54

Perceptron Given an input vector x and a fixed non-linear function φ(x), the class of x is estimated by y(x) = f(w φ(x)), where f(t) is called non-linear activation function. { +1 if t 0, f(t) = 1 otherwise. 54

Probabilistic generative models

Probabilistic generative models Compute probability p(x, z) directly instead of optimizing the weight parameters 56

Probabilistic generative models Compute probability p(x, z) directly instead of optimizing the weight parameters Apply Bayes theorem on p(z x) 56

Bayes theorem For a set of disjunct samples A 1,..., A n and a given sample B the probability p(a i B), i {1,..., n} can be computed as follows: p(a i B) = p(b A i ) p(a i ) n j=1 p(b A j) p(a j ) 57

Probabilistic generative models Consider a two-class classification problem for C 1 and C 2 58

Probabilistic generative models Consider a two-class classification problem for C 1 and C 2 Then where a = log p(x C 1 )p(c 1 ) p(c 1 x) = p(x C 1 )p(c 1 ) + p(x C 2 )p(c 2 ) 1 = 1 + e a = σ(a), ( ) p(x C1 )p(c 1 ) p(x C 2 )p(c 2 ) and σ(a) = 1 1 + e a 58

Probabilistic discriminative models

Probabilistic discriminative models Predict the correct class by directly computing the posterior probability p(z x) 60

Probabilistic discriminative models Predict the correct class by directly computing the posterior probability p(z x) This makes the computation of p(x z) (Bayes theorem) redundant 60

Probabilistic discriminative models Computing the posterior probability p(c k x, θ opt C k x ) is then achieved by using MLE 61

Probabilistic discriminative models Computing the posterior probability p(c k x, θ opt C k x ) is then achieved by using MLE Disadvantage: only little knowledge about the given data is required ( black-box ) 61

Logistic regression

Logistic regression Commonly-used classification algorithm for binary classification problems 63

Logistic regression Commonly-used classification algorithm for binary classification problems Assumption: data noise ɛ follows Bernoulli distribution Ber(n) = p n (1 p) 1 n, n {0, 1} as this distribution is more appropriate for a binary classification problem 63

Logistic regression Predict the correct class label on the probability p(c k x, w) = Ber(C k σ(w x)), where σ is a squashing function (e.g. sigmoid) 64