CMU-Q Lecture 24:

Similar documents
Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Logistic Regression. Machine Learning Fall 2018

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Lecture 2 Machine Learning Review

Jeff Howbert Introduction to Machine Learning Winter

Linear & nonlinear classifiers

Lecture : Probabilistic Machine Learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Overfitting, Bias / Variance Analysis

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

PATTERN RECOGNITION AND MACHINE LEARNING

Machine Learning Linear Classification. Prof. Matteo Matteucci

Linear Models for Regression

Part of the slides are adapted from Ziko Kolter

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Probabilistic Machine Learning. Industrial AI Lab.

CSC 411: Lecture 04: Logistic Regression

Machine Learning Lecture 7

Bias-Variance Tradeoff

Discriminative Models

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Bayesian Machine Learning

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Parameter learning in CRF s

ECE521 week 3: 23/26 January 2017

Support Vector Machines

6.867 Machine learning

Naïve Bayes classification

Introduction to Logistic Regression and Support Vector Machine

The Perceptron algorithm

Linear Models for Regression CS534

Introduction to Machine Learning Midterm Exam

Machine Learning for Signal Processing Bayes Classification and Regression

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Introduction to SVM and RVM

18.9 SUPPORT VECTOR MACHINES

CSC321 Lecture 18: Learning Probabilistic Models

Linear Models for Regression CS534

Discriminative Models

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Kernel Methods and Support Vector Machines

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Machine Learning Practice Page 2 of 2 10/28/13

Final Exam, Machine Learning, Spring 2009

Classification. Chapter Introduction. 6.2 The Bayes classifier

Bayesian Support Vector Machines for Feature Ranking and Selection

Notes on Discriminant Functions and Optimal Classification

Lecture 10: A brief introduction to Support Vector Machine

Machine Learning for NLP

Bayesian Learning (II)

Pattern Recognition. Parameter Estimation of Probability Density Functions

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

SVMs, Duality and the Kernel Trick

Introduction to Machine Learning

Machine Learning Basics: Maximum Likelihood Estimation

CS 188: Artificial Intelligence Spring Announcements

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Introduction to Bayesian Learning. Machine Learning Fall 2018

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Machine Learning, Fall 2012 Homework 2

Lecture 3: Pattern Classification. Pattern classification

Support vector machines Lecture 4

Bayes Decision Theory

Warm up: risk prediction with logistic regression

Linear Models for Classification

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Machine Learning And Applications: Supervised Learning-SVM

y Xw 2 2 y Xw λ w 2 2

Lecture 3: Pattern Classification

Linear Regression (continued)

Introduction to Machine Learning Midterm Exam Solutions

Linear & nonlinear classifiers

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Support Vector Machines.

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Machine Learning for Structured Prediction

L11: Pattern recognition principles

Gaussian discriminant analysis Naive Bayes

An Introduction to Statistical and Probabilistic Linear Models

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

CS 188: Artificial Intelligence Spring Announcements

Lecture 3: Statistical Decision Theory (Part II)

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Transcription:

CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro

SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input features and outputs h θ x (i), y (i), i = 1,, m and a hypothesis function h θ, find parameters values θ that minimize the average empirical error: minimize θ 1 m m i=1 l h θ x (i), y (i) We need to specify: 1. The hypothesis class H, h θ H 2. The loss function l 3. The algorithm for solving the optimization problem (often approximately) 4. A complete ML design: from data processing to learning to validation and testing 2

CLASSIFICATION AND REGRESSION Features: Width, Lightness Classification: (Width, Lightness) {Salmon, Sea bass} (discrete) Regression: (Width, Lightness) Weight (continuous) h θ x : X R 2 Y = {0,1} Which hypothesis class H? h θ x : X R 2 Y R Complex boundaries, relations 3

PROBABILISTIC MODELS: DISCRIMINATIVE VS. GENERATIVE Regression and classification problems can be stated in probabilistic terms (later) The mapping y = h θ x that we are learning can be naturally interpreted as the probability of the output being y given the input data x (under the selected hypothesis h and the learned parameter vector θ) Discriminative models: Directly learn p y x) Parametric hypothesis Allow to discriminate between classes / predicted outputs Generative models / Probability distributions: Learn p(x, y), the probabilistic model that describes the data, then use Bayes rule Allow to generate data any relevant data = sea bass = salmon x 1 x 2 x 1 x 2 p y x) = p x y)p(y) p(x) = p(x, y) p(x) 4

GENERATIVE MODELS A discriminative model, that learn learns p y x; θ), can be used to label the data, to discriminate the data, but not to generate the data o E.g., a discriminative approach tries to find out which (linear, in this case) decision boundary allows for the best classification based on the training data, and takes decisions accordingly o Direct learning of the mapping from X to Y A generative approach would proceed as follows: 1. By looking at the feature data about salmons, build a model of a salmon 2. By looking at the feature data about sea basses, build a model of a sea bass 3. To classify a new fish based on its features x, we can match it against the salmon and the sea bass models, to see whether it looks more like the salmons or more like the sea basses we had seen in the training set 1,2,3 is equivalent to model p x y), where y = {ω 1, ω 2 }: the conditional probability that the observed features x are those of a salmon or a sea bass 5

GENERATIVE MODELS p x p x y = ω 1 ) models the distribution of salmon s features y = ω 2 ) models the distribution of sea bass features p(y) can be derived from the dataset or from other sources o E.g., p(ω 1 ) = ratio of salmons in the dataset, p(ω 2 ) = ratio of sea basses Bayes rule: p y x) = p x y)p(y) p(x) = p(x, y) p(x) posterior = likelihood prior evidence p x = p x y = ω 1 )p y = ω 1 + p x y = ω 2 )p(y = ω 2 ) To make a prediction: arg max y p y x) = arg max y p x y)p(y) p(x) = arg max y p x y)p(y) Equivalent to: decide ω 1 if p ω 1 x) > p ω 2 x), otherwise decide ω 2 6

GENERATIVE MODELS AND BAYES DECISION RULE Decide ω 1 if p x ω 1 )p ω 1 > p x ω 2 )p(ω 2 ) otherwise decide ω 2 Decide ω 1 if p x ω 1) > p(ω 2) p x ω 2 ) p(ω 1 ) otherwise decide ω 2 Likelihood ratio Two disconnected regions for class 2 7

GENERATIVE MODELS Given the joint distribution we can generate any conditional or marginal probability Sample from p(x, y) to obtain labeled data points Given the priors p(y), sample a class or a predictor value Given the class y, sample instance data p x predictor variable sample an expected output y) of that class, or, given a Downside: higher complexity, more parameters to learn Density estimation problem: Parametric (e.g., Gaussian densities) Non-parametric (full density estimation) 8

LET S GO BACK TO LINEAR REGRESSION Linear model as hypothesis: y = h x; w = w 0 + w 1 x 1 + w 2 x 2 + + w d x d = w T x x = (1, x 1, x 2,, x d ) h Find w that minimizes the deviation from the desired answers: y (i) h x i, i in dataset Loss function: Mean squared error (MSE) m l = 1 m i=1 y i h x i 2 The model does not try to explain variation in observed ys for the data 9

STATISTICAL MODEL FOR LINEAR REGRESSION A statistical model of linear regression: y = w T x + ε ε ~ N(0, σ 2 ) y ~ N(w T x, σ 2 ) The model does explain variation in observed ys for the data in terms of a white Gaussian noise The conditional distribution of y given x: p y x; w, σ) = 1 σ 2π exp 1 2σ 2 y wt x 2 Probability of the output being y given the predictor x E y x = w T x 10

STATISTICAL MODEL FOR LINEAR REGRESSION Let s consider the entire data set D, and let s assume that all samples are independent and identically distributed (i.i.d.) random variables What is the joint probability of all training data? That is, the probability of observing all the outputs y in D given w and σ? By iid: p y (1), y (2),, y (m) p y (1), y (2),, y (m) x (1), x (2),, x (m) ; w, σ) x (1), x (2),, x (m) ; w, σ) = p y i m i=1 x (i) ; w, σ) L(D, w, σ) = ς m i=1 p y i x (i) ; w, σ) Likelihood function of predictions, the probability of observing the outputs y in D given w and σ Maximum likelihood estimation of the parameters w: parameter values maximizing the likelihood of the predictions, the value of the parameters such that the probability of observing the data in D is maximized w = arg max w L(D, w, σ) 11

STATISTICAL MODEL FOR LINEAR REGRESSION Log-Likelihood: l D, w, σ = log(l(d, w, σ)) = log ς m i=1 p y i x (i) ; w, σ) m = log p y i i=1 x (i) ; w, σ) Using the conditional density: p y x; w, σ) = 1 σ 2π exp 1 2σ 2 y wt x 2 m l D, w, σ = log i=1 = 1 m 2σ 2 i=1 1 σ 2π exp 1 2σ 2 y i w T x (i) 2 = y i w T x (i) 2 + c(σ) m i=1 1 2σ 2 y i w T x (i) 2 c(σ) Maximizing the predictive log-likelihood with regard to w, is equivalent to minimizing the MSE loss function Does it look familiar? max w l D, w, σ ~ min MSE w More in general, least squares linear fit under Gaussian noise corresponds to the maximum likelihood estimator of the data 12

NON-LINEAR, ADDITIVE REGRESSION MODELS 13

NON-LINEAR PROBLEMS? Design a non-linear regressor / classifier Modify the input data to make the problem linear 14

MAP DATA IN HIGHER DIMENSIONALITY FEATURE SPACES 15

MAP DATA IN HIGHER DIMENSIONALITY FEATURE SPACES The hyperplane is found in z-space, then projected back in x-space, where is an ellipsis The property of the solution of SVMs (that are in terms of dot products between feature vectors) allows to easily define a kernel function that implicitly perform the desired transformation, allowing keeping using linear classifiers. 16

NON-LINEAR, ADDITIVE REGRESSION MODELS Main idea to model nonlinearities: Replace inputs to linear units with b feature (basis) functions φ j x, j = 1,, b, where φ j x is an arbitrary function of x y = h x; w = w 0 + w 1 φ 1 x + w 2 φ 2 x + + w b φ b x = w T φ(x) h b b Original feature input New input Linear model 17

EXAMPLES OF FEATURE FUNCTIONS Higher order polynomial with one-dimensional input, x = (x) φ 1 x = x, φ 2 x = x 2, φ 3 x = x 3, Quadratic polynomial with two-dimensional inputs, x = (x 1, x 2 ) φ 1 x = x 1, φ 2 x = x 2 1, φ 3 x = x 2, φ 4 x = x 2 2, φ 3 x = x 1 x 2 Transcendent functions: φ 1 x = sin(x), φ 2 x = cos(x) 18

SOLUTION USING FEATURE FUNCTIONS The same techniques (analytical gradient + system of equations, or gradient descent) used for the plain linear case with MSE as loss function h x i ; w = w 0 + w 1 φ 1 x i + w 2 φ 2 x i + + w b φ b x i = w T φ(x i ) φ x i = (1, φ 1 x i, φ 2 x i,, φ b (x i )) m l = 1 m i=1 y i h x i 2 To find min w l we have to look where w l = 0 m w l = 2 m i=1 y i h x i φ x i = 0 Results in a system of b linear equations: m m m m w 0 1φ j x i + w 1 φ 1 x i φ j x i + + w k φ k x i φ j x i + w b φ b x i i=1 m i=1 i=1 i=1 = y i φ j x i j = 1,, b i=1 φ j x i 19

EXAMPLE OF SDG WITH FEATURE FUNCTIONS One dimensional feature vectors and high-order polynomial: x = x, φ i x = x i h x; w = w 0 + w 1 φ 1 x + w 2 φ 2 x + + w b φ b x = w 0 + w i x i On-line, single sample, (x i, y i ), gradient update, j = 1,, b w j = w j + α w l h x i ; w, y i = w j + α y i h x i φ j x i b i=1 Same form as in the linear regression model, with x j (i) φj x i 20

ELECTRICITY EXAMPLE New data: it doesn t look linear anymore 21

NEW HYPOTHESIS The complexity of the model grows: one parameter for each feature transformed according to a polynomial of order 2 (at least 3 parameters vs. 2 of original hypothesis) 22

NEW HYPOTHESIS At least 5 parameters (if we had multiple predicting features, all their order d products should be considered, resulting into a number of additional parameters) 23

NEW HYPOTHESIS The number of parameters is now larger than the data points, such that the polynomial can almost precisely fit the data Overfitting 24

SELECTING MODEL COMPLEXITY Dataset with 10 points, 1D features: which hypothesis class should we use? Linear regression: y = h x; w = w 0 + w 1 x Polynomial regression, cubic: y = h x; w = w 0 + w 1 x + w 2 x 2 + w 3 x 3 MSE for the loss functions Which model would give the smaller error in terms of MSE / least squares fit? 25

SELECTING MODEL COMPLEXITY Cubic regression provides a better fit to the data, and a smaller MSE Should we stick with the hypothesis h x; w = w 0 + w 1 x + w 2 x 2 + w 3 x 3? Since a higher order polynomial seems to provide a better fit, why don t we use a polynomial of order higher than 3? What is the highest order that makes sense for the given problem? 26

SELECTING MODEL COMPLEXITY For 10 data points, a degree 9 polynomial gives a perfect fit (Lagrange interpolation). Error is zero. Is it always good to minimize (even reduce to zero) the training error? Related (and more important) question: How do we (will) perform on new, unseen data? 27

OVERFITTING The 9-polynomial model totally fails the prediction for the new point! Overfitting: Situation when the training error is low and the generalization error is high. Causes of the phenomenon: Highly complex hypothesis model, with a large number of parameters (degrees of freedom) Small data size (as compared to the complexity of the model) The learned function has enough degrees of freedom to (over)fit all data perfectly 28

OVERFITTING Empirical loss vs. Generalization loss 29

TRAINING AND VALIDATION LOSS 30

SPLITTING DATASET IN TWO 31

PERFORMANCE ON VALIDATION SET 32

PERFORMANCE ON VALIDATION SET 33

INCREASING MODEL COMPLEXITY In this case, the small size of the dataset favors an easy overfitting by increasing the degree of the polynomial (i.e., hypothesis complexity). For a large multi-dimensional dataset this effect is less strong / evident 34

TRAINING VS. VALIDATION LOSS 35

MODEL SELECTION AND EVALUATION PROCESS 1. Break all available data into training and testing sets (e.g., 70% / 30%) 2. Break training set into training and validation sets (e.g., 70% / 30%) 3. Loop: i. Set a hyperparameter value (e.g., degree of polynomial model complexity) ii. iii. iv. Train the model using training sets Validate the model using validation sets Exit loop if (validation errors keep growing && training errors go to zero) 4. Choose hyperparameters using validation set results: hyperparameter values corresponding to lowest validation errors 5. (Optional) With the selected hyperparameters, retrain the model using all training data sets 6. Evaluate (generalization) performance on the testing sets (more on this next time) 36

MODEL SELECTION AND EVALUATION PROCESS Dataset Training set Testing set Internal training set Validation set Model 1 Model 2 Learn 1 Learn 2 Validate 1 Validate 2 Select best model Learn Model Model n Learn n Validate n 37