COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS6 Lecture 3: Classification with Logistic Regression Advanced optimization techniques Underfitting & Overfitting Model selection (Training- & Validation- & Testset)

CLASSIFICATION WITH LOGISTIC REGRESSION

Logistic Regression Classification and not regression Classification = recognition Mogees and vibration classification: https://www.youtube.com/watch?v=xv4hil-_h Action recognition: https://www.youtube.com/watch?v=ajswswvwqvy

Logistic Regression The default classification model Binary classification Extensions to multi-class later in the course Simple classification algorithm Convex cost - unique local optimum Gradient descent No more parameter than with linear regre Interpretability of parameters Fast evaluation of hypothesis for making predictions

LOGISTIC REGRESSION Hypothesis

Example (step function hypothesis) i Tumor size (mm) x Malignant? 2.3 (N) 2 5. (Y) 3.4 (N) 4 6.3 (Y) 5 5.3 (Y) labeled data y benign malignant decision boundary Tumor size (x) labels

Example (logistic function hypothesis) i Tumor size (mm) x labeled data Malignant? 2.3 (N) 2 5. (Y) y benign malignant? decision boundary Tumor size (x) 3.4 (N) 4 6.3 (Y) 5 5.3 (Y) labels p=.2, class? Hypothesis: Tumor is malignant with probability p Classification: if p <.5: if p.5:.5

Example (logistic function hypothesis) i Tumor size (mm) x labeled data Malignant? 2.3 (N) 2 5. (Y) y benign malignant? decision boundary Tumor size (x) 3.4 (N) 4 6.3 (Y) 5 5.3 (Y) labels p=., class? Hypothesis: Tumor is malignant with probability p Classification: if p <.5: if p.5:.5

Logistic (Sigmoid) function Advantages over step function for classification: Differentiable (gradient descent) Contains additional information (how certain is the prediction?)

Logistic regression hypothesis (one input) 5-5 - -4-2 2 4 5-5 - -4-2 2 4 5-5 - -4-2 2 4.5.5.5-4 -2 2 4-4 -2 2 4-4 -2 2 4

Age (x2) Classification with multiple inputs i Tumor size (mm) Age Maligna nt? x x2 y 2.3 25 (N) 2 5. 62 (Y) 3.4 47 (N) 4 6.3 39 (Y) Tumor size (x) 5 5.3 72 (Y)

Age (x2) Multiple inputs and logistic hypothesis i Tumor size (mm) Age Maligna nt?. Reduce point in high-dimensional space to a scalar z 2. Apply logistic function p=.8, class x x2 y 2.3 25 (N) 2 5. 62 (Y)? 3.4 47 (N) 4 6.3 39 (Y) decision boundary Tumor size (x)? 5 5.3 72 (Y)

Age (x2) Classification with multiple inputs i Tumor size (mm) Age Maligna nt?. Reduce point in high-dimensional space to a scalar z 2. Apply logistic function p=.999, class x x2 y 2.3 25 (N) 2 5. 62 (Y) 3.4 47 (N) 4 6.3 39 (Y) 5 5.3 72 (Y) Tumor size (x)??

Logistic regression hypothesis. Reduce high-dimensional input to a scalar 2. Apply logistic function 3. Interpret output as probability and predict class: Class =

LOGISTIC REGRESSION Cost function

Age (x2) Logistic regression cost function How well does the hypothesis fit the data? Tumor size (x) Prediction:.98 Actual value y: Prediction:.6 Actual value y: Prediction:. Actual value y:

Age (x2) Logistic regression cost function Probabilistic model: y is with probability: Tumor size (x)

Logistic regression cost function Probabilistic model: y is with probability p(x,y=) = The parameters should maximize the likelihood of the data max θ log p X = x x n, y = y, y n θ) If data points are independants max θ i log p x i, y i θ) p x i, y i, x j, y j θ) = p x j, y j θ). p x i, y i θ) Separating positive and negative examples max θ y i = log p x i, θ) + y i = log p x i, θ) σ(x T θ) σ(x T θ)

Logistic regression cost function How well does the hypothesis Cost for predicting probability p when the real value is y: fit the data? 5 4 3 2.5 5 4 3 2.5 Mean over all training examples:

Multiple inputs and logistic hypothesis How well does the hypothesis fit the data? 5 4 3 2.5 5 4 3 2.5 Prediction:. Actual value y: Prediction:.6 Actual value y: Prediction:.98 Actual value y:

Comparison cost functions Linear regression Logistic regression 4 3 2-2 2 5 4 3 2.5 5 4 3 2.5 Mean over all training examples:

Why not mean squared error (MSE) again? MSE with logistic hypothesis is non-convex (many local minima) Logistic regression is convex (unique minimum) Cost function can be derived from statistical principles ( maximum likelihood ) non-convex function convex function

LOGISTIC REGRESSION Learning from data

Minimizing the cost via gradient descent Gradient descent (simultaneous update for j= n) Gradient of logistic regression cost: error input (for j=: )

Age (x2) Linear to non-linear features linear decision boundary Tumor size (x)

Age (x2) Linear to non-linear features non-linear decision boundary Tumor size (x)

Age (x2) Age (x2) Decision boundaries linear decision boundary non-linear decision boundary Tumor size (x) Tumor size (x) Decision boundary is a property of hypothesis, not of data!

Linear vs. Logistic Regression Linear Regression Regression Hypothesis Cost for one training example: Logistic Regression Binary classification (!) Hypothesis Cost for one training example: Gradient Gradient error input error input Analytical: No analytical solution!

GRADIENT DESCENT TRICKS, AND MORE ADVANCED OPTIMIZATION TECHNIQUES For linear regression, logistic regression,.

GD trick #: feature scaling Feature scaling and mean normalization Bring all features into a similar range E.g.: shift and scale each feature to have mean and variance Mean of unscaled feature Standard deviation of unscaled feature (when using non-linear features) Do not apply to constant feature! Typically leads to much faster convergence

GD trick #2: monitoring convergence Diagnose typical issues with Gradient Descent:.75 6 cost J.7.65.6 at iteration 3 cost J 8 6 4 2 cost J 4 2.55 5 iteration 5 iteration 5 5 2 iteration slow convergence (increase learning rate?) oscillations (decrease learning rate) divergence (decrease learning rate)

GD trick #3: adaptive learning rate At each iteration Compare cost function value after Gradient Descent update before and 6 If cost increased: Reject update (go back to previous parameters) Multiply learning rate by.7 (for example) cost J 4 2 5 5 2 iteration If cost decreased: Multiply learning rate by.2 (for example) cost J.75.7.65.6.55 5 iteration Often eliminates slow convergence and divergence issues

A black box view of gradient descent Write code for computing the cost and its gradient Code to compute Gradient descent Code to compute Local/global minimum? Learning rate. Stop when J < 4

More advanced optimization methods Gradient methods = order Newton methods = order 2 Need Hessian matrix or approximations Avoid choosing a learning rate Conjugate gradient, BFGS, L-BFGS, Tricky to implement (numerical stability, etc.) Use available toolbox / library implementations! scipy.optimize.minimize Only use when fighting for performance

EVALUATION OF HYPOTHESIS Training and test set

Training and Test set Training set: used by learning algorithm to fit parameters and find a hypothesis. Test set: independent data set, used after learning to estimate the performance of the hypothesis on new (unseen) test examples. 2.5 2 training test Regression example y.5.5-2 - 2 x E.g. 7% randomly chosen examples from dataset are training examples, the remaining 3% are test examples. Must be disjoint subsets!

Training and Test set workflow Training set Learning algorithm Training error (cost) Hypothesis h Test set Testing Test error (cost)

Linear regression training vs. test error MSEtrain=. MSEtest=2.2 2 training hypothesis 2 test hypothesis polynomial degree 4 y y - -2-2 x - -2-2 x

Classification Training / Test set Training set Test set.9.8.8.6.4.2.2.4.6.8.7.6.5.4.3.2..2.4.6.8

Logistic regression training vs. test error Training Error:.5 Test Error: 2.7.8.8.6 polynomial features up to power.6.4 decision boundary.4.2.2.5.5

UNDERFITTING AND OVERFITTING

Polynomial regression under-/overfitting 2.5 2.5 MSEtrain=.22, MSEtest=.29 training test polynomial fit degree y.5-2 - 2 x

Polynomial regression under-/overfitting 2.5 2.5 MSEtrain=., MSEtest=. training test polynomial fit degree 5 y.5-2 - 2 x

Polynomial regression under-/overfitting 2.5 2.5 MSEtrain=., MSEtest=3.7 training test polynomial fit degree 5 y.5-2 - 2 x

-2-2 Polynomial regression under-/overfitting 2.5 2.5 MSEtrain=.22, MSEtest=.29 training test polynomial fit degree 2.5 2.5 MSEtrain=., MSEtest=3.7 training test polynomial fit degree 5 y y.5.5-2 - 2 x underfitting 2.5 2.5 MSEtrain=., MSEtest=. training test polynomial fit degree 5-2 - 2 x overfitting y.5 just right

Logistic regression with polynomial terms Training set Test set.5.5.5.5

Logistic regression with polynomial terms Terms up to power Training Error:.33 Test Error:.32 decision boundary.5.5.5.5 Training Error:.9 Test Error:.9.8.8 Terms up to power 2.6.4.2.5.6.4.2.5

Logistic regression with polynomial terms Training Error:.7 Test Error:.44 Terms up to power 2.5.5.5.5

Training Error:.33 Test Error:.32 Training Error:.7 Test Error:.44.5.5.5.5.5 underfitting.5.5 overfitting.5 Training Error:.9 Test Error:.9.8.8.6.6.4.4.2.2.5 just right.5

Under-/ and Overfitting Training error (cost) underfitting Test error (cost) overfitting just right Model complexity (e.g. degree of polynomial terms)

Under- and Overfitting Underfitting: Model is too simple (often: too few parameters) High training error, high test error Training error (cost) Test error (cost) Overfitting Model is too complex (often: too many parameters relative to number of training examples) Low training error, high test error In between: Model has right complexity Moderate training error Lowest test error Model complexity

How to deal with overfitting Use model selection to automatically select the right model complexity Use regularization to keep parameters small (other lecture ) Collect more data (often not possible or inefficient) Manually throw out features which are unlikely to contribute (often hard to guess which ones, potentially throwing out the wrong ones) Change features vectors, use pre-processing (often not possible or inefficient, time consuming)

MODEL SELECTION Training, Validation and Test sets

Model selection Selection of learning algorithm and hyperparameters (model complexity) that are most suitable for a given learning problem Training error (cost) Test error (cost) Model complexity

Idea Try out different learning algorithms/variants Vary degree of polynomial Try different sets of features Training error (cost) Prediction error (cost) Select variant with best predictive performance Model complexity

Training, Validation, Test set Training set: used by learning algorithm to fit parameters and find a hypothesis for each learning algorithm/variant. Validation set: used to estimate predictive performance of each learning algorithm/variant. The hypothesis with lowest validation error (cost) is selected. Test set: independent data set, used after learning and model selection to estimate the performance of the final (selected) hypothesis on new (unseen) test examples. E.g. 6/2/2 % randomly chosen examples from dataset. Must be disjoint subsets!

Training/Validation/Test set workflow Training set LA with hyperparameter setting A LA with hyperparameter setting B LA with hyperparameter setting C For example: degree of polynome=, 5, and 5 Hypothesis h A Hypothesis h B Hypothesis h C Validation set Model selection Selected Hypothesis h Test set Testing Test error/cost

Training/Validation/Test set workflow Training set Learning algorithm A Learning algorithm B Learning algorithm C For example: Linear regression, Polynomial regression, Artificial Neural Network Hypothesis h A Hypothesis h B Hypothesis h C Validation set Model selection Selected Hypothesis h Test set Testing Test error/cost

Some questions Logistic regression is a method for regression/classification? Logistic regression hypothesis? What s the cost function used for logistic regression? Is it convex or non-convex? What does adaptive learning rate mean in the context of gradient descent? How to evaluate a hypothesis? What is under-/overfitting? What is model selection? What are training, validation and test sets? How does model selection work (procedure)?

What is next? Neural Networks: Perceptron Feedforward Neural Network Backpropagation