COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Similar documents
ECS171: Machine Learning

Deep Learning & Artificial Intelligence WS 2018/2019

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Logistic Regression. COMP 527 Danushka Bollegala

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Regression with Numerical Optimization. Logistic

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

CSC321 Lecture 2: Linear Regression

Lecture 5: Logistic Regression. Neural Networks

ECS171: Machine Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Neural Network Training

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning 4771

CS260: Machine Learning Algorithms

CSC 411: Lecture 04: Logistic Regression

Computational statistics

Introduction to Machine Learning

Learning with multiple models. Boosting.

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Neural Networks and Deep Learning

CMU-Q Lecture 24:

Machine Learning Basics III

CSC 578 Neural Networks and Deep Learning

Logistic Regression & Neural Networks

Reading Group on Deep Learning Session 1

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

You submitted this quiz on Wed 16 Apr :18 PM IST. You got a score of 5.00 out of 5.00.

Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning

Least Mean Squares Regression. Machine Learning Fall 2018

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CSC 411 Lecture 6: Linear Regression

Lecture 5 Neural models for NLP

Logistic Regression. Machine Learning Fall 2018

Least Mean Squares Regression

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Introduction to Neural Networks

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Pattern Recognition and Machine Learning

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

The Perceptron algorithm

Perceptron (Theory) + Linear Regression

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Sample questions for Fundamentals of Machine Learning 2018

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Midterm exam CS 189/289, Fall 2015

Linear Regression (continued)

Generative v. Discriminative classifiers Intuition

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Stochastic Gradient Descent. CS 584: Big Data Analytics

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Machine Learning Basics: Maximum Likelihood Estimation

Linear Models for Regression CS534

Classification Based on Probability

Artificial Neural Networks

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Machine Learning: Logistic Regression. Lecture 04

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Machine Learning Lecture 5

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

CSC321 Lecture 9: Generalization

Artificial Neural Networks

Optimization and Gradient Descent

Linear Discrimination Functions

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Big Data Analytics. Lucas Rego Drumond

Neural Networks: Backpropagation

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Logistic Regression Trained with Different Loss Functions. Discussion

Bias-Variance Tradeoff

CSC321 Lecture 9: Generalization

Linear & nonlinear classifiers

Comparison of Modern Stochastic Optimization Algorithms

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Statistical Machine Learning from Data

Deep Feedforward Networks

Machine Learning Practice Page 2 of 2 10/28/13

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 2 Machine Learning Review

Generative v. Discriminative classifiers Intuition

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Data Mining Part 5. Prediction

Binary Classification / Perceptron

Logistic Regression Logistic

Selected Topics in Optimization. Some slides borrowed from

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

Transcription:

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS6 Lecture 3: Classification with Logistic Regression Advanced optimization techniques Underfitting & Overfitting Model selection (Training- & Validation- & Testset)

CLASSIFICATION WITH LOGISTIC REGRESSION

Logistic Regression Classification and not regression Classification = recognition Mogees and vibration classification: https://www.youtube.com/watch?v=xv4hil-_h Action recognition: https://www.youtube.com/watch?v=ajswswvwqvy

Logistic Regression The default classification model Binary classification Extensions to multi-class later in the course Simple classification algorithm Convex cost - unique local optimum Gradient descent No more parameter than with linear regre Interpretability of parameters Fast evaluation of hypothesis for making predictions

LOGISTIC REGRESSION Hypothesis

Example (step function hypothesis) i Tumor size (mm) x Malignant? 2.3 (N) 2 5. (Y) 3.4 (N) 4 6.3 (Y) 5 5.3 (Y) labeled data y benign malignant decision boundary Tumor size (x) labels

Example (logistic function hypothesis) i Tumor size (mm) x labeled data Malignant? 2.3 (N) 2 5. (Y) y benign malignant? decision boundary Tumor size (x) 3.4 (N) 4 6.3 (Y) 5 5.3 (Y) labels p=.2, class? Hypothesis: Tumor is malignant with probability p Classification: if p <.5: if p.5:.5

Example (logistic function hypothesis) i Tumor size (mm) x labeled data Malignant? 2.3 (N) 2 5. (Y) y benign malignant? decision boundary Tumor size (x) 3.4 (N) 4 6.3 (Y) 5 5.3 (Y) labels p=., class? Hypothesis: Tumor is malignant with probability p Classification: if p <.5: if p.5:.5

Logistic (Sigmoid) function Advantages over step function for classification: Differentiable (gradient descent) Contains additional information (how certain is the prediction?)

Logistic regression hypothesis (one input) 5-5 - -4-2 2 4 5-5 - -4-2 2 4 5-5 - -4-2 2 4.5.5.5-4 -2 2 4-4 -2 2 4-4 -2 2 4

Age (x2) Classification with multiple inputs i Tumor size (mm) Age Maligna nt? x x2 y 2.3 25 (N) 2 5. 62 (Y) 3.4 47 (N) 4 6.3 39 (Y) Tumor size (x) 5 5.3 72 (Y)

Age (x2) Multiple inputs and logistic hypothesis i Tumor size (mm) Age Maligna nt?. Reduce point in high-dimensional space to a scalar z 2. Apply logistic function p=.8, class x x2 y 2.3 25 (N) 2 5. 62 (Y)? 3.4 47 (N) 4 6.3 39 (Y) decision boundary Tumor size (x)? 5 5.3 72 (Y)

Age (x2) Classification with multiple inputs i Tumor size (mm) Age Maligna nt?. Reduce point in high-dimensional space to a scalar z 2. Apply logistic function p=.999, class x x2 y 2.3 25 (N) 2 5. 62 (Y) 3.4 47 (N) 4 6.3 39 (Y) 5 5.3 72 (Y) Tumor size (x)??

Logistic regression hypothesis. Reduce high-dimensional input to a scalar 2. Apply logistic function 3. Interpret output as probability and predict class: Class =

LOGISTIC REGRESSION Cost function

Age (x2) Logistic regression cost function How well does the hypothesis fit the data? Tumor size (x) Prediction:.98 Actual value y: Prediction:.6 Actual value y: Prediction:. Actual value y:

Age (x2) Logistic regression cost function Probabilistic model: y is with probability: Tumor size (x)

Logistic regression cost function Probabilistic model: y is with probability p(x,y=) = The parameters should maximize the likelihood of the data max θ log p X = x x n, y = y, y n θ) If data points are independants max θ i log p x i, y i θ) p x i, y i, x j, y j θ) = p x j, y j θ). p x i, y i θ) Separating positive and negative examples max θ y i = log p x i, θ) + y i = log p x i, θ) σ(x T θ) σ(x T θ)

Logistic regression cost function How well does the hypothesis Cost for predicting probability p when the real value is y: fit the data? 5 4 3 2.5 5 4 3 2.5 Mean over all training examples:

Multiple inputs and logistic hypothesis How well does the hypothesis fit the data? 5 4 3 2.5 5 4 3 2.5 Prediction:. Actual value y: Prediction:.6 Actual value y: Prediction:.98 Actual value y:

Comparison cost functions Linear regression Logistic regression 4 3 2-2 2 5 4 3 2.5 5 4 3 2.5 Mean over all training examples:

Why not mean squared error (MSE) again? MSE with logistic hypothesis is non-convex (many local minima) Logistic regression is convex (unique minimum) Cost function can be derived from statistical principles ( maximum likelihood ) non-convex function convex function

LOGISTIC REGRESSION Learning from data

Minimizing the cost via gradient descent Gradient descent (simultaneous update for j= n) Gradient of logistic regression cost: error input (for j=: )

Age (x2) Linear to non-linear features linear decision boundary Tumor size (x)

Age (x2) Linear to non-linear features non-linear decision boundary Tumor size (x)

Age (x2) Age (x2) Decision boundaries linear decision boundary non-linear decision boundary Tumor size (x) Tumor size (x) Decision boundary is a property of hypothesis, not of data!

Linear vs. Logistic Regression Linear Regression Regression Hypothesis Cost for one training example: Logistic Regression Binary classification (!) Hypothesis Cost for one training example: Gradient Gradient error input error input Analytical: No analytical solution!

GRADIENT DESCENT TRICKS, AND MORE ADVANCED OPTIMIZATION TECHNIQUES For linear regression, logistic regression,.

GD trick #: feature scaling Feature scaling and mean normalization Bring all features into a similar range E.g.: shift and scale each feature to have mean and variance Mean of unscaled feature Standard deviation of unscaled feature (when using non-linear features) Do not apply to constant feature! Typically leads to much faster convergence

GD trick #2: monitoring convergence Diagnose typical issues with Gradient Descent:.75 6 cost J.7.65.6 at iteration 3 cost J 8 6 4 2 cost J 4 2.55 5 iteration 5 iteration 5 5 2 iteration slow convergence (increase learning rate?) oscillations (decrease learning rate) divergence (decrease learning rate)

GD trick #3: adaptive learning rate At each iteration Compare cost function value after Gradient Descent update before and 6 If cost increased: Reject update (go back to previous parameters) Multiply learning rate by.7 (for example) cost J 4 2 5 5 2 iteration If cost decreased: Multiply learning rate by.2 (for example) cost J.75.7.65.6.55 5 iteration Often eliminates slow convergence and divergence issues

A black box view of gradient descent Write code for computing the cost and its gradient Code to compute Gradient descent Code to compute Local/global minimum? Learning rate. Stop when J < 4

More advanced optimization methods Gradient methods = order Newton methods = order 2 Need Hessian matrix or approximations Avoid choosing a learning rate Conjugate gradient, BFGS, L-BFGS, Tricky to implement (numerical stability, etc.) Use available toolbox / library implementations! scipy.optimize.minimize Only use when fighting for performance

EVALUATION OF HYPOTHESIS Training and test set

Training and Test set Training set: used by learning algorithm to fit parameters and find a hypothesis. Test set: independent data set, used after learning to estimate the performance of the hypothesis on new (unseen) test examples. 2.5 2 training test Regression example y.5.5-2 - 2 x E.g. 7% randomly chosen examples from dataset are training examples, the remaining 3% are test examples. Must be disjoint subsets!

Training and Test set workflow Training set Learning algorithm Training error (cost) Hypothesis h Test set Testing Test error (cost)

Linear regression training vs. test error MSEtrain=. MSEtest=2.2 2 training hypothesis 2 test hypothesis polynomial degree 4 y y - -2-2 x - -2-2 x

Classification Training / Test set Training set Test set.9.8.8.6.4.2.2.4.6.8.7.6.5.4.3.2..2.4.6.8

Logistic regression training vs. test error Training Error:.5 Test Error: 2.7.8.8.6 polynomial features up to power.6.4 decision boundary.4.2.2.5.5

UNDERFITTING AND OVERFITTING

Polynomial regression under-/overfitting 2.5 2.5 MSEtrain=.22, MSEtest=.29 training test polynomial fit degree y.5-2 - 2 x

Polynomial regression under-/overfitting 2.5 2.5 MSEtrain=., MSEtest=. training test polynomial fit degree 5 y.5-2 - 2 x

Polynomial regression under-/overfitting 2.5 2.5 MSEtrain=., MSEtest=3.7 training test polynomial fit degree 5 y.5-2 - 2 x

-2-2 Polynomial regression under-/overfitting 2.5 2.5 MSEtrain=.22, MSEtest=.29 training test polynomial fit degree 2.5 2.5 MSEtrain=., MSEtest=3.7 training test polynomial fit degree 5 y y.5.5-2 - 2 x underfitting 2.5 2.5 MSEtrain=., MSEtest=. training test polynomial fit degree 5-2 - 2 x overfitting y.5 just right

Logistic regression with polynomial terms Training set Test set.5.5.5.5

Logistic regression with polynomial terms Terms up to power Training Error:.33 Test Error:.32 decision boundary.5.5.5.5 Training Error:.9 Test Error:.9.8.8 Terms up to power 2.6.4.2.5.6.4.2.5

Logistic regression with polynomial terms Training Error:.7 Test Error:.44 Terms up to power 2.5.5.5.5

Training Error:.33 Test Error:.32 Training Error:.7 Test Error:.44.5.5.5.5.5 underfitting.5.5 overfitting.5 Training Error:.9 Test Error:.9.8.8.6.6.4.4.2.2.5 just right.5

Under-/ and Overfitting Training error (cost) underfitting Test error (cost) overfitting just right Model complexity (e.g. degree of polynomial terms)

Under- and Overfitting Underfitting: Model is too simple (often: too few parameters) High training error, high test error Training error (cost) Test error (cost) Overfitting Model is too complex (often: too many parameters relative to number of training examples) Low training error, high test error In between: Model has right complexity Moderate training error Lowest test error Model complexity

How to deal with overfitting Use model selection to automatically select the right model complexity Use regularization to keep parameters small (other lecture ) Collect more data (often not possible or inefficient) Manually throw out features which are unlikely to contribute (often hard to guess which ones, potentially throwing out the wrong ones) Change features vectors, use pre-processing (often not possible or inefficient, time consuming)

MODEL SELECTION Training, Validation and Test sets

Model selection Selection of learning algorithm and hyperparameters (model complexity) that are most suitable for a given learning problem Training error (cost) Test error (cost) Model complexity

Idea Try out different learning algorithms/variants Vary degree of polynomial Try different sets of features Training error (cost) Prediction error (cost) Select variant with best predictive performance Model complexity

Training, Validation, Test set Training set: used by learning algorithm to fit parameters and find a hypothesis for each learning algorithm/variant. Validation set: used to estimate predictive performance of each learning algorithm/variant. The hypothesis with lowest validation error (cost) is selected. Test set: independent data set, used after learning and model selection to estimate the performance of the final (selected) hypothesis on new (unseen) test examples. E.g. 6/2/2 % randomly chosen examples from dataset. Must be disjoint subsets!

Training/Validation/Test set workflow Training set LA with hyperparameter setting A LA with hyperparameter setting B LA with hyperparameter setting C For example: degree of polynome=, 5, and 5 Hypothesis h A Hypothesis h B Hypothesis h C Validation set Model selection Selected Hypothesis h Test set Testing Test error/cost

Training/Validation/Test set workflow Training set Learning algorithm A Learning algorithm B Learning algorithm C For example: Linear regression, Polynomial regression, Artificial Neural Network Hypothesis h A Hypothesis h B Hypothesis h C Validation set Model selection Selected Hypothesis h Test set Testing Test error/cost

Some questions Logistic regression is a method for regression/classification? Logistic regression hypothesis? What s the cost function used for logistic regression? Is it convex or non-convex? What does adaptive learning rate mean in the context of gradient descent? How to evaluate a hypothesis? What is under-/overfitting? What is model selection? What are training, validation and test sets? How does model selection work (procedure)?

What is next? Neural Networks: Perceptron Feedforward Neural Network Backpropagation