Linear and Logistic Regression. Dr. Xiaowei Huang

Similar documents
Machine Learning Basics

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang /

ECE521 Lecture7. Logistic Regression

ECE521 Lectures 9 Fully Connected Neural Networks

COMS 4771 Introduction to Machine Learning. Nakul Verma

Stochastic gradient descent; Classification

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Learning Theory Continued

Machine Learning Linear Classification. Prof. Matteo Matteucci

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Stochastic Gradient Descent

CSC321 Lecture 4: Learning a Classifier

CSC 411 Lecture 7: Linear Classification

CPSC 340: Machine Learning and Data Mining

Learning with multiple models. Boosting.

Geometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat

Logistic Regression. Machine Learning Fall 2018

CSC321 Lecture 4: Learning a Classifier

Tufts COMP 135: Introduction to Machine Learning

Machine Learning, Fall 2009: Midterm

Day 3: Classification, logistic regression

9 Classification. 9.1 Linear Classifiers

Linear Classifiers: Expressiveness

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

ECE 5424: Introduction to Machine Learning

Support Vector Machines

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

ECE 5984: Introduction to Machine Learning

ECE521 Lecture 7/8. Logistic Regression

Machine Learning

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Lecture 3: Multiclass Classification

Evaluation. Andrea Passerini Machine Learning. Evaluation

Variance Reduction and Ensemble Methods

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

CSC321 Lecture 4 The Perceptron Algorithm

Logistic Regression. COMP 527 Danushka Bollegala

Evaluation requires to define performance measures to be optimized

6.036 midterm review. Wednesday, March 18, 15

Warm up: risk prediction with logistic regression

Multiclass Classification-1

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

PAC-learning, VC Dimension and Margin-based Bounds

FINAL: CS 6375 (Machine Learning) Fall 2014

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Voting (Ensemble Methods)

SVMs, Duality and the Kernel Trick

Machine Learning

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

CS7267 MACHINE LEARNING

Data Mining und Maschinelles Lernen

Nonlinear Classification

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Announcements. Proposals graded

Lecture 6. Notes on Linear Algebra. Perceptron

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Machine Learning Lecture 7

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Midterm, Fall 2003

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Machine Learning Lecture 5

Machine Learning (CSE 446): Neural Networks

Gaussian and Linear Discriminant Analysis; Multiclass Classification

MIRA, SVM, k-nn. Lirong Xia

Neural Network Training

Probabilistic Machine Learning. Industrial AI Lab.

CPSC 340: Machine Learning and Data Mining. Gradient Descent Fall 2016

VBM683 Machine Learning

Classification Based on Probability

Linear Models in Machine Learning

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Stephen Scott.

Day 5: Generative models, structured classification

Midterm Exam Solutions, Spring 2007

Chapter 14 Combining Models

CSC 411: Lecture 04: Logistic Regression

Lecture 4: Training a Classifier

DATA MINING AND MACHINE LEARNING

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

CSC321 Lecture 6: Backpropagation

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Support Vector Machine. Industrial AI Lab.

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

Applied Machine Learning Annalisa Marsico

Machine Learning Practice Page 2 of 2 10/28/13

Transcription:

Linear and Logistic Regression Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/

Up to now, Two Classical Machine Learning Algorithms Decision tree learning K-nearest neighbor Model Evaluation Metrics Learning curves Training/Validation/Test datasets Confusion matrices (accuracy, error, ROC curve, PR curve)

Confidence for decision tree (example) Random forest: multiple decision trees are trained, by using different resamples of your data. Probabilities can be calculated by the proportion of decision trees which vote for each class. For example, if 8 out of 10 decision trees vote to classify an instance as positive, we say that the confidence is 8/10. Here, the confidences of all classes add up to 1

Confidence for k-nn classification (example) Classification steps are the same, recall Given a class, we compute Accumulated distance to the supportive instances apply sigmoid function on the reciprocal of the accumulated distance Here, the confidences of all classes may not add up to 1 Softmax?

Today s Topics linear regression linear classification logistic regression multiclass logistic regression (leave as extended materials)

Recap: dot product in linear algebra Geometric meaning: can be used to understand the angle between two vectors

Linear regression

Linear regression Given training data i.i.d. from distribution D

Recap: Consider the inductive bias of DT and k-nn learners

Linear regression Hypothesis Class H Given training data Find that minimises i.i.d. from distribution D L 2 loss, or mean square error where represents the error of instance represents the square error of all training instances So, represents the mean square error of all training instances

Linear regression Given training data Find that minimises i.i.d. from distribution D

Organise feature data into matrix Let X be a matrix whose i-th row is Football player example: (height, weight, runningspeed) v1 v2 v3 y 182 87 11.3 No 189 92 12.3 Yes 178 79 10.6 Yes 183 90 12.7 No

Transform input matrix with weight vector Assume a function with weight vector Intuitively, by 20, running speed is more important than the other two features, and by -1, weight is negatively correlated to y This is the parameter vector we want to learn

Organise output into vector Let y be the vector v1 v2 v3 y 182 87 11.3 325 189 92 12.3 344 178 79 10.6 350 183 90 12.7 320

Error representation Square error of all instances

Linear regression : optimization Given training data Find that minimises i.i.d. from distribution D Let X be a matrix whose i-th row is, y be the vector Now we knew where this comes from! Solving this optimization problem will be introduced in later lectures.

Linear regression with bias Given training data Find that minimises the loss i.i.d. from distribution D Reduce to the case without bias: Let Then Bias Term Intuitively, every instance is extended with one more feature whose value is always 1, and we already know the weight for this feature, i.e., b

Linear regression with bias Think about bias for the football player example

Linear regression with lasso penalty Given training data Find that minimises the loss i.i.d. from distribution D lasso penalty: L 1 norm of the parameter, encourages sparsity

Evaluation Metrics Root mean squared error (RMSE) Mean absolute error (MAE) average L 1 error R-square (R-squared) Historically all were computed on training data, and possibly adjusted after, but really should cross-validate

Linear classification

Linear classification

Linear classification: natural attempt Given training data Hypothesis y = 1 if w T x > 0 y = 0 if w T x < 0 Prediction: where step(m)=1, if m>0 and step(m)=0, otherwise Piecewise Linear model H i.i.d. from distribution D Still, w is the vector of parameters to be trained. But what is the optimisation objective?

Linear classification: simple approach Drawback: not robust to outliers

Linear classification: natural attempt Given training data Find that minimises i.i.d. from distribution D Drawback: difficult to optimize NP-hard in the worst case 0-1 loss loss = 0, i.e., no loss, when the classification is the same as its label. loss =1, otherwise.

logistic regression

Why logistic regression? It's tempting to use the linear regression output as probabilities but it's a mistake because the output can be negative, and greater than 1 whereas probability can not. As regression might actually produce probabilities that could be less than 0, or even bigger than 1, logistic regression was introduced. Logistic regression always outputs a value between 0 and 1

Compare the two Linear regression Linear classification

Between the two Prediction bounded in [0,1] Smooth Sigmoid:

Linear regression: sigmoid prediction Squash the output of the linear function Find w that minimizes Question: Do we need to squash y?

Linear classification: logistic regression Squash the output of the linear function A better approach: Interpret as a probability Here we assume that y=0 or y=1

Linear classification: logistic regression Find Find w that minimises that minimises Why log function used? To avoid numerical unstability. Logistic regression: MLE with sigmoid

Linear classification: logistic regression Given training data Find w that minimises i.i.d. from distribution D No close form solution; Need to use gradient descent

Properties of sigmoid function

Exercises Given the dataset and consider the mean square root error, if we have the following two linear functions: x1 x2 x3 y f w (x) = 2x 1 + 1x 2 + 20x 3-330 f w (x) = 1x 1-2x 2 + 23x 3 332 please answer the following questions: (1) which model is better for linear regression? (2) which model is better for linear classification by considering 0-1 loss for y T =(No,Yes,Yes,No)? (3) which model is better for logistic regression? (4) According to the logistic regression of the first model, what is the prediction result of the first model on a new input (181,92,12.4). 182 87 11.3 325 189 92 12.3 344 178 79 10.6 350 183 90 12.7 320

Extended Materials

Review: binary logistic regression

Review: binary logistic regression

Review: binary logistic regression

Review: binary logistic regression

Multiclass logistic regression

Multiclass logistic regression

Multiclass logistic regression

Multiclass logistic regression: conclusion

Softmax

Cross entropy for conditional distribution

Cross entropy for full distribution