CPSC 340: Machine Learning and Data Mining

Similar documents
CPSC 340: Machine Learning and Data Mining

Warm up: risk prediction with logistic regression

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2017

CPSC 340: Machine Learning and Data Mining. Gradient Descent Fall 2016

6.036 midterm review. Wednesday, March 18, 15

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Perceptron (Theory) + Linear Regression

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

More about the Perceptron

Machine Learning Practice Page 2 of 2 10/28/13

Linear Classifiers. Michael Collins. January 18, 2012

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

CPSC 540: Machine Learning

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Binary Classification / Perceptron

Logistic Regression Logistic

Logistic Regression. COMP 527 Danushka Bollegala

Machine Learning (CSE 446): Neural Networks

Support vector machines Lecture 4

Linear and Logistic Regression. Dr. Xiaowei Huang

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

The Perceptron Algorithm

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2016

Introduction to Machine Learning Midterm Exam

Day 3: Classification, logistic regression

Lecture 6. Notes on Linear Algebra. Perceptron

9 Classification. 9.1 Linear Classifiers

ECE521 week 3: 23/26 January 2017

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Linear Regression (continued)

Machine Learning for NLP

Lecture 2 Machine Learning Review

The Perceptron algorithm

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Deep Learning for Computer Vision

CPSC 540: Machine Learning

Introduction to Machine Learning

Nonlinear Classification

Machine Learning Linear Classification. Prof. Matteo Matteucci

DATA MINING AND MACHINE LEARNING

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Training the linear classifier

Introduction to Machine Learning Midterm, Tues April 8

Logistic Regression & Neural Networks

Least Mean Squares Regression

Support Vector Machine I

EE 511 Online Learning Perceptron

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Overfitting, Bias / Variance Analysis

CPSC 540: Machine Learning

Machine Learning Basics

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Machine Learning Lecture 5

Optimization and Gradient Descent

Machine Learning and Data Mining. Linear classification. Kalev Kask

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Jeff Howbert Introduction to Machine Learning Winter

Day 4: Classification, support vector machines

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Neural networks and support vector machines

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

CSC 411 Lecture 6: Linear Regression

Linear discriminant functions

Midterm exam CS 189/289, Fall 2015

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Machine Learning for NLP

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Support Vector Machines

MIRA, SVM, k-nn. Lirong Xia

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning, Fall 2009: Midterm

Classification Based on Probability

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Dan Roth 461C, 3401 Walnut

COMP 875 Announcements

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Introduction to Machine Learning Midterm Exam Solutions

1 Machine Learning Concepts (16 points)

CMU-Q Lecture 24:

Machine Learning Basics Lecture 3: Perceptron. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning 4771

Lecture 3: Multiclass Classification

Machine Learning and Data Mining. Linear regression. Kalev Kask

CS 6375 Machine Learning

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Statistical Methods for Data Mining

Machine Learning Lecture 7

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2018

6.036: Midterm, Spring Solutions

Transcription:

CPSC 340: Machine Learning and Data Mining Linear Classifiers: predictions Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1

Admin Assignment 4: Due Friday of next week Midterm: Well done! You have until Thursday to discuss grading concerns (in person) See Piazza for post regarding Q4a You can collect exam papers after class or in office hours 2

Part 3 Key Ideas: Linear Models, Least Squares Focus of Part 3 is linear models: Supervised learning where prediction is linear combination of features: Regression: Target y i is numerical, testing (y" i == y i ) doesn t make sense. Squared error: Can find optimal w by solving normal equations. 3

Part 3 Key Ideas: Gradient Descent, Error Functions For large d we often use gradient descent: Iterations only cost O(nd). Converges to a critical point of a smooth function. For convex functions, it finds a global optimum. L 1 -norm and L -norm errors: More/less robust to outliers. Can apply gradient descent after smoothing with Huber or log-sum-exp. 4

Part 3 Key Ideas: Change of basis, Complexity Scores Change of basis: replaces features x i with non-linear transforms z i : Add a bias variable (feature that is always one). Polynomial basis. Radial basis functions (non-parametric basis). We discussed scores for choosing true model complexity. Validation score vs. AIC/BIC. Search and score for feature selection: Define a score like BIC, and do a search like forward selection. 5

Part 3 Key Ideas: Regularization L0-regularization (AIC, BIC): Adds penalty on the number of non-zeros to select features. L2-regularization (ridge regression): Adding penalty on the L2-norm of w to decrease overfitting: L1-regularization (LASSO): Adding penalty on the L1-norm decreases overfitting and selects features: 6

Key Idea in Rest of Part 3 The next few lectures will focus on: Using linear models for classification It may seem like we re spending a lot of time on linear models. Linear models are used a lot and are understandable. ICBC only uses linear models for insurance estimates. Linear models are also the building blocks for more-advanced methods. Latent-factor models in Part 4 and deep learning in Part 5. 7

Motivation: Identifying Important E-mails How can we automatically identify important e-mails? A binary classification problem ( important vs. not important ). Labels are approximated by whether you took an action based on mail. High-dimensional feature set (that we ll discuss later). Gmail uses a linear classifier for this problem. 8

Binary Classification Using Regression? Can we apply linear models for binary classification? Set y i = +1 for one class ( important ). Set y i = -1 for the other class ( not important ). At training time, fit a linear regression model: The model will try to make w T x i = +1 for important e-mails, and w T x i = -1 for not important e-mails. 9

Binary Classification Using Regression? Can we apply linear models for binary classification? Set y i = +1 for one class ( important ). Set y i = -1 for the other class ( not important ). Linear model gives real numbers like 0.9, -1.1, and so on. So to predict, we look at the sign of w T x i. If w T x i = 0.9, predict y" i = +1. If w T x i = -1.1, predict y" i = -1. If w T x i = 0.1, predict y" i = +1. If w T x i = -100, predict y" i = -1. 10

Decision Boundary in 1D 11

Decision Boundary in 1D We can interpret w as hyperplane separating x into 2 half-spaces: Half-space where w T x i > 0 and half-space where w T x i < 0. 12

Decision Boundary in 2D decision tree KNN linear classifier A linear classifier would be linear function y" i = β + w 1 x i1 +w 2 x i2 coming out of the page (the boundary is at y" i =0). Or recall from multivariable calculus that a plane in d-dimensions is defined by its normal vector in d-dimensions, plus an intercept/offset. 13

Perceptron Algorithm One of the first learning algorithms was the perceptron (1957). Searches for a w such that sign(w T x i ) = y i for all i. Perceptron algorithm: Start with w 0 = 0. Go through examples in any order until you make a mistake predicting y i. Set w t+1 = w t + y i x i. Keep going through examples until you make no errors on training data. Intuition for step: if y i = +1, add more of x i to w so that w T x i is larger. If a perfect classifier exists, this algorithm finds one in finite number of steps. In this case we say the training data is linearly separable 14

Lecture continues in Jupyter notebook 15

Summary Binary classification using regression: Encode using y i in {-1,1}. Use sign(w T x i ) as prediction. Linear classifier (a hyperplane splitting the space in half). Perceptron algorithm: finds a perfect classifier (if one exists). Least squares is a weird error for classification. 16

https://en.wikipedia.org/wiki/perceptron 17

Online Classification with Perceptron Perceptron for online linear binary classification [Rosenblatt, 1957] Start with w 0 = 0. At time t we receive features x t. We predict y" t = sign(w tt x t ). If y" t y t, then set w t+1 = w t + y t x t. Otherwise, set w t+1 = w t. (Slides are old so above I m using subscripts of t instead of superscripts.) Perceptron mistake bound [Novikoff, 1962]: Assume data is linearly-separable with a margin : There exists w* with w* =1 such that sign(x tt w*) = sign(y t ) for all t and x T w* γ. Then the number of total mistakes is bounded. No requirement that data is IID. 18

Perceptron Mistake Bound Let s normalize each x t so that x t = 1. Length doesn t change label. Whenever we make a mistake, we have sign(y t ) sign(w tt x t ) and So after k errors we have w t 2 k. 19

Perceptron Mistake Bound Let s consider a solution w*, so sign(y t ) = sign(x tt w*). Whenever we make a mistake, we have: So after k mistakes we have w t γk. 20

Perceptron Mistake Bound So our two bounds are w t sqrt(k) and w t γk. This gives γk sqrt(k), or a maximum of 1/γ 2 mistakes. Note that γ > 0 by assumption and is upper-bounded by one by x 1. After this k, under our assumptions we re guaranteed to have a perfect classifier. 21