Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Similar documents
Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares

Support Vector Machines and Bayes Regression

Logistic Regression. Machine Learning Fall 2018

Machine Learning and Data Mining. Linear classification. Kalev Kask

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Linear & nonlinear classifiers

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Support Vector Machines

SVM optimization and Kernel methods

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Probabilistic Machine Learning. Industrial AI Lab.

Announcements - Homework

CMU-Q Lecture 24:

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

HOMEWORK 4: SVMS AND KERNELS

Statistical Pattern Recognition

Notes on Discriminant Functions and Optimal Classification

Introduction to Machine Learning

Classification Logistic Regression

CS 188: Artificial Intelligence Spring Announcements

Lecture #11: Classification & Logistic Regression

Kernel Methods and Support Vector Machines

Warm up: risk prediction with logistic regression

MLE/MAP + Naïve Bayes

Introduction to Logistic Regression and Support Vector Machine

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Machine Learning Practice Page 2 of 2 10/28/13

Bayesian Learning (II)

Logistic Regression Trained with Different Loss Functions. Discussion

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

ECE521 week 3: 23/26 January 2017

Overfitting, Bias / Variance Analysis

Machine Learning for Signal Processing Bayes Classification and Regression

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

CS145: INTRODUCTION TO DATA MINING

1 Machine Learning Concepts (16 points)

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Introduction to Bayesian Learning. Machine Learning Fall 2018

Kernel methods CSE 250B

18.9 SUPPORT VECTOR MACHINES

Machine Learning, Fall 2011: Homework 5

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

CPSC 340: Machine Learning and Data Mining

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Linear Regression (continued)

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Jeff Howbert Introduction to Machine Learning Winter

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

CS 188: Artificial Intelligence Spring Announcements

Classification objectives COMS 4771

Support Vector Machines for Classification: A Statistical Portrait

Introduction to Machine Learning Midterm, Tues April 8

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Logistic Regression Logistic

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

CSC321 Lecture 4 The Perceptron Algorithm

Lecture 4: Training a Classifier

Linear Classifiers. Michael Collins. January 18, 2012

CSC 411 Lecture 17: Support Vector Machine

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

COMS 4771 Introduction to Machine Learning. Nakul Verma

Lecture 2 Machine Learning Review

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Final Exam. 1 True or False (15 Points)

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Perceptron Revisited: Linear Separators. Support Vector Machines

Introduction to Support Vector Machines

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Classification Based on Probability

Machine Learning Linear Classification. Prof. Matteo Matteucci

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning (CSE 446): Unsupervised Learning: K-means and Principal Component Analysis

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Announcements. Proposals graded

Perceptron (Theory) + Linear Regression

Neural networks and support vector machines

The Perceptron algorithm

Machine Learning Lecture 5

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

CIS 520: Machine Learning Oct 09, Kernel Methods

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Logistic Regression & Neural Networks

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

ECE 5424: Introduction to Machine Learning

Kernelized Perceptron Support Vector Machines

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Transcription:

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 12

Announcements HW3 due date as posted. make sure to update the HW pdf file today for clarifications: always use average squared error; always report your lowest test losses; show all curves (train/dev/test) together. check canvas for updates/announcements. Extra Credit will be due Monday, Feb 26th You must do the all of HW3 if you seek any extra credit. Office Hours: Tue, the 20th, at 2-3p Today: Multi-class classification non-linearities; kernel methods 1 / 12

Review/Comments 1 / 12

Some questions on Probabilistic Estimation Remember: think about taking derivatives from scratch (pretend like you no one told you about vector/matrix derivatives) Can you rework that HTHHH example by using the log loss to estimate the bias, π? Do you see why it gives the same result? Suppose you see 51 Heads and 38 Tails. Do you see why it helpful to consider maximizing the log probability rather than directly trying to maximize the probability? Try working this example out with both methods! Can you formulate this problem of estimating π as a logistic regression problem, i.e. without x? We would just use a bias term where Pr(y = 1) = 1 1+exp( b). your learned this in. do you remember what you did? 1 / 12

Some questions on Training and Optimization Do you understand that our optimization is not directly minimizing our number of mistakes? And how what we are doing is different from the perceptron? You will be making two sets of plots in all these HWs. Do you see what we are minimizing? Try to gain intuition as to how the parameters adapt based on the underlying error. Do you see what you do minimize the 0/1 loss for the example HT HHH? Do you understand what a (trivial) lower bound on the square loss is? on the log loss? Do you understand the general conditions when this is achievable? Problem 2.3 forces you to think about these issues. Do you see what loss you will converge to in problem 2.3? 1 / 12

MNIST: comments understanding the MNIST table of results tricks to get lower performance: make your dataset bigger : pixel jitter, distortions, deskewing. These lower the error for pretty much any algorithm. convolutional methods: there is no dev set for Q5?? two class problem is clearly easier than 10 class problem all datasets are from the same dataset! 2 / 12

Today 2 / 12

Multi-class classification suppose y {1, 2,... k}. MNIST: we have k = 10 classes. How do we learn? Misclassification error: the fraction of times (often measured in % ) in which our prediction of the label does not agree with the true label. Like binary classification, we do not optimize this directly it is often computationally difficult 3 / 12

misclassification error: one perspective... misclassification error is a terrible objective function to use anyways: it only gives feedback of correct or not even if you don t predict the true label (e.g. you make a mistake), there is a major difference between your model still thinking the true label is likely v.s. thinking the true label is very unlikely. how do give our model better feedback? Our must provide probabilities of all outcomes Then we reward/penalize our model based on its confidence of the correct answer... 4 / 12

Multi-class classification: one vs all Simplest method: consider each class separately. make 10 binary prediction problems: Build a separate model of Pr(y class ) = 1 x, w class ). Example (just like in HW Q1): build k = 10 separate regression models. 5 / 12

A better probabilistic model: the soft max y {1,... k}: Let s turn the probabilistic crank... The model: we have k weight vectors, w (1), w (2),... w (k). For l {1,... k}, p(y = l x, w (1), w (2),... w (k) ) = exp(w (l) x) k i=1 exp(w(i) x) It is over-parameterized : k 1 p W (y = k x) = 1 p W (y = i x) i=1 max. likelihood estimation is still a convex problem! 6 / 12

Aside: why might square loss be ok for binary classification? Using the square loss for y {0, 1}? it doesn t look like a great surrogate loss. also, it doesn t look like a faithful probabilistic model: What is the Bayes optimal predictor for the square loss? The Bayes optimal predictor for the square loss with y {0, 1}: Can we utilize something more non-linear in our regression? 7 / 12

Can We Have Nonlinearity and Convexity? expressiveness convexity Linear classifiers Neural networks 8 / 12

Can We Have Nonlinearity and Convexity? expressiveness convexity Linear classifiers Neural networks Kernel methods: a family of approaches that give us nonlinear decision boundaries without giving up convexity. 8 / 12

Let s try to build feature mappings Let φ(x) be a mapping from d-dimensional x to d-dimensional x. 2-dimensional example: quadratic interactions What do we call these quadratic terms for binary inputs? 9 / 12

Another example 2-dimensional example: bias+linear+quadratics interactions What do we call these quadratic terms for binary inputs? 10 / 12

The Kernel Trick Some learning algorithms, like the (lin. or logistic) regression, only need you to specify a way to take inner products between your feature vectors. A kernel function (implicitly) computes this inner product: K(x, v) = φ(x) φ(v) for some φ. Typically it is cheap to compute K(, ), and we never explicitly represent φ(v) for any vector v. Let s see! 11 / 12

Examples... 12 / 12