Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Similar documents
Warm up: risk prediction with logistic regression

Classification Logistic Regression

Announcements Kevin Jamieson

Is the test error unbiased for these programs?

Classification Logistic Regression

Logistic Regression Logistic

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Ad Placement Strategies

Classification Logistic Regression

Kernelized Perceptron Support Vector Machines

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Linear classifiers: Overfitting and regularization

Support vector machines Lecture 4

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Logistic Regression. COMP 527 Danushka Bollegala

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Discriminative Models

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Support Vector Machines

Discriminative Models

CS489/698: Intro to ML

Ad Placement Strategies

CPSC 340: Machine Learning and Data Mining

Machine Learning for NLP

CSC 411 Lecture 17: Support Vector Machine

Stochastic Gradient Descent

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Announcements - Homework

1 Machine Learning Concepts (16 points)

Machine Learning Practice Page 2 of 2 10/28/13

Max Margin-Classifier

Adaptive Gradient Methods AdaGrad / Adam

Introduction to Logistic Regression and Support Vector Machine

Kernel Methods and Support Vector Machines

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Announcements. Proposals graded

Linear & nonlinear classifiers

Ch 4. Linear Models for Classification

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

CS60021: Scalable Data Mining. Large Scale Machine Learning

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

(Kernels +) Support Vector Machines

Regression with Numerical Optimization. Logistic

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Convergence rate of SGD

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Support Vector Machines

Optimization and Gradient Descent

Linear & nonlinear classifiers

Support Vector Machines for Classification and Regression

The Perceptron Algorithm, Margins

LECTURE 7 Support vector machines

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Binary Classification / Perceptron

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

CPSC 340 Assignment 4 (due November 17 ATE)

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

STA141C: Big Data & High Performance Statistical Computing

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Convex Optimization Algorithms for Machine Learning in 10 Slides

Machine Learning for NLP

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Logistic Regression. Machine Learning Fall 2018

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

SVMs, Duality and the Kernel Trick

More about the Perceptron

Logistic regression and linear classifiers COMS 4771

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

Deep Learning for Computer Vision

ML (cont.): SUPPORT VECTOR MACHINES

Introduction to Logistic Regression

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

CPSC 540: Machine Learning

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Neural Networks and Deep Learning

Introduction to Machine Learning Midterm, Tues April 8

Linear classifiers: Logistic regression

Multiclass Classification-1

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Perceptron (Theory) + Linear Regression

Machine Learning Basics

Transcription:

Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required in terms of n, p, d. If d << p << n, what is the most memory efficient program (blue, green, red)? If you have unlimited memory, what do you think is the fastest program? Kevin Jamieson 2018 1

Gradient Descent Machine Learning CSE546 Kevin Jamieson University of Washington October 18, 2016 Kevin Jamieson 2018 2

Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx i=1 `i(w) Kevin Jamieson 2018 3

Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx i=1 `i(w) x x y g is a subgradient at x if f(y) f(x)+g T (y x) f convex: f ( x +(1 )y) apple f(x)+(1 )f(y) 8x, y, 2 [0, 1] f(y) f(x)+rf(x) T (y x) 8x, y Kevin Jamieson 2018 4

Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx i=1 `i(w) Logistic Loss: `i(w) = log(1 + exp( y i x T i w)) Squared error Loss: `i(w) =(y i x T i w)2 Kevin Jamieson 2018 5

Least squares Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx i=1 `i(w) Squared error Loss: `i(w) =(y i x T i w)2 How does software solve: 1 2 Xw y 2 2 Kevin Jamieson 2018 6

Least squares Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx i=1 `i(w) Squared error Loss: `i(w) =(y i x T i w)2 How does software solve: its complicated: (LAPACK, BLAS, MKL ) 1 2 Xw y 2 2 Do you need high precision? Is X column/row sparse? Is bw LS sparse? Is X T X well-conditioned? Can X T X fit in cache/memory? Kevin Jamieson 2018 7

Taylor Series Approximation Taylor series in one dimension: f(x + )=f(x)+f 0 (x) + 1 2 f 00 (x) 2 +... Gradient descent: Kevin Jamieson 2018 8

Taylor Series Approximation Taylor series in d dimensions: f(x + v) =f(x)+rf(x) T v + 1 2 vt r 2 f(x)v +... Gradient descent: Kevin Jamieson 2018 9

Gradient Descent f(w) = 1 2 Xw y 2 2 w t+1 = w t rf(w t ) rf(w) = Kevin Jamieson 2018 10

Gradient Descent f(w) = 1 2 Xw y 2 2 w t+1 = w t rf(w t ) rf(w) = X T (Xw y) w = arg min w f(w) =) rf(w )=0 w t+1 w = w t w rf(w t ) = w t w (rf(w t ) rf(w )) = w t w X T X(w t w ) =(I X T X)(w t w ) =(I X T X) t+1 (w 0 w ) Kevin Jamieson 2018 11

Gradient Descent f(w) = 1 2 Xw y 2 2 w t+1 = w t rf(w t ) (w t+1 w )=(I X T X)(w t w ) Example: X= =(I X T X) t+1 (w 0 w ) apple 10 3 0 0 1 y= apple 10 3 1 w 0 = apple 0 0 w = Kevin Jamieson 2018 12

Gradient Descent f(w) = 1 2 Xw y 2 2 w t+1 = w t rf(w t ) (w t+1 w )=(I X T X)(w t w ) Example: X= =(I X T X) t+1 (w 0 w ) apple 10 3 0 0 1 y= apple 10 3 1 w 0 = apple 0 0 apple 1 w = 1 X T X = apple 10 6 0 0 1 Pick such that max{ 1 10 6, 1 } < 1 w t+1,1 w,1 = 1 10 6 t+1 w 0,1 w,1 = 1 10 6 t+1 w t+1,2 w,2 = 1 t+1 w 0,2 w,2 = 1 t+1 Kevin Jamieson 2018 13

Taylor Series Approximation Taylor series in one dimension: f(x + )=f(x)+f 0 (x) + 1 2 f 00 (x) 2 +... Newton s method: Kevin Jamieson 2018 14

Taylor Series Approximation Taylor series in d dimensions: f(x + v) =f(x)+rf(x) T v + 1 2 vt r 2 f(x)v +... Newton s method: Kevin Jamieson 2018 15

Newton s Method f(w) = 1 2 Xw y 2 2 rf(w) = r 2 f(w) = v t is solution to : r 2 f(w t )v t = rf(w t ) w t+1 = w t + v t Kevin Jamieson 2018 16

Newton s Method f(w) = 1 2 Xw y 2 2 rf(w) = X T (Xw y) r 2 f(w) = X T X v t is solution to : r 2 f(w t )v t = rf(w t ) w t+1 = w t + v t For quadratics, Newton s method can converge in one step! (No surprise, why?) w 1 = w 0 (X T X) 1 X T (Xw 0 y) =(1 )w 0 + (X T X) 1 X T y =(1 )w 0 + w In general, for w t close enough to w one should use =1 Kevin Jamieson 2018 17

General case In general for Newton s method to achieve f(w t ) f(w ) apple : So why are ML problems overwhelmingly solved by gradient methods? Hint: v t is solution to : r 2 f(w t )v t = rf(w t ) Kevin Jamieson 2018 18

General Convex case f(w t ) f(w ) apple Clean converge nice proofs: Bubeck Newton s method: t log(log(1/ )) Gradient descent: f is smooth and strongly convex: ai r 2 f(w) : bi f is smooth: r 2 f(w) bi f is potentially non-differentiable: rf(w) 2 apple c Nocedal +Wright, Bubeck Other: BFGS, Heavy-ball, BCD, SVRG, ADAM, Adagrad, Kevin Jamieson 2018 19

Revisiting Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 18, 2016 Kevin Jamieson 2018 20

Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w f(w) rf(w) = = arg min w {(x i,y i )} n i=1 x i 2 R d, y i 2 { 1, 1} ny P (y P (Y = y x, w) = i x i,w) i=1 nx log(1 + exp( i=1 y i x T i w)) 1 1 + exp( yw T x) Kevin Jamieson 2018 21

Stochastic Gradient Descent Machine Learning CSE546 Kevin Jamieson University of Washington October 18, 2016 Kevin Jamieson 2018 22

Stochastic Gradient Descent Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: 1 nx Each `i(w) is convex. `i(w) n i=1 Kevin Jamieson 2018 23

Stochastic Gradient Descent Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: 1 nx Each `i(w) is convex. `i(w) n i=1 Gradient Descent:! 1 nx w t+1 = w t r w `i(w) n i=1 w=w t Kevin Jamieson 2018 24

Stochastic Gradient Descent Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: 1 nx Each `i(w) is convex. `i(w) n i=1 Gradient Descent:! 1 nx w t+1 = w t r w `i(w) n Stochastic Gradient Descent: w t+1 = w t E[r`It (w)] = r w`it (w) i=1 w=w t w=w t I t drawn uniform at random from {1,...,n} Kevin Jamieson 2018 25

Stochastic Gradient Descent Theorem Let w t+1 = w t r w`it (w) w=w t I t drawn uniform at random from {1,...,n} so that If E r`it (w) = 1 n nx r`i(w) =:r`(w) i=1 sup max kr`i(w)k 2 apple G kw 1 w 0 k 2 2 apple R and then w i E[`( w) `(w )] apple R 2T + G 2 apple r RG T = r R GT w = 1 T TX w t t=1 (In practice use last iterate) Kevin Jamieson 2018 26

Stochastic Gradient Descent Proof E[ w t+1 w 2 2]=E[ w t r`it (w t ) w 2 2] Kevin Jamieson 2018 27

Stochastic Gradient Descent Proof E[ w t+1 w 2 2]=E[ w t r`it (w t ) w 2 2] = E[ w t w 2 2] 2 E[r`It (w t ) T (w t w )] + 2 E[ r`it (w t ) 2 2] apple E[ w t w 2 2] 2 E[`(w t ) `(w )] + 2 G E[r`It (w t ) T (w t w )] = E E[r`It (w t ) T (w t w ) I 1,w 1,...,I t 1,w t 1 ] = E r`(w t ) T (w t w ) E `(w t ) `(w ) TX t=1 E[`(w t ) `(w )] apple 1 2 E[ w 1 w 2 2] E[ w T +1 w 2 2]+T 2 G apple R 2 + T G 2 Kevin Jamieson 2018 28

Stochastic Gradient Descent Proof Jensen s inequality: For any random Z 2 R d and convex function : R d! R, (E[Z]) apple E[ (Z)] E[`( w) `(w )] apple 1 T TX t=1 E[`(w t ) `(w )] w = 1 T TX t=1 w t Kevin Jamieson 2018 29

Stochastic Gradient Descent Proof Jensen s inequality: For any random Z 2 R d and convex function : R d! R, (E[Z]) apple E[ (Z)] E[`( w) `(w )] apple 1 T TX t=1 E[`(w t ) `(w )] w = 1 T TX t=1 w t E[`( w) `(w )] apple R 2T + G 2 apple r RG T = r R GT Kevin Jamieson 2018 30

Stochastic Gradient Descent: A Learning perspective Machine Learning CSE546 Kevin Jamieson University of Washington October 18, 2016 Kevin Jamieson 2018 31

Learning Problems as Expectations Minimizing loss in training data: Given dataset: Sampled iid from some distribution p(x) on features: Loss function, e.g., hinge loss, logistic loss, We often minimize loss in training data: However, we should really minimize expected loss on all data: So, we are approximating the integral by the average on the training data Sham Kakade 2016 32

Gradient descent in Terms of Expectations True objective function: Taking the gradient: True gradient descent rule: How do we estimate expected gradient? Sham Kakade 2016 33

SGD: Stochastic Gradient Descent True gradient: Sample based approximation: What if we estimate gradient with just one sample??? Unbiased estimate of gradient Very noisy! Also called stochastic gradient descent Among many other names VERY useful in practice!!! Sham Kakade 2016 34

Perceptron Machine Learning CSE546 Kevin Jamieson University of Washington October 18, 2018 Kevin Jamieson 2018 35

Online learning Click prediction for ads is a streaming data task: User enters query, and ad must be selected Observe x j, and must predict y j User either clicks or doesn t click on ad Label y j is revealed afterwards Google gets a reward if user clicks on ad Update model for next time Kevin Jamieson 2016 36

Online classification New point arrives at time k Kevin Jamieson 2016 37

The Perceptron Algorithm [Rosenblatt 58, 62] Classification setting: y in {-1,+1} Linear model Prediction: Training: Initialize weight vector: At each time step: Observe features: Make prediction: Observe true class: Update model: If prediction is not equal to truth Kevin Jamieson 2016 38

The Perceptron Algorithm [Rosenblatt 58, 62] Classification setting: y in {-1,+1} Linear model Prediction: Training: Initialize weight vector: At each time step: Observe features: Make prediction: Observe true class: sign(w T x i + b) w 0 =0,b 0 =0 x k sign(x T k w k + b k ) y k Update model: If prediction is not equal to truth apple wk+1 b k+1 = apple wk b k + y k apple xk 1 Kevin Jamieson 2016 39

Rosenblatt 1957 "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence." The New York Times, 1958 Kevin Jamieson 2016 40

Linear Separability Perceptron guaranteed to converge if Data linearly separable: Kevin Jamieson 2016 41

Perceptron Analysis: Linearly Separable Case Theorem [Block, Novikoff]: Given a sequence of labeled examples: Each feature vector has bounded norm: If dataset is linearly separable: Then the number of mistakes made by the online perceptron on any such sequence is bounded by Kevin Jamieson 2016 42

Beyond Linearly Separable Case Perceptron algorithm is super cool! No assumption about data distribution! Could be generated by an oblivious adversary, no need to be iid Makes a fixed number of mistakes, and it s done for ever! Even if you see infinite data Kevin Jamieson 2016 43

Beyond Linearly Separable Case Perceptron algorithm is super cool! No assumption about data distribution! Could be generated by an oblivious adversary, no need to be iid Makes a fixed number of mistakes, and it s done for ever! Even if you see infinite data Perceptron is useless in practice! Real world not linearly separable If data not separable, cycles forever and hard to detect Even if separable may not give good generalization accuracy (small margin) Kevin Jamieson 2016 44

What is the Perceptron Doing??? When we discussed logistic regression: Started from maximizing conditional log-likelihood When we discussed the Perceptron: Started from description of an algorithm What is the Perceptron optimizing???? Kevin Jamieson 2016 45

Support Vector Machines Machine Learning CSE546 Kevin Jamieson University of Washington October 18, 2018 2018 Kevin Jamieson 46

Linear classifiers Which line is better? 2018 Kevin Jamieson 47

Pick the one with the largest margin! x T w + b =0 margin 2γ 2018 Kevin Jamieson 48

Pick the one with the largest margin! w x 0 x T w + b =0 Distance from x 0 to hyperplane defined by x T w + b = 0? 2018 Kevin Jamieson 49

Pick the one with the largest margin! w x 0 x T w + b =0 Distance from x 0 to hyperplane defined by x T w + b = 0? If ex 0 is the projection of x 0 onto the hyperplane then x 0 ex 0 2 = (x T 0 ex 0 ) T w w 2 = 1 w 2 x T 0 w ex T 0 w = 1 w 2 x T 0 w + b 2018 Kevin Jamieson 50

Pick the one with the largest margin! x T w + b =0 Distance of x 0 from hyperplane x T w + b: 1 w 2 (x T 0 w + b) Optimal Hyperplane max w,b subject to 1 w 2 y i (x T i w + b) 8i margin 2γ 2018 Kevin Jamieson 51

Pick the one with the largest margin! x T w + b =0 Distance of x 0 from hyperplane x T w + b: 1 w 2 (x T 0 w + b) Optimal Hyperplane max w,b subject to 1 w 2 y i (x T i w + b) 8i Optimal Hyperplane (reparameterized) margin 2γ min w,b w 2 2 subject to y i (x T i w + b) 1 8i 2018 Kevin Jamieson 52

Pick the one with the largest margin! x T w + b =0 Solve efficiently by many methods, e.g., quadratic programming (QP) Well-studied solution algorithms Stochastic gradient descent Coordinate descent (in the dual) Optimal Hyperplane (reparameterized) margin 2γ min w,b w 2 2 subject to y i (x T i w + b) 1 8i 2018 Kevin Jamieson 53

What if the data is still not linearly separable? x T w + b =0 1 w 2 If data is linearly separable min w,b w 2 2 y i (x T i w + b) 1 8i 1 w 2 2018 Kevin Jamieson 54

What if the data is still not linearly separable? x T w + b =0 If data is linearly separable 1 w 2 min w,b w 2 2 y i (x T i w + b) 1 8i 1 If data is not linearly separable, some points don t satisfy margin constraint: x T w + b =0 w 2 1 w 2 min w,b w 2 2 y i (x T i w + b) 1 i 8i nx i 0, j apple j=1 1 w 2 2018 Kevin Jamieson 55

What if the data is still not linearly separable? x T w + b =0 If data is linearly separable 1 w 2 min w,b w 2 2 y i (x T i w + b) 1 8i 1 If data is not linearly separable, some points don t satisfy margin constraint: x T w + b =0 w 2 1 w 2 min w,b w 2 2 y i (x T i w + b) 1 i 8i nx i 0, j apple j=1 1 w 2 What are support vectors? 2018 Kevin Jamieson 56

SVM as penalization method Original quadratic program with linear constraints: min w,b w 2 2 y i (x T i w + b) 1 i 8i nx i 0, j apple j=1 2018 Kevin Jamieson 57

SVM as penalization method Original quadratic program with linear constraints: min w,b w 2 2 y i (x T i w + b) 1 i 8i nx i 0, j apple j=1 Using same constrained convex optimization trick as for lasso: For any 0 there exists a 0 such that the solution the following solution is equivalent: nx max{0, 1 y i (b + x T i w)} + w 2 2 i=1 2018 Kevin Jamieson 58

Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R nx Learning a model s parameters: Each `i(w) is convex. i=1 `i(w) Hinge Loss: `i(w) = max{0, 1 y i x T i w} Logistic Loss: `i(w) = log(1 + exp( y i x T i w)) Squared error Loss: `i(w) =(y i x T i w)2 How do we solve for w? The last two lectures! Kevin Jamieson 2018 59

Perceptron is optimizing what? Perceptron update rule: apple wk+1 b k+1 = apple wk b k + y k apple xk 1 1{y i (b + x T i w) < 0} SVM objective: nx max{0, 1 y i (b + x T i w)} + w 2 2 = i=1 nx `i(w, b) i=1 r w`i(w, b) = ( r b`i(w, b) = ( x i y i + 2 n w if y i(b + x T i w) < 1 0 otherwise y i if y i (b + x T i w) < 1 0 otherwise Perceptron is just SGD on SVM with = 0, = 1! 2018 Kevin Jamieson 60

SVMs vs logistic regression We often want probabilities/confidences, logistic wins here? 2018 Kevin Jamieson 61

SVMs vs logistic regression We often want probabilities/confidences, logistic wins here? No! Perform isotonic regression or non-parametric bootstrap for probability calibration. Predictor gives some score, how do we transform that score to a probability? 2018 Kevin Jamieson 62

SVMs vs logistic regression We often want probabilities/confidences, logistic wins here? No! Perform isotonic regression or non-parametric bootstrap for probability calibration. Predictor gives some score, how do we transform that score to a probability? For classification loss, logistic and svm are comparable Multiclass setting: Softmax naturally generalizes logistic regression SVMs have What about good old least squares? 2018 Kevin Jamieson 63

What about multiple classes? 2018 Kevin Jamieson 64