CS489/698: Intro to ML

Similar documents
Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Classification Logistic Regression

Logistic Regression. Seungjin Choi

Rapid Introduction to Machine Learning/ Deep Learning

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Warm up: risk prediction with logistic regression

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Gaussian discriminant analysis Naive Bayes

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Machine Learning Lecture 5

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

MLPR: Logistic Regression and Neural Networks

Machine Learning Linear Classification. Prof. Matteo Matteucci

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Introduction to Logistic Regression and Support Vector Machine

CPSC 340 Assignment 4 (due November 17 ATE)

Statistical Machine Learning Hilary Term 2018

Ch 4. Linear Models for Classification

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Linear discriminant functions

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Lecture 17: Neural Networks and Deep Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Introduction to Machine Learning

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Is the test error unbiased for these programs?

An Introduction to Statistical and Probabilistic Linear Models

ESS2222. Lecture 4 Linear model

1 Machine Learning Concepts (16 points)

Linear classifiers: Logistic regression

Linear Models for Classification

Neural Network Training

Classification Logistic Regression

Linear Regression (continued)

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

SVMs: nonlinearity through kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Ad Placement Strategies

Machine Learning Practice Page 2 of 2 10/28/13

1 Review of Winnow Algorithm

Classification objectives COMS 4771

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Statistical Data Mining and Machine Learning Hilary Term 2016

Gradient-Based Learning. Sargur N. Srihari

Logistic Regression. William Cohen

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Machine Learning for NLP

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

FINAL: CS 6375 (Machine Learning) Fall 2014

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 4: Exponential family of distributions and generalized linear model (GLM) (Draft: version 0.9.2)

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Online Learning and Sequential Decision Making

Support Vector Machines and Kernel Methods

Announcements Kevin Jamieson

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Lecture 2 Machine Learning Review

Linear & nonlinear classifiers

Multilayer Perceptron

Support Vector Machines

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Linear Models in Machine Learning

Neural Networks and Deep Learning

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Machine Learning Foundations

Intelligent Systems Discriminative Learning, Neural Networks

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

IFT Lecture 7 Elements of statistical learning theory

Stochastic gradient descent; Classification

Bregman Divergences for Data Mining Meta-Algorithms

CS60021: Scalable Data Mining. Large Scale Machine Learning

Overfitting, Bias / Variance Analysis

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

ECS171: Machine Learning

Ad Placement Strategies

Maximum Entropy Klassifikator; Klassifikation mit Scikit-Learn

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Expectation maximization

STA 4273H: Statistical Machine Learning

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Lecture 15: Logistic Regression

CPSC 340: Machine Learning and Data Mining

Announcements. Proposals graded

Support Vector Machines

Transcription:

CS489/698: Intro to ML Lecture 04: Logistic Regression 1

Outline Announcements Baseline Learning Machine Learning Pyramid Regression or Classification (that s it!) History of Classification History of Solvers (Analytical to Convex to Non-Convex but smooth ) Convexity SGD Perceptron Review Bernoulli model / Logistic Regression Tensorflow Playground / Demo code Multiclass 2

Announcements Assignment 1 due next Tuesday 3

Outline Announcements Baseline Learning Machine Learning Pyramid Regression or Classification (that s it!) History of Classification History of Solvers (Analytical to Convex to Non-Convex but smooth ) Convexity SGD Perceptron Review Bernoulli model / Logistic Regression Tensorflow Playground / Demo code Multiclass 4

Baseline Assuming Lin Alg Basics: 9/21/17 Francois Chaubard and Agastya Kalra 5 f cal Systems

Outline Announcements Baseline Learning Machine Learning Pyramid Regression or Classification (that s it!) History of Classification History of Solvers (Analytical to Convex to Non-Convex but smooth ) Convexity SGD Perceptron Review Bernoulli model / Logistic Regression Tensorflow Playground / Demo code Multiclass 6

ML Pyramid Deep Learning Machine Learning (anything fancier than simple Sklearn fit / predict calls) Software Engineering Convex Optimization Linear Algebra Information Theory Probability and Statistics 7 f 9/21/17 Francois Chaubard and Agastya Kalra cal Systems

Outline Announcements Baseline Learning Machine Learning Pyramid Regression or Classification (that s it!) History of Classification History of Solvers (Analytical to Convex to Non-Convex but smooth ) Convexity SGD Perceptron Review Bernoulli model / Logistic Regression Tensorflow Playground / Demo code Multiclass 8

Regression or Classification or 9 f 9/21/17 Francois Chaubard and Agastya Kalra cal Systems

Outline Announcements Baseline Learning Machine Learning Pyramid Regression or Classification (that s it!) History of Classification History of Solvers (Analytical to Convex to Non-Convex but smooth ) Convexity SGD Perceptron Review Bernoulli model / Logistic Regression Tensorflow Playground / Demo code Multiclass 10

Classification Higher complexity datasets need higher capacity models (Formally, higher VC-Dimensionality) Perceptron!Perceptron Logistic Regression!Logistic Regression? 11 f MLP Feature Eng etc 9/21/17 Francois Chaubard and Agastya Kalra etc cal Systems

Outline Announcements Baseline Learning Machine Learning Pyramid Regression or Classification (that s it!) History of Classification History of Solvers (Analytical to Convex to Non-Convex but smooth ) Convexity SGD Perceptron Review Bernoulli model / Logistic Regression Tensorflow Playground / Demo code Multiclass 12

History of Solvers Closed Form <1950s Iterative Methods+Convex 1950-2012 Iterative Methods+Smooth 2012+ Least Squares etc Interior Point Methods Log Barrier etc Deep Learning etc 13 f 9/21/17 Francois Chaubard and Agastya Kalra cal Systems

Outline Announcements Baseline Learning Machine Learning Pyramid Regression or Classification (that s it!) History of Classification History of Solvers (Analytical to Convex to Non-Convex but smooth ) Convexity SGD Perceptron Review Bernoulli model / Logistic Regression Tensorflow Playground / Demo code Multiclass 14

Convexity Jensen s Inequality 15 f 9/21/17 Francois Chaubard and Agastya Kalra cal Systems

Outline Announcements Baseline Learning Machine Learning Pyramid Regression or Classification (that s it!) History of Classification History of Solvers (Analytical to Convex to Non-Convex but smooth ) Convexity SGD Perceptron Review Bernoulli model / Logistic Regression Tensorflow Playground / Demo code Multiclass 16

SGD 17 f 9/21/17 Francois Chaubard and Agastya Kalra cal Systems

Outline Announcements Baseline Learning Machine Learning Pyramid Regression or Classification (that s it!) History of Classification History of Solvers (Analytical to Convex to Non-Convex but smooth ) Convexity SGD Perceptron Review Bernoulli model / Logistic Regression Tensorflow Playground / Demo code Multiclass 18

Perceptron Issues: 19 f 9/21/17 Francois Chaubard and Agastya Kalra cal Systems

Outline Announcements Baseline Learning Machine Learning Pyramid Regression or Classification (that s it!) History of Classification History of Solvers (Analytical to Convex to Non-Convex but smooth ) Convexity SGD Perceptron Review Bernoulli model / Logistic Regression Tensorflow Playground / Demo code Multiclass 20

Bernoulli model Let P(Y=1 X=x) = p(x; w), parameterized by w Conditional likelihood on {(x 1, y 1 ), (x n, y n )}: P(Y 1 = y 1,...,Y n = y n X 1 = x 1,...,X n = x n ) simplifies if independence holds ny ny P(Y i = y i X i = x i )= p(x i ; w) y i (1 p(x i ; w)) 1 y i i=1 Assuming y i is {0,1}-valued i=1 21

Naïve solution ny p(x i ; w) y i (1 p(x i ; w)) 1 y i i=1 Find w to maximize conditional likelihood What is the solution if p(x; w) does not depend on x? What is the solution if p(x; w) does not depend on? 22

Generalized linear models (GLM) y ~ Bernoulli(p); p = p(x; w) natural parameter Logistic regression y ~ Normal(µ, σ 2 ); µ = µ(x; w) (weighted) least-squares regression GLM: y ~ exp( θ φ(y) A(θ) ) log-partition function sufficient statistics 23

Logit transform p(x; w) = w T x? p >=0 not guaranteed log p(x; w) = w T x? better! LHS negative, RHS real-valued Logit transform Or equivalently log p(x; w) 1 p(x; w) = w> x p(x; w) = 1 odds ratio 1 + exp( w > x) 24

Prediction with confidence p(x; w) = 1 1 + exp( w > x) ŷ = 1 if p = P(Y=1 X=x) > ½ iff w T x > 0 Decision boundary w T x = 0 ŷ = sign(w T x) as before, but with confidence p(x; w) 25

Not just a classification algorithm Logistic regression does more than classification it estimates conditional probabilities under the logit transform assumption Having confidence in prediction is nice the price is an assumption that may or may not hold If classification is the sole goal, then doing extra work as shall see, SVM only estimates decision boundary 26

More than logistic regression F(p) transforms p from [0,1] to R Then, equating F(p) to a linear function w T x But, there are many other choices for F! precisely the inverse of any distribution function! 27

Logistic distribution Cumulative Distribution Function F (x; µ, s) = 1 1 + exp( Mean mu, variance s 2 π 2 /3 x µ s ) 28

Outline Announcements Baseline Learning Machine Learning Pyramid Regression or Classification (that s it!) History of Classification History of Solvers (Analytical to Convex to Non-Convex but smooth ) Convexity SGD Perceptron Review Bernoulli model / Logistic Regression Tensorflow Playground / Demo code Multiclass 29

Playground 30 f 9/21/17 Francois Chaubard and Agastya Kalra cal Systems

Tensorflow coding example 31 f 9/21/17 Francois Chaubard and Agastya Kalra cal Systems

Outline Announcements Baseline Learning Machine Learning Pyramid Regression or Classification (that s it!) History of Classification History of Solvers (Analytical to Convex to Non-Convex but smooth ) Convexity SGD Perceptron Review Bernoulli model / Logistic Regression Tensorflow Playground / Demo code Multiclass 32

More than 2 classes Softmax P(Y = c x,w)= exp(w > c x) P k q=1 exp(w> q x) 33

More than 2 classes Softmax P(Y = c x,w)= exp(w > c x) P k q=1 exp(w> q x) Again, nonnegative and sum to 1 Negative log-likelihood (y is one-hot) log ny i=1 ky c=1 p y ic ic = nx i=1 kx y ic log p ic c=1 34

Questions? 35

backup 36

Classification revisited ŷ = sign( x T w + b ) How confident we are about ŷ? x T w + b seems a good indicator real-valued; hard to interpret ways to transform into [0,1] Better(?) idea: learn confidence directly 37

Conditional probability P(Y=1 X=x): conditional on seeing x, what is the chance of this instance being positive, i.e., Y=1? obviously, value in [0,1] P(Y=0 X=x) = 1 P(Y=1 X=x), if two classes more generally, sum to 1 Notation (Simplex). Δ k-1 := { p in R k : p 0, Σ i p i = 1 } 38

Reduction to a harder problem P(Y=1 X=x) = E(1 Y=1 X=x) 1 A = ( 1, A is true 0, A is false Let Z = 1 Y=1, then regression function for (X, Z) use linear regression for binary Z? Exploit structure! conditional probabilities are in a simplex Never reduce to unnecessarily harder problem 39

Maximum likelihood ny p(x i ; w) y i (1 p(x i ; w)) 1 y i i=1 p(x; w) = 1 1 + exp( w > x) Minimize negative log-likelihood X log(e (1 i y i)w > x i + e y iw > x i ) X i log(1 + e ỹiw > x i ) 40

Newton s algorithm w t+1 w t t [r 2 f(w t )] 1 rf(w t ) rf(w t )=X > (p y) r 2 f(w t )= X i p i (1 p i )x i x > i PSD p i = η = 1: iterative weighted least-squares 1 1+e w> t x i Uncertain predictions get bigger weight 41

A word about implementation Numerically computing exponential can be tricky easily underflows or overflows The usual trick estimate the range of the exponents shift the mean of the exponents to 0 42

Robustness `(t) = log(1 + e t ) L(ŷ, y) =`( ŷy) ŷ = w > x Bounded derivative `0(t) = Variational exponential et 1+e t = 1 1+e t Larger exp loss gets smaller weights log(1 + e t )= min 0apple apple1 et log( )+ 1 43