Introduction to Machine Learning

Similar documents
Introduction to Machine Learning

Introduction to Machine Learning

Logistic Regression. Seungjin Choi

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Neural Network Training

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Generative v. Discriminative classifiers Intuition

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Generative v. Discriminative classifiers Intuition

Introduction to Machine Learning

Introduction to Machine Learning

Bias-Variance Tradeoff

An Introduction to Statistical and Probabilistic Linear Models

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Bayesian Regression Linear and Logistic Regression

Machine Learning Gaussian Naïve Bayes Big Picture

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Bayesian Machine Learning

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Support Vector Machines

Logistic Regression. William Cohen

Linear Classification

Overfitting, Bias / Variance Analysis

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Machine Learning

Linear and logistic regression

Naïve Bayes classification

PATTERN RECOGNITION AND MACHINE LEARNING

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning, Fall 2012 Homework 2

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Logistic Regression. Machine Learning Fall 2018

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Machine Learning

Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Machine Learning 2017

A Course in Machine Learning

Nonparametric Bayesian Methods (Gaussian Processes)

Multiclass Logistic Regression

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Introduction to Bayesian Learning. Machine Learning Fall 2018

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Admin BACKPROPAGATION. Neural network. Neural network 11/3/16. Assignment 7. Assignment 8 Goals today. David Kauchak CS158 Fall 2016

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Linear Models for Classification

Gaussian discriminant analysis Naive Bayes

ECE 5984: Introduction to Machine Learning

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Machine Learning - Waseda University Logistic Regression

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Logistic Regression & Neural Networks

Ch 4. Linear Models for Classification

CPSC 540: Machine Learning

Logistic Regression. COMP 527 Danushka Bollegala

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Introduction to Probabilistic Machine Learning

Bayesian Learning (II)

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Machine Learning Linear Classification. Prof. Matteo Matteucci

Tutorial on Maximum Likelyhood Estimation: Parametric Density Estimation

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

MLE/MAP + Naïve Bayes

Lecture 2: Correlated Topic Model

Machine Learning - MT Classification: Generative Models

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

ECE521 week 3: 23/26 January 2017

CMU-Q Lecture 24:

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Bayesian Linear Regression. Sargur Srihari

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Logistic Regression. Stochastic Gradient Descent

Introduction to Machine Learning

Machine Learning Lecture 7

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Lecture : Probabilistic Machine Learning

Reading Group on Deep Learning Session 1

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Stochastic Gradient Descent

Lecture 3 Notes. Dan Sheldon. September 17, 2012

CPSC 340: Machine Learning and Data Mining

Undirected Graphical Models

Cascaded redundancy reduction

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

CS 340 Lec. 16: Logistic Regression

Statistical Data Mining and Machine Learning Hilary Term 2016

The connection of dropout and Bayesian statistics

Outline Lecture 2 2(32)

Influence of weight initialization on multilayer perceptron performance

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Transcription:

How o you estimate p(y x)? Outline Contents Introuction to Machine Learning Logistic Regression Varun Chanola April 9, 207 Generative vs. Discriminative Classifiers 2 Logistic Regression 2 3 Logistic Regression - Training 3 3. Using Graient Descent for Learning Weights......... 4 3.2 Using Newton s Metho..................... 4 3.3 Regularization with Logistic Regression............. 5 3.4 Hanling Multiple Classes.................... 6 3.5 Bayesian Logistic Regression................... 6 3.6 Laplace Approximation...................... 6 3.7 Posterior of w for Logistic Regression.............. 7 3.8 Approximating the Posterior................... 8 3.9 Getting Preiction on Unseen Examples............ 8 Generative vs. Discriminative Classifiers Probabilistic classification task: p(y = benign X = x), p(y = malicious X = x) p(y x) = p(y, x) p(x) = p(x y)p(y) p(x) Two step approach - Estimate generative moel an then posterior for y (Naïve Bayes) Solving a more general problem [2, ] Why not irectly moel p(y x)? - Discriminative approach Number of training examples neee to learn a PAC-learnable classifier VC-imension of the hypothesis space VC-imension of a probabilistic classifier Number of parameters [2] (or a small polynomial in the number of parameters) Number of parameters for p(y, x) > Number of parameters for p(y x) Discriminative classifiers nee lesser training examples to for PAC learning than generative classifiers 2 Logistic Regression y x is a Bernoulli istribution with parameter θ = sigmoi(w x) When a new input x arrives, we toss a coin which has sigmoi(w x ) as the probability of heas If outcome is heas, the preicte class is else 0 Learns a linear bounary Learning Task for Logistic Regression Given training examples x i, y i D, learn w Bayesian Interpretation Directly moel p(y x) (y {0, }) 2

p(y x) Bernoulli(θ = sigmoi(w x)) Geometric Interpretation Use regression to preict iscrete values Squash output to [0, ] using sigmoi function Output less than 0.5 is one class an greater than 0.5 is the other 3 Logistic Regression - Training MLE Approach Assume that y {0, } What is the likelihoo for a bernoulli sample? Log-likelihoo If y i =, p(y i ) = θ i = +exp( w x i) If y i = 0, p(y i ) = θ i = +exp(w x i) In general, p(y i ) = θ yi i ( θ i) yi LL(w) = y i log θ i + ( y i ) log ( θ i ) No close form solution for maximizing log-likelihoo Using this result we obtain: N w LL(w) = y i θ i ( θ i )x i ( y i) θ i ( θ i )x i θ i θ i = = (y i ( θ i ) ( y i )θ i )x i (y i θ i )x i Obviously, given that θ i is a non-linear function of w, a close form solution is not possible. 3. Using Graient Descent for Learning Weights 0 0.5 Compute graient of LL with respect to w A convex function of w with a unique global maximum Upate rule: N w LL(w) = (y i θ i )x i w k+ = w k + η LL(w k ) w k To unerstan why there is no close form solution for maximizing the loglikelihoo, we first ifferentiate LL(w) with respect to w. We make use of the useful result for sigmoi: 0 5 0 2 4 6 8 0 0 θ i w = θ i( θ i )x i 3.2 Using Newton s Metho Setting η is sometimes tricky 3 4

Too large incorrect results Too small slow convergence Another way to spee up convergence: Newton s Metho w k+ = w k + ηh k LL(w k ) w k Hessian or H is the secon orer erivative of the objective function Newton s metho belong to the family of secon orer optimization algorithms For logistic regression, the Hessian is: 3.4 Hanling Multiple Classes One vs. Rest an One vs. Other p(y x) Multinoulli(θ) Multinoulli parameter vector θ is efine as: θ j = exp(w j x) C k= exp(w k x) Multiclass logistic regression has C weight vectors to learn 3.5 Bayesian Logistic Regression How to get the posterior for w? H = i θ i ( θ i )x i x i Not easy - Why? 3.3 Regularization with Logistic Regression Overfitting is an issue, especially with large number of features A a Gaussian prior N (0, τ 2 ) (Or a regularization penalty) Easy to incorporate in the graient escent base approach where I is the ientity matrix. LL (w) = LL(w) 2 λw w w LL (w) = LL(w) λw w H = H λi Laplace Approximation We o not know what the true posterior istribution for w is. Is there a close-enough (approximate) Gaussian istribution? One shoul note that we use a Gaussian prior for w which is not a conjugate prior for the Bernoulli istribution use in the logistic regression. In fact there is no convenient prior that may be use for logistic regression. 3.6 Laplace Approximation Problem Statement How to approximate a posterior with a Gaussian istribution? When is this neee? When irect computation of posterior is not possible. No conjugate prior 5 6

Assume that posterior is: p(w D) = Z e E(w) E(w) is energy function negative log of unnormalize log posterior Let w MAP be the moe or expecte value of the posterior istribution of w Taylor series expansion of E(w) aroun the moe E(w) = E(w MAP ) + (w w ) E (w) + (w w ) E (w)(w w ) +... E(w MAP ) + (w w ) E (w) + (w w MAP ) E (w)(w w ) E (w) = - first erivative of E(w) (graient) an E (w) = H is the secon erivative (Hessian) Since w MAP is the moe, the first erivative or graient is zero E(w) E(w MAP ) + (w w ) H(w w ) Posterior p(w D) may be written as: p(w D) [ ) exp ] Z e E(wMAP 2 (w w ) H(w w ) = N (w MAP, H ) w MAP is the moe obtaine by maximizing the posterior using graient ascent 3.7 Posterior of w for Logistic Regression Prior: Likelihoo of ata where θ i = +e w x i Posterior: p(d w) = p(w) = N (0, τ 2 I) N θ yi i ( θ i) yi p(w D) = N (0, τ 2 I) N θyi i ( θ i) yi p(d w)w 3.8 Approximating the Posterior Approximate posterior istribution p(w D) = N (w MAP, H ) H is the Hessian of the negative log-posterior w.r.t. w Hessian (or H) is a matrix that is obtaine by ouble ifferentiation of a scalar function of a vector. In this case the scalar function is the negative log-posterior of w: H = 2 ( log p(d w) log p(w)) 3.9 Getting Preiction on Unseen Examples p(y x) = p(y x, w)p(w D)w. Use a point estimate of w (MLE or MAP) 2. Analytical Result 3. Monte Carlo Approximation Numerical integration Sample finite versions of w using p(w D) p(w D) = N (w MAP, H ) Compute p(y x) using the samples an a Using point estimate of w means that the integral isappears from the Bayesian averaging equation. 7 8

References References [] A. Y. Ng an M. I. Joran. On iscriminative vs. generative classifiers: A comparison of logistic regression an naive bayes. In T. G. Dietterich, S. Becker, an Z. Ghahramani, eitors, NIPS, pages 84 848. MIT Press, 200. [2] V. Vapnik. Statistical learning theory. Wiley, 998. 9