Naïve Bayes. Naïve Bayes

Similar documents
Statistical Data Mining and Machine Learning Hilary Term 2016

The Bayesian Learning Framework. Back to Maximum Likelihood. Naïve Bayes. Simple Example: Coin Tosses. Given a generative model

10-701/ Machine Learning Mid-term Exam Solution

Machine Learning Brett Bernstein

Lecture 2 October 11

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

CSIE/GINM, NTU 2009/11/30 1

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Machine Learning Brett Bernstein

Intro to Learning Theory

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Admin REGULARIZATION. Schedule. Midterm 9/29/16. Assignment 5. Midterm next week, due Friday (more on this in 1 min)

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

Linear Classifiers III

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Topic 9: Sampling Distributions of Estimators

Statistical Machine Learning Hilary Term 2018

Chapter 7. Support Vector Machine

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

6.867 Machine learning

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Lecture 3: MLE and Regression

Pattern recognition systems Laboratory 10 Linear Classifiers and the Perceptron Algorithm

CS284A: Representations and Algorithms in Molecular Biology

Topic 9: Sampling Distributions of Estimators

6.867 Machine learning, lecture 7 (Jaakkola) 1

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

Topic 9: Sampling Distributions of Estimators

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Machine Learning Theory (CS 6783)

Logit regression Logit regression

Linear Support Vector Machines

18.657: Mathematics of Machine Learning

Lecture 13: Maximum Likelihood Estimation

Support vector machine revisited

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Information-based Feature Selection

Pattern recognition systems Lab 10 Linear Classifiers and the Perceptron Algorithm

Step 1: Function Set. Otherwise, output C 2. Function set: Including all different w and b

Probability and MLE.

Optimally Sparse SVMs

Introductory statistics

Classification with linear models

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Sample Size Estimation in the Proportional Hazards Model for K-sample or Regression Settings Scott S. Emerson, M.D., Ph.D.

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Empirical Process Theory and Oracle Inequalities

Statistical Pattern Recognition

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

Algorithms for Clustering

Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

Introduction to Machine Learning DIS10

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

MA Advanced Econometrics: Properties of Least Squares Estimators

Efficient GMM LECTURE 12 GMM II

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

PC5215 Numerical Recipes with Applications - Review Problems

Lecture 11 and 12: Basic estimation theory

Overview. Structured learning for feature selection and prediction. Motivation for feature selection. Outline. Part III:

1.010 Uncertainty in Engineering Fall 2008

Mixtures of Gaussians and the EM Algorithm

Bayesian Methods: Introduction to Multi-parameter Models

1 Models for Matched Pairs

R. van Zyl 1, A.J. van der Merwe 2. Quintiles International, University of the Free State

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

1 Review of Probability & Statistics

The Maximum-Likelihood Decoding Performance of Error-Correcting Codes

Regression with quadratic loss

4. Linear Classification. Kai Yu

A Risk Comparison of Ordinary Least Squares vs Ridge Regression

Stat410 Probability and Statistics II (F16)

Multinomial likelihood. Multinomial MLE. NIST data and genetic fingerprints. θ = (p 1,..., p m ) with j p j = 1 and p j 0. Point probabilities

Introduction to Optimization Techniques. How to Solve Equations

Machine Learning. Logistic Regression -- generative verses discriminative classifier. Le Song /15-781, Spring 2008

STA6938-Logistic Regression Model

Chapter 6 Principles of Data Reduction

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

FMA901F: Machine Learning Lecture 4: Linear Models for Classification. Cristian Sminchisescu

Linear Differential Equations of Higher Order Basic Theory: Initial-Value Problems d y d y dy

A survey on penalized empirical risk minimization Sara A. van de Geer

Exercises Advanced Data Mining: Solutions

REGRESSION WITH QUADRATIC LOSS

Optimization Methods MIT 2.098/6.255/ Final exam

Linear Regression Demystified

15-780: Graduate Artificial Intelligence. Density estimation

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

x = Pr ( X (n) βx ) =

Agnostic Learning and Concentration Inequalities

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

Lecture 7: October 18, 2017

Probabilistic Unsupervised Learning

Transcription:

Statistical Data Miig ad Machie Learig Hilary Term 206 Dio Sejdiovic Departmet of Statistics Oxford Slides ad other materials available at: http://www.stats.ox.ac.uk/~sejdiov/sdmml : aother plug-i classifier with a simple geerative model - it assumes all measured variables/features are idepedet give the label. Ofte used i text documet classificatio, e.g. of scietific articles or emails. A basic stadard model for text classificatio cosists of cosiderig a pre-specified dictioary of p words ad summarizig each documet i by a biary vector x i where x (j) i = { if word j is preset i documet 0 otherwise. Presece of the word j is the j-the feature/dimesio. To implemet a plug-i classifier, we eed a model for the coditioal probability mass fuctio g k (x) = P(X = x Y = k) for each class k =,..., K. is a plug-i classifier which igores feature correlatios ad assumes: MLE: g k (x i ) = P(X = x i Y = k) = = p p P(X (j) = x (j) i Y = k) (φ kj ) x(j) i ( φ kj ) x(j) i, ˆπ k = k, ˆφkj = i:y i =k x(j) i k. A problem with MLE: if the l-th word did ot appear i documets labelled as class k the ˆφ kl = 0 ad where we deoted parametrized coditioal PMF with φ kj = P(X (j) = Y = k) (probability that j-th word appears i class k documet). Give dataset, the MLE of the parameters is: ˆπ k = k, ˆφkj = i:y i =k x(j) i k. P(Y = k X = x with l-th etry equal to ) p ( ) x (j) ˆπ k ˆφkj ( ˆφ ) x (j) kj = 0 i.e. we will ever attribute a ew documet cotaiig word l to class k (regardless of other words i it). This is a example of overfittig. give the class, it assumes each word appears i a documet idepedetly of all others

Geerative Learig Geerative vs Discrimiative Geerative vs Discrimiative Geerative vs Discrimiative Learig Classifiers we have see so far are geerative: we work with a joit distributio p X,Y (x, y) over data vectors ad labels. A learig algorithm: costruct f : X Y which predicts the label of X. Give a loss fuctio L, the risk R of f (X) is R(f ) = E px,y [L(Y, f (X))] For 0/ loss i classificatio, Bayes classifier f Bayes (x) = argmax p(y = k x) = argmax p X,Y (x, k) has the miimum risk (Bayes risk), but is ukow sice p X,Y is ukow. Assume a parameteric model for the joit: p X,Y (x, y) = p X,Y (x, y ) Fit ˆ = argmax log p(x i, y i ) ad plug i back to Bayes classifier: ˆf (x) = argmax p(y = k x, ) = argmax p X,Y (x, k ˆ). Geerative learig: fid parameters which explai all the data available. ˆ = argmax log p(x i, y i ) Examples: LDA, QDA, aïve Bayes. Makes use of all the data available. Flexible modellig framework, so ca icorporate missig features or ulabeled examples. Stroger modellig assumptios, which may ot be realistic (Gaussiaity, idepedece of features). Discrimiative learig: fid parameters that aid i predictio. ˆ = argmi L(y i, f (x i )) or ˆ = argmax log p(y i x i, ) Examples: logistic regressio, eural ets, support vector machies. Typically performs better o a give task. Weaker modellig assumptios: essetially o model o X, oly o Y X. Ca overfit more easily. Logistic regressio Logistic Regressio Hard vs Soft classificatio rules Logistic Regressio A discrimiative classifier. Cosider biary classificatio with Y = {, +}. Logistic regressio uses a parametric model o the coditioal Y X, ot the joit distributio of (X, Y): p(y = y X = x; a, b) = + exp( y(a + b x)). a, b fitted by miimizig the empirical risk with respect to log loss. Cosider usig LDA for biary classificatio with Y = {, +}. Predictios are based o liear decisio boudary: { ŷ LDA (x) = sig log ˆπ + g + (x ˆµ +, ˆΣ) log ˆπ g (x ˆµ, ˆΣ) } = sig { a + b x } for a ad b depedig o fitted parameters ˆ = (ˆπ, ˆπ +, ˆµ, ˆµ +, Σ). Quatity a + b x ca be viewed as a soft classificatio rule. Ideed, it is modellig the differece betwee the log-discrimiat fuctios, or equivaletly, the log-odds ratio: a + b x = log p(y = + X = x; ˆ) p(y = X = x; ˆ). f (x) = a + b x correspods to the cofidece of predictios ad loss ca be measured as a fuctio of this cofidece: expoetial loss: L(y, f (x)) = e yf (x), log-loss: L(y, f (x)) = log( + e yf (x) ), hige loss: L(y, f (x)) = max{ yf (x), 0}.

Logistic Regressio Liearity of log-odds ad logistic fuctio We ca treat a ad b as parameters i their ow right i the model of the coditioal Y X. p(y = + X = x; a, b) log p(y = X = x; a, b) = a + b x. Solve explicitly for coditioal class probabilities: p(y = + X = x; a, b) = p(y = X = x; a, b) = + exp( (a + b x)) =: s(a + b x) + exp(+(a + b x)) = s( a b x) where s(z) = /( + exp( z)) is the logistic fuctio. Logistic Regressio Fittig the parameters of the hyperplae Cosider maximizig the coditioal log likelihood: l(a, b) = log p(y = y i X = x i ) = log s(y i (a + b x i )). Equivalet to miimizig the empirical risk associated with the log loss: ˆR log (f a,b ) = log s(y i (a + b x i )) = log( + exp( y i (a + b x i ))) over all liear soft classificatio rules f a,b (x) = a + b x. 0.5 Logistic Regressio 0 8 6 4 2 0 2 4 6 8 Logistic Regressio Logistic Regressio Logistic Regressio Not possible to fid optimal a, b aalytically. For simplicity, absorb a as a etry i b by appedig ito x vector. Objective fuctio: ˆR log = Differetiate wrt b: log s(y i xi b) bˆr log = 2 bˆr log = s( y i xi b)y i x i Logistic Fuctio s(y i xi b)s( y i xi b)x i xi 0. s( z) = s(z) z s(z) = s(z)s( z) z log s(z) = s( z) 2 z log s(z) = s(z)s( z) Secod derivative is positive-defiite: objective fuctio is covex ad there is a sigle uique global miimum. May differet algorithms ca fid optimal b, e.g.: Gradiet descet: b ew = b + ɛ s( y i xi b)y i x i Stochastic gradiet descet: b ew = b + ɛ t s( y i xi I(t) i I(t) b)y i x i where I(t) is a subset of the data at iteratio t, ad ɛ t 0 slowly ( t ɛ t =, t ɛ2 t < ). Newto-Raphso: b ew = b ( 2 bˆr log ) bˆr log This is also called iterative reweighted least squares. Cojugate gradiet, LBFGS ad other methods from umerical aalysis.

Logistic Regressio vs. LDA Logistic Regressio Liearly separable data Logistic Regressio Both have liear decisio boudaries ad model log-posterior odds as log p(y = + X = x) p(y = X = x) = a + b x LDA models the margial desity of x as a Gaussia mixture with shared covariace g(x) = π N (x; µ, Σ) + π + N (x; µ +, Σ) ad fits the parameters = (µ, µ +, π, π +, Σ) by maximizig joit likelihood p(x i, y i ). a ad b are the determied from. Logistic regressio leaves the margial desity g(x) as a arbitrary desity fuctio, ad fits the parameters a,b by maximizig the coditioal likelihood p(y i x i ; a, b). Assume that the data is liearly separable, i.e. there is a scalar α ad a vector β such that y i (α + β x i ) > 0, i =,...,. Let c > 0. The empirical risk for a = cα, b = cβ is ˆR log (f a,b ) = log( + exp( cy i (α + β x i ))) which ca be made arbitrarily close to zero as c, i.e. soft classificatio rule becomes ± (overcofidece). Multi-class logistic regressio Logistic Regressio Logistic Regressio: Summary Logistic Regressio The multi-class/multiomial logistic regressio uses the softmax fuctio to model the coditioal class probabilities p (Y = k X = x; ), for K classes k =,..., K, i.e., exp ( w k p (Y = k X = x; ) = x + b ) k K l= exp ( w l x + b l). Parameters are = (b, W) where W = (w kj ) is a K p matrix of weights ad b R K is a vector of bias terms. Makes less modellig assumptios tha geerative classifiers: ofte resultig i better predictio accuracy. Divergig optimal parameters for liearly separable data: eed to regularise / pull them towards zero. A simple example of a geeralised liear model (GLM), for which there is a well established statistical theory: Assessmet of fit via deviace ad plots, Well fouded approaches to removig isigificat features (drop-i deviace test, Wald test).

Regularizatio Regularizatio Regularizatio Regularizatio Flexible models for high-dimesioal problems require may parameters. With may parameters, learers ca easily overfit. Regularizatio: Limit flexibility of model to prevet overfittig. Add term pealizig large values of parameters. mi ˆR(f ) + λ ρ ρ = mi L(y i, f (x i )) + λ ρ ρ where ρ, ad ρ = ( p j ρ ) /ρ is the L ρ orm of (also of iterest whe ρ [0, ), but is o loger a orm). Also kow as shrikage methods parameters are shruk towards 0. λ is a tuig parameter (or hyperparameter) ad cotrols the amout of regularizatio, ad resultig complexity of the model. 3 2.5 2.5 0.5.0.0.50.0.5 2.0 0 5 4 3 2 0 2 3 4 5 L ρ regularizatio profile for differet values of ρ. Regularizatio Regularizatio Types of Regularizatio L promotes sparsity Ridge regressio / Tikhoov regularizatio: ρ = 2 (Euclidea orm) LASSO: ρ = (Mahatta orm) Sparsity-iducig regularizatio: ρ (ocovex for ρ < ) Elastic et regularizatio: mixed L /L 2 pealty: mi L(y i, f (x i )) + λ [ ( α) 2 ] 2 + α Figure : The itersectio betwee the L (left) ad the L 2 (right) ball with a hyperplae. L regularizatio ofte leads to optimal solutios with may zeros, i.e., the regressio fuctio depeds oly o the (small) umber of features with o-zero parameters. figure from M. Elad, Sparse ad Redudat Represetatios, 200.