Statistical Data Miig ad Machie Learig Hilary Term 206 Dio Sejdiovic Departmet of Statistics Oxford Slides ad other materials available at: http://www.stats.ox.ac.uk/~sejdiov/sdmml : aother plug-i classifier with a simple geerative model - it assumes all measured variables/features are idepedet give the label. Ofte used i text documet classificatio, e.g. of scietific articles or emails. A basic stadard model for text classificatio cosists of cosiderig a pre-specified dictioary of p words ad summarizig each documet i by a biary vector x i where x (j) i = { if word j is preset i documet 0 otherwise. Presece of the word j is the j-the feature/dimesio. To implemet a plug-i classifier, we eed a model for the coditioal probability mass fuctio g k (x) = P(X = x Y = k) for each class k =,..., K. is a plug-i classifier which igores feature correlatios ad assumes: MLE: g k (x i ) = P(X = x i Y = k) = = p p P(X (j) = x (j) i Y = k) (φ kj ) x(j) i ( φ kj ) x(j) i, ˆπ k = k, ˆφkj = i:y i =k x(j) i k. A problem with MLE: if the l-th word did ot appear i documets labelled as class k the ˆφ kl = 0 ad where we deoted parametrized coditioal PMF with φ kj = P(X (j) = Y = k) (probability that j-th word appears i class k documet). Give dataset, the MLE of the parameters is: ˆπ k = k, ˆφkj = i:y i =k x(j) i k. P(Y = k X = x with l-th etry equal to ) p ( ) x (j) ˆπ k ˆφkj ( ˆφ ) x (j) kj = 0 i.e. we will ever attribute a ew documet cotaiig word l to class k (regardless of other words i it). This is a example of overfittig. give the class, it assumes each word appears i a documet idepedetly of all others
Geerative Learig Geerative vs Discrimiative Geerative vs Discrimiative Geerative vs Discrimiative Learig Classifiers we have see so far are geerative: we work with a joit distributio p X,Y (x, y) over data vectors ad labels. A learig algorithm: costruct f : X Y which predicts the label of X. Give a loss fuctio L, the risk R of f (X) is R(f ) = E px,y [L(Y, f (X))] For 0/ loss i classificatio, Bayes classifier f Bayes (x) = argmax p(y = k x) = argmax p X,Y (x, k) has the miimum risk (Bayes risk), but is ukow sice p X,Y is ukow. Assume a parameteric model for the joit: p X,Y (x, y) = p X,Y (x, y ) Fit ˆ = argmax log p(x i, y i ) ad plug i back to Bayes classifier: ˆf (x) = argmax p(y = k x, ) = argmax p X,Y (x, k ˆ). Geerative learig: fid parameters which explai all the data available. ˆ = argmax log p(x i, y i ) Examples: LDA, QDA, aïve Bayes. Makes use of all the data available. Flexible modellig framework, so ca icorporate missig features or ulabeled examples. Stroger modellig assumptios, which may ot be realistic (Gaussiaity, idepedece of features). Discrimiative learig: fid parameters that aid i predictio. ˆ = argmi L(y i, f (x i )) or ˆ = argmax log p(y i x i, ) Examples: logistic regressio, eural ets, support vector machies. Typically performs better o a give task. Weaker modellig assumptios: essetially o model o X, oly o Y X. Ca overfit more easily. Logistic regressio Logistic Regressio Hard vs Soft classificatio rules Logistic Regressio A discrimiative classifier. Cosider biary classificatio with Y = {, +}. Logistic regressio uses a parametric model o the coditioal Y X, ot the joit distributio of (X, Y): p(y = y X = x; a, b) = + exp( y(a + b x)). a, b fitted by miimizig the empirical risk with respect to log loss. Cosider usig LDA for biary classificatio with Y = {, +}. Predictios are based o liear decisio boudary: { ŷ LDA (x) = sig log ˆπ + g + (x ˆµ +, ˆΣ) log ˆπ g (x ˆµ, ˆΣ) } = sig { a + b x } for a ad b depedig o fitted parameters ˆ = (ˆπ, ˆπ +, ˆµ, ˆµ +, Σ). Quatity a + b x ca be viewed as a soft classificatio rule. Ideed, it is modellig the differece betwee the log-discrimiat fuctios, or equivaletly, the log-odds ratio: a + b x = log p(y = + X = x; ˆ) p(y = X = x; ˆ). f (x) = a + b x correspods to the cofidece of predictios ad loss ca be measured as a fuctio of this cofidece: expoetial loss: L(y, f (x)) = e yf (x), log-loss: L(y, f (x)) = log( + e yf (x) ), hige loss: L(y, f (x)) = max{ yf (x), 0}.
Logistic Regressio Liearity of log-odds ad logistic fuctio We ca treat a ad b as parameters i their ow right i the model of the coditioal Y X. p(y = + X = x; a, b) log p(y = X = x; a, b) = a + b x. Solve explicitly for coditioal class probabilities: p(y = + X = x; a, b) = p(y = X = x; a, b) = + exp( (a + b x)) =: s(a + b x) + exp(+(a + b x)) = s( a b x) where s(z) = /( + exp( z)) is the logistic fuctio. Logistic Regressio Fittig the parameters of the hyperplae Cosider maximizig the coditioal log likelihood: l(a, b) = log p(y = y i X = x i ) = log s(y i (a + b x i )). Equivalet to miimizig the empirical risk associated with the log loss: ˆR log (f a,b ) = log s(y i (a + b x i )) = log( + exp( y i (a + b x i ))) over all liear soft classificatio rules f a,b (x) = a + b x. 0.5 Logistic Regressio 0 8 6 4 2 0 2 4 6 8 Logistic Regressio Logistic Regressio Logistic Regressio Not possible to fid optimal a, b aalytically. For simplicity, absorb a as a etry i b by appedig ito x vector. Objective fuctio: ˆR log = Differetiate wrt b: log s(y i xi b) bˆr log = 2 bˆr log = s( y i xi b)y i x i Logistic Fuctio s(y i xi b)s( y i xi b)x i xi 0. s( z) = s(z) z s(z) = s(z)s( z) z log s(z) = s( z) 2 z log s(z) = s(z)s( z) Secod derivative is positive-defiite: objective fuctio is covex ad there is a sigle uique global miimum. May differet algorithms ca fid optimal b, e.g.: Gradiet descet: b ew = b + ɛ s( y i xi b)y i x i Stochastic gradiet descet: b ew = b + ɛ t s( y i xi I(t) i I(t) b)y i x i where I(t) is a subset of the data at iteratio t, ad ɛ t 0 slowly ( t ɛ t =, t ɛ2 t < ). Newto-Raphso: b ew = b ( 2 bˆr log ) bˆr log This is also called iterative reweighted least squares. Cojugate gradiet, LBFGS ad other methods from umerical aalysis.
Logistic Regressio vs. LDA Logistic Regressio Liearly separable data Logistic Regressio Both have liear decisio boudaries ad model log-posterior odds as log p(y = + X = x) p(y = X = x) = a + b x LDA models the margial desity of x as a Gaussia mixture with shared covariace g(x) = π N (x; µ, Σ) + π + N (x; µ +, Σ) ad fits the parameters = (µ, µ +, π, π +, Σ) by maximizig joit likelihood p(x i, y i ). a ad b are the determied from. Logistic regressio leaves the margial desity g(x) as a arbitrary desity fuctio, ad fits the parameters a,b by maximizig the coditioal likelihood p(y i x i ; a, b). Assume that the data is liearly separable, i.e. there is a scalar α ad a vector β such that y i (α + β x i ) > 0, i =,...,. Let c > 0. The empirical risk for a = cα, b = cβ is ˆR log (f a,b ) = log( + exp( cy i (α + β x i ))) which ca be made arbitrarily close to zero as c, i.e. soft classificatio rule becomes ± (overcofidece). Multi-class logistic regressio Logistic Regressio Logistic Regressio: Summary Logistic Regressio The multi-class/multiomial logistic regressio uses the softmax fuctio to model the coditioal class probabilities p (Y = k X = x; ), for K classes k =,..., K, i.e., exp ( w k p (Y = k X = x; ) = x + b ) k K l= exp ( w l x + b l). Parameters are = (b, W) where W = (w kj ) is a K p matrix of weights ad b R K is a vector of bias terms. Makes less modellig assumptios tha geerative classifiers: ofte resultig i better predictio accuracy. Divergig optimal parameters for liearly separable data: eed to regularise / pull them towards zero. A simple example of a geeralised liear model (GLM), for which there is a well established statistical theory: Assessmet of fit via deviace ad plots, Well fouded approaches to removig isigificat features (drop-i deviace test, Wald test).
Regularizatio Regularizatio Regularizatio Regularizatio Flexible models for high-dimesioal problems require may parameters. With may parameters, learers ca easily overfit. Regularizatio: Limit flexibility of model to prevet overfittig. Add term pealizig large values of parameters. mi ˆR(f ) + λ ρ ρ = mi L(y i, f (x i )) + λ ρ ρ where ρ, ad ρ = ( p j ρ ) /ρ is the L ρ orm of (also of iterest whe ρ [0, ), but is o loger a orm). Also kow as shrikage methods parameters are shruk towards 0. λ is a tuig parameter (or hyperparameter) ad cotrols the amout of regularizatio, ad resultig complexity of the model. 3 2.5 2.5 0.5.0.0.50.0.5 2.0 0 5 4 3 2 0 2 3 4 5 L ρ regularizatio profile for differet values of ρ. Regularizatio Regularizatio Types of Regularizatio L promotes sparsity Ridge regressio / Tikhoov regularizatio: ρ = 2 (Euclidea orm) LASSO: ρ = (Mahatta orm) Sparsity-iducig regularizatio: ρ (ocovex for ρ < ) Elastic et regularizatio: mixed L /L 2 pealty: mi L(y i, f (x i )) + λ [ ( α) 2 ] 2 + α Figure : The itersectio betwee the L (left) ad the L 2 (right) ball with a hyperplae. L regularizatio ofte leads to optimal solutios with may zeros, i.e., the regressio fuctio depeds oly o the (small) umber of features with o-zero parameters. figure from M. Elad, Sparse ad Redudat Represetatios, 200.