Logis@cs CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Sign up at Piazza: hlp://piazza.com/northeastern/spring2016/cs6140 All course-relevant ques@ons go here! Textbook Assignment 1 Analy@cal ques@ons Simple programming Logis@cs Exam Open book Computer, but no internet Project Proposal: problem defini@on, related work, poten@al model and algorithms, datasets, evalua@on What We Learned Last Week Basic Concept Supervised learning vs. unsupervised learning Parametric vs. non-parametric Classifica@on vs. regression Training set, test set, development set OverfiYng vs. underfiyng K-Nearest Neighbors Linear Regression Ridge Regression KNN Linear Regression Assump@on: the response is a linear func@on of the inputs Inner product between input sample X and weight vector W Residual error: difference between predic@on and true label 1
We want to minimize Ridge Regression Today s Outline Genera@ve Model and Discrimina@ve Model Logis@c Regression Genera@ve Models Genera@ve Models vs. Discrimina@ve Models Decision Tree Genera@ve model Learn P(X, Y) from training sample P(X, Y)=P(Y)P(X Y) Specifies how to generate the observed features x for y Discrimina@ve model Learn P(Y X) from training sample Directly models the mapping from features x to y Logis@c Regression A discrimina@ve model Logis@c Regression A discrimina@ve model y is 0 or 1 Ber is a Bernoulli distribu@on Classifica@on! Not really regression y is 0 or 1 Ber is a Bernouli distribu@on Remember in linear regression 2
Logis@c Regression Sigmoid func@on A discrimina@ve model Defini@on sigm is sigmoid func@on Logis@c Regression Logis@c Regression A discrimina@ve model sigm is sigmod func@on Logis@c Regression Logis@c Regression A discrimina@ve model sigm is sigmod func@on 3
Logis@c Regression A discrimina@ve model Parameter Es@ma@on Nega@ve log-likelihood Parameter es@ma@on How to get w? Parameter Es@ma@on Gradient and Hessian Where MLE won t work Our objec@ve func@on is convex à unique global minimum Parameter Es@ma@on Gradient Descent Example Gradient descent is the step size 4
Changing Step Size Changing Step Size Guarantee to converge to local op@mum Gradient Descent Direc0on Remember that we want to have Line search: find step size by minimizing Parameter Es@ma@on: Newton s Method In gradient descent Parameter Es@ma@on: Newton s Method Newton s method: second order op@miza@on Consider a second-order Taylor series approxima@on of objec@ve func@on at step k Newton s method: second order op@miza@on Faster op@miza@on Parameter Es@ma@on: Newton s Method Newton s method: second order op@miza@on Consider a second-order Taylor series approxima@on of objec@ve func@on at step k Parameter Es@ma@on: Newton s Method Newton s method: second order op@miza@on Consider a second-order Taylor series approxima@on of objec@ve func@on at step k Rewrite into Rewrite into where Update func@on 5
Parameter Es@ma@on: Newton s Method Now apply Newton s method to our problem Parameter Es@ma@on: Newton s Method Adding L2 Regulariza@on Adding L2 Regulariza@on To avoid overfiyng Genera@ve model Learn P(X, Y) from training sample P(X, Y)=P(Y)P(X Y) Specifies how to generate the observed features x for y Discrimina@ve model Learn P(Y X) from training sample Directly models the mapping from features x to y 6
Bayesian Concept Learning Bayesian Inference How do human beings learn from everyday life? Meanings of words Causes of a person s ac@on Future outcomes of a dynamic process. Some of the slides are borrowed from Kevin Murphy s Lectures Bayesian Inference Number Game Observe one or more examples Judge whether other numbers are yes or no Hypothesis space: H Prior p(h) Likelihood p(d h) Compu@ng posterior p(h D) Number Game Generaliza@on from posi@ve samples 7
Bayesian model H: Hypothesis space of possible concepts X: n examples of a concept C Evaluate hypotheses given data using Bayes rule: Hypothesis space Mathema@cal proper@es (~50) odd, even, square, cube, prime, mul@ples of small integers powers of small integers same first (or last) digit Magnitude intervals (~5000): all intervals of integers with endpoints between 1 and 100 Likelihood Size principle: Smaller hypotheses receive greater likelihood, and exponen@ally more so as n increases. Occam s razor The model favors the simplest or smallest hypothesis consistent with the data Likelihood Occam s razor The model favors the simplest or smallest hypothesis consistent with the data D={16} h1: powers of two under 100 h2: event numbers under 100 P(D h1)=1/6 P(D h2)=1/50 Prior X={60,80,10,30} 8
Prior Posterior X={60,80,10,30} Why prefer mul@ples of 10 over even numbers? Why prefer mul@ples of 10 over mul@ples of 10 except 50 and 20? Cannot learn efficiently if we have a uniform prior over all 2 100 logically possible hypotheses Posterior predic@ve distribu@on Bayesian model averaging Posterior predic@ve distribu@on Maximum a posteriori (MAP) Or plug-in approxima@on 9
Naïve Bayes Naïve Bayes Document classifica@on example Y {1,,C}, x {0,1} d Y {spam, urgent, normal} xi = 1 (word i is present in message) Bayes Rules Class condi@onal density p(x y=c) Assump@on: features are independent assignment, released Class condi@onal density p(x y=c) Mul@variate Poisson Class condi@onal density p(x y=c) Mul@nomial Formally, 10
Class condi@onal density p(x y=c) Binary features: mul@variate Bernoulli Class condi@onal density p(x y=c) Binary features: mul@variate Bernoulli Commonly used Bayes Rules Class Prior Let (Y 1,..,Y C ) ~ Mult(,, 1) be the class prior 1-of-C encoding: only one bit can be on e.g., p(spam)=0.7, p(urgent)=0.1, p(normal)=0.2 Bayes Rules Class Posterior Fill with class condi@onal probability and prior 11
Class Posterior Log-Sum-Exp Trick Fill with class condi@onal probability and prior Numerator and denominator are very small numbers, use logs to avoid underflow How to compute the normaliza@on constant? Log-Sum-Exp Trick Parameter Es@ma@on So far we have assumed that the parameters of p(x y=c) and p(y=c) are known. To es@mate p(y=c), we can use MLE or MAP or fully Bayesian es@ma@on of a mul@nomial Parameter Es@ma@on To es@mate p(x y=c): MLE for Bernoulli features For each feature, we just count how many @mes word j occurred in documents of class c, and divide by the number of documents of class c Plug-in Approxima@on We can compute MLEs for each feature j and class c separately 12
Then we have Plug-in Approxima@on Genera@ve model Learn P(X, Y) from training sample P(X, Y)=P(Y)P(X Y) Specifies how to generate the observed features x for y Discrimina@ve model Learn P(Y X) from training sample Directly models the mapping from features x to y Easy to fit the model Genera@ve model! Fit classes separately Genera@ve model! Handle missing features easily Genera@ve model! Handle unlabeled training data Easier for Genera@ve model! 13
Symmetric in inputs and outputs Genera@ve model! Define p(x,y) Handle feature preprocessing Discrimina@ve model! Decision Tree Well-calibrated probabili@es Discrimina@ve model! [some of the slides are borrowed from Tom Mitchell s lecture] Decision Tree Decision Tree Play tennis? Play tennis? Each internal node: test one alribute X i Each branch from a node: selects one value for X i Each leaf node: predict Y (or P(Y X leaf)) 14
Top-Down Induc@on of Decision Trees Top-Down Induc@on of Decision Trees Which alribute is best? Entropy Entropy Entropy H(X) of a random variable X H(X) is the expected number of bits needed to encode a randomly drawn value of X (under most efficient code) Informa@on Gain Gain(S,A)=expected reduc@on in entropy due to sor@ng on A 15
Informa@on Gain OverfiYng Avoid OverfiYng Stop growing when data split is not sta@s@cally significant Grow a full-tree, then prune Reduce-Error Pruning 16
Rule Post-Pruning What We Learned Today Genera@ve Model and Discrimina@ve Model Logis@c Regression Homework Reading Murphy Ch 3, 8.1-8.3, 8.6, 16.2 First assignment is out! Genera@ve Models Genera@ve Models vs. Discrimina@ve Models Decision Tree 17