Linear Classifiers CS 88 Artificial Intelligence Perceptrons and Logistic Regression Pieter Abbeel & Dan Klein University of California, Berkeley Feature Vectors Some (Simplified) Biology Very loose inspiration human neurons Hello, Do you want free printr cartriges? Why pay more when you can get m ABSOLUTELY FREE! Just PIXEL-7, PIXEL-7,3 NUM_LOOPS SPAM or + Linear Classifiers Binary case compare features to a weight vector Learning figure out weight vector from examples Inputs are feature values Each feature has a weight Sum is activation If activation is Positive, output + Negative, output - Weights f f f3 4 - -3 w w w3 S >? Dot product positive means positive class
Decision Rules Binary Decision Rule Examples are points Any weight vector is a hyperplane One side corresponds to Y=+ Or corresponds to Y=- + = SPAM -3 free 4 money Weight Updates money In space of feature vectors - = HAM free Learning Binary Perceptron Start with weights = For each training instance Classify with current weights If correct (i.e., y=y*), no change! If wrong adjust weight vector Learning Binary Perceptron Start with weights = For each training instance Classify with current weights If correct (i.e., y=y*), no change! If wrong adjust weight vector by adding or subtracting feature vector. Subtract if y* is -. Examples Perceptron Separable Case
Multiclass Decision Rule Learning Multiclass Perceptron Start with all weights = Pick up training examples one by one Predict with current weights If we have multiple classes A weight vector for each class Score (activation) of a class y If correct, no change! If wrong lower score of wrong answer, raise score of right answer Prediction highest score s Binary = multiclass where negative class has weight zero Example Multiclass Perceptron Properties of Perceptrons election Convergence if training is separable, perceptron will eventually converge (binary case) Mistake Bound maximum number of mistakes (binary case) related to margin or degree of separability Problems with Perceptron Noise if data isn t separable, weights might thrash Averaging weight vectors over time can help (averaged perceptron) Mediocre generalization finds a barely separating solution Overtraining test / held-out accuracy usually rises, n falls Overtraining is a kind of overfitting Separable Separability true if some parameters get training set perfectly correct Non-Separable Improving Perceptron
Non-Separable Case Deterministic Decision Non-Separable Case Probabilistic Decision Even best linear boundary makes at least one mistake.9..7.3.5.5.3.7..9 How to get probabilistic decisions? Perceptron scoring If very positive à want probability going to If very negative à want probability going to Best w? Maximum likelihood estimation Sigmoid function with = Logistic Regression Separable Case Deterministic Decision Many Options Separable Case Probabilistic Decision Clear Preference.7.3.5.5.3.7.7.3.5.5.3.7
Recall Perceptron A weight vector for each class Multiclass Logistic Regression Best w? Maximum likelihood estimation Score (activation) of a class y Prediction highest score s How to make scores into probabilities? with original activations softmax activations = Multi-Class Logistic Regression Next Lecture Optimization i.e., how do we solve