Loss Functions, Decision Theory, and Linear Models

Size: px

Start display at page:

Download "Loss Functions, Decision Theory, and Linear Models"

Shanon Lamb
5 years ago
Views:

1 Loss Functions, Decision Theory, and Linear Models CMSC 678 UMBC January 31 st, 2018 Some slides adapted from Hamed Pirsiavash

2 Logistics Recap Piazza (ask & answer questions): Course site: 678/spring18/ Evaluation submission site: 678/spring18/submit

3 Course Announcement: Assignment 1 Due Wednesday, 2/7 (~7 days) Math & programming review Discuss with others, but write, implement and complete on your own

4 Recap from last time

5 What does it mean to learn? Generalization

6 Machine Learning Framework: Learning Gold/correct labels instance 1 scoring model score θ (X) instance 2 instance 3 Machine Learning Predictor Evaluator score instance 4 instances are typically examined independently Extra-knowledge give feedback to the predictor F(θ) objective

7 Gradient Ascent

8 Model, parameters and hyperparameters Model: mathematical formulation of system (e.g., classifier) Parameters: primary knobs of the model that are set by a learning algorithm Hyperparameter: secondary knobs

9 A Terminology Buffet Classification Fully-supervised Probabilistic Neural Regression Semi-supervised Generative Conditional Memorybased Exemplar Clustering Un-supervised Spectral the task: what kind of problem are you solving? the data: amount of human input/number of labeled examples the approach: how any data are being used

10 Classification: Supervised Machine Learning Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis Input: an instance d a fixed set of classes C = {c 1, c 2,, c J } A training set of m hand-labeled instances (d 1,c 1 ),...,(d m,c m ) Output: a learned classifier γ that maps instances to classes γ learns to associate certain features of instances with their labels

11 Classification Example: Face Recognition What is a good representation for images? Courtesy from Hamed Pirsiavash Pixel values? Edges?

12 Ingredients for classification Inject your knowledge into a learning system Feature representation Training data: labeled examples Model Courtesy Hamed Pirsiavash

13 Ingredients for classification Inject your knowledge into a learning system Problem specific Difficult to learn from bad ones Feature representation Training data: labeled examples Model Courtesy Hamed Pirsiavash

14 Ingredients for classification Inject your knowledge into a learning system Problem specific Labeling data == $$$ Difficult to learn from bad ones Sometimes data is available for free Feature representation Training data: labeled examples Model Courtesy Hamed Pirsiavash

15 Ingredients for classification Inject your knowledge into a learning system Problem specific Difficult to learn from bad ones Feature representation Labeling data == $$$ Sometimes data is available for free Training data: labeled examples No single learning algorithm is always good ( no free lunch ) Different learning algorithms work differently Model Courtesy Hamed Pirsiavash

16 Sequence & Structured Prediction Courtesy Hamed Pirsiavash

17 Regression Like classification, but real-valued

18 Regression Example: Stock Market Prediction Courtesy Hamed Pirsiavash

19 Unsupervised learning: Clustering Courtesy Hamed Pirsiavash

20 Inductive Bias What do we know before we see the data? Courtesy Hamed Pirsiavash

21 Inductive Bias What do we know before we see the data? A B C D Partition these into two groups Courtesy Hamed Pirsiavash

22 Machine Learning Framework Gold/correct labels instance 1 instance 2 instance 3 Machine Learning Predictor Evaluator score instance 4 instances are typically examined independently Extra-knowledge

23 Machine Learning Framework Gold/correct labels instance 1 instance 2 instance 3 Machine Learning Predictor Evaluator score instance 4 instances are typically examined independently Extra-knowledge Other ML task (or consumerfacing product)

24 Today s Goal: Optimize Empirical Risk of Surrogate Loss NN yy ii = ww TT xx ii + bb formulate a linear prediction model argmin h ii=1 l yy ii, h θθ xx ii F learn about empirical risk minimization θθ FF = ii l yy ii, yy = h θθ xx ii yy θθ h θθ xx ii approximate 0-1 classification loss in a computable way

25 (Most) Probability Axioms p(everything) = 1 p(φ) = 0 p(a) p(b), when A B A B p(a B) = p(a) + p(b), when A B = φ everything p(a B) p(a) + p(b) p(a B) = p(a) + p(b) p(a B)

26 Conditional Probability pp XX YY) = pp(xx & YY) pp(yy) Conditional Probabilities are Probabilities

27 Conditional Probability pp XX YY) = pp(xx & YY) pp(yy) pp YY =?

28 Conditional Probability pp XX YY) = pp(xx & YY) pp(yy) pp YY = pp(xx & YY)dddd

29 Marginal(ized) Probability: The Discrete Case y x 1 & y x 2 & y x 3 & y x 4 & y pp yy = xx = xx pp(xx, yy) pp xx pp yy xx)

30 Bayes Rule pp XX YY) = posterior probability likelihood pp YY XX) pp(xx) pp(yy) marginal likelihood (probability) prior probability

31 Maximum A Posteriori Classification Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis class class-based likelihood (language model) prior probability of class pp XX YY) = observed data pp YY XX) pp(xx) pp(yy) observation likelihood (averaged over all classes)

32 Classify with Bayes Rule argmax XX pp XX YY)

33 Classify with Bayes Rule argmax XX pp YY XX) pp(xx) pp(yy)

34 Classify with Bayes Rule argmax XX pp YY XX) pp(xx) pp(yy) constant with respect to X

35 Classify with Bayes Rule argmax XX pp YY XX) pp(xx)

36 Classify with Bayes Rule argmax XX log pp YY XX) + log pp(xx)

37 Classify with Bayes Rule argmax XX log pp YY XX) + log pp(xx)

38 Classify with Bayes Rule how likely is label X overall? argmax XX log pp YY XX) + log pp(xx) how well does text Y represent label X?

39 Classify with Bayes Rule how likely is label X overall? argmax XX log pp YY XX) + log pp(xx) how well does text Y represent label X? For simple or flat labels: * iterate through labels * evaluate score for each label, keeping only the best (n best) * return the best (or n best) label and score

40 Conditional Probabilities: Changing the Left 1 p(a) what happens as we add conjuncts to the left? 0

41 Conditional Probabilities: Changing the Left 1 p(a) p(a, B) 0 what happens as we add conjuncts to the left?

42 Conditional Probabilities: Changing the Left 1 p(a) p(a, B) p(a, B, C) 0 what happens as we add conjuncts to the left?

43 Conditional Probabilities: Changing the Left 1 p(a) p(a, B) p(a, B, C) p(a, B, C, D) 0 what happens as we add conjuncts to the left?

44 Conditional Probabilities: Changing the Left 1 p(a) p(a, B) p(a, B, C) p(a, B, C, D) 0 p(a, B, C, D, E) what happens as we add conjuncts to the left?

45 Conditional Probabilities: Changing the Right 1 p(a) what happens as we add conjuncts to the right? 0

46 Conditional Probabilities: Changing the Right 1 p(a) p(a B) 0 what happens as we add conjuncts to the right?

47 Conditional Probabilities: Changing the Right 1 p(a B) p(a) what happens as we add conjuncts to the right? 0

48 Conditional Probabilities: Changing the Right 1 p(a B) p(a) what happens as we add conjuncts to the right? 0

49 Conditional Probabilities Bias vs. Variance Lower bias: More specific to what we care about Higher variance: For fixed observations, estimates become less reliable

50 Probability Chain Rule pp xx 1, xx 2,, xx SS = pp xx 1 pp xx 2 xx 1 )pp xx 3 xx 1, xx 2 ) pp xx SS xx 1,, xx ii = SS ii pp xx ii xx 1,, xx ii 1 ) extension of Bayes rule

51 Expected Value random variable XX ~ pp EE ff(xx) = ff(xx) pp xx expected value (distribution p is implicit) xx

52 Expected Value: Example uniform distribution of number of cats I have /6 * 1 + 1/6 * 2 + 1/6 * 3 + 1/6 * 4 + 1/6 * 5 + 1/6 * 6 = 3.5

53 Expected Value: Example non-uniform distribution of number of cats I have /2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5

54 Probability Prerequisites Basic probability axioms and definitions Probabilistic Independence Definition of joint probability Bayes rule Probability chain rule Expected Value (of a function) of a Random Variable Definition of conditional probability

55 Loss Functions and Decision Theory FORMALIZING LEARNING

56 Decision Theory Decision theory is trivial, apart from the computational details MacKay, ITILA, Ch 36 Input: x ( state of the world ) Output: a decision y

57 Decision Theory Decision theory is trivial, apart from the computational details MacKay, ITILA, Ch 36 Input: x ( state of the world ) Output: a decision y Requirement 1: a decision (hypothesis) function h(x) to produce y

58 Decision Theory Decision theory is trivial, apart from the computational details MacKay, ITILA, Ch 36 Input: x ( state of the world ) Output: a decision y Requirement 1: a decision (hypothesis) function h(x) to produce y Requirement 2: a function l(y, y ) telling us how wrong we are

59 Decision Theory Decision theory is trivial, apart from the computational details MacKay, ITILA, Ch 36 Input: x ( state of the world ) Output: a decision y Requirement 1: a decision (hypothesis) function h(x) to produce y Requirement 2: a loss function l(y, y ) telling us how wrong we are Goal: minimize our expected loss across any possible input

60 Requirement 1: Decision Function instance 1 h(x) Gold/correct labels instance 2 instance 3 Machine Learning Predictor Evaluator score instance 4 Extra-knowledge h(x) is our predictor (classifier, regression model, clustering model, etc.)

61 Requirement 2: Loss Function ell (fancy l character) predicted label/result l yy, yy 0 optimize l? minimize maximize correct label/result loss: A function that tells you how much to penalize a prediction y from the correct answer y

62 Requirement 2: Loss Function ell (fancy l character) predicted label/result - l is called a utility or reward function l yy, yy 0 correct label/result loss: A function that tells you how much to penalize a prediction y from the correct answer y

63 Decision Theory minimize expected loss across any possible input arg min yy EE[l(yy, yy)]

64 Risk Minimization minimize expected loss across any possible input arg min yy EE[l(yy, yy)] = arg min h EE[l(yy, h(xx))] a particular, unspecified input pair (x,y) but we want any possible pair

65 Decision Theory minimize expected loss across any possible input arg min EE[l(yy, yy)] = yy arg min EE[l(yy, h(xx))] = h argmin EE xx,yy PP l yy, h xx h Assumption: there exists some true (but likely unknown) distribution P over inputs x and outputs y

66 Risk Minimization minimize expected loss across any possible input arg min yy EE[l(yy, yy)] = arg min h EE[l(yy, h(xx))] = argmin h EE xx,yy PP l yy, h xx = argmin h l yy, h xx PP xx, yy dd(xx, yy)

67 Risk Minimization minimize expected loss across any possible input arg min yy EE[l(yy, yy)] = arg min h EE[l(yy, h(xx))] = argmin h EE xx,yy PP l yy, h xx = argmin h l yy, h xx PP xx, yy dd(xx, yy) we don t know this distribution*! *we could try to approximate it analytically

68 Empirical Risk Minimization minimize expected loss across our observed input arg min yy EE[l(yy, yy)] = arg min h EE[l(yy, h(xx))] = argmin h EE xx,yy PP l yy, h xx NN 1 argmin h NN ii=1 l yy ii, h xx ii

69 Empirical Risk Minimization minimize expected loss across our observed input NN argmin h our classifier/predictor controlled by our parameters θ ii=1 l yy ii, h xx ii change θ change the behavior of the classifier

70 Best Case: Optimize Empirical Risk with Gradients NN argmin h ii=1 l yy ii, h θθ xx ii F θθ FF = ii l yy ii, yy = h θθ xx ii yy θθ h θθ xx ii differentiating might not always work: apart from the computational details

71 Loss Function Example: 0-1 Loss l yy, yy = 0, if yy = yy 1, if yy yy

72 Loss Function Example: 0-1 Loss l yy, yy = 0, if yy = yy 1, if yy yy Problem: not differentiable wrt yy (or θ) Solution 1: is the data linearly separable? Perceptron (next class) can work Solution 2: is h(x) a conditional distribution p(y x)? Use MAP

73 Loss Function Examples squared loss l yy, yy = y yy 2 absolute loss l yy, yy = yy yy

INTRODUCTION TO DATA SCIENCE

INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #13 3/9/2017 CMSC320 Tuesdays & Thursdays 3:30pm 4:45pm ANNOUNCEMENTS Mini-Project #1 is due Saturday night (3/11): Seems like people are able to do