Introduction to Machine Learning

Size: px

Start display at page:

Download "Introduction to Machine Learning"

Daniela Glenn
6 years ago
Views:

1 Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane

2 Outline Course Info.: People, References, Resources Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection Decision Theory: ML, Loss Function, MAP Probability Theory: (e.g.) Probabilities and Parameter Estimation Conclusion Intro. to Machine Learning Alireza Ghane / Torsten Möller 1

3 Outline Course Info.: People, References, Resources Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection Decision Theory: ML, Loss Function, MAP Probability Theory: (e.g.) Probabilities and Parameter Estimation Conclusion Intro. to Machine Learning Alireza Ghane / Torsten Möller 2

Conclusion Dr. Torsten Mo ller Home: http://vda.univie.ac.

4 Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion Course Info. Dr. Torsten Mo ller Home: Discussions: Intro. to Machine Learning Alireza Ghane / Torsten Mo ller Alireza Ghane 3

5 Registration Course max participants: 25 Course Registered: 61 Number of Seats: 48 Excess: 13 Sign your name on the sheet If you miss the first two sessions, you will be automatically SIGNED OFF the course! Intro. to Machine Learning Alireza Ghane / Torsten Möller 4

6 Registration Course max participants: 25 Course Registered: 61 Number of Seats: 48 Excess: 13 Sign your name on the sheet If you miss the first two sessions, you will be automatically SIGNED OFF the course! Intro. to Machine Learning Alireza Ghane / Torsten Möller 5

7 References Main Textbook: Pattern Recognition and Machine Learning, Christopher M. Bishop, Springer Other Useful Resources: The Elements of Statistical Learning, Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Machine Learning, Tom Mitchel. Pattern Classification (2nd ed.), Richard O. Duda, Peter E. Hart, and David G. Stork. Machine Learning, An Algorithmic Perspective, Stephen Marsland. The Top Ten Algorithms in Data Mining, X. Wu, V. Kumar. Learning from Data, Cherkassky-Mulier. Online Courses: Andrew Ng: Intro. to Machine Learning Alireza Ghane / Torsten Möller 6

8 References Main Textbook: Pattern Recognition and Machine Learning, Christopher M. Bishop, Springer Other Useful Resources: The Elements of Statistical Learning, Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Machine Learning, Tom Mitchel. Pattern Classification (2nd ed.), Richard O. Duda, Peter E. Hart, and David G. Stork. Machine Learning, An Algorithmic Perspective, Stephen Marsland. The Top Ten Algorithms in Data Mining, X. Wu, V. Kumar. Learning from Data, Cherkassky-Mulier. Online Courses: Andrew Ng: Intro. to Machine Learning Alireza Ghane / Torsten Möller 7

9 References Main Textbook: Pattern Recognition and Machine Learning, Christopher M. Bishop, Springer Other Useful Resources: The Elements of Statistical Learning, Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Machine Learning, Tom Mitchel. Pattern Classification (2nd ed.), Richard O. Duda, Peter E. Hart, and David G. Stork. Machine Learning, An Algorithmic Perspective, Stephen Marsland. The Top Ten Algorithms in Data Mining, X. Wu, V. Kumar. Learning from Data, Cherkassky-Mulier. Online Courses: Andrew Ng: Intro. to Machine Learning Alireza Ghane / Torsten Möller 8

10 Grading: Assignments / Labs ( 50% ) 5 assignments, 10% each Final Exam ( 40 % ) Class Feedback ( 10 % ) Assignment late policy 5 grace days for all assignments together after the grace days, 25% penalty for each day Grading Programming with: MATLAB (licensed): Octave (free): Intro. to Machine Learning Alireza Ghane / Torsten Möller 9

11 Grading: Assignments / Labs ( 50% ) 5 assignments, 10% each Final Exam ( 40 % ) Class Feedback ( 10 % ) Assignment late policy 5 grace days for all assignments together after the grace days, 25% penalty for each day Grading Programming with: MATLAB (licensed): Octave (free): Intro. to Machine Learning Alireza Ghane / Torsten Möller 10

Grading: Assignments / Labs ( 50% ) 5 assignments, 10% each Final Exam ( 40 % ) Class Feedback ( 10 % ) Assignment late policy 5 grace days for all assignments together after the grace days, 25%

12 Grading: Assignments / Labs ( 50% ) 5 assignments, 10% each Final Exam ( 40 % ) Class Feedback ( 10 % ) Assignment late policy 5 grace days for all assignments together after the grace days, 25% penalty for each day Grading Programming with: MATLAB (licensed): Octave (free): Intro. to Machine Learning Alireza Ghane / Torsten Möller 11

13 Course Topics We will cover techniques in the standard ML toolkit maximum likelihood, regularization, support vector machines (SVM), Fisher s linear discriminant (LDA), boosting, principal components analysis (PCA), Markov random fields (MRF), neural networks, graphical models, belief propagation, expectation-maximization (EM), mixture models, mixtures of experts (MoE), hidden Markov models (HMM), particle filters, Markov Chain Monte Carlo (MCMC), Gibbs sampling,... Intro. to Machine Learning Alireza Ghane / Torsten Möller 12

14 Calculus: Background E = mc 2 E c = 2mc Linear algebra (PRML Appendix C): Au i = λ i u i ; Probability (PRML Ch. 1.2): p(x) = p(x, Y ); p(x) = Y x (xt a) = a p(x, y)dy; E x [f] = p(x)f(x)dx It will be possible to refresh, but if you ve never seen these before this course will be very difficult. Intro. to Machine Learning Alireza Ghane / Torsten Möller 13

15 Outline Course Info.: People, References, Resources Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection Decision Theory: ML, Loss Function, MAP Probability Theory: (e.g.) Probabilities and Parameter Estimation Conclusion Intro. to Machine Learning Alireza Ghane / Greg Mori 14

16 What is Machine Learning (ML)? Algorithms that automatically improve performance through experience Often this means define a model by hand, and use data to fit its parameters Intro. to Machine Learning Alireza Ghane / Greg Mori 15

17 Why ML? The real world is complex difficult to hand-craft solutions. ML is the preferred framework for applications in many fields: Computer Vision Natural Language Processing, Speech Recognition Robotics... Intro. to Machine Learning Alireza Ghane / Greg Mori 16

18 Hand-written Digit Recognition 518 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 Fig. 8. All of the misclassified MNIST test digits using our method 63 out of 10,000). The text above each digit indicates the example number followed by the true label and the assigned label. Belongie et al. PAMI 2002 straightforward sum of squared differences SSD). SSD error rate with an average of only four two-dimensional performs very well on this easy database due to the lack of views for each three-dimensional object, thanks to the Difficult to hand-craft rules about digits variation in lighting [24] PCA just makes it faster). flexibility provided by the matching algorithm. The prototype selection algorithm is illustrated in Fig. 10. As seen, views are allocated mainly for more complex 6.3 MPEG-7 Shape Silhouette Database categories with high within class variability. The curve Our next experiment involves the MPEG-7 shape silhouette Intro. to Machine Learning Alireza Ghane / Greg Mori marked SC-proto in Fig. 9 shows the improved classification database, specifically Core Experiment CE-Shape-1 part B, 17

19 Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0, 0, 0, 0, 0) Represent input image as a vector x i R 784. Suppose we have a target vector t i This is supervised learning Discrete, finite label set: perhaps t i {0, 1} 10, a classification problem Given a training set {(x 1, t 1 ),..., (x N, t N )}, learning problem is to construct a good function y(x) from these. y : R 784 R 10 Intro. to Machine Learning Alireza Ghane / Greg Mori 18

20 Face Detection Classification problem Schneiderman and Kanade, IJCV 2002 t i {0, 1, 2}, non-face, frontal face, profile face. Intro. to Machine Learning Alireza Ghane / Greg Mori 19

21 Spam Detection Classification problem t i {0, 1}, non-spam, spam x i counts of words, e.g. Viagra, stock, outperform, multi-bagger Intro. to Machine Learning Alireza Ghane / Greg Mori 20

22 Caveat - Horses (source?) Once upon a time there were two neighboring farmers, Jed and Ned. Each owned a horse, and the horses both liked to jump the fence between the two farms. Clearly the farmers needed some means to tell whose horse was whose. So Jed and Ned got together and agreed on a scheme for discriminating between horses. Jed would cut a small notch in one ear of his horse. Not a big, painful notch, but just big enough to be seen. Well, wouldn t you know it, the day after Jed cut the notch in horse s ear, Ned s horse caught on the barbed wire fence and tore his ear the exact same way! Something else had to be devised, so Jed tied a big blue bow on the tail of his horse. But the next day, Jed s horse jumped the fence, ran into the field where Ned s horse was grazing, and chewed the bow right off the other horse s tail. Ate the whole bow! Intro. to Machine Learning Alireza Ghane / Greg Mori 21

23 Caveat - Horses (source?) Once upon a time there were two neighboring farmers, Jed and Ned. Each owned a horse, and the horses both liked to jump the fence between the two farms. Clearly the farmers needed some means to tell whose horse was whose. So Jed and Ned got together and agreed on a scheme for discriminating between horses. Jed would cut a small notch in one ear of his horse. Not a big, painful notch, but just big enough to be seen. Well, wouldn t you know it, the day after Jed cut the notch in horse s ear, Ned s horse caught on the barbed wire fence and tore his ear the exact same way! Something else had to be devised, so Jed tied a big blue bow on the tail of his horse. But the next day, Jed s horse jumped the fence, ran into the field where Ned s horse was grazing, and chewed the bow right off the other horse s tail. Ate the whole bow! Intro. to Machine Learning Alireza Ghane / Greg Mori 22

24 Caveat - Horses (source?) Once upon a time there were two neighboring farmers, Jed and Ned. Each owned a horse, and the horses both liked to jump the fence between the two farms. Clearly the farmers needed some means to tell whose horse was whose. So Jed and Ned got together and agreed on a scheme for discriminating between horses. Jed would cut a small notch in one ear of his horse. Not a big, painful notch, but just big enough to be seen. Well, wouldn t you know it, the day after Jed cut the notch in horse s ear, Ned s horse caught on the barbed wire fence and tore his ear the exact same way! Something else had to be devised, so Jed tied a big blue bow on the tail of his horse. But the next day, Jed s horse jumped the fence, ran into the field where Ned s horse was grazing, and chewed the bow right off the other horse s tail. Ate the whole bow! Intro. to Machine Learning Alireza Ghane / Greg Mori 23

25 Caveat - Horses (source?) Finally, Jed suggested, and Ned concurred, that they should pick a feature that was less apt to change. Height seemed like a good feature to use. But were the heights different? Well, each farmer went and measured his horse, and do you know what? The brown horse was a full inch taller than the white one! Moral of the story: ML provides theory and tools for setting parameters. Make sure you have the right model and features. Think about your feature vector x. Intro. to Machine Learning Alireza Ghane / Greg Mori 24

26 Caveat - Horses (source?) Finally, Jed suggested, and Ned concurred, that they should pick a feature that was less apt to change. Height seemed like a good feature to use. But were the heights different? Well, each farmer went and measured his horse, and do you know what? The brown horse was a full inch taller than the white one! Moral of the story: ML provides theory and tools for setting parameters. Make sure you have the right model and features. Think about your feature vector x. Intro. to Machine Learning Alireza Ghane / Greg Mori 25

27 Stock Price Prediction Problems in which t i is continuous are called regression E.g. t i is stock price, x i contains company profit, debt, cash flow, gross sales, number of spam s sent,... Intro. to Machine Learning Alireza Ghane / Greg Mori 26

28 Clustering Images Wang et al., CVPR 2006 Only x i is defined: unsupervised learning E.g. x i describes image, find groups of similar images Intro. to Machine Learning Alireza Ghane / Greg Mori 27

29 Types of Learning Problems Supervised Learning Classification Regression Unsupervised Learning Density estimation Clustering: k-means, mixture models, hierarchical clustering Hidden Markov models Reinforcement Learning Intro. to Machine Learning Alireza Ghane / Greg Mori 28

30 Outline Course Info.: People, References, Resources Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection Decision Theory: ML, Loss Function, MAP Probability Theory: (e.g.) Probabilities and Parameter Estimation Conclusion Intro. to Machine Learning Alireza Ghane / Greg Mori 29

31 An Example - Polynomial Curve Fitting Suppose we are given training set of N observations (x 1,..., x N ) and (t 1,..., t N ), x i, t i R Regression problem, estimate y(x) from these data Intro. to Machine Learning Alireza Ghane / Greg Mori 30

32 Polynomial Curve Fitting What form is y(x)? Let s try polynomials of degree M: y(x, w) = w 0 +w 1 x+w 2 x w M x M 1 0 This is the hypothesis space. How do we measure success? Sum of squared errors: E(w) = 1 2 N {y(x n, w) t n } 2 n=1 Among functions in the class, choose that which minimizes this error Intro. to Machine Learning Alireza Ghane / Greg Mori 31

33 Polynomial Curve Fitting What form is y(x)? Let s try polynomials of degree M: y(x, w) = w 0 +w 1 x+w 2 x w M x M 1 0 This is the hypothesis space. How do we measure success? Sum of squared errors: E(w) = 1 2 N {y(x n, w) t n } 2 n=1 t tn y(xn, w) Among functions in the class, choose that which minimizes this error xn x Intro. to Machine Learning Alireza Ghane / Greg Mori 32

34 Polynomial Curve Fitting What form is y(x)? Let s try polynomials of degree M: y(x, w) = w 0 +w 1 x+w 2 x w M x M 1 0 This is the hypothesis space. How do we measure success? Sum of squared errors: E(w) = 1 2 N {y(x n, w) t n } 2 n=1 t tn y(xn, w) Among functions in the class, choose that which minimizes this error xn x Intro. to Machine Learning Alireza Ghane / Greg Mori 33

35 Polynomial Curve Fitting Error function Best coefficients E(w) = 1 2 N {y(x n, w) t n } 2 n=1 w = arg min E(w) w Found using pseudo-inverse (more later) Intro. to Machine Learning Alireza Ghane / Greg Mori 34

36 Which Degree of Polynomial? A model selection problem M = 9 E(w ) = 0: This is over-fitting Intro. to Machine Learning Alireza Ghane / Greg Mori 35

37 Generalization 1 Training Test Generalization is the holy grail of ML Want good performance for new data Measure generalization using a separate set Use root-mean-squared (RMS) error: E RMS = 2E(w )/N Intro. to Machine Learning Alireza Ghane / Greg Mori 36

38 Controlling Over-fitting: Regularization As order of polynomial M increases, so do coefficient magnitudes Penalize large coefficients in error function: Ẽ(w) = 1 2 N {y(x n, w) t n } 2 + λ 2 w 2 n=1 Intro. to Machine Learning Alireza Ghane / Greg Mori 37

39 Controlling Over-fitting: Regularization As order of polynomial M increases, so do coefficient magnitudes Penalize large coefficients in error function: Ẽ(w) = 1 2 N {y(x n, w) t n } 2 + λ 2 w 2 n=1 Intro. to Machine Learning Alireza Ghane / Greg Mori 38

40 Controlling Over-fitting: Regularization Intro. to Machine Learning Alireza Ghane / Greg Mori 39

41 Controlling Over-fitting: Regularization 1 Training Test Note the E RMS for the training set. Perfect match of training set with the model is a result of over-fitting Training and test error show similar trend Intro. to Machine Learning Alireza Ghane / Greg Mori 40

42 Over-fitting: Dataset size With more data, more complex model (M = 9) can be fit Rule of thumb: 10 datapoints for each parameter Intro. to Machine Learning Alireza Ghane / Greg Mori 41

43 Validation Set Split training data into training set and validation set Train different models (e.g. diff. order polynomials) on training set Choose model (e.g. order of polynomial) with minimum error on validation set Intro. to Machine Learning Alireza Ghane / Greg Mori 42

44 Cross-validation run 1 run 2 run 3 run 4 Data are often limited Cross-validation creates S groups of data, use S 1 to train, other to validate Extreme case leave-one-out cross-validation (LOO-CV): S is number of training data points Cross-validation is an effective method for model selection, but can be slow Models with multiple complexity parameters: exponential number of runs Intro. to Machine Learning Alireza Ghane / Greg Mori 43

45 Summary Want models that generalize to new data Train model on training set Measure performance on held-out test set Performance on test set is good estimate of performance on new data Intro. to Machine Learning Alireza Ghane / Greg Mori 44

46 Summary - Model Selection Which model to use? E.g. which degree polynomial? Training set error is lower with more complex model Can t just choose the model with lowest training error Peeking at test error is unfair. E.g. picking polynomial with lowest test error Performance on test set is no longer good estimate of performance on new data Intro. to Machine Learning Alireza Ghane / Greg Mori 45

47 Summary - Model Selection Which model to use? E.g. which degree polynomial? Training set error is lower with more complex model Can t just choose the model with lowest training error Peeking at test error is unfair. E.g. picking polynomial with lowest test error Performance on test set is no longer good estimate of performance on new data Intro. to Machine Learning Alireza Ghane / Greg Mori 46

48 Summary - Solutions I Use a validation set Train models on training set. E.g. different degree polynomials Measure performance on held-out validation set Measure performance of that model on held-out test set Can use cross-validation on training set instead of a separate validation set if little data and lots of time Choose model with lowest error over all cross-validation folds (e.g. polynomial degree) Retrain that model using all training data (e.g. polynomial coefficients) Intro. to Machine Learning Alireza Ghane / Greg Mori 47

49 Summary - Solutions I Use a validation set Train models on training set. E.g. different degree polynomials Measure performance on held-out validation set Measure performance of that model on held-out test set Can use cross-validation on training set instead of a separate validation set if little data and lots of time Choose model with lowest error over all cross-validation folds (e.g. polynomial degree) Retrain that model using all training data (e.g. polynomial coefficients) Intro. to Machine Learning Alireza Ghane / Greg Mori 48

50 Summary - Solutions II Use regularization Train complex model (e.g high order polynomial) but penalize being too complex (e.g. large weight magnitudes) Need to balance error vs. regularization (λ) Choose λ using cross-validation Get more data Intro. to Machine Learning Alireza Ghane / Greg Mori 49

51 Summary - Solutions II Use regularization Train complex model (e.g high order polynomial) but penalize being too complex (e.g. large weight magnitudes) Need to balance error vs. regularization (λ) Choose λ using cross-validation Get more data Intro. to Machine Learning Alireza Ghane / Greg Mori 50

52 Outline Course Info.: People, References, Resources Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection Decision Theory: ML, Loss Function, MAP Probability Theory: (e.g.) Probabilities and Parameter Estimation Conclusion

53 Decision Theory For a sample x, decide which class(c k ) it is from. Ideas: Maximum Likelihood Minimum Loss/Cost (e.g. misclassification rate) Maximum Aposteriori (MAP) Intro. to Machine Learning Alireza Ghane 52

54 Decision: Maximum Likelihood Inference step: Determine statistics from training data. p(x, t) OR p(x C k ) Decision step: Determine optimal t for test input x: t = arg max{ p (x C k ) k }{{} Likelihood } Intro. to Machine Learning Alireza Ghane 53

55 Decision: Maximum Likelihood Inference step: Determine statistics from training data. p(x, t) OR p(x C k ) Decision step: Determine optimal t for test input x: t = arg max{ p (x C k ) k }{{} Likelihood } Intro. to Machine Learning Alireza Ghane 54

56 Decision: Maximum Likelihood Inference step: Determine statistics from training data. p(x, t) OR p(x C k ) Decision step: Determine optimal t for test input x: t = arg max{ p (x C k ) k }{{} Likelihood } Intro. to Machine Learning Alireza Ghane 55

57 Decision: Minimum Misclassification Rate p(mistake) = p (x R 1, C 2 ) + p (x R 2, C 1 ) = R 1 p (x, C 2 ) dx + R 2 p (x, C 1 ) dx p(mistake) = k R j p (x, C k ) dx j p(x, C1) x0 x p(x, C2) ˆx: decision boundary. x 0 : optimal decision boundary x x 0 : arg min{p (mistake)} R 1 R1 R2 Intro. to Machine Learning Alireza Ghane 56

58 Decision: Minimum Misclassification Rate p(mistake) = p (x R 1, C 2 ) + p (x R 2, C 1 ) = R 1 p (x, C 2 ) dx + R 2 p (x, C 1 ) dx p(mistake) = k R j p (x, C k ) dx j p(x, C1) x0 x p(x, C2) ˆx: decision boundary. x 0 : optimal decision boundary x x 0 : arg min{p (mistake)} R 1 R1 R2 Intro. to Machine Learning Alireza Ghane 57

59 Decision: Minimum Loss/Cost Misclassification rate: R : arg min {R i i {1,,K}} L (R j, C k ) Weighted loss/cost function: R : arg min W j,k L (R j, C k ) {R i i {1,,K}} Is useful when: The population of the classes are different The failure cost is non-symmetric k k j j Intro. to Machine Learning Alireza Ghane 58

60 Decision: Maximum Aposteriori (MAP) Bayes Theorem: P {A B} = P {B A}P {A} P {B} p(c k x) }{{} P osterior p(x C k ) }{{} Likelihood p(c k ) }{{} P rior Provides an Aposteriori Belief for the estimation, rather than a single point estimate. Can utilize Apriori Information in the decision. Intro. to Machine Learning Alireza Ghane 59

61 Outline Course Info.: People, References, Resources Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection Decision Theory: ML, Loss Function, MAP Probability Theory: (e.g.) Probabilities and Parameter Estimation Conclusion

62 Coin Tossing Let s say you re given a coin, and you want to find out P (heads), the probability that if you flip it it lands as heads. Flip it a few times: H H T P (heads) = 2/3 Hmm... is this rigorous? Does this make sense? Intro. to Machine Learning Alireza Ghane / Greg Mori 61

63 Coin Tossing Let s say you re given a coin, and you want to find out P (heads), the probability that if you flip it it lands as heads. Flip it a few times: H H T P (heads) = 2/3 Hmm... is this rigorous? Does this make sense? Intro. to Machine Learning Alireza Ghane / Greg Mori 62

64 Coin Tossing Let s say you re given a coin, and you want to find out P (heads), the probability that if you flip it it lands as heads. Flip it a few times: H H T P (heads) = 2/3 Hmm... is this rigorous? Does this make sense? Intro. to Machine Learning Alireza Ghane / Greg Mori 63

65 Coin Tossing - Model Bernoulli distribution P (heads) = µ, P (tails) = 1 µ Assume coin flips are independent and identically distributed (i.i.d.) i.e. All are separate samples from the Bernoulli distribution Given data D = {x 1,..., x N }, heads: x i = 1, tails: x i = 0, the likelihood of the data is: p(d µ) = N p(x n µ) = n=1 N µ xn (1 µ) 1 xn n=1 Intro. to Machine Learning Alireza Ghane / Greg Mori 64

66 Maximum Likelihood Estimation Given D with h heads and t tails What should µ be? Maximum Likelihood Estimation (MLE): choose µ which maximizes the likelihood of the data µ ML = arg max µ p(d µ) Since ln( ) is monotone increasing: µ ML = arg max ln p(d µ) µ Intro. to Machine Learning Alireza Ghane / Greg Mori 65

67 Likelihood: Maximum Likelihood Estimation Log-likelihood: ln p(d µ) = p(d µ) = N µ xn (1 µ) 1 xn n=1 N x n ln µ + (1 x n ) ln(1 µ) n=1 Take derivative, set to 0: d N dµ ln p(d µ) = 1 x n µ (1 x 1 n) 1 µ = 1 µ h 1 1 µ t n=1 µ = h t + h Intro. to Machine Learning Alireza Ghane / Greg Mori 66

68 Likelihood: Maximum Likelihood Estimation Log-likelihood: ln p(d µ) = p(d µ) = N µ xn (1 µ) 1 xn n=1 N x n ln µ + (1 x n ) ln(1 µ) n=1 Take derivative, set to 0: d N dµ ln p(d µ) = 1 x n µ (1 x 1 n) 1 µ = 1 µ h 1 1 µ t n=1 µ = h t + h Intro. to Machine Learning Alireza Ghane / Greg Mori 67

69 Likelihood: Maximum Likelihood Estimation Log-likelihood: ln p(d µ) = p(d µ) = N µ xn (1 µ) 1 xn n=1 N x n ln µ + (1 x n ) ln(1 µ) n=1 Take derivative, set to 0: d N dµ ln p(d µ) = 1 x n µ (1 x 1 n) 1 µ = 1 µ h 1 1 µ t n=1 µ = h t + h Intro. to Machine Learning Alireza Ghane / Greg Mori 68

70 Likelihood: Maximum Likelihood Estimation Log-likelihood: ln p(d µ) = p(d µ) = N µ xn (1 µ) 1 xn n=1 N x n ln µ + (1 x n ) ln(1 µ) n=1 Take derivative, set to 0: d N dµ ln p(d µ) = 1 x n µ (1 x 1 n) 1 µ = 1 µ h 1 1 µ t n=1 µ = h t + h Intro. to Machine Learning Alireza Ghane / Greg Mori 69

71 Likelihood: Maximum Likelihood Estimation Log-likelihood: ln p(d µ) = p(d µ) = N µ xn (1 µ) 1 xn n=1 N x n ln µ + (1 x n ) ln(1 µ) n=1 Take derivative, set to 0: d N dµ ln p(d µ) = 1 x n µ (1 x 1 n) 1 µ = 1 µ h 1 1 µ t n=1 µ = h t + h Intro. to Machine Learning Alireza Ghane / Greg Mori 70

72 Bayesian Learning Wait, does this make sense? What if I flip 1 time, heads? Do I believe µ=1? Learn µ the Bayesian way: P (µ D) = P (µ D) }{{} posterior P (D µ)p (µ) P (D) P (D µ) P (µ) }{{}}{{} prior likelihood Prior encodes knowledge that most coins are Conjugate prior makes math simpler, easy interpretation For Bernoulli, the beta distribution is its conjugate Intro. to Machine Learning Alireza Ghane / Greg Mori 71

73 Bayesian Learning Wait, does this make sense? What if I flip 1 time, heads? Do I believe µ=1? Learn µ the Bayesian way: P (µ D) = P (µ D) }{{} posterior P (D µ)p (µ) P (D) P (D µ) P (µ) }{{}}{{} prior likelihood Prior encodes knowledge that most coins are Conjugate prior makes math simpler, easy interpretation For Bernoulli, the beta distribution is its conjugate Intro. to Machine Learning Alireza Ghane / Greg Mori 72

74 Bayesian Learning Wait, does this make sense? What if I flip 1 time, heads? Do I believe µ=1? Learn µ the Bayesian way: P (µ D) = P (µ D) }{{} posterior P (D µ)p (µ) P (D) P (D µ) P (µ) }{{}}{{} prior likelihood Prior encodes knowledge that most coins are Conjugate prior makes math simpler, easy interpretation For Bernoulli, the beta distribution is its conjugate Intro. to Machine Learning Alireza Ghane / Greg Mori 73

75 Beta Distribution We will use the Beta distribution to express our prior knowledge about coins: Beta(µ a, b) = Γ(a + b) µ a 1 (1 µ) b 1 Γ(a)Γ(b) }{{} normalization Parameters a and b control the shape of this distribution Intro. to Machine Learning Alireza Ghane / Greg Mori 74

76 Posterior P (µ D) P (D µ)p (µ) N µ xn (1 µ) 1 xn µ a 1 (1 µ) b 1 }{{} n=1 }{{} prior likelihood µ h (1 µ) t µ a 1 (1 µ) b 1 µ h+a 1 (1 µ) t+b 1 Simple form for posterior is due to use of conjugate prior Parameters a and b act as extra observations Note that as N = h + t, prior is ignored Intro. to Machine Learning Alireza Ghane / Greg Mori 75

77 Posterior P (µ D) P (D µ)p (µ) N µ xn (1 µ) 1 xn µ a 1 (1 µ) b 1 }{{} n=1 }{{} prior likelihood µ h (1 µ) t µ a 1 (1 µ) b 1 µ h+a 1 (1 µ) t+b 1 Simple form for posterior is due to use of conjugate prior Parameters a and b act as extra observations Note that as N = h + t, prior is ignored Intro. to Machine Learning Alireza Ghane / Greg Mori 76

78 Posterior P (µ D) P (D µ)p (µ) N µ xn (1 µ) 1 xn µ a 1 (1 µ) b 1 }{{} n=1 }{{} prior likelihood µ h (1 µ) t µ a 1 (1 µ) b 1 µ h+a 1 (1 µ) t+b 1 Simple form for posterior is due to use of conjugate prior Parameters a and b act as extra observations Note that as N = h + t, prior is ignored Intro. to Machine Learning Alireza Ghane / Greg Mori 77

79 Maximum A Posteriori Given posterior P (µ D) we could compute a single value, known as the Maximum a Posteriori (MAP) estimate for µ: µ MAP = arg max µ P (µ D) Known as point estimation However, correct Bayesian thing to do is to use the full distribution over µ i.e. Compute E µ [f] = p(µ D)f(µ)dµ This integral is usually hard to compute Intro. to Machine Learning Alireza Ghane / Greg Mori 78

80 Maximum A Posteriori Given posterior P (µ D) we could compute a single value, known as the Maximum a Posteriori (MAP) estimate for µ: µ MAP = arg max µ P (µ D) Known as point estimation However, correct Bayesian thing to do is to use the full distribution over µ i.e. Compute E µ [f] = p(µ D)f(µ)dµ This integral is usually hard to compute Intro. to Machine Learning Alireza Ghane / Greg Mori 79

81 Maximum A Posteriori Given posterior P (µ D) we could compute a single value, known as the Maximum a Posteriori (MAP) estimate for µ: µ MAP = arg max µ P (µ D) Known as point estimation However, correct Bayesian thing to do is to use the full distribution over µ i.e. Compute E µ [f] = p(µ D)f(µ)dµ This integral is usually hard to compute Intro. to Machine Learning Alireza Ghane / Greg Mori 80

82 Polynomial Curve Fitting: What We Did What form is y(x)? Let s try polynomials of degree M: y(x, w) = w 0 +w 1 x+w 2 x w M x M 1 0 This is the hypothesis space. How do we measure success? Sum of squared errors: E(w) = 1 2 N {y(x n, w) t n } 2 n=1 t tn y(xn, w) Among functions in the class, choose that which minimizes this error xn x Intro. to Machine Learning Alireza Ghane 81

83 Curve Fitting: Probabilistic Approach t y(x, w) y(x 0, w) p(t x 0, w, β) 2σ x 0 x N p(t x, w, β) = N ( t n y(x n, w), β 1) n=1 Intro. to Machine Learning Alireza Ghane 82

84 Curve Fitting: Probabilistic Approach t y(x, w) y(x 0, w) p(t x 0, w, β) 2σ x 0 x p(t x, w, β) = N N ( t n y(x n, w), β 1) n=1 ln (p(t x, w, β)) = β N {y(x n, w) t n } 2 + N 2 2 ln β N ln (2π) n=1 }{{}} 2 {{}}{{} const. const. βe(w) Intro. to Machine Learning Alireza Ghane 83

85 Curve Fitting: Probabilistic Approach t y(x, w) y(x0, w) p(t x0, w, β) 2σ x0 x p(t x, w, β) = N N ( t n y(x n, w), β 1) n=1 ln (p(t x, w, β)) = β N {y(x n, w) t n } 2 + N 2 2 ln β N ln (2π) n=1 }{{}} 2 {{}}{{} const. const. βe(w) Maximize log-likelihood Minimize E(w). Can optimize for β as well. Intro. to Machine Learning Alireza Ghane 84

86 Curve Fitting: Bayesian Approach t y(x, w) y(x 0, w) p(t x 0, w, β) 2σ x 0 x N p(t x, w, β) = N ( t n y(x n, w), β 1) n=1 Intro. to Machine Learning Alireza Ghane 85

87 Curve Fitting: Bayesian Approach t y(x, w) y(x 0, w) p(t x 0, w, β) 2σ x 0 x N p(t x, w, β) = N ( t n y(x n, w), β 1) n=1 Posterior Dist.:p (w x, t, α, β) p (t x, w, β) p (t α) Intro. to Machine Learning Alireza Ghane 86

88 Curve Fitting: Bayesian Approach t y(x, w) y(x0, w) p(t x0, w, β) 2σ x0 x N p(t x, w, β) = N ( t n y(x n, w), β 1) n=1 Posterior Dist.:p (w x, t, α, β) p (t x, w, β) p (t α) Minimize: β N {y(x n, w) t n } 2 + α 2 2 wt w n=1 }{{}}{{} regularization. βe(w) Intro. to Machine Learning Alireza Ghane 87

89 Curve Fitting: Bayesian p (t x 0, w, β, α) = N (t P w (x 0 ), Q w,β,α (x 0 )) t y(x, w) y(x 0, w) p(t x 0, w, β) 2σ x 0 x Intro. to Machine Learning Alireza Ghane 88

90 Curve Fitting: Bayesian p (t x 0, w, β, α) = N (t P w (x 0 ), Q w,β,α (x 0 )) t y(x, w) p (t x, x, t) = N ( t m(x), s 2 (x) ) y(x 0, w) p(t x 0, w, β) 2σ x 0 x Intro. to Machine Learning Alireza Ghane 89

91 Curve Fitting: Bayesian p (t x 0, w, β, α) = N (t P w (x 0 ), Q w,β,α (x 0 )) t y(x, w) p (t x, x, t) = N ( t m(x), s 2 (x) ) y(x 0, w) p(t x 0, w, β) 2σ N m(x) = φ(x) T S φ(x n )t n n=1 s 2 (x) = β 1 ( 1 + φ(x) T Sφ(x) ) 1 0 x 0 x S 1 = α N β I + φ(x n )φ(x n ) T n= Intro. to Machine Learning Alireza Ghane 90

92 Conclusion Readings: Chapter 1.1, 1.3, 1.5, 2.1 Types of learning problems Supervised: regression, classification Unsupervised Learning as optimization Squared error loss function Maximum likelihood (ML) Maximum a posteriori (MAP) Want generalization, avoid over-fitting Cross-validation Regularization Bayesian prior on model parameters Intro. to Machine Learning Alireza Ghane 91

Non-parametric Methods

Non-parametric Methods Machine Learning Alireza Ghane Non-Parametric Methods Alireza Ghane / Torsten Möller 1 Outline Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection