Introduction to Machine Learning

Size: px
Start display at page:

Download "Introduction to Machine Learning"

Transcription

1 Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane

2 Outline Course Info.: People, References, Resources Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection Decision Theory: ML, Loss Function, MAP Probability Theory: (e.g.) Probabilities and Parameter Estimation Conclusion Intro. to Machine Learning Alireza Ghane / Torsten Möller 1

3 Outline Course Info.: People, References, Resources Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection Decision Theory: ML, Loss Function, MAP Probability Theory: (e.g.) Probabilities and Parameter Estimation Conclusion Intro. to Machine Learning Alireza Ghane / Torsten Möller 2

4 Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion Course Info. Dr. Torsten Mo ller Home: Discussions: Intro. to Machine Learning Alireza Ghane / Torsten Mo ller Alireza Ghane 3

5 Registration Course max participants: 25 Course Registered: 61 Number of Seats: 48 Excess: 13 Sign your name on the sheet If you miss the first two sessions, you will be automatically SIGNED OFF the course! Intro. to Machine Learning Alireza Ghane / Torsten Möller 4

6 Registration Course max participants: 25 Course Registered: 61 Number of Seats: 48 Excess: 13 Sign your name on the sheet If you miss the first two sessions, you will be automatically SIGNED OFF the course! Intro. to Machine Learning Alireza Ghane / Torsten Möller 5

7 References Main Textbook: Pattern Recognition and Machine Learning, Christopher M. Bishop, Springer Other Useful Resources: The Elements of Statistical Learning, Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Machine Learning, Tom Mitchel. Pattern Classification (2nd ed.), Richard O. Duda, Peter E. Hart, and David G. Stork. Machine Learning, An Algorithmic Perspective, Stephen Marsland. The Top Ten Algorithms in Data Mining, X. Wu, V. Kumar. Learning from Data, Cherkassky-Mulier. Online Courses: Andrew Ng: Intro. to Machine Learning Alireza Ghane / Torsten Möller 6

8 References Main Textbook: Pattern Recognition and Machine Learning, Christopher M. Bishop, Springer Other Useful Resources: The Elements of Statistical Learning, Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Machine Learning, Tom Mitchel. Pattern Classification (2nd ed.), Richard O. Duda, Peter E. Hart, and David G. Stork. Machine Learning, An Algorithmic Perspective, Stephen Marsland. The Top Ten Algorithms in Data Mining, X. Wu, V. Kumar. Learning from Data, Cherkassky-Mulier. Online Courses: Andrew Ng: Intro. to Machine Learning Alireza Ghane / Torsten Möller 7

9 References Main Textbook: Pattern Recognition and Machine Learning, Christopher M. Bishop, Springer Other Useful Resources: The Elements of Statistical Learning, Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Machine Learning, Tom Mitchel. Pattern Classification (2nd ed.), Richard O. Duda, Peter E. Hart, and David G. Stork. Machine Learning, An Algorithmic Perspective, Stephen Marsland. The Top Ten Algorithms in Data Mining, X. Wu, V. Kumar. Learning from Data, Cherkassky-Mulier. Online Courses: Andrew Ng: Intro. to Machine Learning Alireza Ghane / Torsten Möller 8

10 Grading: Assignments / Labs ( 50% ) 5 assignments, 10% each Final Exam ( 40 % ) Class Feedback ( 10 % ) Assignment late policy 5 grace days for all assignments together after the grace days, 25% penalty for each day Grading Programming with: MATLAB (licensed): Octave (free): Intro. to Machine Learning Alireza Ghane / Torsten Möller 9

11 Grading: Assignments / Labs ( 50% ) 5 assignments, 10% each Final Exam ( 40 % ) Class Feedback ( 10 % ) Assignment late policy 5 grace days for all assignments together after the grace days, 25% penalty for each day Grading Programming with: MATLAB (licensed): Octave (free): Intro. to Machine Learning Alireza Ghane / Torsten Möller 10

12 Grading: Assignments / Labs ( 50% ) 5 assignments, 10% each Final Exam ( 40 % ) Class Feedback ( 10 % ) Assignment late policy 5 grace days for all assignments together after the grace days, 25% penalty for each day Grading Programming with: MATLAB (licensed): Octave (free): Intro. to Machine Learning Alireza Ghane / Torsten Möller 11

13 Course Topics We will cover techniques in the standard ML toolkit maximum likelihood, regularization, support vector machines (SVM), Fisher s linear discriminant (LDA), boosting, principal components analysis (PCA), Markov random fields (MRF), neural networks, graphical models, belief propagation, expectation-maximization (EM), mixture models, mixtures of experts (MoE), hidden Markov models (HMM), particle filters, Markov Chain Monte Carlo (MCMC), Gibbs sampling,... Intro. to Machine Learning Alireza Ghane / Torsten Möller 12

14 Calculus: Background E = mc 2 E c = 2mc Linear algebra (PRML Appendix C): Au i = λ i u i ; Probability (PRML Ch. 1.2): p(x) = p(x, Y ); p(x) = Y x (xt a) = a p(x, y)dy; E x [f] = p(x)f(x)dx It will be possible to refresh, but if you ve never seen these before this course will be very difficult. Intro. to Machine Learning Alireza Ghane / Torsten Möller 13

15 Outline Course Info.: People, References, Resources Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection Decision Theory: ML, Loss Function, MAP Probability Theory: (e.g.) Probabilities and Parameter Estimation Conclusion Intro. to Machine Learning Alireza Ghane / Greg Mori 14

16 What is Machine Learning (ML)? Algorithms that automatically improve performance through experience Often this means define a model by hand, and use data to fit its parameters Intro. to Machine Learning Alireza Ghane / Greg Mori 15

17 Why ML? The real world is complex difficult to hand-craft solutions. ML is the preferred framework for applications in many fields: Computer Vision Natural Language Processing, Speech Recognition Robotics... Intro. to Machine Learning Alireza Ghane / Greg Mori 16

18 Hand-written Digit Recognition 518 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 Fig. 8. All of the misclassified MNIST test digits using our method 63 out of 10,000). The text above each digit indicates the example number followed by the true label and the assigned label. Belongie et al. PAMI 2002 straightforward sum of squared differences SSD). SSD error rate with an average of only four two-dimensional performs very well on this easy database due to the lack of views for each three-dimensional object, thanks to the Difficult to hand-craft rules about digits variation in lighting [24] PCA just makes it faster). flexibility provided by the matching algorithm. The prototype selection algorithm is illustrated in Fig. 10. As seen, views are allocated mainly for more complex 6.3 MPEG-7 Shape Silhouette Database categories with high within class variability. The curve Our next experiment involves the MPEG-7 shape silhouette Intro. to Machine Learning Alireza Ghane / Greg Mori marked SC-proto in Fig. 9 shows the improved classification database, specifically Core Experiment CE-Shape-1 part B, 17

19 Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0, 0, 0, 0, 0) Represent input image as a vector x i R 784. Suppose we have a target vector t i This is supervised learning Discrete, finite label set: perhaps t i {0, 1} 10, a classification problem Given a training set {(x 1, t 1 ),..., (x N, t N )}, learning problem is to construct a good function y(x) from these. y : R 784 R 10 Intro. to Machine Learning Alireza Ghane / Greg Mori 18

20 Face Detection Classification problem Schneiderman and Kanade, IJCV 2002 t i {0, 1, 2}, non-face, frontal face, profile face. Intro. to Machine Learning Alireza Ghane / Greg Mori 19

21 Spam Detection Classification problem t i {0, 1}, non-spam, spam x i counts of words, e.g. Viagra, stock, outperform, multi-bagger Intro. to Machine Learning Alireza Ghane / Greg Mori 20

22 Caveat - Horses (source?) Once upon a time there were two neighboring farmers, Jed and Ned. Each owned a horse, and the horses both liked to jump the fence between the two farms. Clearly the farmers needed some means to tell whose horse was whose. So Jed and Ned got together and agreed on a scheme for discriminating between horses. Jed would cut a small notch in one ear of his horse. Not a big, painful notch, but just big enough to be seen. Well, wouldn t you know it, the day after Jed cut the notch in horse s ear, Ned s horse caught on the barbed wire fence and tore his ear the exact same way! Something else had to be devised, so Jed tied a big blue bow on the tail of his horse. But the next day, Jed s horse jumped the fence, ran into the field where Ned s horse was grazing, and chewed the bow right off the other horse s tail. Ate the whole bow! Intro. to Machine Learning Alireza Ghane / Greg Mori 21

23 Caveat - Horses (source?) Once upon a time there were two neighboring farmers, Jed and Ned. Each owned a horse, and the horses both liked to jump the fence between the two farms. Clearly the farmers needed some means to tell whose horse was whose. So Jed and Ned got together and agreed on a scheme for discriminating between horses. Jed would cut a small notch in one ear of his horse. Not a big, painful notch, but just big enough to be seen. Well, wouldn t you know it, the day after Jed cut the notch in horse s ear, Ned s horse caught on the barbed wire fence and tore his ear the exact same way! Something else had to be devised, so Jed tied a big blue bow on the tail of his horse. But the next day, Jed s horse jumped the fence, ran into the field where Ned s horse was grazing, and chewed the bow right off the other horse s tail. Ate the whole bow! Intro. to Machine Learning Alireza Ghane / Greg Mori 22

24 Caveat - Horses (source?) Once upon a time there were two neighboring farmers, Jed and Ned. Each owned a horse, and the horses both liked to jump the fence between the two farms. Clearly the farmers needed some means to tell whose horse was whose. So Jed and Ned got together and agreed on a scheme for discriminating between horses. Jed would cut a small notch in one ear of his horse. Not a big, painful notch, but just big enough to be seen. Well, wouldn t you know it, the day after Jed cut the notch in horse s ear, Ned s horse caught on the barbed wire fence and tore his ear the exact same way! Something else had to be devised, so Jed tied a big blue bow on the tail of his horse. But the next day, Jed s horse jumped the fence, ran into the field where Ned s horse was grazing, and chewed the bow right off the other horse s tail. Ate the whole bow! Intro. to Machine Learning Alireza Ghane / Greg Mori 23

25 Caveat - Horses (source?) Finally, Jed suggested, and Ned concurred, that they should pick a feature that was less apt to change. Height seemed like a good feature to use. But were the heights different? Well, each farmer went and measured his horse, and do you know what? The brown horse was a full inch taller than the white one! Moral of the story: ML provides theory and tools for setting parameters. Make sure you have the right model and features. Think about your feature vector x. Intro. to Machine Learning Alireza Ghane / Greg Mori 24

26 Caveat - Horses (source?) Finally, Jed suggested, and Ned concurred, that they should pick a feature that was less apt to change. Height seemed like a good feature to use. But were the heights different? Well, each farmer went and measured his horse, and do you know what? The brown horse was a full inch taller than the white one! Moral of the story: ML provides theory and tools for setting parameters. Make sure you have the right model and features. Think about your feature vector x. Intro. to Machine Learning Alireza Ghane / Greg Mori 25

27 Stock Price Prediction Problems in which t i is continuous are called regression E.g. t i is stock price, x i contains company profit, debt, cash flow, gross sales, number of spam s sent,... Intro. to Machine Learning Alireza Ghane / Greg Mori 26

28 Clustering Images Wang et al., CVPR 2006 Only x i is defined: unsupervised learning E.g. x i describes image, find groups of similar images Intro. to Machine Learning Alireza Ghane / Greg Mori 27

29 Types of Learning Problems Supervised Learning Classification Regression Unsupervised Learning Density estimation Clustering: k-means, mixture models, hierarchical clustering Hidden Markov models Reinforcement Learning Intro. to Machine Learning Alireza Ghane / Greg Mori 28

30 Outline Course Info.: People, References, Resources Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection Decision Theory: ML, Loss Function, MAP Probability Theory: (e.g.) Probabilities and Parameter Estimation Conclusion Intro. to Machine Learning Alireza Ghane / Greg Mori 29

31 An Example - Polynomial Curve Fitting Suppose we are given training set of N observations (x 1,..., x N ) and (t 1,..., t N ), x i, t i R Regression problem, estimate y(x) from these data Intro. to Machine Learning Alireza Ghane / Greg Mori 30

32 Polynomial Curve Fitting What form is y(x)? Let s try polynomials of degree M: y(x, w) = w 0 +w 1 x+w 2 x w M x M 1 0 This is the hypothesis space. How do we measure success? Sum of squared errors: E(w) = 1 2 N {y(x n, w) t n } 2 n=1 Among functions in the class, choose that which minimizes this error Intro. to Machine Learning Alireza Ghane / Greg Mori 31

33 Polynomial Curve Fitting What form is y(x)? Let s try polynomials of degree M: y(x, w) = w 0 +w 1 x+w 2 x w M x M 1 0 This is the hypothesis space. How do we measure success? Sum of squared errors: E(w) = 1 2 N {y(x n, w) t n } 2 n=1 t tn y(xn, w) Among functions in the class, choose that which minimizes this error xn x Intro. to Machine Learning Alireza Ghane / Greg Mori 32

34 Polynomial Curve Fitting What form is y(x)? Let s try polynomials of degree M: y(x, w) = w 0 +w 1 x+w 2 x w M x M 1 0 This is the hypothesis space. How do we measure success? Sum of squared errors: E(w) = 1 2 N {y(x n, w) t n } 2 n=1 t tn y(xn, w) Among functions in the class, choose that which minimizes this error xn x Intro. to Machine Learning Alireza Ghane / Greg Mori 33

35 Polynomial Curve Fitting Error function Best coefficients E(w) = 1 2 N {y(x n, w) t n } 2 n=1 w = arg min E(w) w Found using pseudo-inverse (more later) Intro. to Machine Learning Alireza Ghane / Greg Mori 34

36 Which Degree of Polynomial? A model selection problem M = 9 E(w ) = 0: This is over-fitting Intro. to Machine Learning Alireza Ghane / Greg Mori 35

37 Generalization 1 Training Test Generalization is the holy grail of ML Want good performance for new data Measure generalization using a separate set Use root-mean-squared (RMS) error: E RMS = 2E(w )/N Intro. to Machine Learning Alireza Ghane / Greg Mori 36

38 Controlling Over-fitting: Regularization As order of polynomial M increases, so do coefficient magnitudes Penalize large coefficients in error function: Ẽ(w) = 1 2 N {y(x n, w) t n } 2 + λ 2 w 2 n=1 Intro. to Machine Learning Alireza Ghane / Greg Mori 37

39 Controlling Over-fitting: Regularization As order of polynomial M increases, so do coefficient magnitudes Penalize large coefficients in error function: Ẽ(w) = 1 2 N {y(x n, w) t n } 2 + λ 2 w 2 n=1 Intro. to Machine Learning Alireza Ghane / Greg Mori 38

40 Controlling Over-fitting: Regularization Intro. to Machine Learning Alireza Ghane / Greg Mori 39

41 Controlling Over-fitting: Regularization 1 Training Test Note the E RMS for the training set. Perfect match of training set with the model is a result of over-fitting Training and test error show similar trend Intro. to Machine Learning Alireza Ghane / Greg Mori 40

42 Over-fitting: Dataset size With more data, more complex model (M = 9) can be fit Rule of thumb: 10 datapoints for each parameter Intro. to Machine Learning Alireza Ghane / Greg Mori 41

43 Validation Set Split training data into training set and validation set Train different models (e.g. diff. order polynomials) on training set Choose model (e.g. order of polynomial) with minimum error on validation set Intro. to Machine Learning Alireza Ghane / Greg Mori 42

44 Cross-validation run 1 run 2 run 3 run 4 Data are often limited Cross-validation creates S groups of data, use S 1 to train, other to validate Extreme case leave-one-out cross-validation (LOO-CV): S is number of training data points Cross-validation is an effective method for model selection, but can be slow Models with multiple complexity parameters: exponential number of runs Intro. to Machine Learning Alireza Ghane / Greg Mori 43

45 Summary Want models that generalize to new data Train model on training set Measure performance on held-out test set Performance on test set is good estimate of performance on new data Intro. to Machine Learning Alireza Ghane / Greg Mori 44

46 Summary - Model Selection Which model to use? E.g. which degree polynomial? Training set error is lower with more complex model Can t just choose the model with lowest training error Peeking at test error is unfair. E.g. picking polynomial with lowest test error Performance on test set is no longer good estimate of performance on new data Intro. to Machine Learning Alireza Ghane / Greg Mori 45

47 Summary - Model Selection Which model to use? E.g. which degree polynomial? Training set error is lower with more complex model Can t just choose the model with lowest training error Peeking at test error is unfair. E.g. picking polynomial with lowest test error Performance on test set is no longer good estimate of performance on new data Intro. to Machine Learning Alireza Ghane / Greg Mori 46

48 Summary - Solutions I Use a validation set Train models on training set. E.g. different degree polynomials Measure performance on held-out validation set Measure performance of that model on held-out test set Can use cross-validation on training set instead of a separate validation set if little data and lots of time Choose model with lowest error over all cross-validation folds (e.g. polynomial degree) Retrain that model using all training data (e.g. polynomial coefficients) Intro. to Machine Learning Alireza Ghane / Greg Mori 47

49 Summary - Solutions I Use a validation set Train models on training set. E.g. different degree polynomials Measure performance on held-out validation set Measure performance of that model on held-out test set Can use cross-validation on training set instead of a separate validation set if little data and lots of time Choose model with lowest error over all cross-validation folds (e.g. polynomial degree) Retrain that model using all training data (e.g. polynomial coefficients) Intro. to Machine Learning Alireza Ghane / Greg Mori 48

50 Summary - Solutions II Use regularization Train complex model (e.g high order polynomial) but penalize being too complex (e.g. large weight magnitudes) Need to balance error vs. regularization (λ) Choose λ using cross-validation Get more data Intro. to Machine Learning Alireza Ghane / Greg Mori 49

51 Summary - Solutions II Use regularization Train complex model (e.g high order polynomial) but penalize being too complex (e.g. large weight magnitudes) Need to balance error vs. regularization (λ) Choose λ using cross-validation Get more data Intro. to Machine Learning Alireza Ghane / Greg Mori 50

52 Outline Course Info.: People, References, Resources Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection Decision Theory: ML, Loss Function, MAP Probability Theory: (e.g.) Probabilities and Parameter Estimation Conclusion

53 Decision Theory For a sample x, decide which class(c k ) it is from. Ideas: Maximum Likelihood Minimum Loss/Cost (e.g. misclassification rate) Maximum Aposteriori (MAP) Intro. to Machine Learning Alireza Ghane 52

54 Decision: Maximum Likelihood Inference step: Determine statistics from training data. p(x, t) OR p(x C k ) Decision step: Determine optimal t for test input x: t = arg max{ p (x C k ) k }{{} Likelihood } Intro. to Machine Learning Alireza Ghane 53

55 Decision: Maximum Likelihood Inference step: Determine statistics from training data. p(x, t) OR p(x C k ) Decision step: Determine optimal t for test input x: t = arg max{ p (x C k ) k }{{} Likelihood } Intro. to Machine Learning Alireza Ghane 54

56 Decision: Maximum Likelihood Inference step: Determine statistics from training data. p(x, t) OR p(x C k ) Decision step: Determine optimal t for test input x: t = arg max{ p (x C k ) k }{{} Likelihood } Intro. to Machine Learning Alireza Ghane 55

57 Decision: Minimum Misclassification Rate p(mistake) = p (x R 1, C 2 ) + p (x R 2, C 1 ) = R 1 p (x, C 2 ) dx + R 2 p (x, C 1 ) dx p(mistake) = k R j p (x, C k ) dx j p(x, C1) x0 x p(x, C2) ˆx: decision boundary. x 0 : optimal decision boundary x x 0 : arg min{p (mistake)} R 1 R1 R2 Intro. to Machine Learning Alireza Ghane 56

58 Decision: Minimum Misclassification Rate p(mistake) = p (x R 1, C 2 ) + p (x R 2, C 1 ) = R 1 p (x, C 2 ) dx + R 2 p (x, C 1 ) dx p(mistake) = k R j p (x, C k ) dx j p(x, C1) x0 x p(x, C2) ˆx: decision boundary. x 0 : optimal decision boundary x x 0 : arg min{p (mistake)} R 1 R1 R2 Intro. to Machine Learning Alireza Ghane 57

59 Decision: Minimum Loss/Cost Misclassification rate: R : arg min {R i i {1,,K}} L (R j, C k ) Weighted loss/cost function: R : arg min W j,k L (R j, C k ) {R i i {1,,K}} Is useful when: The population of the classes are different The failure cost is non-symmetric k k j j Intro. to Machine Learning Alireza Ghane 58

60 Decision: Maximum Aposteriori (MAP) Bayes Theorem: P {A B} = P {B A}P {A} P {B} p(c k x) }{{} P osterior p(x C k ) }{{} Likelihood p(c k ) }{{} P rior Provides an Aposteriori Belief for the estimation, rather than a single point estimate. Can utilize Apriori Information in the decision. Intro. to Machine Learning Alireza Ghane 59

61 Outline Course Info.: People, References, Resources Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection Decision Theory: ML, Loss Function, MAP Probability Theory: (e.g.) Probabilities and Parameter Estimation Conclusion

62 Coin Tossing Let s say you re given a coin, and you want to find out P (heads), the probability that if you flip it it lands as heads. Flip it a few times: H H T P (heads) = 2/3 Hmm... is this rigorous? Does this make sense? Intro. to Machine Learning Alireza Ghane / Greg Mori 61

63 Coin Tossing Let s say you re given a coin, and you want to find out P (heads), the probability that if you flip it it lands as heads. Flip it a few times: H H T P (heads) = 2/3 Hmm... is this rigorous? Does this make sense? Intro. to Machine Learning Alireza Ghane / Greg Mori 62

64 Coin Tossing Let s say you re given a coin, and you want to find out P (heads), the probability that if you flip it it lands as heads. Flip it a few times: H H T P (heads) = 2/3 Hmm... is this rigorous? Does this make sense? Intro. to Machine Learning Alireza Ghane / Greg Mori 63

65 Coin Tossing - Model Bernoulli distribution P (heads) = µ, P (tails) = 1 µ Assume coin flips are independent and identically distributed (i.i.d.) i.e. All are separate samples from the Bernoulli distribution Given data D = {x 1,..., x N }, heads: x i = 1, tails: x i = 0, the likelihood of the data is: p(d µ) = N p(x n µ) = n=1 N µ xn (1 µ) 1 xn n=1 Intro. to Machine Learning Alireza Ghane / Greg Mori 64

66 Maximum Likelihood Estimation Given D with h heads and t tails What should µ be? Maximum Likelihood Estimation (MLE): choose µ which maximizes the likelihood of the data µ ML = arg max µ p(d µ) Since ln( ) is monotone increasing: µ ML = arg max ln p(d µ) µ Intro. to Machine Learning Alireza Ghane / Greg Mori 65

67 Likelihood: Maximum Likelihood Estimation Log-likelihood: ln p(d µ) = p(d µ) = N µ xn (1 µ) 1 xn n=1 N x n ln µ + (1 x n ) ln(1 µ) n=1 Take derivative, set to 0: d N dµ ln p(d µ) = 1 x n µ (1 x 1 n) 1 µ = 1 µ h 1 1 µ t n=1 µ = h t + h Intro. to Machine Learning Alireza Ghane / Greg Mori 66

68 Likelihood: Maximum Likelihood Estimation Log-likelihood: ln p(d µ) = p(d µ) = N µ xn (1 µ) 1 xn n=1 N x n ln µ + (1 x n ) ln(1 µ) n=1 Take derivative, set to 0: d N dµ ln p(d µ) = 1 x n µ (1 x 1 n) 1 µ = 1 µ h 1 1 µ t n=1 µ = h t + h Intro. to Machine Learning Alireza Ghane / Greg Mori 67

69 Likelihood: Maximum Likelihood Estimation Log-likelihood: ln p(d µ) = p(d µ) = N µ xn (1 µ) 1 xn n=1 N x n ln µ + (1 x n ) ln(1 µ) n=1 Take derivative, set to 0: d N dµ ln p(d µ) = 1 x n µ (1 x 1 n) 1 µ = 1 µ h 1 1 µ t n=1 µ = h t + h Intro. to Machine Learning Alireza Ghane / Greg Mori 68

70 Likelihood: Maximum Likelihood Estimation Log-likelihood: ln p(d µ) = p(d µ) = N µ xn (1 µ) 1 xn n=1 N x n ln µ + (1 x n ) ln(1 µ) n=1 Take derivative, set to 0: d N dµ ln p(d µ) = 1 x n µ (1 x 1 n) 1 µ = 1 µ h 1 1 µ t n=1 µ = h t + h Intro. to Machine Learning Alireza Ghane / Greg Mori 69

71 Likelihood: Maximum Likelihood Estimation Log-likelihood: ln p(d µ) = p(d µ) = N µ xn (1 µ) 1 xn n=1 N x n ln µ + (1 x n ) ln(1 µ) n=1 Take derivative, set to 0: d N dµ ln p(d µ) = 1 x n µ (1 x 1 n) 1 µ = 1 µ h 1 1 µ t n=1 µ = h t + h Intro. to Machine Learning Alireza Ghane / Greg Mori 70

72 Bayesian Learning Wait, does this make sense? What if I flip 1 time, heads? Do I believe µ=1? Learn µ the Bayesian way: P (µ D) = P (µ D) }{{} posterior P (D µ)p (µ) P (D) P (D µ) P (µ) }{{}}{{} prior likelihood Prior encodes knowledge that most coins are Conjugate prior makes math simpler, easy interpretation For Bernoulli, the beta distribution is its conjugate Intro. to Machine Learning Alireza Ghane / Greg Mori 71

73 Bayesian Learning Wait, does this make sense? What if I flip 1 time, heads? Do I believe µ=1? Learn µ the Bayesian way: P (µ D) = P (µ D) }{{} posterior P (D µ)p (µ) P (D) P (D µ) P (µ) }{{}}{{} prior likelihood Prior encodes knowledge that most coins are Conjugate prior makes math simpler, easy interpretation For Bernoulli, the beta distribution is its conjugate Intro. to Machine Learning Alireza Ghane / Greg Mori 72

74 Bayesian Learning Wait, does this make sense? What if I flip 1 time, heads? Do I believe µ=1? Learn µ the Bayesian way: P (µ D) = P (µ D) }{{} posterior P (D µ)p (µ) P (D) P (D µ) P (µ) }{{}}{{} prior likelihood Prior encodes knowledge that most coins are Conjugate prior makes math simpler, easy interpretation For Bernoulli, the beta distribution is its conjugate Intro. to Machine Learning Alireza Ghane / Greg Mori 73

75 Beta Distribution We will use the Beta distribution to express our prior knowledge about coins: Beta(µ a, b) = Γ(a + b) µ a 1 (1 µ) b 1 Γ(a)Γ(b) }{{} normalization Parameters a and b control the shape of this distribution Intro. to Machine Learning Alireza Ghane / Greg Mori 74

76 Posterior P (µ D) P (D µ)p (µ) N µ xn (1 µ) 1 xn µ a 1 (1 µ) b 1 }{{} n=1 }{{} prior likelihood µ h (1 µ) t µ a 1 (1 µ) b 1 µ h+a 1 (1 µ) t+b 1 Simple form for posterior is due to use of conjugate prior Parameters a and b act as extra observations Note that as N = h + t, prior is ignored Intro. to Machine Learning Alireza Ghane / Greg Mori 75

77 Posterior P (µ D) P (D µ)p (µ) N µ xn (1 µ) 1 xn µ a 1 (1 µ) b 1 }{{} n=1 }{{} prior likelihood µ h (1 µ) t µ a 1 (1 µ) b 1 µ h+a 1 (1 µ) t+b 1 Simple form for posterior is due to use of conjugate prior Parameters a and b act as extra observations Note that as N = h + t, prior is ignored Intro. to Machine Learning Alireza Ghane / Greg Mori 76

78 Posterior P (µ D) P (D µ)p (µ) N µ xn (1 µ) 1 xn µ a 1 (1 µ) b 1 }{{} n=1 }{{} prior likelihood µ h (1 µ) t µ a 1 (1 µ) b 1 µ h+a 1 (1 µ) t+b 1 Simple form for posterior is due to use of conjugate prior Parameters a and b act as extra observations Note that as N = h + t, prior is ignored Intro. to Machine Learning Alireza Ghane / Greg Mori 77

79 Maximum A Posteriori Given posterior P (µ D) we could compute a single value, known as the Maximum a Posteriori (MAP) estimate for µ: µ MAP = arg max µ P (µ D) Known as point estimation However, correct Bayesian thing to do is to use the full distribution over µ i.e. Compute E µ [f] = p(µ D)f(µ)dµ This integral is usually hard to compute Intro. to Machine Learning Alireza Ghane / Greg Mori 78

80 Maximum A Posteriori Given posterior P (µ D) we could compute a single value, known as the Maximum a Posteriori (MAP) estimate for µ: µ MAP = arg max µ P (µ D) Known as point estimation However, correct Bayesian thing to do is to use the full distribution over µ i.e. Compute E µ [f] = p(µ D)f(µ)dµ This integral is usually hard to compute Intro. to Machine Learning Alireza Ghane / Greg Mori 79

81 Maximum A Posteriori Given posterior P (µ D) we could compute a single value, known as the Maximum a Posteriori (MAP) estimate for µ: µ MAP = arg max µ P (µ D) Known as point estimation However, correct Bayesian thing to do is to use the full distribution over µ i.e. Compute E µ [f] = p(µ D)f(µ)dµ This integral is usually hard to compute Intro. to Machine Learning Alireza Ghane / Greg Mori 80

82 Polynomial Curve Fitting: What We Did What form is y(x)? Let s try polynomials of degree M: y(x, w) = w 0 +w 1 x+w 2 x w M x M 1 0 This is the hypothesis space. How do we measure success? Sum of squared errors: E(w) = 1 2 N {y(x n, w) t n } 2 n=1 t tn y(xn, w) Among functions in the class, choose that which minimizes this error xn x Intro. to Machine Learning Alireza Ghane 81

83 Curve Fitting: Probabilistic Approach t y(x, w) y(x 0, w) p(t x 0, w, β) 2σ x 0 x N p(t x, w, β) = N ( t n y(x n, w), β 1) n=1 Intro. to Machine Learning Alireza Ghane 82

84 Curve Fitting: Probabilistic Approach t y(x, w) y(x 0, w) p(t x 0, w, β) 2σ x 0 x p(t x, w, β) = N N ( t n y(x n, w), β 1) n=1 ln (p(t x, w, β)) = β N {y(x n, w) t n } 2 + N 2 2 ln β N ln (2π) n=1 }{{}} 2 {{}}{{} const. const. βe(w) Intro. to Machine Learning Alireza Ghane 83

85 Curve Fitting: Probabilistic Approach t y(x, w) y(x0, w) p(t x0, w, β) 2σ x0 x p(t x, w, β) = N N ( t n y(x n, w), β 1) n=1 ln (p(t x, w, β)) = β N {y(x n, w) t n } 2 + N 2 2 ln β N ln (2π) n=1 }{{}} 2 {{}}{{} const. const. βe(w) Maximize log-likelihood Minimize E(w). Can optimize for β as well. Intro. to Machine Learning Alireza Ghane 84

86 Curve Fitting: Bayesian Approach t y(x, w) y(x 0, w) p(t x 0, w, β) 2σ x 0 x N p(t x, w, β) = N ( t n y(x n, w), β 1) n=1 Intro. to Machine Learning Alireza Ghane 85

87 Curve Fitting: Bayesian Approach t y(x, w) y(x 0, w) p(t x 0, w, β) 2σ x 0 x N p(t x, w, β) = N ( t n y(x n, w), β 1) n=1 Posterior Dist.:p (w x, t, α, β) p (t x, w, β) p (t α) Intro. to Machine Learning Alireza Ghane 86

88 Curve Fitting: Bayesian Approach t y(x, w) y(x0, w) p(t x0, w, β) 2σ x0 x N p(t x, w, β) = N ( t n y(x n, w), β 1) n=1 Posterior Dist.:p (w x, t, α, β) p (t x, w, β) p (t α) Minimize: β N {y(x n, w) t n } 2 + α 2 2 wt w n=1 }{{}}{{} regularization. βe(w) Intro. to Machine Learning Alireza Ghane 87

89 Curve Fitting: Bayesian p (t x 0, w, β, α) = N (t P w (x 0 ), Q w,β,α (x 0 )) t y(x, w) y(x 0, w) p(t x 0, w, β) 2σ x 0 x Intro. to Machine Learning Alireza Ghane 88

90 Curve Fitting: Bayesian p (t x 0, w, β, α) = N (t P w (x 0 ), Q w,β,α (x 0 )) t y(x, w) p (t x, x, t) = N ( t m(x), s 2 (x) ) y(x 0, w) p(t x 0, w, β) 2σ x 0 x Intro. to Machine Learning Alireza Ghane 89

91 Curve Fitting: Bayesian p (t x 0, w, β, α) = N (t P w (x 0 ), Q w,β,α (x 0 )) t y(x, w) p (t x, x, t) = N ( t m(x), s 2 (x) ) y(x 0, w) p(t x 0, w, β) 2σ N m(x) = φ(x) T S φ(x n )t n n=1 s 2 (x) = β 1 ( 1 + φ(x) T Sφ(x) ) 1 0 x 0 x S 1 = α N β I + φ(x n )φ(x n ) T n= Intro. to Machine Learning Alireza Ghane 90

92 Conclusion Readings: Chapter 1.1, 1.3, 1.5, 2.1 Types of learning problems Supervised: regression, classification Unsupervised Learning as optimization Squared error loss function Maximum likelihood (ML) Maximum a posteriori (MAP) Want generalization, avoid over-fitting Cross-validation Regularization Bayesian prior on model parameters Intro. to Machine Learning Alireza Ghane 91

Non-parametric Methods

Non-parametric Methods Non-parametric Methods Machine Learning Alireza Ghane Non-Parametric Methods Alireza Ghane / Torsten Möller 1 Outline Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection

More information

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation Machine Learning CMPT 726 Simon Fraser University Binomial Parameter Estimation Outline Maximum Likelihood Estimation Smoothed Frequencies, Laplace Correction. Bayesian Approach. Conjugate Prior. Uniform

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Non-parametric Methods

Non-parametric Methods Non-parametric Methods Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 2 of Pattern Recognition and Machine Learning by Bishop (with an emphasis on section 2.5) Möller/Mori 2 Outline Last

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Course Info. Instructor: Nicholas Ruozzi Office: ECSS 3.409 Office hours: Tues.

More information

Machine Learning CSE546

Machine Learning CSE546 http://www.cs.washington.edu/education/courses/cse546/17au/ Machine Learning CSE546 Kevin Jamieson University of Washington September 28, 2017 1 You may also like ML uses past data to make personalized

More information

CS 540: Machine Learning Lecture 1: Introduction

CS 540: Machine Learning Lecture 1: Introduction CS 540: Machine Learning Lecture 1: Introduction AD January 2008 AD () January 2008 1 / 41 Acknowledgments Thanks to Nando de Freitas Kevin Murphy AD () January 2008 2 / 41 Administrivia & Announcement

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Bayesian Networks Inference with Probabilistic Graphical Models

Bayesian Networks Inference with Probabilistic Graphical Models 4190.408 2016-Spring Bayesian Networks Inference with Probabilistic Graphical Models Byoung-Tak Zhang intelligence Lab Seoul National University 4190.408 Artificial (2016-Spring) 1 Machine Learning? Learning

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Mathematical Formulation of Our Example

Mathematical Formulation of Our Example Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot

More information

Introduction. Le Song. Machine Learning I CSE 6740, Fall 2013

Introduction. Le Song. Machine Learning I CSE 6740, Fall 2013 Introduction Le Song Machine Learning I CSE 6740, Fall 2013 What is Machine Learning (ML) Study of algorithms that improve their performance at some task with experience 2 Common to Industrial scale problems

More information

An Introduction to Statistical and Probabilistic Linear Models

An Introduction to Statistical and Probabilistic Linear Models An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Point Estimation. Vibhav Gogate The University of Texas at Dallas Point Estimation Vibhav Gogate The University of Texas at Dallas Some slides courtesy of Carlos Guestrin, Chris Bishop, Dan Weld and Luke Zettlemoyer. Basics: Expectation and Variance Binary Variables

More information

ECE-271B. Nuno Vasconcelos ECE Department, UCSD

ECE-271B. Nuno Vasconcelos ECE Department, UCSD ECE-271B Statistical ti ti Learning II Nuno Vasconcelos ECE Department, UCSD The course the course is a graduate level course in statistical learning in SLI we covered the foundations of Bayesian or generative

More information

Machine Learning and Deep Learning! Vincent Lepetit!

Machine Learning and Deep Learning! Vincent Lepetit! Machine Learning and Deep Learning!! Vincent Lepetit! 1! What is Machine Learning?! 2! Hand-Written Digit Recognition! 2 9 3! Hand-Written Digit Recognition! Formalization! 0 1 x = @ A Images are 28x28

More information

Probabilistic Graphical Models for Image Analysis - Lecture 1

Probabilistic Graphical Models for Image Analysis - Lecture 1 Probabilistic Graphical Models for Image Analysis - Lecture 1 Alexey Gronskiy, Stefan Bauer 21 September 2018 Max Planck ETH Center for Learning Systems Overview 1. Motivation - Why Graphical Models 2.

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

Final Exam, Spring 2006

Final Exam, Spring 2006 070 Final Exam, Spring 2006. Write your name and your email address below. Name: Andrew account: 2. There should be 22 numbered pages in this exam (including this cover sheet). 3. You may use any and all

More information

CS Machine Learning Qualifying Exam

CS Machine Learning Qualifying Exam CS Machine Learning Qualifying Exam Georgia Institute of Technology March 30, 2017 The exam is divided into four areas: Core, Statistical Methods and Models, Learning Theory, and Decision Processes. There

More information

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use? Today Statistical Learning Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Continuous case Learning Parameters for a Bayesian Network Naive Bayes Maximum Likelihood estimates

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes MLE / MAP Readings: Estimating Probabilities (Mitchell, 2016)

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications Neural Networks Bishop PRML Ch. 5 Alireza Ghane Neural Networks Alireza Ghane / Greg Mori 1 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013 Bayesian Methods Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2013 1 What about prior n Billionaire says: Wait, I know that the thumbtack is close to 50-50. What can you

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables? Linear Regression Machine Learning CSE546 Sham Kakade University of Washington Oct 4, 2016 1 What about continuous variables? Billionaire says: If I am measuring a continuous variable, what can you do

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables? Linear Regression Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2014 1 What about continuous variables? n Billionaire says: If I am measuring a continuous variable, what

More information

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference Associate Instructor: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise noted, all material posted

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 143 Part IV

More information

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018 Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability

More information

CS 361: Probability & Statistics

CS 361: Probability & Statistics March 14, 2018 CS 361: Probability & Statistics Inference The prior From Bayes rule, we know that we can express our function of interest as Likelihood Prior Posterior The right hand side contains the

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Multivariate statistical methods and data mining in particle physics

Multivariate statistical methods and data mining in particle physics Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general

More information

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann Machine Learning! in just a few minutes Jan Peters Gerhard Neumann 1 Purpose of this Lecture Foundations of machine learning tools for robotics We focus on regression methods and general principles Often

More information

CS 361: Probability & Statistics

CS 361: Probability & Statistics October 17, 2017 CS 361: Probability & Statistics Inference Maximum likelihood: drawbacks A couple of things might trip up max likelihood estimation: 1) Finding the maximum of some functions can be quite

More information

Statistical Models. David M. Blei Columbia University. October 14, 2014

Statistical Models. David M. Blei Columbia University. October 14, 2014 Statistical Models David M. Blei Columbia University October 14, 2014 We have discussed graphical models. Graphical models are a formalism for representing families of probability distributions. They are

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released

More information

Machine Learning. Instructor: Pranjal Awasthi

Machine Learning. Instructor: Pranjal Awasthi Machine Learning Instructor: Pranjal Awasthi Course Info Requested an SPN and emailed me Wait for Carol Difrancesco to give them out. Not registered and need SPN Email me after class No promises It s a

More information

Machine Learning, Midterm Exam

Machine Learning, Midterm Exam 10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have

More information

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met c Outlines Statistical Group and College of Engineering and Computer Science Overview Linear Regression Linear Classification Neural Networks Kernel Methods and SVM Mixture Models and EM Resources More

More information

Machine Learning Lecture 1

Machine Learning Lecture 1 Many slides adapted from B. Schiele Machine Learning Lecture 1 Introduction 18.04.2016 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de Organization Lecturer Prof.

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Bayesian Methods: Naïve Bayes

Bayesian Methods: Naïve Bayes Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Naïve Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders Homework 6: PAC Learning

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Empirical Bayes, Hierarchical Bayes Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 5: Due April 10. Project description on Piazza. Final details coming

More information

Machine Learning using Bayesian Approaches

Machine Learning using Bayesian Approaches Machine Learning using Bayesian Approaches Sargur N. Srihari University at Buffalo, State University of New York 1 Outline 1. Progress in ML and PR 2. Fully Bayesian Approach 1. Probability theory Bayes

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes Matt Gormley Lecture 19 March 20, 2018 1 Midterm Exam Reminders

More information

Final Examination CS 540-2: Introduction to Artificial Intelligence

Final Examination CS 540-2: Introduction to Artificial Intelligence Final Examination CS 540-2: Introduction to Artificial Intelligence May 7, 2017 LAST NAME: SOLUTIONS FIRST NAME: Problem Score Max Score 1 14 2 10 3 6 4 10 5 11 6 9 7 8 9 10 8 12 12 8 Total 100 1 of 11

More information

CS 188: Artificial Intelligence Spring Today

CS 188: Artificial Intelligence Spring Today CS 188: Artificial Intelligence Spring 2006 Lecture 9: Naïve Bayes 2/14/2006 Dan Klein UC Berkeley Many slides from either Stuart Russell or Andrew Moore Bayes rule Today Expectations and utilities Naïve

More information

PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY. Arto Klami

PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY. Arto Klami PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY Arto Klami 1 PROBABILISTIC PROGRAMMING Probabilistic programming is to probabilistic modelling as deep learning is to neural networks (Antti Honkela,

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

PATTERN CLASSIFICATION

PATTERN CLASSIFICATION PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS

More information

Probabilistic Machine Learning

Probabilistic Machine Learning Probabilistic Machine Learning Bayesian Nets, MCMC, and more Marek Petrik 4/18/2017 Based on: P. Murphy, K. (2012). Machine Learning: A Probabilistic Perspective. Chapter 10. Conditional Independence Independent

More information

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE March 28, 2012 The exam is closed book. You are allowed a double sided one page cheat sheet. Answer the questions in the spaces provided on

More information

PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY

PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY Arto Klami Adapted from my talk in AIHelsinki seminar Dec 15, 2016 1 MOTIVATING INTRODUCTION Most of the artificial intelligence success stories

More information

Bayesian Models in Machine Learning

Bayesian Models in Machine Learning Bayesian Models in Machine Learning Lukáš Burget Escuela de Ciencias Informáticas 2017 Buenos Aires, July 24-29 2017 Frequentist vs. Bayesian Frequentist point of view: Probability is the frequency of

More information

GI01/M055: Supervised Learning

GI01/M055: Supervised Learning GI01/M055: Supervised Learning 1. Introduction to Supervised Learning October 5, 2009 John Shawe-Taylor 1 Course information 1. When: Mondays, 14:00 17:00 Where: Room 1.20, Engineering Building, Malet

More information

Machine Learning Lecture 2

Machine Learning Lecture 2 Machine Perceptual Learning and Sensory Summer Augmented 15 Computing Many slides adapted from B. Schiele Machine Learning Lecture 2 Probability Density Estimation 16.04.2015 Bastian Leibe RWTH Aachen

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

Machine Learning (COMP-652 and ECSE-608)

Machine Learning (COMP-652 and ECSE-608) Machine Learning (COMP-652 and ECSE-608) Instructors: Doina Precup and Guillaume Rabusseau Email: dprecup@cs.mcgill.ca and guillaume.rabusseau@mail.mcgill.ca Teaching assistants: Tianyu Li and TBA Class

More information

Probability Theory for Machine Learning. Chris Cremer September 2015

Probability Theory for Machine Learning. Chris Cremer September 2015 Probability Theory for Machine Learning Chris Cremer September 2015 Outline Motivation Probability Definitions and Rules Probability Distributions MLE for Gaussian Parameter Estimation MLE and Least Squares

More information

Classification & Information Theory Lecture #8

Classification & Information Theory Lecture #8 Classification & Information Theory Lecture #8 Introduction to Natural Language Processing CMPSCI 585, Fall 2007 University of Massachusetts Amherst Andrew McCallum Today s Main Points Automatically categorizing

More information

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January, 1 Stat 406: Algorithms for classification and prediction Lecture 1: Introduction Kevin Murphy Mon 7 January, 2008 1 1 Slides last updated on January 7, 2008 Outline 2 Administrivia Some basic definitions.

More information

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Computer Vision Group Prof. Daniel Cremers. 3. Regression Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4 ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian

More information

an introduction to bayesian inference

an introduction to bayesian inference with an application to network analysis http://jakehofman.com january 13, 2010 motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena

More information

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/

More information

Lecture 18: Learning probabilistic models

Lecture 18: Learning probabilistic models Lecture 8: Learning probabilistic models Roger Grosse Overview In the first half of the course, we introduced backpropagation, a technique we used to train neural nets to minimize a variety of cost functions.

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Introduc)on to Bayesian methods (con)nued) - Lecture 16 Introduc)on to Bayesian methods (con)nued) - Lecture 16 David Sontag New York University Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein, and Vibhav Gogate Outline of lectures Review of

More information