Introduction to Machine Learning
|
|
- Daniela Glenn
- 6 years ago
- Views:
Transcription
1 Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane
2 Outline Course Info.: People, References, Resources Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection Decision Theory: ML, Loss Function, MAP Probability Theory: (e.g.) Probabilities and Parameter Estimation Conclusion Intro. to Machine Learning Alireza Ghane / Torsten Möller 1
3 Outline Course Info.: People, References, Resources Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection Decision Theory: ML, Loss Function, MAP Probability Theory: (e.g.) Probabilities and Parameter Estimation Conclusion Intro. to Machine Learning Alireza Ghane / Torsten Möller 2
4 Course Info. Machine Learning Curve Fitting Decision Theory Probability Theory Conclusion Course Info. Dr. Torsten Mo ller Home: Discussions: Intro. to Machine Learning Alireza Ghane / Torsten Mo ller Alireza Ghane 3
5 Registration Course max participants: 25 Course Registered: 61 Number of Seats: 48 Excess: 13 Sign your name on the sheet If you miss the first two sessions, you will be automatically SIGNED OFF the course! Intro. to Machine Learning Alireza Ghane / Torsten Möller 4
6 Registration Course max participants: 25 Course Registered: 61 Number of Seats: 48 Excess: 13 Sign your name on the sheet If you miss the first two sessions, you will be automatically SIGNED OFF the course! Intro. to Machine Learning Alireza Ghane / Torsten Möller 5
7 References Main Textbook: Pattern Recognition and Machine Learning, Christopher M. Bishop, Springer Other Useful Resources: The Elements of Statistical Learning, Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Machine Learning, Tom Mitchel. Pattern Classification (2nd ed.), Richard O. Duda, Peter E. Hart, and David G. Stork. Machine Learning, An Algorithmic Perspective, Stephen Marsland. The Top Ten Algorithms in Data Mining, X. Wu, V. Kumar. Learning from Data, Cherkassky-Mulier. Online Courses: Andrew Ng: Intro. to Machine Learning Alireza Ghane / Torsten Möller 6
8 References Main Textbook: Pattern Recognition and Machine Learning, Christopher M. Bishop, Springer Other Useful Resources: The Elements of Statistical Learning, Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Machine Learning, Tom Mitchel. Pattern Classification (2nd ed.), Richard O. Duda, Peter E. Hart, and David G. Stork. Machine Learning, An Algorithmic Perspective, Stephen Marsland. The Top Ten Algorithms in Data Mining, X. Wu, V. Kumar. Learning from Data, Cherkassky-Mulier. Online Courses: Andrew Ng: Intro. to Machine Learning Alireza Ghane / Torsten Möller 7
9 References Main Textbook: Pattern Recognition and Machine Learning, Christopher M. Bishop, Springer Other Useful Resources: The Elements of Statistical Learning, Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Machine Learning, Tom Mitchel. Pattern Classification (2nd ed.), Richard O. Duda, Peter E. Hart, and David G. Stork. Machine Learning, An Algorithmic Perspective, Stephen Marsland. The Top Ten Algorithms in Data Mining, X. Wu, V. Kumar. Learning from Data, Cherkassky-Mulier. Online Courses: Andrew Ng: Intro. to Machine Learning Alireza Ghane / Torsten Möller 8
10 Grading: Assignments / Labs ( 50% ) 5 assignments, 10% each Final Exam ( 40 % ) Class Feedback ( 10 % ) Assignment late policy 5 grace days for all assignments together after the grace days, 25% penalty for each day Grading Programming with: MATLAB (licensed): Octave (free): Intro. to Machine Learning Alireza Ghane / Torsten Möller 9
11 Grading: Assignments / Labs ( 50% ) 5 assignments, 10% each Final Exam ( 40 % ) Class Feedback ( 10 % ) Assignment late policy 5 grace days for all assignments together after the grace days, 25% penalty for each day Grading Programming with: MATLAB (licensed): Octave (free): Intro. to Machine Learning Alireza Ghane / Torsten Möller 10
12 Grading: Assignments / Labs ( 50% ) 5 assignments, 10% each Final Exam ( 40 % ) Class Feedback ( 10 % ) Assignment late policy 5 grace days for all assignments together after the grace days, 25% penalty for each day Grading Programming with: MATLAB (licensed): Octave (free): Intro. to Machine Learning Alireza Ghane / Torsten Möller 11
13 Course Topics We will cover techniques in the standard ML toolkit maximum likelihood, regularization, support vector machines (SVM), Fisher s linear discriminant (LDA), boosting, principal components analysis (PCA), Markov random fields (MRF), neural networks, graphical models, belief propagation, expectation-maximization (EM), mixture models, mixtures of experts (MoE), hidden Markov models (HMM), particle filters, Markov Chain Monte Carlo (MCMC), Gibbs sampling,... Intro. to Machine Learning Alireza Ghane / Torsten Möller 12
14 Calculus: Background E = mc 2 E c = 2mc Linear algebra (PRML Appendix C): Au i = λ i u i ; Probability (PRML Ch. 1.2): p(x) = p(x, Y ); p(x) = Y x (xt a) = a p(x, y)dy; E x [f] = p(x)f(x)dx It will be possible to refresh, but if you ve never seen these before this course will be very difficult. Intro. to Machine Learning Alireza Ghane / Torsten Möller 13
15 Outline Course Info.: People, References, Resources Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection Decision Theory: ML, Loss Function, MAP Probability Theory: (e.g.) Probabilities and Parameter Estimation Conclusion Intro. to Machine Learning Alireza Ghane / Greg Mori 14
16 What is Machine Learning (ML)? Algorithms that automatically improve performance through experience Often this means define a model by hand, and use data to fit its parameters Intro. to Machine Learning Alireza Ghane / Greg Mori 15
17 Why ML? The real world is complex difficult to hand-craft solutions. ML is the preferred framework for applications in many fields: Computer Vision Natural Language Processing, Speech Recognition Robotics... Intro. to Machine Learning Alireza Ghane / Greg Mori 16
18 Hand-written Digit Recognition 518 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 Fig. 8. All of the misclassified MNIST test digits using our method 63 out of 10,000). The text above each digit indicates the example number followed by the true label and the assigned label. Belongie et al. PAMI 2002 straightforward sum of squared differences SSD). SSD error rate with an average of only four two-dimensional performs very well on this easy database due to the lack of views for each three-dimensional object, thanks to the Difficult to hand-craft rules about digits variation in lighting [24] PCA just makes it faster). flexibility provided by the matching algorithm. The prototype selection algorithm is illustrated in Fig. 10. As seen, views are allocated mainly for more complex 6.3 MPEG-7 Shape Silhouette Database categories with high within class variability. The curve Our next experiment involves the MPEG-7 shape silhouette Intro. to Machine Learning Alireza Ghane / Greg Mori marked SC-proto in Fig. 9 shows the improved classification database, specifically Core Experiment CE-Shape-1 part B, 17
19 Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0, 0, 0, 0, 0) Represent input image as a vector x i R 784. Suppose we have a target vector t i This is supervised learning Discrete, finite label set: perhaps t i {0, 1} 10, a classification problem Given a training set {(x 1, t 1 ),..., (x N, t N )}, learning problem is to construct a good function y(x) from these. y : R 784 R 10 Intro. to Machine Learning Alireza Ghane / Greg Mori 18
20 Face Detection Classification problem Schneiderman and Kanade, IJCV 2002 t i {0, 1, 2}, non-face, frontal face, profile face. Intro. to Machine Learning Alireza Ghane / Greg Mori 19
21 Spam Detection Classification problem t i {0, 1}, non-spam, spam x i counts of words, e.g. Viagra, stock, outperform, multi-bagger Intro. to Machine Learning Alireza Ghane / Greg Mori 20
22 Caveat - Horses (source?) Once upon a time there were two neighboring farmers, Jed and Ned. Each owned a horse, and the horses both liked to jump the fence between the two farms. Clearly the farmers needed some means to tell whose horse was whose. So Jed and Ned got together and agreed on a scheme for discriminating between horses. Jed would cut a small notch in one ear of his horse. Not a big, painful notch, but just big enough to be seen. Well, wouldn t you know it, the day after Jed cut the notch in horse s ear, Ned s horse caught on the barbed wire fence and tore his ear the exact same way! Something else had to be devised, so Jed tied a big blue bow on the tail of his horse. But the next day, Jed s horse jumped the fence, ran into the field where Ned s horse was grazing, and chewed the bow right off the other horse s tail. Ate the whole bow! Intro. to Machine Learning Alireza Ghane / Greg Mori 21
23 Caveat - Horses (source?) Once upon a time there were two neighboring farmers, Jed and Ned. Each owned a horse, and the horses both liked to jump the fence between the two farms. Clearly the farmers needed some means to tell whose horse was whose. So Jed and Ned got together and agreed on a scheme for discriminating between horses. Jed would cut a small notch in one ear of his horse. Not a big, painful notch, but just big enough to be seen. Well, wouldn t you know it, the day after Jed cut the notch in horse s ear, Ned s horse caught on the barbed wire fence and tore his ear the exact same way! Something else had to be devised, so Jed tied a big blue bow on the tail of his horse. But the next day, Jed s horse jumped the fence, ran into the field where Ned s horse was grazing, and chewed the bow right off the other horse s tail. Ate the whole bow! Intro. to Machine Learning Alireza Ghane / Greg Mori 22
24 Caveat - Horses (source?) Once upon a time there were two neighboring farmers, Jed and Ned. Each owned a horse, and the horses both liked to jump the fence between the two farms. Clearly the farmers needed some means to tell whose horse was whose. So Jed and Ned got together and agreed on a scheme for discriminating between horses. Jed would cut a small notch in one ear of his horse. Not a big, painful notch, but just big enough to be seen. Well, wouldn t you know it, the day after Jed cut the notch in horse s ear, Ned s horse caught on the barbed wire fence and tore his ear the exact same way! Something else had to be devised, so Jed tied a big blue bow on the tail of his horse. But the next day, Jed s horse jumped the fence, ran into the field where Ned s horse was grazing, and chewed the bow right off the other horse s tail. Ate the whole bow! Intro. to Machine Learning Alireza Ghane / Greg Mori 23
25 Caveat - Horses (source?) Finally, Jed suggested, and Ned concurred, that they should pick a feature that was less apt to change. Height seemed like a good feature to use. But were the heights different? Well, each farmer went and measured his horse, and do you know what? The brown horse was a full inch taller than the white one! Moral of the story: ML provides theory and tools for setting parameters. Make sure you have the right model and features. Think about your feature vector x. Intro. to Machine Learning Alireza Ghane / Greg Mori 24
26 Caveat - Horses (source?) Finally, Jed suggested, and Ned concurred, that they should pick a feature that was less apt to change. Height seemed like a good feature to use. But were the heights different? Well, each farmer went and measured his horse, and do you know what? The brown horse was a full inch taller than the white one! Moral of the story: ML provides theory and tools for setting parameters. Make sure you have the right model and features. Think about your feature vector x. Intro. to Machine Learning Alireza Ghane / Greg Mori 25
27 Stock Price Prediction Problems in which t i is continuous are called regression E.g. t i is stock price, x i contains company profit, debt, cash flow, gross sales, number of spam s sent,... Intro. to Machine Learning Alireza Ghane / Greg Mori 26
28 Clustering Images Wang et al., CVPR 2006 Only x i is defined: unsupervised learning E.g. x i describes image, find groups of similar images Intro. to Machine Learning Alireza Ghane / Greg Mori 27
29 Types of Learning Problems Supervised Learning Classification Regression Unsupervised Learning Density estimation Clustering: k-means, mixture models, hierarchical clustering Hidden Markov models Reinforcement Learning Intro. to Machine Learning Alireza Ghane / Greg Mori 28
30 Outline Course Info.: People, References, Resources Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection Decision Theory: ML, Loss Function, MAP Probability Theory: (e.g.) Probabilities and Parameter Estimation Conclusion Intro. to Machine Learning Alireza Ghane / Greg Mori 29
31 An Example - Polynomial Curve Fitting Suppose we are given training set of N observations (x 1,..., x N ) and (t 1,..., t N ), x i, t i R Regression problem, estimate y(x) from these data Intro. to Machine Learning Alireza Ghane / Greg Mori 30
32 Polynomial Curve Fitting What form is y(x)? Let s try polynomials of degree M: y(x, w) = w 0 +w 1 x+w 2 x w M x M 1 0 This is the hypothesis space. How do we measure success? Sum of squared errors: E(w) = 1 2 N {y(x n, w) t n } 2 n=1 Among functions in the class, choose that which minimizes this error Intro. to Machine Learning Alireza Ghane / Greg Mori 31
33 Polynomial Curve Fitting What form is y(x)? Let s try polynomials of degree M: y(x, w) = w 0 +w 1 x+w 2 x w M x M 1 0 This is the hypothesis space. How do we measure success? Sum of squared errors: E(w) = 1 2 N {y(x n, w) t n } 2 n=1 t tn y(xn, w) Among functions in the class, choose that which minimizes this error xn x Intro. to Machine Learning Alireza Ghane / Greg Mori 32
34 Polynomial Curve Fitting What form is y(x)? Let s try polynomials of degree M: y(x, w) = w 0 +w 1 x+w 2 x w M x M 1 0 This is the hypothesis space. How do we measure success? Sum of squared errors: E(w) = 1 2 N {y(x n, w) t n } 2 n=1 t tn y(xn, w) Among functions in the class, choose that which minimizes this error xn x Intro. to Machine Learning Alireza Ghane / Greg Mori 33
35 Polynomial Curve Fitting Error function Best coefficients E(w) = 1 2 N {y(x n, w) t n } 2 n=1 w = arg min E(w) w Found using pseudo-inverse (more later) Intro. to Machine Learning Alireza Ghane / Greg Mori 34
36 Which Degree of Polynomial? A model selection problem M = 9 E(w ) = 0: This is over-fitting Intro. to Machine Learning Alireza Ghane / Greg Mori 35
37 Generalization 1 Training Test Generalization is the holy grail of ML Want good performance for new data Measure generalization using a separate set Use root-mean-squared (RMS) error: E RMS = 2E(w )/N Intro. to Machine Learning Alireza Ghane / Greg Mori 36
38 Controlling Over-fitting: Regularization As order of polynomial M increases, so do coefficient magnitudes Penalize large coefficients in error function: Ẽ(w) = 1 2 N {y(x n, w) t n } 2 + λ 2 w 2 n=1 Intro. to Machine Learning Alireza Ghane / Greg Mori 37
39 Controlling Over-fitting: Regularization As order of polynomial M increases, so do coefficient magnitudes Penalize large coefficients in error function: Ẽ(w) = 1 2 N {y(x n, w) t n } 2 + λ 2 w 2 n=1 Intro. to Machine Learning Alireza Ghane / Greg Mori 38
40 Controlling Over-fitting: Regularization Intro. to Machine Learning Alireza Ghane / Greg Mori 39
41 Controlling Over-fitting: Regularization 1 Training Test Note the E RMS for the training set. Perfect match of training set with the model is a result of over-fitting Training and test error show similar trend Intro. to Machine Learning Alireza Ghane / Greg Mori 40
42 Over-fitting: Dataset size With more data, more complex model (M = 9) can be fit Rule of thumb: 10 datapoints for each parameter Intro. to Machine Learning Alireza Ghane / Greg Mori 41
43 Validation Set Split training data into training set and validation set Train different models (e.g. diff. order polynomials) on training set Choose model (e.g. order of polynomial) with minimum error on validation set Intro. to Machine Learning Alireza Ghane / Greg Mori 42
44 Cross-validation run 1 run 2 run 3 run 4 Data are often limited Cross-validation creates S groups of data, use S 1 to train, other to validate Extreme case leave-one-out cross-validation (LOO-CV): S is number of training data points Cross-validation is an effective method for model selection, but can be slow Models with multiple complexity parameters: exponential number of runs Intro. to Machine Learning Alireza Ghane / Greg Mori 43
45 Summary Want models that generalize to new data Train model on training set Measure performance on held-out test set Performance on test set is good estimate of performance on new data Intro. to Machine Learning Alireza Ghane / Greg Mori 44
46 Summary - Model Selection Which model to use? E.g. which degree polynomial? Training set error is lower with more complex model Can t just choose the model with lowest training error Peeking at test error is unfair. E.g. picking polynomial with lowest test error Performance on test set is no longer good estimate of performance on new data Intro. to Machine Learning Alireza Ghane / Greg Mori 45
47 Summary - Model Selection Which model to use? E.g. which degree polynomial? Training set error is lower with more complex model Can t just choose the model with lowest training error Peeking at test error is unfair. E.g. picking polynomial with lowest test error Performance on test set is no longer good estimate of performance on new data Intro. to Machine Learning Alireza Ghane / Greg Mori 46
48 Summary - Solutions I Use a validation set Train models on training set. E.g. different degree polynomials Measure performance on held-out validation set Measure performance of that model on held-out test set Can use cross-validation on training set instead of a separate validation set if little data and lots of time Choose model with lowest error over all cross-validation folds (e.g. polynomial degree) Retrain that model using all training data (e.g. polynomial coefficients) Intro. to Machine Learning Alireza Ghane / Greg Mori 47
49 Summary - Solutions I Use a validation set Train models on training set. E.g. different degree polynomials Measure performance on held-out validation set Measure performance of that model on held-out test set Can use cross-validation on training set instead of a separate validation set if little data and lots of time Choose model with lowest error over all cross-validation folds (e.g. polynomial degree) Retrain that model using all training data (e.g. polynomial coefficients) Intro. to Machine Learning Alireza Ghane / Greg Mori 48
50 Summary - Solutions II Use regularization Train complex model (e.g high order polynomial) but penalize being too complex (e.g. large weight magnitudes) Need to balance error vs. regularization (λ) Choose λ using cross-validation Get more data Intro. to Machine Learning Alireza Ghane / Greg Mori 49
51 Summary - Solutions II Use regularization Train complex model (e.g high order polynomial) but penalize being too complex (e.g. large weight magnitudes) Need to balance error vs. regularization (λ) Choose λ using cross-validation Get more data Intro. to Machine Learning Alireza Ghane / Greg Mori 50
52 Outline Course Info.: People, References, Resources Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection Decision Theory: ML, Loss Function, MAP Probability Theory: (e.g.) Probabilities and Parameter Estimation Conclusion
53 Decision Theory For a sample x, decide which class(c k ) it is from. Ideas: Maximum Likelihood Minimum Loss/Cost (e.g. misclassification rate) Maximum Aposteriori (MAP) Intro. to Machine Learning Alireza Ghane 52
54 Decision: Maximum Likelihood Inference step: Determine statistics from training data. p(x, t) OR p(x C k ) Decision step: Determine optimal t for test input x: t = arg max{ p (x C k ) k }{{} Likelihood } Intro. to Machine Learning Alireza Ghane 53
55 Decision: Maximum Likelihood Inference step: Determine statistics from training data. p(x, t) OR p(x C k ) Decision step: Determine optimal t for test input x: t = arg max{ p (x C k ) k }{{} Likelihood } Intro. to Machine Learning Alireza Ghane 54
56 Decision: Maximum Likelihood Inference step: Determine statistics from training data. p(x, t) OR p(x C k ) Decision step: Determine optimal t for test input x: t = arg max{ p (x C k ) k }{{} Likelihood } Intro. to Machine Learning Alireza Ghane 55
57 Decision: Minimum Misclassification Rate p(mistake) = p (x R 1, C 2 ) + p (x R 2, C 1 ) = R 1 p (x, C 2 ) dx + R 2 p (x, C 1 ) dx p(mistake) = k R j p (x, C k ) dx j p(x, C1) x0 x p(x, C2) ˆx: decision boundary. x 0 : optimal decision boundary x x 0 : arg min{p (mistake)} R 1 R1 R2 Intro. to Machine Learning Alireza Ghane 56
58 Decision: Minimum Misclassification Rate p(mistake) = p (x R 1, C 2 ) + p (x R 2, C 1 ) = R 1 p (x, C 2 ) dx + R 2 p (x, C 1 ) dx p(mistake) = k R j p (x, C k ) dx j p(x, C1) x0 x p(x, C2) ˆx: decision boundary. x 0 : optimal decision boundary x x 0 : arg min{p (mistake)} R 1 R1 R2 Intro. to Machine Learning Alireza Ghane 57
59 Decision: Minimum Loss/Cost Misclassification rate: R : arg min {R i i {1,,K}} L (R j, C k ) Weighted loss/cost function: R : arg min W j,k L (R j, C k ) {R i i {1,,K}} Is useful when: The population of the classes are different The failure cost is non-symmetric k k j j Intro. to Machine Learning Alireza Ghane 58
60 Decision: Maximum Aposteriori (MAP) Bayes Theorem: P {A B} = P {B A}P {A} P {B} p(c k x) }{{} P osterior p(x C k ) }{{} Likelihood p(c k ) }{{} P rior Provides an Aposteriori Belief for the estimation, rather than a single point estimate. Can utilize Apriori Information in the decision. Intro. to Machine Learning Alireza Ghane 59
61 Outline Course Info.: People, References, Resources Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection Decision Theory: ML, Loss Function, MAP Probability Theory: (e.g.) Probabilities and Parameter Estimation Conclusion
62 Coin Tossing Let s say you re given a coin, and you want to find out P (heads), the probability that if you flip it it lands as heads. Flip it a few times: H H T P (heads) = 2/3 Hmm... is this rigorous? Does this make sense? Intro. to Machine Learning Alireza Ghane / Greg Mori 61
63 Coin Tossing Let s say you re given a coin, and you want to find out P (heads), the probability that if you flip it it lands as heads. Flip it a few times: H H T P (heads) = 2/3 Hmm... is this rigorous? Does this make sense? Intro. to Machine Learning Alireza Ghane / Greg Mori 62
64 Coin Tossing Let s say you re given a coin, and you want to find out P (heads), the probability that if you flip it it lands as heads. Flip it a few times: H H T P (heads) = 2/3 Hmm... is this rigorous? Does this make sense? Intro. to Machine Learning Alireza Ghane / Greg Mori 63
65 Coin Tossing - Model Bernoulli distribution P (heads) = µ, P (tails) = 1 µ Assume coin flips are independent and identically distributed (i.i.d.) i.e. All are separate samples from the Bernoulli distribution Given data D = {x 1,..., x N }, heads: x i = 1, tails: x i = 0, the likelihood of the data is: p(d µ) = N p(x n µ) = n=1 N µ xn (1 µ) 1 xn n=1 Intro. to Machine Learning Alireza Ghane / Greg Mori 64
66 Maximum Likelihood Estimation Given D with h heads and t tails What should µ be? Maximum Likelihood Estimation (MLE): choose µ which maximizes the likelihood of the data µ ML = arg max µ p(d µ) Since ln( ) is monotone increasing: µ ML = arg max ln p(d µ) µ Intro. to Machine Learning Alireza Ghane / Greg Mori 65
67 Likelihood: Maximum Likelihood Estimation Log-likelihood: ln p(d µ) = p(d µ) = N µ xn (1 µ) 1 xn n=1 N x n ln µ + (1 x n ) ln(1 µ) n=1 Take derivative, set to 0: d N dµ ln p(d µ) = 1 x n µ (1 x 1 n) 1 µ = 1 µ h 1 1 µ t n=1 µ = h t + h Intro. to Machine Learning Alireza Ghane / Greg Mori 66
68 Likelihood: Maximum Likelihood Estimation Log-likelihood: ln p(d µ) = p(d µ) = N µ xn (1 µ) 1 xn n=1 N x n ln µ + (1 x n ) ln(1 µ) n=1 Take derivative, set to 0: d N dµ ln p(d µ) = 1 x n µ (1 x 1 n) 1 µ = 1 µ h 1 1 µ t n=1 µ = h t + h Intro. to Machine Learning Alireza Ghane / Greg Mori 67
69 Likelihood: Maximum Likelihood Estimation Log-likelihood: ln p(d µ) = p(d µ) = N µ xn (1 µ) 1 xn n=1 N x n ln µ + (1 x n ) ln(1 µ) n=1 Take derivative, set to 0: d N dµ ln p(d µ) = 1 x n µ (1 x 1 n) 1 µ = 1 µ h 1 1 µ t n=1 µ = h t + h Intro. to Machine Learning Alireza Ghane / Greg Mori 68
70 Likelihood: Maximum Likelihood Estimation Log-likelihood: ln p(d µ) = p(d µ) = N µ xn (1 µ) 1 xn n=1 N x n ln µ + (1 x n ) ln(1 µ) n=1 Take derivative, set to 0: d N dµ ln p(d µ) = 1 x n µ (1 x 1 n) 1 µ = 1 µ h 1 1 µ t n=1 µ = h t + h Intro. to Machine Learning Alireza Ghane / Greg Mori 69
71 Likelihood: Maximum Likelihood Estimation Log-likelihood: ln p(d µ) = p(d µ) = N µ xn (1 µ) 1 xn n=1 N x n ln µ + (1 x n ) ln(1 µ) n=1 Take derivative, set to 0: d N dµ ln p(d µ) = 1 x n µ (1 x 1 n) 1 µ = 1 µ h 1 1 µ t n=1 µ = h t + h Intro. to Machine Learning Alireza Ghane / Greg Mori 70
72 Bayesian Learning Wait, does this make sense? What if I flip 1 time, heads? Do I believe µ=1? Learn µ the Bayesian way: P (µ D) = P (µ D) }{{} posterior P (D µ)p (µ) P (D) P (D µ) P (µ) }{{}}{{} prior likelihood Prior encodes knowledge that most coins are Conjugate prior makes math simpler, easy interpretation For Bernoulli, the beta distribution is its conjugate Intro. to Machine Learning Alireza Ghane / Greg Mori 71
73 Bayesian Learning Wait, does this make sense? What if I flip 1 time, heads? Do I believe µ=1? Learn µ the Bayesian way: P (µ D) = P (µ D) }{{} posterior P (D µ)p (µ) P (D) P (D µ) P (µ) }{{}}{{} prior likelihood Prior encodes knowledge that most coins are Conjugate prior makes math simpler, easy interpretation For Bernoulli, the beta distribution is its conjugate Intro. to Machine Learning Alireza Ghane / Greg Mori 72
74 Bayesian Learning Wait, does this make sense? What if I flip 1 time, heads? Do I believe µ=1? Learn µ the Bayesian way: P (µ D) = P (µ D) }{{} posterior P (D µ)p (µ) P (D) P (D µ) P (µ) }{{}}{{} prior likelihood Prior encodes knowledge that most coins are Conjugate prior makes math simpler, easy interpretation For Bernoulli, the beta distribution is its conjugate Intro. to Machine Learning Alireza Ghane / Greg Mori 73
75 Beta Distribution We will use the Beta distribution to express our prior knowledge about coins: Beta(µ a, b) = Γ(a + b) µ a 1 (1 µ) b 1 Γ(a)Γ(b) }{{} normalization Parameters a and b control the shape of this distribution Intro. to Machine Learning Alireza Ghane / Greg Mori 74
76 Posterior P (µ D) P (D µ)p (µ) N µ xn (1 µ) 1 xn µ a 1 (1 µ) b 1 }{{} n=1 }{{} prior likelihood µ h (1 µ) t µ a 1 (1 µ) b 1 µ h+a 1 (1 µ) t+b 1 Simple form for posterior is due to use of conjugate prior Parameters a and b act as extra observations Note that as N = h + t, prior is ignored Intro. to Machine Learning Alireza Ghane / Greg Mori 75
77 Posterior P (µ D) P (D µ)p (µ) N µ xn (1 µ) 1 xn µ a 1 (1 µ) b 1 }{{} n=1 }{{} prior likelihood µ h (1 µ) t µ a 1 (1 µ) b 1 µ h+a 1 (1 µ) t+b 1 Simple form for posterior is due to use of conjugate prior Parameters a and b act as extra observations Note that as N = h + t, prior is ignored Intro. to Machine Learning Alireza Ghane / Greg Mori 76
78 Posterior P (µ D) P (D µ)p (µ) N µ xn (1 µ) 1 xn µ a 1 (1 µ) b 1 }{{} n=1 }{{} prior likelihood µ h (1 µ) t µ a 1 (1 µ) b 1 µ h+a 1 (1 µ) t+b 1 Simple form for posterior is due to use of conjugate prior Parameters a and b act as extra observations Note that as N = h + t, prior is ignored Intro. to Machine Learning Alireza Ghane / Greg Mori 77
79 Maximum A Posteriori Given posterior P (µ D) we could compute a single value, known as the Maximum a Posteriori (MAP) estimate for µ: µ MAP = arg max µ P (µ D) Known as point estimation However, correct Bayesian thing to do is to use the full distribution over µ i.e. Compute E µ [f] = p(µ D)f(µ)dµ This integral is usually hard to compute Intro. to Machine Learning Alireza Ghane / Greg Mori 78
80 Maximum A Posteriori Given posterior P (µ D) we could compute a single value, known as the Maximum a Posteriori (MAP) estimate for µ: µ MAP = arg max µ P (µ D) Known as point estimation However, correct Bayesian thing to do is to use the full distribution over µ i.e. Compute E µ [f] = p(µ D)f(µ)dµ This integral is usually hard to compute Intro. to Machine Learning Alireza Ghane / Greg Mori 79
81 Maximum A Posteriori Given posterior P (µ D) we could compute a single value, known as the Maximum a Posteriori (MAP) estimate for µ: µ MAP = arg max µ P (µ D) Known as point estimation However, correct Bayesian thing to do is to use the full distribution over µ i.e. Compute E µ [f] = p(µ D)f(µ)dµ This integral is usually hard to compute Intro. to Machine Learning Alireza Ghane / Greg Mori 80
82 Polynomial Curve Fitting: What We Did What form is y(x)? Let s try polynomials of degree M: y(x, w) = w 0 +w 1 x+w 2 x w M x M 1 0 This is the hypothesis space. How do we measure success? Sum of squared errors: E(w) = 1 2 N {y(x n, w) t n } 2 n=1 t tn y(xn, w) Among functions in the class, choose that which minimizes this error xn x Intro. to Machine Learning Alireza Ghane 81
83 Curve Fitting: Probabilistic Approach t y(x, w) y(x 0, w) p(t x 0, w, β) 2σ x 0 x N p(t x, w, β) = N ( t n y(x n, w), β 1) n=1 Intro. to Machine Learning Alireza Ghane 82
84 Curve Fitting: Probabilistic Approach t y(x, w) y(x 0, w) p(t x 0, w, β) 2σ x 0 x p(t x, w, β) = N N ( t n y(x n, w), β 1) n=1 ln (p(t x, w, β)) = β N {y(x n, w) t n } 2 + N 2 2 ln β N ln (2π) n=1 }{{}} 2 {{}}{{} const. const. βe(w) Intro. to Machine Learning Alireza Ghane 83
85 Curve Fitting: Probabilistic Approach t y(x, w) y(x0, w) p(t x0, w, β) 2σ x0 x p(t x, w, β) = N N ( t n y(x n, w), β 1) n=1 ln (p(t x, w, β)) = β N {y(x n, w) t n } 2 + N 2 2 ln β N ln (2π) n=1 }{{}} 2 {{}}{{} const. const. βe(w) Maximize log-likelihood Minimize E(w). Can optimize for β as well. Intro. to Machine Learning Alireza Ghane 84
86 Curve Fitting: Bayesian Approach t y(x, w) y(x 0, w) p(t x 0, w, β) 2σ x 0 x N p(t x, w, β) = N ( t n y(x n, w), β 1) n=1 Intro. to Machine Learning Alireza Ghane 85
87 Curve Fitting: Bayesian Approach t y(x, w) y(x 0, w) p(t x 0, w, β) 2σ x 0 x N p(t x, w, β) = N ( t n y(x n, w), β 1) n=1 Posterior Dist.:p (w x, t, α, β) p (t x, w, β) p (t α) Intro. to Machine Learning Alireza Ghane 86
88 Curve Fitting: Bayesian Approach t y(x, w) y(x0, w) p(t x0, w, β) 2σ x0 x N p(t x, w, β) = N ( t n y(x n, w), β 1) n=1 Posterior Dist.:p (w x, t, α, β) p (t x, w, β) p (t α) Minimize: β N {y(x n, w) t n } 2 + α 2 2 wt w n=1 }{{}}{{} regularization. βe(w) Intro. to Machine Learning Alireza Ghane 87
89 Curve Fitting: Bayesian p (t x 0, w, β, α) = N (t P w (x 0 ), Q w,β,α (x 0 )) t y(x, w) y(x 0, w) p(t x 0, w, β) 2σ x 0 x Intro. to Machine Learning Alireza Ghane 88
90 Curve Fitting: Bayesian p (t x 0, w, β, α) = N (t P w (x 0 ), Q w,β,α (x 0 )) t y(x, w) p (t x, x, t) = N ( t m(x), s 2 (x) ) y(x 0, w) p(t x 0, w, β) 2σ x 0 x Intro. to Machine Learning Alireza Ghane 89
91 Curve Fitting: Bayesian p (t x 0, w, β, α) = N (t P w (x 0 ), Q w,β,α (x 0 )) t y(x, w) p (t x, x, t) = N ( t m(x), s 2 (x) ) y(x 0, w) p(t x 0, w, β) 2σ N m(x) = φ(x) T S φ(x n )t n n=1 s 2 (x) = β 1 ( 1 + φ(x) T Sφ(x) ) 1 0 x 0 x S 1 = α N β I + φ(x n )φ(x n ) T n= Intro. to Machine Learning Alireza Ghane 90
92 Conclusion Readings: Chapter 1.1, 1.3, 1.5, 2.1 Types of learning problems Supervised: regression, classification Unsupervised Learning as optimization Squared error loss function Maximum likelihood (ML) Maximum a posteriori (MAP) Want generalization, avoid over-fitting Cross-validation Regularization Bayesian prior on model parameters Intro. to Machine Learning Alireza Ghane 91
Non-parametric Methods
Non-parametric Methods Machine Learning Alireza Ghane Non-Parametric Methods Alireza Ghane / Torsten Möller 1 Outline Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection
More informationMachine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation
Machine Learning CMPT 726 Simon Fraser University Binomial Parameter Estimation Outline Maximum Likelihood Estimation Smoothed Frequencies, Laplace Correction. Bayesian Approach. Conjugate Prior. Uniform
More informationLinear Models for Regression
Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning
More informationCSC321 Lecture 18: Learning Probabilistic Models
CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationNon-parametric Methods
Non-parametric Methods Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 2 of Pattern Recognition and Machine Learning by Bishop (with an emphasis on section 2.5) Möller/Mori 2 Outline Last
More informationCS 6375 Machine Learning
CS 6375 Machine Learning Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Course Info. Instructor: Nicholas Ruozzi Office: ECSS 3.409 Office hours: Tues.
More informationMachine Learning CSE546
http://www.cs.washington.edu/education/courses/cse546/17au/ Machine Learning CSE546 Kevin Jamieson University of Washington September 28, 2017 1 You may also like ML uses past data to make personalized
More informationCS 540: Machine Learning Lecture 1: Introduction
CS 540: Machine Learning Lecture 1: Introduction AD January 2008 AD () January 2008 1 / 41 Acknowledgments Thanks to Nando de Freitas Kevin Murphy AD () January 2008 2 / 41 Administrivia & Announcement
More informationProbabilistic modeling. The slides are closely adapted from Subhransu Maji s slides
Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationBayesian Networks Inference with Probabilistic Graphical Models
4190.408 2016-Spring Bayesian Networks Inference with Probabilistic Graphical Models Byoung-Tak Zhang intelligence Lab Seoul National University 4190.408 Artificial (2016-Spring) 1 Machine Learning? Learning
More information9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering
Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make
More informationBayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework
HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationMathematical Formulation of Our Example
Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot
More informationIntroduction. Le Song. Machine Learning I CSE 6740, Fall 2013
Introduction Le Song Machine Learning I CSE 6740, Fall 2013 What is Machine Learning (ML) Study of algorithms that improve their performance at some task with experience 2 Common to Industrial scale problems
More informationAn Introduction to Statistical and Probabilistic Linear Models
An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationRecent Advances in Bayesian Inference Techniques
Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationIntroduction: MLE, MAP, Bayesian reasoning (28/8/13)
STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this
More informationPoint Estimation. Vibhav Gogate The University of Texas at Dallas
Point Estimation Vibhav Gogate The University of Texas at Dallas Some slides courtesy of Carlos Guestrin, Chris Bishop, Dan Weld and Luke Zettlemoyer. Basics: Expectation and Variance Binary Variables
More informationECE-271B. Nuno Vasconcelos ECE Department, UCSD
ECE-271B Statistical ti ti Learning II Nuno Vasconcelos ECE Department, UCSD The course the course is a graduate level course in statistical learning in SLI we covered the foundations of Bayesian or generative
More informationMachine Learning and Deep Learning! Vincent Lepetit!
Machine Learning and Deep Learning!! Vincent Lepetit! 1! What is Machine Learning?! 2! Hand-Written Digit Recognition! 2 9 3! Hand-Written Digit Recognition! Formalization! 0 1 x = @ A Images are 28x28
More informationProbabilistic Graphical Models for Image Analysis - Lecture 1
Probabilistic Graphical Models for Image Analysis - Lecture 1 Alexey Gronskiy, Stefan Bauer 21 September 2018 Max Planck ETH Center for Learning Systems Overview 1. Motivation - Why Graphical Models 2.
More informationLearning Bayesian network : Given structure and completely observed data
Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution
More informationFinal Exam, Spring 2006
070 Final Exam, Spring 2006. Write your name and your email address below. Name: Andrew account: 2. There should be 22 numbered pages in this exam (including this cover sheet). 3. You may use any and all
More informationCS Machine Learning Qualifying Exam
CS Machine Learning Qualifying Exam Georgia Institute of Technology March 30, 2017 The exam is divided into four areas: Core, Statistical Methods and Models, Learning Theory, and Decision Processes. There
More informationToday. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?
Today Statistical Learning Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Continuous case Learning Parameters for a Bayesian Network Naive Bayes Maximum Likelihood estimates
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationBrief Introduction of Machine Learning Techniques for Content Analysis
1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationMLE/MAP + Naïve Bayes
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes MLE / MAP Readings: Estimating Probabilities (Mitchell, 2016)
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationNeural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications
Neural Networks Bishop PRML Ch. 5 Alireza Ghane Neural Networks Alireza Ghane / Greg Mori 1 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of
More informationMachine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013
Bayesian Methods Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2013 1 What about prior n Billionaire says: Wait, I know that the thumbtack is close to 50-50. What can you
More informationCPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017
CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class
More informationMachine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?
Linear Regression Machine Learning CSE546 Sham Kakade University of Washington Oct 4, 2016 1 What about continuous variables? Billionaire says: If I am measuring a continuous variable, what can you do
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationMachine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?
Linear Regression Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2014 1 What about continuous variables? n Billionaire says: If I am measuring a continuous variable, what
More informationCOMP 551 Applied Machine Learning Lecture 19: Bayesian Inference
COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference Associate Instructor: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise noted, all material posted
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 143 Part IV
More informationIntroduction to Bayesian Learning. Machine Learning Fall 2018
Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability
More informationCS 361: Probability & Statistics
March 14, 2018 CS 361: Probability & Statistics Inference The prior From Bayes rule, we know that we can express our function of interest as Likelihood Prior Posterior The right hand side contains the
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationMultivariate statistical methods and data mining in particle physics
Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general
More informationMachine Learning! in just a few minutes. Jan Peters Gerhard Neumann
Machine Learning! in just a few minutes Jan Peters Gerhard Neumann 1 Purpose of this Lecture Foundations of machine learning tools for robotics We focus on regression methods and general principles Often
More informationCS 361: Probability & Statistics
October 17, 2017 CS 361: Probability & Statistics Inference Maximum likelihood: drawbacks A couple of things might trip up max likelihood estimation: 1) Finding the maximum of some functions can be quite
More informationStatistical Models. David M. Blei Columbia University. October 14, 2014
Statistical Models David M. Blei Columbia University October 14, 2014 We have discussed graphical models. Graphical models are a formalism for representing families of probability distributions. They are
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released
More informationMachine Learning. Instructor: Pranjal Awasthi
Machine Learning Instructor: Pranjal Awasthi Course Info Requested an SPN and emailed me Wait for Carol Difrancesco to give them out. Not registered and need SPN Email me after class No promises It s a
More informationMachine Learning, Midterm Exam
10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have
More informationOverview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met
c Outlines Statistical Group and College of Engineering and Computer Science Overview Linear Regression Linear Classification Neural Networks Kernel Methods and SVM Mixture Models and EM Resources More
More informationMachine Learning Lecture 1
Many slides adapted from B. Schiele Machine Learning Lecture 1 Introduction 18.04.2016 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de Organization Lecturer Prof.
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationBayesian Methods: Naïve Bayes
Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior
More informationUnsupervised Learning
Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College
More informationNaïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Naïve Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders Homework 6: PAC Learning
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Empirical Bayes, Hierarchical Bayes Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 5: Due April 10. Project description on Piazza. Final details coming
More informationMachine Learning using Bayesian Approaches
Machine Learning using Bayesian Approaches Sargur N. Srihari University at Buffalo, State University of New York 1 Outline 1. Progress in ML and PR 2. Fully Bayesian Approach 1. Probability theory Bayes
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationMLE/MAP + Naïve Bayes
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes Matt Gormley Lecture 19 March 20, 2018 1 Midterm Exam Reminders
More informationFinal Examination CS 540-2: Introduction to Artificial Intelligence
Final Examination CS 540-2: Introduction to Artificial Intelligence May 7, 2017 LAST NAME: SOLUTIONS FIRST NAME: Problem Score Max Score 1 14 2 10 3 6 4 10 5 11 6 9 7 8 9 10 8 12 12 8 Total 100 1 of 11
More informationCS 188: Artificial Intelligence Spring Today
CS 188: Artificial Intelligence Spring 2006 Lecture 9: Naïve Bayes 2/14/2006 Dan Klein UC Berkeley Many slides from either Stuart Russell or Andrew Moore Bayes rule Today Expectations and utilities Naïve
More informationPROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY. Arto Klami
PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY Arto Klami 1 PROBABILISTIC PROGRAMMING Probabilistic programming is to probabilistic modelling as deep learning is to neural networks (Antti Honkela,
More informationGenerative Clustering, Topic Modeling, & Bayesian Inference
Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationPATTERN CLASSIFICATION
PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS
More informationProbabilistic Machine Learning
Probabilistic Machine Learning Bayesian Nets, MCMC, and more Marek Petrik 4/18/2017 Based on: P. Murphy, K. (2012). Machine Learning: A Probabilistic Perspective. Chapter 10. Conditional Independence Independent
More informationMIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE
MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE March 28, 2012 The exam is closed book. You are allowed a double sided one page cheat sheet. Answer the questions in the spaces provided on
More informationPROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY
PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY Arto Klami Adapted from my talk in AIHelsinki seminar Dec 15, 2016 1 MOTIVATING INTRODUCTION Most of the artificial intelligence success stories
More informationBayesian Models in Machine Learning
Bayesian Models in Machine Learning Lukáš Burget Escuela de Ciencias Informáticas 2017 Buenos Aires, July 24-29 2017 Frequentist vs. Bayesian Frequentist point of view: Probability is the frequency of
More informationGI01/M055: Supervised Learning
GI01/M055: Supervised Learning 1. Introduction to Supervised Learning October 5, 2009 John Shawe-Taylor 1 Course information 1. When: Mondays, 14:00 17:00 Where: Room 1.20, Engineering Building, Malet
More informationMachine Learning Lecture 2
Machine Perceptual Learning and Sensory Summer Augmented 15 Computing Many slides adapted from B. Schiele Machine Learning Lecture 2 Probability Density Estimation 16.04.2015 Bastian Leibe RWTH Aachen
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More informationDEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE
Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures
More informationMachine Learning for Signal Processing Bayes Classification and Regression
Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For
More informationMachine Learning (COMP-652 and ECSE-608)
Machine Learning (COMP-652 and ECSE-608) Instructors: Doina Precup and Guillaume Rabusseau Email: dprecup@cs.mcgill.ca and guillaume.rabusseau@mail.mcgill.ca Teaching assistants: Tianyu Li and TBA Class
More informationProbability Theory for Machine Learning. Chris Cremer September 2015
Probability Theory for Machine Learning Chris Cremer September 2015 Outline Motivation Probability Definitions and Rules Probability Distributions MLE for Gaussian Parameter Estimation MLE and Least Squares
More informationClassification & Information Theory Lecture #8
Classification & Information Theory Lecture #8 Introduction to Natural Language Processing CMPSCI 585, Fall 2007 University of Massachusetts Amherst Andrew McCallum Today s Main Points Automatically categorizing
More informationStat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,
1 Stat 406: Algorithms for classification and prediction Lecture 1: Introduction Kevin Murphy Mon 7 January, 2008 1 1 Slides last updated on January 7, 2008 Outline 2 Administrivia Some basic definitions.
More informationComputer Vision Group Prof. Daniel Cremers. 3. Regression
Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the
More informationParametric Techniques
Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure
More informationMachine Learning 2017
Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4
ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian
More informationan introduction to bayesian inference
with an application to network analysis http://jakehofman.com january 13, 2010 motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena
More informationShould all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?
Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/
More informationLecture 18: Learning probabilistic models
Lecture 8: Learning probabilistic models Roger Grosse Overview In the first half of the course, we introduced backpropagation, a technique we used to train neural nets to minimize a variety of cost functions.
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationParametric Techniques Lecture 3
Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to
More informationIntroduc)on to Bayesian methods (con)nued) - Lecture 16
Introduc)on to Bayesian methods (con)nued) - Lecture 16 David Sontag New York University Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein, and Vibhav Gogate Outline of lectures Review of
More information