Machine Learning Lecture 2


 Charity Richard
 1 years ago
 Views:
Transcription
1 Machine Perceptual Learning and Sensory Summer Augmented 15 Computing Many slides adapted from B. Schiele Machine Learning Lecture 2 Probability Density Estimation Bastian Leibe RWTH Aachen
2 Announcements Course webpage Slides will be made available on the webpage L2P electronic repository Exercises and supplementary materials will be posted on the L2P Please subscribe to the lecture on the Campus system! Important to get announcements and L2P access! 2
3 Announcements Exercise sheet 1 is now available on L2P Bayes decision theory Maximum Likelihood Kernel density estimation / knn Submit your results to Ishrat/Michael until evening of Work in teams (of up to 3 people) is encouraged Who is not part of an exercise team yet? 3
4 Course Outline Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Discriminative Approaches (5 weeks) Linear Discriminant Functions Support Vector Machines Ensemble Methods & Boosting Randomized Trees, Forests & Ferns Generative Models (4 weeks) Bayesian Networks Markov Random Fields 4
5 Topics of This Lecture Bayes Decision Theory Basic concepts Minimizing the misclassification rate Minimizing the expected loss Discriminant functions Probability Density Estimation General concepts Gaussian distribution Parametric Methods Maximum Likelihood approach Bayesian vs. Frequentist views on probability Bayesian Learning 5
6 Recap: Bayes Decision Theory Concepts Concept 1: Priors (a priori probabilities) pc k What we can tell about the probability before seeing the data. Example: C C 1 2 a b pc pc In general: Slide credit: Bernt Schiele pck k 1 6
7 Recap: Bayes Decision Theory Concepts Concept 2: Conditional probabilities Let x be a feature vector. p x Ck x measures/describes certain properties of the input. E.g. number of black pixels, aspect ratio, p(x C k ) describes its likelihood for class C k. p x a p x b x Slide credit: Bernt Schiele x 7
8 Bayes Decision Theory Concepts Concept 3: Posterior probabilities p Ck We are typically interested in the a posteriori probability, i.e. the probability of class C k given the measurement vector x. x Bayes Theorem: p C Interpretation k x p x C k p Ck p x Ck p Ck p x p x Ci p C Likelihood Prior Posterior Normalization Factor i i Slide credit: Bernt Schiele 8
9 Bayes Decision Theory p x a p x b Likelihood p x a p( a) x p x b p( b) x Decision boundary Likelihood Prior p a x p b x Slide credit: Bernt Schiele x Posterior = Likelihood Prior NormalizationFactor 9
10 Bayesian Decision Theory Goal: Minimize the probability of a misclassification The green and blue regions stay constant. Only the size of the red region varies! = Z R 1 p(c 2 jx)p(x)dx + Z p(c 1 jx)p(x)dx R 2 12 Image source: C.M. Bishop, 2006
11 Bayes Decision Theory Optimal decision rule Decide for C 1 if p(c 1 jx) > p(c 2 jx) This is equivalent to p(xjc 1 )p(c 1 ) > p(xjc 2 )p(c 2 ) Which is again equivalent to (LikelihoodRatio test) p(xjc 1 ) p(xjc 2 ) > p(c 2) p(c 1 ) Slide credit: Bernt Schiele Decision threshold 13
12 Generalization to More Than 2 Classes Decide for class k whenever it has the greatest posterior probability of all classes: p(c k jx) > p(c j jx) 8j 6= k p(xjc k )p(c k ) > p(xjc j )p(c j ) 8j 6= k Likelihoodratio test p(xjc k ) p(xjc j ) > p(c j) p(c k ) 8j 6= k Slide credit: Bernt Schiele 14
13 Bayes Decision Theory Decision regions: R 1, R 2, R 3, Slide credit: Bernt Schiele 15
14 Classifying with Loss Functions Generalization to decisions with a loss function Differentiate between the possible decisions and the possible true classes. Example: medical diagnosis Decisions: diagnosis is sick or healthy (or: further examination necessary) Classes: patient is sick or healthy The cost may be asymmetric: loss(decision = healthyjpatient = sick) >> loss(decision = sickjpatient = healthy) Slide credit: Bernt Schiele 16
15 Truth Classifying with Loss Functions In general, we can formalize this by introducing a loss matrix L kj L kj = loss for decision C j if truth is C k : Example: cancer diagnosis Decision L cancer diagnosis = 17
16 Classifying with Loss Functions Loss functions may be different for different actors. Example: L stocktrader (subprime) = invest don t invest µ 1 2 c gain L bank (subprime) = µ 1 2 c gain 0 0 Different loss functions may lead to different Bayes optimal strategies. 18
17 Minimizing the Expected Loss Optimal solution is the one that minimizes the loss. But: loss function depends on the true class, which is unknown. Solution: Minimize the expected loss This can be done by choosing the regions R j such that which is easy to do once we know the posterior class probabilities p(c k jx). 19
18 Minimizing the Expected Loss Example: 2 Classes: C 1, C 2 2 Decision: 1, 2 Loss function: L( j jc k ) = L kj Expected loss (= risk R) for the two decisions: Goal: Decide such that expected loss is minimized I.e. decide 1 if Slide credit: Bernt Schiele 20
19 Minimizing the Expected Loss R( 2 jx) > R( 1 jx) L 12 p(c 1 jx) + L 22 p(c 2 jx) > L 11 p(c 1 jx) + L 21 p(c 2 jx) (L 12 L 11 )p(c 1 jx) > (L 21 L 22 )p(c 2 jx) (L 12 L 11 ) (L 21 L 22 ) > p(c 2jx) p(c 1 jx) = p(xjc 2)p(C 2 ) p(xjc 1 )p(c 1 ) p(xjc 1 ) p(xjc 2 ) > (L 21 L 22 ) (L 12 L 11 ) p(c 2 ) p(c 1 ) Adapted decision rule taking into account the loss. Slide credit: Bernt Schiele 21
20 The Reject Option Classification errors arise from regions where the largest posterior probability p(c k jx) is significantly less than 1. These are the regions where we are relatively uncertain about class membership. For some applications, it may be better to reject the automatic decision entirely in such a case and e.g. consult a human expert. 22 Image source: C.M. Bishop, 2006
21 Discriminant Functions Formulate classification in terms of comparisons Discriminant functions y 1 (x); : : : ; y K (x) Classify x as class C k if y k (x) > y j (x) 8j 6= k Examples (Bayes Decision Theory) y k (x) = p(c k jx) y k (x) = p(xjc k )p(c k ) y k (x) = log p(xjc k ) + log p(c k ) Slide credit: Bernt Schiele 23
22 Different Views on the Decision Problem y k (x) / p(xjc k )p(c k ) First determine the classconditional densities for each class individually and separately infer the prior class probabilities. Then use Bayes theorem to determine class membership. Generative methods y k (x) = p(c k jx) First solve the inference problem of determining the posterior class probabilities. Then use decision theory to assign each new x to its class. Discriminative methods Alternative Directly find a discriminant function which maps each input x directly onto a class label. y k (x) 24
23 Topics of This Lecture Bayes Decision Theory Basic concepts Minimizing the misclassification rate Minimizing the expected loss Discriminant functions Probability Density Estimation General concepts Gaussian distribution Parametric Methods Maximum Likelihood approach Bayesian vs. Frequentist views on probability Bayesian Learning 25
24 Probability Density Estimation Up to now Bayes optimal classification Based on the probabilities p(xjc k )p(c k ) How can we estimate (=learn) those probability densities? Supervised training case: data and class labels are known. Estimate the probability density for each class separately: p(xjc k ) (For simplicity of notation, we will drop the class label in the following.) C k C k Slide credit: Bernt Schiele 26
25 Probability Density Estimation Data: x 1, x 2, x 3, x 4, x Estimate: p(x) x Methods Parametric representations Nonparametric representations Mixture models (next lecture) Slide credit: Bernt Schiele 27
26 The Gaussian (or Normal) Distribution Onedimensional case Mean ¹ Variance ¾ 2 N(xj¹; ¾ 2 ) = p 1 ¾ (x ¹)2 exp ½ 2¼¾ 2¾ 2 Multidimensional case Mean ¹ Covariance N(xj¹; ) = ½ 1 exp 1 ¾ (2¼) D=2 j j1=2 2 (x ¹)T 1 (x ¹) 28 Image source: C.M. Bishop, 2006
27 Gaussian Distribution Properties Central Limit Theorem The distribution of the sum of N i.i.d. random variables becomes increasingly Gaussian as N grows. In practice, the convergence to a Gaussian can be very rapid. This makes the Gaussian interesting for many applications. Example: N uniform [0,1] random variables. 29 Image source: C.M. Bishop, 2006
28 Gaussian Distribution Properties Quadratic Form N depends on x through the exponent Here, M is often called the Mahalanobis distance from ¹ to x. Shape of the Gaussian is a real, symmetric matrix. We can therefore decompose it into its eigenvectors DX = iu i u T i i=1 and thus obtain with. Constant density on ellipsoids with main directions along the eigenvectors u i and scaling factors p i. 30 Image source: C.M. Bishop, 2006
29 Gaussian Distribution Properties Special cases Full covariance matrix = [¾ ij ] General ellipsoid shape Diagonal covariance matrix = diagf¾ i g Axisaligned ellipsoid Uniform variance = ¾ 2 I Hypersphere 31 Image source: C.M. Bishop, 2006
30 Gaussian Distribution Properties The marginals of a Gaussian are again Gaussians: 32 Image source: C.M. Bishop, 2006
31 Topics of This Lecture Bayes Decision Theory Basic concepts Minimizing the misclassification rate Minimizing the expected loss Discriminant functions Probability Density Estimation General concepts Gaussian distribution Parametric Methods Maximum Likelihood approach Bayesian vs. Frequentist views on probability Bayesian Learning 33
32 Parametric Methods Given Data Parametric form of the distribution with parameters µ E.g. for Gaussian distrib.: Learning X = fx 1 ; x 2 ; : : : ; x N g Estimation of the parameters µ µ = (¹; ¾) x x Likelihood of µ Probability that the data X have indeed been generated from a probability density with parameters µ L(µ) = p(xjµ) Slide adapted from Bernt Schiele 34
33 Maximum Likelihood Approach Computation of the likelihood p(x n jµ) Single data point: Assumption: all data points are independent Loglikelihood L(µ) = p(xjµ) = E(µ) = ln L(µ) = NY n=1 p(x n jµ) NX ln p(x n jµ) n=1 Estimation of the parameters µ (Learning) Maximize the likelihood Minimize the negative loglikelihood Slide credit: Bernt Schiele 35
34 Maximum Likelihood Approach Likelihood: L(µ) = p(xjµ) = NY n=1 p(x n jµ) We want to obtain ^µ such that L(^µ) is maximized. p(xjµ) ^µ µ Slide credit: Bernt Schiele 36
35 Maximum Likelihood Approach Minimizing the loglikelihood How do we minimize a function? Take the derivative and set it NX ln p(x n jµ) = n=1 p(x njµ) p(x n jµ)! = 0 Loglikelihood for Normal distribution (1D case) NX E(µ) = ln p(x n j¹; ¾) = n=1 NX n=1 ln µ 1 p exp ½ jjx n ¹jj 2 ¾ 2¼¾ 2¾ 2 37
36 Maximum Likelihood Approach Minimizing E(¹; ¾) = p(x nj¹; ¾) p(x n j¹; ¾) = n=1 NX 2(x n ¹) 2¾ 2 n=1 X N = 1 NX 2 (x (x n n ¹) ¹) ¾ n=1 Ã = 1 N! X 2 x n n N¹ E(¹; ¾) =! 0, ^¹ = 1 N NX n=1 x n p(x n j¹; ¾) = 1 p e jjx n ¹jj 2 2¾ 2 2¼¾ 38
37 Maximum Likelihood Approach We thus obtain ^¹ = 1 NX N n=1 x n In a similar fashion, we get ^¾ 2 = 1 NX (x n ^¹) 2 N n=1 sample mean sample variance ^µ = (^¹; ^¾) is the Maximum Likelihood estimate for the parameters of a Gaussian distribution. This is a very important result. Unfortunately, it is wrong 39
38 Maximum Likelihood Approach Or not wrong, but rather biased Assume the samples x 1, x 2,, x N come from a true Gaussian distribution with mean ¹ and variance ¾ 2 We can now compute the expectations of the ML estimates with respect to the data set values. It can be shown that E(¹ ML ) = ¹ µ N 1 E(¾ML) 2 = N ¾ 2 The ML estimate will underestimate the true variance. Corrected estimate: ~¾ 2 = N N 1 ¾2 ML = 1 N 1 NX (x n ^¹) 2 n=1 40
39 Maximum Likelihood Limitations Maximum Likelihood has several significant limitations It systematically underestimates the variance of the distribution! E.g. consider the case N = 1; X = fx 1 g x Maximumlikelihood estimate: ^¾ = 0! We say ML overfits to the observed data. We will still often use ML, but it is important to know about this effect. ^¹ x Slide adapted from Bernt Schiele 41
40 Deeper Reason Maximum Likelihood is a Frequentist concept In the Frequentist view, probabilities are the frequencies of random, repeatable events. These frequencies are fixed, but can be estimated more precisely when more data is available. This is in contrast to the Bayesian interpretation In the Bayesian view, probabilities quantify the uncertainty about certain states or events. This uncertainty can be revised in the light of new evidence. Bayesians and Frequentists do not like each other too well 42
41 Bayesian vs. Frequentist View To see the difference Suppose we want to estimate the uncertainty whether the Arctic ice cap will have disappeared by the end of the century. This question makes no sense in a Frequentist view, since the event cannot be repeated numerous times. In the Bayesian view, we generally have a prior, e.g. from calculations how fast the polar ice is melting. If we now get fresh evidence, e.g. from a new satellite, we may revise our opinion and update the uncertainty from the prior. Posterior / Likelihood Prior This generally allows to get better uncertainty estimates for many situations. Main Frequentist criticism The prior has to come from somewhere and if it is wrong, the result will be worse. 43
42 Bayesian Approach to Parameter Learning Conceptual shift Maximum Likelihood views the true parameter vector µ to be unknown, but fixed. In Bayesian learning, we consider µ to be a random variable. This allows us to use knowledge about the parameters µ i.e. to use a prior for µ Training data then converts this prior distribution on µ into a posterior probability density. The prior thus encodes knowledge we have about the type of distribution we expect to see for µ. Slide adapted from Bernt Schiele 44
43 Bayesian Learning Approach Bayesian view: Consider the parameter vector µ as a random variable. When estimating the parameters, what we compute is p(xjx) = p(xjx) = Z Z p(x; µjx)dµ p(x; µjx) = p(xjµ; X)p(µjX) p(xjµ)p(µjx)dµ Assumption: given µ, this doesn t depend on X anymore This is entirely determined by the parameter µ (i.e. by the parametric form of the pdf). Slide adapted from Bernt Schiele 45
44 Bayesian Learning Approach Z p(xjx) = p(xjµ)p(µjx)dµ p(µjx) = p(xjµ)p(µ) p(x) p(x) = Z p(xjµ)p(µ)dµ = = p(µ) p(x) L(µ) Z L(µ)p(µ)dµ Inserting this above, we obtain Z p(xjµ)l(µ)p(µ) p(xjx) = dµ = p(x) Z p(xjµ)l(µ)p(µ) R L(µ)p(µ)dµ dµ Slide credit: Bernt Schiele 46
45 Bayesian Learning Approach Discussion Likelihood of the parametric form µ given the data set X. Estimate for x based on parametric form µ Prior for the parameters µ p(xjx) = Z p(xjµ)l(µ)p(µ) R L(µ)p(µ)dµ dµ Normalization: integrate over all possible values of µ If we now plug in a (suitable) prior p(µ), we can estimate from the data set X. p(xjx) 47
46 Bayesian Density Estimation Discussion p(xjx) = Z p(xjµ)p(µjx)dµ = Z p(xjµ)l(µ)p(µ) R L(µ)p(µ)dµ dµ p(µjx) The probability makes the dependency of the estimate on the data explicit. p(µjx) ^µ If is very small everywhere, but is large for one, then p(xjx) ¼ p(xj^µ) The more uncertain we are about µ, the more we average over all parameter values. Slide credit: Bernt Schiele 48
47 Bayesian Density Estimation Problem In the general case, the integration over µ is not possible (or only possible stochastically). Example where an analytical solution is possible Normal distribution for the data, ¾ 2 assumed known and fixed. Estimate the distribution of the mean: p(¹jx) = p(xj¹)p(¹) p(x) Prior: We assume a Gaussian prior over ¹, Slide credit: Bernt Schiele 49
48 Bayesian Learning Approach Sample mean: Bayes estimate: Note: ¹ N = ¾2 ¹ 0 + N¾ 2 0 ¹x ¾ 2 + N¾ ¾ 2 N = 1 ¾ N ¾ 2 ¹x = 1 N NX n=1 x n p(¹jx) ¹ 0 = 0 50 Slide adapted from Bernt Schiele Image source: C.M. Bishop, 2006
49 Summary: ML vs. Bayesian Learning Maximum Likelihood Simple approach, often analytically possible. Problem: estimation is biased, tends to overfit to the data. But: Often needs some correction or regularization. Approximation gets accurate for N! 1. Bayesian Learning General approach, avoids the estimation bias through a prior. Problems: But: Need to choose a suitable prior (not always obvious). Integral over µ often not analytically feasible anymore. Efficient stochastic sampling techniques available (see Lecture 15). (In this lecture, we ll use both concepts wherever appropriate) 51
50 References and Further Reading More information in Bishop s book Gaussian distribution and ML: Ch and Bayesian Learning: Ch and Nonparametric methods: Ch Additional information can be found in Duda & Hart ML estimation: Ch. 3.2 Bayesian Learning: Ch Nonparametric methods: Ch Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 R.O. Duda, P.E. Hart, D.G. Stork Pattern Classification 2 nd Ed., WileyInterscience,
Machine Learning Lecture 2
Announcements Machine Learning Lecture 2 Eceptional number of lecture participants this year Current count: 449 participants This is very nice, but it stretches our resources to their limits Probability
More informationMachine Learning Lecture 1
Many slides adapted from B. Schiele Machine Learning Lecture 1 Introduction 18.04.2016 Bastian Leibe RWTH Aachen http://www.vision.rwthaachen.de/ leibe@vision.rwthaachen.de Organization Lecturer Prof.
More informationLecture Machine Learning (past summer semester) If at least one Englishspeaking student is present.
Advanced Machine Learning Lecture 1 Organization Lecturer Prof. Bastian Leibe (leibe@vision.rwthaachen.de) Introduction 20.10.2015 Bastian Leibe Visual Computing Institute RWTH Aachen University http://www.vision.rwthaachen.de/
More informationMachine Learning Lecture 12
Machine Learning Lecture 12 Neural Networks 30.11.2017 Bastian Leibe RWTH Aachen http://www.vision.rwthaachen.de leibe@vision.rwthaachen.de Course Outline Fundamentals Bayes Decision Theory Probability
More informationBayesian Decision Theory
Bayesian Decision Theory Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Bayesian Decision Theory Bayesian classification for normal distributions Error Probabilities
More informationPattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite
More informationProblem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30
Problem Set MAS 6J/1.16J: Pattern Recognition and Analysis Due: 5:00 p.m. on September 30 [Note: All instructions to plot data or write a program should be carried out using Matlab. In order to maintain
More informationMachine Learning, Fall 2012 Homework 2
060 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv BarJoseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0
More informationThe Bayes classifier
The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongamro, Namgu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationContents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)
Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture
More information10701/ Machine Learning, Fall
070/578 Machine Learning, Fall 2003 Homework 2 Solution If you have questions, please contact Jiayong Zhang .. (Error Function) The sumofsquares error is the most common training
More informationPhysics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester
Physics 403 Parameter Estimation, Correlations, and Error Bars Segev BenZvi Department of Physics and Astronomy University of Rochester Table of Contents 1 Review of Last Class Best Estimates and Reliability
More informationMachine Learning (CS 567) Lecture 2
Machine Learning (CS 567) Lecture 2 Time: TTh 5:00pm  6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol
More informationIntroduction to Graphical Models
Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) YungKyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic
More informationBayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington
Bayesian Classifiers and Probability Estimation Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington 1 Data Space Suppose that we have a classification problem The
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationECE521 Lecture7. Logistic Regression
ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multiclass classification 2 Outline Decision theory is conceptually easy and computationally hard
More informationGI01/M055: Supervised Learning
GI01/M055: Supervised Learning 1. Introduction to Supervised Learning October 5, 2009 John ShaweTaylor 1 Course information 1. When: Mondays, 14:00 17:00 Where: Room 1.20, Engineering Building, Malet
More informationMachine Learning, Midterm Exam
10601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv BarJoseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have
More informationBAYESIAN DECISION THEORY. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
BAYESIAN DECISION THEORY Credits 2 Some of these slides were sourced and/or modified from: Christopher Bishop, Microsoft UK Simon Prince, University College London Sergios Theodoridis, University of Athens
More informationStatistics for the LHC Lecture 1: Introduction
Statistics for the LHC Lecture 1: Introduction Academic Training Lectures CERN, 14 17 June, 2010 indico.cern.ch/conferencedisplay.py?confid=77830 Glen Cowan Physics Department Royal Holloway, University
More informationBayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?
Bayes Rule CS789: Machine Learning and Neural Network Bayesian learning P (Y X) = P (X Y )P (Y ) P (X) Jakramate Bootkrajang Department of Computer Science Chiang Mai University P (Y ): prior belief, prior
More informationClassification 1: Linear regression of indicators, linear discriminant analysis
Classification 1: Linear regression of indicators, linear discriminant analysis Ryan Tibshirani Data Mining: 36462/36662 April 2 2013 Optional reading: ISL 4.1, 4.2, 4.4, ESL 4.1 4.3 1 Classification
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer
Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two onepage, twosided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationCOMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma
COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters
More informationIntroduction: MLE, MAP, Bayesian reasoning (28/8/13)
STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and NonParametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationMachine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler
+ Machine Learning and Data Mining Bayes Classifiers Prof. Alexander Ihler A basic classifier Training data D={x (i),y (i) }, Classifier f(x ; D) Discrete feature vector x f(x ; D) is a con@ngency table
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte  CMPT 726 Bishop PRML Ch. 4 Classification: Handwritten Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationMachine Learning Lecture 14
Machine Learning Lecture 14 Tricks of the Trade 07.12.2017 Bastian Leibe RWTH Aachen http://www.vision.rwthaachen.de leibe@vision.rwthaachen.de Course Outline Fundamentals Bayes Decision Theory Probability
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 295P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of Kmeans Hard assignments of data points to clusters small shift of a
More informationMidterm: CS 6375 Spring 2015 Solutions
Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a onepage cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an
More informationA Brief Review of Probability, Bayesian Statistics, and Information Theory
A Brief Review of Probability, Bayesian Statistics, and Information Theory Brendan Frey Electrical and Computer Engineering University of Toronto frey@psi.toronto.edu http://www.psi.toronto.edu A system
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationIntroduction to Gaussian Process
Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Nonparametric Gaussian Process (GP) GP Regression
More informationIntroduction to Bayesian Learning
Course Information Introduction Introduction to Bayesian Learning Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Apprendimento Automatico: Fondamenti  A.A. 2016/2017 Outline
More informationHST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007
MIT OpenCourseWare http://ocw.mit.edu HST.582J / 6.555J / 16.456J Biomedical Signal and Image Processing Spring 2007 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
More informationLecture 2: From Linear Regression to Kalman Filter and Beyond
Lecture 2: From Linear Regression to Kalman Filter and Beyond Department of Biomedical Engineering and Computational Science Aalto University January 26, 2012 Contents 1 Batch and Recursive Estimation
More informationCS 188: Artificial Intelligence. Machine Learning
CS 188: Artificial Intelligence Review of Machine Learning (ML) DISCLAIMER: It is insufficient to simply study these slides, they are merely meant as a quick refresher of the highlevel ideas covered.
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:
More informationLinear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging
More informationIntroduction to Machine Learning Midterm Exam Solutions
10701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv BarJoseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,
More informationThe classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know
The Bayes classifier Theorem The classifier satisfies where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know Alternatively, since the maximum it is
More informationStatistical Rock Physics
Statistical  Introduction Book review 3.13.3 Min Sun March. 13, 2009 Outline. What is Statistical. Why we need Statistical. How Statistical works Statistical Rock physics Information theory Statistics
More informationUniversity of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians
University of Cambridge Engineering Part IIB Module 4F: Statistical Pattern Processing Handout 2: Multivariate Gaussians.2.5..5 8 6 4 2 2 4 6 8 Mark Gales mjfg@eng.cam.ac.uk Michaelmas 2 2 Engineering
More informationData Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis
Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Radu Horaud INRIA Grenoble RhoneAlpes, France Radu.Horaud@inrialpes.fr http://perception.inrialpes.fr/ Outline of Lecture
More informationPart I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis
Week 5 Based in part on slides from textbook, slides of Susan Holmes Part I Linear Discriminant Analysis October 29, 2012 1 / 1 2 / 1 Nearest centroid rule Suppose we break down our data matrix as by the
More informationLecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1
Lecture 2 1 Probability (90 min.) Definition, Bayes theorem, probability densities and their properties, catalogue of pdfs, Monte Carlo 2 Statistical tests (90 min.) general concepts, test statistics,
More information12 Discriminant Analysis
12 Discriminant Analysis Discriminant analysis is used in situations where the clusters are known a priori. The aim of discriminant analysis is to classify an observation, or several observations, into
More informationBayesian Networks Structure Learning (cont.)
Koller & Friedman Chapters (handed out): Chapter 11 (short) Chapter 1: 1.1, 1., 1.3 (covered in the beginning of semester) 1.4 (Learning parameters for BNs) Chapter 13: 13.1, 13.3.1, 13.4.1, 13.4.3 (basic
More informationMachine Learning (CS 567) Lecture 5
Machine Learning (CS 567) Lecture 5 Time: TTh 5:00pm  6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September
More informationB4 Estimation and Inference
B4 Estimation and Inference 6 Lectures Hilary Term 27 2 Tutorial Sheets A. Zisserman Overview Lectures 1 & 2: Introduction sensors, and basics of probability density functions for representing sensor error
More informationLast updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship
More informationSlides modified from: PATTERN RECOGNITION CHRISTOPHER M. BISHOP. and: Computer vision: models, learning and inference Simon J.D.
Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP and: Computer vision: models, learning and inference. 2011 Simon J.D. Prince ClassificaLon Example: Gender ClassiﬁcaLon
More informationRelevance Vector Machines
LUT February 21, 2011 Support Vector Machines Model / Regression Marginal Likelihood Regression Relevance vector machines Exercise Support Vector Machines The relevance vector machine (RVM) is a bayesian
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Nonparametric
More informationBased on slides by Richard Zemel
CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we
More information7 Gaussian Discriminant Analysis (including QDA and LDA)
36 Jonathan Richard Shewchuk 7 Gaussian Discriminant Analysis (including QDA and LDA) GAUSSIAN DISCRIMINANT ANALYSIS Fundamental assumption: each class comes from normal distribution (Gaussian). X N(µ,
More informationCourse 495: Advanced Statistical Machine Learning/Pattern Recognition
Course 495: Advanced Statistical Machine Learning/Pattern Recognition Goal (Lecture): To present Probabilistic Principal Component Analysis (PPCA) using both Maximum Likelihood (ML) and Expectation Maximization
More informationLecture 2: From Linear Regression to Kalman Filter and Beyond
Lecture 2: From Linear Regression to Kalman Filter and Beyond January 18, 2017 Contents 1 Batch and Recursive Estimation 2 Towards Bayesian Filtering 3 Kalman Filter and Bayesian Filtering and Smoothing
More informationPILCO: A ModelBased and DataEfficient Approach to Policy Search
PILCO: A ModelBased and DataEfficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol
More informationMachine Learning
Machine Learning 10601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 2, 2015 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationLecture 3: Machine learning, classification, and generative models
EE E6820: Speech & Audio Processing & Recognition Lecture 3: Machine learning, classification, and generative models 1 Classification 2 Generative models 3 Gaussian models Michael Mandel
More informationModel Selection for Gaussian Processes
Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal
More informationLecture 2: Simple Classifiers
CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 2: Simple Classifiers Slides based on Rich Zemel s All lecture slides will be available on the course website: www.cs.toronto.edu/~jessebett/csc412
More informationSTATS 306B: Unsupervised Learning Spring Lecture 2 April 2
STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised
More informationEstimation of Operational Risk Capital Charge under Parameter Uncertainty
Estimation of Operational Risk Capital Charge under Parameter Uncertainty Pavel V. Shevchenko Principal Research Scientist, CSIRO Mathematical and Information Sciences, Sydney, Locked Bag 17, North Ryde,
More informationSeminar über Statistik FS2008: Model Selection
Seminar über Statistik FS2008: Model Selection Alessia Fenaroli, Ghazale Jazayeri Monday, April 2, 2008 Introduction Model Choice deals with the comparison of models and the selection of a model. It can
More informationFrom inductive inference to machine learning
From inductive inference to machine learning ADAPTED FROM AIMA SLIDES Russel&Norvig:Artificial Intelligence: a modern approach AIMA: Inductive inference AIMA: Inductive inference 1 Outline Bayesian inferences
More informationGAUSSIAN PROCESS REGRESSION
GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The
More informationCourse in Data Science
Course in Data Science About the Course: In this course you will get an introduction to the main tools and ideas which are required for Data Scientist/Business Analyst/Data Analyst. The course gives an
More informationProbability Based Learning
Probability Based Learning Lecture 7, DD2431 Machine Learning J. Sullivan, A. Maki September 2013 Advantages of Probability Based Methods Work with sparse training data. More powerful than deterministic
More informationFeature selection and extraction Spectral domain quality estimation Alternatives
Feature selection and extraction Error estimation Maa57.3210 Data Classification and Modelling in Remote Sensing Markus Törmä markus.torma@tkk.fi Measurements Preprocessing: Remove random and systematic
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationStatistical learning. Chapter 20, Sections 1 4 1
Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete
More informationLinear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining
Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes November 9, 2012 1 / 1 Nearest centroid rule Suppose we break down our data matrix as by the labels yielding (X
More informationPattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes
Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3  MMIS Fall Semester 2016 Lesson 1 5 October 2016 Learning and Evaluation of Pattern Recognition Processes Outline Notation...2 1. The
More informationIntroduction to emulators  the what, the when, the why
School of Earth and Environment INSTITUTE FOR CLIMATE & ATMOSPHERIC SCIENCE Introduction to emulators  the what, the when, the why Dr Lindsay Lee 1 What is a simulator? A simulator is a computer code
More informationA Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007
Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized
More informationProbabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov
Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly
More informationProbability Theory for Machine Learning. Chris Cremer September 2015
Probability Theory for Machine Learning Chris Cremer September 2015 Outline Motivation Probability Definitions and Rules Probability Distributions MLE for Gaussian Parameter Estimation MLE and Least Squares
More informationIntroduction to Neural Networks
Introduction to Neural Networks Steve Renals Automatic Speech Recognition ASR Lecture 10 24 February 2014 ASR Lecture 10 Introduction to Neural Networks 1 Neural networks for speech recognition Introduction
More informationToday. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion
Today Probability and Statistics Naïve Bayes Classification Linear Algebra Matrix Multiplication Matrix Inversion Calculus Vector Calculus Optimization Lagrange Multipliers 1 Classical Artificial Intelligence
More informationStat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2
Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, 2010 Jeffreys priors Lecturer: Michael I. Jordan Scribe: Timothy Hunter 1 Priors for the multivariate Gaussian Consider a multivariate
More informationSession 1: Pattern Recognition
Proc. Digital del Continguts Musicals Session 1: Pattern Recognition 1 2 3 4 5 Music Content Analysis Pattern Classification The Statistical Approach Distribution Models Singing Detection Dan Ellis
More informationMachine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012
Machine Learning 10601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 20, 2012 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)
More informationPAClearning, VC Dimension and Marginbased Bounds
More details: General: http://www.learningwithkernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAClearning, VC Dimension and Marginbased
More informationLast Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression
CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 22 Dan Weld Learning Gaussians Naïve Bayes Last Time Gaussians Naïve Bayes Logistic Regression Today Some slides from Carlos Guestrin, Luke Zettlemoyer
More informationIntroduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah
Introduction to Graphical Models Srikumar Ramalingam School of Computing University of Utah Reference Christopher M. Bishop, Pattern Recognition and Machine Learning, Jonathan S. Yedidia, William T. Freeman,
More informationAn AIish view of Probability, Conditional Probability & Bayes Theorem
An AIish view of Probability, Conditional Probability & Bayes Theorem Review: Uncertainty and Truth Values: a mismatch Let action A t = leave for airport t minutes before flight. Will A 15 get me there
More informationProbability models for machine learning. Advanced topics ML4bio 2016 Alan Moses
Probability models for machine learning Advanced topics ML4bio 2016 Alan Moses What did we cover in this course so far? 4 major areas of machine learning: Clustering Dimensionality reduction Classification
More informationMachine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014
Machine Learning Probability Basics Basic definitions: Random variables, joint, conditional, marginal distribution, Bayes theorem & examples; Probability distributions: Binomial, Beta, Multinomial, Dirichlet,
More information