Lecture 2: Parameter Learning in Fully Observed Graphical Models. Sam Roweis. Monday July 24, 2006 Machine Learning Summer School, Taiwan
|
|
- Thomasina Small
- 5 years ago
- Views:
Transcription
1 Lecture 2: Parameter Learning in Fully Observed Graphical Models Sam Roweis Monday July 24, 2006 Machine Learning Summer School, Taiwan
2 Likelihood Functions So far we have focused on the (log) probability function p( θ) which assigns a probability (density) to any joint configuration of variables given fied parameters θ. But in learning we turn this on its head: we have some fied data and we want to find parameters. Think of p( θ) as a function of θ for fied : Z(θ; ) = p( θ) l(θ; ) = log p( θ) This function is called the (log) likelihood. Chose θ to maimize some loss function L(θ) which includes l(θ): L(θ) = l(θ; D) L(θ) = l(θ; D) + log p(θ) maimum likelihood (ML) maimum a posteriori (MAP)/penalizedML (also cross-validation, Bayesian estimators, BIC, AIC,...)
3 Maimum Likelihood For IID data: p(d θ) = m l(θ; D) = m p( m θ) log p( m θ) Idea of maimum likelihod estimation (MLE): pick the setting of parameters most likely to have generated the data we saw: θ ML = argma θ l(θ; D) Commonly used as a baseline model in statistics. Often leads to intuitive, appealing, or natural estimators. For a start, the IID assumption makes the log likelihood into a sum, so its derivative can be easily taken term by term.
4 Eample: Bernoulli Trials We observe M iid coin flips: D=H,H,T,H,... Model: p(h) = θ p(t ) = (1 θ) Likelihood: l(θ; D) = log p(d θ) = log θ m (1 θ) 1 m m = log θ m m + log(1 θ) m (1 m ) = log θn H + log(1 θ)n T Take derivatives and set to zero: l θ = N H θ N T 1 θ θ ML = N H N H + N T
5 Eample: Univariate Normal We observe M iid real samples: D=1.18,-.25,.78,... Model: p() = (2πσ 2 ) 1/2 ep{ ( µ) 2 /2σ 2 } Likelihood (using probability density): l(θ; D) = log p(d θ) = M 2 log(2πσ2 ) 1 2 m ( m µ) 2 Take derivatives and set to zero: l µ = (1/σ2 ) m ( m µ) l σ 2 = M 2σ σ 4 m ( m µ) 2 µ ML = (1/M) m m σml 2 = (1/M) m 2 m µ 2 ML σ 2
6 Eample: Linear Regression X y Y
7 Eample: Linear Regression At a linear regression node, some parents (covariates/inputs) and all children (responses/outputs) are continuous valued variables. For each child and setting of discrete parents we use the model: p(y, θ) = gauss(y θ, σ 2 ) The likelihood is the familiar squared error cost: l(θ; D) = 1 2σ 2 (y m θ m ) 2 The ML parameters can be solved for using linear least-squares: l θ = (y m θ m ) m m θml = (X X) 1 X Y Sufficient statistics are input correlation matri and input-output cross-correlation vector. m
8 MLE for Directed GMs For a directed GM, the likelihood function has a nice form: log p(d θ) = log p( m i π i, θ i ) = log p( m i π i, θ i ) m m i The parameters decouple; so we can maimize likelihood independently for each node s function by setting θ i. Only need the values of i and its parents in order to estimate θ i. In general, for fully observed data if we know how to estimate params at a single node we can do it for the whole network. i X1 X1 X1 X1 X 2 X 3 X 2 X 3 X 2 X 3 X 4 X 4 (a) (b)
9 Reminder: Classification Given eamples of a discrete class label y and some features. Goal: compute label (y) for new inputs. Two approaches: Generative: model p(, y) = p(y)p( y); use Bayes rule to infer conditional p(y ). Discriminative: model discriminants f(y ) directly and take ma. Generative approach is related to conditional density estimation while discriminative approach is closer to regression
10 Probabilistic Classification: Bayes Classifiers Generative model: p(, y) = p(y)p( y). p(y) are called class priors. p( y) are called class conditional feature distributions. For the prior we use a Bernoulli or multinomial: p(y = k π) = π k with k π k = 1. Classification rules: ML: argma y p( y) (can behave badly if skewed priors) MAP: argma y p(y ) = argma y log p( y) + log p(y) (safer) Fitting: maimize n log p(n, y n ) = n log p(n y n ) + log p(y n ) 1) Sort data into batches by class label. 2) Estimate p(y) by counting size of batches (plus regularization). 3) Estimate p( y) separately within each batch using ML. (also with regularization).
11 Three Key Regularization Ideas To avoid overfitting, we can put priors on the parameters of the class and class conditional feature distributions. We can also tie some parameters together so that fewer of them are estimated using more data. Finally, we can make factorization or independence assumptions about the distributions. In particular, for the class conditional distributions we can assume the features are fully dependent, partly dependent, or independent (!). Y Y Y X1 X 2 X m X1 X 2 X m X (a) (b) (c)
12 Gaussian Class-Conditional Distributions If all features are continuous, a popular choice is a Gaussian class-conditional. { p( y = k, θ) = 2πΣ 1/2 ep 1 } 2 ( µ k)σ 1 ( µ k ) Fitting: use the following amazing and useful fact. The maimum likelihood fit of a Gaussian to some data is the Gaussian whose mean is equal to the data mean and whose covariance is equal to the sample covariance. [Try to prove this as an eercise in understanding likelihood, algebra, and calculus all at once!] Seems easy. And works amazingly well. But we can do even better with some simple regularization...
13 Regularized Gaussians Idea 1: assume all the covariances are the same (tie parameters). This is eactly Fisher s linear discriminant analysis (a) (b) Idea 2: Make independence assumptions to get diagonal or identity-multiple covariances. (Or sparse inverse covariances.) More on this in a few minutes... Idea 3: add a bit of the identity matri to each sample covariance. This fattens it up in directions where there wasn t enough data. Equivalent to using a Wishart prior on the covariance matri.
14 Gaussian Bayes Classifier Maimum likelihood estimates for parameters: priors π k : use observed frequencies of classes (plus smoothing) means µ k : use class means covariance Σ: use data from single class or pooled data ( m µ y m) to estimate full/diagonal covariances Compute the posterior via Bayes rule: p( y = k, θ)p(y = k π) p(y = k, θ) = p( y = j, θ)p(y = j π) j = ep{µ k Σ 1 µ k Σ 1 µ k /2 + log π k } j ep{µ j Σ 1 µ j Σ 1 µ j /2 + log π j } = e β k / j eβ j = ep{β k }/Z where β k = [Σ 1 µ k ; (µ k Σ 1 µ k + log π k )] and we have augmented with a constant component always equal to 1 (bias term).
15 Softma/Logit The squashing function is known as the softma or logit: 1 φ k (z) g(η) = 1 + e η ez k j ez j It is invertible (up to a constant): z k = log φ k + c η = log(g/1 g) Derivative is easy: φ k z j = φ k (δ kj φ j ) dg dη = g(1 g) φ( z ) z
16 Log Linear Geometry Taking the ratio of any two posteriors (the odds ) shows that the contours of equal pairwise probability are linear surfaces in the feature space: p(y = k, θ) p(y = j, θ) = ep { (β k β j ) } The pairwise discrimination contours p(y k ) = p(y j ) are orthogonal to the differences of the means in feature space when Σ = σi. For general Σ shared b/w all classes the same is true in the transformed feature space w = Σ 1. The priors do not change the geometry, they only shift the operating point on the logit by the log-odds log(π k /π j ). Thus, for equal class-covariances, we obtain a linear classifier. If we use different covariances, the decision surfaces are conic sections and we have a quadratic classifier.
17 Discrete Bayesian Classifier If the inputs are discrete (categorical), what should we do? The simplest class conditional model is a joint multinomial (table): p( 1 = a, 2 = b,... y = c) = η c ab... This is conceptually correct, but there s a big practical problem. Fitting: ML params are observed counts: ηab... c = n [y n = c][ 1 = a][ 2 = b][...][...] n [y n = c] Consider the 1616 digits at 256 gray levels. How many entries in the table? How many will be zero? What happens at test time? Doh! We obviously need some regularlization. Smoothing will not help much here. Unless we know about the relationships between inputs beforehand, sharing parameters is hard also. But what about independence?
18 Naive (Idiot s) Bayes Classifier Assumption: conditioned on class, attributes are independent. Y p( y) = i p( i y) X1 X 2 Sounds crazy right? Right! But it works. X m (a) Algorithm: sort data cases into bins according to y n. Compute marginal probabilities p(y = c) using frequencies. For each class, estimate distribution of i th variable: p( i y = c). At test time, compute argma c p(c ) using c() = argma c p(c ) = argma c [log p( c) + log p(c)] = argma c [log p(c) + i log p( i c)]
19 Discrete (Multinomial) Naive Bayes Discrete features i, assumed independent given the class label y. Classification rule: p(y = k, η) = = π k q π q p( i = j y = k) = η ijk p( y = k, η) = i j i i eβ k q eβ q j η[ i=j] ijk j η[ i=j] ijq η [ i=j] ijk ML parameters are classconditional frequency counts: ηijk = m [ i m = j][y m = k] m [ym = k] β k = log[η 11k... η 1jk... η ijk... log π k ] = [ 1 =1; 1 =2;... ; i =j;... ; 1] Log-Linear!
20 Gaussian Naive Bayes This is just a Gaussian Bayes Classifier with a separate diagonal covariance matri for each class. Equivalent to fitting a one-dimensional Gaussian to each input for each possible class. Decision surfaces are quadratics, not linear...
21 Discriminative Models Parametrize p(y ) directly, forget p(, y) and Bayes rule. As long as p(y ) or discriminants f(y ) are linear functions of (or monotone transforms), decision surfaces will be piecewise linear. Don t need to model the density of the features. Some density models have lots of parameters. Many densities give same linear classifier. But we cannot generate new labeled data. Optimize a cost function closer to the one we use at test time
22 Logistic/Softma Regression Model: y is a multinomial random variable whose posterior is the softma of linear functions of any feature vector. p(y = k, θ) = eθ k j eθ j Fitting: now we optimize the conditional likelihood: l(θ; D) = mk [y m = k] log p(y = k m, θ) = mk y m k log pm k l θ i = mk l m k p m k p m k z m i z m i θ i = mk y m k p m k p m k (δ ik p m i )m y = m (y m k pm k )m
23 Chains: Markov Models If variables have some temporal/spatial order, we can model their joint distribution as a dynamical/diffusion system. Simple idea: net output depends only on k previous outputs: y t = f[y t 1, y t 2,..., y t k ] k is called the order of the Markov Model y1 y2 y3 y4 y5 yk y0 Add noise to make the system probabilistic: p(y t y t 1, y t 2,..., y t k )
24 Learning Markov Models The ML parameter estimates for a simple Markov model are easy: T p(y 1, y 2,..., y T ) = p(y 1... y k ) p(y t y t 1, y t 2,..., y t k ) log p({y}) = log p(y 1... y k ) + t=k+1 T t=k+1 log p(y t y t 1, y t 2,..., y t k ) Each window of k + 1 outputs is a training case for the model p(y t y t 1, y t 2,..., y t k ). Eample: for discrete outputs (symbols) and a 2nd-order markov model we can use the multinomial model: p(y t = m y t 1 = a, y t 2 = b) = α mab The maimum likelihood values for α are: α mab = num[t s.t. y t = m, y t 1 = a, y t 2 = b] num[t s.t. y t 1 = a, y t 2 = b]
25 Maimum Entropy Markov Models We can etend this idea to a logistic regression through time type of conditional model called a maimum entropy markov model. s s s s T f f f f T The joint distribution is now a conditional model: p(s T 1 T 1 ) = t p(s t s t 1, f t ( T 1 )) The features f t can be very nonlocal functions of the underlying input sequence, for eample they can consult things in the past and in the future.
26 Directed Tree Graphical Models Directed trees are DAGMs in which each variable i has eactly one other variable as its parent πi ecept the root root which has no parents. Thus, the probability of a variable taking on a certain value depends only on the value of its parent: p() = p( root ) p( i πi ) i root Trees are the net step up from assuming independence. Instead of considering variables in isolation, consider them in pairs. 1,n 3,n NB: each node (ecept root) has eactly one parent, but nodes may have more than one child. 2,n 6,n 4,n 5,n
27 Likelihood function Notation: y i a node i and its single parent πi. V i set of joint configurations of node i and its parent πi (y root root and V root v root ) Directed model likelihood: l(θ; D) = log p( n ) = log p r ( n r ) + log p( n i n πi ) n n i r = [yi n = v] log p i(v) indicator trick n i v V i = N i (v) log p i (v) i v V i where N i (v) = n [yn i = v] and p i(v i ) = p( i πi ). Trees are in the eponential family with y i as sufficient statistics.
28 Maimum Likelihood Parameters Given Structure Trees are just a special case of fully observed graphical models. For discrete data i with values v i, each node stores a conditional probability table (CPT) over its values given its parent s value. The ML parameter estimates are just the empirical histograms of each node s values given its parent: p ( i = v i πi = v j ) = N( i = v i, πi = v j ) v i N( i = v i, πi = v j ) = N i(y i ) N πi (v j ) ecept for the root which uses marginal counts N r (v r )/N. For continuous data, the most common model is a two-dimensional Gaussian at each node. The ML parameters are just to set the mean of p i (y i ) to be the sample mean of [ i ; πi ] and the covariance matri to the sample covariance. In practice we should use some kind of smoothing/regularization.
29 aids health food insurance medicine president msg water patients studies government cancer disease doctor vitamin children jews power rights state war israel religion human law university world god fact gun research science bible christian earth jesus case course evidence question computer orbit satellite solar space launch moon nasa program shuttle technology lunar mars windows mission card dos driver files pc version win graphics video disk format ftp software team won drive memory system image fans games hockey league players puck season car scsi data problem display phone nhl baseball bmw dealer engine honda mac help server number hit oil
30 Unobserved Variables We have been assuming that we observe all the random variables in our model at training time, and all the inputs at test time. But certain variables Q in our models may be unobserved, either some of the time or always, either at training time or at test time. Q1 Q2 3 Q Q 4 Q 5 Q 6 X1 X 2 X 3 X 4 X 5 X 6 (Graphically, we will use shading to indicate observation.)
31 Partially Unobserved (Missing) Variables If variables are occasionally unobserved they are missing data. e.g. undefinied inputs, missing class labels, erroneous target values In this case, we can still model the joint distribution, but we define a new cost function in which we sum out or marginalize the missing values at training or test time: l(θ; D) = log p( c, y c θ) + log p( m θ) complete = complete [Recall that p() = q p(, q).] missing log p( c, y c θ) + missing log y p( m, y θ)
Lecture 2: Simple Classifiers
CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 2: Simple Classifiers Slides based on Rich Zemel s All lecture slides will be available on the course website: www.cs.toronto.edu/~jessebett/csc412
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning and Inference in DAGs We discussed learning in DAG models, log p(x W ) = n d log p(x i j x i pa(j),
More informationCS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs
CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs Professor Erik Sudderth Brown University Computer Science October 6, 2016 Some figures and materials courtesy
More informationCS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning
CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning Professor Erik Sudderth Brown University Computer Science October 4, 2016 Some figures and materials courtesy
More informationLecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan
Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan Latent Variable Models What to do when a variable z is always
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationBased on slides by Richard Zemel
CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we
More informationBayesian Approach 2. CSC412 Probabilistic Learning & Reasoning
CSC412 Probabilistic Learning & Reasoning Lecture 12: Bayesian Parameter Estimation February 27, 2006 Sam Roweis Bayesian Approach 2 The Bayesian programme (after Rev. Thomas Bayes) treats all unnown quantities
More informationThe Generative Self-Organizing Map. A Probabilistic Generalization of Kohonen s SOM
Submitted to: European Symposium on Artificial Neural Networks 2003 Universiteit van Amsterdam IAS technical report IAS-UVA-02-03 The Generative Self-Organizing Map A Probabilistic Generalization of Kohonen
More informationNotes on Discriminant Functions and Optimal Classification
Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem
More information7 Gaussian Discriminant Analysis (including QDA and LDA)
36 Jonathan Richard Shewchuk 7 Gaussian Discriminant Analysis (including QDA and LDA) GAUSSIAN DISCRIMINANT ANALYSIS Fundamental assumption: each class comes from normal distribution (Gaussian). X N(µ,
More informationProbabilistic modeling. The slides are closely adapted from Subhransu Maji s slides
Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework
More informationIntroduction: MLE, MAP, Bayesian reasoning (28/8/13)
STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this
More informationSelf-Organization by Optimizing Free-Energy
Self-Organization by Optimizing Free-Energy J.J. Verbeek, N. Vlassis, B.J.A. Kröse University of Amsterdam, Informatics Institute Kruislaan 403, 1098 SJ Amsterdam, The Netherlands Abstract. We present
More informationLecture 3: Pattern Classification. Pattern classification
EE E68: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mitures and
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationThe Bayes classifier
The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal
More informationRandom Variables and Densities
Random Variables and Densities Review: Probability and Statistics Sam Roweis Random variables X represents outcomes or states of world. Instantiations of variables usually in lower case: We will write
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationLearning Bayesian network : Given structure and completely observed data
Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationLinear Methods for Prediction
Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we
More informationBut if z is conditioned on, we need to model it:
Partially Unobserved Variables Lecture 8: Unsupervised Learning & EM Algorithm Sam Roweis October 28, 2003 Certain variables q in our models may be unobserved, either at training time or at test time or
More informationCSC321 Lecture 18: Learning Probabilistic Models
CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationIntroduction to Probability and Statistics (Continued)
Introduction to Probability and Statistics (Continued) Prof. icholas Zabaras Center for Informatics and Computational Science https://cics.nd.edu/ University of otre Dame otre Dame, Indiana, USA Email:
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released
More informationMachine Learning Lecture 2
Announcements Machine Learning Lecture 2 Eceptional number of lecture participants this year Current count: 449 participants This is very nice, but it stretches our resources to their limits Probability
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationCPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017
CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class
More informationCS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas
CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should
More informationBayesian Models in Machine Learning
Bayesian Models in Machine Learning Lukáš Burget Escuela de Ciencias Informáticas 2017 Buenos Aires, July 24-29 2017 Frequentist vs. Bayesian Frequentist point of view: Probability is the frequency of
More informationMLE/MAP + Naïve Bayes
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes MLE / MAP Readings: Estimating Probabilities (Mitchell, 2016)
More informationWhere now? Machine Learning and Bayesian Inference
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone etension 67 Email: sbh@clcamacuk wwwclcamacuk/ sbh/ Where now? There are some simple take-home messages from
More informationMaximum Likelihood, Logistic Regression, and Stochastic Gradient Training
Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions
More informationorder is number of previous outputs
Markov Models Lecture : Markov and Hidden Markov Models PSfrag Use past replacements as state. Next output depends on previous output(s): y t = f[y t, y t,...] order is number of previous outputs y t y
More informationNaïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Naïve Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders Homework 6: PAC Learning
More informationBayesian Learning (II)
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP
More informationArtificial Neural Networks 2
CSC2515 Machine Learning Sam Roweis Artificial Neural s 2 We saw neural nets for classification. Same idea for regression. ANNs are just adaptive basis regression machines of the form: y k = j w kj σ(b
More informationBAYESIAN MACHINE LEARNING.
BAYESIAN MACHINE LEARNING frederic.pennerath@centralesupelec.fr What is this Bayesian Machine Learning course about? A course emphasizing the few essential theoretical ingredients Probabilistic generative
More informationLecture 1: Probabilistic Graphical Models. Sam Roweis. Monday July 24, 2006 Machine Learning Summer School, Taiwan
Lecture 1: Probabilistic Graphical Models Sam Roweis Monday July 24, 2006 Machine Learning Summer School, Taiwan Building Intelligent Computers We want intelligent, adaptive, robust behaviour in computers.
More informationBayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework
HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for
More informationMachine Learning Lecture 2
Machine Perceptual Learning and Sensory Summer Augmented 6 Computing Announcements Machine Learning Lecture 2 Course webpage http://www.vision.rwth-aachen.de/teaching/ Slides will be made available on
More information9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering
Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make
More informationToday. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?
Today Statistical Learning Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Continuous case Learning Parameters for a Bayesian Network Naive Bayes Maximum Likelihood estimates
More informationIntroduction to Bayesian Learning. Machine Learning Fall 2018
Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability
More informationMachine Learning, Fall 2009: Midterm
10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all
More informationMachine Learning 4771
Machine Learning 4771 Instructor: Tony Jebara Topic 7 Unsupervised Learning Statistical Perspective Probability Models Discrete & Continuous: Gaussian, Bernoulli, Multinomial Maimum Likelihood Logistic
More informationCSC 411: Lecture 09: Naive Bayes
CSC 411: Lecture 09: Naive Bayes Class based on Raquel Urtasun & Rich Zemel s lectures Sanja Fidler University of Toronto Feb 8, 2015 Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 1
More informationBayesian RL Seminar. Chris Mansley September 9, 2008
Bayesian RL Seminar Chris Mansley September 9, 2008 Bayes Basic Probability One of the basic principles of probability theory, the chain rule, will allow us to derive most of the background material in
More informationVariables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)
CSC2515 Machine Learning Sam Roweis Lecture 8: Unsupervised Learning & EM Algorithm October 31, 2006 Partially Unobserved Variables 2 Certain variables q in our models may be unobserved, either at training
More informationSparse Linear Models (10/7/13)
STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine
More informationCOS513 LECTURE 8 STATISTICAL CONCEPTS
COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions
More informationDEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE
Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures
More informationLinear Methods for Prediction
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this
More information10-701/15-781, Machine Learning: Homework 4
10-701/15-781, Machine Learning: Homewor 4 Aarti Singh Carnegie Mellon University ˆ The assignment is due at 10:30 am beginning of class on Mon, Nov 15, 2010. ˆ Separate you answers into five parts, one
More informationReadings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008
Readings: K&F: 16.3, 16.4, 17.3 Bayesian Param. Learning Bayesian Structure Learning Graphical Models 10708 Carlos Guestrin Carnegie Mellon University October 6 th, 2008 10-708 Carlos Guestrin 2006-2008
More informationIntroduction to Machine Learning Midterm, Tues April 8
Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend
More informationMachine Learning Lecture 3
Announcements Machine Learning Lecture 3 Eam dates We re in the process of fiing the first eam date Probability Density Estimation II 9.0.207 Eercises The first eercise sheet is available on L2P now First
More informationNaïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824
Naïve Bayes Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative HW 1 out today. Please start early! Office hours Chen: Wed 4pm-5pm Shih-Yang: Fri 3pm-4pm Location: Whittemore 266
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationIntroduction to Machine Learning
Outline Introduction to Machine Learning Bayesian Classification Varun Chandola March 8, 017 1. {circular,large,light,smooth,thick}, malignant. {circular,large,light,irregular,thick}, malignant 3. {oval,large,dark,smooth,thin},
More informationSTA 414/2104, Spring 2014, Practice Problem Set #1
STA 44/4, Spring 4, Practice Problem Set # Note: these problems are not for credit, and not to be handed in Question : Consider a classification problem in which there are two real-valued inputs, and,
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationMLE/MAP + Naïve Bayes
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes Matt Gormley Lecture 19 March 20, 2018 1 Midterm Exam Reminders
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationMachine Learning, Midterm Exam: Spring 2009 SOLUTION
10-601 Machine Learning, Midterm Exam: Spring 2009 SOLUTION March 4, 2009 Please put your name at the top of the table below. If you need more room to work out your answer to a question, use the back of
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationCSE 559A: Computer Vision
CSE 559A: Computer Vision Fall 208: T-R: :30-pm @ Lopata 0 Instructor: Ayan Chakrabarti (ayan@wustl.edu). Course Staff: Zhihao ia, Charlie Wu, Han Liu http://www.cse.wustl.edu/~ayan/courses/cse559a/ Sep
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationIntroduction to Machine Learning. Lecture 2
Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for
More informationMachine Learning for Signal Processing Bayes Classification and Regression
Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For
More informationDiscrete Mathematics and Probability Theory Fall 2015 Lecture 21
CS 70 Discrete Mathematics and Probability Theory Fall 205 Lecture 2 Inference In this note we revisit the problem of inference: Given some data or observations from the world, what can we infer about
More informationGaussian Models
Gaussian Models ddebarr@uw.edu 2016-04-28 Agenda Introduction Gaussian Discriminant Analysis Inference Linear Gaussian Systems The Wishart Distribution Inferring Parameters Introduction Gaussian Density
More informationCSE 559A: Computer Vision Tomorrow Zhihao's Office Hours back in Jolley 309: 10:30am-Noon
CSE 559A: Computer Vision ADMINISTRIVIA Tomorrow Zhihao's Office Hours back in Jolley 309: 0:30am-Noon Fall 08: T-R: :30-pm @ Lopata 0 This Friday: Regular Office Hours Net Friday: Recitation for PSET
More informationClassification. Chapter Introduction. 6.2 The Bayes classifier
Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode
More informationIntroduction to AI Learning Bayesian networks. Vibhav Gogate
Introduction to AI Learning Bayesian networks Vibhav Gogate Inductive Learning in a nutshell Given: Data Examples of a function (X, F(X)) Predict function F(X) for new examples X Discrete F(X): Classification
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationBayesian Methods: Naïve Bayes
Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationAssociation studies and regression
Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration
More informationCSC2515 Assignment #2
CSC2515 Assignment #2 Due: Nov.4, 2pm at the START of class Worth: 18% Late assignments not accepted. 1 Pseudo-Bayesian Linear Regression (3%) In this question you will dabble in Bayesian statistics and
More informationLogistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu
Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data
More informationAlgorithmisches Lernen/Machine Learning
Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationIntroduction to Machine Learning
Introduction to Machine Learning Bayesian Classification Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574
More informationProbabilistic Machine Learning. Industrial AI Lab.
Probabilistic Machine Learning Industrial AI Lab. Probabilistic Linear Regression Outline Probabilistic Classification Probabilistic Clustering Probabilistic Dimension Reduction 2 Probabilistic Linear
More informationMachine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.
10-601 Machine Learning, Midterm Exam: Spring 2008 Please put your name on this cover sheet If you need more room to work out your answer to a question, use the back of the page and clearly mark on the
More information6.867 Machine learning
6.867 Machine learning Mid-term eam October 8, 6 ( points) Your name and MIT ID: .5.5 y.5 y.5 a).5.5 b).5.5.5.5 y.5 y.5 c).5.5 d).5.5 Figure : Plots of linear regression results with different types of
More informationCOMP90051 Statistical Machine Learning
COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 2. Statistical Schools Adapted from slides by Ben Rubinstein Statistical Schools of Thought Remainder of lecture is to provide
More informationMACHINE LEARNING ADVANCED MACHINE LEARNING
MACHINE LEARNING ADVANCED MACHINE LEARNING Recap of Important Notions on Estimation of Probability Density Functions 2 2 MACHINE LEARNING Overview Definition pdf Definition joint, condition, marginal,
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More information