Lecture 2: Parameter Learning in Fully Observed Graphical Models. Sam Roweis. Monday July 24, 2006 Machine Learning Summer School, Taiwan

Size: px
Start display at page:

Download "Lecture 2: Parameter Learning in Fully Observed Graphical Models. Sam Roweis. Monday July 24, 2006 Machine Learning Summer School, Taiwan"

Transcription

1 Lecture 2: Parameter Learning in Fully Observed Graphical Models Sam Roweis Monday July 24, 2006 Machine Learning Summer School, Taiwan

2 Likelihood Functions So far we have focused on the (log) probability function p( θ) which assigns a probability (density) to any joint configuration of variables given fied parameters θ. But in learning we turn this on its head: we have some fied data and we want to find parameters. Think of p( θ) as a function of θ for fied : Z(θ; ) = p( θ) l(θ; ) = log p( θ) This function is called the (log) likelihood. Chose θ to maimize some loss function L(θ) which includes l(θ): L(θ) = l(θ; D) L(θ) = l(θ; D) + log p(θ) maimum likelihood (ML) maimum a posteriori (MAP)/penalizedML (also cross-validation, Bayesian estimators, BIC, AIC,...)

3 Maimum Likelihood For IID data: p(d θ) = m l(θ; D) = m p( m θ) log p( m θ) Idea of maimum likelihod estimation (MLE): pick the setting of parameters most likely to have generated the data we saw: θ ML = argma θ l(θ; D) Commonly used as a baseline model in statistics. Often leads to intuitive, appealing, or natural estimators. For a start, the IID assumption makes the log likelihood into a sum, so its derivative can be easily taken term by term.

4 Eample: Bernoulli Trials We observe M iid coin flips: D=H,H,T,H,... Model: p(h) = θ p(t ) = (1 θ) Likelihood: l(θ; D) = log p(d θ) = log θ m (1 θ) 1 m m = log θ m m + log(1 θ) m (1 m ) = log θn H + log(1 θ)n T Take derivatives and set to zero: l θ = N H θ N T 1 θ θ ML = N H N H + N T

5 Eample: Univariate Normal We observe M iid real samples: D=1.18,-.25,.78,... Model: p() = (2πσ 2 ) 1/2 ep{ ( µ) 2 /2σ 2 } Likelihood (using probability density): l(θ; D) = log p(d θ) = M 2 log(2πσ2 ) 1 2 m ( m µ) 2 Take derivatives and set to zero: l µ = (1/σ2 ) m ( m µ) l σ 2 = M 2σ σ 4 m ( m µ) 2 µ ML = (1/M) m m σml 2 = (1/M) m 2 m µ 2 ML σ 2

6 Eample: Linear Regression X y Y

7 Eample: Linear Regression At a linear regression node, some parents (covariates/inputs) and all children (responses/outputs) are continuous valued variables. For each child and setting of discrete parents we use the model: p(y, θ) = gauss(y θ, σ 2 ) The likelihood is the familiar squared error cost: l(θ; D) = 1 2σ 2 (y m θ m ) 2 The ML parameters can be solved for using linear least-squares: l θ = (y m θ m ) m m θml = (X X) 1 X Y Sufficient statistics are input correlation matri and input-output cross-correlation vector. m

8 MLE for Directed GMs For a directed GM, the likelihood function has a nice form: log p(d θ) = log p( m i π i, θ i ) = log p( m i π i, θ i ) m m i The parameters decouple; so we can maimize likelihood independently for each node s function by setting θ i. Only need the values of i and its parents in order to estimate θ i. In general, for fully observed data if we know how to estimate params at a single node we can do it for the whole network. i X1 X1 X1 X1 X 2 X 3 X 2 X 3 X 2 X 3 X 4 X 4 (a) (b)

9 Reminder: Classification Given eamples of a discrete class label y and some features. Goal: compute label (y) for new inputs. Two approaches: Generative: model p(, y) = p(y)p( y); use Bayes rule to infer conditional p(y ). Discriminative: model discriminants f(y ) directly and take ma. Generative approach is related to conditional density estimation while discriminative approach is closer to regression

10 Probabilistic Classification: Bayes Classifiers Generative model: p(, y) = p(y)p( y). p(y) are called class priors. p( y) are called class conditional feature distributions. For the prior we use a Bernoulli or multinomial: p(y = k π) = π k with k π k = 1. Classification rules: ML: argma y p( y) (can behave badly if skewed priors) MAP: argma y p(y ) = argma y log p( y) + log p(y) (safer) Fitting: maimize n log p(n, y n ) = n log p(n y n ) + log p(y n ) 1) Sort data into batches by class label. 2) Estimate p(y) by counting size of batches (plus regularization). 3) Estimate p( y) separately within each batch using ML. (also with regularization).

11 Three Key Regularization Ideas To avoid overfitting, we can put priors on the parameters of the class and class conditional feature distributions. We can also tie some parameters together so that fewer of them are estimated using more data. Finally, we can make factorization or independence assumptions about the distributions. In particular, for the class conditional distributions we can assume the features are fully dependent, partly dependent, or independent (!). Y Y Y X1 X 2 X m X1 X 2 X m X (a) (b) (c)

12 Gaussian Class-Conditional Distributions If all features are continuous, a popular choice is a Gaussian class-conditional. { p( y = k, θ) = 2πΣ 1/2 ep 1 } 2 ( µ k)σ 1 ( µ k ) Fitting: use the following amazing and useful fact. The maimum likelihood fit of a Gaussian to some data is the Gaussian whose mean is equal to the data mean and whose covariance is equal to the sample covariance. [Try to prove this as an eercise in understanding likelihood, algebra, and calculus all at once!] Seems easy. And works amazingly well. But we can do even better with some simple regularization...

13 Regularized Gaussians Idea 1: assume all the covariances are the same (tie parameters). This is eactly Fisher s linear discriminant analysis (a) (b) Idea 2: Make independence assumptions to get diagonal or identity-multiple covariances. (Or sparse inverse covariances.) More on this in a few minutes... Idea 3: add a bit of the identity matri to each sample covariance. This fattens it up in directions where there wasn t enough data. Equivalent to using a Wishart prior on the covariance matri.

14 Gaussian Bayes Classifier Maimum likelihood estimates for parameters: priors π k : use observed frequencies of classes (plus smoothing) means µ k : use class means covariance Σ: use data from single class or pooled data ( m µ y m) to estimate full/diagonal covariances Compute the posterior via Bayes rule: p( y = k, θ)p(y = k π) p(y = k, θ) = p( y = j, θ)p(y = j π) j = ep{µ k Σ 1 µ k Σ 1 µ k /2 + log π k } j ep{µ j Σ 1 µ j Σ 1 µ j /2 + log π j } = e β k / j eβ j = ep{β k }/Z where β k = [Σ 1 µ k ; (µ k Σ 1 µ k + log π k )] and we have augmented with a constant component always equal to 1 (bias term).

15 Softma/Logit The squashing function is known as the softma or logit: 1 φ k (z) g(η) = 1 + e η ez k j ez j It is invertible (up to a constant): z k = log φ k + c η = log(g/1 g) Derivative is easy: φ k z j = φ k (δ kj φ j ) dg dη = g(1 g) φ( z ) z

16 Log Linear Geometry Taking the ratio of any two posteriors (the odds ) shows that the contours of equal pairwise probability are linear surfaces in the feature space: p(y = k, θ) p(y = j, θ) = ep { (β k β j ) } The pairwise discrimination contours p(y k ) = p(y j ) are orthogonal to the differences of the means in feature space when Σ = σi. For general Σ shared b/w all classes the same is true in the transformed feature space w = Σ 1. The priors do not change the geometry, they only shift the operating point on the logit by the log-odds log(π k /π j ). Thus, for equal class-covariances, we obtain a linear classifier. If we use different covariances, the decision surfaces are conic sections and we have a quadratic classifier.

17 Discrete Bayesian Classifier If the inputs are discrete (categorical), what should we do? The simplest class conditional model is a joint multinomial (table): p( 1 = a, 2 = b,... y = c) = η c ab... This is conceptually correct, but there s a big practical problem. Fitting: ML params are observed counts: ηab... c = n [y n = c][ 1 = a][ 2 = b][...][...] n [y n = c] Consider the 1616 digits at 256 gray levels. How many entries in the table? How many will be zero? What happens at test time? Doh! We obviously need some regularlization. Smoothing will not help much here. Unless we know about the relationships between inputs beforehand, sharing parameters is hard also. But what about independence?

18 Naive (Idiot s) Bayes Classifier Assumption: conditioned on class, attributes are independent. Y p( y) = i p( i y) X1 X 2 Sounds crazy right? Right! But it works. X m (a) Algorithm: sort data cases into bins according to y n. Compute marginal probabilities p(y = c) using frequencies. For each class, estimate distribution of i th variable: p( i y = c). At test time, compute argma c p(c ) using c() = argma c p(c ) = argma c [log p( c) + log p(c)] = argma c [log p(c) + i log p( i c)]

19 Discrete (Multinomial) Naive Bayes Discrete features i, assumed independent given the class label y. Classification rule: p(y = k, η) = = π k q π q p( i = j y = k) = η ijk p( y = k, η) = i j i i eβ k q eβ q j η[ i=j] ijk j η[ i=j] ijq η [ i=j] ijk ML parameters are classconditional frequency counts: ηijk = m [ i m = j][y m = k] m [ym = k] β k = log[η 11k... η 1jk... η ijk... log π k ] = [ 1 =1; 1 =2;... ; i =j;... ; 1] Log-Linear!

20 Gaussian Naive Bayes This is just a Gaussian Bayes Classifier with a separate diagonal covariance matri for each class. Equivalent to fitting a one-dimensional Gaussian to each input for each possible class. Decision surfaces are quadratics, not linear...

21 Discriminative Models Parametrize p(y ) directly, forget p(, y) and Bayes rule. As long as p(y ) or discriminants f(y ) are linear functions of (or monotone transforms), decision surfaces will be piecewise linear. Don t need to model the density of the features. Some density models have lots of parameters. Many densities give same linear classifier. But we cannot generate new labeled data. Optimize a cost function closer to the one we use at test time

22 Logistic/Softma Regression Model: y is a multinomial random variable whose posterior is the softma of linear functions of any feature vector. p(y = k, θ) = eθ k j eθ j Fitting: now we optimize the conditional likelihood: l(θ; D) = mk [y m = k] log p(y = k m, θ) = mk y m k log pm k l θ i = mk l m k p m k p m k z m i z m i θ i = mk y m k p m k p m k (δ ik p m i )m y = m (y m k pm k )m

23 Chains: Markov Models If variables have some temporal/spatial order, we can model their joint distribution as a dynamical/diffusion system. Simple idea: net output depends only on k previous outputs: y t = f[y t 1, y t 2,..., y t k ] k is called the order of the Markov Model y1 y2 y3 y4 y5 yk y0 Add noise to make the system probabilistic: p(y t y t 1, y t 2,..., y t k )

24 Learning Markov Models The ML parameter estimates for a simple Markov model are easy: T p(y 1, y 2,..., y T ) = p(y 1... y k ) p(y t y t 1, y t 2,..., y t k ) log p({y}) = log p(y 1... y k ) + t=k+1 T t=k+1 log p(y t y t 1, y t 2,..., y t k ) Each window of k + 1 outputs is a training case for the model p(y t y t 1, y t 2,..., y t k ). Eample: for discrete outputs (symbols) and a 2nd-order markov model we can use the multinomial model: p(y t = m y t 1 = a, y t 2 = b) = α mab The maimum likelihood values for α are: α mab = num[t s.t. y t = m, y t 1 = a, y t 2 = b] num[t s.t. y t 1 = a, y t 2 = b]

25 Maimum Entropy Markov Models We can etend this idea to a logistic regression through time type of conditional model called a maimum entropy markov model. s s s s T f f f f T The joint distribution is now a conditional model: p(s T 1 T 1 ) = t p(s t s t 1, f t ( T 1 )) The features f t can be very nonlocal functions of the underlying input sequence, for eample they can consult things in the past and in the future.

26 Directed Tree Graphical Models Directed trees are DAGMs in which each variable i has eactly one other variable as its parent πi ecept the root root which has no parents. Thus, the probability of a variable taking on a certain value depends only on the value of its parent: p() = p( root ) p( i πi ) i root Trees are the net step up from assuming independence. Instead of considering variables in isolation, consider them in pairs. 1,n 3,n NB: each node (ecept root) has eactly one parent, but nodes may have more than one child. 2,n 6,n 4,n 5,n

27 Likelihood function Notation: y i a node i and its single parent πi. V i set of joint configurations of node i and its parent πi (y root root and V root v root ) Directed model likelihood: l(θ; D) = log p( n ) = log p r ( n r ) + log p( n i n πi ) n n i r = [yi n = v] log p i(v) indicator trick n i v V i = N i (v) log p i (v) i v V i where N i (v) = n [yn i = v] and p i(v i ) = p( i πi ). Trees are in the eponential family with y i as sufficient statistics.

28 Maimum Likelihood Parameters Given Structure Trees are just a special case of fully observed graphical models. For discrete data i with values v i, each node stores a conditional probability table (CPT) over its values given its parent s value. The ML parameter estimates are just the empirical histograms of each node s values given its parent: p ( i = v i πi = v j ) = N( i = v i, πi = v j ) v i N( i = v i, πi = v j ) = N i(y i ) N πi (v j ) ecept for the root which uses marginal counts N r (v r )/N. For continuous data, the most common model is a two-dimensional Gaussian at each node. The ML parameters are just to set the mean of p i (y i ) to be the sample mean of [ i ; πi ] and the covariance matri to the sample covariance. In practice we should use some kind of smoothing/regularization.

29 aids health food insurance medicine president msg water patients studies government cancer disease doctor vitamin children jews power rights state war israel religion human law university world god fact gun research science bible christian earth jesus case course evidence question computer orbit satellite solar space launch moon nasa program shuttle technology lunar mars windows mission card dos driver files pc version win graphics video disk format ftp software team won drive memory system image fans games hockey league players puck season car scsi data problem display phone nhl baseball bmw dealer engine honda mac help server number hit oil

30 Unobserved Variables We have been assuming that we observe all the random variables in our model at training time, and all the inputs at test time. But certain variables Q in our models may be unobserved, either some of the time or always, either at training time or at test time. Q1 Q2 3 Q Q 4 Q 5 Q 6 X1 X 2 X 3 X 4 X 5 X 6 (Graphically, we will use shading to indicate observation.)

31 Partially Unobserved (Missing) Variables If variables are occasionally unobserved they are missing data. e.g. undefinied inputs, missing class labels, erroneous target values In this case, we can still model the joint distribution, but we define a new cost function in which we sum out or marginalize the missing values at training or test time: l(θ; D) = log p( c, y c θ) + log p( m θ) complete = complete [Recall that p() = q p(, q).] missing log p( c, y c θ) + missing log y p( m, y θ)

Lecture 2: Simple Classifiers

Lecture 2: Simple Classifiers CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 2: Simple Classifiers Slides based on Rich Zemel s All lecture slides will be available on the course website: www.cs.toronto.edu/~jessebett/csc412

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning and Inference in DAGs We discussed learning in DAG models, log p(x W ) = n d log p(x i j x i pa(j),

More information

CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs

CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs Professor Erik Sudderth Brown University Computer Science October 6, 2016 Some figures and materials courtesy

More information

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning Professor Erik Sudderth Brown University Computer Science October 4, 2016 Some figures and materials courtesy

More information

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan Latent Variable Models What to do when a variable z is always

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning CSC412 Probabilistic Learning & Reasoning Lecture 12: Bayesian Parameter Estimation February 27, 2006 Sam Roweis Bayesian Approach 2 The Bayesian programme (after Rev. Thomas Bayes) treats all unnown quantities

More information

The Generative Self-Organizing Map. A Probabilistic Generalization of Kohonen s SOM

The Generative Self-Organizing Map. A Probabilistic Generalization of Kohonen s SOM Submitted to: European Symposium on Artificial Neural Networks 2003 Universiteit van Amsterdam IAS technical report IAS-UVA-02-03 The Generative Self-Organizing Map A Probabilistic Generalization of Kohonen

More information

Notes on Discriminant Functions and Optimal Classification

Notes on Discriminant Functions and Optimal Classification Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem

More information

7 Gaussian Discriminant Analysis (including QDA and LDA)

7 Gaussian Discriminant Analysis (including QDA and LDA) 36 Jonathan Richard Shewchuk 7 Gaussian Discriminant Analysis (including QDA and LDA) GAUSSIAN DISCRIMINANT ANALYSIS Fundamental assumption: each class comes from normal distribution (Gaussian). X N(µ,

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

Self-Organization by Optimizing Free-Energy

Self-Organization by Optimizing Free-Energy Self-Organization by Optimizing Free-Energy J.J. Verbeek, N. Vlassis, B.J.A. Kröse University of Amsterdam, Informatics Institute Kruislaan 403, 1098 SJ Amsterdam, The Netherlands Abstract. We present

More information

Lecture 3: Pattern Classification. Pattern classification

Lecture 3: Pattern Classification. Pattern classification EE E68: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mitures and

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

The Bayes classifier

The Bayes classifier The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal

More information

Random Variables and Densities

Random Variables and Densities Random Variables and Densities Review: Probability and Statistics Sam Roweis Random variables X represents outcomes or states of world. Instantiations of variables usually in lower case: We will write

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Linear Methods for Prediction

Linear Methods for Prediction Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we

More information

But if z is conditioned on, we need to model it:

But if z is conditioned on, we need to model it: Partially Unobserved Variables Lecture 8: Unsupervised Learning & EM Algorithm Sam Roweis October 28, 2003 Certain variables q in our models may be unobserved, either at training time or at test time or

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Introduction to Probability and Statistics (Continued)

Introduction to Probability and Statistics (Continued) Introduction to Probability and Statistics (Continued) Prof. icholas Zabaras Center for Informatics and Computational Science https://cics.nd.edu/ University of otre Dame otre Dame, Indiana, USA Email:

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released

More information

Machine Learning Lecture 2

Machine Learning Lecture 2 Announcements Machine Learning Lecture 2 Eceptional number of lecture participants this year Current count: 449 participants This is very nice, but it stretches our resources to their limits Probability

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

Bayesian Models in Machine Learning

Bayesian Models in Machine Learning Bayesian Models in Machine Learning Lukáš Burget Escuela de Ciencias Informáticas 2017 Buenos Aires, July 24-29 2017 Frequentist vs. Bayesian Frequentist point of view: Probability is the frequency of

More information

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes MLE / MAP Readings: Estimating Probabilities (Mitchell, 2016)

More information

Where now? Machine Learning and Bayesian Inference

Where now? Machine Learning and Bayesian Inference Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone etension 67 Email: sbh@clcamacuk wwwclcamacuk/ sbh/ Where now? There are some simple take-home messages from

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

order is number of previous outputs

order is number of previous outputs Markov Models Lecture : Markov and Hidden Markov Models PSfrag Use past replacements as state. Next output depends on previous output(s): y t = f[y t, y t,...] order is number of previous outputs y t y

More information

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Naïve Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders Homework 6: PAC Learning

More information

Bayesian Learning (II)

Bayesian Learning (II) Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP

More information

Artificial Neural Networks 2

Artificial Neural Networks 2 CSC2515 Machine Learning Sam Roweis Artificial Neural s 2 We saw neural nets for classification. Same idea for regression. ANNs are just adaptive basis regression machines of the form: y k = j w kj σ(b

More information

BAYESIAN MACHINE LEARNING.

BAYESIAN MACHINE LEARNING. BAYESIAN MACHINE LEARNING frederic.pennerath@centralesupelec.fr What is this Bayesian Machine Learning course about? A course emphasizing the few essential theoretical ingredients Probabilistic generative

More information

Lecture 1: Probabilistic Graphical Models. Sam Roweis. Monday July 24, 2006 Machine Learning Summer School, Taiwan

Lecture 1: Probabilistic Graphical Models. Sam Roweis. Monday July 24, 2006 Machine Learning Summer School, Taiwan Lecture 1: Probabilistic Graphical Models Sam Roweis Monday July 24, 2006 Machine Learning Summer School, Taiwan Building Intelligent Computers We want intelligent, adaptive, robust behaviour in computers.

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Machine Learning Lecture 2

Machine Learning Lecture 2 Machine Perceptual Learning and Sensory Summer Augmented 6 Computing Announcements Machine Learning Lecture 2 Course webpage http://www.vision.rwth-aachen.de/teaching/ Slides will be made available on

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use? Today Statistical Learning Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Continuous case Learning Parameters for a Bayesian Network Naive Bayes Maximum Likelihood estimates

More information

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018 Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability

More information

Machine Learning, Fall 2009: Midterm

Machine Learning, Fall 2009: Midterm 10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: Tony Jebara Topic 7 Unsupervised Learning Statistical Perspective Probability Models Discrete & Continuous: Gaussian, Bernoulli, Multinomial Maimum Likelihood Logistic

More information

CSC 411: Lecture 09: Naive Bayes

CSC 411: Lecture 09: Naive Bayes CSC 411: Lecture 09: Naive Bayes Class based on Raquel Urtasun & Rich Zemel s lectures Sanja Fidler University of Toronto Feb 8, 2015 Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 1

More information

Bayesian RL Seminar. Chris Mansley September 9, 2008

Bayesian RL Seminar. Chris Mansley September 9, 2008 Bayesian RL Seminar Chris Mansley September 9, 2008 Bayes Basic Probability One of the basic principles of probability theory, the chain rule, will allow us to derive most of the background material in

More information

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z) CSC2515 Machine Learning Sam Roweis Lecture 8: Unsupervised Learning & EM Algorithm October 31, 2006 Partially Unobserved Variables 2 Certain variables q in our models may be unobserved, either at training

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

COS513 LECTURE 8 STATISTICAL CONCEPTS

COS513 LECTURE 8 STATISTICAL CONCEPTS COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions

More information

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures

More information

Linear Methods for Prediction

Linear Methods for Prediction This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this

More information

10-701/15-781, Machine Learning: Homework 4

10-701/15-781, Machine Learning: Homework 4 10-701/15-781, Machine Learning: Homewor 4 Aarti Singh Carnegie Mellon University ˆ The assignment is due at 10:30 am beginning of class on Mon, Nov 15, 2010. ˆ Separate you answers into five parts, one

More information

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings: K&F: 16.3, 16.4, 17.3 Bayesian Param. Learning Bayesian Structure Learning Graphical Models 10708 Carlos Guestrin Carnegie Mellon University October 6 th, 2008 10-708 Carlos Guestrin 2006-2008

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

Machine Learning Lecture 3

Machine Learning Lecture 3 Announcements Machine Learning Lecture 3 Eam dates We re in the process of fiing the first eam date Probability Density Estimation II 9.0.207 Eercises The first eercise sheet is available on L2P now First

More information

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824 Naïve Bayes Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative HW 1 out today. Please start early! Office hours Chen: Wed 4pm-5pm Shih-Yang: Fri 3pm-4pm Location: Whittemore 266

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Introduction to Machine Learning

Introduction to Machine Learning Outline Introduction to Machine Learning Bayesian Classification Varun Chandola March 8, 017 1. {circular,large,light,smooth,thick}, malignant. {circular,large,light,irregular,thick}, malignant 3. {oval,large,dark,smooth,thin},

More information

STA 414/2104, Spring 2014, Practice Problem Set #1

STA 414/2104, Spring 2014, Practice Problem Set #1 STA 44/4, Spring 4, Practice Problem Set # Note: these problems are not for credit, and not to be handed in Question : Consider a classification problem in which there are two real-valued inputs, and,

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes Matt Gormley Lecture 19 March 20, 2018 1 Midterm Exam Reminders

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Machine Learning, Midterm Exam: Spring 2009 SOLUTION 10-601 Machine Learning, Midterm Exam: Spring 2009 SOLUTION March 4, 2009 Please put your name at the top of the table below. If you need more room to work out your answer to a question, use the back of

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

CSE 559A: Computer Vision

CSE 559A: Computer Vision CSE 559A: Computer Vision Fall 208: T-R: :30-pm @ Lopata 0 Instructor: Ayan Chakrabarti (ayan@wustl.edu). Course Staff: Zhihao ia, Charlie Wu, Han Liu http://www.cse.wustl.edu/~ayan/courses/cse559a/ Sep

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Introduction to Machine Learning. Lecture 2

Introduction to Machine Learning. Lecture 2 Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21 CS 70 Discrete Mathematics and Probability Theory Fall 205 Lecture 2 Inference In this note we revisit the problem of inference: Given some data or observations from the world, what can we infer about

More information

Gaussian Models

Gaussian Models Gaussian Models ddebarr@uw.edu 2016-04-28 Agenda Introduction Gaussian Discriminant Analysis Inference Linear Gaussian Systems The Wishart Distribution Inferring Parameters Introduction Gaussian Density

More information

CSE 559A: Computer Vision Tomorrow Zhihao's Office Hours back in Jolley 309: 10:30am-Noon

CSE 559A: Computer Vision Tomorrow Zhihao's Office Hours back in Jolley 309: 10:30am-Noon CSE 559A: Computer Vision ADMINISTRIVIA Tomorrow Zhihao's Office Hours back in Jolley 309: 0:30am-Noon Fall 08: T-R: :30-pm @ Lopata 0 This Friday: Regular Office Hours Net Friday: Recitation for PSET

More information

Classification. Chapter Introduction. 6.2 The Bayes classifier

Classification. Chapter Introduction. 6.2 The Bayes classifier Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode

More information

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Introduction to AI Learning Bayesian networks. Vibhav Gogate Introduction to AI Learning Bayesian networks Vibhav Gogate Inductive Learning in a nutshell Given: Data Examples of a function (X, F(X)) Predict function F(X) for new examples X Discrete F(X): Classification

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Bayesian Methods: Naïve Bayes

Bayesian Methods: Naïve Bayes Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

CSC2515 Assignment #2

CSC2515 Assignment #2 CSC2515 Assignment #2 Due: Nov.4, 2pm at the START of class Worth: 18% Late assignments not accepted. 1 Pseudo-Bayesian Linear Regression (3%) In this question you will dabble in Bayesian statistics and

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Bayesian Classification Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Probabilistic Machine Learning. Industrial AI Lab.

Probabilistic Machine Learning. Industrial AI Lab. Probabilistic Machine Learning Industrial AI Lab. Probabilistic Linear Regression Outline Probabilistic Classification Probabilistic Clustering Probabilistic Dimension Reduction 2 Probabilistic Linear

More information

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20. 10-601 Machine Learning, Midterm Exam: Spring 2008 Please put your name on this cover sheet If you need more room to work out your answer to a question, use the back of the page and clearly mark on the

More information

6.867 Machine learning

6.867 Machine learning 6.867 Machine learning Mid-term eam October 8, 6 ( points) Your name and MIT ID: .5.5 y.5 y.5 a).5.5 b).5.5.5.5 y.5 y.5 c).5.5 d).5.5 Figure : Plots of linear regression results with different types of

More information

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 2. Statistical Schools Adapted from slides by Ben Rubinstein Statistical Schools of Thought Remainder of lecture is to provide

More information

MACHINE LEARNING ADVANCED MACHINE LEARNING

MACHINE LEARNING ADVANCED MACHINE LEARNING MACHINE LEARNING ADVANCED MACHINE LEARNING Recap of Important Notions on Estimation of Probability Density Functions 2 2 MACHINE LEARNING Overview Definition pdf Definition joint, condition, marginal,

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information