Probabilistic Learning

Size: px
Start display at page:

Download "Probabilistic Learning"

Transcription

1 Artificial Intelligence: Representation and Problem Solving April 17, 2007 Probabilistic Learning Reminder No class on Thursday - spring carnival. 2

2 Recall the basic algorithm for learning decision trees: 1. starting with whole training data 2. select attribute or value along dimension that gives best split using information gain or other criteria 3. create child nodes based on split 4. recurse on each child using child data until a stopping criterion is reached all examples have same class amount of data is too small tree too large Does this capture probabilistic relationships? 3 Quantifying the certainty of decions Suppose instead of a yes or no answer want some estimate of how strongly we believe a loan applicant is a credit risk. This might be useful if we want some flexibility in adjusting our decision criteria. - Eg, suppose we re willing to take more risk if times are good. - Or, if we want to examine case we believe are higher risks more carefully. <2 years at current job? Predicting credit risk missed payments? defaulted? Y N Y N Y N 4

3 The mushroom data Or suppose we wanted to know how likely a mushroom was safe to eat? Do decision trees give us that information? 5 EDIBLE? CAP-SHAPE CAP-SURFACE! 1 edible flat fibrous! 2 poisonous convex smooth! 3 edible flat fibrous! 4 edible convex scaly! 5 poisonous convex smooth! 6 edible convex fibrous! 7 poisonous flat scaly! 8 poisonous flat scaly! 9 poisonous convex fibrous! 10 poisonous convex fibrous! 11 poisonous flat smooth! 12 edible convex smooth! 13 poisonous knobbed scaly! 14 poisonous flat smooth! 15 poisonous flat fibrous! Mushroom data Fisher s Iris data 2.5 In which example would you be more confident about 2 the class? Iris virginica petal width (cm) Iris setosa Iris versicolor Decision trees provide a classification 6 7but not petal length (cm) uncertainty. 6

4 The general classification problem desired output y = {y 1,..., y K } output is a binary classification vector: { 1 if x i C i class i, y i = 0 otherwise model θ = {θ 1,..., θ M } model (e.g. a decision tree) is defined by M parameters. Data D = {x 1,..., x T } x i = {x 1,..., x N } i input is a set of T observations, each an N-dimensional vector (binary, discrete, or continuous) Given data, we want to learn a model that can correctly classify novel observations. How do we approach this probabilistically? 7 The answer to all questions of uncertainty Let s apply Bayes rule to infer the most probable class given the observation: p(c k x) = p(x C k)p(c k ) p(x) p(x C k )p(c k ) = k p(x C k)p(c k ) This is the answer, but what does it mean? How do we specify the terms? - p(c k ) is the prior probability on the different classes - p(x C k ) is the data likelihood, ie probability of x given class C k How should we define this? 8

5 What classifier would give optimal performance? Consider the iris data again. How would we minimize the number of future mis-classifications? We would need to know the true distribution of the classes. Assume they follow a Gaussian distribution. The number of samples in each class is the same (50), so (assume) p(c k ) is equal for all classes. Because p(x) is the same for all classes we have: p(c k x) = p(x C k)p(c k ) p(x) p(x C k )p(c k ) petal width (cm) 0.9p(petal length C 2 ) p(petal length C 3 ) petal length (cm) 9 Where do we put the boundary? p(petal length C 2 ) p(petal length C 3 )

6 Where do we put the boundary? decision boundary R32 = C3 is misclassified as C2 p(petal length C 2 ) p(petal length C 3 ) R23 = C2 is misclassified as C3 Where do we put the boundary? Shifting the boundary trades-off the two errors R32 = C3 is misclassified as C2 p(petal length C 2 ) p(petal length C 3 ) R23 = C2 is misclassified as C3

7 Where do we put the boundary? The misclassification error is defined by p(error) = p(c 3 x)dx + p(c 2 x)dx R 32 R 23 which in our case is proportional to the data likelihood R32 = C3 is misclassified as C2 p(petal length C 2 ) p(petal length C 3 ) R23 = C2 is misclassified as C3 Where do we put the boundary? The misclassification error is defined by p(error) = p(c 3 x)dx + p(c 2 x)dx R 32 R 23 which in our case is proportional to the data likelihood p(petal length C 2 ) p(petal length C 3 ) This region would yield p(c 3 x) > p(c 2 x) but we re still classifying this region as C2!

8 The optimal decision boundary The minimal misclassification error at the point where p(c 3 x) = p(c 2 x) Optimal decision boundary p(x C 3 )p(c 3 )/p(x) = p(x C 2 )p(c 2 )/p(x) p(x C 3 ) = p(x C 2 ) p(c 2 petal length) Note: this assumes we have only two classes. p(petal length C 2 ) p(petal length C 3 ) p(c 3 petal length) Bayesian classification for more complex models Recall the class conditional probability: p(c k x) = p(x C k)p(c k ) p(x) p(x C k )p(c k ) = k p(x C k)p(c k ) How do we define the data likelihood, p(x C k) ie the probability of x given class C k 16

9 Defining a probabilistic classification model How would we define credit risk problem? - Class: - Data: C 1 = defaulted C 2 = didn t default x = { <2 years, missed payments } - Prior (from data): p(c 1 ) = 3/10; p(c 2 ) = 7/10; - Likelihood: p(x 1, x 2 C 1 ) =? p(x 1, x 2 C 2 ) =? - How would we determine these? <2 years at current job? missed payments? defaulted? Y N Y N Y N Predicting credit risk 17 Defining a probabilistic model by counting The prior is obtained by counting number of classes in the data: p(c k = k) = Count(C k = k) # records The likelihood is obtained the same way: p(x = v C k ) = Count(x = v C k = k) Count(C k = k) p(x 1 = v 1,..., x N = v N C k = k) = Count(x 1 = v 1,... x N = v N, C k = k) Count(C k = k) This is the maximum likelihood estimate (MLE) of the probabilities 18

10 Defining a probabilistic classification model Determining the likelihood: p(x 1, x 2 C 1 ) =? p(x 1, x 2 C 2 ) =? Simple approach: look at counts in data x1 <2 years at current job? x2 missed payments? C1 did default C2 did not default <2 years at current job? Predicting credit risk missed payments? defaulted? Y N Y N N Y Y N Y N Y N Y N 19 Defining a probabilistic classification model Determining the likelihood: p(x 1, x 2 C 1 ) =? p(x 1, x 2 C 2 ) =? Simple approach: look at counts in data x1 <2 years at current job? x2 missed payments? C1 did default C2 did not default N N 3/3 0/0 <2 years at current job? Predicting credit risk missed payments? defaulted? Y N Y N Y Y Y N Y N Y N 20

11 Defining a probabilistic classification model Determining the likelihood: p(x 1, x 2 C 1 ) =? p(x 1, x 2 C 2 ) =? Simple approach: look at counts in data x1 <2 years at current job? x2 missed payments? C1 did default C2 did not default N N 3/3 0/0 N Y 2/3 1/3 <2 years at current job? Predicting credit risk missed payments? defaulted? Y N Y N Y N Y Y N Y 21 Defining a probabilistic classification model Determining the likelihood: p(x 1, x 2 C 1 ) =? p(x 1, x 2 C 2 ) =? Simple approach: look at counts in data x1 <2 years at current job? x2 missed payments? C1 did default C2 did not default N N 3/3 0/0 N Y 2/3 1/3 Y N 1/4 3/4 <2 years at current job? Predicting credit risk missed payments? defaulted? Y N Y N Y N Y Y 22

12 Defining a probabilistic classification model Determining the likelihood: p(x 1, x 2 C 1 ) =? p(x 1, x 2 C 2 ) =? Simple approach: look at counts in data x1 <2 years at current job? x2 missed payments? C1 did default C2 did not default N N 3/3 0/0 N Y 2/3 1/3 Y N 1/4 3/4 Y Y 0/0 0/0 <2 years at current job? Predicting credit risk missed payments? defaulted? Y N Y N Y N 23 Defining a probabilistic classification model Determining the likelihood: p(x 1, x 2 C 1 ) =? p(x 1, x 2 C 2 ) =? Simple approach: look at counts in data x1 <2 years at current job? x2 missed payments? C1 did default C2 did not default N N 3/3 0/0 N Y 2/3 1/3 Y N 1/4 3/4 Y Y 0/0 0/0 <2 years at current job? Predicting credit risk missed payments? defaulted? Y N Y N Y N What do we do about these? 24

13 Being (proper) Bayesians: Recall our coin-flipping example In Bernoulli trials, each sample is either 1 (e.g. heads) with probability ", or 0 (tails) with probability 1 $ ". The binomial distribution specifies probability of total #heads, y, out of n trials: p(y θ, n) = ( n y ) θ y (1 θ) n y p(y!=0.25, n=10) y 25 Applying Bayes rule Given n trials with k heads, what do we know about "? We can apply Bayes rule to see how our knowledge changes as we acquire new observations: p(θ y, n) = posterior likelihood prior p(y θ, n)p(θ n) p(y n) = normalizing constant # We know the likelihood, what about the prior? # Uniform on [0, 1] is a reasonable assumption, i.e. we don t know anything. # What is the form of the posterior? p(y θ, n)p(θ n)dθ # In this case, the posterior is just proportional to the likelihood: p(θ y, n) ( ) n θ y (1 θ) n y y 26

14 Evaluating the posterior # What do we know initially, before observing any trials? p(! y=0, n=0) ! 27 Coin tossing # What is our belief about " after observing one tail? p(! y=0, n=1) ! 28

15 Coin tossing # Now after two trials we observe 1 head and 1 tail. p(! y=1, n=2) ! 29 Coin tossing # 3 trials: 1 head and 2 tails. p(! y=1, n=3) ! 30

16 Coin tossing # 4 trials: 1 head and 3 tails. p(! y=1, n=4) ! 31 Coin tossing # 5 trials: 1 head and 4 tails. p(! y=1, n=5) ! 32

17 Evaluating the normalizing constant To get proper probability density functions, we need to evaluate p(y n): p(θ y, n) = p(y θ, n)p(θ n) p(y n) # Bayes in his original paper in 1763 showed that: 1 p(y n) = 0 p(y θ, n)p(θ n)dθ = 1 n + 1 ( ) n p(θ y, n) = θ y (1 θ) n y (n + 1) y 33 The ratio estimate What about after just one trial: 0 heads and 1 tail? # MAP and ratio estimate would say 0. Does this make sense? # What would a better estimate be? MAP estimate = 0 y/n = 0 p(! y=0, n=1) * ! 34

18 The expected value estimate The expected value of a pdf is: 1 E(θ y, n) = 0 = y + 1 n + 2 θp(θ y, n)dθ This is called smoothing or regularization E(θ y = 0, n = 1) = 1 3 p(! y=0, n=1) What happens for zero trials? ! 35 On to the mushrooms! EDIBLE? CAP-SHAPE CAP-SURFACE CAP-COLOR ODOR STALK-SHAPE POPULATION HABITAT! 1 edible flat fibrous red none tapering several woods! 2 poisonous convex smooth red foul tapering several paths! 3 edible flat fibrous brown none tapering abundant grasses! 4 edible convex scaly gray none tapering several woods! 5 poisonous convex smooth red foul tapering several woods! 6 edible convex fibrous gray none tapering several woods! 7 poisonous flat scaly brown fishy tapering several leaves! 8 poisonous flat scaly brown spicy tapering several leaves! 9 poisonous convex fibrous yellow foul enlarging several paths! 10 poisonous convex fibrous yellow foul enlarging several woods! 11 poisonous flat smooth brown spicy tapering several woods! 12 edible convex smooth yellow anise tapering several woods! 13 poisonous knobbed scaly red foul tapering several leaves! 14 poisonous flat smooth brown foul tapering several leaves! 15 poisonous flat fibrous gray foul enlarging several woods! 16 edible sunken fibrous brown none enlarging solitary urban! 17 poisonous flat smooth brown foul tapering several woods! 18 poisonous convex smooth white foul tapering scattered urban! 19 poisonous flat scaly yellow foul enlarging solitary paths! 20 edible convex fibrous gray none tapering several woods!!!!!!!!!! 36

19 The scaling problem p(x = v C k ) = Count(x = v C k = k) Count(C k = k) p(x 1 = v 1,..., x N = v N C k = k) = Count(x 1 = v 1,... x N = v N, C k = k) Count(C k = k) The prior is easy enough. But for the likelihood, the table is huge! 37 Mushroom attributes and values # values attributes EDIBLE: edible poisonous CAP-SHAPE: bell conical convex flat knobbed sunken CAP-SURFACE: fibrous grooves scaly smooth CAP-COLOR: brown buff cinnamon gray green pink purple red white yellow BRUISES: bruises no ODOR: almond anise creosote fishy foul musty none pungent spicy GILL-ATTACHMENT: attached free GILL-SPACING: close crowded GILL-SIZE: broad narrow GILL-COLOR: black brown buff chocolate gray green orange pink purple red white yellow STALK-SHAPE: enlarging tapering STALK-ROOT: bulbous club equal rooted STALK-SURFACE-ABOVE-RING: fibrous scaly silky smooth STALK-SURFACE-BELOW-RING: fibrous scaly silky smooth STALK-COLOR-ABOVE-RING: brown buff cinnamon gray orange pink red white yellow STALK-COLOR-BELOW-RING: brown buff cinnamon gray orange pink red white yellow VEIL-TYPE: partial universal VEIL-COLOR: brown orange white yellow RING-NUMBER: none one two RING-TYPE: evanescent flaring large none pendant SPORE-PRINT-COLOR: black brown buff chocolate green orange purple white yellow POPULATION: abundant clustered numerous scattered several solitary HABITAT: grasses leaves meadows paths urban waste woods 22 attributes with an average of 5 values! 38

20 Simplifying with Naïve Bayes What if we assume the features are independent? p(x C k ) = p(x 1,..., x N C k ) N = p(x n C k ) n=1 We know that s not precisely true, but it might make a good approximation. Now we only need to specify N different likelihoods: p(x i = v i C k = k) = Count(x i = v i C k = k) Count(C k = k) Huge savings in number of of parameters 39 Inference with Naïve Bayes Inference is just like before, but with the independence approximation: p(c k x) = p(x C k)p(c k ) p(x) = p(c k) n p(x n C k ) p(x) p(c k ) n = p(x n C k ) p(c k ) p(x n C k ) n Classification performance is often surprisingly good easy to implement k 40

21 Implementation issues If you implement Naïve Bayes naïvely, you ll run into trouble. Why? p(c k x) = p(c k ) n p(x n C k ) p(c k ) p(x n C k ) n k It s never good to compute products of a long list of numbers They ll quickly go to zero with machine precision, even using doubles (64 bit) Strategy: compute log probabilities log p(c k x) = log p(c k ) + n = log p(c k ) + n What about that constant? It still has a product. [ log p(x n C k ) log p(c k ) ] p(x n C k ) n log p(x n C k ) constant k 41 Converting back to probabilities The only requirement of the denominator is that it normalize the numerator to yield a valid probability distribution. We used a log transformation: g i = log p i + constant The form of the probability the same for any constant c p i i p i = = = e g i i eg i e c e g i i ec e g i e g i+c i eg i+c A common choice: choose c so that the log probabilities are shifted to zero: c = max g i i 42

22 Recall another simplifying assumption: Noisy-OR We assume each cause C j can produce effect Ei with probability fij. The noisy-or model assumes the parent causes of effect E i contribute independently. The probability that none of them caused effect E i is simply the product of the probabilities that each one did not cause Ei. The probability that any of them caused Ei is just one minus the above, i.e. P (E i par(e i )) = P (E i C 1,..., C n ) = 1 i (1 P (E i C j )) touch contaminated object (O) hit by viral droplet (D) f CD catch cold (C) eat contaminated food (F) f CO f CF = 1 i (1 f ij ) P (C D, O, F ) = 1 (1 f CD )(1 f CO )(1 f CF ) 43 A general one-layer causal network Could either model causes and effects Or equivalently stochastic binary features. Each input xi encodes the probability that the ith binary input feature is present. The set of features represented by %j is defined by weights fij which encode the probability that feature i is an instance of %j.

23 The data: a set of stochastic binary patterns a Each column is a distinct eight-dimensional binary feature. There are five underlying causal feature patterns. What are they? b c The data: a set of stochastic binary patterns a a a Each column is a distinct eight-dimensional binary feature. b b b true hidden causes of the data c c c inferred causes of the data

24 Hierarchical Statistical Models A Bayesian belief network: The joint probability of binary states is P (S W) = i P (S i pa(s i ), W) The probability S i depends only on its parents: pa(s i ) S i D P (S i pa(s i ), W) = { h( j S jw ji ) if S i = 1 1 h( j S jw ji ) if S i = 0 The function h specifies how causes are combined, h(u) = 1 exp( u), u > 0. Main points: hierarchical structure allows model to form high order representations upper states are priors for lower states weights encode higher order features 47 Gibbs sampling (back to the example from last lecture) Model represents stochastic binary features. Each input xi encodes the probability that the ith binary input feature is present. The set of features represented by %j is defined by weights fij which encode the probability that feature i is an instance of %j. Trick: It s easier to adapt weights in an unbounded space, so use the transformation: optimize in w-space. 48

25 The data: a set of stochastic binary patterns a a a Each column is a distinct eight-dimensional binary feature. b c c c b b true hidden causes of the data inferred causes of the data 49 Hierarchical Statistical Models A Bayesian belief network: The joint probability of binary states is! P (S W) = P (Si pa(si), W) i The probability Si depends only on its parents: pa(si) P (Si pa(si), W) = " # h( j Sj wji) if Si = 1 # 1 h( j Sj wji) if Si = 0 Si The function h specifies how causes are combined, h(u) = 1 exp( u), u > 0. D Main points: hierarchical structure allows model to form high order representations upper states are priors for lower states weights encode higher order features 50

26 Learning Objective Adapt W to find the most probable explanation of the input patterns. The probability of the data is P (D 1:N W) = n P (D n W) P (D n W) is computed by marginalizing P (D n W) = k P (D n S k, W)P (S k W) Computing this sum exactly is intractable, but we can still make accurate approximations. 51 Approximating P (D n W) Good representations should have just one or a few possible explanations for most patterns. in this case, most P (D n S k, W) will be zero the following approximation will be very accurate once weights have adapted P (D n W) P (D n Ŝ, W)P(Ŝ W) approximation becomes increasingly accurate as learning proceeds 52

27 Using Adapting EM to adapt the Weights the network parameters The complexity of the model is controlled by placing a prior on the weights. assume the prior to be the product of gamma distributions objective function becomes L = P (D 1:N W)P (W α, β) A simple and efficient EM formula for adapting the weights can be derived using the transformations f ij = 1 exp( w ij ) and g i = 1 exp( u i ). S (n) f ij = α 1 + 2f ij + n S(n) i j α + β + n S(n) i f ij /g (n) f ij can be interpreted as the frequency of state S j given cause S i j f ij is a weighted average of the number of times S j was active given S i the ratio f ij /g j inversely weights each term by number causes for S j 53 Inferring the best representation of the observed variables Given on the input D, the is no simple way to determine which states are the input s most likely causes. - Computing the most probable network state is an inference process - we want to find the explanation of the data with highest probability - this can be done efficiently with Gibbs sampling Gibbs sampling is another example of an MCMC method Key idea: The samples are guaranteed to converge to the true posterior probability distribution 54

28 Gibbs Sampling Gibbs sampling is a way to select an ensemble of states that are representative of the posterior distribution P (S D, W). Each state of the network is updated iteratively according to the probability of S i given the remaining states. this conditional probability can be computed using (Neal, 1992) P (S i = a S j : j i, W) P (S i = a pa(s i ), W) j ch(s i ) P (S j pa(s j ), S i = a, W) limiting ensemble of states will be typical samples from P (S D, W) also works if any subset of states are fixed and the rest are sampled 55 The Network Gibbs sampling Interpretation equations of(derivation the Gibbsomitted) Sampling Equation The probability of S i changing state given the remaining states is P (S i = 1 S i S j : j i, W) = exp( x i ) x i indicates how much changing the state S i changes the probability of the whole network state x i = log h(u i ; 1 S i ) log h(u i ; S i ) + log h(u j + δ ij ; S j ) log h(u j ; S j ) j ch(s i ) u i is the causal input to S i, u i = k S kw ki δ j specifies the change in u j for a change in S i, δ ij = +S j w ij if S i = 0, or S j w ij if S i = 1 56

29 Interpretation of the Gibbs sampling equation The Gibbs equation can be interpreted as: feedback +! feedforward feed-back: how consistent is Si with current causes?! feedforward: how likely is Si a cause of its children feedback allows the lower-level units to use information only computable at higher levels feedback determines (disambiguates) the state when the feedforward input is ambiguous 57 The higher-order lines problem! "!"$!"$!"$!"$!"# The true generative model Patterns sampled from the model Can we infer the structure of the network given only the patterns? 58

30 Weights in a belief network after learning The second layer learns combinations of the first layer features The first layer of weights learn that patterns are combinations of lines. 59 The Shifter Problem A B Shift patterns weights of a network after learning 60

31 Gibbs sampling: feedback disambiguates lower-level states " $ #! One the structure learned, the Gibbs updating convergences in two sweeps. 61 Next time (which is next Tuesday) classifying with other models Don t forget: Spring Carnival this week - No class on Thursday. 62

Computational Perception. Bayesian Inference

Computational Perception. Bayesian Inference Computational Perception 15-485/785 January 24, 2008 Bayesian Inference The process of probabilistic inference 1. define model of problem 2. derive posterior distributions and estimators 3. estimate parameters

More information

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes Matt Gormley Lecture 19 March 20, 2018 1 Midterm Exam Reminders

More information

Classification Trees

Classification Trees Classification Trees 36-350, Data Mining 29 Octber 2008 Classification trees work just like regression trees, only they try to predict a discrete category (the class), rather than a numerical value. The

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes MLE / MAP Readings: Estimating Probabilities (Mitchell, 2016)

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Bayesian Methods: Naïve Bayes

Bayesian Methods: Naïve Bayes Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior

More information

The Naïve Bayes Classifier. Machine Learning Fall 2017

The Naïve Bayes Classifier. Machine Learning Fall 2017 The Naïve Bayes Classifier Machine Learning Fall 2017 1 Today s lecture The naïve Bayes Classifier Learning the naïve Bayes Classifier Practical concerns 2 Today s lecture The naïve Bayes Classifier Learning

More information

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch. School of Computer Science 10-701 Introduction to Machine Learning aïve Bayes Readings: Mitchell Ch. 6.1 6.10 Murphy Ch. 3 Matt Gormley Lecture 3 September 14, 2016 1 Homewor 1: due 9/26/16 Project Proposal:

More information

Bayesian Models in Machine Learning

Bayesian Models in Machine Learning Bayesian Models in Machine Learning Lukáš Burget Escuela de Ciencias Informáticas 2017 Buenos Aires, July 24-29 2017 Frequentist vs. Bayesian Frequentist point of view: Probability is the frequency of

More information

Bayesian Networks. 10. Parameter Learning / Missing Values

Bayesian Networks. 10. Parameter Learning / Missing Values Bayesian Networks Bayesian Networks 10. Parameter Learning / Missing Values Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Business Economics and Information Systems

More information

Probability and Estimation. Alan Moses

Probability and Estimation. Alan Moses Probability and Estimation Alan Moses Random variables and probability A random variable is like a variable in algebra (e.g., y=e x ), but where at least part of the variability is taken to be stochastic.

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Bayesian Learning (II)

Bayesian Learning (II) Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP

More information

Programming Assignment 4: Image Completion using Mixture of Bernoullis

Programming Assignment 4: Image Completion using Mixture of Bernoullis Programming Assignment 4: Image Completion using Mixture of Bernoullis Deadline: Tuesday, April 4, at 11:59pm TA: Renie Liao (csc321ta@cs.toronto.edu) Submission: You must submit two files through MarkUs

More information

Bayesian Networks BY: MOHAMAD ALSABBAGH

Bayesian Networks BY: MOHAMAD ALSABBAGH Bayesian Networks BY: MOHAMAD ALSABBAGH Outlines Introduction Bayes Rule Bayesian Networks (BN) Representation Size of a Bayesian Network Inference via BN BN Learning Dynamic BN Introduction Conditional

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Naïve Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders Homework 6: PAC Learning

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Midterm, Fall 2003

Midterm, Fall 2003 5-78 Midterm, Fall 2003 YOUR ANDREW USERID IN CAPITAL LETTERS: YOUR NAME: There are 9 questions. The ninth may be more time-consuming and is worth only three points, so do not attempt 9 unless you are

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

T Machine Learning: Basic Principles

T Machine Learning: Basic Principles Machine Learning: Basic Principles Bayesian Networks Laboratory of Computer and Information Science (CIS) Department of Computer Science and Engineering Helsinki University of Technology (TKK) Autumn 2007

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

Introduction to Probability and Statistics (Continued)

Introduction to Probability and Statistics (Continued) Introduction to Probability and Statistics (Continued) Prof. icholas Zabaras Center for Informatics and Computational Science https://cics.nd.edu/ University of otre Dame otre Dame, Indiana, USA Email:

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a

More information

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution Outline A short review on Bayesian analysis. Binomial, Multinomial, Normal, Beta, Dirichlet Posterior mean, MAP, credible interval, posterior distribution Gibbs sampling Revisit the Gaussian mixture model

More information

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic CS 188: Artificial Intelligence Fall 2008 Lecture 16: Bayes Nets III 10/23/2008 Announcements Midterms graded, up on glookup, back Tuesday W4 also graded, back in sections / box Past homeworks in return

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

an introduction to bayesian inference

an introduction to bayesian inference with an application to network analysis http://jakehofman.com january 13, 2010 motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena

More information

Lecture 8: Bayesian Networks

Lecture 8: Bayesian Networks Lecture 8: Bayesian Networks Bayesian Networks Inference in Bayesian Networks COMP-652 and ECSE 608, Lecture 8 - January 31, 2017 1 Bayes nets P(E) E=1 E=0 0.005 0.995 E B P(B) B=1 B=0 0.01 0.99 E=0 E=1

More information

Qualifier: CS 6375 Machine Learning Spring 2015

Qualifier: CS 6375 Machine Learning Spring 2015 Qualifier: CS 6375 Machine Learning Spring 2015 The exam is closed book. You are allowed to use two double-sided cheat sheets and a calculator. If you run out of room for an answer, use an additional sheet

More information

COMP 328: Machine Learning

COMP 328: Machine Learning COMP 328: Machine Learning Lecture 2: Naive Bayes Classifiers Nevin L. Zhang Department of Computer Science and Engineering The Hong Kong University of Science and Technology Spring 2010 Nevin L. Zhang

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released

More information

Day 5: Generative models, structured classification

Day 5: Generative models, structured classification Day 5: Generative models, structured classification Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 22 June 2018 Linear regression

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Bayesian RL Seminar. Chris Mansley September 9, 2008

Bayesian RL Seminar. Chris Mansley September 9, 2008 Bayesian RL Seminar Chris Mansley September 9, 2008 Bayes Basic Probability One of the basic principles of probability theory, the chain rule, will allow us to derive most of the background material in

More information

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4 ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Bayesian Networks. Motivation

Bayesian Networks. Motivation Bayesian Networks Computer Sciences 760 Spring 2014 http://pages.cs.wisc.edu/~dpage/cs760/ Motivation Assume we have five Boolean variables,,,, The joint probability is,,,, How many state configurations

More information

PMR Learning as Inference

PMR Learning as Inference Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning

More information

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018 Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Bayesian Classification Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Bayesian Analysis for Natural Language Processing Lecture 2

Bayesian Analysis for Natural Language Processing Lecture 2 Bayesian Analysis for Natural Language Processing Lecture 2 Shay Cohen February 4, 2013 Administrativia The class has a mailing list: coms-e6998-11@cs.columbia.edu Need two volunteers for leading a discussion

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Introduction to Machine Learning

Introduction to Machine Learning Outline Introduction to Machine Learning Bayesian Classification Varun Chandola March 8, 017 1. {circular,large,light,smooth,thick}, malignant. {circular,large,light,irregular,thick}, malignant 3. {oval,large,dark,smooth,thin},

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling Due: Tuesday, May 10, 2016, at 6pm (Submit via NYU Classes) Instructions: Your answers to the questions below, including

More information

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

Final Examination CS 540-2: Introduction to Artificial Intelligence

Final Examination CS 540-2: Introduction to Artificial Intelligence Final Examination CS 540-2: Introduction to Artificial Intelligence May 7, 2017 LAST NAME: SOLUTIONS FIRST NAME: Problem Score Max Score 1 14 2 10 3 6 4 10 5 11 6 9 7 8 9 10 8 12 12 8 Total 100 1 of 11

More information

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

ECE662: Pattern Recognition and Decision Making Processes: HW TWO ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are

More information

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1 Bayes Networks CS540 Bryan R Gibson University of Wisconsin-Madison Slides adapted from those used by Prof. Jerry Zhu, CS540-1 1 / 59 Outline Joint Probability: great for inference, terrible to obtain

More information

Introduction to Bayesian Learning

Introduction to Bayesian Learning Course Information Introduction Introduction to Bayesian Learning Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Apprendimento Automatico: Fondamenti - A.A. 2016/2017 Outline

More information

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007 MIT OpenCourseWare http://ocw.mit.edu HST.582J / 6.555J / 16.456J Biomedical Signal and Image Processing Spring 2007 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

A primer on Bayesian statistics, with an application to mortality rate estimation

A primer on Bayesian statistics, with an application to mortality rate estimation A primer on Bayesian statistics, with an application to mortality rate estimation Peter off University of Washington Outline Subjective probability Practical aspects Application to mortality rate estimation

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

CS 188: Artificial Intelligence Spring Today

CS 188: Artificial Intelligence Spring Today CS 188: Artificial Intelligence Spring 2006 Lecture 9: Naïve Bayes 2/14/2006 Dan Klein UC Berkeley Many slides from either Stuart Russell or Andrew Moore Bayes rule Today Expectations and utilities Naïve

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 5 Bayesian Learning of Bayesian Networks CS/CNS/EE 155 Andreas Krause Announcements Recitations: Every Tuesday 4-5:30 in 243 Annenberg Homework 1 out. Due in class

More information

Probabilistic Machine Learning

Probabilistic Machine Learning Probabilistic Machine Learning Bayesian Nets, MCMC, and more Marek Petrik 4/18/2017 Based on: P. Murphy, K. (2012). Machine Learning: A Probabilistic Perspective. Chapter 10. Conditional Independence Independent

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Empirical Bayes, Hierarchical Bayes Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 5: Due April 10. Project description on Piazza. Final details coming

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 2. Statistical Schools Adapted from slides by Ben Rubinstein Statistical Schools of Thought Remainder of lecture is to provide

More information

Homework 6: Image Completion using Mixture of Bernoullis

Homework 6: Image Completion using Mixture of Bernoullis Homework 6: Image Completion using Mixture of Bernoullis Deadline: Wednesday, Nov. 21, at 11:59pm Submission: You must submit two files through MarkUs 1 : 1. a PDF file containing your writeup, titled

More information

Lecture 2: Simple Classifiers

Lecture 2: Simple Classifiers CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 2: Simple Classifiers Slides based on Rich Zemel s All lecture slides will be available on the course website: www.cs.toronto.edu/~jessebett/csc412

More information

Structure Learning: the good, the bad, the ugly

Structure Learning: the good, the bad, the ugly Readings: K&F: 15.1, 15.2, 15.3, 15.4, 15.5 Structure Learning: the good, the bad, the ugly Graphical Models 10708 Carlos Guestrin Carnegie Mellon University September 29 th, 2006 1 Understanding the uniform

More information

Naive Bayes classification

Naive Bayes classification Naive Bayes classification Christos Dimitrakakis December 4, 2015 1 Introduction One of the most important methods in machine learning and statistics is that of Bayesian inference. This is the most fundamental

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

Report on Preliminary Experiments with Data Grid Models in the Agnostic Learning vs. Prior Knowledge Challenge

Report on Preliminary Experiments with Data Grid Models in the Agnostic Learning vs. Prior Knowledge Challenge Report on Preliminary Experiments with Data Grid Models in the Agnostic Learning vs. Prior Knowledge Challenge Marc Boullé IJCNN 2007 08/2007 research & development Outline From univariate to multivariate

More information

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018 Bayesian Methods David S. Rosenberg New York University March 20, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 1 / 38 Contents 1 Classical Statistics 2 Bayesian

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

The Origin of Deep Learning. Lili Mou Jan, 2015

The Origin of Deep Learning. Lili Mou Jan, 2015 The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets

More information

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014 Probability Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh August 2014 (All of the slides in this course have been adapted from previous versions

More information

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio Estimation of reliability parameters from Experimental data (Parte 2) This lecture Life test (t 1,t 2,...,t n ) Estimate θ of f T t θ For example: λ of f T (t)= λe - λt Classical approach (frequentist

More information

{ p if x = 1 1 p if x = 0

{ p if x = 1 1 p if x = 0 Discrete random variables Probability mass function Given a discrete random variable X taking values in X = {v 1,..., v m }, its probability mass function P : X [0, 1] is defined as: P (v i ) = Pr[X =

More information

BAYESIAN DECISION THEORY

BAYESIAN DECISION THEORY Last updated: September 17, 2012 BAYESIAN DECISION THEORY Problems 2 The following problems from the textbook are relevant: 2.1 2.9, 2.11, 2.17 For this week, please at least solve Problem 2.3. We will

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Generative Models Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1

More information

Learning with Probabilities

Learning with Probabilities Learning with Probabilities CS194-10 Fall 2011 Lecture 15 CS194-10 Fall 2011 Lecture 15 1 Outline Bayesian learning eliminates arbitrary loss functions and regularizers facilitates incorporation of prior

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic NPFL108 Bayesian inference Introduction Filip Jurčíček Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic Home page: http://ufal.mff.cuni.cz/~jurcicek Version: 21/02/2014

More information

10-701/15-781, Machine Learning: Homework 4

10-701/15-781, Machine Learning: Homework 4 10-701/15-781, Machine Learning: Homewor 4 Aarti Singh Carnegie Mellon University ˆ The assignment is due at 10:30 am beginning of class on Mon, Nov 15, 2010. ˆ Separate you answers into five parts, one

More information

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007 Bayesian inference Fredrik Ronquist and Peter Beerli October 3, 2007 1 Introduction The last few decades has seen a growing interest in Bayesian inference, an alternative approach to statistical inference.

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information