Machine Learning. Bayesian Learning. Michael M. Richter. Michael M. Richter
|
|
- Garry Strickland
- 6 years ago
- Views:
Transcription
1 Machine Learning Bayesian Learning
2 Topic This is concept learning the probabilistic way. That means, everything that is stated is done in an exact way but not always true. That means, the learned concept is equipped with a probability for being correct.
3 History Bayesian Decision Theory came long before Version Spaces, Decision Tree Learning and Neural Networks. It was studied in the field of Statistical Theory and more specifically, in the field of Pattern Recognition. Bayesian Decision Theory is at the basis of important learning schemes such as the Naïve Bayes Classifier, Learning Bayesian Belief Networks and the EM Algorithm. Bayesian Decision Theory is also useful as it provides a framework within which many non-bayesian classifiers can be studied (See [Mitchell, Sections 6.3, 4,5,6]).
4 : Why Bayesian Classification? Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured
5 Maximum Likelihood Suppose there are a number of hypothesis generated and that for each one there is a probability for being the right one is calculated. The maximum likelihood principle says one should choose the hypothesis h with the highest probability. P(h D) is the a posteriori probability of h (after seeing the data D). P(h) is the a priori probability of h. P(D h) is the likelihood of D under h.
6 Part 1 The Naïve Bayesian Approach
7 Basic Formulas for Probabilities Product Rule : probability P(AB) of a conjunction of two events A and B: Sum Rule: probability of a disjunction of two events A and B: Theorem of Total Probability: if events A1,., An are mutually exclusive then ) ( ) ( ) ( ) ( ), ( A P A B P B P B A P B A P ) ( ) ( ) ( ) ( AB P B P A P B A P ) ( ) ( ) ( 1 i n i i A P A B P B P
8 A Basic Learning Scenario (1) Event Y = y Observed example event Event Z = z Correctness of hypothesis z D: Data P(Z = z Y = y) = P(Z = z) P(Y = y Z = z) P(Y = y) Probability that h is a correct hypothesis for data D P(h D) = Prob(h) to be a correct hypothesis P(h ) P(D h) P(D) Prob(D) to be observed, if h is correct Prob( D) to be observed
9 A Basic Learning Scenario (2) Notation: P(h) is a-priori-probability of h P(D h) is likelihood of D under h P(h D) is a-posteriori-probability of h given D The basic theorem of Bayes Rule: P(h D) = P(h ) P(D h) P(D) This theorem makes applications possible because it reduces the unknown conditional probability to ones that are known a priori.
10 A Basic Learning Scenario (3) Learner has hypotheses h1,,hn and uses observed data D. Wanted: Some h { h1,..., hk }, for which P(h D) is maximal: (Maximum-a-posteriori-Hypothesis). A posteriori means: After seeing data Background knowledge: A-priori-probability Pr(h) of h A priori means: before seeing data
11 Bayesian Classification and Decision (1) The Bayes decision rule selects the class with minimum conditional risk. In the case of minimum-error-rate classification, the rule will select the class with the maximum posteriori probability. Suppose there are k classes, c1, c2,..., ck. Given a feature vector x: The minimum-error-rate rule will assign it to the class cj if P(cj x) > P(ci x) for all i j.
12 Bayesian Classification and Decision (2) An equivalent but more useful criterion for minimum-errorrate classification is: Choose class cj so that P(x cj)p(cj) > P(x ci)p(ci) for all i j This relies on Bayes theorem. Note: There can no method exist that finds with higher probability the correct hypothesis. But: That can change if one has additional knowledge.
13 Assume: Example (1) A lab test D for a form of cancer has 98% chance of giving a positive result if the cancer is present, and 97% chance of giving a negative result if the cancer is absent. (2) 0.8% of population has this cancer: P(cancer)=0.008 and P(~cancer)=0.992 What is probability that the cancer is present for a positive result? P(cancer D) = P(D cancer)p(cancer) / P(D) = 0.98*0.008 /(0.98* *0.992)=0.21
14 MAP and ML Given some data D and a hypothesis space H, what is the most probable hypothesis h H; i.e., P(h D) is maximal? This hypothesis is called the maximal a posteriori hypothesis h MAP h MAP = = argmax h H P(D h)p(h) Again: h MAP is optimal in the sense that no method can exist that finds with higher correct probability. If P(h) = P(h ) for all h, h H then this reduces to the maximum likelihood principle h ML = argmax h H P(D h)
15 The Gibbs Classifier Bayes Optimal is optimal but expensive; it uses all hypotheses in H. Non-optimal but much more efficient is the GIBBS-classifier Algorithm: Given: Sample S set of data {x1,...,xm} D), hypotheses space H with a probability distribution P and some to be classified. Method: 1. Select h H randomly according to P (that is similar to GA!) 2. Output: h(x) Surprisingly: E(errorGibbs) 2 E(errorBayesOptimal)
16 The Naïve Bayesian Algorithm (1) Learning Scenario: Examples x 1,...,x m, x i = (a i1,...,a in ) for attributes A 1,..., A n; Hypotheses H = { h1,..., hk } for classes Class of x ist C(x) Two ways to proceed: 1) Using Bayes optimal classification 2) Do not access H for classification. Method 2) avoids to overview all hypothesis in H what is often very difficult and impractical.
17 Estimation of Probabilities from Samples X 1 X 2 X N C Two classes: -1, +1 N boolean attributes How do we estimate P(C)? E.g. Simple Binomial estimation Count # of instances with C = -1, and with C = +1 How do we estimate P(X1,,XN C)? Count P(X1,,XN C=+1) Count P(X1,,XN C=-1) Very complex tasks!
18 Conditional Independence Conditional independence is supposed to simplify the estimation task. Def.: (i) Y is independent of Z, if for all y Y, z Z P(Y = y, Z = z) = P(Y = y) P(Z = z) (ii) X is conditionally independent of Y given Z if P(X=x,Y=y Z=z) = P(X=x Z=z) P(Y=y Z=z) Another formulation: P(X = x Y = y, Z = z) = P(X = x Z = z) This reduces the complexity for n variables from O(2n) in the product space to O(n)!
19 The Naïve Bayesian Algorithm (2) Given x = (a 1,...,a n ): The (conditional) independence assumption says: P(a1,,an h) = P(a1 h)p(a2 h) P(an h) The assumption is called naive. This reduces the parameter estimation from the product space (which is =(2 n )) to the sum of the attribute spaces (which is O(n)). However, it is not always satisfied (e.g. thunder is not independent to rain). The goal is now to avoid the knowledge about P(h) for all h H.
20 The Naïve Bayesian Algorithm (3) Therefore we proceed:. h MAP = h { h1,..., hk }, for which Equivalent: P(C(x)=h x = (a1,...,an)) is maximal. P(x = (a1,...,an) C(x)=h) P(C(x)=h) is maximal. The probabilities from the right side are estimated from a given set S of examples. Without the independence assumption this would be impractible because S needs to be too large.
21 Example A naive Bayes classiers adopts the assumption of conditional independence. Given: P(pneumonia) = 0.01, P(flu) = 0.05 P(cough pneumonia) = 0.9, P(fever pneumonia) = 0.9, P(chest-pain pneumonia) = 0.8, P(cough flu) = 0.5, P(fever flu) = 0.9, P(chest-pain flu) = 0.1 Suppose a patient had cough, fever, but no chest pain. What is the probability ratio between pneumonia and flu? What is the best diagnosis? Solution: Probability ratio = 0.01 * 0.9 * 0.9 * (1-0.8) / 0.05 * 0.5 * 0.9 * (1-0.1) = 0.08 So flu is at least ten times more likely than pneumonia.
22 Advantages: Discussion (1) Tends to work well despite strong assumption of conditional independence. Experiments show it to be quite competitive with other classification methods on standard UCI datasets. Although it does not produce accurate probability estimates when its independence assumptions are violated, it may still pick the correct maximum-probability class in many cases. Able to learn conjunctive concepts in any case
23 Discussion (2) Disadvantages: Does not perform any search of the hypothesis space. Directly constructs a hypothesis from parameter estimates that are easily calculated from the training data. Strong bias Not guarantee consistency with training data. Typically handles noise well since it does not even focus on completely fitting the training data.
24 Part 2 Belief Networks
25 Bayesian Belief Networks (1) Discussing the independence assumption: Positive: makes computation feasible Negative: Is it often not satisfied Reason: There are causal or influential relations between the attributes. Such relation are background knowledge. Idea: Make them visible in a graph. Conditional independence is now only between a subsets of variables. Belief networks combine both.
26 Bayesian Belief Networks (2) A Bayesian belief net (BBN) is a directed graph, together with an associated set of probability tables. The nodes represent variables, which can be discrete or continuous. The edges represent causal/influential relationships between variables. Nodes not connected by edges are independent.
27 Causality (1) Although Bayesian networks are often used to represent causal relationships, this need not be the case: a directed edge from u to v does not require that X v is causally dependent on X u. Example: The graphs: A B C and C B A are equivalent: that is they impose exactly the same conditional independence requirements.
28 Causality (2) A causal network is a Bayesian network with an explicit requirement that the relationships be causal. The additional semantics of the causal networks specify that if a node X is actively caused to be in a given state x (an action written as do(x=x)), then the probability density function changes to the one of the network obtained by cutting the links from X's parents to X, and setting X to the caused value x. Using these semantics, one can predict the impact of external interventions from data obtained prior to intervention.
29 Influence Diagrams The network can represent influence diagrams. Such diagrams are used to represent decision models. Therefore they are a method to support decision making.
30 Example (1) Temperature Winds Umbrella Cloudiness Rain Temperature: cold, mild, hot Cloudes: none, partial, covered Winds: No, mild, strong Conditinal probability table
31 Storm Lightning Thunder Example (2) BusTourGroup Campfire ForestFire Associated with each node is a conditional probability table, which specifies the conditional distribution for the variable given its immediate parents in the graph Each node is asserted to be conditionally independent of its non-descendants, given its immediate parents
32 Inference in Bayesian Networks (1) In general: Calculate conditional probabilities along the directed edges. This can be done in a forward or backward mode. Example forward mode: Suppose we have the edge A B, then we get and P(B) = P(B A)P(A) + P(B not A)P(not A) P(not B) = P(not B A)P(A) + P(not B not A)P(not A)
33 Inference in Bayesian Networks (2) Suppose we want to calculate P(AB E). Using P(A,B) = P(A B) p(b) we get: P(AB E) = P(A E) * P(B AE) P(AB E) = P(B E) * P(A BE) Therefore: P(A BE) = ( P(A E) * P(B AE) ) / P(B E) (another version of Bayes' Theorem).
34 Example (1) Age Income How likely are elderly rich people to buy Sun? P( paper = Sun Age>60, Income > 60k) House Owner Living Location Newspaper Preference Voting Pattern
35 Example (2) Age House Owner Income Living Location How likely are elderly rich people who voted liberal to buy Herald? P( paper = H Age>60, Income > 60k, v = liberal) Newspaper Preference Voting Pattern
36 Unobserved Variables Bayesian networks can be used to answer probabilistic queries about unobserved variables They can be used to find out updated knowledge of the state of a subset of variables when other variables (the evidence variables) are observed. This process of computing the posterior distribution of variables given evidence is called probabilistic inference. A Bayesian network can thus be considered a mechanism for automatically applying Bayes theorem to complex problems.
37 Inference in Bayesian Networks (2) In the network we can chain on several edges: Find the probability of H given that A1, A2, A3 and E have happened: P(H A1A2A3E) = ( P(H E) * P(A1A2A3 HE) ) / P(A1A2A3 E) because: P(A1A2A3 E) = P(A1 A2A3E) * P(A2A3 E) = P(A1 A2A3E) * P(A2 A3E) P(A3 E). With independence this simplifies. E.g. we get: P(H A1AEI) = ( P(H E) * P(A1 HE) ) * P(A2 HE) ) / ( P(A1 E) * P(A2 E) )
38 Consider the net: Recalculation (1) A C Given probabilities: B True False P(A) = 0.1 P(~A) = 0.9 True False P(B) = 0.4 P(~B) = 0.6 A True False B True False True False True P(C AB) = 0.8 P(C A~B) = 0.6 P(C ~AB) = 0.5 P(C ~A~B) = 0.5 False P(~C AB) = 0.2 P(C A~B) = 0.6 P(~C ~AB) = 0.5 P(~C ~A~B) = 0.5
39 Recalculation (2) Calculation of the probability of C: p(c) =p(cab) + p(c~ab) + p(ca~b) + p(c~a~b)=p(c AB) * p(ab) + p(c ~AB) * p(~ab) + p(c A~B) * p(a~b) + p(c ~A~B) * p(~a~b)=p(c AB) * p(a) * p(b) + p(c ~AB) * p(~a) * p(b) + p(c A~B) * p(a) * p(~b) + p(c ~A~B) * p(~a) * p(~b)=0.518 Recalculation of P(A) and P(B) If we know that C is true using Bayes rule: P(B C) =( P( C B) * P(B) ) / P(C)=( ( P(C AB) * P(A) + P(C ~AB) * P(~A) ) * P(B) ) / P(C)=( (0.8 * * 0.9) * 0.4 ) / = P(A C) =( P( C A) * P(A) ) / P(C)=( ( ppc AB) * P(B) + P(C A~B) * P(~B) ) * P(A) ) / P(C)=( (0.8 * * 0.6) * 0.1 ) / = 0.131
40 Complete and Incomplete Information 1. The network structure is given in advance and all the variables are fully observable in the training examples. ==> Trivial Case: just estimate the conditional probabilities. 2. The network structure is given in advance but only some of the variables are observable in the training data. ==> Similar to learning the weights for the hidden units of a Neural Net: Gradient Ascent Procedure 3. The network structure is not known in advance. ==> Use a heuristic search or constraint-based technique to search through potential structures.
41 Parameter Learning In order to fully specify the Bayesian network and thus fully represent the joint parameter probabilitydistribution, it is necessary to specify for each node X the probability distribution for X conditional upon X's parents. The distribution of X conditional upon its parents may have any form.
42 Expectation Maximization: Unobservable Relevant Variables. Example:Assume that data points have been uniformly generated from k distinct Gaussian with the same known variance. Problem find a hypothesis h=< 1, 2,.., k > that describes the means of each of the k distributions. In particular, we are looking for a maximum likelihood hypothesis for these means. We extend the problem description as follows: for each point x i, there are k hidden variables z i1,..,z ik such that z il =1 if x i was generated by normal distribution N and z iq = 0 for all q N.
43 EM Algorithm Initially: An arbitrary initial hypothesis h=< 1, 2,.., k > is chosen. The EM Algorithm contains two steps: Step 1 (Estimation, E): Calculate the expected value E[zij] of each hidden variable zij, assuming that the current hypothesis h=< 1, 2,.., k > holds. Step 2 (Maximization, M): Calculate a new maximum likelihood hypothesis h =< 1, 2,.., k >, assuming the value taken on by each hidden variable z ij is its expected value E[z ij ] calculated in step 1. Then replace the hypothesis h=< 1, 2,.., k > by the new hypothesis h =< 1, 2,.., k > and iterate.
44 Problems and Limitations (1) A computational problem is exploring a previously unknown network. To calculate the probability of any branch of the network, all branches must be calculated. This process of network discovery is an NP-hard task which might either be too costly to perform, or impossible given the number and combination of variables.
45 Problems and Limitations (2) The network relies on the quality and coverage of the prior beliefs (which is knowledge!!) used in the inference processing. The network is only as useful as this background knowledge is reliable. Both, too optimistic or too pessimistic expectation of the quality of these prior beliefs will invalidate the results. Related to this is the selection of the statistical distribution induced in modeling the data. Selecting the proper distribution model to describe the data has a notable effect on the quality of the resulting network.
46 Dependency Networks They are a generalization and alternative to the Bayesian network. It has also has a graph and probability component but graph can be cyclic. The probability component is as in a Bayesian network.
47 Loops If BP is used on graphs with loops, messages may circulate indefinitely Empirically, a good approximation is still achievable Stop after fixed # of iterations Stop when no significant change in beliefs If solution is not oscillatory but converges, it usually is a good approximation
48 Applications Bayesian Learning is a standard methods in many application areas like Medicine (classification, prediction) Image retrieval and pattern recognition Quality control for material Some competitors are e.g. Support vector machines Clustering methods
49 Tools Hugin tool: Implements the propagation algorithm of Lauritzen and Spiegelhalter. A more modern and powerful BBN tool is the AgendaRisk tool With this tool it is possible to perform fast propagation in large BBNs (with hundreds of nodes and millions of state combinations) GeNIe: WinMine Toolkit, Weka, Matlab
50 Summary Bayes theorem Baysian decision Maximum a posteriori and maximum likelihood The naïve Baysian method and conditional independence Gibbs system Belief nets and inference in nets and belief revision Estimating unknown parameters: EM algorithm Limitations
51 Some References (1) Bernardo, J. M. and Smith, A. F. M. (1994) Bayesian Theory, New York: John Wiley. Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (1995) Bayesian Data Analysis, London: Chapman & Hall, ISBN Ian H.Witten, Eibe Frank: Data Mining Practical Machine Learning Tools with Java Implementations. Morgan Kaufmann, David W. Aha: Machine Learning tools. home.earthlink.net/~dwaha/research/machinelearning.html
52 Some References (2) Heckerman, David :Tutorial on Learning with Bayesian Networks. In Jordan, Michael Irwin, Learning in Graphical Models, Adaptive Computation and Machine Learning, MIT Press 1998, pp , Borgelt, Christian; Kruse, Rudolf (March 2002). Graphical Models for Data Analysis and Mining Chichester, D. Heckerman, D. M. Chickering, C. Meek, R. Rounthwaite, C. Kadie, Dependency Networks for Inference, Collaborative Filtering, and Data Visualization, Journal of Machine Learning Research, Vol. 1, 2000, pp
Stephen Scott.
1 / 28 ian ian Optimal (Adapted from Ethem Alpaydin and Tom Mitchell) Naïve Nets sscott@cse.unl.edu 2 / 28 ian Optimal Naïve Nets Might have reasons (domain information) to favor some hypotheses/predictions
More informationCSCE 478/878 Lecture 6: Bayesian Learning
Bayesian Methods Not all hypotheses are created equal (even if they are all consistent with the training data) Outline CSCE 478/878 Lecture 6: Bayesian Learning Stephen D. Scott (Adapted from Tom Mitchell
More informationBayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses
Bayesian Learning Two Roles for Bayesian Methods Probabilistic approach to inference. Quantities of interest are governed by prob. dist. and optimal decisions can be made by reasoning about these prob.
More informationNotes on Machine Learning for and
Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori
More informationBayesian Learning Features of Bayesian learning methods:
Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease or increase the estimated probability that a hypothesis is correct. This provides a more
More informationCSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas
ian ian ian Might have reasons (domain information) to favor some hypotheses/predictions over others a priori ian methods work with probabilities, and have two main roles: Naïve Nets (Adapted from Ethem
More informationBayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan
Bayesian Learning CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Bayes Theorem MAP Learners Bayes optimal classifier Naïve Bayes classifier Example text classification Bayesian networks
More informationLecture 9: Bayesian Learning
Lecture 9: Bayesian Learning Cognitive Systems II - Machine Learning Part II: Special Aspects of Concept Learning Bayes Theorem, MAL / ML hypotheses, Brute-force MAP LEARNING, MDL principle, Bayes Optimal
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Chapter 8&9: Classification: Part 3 Instructor: Yizhou Sun yzsun@ccs.neu.edu March 12, 2013 Midterm Report Grade Distribution 90-100 10 80-89 16 70-79 8 60-69 4
More informationBayesian Learning. Examples. Conditional Probability. Two Roles for Bayesian Methods. Prior Probability and Random Variables. The Chain Rule P (B)
Examples My mood can take 2 possible values: happy, sad. The weather can take 3 possible vales: sunny, rainy, cloudy My friends know me pretty well and say that: P(Mood=happy Weather=rainy) = 0.25 P(Mood=happy
More informationRecall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem
Recall from last time: Conditional probabilities Our probabilistic models will compute and manipulate conditional probabilities. Given two random variables X, Y, we denote by Lecture 2: Belief (Bayesian)
More informationMODULE -4 BAYEIAN LEARNING
MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities
More informationBayesian Inference. Definitions from Probability: Naive Bayes Classifiers: Advantages and Disadvantages of Naive Bayes Classifiers:
Bayesian Inference The purpose of this document is to review belief networks and naive Bayes classifiers. Definitions from Probability: Belief networks: Naive Bayes Classifiers: Advantages and Disadvantages
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationAlgorithmisches Lernen/Machine Learning
Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines
More informationBAYESIAN LEARNING. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.6]
1 BAYESIAN LEARNING [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.6] Bayes Theorem MAP, ML hypotheses, MAP learners Minimum description length principle Bayes optimal classifier, Naive Bayes learner Example:
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationIntroduction to Bayesian Learning. Machine Learning Fall 2018
Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability
More informationBayesian Classification. Bayesian Classification: Why?
Bayesian Classification http://css.engineering.uiowa.edu/~comp/ Bayesian Classification: Why? Probabilistic learning: Computation of explicit probabilities for hypothesis, among the most practical approaches
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification
More informationLecture 10: Introduction to reasoning under uncertainty. Uncertainty
Lecture 10: Introduction to reasoning under uncertainty Introduction to reasoning under uncertainty Review of probability Axioms and inference Conditional probability Probability distributions COMP-424,
More informationCHAPTER-17. Decision Tree Induction
CHAPTER-17 Decision Tree Induction 17.1 Introduction 17.2 Attribute selection measure 17.3 Tree Pruning 17.4 Extracting Classification Rules from Decision Trees 17.5 Bayesian Classification 17.6 Bayes
More information9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering
Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make
More informationBayesian Learning. Bayesian Learning Criteria
Bayesian Learning In Bayesian learning, we are interested in the probability of a hypothesis h given the dataset D. By Bayes theorem: P (h D) = P (D h)p (h) P (D) Other useful formulas to remember are:
More informationUncertainty. Variables. assigns to each sentence numerical degree of belief between 0 and 1. uncertainty
Bayes Classification n Uncertainty & robability n Baye's rule n Choosing Hypotheses- Maximum a posteriori n Maximum Likelihood - Baye's concept learning n Maximum Likelihood of real valued function n Bayes
More informationNaïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824
Naïve Bayes Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative HW 1 out today. Please start early! Office hours Chen: Wed 4pm-5pm Shih-Yang: Fri 3pm-4pm Location: Whittemore 266
More informationWhy Probability? It's the right way to look at the world.
Probability Why Probability? It's the right way to look at the world. Discrete Random Variables We denote discrete random variables with capital letters. A boolean random variable may be either true or
More informationIntroduction to Machine Learning
Introduction to Machine Learning CS4375 --- Fall 2018 Bayesian a Learning Reading: Sections 13.1-13.6, 20.1-20.2, R&N Sections 6.1-6.3, 6.7, 6.9, Mitchell 1 Uncertainty Most real-world problems deal with
More informationIntroduction to Bayesian Learning
Course Information Introduction Introduction to Bayesian Learning Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Apprendimento Automatico: Fondamenti - A.A. 2016/2017 Outline
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September
More informationIntroduction to Machine Learning
Uncertainty Introduction to Machine Learning CS4375 --- Fall 2018 a Bayesian Learning Reading: Sections 13.1-13.6, 20.1-20.2, R&N Sections 6.1-6.3, 6.7, 6.9, Mitchell Most real-world problems deal with
More informationNaïve Bayesian. From Han Kamber Pei
Naïve Bayesian From Han Kamber Pei Bayesian Theorem: Basics Let X be a data sample ( evidence ): class label is unknown Let H be a hypothesis that X belongs to class C Classification is to determine H
More informationStatistical learning. Chapter 20, Sections 1 4 1
Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete
More informationCMPT Machine Learning. Bayesian Learning Lecture Scribe for Week 4 Jan 30th & Feb 4th
CMPT 882 - Machine Learning Bayesian Learning Lecture Scribe for Week 4 Jan 30th & Feb 4th Stephen Fagan sfagan@sfu.ca Overview: Introduction - Who was Bayes? - Bayesian Statistics Versus Classical Statistics
More informationArtificial Intelligence: Reasoning Under Uncertainty/Bayes Nets
Artificial Intelligence: Reasoning Under Uncertainty/Bayes Nets Bayesian Learning Conditional Probability Probability of an event given the occurrence of some other event. P( X Y) P( X Y) P( Y) P( X,
More informationIntroduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees
Introduction to ML Two examples of Learners: Naïve Bayesian Classifiers Decision Trees Why Bayesian learning? Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical
More informationDirected Graphical Models
Directed Graphical Models Instructor: Alan Ritter Many Slides from Tom Mitchell Graphical Models Key Idea: Conditional independence assumptions useful but Naïve Bayes is extreme! Graphical models express
More informationConsider an experiment that may have different outcomes. We are interested to know what is the probability of a particular set of outcomes.
CMSC 310 Artificial Intelligence Probabilistic Reasoning and Bayesian Belief Networks Probabilities, Random Variables, Probability Distribution, Conditional Probability, Joint Distributions, Bayes Theorem
More informationTwo Roles for Bayesian Methods
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners Minimum description length principle Bayes optimal classifier Naive Bayes learner Example: Learning over text data Bayesian belief networks
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University August 30, 2017 Today: Decision trees Overfitting The Big Picture Coming soon Probabilistic learning MLE,
More informationProbabilistic Classification
Bayesian Networks Probabilistic Classification Goal: Gather Labeled Training Data Build/Learn a Probability Model Use the model to infer class labels for unlabeled data points Example: Spam Filtering...
More informationBayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)
Bayesian Learning Chapter 6: Bayesian Learning CS 536: Machine Learning Littan (Wu, TA) [Read Ch. 6, except 6.3] [Suggested exercises: 6.1, 6.2, 6.6] Bayes Theore MAP, ML hypotheses MAP learners Miniu
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationChapter 6 Classification and Prediction (2)
Chapter 6 Classification and Prediction (2) Outline Classification and Prediction Decision Tree Naïve Bayes Classifier Support Vector Machines (SVM) K-nearest Neighbors Accuracy and Error Measures Feature
More informationBayesian Networks Inference with Probabilistic Graphical Models
4190.408 2016-Spring Bayesian Networks Inference with Probabilistic Graphical Models Byoung-Tak Zhang intelligence Lab Seoul National University 4190.408 Artificial (2016-Spring) 1 Machine Learning? Learning
More informationA Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004
A Brief Introduction to Graphical Models Presenter: Yijuan Lu November 12,2004 References Introduction to Graphical Models, Kevin Murphy, Technical Report, May 2001 Learning in Graphical Models, Michael
More informationCOMP 328: Machine Learning
COMP 328: Machine Learning Lecture 2: Naive Bayes Classifiers Nevin L. Zhang Department of Computer Science and Engineering The Hong Kong University of Science and Technology Spring 2010 Nevin L. Zhang
More informationTopics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning
Topics Bayesian Learning Sattiraju Prabhakar CS898O: ML Wichita State University Objectives for Bayesian Learning Bayes Theorem and MAP Bayes Optimal Classifier Naïve Bayes Classifier An Example Classifying
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationData Mining Part 4. Prediction
Data Mining Part 4. Prediction 4.3. Fall 2009 Instructor: Dr. Masoud Yaghini Outline Introduction Bayes Theorem Naïve References Introduction Bayesian classifiers A statistical classifiers Introduction
More informationGraphical Models and Kernel Methods
Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.
More informationSYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I
SYDE 372 Introduction to Pattern Recognition Probability Measures for Classification: Part I Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 Why use probability
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationMobile Robot Localization
Mobile Robot Localization 1 The Problem of Robot Localization Given a map of the environment, how can a robot determine its pose (planar coordinates + orientation)? Two sources of uncertainty: - observations
More informationText Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)
Text Categorization CSE 454 (Based on slides by Dan Weld, Tom Mitchell, and others) 1 Given: Categorization A description of an instance, x X, where X is the instance language or instance space. A fixed
More informationPROBABILISTIC REASONING SYSTEMS
PROBABILISTIC REASONING SYSTEMS In which we explain how to build reasoning systems that use network models to reason with uncertainty according to the laws of probability theory. Outline Knowledge in uncertain
More informationBayesian Approaches Data Mining Selected Technique
Bayesian Approaches Data Mining Selected Technique Henry Xiao xiao@cs.queensu.ca School of Computing Queen s University Henry Xiao CISC 873 Data Mining p. 1/17 Probabilistic Bases Review the fundamentals
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationMachine Learning (CS 567)
Machine Learning (CS 567) Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol Han (cheolhan@usc.edu)
More informationIntroduction to Probabilistic Graphical Models
Introduction to Probabilistic Graphical Models Kyu-Baek Hwang and Byoung-Tak Zhang Biointelligence Lab School of Computer Science and Engineering Seoul National University Seoul 151-742 Korea E-mail: kbhwang@bi.snu.ac.kr
More informationp L yi z n m x N n xi
y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen
More informationBrief Introduction of Machine Learning Techniques for Content Analysis
1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationBayesian Methods in Artificial Intelligence
WDS'10 Proceedings of Contributed Papers, Part I, 25 30, 2010. ISBN 978-80-7378-139-2 MATFYZPRESS Bayesian Methods in Artificial Intelligence M. Kukačka Charles University, Faculty of Mathematics and Physics,
More informationBuilding Bayesian Networks. Lecture3: Building BN p.1
Building Bayesian Networks Lecture3: Building BN p.1 The focus today... Problem solving by Bayesian networks Designing Bayesian networks Qualitative part (structure) Quantitative part (probability assessment)
More informationProbabilistic Reasoning. Kee-Eung Kim KAIST Computer Science
Probabilistic Reasoning Kee-Eung Kim KAIST Computer Science Outline #1 Acting under uncertainty Probabilities Inference with Probabilities Independence and Bayes Rule Bayesian networks Inference in Bayesian
More informationBayesian Networks BY: MOHAMAD ALSABBAGH
Bayesian Networks BY: MOHAMAD ALSABBAGH Outlines Introduction Bayes Rule Bayesian Networks (BN) Representation Size of a Bayesian Network Inference via BN BN Learning Dynamic BN Introduction Conditional
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationLecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan
Lecture 9: Naive Bayes, SVM, Kernels Instructor: Outline 1 Probability basics 2 Probabilistic Interpretation of Classification 3 Bayesian Classifiers, Naive Bayes 4 Support Vector Machines Probability
More informationLecture 6: Graphical Models: Learning
Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)
More informationThe Naïve Bayes Classifier. Machine Learning Fall 2017
The Naïve Bayes Classifier Machine Learning Fall 2017 1 Today s lecture The naïve Bayes Classifier Learning the naïve Bayes Classifier Practical concerns 2 Today s lecture The naïve Bayes Classifier Learning
More informationBayesian Learning (II)
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More information13 : Variational Inference: Loopy Belief Propagation and Mean Field
10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction
More informationECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4
ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationCS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine
CS 484 Data Mining Classification 7 Some slides are from Professor Padhraic Smyth at UC Irvine Bayesian Belief networks Conditional independence assumption of Naïve Bayes classifier is too strong. Allows
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 3, 2016 CPSC 422, Lecture 11 Slide 1 422 big picture: Where are we? Query Planning Deterministic Logics First Order Logics Ontologies
More informationCS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016
CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016 Plan for today and next week Today and next time: Bayesian networks (Bishop Sec. 8.1) Conditional
More informationModeling and reasoning with uncertainty
CS 2710 Foundations of AI Lecture 18 Modeling and reasoning with uncertainty Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square KB systems. Medical example. We want to build a KB system for the diagnosis
More informationCS6375: Machine Learning Gautam Kunapuli. Decision Trees
Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s
More informationMachine Learning Summer School
Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,
More informationTutorial 2. Fall /21. CPSC 340: Machine Learning and Data Mining
1/21 Tutorial 2 CPSC 340: Machine Learning and Data Mining Fall 2016 Overview 2/21 1 Decision Tree Decision Stump Decision Tree 2 Training, Testing, and Validation Set 3 Naive Bayes Classifier Decision
More information{ p if x = 1 1 p if x = 0
Discrete random variables Probability mass function Given a discrete random variable X taking values in X = {v 1,..., v m }, its probability mass function P : X [0, 1] is defined as: P (v i ) = Pr[X =
More informationRecent Advances in Bayesian Inference Techniques
Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian
More informationMobile Robot Localization
Mobile Robot Localization 1 The Problem of Robot Localization Given a map of the environment, how can a robot determine its pose (planar coordinates + orientation)? Two sources of uncertainty: - observations
More informationThe Origin of Deep Learning. Lili Mou Jan, 2015
The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets
More informationBayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington
Bayesian Classifiers and Probability Estimation Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington 1 Data Space Suppose that we have a classification problem The
More informationDirected and Undirected Graphical Models
Directed and Undirected Graphical Models Adrian Weller MLSALT4 Lecture Feb 26, 2016 With thanks to David Sontag (NYU) and Tony Jebara (Columbia) for use of many slides and illustrations For more information,
More informationInference in Graphical Models Variable Elimination and Message Passing Algorithm
Inference in Graphical Models Variable Elimination and Message Passing lgorithm Le Song Machine Learning II: dvanced Topics SE 8803ML, Spring 2012 onditional Independence ssumptions Local Markov ssumption
More informationProbabilistic Graphical Models (I)
Probabilistic Graphical Models (I) Hongxin Zhang zhx@cad.zju.edu.cn State Key Lab of CAD&CG, ZJU 2015-03-31 Probabilistic Graphical Models Modeling many real-world problems => a large number of random
More informationLecture 8: Bayesian Networks
Lecture 8: Bayesian Networks Bayesian Networks Inference in Bayesian Networks COMP-652 and ECSE 608, Lecture 8 - January 31, 2017 1 Bayes nets P(E) E=1 E=0 0.005 0.995 E B P(B) B=1 B=0 0.01 0.99 E=0 E=1
More informationBayesian Network Representation
Bayesian Network Representation Sargur Srihari srihari@cedar.buffalo.edu 1 Topics Joint and Conditional Distributions I-Maps I-Map to Factorization Factorization to I-Map Perfect Map Knowledge Engineering
More informationMachine Learning. Bayesian Learning.
Machine Learning Bayesian Learning Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg Martin.Riedmiller@uos.de
More informationText Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University
Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 14, 2015 Today: The Big Picture Overfitting Review: probability Readings: Decision trees, overfiting
More informationInference in Bayesian Networks
Andrea Passerini passerini@disi.unitn.it Machine Learning Inference in graphical models Description Assume we have evidence e on the state of a subset of variables E in the model (i.e. Bayesian Network)
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More information