CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Similar documents
Bayesian networks. Chapter 14, Sections 1 4

Graphical Models - Part I

Directed Graphical Models

Review: Bayesian learning and inference

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Probabilistic Reasoning Systems

Objectives. Probabilistic Reasoning Systems. Outline. Independence. Conditional independence. Conditional independence II.

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Quantifying uncertainty & Bayesian networks

Bayesian networks. Chapter Chapter

Bayesian Network. Outline. Bayesian Network. Syntax Semantics Exact inference by enumeration Exact inference by variable elimination

CS 380: ARTIFICIAL INTELLIGENCE UNCERTAINTY. Santiago Ontañón

Informatics 2D Reasoning and Agents Semester 2,

This lecture. Reading. Conditional Independence Bayesian (Belief) Networks: Syntax and semantics. Chapter CS151, Spring 2004

Bayesian networks. Soleymani. CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018

PROBABILISTIC REASONING SYSTEMS

Uncertainty and Belief Networks. Introduction to Artificial Intelligence CS 151 Lecture 1 continued Ok, Lecture 2!

Bayesian Networks. Semantics of Bayes Nets. Example (Binary valued Variables) CSC384: Intro to Artificial Intelligence Reasoning under Uncertainty-III

Implementing Machine Reasoning using Bayesian Network in Big Data Analytics

Bayesian networks. Chapter AIMA2e Slides, Stuart Russell and Peter Norvig, Completed by Kazim Fouladi, Fall 2008 Chapter 14.

Bayesian Belief Network

Probabilistic Reasoning. Kee-Eung Kim KAIST Computer Science

Axioms of Probability? Notation. Bayesian Networks. Bayesian Networks. Today we ll introduce Bayesian Networks.

Bayesian networks: Modeling

CS 5522: Artificial Intelligence II

Bayesian networks. Chapter Chapter

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

CS 188: Artificial Intelligence Fall 2009

Introduction to Artificial Intelligence. Unit # 11

CS 343: Artificial Intelligence

Uncertainty. Introduction to Artificial Intelligence CS 151 Lecture 2 April 1, CS151, Spring 2004

Probabilistic Models. Models describe how (a portion of) the world works

Probabilistic Models

Bayesian networks. Chapter Chapter Outline. Syntax Semantics Parameterized distributions. Chapter

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Introduction to Artificial Intelligence Belief networks

Bayesian Networks. Motivation

CSE 473: Artificial Intelligence Autumn 2011

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

14 PROBABILISTIC REASONING

Probabilistic Classification

CS Belief networks. Chapter

Outline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car

Learning Bayesian Networks (part 1) Goals for the lecture

Probabilistic representation and reasoning

CS 188: Artificial Intelligence Spring Announcements

Introduction to Machine Learning Midterm, Tues April 8

Final Examination CS540-2: Introduction to Artificial Intelligence

Another look at Bayesian. inference

Final Examination CS 540-2: Introduction to Artificial Intelligence

School of EECS Washington State University. Artificial Intelligence

Bayesian Networks. Material used

Probabilistic representation and reasoning

CS 5522: Artificial Intelligence II

COMP5211 Lecture Note on Reasoning under Uncertainty

Events A and B are independent P(A) = P(A B) = P(A B) / P(B)

Y. Xiang, Inference with Uncertain Knowledge 1

Bayes Networks 6.872/HST.950

Chapter 6 Classification and Prediction (2)

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Bayes Nets: Independence

Probabilistic Models Bayesian Networks Markov Random Fields Inference. Graphical Models. Foundations of Data Analysis

Bayesian Networks BY: MOHAMAD ALSABBAGH

Bayesian Approaches Data Mining Selected Technique

last two digits of your SID

Uncertainty. 22c:145 Artificial Intelligence. Problem of Logic Agents. Foundations of Probability. Axioms of Probability

p(x) p(x Z) = y p(y X, Z) = αp(x Y, Z)p(Y Z)

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

CS 188: Artificial Intelligence Spring Announcements

Defining Things in Terms of Joint Probability Distribution. Today s Lecture. Lecture 17: Uncertainty 2. Victor R. Lesser

CS7267 MACHINE LEARNING

Outline } Conditional independence } Bayesian networks: syntax and semantics } Exact inference } Approximate inference AIMA Slides cstuart Russell and

CS6220: DATA MINING TECHNIQUES

Directed Graphical Models or Bayesian Networks

Probability and Uncertainty. Bayesian Networks

Announcements. CS 188: Artificial Intelligence Spring Probability recap. Outline. Bayes Nets: Big Picture. Graphical Model Notation

Probability. CS 3793/5233 Artificial Intelligence Probability 1

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Support Vector Machine. Industrial AI Lab.

Artificial Intelligence Bayesian Networks

Learning with multiple models. Boosting.

CS 188: Artificial Intelligence Fall 2008

Artificial Intelligence Bayes Nets: Independence

Bayesian Networks (Part II)

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

Brief Intro. to Bayesian Networks. Extracted and modified from four lectures in Intro to AI, Spring 2008 Paul S. Rosenbloom

Graphical Models - Part II

CS6220: DATA MINING TECHNIQUES

FINAL: CS 6375 (Machine Learning) Fall 2014

Voting (Ensemble Methods)

Probabilistic Representation and Reasoning

Bayesian Networks Representation

COMP9414: Artificial Intelligence Reasoning Under Uncertainty

ECE 5984: Introduction to Machine Learning

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Uncertainty. Logic and Uncertainty. Russell & Norvig. Readings: Chapter 13. One problem with logical-agent approaches: C:145 Artificial

Qualifier: CS 6375 Machine Learning Spring 2015

Transcription:

CS 484 Data Mining Classification 7 Some slides are from Professor Padhraic Smyth at UC Irvine

Bayesian Belief networks Conditional independence assumption of Naïve Bayes classifier is too strong. Allows to specify which pairs of attributes are conditionally independent. A simple, graphical notation for conditional independence assertions and hence for compact specification of full joint distributions Syntax: a set of nodes, one per variable a directed, acyclic graph (link "directly influences") a conditional distribution for each node given its parents: P (X i Parents (X i )) In the simplest case, conditional distribution represented as a conditional probability table (CPT) giving the distribution over X i for each combination of parent values 2

Background: Law of Total Probability Law of Total Probability (aka summing out or marginalization) P(A) = Σ i P(A, B i ) = Σ i P(A B i ) P(B i ) Why is this useful? Given a joint distribution (e.g., P(A,B,C,D)) we can obtain any marginal probability (e.g., P(B)) by summing out the other variables, e.g., P(B) = Σ i Σ j Σ k P(A i, B, C j, D k ) Less obvious: we can also compute any conditional probability of interest given a joint distribution, e.g., P(C B) = Σ i Σ j P(A i, C, D j B) = 1 / P(B) Σ i Σ j P(A i, C, D j, B) where 1 / P(B) is just a normalization constant Thus, the joint distribution contains the information we need to compute any probability of interest. 3

Background: The Chain Rule or Factoring We can always write P(A, B, C, Z) = P(A B, C,. Z) P(B, C, Z) (by definition of joint probability) Repeatedly applying this idea, we can write P(A, B, C, Z) = P(A B, C,. Z) P(B C,.. Z) P(C.. Z)..P(Z) This factorization holds for any ordering of the variables This is the chain rule for probabilities 4

Conditional Independence The Markov condition: given its parents (P 1, P 2 ), a node (X) is conditionally independent of its non-descendants (ND 1, ND 2 ) P 1 P 2 X ND 1 ND 2 C 1 C 2 5

Example Topology of network encodes conditional independence assertions: Weather is independent of the other variables Toothache and Catch are conditionally independent given Cavity 6

Conditional Independence 2 random variables A and B are conditionally independent given C iff P(A, B C) = P(A C) P(B C) More intuitive (equivalent) conditional formulation A and B are conditionally independent given C iff P(A B, C) = P(A C) Intuitive interpretation: P(A B, C) = P(A C) tells us that learning about B, given that we already know C, provides no change in our probability for A, i.e., B contains no information about a beyond what C provides Can generalize to more than 2 random variables E.g., K different symptom variables X 1, X 2, X K, and C = disease P(X 1, X 2,. X K C) = Π P(Xi C) Also known as the naïve Bayes assumption 7

Bayesian Networks A Bayesian network specifies a joint distribution in a structured form Represent dependence/independence via a directed graph Nodes = random variables Edges = direct dependence Structure of the graph ó Conditional independence relations p(x 1, X 2,...X N ) = Π p(x i parents(x i ) ) The full joint distribution The graph-structured approximation Requires that graph is acyclic (no directed cycles) 2 components to a Bayesian network The graph structure (conditional independence assumptions) The numerical probabilities (for each variable given its parents) 8

Example of a simple Bayesian network A B P(A,B,C) = P(C A,B)P(A)P(B) C Probability model has simple factored form Directed edges => direct dependence Absence of an edge => conditional independence Also known as belief networks, graphical models, causal networks Other formulations, e.g., undirected graphical models 9

Examples of 3-way Bayesian Networks A B C Marginal Independence: P(A,B,C) = P(A) P(B) P(C) 10

Examples of 3-way Bayesian Networks Conditionally independent effects: P(A,B,C) = P(B A)P(C A)P(A) B A C B and C are conditionally independent Given A e.g., A is a disease, and we model B and C as conditionally independent symptoms given A 11

Examples of 3-way Bayesian Networks A C B Independent Causes: P(A,B,C) = P(C A,B)P(A)P(B) Explaining away effect: Given C, observing A makes B less likely e.g., earthquake/burglary/alarm example A and B are (marginally) independent but become dependent once C is known 12

Examples of 3-way Bayesian Networks A B C Markov dependence: P(A,B,C) = P(C B) P(B A)P(A) 13

Example I'm at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn't call. Sometimes it's set off by minor earthquakes. Is there a burglar? Variables: Burglary (B), Earthquake (E), Alarm (A), JohnCalls (J), MaryCalls (M) What is P(B M, J)? (for example) 14

Example We can use the full joint distribution to answer this question. Requires 2 5 = 32 probabilities Can we use prior domain knowledge to come up with a Baysian network that requires fewer probabilities? Network topology reflects "causal" knowledge: A burglar can set the alarm off An earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call 15

Constructing a Baysian Network Step 1 Order the variables in terms of causality (may be a partial order), e.g. {E, B} -> {A} -> {J, M} P(J, M, A, E, B) = P(J, M A, E, B) P(A E, B) P(E, B) P(J, M A) P(A E, B) P(E) P(B) P(J A)P(M A) P(A E, B) P(E) P(B) Conditionally independent assumption 16

The Resulting Bayesian Network 17

Constructing this Bayesian Network: Step 2 P(J, M, A, E, B) = P(J A) P(M A) P(A E, B) P(E) P(B) There are 3 conditional probability tables (CPTs) to be determined: P(J A), P(M A), P(A E, B) Requiring 2 + 2 + 4 = 8 probabilities And 2 marginal probabilities P(E), P(B) -> 2 more probabilities Where do these probabilities come from? Expert knowledge From data (relative frequency estimates) 18

Inference (Reasoning) in Bayesian Networks Consider answering a query in a Bayesian Network Q = set of query variables e = evidence (set of instantiated variable-value pairs) Inference = computation of conditional distribution P(Q e) Examples P(burglary alarm) P(earthquake JCalls, Mcalls) P(JCalls, MCalls burglary, earthquake) Can we use the structure of the Bayesian Network to answer such queries efficiently? Answer = yes Generally speaking, complexity is inversely proportional to sparsity of graph

Why Bayesian Classifiers? Captures prior knowledge of a particular domain. Encodes causality. Works well with incomplete data. Robust to model overfitting. Can add new variables easily. Probabilistic outputs. But lots of time and effort spent in constructing the network. 20

Support Vector Machines

Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data 22

Support Vector Machines B 1 One Possible Solution 23

Support Vector Machines B 2 Another possible solution 24

Support Vector Machines B 2 Other possible solutions 25

Support Vector Machines B 1 B 2 Which one is better? B1 or B2? How do you define better? 26

Support Vector Machines B 1 B 2 b 21 b 22 margin b 11 b 12 Find hyperplane maximizes the margin => B1 is better than B2 27

Support Vector Machines B 1 w x + b = 0 w x w x + b = + 1 + b = 1 Classify a point into: b 11 1 if w x + b 1 ( x) = 1 if w x + b 1 2 w f Margin = 2 b 12 28

Support Vector Machines We want to maximize: Margin Which is equivalent to minimizing: 2 2 w But subjected to the following constraints: 1 if w xi + b 1 f ( x i ) = 1 if w xi + b 1 = L( w) = w 2 2 This is a constrained optimization problem Numerical approaches to solve it (e.g., quadratic programming) 29

Support Vector Machines What if the problem is not linearly separable? 30

Support Vector Machines What if the problem is not linearly separable? Introduce slack variables 2 N Need to minimize: w = + k L( w) C ξi 2 i= 1 Subject to: f ( x i ) = 1 1 if w xi + b 1-ξi if w x + b 1+ ξ i i 31

Nonlinear Support Vector Machines What if decision boundary is not linear? 32

Nonlinear Support Vector Machines Transform data into higher dimensional space 33

Why SVMs? Convex Convex Convex No trapping in local minima SVMs work for categorical and continuous data. Can control the model complexity by providing the control on cost function, margin parameters to use. Kernel Trick (Not discussed) extends it to non-linear spaces. 34

Loss, bias, variance and noise Bias: depends on the angle Average shot Variance: depends on the force and size of cannonball Bias Noise: depends on the target position variance target 35

Average shot Bias: depends on the angle Bias target noise Variance: depends on the force and size of the bird variance

Example: Bias 37

Bias-Variance (Generalize) 38

For better generalizable model Minimize both bias and variance However, Neglect the input data and predict the output to be a constant value gives zero variance but high bias. On the other hand, perfectly interpolate the given data to produce f=f* - implies zero bias but high variance. 39

Model Complexity Simple models of low complexity high bias, small variance potentially rubbish, but stable predictions Flexible models of high complexity small bias, high variance over-complex models can be always massaged to exactly explain the observed training data What is the right level of model complexity? The problem of model selection 40

Complexity of the model E=bias+var var bias Complexity Usually, the bias is a decreasing function of the complexity, while variance is an increasing function of the complexity. 41

Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made by multiple classifiers 42

General Idea D Original Training data Step 1: Create Multiple Data Sets... D 1 D 2 D t-1 D t Step 2: Build Multiple Classifiers C 1 C 2 C t -1 C t Step 3: Combine Classifiers C * 43

Why does it work? Suppose there are 25 base classifiers Each classifier has error rate, ε = 0.35 Assume classifiers are independent Probability that the ensemble classifier makes a wrong prediction: 25 i= 13 25 i ε (1 ε ) i 25 i = 0.06 44

Examples of Ensemble Methods How to generate an ensemble of classifiers? Bagging Boosting 45

Bagging Bootstrap Aggregation Create classifiers by drawing samples of size equal to the original dataset. (Appx 63% of data will be chosen) Learn classifier using these samples. Vote on them. Why does this help? If there is a high variance i.e. classifier is unstable, bagging will help to reduce errors due to fluctuations in the training data. If the classifier is stable i.e. error of the ensemble is primarily by bias in the base classifier -> may degrade the performance. 46

Boosting An iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified records Initially, all N records are assigned equal weights Unlike bagging, weights may change at the end of boosting round 47

Adaboost (Freund et. al. 1997) Given a set of n class-labeled tuples (x 1,y 1 ) (x n,y n ) i.e T Initially all weights of tuples are set to same (1/n) Generate k classifiers in k rounds. At the i-th round Tuples from T are sampled from T to form training set T i Each tuple s chance of selection depends on its weight. Learn a model M i from T i Compute error rate using T i If tuple is misclassified its weight is increased. During prediction use the error of the classifier as a weight (vote) on each of the models 48

Why boosting/bagging? Improves the variance of unstable classifiers. Unstable Classifiers Neural nets, decision trees Stable Classifiers K-NN May lead to results that are not explanatory. 49