Info 2950, Lecture 4

Similar documents
1 INFO 2950, 2 4 Feb 10

TA Qinru Shi: Based on poll result, the first Programming Boot Camp will be: this Sunday 5 Feb, 7-8pm Gates 114.

With Question/Answer Animations. Chapter 7

Lecture 10: Bayes' Theorem, Expected Value and Variance Lecturer: Lale Özkahya

STAT:5100 (22S:193) Statistical Inference I

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

1. Let A and B be two events such that P(A)=0.6 and P(B)=0.6. Which of the following MUST be true?

CSC Discrete Math I, Spring Discrete Probability

4. Conditional Probability

Outline. Probability. Math 143. Department of Mathematics and Statistics Calvin College. Spring 2010

Conditional Probability & Independence. Conditional Probabilities

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Probability Notes (A) , Fall 2010

2.4. Conditional Probability


Conditional Probability. CS231 Dianna Xu

lecture notes October 22, Probability Theory

Lecture 2. Conditional Probability

Math 140 Introductory Statistics

Announcements. Topics: To Do:

CDA6530: Performance Models of Computers and Networks. Chapter 1: Review of Practical Probability

Module 2 : Conditional Probability

Math 140 Introductory Statistics

Notes slides from before lecture. CSE 21, Winter 2017, Section A00. Lecture 16 Notes. Class URL:

Bayes Formula. MATH 107: Finite Mathematics University of Louisville. March 26, 2014

Lecture 4. Selected material from: Ch. 6 Probability

Discrete Probability

Lecture Notes 1 Basic Probability. Elements of Probability. Conditional probability. Sequential Calculation of Probability

Probability Review Lecturer: Ji Liu Thank Jerry Zhu for sharing his slides

4. Conditional Probability P( ) CSE 312 Autumn 2012 W.L. Ruzzo

Discrete Probability

Probability the chance that an uncertain event will occur (always between 0 and 1)

COMP 328: Machine Learning

Econ 325: Introduction to Empirical Economics

Today we ll discuss ways to learn how to think about events that are influenced by chance.

Lecture 1: Probability Fundamentals

Lecture 4 Bayes Theorem

Lecture 4 Bayes Theorem

Dynamic Programming Lecture #4

Machine Learning for natural language processing

1 INFO Sep 05

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

SS257a Midterm Exam Monday Oct 27 th 2008, 6:30-9:30 PM Talbot College 342 and 343. You may use simple, non-programmable scientific calculators.

Lecture 4 : Conditional Probability and Bayes Theorem 0/ 26

Topic -2. Probability. Larson & Farber, Elementary Statistics: Picturing the World, 3e 1

Statistics for Business and Economics

Introduction and basic definitions

Machine Learning

Lecture 3: Probability

Independence, Concentration, Bayes Theorem

Classification & Information Theory Lecture #8

MAT PS4 Solutions

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Compound Events. The event E = E c (the complement of E) is the event consisting of those outcomes which are not in E.

Behavioral Data Mining. Lecture 2

Conditional probability

Lecture 2: Probability, conditional probability, and independence

Outline Conditional Probability The Law of Total Probability and Bayes Theorem Independent Events. Week 4 Classical Probability, Part II

The enumeration of all possible outcomes of an experiment is called the sample space, denoted S. E.g.: S={head, tail}

CINQA Workshop Probability Math 105 Silvia Heubach Department of Mathematics, CSULA Thursday, September 6, 2012

Computational Genomics

Chapter 2. Conditional Probability and Independence. 2.1 Conditional Probability

Probabilistic Classification

Probability and Sample space

Be able to define the following terms and answer basic questions about them:

Final Examination CS 540-2: Introduction to Artificial Intelligence

1 The Basic Counting Principles

September 7, Introduction to Math 210: Mathematics in the Age of Information. Class Website: kazdan/210.

Probability Theory for Machine Learning. Chris Cremer September 2015

It applies to discrete and continuous random variables, and a mix of the two.

Probability Theory and Applications

Introduction to Machine Learning CMU-10701

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

T Machine Learning: Basic Principles

Probability: Why do we care? Lecture 2: Probability and Distributions. Classical Definition. What is Probability?

1 Review of Probability

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

(6, 1), (5, 2), (4, 3), (3, 4), (2, 5), (1, 6)

7.1 What is it and why should we care?

Bayesian Learning. Instructor: Jesse Davis

Conditional Independence

STA 247 Solutions to Assignment #1

CS 188: Artificial Intelligence Spring Today

STAT 430/510 Probability

Refresher on Discrete Probability

Lecture 04: Conditional Probability. Lisa Yan July 2, 2018

Standard & Conditional Probability

Lecture 7B: Chapter 6, Section 2 Finding Probabilities: More General Rules

Math 42, Discrete Mathematics

STOR 435 Lecture 5. Conditional Probability and Independence - I

Problems and results for the ninth week Mathematics A3 for Civil Engineering students

Probability. CS 3793/5233 Artificial Intelligence Probability 1

Basic Probability and Statistics

LECTURE 1. 1 Introduction. 1.1 Sample spaces and events

Lecture 3 Probability Basics

Intelligent Systems (AI-2)

Week 2: Probability: Counting, Sets, and Bayes

Lecture 4: Independent Events and Bayes Theorem

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Transcription:

Info 2950, Lecture 4 7 Feb 2017 More Programming and Statistics Boot Camps? This week (only): PG Wed Office Hour 8 Feb at 3pm Prob Set 1: due Mon night 13 Feb (no extensions ) Note: Added part to problem 2 (later today [Tues])

( ) 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 0 0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 Red (upper): The probability 1 365! (365 n)!365 n that at least two birthdays coincide within a group of n people, as function of n. Green (lower): The probability 1 ( 364 365) n 1 of a birthday coinciding with yours within a group of n people including you.

This class 01/19 2 01/27 3 02/20 3 03/30 2 04/01 2 04/04 3 05/08 2 05/16 2 05/31 2 06/02 2 06/11 2 07/01 2 07/24 2 08/13 2 08/17 2 10/05 3 10/17 2 11/07 2 11/27 2 12/10 2 12/27 2 12/28 2 06/15 2 06/25 2 (See PS1 #4)

Bayes Rule A simple formula follows from the above definitions and symmetry of the joint probability: p(a B)p(B) =p(a, B) =p(b,a) =p(b A)p(A): p(a B) = p(b A)p(A) p(b) Called Bayes theorem or Bayes rule connects inductive and deductive inference (Rev. Thomas Bayes (1763), Pierre-Simon Laplace (1812), Sir Harold Je reys (1939)) For mutually disjoint sets A i with S n i=1 A i = S, Bayes rule takes the form p(a i B) = p(b A i )p(a i ) p(b A 1 )p(a 1 )+...+ p(b A n )p(a n ).

Example 1: Consider a casino with loaded and unloaded dice. For a loaded die (L), probability of rolling a 6 is 50%: p(6 L) =1/2, and p(i L) =1/10 (i =1,...,5) For a fair die (L), the probabilities are p(i L) =1/6 (i =1,...,6). Suppose there s a 1% probability of choosing a loaded die: p(l) =1/100. If we select a die at random and roll three consecutive 6 s with it, what is the posterior probability, P (L 6, 6, 6), that it was loaded?

The probability of the die being loaded, given 3 consecutive 6 s, is p(l 6, 6, 6) = p(6, 6, 6 L)p(L) p(6, 6, 6) = p(6 L) 3 p(l) p(6 L) 3 p(l)+p(6 L) 3 p(l) = = (1/2) 3 (1/100) (1/2) 3 (1/100) + (1/6) 3 (99/100) 1 1+(1/3) 3 99 = 1 1 + 11/3 = 3 14.21, so only a roughly 21% chance that it was loaded. (Note that the Bayesian prior in the above is p(l) =1/100, giving the expected probability before collecting the data from actual rolls, and significantly a ects the inferred posterior probability.)

Example 2: Duchenne Muscular Dystrophy (DMD): XY XX model as simple recessive sex-linked disease caused by a mutated X chromosome ( e X) exy male expresses the disease X ~ Y exx female is a carrier but does not express the disease X X ~ X ~ X ~ Neither of a woman s parents expresses the disease, but her brother does.

e Neither of a woman s parents expresses the disease, but her brother does. Then the woman s mother must be a carrier, and the woman herself therefore has an apriori50/50 chance of being a carrier, p(c) =1/2. She gives birth to a healthy son (h.s.). What now is her probability of being a carrier?

Her probability of being a carrier, given a healthy son, is p(c h.s.) = p(h.s. C)p(C) p(h.s.) = = p(h.s. C)p(C) p(h.s. C)p(C)+p(h.s. C)p(C) (1/2) (1/2) (1/2) (1/2) + 1 (1/2) = 1 3 (where C means not carrier ).

Intuitively: if she s not a carrier, then there are two ways she could have a healthy son (from either of her good X s) whereas: if she s a carrier there s only one way So a total of three ways and the probability that she s a carrier is 1/3, given the knowledge that she s had exactly one healthy son.

Also worth noting: the woman has a hidden state C or C determined once and for all she doesn t make an independent coin flip each time she has a child as to whether or not she s a carrier Prior to generating data about her son or sons, she has a Bayesian prior of 1/2 to be a carrier. Subsequent data permits a principled reassessment of that probability: continuously decreasing for each successive healthy son or jumping to 1 if she has a single diseased son

Example 3: Suppose there s a rare genetic disease that a ects 1 out of a million people p(d) = 10 6 Suppose a screening test for this disease is 100% sensitive (i.e., is always correct if one has the disease) and 99.99% specific (i.e., has a.01% false positive rate) Is it worthwhile to be screened for this disease?

Is it worthwhile to be screened for this disease? What is p(d +)? The above sensitivity and specificity imply: p(+ D) = 1 and p(+ D) = 10 4 so the probability of having the disease, given a positive test (+), is p(d +) = p(+ D)p(D) p(+) = = p(+ D)p(D) p(+ D)p(D)+p(+ D)p(D) 1 10 6 1 10 6 + 10 4 (1 10 6 ) 10 2 and there s little point to being screened (only once).

Intuitively: if one million people were screened, we would expect roughly one to have the disease, but the test will give roughly 100 false positives. So a positive result would imply only roughly a 1 out of 100 chance for one of those positives to have the disease. In this case the result is biased by the small [one in a million] Bayesian prior p(d).

More Joint Probabilities p(a, B C) =p(a, B, C)/p(C) =p(a \ B \ C)/p(C) p(a B,C) =p(a, B, C)/p(B,C) =p(a \ B \ C)/p(B \ C) Chain Rule p(a, B, C) =p(a B,C) p(b,c) =p(a B,C) p(b C)p(C) p(a n,a n 1,...,A 1 )=p(a n A n 1,...,A 1 ) p(a n 1,...,A 1 )=... = p(a n A n 1,...,A 1 ) p(a n 1 A n 2,...,A 1 ) p(a 3 A 2,A 1 ) p(a 2 A 1 ) p(a 1 )

Binary Classifiers: Use a set of features to determine whether objects have binary (yes or no) properties. Examples: whether or not a text is classified as medicine, or whether an email is classified as spam. In those cases, the features of interest might be the words the text or email contains.

Naive Bayes methodology: statistical method (making use of the word probability distribution) as contrasted with a rule-based method (where a set of heuristic rules is constructed and then has to be maintained over time) Advantage of the statistical method: features automatically selected and weighted properly, no additional ad hoc methodology easy to retrain as training set evolves over time, using same straightforward framework

Spam Filters Spam filter = binary classifier where property is whether message is spam (S) or non-spam (S). Features = words of the message. Assume we have a training set of messages tagged as spam or non-spam and use the document frequency of words in the two partitions as evidence regarding whether new messages are spam.

Example 1 (Rosen p. 422): Suppose the word Rolex appears in 250 messages of a set of 2000 spam messages, and in 5 of 1000 non spam messages. Then we estimate p( Rolex S) = 250/2000 =.125 and p( Rolex S) =5/1000 =.005. Assuming a flat prior (p(s) =p(s) =1/2) in Bayes law gives p(s Rolex ) = p( Rolex S)p(S) p( Rolex S)p(S)+p( Rolex S)p(S) =.125.125 +.005 =.125.130 =.962. With a rejection threshold of.9, this would be rejected.

Example 2 (two words, stock and undervalued ): Now suppose in a training set of 2000 spam messages and 1000 non-spam messages, the word stock appears in 400 spam messages and 60 non-spam, and the word undervalued appears in 200 spam and 25 non-spam messages. Then we estimate p( stock S) = 400/2000 =.2 p( stock S) = 60/1000 =.06 p( undervalued S) = 200/2000 =.1 p( undervalued S) = 25/1000 =.025.

Accurate estimates of joint probability distributions p(w 1,w 2 S) and p(w 1,w 2 S) would require too much data Key assumption: assume statistical independence to estimate as p(w 1,w 2 S) =p(w 1 S) p(w 2 S) p(w 1,w 2 S) =p(w 1 S) p(w 2 S) (This assumption is not true in practice: words are not statistically independent. But we re only interested in determining whether above or below some threshold, not trying to calculate an accurate p(s {w 1,w 2,...,w n })

{ } Write w 1 = stock and w 2 = undervalued, and recall: p(w 1 S) = 400/2000 =.2 p(w 1 S) = 60/1000 =.06, p(w 2 S) = 200/2000 =.1 p(w 2 S) = 25/1000 =.025 So assuming a flat prior (p(s) =p(s) =1/2), and independence of the features gives p(s w 1,w 2 )= = p(w 1,w 2 S)p(S) p(w 1,w 2 S)p(S)+p(w 1,w 2 S)p(S) p(w 1 S)p(w 2 S)p(S) p(w 1 S)p(w 2 S)p(S)+p(w 1 S)p(w 2 S)p(S) =.2.1.2.1+.06.025 =.930 at a.9 probability threshold a message containing those two words would be rejected as spam.

More generally, for n features (words) p(s {w 1,w 2,...,w n })= p({w 1,w 2,...,w n } S)p(S) p({w 1,w 2,...,w n }) = p({w 1,w 2,...,w n } S)p(S) p({w 1,w 2,...,w n } S)p(S) + p({w 1,w 2,...,w n } S)p(S) naive = p(w 1 S)p(w 2 S) p(w n S)p(S) p(w 1 S)p(w 2 S) p(w n S)p(S) + p(w 1 S)p(w 2 S) p(w n S)p(S) = p(s) Q n i=1 p(w i S) p(s) Q n i=1 p(w i S) + p(s) Q n i=1 p(w i S)