Announcements. Proposals graded

Size: px

Start display at page:

Download "Announcements. Proposals graded"

Jonathan Cross
5 years ago
Views:

1 Announcements Proposals graded Kevin Jamieson

2 Hypothesis testing Machine Learning CSE546 Kevin Jamieson University of Washington October 30, Kevin Jamieson 2

3 Anomaly detection You are Amazon and wish to detect transactions with stolen credit cards. For each transaction we observe a feature vector X: { -address, age of account, anonymous PO box, price of items, copies of purchased item, etc. } and the transaction is either real (Y=0) or fraudulent (Y=1) Hypothesis testing: H0: X P 0 H1: X P 1 P k = P(X = x Y = k) Your job is to build a (possibly randomized) decision function (x) 2 {0, 1} Kevin Jamieson

4 Anomaly detection Hypothesis testing: H0: X P 0 P k = P(X = x Y = k) H1: X P 1 Your job is to build a (possibly randomized) decision function (x) 2 {0, 1} Bayesian Hypothesis Testing: Assume P(Y = 1) = P(X = x) = P 1 (x)+(1 )P 0 (x) arg min P XY (Y 6= (X)) Kevin Jamieson

5 Anomaly detection Hypothesis testing: H0: X P 0 P k = P(X = x Y = k) H1: X P 1 Your job is to build a (possibly randomized) decision function (x) 2 {0, 1} Minimax Hypothesis Testing: arg min max{p( (X) =0 Y = 1), P( (X) =1 Y = 0)} Kevin Jamieson

6 Anomaly detection Hypothesis testing: H0: X P 0 P k = P(X = x Y = k) H1: X P 1 Your job is to build a (possibly randomized) decision function (x) 2 {0, 1} Neyman-Pearson Hypothesis Testing: arg max P( (X) =1 Y = 1), subject to P( (X) =1 Y = 0) apple Kevin Jamieson

7 Neyman-Pearson Testing Hypothesis testing: H0: X P 0 P k = P(X = x Y = k) H1: X P 1 Neyman-Pearson Hypothesis Testing: arg max P( (X) =1 Y = 1), subject to P( (X) =1 Y = 0) apple Theorem: The optimal test has the form P( (X) = 1) = and satisfies P( (X) =1 Y = 0) = 8 >< 1 if P 1(x) P 0 (x) > if P 1(x) P 0 (x) >: = 0 if P 1(x) P 0 (x) < Kevin Jamieson

8 Neyman-Pearson Testing Hypothesis testing: H0: X P 0 P k = P(X = x Y = k) H1: X P 1 Neyman-Pearson Hypothesis Testing: arg max P( (X) =1 Y = 1), subject to P( (X) =1 Y = 0) apple Example: Kevin Jamieson

9 ROC Curve Hypothesis testing: H0: X P 0 P k = P(X = x Y = k) H1: X P 1 1 Prob of Detection P( (X) =1 Y = 1) 0 0 Prob of False Alarm 1 P( (X) =1 Y = 0)} Kevin Jamieson

10 p-values Machine Learning CSE546 Kevin Jamieson University of Washington October 30, Kevin Jamieson 10

11 Anomaly detection You are Amazon and wish to detect transactions with stolen credit cards. For each transaction we observe a feature vector X: { -address, age of account, anonymous PO box, price of items, copies of purchased item, etc. } and the transaction is either real (Y=0) or fraudulent (Y=1) Hypothesis testing: H0: X P 0 H1: X P 1 P k = P(X = x Y = k) Your job is to build a (possibly randomized) decision function (x) 2 {0, 1} Natural to have model for P 0 (regular purchases). But what if we have no model for P 1 since people are strategic? Kevin Jamieson

12 p-value Hypothesis testing: H0: X P 0 P k = P(X = x Y = k) H1: X P 1 Your job is to build a (possibly randomized) decision function (x) 2 {0, 1} Definition p-value: probability of finding the observed, or more extreme, results when the null hypothesis H0 is true (e.g., X P 0 ) Definition p-value: a uniformly distributed random variable under the null hypothesis (e.g., X P 0 ) WARNING: A small p-value is NOT evidence that H1 is true. Kevin Jamieson

13 p-value Hypothesis testing: H0: X P 0 P k = P(X = x Y = k) H1: X P 1 Your job is to build a (possibly randomized) decision function (x) 2 {0, 1} Definition p-value: a uniformly distributed random variable under the null hypothesis (e.g., X P 0 ) P 0 (x) =N (x; µ 0, 2 ) Observe: x i 2 R p-value: p i = P 0 (X x i ) = R 1 x=x i 1 p 2 2 e (x µ 0) 2 /2 2 dx Kevin Jamieson

14 p-value: used the right way Hypothesis testing: H0: X P 0 P k = P(X = x Y = k) H1: X P 1 Your job is to build a (possibly randomized) decision function (x) 2 {0, 1} Definition p-value: a uniformly distributed random variable under the null hypothesis (e.g., X P 0 ) Set: =.05 Observe: x i 2 R p-value: p i = P 0 (X x i ) Test: Ifp i apple then reject the null hypothesis H0 Kevin Jamieson

15 p-value: used the wrong way Hypothesis testing: H0: X P 0 P k = P(X = x Y = k) H1: X P 1 Your job is to build a (possibly randomized) decision function (x) 2 {0, 1} Definition p-value: a uniformly distributed random variable under the null hypothesis (e.g., X P 0 ) BAD Set: =.05 Observe: x i 2 R p-value: p i = P 0 (X x i ) Test: Ifp i apple then reject the null hypothesis H0 If p i > repeat the experiment with new x i until p i apple Kevin Jamieson

16 p-value: used the wrong way Each day i=1,2, you measure an iid x i N (µ, 1) H0: µ =0 Under H0 the statistic Z i = 1 p i P i j=1 x j N (0, 1) 1 p i = 1 2 Z 1 e z 2 /2 dz z=z i.05 0 days i Kevin Jamieson

17 Multiple testing Machine Learning CSE546 Kevin Jamieson University of Washington October 30, Kevin Jamieson 17

Case study in adaptive sampling tradeoffs Drosophila RNAi screen identifies host genes important for influenza virus replication, Nature 2008.

gene i=1,2,,n you measure an H0(i): µ i =0 x i N (µ i, 1) Consider procedure for individual hypothesis testing: Set: =.

18 Case study in adaptive sampling tradeoffs Drosophila RNAi screen identifies host genes important for influenza virus replication, Nature Wild type strain with 13,071 genes Inhibit a single gene infect with fluorescing virus (indicating gene s influence) 72 h microwell array Each gene i=1,2,,n you measure an H0(i): µ i =0 x i N (µ i, 1) Consider procedure for individual hypothesis testing: Set: =.05 Observe: x i 2 R p-value: p i = P 0 (X x i ) Test: Ifp i apple then reject the null hypothesis H0 Under H0, how many genes do we expect to reject the null hypothesis? 18

19 Multiple Testing If we make n rejections individually at level I 0 = {i : H0(i) istrue} E[ X i2i 0 1{p i apple }] = X i2i 0 P(p i apple ) = I 0 That s a lot of false alarms! Kevin Jamieson

20 Multiple Testing - FWER Family-wise error rate FWER= P(reject any true null) I 0 = {i : H0(i) istrue} Bonferroni rule: Reject i if p i apple /n! FWER = P [ I 0 {p i apple /n} = Kevin Jamieson

21 Multiple Testing - FDR False discovery rate FDR= E h I0 \R R i I 0 = {i : H0(i) istrue} Benjamini-Hochberg procedure: Sort p-values such that p (1) apple p (2) apple apple p (n) i max = max{i : p (i) apple i n } R = {i : i apple i max } Theorem: BH( ) satisfies FDRapple Kevin Jamieson

22 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington October 30, Kevin Jamieson 22

23 MLE Recap - coin flips Data: sequence D= (HHTHT ), k heads out of n flips Hypothesis: P(Heads) = θ, P(Tails) = 1-θ P (D ) = k (1 ) n k Maximum likelihood estimation (MLE): Choose θ that maximizes the probability of observed data: b MLE = arg max = arg max P (D ) log P (D ) b MLE = k n 2017 Kevin Jamieson 23

24 What about prior Billionaire: Wait, I know that the coin is close to What can you do for me now? You say: I can learn it the Bayesian way 2017 Kevin Jamieson 24

25 Bayesian Learning Use Bayes rule: Or equivalently: 2017 Kevin Jamieson 25

26 Bayesian Learning for Coins Likelihood function is simply Binomial: What about prior? Represent expert knowledge Conjugate priors: Closed-form representation of posterior For Binomial, conjugate prior is Beta distribution 2017 Kevin Jamieson 26

27 Beta prior distribution P(θ) Mean: Mode: Beta(2,3) Beta(20,30) Likelihood function: Posterior: 2017 Kevin Jamieson 27

28 Posterior distribution Prior: Data: α H heads and α T tails Posterior distribution: Beta(2,3) Beta(20,30) 2017 Kevin Jamieson 28

29 Using Bayesian posterior Posterior distribution: Bayesian inference: Estimate mean E[ ] = Z 1 0 P ( D)d Estimate arbitrary function f For arbitrary f integral is often hard to compute 2017 Kevin Jamieson 29

30 MAP: Maximum a posteriori approximation As more data is observed, Beta is more certain MAP: use most likely parameter: 2017 Kevin Jamieson 30

31 MAP for Beta distribution MAP: use most likely parameter: 2017 Kevin Jamieson 31

32 MAP for Beta distribution MAP: use most likely parameter: H + H 1 H + T + H + T 2 Beta prior equivalent to extra coin flips As N 1, prior is forgotten But, for small sample size, prior is important! 2017 Kevin Jamieson 32

33 Bayesian vs Frequentist Data: D Estimator: b = t(d) loss: `(t(d), ) Frequentists treat unknown θ as fixed and the data D as random. Bayesian treat the data D as fixed and the unknown θ as random 2017 Kevin Jamieson 33

34 Recap for Bayesian learning Bayesians are optimists: If we model it correctly, we output most likely answer Assumes one can accurately model: Observations and link to unknown parameter θ: Distribution, structure of unknown θ: p( ) p(x ) Frequentist are pessimists: All models are wrong, prove to me your estimate is good Makes very few assumptions, e.g. E[X 2 ] < 1 and constructs an estimator (e.g., median of means of disjoint subsets of data) Must analyze each estimate 2017 Kevin Jamieson 34

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF Accouncements You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF Please do not zip these files and submit (unless there are >5 files) 1 Bayesian Methods Machine Learning