The Lady Tasting Tea. How to deal with multiple testing. Need to explore many models. More Predictive Modeling

Similar documents
MODULE -4 BAYEIAN LEARNING

CS6220: DATA MINING TECHNIQUES

FINAL: CS 6375 (Machine Learning) Fall 2014

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

ECE521 week 3: 23/26 January 2017

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Statistical aspects of prediction models with high-dimensional data

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Naive Bayes classification

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Pattern Recognition and Machine Learning

PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Bayesian Networks BY: MOHAMAD ALSABBAGH

Machine Learning (CS 567) Lecture 2

STA 4273H: Statistical Machine Learning

Introduction to Machine Learning

Machine Learning, Midterm Exam

Introduction to Bayesian Learning. Machine Learning Fall 2018

Bayesian Learning Features of Bayesian learning methods:

CMPT Machine Learning. Bayesian Learning Lecture Scribe for Week 4 Jan 30th & Feb 4th

CSCE 478/878 Lecture 6: Bayesian Learning

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Introduction to Machine Learning Midterm, Tues April 8

Learning with multiple models. Boosting.

Least Squares Regression

Lecture 5: Bayesian Network

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Undirected Graphical Models

Bias-Variance Tradeoff

Logistic Regression. Machine Learning Fall 2018

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

CS6220: DATA MINING TECHNIQUES

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Building Bayesian Networks. Lecture3: Building BN p.1

Introduction to Bayesian Learning

Machine Learning for Biomedical Engineering. Enrico Grisan

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Final Exam, Spring 2006

Logic, Knowledge Representation and Bayesian Decision Theory

November 28 th, Carlos Guestrin 1. Lower dimensional projections

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Lecture 9: Bayesian Learning

COMP61011! Probabilistic Classifiers! Part 1, Bayes Theorem!

Introduction to Machine Learning Midterm Exam

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Learning Causal Bayesian Networks from Observations and Experiments: A Decision Theoretic Approach

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

CS 188: Artificial Intelligence Spring Today

Introduction to Graphical Models

Bayesian Networks Inference with Probabilistic Graphical Models

Final Exam December 12, 2017

Final Exam December 12, 2017

Least Squares Regression

Final Exam, Fall 2002

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Learning Theory Continued

Data Mining. Supervised Learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Bayesian Inference. Definitions from Probability: Naive Bayes Classifiers: Advantages and Disadvantages of Naive Bayes Classifiers:

Machine Learning Linear Classification. Prof. Matteo Matteucci

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Probabilistic Graphical Models

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

CSE 473: Artificial Intelligence Autumn Topics

Deconstructing Data Science

Probabilistic Machine Learning. Industrial AI Lab.

10701/15781 Machine Learning, Spring 2007: Homework 2

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Introduction to Artificial Intelligence. Unit # 11

CS 188: Artificial Intelligence Spring Announcements

Probability and Probability Distributions. Dr. Mohammed Alahmed

Final Examination CS540-2: Introduction to Artificial Intelligence

Statistical Data Mining and Machine Learning Hilary Term 2016

Bayesian belief networks

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Lecture 3 Classification, Logistic Regression

Probabilistic Graphical Models for Image Analysis - Lecture 1

Introduction to Machine Learning Midterm Exam Solutions

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Proteomics and Variable Selection

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Lecture : Probabilistic Machine Learning

Bayesian Learning (II)

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Introduction. Chapter 1

CSCI-567: Machine Learning (Spring 2019)

CS 188: Artificial Intelligence Fall Recap: Inference Example

Final Examination CS 540-2: Introduction to Artificial Intelligence

EXAMINATION: QUANTITATIVE EMPIRICAL METHODS. Yale University. Department of Political Science

Introduction to Logistic Regression

Introduction to Machine Learning

Transcription:

The Lady Tasting Tea More Predictive Modeling R. A. Fisher & the Lady B. Muriel Bristol claimed she prefers tea added to milk rather than milk added to tea Fisher was skeptical that she could distinguish Possible resolutions Reason about the chemistry of tea and milk Milk first: a little tea interacts with a lot of milk Tea first: vice versa Perform a clinical trial Ask her to determine order for a series of test cups Calculate probability that her answers could have occurred by chance guessing; if small, she wins Fisher s Exact Test Significance testing Reject the null hypothesis (that it happened by chance) if its probability is < 0.1, 0.05, 0.01, 0.001,, 0.000001,,???? How to deal with multiple testing Need to explore many models Suppose Ms. Bristol had tried this test 100 times, and passed once. Would you be convinced of her ability to distinguish? Bonferroni correction: for n trials, insist on a p-value that is 1/n of what you would demand for a single trial Random permutations of data yield distribution of possible results; check to see if actual result is an outlier in this distribution if so, then it s unlikely to be due to random chance Remember: training set => model model + test set => measure of performance But How do we choose the best family of models? How do we choose the important features? Models may have structural parameters Number of hidden units in ANN Max number of parents in Bayes Net Parameters (like the betas in LR), and meta-parameters Not legitimate to try all and report the best!!!!!!!!!!!!!!!!!!

Aliferis lessons (part) Overfitting bias, variance, noise O = optimal possible model over all possible learners L = best model learnable by this learner A = actual model learned Bias = O - L (limitation of learning method or target model) Variance = L - A (error due to sampling of training cases) Compare against learning from randomly permuted data Curse of dimensionality Feature selection Dimensionality reduction Google s Lessons Much of human knowledge is not like physics! invariably, simple models and a lot of data trump more elaborate models based on less data simple n-gram models or linear classifiers based on millions of specific features perform better than elaborate models that try to discover general rules all the experimental evidence from the last decade suggests that throwing away rare events is almost always a bad idea, because much Web data consists of individually rare but collectively frequent events Brian Hayes, http://www.americanscientist.org/issues/pub/an-adventure-in-the-nth-dimension 1 2.000000e+00 2 3.141593e+00 3 4.188790e+00 4 4.934802e+00 5 5.263789e+00 6 5.167713e+00 7 4.724766e+00 8 4.058712e+00 9 3.298509e+00 10 2.550164e+00 11 1.884104e+00 12 1.335263e+00 13 9.106288e-01 14 5.992645e-01 15 3.814433e-01 16 2.353306e-01 17 1.409811e-01 18 8.214589e-02 19 4.662160e-02 20 2.580689e-02 21 1.394915e-02 22 7.370431e-03 23 3.810656e-03 24 1.929574e-03 25 9.577224e-04 26 4.663028e-04 27 2.228721e-04 28 1.046381e-04 29 4.828782e-05 30 2.191535e-05 31 9.787140e-06 32 4.303070e-06 33 1.863467e-06 34 7.952054e-07 35 3.345288e-07 36 1.387895e-07 37 5.680829e-08 38 2.294843e-08 39 9.152231e-09 40 3.604731e-09 41 1.402565e-09 42 5.392665e-10 43 2.049436e-10 44 7.700707e-11 45 2.861553e-11 46 1.051847e-11 47 3.825461e-12 48 1.376865e-12 49 4.905322e-13 50 1.730219e-13 Training Data Cross-validation Real Training Data Can We Deal with Publication Bias? Test Data Validation Data Extrapolate from published studies to (perhaps) unpublished ones Estimate the population of studies being performed Federal grant register ClinicalTrials.gov required registration Public availability of study data allows alternative analyses Journal of Negative Results Any number of times Train on some subset of the training data Test on the remainder, called the validation set Choose best meta-parameters Train, with those meta-parameters, on all training data Test on Test data, once!

Potential Goals of a Study What is the Space of Models to Learn? Decision support in a clinical case Maximize expected outcome to this patient Policy to establish standards of care FDA regulation of drugs, devices, Diagnostic and treatment recommendation e.g., hormone replacement therapy, mammograms for breast cancer detection, prostate-specific antigen to detect prostate cancer, D.A.R.E. Scientific discovery Classification vs. Regression Classification chooses one of a discrete set of answers, or a probability distribution over such a set e.g., diagnosis Regression predicts some dependent variable, typically continuous e.g., predict a lab value, time to some event Probabilistic inference vs. decision analysis i.e., are decisions formally modeled? Hidden states vs. all explicit Hidden states: HMM, BN, MDP, etc. Explicit: autoregressive models, covariance, interpolation, logistic regression, etc. Framework Models with No Hidden (Underlying) State y = f(~x) True relationship ŷ = ˆf(~x) Learned relationship predicts estimated y Minimize (y, ŷ) X Least squares fit minimizes (y i ŷ i ) 2 over all i cases i Choice of the family of f determines the kinds of models we can build and the learning method ˆf can be learned by any function approximation method regression, support vector machine, artificial neural network, Bayesian network, Extrapolation hold constant for unobserved lab values linear or spline interpolation/extrapolation Regression Models linear regression ŷ = X i~x i + i logistic regression (t) = et e t +1 = 1 1+e t ŷ = ( X i i~x i + ) = 1 1+e (P i i~xi+ ) "Anscombe's quartet 3" by Anscombe.svg: Schutzderivative work (label using subscripts): Avenue (talk) - Anscombe.svg. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/ wiki/file:anscombe%27s_quartet_3.svg#/media/file:anscombe %27s_quartet_3.svg

Logistic Regression Simple, fast, unsophisticated, but often works well Given a number of cases,, k For each case, we have an outcome y and a vector of features x = {x1,, xn} ŷ = logit -1 ( β0 + βi xi ) = 1 / ( 1 + exp(-( β0 + βi xi )) Estimate βs by least squares fit Minimize,,k (y - ŷ) 2 + i=0,,n βi 2nd (regularization) term penalizes model complexity L1 norm minimizes number of non-0 βs LASSO L2 norm minimizes prediction error Ridge regression Autoregressive Models px Autoregressive model (p order) x t = c + ix t i + t, where c is a constant, β are parameters ε is a (time-varying) noise term qx Moving Average model (q order) x t = µ + t + i t i finite impulse response to noise q=1 is random walk px qx Autoregressive moving average (ARMA) x t = c + t + ix t i + i t Autoregressive integrated moving average (ARIMA) autoregressive moving average sensitive to discrete derivatives of series see, e.g., http://people.duke.edu/~rnau/411arim.htm i 12 Inferential Models: Naïve Bayes Bipartite Graph Models y (unobserved) is the diagnosis xi (observed) are the symptoms y yi are diseases, unobserved (no longer exhaustive & mutually exclusive) xi are symptoms each x depends on all diseases, hence 2 m conditional probabilities further assumptions reduce this complexity (e.g. noisy or ) y 1 y 2 y m x 3 x n x 3 x n

Markov Model Bayes Network Transition model among time sequences Observed: y s are observable Hidden: y s are unobservable Conditional probabilities Absence of arcs implies independence y1 y2 y2 ym A B x1 x2 x3 xn C D time The ALARM Network Large Bayesian Network Monitoring mechanical ventillation I. A. Beinlich, et al. The ALARM Monitoring System: A Case Study with Two Probabilistic Inference Techniques for Belief Networks. In Proceedings of the 2nd European Conference on Artificial Intelligence in Medicine, pages 247-256. Springer-Verlag, 1989. David Heckerman, Pathfinder/ Intellipath, around 1990 109 nodes E

(Deep) Neural Networks y 1 Every node gets a logistic regression function of its inputs Number of nodes in each layer may vary Number of layers is another hyper parameter Training by back propagation change weights in proportion to error signal y 1 y 2 y 3 y n 1-layer network used in word2vec x 3 x n D. Heckerman, E. Horwitz, and B. Nathwani. Towards Normative Expert Systems: Part I. The Pathfinder Project. Methods of Information in Medicine, 31:90-105, 1992 http://www.structureddecisionmaking.org/tools/toolsinfluencediagram/ Influence Diagram Models not only hidden and observable variables but also: Tests and interventions Utilities of various states s 1 Compact representation of complex decision tree Issues Complexity of fitting, inference Discounting utilities For recurring problems, policy vs. optimal choice (e.g., chronic treatment) a 1 s 2 x 3 x 4 u 1 u 1 u