The Lady Tasting Tea More Predictive Modeling R. A. Fisher & the Lady B. Muriel Bristol claimed she prefers tea added to milk rather than milk added to tea Fisher was skeptical that she could distinguish Possible resolutions Reason about the chemistry of tea and milk Milk first: a little tea interacts with a lot of milk Tea first: vice versa Perform a clinical trial Ask her to determine order for a series of test cups Calculate probability that her answers could have occurred by chance guessing; if small, she wins Fisher s Exact Test Significance testing Reject the null hypothesis (that it happened by chance) if its probability is < 0.1, 0.05, 0.01, 0.001,, 0.000001,,???? How to deal with multiple testing Need to explore many models Suppose Ms. Bristol had tried this test 100 times, and passed once. Would you be convinced of her ability to distinguish? Bonferroni correction: for n trials, insist on a p-value that is 1/n of what you would demand for a single trial Random permutations of data yield distribution of possible results; check to see if actual result is an outlier in this distribution if so, then it s unlikely to be due to random chance Remember: training set => model model + test set => measure of performance But How do we choose the best family of models? How do we choose the important features? Models may have structural parameters Number of hidden units in ANN Max number of parents in Bayes Net Parameters (like the betas in LR), and meta-parameters Not legitimate to try all and report the best!!!!!!!!!!!!!!!!!!
Aliferis lessons (part) Overfitting bias, variance, noise O = optimal possible model over all possible learners L = best model learnable by this learner A = actual model learned Bias = O - L (limitation of learning method or target model) Variance = L - A (error due to sampling of training cases) Compare against learning from randomly permuted data Curse of dimensionality Feature selection Dimensionality reduction Google s Lessons Much of human knowledge is not like physics! invariably, simple models and a lot of data trump more elaborate models based on less data simple n-gram models or linear classifiers based on millions of specific features perform better than elaborate models that try to discover general rules all the experimental evidence from the last decade suggests that throwing away rare events is almost always a bad idea, because much Web data consists of individually rare but collectively frequent events Brian Hayes, http://www.americanscientist.org/issues/pub/an-adventure-in-the-nth-dimension 1 2.000000e+00 2 3.141593e+00 3 4.188790e+00 4 4.934802e+00 5 5.263789e+00 6 5.167713e+00 7 4.724766e+00 8 4.058712e+00 9 3.298509e+00 10 2.550164e+00 11 1.884104e+00 12 1.335263e+00 13 9.106288e-01 14 5.992645e-01 15 3.814433e-01 16 2.353306e-01 17 1.409811e-01 18 8.214589e-02 19 4.662160e-02 20 2.580689e-02 21 1.394915e-02 22 7.370431e-03 23 3.810656e-03 24 1.929574e-03 25 9.577224e-04 26 4.663028e-04 27 2.228721e-04 28 1.046381e-04 29 4.828782e-05 30 2.191535e-05 31 9.787140e-06 32 4.303070e-06 33 1.863467e-06 34 7.952054e-07 35 3.345288e-07 36 1.387895e-07 37 5.680829e-08 38 2.294843e-08 39 9.152231e-09 40 3.604731e-09 41 1.402565e-09 42 5.392665e-10 43 2.049436e-10 44 7.700707e-11 45 2.861553e-11 46 1.051847e-11 47 3.825461e-12 48 1.376865e-12 49 4.905322e-13 50 1.730219e-13 Training Data Cross-validation Real Training Data Can We Deal with Publication Bias? Test Data Validation Data Extrapolate from published studies to (perhaps) unpublished ones Estimate the population of studies being performed Federal grant register ClinicalTrials.gov required registration Public availability of study data allows alternative analyses Journal of Negative Results Any number of times Train on some subset of the training data Test on the remainder, called the validation set Choose best meta-parameters Train, with those meta-parameters, on all training data Test on Test data, once!
Potential Goals of a Study What is the Space of Models to Learn? Decision support in a clinical case Maximize expected outcome to this patient Policy to establish standards of care FDA regulation of drugs, devices, Diagnostic and treatment recommendation e.g., hormone replacement therapy, mammograms for breast cancer detection, prostate-specific antigen to detect prostate cancer, D.A.R.E. Scientific discovery Classification vs. Regression Classification chooses one of a discrete set of answers, or a probability distribution over such a set e.g., diagnosis Regression predicts some dependent variable, typically continuous e.g., predict a lab value, time to some event Probabilistic inference vs. decision analysis i.e., are decisions formally modeled? Hidden states vs. all explicit Hidden states: HMM, BN, MDP, etc. Explicit: autoregressive models, covariance, interpolation, logistic regression, etc. Framework Models with No Hidden (Underlying) State y = f(~x) True relationship ŷ = ˆf(~x) Learned relationship predicts estimated y Minimize (y, ŷ) X Least squares fit minimizes (y i ŷ i ) 2 over all i cases i Choice of the family of f determines the kinds of models we can build and the learning method ˆf can be learned by any function approximation method regression, support vector machine, artificial neural network, Bayesian network, Extrapolation hold constant for unobserved lab values linear or spline interpolation/extrapolation Regression Models linear regression ŷ = X i~x i + i logistic regression (t) = et e t +1 = 1 1+e t ŷ = ( X i i~x i + ) = 1 1+e (P i i~xi+ ) "Anscombe's quartet 3" by Anscombe.svg: Schutzderivative work (label using subscripts): Avenue (talk) - Anscombe.svg. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/ wiki/file:anscombe%27s_quartet_3.svg#/media/file:anscombe %27s_quartet_3.svg
Logistic Regression Simple, fast, unsophisticated, but often works well Given a number of cases,, k For each case, we have an outcome y and a vector of features x = {x1,, xn} ŷ = logit -1 ( β0 + βi xi ) = 1 / ( 1 + exp(-( β0 + βi xi )) Estimate βs by least squares fit Minimize,,k (y - ŷ) 2 + i=0,,n βi 2nd (regularization) term penalizes model complexity L1 norm minimizes number of non-0 βs LASSO L2 norm minimizes prediction error Ridge regression Autoregressive Models px Autoregressive model (p order) x t = c + ix t i + t, where c is a constant, β are parameters ε is a (time-varying) noise term qx Moving Average model (q order) x t = µ + t + i t i finite impulse response to noise q=1 is random walk px qx Autoregressive moving average (ARMA) x t = c + t + ix t i + i t Autoregressive integrated moving average (ARIMA) autoregressive moving average sensitive to discrete derivatives of series see, e.g., http://people.duke.edu/~rnau/411arim.htm i 12 Inferential Models: Naïve Bayes Bipartite Graph Models y (unobserved) is the diagnosis xi (observed) are the symptoms y yi are diseases, unobserved (no longer exhaustive & mutually exclusive) xi are symptoms each x depends on all diseases, hence 2 m conditional probabilities further assumptions reduce this complexity (e.g. noisy or ) y 1 y 2 y m x 3 x n x 3 x n
Markov Model Bayes Network Transition model among time sequences Observed: y s are observable Hidden: y s are unobservable Conditional probabilities Absence of arcs implies independence y1 y2 y2 ym A B x1 x2 x3 xn C D time The ALARM Network Large Bayesian Network Monitoring mechanical ventillation I. A. Beinlich, et al. The ALARM Monitoring System: A Case Study with Two Probabilistic Inference Techniques for Belief Networks. In Proceedings of the 2nd European Conference on Artificial Intelligence in Medicine, pages 247-256. Springer-Verlag, 1989. David Heckerman, Pathfinder/ Intellipath, around 1990 109 nodes E
(Deep) Neural Networks y 1 Every node gets a logistic regression function of its inputs Number of nodes in each layer may vary Number of layers is another hyper parameter Training by back propagation change weights in proportion to error signal y 1 y 2 y 3 y n 1-layer network used in word2vec x 3 x n D. Heckerman, E. Horwitz, and B. Nathwani. Towards Normative Expert Systems: Part I. The Pathfinder Project. Methods of Information in Medicine, 31:90-105, 1992 http://www.structureddecisionmaking.org/tools/toolsinfluencediagram/ Influence Diagram Models not only hidden and observable variables but also: Tests and interventions Utilities of various states s 1 Compact representation of complex decision tree Issues Complexity of fitting, inference Discounting utilities For recurring problems, policy vs. optimal choice (e.g., chronic treatment) a 1 s 2 x 3 x 4 u 1 u 1 u