March 3, 2011 1 Prepared for a Purdue Machine Learning Seminar
Acknowledgement Prof. A. P. Dempster for intensive collaborations on the Dempster-Shafer theory. Jianchun Zhang, Ryan Martin, Duncan Ermini Leaf, Zouyi Zhang, Huiping Xu, Jing-Shiang Hwang, Jun Xie, and Hyokun Yun for collaborations on a variety of IM research projects. NSF support for a joint project with Jun Xie on large-scale multinomial inference and its applications in genome-wide association studies.
References Martin, R. and Liu, C. (2011, ) and the references therein. A possible tbook (Liu and Martin, 2012+; Reasoning with Uncertainty) having the futures: A prior-free and valid probabilistic inference system, which is promising for serious applications of statistics. Fully developed valid probabilistic inferential methods for textbook problems A large collection of applications to modern, challenging, and large-scale statistical problems Deeper understanding of existing schools of thought and their strengths and weaknesses. Satisfactory solutions to well-known benchmark problems, including Stein s paradox and the Behrens-Fisher problem A direct attack on the source of uncertainty, which makes learning and teaching easier and more enjoyable
Abstract Artificial Intelligence It is difficult, perhaps, to believe that artificial intelligence can be made intelligent enough without a valid probabilistic inferential system as a critical module. After a brief review of existing schools of thought on uncertain inference, we introduce a valid probabilistic inferential framework termed inferential models (IMs). With several simple and benchmark examples, we discuss potential applications of IMs in artificial intelligence (in general and machine learning in particular).
Artificial intelligence Machine learning Learning from data What is it? An answer from the web Artificial Intelligence (AI) is the area of computer science focusing on creating machines that can engage on behaviors that humans consider intelligent. The ability to create intelligent machines has intrigued humans since ancient times, and today with the advent of the computer and 50 years of research into AI programming techniques, the dream of smart machines is becoming a reality. Researchers are creating systems which can mimic human thought, understand speech, beat the best human chess player, and countless other feats never before possible.
Is the answer precise? Artificial Intelligence Artificial intelligence Machine learning Learning from data If not, blame on Google s machine learning algorithms
Artificial intelligence Machine learning Learning from data What is it? An answer from the web Machine learning has been central to AI research from the beginning. Unsupervised learning is the ability to find patterns in a stream of input. Supervised learning includes both classification and numerical regression. Classification is used to determine what category something belongs in, after seeing a number of examples of things from several categories. Regression takes a set of numerical input/output examples and attempts to discover a continuous function that would generate the outputs from the inputs. In reinforcement learning the agent is rewarded for good responses and punished for bad ones. These can be analyzed in terms of decision theory, using concepts like utility. The mathematical analysis of machine learning algorithms and their performance is a branch of theoretical computer science known as computational learning theory.
Artificial intelligence Machine learning Learning from data The inference problem Input 1. Data x observed observable quantities X X. 2. Assertion A statements on θ Θ, unknown quantities. 3. Association between X and θ. For example, x is a sample of the population characterized by the cdf F θ (.). Output: 1. Probabilistic uncertainty assessments on the truth or the falsity of A given X = x. 2. Plausible regions for θ and its functions.
Uncertain inference Artificial Intelligence Intelligence and uncertainty Probability models Statistical models Existing schools of thought is critical to AI No?
Intelligence and uncertainty Probability models Statistical models Existing schools of thought One (simple) kind of uncertain inference Probability models A probability model has a meaningful/valid probability distribution assumed to be adequate for everything. In particular, θ has a valid marginal distribution that can be operated via the usual probability calculus to derive valid, e.g., marginal and conditional posterior distributions. Subjective Bayesian Philosophically, every Bayesian is subjective. Bayes was not Bayesian. What s wrong? Nothing is wrong you make the decision and (you or your clients) should take the consequence.
Statistical models Artificial Intelligence Intelligence and uncertainty Probability models Statistical models Existing schools of thought Statistical models In what follows, we consider the cases where you don t have valid distributions for everything, which we refer to as Statistical Models. θ is taken to be unknown.
Intelligence and uncertainty Probability models Statistical models Existing schools of thought Objective Bayesian a personal view The idea can be viewed as to use magic priors to approximate (ideal) frequentist results. Remarks: Assertion-specific priors: Certain priors can work for certain assertions on θ. Large-sample theory: It is really on the case when uncertainty goes away; thinking about both normality and vanishing variances in very-high-dimensional problems. Robust Bayesian: The worst case scenario thinking ultimately leads the Bayesian to a non-bayesian school.
Existing schools of thought Intelligence and uncertainty Probability models Statistical models Existing schools of thought Bayes: for it to work, it really requires valid priors. Fiducial: it is very interesting. It is wrong (but better than Bayes[?]). Dempster-Shafer: as an extension of both Bayes and fiducial, it requires valid independent individual components that are probabilistically meaningful. For example, individual components are specified with fiducial probabilities. Frequentist: starting with specified rules and criteria, it invites the guess and check approach to uncertain inference. If so, is it very appealing? For example, 24+ methods for 2x2 tables and penalty-based methods.
Intelligence and uncertainty Probability models Statistical models Existing schools of thought Remarks These existing methods are useful. All these schools of thought fail for many benchmark examples, such as, the many-normal-means, Behrens-Fisher, and constrained parameter problems. Thinking outside the box may be necessary for new generations.
A valid probabilistic inference framework Two simple examples Predictive random sets One sample test The likelihood insufficiency principle Likelihood alone is not sufficient for probabilistic inference. An unobserved but predictable quantity called the auxiliary (a)-variable, must be introduced for predictive/probabilistic inference. Remark: Bayes makes θ predictable. Is it credible/valid?
A valid probabilistic inference framework Two simple examples Predictive random sets One sample test The No Validity, No Probability principle? Notation: denote by P x (A) the probability for the truth of A given the observed data x. Definition (validity). An inferential framework is said to be valid if A Θ, P X (A), as a function of X, satisfies P X (A) stochastically Unif(0, 1) under the falsity of A, i.e., under the truth of A c, the negation of A.
A valid probabilistic inference framework Two simple examples Predictive random sets One sample test The Inferential Model (IM) framework IM is valid and consists of three steps: Association-step: Associate X and θ with an a-variable z to obtain the mapping Θ X (z) Θ (z π z) consisting of candidate values of θ given X and z. Prediction-step: Predict z with a credible predictive random set (PRS) S θ, i.e., P (S θ z) Unif (0,1), where z π z. Combination-step: Combine x and S θ to obtain Θ x(s θ ) = z Sθ Θ x(z) and compute evidence e x (A) = P (Θ x(s θ ) A) and e x (A c ) = P (Θ x(s θ ) A c ) with e x(a) = 1 e x (A c ) called plausibility.
A valid probabilistic inference framework Two simple examples Predictive random sets One sample test X N(θ, 1) Example A-step. X = θ + z, where z N(0,1). P-step. S θ = [ Z, Z ], where Z N(0,1). C-step. e x (A) and e x (A c ) with Θ x (S θ ) = [x Z,x + Z ]. e x (θ 0 ) 0.0 0.2 0.4 0.6 0.8 1.0 2 0 2 4 6 θ 0 Figure: Plausibility of assertion A = {θ : θ = θ 0 }, indexed by θ 0, given x = 1.96. Note e x (θ 0 ) = 0.
A valid probabilistic inference framework Two simple examples Predictive random sets One sample test X Binomial(n, θ) This is a homework problem for Stat 598D.
Efficiency Artificial Intelligence A valid probabilistic inference framework Two simple examples Predictive random sets One sample test See Stat 598D lecture notes on Statistical Inference. Let b(z) be a continuous function and define Then S = {z : b(z) b(z)} (Z π z ). P(S z) Unif(0,1) (z π z ). We can use this result to construct credible PRS.
A valid probabilistic inference framework Two simple examples Predictive random sets One sample test Combining information: Conditional IMs Example (A textbook example) Consider the association model X i = θ + z i (z i iid N(0, 1), i = 1,...,n). Write X = θ + z and X i X = (z i z) (i = 1,...,n). Predict z conditional on the observed a-quantities {(z i z)} n 1. This leads to simplified conditional IM: A-step. X = θ + n 1 u, where u N(0, 1). P-step. S = [ U, U ], where U N(0, 1). C-step. Θ x(s) = [ X U / n, X + U / n].
A valid probabilistic inference framework Two simple examples Predictive random sets One sample test Efficient inference: Marginal IMs Example (Another textbook example) Consider the association model X i = η + σz i (z i iid N(0, 1), i = 1,...,n). Let θ = (η, σ 2 ) Θ = R R + and write X = η + σ z, s 2 x = σ2 s 2 z, and (X X1)/s x = (z z1)/s z. Predict z and s 2 z conditional on the observed a-quantities (z z1)/s z. This leads to simplified conditional IM: A-step. X = η + sx n u and s 2 x = σ2 s 2 z, where u t n 1(0, 1) s 2 z χ2 n 1. P-step. S = [ U, U ] [0, ], where U t n 1 (0, 1). C-step. Θ x(s) = [ X U s x/ n, X + U s x/ n] [0, ].
A valid probabilistic inference framework Two simple examples Predictive random sets One sample test Model selection via AI (or by AS Artificial Statistician)? Consider choosing a model from a collection of models, including, e.g., normal for simplicity (and efficiency) and non-parametric for robustness. See Jianchun Zhang s PhD thesis for an IM-based method.
2 2 tables Artificial Intelligence Simpson s paradox The Behrens-Fisher problem Stein s paradox A meta-analysis problem Example (Kidney stone treatment, Steven et al (1994)) Table 1. Small Stones? Table 2. Large Stones Treatment Success Failure Treatment Success Failure A 81 6 A 192 71 B 234 26 B 55 25 For making intelligent decision, there are (at least) two things to consider. Prediction: Conditional on the Stone type. Estimation: Combining data if possible. Thus, check the homogeneity of each of the two tables Table 3. Treatment A & Table 4. Treatment B Stone type Success Failure Stone type Success Failure Small 81 6 Small 234 26 Large 192 71 Large 55 25
Simpson s paradox The Behrens-Fisher problem Stein s paradox A meta-analysis problem Evidence for and against homogeneity of treatments For each of Table 3 and Table 4, compute 1. e(homogeneous), 2. e(homogeneous), and 3. 95% plausibility interval for the odd ratio. Remarks. 1. Simpson s paradox is related more to wrong statistical analysis, i.e., modeling, than to inferential method(?) How can this be done in AI? 2. Some relevant statistical thoughts Increase precision of prediction via conditioning, and Increase precision of estimation via pooling. Can some basics like these be integrated into AI?
Numerical results Artificial Intelligence Simpson s paradox The Behrens-Fisher problem Stein s paradox A meta-analysis problem Figure: Plausibilities for log odd ratios Tables 3 and 4, which shows that pooling makes no sense in this example.
Simpson s paradox The Behrens-Fisher problem Stein s paradox A meta-analysis problem Comparing two normal means with unknown variances This is a common textbook, controversial, and practically useful example (Bayes and fiducial do not work well); See Martin, Hwang, and Liu (2010b).
Many-normal-means Artificial Intelligence Simpson s paradox The Behrens-Fisher problem Stein s paradox A meta-analysis problem The association model: X i = µ i + z i (z i iid N(0,1),i = 1,...,n). The problem of interest is to infer µ. A very important example for understanding inference. (Bayes and fiducial do not work); See Martin, Hwang, and Liu (2010b).
Many-normal-means Artificial Intelligence Simpson s paradox The Behrens-Fisher problem Stein s paradox A meta-analysis problem The usual model for the observable X 1,...,X n: µ i iid N(θ, σ 2 ) (i = 1,...,n) and X i µ ind N(µ i, s 2 i ) (i = 1,...,n) with known positive s 2 1,...,s2 n, where µ = (µ 1,...,µ n) and (θ, σ 2 ) R R + unknown. Here, we are interested in inference about σ 2. Since there are really meaningful prior knowledge in practice, it has been tremendous interest on choosing Bayesian priors.
Many-normal-means Artificial Intelligence Simpson s paradox The Behrens-Fisher problem Stein s paradox A meta-analysis problem The sampling model for the observable quantity is X i ind N(θ, σ 2 + s 2 i ) (i = 1,...,n) For simplicity to motivate ideas, consider the case with known θ = 0, that is, X i ind N(0, σ 2 + s 2 i ) (i = 1,...,n) An association model is given by and " nx i=1 nx i=1 X 2 i σ 2 + s 2 i 0 # Xi 2 1/2 B σ 2 + si 2 @ X 1 q σ 2 + s 2 i = V 1 X n C,..., p σ 2 + sn 2 A = U, where V χ 2 n U Unif (On).
Many-normal-means Artificial Intelligence Simpson s paradox The Behrens-Fisher problem Stein s paradox A meta-analysis problem Specify the predictive random set, which predicts u alone, S = {(v, u) : F n(v).5 F n(v).5 } This is a constrained parameter inference problem. Remark. Validity is not a problem, but efficient inference is not straightforward. It requires to consider Generalized Conditional IMs a challenging topic under investigation!