Ockham Efficiency Theorem for Randomized Scientific Methods

Similar documents
Ockham Efficiency Theorem for Stochastic Empirical Methods Conor Mayo-Wilson and Kevin Kelly. May 19, Technical Report No.

A New Solution to the Puzzle of Simplicity

Church s s Thesis and Hume s s Problem:

Testability and Ockham s Razor: How Formal and Statistical Learning Theory Converge in the New Riddle of. Induction

Belief Revision and Truth-Finding

Inductive vs. Deductive Statistical Inference

I. Induction, Probability and Confirmation: Introduction

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

ECE521 week 3: 23/26 January 2017

From inductive inference to machine learning

Challenges (& Some Solutions) and Making Connections

Measurement Independence, Parameter Independence and Non-locality

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Weighted Majority and the Online Learning Approach

Structure learning in human causal induction

Conceivability and Modal Knowledge

Astronomy 301G: Revolutionary Ideas in Science. Getting Started. What is Science? Richard Feynman ( CE) The Uncertainty of Science

Feyerabend: How to be a good empiricist

Uncertainty and Disagreement in Equilibrium Models

Comment on Leitgeb s Stability Theory of Belief

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Statistical Learning. Philipp Koehn. 10 November 2015

Machine Learning for Biomedical Engineering. Enrico Grisan

Computational Complexity of Bayesian Networks

7. Propositional Logic. Wolfram Burgard and Bernhard Nebel

Agency and Interaction in Formal Epistemology

Abstract. Three Methods and Their Limitations. N-1 Experiments Suffice to Determine the Causal Relations Among N Variables

Statistical and Inductive Inference by Minimum Message Length

Scientific Explanation- Causation and Unification

Review of Gordon Belot, Geometric Possibility Oxford: Oxford UP, 2011

Machine Learning

Reasons for Rejecting a Counterfactual Analysis of Conclusive Reasons

Linear Classifiers and the Perceptron

Algorithmic Probability

Modes of Convergence to the Truth: Steps toward a Better Epistemology of Induction

Bayesian Reasoning. Adapted from slides by Tim Finin and Marie desjardins.

Kernel Methods and Support Vector Machines

Classification (Categorization) CS 391L: Machine Learning: Inductive Classification. Raymond J. Mooney. Sample Category Learning Problem

Logic. Introduction to Artificial Intelligence CS/ECE 348 Lecture 11 September 27, 2001

Logic. Propositional Logic: Syntax

Ockham s Razor, Truth, and Information

Bayesian Learning Extension

Uncertainty and Bayesian Networks

Bayesian learning Probably Approximately Correct Learning

09 Modal Logic II. CS 3234: Logic and Formal Systems. October 14, Martin Henz and Aquinas Hobor

Intervention, determinism, and the causal minimality condition

Machine Learning 4771

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10

A Strong Relevant Logic Model of Epistemic Processes in Scientific Discovery

CS 6375: Machine Learning Computational Learning Theory

Ex Post Cheap Talk : Value of Information and Value of Signals

OPTIMIZATION OF FIRST ORDER MODELS

Introduction to Machine Learning

COMP538: Introduction to Bayesian Networks

An Algorithms-based Intro to Machine Learning

Chapter ML:IV. IV. Statistical Learning. Probability Basics Bayes Classification Maximum a-posteriori Hypotheses

UNCERTAINTY. In which we see what an agent should do when not all is crystal-clear.

PHIL 422 Advanced Logic Inductive Proof

MODULE -4 BAYEIAN LEARNING

A New Conception of Science

Probabilistically Checkable Proofs

AIC Scores as Evidence a Bayesian Interpretation. Malcolm Forster and Elliott Sober University of Wisconsin, Madison

Introduction to Artificial Intelligence. Learning from Oberservations

CAUSALITY. Models, Reasoning, and Inference 1 CAMBRIDGE UNIVERSITY PRESS. Judea Pearl. University of California, Los Angeles

Learning from Observations. Chapter 18, Sections 1 3 1

What Causality Is (stats for mathematicians)

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

Lecture 11: Measuring the Complexity of Proofs

Reasoning with Uncertainty

Confirmation and Chaos *

Proof Techniques (Review of Math 271)

Discrete Mathematics

Statistics: Learning models from data

Machine Learning

On the Complexity of Causal Models

Philosophy 148 Announcements & Such

16.4 Multiattribute Utility Functions

Chaos, Complexity, and Inference (36-462)

Rigorous Science - Based on a probability value? The linkage between Popperian science and statistical analysis

15-424/ Recitation 1 First-Order Logic, Syntax and Semantics, and Differential Equations Notes by: Brandon Bohrer

PHIL12A Section answers, 14 February 2011

STA 4273H: Sta-s-cal Machine Learning

Statistical inference

CAUSATION CAUSATION. Chapter 10. Non-Humean Reductionism

CS 188: Artificial Intelligence Spring Announcements

Foundations of Artificial Intelligence

Bayesian network modeling. 1

Pengju XJTU 2016

Computational Learning Theory

Foundations of Artificial Intelligence

Introduction to Metalogic

Uncertainty. Michael Peters December 27, 2013

Formal Epistemology: Lecture Notes. Horacio Arló-Costa Carnegie Mellon University

Bayesian Social Learning with Random Decision Making in Sequential Systems

Chapter 2. Mathematical Reasoning. 2.1 Mathematical Models

Baker. 1. Classical physics. 2. Determinism

Comparative Bayesian Confirmation and the Quine Duhem Problem: A Rejoinder to Strevens Branden Fitelson and Andrew Waterman

Solomonoff Induction Violates Nicod s Criterion

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Natural deduction for truth-functional logic

Transcription:

Ockham Efficiency Theorem for Randomized Scientific Methods Conor Mayo-Wilson and Kevin T. Kelly Department of Philosophy Carnegie Mellon University Formal Epistemology Workshop (FEW) June 19th, 2009 1 / 39

Point of the Talk Ockham efficiency theorem: Ockham s razor has been explained in terms of minimizing retractions en route to the truth, relative to all deterministic scientific strategies (Kevin T. Kelly and Oliver Schulte). 2 / 39

Point of the Talk Ockham efficiency theorem: Ockham s razor has been explained in terms of minimizing retractions en route to the truth, relative to all deterministic scientific strategies (Kevin T. Kelly and Oliver Schulte). Extension: Ockham s deterministic razor minimizes retractions en route to the truth, relative to a broad class of random scientific strategies. 2 / 39

Point of the Talk Ockham efficiency theorem: Ockham s razor has been explained in terms of minimizing retractions en route to the truth, relative to all deterministic scientific strategies (Kevin T. Kelly and Oliver Schulte). Extension: Ockham s deterministic razor minimizes retractions en route to the truth, relative to a broad class of random scientific strategies. Further significance: Extending the argument to expected retractions is a necessary step for lifting the idea to a theory of statistical theory choice. 2 / 39

Outline 1 Puzzle of Simplicity 2 Standard Explanations 3 Simplicity 4 Examples 5 Mixed Strategies and Ockham s Razor 3 / 39

The Puzzle of Simplicity Ockham s Razor is indespensible in scientific inference. Inference should be truth-conducive. But how could a fixed bias toward simplicity be said to help one find possibly complex truths? 4 / 39

Methodological virtues: Simpler theories are more testable or explanatory or are otherwise more virtuous. 5 / 39

Methodological virtues: Simpler theories are more testable or explanatory or are otherwise more virtuous. Response: wishful thinking desiring that the truth be virtuous doesn t make it so. 5 / 39

Standard Explanations Confirmation: Simple theories are better confirmed by simple data: p(t S E) p(t C E) > p(t S) p(t C ). 6 / 39

Standard Explanations Confirmation: Simple theories are better confirmed by simple data: p(t S E) p(t C E) > p(t S) p(t C ). Responses Simple data are compatible with complex hypotheses. 6 / 39

Standard Explanations Confirmation: Simple theories are better confirmed by simple data: p(t S E) p(t C E) > p(t S) p(t C ). Responses Simple data are compatible with complex hypotheses. No clear connection with finding the truth in the short run. 6 / 39

Standard Explanations Over-fitting: Even if the truth is complex, simple theories improve overall predictive accuracy by trading variance for bias. 7 / 39

Standard Explanations Over-fitting: Even if the truth is complex, simple theories improve overall predictive accuracy by trading variance for bias. Responses: The underlying decision theory is unclear the worst-case solution is the most complex theory. 7 / 39

Standard Explanations Over-fitting: Even if the truth is complex, simple theories improve overall predictive accuracy by trading variance for bias. Responses: The underlying decision theory is unclear the worst-case solution is the most complex theory. The over-fitting account ties Ockham s razor to choices among stochastic theories. 7 / 39

Standard Explanations Over-fitting: Even if the truth is complex, simple theories improve overall predictive accuracy by trading variance for bias. Responses: The underlying decision theory is unclear the worst-case solution is the most complex theory. The over-fitting account ties Ockham s razor to choices among stochastic theories. Prediction must be understood so as to rule out counterfactual or causal predictions. 7 / 39

Standard Explanations Convergence: Even if the truth is complex, complexities in the data will eventually over-turn over-simplified theories. 8 / 39

Standard Explanations Convergence: Even if the truth is complex, complexities in the data will eventually over-turn over-simplified theories. Responses: Any theory choice in the short run is compatible with finding the true theory in the long run. 8 / 39

A New Approach Deduction: Sound; Monotone. 9 / 39

A New Approach Deduction: Sound; Monotone. Induction: Approximate Soundness: converge to truth with minimal errors; Approximate Monotonicity: minimize retractions. 9 / 39

Monotonicity for Bayesians 10 / 39

Monotonicity in Belief Revision Total number of times B n+1 = B n = total number of non-expansive belief revisions. 11 / 39

Main Idea Ockham s Razor = Closest Inductive Approximation to Deduction 12 / 39

Main Idea Ockham s Razor = Closest Inductive Approximation to Deduction Ockham s razor converges to the truth with minimal retractions and errors and elapsed time to retractions, elapsed time to errors 12 / 39

Main Idea Ockham s Razor = Closest Inductive Approximation to Deduction Ockham s razor converges to the truth with minimal retractions and errors and elapsed time to retractions, elapsed time to errors No other convergent method does 12 / 39

Main Idea Ockham s Razor = Closest Inductive Approximation to Deduction Ockham s razor converges to the truth with minimal retractions and errors and elapsed time to retractions, elapsed time to errors No other convergent method does No circular appeal to prior simplicity biases 12 / 39

Main Idea Ockham s Razor = Closest Inductive Approximation to Deduction Ockham s razor converges to the truth with minimal retractions and errors and elapsed time to retractions, elapsed time to errors No other convergent method does No circular appeal to prior simplicity biases No awkward trade-offs between costs are required. 12 / 39

Empirical Problems and Simplicity Empirical Effects: Recognizable eventually. Arbitrarily subtle so may take arbitrarily long to be noticed. Each theory predicts finitely many. first order effect = nonconstant second order effect = nonlinear 13 / 39

Empirical Problems and Simplicity Empirical Problems: Background knowledge K picks out a set of possible, finite, effect sets. Theory T S says that exactly the effects in S will be seen in the unbounded future. A world of experience w is an infinite sequence that presents some finite set of effects at each stage and that presents some set S K in the limit. The empirical problem corresponding to K is to determine which theory in {T S : S K} is true of the actual world of experience w. 14 / 39

Empirical Problems and Simplicity What simplicity is: The empirical complexity of world of experience w is the length of the longest effect path in K to the effect set S w presented by w.... non-quadratic non-linear non-constant 3 2 1 0 15 / 39

Empirical Problems and Simplicity What simplicity isn t: notational or computational brevity (MDL), a question-begging rescaling of prior probability using ln(x) (MML). free parameters or dimensionality (AIC). 16 / 39

Three Paradigmatic Examples Linearly ordered complexity with refutation: Curve Fitting Kelly [2004] 17 / 39

Three Paradigmatic Examples Linearly ordered complexity with refutation: Curve Fitting Kelly [2004] Partially ordered complexity with refutation: Causal Inference Schulte, Luo and Greiner [2007], Kelly and Mayo-Wilson [2008] 17 / 39

Three Paradigmatic Examples Linearly ordered complexity with refutation: Curve Fitting Kelly [2004] Partially ordered complexity with refutation: Causal Inference Schulte, Luo and Greiner [2007], Kelly and Mayo-Wilson [2008] Partially ordered complexity without refutation: Orientation of Causal Edge Kelly and Mayo-Wilson [2008] 17 / 39

Linearly Ordered Simplicity Structure Curve fitting is an instance of a more general type of problem. Problem: Choosing amongst theories that are linearly ordered in terms of complexity. Evidence: Suppose that any false, simple theory is refuted in some finite amount of time. 18 / 39

Linearly Ordered Simplicity Structure 19 / 39

Linearly Ordered Simplicity Structure 19 / 39

Linearly Ordered Simplicity Structure 19 / 39

Linearly Ordered Simplicity Structure...... T 4 T 4 4 5 T 3 T 3 T 2 3 extra retraction 2 4 T 2 T 1 2 3 T 1 1 Ockham violation 1 T 0 T 0 consistent, stalwart, Ockham Ockham violator 20 / 39

Branching Simplicity Structure............ T 4 T 4 T 4 T 4 4 4 5 4 T 3 T 3 T 3 T 3 3 3 4 3 T 2 T 2 T 1 2 T 1 2 1 T 0 1 T 0 consistent, stalwart, Ockham Ockham violator 21 / 39

Branching Simplicity Structure X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z 22 / 39

Discovering Causal Networks Problem: Choose a causal network describing the causal relationships amongst a set of variables. 23 / 39

Discovering Causal Networks Problem: Choose a causal network describing the causal relationships amongst a set of variables. Evidence (Effects): probabilistic dependencies amongst the variables discovered over time. 23 / 39

Discovering Causal Networks Problem: Choose a causal network describing the causal relationships amongst a set of variables. Evidence (Effects): probabilistic dependencies amongst the variables discovered over time. Complexity = greater number of edges 23 / 39

Discovering Causal Networks X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z 24 / 39

Discovering Causal Networks X Y Z X X X X Y Z Y Z Y Z Y Z X Y Z 24 / 39

Discovering Causal Networks X Y Z X X Y Z Y Z? 24 / 39

Deterministic Theory Choice Methods Given arbitrary, finite initial segment e of a world of experience, a deterministic theory choice method produces a unique theory T S or? indicating refusal to choose. 25 / 39

Methodological Properties Say a method M is convergent if, for any world w, M eventually produces the true theory in w. 26 / 39

Methodological Properties Say a method M is convergent if, for any world w, M eventually produces the true theory in w. Ockham if it never says any non-simple theory (relative to evidence). 26 / 39

Methodological Properties Say a method M is convergent if, for any world w, M eventually produces the true theory in w. Ockham if it never says any non-simple theory (relative to evidence). stalwart if whenever it endorses the simplest theory, it continues to do so until it s no longer simplest. 26 / 39

Methodological Properties Say a method M is convergent if, for any world w, M eventually produces the true theory in w. Ockham if it never says any non-simple theory (relative to evidence). stalwart if whenever it endorses the simplest theory, it continues to do so until it s no longer simplest. eventually informative if, in any world, there is some point of inquiry n after which M never says? in w. 26 / 39

Ockham s Razor in Probability Say a method M is a normally Ockham if it is Ockham, stalwart, and eventually informative. 27 / 39

Ockham s Razor Modulo some minor assumptions on the simplicity structure, which all of the above examples satisfy. Theorem (Efficiency Theorem) Let M be a normal Ockham method, and let M be any convergent method. Suppose M and M agree along some finite initial set of experience e, and that M violates Ockham s razor at e. Then M commits strictly more retractions (in the worst-case) in every complexity class with respect to e. 28 / 39

Mixed Strategies in Decision and Game Theory Randomization in decision theory and games: E.g. Rock, paper, scissors, matching pennies 29 / 39

Mixed Strategies in Decision and Game Theory Randomization in decision theory and games: E.g. Rock, paper, scissors, matching pennies Can randomization in scientific inquiry improve expected errors and retractions? 29 / 39

Randomized Strategies Randomized Methods: Machines (formally, discrete state stochastic processes) for selecting theories from data 30 / 39

Randomized Strategies Randomized Methods: Machines (formally, discrete state stochastic processes) for selecting theories from data Outputs of machine are a function of (i) its current state and (ii) total input 30 / 39

Randomized Strategies Randomized Methods: Machines (formally, discrete state stochastic processes) for selecting theories from data Outputs of machine are a function of (i) its current state and (ii) total input States of machine evolve according to a random process i.e. Future and past states may be correlated to any degree - Independence not assumed! No assumptions about process being Markov, etc. 30 / 39

Costs and Randomized Strategies Randomized methods ought to be as deductive as possible: Approximate Soundness: convergence in probability, minimization of expected errors Approximate Monotonicity: minimization of expected retractions 31 / 39

Retractions in Chance and Expected Retractions 32 / 39

Retractions in Chance and Expected Retractions 32 / 39

Methodological Properties Say a randomized method M is Ockham if it never says any non-simple theory with probability greater than zero. 33 / 39

Methodological Properties Say a randomized method M is Ockham if it never says any non-simple theory with probability greater than zero. Stalwart if whenever it endorses the simplest theory (with any positive probability), it continues to do so with unit probability until it is no longer Ockham. 33 / 39

Methodological Properties Say a randomized method M is Ockham if it never says any non-simple theory with probability greater than zero. Stalwart if whenever it endorses the simplest theory (with any positive probability), it continues to do so with unit probability until it is no longer Ockham. Convergent in Probability if for any world w, the probability that M produces the theory true in w approaches 1 as time elapses. 33 / 39

Ockham s Razor in Probability Say a method M is a normally Ockham if it is Ockham, stalwart, and convergent in probability. 34 / 39

Generalized Efficiency Theorem: Suppose the simplicity structure has no short paths, and let M be a randomized or deterministic method such that 1 M first violates Ockham s razor after some initial segment of evidence e. 2 M is convergent in probability Then, in comparison to any normal Ockham method (deterministic or not!), M accrues a strictly greater number of retractions in every complexity class with respect to e. 35 / 39

References Mark Gold, E. Language identification in the limit. Information and Control. V 10. 447-474. Kevin Kelly. The Logic of Reliable Inquiry. Oxford University Press, 1996. Oliver Schulte. Inferring Conservation Laws in Particle Physics: A Case Study in the Problem of Induction. The British Journal for the Philosophy of Science. 2001. Oliver Schulte, W. Luo and R. Greiner. Mind Change Optimal Learning of Bayes Net Structure. In 20th Annual Conference on Learning Theory (COLT), San Diego, CA, June 12-15, 2007. 36 / 39

Branching Simplicity Structure... T 4 As bad in these complexity classes... T 4 4 4 short path T 5 T 3 short path T 5 T 3 3 3 T 2 Only strictly worse in this complexity class! 4 3 T 2 2 2 T 1 T 1 1 1 T 0 T 0 consistent, stalwart, Ockham Ockham violator 37 / 39

Stochastic Processes Definition Let T,, Σ be arbitrary sets. A stochastic process is a quadruple Q =(T, (, D, p), (Σ, S), X ), where: 1 T is a set called the index set of the process; 2 (, D, p) is a (countably additive) probability space; 3 (Σ, S) is a measurable space of possible values of the process; 4 X : T Σ is such that for each fixed t T, the function X t is D/S-measurable. 38 / 39

Stochastic Processes Definition Let F represent finite, initial segments of data streams, and Ans contain the set of all theories and? representing I don t know. A stochastic empirical method is a triple M =(Q, α, σ 0 ) where: 1 Q =(F, (, D, p), (Σ, S), X ) is a stochastic process indexed by the set F of all finite, initial segments of data streams. 2 σ 0 Σ is the initial state of our method (i.e. it satisfies: X 1 () (σ 0 ) = ). 3 α : F Σ Ans is such that for each e F, α e is G/2 Ans -measurable. 39 / 39