Model Evaluation and Selection in Cognitive Computational Modeling

Size: px

Start display at page:

Download "Model Evaluation and Selection in Cognitive Computational Modeling"

Geraldine Cross
5 years ago
Views:

University, Columbus OH, USA In collaboration with Mark

1 Model Evaluation and Selection in Cognitive Computational Modeling Jay Myung Department of Psychology Ohio State University, Columbus OH, USA In collaboration with Mark Pitt Mathematical Psychology Workshop (Jun 2013: Wuhan, China)

2 2

3 3

4 What we hope to achieve today This is intended to be a first introduction to model evaluation and selection/comparison for cognitive scientists The aim is to provide a good conceptual overview of the topic and make you aware of some of the fundamental issues and methods in model evaluation and selection Will not be an in-depth, step-by-step workshop on how to apply model evaluation and comparison methods to extant models using computing or statistical software tools Assume no more than a year-long course in graduate level statistics 4

5 Outline 1. Introduction 2. Model Estimation 3. Model Evaluation and Selection 4. A New Experimental Tool for Model Selection: Adaptive Design Optimization 5. Conclusion 5

6 Outline 1. Introduction 2. Model Estimation 3. Model Evaluation and Selection 4. A New Experimental Tool for Model Selection: Adaptive Design Optimization 5. Conclusion 6

7 Model Selection in Cognitive Computational Modeling How should one decide between competing explanations (models) of data? SIMPLE model of memory (Brown et al, 2007, Psy. Rev.) CMR model of memory (Polyn et al, 2009, Psy. Rev.) 7

8 Two Whale s Views of Model Selection 8

9 Which one of the two below should we choose? 9

10 Preliminaries Models are quantitative stand-ins for theories Models are tools with which to study behavior Increase the precision of prediction Generate novel predictions Provide insight into complex behavior Model comparison is a statistical inference problem. Quantitative methods are developed to aid in deciding between models 10

11 Preliminaries Diversity of types of models in cognitive science makes model comparison challenging Variety of statistical methods are required Discipline would benefit from sharing models and data sets Cognitive Modeling Repository (CMR: 11

12 Outline 1. Introduction 2. Model Estimation 3. Model Evaluation and Selection 4. A New Experimental Tool for Model Selection: Adaptive Design Optimization 5. Conclusion 12

13 2. Model Estimation Parameter Estimation Likelihood Function Maximum Likelihood Estimation (MLE) Illustrative Example 13

14 14

15 Parameter Estimation A mathematical model specifies the range of data patterns it can describe by varying the values of its parameter (w), for example, 2 Power model: p w ( t 1) w 1 15

16 Parameter Estimation Finding the parameter value that best fits observed data Best-fit parameter: ( wˆ, wˆ ) (0.985,0.424)

17 MLE vs LSE Maximum likelihood estimation (MLE): Find the parameter value that is most likely to have generated the data. Least squares estimation (LSE): Find the parameter value that most accurately describes the data. MLE: wˆ argmax p( data w) w 2 LSE: w ˆ argmin obs prd ( w ) w i i i Sum of squares error (SSE) 17

18 MLE in a Nutshell All parametric inference by building a probability model for observed data, not by comparing between data observations and model predictions. Observations LSE Predictions MLE Probability model 18

19 Why MLE? ***Opt. In statistics, MLE is the standard method of parameter estimation and forms a basis for many inferential statistical methods such as hypothesis testing and model comparison (e.g, LRT, AIC, BIC, MDL). Optimal properties of MLE: - Consistency (true parameter value is recovered asymptotically) - Sufficiency (contains complete info about parameter) - Efficiency (achieves minimum variance of parameter estimate) In contrast, LSE does not generally satisfy these properties. LSE is primarily a descriptive tool, often associated with linear models with normal error, such as linear regression and analysis of variance. 19

20 Terminology ***Opt. Data: Statistically speaking, data are a random sample from an unknown population. Probability distribution: Each population is identified by a probability distribution that describes the probability of observing data sample. Model parameter: Associated with each probability distribution is a unique value of model parameter. As the parameter value changes, different probability distributions are specified. 20

21 Coin tossing experiment: Probability Distribution Parameter: w (0 w 1; probability of a head on each toss) Data: y 0, 1,..., n (number of heads out of n tosses) Binomial process: y Bin( n, w) Probability distribution given w: n! n f ( y w) w (1 w) ( n y)! y! n y 21

22 For example, suppose a binomial process given n = 10 and w = 0.2: 10! y f ( y w 0.2) (0.2) (1 0.2) (10 y)! y! 10 y f ( y 0 w 0.2) f ( y 1 w 0.2) f ( y 10 w 0.2) such that f ( y w 0.2) 1 y 0 22

23 Binomial Probability Distributions 23

24 Normal Probability Distribution For a continuous random variable y with normal error, the probability distribution takes the form ( y ) 1 f y w e y ( (, )) ( ) 2 24

25 Probability Density Function ***Opt. When f(y w) is viewed as a function of data y given a fixed value of parameter w, it is called the probability density function (PDF). Normalized PDF: y f ( y w ) 1 (discrete y) f ( y w ) dy 1 (continous y) for each value of parameter w. 25

26 Formal Definition of Model Formally speaking, a model is defined as a collection of PDFs the model generates by varying its parameter values: Model M = { f ( y w) w W } 26

27 2. Model Estimation Parameter Estimation Likelihood Function Maximum Likelihood Estimation (MLE) Illustrative Example 27

28 To recap, a PDF, f(y w), associated with a specific parameter value w describes the probability of observing an outcome y. For instance, according to the binomial PDF with n=10 and w=0.2, y = 2 is more probable to observe than y = 4: f ( y 2 w 0.2) f ( y 4 w 0.2)

29 The Inverse Problem In reality, however, we have observed the data y, but the parameter w is unknown. Accordingly, we are faced with an inverse inference problem: Parameter estimation as an inverse problem: Given the observed data y obs and a model of interest, we wish to identify the one PDF, among all PDFs the model can generate, that is most likely to have produced y obs. 29

30 Likelihood Function To solve the inverse problem, we define the likelihood function (LF) by switching the roles of y and w in the PDF as Likelihood function: Lw ( y ) f( y w) obs obs as a function of parameter w given observed data y. obs 30

31 An Example of Likelihood Function Given y 7 with y : Bin( n 10, w ), obs Lw ( y 7) f( y 7, w) obs obs 10! 7 3 w (1 w) (0 w 1) (10 7)!7! 31

32 PDF vs Likelihood Function Both PDF f(y w) and likelihood function (LF) L(y w) are derived from the same functional, but are defined on different axes, representing two sides of the same coin, in a sense: PDF: f(y w fixed ) LF: f(y fixed w) 32

33 PDF vs Likelihood Function - PDF: A function of y given a specific value of w (defined on data scale) - LF: A function of w given a specific value of y (defined on parameter scale) Both are non-negative measures: - PDF: Actual probability - Likelihood: Likelihood as an unnormalized probability f ( y w ) dy 1 fixed but Lw ( y ) dw 1 obs 33

34 2. Model Estimation Parameter Estimation Likelihood Function Maximum Likelihood Estimation (MLE) Illustrative Example 34

35 Maximum Likelihood Estimate In MLE, we are interesting in finding the probability distribution that makes the observed data y obs most likely, or equivalently, finding the parameter value that maximizes the likelihood function L(w y obs ). The resulting parameter value is called the maximum likelihood estimate denoted by ˆ. w MLE wˆ 0.7 ( y 7, n 10) MLE obs 35

36 Consequently, among all probability distributions or PDFs the model can generate, the one that is most likely to have generated y obs =7 is considered to be f(y w=0.7), shown below. ***Opt. MLE Distribution 36

37 Likelihood Equation ***Opt. To reiterate, MLE requires us to solve the problem of Maximizing the LF or logarithmic LF: wˆ argmax L( w yobs ) =argmax ln L( w yobs ) w w Consider the general setting of a multi-dimensional parameter vector w ( w1, w2,..., w k ) A minimal and necessary condition for the existence of MLE estimate is that the likelihood function must satisfy the following partial differential equation called the likelihood equation (LE): ln Lw ( yobs ) w i 0 at wˆ for all i 1,2,..., k. MLE 37

38 Why the likelihood equation?: This is because by definition, the first derivative of LF with respect to each w i vanishes at ˆ. w MLE ***Opt. ln Lw ( y obs ) w 0 38

39 To illustrate, consider the previous binomial example with y ~ Bin(n=10,w), and suppose that we observed y obs =7. ***Opt. The likelihood equation in this case is given by 10! 7 3 d ln w (1 w ) d ln L( w y obs 7) 7!3! dw dw 10! d ln 7!3! d (7 ln w ) d (3 ln(1 w )) dw dw dw w 1 w 7 10w w (1 w ) 0 from which we obtain wˆ 0.7. MLE 39

40 Solving MLE Note that finding the MLE estimate requires solving a system of k likelihood equations simultaneously. For models with many parameters and non-linear model equations, it is generally not possible to find analytic-form solutions. Instead, one must solve the MLE problem numerically on computer using an optimization algorithm (e.g., hill-climbing, Gauss-Newton, or Levenberg-Marquardt). This generally involves the implementation of an intelligent search procedure on computer, in which we seek to find an MLE by trial and error over the course of a series of iterative steps. 40

41 Hill-climbing Optimization wˆ MLE B Log-likelihood ln L(w) A C 0 a1 a2 a3 Parameter w 41

42 Local Maxima Problem No optimization algorithm can guarantee that the MLE will always be found. Maybe or maybe not. It is, therefore, important to apply an optimization algorithm in several repetitions, each with a different starting value of parameter w, to see if the same solution would be obtained. Further and importantly, the problem is exacerbated by what is known as the local maxima problem. 42

43 global maximum B local maximum Log-likelihood ln L(w) A local maximum C 0 a1 a2 a3 Parameter w 43

44 2. Model Estimation Parameter Estimation Likelihood Function Maximum Likelihood Estimation (MLE) Illustrative Example 44

45 Retention Models and Data Suppose we are interested in testing two models of retention: POW: p w ( t 1) EXP: 1 p w e 1 w t 2 w 2 A memory experiment with n = 50 Bernoulli trials and eight time intervals of t = (0.5, 1, 2, 4, 8, 12, 16, 18) yielded the following data (i.e., number of correct responses out of 50 trials at each time interval) of y obs = (44, 34, 27, 26, 19, 17, 20, 11) Or, equivalently, the observed proportion correct of p obs = (0.88, 0.68, 0.54, 0.52, 0.38, 0.34, 0.40, 0.22) 45

46 The Data 46

47 Likelihood Functions The first step of MLE is to write down the PDF for each model 8 n! POW: f ( y w) w ( t 1) 1 w ( t 1) ( n y )! y! i 1 i i i 1 i i w i 2 w2 1 i 1 i 8 n! EXP: f ( y w) w e 1 w e ( n y )! y! From this, we obtain the log LF y n y w2t i i w2ti 1 1 y n y i i POW: 8 w2 w2 ln ( ) ln( ( 1) ) ( ) ln(1 ( 1) ) EXP: L w y y w t n y w t obs obs, i 1 i obs, i 1 i i 1 8 Lw y y we n y we w2ti w2ti ln ( ) ln( ) ( ) ln(1 ) obs obs, i 1 obs, i 1 i 1 where the parameter-independent terms are omitted. 47

48 Matlab Code 48

49 MLE Best-fit Curves 49

50 Summary of Model Fits MLE LSE POW EXP POW EXP w MLE, w MLE, PVAF 91.2% 79.0% w LSE, w LSE, PVAF 91.4% 79.1% 50

51 Why the Differences between MLE and LSE? Binomial likelihood data: y Bin n p e g p w e -w2t (, ) (.., ) var( y) np(1 p) 1 ***Opt. 2 LSE: minimize obsi prd i( w ) i 51

52 R Code 52

53 53

54 Outline 1. Introduction 2. Model Estimation 3. Model Evaluation and Selection 4. A New Experimental Tool for Model Selection: Adaptive Design Optimization 5. Conclusion 54

55 3. Model Evaluation and Selection Goodness-of-fit vs. Generalizabilty Model Selection Methods Illustrative Examples 55

56 56

57 Goal of Modeling and Approximation to Truth The ultimate, ideal goal of modeling is to identify the model that actually generated the observed data This is not possible because 1) Never enough observations to pin down the truth exactly 2) The truth may be quite complex, beyond modeler s imagination A more realistic goal is to choose among a set of candidate models the one model that provides the closest approximation to the truth, in some defined sense 57

58 Model Evaluation Criteria Qualitative criteria Falisifiability: Do potential observations exist that would be incompatible with the model? Plausibility: Does the theoretical account of the model make sense of established findings? Interpretability: Are the components of the model understandable and linked to known processes? Quantitative criteria Goodness of fit: Does the model fit the observed data sufficiently well? Complexity/simplicity: Is the model's description of the data achieved in the simplest possible manner? Generalizability: How well does the model predict future observations? 58

59 Goodness-of-fit (GOF) Measures - Root mean squared error (RMSE) RMSE = n i 1 ( y y ( wˆ )) i, obs i, prd n 2 - Percent variance accounted for (PVAF), or r 2 PVAF = n i 1 n ( y y ( wˆ )) i 1 iobs, iprd, ( y y ) i, obs obs

60 Noisy Data Behavioral data include random noise from a number of sources, such as measurement error, sampling error, and individual differences Data = Regularity (Cognitive process) + Noise (Idiosyncrasies) 60

61 Problem with GOF as Model Evaluation Criterion Data = Regularity (Cognitive process) + Noise (Idiosynchracies) GOF = Fit to regularity + Fit to noise Properties of the model that have nothing to do with its ability to capture the underlying regularities can improve GOF 61

62 Over-fitting Problem (Over-fitting) Model 1: bx Y ae c bx e Model 2: sin( ) Y ae c dx f X g 62

63 Model Complexity Complexity: Refers to a model s inherent flexibility that enables it to fit a wide range of data patterns Simple Model Complex Model number of model parameters 63

64 Complexity: More than Number of Parameters? Power: p a( t 1) Exponential: p ae bt b Hyperbolic: p 1/( a bt) Are these all equally complex? Maybe not 64

65 Complexity of Multinomial Processing Tree Models Wu, Myung & Batchelder (2010, J. Math. Psych.) 65

66 Generalizability: The Yardstick of Model Selection Generalizability refers to a model s ability to fit all future data samples from the same underlying process, not just the currently observed data sample, and thus can be viewed as a measure of predictive accuracy or proximity to the underlying regularity. 66

67 Model A Model B Goodness of fit (PVAF): 80% 99% Generalizability (PVAF): 70% 50% 67

68 Relationship among Goodness of Fit, Model Complexity and Generalizability Goodness of fit Overfitting Generalizability Model Complexity 68

69 Wanted: A method of model selection that estimates a model s generalizability by taking into account effects of its complexity Selection Criterion: Choose the model, among a set of candidate models, that generalizes best 69

70 3. Model Evaluation and Selection Goodness-of-fit vs. Generalizabilty Model Selection Methods Illustrative Examples 70

71 Model Selection Methods Occam s Razor Likelihood Function Akaike Information Criterion (AIC) Bayesian Information Criterion (BIC) Minimum Description Length (MDL) Bayes Factor (BF) Cross-validation (CV) Likelihood Ratio Test (LRT) 71

72 Occam s Razor: The Economy of Explanation Entities should not be multiplied beyond necessity. - William of Occam ( ) 72

73 Likelihood Function: Revisited Formally speaking, a mathematical model is defined in terms of the likelihood function (LF) that specifies the likelihood of observed data as a function of model parameter: Likelihood function (LF): f( y w ) (e.g.) w 2 Power model: p w ( t 1) (0 p 1) 1 Data: y ~ Binomial(n,p(w)) (y = 0,1,...,n) n! LF: f ( y w) p( w) (1 p( w)) ( n y)! y! y n y 73

74 Maximum Likelihood: Revisited In model fitting, we are interested in finding the parameter value that is most likely to have generated the observed data -- the one that maximizes the likelihood function: Maximum likelihood (ML): f( y wˆ ) ˆ w 74 2

75 Penalized Likelihood Methods Formalization of Occam s razor Estimate a model s generalizability by penalizing it for excess complexity (i.e., more complexity than is needed to fit the regularity in the data) Puts models on equal footing 75

76 The Basic Idea (Generalizability) = (Goodness of fit) - (Model complexity) 76

77 The Basic Idea ***Opt. (Generalizability) = (Goodness of fit) + (Model complexity) ( Generalizability = - Generalizability; Goodness of fit = - Goodness of fit ) 77

78 Akaike Information Criterion (AIC) Akaike (1973): # of parameters AIC 2ln f( y wˆ ) 2k Goodness of fit (ML) + Model Complexity The smaller AIC value of a model, the greater generalizability of the model The model with smallest AIC is the best, among a set of competing models and thus should be preferred 78

79 Bayesian Information Criterion (BIC) Schwarz (1978): Sample size BIC 2ln f( y wˆ ) k lnn Goodness of fit (ML) + Model Complexity The model that minimizes BIC should be preferred 79

80 80

81 Minimum Description Length (MDL) Rissanen (1996): functional form k n MDL ln f ( y wˆ ) ln ln I( w) dw 2 2 Goodness of fit (ML) + Model Complexity The model that minimizes MDL should be preferred 81

82 Bayes Factor (BF) ***Opt. (Kass & Raftery, 1995) In Bayesian model selection, each model is evaluated based on its marginal likelihood defined as p( y M) f( y w, M) ( w M) dw or equivalently, an average likelihood (i.e., how well the model fits the data on average, across the range of its parameters) Bayes factor (BF) between two models is defined as the ratio of two marginal likelihoods BF ij p ( y M ) i p( y M ) j 82

83 Under the assumption of equal model priors, BF is reduced to the posterior odds: ***Opt. BF ij pm ( i y) pm ( y) j ( from Bayes rule) Therefore, the model that maximizes marginal likelihood is the one with highest probability of being true given observed data 83

84 Features of Bayes Factor ***Opt. Pros No optimization (i.e., no maximum likelihood) No explicit measure of model complexity No overfitting, by averaging likelihood function across parameters Cons Issue of choosing parameter priors (virtue or vice?) Non-trivial computations requiring numerical integration 84

85 BIC as an approximation of BF ***Opt. A large sample approximation of the marginal likelihood yields the easily-computable Bayesian Information Criterion (BIC): -2 log marginal likelihood BIC 85

86 (Stone, 1974; Geisser, 1975) Cross-validation (CV) Sampling-based method of estimating generalizability y Cal y Val CV SSE( y wµ ( y )) Val Cal 86

87 87

88 Features of CV Pros Easy to use Sensitive to functional form as well as number of parameters Asymptotically equivalent to AIC Cons Sensitive to the partitioning used - Averaging over multiple partitions - Leave-one-out CV (LOOCV), instead of split-half CV Instability of the estimate due to loss of data 88

89 89

90 Why not Likelihood Ratio Test (LRT)? ***Opt. The LRT is often used to test the adequacy of a model of interest (M R ) in the context of another, fuller model (M F ): LRT ln f( y wˆ R ) f( y wˆ ) F The acceptance/rejection decision on M R is then made based on a p-value obtained from Chi-square distribution with df = k F k R. Acceptance Model M R provides a sufficiently good fit to observed data, and therefore the extra parameters of model M F are judged to be unnecessary 90

91 LRT is NOT a model selection method ***Opt. LRT does not access a model s generalizability, which is the hallmark of model selection. LRT is a null hypothesis significance testing (NHST) method that simply judges descriptive adequacy of a given model (cf. contemporary criticisms of NHST). LRT requires the setting of pair-wise & nested comparison (i.e., can compare only two models at a time). LRT is developed in the context of linear models with normal error; Sampling distributions of the LRT statistic under nonlinear and nonnormal error conditions are generally unknown. At best, LRT is a poor substitute for current model selection methods, such as AIC and BIC. 91

92 3. Model Evaluation and Selection Goodness-of-fit vs. Generalizabilty Model Selection Methods Illustrative Examples 92

93 Example #1: Retention Models (Remember?) MLE results Proportion Correct (pc) POW ( black): p a( t 1) EXP ( red): p ae bt b b POW 2( blue): p a( t 1) c Time Interval (t) 93

94 Model Selection Results POW EXP POW2 # Parms PVAF AIC BIC LOOCV loglik

95 Example #2 1 1 : : 3 3 : 1 95

96 Matlab Code 96

98 Guidelines for Model Evaluation and Selection 1. Evaluate a model s fit to data (descriptive adequacy, or goodness-of-fit) 2. Must take into account a model s fit to other possible data (complexity/flexibility) 3. Select among models using generalizability measures (AIC, BIC, MDL, Bayes Factor, etc.) 98

Summary of Section 3 Models should be evaluated based on generalizability, not on goodness of fit Thou shall not select the best-fitting model but shall

99 Summary of Section 3 Models should be evaluated based on generalizability, not on goodness of fit Thou shall not select the best-fitting model but shall select the best-predicting model. Other non-statistical but very important selection criteria Plausibility Interpretability Explanatory adequacy Falsifiability 99

100 Outline 1. Introduction 2. Model Estimation 3. Model Evaluation and Selection 4. A New Experimental Tool for Model Selection: Adaptive Design Optimization 5. Conclusion 100

101 101

102 Outline 1. Introduction 2. Model Estimation 3. Model Evaluation and Selection 4. A New Experimental Tool for Model Selection: Adaptive Design Optimization (ADO) 5. Conclusion 102

103 4. A New Experimental Tool for Model Selection: ADO Introduction Discriminating Models of Retention with ADO Illustrative Example 103

104 104

105 Introduction Experiments are fundamental to the advancement of cognitive science 105

106 Introduction Data obtained from experiments are used to fit formal models. Multinomial Processing Trees Structural Equations (SEM) Neural Network 106

107 Introduction Often, there are many competing models to describe the same cognitive/perceptual process. Example Short-term memory literature Over 100 different 2-parameter models have been proposed to describe the rate of forgetting over time (Rubin & Wenzel, 1996) 107

108 Model selection problem It is often very difficult to reject models through experimentation because they can mimic one another so closely. Example: Some models of memory retention (forgetting) Hyperbolic Exponential Power 108

109 Experiments to discriminate between models Flow chart of typical investigation Formulate models Collect data Analyze data Statistical model selection methods (e.g., AIC & BIC) 109 Model selection criteria are limited by the data they have to work with.

110 Experiments to discriminate between models Flow chart of typical investigation Formulate models Collect data Analyze data Typically, collect lots of data to decrease noise 110

111 Experiments to discriminate between models Flow chart of typical investigation Formulate models Collect data Analyze data Collect smarter data to highlight differences between models 111

112 Collecting smarter data The design of an experiment determines the quality of data that are collected. Design variables include: number of observations number of treatment groups stimulus levels 112

113 Collecting smarter data The design of an experiment determines the quality of data that are collected. Design variables include: number of observations number of treatment groups stimulus levels 113

114 Hard-to-recruit participants 114

115 Adaptive design optimization (ADO) Adaptively designed experiments Conduct the full experiment as a sequence of miniexperiments Improve the design of the next mini-experiment using knowledge gained from the previous miniexperiments Myung & Pitt (2009) Psychological Review Cavagnaro, Myung, Pitt & Kujala (2010) Neural Computation. 115

116 Adaptive design optimization (ADO) Sequential design framework EXPERIMENTS Y obs (d 1 ) Y obs (d 2 ) Y obs (d s ) d 1 d 2 d 3 d s DESIGNS Adapt the design of the next experiment based on the results of preceding experiments 116

117 4. A New Experimental Tool for Model Selection: ADO Introduction Discriminating Models of Retention with ADO Illustrative Example 117

118 Designing a retention experiment: What Time Intervals Should be Employed? Retention: the rate of retrieval failure over time 118

119 Designing a retention Experiment: What Time Intervals Should be Employed? Two models of retention 119

120 Designing a retention experiment: What Time Intervals Should be Employed? Model mimicry between POW and EXP POW EXP 120

121 Designing a retention experiment: What Time Intervals Should be Employed? Model predictions for a narrow range of parameters (100 Bernoulli trials) POW EXP Recall rate Recall rate Time (seconds) Time (seconds) 0.58<a< <b< <a< <b<

122 Designing a retention experiment: What Time Intervals Should be Employed? A good design choice can aid in discrimination seconds: Bad designs! POW EXP Recall rate Recall rate Time (seconds) Time (seconds) 122

123 Designing a retention experiment: What Time Intervals Should be Employed? A good design choice can aid in discrimination 2-4 seconds: Good designs! POW EXP Recall rate Recall rate Time (seconds) Time (seconds) 123

124 Designing a retention experiment: What Time Intervals Should be Employed? More realistic situations require a more principled approach to finding optimal designs. POW EXP Recall rate Recall rate Time (seconds) Time (seconds) a~beta(2,1) b~beta(1,4) a~beta(2,1) b~beta(1,80) 124

125 4. A New Experimental Tool for Model Selection: ADO Introduction Discriminating Models of Retention with ADO Illustrative Example 125

126 Simulated experiments Generate data at each stage from some true model (e.g. EXP with a=.71 and b=.08) at the optimal retention interval identified by ADO algorithm. 126

127 Simulated experiments Generate data at each stage from some true model (e.g. EXP with a=.71 and b=.08) at the optimal retention interval identified by ADO algorithm. The idea is to see how many stages it takes to accumulate enough data to decisively identify the datagenerating model according to some model selection criterion (e.g. Bayes factor) 127

128 Simulation results Data generated from EXP with a=0.71 and b=

129 Simulation results 30 Stage 1 Compute optimal design Generate data (30 Bernoulli trials) 30 P 0 (POW)=0.5 P 0 (EXP)= Time (seconds) 7 correct responses 16 seconds Time (seconds) 129

130 Simulation results 30 Stage 1 Update model probabilities Update parameter probabilities 30 P 1 (POW)=0.65 P 1 (EXP)= Time (seconds) 7 correct responses 16 seconds Time (seconds) 130

131 Simulation results 30 Stage 2 Compute optimal design Generate data (30 Bernoulli trials) 30 P 1 (POW)=0.65 P 1 (EXP)= Time (seconds) 0 correct responses 96.4 seconds Time (seconds) 131

132 Simulation results 30 Stage 2 Update model probabilities Update parameter probabilities 30 P 2 (POW)=0.30 P 2 (EXP)= Time (seconds) 0 correct responses 96.4 seconds Time (seconds) 132

133 Simulation results 30 Stage 3 Compute optimal design Generate data (30 Bernoulli trials) 30 P 2 (POW)=0.30 P 2 (EXP)= Time (seconds) 0 correct responses 96.8 seconds Time (seconds) 133

134 Simulation results 30 Stage 3 Update model probabilities Update parameter probabilities 30 P 3 (POW)=0.02 P 3 (EXP)= Time (seconds) 0 correct responses 96.8 seconds Time (seconds) 134

135 Simulation results 135

136 Simulation results Bayes Factor = 20 (very strong evidence for EXP) 136

137 Simulation results Bayes Factor = 20 (very strong evidence for EXP) 137

138 Human experiment results 138

139 ADO for Visual Psychophysics Experiment: Adaptive Estimation of Contrast Sensitivity Function (CSF) Lesmes, Jeon, Lu & Dosher (2006); Lesmes, Lu, Baek & Albright (2010) 139

140 140

141 ADO for Neurophysiology Experiment Lewi, Butera & Paninski (2009) 141

142 Much ADO about Nothing - William Shakespeare 142

143 Summary of Section 4 Competing models can be difficult to discriminate because good experimental designs are elusive. ADO finds and exploits differences between models to maximize the likelihood of discrimination. Qualifications ADO applicable to not all types of models Not all variables can be optimized Computational costs 143

144 Outline 1. Introduction 2. Model Estimation 3. Model Evaluation and Selection 4. A New Experimental Tool for Model Selection: Adaptive Design Optimization 5. Conclusion 144

145 Conclusion Model evaluation: A model s good fit to a data set is only a necessary first step in model evaluation, but not a sufficient, final, confirmatory step. Model selection: The best model, among a set of candidate models, is the one that generalizes best and can be identified using model selection methods, such as AIC, BIC, and MDL. Experimental design: Adaptive design optimization (ADO) can further improve model evaluation and selection. 145

146 Further readings on model evaluation, model selection, and adaptive design optimization Special issues Gluck, K., Bello, P., & Busemeyer, J. (2008). Special issue on model comparison. Cognitive Science, 32, Myung, I. J., Forster, M., & Browne, M. W. (2000). Special issue on model selection. Journal of Mathematical Psychology, 44, Wagenmakers, E-J., & Waldorp, L. (2006). Special issue on model selection. Journal of Mathematical Psychology, 50, Articles Shiffrin, R. M., Lee, M. D., Kim, W., & Wagenmakers, E-J. (2008). A survey of model evaluation approaches with a tutorial on hierarchical Bayesian methods. Cognitive Science, 32, Wu, H., Myung, J. I., & Batchelder, W. H. (2010). On the minimum description length complexity of multinomial processing tree models. Journal of Mathematical Psychology, 54, Cavagnaro, D. R., Gonzalez, R., Myung, J. I., & Pitt, M. A. (2013). Optimal decision stimuli for risky choice experiments: An adaptive approach. Management Science, 59(2),

147 All the mathematics you missed but need to know for graduate school (in mathematical psychology and other fields): Undergraduate Calculus, Linear & Matrix Algebra, Real Analysis Probability Theory and Statistics Computer Programming in Matlab, R, and C++ Graduate Mathematical Statistics Bayesian Statistics Numerical Computation and Analysis Machine Learning Methods 147

148 The End 148

149 149

150 Supplemental Slides 150

151 Outline 1. Introduction 2. Model Estimation 3. Model Evaluation and Selection 3.5 Evaluating Other Types of Models 4. A New Experimental Tool for Model Selection: Adaptive Design Optimization 5. Conclusion 151

152 152

153 Restricted Scope of the Methods Models must generate likelihood functions (distribution of likelihoods across parameters) Not all models can do this (without simplification) Connectionist Simulation-based (CMR, Diffusion) Cognitive architectures Diversity of types of models in cognitive science makes model comparison challenging 153

154 Broader Framework Local Model Analysis Can the model simulate (or fit) the particular data pattern observed in an experimental setting? Evaluate success of model across diverse experiments Difficult to synthesize results to obtain a picture of why the model consistently succeeds in simulating human performance Is it the correct model or an overly flexible model? To interpret the behavior of a model's local performance, the global behavior of the model must also be understood (i.e., descriptive adequacy must be balanced by model complexity) 154

155 Broader Framework Global Model Analysis What other data patterns does the model produce besides the empirical one? Learn what a model is and is not capable of doing in a particular experimental setting (i.e., experimental design) with the goal being to obtain a more comprehensive picture of its behavior 155

Parameter Space Partitioning (PSP) Global Model Analysis method Partition a model s parameter space (hidden weight space) into distinct regions corresponding to qualitatively different data patterns

156 Parameter Space Partitioning (PSP) Global Model Analysis method Partition a model s parameter space (hidden weight space) into distinct regions corresponding to qualitatively different data patterns the model could generate in an experiment (Pitt, Kim, Navarro & Myung, 2006) PSP interfaces between the continuous behavior of models and the often discrete, qualitative predictions across conditions in an experiment 156

Pattern 2 Parameter Space Partitioning Pattern 1 Not a valid pattern (e.g., activations never reached threshold) How do you find all data patterns that a model can generate?

157 Pattern 2 Parameter Space Partitioning Pattern 1 Not a valid pattern (e.g., activations never reached threshold) How do you find all data patterns that a model can generate? Hard problem Potentially huge search space Model simulation required for each set of parameter values chosen Classify each simulation result (Does the current pattern match others already found or is it new?) 157

158 What Can be Learned from PSP? Empirical data pattern across three conditions: B > C > A M1 M2 B>C=A A>B>C A=B=C B>A>C 2 B>C>A 2 C=B>A C>A=B B>C>A C>A>B B=C>A A=B>C A>B=C 1 1 Questions PSP can answer How many of the 13 different data patterns can a model simulate? How much of the space is occupied by the empirical pattern? (central/peripheral) What is the relationship between these other patterns and the empirical pattern (in terms of volume and similarity) 158

159 Is it good or bad if a model s predictions are central or peripheral? M1 M2 B>C=A A>B>C B>A>C A=B=C 2 B>C>A 2 C=B>A C>A=B B>C>A C>A>B B=C>A A=B>C A>B=C 1 1 Simulation success takes on additional meaning with knowledge of other model behaviors Model comparison methods do not make decisions for you. They provide you with data that are intended to help you arrive at an informed decision 159

160 How does memory for a spoken word influence perception of the word s phonemes? TRACE McClelland & Elman, 1986 Merge Norris, McQueen, & Cutler, 2001 Word job jog job jog Word Phoneme Decision B G D V J O B G D V Phoneme Input and Decision J O B G D V Phoneme Input Speech Input Excitatory Inhibitory Speech Input 160

161 What are the consequences of splitting the phoneme level in two? TRACE McClelland & Elman, 1986 Merge Norris, McQueen, & Cutler, 2001 Word job jog job jog Word Phoneme Decision B G D V J O B G D V Phoneme Input and Decision J O B G D V Phoneme Input Speech Input Excitatory Inhibitory Speech Input 161

162 PSP Analysis Compared model performance in a 12-condition experiment Empirical data pattern (subcategorical mismatch, Norris et al, 2000) How many of the possible ~3 million data patterns can each model generate? 162

163 PSP Analysis Phoneme and word recognition defined as node activation reaching threshold Recognition of the phoneme /b/ Phoneme activation Threshold Cycle number (time) /b/ node /g/ node /d/ node /v/ node 163

164 Partitioning of Parameter Space in terms of Phoneme Activation /b/ node /g/ node /d/ node /v/ node threshold /g/ Invalid Invalid 2 /b/ /v/ Invalid 1 Hypothetical Example of Parameter Space 164

165 Results of PSP Analysis: Number of Data Patterns By splitting the phoneme level in two, Merge is able to generate more data patterns (more flexible) 165

166 Volume Analysis (weak threshold only) Each fill pattern represents a different data pattern Only patterns that occupy more than 1% of the volume are shown Proportion of the volume of the parameter space Common Common Unique TRACE Merge Human pattern 166

167 Correlation of Volumes of Common Regions Rank order of volume size in Merge Smallest Rank order of volume size in TRACE Largest 167

168 Summary of Section 3.5 Parameter Space Partitioning (PSP) provides a global perspective on model behavior Broader context for interpreting simulation/fit success Complements local model analysis Assess similarity of models Applicable to a wide range of models PSP can be used to evaluate the effectiveness of experiments in distinguishing between models prior to testing (Does only one model generate the predicted pattern?) 168

Squeezing Every Ounce of Information from An Experiment: Adaptive Design Optimization

Squeezing Every Ounce of Information from An Experiment: Adaptive Design Optimization Jay Myung Department of Psychology Ohio State University UCI Department of Cognitive Sciences Colloquium (May 21, 2014)