Model Evaluation and Selection in Cognitive Computational Modeling

Size: px
Start display at page:

Download "Model Evaluation and Selection in Cognitive Computational Modeling"

Transcription

1 Model Evaluation and Selection in Cognitive Computational Modeling Jay Myung Department of Psychology Ohio State University, Columbus OH, USA In collaboration with Mark Pitt Mathematical Psychology Workshop (Jun 2013: Wuhan, China)

2 2

3 3

4 What we hope to achieve today This is intended to be a first introduction to model evaluation and selection/comparison for cognitive scientists The aim is to provide a good conceptual overview of the topic and make you aware of some of the fundamental issues and methods in model evaluation and selection Will not be an in-depth, step-by-step workshop on how to apply model evaluation and comparison methods to extant models using computing or statistical software tools Assume no more than a year-long course in graduate level statistics 4

5 Outline 1. Introduction 2. Model Estimation 3. Model Evaluation and Selection 4. A New Experimental Tool for Model Selection: Adaptive Design Optimization 5. Conclusion 5

6 Outline 1. Introduction 2. Model Estimation 3. Model Evaluation and Selection 4. A New Experimental Tool for Model Selection: Adaptive Design Optimization 5. Conclusion 6

7 Model Selection in Cognitive Computational Modeling How should one decide between competing explanations (models) of data? SIMPLE model of memory (Brown et al, 2007, Psy. Rev.) CMR model of memory (Polyn et al, 2009, Psy. Rev.) 7

8 Two Whale s Views of Model Selection 8

9 Which one of the two below should we choose? 9

10 Preliminaries Models are quantitative stand-ins for theories Models are tools with which to study behavior Increase the precision of prediction Generate novel predictions Provide insight into complex behavior Model comparison is a statistical inference problem. Quantitative methods are developed to aid in deciding between models 10

11 Preliminaries Diversity of types of models in cognitive science makes model comparison challenging Variety of statistical methods are required Discipline would benefit from sharing models and data sets Cognitive Modeling Repository (CMR: 11

12 Outline 1. Introduction 2. Model Estimation 3. Model Evaluation and Selection 4. A New Experimental Tool for Model Selection: Adaptive Design Optimization 5. Conclusion 12

13 2. Model Estimation Parameter Estimation Likelihood Function Maximum Likelihood Estimation (MLE) Illustrative Example 13

14 14

15 Parameter Estimation A mathematical model specifies the range of data patterns it can describe by varying the values of its parameter (w), for example, 2 Power model: p w ( t 1) w 1 15

16 Parameter Estimation Finding the parameter value that best fits observed data Best-fit parameter: ( wˆ, wˆ ) (0.985,0.424)

17 MLE vs LSE Maximum likelihood estimation (MLE): Find the parameter value that is most likely to have generated the data. Least squares estimation (LSE): Find the parameter value that most accurately describes the data. MLE: wˆ argmax p( data w) w 2 LSE: w ˆ argmin obs prd ( w ) w i i i Sum of squares error (SSE) 17

18 MLE in a Nutshell All parametric inference by building a probability model for observed data, not by comparing between data observations and model predictions. Observations LSE Predictions MLE Probability model 18

19 Why MLE? ***Opt. In statistics, MLE is the standard method of parameter estimation and forms a basis for many inferential statistical methods such as hypothesis testing and model comparison (e.g, LRT, AIC, BIC, MDL). Optimal properties of MLE: - Consistency (true parameter value is recovered asymptotically) - Sufficiency (contains complete info about parameter) - Efficiency (achieves minimum variance of parameter estimate) In contrast, LSE does not generally satisfy these properties. LSE is primarily a descriptive tool, often associated with linear models with normal error, such as linear regression and analysis of variance. 19

20 Terminology ***Opt. Data: Statistically speaking, data are a random sample from an unknown population. Probability distribution: Each population is identified by a probability distribution that describes the probability of observing data sample. Model parameter: Associated with each probability distribution is a unique value of model parameter. As the parameter value changes, different probability distributions are specified. 20

21 Coin tossing experiment: Probability Distribution Parameter: w (0 w 1; probability of a head on each toss) Data: y 0, 1,..., n (number of heads out of n tosses) Binomial process: y Bin( n, w) Probability distribution given w: n! n f ( y w) w (1 w) ( n y)! y! n y 21

22 For example, suppose a binomial process given n = 10 and w = 0.2: 10! y f ( y w 0.2) (0.2) (1 0.2) (10 y)! y! 10 y f ( y 0 w 0.2) f ( y 1 w 0.2) f ( y 10 w 0.2) such that f ( y w 0.2) 1 y 0 22

23 Binomial Probability Distributions 23

24 Normal Probability Distribution For a continuous random variable y with normal error, the probability distribution takes the form ( y ) 1 f y w e y ( (, )) ( ) 2 24

25 Probability Density Function ***Opt. When f(y w) is viewed as a function of data y given a fixed value of parameter w, it is called the probability density function (PDF). Normalized PDF: y f ( y w ) 1 (discrete y) f ( y w ) dy 1 (continous y) for each value of parameter w. 25

26 Formal Definition of Model Formally speaking, a model is defined as a collection of PDFs the model generates by varying its parameter values: Model M = { f ( y w) w W } 26

27 2. Model Estimation Parameter Estimation Likelihood Function Maximum Likelihood Estimation (MLE) Illustrative Example 27

28 To recap, a PDF, f(y w), associated with a specific parameter value w describes the probability of observing an outcome y. For instance, according to the binomial PDF with n=10 and w=0.2, y = 2 is more probable to observe than y = 4: f ( y 2 w 0.2) f ( y 4 w 0.2)

29 The Inverse Problem In reality, however, we have observed the data y, but the parameter w is unknown. Accordingly, we are faced with an inverse inference problem: Parameter estimation as an inverse problem: Given the observed data y obs and a model of interest, we wish to identify the one PDF, among all PDFs the model can generate, that is most likely to have produced y obs. 29

30 Likelihood Function To solve the inverse problem, we define the likelihood function (LF) by switching the roles of y and w in the PDF as Likelihood function: Lw ( y ) f( y w) obs obs as a function of parameter w given observed data y. obs 30

31 An Example of Likelihood Function Given y 7 with y : Bin( n 10, w ), obs Lw ( y 7) f( y 7, w) obs obs 10! 7 3 w (1 w) (0 w 1) (10 7)!7! 31

32 PDF vs Likelihood Function Both PDF f(y w) and likelihood function (LF) L(y w) are derived from the same functional, but are defined on different axes, representing two sides of the same coin, in a sense: PDF: f(y w fixed ) LF: f(y fixed w) 32

33 PDF vs Likelihood Function - PDF: A function of y given a specific value of w (defined on data scale) - LF: A function of w given a specific value of y (defined on parameter scale) Both are non-negative measures: - PDF: Actual probability - Likelihood: Likelihood as an unnormalized probability f ( y w ) dy 1 fixed but Lw ( y ) dw 1 obs 33

34 2. Model Estimation Parameter Estimation Likelihood Function Maximum Likelihood Estimation (MLE) Illustrative Example 34

35 Maximum Likelihood Estimate In MLE, we are interesting in finding the probability distribution that makes the observed data y obs most likely, or equivalently, finding the parameter value that maximizes the likelihood function L(w y obs ). The resulting parameter value is called the maximum likelihood estimate denoted by ˆ. w MLE wˆ 0.7 ( y 7, n 10) MLE obs 35

36 Consequently, among all probability distributions or PDFs the model can generate, the one that is most likely to have generated y obs =7 is considered to be f(y w=0.7), shown below. ***Opt. MLE Distribution 36

37 Likelihood Equation ***Opt. To reiterate, MLE requires us to solve the problem of Maximizing the LF or logarithmic LF: wˆ argmax L( w yobs ) =argmax ln L( w yobs ) w w Consider the general setting of a multi-dimensional parameter vector w ( w1, w2,..., w k ) A minimal and necessary condition for the existence of MLE estimate is that the likelihood function must satisfy the following partial differential equation called the likelihood equation (LE): ln Lw ( yobs ) w i 0 at wˆ for all i 1,2,..., k. MLE 37

38 Why the likelihood equation?: This is because by definition, the first derivative of LF with respect to each w i vanishes at ˆ. w MLE ***Opt. ln Lw ( y obs ) w 0 38

39 To illustrate, consider the previous binomial example with y ~ Bin(n=10,w), and suppose that we observed y obs =7. ***Opt. The likelihood equation in this case is given by 10! 7 3 d ln w (1 w ) d ln L( w y obs 7) 7!3! dw dw 10! d ln 7!3! d (7 ln w ) d (3 ln(1 w )) dw dw dw w 1 w 7 10w w (1 w ) 0 from which we obtain wˆ 0.7. MLE 39

40 Solving MLE Note that finding the MLE estimate requires solving a system of k likelihood equations simultaneously. For models with many parameters and non-linear model equations, it is generally not possible to find analytic-form solutions. Instead, one must solve the MLE problem numerically on computer using an optimization algorithm (e.g., hill-climbing, Gauss-Newton, or Levenberg-Marquardt). This generally involves the implementation of an intelligent search procedure on computer, in which we seek to find an MLE by trial and error over the course of a series of iterative steps. 40

41 Hill-climbing Optimization wˆ MLE B Log-likelihood ln L(w) A C 0 a1 a2 a3 Parameter w 41

42 Local Maxima Problem No optimization algorithm can guarantee that the MLE will always be found. Maybe or maybe not. It is, therefore, important to apply an optimization algorithm in several repetitions, each with a different starting value of parameter w, to see if the same solution would be obtained. Further and importantly, the problem is exacerbated by what is known as the local maxima problem. 42

43 global maximum B local maximum Log-likelihood ln L(w) A local maximum C 0 a1 a2 a3 Parameter w 43

44 2. Model Estimation Parameter Estimation Likelihood Function Maximum Likelihood Estimation (MLE) Illustrative Example 44

45 Retention Models and Data Suppose we are interested in testing two models of retention: POW: p w ( t 1) EXP: 1 p w e 1 w t 2 w 2 A memory experiment with n = 50 Bernoulli trials and eight time intervals of t = (0.5, 1, 2, 4, 8, 12, 16, 18) yielded the following data (i.e., number of correct responses out of 50 trials at each time interval) of y obs = (44, 34, 27, 26, 19, 17, 20, 11) Or, equivalently, the observed proportion correct of p obs = (0.88, 0.68, 0.54, 0.52, 0.38, 0.34, 0.40, 0.22) 45

46 The Data 46

47 Likelihood Functions The first step of MLE is to write down the PDF for each model 8 n! POW: f ( y w) w ( t 1) 1 w ( t 1) ( n y )! y! i 1 i i i 1 i i w i 2 w2 1 i 1 i 8 n! EXP: f ( y w) w e 1 w e ( n y )! y! From this, we obtain the log LF y n y w2t i i w2ti 1 1 y n y i i POW: 8 w2 w2 ln ( ) ln( ( 1) ) ( ) ln(1 ( 1) ) EXP: L w y y w t n y w t obs obs, i 1 i obs, i 1 i i 1 8 Lw y y we n y we w2ti w2ti ln ( ) ln( ) ( ) ln(1 ) obs obs, i 1 obs, i 1 i 1 where the parameter-independent terms are omitted. 47

48 Matlab Code 48

49 MLE Best-fit Curves 49

50 Summary of Model Fits MLE LSE POW EXP POW EXP w MLE, w MLE, PVAF 91.2% 79.0% w LSE, w LSE, PVAF 91.4% 79.1% 50

51 Why the Differences between MLE and LSE? Binomial likelihood data: y Bin n p e g p w e -w2t (, ) (.., ) var( y) np(1 p) 1 ***Opt. 2 LSE: minimize obsi prd i( w ) i 51

52 R Code 52

53 53

54 Outline 1. Introduction 2. Model Estimation 3. Model Evaluation and Selection 4. A New Experimental Tool for Model Selection: Adaptive Design Optimization 5. Conclusion 54

55 3. Model Evaluation and Selection Goodness-of-fit vs. Generalizabilty Model Selection Methods Illustrative Examples 55

56 56

57 Goal of Modeling and Approximation to Truth The ultimate, ideal goal of modeling is to identify the model that actually generated the observed data This is not possible because 1) Never enough observations to pin down the truth exactly 2) The truth may be quite complex, beyond modeler s imagination A more realistic goal is to choose among a set of candidate models the one model that provides the closest approximation to the truth, in some defined sense 57

58 Model Evaluation Criteria Qualitative criteria Falisifiability: Do potential observations exist that would be incompatible with the model? Plausibility: Does the theoretical account of the model make sense of established findings? Interpretability: Are the components of the model understandable and linked to known processes? Quantitative criteria Goodness of fit: Does the model fit the observed data sufficiently well? Complexity/simplicity: Is the model's description of the data achieved in the simplest possible manner? Generalizability: How well does the model predict future observations? 58

59 Goodness-of-fit (GOF) Measures - Root mean squared error (RMSE) RMSE = n i 1 ( y y ( wˆ )) i, obs i, prd n 2 - Percent variance accounted for (PVAF), or r 2 PVAF = n i 1 n ( y y ( wˆ )) i 1 iobs, iprd, ( y y ) i, obs obs

60 Noisy Data Behavioral data include random noise from a number of sources, such as measurement error, sampling error, and individual differences Data = Regularity (Cognitive process) + Noise (Idiosyncrasies) 60

61 Problem with GOF as Model Evaluation Criterion Data = Regularity (Cognitive process) + Noise (Idiosynchracies) GOF = Fit to regularity + Fit to noise Properties of the model that have nothing to do with its ability to capture the underlying regularities can improve GOF 61

62 Over-fitting Problem (Over-fitting) Model 1: bx Y ae c bx e Model 2: sin( ) Y ae c dx f X g 62

63 Model Complexity Complexity: Refers to a model s inherent flexibility that enables it to fit a wide range of data patterns Simple Model Complex Model number of model parameters 63

64 Complexity: More than Number of Parameters? Power: p a( t 1) Exponential: p ae bt b Hyperbolic: p 1/( a bt) Are these all equally complex? Maybe not 64

65 Complexity of Multinomial Processing Tree Models Wu, Myung & Batchelder (2010, J. Math. Psych.) 65

66 Generalizability: The Yardstick of Model Selection Generalizability refers to a model s ability to fit all future data samples from the same underlying process, not just the currently observed data sample, and thus can be viewed as a measure of predictive accuracy or proximity to the underlying regularity. 66

67 Model A Model B Goodness of fit (PVAF): 80% 99% Generalizability (PVAF): 70% 50% 67

68 Relationship among Goodness of Fit, Model Complexity and Generalizability Goodness of fit Overfitting Generalizability Model Complexity 68

69 Wanted: A method of model selection that estimates a model s generalizability by taking into account effects of its complexity Selection Criterion: Choose the model, among a set of candidate models, that generalizes best 69

70 3. Model Evaluation and Selection Goodness-of-fit vs. Generalizabilty Model Selection Methods Illustrative Examples 70

71 Model Selection Methods Occam s Razor Likelihood Function Akaike Information Criterion (AIC) Bayesian Information Criterion (BIC) Minimum Description Length (MDL) Bayes Factor (BF) Cross-validation (CV) Likelihood Ratio Test (LRT) 71

72 Occam s Razor: The Economy of Explanation Entities should not be multiplied beyond necessity. - William of Occam ( ) 72

73 Likelihood Function: Revisited Formally speaking, a mathematical model is defined in terms of the likelihood function (LF) that specifies the likelihood of observed data as a function of model parameter: Likelihood function (LF): f( y w ) (e.g.) w 2 Power model: p w ( t 1) (0 p 1) 1 Data: y ~ Binomial(n,p(w)) (y = 0,1,...,n) n! LF: f ( y w) p( w) (1 p( w)) ( n y)! y! y n y 73

74 Maximum Likelihood: Revisited In model fitting, we are interested in finding the parameter value that is most likely to have generated the observed data -- the one that maximizes the likelihood function: Maximum likelihood (ML): f( y wˆ ) ˆ w 74 2

75 Penalized Likelihood Methods Formalization of Occam s razor Estimate a model s generalizability by penalizing it for excess complexity (i.e., more complexity than is needed to fit the regularity in the data) Puts models on equal footing 75

76 The Basic Idea (Generalizability) = (Goodness of fit) - (Model complexity) 76

77 The Basic Idea ***Opt. (Generalizability) = (Goodness of fit) + (Model complexity) ( Generalizability = - Generalizability; Goodness of fit = - Goodness of fit ) 77

78 Akaike Information Criterion (AIC) Akaike (1973): # of parameters AIC 2ln f( y wˆ ) 2k Goodness of fit (ML) + Model Complexity The smaller AIC value of a model, the greater generalizability of the model The model with smallest AIC is the best, among a set of competing models and thus should be preferred 78

79 Bayesian Information Criterion (BIC) Schwarz (1978): Sample size BIC 2ln f( y wˆ ) k lnn Goodness of fit (ML) + Model Complexity The model that minimizes BIC should be preferred 79

80 80

81 Minimum Description Length (MDL) Rissanen (1996): functional form k n MDL ln f ( y wˆ ) ln ln I( w) dw 2 2 Goodness of fit (ML) + Model Complexity The model that minimizes MDL should be preferred 81

82 Bayes Factor (BF) ***Opt. (Kass & Raftery, 1995) In Bayesian model selection, each model is evaluated based on its marginal likelihood defined as p( y M) f( y w, M) ( w M) dw or equivalently, an average likelihood (i.e., how well the model fits the data on average, across the range of its parameters) Bayes factor (BF) between two models is defined as the ratio of two marginal likelihoods BF ij p ( y M ) i p( y M ) j 82

83 Under the assumption of equal model priors, BF is reduced to the posterior odds: ***Opt. BF ij pm ( i y) pm ( y) j ( from Bayes rule) Therefore, the model that maximizes marginal likelihood is the one with highest probability of being true given observed data 83

84 Features of Bayes Factor ***Opt. Pros No optimization (i.e., no maximum likelihood) No explicit measure of model complexity No overfitting, by averaging likelihood function across parameters Cons Issue of choosing parameter priors (virtue or vice?) Non-trivial computations requiring numerical integration 84

85 BIC as an approximation of BF ***Opt. A large sample approximation of the marginal likelihood yields the easily-computable Bayesian Information Criterion (BIC): -2 log marginal likelihood BIC 85

86 (Stone, 1974; Geisser, 1975) Cross-validation (CV) Sampling-based method of estimating generalizability y Cal y Val CV SSE( y wµ ( y )) Val Cal 86

87 87

88 Features of CV Pros Easy to use Sensitive to functional form as well as number of parameters Asymptotically equivalent to AIC Cons Sensitive to the partitioning used - Averaging over multiple partitions - Leave-one-out CV (LOOCV), instead of split-half CV Instability of the estimate due to loss of data 88

89 89

90 Why not Likelihood Ratio Test (LRT)? ***Opt. The LRT is often used to test the adequacy of a model of interest (M R ) in the context of another, fuller model (M F ): LRT ln f( y wˆ R ) f( y wˆ ) F The acceptance/rejection decision on M R is then made based on a p-value obtained from Chi-square distribution with df = k F k R. Acceptance Model M R provides a sufficiently good fit to observed data, and therefore the extra parameters of model M F are judged to be unnecessary 90

91 LRT is NOT a model selection method ***Opt. LRT does not access a model s generalizability, which is the hallmark of model selection. LRT is a null hypothesis significance testing (NHST) method that simply judges descriptive adequacy of a given model (cf. contemporary criticisms of NHST). LRT requires the setting of pair-wise & nested comparison (i.e., can compare only two models at a time). LRT is developed in the context of linear models with normal error; Sampling distributions of the LRT statistic under nonlinear and nonnormal error conditions are generally unknown. At best, LRT is a poor substitute for current model selection methods, such as AIC and BIC. 91

92 3. Model Evaluation and Selection Goodness-of-fit vs. Generalizabilty Model Selection Methods Illustrative Examples 92

93 Example #1: Retention Models (Remember?) MLE results Proportion Correct (pc) POW ( black): p a( t 1) EXP ( red): p ae bt b b POW 2( blue): p a( t 1) c Time Interval (t) 93

94 Model Selection Results POW EXP POW2 # Parms PVAF AIC BIC LOOCV loglik

95 Example #2 1 1 : : 3 3 : 1 95

96 Matlab Code 96

97

98 Guidelines for Model Evaluation and Selection 1. Evaluate a model s fit to data (descriptive adequacy, or goodness-of-fit) 2. Must take into account a model s fit to other possible data (complexity/flexibility) 3. Select among models using generalizability measures (AIC, BIC, MDL, Bayes Factor, etc.) 98

99 Summary of Section 3 Models should be evaluated based on generalizability, not on goodness of fit Thou shall not select the best-fitting model but shall select the best-predicting model. Other non-statistical but very important selection criteria Plausibility Interpretability Explanatory adequacy Falsifiability 99

100 Outline 1. Introduction 2. Model Estimation 3. Model Evaluation and Selection 4. A New Experimental Tool for Model Selection: Adaptive Design Optimization 5. Conclusion 100

101 101

102 Outline 1. Introduction 2. Model Estimation 3. Model Evaluation and Selection 4. A New Experimental Tool for Model Selection: Adaptive Design Optimization (ADO) 5. Conclusion 102

103 4. A New Experimental Tool for Model Selection: ADO Introduction Discriminating Models of Retention with ADO Illustrative Example 103

104 104

105 Introduction Experiments are fundamental to the advancement of cognitive science 105

106 Introduction Data obtained from experiments are used to fit formal models. Multinomial Processing Trees Structural Equations (SEM) Neural Network 106

107 Introduction Often, there are many competing models to describe the same cognitive/perceptual process. Example Short-term memory literature Over 100 different 2-parameter models have been proposed to describe the rate of forgetting over time (Rubin & Wenzel, 1996) 107

108 Model selection problem It is often very difficult to reject models through experimentation because they can mimic one another so closely. Example: Some models of memory retention (forgetting) Hyperbolic Exponential Power 108

109 Experiments to discriminate between models Flow chart of typical investigation Formulate models Collect data Analyze data Statistical model selection methods (e.g., AIC & BIC) 109 Model selection criteria are limited by the data they have to work with.

110 Experiments to discriminate between models Flow chart of typical investigation Formulate models Collect data Analyze data Typically, collect lots of data to decrease noise 110

111 Experiments to discriminate between models Flow chart of typical investigation Formulate models Collect data Analyze data Collect smarter data to highlight differences between models 111

112 Collecting smarter data The design of an experiment determines the quality of data that are collected. Design variables include: number of observations number of treatment groups stimulus levels 112

113 Collecting smarter data The design of an experiment determines the quality of data that are collected. Design variables include: number of observations number of treatment groups stimulus levels 113

114 Hard-to-recruit participants 114

115 Adaptive design optimization (ADO) Adaptively designed experiments Conduct the full experiment as a sequence of miniexperiments Improve the design of the next mini-experiment using knowledge gained from the previous miniexperiments Myung & Pitt (2009) Psychological Review Cavagnaro, Myung, Pitt & Kujala (2010) Neural Computation. 115

116 Adaptive design optimization (ADO) Sequential design framework EXPERIMENTS Y obs (d 1 ) Y obs (d 2 ) Y obs (d s ) d 1 d 2 d 3 d s DESIGNS Adapt the design of the next experiment based on the results of preceding experiments 116

117 4. A New Experimental Tool for Model Selection: ADO Introduction Discriminating Models of Retention with ADO Illustrative Example 117

118 Designing a retention experiment: What Time Intervals Should be Employed? Retention: the rate of retrieval failure over time 118

119 Designing a retention Experiment: What Time Intervals Should be Employed? Two models of retention 119

120 Designing a retention experiment: What Time Intervals Should be Employed? Model mimicry between POW and EXP POW EXP 120

121 Designing a retention experiment: What Time Intervals Should be Employed? Model predictions for a narrow range of parameters (100 Bernoulli trials) POW EXP Recall rate Recall rate Time (seconds) Time (seconds) 0.58<a< <b< <a< <b<

122 Designing a retention experiment: What Time Intervals Should be Employed? A good design choice can aid in discrimination seconds: Bad designs! POW EXP Recall rate Recall rate Time (seconds) Time (seconds) 122

123 Designing a retention experiment: What Time Intervals Should be Employed? A good design choice can aid in discrimination 2-4 seconds: Good designs! POW EXP Recall rate Recall rate Time (seconds) Time (seconds) 123

124 Designing a retention experiment: What Time Intervals Should be Employed? More realistic situations require a more principled approach to finding optimal designs. POW EXP Recall rate Recall rate Time (seconds) Time (seconds) a~beta(2,1) b~beta(1,4) a~beta(2,1) b~beta(1,80) 124

125 4. A New Experimental Tool for Model Selection: ADO Introduction Discriminating Models of Retention with ADO Illustrative Example 125

126 Simulated experiments Generate data at each stage from some true model (e.g. EXP with a=.71 and b=.08) at the optimal retention interval identified by ADO algorithm. 126

127 Simulated experiments Generate data at each stage from some true model (e.g. EXP with a=.71 and b=.08) at the optimal retention interval identified by ADO algorithm. The idea is to see how many stages it takes to accumulate enough data to decisively identify the datagenerating model according to some model selection criterion (e.g. Bayes factor) 127

128 Simulation results Data generated from EXP with a=0.71 and b=

129 Simulation results 30 Stage 1 Compute optimal design Generate data (30 Bernoulli trials) 30 P 0 (POW)=0.5 P 0 (EXP)= Time (seconds) 7 correct responses 16 seconds Time (seconds) 129

130 Simulation results 30 Stage 1 Update model probabilities Update parameter probabilities 30 P 1 (POW)=0.65 P 1 (EXP)= Time (seconds) 7 correct responses 16 seconds Time (seconds) 130

131 Simulation results 30 Stage 2 Compute optimal design Generate data (30 Bernoulli trials) 30 P 1 (POW)=0.65 P 1 (EXP)= Time (seconds) 0 correct responses 96.4 seconds Time (seconds) 131

132 Simulation results 30 Stage 2 Update model probabilities Update parameter probabilities 30 P 2 (POW)=0.30 P 2 (EXP)= Time (seconds) 0 correct responses 96.4 seconds Time (seconds) 132

133 Simulation results 30 Stage 3 Compute optimal design Generate data (30 Bernoulli trials) 30 P 2 (POW)=0.30 P 2 (EXP)= Time (seconds) 0 correct responses 96.8 seconds Time (seconds) 133

134 Simulation results 30 Stage 3 Update model probabilities Update parameter probabilities 30 P 3 (POW)=0.02 P 3 (EXP)= Time (seconds) 0 correct responses 96.8 seconds Time (seconds) 134

135 Simulation results 135

136 Simulation results Bayes Factor = 20 (very strong evidence for EXP) 136

137 Simulation results Bayes Factor = 20 (very strong evidence for EXP) 137

138 Human experiment results 138

139 ADO for Visual Psychophysics Experiment: Adaptive Estimation of Contrast Sensitivity Function (CSF) Lesmes, Jeon, Lu & Dosher (2006); Lesmes, Lu, Baek & Albright (2010) 139

140 140

141 ADO for Neurophysiology Experiment Lewi, Butera & Paninski (2009) 141

142 Much ADO about Nothing - William Shakespeare 142

143 Summary of Section 4 Competing models can be difficult to discriminate because good experimental designs are elusive. ADO finds and exploits differences between models to maximize the likelihood of discrimination. Qualifications ADO applicable to not all types of models Not all variables can be optimized Computational costs 143

144 Outline 1. Introduction 2. Model Estimation 3. Model Evaluation and Selection 4. A New Experimental Tool for Model Selection: Adaptive Design Optimization 5. Conclusion 144

145 Conclusion Model evaluation: A model s good fit to a data set is only a necessary first step in model evaluation, but not a sufficient, final, confirmatory step. Model selection: The best model, among a set of candidate models, is the one that generalizes best and can be identified using model selection methods, such as AIC, BIC, and MDL. Experimental design: Adaptive design optimization (ADO) can further improve model evaluation and selection. 145

146 Further readings on model evaluation, model selection, and adaptive design optimization Special issues Gluck, K., Bello, P., & Busemeyer, J. (2008). Special issue on model comparison. Cognitive Science, 32, Myung, I. J., Forster, M., & Browne, M. W. (2000). Special issue on model selection. Journal of Mathematical Psychology, 44, Wagenmakers, E-J., & Waldorp, L. (2006). Special issue on model selection. Journal of Mathematical Psychology, 50, Articles Shiffrin, R. M., Lee, M. D., Kim, W., & Wagenmakers, E-J. (2008). A survey of model evaluation approaches with a tutorial on hierarchical Bayesian methods. Cognitive Science, 32, Wu, H., Myung, J. I., & Batchelder, W. H. (2010). On the minimum description length complexity of multinomial processing tree models. Journal of Mathematical Psychology, 54, Cavagnaro, D. R., Gonzalez, R., Myung, J. I., & Pitt, M. A. (2013). Optimal decision stimuli for risky choice experiments: An adaptive approach. Management Science, 59(2),

147 All the mathematics you missed but need to know for graduate school (in mathematical psychology and other fields): Undergraduate Calculus, Linear & Matrix Algebra, Real Analysis Probability Theory and Statistics Computer Programming in Matlab, R, and C++ Graduate Mathematical Statistics Bayesian Statistics Numerical Computation and Analysis Machine Learning Methods 147

148 The End 148

149 149

150 Supplemental Slides 150

151 Outline 1. Introduction 2. Model Estimation 3. Model Evaluation and Selection 3.5 Evaluating Other Types of Models 4. A New Experimental Tool for Model Selection: Adaptive Design Optimization 5. Conclusion 151

152 152

153 Restricted Scope of the Methods Models must generate likelihood functions (distribution of likelihoods across parameters) Not all models can do this (without simplification) Connectionist Simulation-based (CMR, Diffusion) Cognitive architectures Diversity of types of models in cognitive science makes model comparison challenging 153

154 Broader Framework Local Model Analysis Can the model simulate (or fit) the particular data pattern observed in an experimental setting? Evaluate success of model across diverse experiments Difficult to synthesize results to obtain a picture of why the model consistently succeeds in simulating human performance Is it the correct model or an overly flexible model? To interpret the behavior of a model's local performance, the global behavior of the model must also be understood (i.e., descriptive adequacy must be balanced by model complexity) 154

155 Broader Framework Global Model Analysis What other data patterns does the model produce besides the empirical one? Learn what a model is and is not capable of doing in a particular experimental setting (i.e., experimental design) with the goal being to obtain a more comprehensive picture of its behavior 155

156 Parameter Space Partitioning (PSP) Global Model Analysis method Partition a model s parameter space (hidden weight space) into distinct regions corresponding to qualitatively different data patterns the model could generate in an experiment (Pitt, Kim, Navarro & Myung, 2006) PSP interfaces between the continuous behavior of models and the often discrete, qualitative predictions across conditions in an experiment 156

157 Pattern 2 Parameter Space Partitioning Pattern 1 Not a valid pattern (e.g., activations never reached threshold) How do you find all data patterns that a model can generate? Hard problem Potentially huge search space Model simulation required for each set of parameter values chosen Classify each simulation result (Does the current pattern match others already found or is it new?) 157

158 What Can be Learned from PSP? Empirical data pattern across three conditions: B > C > A M1 M2 B>C=A A>B>C A=B=C B>A>C 2 B>C>A 2 C=B>A C>A=B B>C>A C>A>B B=C>A A=B>C A>B=C 1 1 Questions PSP can answer How many of the 13 different data patterns can a model simulate? How much of the space is occupied by the empirical pattern? (central/peripheral) What is the relationship between these other patterns and the empirical pattern (in terms of volume and similarity) 158

159 Is it good or bad if a model s predictions are central or peripheral? M1 M2 B>C=A A>B>C B>A>C A=B=C 2 B>C>A 2 C=B>A C>A=B B>C>A C>A>B B=C>A A=B>C A>B=C 1 1 Simulation success takes on additional meaning with knowledge of other model behaviors Model comparison methods do not make decisions for you. They provide you with data that are intended to help you arrive at an informed decision 159

160 How does memory for a spoken word influence perception of the word s phonemes? TRACE McClelland & Elman, 1986 Merge Norris, McQueen, & Cutler, 2001 Word job jog job jog Word Phoneme Decision B G D V J O B G D V Phoneme Input and Decision J O B G D V Phoneme Input Speech Input Excitatory Inhibitory Speech Input 160

161 What are the consequences of splitting the phoneme level in two? TRACE McClelland & Elman, 1986 Merge Norris, McQueen, & Cutler, 2001 Word job jog job jog Word Phoneme Decision B G D V J O B G D V Phoneme Input and Decision J O B G D V Phoneme Input Speech Input Excitatory Inhibitory Speech Input 161

162 PSP Analysis Compared model performance in a 12-condition experiment Empirical data pattern (subcategorical mismatch, Norris et al, 2000) How many of the possible ~3 million data patterns can each model generate? 162

163 PSP Analysis Phoneme and word recognition defined as node activation reaching threshold Recognition of the phoneme /b/ Phoneme activation Threshold Cycle number (time) /b/ node /g/ node /d/ node /v/ node 163

164 Partitioning of Parameter Space in terms of Phoneme Activation /b/ node /g/ node /d/ node /v/ node threshold /g/ Invalid Invalid 2 /b/ /v/ Invalid 1 Hypothetical Example of Parameter Space 164

165 Results of PSP Analysis: Number of Data Patterns By splitting the phoneme level in two, Merge is able to generate more data patterns (more flexible) 165

166 Volume Analysis (weak threshold only) Each fill pattern represents a different data pattern Only patterns that occupy more than 1% of the volume are shown Proportion of the volume of the parameter space Common Common Unique TRACE Merge Human pattern 166

167 Correlation of Volumes of Common Regions Rank order of volume size in Merge Smallest Rank order of volume size in TRACE Largest 167

168 Summary of Section 3.5 Parameter Space Partitioning (PSP) provides a global perspective on model behavior Broader context for interpreting simulation/fit success Complements local model analysis Assess similarity of models Applicable to a wide range of models PSP can be used to evaluate the effectiveness of experiments in distinguishing between models prior to testing (Does only one model generate the predicted pattern?) 168

Squeezing Every Ounce of Information from An Experiment: Adaptive Design Optimization

Squeezing Every Ounce of Information from An Experiment: Adaptive Design Optimization Squeezing Every Ounce of Information from An Experiment: Adaptive Design Optimization Jay Myung Department of Psychology Ohio State University UCI Department of Cognitive Sciences Colloquium (May 21, 2014)

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Model Estimation Example

Model Estimation Example Ronald H. Heck 1 EDEP 606: Multivariate Methods (S2013) April 7, 2013 Model Estimation Example As we have moved through the course this semester, we have encountered the concept of model estimation. Discussions

More information

Bayesian Information Criterion as a Practical Alternative to Null-Hypothesis Testing Michael E. J. Masson University of Victoria

Bayesian Information Criterion as a Practical Alternative to Null-Hypothesis Testing Michael E. J. Masson University of Victoria Bayesian Information Criterion as a Practical Alternative to Null-Hypothesis Testing Michael E. J. Masson University of Victoria Presented at the annual meeting of the Canadian Society for Brain, Behaviour,

More information

An Introduction to Path Analysis

An Introduction to Path Analysis An Introduction to Path Analysis PRE 905: Multivariate Analysis Lecture 10: April 15, 2014 PRE 905: Lecture 10 Path Analysis Today s Lecture Path analysis starting with multivariate regression then arriving

More information

Statistical Distribution Assumptions of General Linear Models

Statistical Distribution Assumptions of General Linear Models Statistical Distribution Assumptions of General Linear Models Applied Multilevel Models for Cross Sectional Data Lecture 4 ICPSR Summer Workshop University of Colorado Boulder Lecture 4: Statistical Distributions

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Model Comparison. Course on Bayesian Inference, WTCN, UCL, February Model Comparison. Bayes rule for models. Linear Models. AIC and BIC.

Model Comparison. Course on Bayesian Inference, WTCN, UCL, February Model Comparison. Bayes rule for models. Linear Models. AIC and BIC. Course on Bayesian Inference, WTCN, UCL, February 2013 A prior distribution over model space p(m) (or hypothesis space ) can be updated to a posterior distribution after observing data y. This is implemented

More information

Bayesian Learning (II)

Bayesian Learning (II) Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP

More information

An Introduction to Mplus and Path Analysis

An Introduction to Mplus and Path Analysis An Introduction to Mplus and Path Analysis PSYC 943: Fundamentals of Multivariate Modeling Lecture 10: October 30, 2013 PSYC 943: Lecture 10 Today s Lecture Path analysis starting with multivariate regression

More information

Model Evaluation, Testing and Selection. In J. Myung, Mark A. Pitt, and Woojae Kim. Department of Psychology Ohio State University.

Model Evaluation, Testing and Selection. In J. Myung, Mark A. Pitt, and Woojae Kim. Department of Psychology Ohio State University. Model Evaluation, Testing and Selection In J. Myung, Mark A. Pitt, and Woojae Kim Department of Psychology Ohio State University January 16, 2003 (Word count: 10375) To appear in K. Lambert and R. Goldstone

More information

SRNDNA Model Fitting in RL Workshop

SRNDNA Model Fitting in RL Workshop SRNDNA Model Fitting in RL Workshop yael@princeton.edu Topics: 1. trial-by-trial model fitting (morning; Yael) 2. model comparison (morning; Yael) 3. advanced topics (hierarchical fitting etc.) (afternoon;

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Statistics for the LHC Lecture 1: Introduction

Statistics for the LHC Lecture 1: Introduction Statistics for the LHC Lecture 1: Introduction Academic Training Lectures CERN, 14 17 June, 2010 indico.cern.ch/conferencedisplay.py?confid=77830 Glen Cowan Physics Department Royal Holloway, University

More information

Model Comparison in Psychology

Model Comparison in Psychology Model Comparison in Psychology Jay I. Myung # and Mark A. Pitt Department of Psychology Ohio State University Columbus, OH 43210 {myung.1, pitt.2}@osu.edu # Corresponding author July 5, 2016 Word count:

More information

PDEEC Machine Learning 2016/17

PDEEC Machine Learning 2016/17 PDEEC Machine Learning 2016/17 Lecture - Model assessment, selection and Ensemble Jaime S. Cardoso jaime.cardoso@inesctec.pt INESC TEC and Faculdade Engenharia, Universidade do Porto Nov. 07, 2017 1 /

More information

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/

More information

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses Steven Bergner, Chris Demwell Lecture notes for Cmpt 882 Machine Learning February 19, 2004 Abstract In these notes, a

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

Machine Learning

Machine Learning Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 13, 2011 Today: The Big Picture Overfitting Review: probability Readings: Decision trees, overfiting

More information

BAYESIAN DECISION THEORY

BAYESIAN DECISION THEORY Last updated: September 17, 2012 BAYESIAN DECISION THEORY Problems 2 The following problems from the textbook are relevant: 2.1 2.9, 2.11, 2.17 For this week, please at least solve Problem 2.3. We will

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models Lecture 3. Hypothesis testing. Goodness of Fit. Model diagnostics GLM (Spring, 2018) Lecture 3 1 / 34 Models Let M(X r ) be a model with design matrix X r (with r columns) r n

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Lecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1

Lecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1 Lecture 2 1 Probability (90 min.) Definition, Bayes theorem, probability densities and their properties, catalogue of pdfs, Monte Carlo 2 Statistical tests (90 min.) general concepts, test statistics,

More information

Physics 403. Segev BenZvi. Choosing Priors and the Principle of Maximum Entropy. Department of Physics and Astronomy University of Rochester

Physics 403. Segev BenZvi. Choosing Priors and the Principle of Maximum Entropy. Department of Physics and Astronomy University of Rochester Physics 403 Choosing Priors and the Principle of Maximum Entropy Segev BenZvi Department of Physics and Astronomy University of Rochester Table of Contents 1 Review of Last Class Odds Ratio Occam Factors

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Statistical Data Analysis Stat 3: p-values, parameter estimation

Statistical Data Analysis Stat 3: p-values, parameter estimation Statistical Data Analysis Stat 3: p-values, parameter estimation London Postgraduate Lectures on Particle Physics; University of London MSci course PH4515 Glen Cowan Physics Department Royal Holloway,

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 27 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

ECE662: Pattern Recognition and Decision Making Processes: HW TWO ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are

More information

Signal Detection Theory With Finite Mixture Distributions: Theoretical Developments With Applications to Recognition Memory

Signal Detection Theory With Finite Mixture Distributions: Theoretical Developments With Applications to Recognition Memory Psychological Review Copyright 2002 by the American Psychological Association, Inc. 2002, Vol. 109, No. 4, 710 721 0033-295X/02/$5.00 DOI: 10.1037//0033-295X.109.4.710 Signal Detection Theory With Finite

More information

Introduction to Bayesian Statistics

Introduction to Bayesian Statistics Bayesian Parameter Estimation Introduction to Bayesian Statistics Harvey Thornburg Center for Computer Research in Music and Acoustics (CCRMA) Department of Music, Stanford University Stanford, California

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University August 30, 2017 Today: Decision trees Overfitting The Big Picture Coming soon Probabilistic learning MLE,

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Detection theory. H 0 : x[n] = w[n]

Detection theory. H 0 : x[n] = w[n] Detection Theory Detection theory A the last topic of the course, we will briefly consider detection theory. The methods are based on estimation theory and attempt to answer questions such as Is a signal

More information

Bioinformatics: Network Analysis

Bioinformatics: Network Analysis Bioinformatics: Network Analysis Model Fitting COMP 572 (BIOS 572 / BIOE 564) - Fall 2013 Luay Nakhleh, Rice University 1 Outline Parameter estimation Model selection 2 Parameter Estimation 3 Generally

More information

Bayesian Decision Theory

Bayesian Decision Theory Bayesian Decision Theory Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Bayesian Decision Theory Bayesian classification for normal distributions Error Probabilities

More information

Mathematical Statistics

Mathematical Statistics Mathematical Statistics MAS 713 Chapter 8 Previous lecture: 1 Bayesian Inference 2 Decision theory 3 Bayesian Vs. Frequentist 4 Loss functions 5 Conjugate priors Any questions? Mathematical Statistics

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes Roger Grosse Roger Grosse CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes 1 / 55 Adminis-Trivia Did everyone get my e-mail

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

SGN (4 cr) Chapter 5

SGN (4 cr) Chapter 5 SGN-41006 (4 cr) Chapter 5 Linear Discriminant Analysis Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology January 21, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006

More information

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006 Hypothesis Testing Part I James J. Heckman University of Chicago Econ 312 This draft, April 20, 2006 1 1 A Brief Review of Hypothesis Testing and Its Uses values and pure significance tests (R.A. Fisher)

More information

day month year documentname/initials 1

day month year documentname/initials 1 ECE471-571 Pattern Recognition Lecture 13 Decision Tree Hairong Qi, Gonzalez Family Professor Electrical Engineering and Computer Science University of Tennessee, Knoxville http://www.eecs.utk.edu/faculty/qi

More information

Lecture 3: Pattern Classification

Lecture 3: Pattern Classification EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Hierarchical Models & Bayesian Model Selection

Hierarchical Models & Bayesian Model Selection Hierarchical Models & Bayesian Model Selection Geoffrey Roeder Departments of Computer Science and Statistics University of British Columbia Jan. 20, 2016 Contact information Please report any typos or

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Introduction. Basic Probability and Bayes Volkan Cevher, Matthias Seeger Ecole Polytechnique Fédérale de Lausanne 26/9/2011 (EPFL) Graphical Models 26/9/2011 1 / 28 Outline

More information

Empirical Risk Minimization, Model Selection, and Model Assessment

Empirical Risk Minimization, Model Selection, and Model Assessment Empirical Risk Minimization, Model Selection, and Model Assessment CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 5.7-5.7.2.4, 6.5-6.5.3.1 Dietterich,

More information

Testing and Model Selection

Testing and Model Selection Testing and Model Selection This is another digression on general statistics: see PE App C.8.4. The EViews output for least squares, probit and logit includes some statistics relevant to testing hypotheses

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization

More information

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann Machine Learning! in just a few minutes Jan Peters Gerhard Neumann 1 Purpose of this Lecture Foundations of machine learning tools for robotics We focus on regression methods and general principles Often

More information

Decision Trees (Cont.)

Decision Trees (Cont.) Decision Trees (Cont.) R&N Chapter 18.2,18.3 Side example with discrete (categorical) attributes: Predicting age (3 values: less than 30, 30-45, more than 45 yrs old) from census data. Attributes (split

More information

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7 Introduction to Generalized Univariate Models: Models for Binary Outcomes EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7 EPSY 905: Intro to Generalized In This Lecture A short review

More information

Logistic Regression: Regression with a Binary Dependent Variable

Logistic Regression: Regression with a Binary Dependent Variable Logistic Regression: Regression with a Binary Dependent Variable LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: State the circumstances under which logistic regression

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures

More information

Multivariate statistical methods and data mining in particle physics

Multivariate statistical methods and data mining in particle physics Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general

More information

Advanced statistical methods for data analysis Lecture 2

Advanced statistical methods for data analysis Lecture 2 Advanced statistical methods for data analysis Lecture 2 RHUL Physics www.pp.rhul.ac.uk/~cowan Universität Mainz Klausurtagung des GK Eichtheorien exp. Tests... Bullay/Mosel 15 17 September, 2008 1 Outline

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Clustering using Mixture Models

Clustering using Mixture Models Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

Model Selection. Frank Wood. December 10, 2009

Model Selection. Frank Wood. December 10, 2009 Model Selection Frank Wood December 10, 2009 Standard Linear Regression Recipe Identify the explanatory variables Decide the functional forms in which the explanatory variables can enter the model Decide

More information

Introduction to Signal Detection and Classification. Phani Chavali

Introduction to Signal Detection and Classification. Phani Chavali Introduction to Signal Detection and Classification Phani Chavali Outline Detection Problem Performance Measures Receiver Operating Characteristics (ROC) F-Test - Test Linear Discriminant Analysis (LDA)

More information

Likelihood-based inference with missing data under missing-at-random

Likelihood-based inference with missing data under missing-at-random Likelihood-based inference with missing data under missing-at-random Jae-kwang Kim Joint work with Shu Yang Department of Statistics, Iowa State University May 4, 014 Outline 1. Introduction. Parametric

More information

7. Estimation and hypothesis testing. Objective. Recommended reading

7. Estimation and hypothesis testing. Objective. Recommended reading 7. Estimation and hypothesis testing Objective In this chapter, we show how the election of estimators can be represented as a decision problem. Secondly, we consider the problem of hypothesis testing

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

CS 195-5: Machine Learning Problem Set 1

CS 195-5: Machine Learning Problem Set 1 CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of

More information

Structure learning in human causal induction

Structure learning in human causal induction Structure learning in human causal induction Joshua B. Tenenbaum & Thomas L. Griffiths Department of Psychology Stanford University, Stanford, CA 94305 jbt,gruffydd @psych.stanford.edu Abstract We use

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Bayesian learning Probably Approximately Correct Learning

Bayesian learning Probably Approximately Correct Learning Bayesian learning Probably Approximately Correct Learning Peter Antal antal@mit.bme.hu A.I. December 1, 2017 1 Learning paradigms Bayesian learning Falsification hypothesis testing approach Probably Approximately

More information

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren 1 / 34 Metamodeling ME598/494 Lecture Max Yi Ren Department of Mechanical Engineering, Arizona State University March 1, 2015 2 / 34 1. preliminaries 1.1 motivation 1.2 ordinary least square 1.3 information

More information

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018 Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

Classification: The rest of the story

Classification: The rest of the story U NIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN CS598 Machine Learning for Signal Processing Classification: The rest of the story 3 October 2017 Today s lecture Important things we haven t covered yet Fisher

More information

Lecture 2: Simple Classifiers

Lecture 2: Simple Classifiers CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 2: Simple Classifiers Slides based on Rich Zemel s All lecture slides will be available on the course website: www.cs.toronto.edu/~jessebett/csc412

More information

From inductive inference to machine learning

From inductive inference to machine learning From inductive inference to machine learning ADAPTED FROM AIMA SLIDES Russel&Norvig:Artificial Intelligence: a modern approach AIMA: Inductive inference AIMA: Inductive inference 1 Outline Bayesian inferences

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

Illustrating the Implicit BIC Prior. Richard Startz * revised June Abstract

Illustrating the Implicit BIC Prior. Richard Startz * revised June Abstract Illustrating the Implicit BIC Prior Richard Startz * revised June 2013 Abstract I show how to find the uniform prior implicit in using the Bayesian Information Criterion to consider a hypothesis about

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Probability Theory for Machine Learning. Chris Cremer September 2015

Probability Theory for Machine Learning. Chris Cremer September 2015 Probability Theory for Machine Learning Chris Cremer September 2015 Outline Motivation Probability Definitions and Rules Probability Distributions MLE for Gaussian Parameter Estimation MLE and Least Squares

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Bayesian inference. Justin Chumbley ETH and UZH. (Thanks to Jean Denizeau for slides)

Bayesian inference. Justin Chumbley ETH and UZH. (Thanks to Jean Denizeau for slides) Bayesian inference Justin Chumbley ETH and UZH (Thanks to Jean Denizeau for slides) Overview of the talk Introduction: Bayesian inference Bayesian model comparison Group-level Bayesian model selection

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information