Model Evaluation and Selection in Cognitive Computational Modeling
|
|
- Geraldine Cross
- 5 years ago
- Views:
Transcription
1 Model Evaluation and Selection in Cognitive Computational Modeling Jay Myung Department of Psychology Ohio State University, Columbus OH, USA In collaboration with Mark Pitt Mathematical Psychology Workshop (Jun 2013: Wuhan, China)
2 2
3 3
4 What we hope to achieve today This is intended to be a first introduction to model evaluation and selection/comparison for cognitive scientists The aim is to provide a good conceptual overview of the topic and make you aware of some of the fundamental issues and methods in model evaluation and selection Will not be an in-depth, step-by-step workshop on how to apply model evaluation and comparison methods to extant models using computing or statistical software tools Assume no more than a year-long course in graduate level statistics 4
5 Outline 1. Introduction 2. Model Estimation 3. Model Evaluation and Selection 4. A New Experimental Tool for Model Selection: Adaptive Design Optimization 5. Conclusion 5
6 Outline 1. Introduction 2. Model Estimation 3. Model Evaluation and Selection 4. A New Experimental Tool for Model Selection: Adaptive Design Optimization 5. Conclusion 6
7 Model Selection in Cognitive Computational Modeling How should one decide between competing explanations (models) of data? SIMPLE model of memory (Brown et al, 2007, Psy. Rev.) CMR model of memory (Polyn et al, 2009, Psy. Rev.) 7
8 Two Whale s Views of Model Selection 8
9 Which one of the two below should we choose? 9
10 Preliminaries Models are quantitative stand-ins for theories Models are tools with which to study behavior Increase the precision of prediction Generate novel predictions Provide insight into complex behavior Model comparison is a statistical inference problem. Quantitative methods are developed to aid in deciding between models 10
11 Preliminaries Diversity of types of models in cognitive science makes model comparison challenging Variety of statistical methods are required Discipline would benefit from sharing models and data sets Cognitive Modeling Repository (CMR: 11
12 Outline 1. Introduction 2. Model Estimation 3. Model Evaluation and Selection 4. A New Experimental Tool for Model Selection: Adaptive Design Optimization 5. Conclusion 12
13 2. Model Estimation Parameter Estimation Likelihood Function Maximum Likelihood Estimation (MLE) Illustrative Example 13
14 14
15 Parameter Estimation A mathematical model specifies the range of data patterns it can describe by varying the values of its parameter (w), for example, 2 Power model: p w ( t 1) w 1 15
16 Parameter Estimation Finding the parameter value that best fits observed data Best-fit parameter: ( wˆ, wˆ ) (0.985,0.424)
17 MLE vs LSE Maximum likelihood estimation (MLE): Find the parameter value that is most likely to have generated the data. Least squares estimation (LSE): Find the parameter value that most accurately describes the data. MLE: wˆ argmax p( data w) w 2 LSE: w ˆ argmin obs prd ( w ) w i i i Sum of squares error (SSE) 17
18 MLE in a Nutshell All parametric inference by building a probability model for observed data, not by comparing between data observations and model predictions. Observations LSE Predictions MLE Probability model 18
19 Why MLE? ***Opt. In statistics, MLE is the standard method of parameter estimation and forms a basis for many inferential statistical methods such as hypothesis testing and model comparison (e.g, LRT, AIC, BIC, MDL). Optimal properties of MLE: - Consistency (true parameter value is recovered asymptotically) - Sufficiency (contains complete info about parameter) - Efficiency (achieves minimum variance of parameter estimate) In contrast, LSE does not generally satisfy these properties. LSE is primarily a descriptive tool, often associated with linear models with normal error, such as linear regression and analysis of variance. 19
20 Terminology ***Opt. Data: Statistically speaking, data are a random sample from an unknown population. Probability distribution: Each population is identified by a probability distribution that describes the probability of observing data sample. Model parameter: Associated with each probability distribution is a unique value of model parameter. As the parameter value changes, different probability distributions are specified. 20
21 Coin tossing experiment: Probability Distribution Parameter: w (0 w 1; probability of a head on each toss) Data: y 0, 1,..., n (number of heads out of n tosses) Binomial process: y Bin( n, w) Probability distribution given w: n! n f ( y w) w (1 w) ( n y)! y! n y 21
22 For example, suppose a binomial process given n = 10 and w = 0.2: 10! y f ( y w 0.2) (0.2) (1 0.2) (10 y)! y! 10 y f ( y 0 w 0.2) f ( y 1 w 0.2) f ( y 10 w 0.2) such that f ( y w 0.2) 1 y 0 22
23 Binomial Probability Distributions 23
24 Normal Probability Distribution For a continuous random variable y with normal error, the probability distribution takes the form ( y ) 1 f y w e y ( (, )) ( ) 2 24
25 Probability Density Function ***Opt. When f(y w) is viewed as a function of data y given a fixed value of parameter w, it is called the probability density function (PDF). Normalized PDF: y f ( y w ) 1 (discrete y) f ( y w ) dy 1 (continous y) for each value of parameter w. 25
26 Formal Definition of Model Formally speaking, a model is defined as a collection of PDFs the model generates by varying its parameter values: Model M = { f ( y w) w W } 26
27 2. Model Estimation Parameter Estimation Likelihood Function Maximum Likelihood Estimation (MLE) Illustrative Example 27
28 To recap, a PDF, f(y w), associated with a specific parameter value w describes the probability of observing an outcome y. For instance, according to the binomial PDF with n=10 and w=0.2, y = 2 is more probable to observe than y = 4: f ( y 2 w 0.2) f ( y 4 w 0.2)
29 The Inverse Problem In reality, however, we have observed the data y, but the parameter w is unknown. Accordingly, we are faced with an inverse inference problem: Parameter estimation as an inverse problem: Given the observed data y obs and a model of interest, we wish to identify the one PDF, among all PDFs the model can generate, that is most likely to have produced y obs. 29
30 Likelihood Function To solve the inverse problem, we define the likelihood function (LF) by switching the roles of y and w in the PDF as Likelihood function: Lw ( y ) f( y w) obs obs as a function of parameter w given observed data y. obs 30
31 An Example of Likelihood Function Given y 7 with y : Bin( n 10, w ), obs Lw ( y 7) f( y 7, w) obs obs 10! 7 3 w (1 w) (0 w 1) (10 7)!7! 31
32 PDF vs Likelihood Function Both PDF f(y w) and likelihood function (LF) L(y w) are derived from the same functional, but are defined on different axes, representing two sides of the same coin, in a sense: PDF: f(y w fixed ) LF: f(y fixed w) 32
33 PDF vs Likelihood Function - PDF: A function of y given a specific value of w (defined on data scale) - LF: A function of w given a specific value of y (defined on parameter scale) Both are non-negative measures: - PDF: Actual probability - Likelihood: Likelihood as an unnormalized probability f ( y w ) dy 1 fixed but Lw ( y ) dw 1 obs 33
34 2. Model Estimation Parameter Estimation Likelihood Function Maximum Likelihood Estimation (MLE) Illustrative Example 34
35 Maximum Likelihood Estimate In MLE, we are interesting in finding the probability distribution that makes the observed data y obs most likely, or equivalently, finding the parameter value that maximizes the likelihood function L(w y obs ). The resulting parameter value is called the maximum likelihood estimate denoted by ˆ. w MLE wˆ 0.7 ( y 7, n 10) MLE obs 35
36 Consequently, among all probability distributions or PDFs the model can generate, the one that is most likely to have generated y obs =7 is considered to be f(y w=0.7), shown below. ***Opt. MLE Distribution 36
37 Likelihood Equation ***Opt. To reiterate, MLE requires us to solve the problem of Maximizing the LF or logarithmic LF: wˆ argmax L( w yobs ) =argmax ln L( w yobs ) w w Consider the general setting of a multi-dimensional parameter vector w ( w1, w2,..., w k ) A minimal and necessary condition for the existence of MLE estimate is that the likelihood function must satisfy the following partial differential equation called the likelihood equation (LE): ln Lw ( yobs ) w i 0 at wˆ for all i 1,2,..., k. MLE 37
38 Why the likelihood equation?: This is because by definition, the first derivative of LF with respect to each w i vanishes at ˆ. w MLE ***Opt. ln Lw ( y obs ) w 0 38
39 To illustrate, consider the previous binomial example with y ~ Bin(n=10,w), and suppose that we observed y obs =7. ***Opt. The likelihood equation in this case is given by 10! 7 3 d ln w (1 w ) d ln L( w y obs 7) 7!3! dw dw 10! d ln 7!3! d (7 ln w ) d (3 ln(1 w )) dw dw dw w 1 w 7 10w w (1 w ) 0 from which we obtain wˆ 0.7. MLE 39
40 Solving MLE Note that finding the MLE estimate requires solving a system of k likelihood equations simultaneously. For models with many parameters and non-linear model equations, it is generally not possible to find analytic-form solutions. Instead, one must solve the MLE problem numerically on computer using an optimization algorithm (e.g., hill-climbing, Gauss-Newton, or Levenberg-Marquardt). This generally involves the implementation of an intelligent search procedure on computer, in which we seek to find an MLE by trial and error over the course of a series of iterative steps. 40
41 Hill-climbing Optimization wˆ MLE B Log-likelihood ln L(w) A C 0 a1 a2 a3 Parameter w 41
42 Local Maxima Problem No optimization algorithm can guarantee that the MLE will always be found. Maybe or maybe not. It is, therefore, important to apply an optimization algorithm in several repetitions, each with a different starting value of parameter w, to see if the same solution would be obtained. Further and importantly, the problem is exacerbated by what is known as the local maxima problem. 42
43 global maximum B local maximum Log-likelihood ln L(w) A local maximum C 0 a1 a2 a3 Parameter w 43
44 2. Model Estimation Parameter Estimation Likelihood Function Maximum Likelihood Estimation (MLE) Illustrative Example 44
45 Retention Models and Data Suppose we are interested in testing two models of retention: POW: p w ( t 1) EXP: 1 p w e 1 w t 2 w 2 A memory experiment with n = 50 Bernoulli trials and eight time intervals of t = (0.5, 1, 2, 4, 8, 12, 16, 18) yielded the following data (i.e., number of correct responses out of 50 trials at each time interval) of y obs = (44, 34, 27, 26, 19, 17, 20, 11) Or, equivalently, the observed proportion correct of p obs = (0.88, 0.68, 0.54, 0.52, 0.38, 0.34, 0.40, 0.22) 45
46 The Data 46
47 Likelihood Functions The first step of MLE is to write down the PDF for each model 8 n! POW: f ( y w) w ( t 1) 1 w ( t 1) ( n y )! y! i 1 i i i 1 i i w i 2 w2 1 i 1 i 8 n! EXP: f ( y w) w e 1 w e ( n y )! y! From this, we obtain the log LF y n y w2t i i w2ti 1 1 y n y i i POW: 8 w2 w2 ln ( ) ln( ( 1) ) ( ) ln(1 ( 1) ) EXP: L w y y w t n y w t obs obs, i 1 i obs, i 1 i i 1 8 Lw y y we n y we w2ti w2ti ln ( ) ln( ) ( ) ln(1 ) obs obs, i 1 obs, i 1 i 1 where the parameter-independent terms are omitted. 47
48 Matlab Code 48
49 MLE Best-fit Curves 49
50 Summary of Model Fits MLE LSE POW EXP POW EXP w MLE, w MLE, PVAF 91.2% 79.0% w LSE, w LSE, PVAF 91.4% 79.1% 50
51 Why the Differences between MLE and LSE? Binomial likelihood data: y Bin n p e g p w e -w2t (, ) (.., ) var( y) np(1 p) 1 ***Opt. 2 LSE: minimize obsi prd i( w ) i 51
52 R Code 52
53 53
54 Outline 1. Introduction 2. Model Estimation 3. Model Evaluation and Selection 4. A New Experimental Tool for Model Selection: Adaptive Design Optimization 5. Conclusion 54
55 3. Model Evaluation and Selection Goodness-of-fit vs. Generalizabilty Model Selection Methods Illustrative Examples 55
56 56
57 Goal of Modeling and Approximation to Truth The ultimate, ideal goal of modeling is to identify the model that actually generated the observed data This is not possible because 1) Never enough observations to pin down the truth exactly 2) The truth may be quite complex, beyond modeler s imagination A more realistic goal is to choose among a set of candidate models the one model that provides the closest approximation to the truth, in some defined sense 57
58 Model Evaluation Criteria Qualitative criteria Falisifiability: Do potential observations exist that would be incompatible with the model? Plausibility: Does the theoretical account of the model make sense of established findings? Interpretability: Are the components of the model understandable and linked to known processes? Quantitative criteria Goodness of fit: Does the model fit the observed data sufficiently well? Complexity/simplicity: Is the model's description of the data achieved in the simplest possible manner? Generalizability: How well does the model predict future observations? 58
59 Goodness-of-fit (GOF) Measures - Root mean squared error (RMSE) RMSE = n i 1 ( y y ( wˆ )) i, obs i, prd n 2 - Percent variance accounted for (PVAF), or r 2 PVAF = n i 1 n ( y y ( wˆ )) i 1 iobs, iprd, ( y y ) i, obs obs
60 Noisy Data Behavioral data include random noise from a number of sources, such as measurement error, sampling error, and individual differences Data = Regularity (Cognitive process) + Noise (Idiosyncrasies) 60
61 Problem with GOF as Model Evaluation Criterion Data = Regularity (Cognitive process) + Noise (Idiosynchracies) GOF = Fit to regularity + Fit to noise Properties of the model that have nothing to do with its ability to capture the underlying regularities can improve GOF 61
62 Over-fitting Problem (Over-fitting) Model 1: bx Y ae c bx e Model 2: sin( ) Y ae c dx f X g 62
63 Model Complexity Complexity: Refers to a model s inherent flexibility that enables it to fit a wide range of data patterns Simple Model Complex Model number of model parameters 63
64 Complexity: More than Number of Parameters? Power: p a( t 1) Exponential: p ae bt b Hyperbolic: p 1/( a bt) Are these all equally complex? Maybe not 64
65 Complexity of Multinomial Processing Tree Models Wu, Myung & Batchelder (2010, J. Math. Psych.) 65
66 Generalizability: The Yardstick of Model Selection Generalizability refers to a model s ability to fit all future data samples from the same underlying process, not just the currently observed data sample, and thus can be viewed as a measure of predictive accuracy or proximity to the underlying regularity. 66
67 Model A Model B Goodness of fit (PVAF): 80% 99% Generalizability (PVAF): 70% 50% 67
68 Relationship among Goodness of Fit, Model Complexity and Generalizability Goodness of fit Overfitting Generalizability Model Complexity 68
69 Wanted: A method of model selection that estimates a model s generalizability by taking into account effects of its complexity Selection Criterion: Choose the model, among a set of candidate models, that generalizes best 69
70 3. Model Evaluation and Selection Goodness-of-fit vs. Generalizabilty Model Selection Methods Illustrative Examples 70
71 Model Selection Methods Occam s Razor Likelihood Function Akaike Information Criterion (AIC) Bayesian Information Criterion (BIC) Minimum Description Length (MDL) Bayes Factor (BF) Cross-validation (CV) Likelihood Ratio Test (LRT) 71
72 Occam s Razor: The Economy of Explanation Entities should not be multiplied beyond necessity. - William of Occam ( ) 72
73 Likelihood Function: Revisited Formally speaking, a mathematical model is defined in terms of the likelihood function (LF) that specifies the likelihood of observed data as a function of model parameter: Likelihood function (LF): f( y w ) (e.g.) w 2 Power model: p w ( t 1) (0 p 1) 1 Data: y ~ Binomial(n,p(w)) (y = 0,1,...,n) n! LF: f ( y w) p( w) (1 p( w)) ( n y)! y! y n y 73
74 Maximum Likelihood: Revisited In model fitting, we are interested in finding the parameter value that is most likely to have generated the observed data -- the one that maximizes the likelihood function: Maximum likelihood (ML): f( y wˆ ) ˆ w 74 2
75 Penalized Likelihood Methods Formalization of Occam s razor Estimate a model s generalizability by penalizing it for excess complexity (i.e., more complexity than is needed to fit the regularity in the data) Puts models on equal footing 75
76 The Basic Idea (Generalizability) = (Goodness of fit) - (Model complexity) 76
77 The Basic Idea ***Opt. (Generalizability) = (Goodness of fit) + (Model complexity) ( Generalizability = - Generalizability; Goodness of fit = - Goodness of fit ) 77
78 Akaike Information Criterion (AIC) Akaike (1973): # of parameters AIC 2ln f( y wˆ ) 2k Goodness of fit (ML) + Model Complexity The smaller AIC value of a model, the greater generalizability of the model The model with smallest AIC is the best, among a set of competing models and thus should be preferred 78
79 Bayesian Information Criterion (BIC) Schwarz (1978): Sample size BIC 2ln f( y wˆ ) k lnn Goodness of fit (ML) + Model Complexity The model that minimizes BIC should be preferred 79
80 80
81 Minimum Description Length (MDL) Rissanen (1996): functional form k n MDL ln f ( y wˆ ) ln ln I( w) dw 2 2 Goodness of fit (ML) + Model Complexity The model that minimizes MDL should be preferred 81
82 Bayes Factor (BF) ***Opt. (Kass & Raftery, 1995) In Bayesian model selection, each model is evaluated based on its marginal likelihood defined as p( y M) f( y w, M) ( w M) dw or equivalently, an average likelihood (i.e., how well the model fits the data on average, across the range of its parameters) Bayes factor (BF) between two models is defined as the ratio of two marginal likelihoods BF ij p ( y M ) i p( y M ) j 82
83 Under the assumption of equal model priors, BF is reduced to the posterior odds: ***Opt. BF ij pm ( i y) pm ( y) j ( from Bayes rule) Therefore, the model that maximizes marginal likelihood is the one with highest probability of being true given observed data 83
84 Features of Bayes Factor ***Opt. Pros No optimization (i.e., no maximum likelihood) No explicit measure of model complexity No overfitting, by averaging likelihood function across parameters Cons Issue of choosing parameter priors (virtue or vice?) Non-trivial computations requiring numerical integration 84
85 BIC as an approximation of BF ***Opt. A large sample approximation of the marginal likelihood yields the easily-computable Bayesian Information Criterion (BIC): -2 log marginal likelihood BIC 85
86 (Stone, 1974; Geisser, 1975) Cross-validation (CV) Sampling-based method of estimating generalizability y Cal y Val CV SSE( y wµ ( y )) Val Cal 86
87 87
88 Features of CV Pros Easy to use Sensitive to functional form as well as number of parameters Asymptotically equivalent to AIC Cons Sensitive to the partitioning used - Averaging over multiple partitions - Leave-one-out CV (LOOCV), instead of split-half CV Instability of the estimate due to loss of data 88
89 89
90 Why not Likelihood Ratio Test (LRT)? ***Opt. The LRT is often used to test the adequacy of a model of interest (M R ) in the context of another, fuller model (M F ): LRT ln f( y wˆ R ) f( y wˆ ) F The acceptance/rejection decision on M R is then made based on a p-value obtained from Chi-square distribution with df = k F k R. Acceptance Model M R provides a sufficiently good fit to observed data, and therefore the extra parameters of model M F are judged to be unnecessary 90
91 LRT is NOT a model selection method ***Opt. LRT does not access a model s generalizability, which is the hallmark of model selection. LRT is a null hypothesis significance testing (NHST) method that simply judges descriptive adequacy of a given model (cf. contemporary criticisms of NHST). LRT requires the setting of pair-wise & nested comparison (i.e., can compare only two models at a time). LRT is developed in the context of linear models with normal error; Sampling distributions of the LRT statistic under nonlinear and nonnormal error conditions are generally unknown. At best, LRT is a poor substitute for current model selection methods, such as AIC and BIC. 91
92 3. Model Evaluation and Selection Goodness-of-fit vs. Generalizabilty Model Selection Methods Illustrative Examples 92
93 Example #1: Retention Models (Remember?) MLE results Proportion Correct (pc) POW ( black): p a( t 1) EXP ( red): p ae bt b b POW 2( blue): p a( t 1) c Time Interval (t) 93
94 Model Selection Results POW EXP POW2 # Parms PVAF AIC BIC LOOCV loglik
95 Example #2 1 1 : : 3 3 : 1 95
96 Matlab Code 96
97
98 Guidelines for Model Evaluation and Selection 1. Evaluate a model s fit to data (descriptive adequacy, or goodness-of-fit) 2. Must take into account a model s fit to other possible data (complexity/flexibility) 3. Select among models using generalizability measures (AIC, BIC, MDL, Bayes Factor, etc.) 98
99 Summary of Section 3 Models should be evaluated based on generalizability, not on goodness of fit Thou shall not select the best-fitting model but shall select the best-predicting model. Other non-statistical but very important selection criteria Plausibility Interpretability Explanatory adequacy Falsifiability 99
100 Outline 1. Introduction 2. Model Estimation 3. Model Evaluation and Selection 4. A New Experimental Tool for Model Selection: Adaptive Design Optimization 5. Conclusion 100
101 101
102 Outline 1. Introduction 2. Model Estimation 3. Model Evaluation and Selection 4. A New Experimental Tool for Model Selection: Adaptive Design Optimization (ADO) 5. Conclusion 102
103 4. A New Experimental Tool for Model Selection: ADO Introduction Discriminating Models of Retention with ADO Illustrative Example 103
104 104
105 Introduction Experiments are fundamental to the advancement of cognitive science 105
106 Introduction Data obtained from experiments are used to fit formal models. Multinomial Processing Trees Structural Equations (SEM) Neural Network 106
107 Introduction Often, there are many competing models to describe the same cognitive/perceptual process. Example Short-term memory literature Over 100 different 2-parameter models have been proposed to describe the rate of forgetting over time (Rubin & Wenzel, 1996) 107
108 Model selection problem It is often very difficult to reject models through experimentation because they can mimic one another so closely. Example: Some models of memory retention (forgetting) Hyperbolic Exponential Power 108
109 Experiments to discriminate between models Flow chart of typical investigation Formulate models Collect data Analyze data Statistical model selection methods (e.g., AIC & BIC) 109 Model selection criteria are limited by the data they have to work with.
110 Experiments to discriminate between models Flow chart of typical investigation Formulate models Collect data Analyze data Typically, collect lots of data to decrease noise 110
111 Experiments to discriminate between models Flow chart of typical investigation Formulate models Collect data Analyze data Collect smarter data to highlight differences between models 111
112 Collecting smarter data The design of an experiment determines the quality of data that are collected. Design variables include: number of observations number of treatment groups stimulus levels 112
113 Collecting smarter data The design of an experiment determines the quality of data that are collected. Design variables include: number of observations number of treatment groups stimulus levels 113
114 Hard-to-recruit participants 114
115 Adaptive design optimization (ADO) Adaptively designed experiments Conduct the full experiment as a sequence of miniexperiments Improve the design of the next mini-experiment using knowledge gained from the previous miniexperiments Myung & Pitt (2009) Psychological Review Cavagnaro, Myung, Pitt & Kujala (2010) Neural Computation. 115
116 Adaptive design optimization (ADO) Sequential design framework EXPERIMENTS Y obs (d 1 ) Y obs (d 2 ) Y obs (d s ) d 1 d 2 d 3 d s DESIGNS Adapt the design of the next experiment based on the results of preceding experiments 116
117 4. A New Experimental Tool for Model Selection: ADO Introduction Discriminating Models of Retention with ADO Illustrative Example 117
118 Designing a retention experiment: What Time Intervals Should be Employed? Retention: the rate of retrieval failure over time 118
119 Designing a retention Experiment: What Time Intervals Should be Employed? Two models of retention 119
120 Designing a retention experiment: What Time Intervals Should be Employed? Model mimicry between POW and EXP POW EXP 120
121 Designing a retention experiment: What Time Intervals Should be Employed? Model predictions for a narrow range of parameters (100 Bernoulli trials) POW EXP Recall rate Recall rate Time (seconds) Time (seconds) 0.58<a< <b< <a< <b<
122 Designing a retention experiment: What Time Intervals Should be Employed? A good design choice can aid in discrimination seconds: Bad designs! POW EXP Recall rate Recall rate Time (seconds) Time (seconds) 122
123 Designing a retention experiment: What Time Intervals Should be Employed? A good design choice can aid in discrimination 2-4 seconds: Good designs! POW EXP Recall rate Recall rate Time (seconds) Time (seconds) 123
124 Designing a retention experiment: What Time Intervals Should be Employed? More realistic situations require a more principled approach to finding optimal designs. POW EXP Recall rate Recall rate Time (seconds) Time (seconds) a~beta(2,1) b~beta(1,4) a~beta(2,1) b~beta(1,80) 124
125 4. A New Experimental Tool for Model Selection: ADO Introduction Discriminating Models of Retention with ADO Illustrative Example 125
126 Simulated experiments Generate data at each stage from some true model (e.g. EXP with a=.71 and b=.08) at the optimal retention interval identified by ADO algorithm. 126
127 Simulated experiments Generate data at each stage from some true model (e.g. EXP with a=.71 and b=.08) at the optimal retention interval identified by ADO algorithm. The idea is to see how many stages it takes to accumulate enough data to decisively identify the datagenerating model according to some model selection criterion (e.g. Bayes factor) 127
128 Simulation results Data generated from EXP with a=0.71 and b=
129 Simulation results 30 Stage 1 Compute optimal design Generate data (30 Bernoulli trials) 30 P 0 (POW)=0.5 P 0 (EXP)= Time (seconds) 7 correct responses 16 seconds Time (seconds) 129
130 Simulation results 30 Stage 1 Update model probabilities Update parameter probabilities 30 P 1 (POW)=0.65 P 1 (EXP)= Time (seconds) 7 correct responses 16 seconds Time (seconds) 130
131 Simulation results 30 Stage 2 Compute optimal design Generate data (30 Bernoulli trials) 30 P 1 (POW)=0.65 P 1 (EXP)= Time (seconds) 0 correct responses 96.4 seconds Time (seconds) 131
132 Simulation results 30 Stage 2 Update model probabilities Update parameter probabilities 30 P 2 (POW)=0.30 P 2 (EXP)= Time (seconds) 0 correct responses 96.4 seconds Time (seconds) 132
133 Simulation results 30 Stage 3 Compute optimal design Generate data (30 Bernoulli trials) 30 P 2 (POW)=0.30 P 2 (EXP)= Time (seconds) 0 correct responses 96.8 seconds Time (seconds) 133
134 Simulation results 30 Stage 3 Update model probabilities Update parameter probabilities 30 P 3 (POW)=0.02 P 3 (EXP)= Time (seconds) 0 correct responses 96.8 seconds Time (seconds) 134
135 Simulation results 135
136 Simulation results Bayes Factor = 20 (very strong evidence for EXP) 136
137 Simulation results Bayes Factor = 20 (very strong evidence for EXP) 137
138 Human experiment results 138
139 ADO for Visual Psychophysics Experiment: Adaptive Estimation of Contrast Sensitivity Function (CSF) Lesmes, Jeon, Lu & Dosher (2006); Lesmes, Lu, Baek & Albright (2010) 139
140 140
141 ADO for Neurophysiology Experiment Lewi, Butera & Paninski (2009) 141
142 Much ADO about Nothing - William Shakespeare 142
143 Summary of Section 4 Competing models can be difficult to discriminate because good experimental designs are elusive. ADO finds and exploits differences between models to maximize the likelihood of discrimination. Qualifications ADO applicable to not all types of models Not all variables can be optimized Computational costs 143
144 Outline 1. Introduction 2. Model Estimation 3. Model Evaluation and Selection 4. A New Experimental Tool for Model Selection: Adaptive Design Optimization 5. Conclusion 144
145 Conclusion Model evaluation: A model s good fit to a data set is only a necessary first step in model evaluation, but not a sufficient, final, confirmatory step. Model selection: The best model, among a set of candidate models, is the one that generalizes best and can be identified using model selection methods, such as AIC, BIC, and MDL. Experimental design: Adaptive design optimization (ADO) can further improve model evaluation and selection. 145
146 Further readings on model evaluation, model selection, and adaptive design optimization Special issues Gluck, K., Bello, P., & Busemeyer, J. (2008). Special issue on model comparison. Cognitive Science, 32, Myung, I. J., Forster, M., & Browne, M. W. (2000). Special issue on model selection. Journal of Mathematical Psychology, 44, Wagenmakers, E-J., & Waldorp, L. (2006). Special issue on model selection. Journal of Mathematical Psychology, 50, Articles Shiffrin, R. M., Lee, M. D., Kim, W., & Wagenmakers, E-J. (2008). A survey of model evaluation approaches with a tutorial on hierarchical Bayesian methods. Cognitive Science, 32, Wu, H., Myung, J. I., & Batchelder, W. H. (2010). On the minimum description length complexity of multinomial processing tree models. Journal of Mathematical Psychology, 54, Cavagnaro, D. R., Gonzalez, R., Myung, J. I., & Pitt, M. A. (2013). Optimal decision stimuli for risky choice experiments: An adaptive approach. Management Science, 59(2),
147 All the mathematics you missed but need to know for graduate school (in mathematical psychology and other fields): Undergraduate Calculus, Linear & Matrix Algebra, Real Analysis Probability Theory and Statistics Computer Programming in Matlab, R, and C++ Graduate Mathematical Statistics Bayesian Statistics Numerical Computation and Analysis Machine Learning Methods 147
148 The End 148
149 149
150 Supplemental Slides 150
151 Outline 1. Introduction 2. Model Estimation 3. Model Evaluation and Selection 3.5 Evaluating Other Types of Models 4. A New Experimental Tool for Model Selection: Adaptive Design Optimization 5. Conclusion 151
152 152
153 Restricted Scope of the Methods Models must generate likelihood functions (distribution of likelihoods across parameters) Not all models can do this (without simplification) Connectionist Simulation-based (CMR, Diffusion) Cognitive architectures Diversity of types of models in cognitive science makes model comparison challenging 153
154 Broader Framework Local Model Analysis Can the model simulate (or fit) the particular data pattern observed in an experimental setting? Evaluate success of model across diverse experiments Difficult to synthesize results to obtain a picture of why the model consistently succeeds in simulating human performance Is it the correct model or an overly flexible model? To interpret the behavior of a model's local performance, the global behavior of the model must also be understood (i.e., descriptive adequacy must be balanced by model complexity) 154
155 Broader Framework Global Model Analysis What other data patterns does the model produce besides the empirical one? Learn what a model is and is not capable of doing in a particular experimental setting (i.e., experimental design) with the goal being to obtain a more comprehensive picture of its behavior 155
156 Parameter Space Partitioning (PSP) Global Model Analysis method Partition a model s parameter space (hidden weight space) into distinct regions corresponding to qualitatively different data patterns the model could generate in an experiment (Pitt, Kim, Navarro & Myung, 2006) PSP interfaces between the continuous behavior of models and the often discrete, qualitative predictions across conditions in an experiment 156
157 Pattern 2 Parameter Space Partitioning Pattern 1 Not a valid pattern (e.g., activations never reached threshold) How do you find all data patterns that a model can generate? Hard problem Potentially huge search space Model simulation required for each set of parameter values chosen Classify each simulation result (Does the current pattern match others already found or is it new?) 157
158 What Can be Learned from PSP? Empirical data pattern across three conditions: B > C > A M1 M2 B>C=A A>B>C A=B=C B>A>C 2 B>C>A 2 C=B>A C>A=B B>C>A C>A>B B=C>A A=B>C A>B=C 1 1 Questions PSP can answer How many of the 13 different data patterns can a model simulate? How much of the space is occupied by the empirical pattern? (central/peripheral) What is the relationship between these other patterns and the empirical pattern (in terms of volume and similarity) 158
159 Is it good or bad if a model s predictions are central or peripheral? M1 M2 B>C=A A>B>C B>A>C A=B=C 2 B>C>A 2 C=B>A C>A=B B>C>A C>A>B B=C>A A=B>C A>B=C 1 1 Simulation success takes on additional meaning with knowledge of other model behaviors Model comparison methods do not make decisions for you. They provide you with data that are intended to help you arrive at an informed decision 159
160 How does memory for a spoken word influence perception of the word s phonemes? TRACE McClelland & Elman, 1986 Merge Norris, McQueen, & Cutler, 2001 Word job jog job jog Word Phoneme Decision B G D V J O B G D V Phoneme Input and Decision J O B G D V Phoneme Input Speech Input Excitatory Inhibitory Speech Input 160
161 What are the consequences of splitting the phoneme level in two? TRACE McClelland & Elman, 1986 Merge Norris, McQueen, & Cutler, 2001 Word job jog job jog Word Phoneme Decision B G D V J O B G D V Phoneme Input and Decision J O B G D V Phoneme Input Speech Input Excitatory Inhibitory Speech Input 161
162 PSP Analysis Compared model performance in a 12-condition experiment Empirical data pattern (subcategorical mismatch, Norris et al, 2000) How many of the possible ~3 million data patterns can each model generate? 162
163 PSP Analysis Phoneme and word recognition defined as node activation reaching threshold Recognition of the phoneme /b/ Phoneme activation Threshold Cycle number (time) /b/ node /g/ node /d/ node /v/ node 163
164 Partitioning of Parameter Space in terms of Phoneme Activation /b/ node /g/ node /d/ node /v/ node threshold /g/ Invalid Invalid 2 /b/ /v/ Invalid 1 Hypothetical Example of Parameter Space 164
165 Results of PSP Analysis: Number of Data Patterns By splitting the phoneme level in two, Merge is able to generate more data patterns (more flexible) 165
166 Volume Analysis (weak threshold only) Each fill pattern represents a different data pattern Only patterns that occupy more than 1% of the volume are shown Proportion of the volume of the parameter space Common Common Unique TRACE Merge Human pattern 166
167 Correlation of Volumes of Common Regions Rank order of volume size in Merge Smallest Rank order of volume size in TRACE Largest 167
168 Summary of Section 3.5 Parameter Space Partitioning (PSP) provides a global perspective on model behavior Broader context for interpreting simulation/fit success Complements local model analysis Assess similarity of models Applicable to a wide range of models PSP can be used to evaluate the effectiveness of experiments in distinguishing between models prior to testing (Does only one model generate the predicted pattern?) 168
Squeezing Every Ounce of Information from An Experiment: Adaptive Design Optimization
Squeezing Every Ounce of Information from An Experiment: Adaptive Design Optimization Jay Myung Department of Psychology Ohio State University UCI Department of Cognitive Sciences Colloquium (May 21, 2014)
More information9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering
Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make
More informationModel Estimation Example
Ronald H. Heck 1 EDEP 606: Multivariate Methods (S2013) April 7, 2013 Model Estimation Example As we have moved through the course this semester, we have encountered the concept of model estimation. Discussions
More informationBayesian Information Criterion as a Practical Alternative to Null-Hypothesis Testing Michael E. J. Masson University of Victoria
Bayesian Information Criterion as a Practical Alternative to Null-Hypothesis Testing Michael E. J. Masson University of Victoria Presented at the annual meeting of the Canadian Society for Brain, Behaviour,
More informationAn Introduction to Path Analysis
An Introduction to Path Analysis PRE 905: Multivariate Analysis Lecture 10: April 15, 2014 PRE 905: Lecture 10 Path Analysis Today s Lecture Path analysis starting with multivariate regression then arriving
More informationStatistical Distribution Assumptions of General Linear Models
Statistical Distribution Assumptions of General Linear Models Applied Multilevel Models for Cross Sectional Data Lecture 4 ICPSR Summer Workshop University of Colorado Boulder Lecture 4: Statistical Distributions
More informationMachine Learning 2017
Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationModel Comparison. Course on Bayesian Inference, WTCN, UCL, February Model Comparison. Bayes rule for models. Linear Models. AIC and BIC.
Course on Bayesian Inference, WTCN, UCL, February 2013 A prior distribution over model space p(m) (or hypothesis space ) can be updated to a posterior distribution after observing data y. This is implemented
More informationBayesian Learning (II)
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP
More informationAn Introduction to Mplus and Path Analysis
An Introduction to Mplus and Path Analysis PSYC 943: Fundamentals of Multivariate Modeling Lecture 10: October 30, 2013 PSYC 943: Lecture 10 Today s Lecture Path analysis starting with multivariate regression
More informationModel Evaluation, Testing and Selection. In J. Myung, Mark A. Pitt, and Woojae Kim. Department of Psychology Ohio State University.
Model Evaluation, Testing and Selection In J. Myung, Mark A. Pitt, and Woojae Kim Department of Psychology Ohio State University January 16, 2003 (Word count: 10375) To appear in K. Lambert and R. Goldstone
More informationSRNDNA Model Fitting in RL Workshop
SRNDNA Model Fitting in RL Workshop yael@princeton.edu Topics: 1. trial-by-trial model fitting (morning; Yael) 2. model comparison (morning; Yael) 3. advanced topics (hierarchical fitting etc.) (afternoon;
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationStatistics for the LHC Lecture 1: Introduction
Statistics for the LHC Lecture 1: Introduction Academic Training Lectures CERN, 14 17 June, 2010 indico.cern.ch/conferencedisplay.py?confid=77830 Glen Cowan Physics Department Royal Holloway, University
More informationModel Comparison in Psychology
Model Comparison in Psychology Jay I. Myung # and Mark A. Pitt Department of Psychology Ohio State University Columbus, OH 43210 {myung.1, pitt.2}@osu.edu # Corresponding author July 5, 2016 Word count:
More informationPDEEC Machine Learning 2016/17
PDEEC Machine Learning 2016/17 Lecture - Model assessment, selection and Ensemble Jaime S. Cardoso jaime.cardoso@inesctec.pt INESC TEC and Faculdade Engenharia, Universidade do Porto Nov. 07, 2017 1 /
More informationShould all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?
Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/
More informationRelationship between Least Squares Approximation and Maximum Likelihood Hypotheses
Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses Steven Bergner, Chris Demwell Lecture notes for Cmpt 882 Machine Learning February 19, 2004 Abstract In these notes, a
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationLecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions
DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables
More informationMachine Learning
Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 13, 2011 Today: The Big Picture Overfitting Review: probability Readings: Decision trees, overfiting
More informationBAYESIAN DECISION THEORY
Last updated: September 17, 2012 BAYESIAN DECISION THEORY Problems 2 The following problems from the textbook are relevant: 2.1 2.9, 2.11, 2.17 For this week, please at least solve Problem 2.3. We will
More informationGeneralized Linear Models
Generalized Linear Models Lecture 3. Hypothesis testing. Goodness of Fit. Model diagnostics GLM (Spring, 2018) Lecture 3 1 / 34 Models Let M(X r ) be a model with design matrix X r (with r columns) r n
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationLecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1
Lecture 2 1 Probability (90 min.) Definition, Bayes theorem, probability densities and their properties, catalogue of pdfs, Monte Carlo 2 Statistical tests (90 min.) general concepts, test statistics,
More informationPhysics 403. Segev BenZvi. Choosing Priors and the Principle of Maximum Entropy. Department of Physics and Astronomy University of Rochester
Physics 403 Choosing Priors and the Principle of Maximum Entropy Segev BenZvi Department of Physics and Astronomy University of Rochester Table of Contents 1 Review of Last Class Odds Ratio Occam Factors
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationStatistical Data Analysis Stat 3: p-values, parameter estimation
Statistical Data Analysis Stat 3: p-values, parameter estimation London Postgraduate Lectures on Particle Physics; University of London MSci course PH4515 Glen Cowan Physics Department Royal Holloway,
More informationMODULE -4 BAYEIAN LEARNING
MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationCSC321 Lecture 9: Generalization
CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 27 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions
More informationECE662: Pattern Recognition and Decision Making Processes: HW TWO
ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are
More informationSignal Detection Theory With Finite Mixture Distributions: Theoretical Developments With Applications to Recognition Memory
Psychological Review Copyright 2002 by the American Psychological Association, Inc. 2002, Vol. 109, No. 4, 710 721 0033-295X/02/$5.00 DOI: 10.1037//0033-295X.109.4.710 Signal Detection Theory With Finite
More informationIntroduction to Bayesian Statistics
Bayesian Parameter Estimation Introduction to Bayesian Statistics Harvey Thornburg Center for Computer Research in Music and Acoustics (CCRMA) Department of Music, Stanford University Stanford, California
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University August 30, 2017 Today: Decision trees Overfitting The Big Picture Coming soon Probabilistic learning MLE,
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters
More informationRecent Advances in Bayesian Inference Techniques
Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian
More informationDetection theory. H 0 : x[n] = w[n]
Detection Theory Detection theory A the last topic of the course, we will briefly consider detection theory. The methods are based on estimation theory and attempt to answer questions such as Is a signal
More informationBioinformatics: Network Analysis
Bioinformatics: Network Analysis Model Fitting COMP 572 (BIOS 572 / BIOE 564) - Fall 2013 Luay Nakhleh, Rice University 1 Outline Parameter estimation Model selection 2 Parameter Estimation 3 Generally
More informationBayesian Decision Theory
Bayesian Decision Theory Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Bayesian Decision Theory Bayesian classification for normal distributions Error Probabilities
More informationMathematical Statistics
Mathematical Statistics MAS 713 Chapter 8 Previous lecture: 1 Bayesian Inference 2 Decision theory 3 Bayesian Vs. Frequentist 4 Loss functions 5 Conjugate priors Any questions? Mathematical Statistics
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationLecture 3: Introduction to Complexity Regularization
ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,
More informationCSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes
CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes Roger Grosse Roger Grosse CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes 1 / 55 Adminis-Trivia Did everyone get my e-mail
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the
More informationPattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite
More informationSGN (4 cr) Chapter 5
SGN-41006 (4 cr) Chapter 5 Linear Discriminant Analysis Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology January 21, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006
More informationHypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006
Hypothesis Testing Part I James J. Heckman University of Chicago Econ 312 This draft, April 20, 2006 1 1 A Brief Review of Hypothesis Testing and Its Uses values and pure significance tests (R.A. Fisher)
More informationday month year documentname/initials 1
ECE471-571 Pattern Recognition Lecture 13 Decision Tree Hairong Qi, Gonzalez Family Professor Electrical Engineering and Computer Science University of Tennessee, Knoxville http://www.eecs.utk.edu/faculty/qi
More informationLecture 3: Pattern Classification
EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationHierarchical Models & Bayesian Model Selection
Hierarchical Models & Bayesian Model Selection Geoffrey Roeder Departments of Computer Science and Statistics University of British Columbia Jan. 20, 2016 Contact information Please report any typos or
More information6.867 Machine Learning
6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.
More informationIntroduction: MLE, MAP, Bayesian reasoning (28/8/13)
STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Introduction. Basic Probability and Bayes Volkan Cevher, Matthias Seeger Ecole Polytechnique Fédérale de Lausanne 26/9/2011 (EPFL) Graphical Models 26/9/2011 1 / 28 Outline
More informationEmpirical Risk Minimization, Model Selection, and Model Assessment
Empirical Risk Minimization, Model Selection, and Model Assessment CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 5.7-5.7.2.4, 6.5-6.5.3.1 Dietterich,
More informationTesting and Model Selection
Testing and Model Selection This is another digression on general statistics: see PE App C.8.4. The EViews output for least squares, probit and logit includes some statistics relevant to testing hypotheses
More informationMS-C1620 Statistical inference
MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents
More informationFundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner
Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization
More informationMachine Learning! in just a few minutes. Jan Peters Gerhard Neumann
Machine Learning! in just a few minutes Jan Peters Gerhard Neumann 1 Purpose of this Lecture Foundations of machine learning tools for robotics We focus on regression methods and general principles Often
More informationDecision Trees (Cont.)
Decision Trees (Cont.) R&N Chapter 18.2,18.3 Side example with discrete (categorical) attributes: Predicting age (3 values: less than 30, 30-45, more than 45 yrs old) from census data. Attributes (split
More informationEPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7
Introduction to Generalized Univariate Models: Models for Binary Outcomes EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7 EPSY 905: Intro to Generalized In This Lecture A short review
More informationLogistic Regression: Regression with a Binary Dependent Variable
Logistic Regression: Regression with a Binary Dependent Variable LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: State the circumstances under which logistic regression
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationDEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE
Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures
More informationMultivariate statistical methods and data mining in particle physics
Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general
More informationAdvanced statistical methods for data analysis Lecture 2
Advanced statistical methods for data analysis Lecture 2 RHUL Physics www.pp.rhul.ac.uk/~cowan Universität Mainz Klausurtagung des GK Eichtheorien exp. Tests... Bullay/Mosel 15 17 September, 2008 1 Outline
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationClustering using Mixture Models
Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior
More informationCSC321 Lecture 9: Generalization
CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions
More informationModel Selection. Frank Wood. December 10, 2009
Model Selection Frank Wood December 10, 2009 Standard Linear Regression Recipe Identify the explanatory variables Decide the functional forms in which the explanatory variables can enter the model Decide
More informationIntroduction to Signal Detection and Classification. Phani Chavali
Introduction to Signal Detection and Classification Phani Chavali Outline Detection Problem Performance Measures Receiver Operating Characteristics (ROC) F-Test - Test Linear Discriminant Analysis (LDA)
More informationLikelihood-based inference with missing data under missing-at-random
Likelihood-based inference with missing data under missing-at-random Jae-kwang Kim Joint work with Shu Yang Department of Statistics, Iowa State University May 4, 014 Outline 1. Introduction. Parametric
More information7. Estimation and hypothesis testing. Objective. Recommended reading
7. Estimation and hypothesis testing Objective In this chapter, we show how the election of estimators can be represented as a decision problem. Secondly, we consider the problem of hypothesis testing
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationCS 195-5: Machine Learning Problem Set 1
CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of
More informationStructure learning in human causal induction
Structure learning in human causal induction Joshua B. Tenenbaum & Thomas L. Griffiths Department of Psychology Stanford University, Stanford, CA 94305 jbt,gruffydd @psych.stanford.edu Abstract We use
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationBayesian learning Probably Approximately Correct Learning
Bayesian learning Probably Approximately Correct Learning Peter Antal antal@mit.bme.hu A.I. December 1, 2017 1 Learning paradigms Bayesian learning Falsification hypothesis testing approach Probably Approximately
More informationOutline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren
1 / 34 Metamodeling ME598/494 Lecture Max Yi Ren Department of Mechanical Engineering, Arizona State University March 1, 2015 2 / 34 1. preliminaries 1.1 motivation 1.2 ordinary least square 1.3 information
More informationIntroduction to Bayesian Learning. Machine Learning Fall 2018
Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability
More informationLecture 4: Probabilistic Learning
DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods
More informationClassification: The rest of the story
U NIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN CS598 Machine Learning for Signal Processing Classification: The rest of the story 3 October 2017 Today s lecture Important things we haven t covered yet Fisher
More informationLecture 2: Simple Classifiers
CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 2: Simple Classifiers Slides based on Rich Zemel s All lecture slides will be available on the course website: www.cs.toronto.edu/~jessebett/csc412
More informationFrom inductive inference to machine learning
From inductive inference to machine learning ADAPTED FROM AIMA SLIDES Russel&Norvig:Artificial Intelligence: a modern approach AIMA: Inductive inference AIMA: Inductive inference 1 Outline Bayesian inferences
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationSUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION
SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology
More informationMachine Learning - MT & 5. Basis Expansion, Regularization, Validation
Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships
More informationIllustrating the Implicit BIC Prior. Richard Startz * revised June Abstract
Illustrating the Implicit BIC Prior Richard Startz * revised June 2013 Abstract I show how to find the uniform prior implicit in using the Bayesian Information Criterion to consider a hypothesis about
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationProbability Theory for Machine Learning. Chris Cremer September 2015
Probability Theory for Machine Learning Chris Cremer September 2015 Outline Motivation Probability Definitions and Rules Probability Distributions MLE for Gaussian Parameter Estimation MLE and Least Squares
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationBayesian inference. Justin Chumbley ETH and UZH. (Thanks to Jean Denizeau for slides)
Bayesian inference Justin Chumbley ETH and UZH (Thanks to Jean Denizeau for slides) Overview of the talk Introduction: Bayesian inference Bayesian model comparison Group-level Bayesian model selection
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More information