Bayesian Networks in Educational Assessment Tutorial

Bayesian Networks in Educational Assessment Tutorial Session V: Refining Bayes Nets with Data Russell Almond, Bob Mislevy, David Williamson and Duanli Yan Unpublished work 2002-2014 ETS 1

Agenda SESSION TOPIC PRESENTERS Session 1: Evidence Centered Design David Williamson Session 2: Bayesian Networks Russell Almond Session 3: Bayes Net Tools & Applications Session 4: ACED: ECD in Action Duanli Yan Russell Almond & Duanli Yan Session 5: Refining Bayes Nets with Data Russell Almond 2

Outline Variables and Parameters The Hyper-Dirichlet Model The EM Algorithm Reduced Parameter Models Evaluating Model Fit Model Search and Causality 3

Variables and Parameters Bayesian statistics does not distinguish between variables and parameters, just known and unknown quantities Define: Variable: person specific Parameter: constant across people Visualize as a two layer network 4

First Layer Skill1 Task1-Obs Skill2 Task2-Obs Task3-Obs A simple model with two skills and 3 observables 5

Distributions and Variables Skill1 Task1-Obs Skill2 Task2-Obs Variables (values are person specific) Task3-Obs Distributions provide probabilities for variables 6

Different People, Same Distributions Student 1 Skill1 Skill2 Task1-Obs Task2-Obs Task3-Obs Student 2 Skill1 Skill2 Task1-Obs Task2-Obs Task3-Obs Student 3 Skill1 Skill2 Task1-Obs Task2-Obs Task3-Obs 7

Second Layer Distributions have Parameters Parameters are the same across all people Parameters drop down into first layer to do person specific computations (e.g., scoring) Probability distributions of parameters are called Laws 9

Second Layer 2 10

Hyper-Markov Properties Spiegelhalter and Lauritzen (1990) make two assumptions of convenience Global Meta Independence parameters from different distributions are independent p 1 and p 2 are independent Local Meta Independence parameters from the same distribution are independent l 1,2, l 1,-2, l -1,2, and l -1,-2 are independent 11

Hyper-Dirichlet Law Bayes net distributions are conditional multinomial distributions Dirichlet law is natural conjugate of multinomial distribution (like beta and binomial) Can be thought of as counts of pseudo-observations in each category. Category 1 Category 2 Category 3 Prior (weight of 6 obs) 3.50 2.33 1.17 Observed counts (30) 14 10 6 Posterior (weight of 36) 17.47 12.33 7.20 Each row of each table is an independent Dirichlet Global and Local Independence 12

An example in pictures Prior 3.0 2.0 1.0 b11[1] sample: 50000 Hypothetical experiment: 3 with Skill, 3 without. 0.0 x 0.0 0.25 0.5 0.75 Likelihood 3.0 2.0 1.0 likelihood[1] sample: 100000 Actual observation: 7 with Skill, 3 without. 0.0 0.0 0.25 0.5 0.75 1.0 Posterior 4.0 3.0 2.0 1.0 0.0 p sample: 50000 Combined information: 10 with Skill, 6 without. 0.0 0.2 0.4 0.6 0.8 13

Hyper-Dirichlet Law Advantages Natural Conjugate Elicit in terms of effective data Very flexible Netica can do it via EM algorithm Disadvantages Many parameters (exponential in number of parents) May be hard to find data for all conditions (e.g. Skill 1 very high, Skill 2 very low) 14

Fully Observed Case If all variables in the Bayes net are observed, learning is easy Hyper-Dirichlet law is natural conjugate of conditional probability tables. Add observed cross-tab to prior to get posterior 15

Netica example fully observed 16

Partially Observed Case These are basis of E-Step in EM (and also sampling in MCMC). 17

Netica example partially observed 18

Netica example partially observed 19

Four Phase Algorithm For each cycle: 1. Select new Proficiency Parameters 2. Select new Evidence/Link Model Parameters 3. Impute values for proficiency variables 4. Impute values for unobserved evidence/link model variables (e.g., missing observations, context effects) Can exploit the basic Bayes net operations for Phases 3 and 4 20

EM Algorithm Variables: (E-Step) Impute Expected Values Usually use expected counts in tables corresponding to CPTs of Bayes net (sufficient statistics) Parameters: (M-Step) Maximize posterior (likelihood) given imputed counts 21

MCMC Algorithm For both parameter and variable phases sample from posterior distribution given all other parameters/variables Can use Bayes net sampling algorithm for variable phase: Pick node in junction tree Sample values for variables using posterior for that node Propagate sampled values to neighbors, and sample remaining variables Repeat until all variables are sampled For hyper-dirichlet laws, can use Gibbs sampler Reduced parameter models may require Metropolis algorithm 22

Identifiability Technically not a problem, as prior identifies model. But: If prior=posterior, we want to know State label swapping Exchange meaning of High and Low states of proficiency variable Can appear as swapped rows in CPTs Usually need more constrained model to get rid of problem In upcoming Dibello-Samejima model, location and scale of latent variables as in IRT Fix difficulty/discrimation of certain categories, or Scale anchor (set of parameters whose average difficulty/discrimination is constrained) 23

Reduced Parameter Models Noisy-and and Noisy-Or models NIDA, DINA and Fusion model (Junker & Sijtsma) DiBello Samejima models Based on effective theta and graded response model Compensatory, Conjunctive, Disjunctive and Inhibitor relationships For both of these model types, number of parameters grows linearly with number of parents 24

Noisy-And All input skills needed to solve problem Bypass parameter for Skill j, q j Slip probability (overall), q 0 Probability of correct outcome NIDA/DINA cognitive diagnosis models 25

Noisy Min (Max) If skills have more than two levels Use a cut point to make skill binary (e.g., reading skill must be greater than X) Use a Noisy-min model Probability of success is determined by the weakest skill Noisy-And/Min common in ed. measurement, Noisy-Or/Max common in medical diagnosis Number of parameters is linear in number of parents/states 26

DiBello--Samejima Models Useful when there are multiple ordered values for both the parent(s) and an observable variable. Single parent version Map each level of parent state to effective theta on IRT (N(0,1)) scale, Now plug into Samejima graded response model to get probability of outcome Uses standard IRT parameters, difficulty and discrimination 27

The Effective q Method (1): Samejima s Model 1 0.8 X=1 0.6 X=3 0.4 0.2 X=2 0-4 -2 0 2 4 Theta a j =1 b j1 =-1 b j2 =+1 Samejima s (1969) psychometric model for graded responses: 28

The Effective q Method (2): Conditional Probabilities for Three q s 1 0.8 X=1 0.6 X=3 0.4 0.2 X=2 0-4 -2 0 2 4 Theta q X=1 (Poor) X=2 (Okay) X=3 (Good) Low= -1.8.70.25.05 Med= -.4.35.40.25 High= 1.0.10.40.50 29

Various Structure Functions For Multiple Parents, assign each parent j an effective theta at each level k,. Combine Using a Structure Function Possible Structure Functions: Compensatory = weighted average Conjunctive = min Disjunctive = max q 1 Inhibitor; e.g. level k* on : ~ where q is some low value. 0 ~ s q1, k 1 ~,..., q ~ q 0 J, k J if if k k 1 1 k k * * 30

Q-Matrix and Bayes Nets Many tasks are single observable (item) Efficient; Useful for disentangling failures Q-Matrix is a matrix view of these Bayes nets Nonzero entries correspond to skill-to-task edges Used by many diagnostic testing applications (Rule Space, Tatsuoka; Fusion model, General Diagnostic Model (von Davier), NIDA/DINA) Gives an overview of the assessment EM fragment for observable identified by selecting parents of observable, and parametric form for distribution 31

Q-Matrix Example EvidenceModel S1 S2 S3 S4 EM8Word 1 0 0 0 EM2ConnectInfo 0 0 1 0 EM8Word 1 0 0 0 EM4SpecInfo 0 1 0 0 EM3ConnectSynth 0 0 1 1 EM8Word 1 0 0 0 EM4SpecInfo 0 1 0 0 Column for each proficiency variable: Is the proficiency relevant for the observable indicated by the row? 1=yes, 0=no. Row for each observable: Which proficiencies are relevant? 32

Augmented Q-Matrix EvidenceMoCPTType Difficulty S1 S2 S3 S4 EM8Word Compensatory 0 2 0 0 0 EM2ConnectICompensatory 0 0 0 2 0 EM8Word Compensatory 0 2 0 0 0 EM4SpecInfo Compensatory 0 0 2 0 0 EM3ConnectSCompensatory 0 0 0 3 2 EM8Word Compensatory 0 2 0 0 0 EM4SpecInfo Compensatory 0 0 2 0 0 Change 0-1 coding to 0-3 to indicate strength of relationship Add a column for distribution type Add a column for difficulty 33

Eliciting Priors 1. Elicit Structure (i.e., what are parents of each node) 2. Elicit Distributional Form (e.g., conjunctive, compensatory, inhibitor) 3. Elicit Strength of Relationship 4. Elicit Measure of Certainty (e.g., effective sample size, variance) Often use Linguistic Priors for 3 and 4 (e.g., map Hard and Easy onto normal with analystselected mean and variance 34

Targets of Model Criticism Indices (Cowell et al., 1999) Parent-Child Relationship Adequacy of conditional probability distribution given observed parents (Box, 1980; 1983) Note: Parent Data not usually observed Unconditional Node Distribution Getting Marginal Distributions for Nodes usually pretty easy Conditional Node Distribution Leave one out prediction Captures Relationship among nodes Two Observable Table Tests for local dependence Global Monitor Overall adequacy of the model with respect to observed data 35

Common Model Criticism Indices Compare predictions to subsequent observations. A surprise index is an empirical measure of how unexpected an observation is. Weather forecasting pedigree (Murphy & Winkler, 1984). Typically designed as penalty indices Penalty incurred when a low probability of occurrence is assigned to an event which subsequently occurs Common indices Logarithmic Score Weaver s Surprise Index Quadratic Brier Score Good s Logarithmic Surprise Index Ranked Probability Score 36

Logarithmic Score (Spiegelhalter et al., 1993) S log - log p Evaluated for each node, as log probability of the event that actually occurred. p is the prior probability of the observed state Greater than or equal to zero; zero if a probability of 100% had been assigned to the observed outcome, higher if observed value was less expected. 37

Weaver s Surprise Index (Weaver, 1948) S. I. i E p p i p1 p2... p where n is the number of possible outcomes Distinction between rare and surprising event Rare small probability Surprising small relative probability Values indicative of surprising observations as they move away from unity; Weaver suggests: value of 3-5 is not large values of 10 begin to be surprising values above 1,000 are definitely surprising 2 2 i p 2 n 38

Williamson Prediction Error Technique For each person i For each observable j Predict X i j from X i -j Score S i j using one of the scoring rules previously described Sum over items to gauge person fit Sum over people to gauge item fit Sum over items & people to gauge model fit 39

Reference Distribution Distribution of scores under null hypothesis is unknown. Simulate data from model and calculate S i j for each simulee/observable pair Take a bootstrap sample from S i j to get reference distribution (sample simulees) 40

Posterior Predictive Model Checks Guttman [1967], Rubin [1984], Sinharay [2004] Method: parameters in model y data y rep replicated data using same parameters -- shadow data. Pick a statistics D(y, ) Compare D(y, ) and D(y rep, ) Often look at Pr(D(y, ) > D(y rep, )) Sometimes does not depend on : D(y) and D(y rep ) 41

PPMC in BUGS First, create shadow data by copying data line in model Y[i] ~ dxxx(omega) Yrep[i] ~ dxxx(omega) Next, have BUGS calculate D(y, ) and D(y rep, ) stat <- D(Y,omega) statrep <- D(Yrep, omega) pstat <- (stat < statrep) Mean of pstat is PP p-value 42

Expected Value vs Actual -6-4 -2 0 2 4 Number correct residual 2 4 6 8 10 12 Posterior Mean of Expected number correct score Sinharay and Almond (2007), based on data from Tatsuoka 43

Observable Characteristic Plots Data from Tatsuoka mixed number subtraction test Sinharay, Almond and Yan (2004) X axis groups are equivalence classes of proficiency profiles, group membership estimated through MCMC (one cycle) Horizontal lines indicate success probabilities for people who do/do not have necessary skills Glyph at center of line shows whether or not group expected to succeed Bars give credible intervals for group success rate 44

Learning Models Make Modifications to model to improve model fit. Model Search Maximum Score Model Search MCMC Best Set of Models Heckerman (1995; reprinted in Jordan,1998) and Buntine(1996) provide good tutorials. Cowell et al. (1999) also has several chapters on this topic. Neapolitan (2004) devotes much of the book to this topic 45

Limitations of Learning (1) Certain Models are mathematically identical, can t be distinguished from fit score Only can distinguish models which differ on independence conditions. A B C A B C A B C These are the same (except for order of parameters) A B C This one has different independence conditions 46

Limitations of Learning (2) Latent Variables add other possible models Latent Variables can be hidden causes Cannot distinguish models when latent variables are not observed. A A H No Effect H Intermediate Step C C A A H Common Cause H C C Contributing Factor All four models have identical scores. 47

Causality and Learning Many authors (especially Pearl) use learning to learn causality. Can distinguish patterns where arrows point A B C inwards. Technical definition of causality at odd with the lay definition Always relative to observed variables. 48

Causality Example Which variables are included in model search affects conclusions Many unmodeled intermediate steps in both pictures Be cautious with the use of the word causal in a technical sense. Gender Race Gender Race Parent's Education Proficiency Proficiency Item1 Item2 Item3 Item1 Item2 Item3 Model A Model B 49