Bayesian Networks in Educational Assessment Tutorial

Similar documents
An IRT-based Parameterization for Conditional Probability Tables

Assessing Fit of Models With Discrete Proficiency Variables in Educational Assessment

Multilevel Statistical Models: 3 rd edition, 2003 Contents

p L yi z n m x N n xi

Bayesian Networks BY: MOHAMAD ALSABBAGH

Contents. Part I: Fundamentals of Bayesian Inference 1

STA 4273H: Statistical Machine Learning

Generative Models for Discrete Data

Posterior Predictive Model Checks in Cognitive Diagnostic Models

Bayesian Networks in Educational Assessment

Plausible Values for Latent Variables Using Mplus

Machine Learning Summer School

Equivalency of the DINA Model and a Constrained General Diagnostic Model

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Lecture 6: Graphical Models: Learning

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Structure learning in human causal induction

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

PIRLS 2016 Achievement Scaling Methodology 1

Psychometric Models: The Loglinear Cognitive Diagnosis Model. Section #3 NCME 2016 Training Session. NCME 2016 Training Session: Section 3

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California

A Markov chain Monte Carlo approach to confirmatory item factor analysis. Michael C. Edwards The Ohio State University

The Bayesian Choice. Christian P. Robert. From Decision-Theoretic Foundations to Computational Implementation. Second Edition.

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

CSE 473: Artificial Intelligence Autumn Topics

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Online Item Calibration for Q-matrix in CD-CAT

Automating variational inference for statistics and data mining

Recent Advances in Bayesian Inference Techniques

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Chris Bishop s PRML Ch. 8: Graphical Models

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)

Estimation of Q-matrix for DINA Model Using the Constrained. Generalized DINA Framework. Huacheng Li

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Subject CS1 Actuarial Statistics 1 Core Principles

Gaussian Models

STA 4273H: Statistical Machine Learning

A multivariate multilevel model for the analysis of TIMMS & PIRLS data

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

LEARNING WITH BAYESIAN NETWORKS

Bayes Nets III: Inference

PMR Learning as Inference

Bayesian Networks. Motivation

Analyzing Hierarchical Data with the DINA-HC Approach. Jianzhou Zhang

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Computational Cognitive Science

CS6220: DATA MINING TECHNIQUES

Non-Parametric Bayes

Estimating the Q-matrix for Cognitive Diagnosis Models in a. Bayesian Framework

Learning With Bayesian Networks. Markus Kalisch ETH Zürich

Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks

CS6220: DATA MINING TECHNIQUES

Exploiting TIMSS and PIRLS combined data: multivariate multilevel modelling of student achievement

Introduction to Probabilistic Graphical Models

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Nonparametric Bayesian Methods (Gaussian Processes)

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Latent Dirichlet Allocation (LDA)

Bayesian Methods for Machine Learning

STA 414/2104: Machine Learning

Statistical Analysis of Q-matrix Based Diagnostic. Classification Models

The Bayes classifier

Learning Bayesian Networks (part 1) Goals for the lecture

STA 4273H: Statistical Machine Learning

Bayesian time series classification

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Hierarchical Cognitive Diagnostic Analysis: Simulation Study

On a Discrete Dirichlet Model

Machine Learning, Fall 2009: Midterm

Measurement error as missing data: the case of epidemiologic assays. Roderick J. Little

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Outline. Clustering. Capturing Unobserved Heterogeneity in the Austrian Labor Market Using Finite Mixtures of Markov Chain Models

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Machine Learning Overview

Bayesian Network Representation

STA414/2104 Statistical Methods for Machine Learning II

Announcements. Inference. Mid-term. Inference by Enumeration. Reminder: Alarm Network. Introduction to Artificial Intelligence. V22.

ECE521 week 3: 23/26 January 2017

Bayesian Networks. Characteristics of Learning BN Models. Bayesian Learning. An Example

Introduction to Artificial Intelligence. Unit # 11

Outline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car

Mixtures of Gaussians. Sargur Srihari

A Marginal Maximum Likelihood Procedure for an IRT Model with Single-Peaked Response Functions

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Lecture 16: Mixtures of Generalized Linear Models

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Tensor rank-one decomposition of probability tables

Diagnostic Classification Models: Psychometric Issues and Statistical Challenges

Bayesian (conditionally) conjugate inference for discrete data models. Jon Forster (University of Southampton)

{ p if x = 1 1 p if x = 0

Introduction to Machine Learning Midterm Exam Solutions

13: Variational inference II

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Latent Dirichlet Allocation (LDA)

Stochastic Approximation Methods for Latent Regression Item Response Models

Learning Bayesian networks

Package dina. April 26, 2017

Learning Bayesian network : Given structure and completely observed data

Transcription:

Bayesian Networks in Educational Assessment Tutorial Session V: Refining Bayes Nets with Data Russell Almond, Bob Mislevy, David Williamson and Duanli Yan Unpublished work 2002-2014 ETS 1

Agenda SESSION TOPIC PRESENTERS Session 1: Evidence Centered Design David Williamson Session 2: Bayesian Networks Russell Almond Session 3: Bayes Net Tools & Applications Session 4: ACED: ECD in Action Duanli Yan Russell Almond & Duanli Yan Session 5: Refining Bayes Nets with Data Russell Almond 2

Outline Variables and Parameters The Hyper-Dirichlet Model The EM Algorithm Reduced Parameter Models Evaluating Model Fit Model Search and Causality 3

Variables and Parameters Bayesian statistics does not distinguish between variables and parameters, just known and unknown quantities Define: Variable: person specific Parameter: constant across people Visualize as a two layer network 4

First Layer Skill1 Task1-Obs Skill2 Task2-Obs Task3-Obs A simple model with two skills and 3 observables 5

Distributions and Variables Skill1 Task1-Obs Skill2 Task2-Obs Variables (values are person specific) Task3-Obs Distributions provide probabilities for variables 6

Different People, Same Distributions Student 1 Skill1 Skill2 Task1-Obs Task2-Obs Task3-Obs Student 2 Skill1 Skill2 Task1-Obs Task2-Obs Task3-Obs Student 3 Skill1 Skill2 Task1-Obs Task2-Obs Task3-Obs 7

Different People, Same Distributions Student 1 Skill1 Skill2 Task1-Obs Task2-Obs Task3-Obs Student 2 Skill1 Skill2 Task1-Obs Task2-Obs Task3-Obs Student 3 Skill1 Skill2 Task1-Obs Task2-Obs Task3-Obs 8

Second Layer Distributions have Parameters Parameters are the same across all people Parameters drop down into first layer to do person specific computations (e.g., scoring) Probability distributions of parameters are called Laws 9

Second Layer 2 10

Hyper-Markov Properties Spiegelhalter and Lauritzen (1990) make two assumptions of convenience Global Meta Independence parameters from different distributions are independent p 1 and p 2 are independent Local Meta Independence parameters from the same distribution are independent l 1,2, l 1,-2, l -1,2, and l -1,-2 are independent 11

Hyper-Dirichlet Law Bayes net distributions are conditional multinomial distributions Dirichlet law is natural conjugate of multinomial distribution (like beta and binomial) Can be thought of as counts of pseudo-observations in each category. Category 1 Category 2 Category 3 Prior (weight of 6 obs) 3.50 2.33 1.17 Observed counts (30) 14 10 6 Posterior (weight of 36) 17.47 12.33 7.20 Each row of each table is an independent Dirichlet Global and Local Independence 12

An example in pictures Prior 3.0 2.0 1.0 b11[1] sample: 50000 Hypothetical experiment: 3 with Skill, 3 without. 0.0 x 0.0 0.25 0.5 0.75 Likelihood 3.0 2.0 1.0 likelihood[1] sample: 100000 Actual observation: 7 with Skill, 3 without. 0.0 0.0 0.25 0.5 0.75 1.0 Posterior 4.0 3.0 2.0 1.0 0.0 p sample: 50000 Combined information: 10 with Skill, 6 without. 0.0 0.2 0.4 0.6 0.8 13

Hyper-Dirichlet Law Advantages Natural Conjugate Elicit in terms of effective data Very flexible Netica can do it via EM algorithm Disadvantages Many parameters (exponential in number of parents) May be hard to find data for all conditions (e.g. Skill 1 very high, Skill 2 very low) 14

Fully Observed Case If all variables in the Bayes net are observed, learning is easy Hyper-Dirichlet law is natural conjugate of conditional probability tables. Add observed cross-tab to prior to get posterior 15

Netica example fully observed 16

Partially Observed Case These are basis of E-Step in EM (and also sampling in MCMC). 17

Netica example partially observed 18

Netica example partially observed 19

Four Phase Algorithm For each cycle: 1. Select new Proficiency Parameters 2. Select new Evidence/Link Model Parameters 3. Impute values for proficiency variables 4. Impute values for unobserved evidence/link model variables (e.g., missing observations, context effects) Can exploit the basic Bayes net operations for Phases 3 and 4 20

EM Algorithm Variables: (E-Step) Impute Expected Values Usually use expected counts in tables corresponding to CPTs of Bayes net (sufficient statistics) Parameters: (M-Step) Maximize posterior (likelihood) given imputed counts 21

MCMC Algorithm For both parameter and variable phases sample from posterior distribution given all other parameters/variables Can use Bayes net sampling algorithm for variable phase: Pick node in junction tree Sample values for variables using posterior for that node Propagate sampled values to neighbors, and sample remaining variables Repeat until all variables are sampled For hyper-dirichlet laws, can use Gibbs sampler Reduced parameter models may require Metropolis algorithm 22

Identifiability Technically not a problem, as prior identifies model. But: If prior=posterior, we want to know State label swapping Exchange meaning of High and Low states of proficiency variable Can appear as swapped rows in CPTs Usually need more constrained model to get rid of problem In upcoming Dibello-Samejima model, location and scale of latent variables as in IRT Fix difficulty/discrimation of certain categories, or Scale anchor (set of parameters whose average difficulty/discrimination is constrained) 23

Reduced Parameter Models Noisy-and and Noisy-Or models NIDA, DINA and Fusion model (Junker & Sijtsma) DiBello Samejima models Based on effective theta and graded response model Compensatory, Conjunctive, Disjunctive and Inhibitor relationships For both of these model types, number of parameters grows linearly with number of parents 24

Noisy-And All input skills needed to solve problem Bypass parameter for Skill j, q j Slip probability (overall), q 0 Probability of correct outcome NIDA/DINA cognitive diagnosis models 25

Noisy Min (Max) If skills have more than two levels Use a cut point to make skill binary (e.g., reading skill must be greater than X) Use a Noisy-min model Probability of success is determined by the weakest skill Noisy-And/Min common in ed. measurement, Noisy-Or/Max common in medical diagnosis Number of parameters is linear in number of parents/states 26

DiBello--Samejima Models Useful when there are multiple ordered values for both the parent(s) and an observable variable. Single parent version Map each level of parent state to effective theta on IRT (N(0,1)) scale, Now plug into Samejima graded response model to get probability of outcome Uses standard IRT parameters, difficulty and discrimination 27

The Effective q Method (1): Samejima s Model 1 0.8 X=1 0.6 X=3 0.4 0.2 X=2 0-4 -2 0 2 4 Theta a j =1 b j1 =-1 b j2 =+1 Samejima s (1969) psychometric model for graded responses: 28

The Effective q Method (2): Conditional Probabilities for Three q s 1 0.8 X=1 0.6 X=3 0.4 0.2 X=2 0-4 -2 0 2 4 Theta q X=1 (Poor) X=2 (Okay) X=3 (Good) Low= -1.8.70.25.05 Med= -.4.35.40.25 High= 1.0.10.40.50 29

Various Structure Functions For Multiple Parents, assign each parent j an effective theta at each level k,. Combine Using a Structure Function Possible Structure Functions: Compensatory = weighted average Conjunctive = min Disjunctive = max q 1 Inhibitor; e.g. level k* on : ~ where q is some low value. 0 ~ s q1, k 1 ~,..., q ~ q 0 J, k J if if k k 1 1 k k * * 30

Q-Matrix and Bayes Nets Many tasks are single observable (item) Efficient; Useful for disentangling failures Q-Matrix is a matrix view of these Bayes nets Nonzero entries correspond to skill-to-task edges Used by many diagnostic testing applications (Rule Space, Tatsuoka; Fusion model, General Diagnostic Model (von Davier), NIDA/DINA) Gives an overview of the assessment EM fragment for observable identified by selecting parents of observable, and parametric form for distribution 31

Q-Matrix Example EvidenceModel S1 S2 S3 S4 EM8Word 1 0 0 0 EM2ConnectInfo 0 0 1 0 EM8Word 1 0 0 0 EM4SpecInfo 0 1 0 0 EM3ConnectSynth 0 0 1 1 EM8Word 1 0 0 0 EM4SpecInfo 0 1 0 0 Column for each proficiency variable: Is the proficiency relevant for the observable indicated by the row? 1=yes, 0=no. Row for each observable: Which proficiencies are relevant? 32

Augmented Q-Matrix EvidenceMoCPTType Difficulty S1 S2 S3 S4 EM8Word Compensatory 0 2 0 0 0 EM2ConnectICompensatory 0 0 0 2 0 EM8Word Compensatory 0 2 0 0 0 EM4SpecInfo Compensatory 0 0 2 0 0 EM3ConnectSCompensatory 0 0 0 3 2 EM8Word Compensatory 0 2 0 0 0 EM4SpecInfo Compensatory 0 0 2 0 0 Change 0-1 coding to 0-3 to indicate strength of relationship Add a column for distribution type Add a column for difficulty 33

Eliciting Priors 1. Elicit Structure (i.e., what are parents of each node) 2. Elicit Distributional Form (e.g., conjunctive, compensatory, inhibitor) 3. Elicit Strength of Relationship 4. Elicit Measure of Certainty (e.g., effective sample size, variance) Often use Linguistic Priors for 3 and 4 (e.g., map Hard and Easy onto normal with analystselected mean and variance 34

Targets of Model Criticism Indices (Cowell et al., 1999) Parent-Child Relationship Adequacy of conditional probability distribution given observed parents (Box, 1980; 1983) Note: Parent Data not usually observed Unconditional Node Distribution Getting Marginal Distributions for Nodes usually pretty easy Conditional Node Distribution Leave one out prediction Captures Relationship among nodes Two Observable Table Tests for local dependence Global Monitor Overall adequacy of the model with respect to observed data 35

Common Model Criticism Indices Compare predictions to subsequent observations. A surprise index is an empirical measure of how unexpected an observation is. Weather forecasting pedigree (Murphy & Winkler, 1984). Typically designed as penalty indices Penalty incurred when a low probability of occurrence is assigned to an event which subsequently occurs Common indices Logarithmic Score Weaver s Surprise Index Quadratic Brier Score Good s Logarithmic Surprise Index Ranked Probability Score 36

Logarithmic Score (Spiegelhalter et al., 1993) S log - log p Evaluated for each node, as log probability of the event that actually occurred. p is the prior probability of the observed state Greater than or equal to zero; zero if a probability of 100% had been assigned to the observed outcome, higher if observed value was less expected. 37

Weaver s Surprise Index (Weaver, 1948) S. I. i E p p i p1 p2... p where n is the number of possible outcomes Distinction between rare and surprising event Rare small probability Surprising small relative probability Values indicative of surprising observations as they move away from unity; Weaver suggests: value of 3-5 is not large values of 10 begin to be surprising values above 1,000 are definitely surprising 2 2 i p 2 n 38

Williamson Prediction Error Technique For each person i For each observable j Predict X i j from X i -j Score S i j using one of the scoring rules previously described Sum over items to gauge person fit Sum over people to gauge item fit Sum over items & people to gauge model fit 39

Reference Distribution Distribution of scores under null hypothesis is unknown. Simulate data from model and calculate S i j for each simulee/observable pair Take a bootstrap sample from S i j to get reference distribution (sample simulees) 40

Posterior Predictive Model Checks Guttman [1967], Rubin [1984], Sinharay [2004] Method: parameters in model y data y rep replicated data using same parameters -- shadow data. Pick a statistics D(y, ) Compare D(y, ) and D(y rep, ) Often look at Pr(D(y, ) > D(y rep, )) Sometimes does not depend on : D(y) and D(y rep ) 41

PPMC in BUGS First, create shadow data by copying data line in model Y[i] ~ dxxx(omega) Yrep[i] ~ dxxx(omega) Next, have BUGS calculate D(y, ) and D(y rep, ) stat <- D(Y,omega) statrep <- D(Yrep, omega) pstat <- (stat < statrep) Mean of pstat is PP p-value 42

Expected Value vs Actual -6-4 -2 0 2 4 Number correct residual 2 4 6 8 10 12 Posterior Mean of Expected number correct score Sinharay and Almond (2007), based on data from Tatsuoka 43

Observable Characteristic Plots Data from Tatsuoka mixed number subtraction test Sinharay, Almond and Yan (2004) X axis groups are equivalence classes of proficiency profiles, group membership estimated through MCMC (one cycle) Horizontal lines indicate success probabilities for people who do/do not have necessary skills Glyph at center of line shows whether or not group expected to succeed Bars give credible intervals for group success rate 44

Learning Models Make Modifications to model to improve model fit. Model Search Maximum Score Model Search MCMC Best Set of Models Heckerman (1995; reprinted in Jordan,1998) and Buntine(1996) provide good tutorials. Cowell et al. (1999) also has several chapters on this topic. Neapolitan (2004) devotes much of the book to this topic 45

Limitations of Learning (1) Certain Models are mathematically identical, can t be distinguished from fit score Only can distinguish models which differ on independence conditions. A B C A B C A B C These are the same (except for order of parameters) A B C This one has different independence conditions 46

Limitations of Learning (2) Latent Variables add other possible models Latent Variables can be hidden causes Cannot distinguish models when latent variables are not observed. A A H No Effect H Intermediate Step C C A A H Common Cause H C C Contributing Factor All four models have identical scores. 47

Causality and Learning Many authors (especially Pearl) use learning to learn causality. Can distinguish patterns where arrows point A B C inwards. Technical definition of causality at odd with the lay definition Always relative to observed variables. 48

Causality Example Which variables are included in model search affects conclusions Many unmodeled intermediate steps in both pictures Be cautious with the use of the word causal in a technical sense. Gender Race Gender Race Parent's Education Proficiency Proficiency Item1 Item2 Item3 Item1 Item2 Item3 Model A Model B 49