Knowledge Tracing Machines: Families of models for predicting student performance

Size: px

Start display at page:

Download "Knowledge Tracing Machines: Families of models for predicting student performance"

Magdalene Montgomery
5 years ago
Views:

Center for Advanced Intelligence Project, Tokyo

1 Knowledge Tracing Machines: Families of models for predicting student performance Jill-Jênn Vie RIKEN Center for Advanced Intelligence Project, Tokyo Optimizing Human Learning, June 12, 2018 Polytechnique Montréal, June 18, 2018

2 Predicting student performance Data A population of students answering questions Events: Student i answered question j correctly/incorrectly Goal Learn the difficulty of questions automatically from data Measure the knowledge of students Potentially optimize their learning Assumption Good model for prediction Good adaptive policy for teaching

3 Learning outcomes of this tutorial Logistic regression is amazing Unidimensional Takes IRT, PFA as special cases Factorization machines are even more amazing Multidimensional Take MIRT as special case It makes sense to consider deep neural networks What does deep knowledge tracing model exactly?

4 Families of models Factorization Machines (Rendle 2012) Multidimensional Item Response Theory Logistic Regression Item Response Theory Performance Factor Analysis Recurrent Neural Networks Deep Knowledge Tracing (Piech et al. 2015) Steffen Rendle (2012). Factorization Machines with libfm. In: ACM Transactions on Intelligent Systems and Technology (TIST) 3.3, 57:1 57:22. doi: / Chris Piech et al. (2015). Deep knowledge tracing. In: Advances in Neural Information Processing Systems (NIPS), pp

5 Problems Weak generalization Filling the blanks: some students did not attempt all questions Strong generalization Cold-start: some new students are not in the train set

6 Dummy dataset User 1 answered Item 1 correct User 1 answered Item 2 incorrect User 2 answered Item 1 incorrect User 2 answered Item 1 correct User 2 answered Item 2??? user item correct dummy.csv

7 Task 1: Item Response Theory Learn abilities θ i for each user i Learn easiness e j for each item j such that: Pr(User i Item j OK) = σ(θ i + e j ) logit Pr(User i Item j OK) = θ i + e j Logistic regression Learn w such that logit Pr(x) = w, x Usually with L2 regularization: w 2 2 penalty Gaussian prior

8 Graphically: IRT as logistic regression Encoding of User i answered Item j : w x θ i U i 1 e j I j 1 Users Items logit Pr(User i Item j OK) = w, x = θ i + e j

9 Encoding python encode.py --users --items Users Items U 0 U 1 U 2 I 0 I 1 I data/dummy/x-ui.npz Then logistic regression can be run on the sparse features: python lr.py data/dummy/x-ui.npz

10 Oh, there s a problem python encode.py --users --items python lr.py data/dummy/x-ui.npz Users Items U 0 U 1 U 2 I 0 I 1 I 2 y pred y User 1 Item 1 OK User 1 Item 2 NOK User 2 Item 1 NOK User 2 Item 1 OK User 2 Item 2 NOK We predict the same thing when there are several attempts.

11 Count successes and failures Keep track of what the student has done before: user item skill correct wins fails data/dummy/data.csv

12 Task 2: Performance Factor Analysis W ik : how many successes of user i over skill k (F ik : #failures) Learn β k, γ k, δ k for each skill k such that: logit Pr(User i Item j OK) = Skill k of Item j python encode.py --skills --wins --fails Skills Wins Fails S 0 S 1 S 2 S 0 S 1 S 2 S 0 S 1 S 2 β k + W ik γ k + F ik δ k data/dummy/x-swf.npz

13 Better! python encode.py --skills --wins --fails python lr.py data/dummy/x-swf.npz Skills Wins Fails S 0 S 1 S 2 S 0 S 1 S 2 S 0 S 1 S 2 y pred y User 1 Item 1 OK User 1 Item 2 NOK User 2 Item 1 NOK User 2 Item 1 OK User 2 Item 2 NOK

14 Task 3: a new model (but still logistic regression) python encode.py --items --skills --wins --fails python lr.py data/dummy/x-iswf.npz

15 Here comes a new challenger How to model side information in, say, recommender systems? Logistic Regression Learn a bias for each feature (each user, item, etc.) Factorization Machines Learn a bias and an embedding for each feature

16 What can be done with embeddings?

17 Interpreting the components

18 Interpreting the components

19 Graphically: logistic regression w x θ i U i 1 e j I j 1 Users Items

20 Graphically: factorization machines V u i v j s k w θ i e j β k x U i I j S k Users Items Skills

21 Formally: factorization machines Learn bias w k and embedding v k for each feature k such that: N logit p(x) = µ + w k x k + x k x l v k, v l Particular cases k=1 }{{} logistic regression 1 k<l N }{{} pairwise interactions Multidimensional item response theory: logit p = u i, v j + e j SPARFA: v j > 0 and v j sparse GenMA: v j is constrained by the zeroes of a q-matrix (q ij ) i,j Andrew S Lan, Andrew E Waters, Christoph Studer, and Richard G Baraniuk (2014). Sparse factor analysis for learning and content analytics. In: The Journal of Machine Learning Research 15.1, pp Jill-Jênn Vie, Fabrice Popineau, Yolaine Bourda, and Éric Bruillard (2016). Adaptive Testing Using a General Diagnostic Model. In: European Conference on Technology Enhanced Learning. Springer, pp

22 Tradeoff expressiveness/interpretability NLL logit p 4 q 7 q 10 q Rasch θ i + e j (79%) (79%) (79%) DINA 1 s j or g j (80%) (82%) (82%) MIRT u i, v j + e j (83%) (86%) (86%) GenMA u i, q j + e j (79%) (85%) (88%) Comparaison de modèles de tests adaptatifs (Fraction) Rasch MIRT K = 2 GenMA K = 8 DINA K = Comparaison de modèles de tests adaptatifs (TIMSS) Rasch MIRT K = 2 GenMA K = 13 DINA K = Log loss 0.5 Log loss Nombre de questions posées Nombre de questions posées

23 Assistments 2009 dataset attempts of 4163 students over items on 124 skills. Download Put it in data/assistments09 python fm.py data/assistments09/x-ui.npz etc. or make big AUC users + items skills + w + f items + skills + w + f LR (IRT) 2s (PFA) 9s s FM min9s s min30s Results obtained with FM d = 20

24 Deep Factorization Machines Learn layers W (l) and b (l) such that: a 0 (x) = (v user, v item, v skill,...) a (l+1) (x) = ReLU(W (l) a (l) (x) + b (l) ) l = 0,..., L 1 y DNN (x) = ReLU(W (L) a (L) (x) + b (L) ) logit p(x) = y FM (x) + y DNN (x) Jill-Jênn Vie (2018). Deep Factorization Machines for Knowledge Tracing. In: The 13th Workshop on Innovative Use of NLP for Building Educational Applications. url:

25 Comparison FM: y FM factorization machine with λ = 0.01 Deep: y DNN : multilayer perceptron DeepFM: y DNN + y FM with shared embedding Bayesian FM: w k, v kf N (µ f, 1/λ f ) µ f N (0, 1), λ f Γ(1, 1) (trained using Gibbs sampling) Various types of side information first: <discrete> (user, token, countries, etc.) last: <discrete> + <continuous> (time + days) pfa: <discrete> + wins + fails

26 Duolingo dataset

27 Results Model d epoch train first last pfa Bayesian FM / Bayesian FM / DeepFM 20 15/ Bayesian FM / FM 20 20/ Bayesian FM / FM 20 21/ FM 20 37/ DeepFM 20 77/ Deep 20 7/ Deep / LR 0 50/ LR 0 50/ LR 0 50/

28 Duolingo ranking Rank Team Algo AUC 1 SanaLabs RNN + GBDT singsound RNN NYU GBDT CECL LR + L1 (13M feat.) TMU RNN (off) JJV Bayesian FM (off) JJV DeepFM JJV DeepFM Duolingo LR.771 Burr Settles, Chris Brust, Erin Gustafson, Masato Hagiwara, and Nitin Madnani (2018). Second language acquisition modeling. In: Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp url:

29 What bout recurrent neural networks? Deep Knowledge Tracing: model the problem as sequence prediction Each student on skill q t has performance a t How to predict outcomes y on every skill k? Spoiler: by measuring the evolution of a latent state h t

30 Graphically: deep knowledge tracing q 0, a 0 q 1, a 1 q 2, a 2 h 0 h 1 h 2 h 3 y = y 0 y q1 y M 1 y y = y 0 y M 1

31 Graphically: there is a MIRT in my DKT q 0, a 0 q 1, a 1 q 2, a 2 h 0 h 1 h 2 h 3 v q1 v q2 v q3 y q1 = σ( h 1, v q1 ) y q2 y q3

32 Drawback of Deep Knowledge Tracing DKT does not model individual differences. Actually, Wilson even managed to beat DKT with (1-dim!) IRT. By estimating on-the-fly the student s learning ability, we managed to get a better model. AUC BKT IRT PFA DKT DKT-DSC Cognitive Tutor Assistments Assistments Assistments Sein Minn, Yi Yu, Michel Desmarais, Feida Zhu, and Jill-Jênn Vie (2018). Deep Knowledge Tracing and Dynamic Student Classification for Knowledge Tracing. Submitted at IEEE International Conference on Data Mining.

33 Take home message Factorization machines are a strong baseline that take many models as special cases Recurrent neural networks are powerful because they track the evolution of the latent state (try simpler dynamic models?) Deep factorization machines may require more data/tuning, but neural collaborative filtering offer promising directions

34 Any suggestions are welcome! Feel free to chat: All code: github.com/jilljenn/ktm Do you have any questions?

35 Lan, Andrew S, Andrew E Waters, Christoph Studer, and Richard G Baraniuk (2014). Sparse factor analysis for learning and content analytics. In: The Journal of Machine Learning Research 15.1, pp Minn, Sein, Yi Yu, Michel Desmarais, Feida Zhu, and Jill-Jênn Vie (2018). Deep Knowledge Tracing and Dynamic Student Classification for Knowledge Tracing. Submitted at IEEE International Conference on Data Mining. Piech, Chris, Jonathan Bassen, Jonathan Huang, Surya Ganguli, Mehran Sahami, Leonidas J Guibas, and Jascha Sohl-Dickstein (2015). Deep knowledge tracing. In: Advances in Neural Information Processing Systems (NIPS), pp Rendle, Steffen (2012). Factorization Machines with libfm. In: ACM Transactions on Intelligent Systems and Technology (TIST) 3.3, 57:1 57:22. doi: /

36 Settles, Burr, Chris Brust, Erin Gustafson, Masato Hagiwara, and Nitin Madnani (2018). Second language acquisition modeling. In: Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp url: Vie, Jill-Jênn (2018). Deep Factorization Machines for Knowledge Tracing. In: The 13th Workshop on Innovative Use of NLP for Building Educational Applications. url: Vie, Jill-Jênn, Fabrice Popineau, Yolaine Bourda, and Éric Bruillard (2016). Adaptive Testing Using a General Diagnostic Model. In: European Conference on Technology Enhanced Learning. Springer, pp

arxiv: v2 [cs.lg] 5 May 2015

arxiv: v2 [cs.lg] 5 May 2015 fastfm: A Library for Factorization Machines Immanuel Bayer University of Konstanz 78457 Konstanz, Germany immanuel.bayer@uni-konstanz.de arxiv:505.0064v [cs.lg] 5 May 05 Editor: Abstract Factorization