Machine Teaching. for Personalized Education, Security, Interactive Machine Learning. Jerry Zhu

Size: px

Start display at page:

Download "Machine Teaching. for Personalized Education, Security, Interactive Machine Learning. Jerry Zhu"

Joy Atkins
5 years ago
Views:

1 Machine Teaching for Personalized Education, Security, Interactive Machine Learning Jerry Zhu NIPS 2015 Workshop on Machine Learning from and for Adaptive User Technologies

2 Supervised Learning Review D: training set (x 1, y 1 )... (x n, y n ) Learning algorithm A : A(D) = {θ} Example: regularized empirical risk minimization A(D) = argmin θ n l(θ, x i, y i ) + λω(θ) i=1 Example: version space learner A(D) = {θ consistent with D}

3 Machine Teaching = Inverse Machine Learning Given: learning algorithm A, target model θ Find: the smallest D such that A(D) = {θ } D A A(D) θ 1 Α (θ ) D Α 1 Θ

4 Machine Teaching Example Teach a 1D threshold classifier Given: A = SVM, θ = 0.6 What is the smallest D?

5 Machine Teaching Example Teach a 1D threshold classifier Given: A = SVM, θ = 0.6 What is the smallest D?

6 Machine Teaching as Communication sender = teacher message = θ decoder = A codeword = D

7 Machine Teaching Stronger Than Active Learning Sample complexity to achieve ɛ error: passive learning n = O (1/ɛ)) active learning n = O (log(1/ɛ)): needs binary search machine teaching n = 2: teaching dimension [Goldman + Kearns 1995], the teacher knows θ θ θ θ { O(1/n) O(1/2 ) passive learning "waits" active learning "explores" teaching "guides" { n

8 Why Machine Teaching? Why bother if we already know θ? 1. education 2. computer security 3. interactive machine learning

9 Education Teacher answers three questions: 1. My student s cognitive model A is a (a) SVM (b) logistic regression (c) neural net Student success is defined by (a) learning a target model θ (b) excel on a test set 3. My teaching effort is defined by (a) training set size (b) training item cost... Machine teaching finds the optimal, personalized lesson D for the student.

10 Education Example [Patil et al. 2014] Human categorization task: 1D threshold θ = 0.5 A: kernel density estimator Optimal D: Teaching humans with D: y= 1 y=1 0 θ =0.5 1 human trained on human test accuracy optimal D 72.5% random items 69.8% (statistically significant)

11 Security: Data Poisoning Attack [Alfeld et al. 2016], [Mei + Zhu 2015] Given: learner A, attack target θ, clean training data D 0 Find: the minimum poison δ such that A(D 0 + δ) = {θ }

Security: Lake Mendota Data 140 120 ICE DAYS 100 80 60 40 1900

12 Security: Lake Mendota Data ICE DAYS YEAR Lake Mendota, Wisconsin ice days

13 Security: Optimal Attacks on Ordinary Least Squares min δ, β δ p s.t. β1 0 β = argmin (y + δ) Xβ 2 β ICE DAYS original, β 1 = norm attack, β 1 = YEAR ICE DAYS original, β 1 = norm attack, β 1 = YEAR minimize δ 2 2 minimize δ 1

14 Security: Optimal Attack on Latent Dirichlet Allocation [Mei + Zhu 2015]

15 Optimization for Machine Teaching Bilevel optimization min D D s.t. D θ = argmin θ Θ 1 D (x i,y i ) D l(x i, y i, θ) + Ω(θ) In general min D D TeachingLoss(A(D), θ ) + TeachingEffort(D) Sometimes closed-form solution [Alfeld et al. 2016] [Zhu 2013] Often nonconvex optimization [Bo Zhang, this workshop]

16 Theoretical Question: Teaching Dimension Given: A, θ Find: the smallest D such that A(D) = {θ } Teaching dimension T D = D

17 Theoretical Question: Teaching Dimension Given: A, θ Find: the smallest D such that A(D) = {θ } Teaching dimension T D = D A = version space learner: T D =? [Goldman + Kearns 1995] A = SV M: T D = 2 Learner-specific teaching dimension

18 Teaching Dimension Example: Teaching an SVM Learner A(D) = arg min θ R 2 n i=1 max(1 y ix i θ, 0) θ 2 Target model: θ = ( 1 2, 3 2 ) One training item is necessary and sufficient: T D(θ, A) = 1 x 1 = ( 1 2, 3 2 ), y 1 = 1

19 The Teaching Dimension of Linear Learners [Liu & Zhu 2015] A(D) = argmin θ R d n l(θ x i, y i ) + λ 2 θ 2 2 i=1 goal homogeneous inhomogeneous θ = [w ; b ] ridge SVM logistic ridge SVM logistic parameter 1 λ θ 2 λ θ 2 τ max 2 2 λ w λ w 2 2τ max boundary

20 The Teaching Dimension of Linear Learners [Liu & Zhu 2015] A(D) = argmin θ R d n l(θ x i, y i ) + λ 2 θ 2 2 i=1 goal homogeneous inhomogeneous θ = [w ; b ] ridge SVM logistic ridge SVM logistic parameter 1 λ θ 2 λ θ 2 τ max 2 2 λ w λ w 2 2τ max boundary Teaching Dimension (independent of d) distinct from VC-dimension (d + 1)

21 Interactive Machine Learning You train a classifier Only use item, label pairs (x 1, y 1 )... (x n, y n ) Cost = n to reach small ɛ error

22 Speed of Light [Goldman+Kearns 1995], [Angluin 2004], [Cakmak+Thomaz 2011], [Zhu et al. in preparation] Fundamental Cost of Interactive Machine Learning n Teaching Dimension Optimal teacher achieves n =Teaching Dimension Can be much faster than active learning (recall 2 vs. log 1 ɛ ) Must allow teacher-initiated items (unlike active learning) But some humans are bad teachers...

23 Some Humans are Suboptimal Teachers [Khan, Zhu, Mutlu 2011] Challenge for the computer: make human behave more like the optimal teacher

24 Summary Machine teaching Theory: teaching dimension Algorithm: bilevel optimization Applications: education, security, interactive machine learning

Optimal Teaching for Online Perceptrons

Optimal Teaching for Online Perceptrons Xuezhou Zhang Hrag Gorune Ohannessian Ayon Sen Scott Alfeld Xiaojin Zhu Department of Computer Sciences, University of Wisconsin Madison {zhangxz1123, gorune, ayonsn,