arxiv: v1 [stat.ml] 25 Feb 2018

Size: px

Start display at page:

Download "arxiv: v1 [stat.ml] 25 Feb 2018"

Ruby Francis
5 years ago
Views:

1 Yuzhe Ma Robert Nowa Philippe Rigollet Xuezhou Zhang Xiaojin Zhu University of Wisconsin-Madison Massachusetts Institute of Technology arxiv:8.8946v stat.ml 5 Feb 8 Abstract We call a learner super-teachable if a teacher can trim down an iid training set while maing the learner learn even better. We provide sharp superteaching guarantees on two learners: the maximum lielihood estimator for the mean of a Gaussian, and the large margin classifier in D. For general learners, we provide a mixed-integer nonlinear programming-based algorithm to find a super teaching set. Empirical experiments show that our algorithm is able to find good super-teaching sets for both regression and classification problems. Introduction Consider the following question: a learner receives an iid training set S drawn from a distribution parametrized by θ. There is a teacher who nows θ. Can the teacher select a subset from S so the learner estimates θ better from the subset than from S? This question is distinct from training set reduction (see e.g. 9, 43, 4) in that the teacher can use the nowledge of θ to carefully design the subset. It is, in fact, a coding problem: Can the teacher approximately encode θ using items in S for a nown decoder, which is the learner? As such, the question is not a machine learning tas but rather a machine teaching one 47,, 45. This question is relevant for several nascent applications. One application is in understanding blacbox models such as deep nets. Often observation to a blacbox model is limited to its predicted label y = θ (x) given input x. One way to interpret a blacbox model is to locally train an interpretable model with data points S labeled by the blacbox model around the region of interest 35. We, however, as for more: to reduce the size of the training set S for Proceedings of the st International Conference on Artificial Intelligence and Statistics (AISTATS) 8, Lanzarote, Spain. PMLR: Volume 84. Copyright 8 by the author(s). the local learner while maing the learner approximate the blacbox better. The reduced training set itself also serves as representative examples of local model behavior. Another application is in education. Imagine a teacher who has a teaching goal θ. This is a reasonable assumption in practice: e.g. a geology teacher has the nowledge of the actual decision boundaries between roc categories. However, the teacher is constrained to teach with a given textboo (or a set of courseware) S. To the extent that the student is quantified mathematically, the teacher wants to select pages in the textboo with the guarantee that the student learns better from those pages than from gulping the whole boo. But is the question possible? The following example says yes. Consider learning a threshold classifier on the interval,, with true threshold at θ =. Let S have n items drawn uniformly from the interval and labeled according to θ. Let the learner be a hard margin SVM, which places the estimated threshold in the middle of the inner-most pair in S with different labels: ˆθ S = (x + x + )/ where x is the largest negative training item and x + the smallest positive training item in S. It is well nown that ˆθ S θ converges at a rate of /n: the intuition being that the average space between adjacent items is O(/n). - Figure : The original training set S with n = 6 items (circles and stars; green=negative, purple=positive), and the most-symmetric training set (stars) the teacher selects. The teacher nows everything but cannot tell θ directly to the learner. Instead, it can select the most-symmetric pair in S about θ and give them to the learner as a two-item training set. We will prove later that the ris on the most symmetric pair is O(/n ), that is, learning from the selected subset surpasses learning from S. Thus we observe something interesting: the teacher can turn a larger training set S into a smaller and better subset for the midpoint classifier. We call this phenomenon super-teaching.

2 Formal Definition of Super Teaching Let Z be the data space: for unsupervised learning Z = X, while for supervised learning Z = X Y. Let p Z be the underlying distribution over Z. We tae a function view of the learner: a learner A is a function A : n=z n Θ, where Θ is the learner s hypothesis space. The notation n=z n defines the set of (potentially non-iid) training sets, namely multisets of any size whose elements are in Z. Given any training set T n=z n, we assume A returns a unique hypothesis A(T ) ˆθ T Θ. The learner s ris R(θ) for θ Θ is defined as: R(θ) = E pz l(θ(x), y), or R(θ) = θ θ. () The former is for prediction tass where l() is a loss function and θ(x) denotes the prediction on x made by model θ; the latter is for parameter estimation where we assume a realizable model p Z = p θ for some θ Θ. We now introduce a clairvoyant teacher B who has full nowledge of p Z, A, R. The teacher is also given an iid training set S = {z,..., z n } p Z. If the teacher teaches S to A, the learner will incur a ris R(A(S)) R(ˆθ S ). The teacher s goal is to judiciously select a subset B(S) S to act as a super teaching set for the learner so that R(ˆθ B(S) ) < R(ˆθ S ). Of course, to do so the teacher must utilize her nowledge of the learning tas, thus the subset is actually a function B(S, p Z, A, R). In particular, the teacher nows p Z already, and this sets our problem apart from machine learning. For readability we suppress these extra parameters in the rest of the paper. We formally define super teaching as follows. Definition (Super Teaching). B is a super teacher for learner A if δ >, N such that n N P S R(ˆθ B(S) ) c n R(ˆθ S ) > δ, () where S iid p n Z, B(S) S, and c n is a sequence we call super teaching ratio. Obviously, c n = can be trivially achieved by letting B(S) = S so we are interested in small c n. There are two fundamental questions: () Do super teachers provably exist? () How to compute a super teaching set B(S) in practice? We answer the first question positively by exhibiting super teaching on two learners: maximum lielihood estimator for the mean of a Gaussian in section 3, and D large margin classifier in section 4. Guarantees on super teaching for general learners remain future wor. Nonetheless, empirically we can find a super teaching set for many general learners: We formulate the second question as mixed-integer nonlinear programming in section 5. Empirical experiments in section 6 demonstrates that one can find a good B(S) effectively. 3 Analysis on Super Teaching for the MLE of Gaussian mean In this section, we present our first theoretical result on super teaching, when the learner A MLE is the maximum lielihood estimator (MLE) for the mean of a Gaussian. Let Z = X = R, Θ = R, p Z (x) = N (θ, ). Given a sample S of size n drawn from p Z, the learner computes the MLE for the mean: ˆθ S = A MLE (S) = n n i= x i. We define the ris as R(ˆθ S ) = ˆθ S θ. The teacher we consider is the optimal -subset teacher B, which uses the best subset of size to teach: B (S) argmin T S, T = R(ˆθ T ). (3) To build intuition, it is well-nown that the ris of A MLE under S is O(/ n) because the variance under n items shrins lie /n. Now consider =. Since the teacher B nows θ, under our setting the best teaching strategy is for her to select the item in S closest to θ, which forms the singleton teaching set B (S). One can show that with large probability this closest item is O(/n) away from θ (the central part of a Gaussian density is essentially uniform). Therefore, we already see a super teaching ratio of c n = n. More generally, our main result below shows that B achieves a super teaching ratio c n = O(n + ): Theorem. Let B be the optimal -subset teacher. ɛ (, 4 ), δ (, ), N(, ɛ, δ) such that n N(, ɛ, δ), P R(ˆθ B (S)) c n R(ˆθ S ) > δ, where c n = ɛ n + +ɛ. Toward proving the theorem, we first recall the standard rate R(ˆθ S ) n if A MLE learns from the whole training set S: Proposition. Let S be an n-item iid sample drawn from N (θ, ). ɛ >, δ (, ), N (ɛ, δ) such that n N, P n ɛ < R(ˆθ S ) < n +ɛ > δ. (4) Proof. R(ˆθ S ) = ˆθ S θ and ˆθ S θ N (, n ) = n nx π e. Let α = n ɛ and β = n +ɛ. We have P R(ˆθ S ) α = < α α n dx = α π n π n π = nx e dx π n ɛ, Remar: we introduced an auxiliary variable ɛ which controls the implicit tradeoff between c n, how much super teaching helps, and N, how soon super teaching taes effect. When ɛ the teaching ratio c n approaches O(n + ), but as we will see N(, ɛ, δ). Similarly, also affects the tradeoff: the teaching ratio is smaller as we enlarge, but N(, ɛ, δ) increases. (5)

3 Yuzhe Ma, Robert Nowa, Philippe Rigollet, Xuezhou Zhang, Xiaojin Zhu P R(ˆθ S ) β = < = β β x n β π β nβ nπ e < β n π nx e dx = nπ = nx e dx β β n ny π e dy π n ɛ. Thus P α < R(ˆθ S ) < β = P R(ˆθ S ) α P R(ˆθ S ) β > π n ɛ. Let N (ɛ, δ) = ( δ 8 π ) ɛ, then n N, P α < R(ˆθ S ) < β > δ. We now wor out the ris of A MLE if it learns from the optimal -subset teacher B. Theorem 4 says that this ris is very small and sharply concentrated around R(ˆθ B (S)) n. To prove Theorem 4, we first give the following lemma. ( ) n Lemma 3. Denote C n =. Let the index set I = {,,...n} where n 4. Consider all subsets of size, then there are at most 4 C Cn ordered pairs of subsets that are overlapping but not identical. Proof. Let I and I be two subsets of size and they overlap on t indexes. Then the total number of distinct indexes that appear in I I is t. There are C t n ways of choosing such t indexes. Next we determine which t indexes are overlapping ones. We have Ct t ways of choosing such t indexes. Finally we have C t t ways of selecting half of the non-overlapping indexes and attribute them to I. Thus in total we have O t = C t n C t t C t t ordered pairs of subsets that overlap on t indexes. By our assumption n 4 we have C t n Cn. Also note that C t t < Ct and C t t < C, thus O t < C n C t C. Therefore the total number of ordered pairs of subsets that are overlapping but not identical is O = O t < C C n t C < t= t= t= C C n t C = 4 C C. n (6) Now we prove the ris of the optimal -subset teacher. Theorem 4. Let B be the optimal -subset teacher. Let S be an n-item iid sample drawn from N (θ, ). ɛ (7) (, ), δ (, ), N (, ɛ, δ) such that n N, P ( n )+ɛ < R(ˆθ B (S)) < ( n ) ɛ > δ. Proof. Let I {,,..., n} and I =, define γ I = i I (x i θ ). Let S I denote the subset indexed by I. Note that ˆθ SI = i I x i and R(ˆθ SI ) = ˆθ SI θ = i I x i θ = γ I. Also note that R(ˆθ B (S)) = inf I R(ˆθ SI ) = inf I γ I. Thus to prove Theorem 4 it suffices to prove P ( n )+ɛ < inf γ I < ( I n ) ɛ. (9) Let α = ( n )+ɛ and β = ( n ) ɛ. We first prove the lower bound. Note that γ I has the same distribution for all I. Thus by the union bound, P inf I (8) γ I α = P I : γ I α C n P γ I α, () where I = {,,..., }. Since γ I N (, ), we have P γ I α = α α Note that C n ( en ). Thus, P inf I e x dx < α. () π π γ I α < ( en ) π α = π e ( n )ɛ. () Thus N (, ɛ, δ) such that n N, P inf γ I α < δ I. (3) To show the upper bound, we define t I = γ I < β, where is the indicator function. Let T = I t I. Then it suffices to show lim n P T = =. Note that P T = = P T E T = E T P (T E T ) (E T ) V T (E T ), (4) where the last inequality follows from the Marov inequality. Now we lower bound E T. E T = E t I = E t I = C n E t I I I (5) β = C n P γ I < β = C n e x dx. π Note that ɛ <, thus β <. For x ( β, β), π e x > β π e = πe. Also note that C n > ( n ), thus E T > ( n ) πe β = πe (n )ɛ. (6)

4 Now we upper bound V T. V T = I,I Cov t I, t I = I,I, I I Cov t I, t I. (7) Note that for Bernoulli random variable t I, V t I E t I. Thus if I = I, then Cov t I, t I = V t I E t I = P γ I < β β = e x dx < β = π π π ( n ) ɛ. β (8) Otherwise I I, that is, I and I overlap but not identical, then Cov t I, t I = E t I t I E t I E t I E t I t I = P γ I < β, γ I < β. (9) Note that γ I and γ I are jointly Gaussian with covariance Cov γ I, γ I = Cov x i θ, x i θ i I,i I = I I = ρ, i I,i I,i=i () where ρ. The joint PDF of two standard normal distributions x, y with covariance ρ is f(x, y) = Note that f(x, y) π ρ, thus P γ I < β, γ I < β = π x ρxy+y ρ e ( ρ ). () x <β, y <β π ρ (β) = π ρ β. Since π ρ π ρ dxdy π ( = ) π π P γ I < β, γ I < β β π (), thus = π ( n ) ɛ. According to Lemma 3, there are at most 4 C of I and I such that I I. Thus, V T = V t I + Cov t I, t I I I I, I I (3) Cn pairs C n π ( n ) ɛ + 4 C C n π ( n ) ɛ π (en ) ( n ) ɛ + 4 C en ( ) π ( n ) ɛ = π e ( n )ɛ + 4 π C ( e ) ( n )ɛ. (4) Now plug (4) and (6) into (4), we have P T = a ()( n ) ɛ + a ()( n ), (5) where a = π e+ and a () = ec ( e ). Thus N (, ɛ, δ) such that n N P inf I, γ I β < δ. (6) Let N (, ɛ, δ) = max{n (, ɛ, δ), N (, ɛ, δ)}, combining (3) and (6) concludes the proof. Now we can conclude super-teaching by comparing Theorem 4 and Proposition : Proof of Theorem. Let α = ( n ) ɛ and β = n ɛ. By Proposition, ɛ (, 4 ), δ (, ), N (ɛ, δ ) such that n N, P R(ˆθ S ) > β > δ. By Theorem 4, N (, ɛ, δ ) such that n N, P R(ˆθ B (S)) < α > δ. Let c n = ɛ n + +ɛ. Since ɛ < 4, c n is a decreasing sequence in n with lim n c n =. Let N 3 (, ɛ) be the first integer such that c N3. Let N(, ɛ, δ) = max{n (ɛ, δ ), N (, ɛ, δ ), N 3(, ɛ)}. By a union bound n N(, ɛ, δ), P Since α β = c n, we have P δ, where c n c N3. R(ˆθ B (S)) < α, R(ˆθ S ) β R(ˆθ B (S)) c n R(ˆθ S ) > δ. > 4 Analysis on Super Teaching for D Large Margin Classifier We present our second theoretical result, this time on teaching a D large margin classifier. Let X =,, Y = {, }, Θ =,, θ =, p Z (x, y) = p Z (x)p Z (y x) where p Z (x) = U(X) and p Z (y = x) = x θ. Let x max i:yi= x i and x + min i:yi=+ x i be the inner-most pair of opposite labels in S if they exist. We formally define the large margin classifier A lm (S) as (x + x + )/ if x, x + exist ˆθ S = A lm (S) = if S all positive if S all negative. (7) The ris is defined as R(ˆθ S ) = ˆθ S θ = ˆθ S. The teacher we consider is the most symmetric teacher, who selects the most symmetric pair about θ in S and gives it to the learner. We define the most-symmetric teacher B ms : { {(s, ), (s B ms (S) = +, )} if s, s + exist, {(x, y )} otherwise. (8)

5 Yuzhe Ma, Robert Nowa, Philippe Rigollet, Xuezhou Zhang, Xiaojin Zhu where (s, s + ) argmin (x, ),(x,) S x+x θ. Our main result shows that learning from the whole set S achieves the well-nown O(/n) ris, but surprisingly B ms achieves O(/n ) ris, therefore it is an approximately c n = O(n ) super teaching ratio. Theorem 5. Let S be an n-item iid sample drawn from p Z. Then δ (, ), N(δ) such that n N, P R(ˆθ Bms(S)) c n R(ˆθ S ) > δ, where c n = 3 nδ ln 6 δ. Before proving Theorem 5, we first show that B ms is an optimal teacher for the large margin classifier. Proposition 6. B ms is an optimal teacher for the large margin classifier ˆθ S. Proof. We show R(ˆθ Bms(S)) R(ˆθ B(S) ) for any B and any S. If B ms (S) =, then S is either all positive or all negative. In both cases R(ˆθ B(S) ) = for any B by definition. Thus R(ˆθ Bms(S)) R(ˆθ B(S) ). Otherwise B ms (S) =, then if B(S) is all positive or all negative, we have R(ˆθ B(S) ) = and thus R(ˆθ Bms(S)) R(ˆθ B(S) ). Otherwise let x B, x B + be the inner most pair of B(S). Since x B, x B + S, then by definition of B ms, R(ˆθ Bms(S)) = s +s+ θ xb +xb + θ = R(ˆθ B(S) ). Now we wor out the ris of the most symmetric teacher B ms. To bound the ris of B ms we need the following ey lemma, which shows that the sample complexity with the teacher is O(ɛ / ). Lemma 9. Let n = 4m, where m is an integer. Let S be an n-item iid sample drawn from p Z. ɛ >, δ (, ), M(ɛ, δ) = max{ 3e ln 4 ln 3 δ, ( ɛ ln 3 δ ) } such that m M(ɛ, δ), P R(ˆθ Bms(S)) ɛ > δ. Proof. We give a proof setch and the details are in the appendix. Let S = {x (x, ) S} and S = {x (x, ) S} respectively. Then we have S + S = 4m. Define event E : { S m S m}. Given that m 3e ln 4 ln 3 δ, one can show P (E ) > δ 3. Since S + S = 4m, either S m or S m. Without loss of generality we assume S m. We then divide the interval, equally into N = m (ln 3 δ ) segments. The length of each segment is N = O( m ) as Figure shows. 3 N - Figure : segments Now we show that learning on the whole S incurs O(n ) ris. First, we give the following lemma for the exact tail probability of R(ˆθ S ). Lemma 7. For the large margin classifier ˆθ S, we have ( ɛ) n + (ɛ) n < ɛ P R(ˆθ S ) > ɛ = ( )n < ɛ < (9) ɛ =. The proof for Lemma 7 is in the appendix. Now we show that R(ˆθ S ) is O(n ). Theorem 8. Let S be an n-item iid sample drawn from p Z. Then δ (, ) and n, P R(ˆθ S ) > δ > δ. (3) n Proof. According to Lemma 7, for ɛ, we have P R(ˆθ S ) > ɛ > ( ɛ) n > nɛ. (3) Note that n, thus δ n. Let ɛ = δ n in (3) we have P R(ˆθ S ) > δ > n δ = δ. (3) n n Let N o be the number of segments that are occupied by the points in S. Note that N o is a random variable. Let E be the event that N o m. Then one can show P (E ) > δ 3. By union bound, we have P (E, E ) > δ 3. Let E 3 be the following event: there exist a point x in S such that x, the flipped point, lies in the same segment as some point x in S. One can show that P (E 3 E, E ) > δ 3. Thus P (E 3) P (E, E, E 3 ) = P (E 3 E, E )P (E, E ) ( δ δ 3 )( 3 ) > δ. If E 3 happens, then x + x = x ( x ). Note that m ( ɛ ln 3 δ ) and N = m (ln 3 δ ) m (ln 3 δ ), thus N m ln 3 δ ɛ. Therefore R(ˆθ Bms(S)) = s +s+ x+x ɛ. Rewriting ɛ in Lemma 9 as a function of n, we have the following theorem. Theorem. Let S be an n-item iid sample dawn from p Z, then N (δ) = e ln 4 ln 3 δ such that n N, P R(ˆθ Bms(S)) 6 n ln 3 > δ. (33) δ Proof. Note that if n N (δ) = ln 4 ln 3 δ, then m = n 4 3e ln 4 ln 3 δ, thus the minimum ɛ that satisfies m M(ɛ, δ) is m ln 3 δ = 6 n ln 3 δ. e N

6 Now we can conclude super teaching: Proof of Theorem 5. According to Theorem, N ( δ ) such that n N, P R(ˆθ Bms(S)) 6 n ln 6 δ > δ. Note that N, thus according to Theorem 8, n N, P R(ˆθ S ) > δ n > δ. Let c n = 3 nδ ln 6 δ and N (δ) = 3 δ ln 6 δ so that c N =. Let N(δ) = max{n (δ), N (δ)}. By union bound, n N, with probability at least δ, we have both R(ˆθ S ) > δ n and R(ˆθ Bms(S)) 6 n ln 6 δ, which gives P R(ˆθ Bms(S)) c n R(ˆθ S ) > δ, where c n c N =. 5 An MINLP Algorithm for Super Teaching Although the problem of proving super teaching ratios for a specific learner is interesting, we now focus on an algorithm to find a super teaching set for general learners given a training set S. That is, we find a subset B(S) S so that R(ˆθ B(S) ) < R(ˆθ S ). We start by formulating super teaching as a subset selection problem. To this end, we introduce binary indicator variables b,..., b n where b i = means z i S is included in the subset. We consider learners A that can be defined via convex empirical ris minimization: A(S) argmin θ Θ n i= l(θ, z i ) + λ θ. (34) For simplicity we assume there is a unique global minimum which is returned by argmin. Note that we use l in (34) to denote the (surrogate) convex loss used by A in performing empirical ris minimization. For example, l may be the negative log lielihood for logistic regression. l is potentially different from l (e.g. the - loss) used by the teacher to define the teaching ris R in (). We formulate super teaching as the following bilevel combinatorial optimization problem: min b {,} n,ˆθ Θ R(ˆθ) (35) s.t. ˆθ n = argminθ Θ b i l(θ, zi ) + λ θ. (36) i= Under mild conditions, we may replace the lower level optimization problem (i.e. the machine learning problem (36)) by its first order optimality (KKT) conditions: min b {,} n,ˆθ Θ s.t. R(ˆθ) (37) n b i θ l(ˆθ, zi ) + λˆθ =. i= This reduces the bilevel problem but the constraint is nonlinear in general, leading to a mixed-integer nonlinear program (MINLP), for which effective solvers exist. We use the MINLP solver in NEOS 5. 6 Simulations We now apply the framewor in section 5 to logistic regression and ridge regression, and show that the solver indeed selects a super-teaching subset that is far better than the original training set S. 6. Teaching Logistic Regression A lr Let X = R d, Θ = R d, θ = ( d,..., d ), p Z (x) = N (, I). Let p Z (y x) = x θ >, which is deterministic given x. Logistic regression estimates ˆθ S = A lr (S) with (34), where λ =. and l(z i ) = log( + exp( y i x i θ)). In contrast, The teacher s ris is defined to be the expected - loss: R(ˆθ) = E pz ˆθ(x) y, where ˆθ(x) is the label of x predicted by ˆθ. Since p Z is symmetric about the origin, the ris can be rewritten in terms of the angle between ˆθ and θ ˆθ : R(ˆθ) = arccos( θ )/π. ˆθ θ Instantiating (37) we have min b {,} n,ˆθ R d s.t. ˆθ θ arccos( )/π (38) ˆθ θ n b i y i x i λˆθ =. + exp(y i x i ˆθ) i= We run experiments to study the effectiveness and scalability of the NEOS MINLP solver on (38), specifically with respect to the training set size n = S and dimension d. In the first set of experiments we fix d = and vary n = 6, 64, 56 and 4. For each n we run trials. In each trial we draw an n-item iid sample S p Z and call the solver on (38). The solver s solution to b... b n indicates the super teaching set B(S). We then compute an empirical version of the super teaching ratio: ĉ n = R(ˆθ B(S) )/R(ˆθ S ). Logistic Regression Ridge Regression n = S ĉ n B(S) /n time (s) ĉ n B(S) /n time (s) 6 8.5e e- 7.8e e- 64.3e e+ 7.5e e e e+ 5.6e e+ 4.3e-.86.4e+3 4.e e+3 Table : Super teaching as n changes. In the left half of Table we report the median of the following quantities over trials: ĉ n, the fraction of the training items selected for super teaching B(S) /n, and the NEOS server running time. The main result is that ĉ n for all n, which means the solver indeed selects a super-teaching set B(S) that is far

7 Yuzhe Ma, Robert Nowa, Philippe Rigollet, Xuezhou Zhang, Xiaojin Zhu Logistic Regression Ridge Regression d ĉ n B(S) /n time (s) ĉ n B(S) /n time (s) 3.e e- 3.3e e+ 4.4e e+ 7.e e+ 8.8e e+.5e e e-.4 5.e+ 4.3e e+ 3 8.e-.58.e+ 6.4e e+ Table : Super teaching as d changes better than the original iid training set S. Therefore, MINLP is a valid algorithm for finding a super teaching set. Second, we note that the solver tends to select a large subset since the median B(S) /n /. This is interesting as it is nown that when S is dense, one can select extremely sparse super teaching sets, as small as a few items, to teach effectively 8. Understanding the different regimes remains future wor. Finally, the running time grows fast with n. For example, when n = 4 it taes around half an hour to solve (38). Future wor needs to address this bottlenec in applying MINLP to large problems. In the second set of experiments we fix n = 3 and vary d =, 4, 8, 6, 3. The left half of Table shows the results. The empirical teaching ratio ĉ n is still below in all cases, showing super teaching. But as the dimension of the problem increases ĉ n deteriorates toward. Nonetheless, even when d = n we still see a median super teaching ratio of.8; the corresponding super teaching set B(S) has only 58% training items than the dimension. It is interesting that the MINLP algorithm intentionally created a high dimensional learning problem (as in higher dimension d than selected training items B(S) ) to achieve better teaching, nowing that the learner A lr is regularized. The running time does not change dramatically. 6. Teaching Ridge Regression A rr Let X = R d, Θ = R d, θ = ( d,..., d ), p Z (x) = N (, I), p Z (y x) = N (y; x θ,.). Let the teaching ris be the parameter difference: R(ˆθ) = ˆθ θ. Given a sample S with n iid items drawn from p Z, ridge regression estimates ˆθ S = A rr (S) with λ =. and l(z i ) = (x ˆθ i y i ). The corresponding MINLP is: min ˆθ θ (39) b {,} n,ˆθ R d s.t. λˆθ + n b i (x ˆθ i y i )x i =. i= We run the same set of experiments. Tables and show the results, which are qualitatively similar to teaching logistic regression. Again, we see the empirical super teaching ratio ĉ n, indicating the presence of super teaching. (a) logistic regression (b) ridge regression Figure 3: Typical trials from the MINLP algorithm Finally, Figure 3 visualizes one typical trial each for teaching logistic regression and ridge regression. S consists of both dar and light points, while the dar ones representing B(S) optimized by MINLP. The dashed line shows ˆθ S, while the solid lines shows ˆθ B(S). The ground truth (x + x = in logistic regression, y = x in ridge regression) essentially overlaps with the solid lines. Specifically, the super taught models ˆθ B(S) have negligible riss of.5e- 4 and 3.3e-3, whereas models ˆθ S trained from the whole iid sample S incur much larger riss of.3 and.6, respectively. 7 Related Wor There has been several research threads in different communities aimed at reducing a data set while maintaining its utility. The first thread is training set reduction 9, 43, 4, which during training time prunes items in S in an attempt to improve the learned model. The second thread is coresets, 6, a summary of S such that models learned on the summary are provably competitive with models learned on the full data set S. But as they do not now the target model p Z or θ, these methods cannot truly achieve super teaching. The third thread is curriculum learning which showed that smart initialization is useful for nonconvex optimization. In contrast, our teacher can directly encode the true model and therefore obtain faster rates. The final thread is sample compression 7, where a compression function chooses a subset T S and a reconstruction function to form a hypothesis. Our present wor has some similarity with compression, which allows increased accuracy since compression bounds can be used as regularization 6. The theoretical study of machine teaching has focused on the teaching dimension, i.e. the minimum training set size needed to exactly teach a target concept θ, 38, 46, 8, 9, 45, 6, 44, 48, 9, 3, 4,, 3, 8, 7, 5, 5, 36, 3,. Most of the prior wor assumed a synthetic teaching setting where S is the whole item space, which is often unrealistic. Liu et al. considered approximate teaching in the finite S setting 3, though their analysis focused on a specific SGD learner. Our super teaching setting applies to arbitrary

8 learners, and we allow approximate teaching namely we do not require the teacher to teach exactly the target model, which is infeasible in our pool-based teaching setting with a finite S. Machine teaching applications include education 4, 33, 39, 7, 3, 34, computer security,, 3, and interactive machine learning 4,, 4. By establishing the existence of super-teaching, the present paper can guide the process of finding a more effective training set for these applications. 8 Discussions and Conclusion We presented super-teaching: when the teacher already nows the target model, she can often choose from a given training set a smaller subset that trains a learner better. We proved this for two learners, and provided an empirical algorithm based on mixed integer nonlinear programming to find a super teaching set. However, much needs to be done on the theory of super teaching. We give two counterexamples to illustrate that not all learners are super-teachable. Example (MLE of interval). Let X =, θ, where θ R +. p Z (x) = U(X). Given a n-item training set S, the MLE for θ is ˆθ S = A int (S) = max i=:n x i. The ris is defined as R(ˆθ S ) = ˆθ S θ. We show A int is not superteachable. ˆθB(S) = max xi B(S) x i max xi S x i = ˆθ S. Since ˆθ S θ, R(ˆθ B(S) ) = ˆθ B(S) θ ˆθ S θ = R(ˆθ S ). We can generalize this to a classification setting, and show that neither the least nor the greatest consistent hypothesis is not super-teachable: Example (Consistent learners). Let X = x min, x max Z be an interval over the integer grid. The hypothesis space is Θ = {a, b X : y = in a, b and outside}. θ = a, b Θ. p Z is uniform on X and noiseless y labeled according to θ. The ris R(ˆθ S ) is the size of the symmetric difference between the two intervals ˆθ S and θ, normalized by x max x min. Given a sample S, the least consistent learner A lc learns the tightest interval over positive lc items in S: ˆθ S = A lc(s) min i=:n x i, max i=:n x i. y i= y i= ˆθ S lc = if S does not contain positive items. The greatest consistent learner A gc extends the hypothesis interval in both directions as much as possible before hitting negative points in S. If S has no positive we define =, too. ˆθ gc S Proposition. Neither A lc nor A gc is super-teachable. Proof. We first show A lc is not super-teachable. Note that A lc learns the tightest interval consistent with S, thus we lc always have ˆθ S θ. Now we show that ˆθ B(S) lc lc ˆθ S is always true so that R(ˆθ S lc) R(ˆθ B(S) lc ) follows. If θ =, then trivially ˆθ lc B(S) = ˆθ lc S =. Now assume θ. If (x, ) B(S), let a, b = ˆθ B(S) lc lc. Note that ˆθ S because B(S) S and thus S has lc at least one positive point. Let ˆθ S = a, b. Now a = min{x (x, ) B(S)} min{x (x, ) S} = a, and b = max{x (x, ) B(S)} max{x (x, ) S} = b. Thus we have ˆθ B(S) lc lc ˆθ S. If (x, ) B(S), ˆθ B(S) lc = and ˆθ B(S) lc lc ˆθ S is always true. Thus ˆθ lc B(S) ˆθ lc S θ for any B and any S. The proof for A gc is similar by showing θ ˆθ gc S gc ˆθ B(S). This leads to an open question: which family of learners are super teachable? We offer a conjecture here: we speculate that MLEs (and the derived MAP estimates or regularized empirical ris minimizers) which satisfy the asymptotic normality conditions 4 are super teachable. This conjecture is motivated by its similarity to the proof in section 3. Also note that the two counterexamples are classic examples of MLE that do not satisfy the asymptotic normality conditions. Another open question concerns the optimal super-teaching subset size for a given training set of size n. For example, our result on teaching the MLE of Gaussian mean indicates that the rate improves as grows. However, our analysis only applies to a fixed. Further research is needed to identify the optimal. Acnowledgments: R.N. acnowledges support by NSF IIS and CCF P.R. is supported in part by grants NSF DMS-7596, NSF DMS-TRIPODS-7475, DARPA W9NF-6--55, ONR N and a grant from the MIT NEC Corporation. X.Z. is supported in part by NSF CCF-747, IIS-6365, CMMI-565, DGE-54548, and CCF References S. Alfeld, X. Zhu, and P. Barford. Data poisoning attacs against autoregressive models. AAAI, 6. S. Alfeld, X. Zhu, and P. Barford. Explicit defense actions against test-set attacs. In The Thirty-First AAAI Conference on Artificial Intelligence (AAAI), 7. 3 D. Angluin. Queries revisited. Theoretical Computer Science, 33():75 94, 4. 4 D. Angluin and M. Kriis. Teachers, learners and blac boxes. COLT, D. Angluin and M. Kriis. Learning from different teachers. Machine Learning, 5():37 63, 3.

9 Yuzhe Ma, Robert Nowa, Philippe Rigollet, Xuezhou Zhang, Xiaojin Zhu 6 O. Bachem, M. Lucic, and A. Krause. Practical Coreset Constructions for Machine Learning. ArXiv e- prints, Mar F. J. Balbach. Measuring teachability using variants of the teaching dimension. Theor. Comput. Sci., 397(- 3):94 3, 8. 8 F. J. Balbach and T. Zeugmann. Teaching randomized learners. COLT, pages 9 43, 6. 9 F. J. Balbach and T. Zeugmann. Recent developments in algorithmic teaching. In Proceedings of the 3rd International Conference on Language and Automata Theory and Applications, pages 8, 9. S. Ben-David and N. Eiron. Self-directed learning and its relation to the VC-dimension and to teacherdirected learning. Machine Learning, 33():87 4, 998. Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In L. Bottou and M. Littman, editors, Proceedings of the 6th International Conference on Machine Learning, pages 4 48, Montreal, June 9. Omnipress. M. Cama and M. Lopes. Algorithmic and human teaching of sequential decision tass. In AAAI,. 3 M. Cama and A. Thomaz. Mixed-initiative active learning. ICML Worshop on Combining Learning Strategies to Reduce Label Cost,. 4 B. Clement, P.-Y. Oudeyer, and M. Lopes. A comparison of automatic teaching strategies for heterogeneous student populations. In Educational Data Mining (EDM), 6. 5 J. Czyzy, M. P. Mesnier, and J. J. Moré. The NEOS server. IEEE Computational Science and Engineering, 5(3):68 75, T. Doliwa, G. Fan, H. U. Simon, and S. Zilles. Recursive teaching dimension, VC-dimension and sample compression. Journal of Machine Learning Research, 5:37 33, 4. 7 S. Floyd and M. Warmuth. Sample compression, learnability, and the Vapni-Chervonenis dimension. Machine learning, (3):69 34, Z. Gao, C. Ries, H. U. Simon, and S. Zilles. Preference-based teaching. Journal of Machine Learning Research, 8(3): 3, 7. 9 S. Garcia, J. Derrac, J. Cano, and F. Herrera. Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3):47 435,. S. Goldman and M. Kearns. On the complexity of teaching. Journal of Computer and Systems Sciences, 5(): 3, 995. S. A. Goldman and H. D. Mathias. Teaching a smarter learner. Journal of Computer and Systems Sciences, 5():55 67, 996. S. Har-Peled. Geometric approximation algorithms, volume 73. American mathematical society Boston,. 3 T. Hegedüs. Generalized teaching dimensions and the query complexity of learning. In Proceedings of the eighth Annual Conference on Computational Learning Theory (COLT), pages 8 7, F. Khan, X. Zhu, and B. Mutlu. How do humans teach: On curriculum learning and teaching dimension. NIPS,. 5 H. Kobayashi and A. Shinohara. Complexity of teaching by a restricted number of examples. COLT, pages 93 3, 9. 6 A. Kontorovich, S. Sabato, and R. Weiss. Nearestneighbor sample compression: Efficiency, consistency, infinite dimensions. arxiv preprint arxiv:75.884, 7. 7 R. Lindsey, M. Mozer, W. J. Huggins, and H. Pashler. Optimizing instructional policies. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing Systems 6, pages J. Liu and X. Zhu. The teaching dimension of linear learners. Journal of Machine Learning Research, 7(6): 5, 6. 9 J. Liu, X. Zhu, and H. G. Ohannessian. The teaching dimension of linear learners. In The 33rd International Conference on Machine Learning (ICML), 6. 3 W. Liu, B. Dai, J. M. Rehg, and L. Song. Iterative machine teaching. In ICML, 7. 3 H. D. Mathias. A model of interactive teaching. J. Comput. Syst. Sci., 54(3):487 5, S. Mei and X. Zhu. Using machine teaching to identify optimal training-set attacs on machine learners. AAAI, K. Patil, X. Zhu, L. Kopec, and B. C. Love. Optimal teaching for limited-capacity human learners. Advances in Neural Information Processing Systems (NIPS), 4.

10 34 A. N. Rafferty, E. Brunsill, T. L. Griffiths, and P. Shafto. Faster teaching by POMDP planning. In Proceedings of the 5th International Conference on Artificial Intelligence in Education, AIED, pages 8 87, Berlin, Heidelberg,. Springer-Verlag. 48 S. Zilles, S. Lange, R. Holte, and M. Zinevich. Models of cooperative teaching and learning. Journal of Machine Learning Research, : ,. 35 M. T. Ribeiro, S. Singh, and C. Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages ACM, R. L. Rivest and Y. L. Yin. Being taught can be faster than asing questions. COLT, S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, New Yor, NY, USA, A. Shinohara and S. Miyano. Teachability in computational learning. New Generation Computing, 8(4): , A. Singla, I. Bogunovic, G. Barto, A. Karbasi, and A. Krause. Near-optimally teaching the crowd to classify. In ICML, pages 54 6, 4. 4 J. Suh, X. Zhu, and S. Amershi. The label complexity of mixed-initiative classifier training. International Conference on Machine Learning (ICML), 6. 4 H. White. Maximum lielihood estimation of misspecified models. Econometrica: Journal of the Econometric Society, pages 5, D. R. Wilson and T. R. Martinez. Reduction techniques for instance-based learning algorithms. Machine Learning, 38(3):57 86,. 43 X. Zeng and X.-w. Chen. SMO-based pruning methods for sparse least squares support vector machines. IEEE transactions on Neural Networs, 6(6):54 546, X. Zhu. Machine teaching for bayesian learners in the exponential family. In Advances in Neural Information Processing Systems (NIPS), pages 95 93, X. Zhu. Machine teaching: an inverse problem to machine learning and an approach toward optimal education. AAAI, X. Zhu, J. Liu, and M. Lopes. No learner left behind: On the complexity of teaching multiple learners simultaneously. In The 6th International Joint Conference on Artificial Intelligence (IJCAI), X. Zhu, A. Singla, S. Zilles, and A. N. Rafferty. An Overview of Machine Teaching. ArXiv e-prints, Jan. 8.

11 Yuzhe Ma, Robert Nowa, Philippe Rigollet, Xuezhou Zhang, Xiaojin Zhu Supplemental Material Lemma 7. For the large margin classifier ˆθ S, we have ( ɛ) n + (ɛ) n < ɛ P R(ˆθ S ) > ɛ = ( )n < ɛ < ɛ =. (9) Proof. The ris is R(ˆθ S ) = ˆθ S. Define event E : { (x, ) S (x, +) S}. P ˆθ S ɛ can be decomposed into two components depending on if E happens as (4) shows. P ˆθ S ɛ = P ˆθ S ɛ, E + P ˆθ S ɛ, E c. (4) P ˆθ S ɛ, E c = P ˆθ S ɛ E c P E c. Note that P E c = ( )n, P ˆθ S ɛ E c is if ɛ < and if ɛ = because ˆθ S = ± always holds given E c happens. Thus, P ˆθ S ɛ, E c = if ɛ < ( )n if ɛ =. Now we compute P ˆθ S ɛ, E. Let n + be the number of positive points in S. Define E i : {n + = i}. Note E i E j = if i j and E = n i= E i, thus n P ˆθ S ɛ, E = i= n P ˆθ S ɛ, E i = i= (4) P ˆθ S ɛ E i P E i. (4) P E i = Ci n( )n. Note that P ˆθ S ɛ E i = P x +x+ ɛ E i. To compute it, we first compute F x,x + (ɛ, ɛ E i ) = P x ɛ, x + ɛ E i. Given E i happens, P x ɛ E i = ( ɛ ) n i and P x + ɛ E i = ( ɛ ) i. Also since x ɛ and x + ɛ are independent given E i happens, thus Tae the derivative of F gives F x,x + (ɛ, ɛ E i ) = P x ɛ, x + ɛ E i = ( ɛ ) n i ( ɛ ) i. (43) f x,x + (ɛ, ɛ E i ) = i(n i)( ɛ ) n i ( ɛ ) i. (44) Note that ˆθ S ɛ x x ɛ. Therefore, we integrate f x,x + (ɛ, ɛ E i ) over the region ɛ ɛ ɛ to obtain P ˆθ S ɛ E i. However, note that ɛ, ɛ, thus for ɛ >, the region ɛ ɛ ɛ becomes the whole,, and the integration is. Then (4) becomes P ˆθ S ɛ, E = n i= P E i = P E = ( )n. For ɛ, by (4) we have n P ˆθ S ɛ, E = P E i i(n i)( ɛ ) n i ( ɛ ) i dɛ dɛ i= ɛ ɛ ɛ n = Ci n ( )n i(n i)( ɛ ) n i ( ɛ ) i dɛ dɛ i= = ( )n ɛ ɛ ɛ ɛ ɛ ɛ i= n Ci n i(n i)( ɛ ) n i ( ɛ ) i dɛ dɛ. (45)

12 Note that C n i i(n i) = n(n )Cn i, (45) becomes P ˆθ S ɛ, E = n(n )( n )n C n i ( ɛ ) n i ( ɛ ) i dɛ dɛ ɛ ɛ ɛ i= = n(n )( n )n C n i ( ɛ ) n i ( ɛ ) i dɛ dɛ = n(n )( )n = n(n )( )n ɛ ɛ ɛ i= ɛ ɛ ɛ,, ( ɛ ɛ ) n dɛ dɛ ( ɛ ɛ ) n dɛ dɛ ɛ ɛ >ɛ ( ɛ ɛ ) n dɛ dɛ. (46) Now we compute the two integration in (46) =,, ( ɛ ɛ ) n dɛ dɛ = n ( ɛ ɛ ) n dɛ n ( ɛ ) n ( ɛ ) n dɛ = n(n ) ( ɛ)n + n(n ) ( ɛ ) n = n n(n ). (47) For the second integration, note that it can decomposed as ɛ ɛ >ɛ ( ɛ ɛ ) n dɛ dɛ = ɛ ɛ +ɛ ( ɛ ɛ ) n dɛ dɛ + ɛ ɛ ɛ ( ɛ ɛ ) n dɛ dɛ. (48) Since the two sub-integration s are identical because the two sub regions are symmetric. We only show the computation for the first. ɛ ɛ ( ɛ ɛ ) n dɛ dɛ = n ( ɛ ɛ ) n ɛ +ɛdɛ Thus we have = ɛ +ɛ ɛ n ( ɛ ) n + n n ( ɛ ɛ) n dɛ = n(n ) ( ɛ ) n n n(n ) ( ɛ ɛ) n ɛ = n n(n ) ɛn + ( ɛ) n n(n ). = ɛ ɛ >ɛ ( ɛ ɛ ) n dɛ dɛ = n n(n ) ɛn + ( ɛ) n n(n ). Combine (47) and (5), we can compute (46) as follows. P ˆθ S ɛ, E = = n(n )( n )n n(n ) ɛ ɛ +ɛ = n n ɛ n ( ɛ) n + ( )n = ɛ n ( ɛ) n ( ɛ ɛ ) n dɛ dɛ n n(n ) (ɛn + ( ɛ) n ) + n(n ) (49) (5) (5)

13 Yuzhe Ma, Robert Nowa, Philippe Rigollet, Xuezhou Zhang, Xiaojin Zhu Therefore we have Now combine (4) and (5) we have which is equivalent to (9). P ˆθ S ɛ, E = P ˆθ S ɛ = ɛ n ( ɛ) n if ɛ ( )n if < ɛ. (5) ɛ n ( ɛ) n if ɛ ( )n if < ɛ < if ɛ =. Lemma 9. Let n = 4m, where m is an integer. Let S be an n-item iid sample drawn from p Z. ɛ >, δ (, ), M(ɛ, δ) = max{ 3e ln 4 ln 3 δ, ( ɛ ln 3 δ ) } such that m M(ɛ, δ), P R(ˆθ Bms(S)) ɛ > δ. Proof. Let S = {x (x, ) S} and S = {x (x, ) S} respectively. Then we have S + S = 4m. Define event E : { S m S m}. Then we have (53) m P E = Ci 4m ( )4m. (54) where we rule out all possible sequences of 4m points which lead to S < m or S < m. By standard result 37 (Lemma A.5) d = Cm ( em d )d, we have i= P E ( 4em m )m ( )4m = e m 4 m ( m m )m ( e 4 )m ( e 4 )m (55) where the nd-to-last inequality follows from the fact that e ( + m )m. Note that by definition m 3e ln 4 ln 3 δ > ln 4 ln 3 δ, thus ( e 4 )m < δ 3 and P E > δ 3. Since S + S = 4m, then either S m or S m. Without loss of generality we assume S m. We then divide the interval, equally into N = m (ln 3 δ ) segments. The length of each segment is N = O( m ) as Figure 4 shows. Note that m 3e ln 4 ln 3 δ > 3e ln 3 δ, thus N 3em > em > m. 3 N - Figure 4: segments Let N o be the number of segments that are occupied by the points in S. Note that N o is a random variable. Let E be the event that N o m. Now we lower bound P E. This is a variant of the coupon collector s problem: there are N distinct coupons, and in S trials we want to collect at least m distinct coupons. Note that P E = P E c = m i= P N o = i. Let T i be the number of all possible coupon sequences of S such that S occupies exactly i segments (i.e. distinct coupons). We have Ci n ways of choosing i segments among a total of N. Also, for each choice of i segments, the number of all possible coupon sequences of S such that S fully occupies those i segments without empty is upper bounded by i S. Thus T i Ci ni S and we have Since m P N o = i = 3e ln 4 ln 3 δ > log 3 δ, S m, and N > em, thus P E c = < m i= m i= P N o = i m i= T i N S Cn i ( i N ) S. (56) C n i ( i N ) S m i= C n i ( m N )m C n i ( m N )m ( en m )m ( m N )m = ( em N )m < ( em em )m = ( )m < δ 3. (57)

14 Thus P E δ 3. Applying union bound, P E, E δ 3. Let E 3 be the following event: there exist a point x in S such that x, the flipped point, lies in the same segment as some point x in S. If E 3 happens, then x + x = x ( x ) N. Note that P E 3 P E, E, E 3 = P E 3 E, E P E, E. Now we lower bound P E 3 E, E. Given E and E happen, we have S m and N o m. Since N = m (ln 3 δ ) m (ln 3 δ ), we have P E3 c E, E = ( N o N ) S ( m N ) S ( m N )m e m N δ 3. (58) Thus, P E 3 E, E = P E3 c E, E > δ 3. P E 3 P E, E, E 3 = P E 3 E, E P E, E ( δ δ 3 )( 3 ) > δ. Thus with probability at least δ, there exist x S and x S such that x + x N. We now bound N. N = m (ln 3 δ ) m (ln 3 δ ). Therefore N m ln 3 δ. Recall by definition m ( ɛ ln 3 δ ), thus N ɛ. We now have x + x ɛ. Finally, since {s, s + } selected by teacher B ms is the most symmetric pair, it must satisfy s + s + x + x ɛ. Putting together, with probability at least δ, R(ˆθ Bms(S)) = s + s + ɛ.

Optimal Teaching for Online Perceptrons

Optimal Teaching for Online Perceptrons Xuezhou Zhang Hrag Gorune Ohannessian Ayon Sen Scott Alfeld Xiaojin Zhu Department of Computer Sciences, University of Wisconsin Madison {zhangxz1123, gorune, ayonsn,