arxiv: v1 [stat.ml] 25 Feb 2018

Size: px
Start display at page:

Download "arxiv: v1 [stat.ml] 25 Feb 2018"

Transcription

1 Yuzhe Ma Robert Nowa Philippe Rigollet Xuezhou Zhang Xiaojin Zhu University of Wisconsin-Madison Massachusetts Institute of Technology arxiv:8.8946v stat.ml 5 Feb 8 Abstract We call a learner super-teachable if a teacher can trim down an iid training set while maing the learner learn even better. We provide sharp superteaching guarantees on two learners: the maximum lielihood estimator for the mean of a Gaussian, and the large margin classifier in D. For general learners, we provide a mixed-integer nonlinear programming-based algorithm to find a super teaching set. Empirical experiments show that our algorithm is able to find good super-teaching sets for both regression and classification problems. Introduction Consider the following question: a learner receives an iid training set S drawn from a distribution parametrized by θ. There is a teacher who nows θ. Can the teacher select a subset from S so the learner estimates θ better from the subset than from S? This question is distinct from training set reduction (see e.g. 9, 43, 4) in that the teacher can use the nowledge of θ to carefully design the subset. It is, in fact, a coding problem: Can the teacher approximately encode θ using items in S for a nown decoder, which is the learner? As such, the question is not a machine learning tas but rather a machine teaching one 47,, 45. This question is relevant for several nascent applications. One application is in understanding blacbox models such as deep nets. Often observation to a blacbox model is limited to its predicted label y = θ (x) given input x. One way to interpret a blacbox model is to locally train an interpretable model with data points S labeled by the blacbox model around the region of interest 35. We, however, as for more: to reduce the size of the training set S for Proceedings of the st International Conference on Artificial Intelligence and Statistics (AISTATS) 8, Lanzarote, Spain. PMLR: Volume 84. Copyright 8 by the author(s). the local learner while maing the learner approximate the blacbox better. The reduced training set itself also serves as representative examples of local model behavior. Another application is in education. Imagine a teacher who has a teaching goal θ. This is a reasonable assumption in practice: e.g. a geology teacher has the nowledge of the actual decision boundaries between roc categories. However, the teacher is constrained to teach with a given textboo (or a set of courseware) S. To the extent that the student is quantified mathematically, the teacher wants to select pages in the textboo with the guarantee that the student learns better from those pages than from gulping the whole boo. But is the question possible? The following example says yes. Consider learning a threshold classifier on the interval,, with true threshold at θ =. Let S have n items drawn uniformly from the interval and labeled according to θ. Let the learner be a hard margin SVM, which places the estimated threshold in the middle of the inner-most pair in S with different labels: ˆθ S = (x + x + )/ where x is the largest negative training item and x + the smallest positive training item in S. It is well nown that ˆθ S θ converges at a rate of /n: the intuition being that the average space between adjacent items is O(/n). - Figure : The original training set S with n = 6 items (circles and stars; green=negative, purple=positive), and the most-symmetric training set (stars) the teacher selects. The teacher nows everything but cannot tell θ directly to the learner. Instead, it can select the most-symmetric pair in S about θ and give them to the learner as a two-item training set. We will prove later that the ris on the most symmetric pair is O(/n ), that is, learning from the selected subset surpasses learning from S. Thus we observe something interesting: the teacher can turn a larger training set S into a smaller and better subset for the midpoint classifier. We call this phenomenon super-teaching.

2 Formal Definition of Super Teaching Let Z be the data space: for unsupervised learning Z = X, while for supervised learning Z = X Y. Let p Z be the underlying distribution over Z. We tae a function view of the learner: a learner A is a function A : n=z n Θ, where Θ is the learner s hypothesis space. The notation n=z n defines the set of (potentially non-iid) training sets, namely multisets of any size whose elements are in Z. Given any training set T n=z n, we assume A returns a unique hypothesis A(T ) ˆθ T Θ. The learner s ris R(θ) for θ Θ is defined as: R(θ) = E pz l(θ(x), y), or R(θ) = θ θ. () The former is for prediction tass where l() is a loss function and θ(x) denotes the prediction on x made by model θ; the latter is for parameter estimation where we assume a realizable model p Z = p θ for some θ Θ. We now introduce a clairvoyant teacher B who has full nowledge of p Z, A, R. The teacher is also given an iid training set S = {z,..., z n } p Z. If the teacher teaches S to A, the learner will incur a ris R(A(S)) R(ˆθ S ). The teacher s goal is to judiciously select a subset B(S) S to act as a super teaching set for the learner so that R(ˆθ B(S) ) < R(ˆθ S ). Of course, to do so the teacher must utilize her nowledge of the learning tas, thus the subset is actually a function B(S, p Z, A, R). In particular, the teacher nows p Z already, and this sets our problem apart from machine learning. For readability we suppress these extra parameters in the rest of the paper. We formally define super teaching as follows. Definition (Super Teaching). B is a super teacher for learner A if δ >, N such that n N P S R(ˆθ B(S) ) c n R(ˆθ S ) > δ, () where S iid p n Z, B(S) S, and c n is a sequence we call super teaching ratio. Obviously, c n = can be trivially achieved by letting B(S) = S so we are interested in small c n. There are two fundamental questions: () Do super teachers provably exist? () How to compute a super teaching set B(S) in practice? We answer the first question positively by exhibiting super teaching on two learners: maximum lielihood estimator for the mean of a Gaussian in section 3, and D large margin classifier in section 4. Guarantees on super teaching for general learners remain future wor. Nonetheless, empirically we can find a super teaching set for many general learners: We formulate the second question as mixed-integer nonlinear programming in section 5. Empirical experiments in section 6 demonstrates that one can find a good B(S) effectively. 3 Analysis on Super Teaching for the MLE of Gaussian mean In this section, we present our first theoretical result on super teaching, when the learner A MLE is the maximum lielihood estimator (MLE) for the mean of a Gaussian. Let Z = X = R, Θ = R, p Z (x) = N (θ, ). Given a sample S of size n drawn from p Z, the learner computes the MLE for the mean: ˆθ S = A MLE (S) = n n i= x i. We define the ris as R(ˆθ S ) = ˆθ S θ. The teacher we consider is the optimal -subset teacher B, which uses the best subset of size to teach: B (S) argmin T S, T = R(ˆθ T ). (3) To build intuition, it is well-nown that the ris of A MLE under S is O(/ n) because the variance under n items shrins lie /n. Now consider =. Since the teacher B nows θ, under our setting the best teaching strategy is for her to select the item in S closest to θ, which forms the singleton teaching set B (S). One can show that with large probability this closest item is O(/n) away from θ (the central part of a Gaussian density is essentially uniform). Therefore, we already see a super teaching ratio of c n = n. More generally, our main result below shows that B achieves a super teaching ratio c n = O(n + ): Theorem. Let B be the optimal -subset teacher. ɛ (, 4 ), δ (, ), N(, ɛ, δ) such that n N(, ɛ, δ), P R(ˆθ B (S)) c n R(ˆθ S ) > δ, where c n = ɛ n + +ɛ. Toward proving the theorem, we first recall the standard rate R(ˆθ S ) n if A MLE learns from the whole training set S: Proposition. Let S be an n-item iid sample drawn from N (θ, ). ɛ >, δ (, ), N (ɛ, δ) such that n N, P n ɛ < R(ˆθ S ) < n +ɛ > δ. (4) Proof. R(ˆθ S ) = ˆθ S θ and ˆθ S θ N (, n ) = n nx π e. Let α = n ɛ and β = n +ɛ. We have P R(ˆθ S ) α = < α α n dx = α π n π n π = nx e dx π n ɛ, Remar: we introduced an auxiliary variable ɛ which controls the implicit tradeoff between c n, how much super teaching helps, and N, how soon super teaching taes effect. When ɛ the teaching ratio c n approaches O(n + ), but as we will see N(, ɛ, δ). Similarly, also affects the tradeoff: the teaching ratio is smaller as we enlarge, but N(, ɛ, δ) increases. (5)

3 Yuzhe Ma, Robert Nowa, Philippe Rigollet, Xuezhou Zhang, Xiaojin Zhu P R(ˆθ S ) β = < = β β x n β π β nβ nπ e < β n π nx e dx = nπ = nx e dx β β n ny π e dy π n ɛ. Thus P α < R(ˆθ S ) < β = P R(ˆθ S ) α P R(ˆθ S ) β > π n ɛ. Let N (ɛ, δ) = ( δ 8 π ) ɛ, then n N, P α < R(ˆθ S ) < β > δ. We now wor out the ris of A MLE if it learns from the optimal -subset teacher B. Theorem 4 says that this ris is very small and sharply concentrated around R(ˆθ B (S)) n. To prove Theorem 4, we first give the following lemma. ( ) n Lemma 3. Denote C n =. Let the index set I = {,,...n} where n 4. Consider all subsets of size, then there are at most 4 C Cn ordered pairs of subsets that are overlapping but not identical. Proof. Let I and I be two subsets of size and they overlap on t indexes. Then the total number of distinct indexes that appear in I I is t. There are C t n ways of choosing such t indexes. Next we determine which t indexes are overlapping ones. We have Ct t ways of choosing such t indexes. Finally we have C t t ways of selecting half of the non-overlapping indexes and attribute them to I. Thus in total we have O t = C t n C t t C t t ordered pairs of subsets that overlap on t indexes. By our assumption n 4 we have C t n Cn. Also note that C t t < Ct and C t t < C, thus O t < C n C t C. Therefore the total number of ordered pairs of subsets that are overlapping but not identical is O = O t < C C n t C < t= t= t= C C n t C = 4 C C. n (6) Now we prove the ris of the optimal -subset teacher. Theorem 4. Let B be the optimal -subset teacher. Let S be an n-item iid sample drawn from N (θ, ). ɛ (7) (, ), δ (, ), N (, ɛ, δ) such that n N, P ( n )+ɛ < R(ˆθ B (S)) < ( n ) ɛ > δ. Proof. Let I {,,..., n} and I =, define γ I = i I (x i θ ). Let S I denote the subset indexed by I. Note that ˆθ SI = i I x i and R(ˆθ SI ) = ˆθ SI θ = i I x i θ = γ I. Also note that R(ˆθ B (S)) = inf I R(ˆθ SI ) = inf I γ I. Thus to prove Theorem 4 it suffices to prove P ( n )+ɛ < inf γ I < ( I n ) ɛ. (9) Let α = ( n )+ɛ and β = ( n ) ɛ. We first prove the lower bound. Note that γ I has the same distribution for all I. Thus by the union bound, P inf I (8) γ I α = P I : γ I α C n P γ I α, () where I = {,,..., }. Since γ I N (, ), we have P γ I α = α α Note that C n ( en ). Thus, P inf I e x dx < α. () π π γ I α < ( en ) π α = π e ( n )ɛ. () Thus N (, ɛ, δ) such that n N, P inf γ I α < δ I. (3) To show the upper bound, we define t I = γ I < β, where is the indicator function. Let T = I t I. Then it suffices to show lim n P T = =. Note that P T = = P T E T = E T P (T E T ) (E T ) V T (E T ), (4) where the last inequality follows from the Marov inequality. Now we lower bound E T. E T = E t I = E t I = C n E t I I I (5) β = C n P γ I < β = C n e x dx. π Note that ɛ <, thus β <. For x ( β, β), π e x > β π e = πe. Also note that C n > ( n ), thus E T > ( n ) πe β = πe (n )ɛ. (6)

4 Now we upper bound V T. V T = I,I Cov t I, t I = I,I, I I Cov t I, t I. (7) Note that for Bernoulli random variable t I, V t I E t I. Thus if I = I, then Cov t I, t I = V t I E t I = P γ I < β β = e x dx < β = π π π ( n ) ɛ. β (8) Otherwise I I, that is, I and I overlap but not identical, then Cov t I, t I = E t I t I E t I E t I E t I t I = P γ I < β, γ I < β. (9) Note that γ I and γ I are jointly Gaussian with covariance Cov γ I, γ I = Cov x i θ, x i θ i I,i I = I I = ρ, i I,i I,i=i () where ρ. The joint PDF of two standard normal distributions x, y with covariance ρ is f(x, y) = Note that f(x, y) π ρ, thus P γ I < β, γ I < β = π x ρxy+y ρ e ( ρ ). () x <β, y <β π ρ (β) = π ρ β. Since π ρ π ρ dxdy π ( = ) π π P γ I < β, γ I < β β π (), thus = π ( n ) ɛ. According to Lemma 3, there are at most 4 C of I and I such that I I. Thus, V T = V t I + Cov t I, t I I I I, I I (3) Cn pairs C n π ( n ) ɛ + 4 C C n π ( n ) ɛ π (en ) ( n ) ɛ + 4 C en ( ) π ( n ) ɛ = π e ( n )ɛ + 4 π C ( e ) ( n )ɛ. (4) Now plug (4) and (6) into (4), we have P T = a ()( n ) ɛ + a ()( n ), (5) where a = π e+ and a () = ec ( e ). Thus N (, ɛ, δ) such that n N P inf I, γ I β < δ. (6) Let N (, ɛ, δ) = max{n (, ɛ, δ), N (, ɛ, δ)}, combining (3) and (6) concludes the proof. Now we can conclude super-teaching by comparing Theorem 4 and Proposition : Proof of Theorem. Let α = ( n ) ɛ and β = n ɛ. By Proposition, ɛ (, 4 ), δ (, ), N (ɛ, δ ) such that n N, P R(ˆθ S ) > β > δ. By Theorem 4, N (, ɛ, δ ) such that n N, P R(ˆθ B (S)) < α > δ. Let c n = ɛ n + +ɛ. Since ɛ < 4, c n is a decreasing sequence in n with lim n c n =. Let N 3 (, ɛ) be the first integer such that c N3. Let N(, ɛ, δ) = max{n (ɛ, δ ), N (, ɛ, δ ), N 3(, ɛ)}. By a union bound n N(, ɛ, δ), P Since α β = c n, we have P δ, where c n c N3. R(ˆθ B (S)) < α, R(ˆθ S ) β R(ˆθ B (S)) c n R(ˆθ S ) > δ. > 4 Analysis on Super Teaching for D Large Margin Classifier We present our second theoretical result, this time on teaching a D large margin classifier. Let X =,, Y = {, }, Θ =,, θ =, p Z (x, y) = p Z (x)p Z (y x) where p Z (x) = U(X) and p Z (y = x) = x θ. Let x max i:yi= x i and x + min i:yi=+ x i be the inner-most pair of opposite labels in S if they exist. We formally define the large margin classifier A lm (S) as (x + x + )/ if x, x + exist ˆθ S = A lm (S) = if S all positive if S all negative. (7) The ris is defined as R(ˆθ S ) = ˆθ S θ = ˆθ S. The teacher we consider is the most symmetric teacher, who selects the most symmetric pair about θ in S and gives it to the learner. We define the most-symmetric teacher B ms : { {(s, ), (s B ms (S) = +, )} if s, s + exist, {(x, y )} otherwise. (8)

5 Yuzhe Ma, Robert Nowa, Philippe Rigollet, Xuezhou Zhang, Xiaojin Zhu where (s, s + ) argmin (x, ),(x,) S x+x θ. Our main result shows that learning from the whole set S achieves the well-nown O(/n) ris, but surprisingly B ms achieves O(/n ) ris, therefore it is an approximately c n = O(n ) super teaching ratio. Theorem 5. Let S be an n-item iid sample drawn from p Z. Then δ (, ), N(δ) such that n N, P R(ˆθ Bms(S)) c n R(ˆθ S ) > δ, where c n = 3 nδ ln 6 δ. Before proving Theorem 5, we first show that B ms is an optimal teacher for the large margin classifier. Proposition 6. B ms is an optimal teacher for the large margin classifier ˆθ S. Proof. We show R(ˆθ Bms(S)) R(ˆθ B(S) ) for any B and any S. If B ms (S) =, then S is either all positive or all negative. In both cases R(ˆθ B(S) ) = for any B by definition. Thus R(ˆθ Bms(S)) R(ˆθ B(S) ). Otherwise B ms (S) =, then if B(S) is all positive or all negative, we have R(ˆθ B(S) ) = and thus R(ˆθ Bms(S)) R(ˆθ B(S) ). Otherwise let x B, x B + be the inner most pair of B(S). Since x B, x B + S, then by definition of B ms, R(ˆθ Bms(S)) = s +s+ θ xb +xb + θ = R(ˆθ B(S) ). Now we wor out the ris of the most symmetric teacher B ms. To bound the ris of B ms we need the following ey lemma, which shows that the sample complexity with the teacher is O(ɛ / ). Lemma 9. Let n = 4m, where m is an integer. Let S be an n-item iid sample drawn from p Z. ɛ >, δ (, ), M(ɛ, δ) = max{ 3e ln 4 ln 3 δ, ( ɛ ln 3 δ ) } such that m M(ɛ, δ), P R(ˆθ Bms(S)) ɛ > δ. Proof. We give a proof setch and the details are in the appendix. Let S = {x (x, ) S} and S = {x (x, ) S} respectively. Then we have S + S = 4m. Define event E : { S m S m}. Given that m 3e ln 4 ln 3 δ, one can show P (E ) > δ 3. Since S + S = 4m, either S m or S m. Without loss of generality we assume S m. We then divide the interval, equally into N = m (ln 3 δ ) segments. The length of each segment is N = O( m ) as Figure shows. 3 N - Figure : segments Now we show that learning on the whole S incurs O(n ) ris. First, we give the following lemma for the exact tail probability of R(ˆθ S ). Lemma 7. For the large margin classifier ˆθ S, we have ( ɛ) n + (ɛ) n < ɛ P R(ˆθ S ) > ɛ = ( )n < ɛ < (9) ɛ =. The proof for Lemma 7 is in the appendix. Now we show that R(ˆθ S ) is O(n ). Theorem 8. Let S be an n-item iid sample drawn from p Z. Then δ (, ) and n, P R(ˆθ S ) > δ > δ. (3) n Proof. According to Lemma 7, for ɛ, we have P R(ˆθ S ) > ɛ > ( ɛ) n > nɛ. (3) Note that n, thus δ n. Let ɛ = δ n in (3) we have P R(ˆθ S ) > δ > n δ = δ. (3) n n Let N o be the number of segments that are occupied by the points in S. Note that N o is a random variable. Let E be the event that N o m. Then one can show P (E ) > δ 3. By union bound, we have P (E, E ) > δ 3. Let E 3 be the following event: there exist a point x in S such that x, the flipped point, lies in the same segment as some point x in S. One can show that P (E 3 E, E ) > δ 3. Thus P (E 3) P (E, E, E 3 ) = P (E 3 E, E )P (E, E ) ( δ δ 3 )( 3 ) > δ. If E 3 happens, then x + x = x ( x ). Note that m ( ɛ ln 3 δ ) and N = m (ln 3 δ ) m (ln 3 δ ), thus N m ln 3 δ ɛ. Therefore R(ˆθ Bms(S)) = s +s+ x+x ɛ. Rewriting ɛ in Lemma 9 as a function of n, we have the following theorem. Theorem. Let S be an n-item iid sample dawn from p Z, then N (δ) = e ln 4 ln 3 δ such that n N, P R(ˆθ Bms(S)) 6 n ln 3 > δ. (33) δ Proof. Note that if n N (δ) = ln 4 ln 3 δ, then m = n 4 3e ln 4 ln 3 δ, thus the minimum ɛ that satisfies m M(ɛ, δ) is m ln 3 δ = 6 n ln 3 δ. e N

6 Now we can conclude super teaching: Proof of Theorem 5. According to Theorem, N ( δ ) such that n N, P R(ˆθ Bms(S)) 6 n ln 6 δ > δ. Note that N, thus according to Theorem 8, n N, P R(ˆθ S ) > δ n > δ. Let c n = 3 nδ ln 6 δ and N (δ) = 3 δ ln 6 δ so that c N =. Let N(δ) = max{n (δ), N (δ)}. By union bound, n N, with probability at least δ, we have both R(ˆθ S ) > δ n and R(ˆθ Bms(S)) 6 n ln 6 δ, which gives P R(ˆθ Bms(S)) c n R(ˆθ S ) > δ, where c n c N =. 5 An MINLP Algorithm for Super Teaching Although the problem of proving super teaching ratios for a specific learner is interesting, we now focus on an algorithm to find a super teaching set for general learners given a training set S. That is, we find a subset B(S) S so that R(ˆθ B(S) ) < R(ˆθ S ). We start by formulating super teaching as a subset selection problem. To this end, we introduce binary indicator variables b,..., b n where b i = means z i S is included in the subset. We consider learners A that can be defined via convex empirical ris minimization: A(S) argmin θ Θ n i= l(θ, z i ) + λ θ. (34) For simplicity we assume there is a unique global minimum which is returned by argmin. Note that we use l in (34) to denote the (surrogate) convex loss used by A in performing empirical ris minimization. For example, l may be the negative log lielihood for logistic regression. l is potentially different from l (e.g. the - loss) used by the teacher to define the teaching ris R in (). We formulate super teaching as the following bilevel combinatorial optimization problem: min b {,} n,ˆθ Θ R(ˆθ) (35) s.t. ˆθ n = argminθ Θ b i l(θ, zi ) + λ θ. (36) i= Under mild conditions, we may replace the lower level optimization problem (i.e. the machine learning problem (36)) by its first order optimality (KKT) conditions: min b {,} n,ˆθ Θ s.t. R(ˆθ) (37) n b i θ l(ˆθ, zi ) + λˆθ =. i= This reduces the bilevel problem but the constraint is nonlinear in general, leading to a mixed-integer nonlinear program (MINLP), for which effective solvers exist. We use the MINLP solver in NEOS 5. 6 Simulations We now apply the framewor in section 5 to logistic regression and ridge regression, and show that the solver indeed selects a super-teaching subset that is far better than the original training set S. 6. Teaching Logistic Regression A lr Let X = R d, Θ = R d, θ = ( d,..., d ), p Z (x) = N (, I). Let p Z (y x) = x θ >, which is deterministic given x. Logistic regression estimates ˆθ S = A lr (S) with (34), where λ =. and l(z i ) = log( + exp( y i x i θ)). In contrast, The teacher s ris is defined to be the expected - loss: R(ˆθ) = E pz ˆθ(x) y, where ˆθ(x) is the label of x predicted by ˆθ. Since p Z is symmetric about the origin, the ris can be rewritten in terms of the angle between ˆθ and θ ˆθ : R(ˆθ) = arccos( θ )/π. ˆθ θ Instantiating (37) we have min b {,} n,ˆθ R d s.t. ˆθ θ arccos( )/π (38) ˆθ θ n b i y i x i λˆθ =. + exp(y i x i ˆθ) i= We run experiments to study the effectiveness and scalability of the NEOS MINLP solver on (38), specifically with respect to the training set size n = S and dimension d. In the first set of experiments we fix d = and vary n = 6, 64, 56 and 4. For each n we run trials. In each trial we draw an n-item iid sample S p Z and call the solver on (38). The solver s solution to b... b n indicates the super teaching set B(S). We then compute an empirical version of the super teaching ratio: ĉ n = R(ˆθ B(S) )/R(ˆθ S ). Logistic Regression Ridge Regression n = S ĉ n B(S) /n time (s) ĉ n B(S) /n time (s) 6 8.5e e- 7.8e e- 64.3e e+ 7.5e e e e+ 5.6e e+ 4.3e-.86.4e+3 4.e e+3 Table : Super teaching as n changes. In the left half of Table we report the median of the following quantities over trials: ĉ n, the fraction of the training items selected for super teaching B(S) /n, and the NEOS server running time. The main result is that ĉ n for all n, which means the solver indeed selects a super-teaching set B(S) that is far

7 Yuzhe Ma, Robert Nowa, Philippe Rigollet, Xuezhou Zhang, Xiaojin Zhu Logistic Regression Ridge Regression d ĉ n B(S) /n time (s) ĉ n B(S) /n time (s) 3.e e- 3.3e e+ 4.4e e+ 7.e e+ 8.8e e+.5e e e-.4 5.e+ 4.3e e+ 3 8.e-.58.e+ 6.4e e+ Table : Super teaching as d changes better than the original iid training set S. Therefore, MINLP is a valid algorithm for finding a super teaching set. Second, we note that the solver tends to select a large subset since the median B(S) /n /. This is interesting as it is nown that when S is dense, one can select extremely sparse super teaching sets, as small as a few items, to teach effectively 8. Understanding the different regimes remains future wor. Finally, the running time grows fast with n. For example, when n = 4 it taes around half an hour to solve (38). Future wor needs to address this bottlenec in applying MINLP to large problems. In the second set of experiments we fix n = 3 and vary d =, 4, 8, 6, 3. The left half of Table shows the results. The empirical teaching ratio ĉ n is still below in all cases, showing super teaching. But as the dimension of the problem increases ĉ n deteriorates toward. Nonetheless, even when d = n we still see a median super teaching ratio of.8; the corresponding super teaching set B(S) has only 58% training items than the dimension. It is interesting that the MINLP algorithm intentionally created a high dimensional learning problem (as in higher dimension d than selected training items B(S) ) to achieve better teaching, nowing that the learner A lr is regularized. The running time does not change dramatically. 6. Teaching Ridge Regression A rr Let X = R d, Θ = R d, θ = ( d,..., d ), p Z (x) = N (, I), p Z (y x) = N (y; x θ,.). Let the teaching ris be the parameter difference: R(ˆθ) = ˆθ θ. Given a sample S with n iid items drawn from p Z, ridge regression estimates ˆθ S = A rr (S) with λ =. and l(z i ) = (x ˆθ i y i ). The corresponding MINLP is: min ˆθ θ (39) b {,} n,ˆθ R d s.t. λˆθ + n b i (x ˆθ i y i )x i =. i= We run the same set of experiments. Tables and show the results, which are qualitatively similar to teaching logistic regression. Again, we see the empirical super teaching ratio ĉ n, indicating the presence of super teaching. (a) logistic regression (b) ridge regression Figure 3: Typical trials from the MINLP algorithm Finally, Figure 3 visualizes one typical trial each for teaching logistic regression and ridge regression. S consists of both dar and light points, while the dar ones representing B(S) optimized by MINLP. The dashed line shows ˆθ S, while the solid lines shows ˆθ B(S). The ground truth (x + x = in logistic regression, y = x in ridge regression) essentially overlaps with the solid lines. Specifically, the super taught models ˆθ B(S) have negligible riss of.5e- 4 and 3.3e-3, whereas models ˆθ S trained from the whole iid sample S incur much larger riss of.3 and.6, respectively. 7 Related Wor There has been several research threads in different communities aimed at reducing a data set while maintaining its utility. The first thread is training set reduction 9, 43, 4, which during training time prunes items in S in an attempt to improve the learned model. The second thread is coresets, 6, a summary of S such that models learned on the summary are provably competitive with models learned on the full data set S. But as they do not now the target model p Z or θ, these methods cannot truly achieve super teaching. The third thread is curriculum learning which showed that smart initialization is useful for nonconvex optimization. In contrast, our teacher can directly encode the true model and therefore obtain faster rates. The final thread is sample compression 7, where a compression function chooses a subset T S and a reconstruction function to form a hypothesis. Our present wor has some similarity with compression, which allows increased accuracy since compression bounds can be used as regularization 6. The theoretical study of machine teaching has focused on the teaching dimension, i.e. the minimum training set size needed to exactly teach a target concept θ, 38, 46, 8, 9, 45, 6, 44, 48, 9, 3, 4,, 3, 8, 7, 5, 5, 36, 3,. Most of the prior wor assumed a synthetic teaching setting where S is the whole item space, which is often unrealistic. Liu et al. considered approximate teaching in the finite S setting 3, though their analysis focused on a specific SGD learner. Our super teaching setting applies to arbitrary

8 learners, and we allow approximate teaching namely we do not require the teacher to teach exactly the target model, which is infeasible in our pool-based teaching setting with a finite S. Machine teaching applications include education 4, 33, 39, 7, 3, 34, computer security,, 3, and interactive machine learning 4,, 4. By establishing the existence of super-teaching, the present paper can guide the process of finding a more effective training set for these applications. 8 Discussions and Conclusion We presented super-teaching: when the teacher already nows the target model, she can often choose from a given training set a smaller subset that trains a learner better. We proved this for two learners, and provided an empirical algorithm based on mixed integer nonlinear programming to find a super teaching set. However, much needs to be done on the theory of super teaching. We give two counterexamples to illustrate that not all learners are super-teachable. Example (MLE of interval). Let X =, θ, where θ R +. p Z (x) = U(X). Given a n-item training set S, the MLE for θ is ˆθ S = A int (S) = max i=:n x i. The ris is defined as R(ˆθ S ) = ˆθ S θ. We show A int is not superteachable. ˆθB(S) = max xi B(S) x i max xi S x i = ˆθ S. Since ˆθ S θ, R(ˆθ B(S) ) = ˆθ B(S) θ ˆθ S θ = R(ˆθ S ). We can generalize this to a classification setting, and show that neither the least nor the greatest consistent hypothesis is not super-teachable: Example (Consistent learners). Let X = x min, x max Z be an interval over the integer grid. The hypothesis space is Θ = {a, b X : y = in a, b and outside}. θ = a, b Θ. p Z is uniform on X and noiseless y labeled according to θ. The ris R(ˆθ S ) is the size of the symmetric difference between the two intervals ˆθ S and θ, normalized by x max x min. Given a sample S, the least consistent learner A lc learns the tightest interval over positive lc items in S: ˆθ S = A lc(s) min i=:n x i, max i=:n x i. y i= y i= ˆθ S lc = if S does not contain positive items. The greatest consistent learner A gc extends the hypothesis interval in both directions as much as possible before hitting negative points in S. If S has no positive we define =, too. ˆθ gc S Proposition. Neither A lc nor A gc is super-teachable. Proof. We first show A lc is not super-teachable. Note that A lc learns the tightest interval consistent with S, thus we lc always have ˆθ S θ. Now we show that ˆθ B(S) lc lc ˆθ S is always true so that R(ˆθ S lc) R(ˆθ B(S) lc ) follows. If θ =, then trivially ˆθ lc B(S) = ˆθ lc S =. Now assume θ. If (x, ) B(S), let a, b = ˆθ B(S) lc lc. Note that ˆθ S because B(S) S and thus S has lc at least one positive point. Let ˆθ S = a, b. Now a = min{x (x, ) B(S)} min{x (x, ) S} = a, and b = max{x (x, ) B(S)} max{x (x, ) S} = b. Thus we have ˆθ B(S) lc lc ˆθ S. If (x, ) B(S), ˆθ B(S) lc = and ˆθ B(S) lc lc ˆθ S is always true. Thus ˆθ lc B(S) ˆθ lc S θ for any B and any S. The proof for A gc is similar by showing θ ˆθ gc S gc ˆθ B(S). This leads to an open question: which family of learners are super teachable? We offer a conjecture here: we speculate that MLEs (and the derived MAP estimates or regularized empirical ris minimizers) which satisfy the asymptotic normality conditions 4 are super teachable. This conjecture is motivated by its similarity to the proof in section 3. Also note that the two counterexamples are classic examples of MLE that do not satisfy the asymptotic normality conditions. Another open question concerns the optimal super-teaching subset size for a given training set of size n. For example, our result on teaching the MLE of Gaussian mean indicates that the rate improves as grows. However, our analysis only applies to a fixed. Further research is needed to identify the optimal. Acnowledgments: R.N. acnowledges support by NSF IIS and CCF P.R. is supported in part by grants NSF DMS-7596, NSF DMS-TRIPODS-7475, DARPA W9NF-6--55, ONR N and a grant from the MIT NEC Corporation. X.Z. is supported in part by NSF CCF-747, IIS-6365, CMMI-565, DGE-54548, and CCF References S. Alfeld, X. Zhu, and P. Barford. Data poisoning attacs against autoregressive models. AAAI, 6. S. Alfeld, X. Zhu, and P. Barford. Explicit defense actions against test-set attacs. In The Thirty-First AAAI Conference on Artificial Intelligence (AAAI), 7. 3 D. Angluin. Queries revisited. Theoretical Computer Science, 33():75 94, 4. 4 D. Angluin and M. Kriis. Teachers, learners and blac boxes. COLT, D. Angluin and M. Kriis. Learning from different teachers. Machine Learning, 5():37 63, 3.

9 Yuzhe Ma, Robert Nowa, Philippe Rigollet, Xuezhou Zhang, Xiaojin Zhu 6 O. Bachem, M. Lucic, and A. Krause. Practical Coreset Constructions for Machine Learning. ArXiv e- prints, Mar F. J. Balbach. Measuring teachability using variants of the teaching dimension. Theor. Comput. Sci., 397(- 3):94 3, 8. 8 F. J. Balbach and T. Zeugmann. Teaching randomized learners. COLT, pages 9 43, 6. 9 F. J. Balbach and T. Zeugmann. Recent developments in algorithmic teaching. In Proceedings of the 3rd International Conference on Language and Automata Theory and Applications, pages 8, 9. S. Ben-David and N. Eiron. Self-directed learning and its relation to the VC-dimension and to teacherdirected learning. Machine Learning, 33():87 4, 998. Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In L. Bottou and M. Littman, editors, Proceedings of the 6th International Conference on Machine Learning, pages 4 48, Montreal, June 9. Omnipress. M. Cama and M. Lopes. Algorithmic and human teaching of sequential decision tass. In AAAI,. 3 M. Cama and A. Thomaz. Mixed-initiative active learning. ICML Worshop on Combining Learning Strategies to Reduce Label Cost,. 4 B. Clement, P.-Y. Oudeyer, and M. Lopes. A comparison of automatic teaching strategies for heterogeneous student populations. In Educational Data Mining (EDM), 6. 5 J. Czyzy, M. P. Mesnier, and J. J. Moré. The NEOS server. IEEE Computational Science and Engineering, 5(3):68 75, T. Doliwa, G. Fan, H. U. Simon, and S. Zilles. Recursive teaching dimension, VC-dimension and sample compression. Journal of Machine Learning Research, 5:37 33, 4. 7 S. Floyd and M. Warmuth. Sample compression, learnability, and the Vapni-Chervonenis dimension. Machine learning, (3):69 34, Z. Gao, C. Ries, H. U. Simon, and S. Zilles. Preference-based teaching. Journal of Machine Learning Research, 8(3): 3, 7. 9 S. Garcia, J. Derrac, J. Cano, and F. Herrera. Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3):47 435,. S. Goldman and M. Kearns. On the complexity of teaching. Journal of Computer and Systems Sciences, 5(): 3, 995. S. A. Goldman and H. D. Mathias. Teaching a smarter learner. Journal of Computer and Systems Sciences, 5():55 67, 996. S. Har-Peled. Geometric approximation algorithms, volume 73. American mathematical society Boston,. 3 T. Hegedüs. Generalized teaching dimensions and the query complexity of learning. In Proceedings of the eighth Annual Conference on Computational Learning Theory (COLT), pages 8 7, F. Khan, X. Zhu, and B. Mutlu. How do humans teach: On curriculum learning and teaching dimension. NIPS,. 5 H. Kobayashi and A. Shinohara. Complexity of teaching by a restricted number of examples. COLT, pages 93 3, 9. 6 A. Kontorovich, S. Sabato, and R. Weiss. Nearestneighbor sample compression: Efficiency, consistency, infinite dimensions. arxiv preprint arxiv:75.884, 7. 7 R. Lindsey, M. Mozer, W. J. Huggins, and H. Pashler. Optimizing instructional policies. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing Systems 6, pages J. Liu and X. Zhu. The teaching dimension of linear learners. Journal of Machine Learning Research, 7(6): 5, 6. 9 J. Liu, X. Zhu, and H. G. Ohannessian. The teaching dimension of linear learners. In The 33rd International Conference on Machine Learning (ICML), 6. 3 W. Liu, B. Dai, J. M. Rehg, and L. Song. Iterative machine teaching. In ICML, 7. 3 H. D. Mathias. A model of interactive teaching. J. Comput. Syst. Sci., 54(3):487 5, S. Mei and X. Zhu. Using machine teaching to identify optimal training-set attacs on machine learners. AAAI, K. Patil, X. Zhu, L. Kopec, and B. C. Love. Optimal teaching for limited-capacity human learners. Advances in Neural Information Processing Systems (NIPS), 4.

10 34 A. N. Rafferty, E. Brunsill, T. L. Griffiths, and P. Shafto. Faster teaching by POMDP planning. In Proceedings of the 5th International Conference on Artificial Intelligence in Education, AIED, pages 8 87, Berlin, Heidelberg,. Springer-Verlag. 48 S. Zilles, S. Lange, R. Holte, and M. Zinevich. Models of cooperative teaching and learning. Journal of Machine Learning Research, : ,. 35 M. T. Ribeiro, S. Singh, and C. Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages ACM, R. L. Rivest and Y. L. Yin. Being taught can be faster than asing questions. COLT, S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, New Yor, NY, USA, A. Shinohara and S. Miyano. Teachability in computational learning. New Generation Computing, 8(4): , A. Singla, I. Bogunovic, G. Barto, A. Karbasi, and A. Krause. Near-optimally teaching the crowd to classify. In ICML, pages 54 6, 4. 4 J. Suh, X. Zhu, and S. Amershi. The label complexity of mixed-initiative classifier training. International Conference on Machine Learning (ICML), 6. 4 H. White. Maximum lielihood estimation of misspecified models. Econometrica: Journal of the Econometric Society, pages 5, D. R. Wilson and T. R. Martinez. Reduction techniques for instance-based learning algorithms. Machine Learning, 38(3):57 86,. 43 X. Zeng and X.-w. Chen. SMO-based pruning methods for sparse least squares support vector machines. IEEE transactions on Neural Networs, 6(6):54 546, X. Zhu. Machine teaching for bayesian learners in the exponential family. In Advances in Neural Information Processing Systems (NIPS), pages 95 93, X. Zhu. Machine teaching: an inverse problem to machine learning and an approach toward optimal education. AAAI, X. Zhu, J. Liu, and M. Lopes. No learner left behind: On the complexity of teaching multiple learners simultaneously. In The 6th International Joint Conference on Artificial Intelligence (IJCAI), X. Zhu, A. Singla, S. Zilles, and A. N. Rafferty. An Overview of Machine Teaching. ArXiv e-prints, Jan. 8.

11 Yuzhe Ma, Robert Nowa, Philippe Rigollet, Xuezhou Zhang, Xiaojin Zhu Supplemental Material Lemma 7. For the large margin classifier ˆθ S, we have ( ɛ) n + (ɛ) n < ɛ P R(ˆθ S ) > ɛ = ( )n < ɛ < ɛ =. (9) Proof. The ris is R(ˆθ S ) = ˆθ S. Define event E : { (x, ) S (x, +) S}. P ˆθ S ɛ can be decomposed into two components depending on if E happens as (4) shows. P ˆθ S ɛ = P ˆθ S ɛ, E + P ˆθ S ɛ, E c. (4) P ˆθ S ɛ, E c = P ˆθ S ɛ E c P E c. Note that P E c = ( )n, P ˆθ S ɛ E c is if ɛ < and if ɛ = because ˆθ S = ± always holds given E c happens. Thus, P ˆθ S ɛ, E c = if ɛ < ( )n if ɛ =. Now we compute P ˆθ S ɛ, E. Let n + be the number of positive points in S. Define E i : {n + = i}. Note E i E j = if i j and E = n i= E i, thus n P ˆθ S ɛ, E = i= n P ˆθ S ɛ, E i = i= (4) P ˆθ S ɛ E i P E i. (4) P E i = Ci n( )n. Note that P ˆθ S ɛ E i = P x +x+ ɛ E i. To compute it, we first compute F x,x + (ɛ, ɛ E i ) = P x ɛ, x + ɛ E i. Given E i happens, P x ɛ E i = ( ɛ ) n i and P x + ɛ E i = ( ɛ ) i. Also since x ɛ and x + ɛ are independent given E i happens, thus Tae the derivative of F gives F x,x + (ɛ, ɛ E i ) = P x ɛ, x + ɛ E i = ( ɛ ) n i ( ɛ ) i. (43) f x,x + (ɛ, ɛ E i ) = i(n i)( ɛ ) n i ( ɛ ) i. (44) Note that ˆθ S ɛ x x ɛ. Therefore, we integrate f x,x + (ɛ, ɛ E i ) over the region ɛ ɛ ɛ to obtain P ˆθ S ɛ E i. However, note that ɛ, ɛ, thus for ɛ >, the region ɛ ɛ ɛ becomes the whole,, and the integration is. Then (4) becomes P ˆθ S ɛ, E = n i= P E i = P E = ( )n. For ɛ, by (4) we have n P ˆθ S ɛ, E = P E i i(n i)( ɛ ) n i ( ɛ ) i dɛ dɛ i= ɛ ɛ ɛ n = Ci n ( )n i(n i)( ɛ ) n i ( ɛ ) i dɛ dɛ i= = ( )n ɛ ɛ ɛ ɛ ɛ ɛ i= n Ci n i(n i)( ɛ ) n i ( ɛ ) i dɛ dɛ. (45)

12 Note that C n i i(n i) = n(n )Cn i, (45) becomes P ˆθ S ɛ, E = n(n )( n )n C n i ( ɛ ) n i ( ɛ ) i dɛ dɛ ɛ ɛ ɛ i= = n(n )( n )n C n i ( ɛ ) n i ( ɛ ) i dɛ dɛ = n(n )( )n = n(n )( )n ɛ ɛ ɛ i= ɛ ɛ ɛ,, ( ɛ ɛ ) n dɛ dɛ ( ɛ ɛ ) n dɛ dɛ ɛ ɛ >ɛ ( ɛ ɛ ) n dɛ dɛ. (46) Now we compute the two integration in (46) =,, ( ɛ ɛ ) n dɛ dɛ = n ( ɛ ɛ ) n dɛ n ( ɛ ) n ( ɛ ) n dɛ = n(n ) ( ɛ)n + n(n ) ( ɛ ) n = n n(n ). (47) For the second integration, note that it can decomposed as ɛ ɛ >ɛ ( ɛ ɛ ) n dɛ dɛ = ɛ ɛ +ɛ ( ɛ ɛ ) n dɛ dɛ + ɛ ɛ ɛ ( ɛ ɛ ) n dɛ dɛ. (48) Since the two sub-integration s are identical because the two sub regions are symmetric. We only show the computation for the first. ɛ ɛ ( ɛ ɛ ) n dɛ dɛ = n ( ɛ ɛ ) n ɛ +ɛdɛ Thus we have = ɛ +ɛ ɛ n ( ɛ ) n + n n ( ɛ ɛ) n dɛ = n(n ) ( ɛ ) n n n(n ) ( ɛ ɛ) n ɛ = n n(n ) ɛn + ( ɛ) n n(n ). = ɛ ɛ >ɛ ( ɛ ɛ ) n dɛ dɛ = n n(n ) ɛn + ( ɛ) n n(n ). Combine (47) and (5), we can compute (46) as follows. P ˆθ S ɛ, E = = n(n )( n )n n(n ) ɛ ɛ +ɛ = n n ɛ n ( ɛ) n + ( )n = ɛ n ( ɛ) n ( ɛ ɛ ) n dɛ dɛ n n(n ) (ɛn + ( ɛ) n ) + n(n ) (49) (5) (5)

13 Yuzhe Ma, Robert Nowa, Philippe Rigollet, Xuezhou Zhang, Xiaojin Zhu Therefore we have Now combine (4) and (5) we have which is equivalent to (9). P ˆθ S ɛ, E = P ˆθ S ɛ = ɛ n ( ɛ) n if ɛ ( )n if < ɛ. (5) ɛ n ( ɛ) n if ɛ ( )n if < ɛ < if ɛ =. Lemma 9. Let n = 4m, where m is an integer. Let S be an n-item iid sample drawn from p Z. ɛ >, δ (, ), M(ɛ, δ) = max{ 3e ln 4 ln 3 δ, ( ɛ ln 3 δ ) } such that m M(ɛ, δ), P R(ˆθ Bms(S)) ɛ > δ. Proof. Let S = {x (x, ) S} and S = {x (x, ) S} respectively. Then we have S + S = 4m. Define event E : { S m S m}. Then we have (53) m P E = Ci 4m ( )4m. (54) where we rule out all possible sequences of 4m points which lead to S < m or S < m. By standard result 37 (Lemma A.5) d = Cm ( em d )d, we have i= P E ( 4em m )m ( )4m = e m 4 m ( m m )m ( e 4 )m ( e 4 )m (55) where the nd-to-last inequality follows from the fact that e ( + m )m. Note that by definition m 3e ln 4 ln 3 δ > ln 4 ln 3 δ, thus ( e 4 )m < δ 3 and P E > δ 3. Since S + S = 4m, then either S m or S m. Without loss of generality we assume S m. We then divide the interval, equally into N = m (ln 3 δ ) segments. The length of each segment is N = O( m ) as Figure 4 shows. Note that m 3e ln 4 ln 3 δ > 3e ln 3 δ, thus N 3em > em > m. 3 N - Figure 4: segments Let N o be the number of segments that are occupied by the points in S. Note that N o is a random variable. Let E be the event that N o m. Now we lower bound P E. This is a variant of the coupon collector s problem: there are N distinct coupons, and in S trials we want to collect at least m distinct coupons. Note that P E = P E c = m i= P N o = i. Let T i be the number of all possible coupon sequences of S such that S occupies exactly i segments (i.e. distinct coupons). We have Ci n ways of choosing i segments among a total of N. Also, for each choice of i segments, the number of all possible coupon sequences of S such that S fully occupies those i segments without empty is upper bounded by i S. Thus T i Ci ni S and we have Since m P N o = i = 3e ln 4 ln 3 δ > log 3 δ, S m, and N > em, thus P E c = < m i= m i= P N o = i m i= T i N S Cn i ( i N ) S. (56) C n i ( i N ) S m i= C n i ( m N )m C n i ( m N )m ( en m )m ( m N )m = ( em N )m < ( em em )m = ( )m < δ 3. (57)

14 Thus P E δ 3. Applying union bound, P E, E δ 3. Let E 3 be the following event: there exist a point x in S such that x, the flipped point, lies in the same segment as some point x in S. If E 3 happens, then x + x = x ( x ) N. Note that P E 3 P E, E, E 3 = P E 3 E, E P E, E. Now we lower bound P E 3 E, E. Given E and E happen, we have S m and N o m. Since N = m (ln 3 δ ) m (ln 3 δ ), we have P E3 c E, E = ( N o N ) S ( m N ) S ( m N )m e m N δ 3. (58) Thus, P E 3 E, E = P E3 c E, E > δ 3. P E 3 P E, E, E 3 = P E 3 E, E P E, E ( δ δ 3 )( 3 ) > δ. Thus with probability at least δ, there exist x S and x S such that x + x N. We now bound N. N = m (ln 3 δ ) m (ln 3 δ ). Therefore N m ln 3 δ. Recall by definition m ( ɛ ln 3 δ ), thus N ɛ. We now have x + x ɛ. Finally, since {s, s + } selected by teacher B ms is the most symmetric pair, it must satisfy s + s + x + x ɛ. Putting together, with probability at least δ, R(ˆθ Bms(S)) = s + s + ɛ.

Optimal Teaching for Online Perceptrons

Optimal Teaching for Online Perceptrons Optimal Teaching for Online Perceptrons Xuezhou Zhang Hrag Gorune Ohannessian Ayon Sen Scott Alfeld Xiaojin Zhu Department of Computer Sciences, University of Wisconsin Madison {zhangxz1123, gorune, ayonsn,

More information

Machine Teaching. for Personalized Education, Security, Interactive Machine Learning. Jerry Zhu

Machine Teaching. for Personalized Education, Security, Interactive Machine Learning. Jerry Zhu Machine Teaching for Personalized Education, Security, Interactive Machine Learning Jerry Zhu NIPS 2015 Workshop on Machine Learning from and for Adaptive User Technologies Supervised Learning Review D:

More information

PAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers

PAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers PAC-Bayes Ris Bounds for Sample-Compressed Gibbs Classifiers François Laviolette Francois.Laviolette@ift.ulaval.ca Mario Marchand Mario.Marchand@ift.ulaval.ca Département d informatique et de génie logiciel,

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

How can a Machine Learn: Passive, Active, and Teaching

How can a Machine Learn: Passive, Active, and Teaching How can a Machine Learn: Passive, Active, and Teaching Xiaojin Zhu jerryzhu@cs.wisc.edu Department of Computer Sciences University of Wisconsin-Madison NLP&CC 2013 Zhu (Univ. Wisconsin-Madison) How can

More information

Quadratic Upper Bound for Recursive Teaching Dimension of Finite VC Classes

Quadratic Upper Bound for Recursive Teaching Dimension of Finite VC Classes Proceedings of Machine Learning Research vol 65:1 10, 2017 Quadratic Upper Bound for Recursive Teaching Dimension of Finite VC Classes Lunjia Hu Ruihan Wu Tianhong Li Institute for Interdisciplinary Information

More information

10-701/15-781, Machine Learning: Homework 4

10-701/15-781, Machine Learning: Homework 4 10-701/15-781, Machine Learning: Homewor 4 Aarti Singh Carnegie Mellon University ˆ The assignment is due at 10:30 am beginning of class on Mon, Nov 15, 2010. ˆ Separate you answers into five parts, one

More information

Machine Teaching and its Applications

Machine Teaching and its Applications Machine Teaching and its Applications Jerry Zhu University of Wisconsin-Madison Jan. 8, 2018 1/64 Introduction 2/64 Machine teaching Given target model θ, learner A Find the best training set D so that

More information

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,

More information

Qualifying Exam in Machine Learning

Qualifying Exam in Machine Learning Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Sharp Generalization Error Bounds for Randomly-projected Classifiers

Sharp Generalization Error Bounds for Randomly-projected Classifiers Sharp Generalization Error Bounds for Randomly-projected Classifiers R.J. Durrant and A. Kabán School of Computer Science The University of Birmingham Birmingham B15 2TT, UK http://www.cs.bham.ac.uk/ axk

More information

The Power of Random Counterexamples

The Power of Random Counterexamples Proceedings of Machine Learning Research 76:1 14, 2017 Algorithmic Learning Theory 2017 The Power of Random Counterexamples Dana Angluin DANA.ANGLUIN@YALE.EDU and Tyler Dohrn TYLER.DOHRN@YALE.EDU Department

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Discriminative Learning can Succeed where Generative Learning Fails

Discriminative Learning can Succeed where Generative Learning Fails Discriminative Learning can Succeed where Generative Learning Fails Philip M. Long, a Rocco A. Servedio, b,,1 Hans Ulrich Simon c a Google, Mountain View, CA, USA b Columbia University, New York, New York,

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

arxiv: v1 [cs.lg] 18 Feb 2017

arxiv: v1 [cs.lg] 18 Feb 2017 Quadratic Upper Bound for Recursive Teaching Dimension of Finite VC Classes Lunjia Hu, Ruihan Wu, Tianhong Li and Liwei Wang arxiv:1702.05677v1 [cs.lg] 18 Feb 2017 February 21, 2017 Abstract In this work

More information

Estimating Gaussian Mixture Densities with EM A Tutorial

Estimating Gaussian Mixture Densities with EM A Tutorial Estimating Gaussian Mixture Densities with EM A Tutorial Carlo Tomasi Due University Expectation Maximization (EM) [4, 3, 6] is a numerical algorithm for the maximization of functions of several variables

More information

Introduction: The Perceptron

Introduction: The Perceptron Introduction: The Perceptron Haim Sompolinsy, MIT October 4, 203 Perceptron Architecture The simplest type of perceptron has a single layer of weights connecting the inputs and output. Formally, the perceptron

More information

Unlabeled Data: Now It Helps, Now It Doesn t

Unlabeled Data: Now It Helps, Now It Doesn t institution-logo-filena A. Singh, R. D. Nowak, and X. Zhu. In NIPS, 2008. 1 Courant Institute, NYU April 21, 2015 Outline institution-logo-filena 1 Conflicting Views in Semi-supervised Learning The Cluster

More information

Active Learning: Disagreement Coefficient

Active Learning: Disagreement Coefficient Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning Michalis K. Titsias Department of Informatics Athens University of Economics and Business

More information

Learning Bound for Parameter Transfer Learning

Learning Bound for Parameter Transfer Learning Learning Bound for Parameter Transfer Learning Wataru Kumagai Faculty of Engineering Kanagawa University kumagai@kanagawa-u.ac.jp Abstract We consider a transfer-learning problem by using the parameter

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing Afonso S. Bandeira April 9, 2015 1 The Johnson-Lindenstrauss Lemma Suppose one has n points, X = {x 1,..., x n }, in R d with d very

More information

On the Sample Complexity of Noise-Tolerant Learning

On the Sample Complexity of Noise-Tolerant Learning On the Sample Complexity of Noise-Tolerant Learning Javed A. Aslam Department of Computer Science Dartmouth College Hanover, NH 03755 Scott E. Decatur Laboratory for Computer Science Massachusetts Institute

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Sample width for multi-category classifiers

Sample width for multi-category classifiers R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University

More information

CSE 546 Final Exam, Autumn 2013

CSE 546 Final Exam, Autumn 2013 CSE 546 Final Exam, Autumn 0. Personal info: Name: Student ID: E-mail address:. There should be 5 numbered pages in this exam (including this cover sheet).. You can use any material you brought: any book,

More information

Probability and Information Theory. Sargur N. Srihari

Probability and Information Theory. Sargur N. Srihari Probability and Information Theory Sargur N. srihari@cedar.buffalo.edu 1 Topics in Probability and Information Theory Overview 1. Why Probability? 2. Random Variables 3. Probability Distributions 4. Marginal

More information

Computational Learning Theory

Computational Learning Theory CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful

More information

Being Taught can be Faster than Asking Questions

Being Taught can be Faster than Asking Questions Being Taught can be Faster than Asking Questions Ronald L. Rivest Yiqun Lisa Yin Abstract We explore the power of teaching by studying two on-line learning models: teacher-directed learning and self-directed

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Gaussian Processes (10/16/13)

Gaussian Processes (10/16/13) STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Gradient-Based Learning. Sargur N. Srihari

Gradient-Based Learning. Sargur N. Srihari Gradient-Based Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation

More information

From Batch to Transductive Online Learning

From Batch to Transductive Online Learning From Batch to Transductive Online Learning Sham Kakade Toyota Technological Institute Chicago, IL 60637 sham@tti-c.org Adam Tauman Kalai Toyota Technological Institute Chicago, IL 60637 kalai@tti-c.org

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm The Expectation-Maximization Algorithm Francisco S. Melo In these notes, we provide a brief overview of the formal aspects concerning -means, EM and their relation. We closely follow the presentation in

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

arxiv: v1 [cs.it] 21 Feb 2013

arxiv: v1 [cs.it] 21 Feb 2013 q-ary Compressive Sensing arxiv:30.568v [cs.it] Feb 03 Youssef Mroueh,, Lorenzo Rosasco, CBCL, CSAIL, Massachusetts Institute of Technology LCSL, Istituto Italiano di Tecnologia and IIT@MIT lab, Istituto

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

DS-GA 1002 Lecture notes 2 Fall Random variables

DS-GA 1002 Lecture notes 2 Fall Random variables DS-GA 12 Lecture notes 2 Fall 216 1 Introduction Random variables Random variables are a fundamental tool in probabilistic modeling. They allow us to model numerical quantities that are uncertain: the

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Efficient Semi-supervised and Active Learning of Disjunctions

Efficient Semi-supervised and Active Learning of Disjunctions Maria-Florina Balcan ninamf@cc.gatech.edu Christopher Berlind cberlind@gatech.edu Steven Ehrlich sehrlich@cc.gatech.edu Yingyu Liang yliang39@gatech.edu School of Computer Science, College of Computing,

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

10.1 The Formal Model

10.1 The Formal Model 67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 10: The Formal (PAC) Learning Model Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 We have see so far algorithms that explicitly estimate

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

References for online kernel methods

References for online kernel methods References for online kernel methods W. Liu, J. Principe, S. Haykin Kernel Adaptive Filtering: A Comprehensive Introduction. Wiley, 2010. W. Liu, P. Pokharel, J. Principe. The kernel least mean square

More information

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015 10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015 1 Active Learning Most classic machine learning methods and the formal learning

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY Dan A. Simovici UMB, Doctoral Summer School Iasi, Romania What is Machine Learning? The Vapnik-Chervonenkis Dimension Probabilistic Learning Potential

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 477 Instructor: Tony Jebara Topic 5 Generalization Guarantees VC-Dimension Nearest Neighbor Classification (infinite VC dimension) Structural Risk Minimization Support Vector Machines

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Toward Adversarial Learning as Control

Toward Adversarial Learning as Control Toward Adversarial Learning as Control Jerry Zhu University of Wisconsin-Madison The 2nd ARO/IARPA Workshop on Adversarial Machine Learning May 9, 2018 Test time attacks Given classifier f : X Y, x X Attacker

More information

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity; CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

12.1 A Polynomial Bound on the Sample Size m for PAC Learning 67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 12: PAC III Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 In this lecture will use the measure of VC dimension, which is a combinatorial

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Single-tree GMM training

Single-tree GMM training Single-tree GMM training Ryan R. Curtin May 27, 2015 1 Introduction In this short document, we derive a tree-independent single-tree algorithm for Gaussian mixture model training, based on a technique

More information

The Decision List Machine

The Decision List Machine The Decision List Machine Marina Sokolova SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 sokolova@site.uottawa.ca Nathalie Japkowicz SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 nat@site.uottawa.ca

More information

arxiv: v1 [cs.ds] 3 Feb 2018

arxiv: v1 [cs.ds] 3 Feb 2018 A Model for Learned Bloom Filters and Related Structures Michael Mitzenmacher 1 arxiv:1802.00884v1 [cs.ds] 3 Feb 2018 Abstract Recent work has suggested enhancing Bloom filters by using a pre-filter, based

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Machine Learning in the Data Revolution Era

Machine Learning in the Data Revolution Era Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

Lecture 4: Two-point Sampling, Coupon Collector s problem

Lecture 4: Two-point Sampling, Coupon Collector s problem Randomized Algorithms Lecture 4: Two-point Sampling, Coupon Collector s problem Sotiris Nikoletseas Associate Professor CEID - ETY Course 2013-2014 Sotiris Nikoletseas, Associate Professor Randomized Algorithms

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good

More information

Theorem (Special Case of Ramsey s Theorem) R(k, l) is finite. Furthermore, it satisfies,

Theorem (Special Case of Ramsey s Theorem) R(k, l) is finite. Furthermore, it satisfies, Math 16A Notes, Wee 6 Scribe: Jesse Benavides Disclaimer: These notes are not nearly as polished (and quite possibly not nearly as correct) as a published paper. Please use them at your own ris. 1. Ramsey

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

A Large Deviation Bound for the Area Under the ROC Curve

A Large Deviation Bound for the Area Under the ROC Curve A Large Deviation Bound for the Area Under the ROC Curve Shivani Agarwal, Thore Graepel, Ralf Herbrich and Dan Roth Dept. of Computer Science University of Illinois Urbana, IL 680, USA {sagarwal,danr}@cs.uiuc.edu

More information

Optimization Methods for Machine Learning (OMML)

Optimization Methods for Machine Learning (OMML) Optimization Methods for Machine Learning (OMML) 2nd lecture (2 slots) Prof. L. Palagi 16/10/2014 1 What is (not) Data Mining? By Namwar Rizvi - Ad Hoc Query: ad Hoc queries just examines the current data

More information

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems Weinan E 1 and Bing Yu 2 arxiv:1710.00211v1 [cs.lg] 30 Sep 2017 1 The Beijing Institute of Big Data Research,

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Generalization and Overfitting

Generalization and Overfitting Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle

More information

A Result of Vapnik with Applications

A Result of Vapnik with Applications A Result of Vapnik with Applications Martin Anthony Department of Statistical and Mathematical Sciences London School of Economics Houghton Street London WC2A 2AE, U.K. John Shawe-Taylor Department of

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Tobias Pohlen Selected Topics in Human Language Technology and Pattern Recognition February 10, 2014 Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6

More information

Wavelet Transform And Principal Component Analysis Based Feature Extraction

Wavelet Transform And Principal Component Analysis Based Feature Extraction Wavelet Transform And Principal Component Analysis Based Feature Extraction Keyun Tong June 3, 2010 As the amount of information grows rapidly and widely, feature extraction become an indispensable technique

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Analysis of Spectral Kernel Design based Semi-supervised Learning

Analysis of Spectral Kernel Design based Semi-supervised Learning Analysis of Spectral Kernel Design based Semi-supervised Learning Tong Zhang IBM T. J. Watson Research Center Yorktown Heights, NY 10598 Rie Kubota Ando IBM T. J. Watson Research Center Yorktown Heights,

More information

Pointwise Exact Bootstrap Distributions of Cost Curves

Pointwise Exact Bootstrap Distributions of Cost Curves Pointwise Exact Bootstrap Distributions of Cost Curves Charles Dugas and David Gadoury University of Montréal 25th ICML Helsinki July 2008 Dugas, Gadoury (U Montréal) Cost curves July 8, 2008 1 / 24 Outline

More information