arxiv: v1 [stat.ml] 25 Feb 2018
|
|
- Ruby Francis
- 5 years ago
- Views:
Transcription
1 Yuzhe Ma Robert Nowa Philippe Rigollet Xuezhou Zhang Xiaojin Zhu University of Wisconsin-Madison Massachusetts Institute of Technology arxiv:8.8946v stat.ml 5 Feb 8 Abstract We call a learner super-teachable if a teacher can trim down an iid training set while maing the learner learn even better. We provide sharp superteaching guarantees on two learners: the maximum lielihood estimator for the mean of a Gaussian, and the large margin classifier in D. For general learners, we provide a mixed-integer nonlinear programming-based algorithm to find a super teaching set. Empirical experiments show that our algorithm is able to find good super-teaching sets for both regression and classification problems. Introduction Consider the following question: a learner receives an iid training set S drawn from a distribution parametrized by θ. There is a teacher who nows θ. Can the teacher select a subset from S so the learner estimates θ better from the subset than from S? This question is distinct from training set reduction (see e.g. 9, 43, 4) in that the teacher can use the nowledge of θ to carefully design the subset. It is, in fact, a coding problem: Can the teacher approximately encode θ using items in S for a nown decoder, which is the learner? As such, the question is not a machine learning tas but rather a machine teaching one 47,, 45. This question is relevant for several nascent applications. One application is in understanding blacbox models such as deep nets. Often observation to a blacbox model is limited to its predicted label y = θ (x) given input x. One way to interpret a blacbox model is to locally train an interpretable model with data points S labeled by the blacbox model around the region of interest 35. We, however, as for more: to reduce the size of the training set S for Proceedings of the st International Conference on Artificial Intelligence and Statistics (AISTATS) 8, Lanzarote, Spain. PMLR: Volume 84. Copyright 8 by the author(s). the local learner while maing the learner approximate the blacbox better. The reduced training set itself also serves as representative examples of local model behavior. Another application is in education. Imagine a teacher who has a teaching goal θ. This is a reasonable assumption in practice: e.g. a geology teacher has the nowledge of the actual decision boundaries between roc categories. However, the teacher is constrained to teach with a given textboo (or a set of courseware) S. To the extent that the student is quantified mathematically, the teacher wants to select pages in the textboo with the guarantee that the student learns better from those pages than from gulping the whole boo. But is the question possible? The following example says yes. Consider learning a threshold classifier on the interval,, with true threshold at θ =. Let S have n items drawn uniformly from the interval and labeled according to θ. Let the learner be a hard margin SVM, which places the estimated threshold in the middle of the inner-most pair in S with different labels: ˆθ S = (x + x + )/ where x is the largest negative training item and x + the smallest positive training item in S. It is well nown that ˆθ S θ converges at a rate of /n: the intuition being that the average space between adjacent items is O(/n). - Figure : The original training set S with n = 6 items (circles and stars; green=negative, purple=positive), and the most-symmetric training set (stars) the teacher selects. The teacher nows everything but cannot tell θ directly to the learner. Instead, it can select the most-symmetric pair in S about θ and give them to the learner as a two-item training set. We will prove later that the ris on the most symmetric pair is O(/n ), that is, learning from the selected subset surpasses learning from S. Thus we observe something interesting: the teacher can turn a larger training set S into a smaller and better subset for the midpoint classifier. We call this phenomenon super-teaching.
2 Formal Definition of Super Teaching Let Z be the data space: for unsupervised learning Z = X, while for supervised learning Z = X Y. Let p Z be the underlying distribution over Z. We tae a function view of the learner: a learner A is a function A : n=z n Θ, where Θ is the learner s hypothesis space. The notation n=z n defines the set of (potentially non-iid) training sets, namely multisets of any size whose elements are in Z. Given any training set T n=z n, we assume A returns a unique hypothesis A(T ) ˆθ T Θ. The learner s ris R(θ) for θ Θ is defined as: R(θ) = E pz l(θ(x), y), or R(θ) = θ θ. () The former is for prediction tass where l() is a loss function and θ(x) denotes the prediction on x made by model θ; the latter is for parameter estimation where we assume a realizable model p Z = p θ for some θ Θ. We now introduce a clairvoyant teacher B who has full nowledge of p Z, A, R. The teacher is also given an iid training set S = {z,..., z n } p Z. If the teacher teaches S to A, the learner will incur a ris R(A(S)) R(ˆθ S ). The teacher s goal is to judiciously select a subset B(S) S to act as a super teaching set for the learner so that R(ˆθ B(S) ) < R(ˆθ S ). Of course, to do so the teacher must utilize her nowledge of the learning tas, thus the subset is actually a function B(S, p Z, A, R). In particular, the teacher nows p Z already, and this sets our problem apart from machine learning. For readability we suppress these extra parameters in the rest of the paper. We formally define super teaching as follows. Definition (Super Teaching). B is a super teacher for learner A if δ >, N such that n N P S R(ˆθ B(S) ) c n R(ˆθ S ) > δ, () where S iid p n Z, B(S) S, and c n is a sequence we call super teaching ratio. Obviously, c n = can be trivially achieved by letting B(S) = S so we are interested in small c n. There are two fundamental questions: () Do super teachers provably exist? () How to compute a super teaching set B(S) in practice? We answer the first question positively by exhibiting super teaching on two learners: maximum lielihood estimator for the mean of a Gaussian in section 3, and D large margin classifier in section 4. Guarantees on super teaching for general learners remain future wor. Nonetheless, empirically we can find a super teaching set for many general learners: We formulate the second question as mixed-integer nonlinear programming in section 5. Empirical experiments in section 6 demonstrates that one can find a good B(S) effectively. 3 Analysis on Super Teaching for the MLE of Gaussian mean In this section, we present our first theoretical result on super teaching, when the learner A MLE is the maximum lielihood estimator (MLE) for the mean of a Gaussian. Let Z = X = R, Θ = R, p Z (x) = N (θ, ). Given a sample S of size n drawn from p Z, the learner computes the MLE for the mean: ˆθ S = A MLE (S) = n n i= x i. We define the ris as R(ˆθ S ) = ˆθ S θ. The teacher we consider is the optimal -subset teacher B, which uses the best subset of size to teach: B (S) argmin T S, T = R(ˆθ T ). (3) To build intuition, it is well-nown that the ris of A MLE under S is O(/ n) because the variance under n items shrins lie /n. Now consider =. Since the teacher B nows θ, under our setting the best teaching strategy is for her to select the item in S closest to θ, which forms the singleton teaching set B (S). One can show that with large probability this closest item is O(/n) away from θ (the central part of a Gaussian density is essentially uniform). Therefore, we already see a super teaching ratio of c n = n. More generally, our main result below shows that B achieves a super teaching ratio c n = O(n + ): Theorem. Let B be the optimal -subset teacher. ɛ (, 4 ), δ (, ), N(, ɛ, δ) such that n N(, ɛ, δ), P R(ˆθ B (S)) c n R(ˆθ S ) > δ, where c n = ɛ n + +ɛ. Toward proving the theorem, we first recall the standard rate R(ˆθ S ) n if A MLE learns from the whole training set S: Proposition. Let S be an n-item iid sample drawn from N (θ, ). ɛ >, δ (, ), N (ɛ, δ) such that n N, P n ɛ < R(ˆθ S ) < n +ɛ > δ. (4) Proof. R(ˆθ S ) = ˆθ S θ and ˆθ S θ N (, n ) = n nx π e. Let α = n ɛ and β = n +ɛ. We have P R(ˆθ S ) α = < α α n dx = α π n π n π = nx e dx π n ɛ, Remar: we introduced an auxiliary variable ɛ which controls the implicit tradeoff between c n, how much super teaching helps, and N, how soon super teaching taes effect. When ɛ the teaching ratio c n approaches O(n + ), but as we will see N(, ɛ, δ). Similarly, also affects the tradeoff: the teaching ratio is smaller as we enlarge, but N(, ɛ, δ) increases. (5)
3 Yuzhe Ma, Robert Nowa, Philippe Rigollet, Xuezhou Zhang, Xiaojin Zhu P R(ˆθ S ) β = < = β β x n β π β nβ nπ e < β n π nx e dx = nπ = nx e dx β β n ny π e dy π n ɛ. Thus P α < R(ˆθ S ) < β = P R(ˆθ S ) α P R(ˆθ S ) β > π n ɛ. Let N (ɛ, δ) = ( δ 8 π ) ɛ, then n N, P α < R(ˆθ S ) < β > δ. We now wor out the ris of A MLE if it learns from the optimal -subset teacher B. Theorem 4 says that this ris is very small and sharply concentrated around R(ˆθ B (S)) n. To prove Theorem 4, we first give the following lemma. ( ) n Lemma 3. Denote C n =. Let the index set I = {,,...n} where n 4. Consider all subsets of size, then there are at most 4 C Cn ordered pairs of subsets that are overlapping but not identical. Proof. Let I and I be two subsets of size and they overlap on t indexes. Then the total number of distinct indexes that appear in I I is t. There are C t n ways of choosing such t indexes. Next we determine which t indexes are overlapping ones. We have Ct t ways of choosing such t indexes. Finally we have C t t ways of selecting half of the non-overlapping indexes and attribute them to I. Thus in total we have O t = C t n C t t C t t ordered pairs of subsets that overlap on t indexes. By our assumption n 4 we have C t n Cn. Also note that C t t < Ct and C t t < C, thus O t < C n C t C. Therefore the total number of ordered pairs of subsets that are overlapping but not identical is O = O t < C C n t C < t= t= t= C C n t C = 4 C C. n (6) Now we prove the ris of the optimal -subset teacher. Theorem 4. Let B be the optimal -subset teacher. Let S be an n-item iid sample drawn from N (θ, ). ɛ (7) (, ), δ (, ), N (, ɛ, δ) such that n N, P ( n )+ɛ < R(ˆθ B (S)) < ( n ) ɛ > δ. Proof. Let I {,,..., n} and I =, define γ I = i I (x i θ ). Let S I denote the subset indexed by I. Note that ˆθ SI = i I x i and R(ˆθ SI ) = ˆθ SI θ = i I x i θ = γ I. Also note that R(ˆθ B (S)) = inf I R(ˆθ SI ) = inf I γ I. Thus to prove Theorem 4 it suffices to prove P ( n )+ɛ < inf γ I < ( I n ) ɛ. (9) Let α = ( n )+ɛ and β = ( n ) ɛ. We first prove the lower bound. Note that γ I has the same distribution for all I. Thus by the union bound, P inf I (8) γ I α = P I : γ I α C n P γ I α, () where I = {,,..., }. Since γ I N (, ), we have P γ I α = α α Note that C n ( en ). Thus, P inf I e x dx < α. () π π γ I α < ( en ) π α = π e ( n )ɛ. () Thus N (, ɛ, δ) such that n N, P inf γ I α < δ I. (3) To show the upper bound, we define t I = γ I < β, where is the indicator function. Let T = I t I. Then it suffices to show lim n P T = =. Note that P T = = P T E T = E T P (T E T ) (E T ) V T (E T ), (4) where the last inequality follows from the Marov inequality. Now we lower bound E T. E T = E t I = E t I = C n E t I I I (5) β = C n P γ I < β = C n e x dx. π Note that ɛ <, thus β <. For x ( β, β), π e x > β π e = πe. Also note that C n > ( n ), thus E T > ( n ) πe β = πe (n )ɛ. (6)
4 Now we upper bound V T. V T = I,I Cov t I, t I = I,I, I I Cov t I, t I. (7) Note that for Bernoulli random variable t I, V t I E t I. Thus if I = I, then Cov t I, t I = V t I E t I = P γ I < β β = e x dx < β = π π π ( n ) ɛ. β (8) Otherwise I I, that is, I and I overlap but not identical, then Cov t I, t I = E t I t I E t I E t I E t I t I = P γ I < β, γ I < β. (9) Note that γ I and γ I are jointly Gaussian with covariance Cov γ I, γ I = Cov x i θ, x i θ i I,i I = I I = ρ, i I,i I,i=i () where ρ. The joint PDF of two standard normal distributions x, y with covariance ρ is f(x, y) = Note that f(x, y) π ρ, thus P γ I < β, γ I < β = π x ρxy+y ρ e ( ρ ). () x <β, y <β π ρ (β) = π ρ β. Since π ρ π ρ dxdy π ( = ) π π P γ I < β, γ I < β β π (), thus = π ( n ) ɛ. According to Lemma 3, there are at most 4 C of I and I such that I I. Thus, V T = V t I + Cov t I, t I I I I, I I (3) Cn pairs C n π ( n ) ɛ + 4 C C n π ( n ) ɛ π (en ) ( n ) ɛ + 4 C en ( ) π ( n ) ɛ = π e ( n )ɛ + 4 π C ( e ) ( n )ɛ. (4) Now plug (4) and (6) into (4), we have P T = a ()( n ) ɛ + a ()( n ), (5) where a = π e+ and a () = ec ( e ). Thus N (, ɛ, δ) such that n N P inf I, γ I β < δ. (6) Let N (, ɛ, δ) = max{n (, ɛ, δ), N (, ɛ, δ)}, combining (3) and (6) concludes the proof. Now we can conclude super-teaching by comparing Theorem 4 and Proposition : Proof of Theorem. Let α = ( n ) ɛ and β = n ɛ. By Proposition, ɛ (, 4 ), δ (, ), N (ɛ, δ ) such that n N, P R(ˆθ S ) > β > δ. By Theorem 4, N (, ɛ, δ ) such that n N, P R(ˆθ B (S)) < α > δ. Let c n = ɛ n + +ɛ. Since ɛ < 4, c n is a decreasing sequence in n with lim n c n =. Let N 3 (, ɛ) be the first integer such that c N3. Let N(, ɛ, δ) = max{n (ɛ, δ ), N (, ɛ, δ ), N 3(, ɛ)}. By a union bound n N(, ɛ, δ), P Since α β = c n, we have P δ, where c n c N3. R(ˆθ B (S)) < α, R(ˆθ S ) β R(ˆθ B (S)) c n R(ˆθ S ) > δ. > 4 Analysis on Super Teaching for D Large Margin Classifier We present our second theoretical result, this time on teaching a D large margin classifier. Let X =,, Y = {, }, Θ =,, θ =, p Z (x, y) = p Z (x)p Z (y x) where p Z (x) = U(X) and p Z (y = x) = x θ. Let x max i:yi= x i and x + min i:yi=+ x i be the inner-most pair of opposite labels in S if they exist. We formally define the large margin classifier A lm (S) as (x + x + )/ if x, x + exist ˆθ S = A lm (S) = if S all positive if S all negative. (7) The ris is defined as R(ˆθ S ) = ˆθ S θ = ˆθ S. The teacher we consider is the most symmetric teacher, who selects the most symmetric pair about θ in S and gives it to the learner. We define the most-symmetric teacher B ms : { {(s, ), (s B ms (S) = +, )} if s, s + exist, {(x, y )} otherwise. (8)
5 Yuzhe Ma, Robert Nowa, Philippe Rigollet, Xuezhou Zhang, Xiaojin Zhu where (s, s + ) argmin (x, ),(x,) S x+x θ. Our main result shows that learning from the whole set S achieves the well-nown O(/n) ris, but surprisingly B ms achieves O(/n ) ris, therefore it is an approximately c n = O(n ) super teaching ratio. Theorem 5. Let S be an n-item iid sample drawn from p Z. Then δ (, ), N(δ) such that n N, P R(ˆθ Bms(S)) c n R(ˆθ S ) > δ, where c n = 3 nδ ln 6 δ. Before proving Theorem 5, we first show that B ms is an optimal teacher for the large margin classifier. Proposition 6. B ms is an optimal teacher for the large margin classifier ˆθ S. Proof. We show R(ˆθ Bms(S)) R(ˆθ B(S) ) for any B and any S. If B ms (S) =, then S is either all positive or all negative. In both cases R(ˆθ B(S) ) = for any B by definition. Thus R(ˆθ Bms(S)) R(ˆθ B(S) ). Otherwise B ms (S) =, then if B(S) is all positive or all negative, we have R(ˆθ B(S) ) = and thus R(ˆθ Bms(S)) R(ˆθ B(S) ). Otherwise let x B, x B + be the inner most pair of B(S). Since x B, x B + S, then by definition of B ms, R(ˆθ Bms(S)) = s +s+ θ xb +xb + θ = R(ˆθ B(S) ). Now we wor out the ris of the most symmetric teacher B ms. To bound the ris of B ms we need the following ey lemma, which shows that the sample complexity with the teacher is O(ɛ / ). Lemma 9. Let n = 4m, where m is an integer. Let S be an n-item iid sample drawn from p Z. ɛ >, δ (, ), M(ɛ, δ) = max{ 3e ln 4 ln 3 δ, ( ɛ ln 3 δ ) } such that m M(ɛ, δ), P R(ˆθ Bms(S)) ɛ > δ. Proof. We give a proof setch and the details are in the appendix. Let S = {x (x, ) S} and S = {x (x, ) S} respectively. Then we have S + S = 4m. Define event E : { S m S m}. Given that m 3e ln 4 ln 3 δ, one can show P (E ) > δ 3. Since S + S = 4m, either S m or S m. Without loss of generality we assume S m. We then divide the interval, equally into N = m (ln 3 δ ) segments. The length of each segment is N = O( m ) as Figure shows. 3 N - Figure : segments Now we show that learning on the whole S incurs O(n ) ris. First, we give the following lemma for the exact tail probability of R(ˆθ S ). Lemma 7. For the large margin classifier ˆθ S, we have ( ɛ) n + (ɛ) n < ɛ P R(ˆθ S ) > ɛ = ( )n < ɛ < (9) ɛ =. The proof for Lemma 7 is in the appendix. Now we show that R(ˆθ S ) is O(n ). Theorem 8. Let S be an n-item iid sample drawn from p Z. Then δ (, ) and n, P R(ˆθ S ) > δ > δ. (3) n Proof. According to Lemma 7, for ɛ, we have P R(ˆθ S ) > ɛ > ( ɛ) n > nɛ. (3) Note that n, thus δ n. Let ɛ = δ n in (3) we have P R(ˆθ S ) > δ > n δ = δ. (3) n n Let N o be the number of segments that are occupied by the points in S. Note that N o is a random variable. Let E be the event that N o m. Then one can show P (E ) > δ 3. By union bound, we have P (E, E ) > δ 3. Let E 3 be the following event: there exist a point x in S such that x, the flipped point, lies in the same segment as some point x in S. One can show that P (E 3 E, E ) > δ 3. Thus P (E 3) P (E, E, E 3 ) = P (E 3 E, E )P (E, E ) ( δ δ 3 )( 3 ) > δ. If E 3 happens, then x + x = x ( x ). Note that m ( ɛ ln 3 δ ) and N = m (ln 3 δ ) m (ln 3 δ ), thus N m ln 3 δ ɛ. Therefore R(ˆθ Bms(S)) = s +s+ x+x ɛ. Rewriting ɛ in Lemma 9 as a function of n, we have the following theorem. Theorem. Let S be an n-item iid sample dawn from p Z, then N (δ) = e ln 4 ln 3 δ such that n N, P R(ˆθ Bms(S)) 6 n ln 3 > δ. (33) δ Proof. Note that if n N (δ) = ln 4 ln 3 δ, then m = n 4 3e ln 4 ln 3 δ, thus the minimum ɛ that satisfies m M(ɛ, δ) is m ln 3 δ = 6 n ln 3 δ. e N
6 Now we can conclude super teaching: Proof of Theorem 5. According to Theorem, N ( δ ) such that n N, P R(ˆθ Bms(S)) 6 n ln 6 δ > δ. Note that N, thus according to Theorem 8, n N, P R(ˆθ S ) > δ n > δ. Let c n = 3 nδ ln 6 δ and N (δ) = 3 δ ln 6 δ so that c N =. Let N(δ) = max{n (δ), N (δ)}. By union bound, n N, with probability at least δ, we have both R(ˆθ S ) > δ n and R(ˆθ Bms(S)) 6 n ln 6 δ, which gives P R(ˆθ Bms(S)) c n R(ˆθ S ) > δ, where c n c N =. 5 An MINLP Algorithm for Super Teaching Although the problem of proving super teaching ratios for a specific learner is interesting, we now focus on an algorithm to find a super teaching set for general learners given a training set S. That is, we find a subset B(S) S so that R(ˆθ B(S) ) < R(ˆθ S ). We start by formulating super teaching as a subset selection problem. To this end, we introduce binary indicator variables b,..., b n where b i = means z i S is included in the subset. We consider learners A that can be defined via convex empirical ris minimization: A(S) argmin θ Θ n i= l(θ, z i ) + λ θ. (34) For simplicity we assume there is a unique global minimum which is returned by argmin. Note that we use l in (34) to denote the (surrogate) convex loss used by A in performing empirical ris minimization. For example, l may be the negative log lielihood for logistic regression. l is potentially different from l (e.g. the - loss) used by the teacher to define the teaching ris R in (). We formulate super teaching as the following bilevel combinatorial optimization problem: min b {,} n,ˆθ Θ R(ˆθ) (35) s.t. ˆθ n = argminθ Θ b i l(θ, zi ) + λ θ. (36) i= Under mild conditions, we may replace the lower level optimization problem (i.e. the machine learning problem (36)) by its first order optimality (KKT) conditions: min b {,} n,ˆθ Θ s.t. R(ˆθ) (37) n b i θ l(ˆθ, zi ) + λˆθ =. i= This reduces the bilevel problem but the constraint is nonlinear in general, leading to a mixed-integer nonlinear program (MINLP), for which effective solvers exist. We use the MINLP solver in NEOS 5. 6 Simulations We now apply the framewor in section 5 to logistic regression and ridge regression, and show that the solver indeed selects a super-teaching subset that is far better than the original training set S. 6. Teaching Logistic Regression A lr Let X = R d, Θ = R d, θ = ( d,..., d ), p Z (x) = N (, I). Let p Z (y x) = x θ >, which is deterministic given x. Logistic regression estimates ˆθ S = A lr (S) with (34), where λ =. and l(z i ) = log( + exp( y i x i θ)). In contrast, The teacher s ris is defined to be the expected - loss: R(ˆθ) = E pz ˆθ(x) y, where ˆθ(x) is the label of x predicted by ˆθ. Since p Z is symmetric about the origin, the ris can be rewritten in terms of the angle between ˆθ and θ ˆθ : R(ˆθ) = arccos( θ )/π. ˆθ θ Instantiating (37) we have min b {,} n,ˆθ R d s.t. ˆθ θ arccos( )/π (38) ˆθ θ n b i y i x i λˆθ =. + exp(y i x i ˆθ) i= We run experiments to study the effectiveness and scalability of the NEOS MINLP solver on (38), specifically with respect to the training set size n = S and dimension d. In the first set of experiments we fix d = and vary n = 6, 64, 56 and 4. For each n we run trials. In each trial we draw an n-item iid sample S p Z and call the solver on (38). The solver s solution to b... b n indicates the super teaching set B(S). We then compute an empirical version of the super teaching ratio: ĉ n = R(ˆθ B(S) )/R(ˆθ S ). Logistic Regression Ridge Regression n = S ĉ n B(S) /n time (s) ĉ n B(S) /n time (s) 6 8.5e e- 7.8e e- 64.3e e+ 7.5e e e e+ 5.6e e+ 4.3e-.86.4e+3 4.e e+3 Table : Super teaching as n changes. In the left half of Table we report the median of the following quantities over trials: ĉ n, the fraction of the training items selected for super teaching B(S) /n, and the NEOS server running time. The main result is that ĉ n for all n, which means the solver indeed selects a super-teaching set B(S) that is far
7 Yuzhe Ma, Robert Nowa, Philippe Rigollet, Xuezhou Zhang, Xiaojin Zhu Logistic Regression Ridge Regression d ĉ n B(S) /n time (s) ĉ n B(S) /n time (s) 3.e e- 3.3e e+ 4.4e e+ 7.e e+ 8.8e e+.5e e e-.4 5.e+ 4.3e e+ 3 8.e-.58.e+ 6.4e e+ Table : Super teaching as d changes better than the original iid training set S. Therefore, MINLP is a valid algorithm for finding a super teaching set. Second, we note that the solver tends to select a large subset since the median B(S) /n /. This is interesting as it is nown that when S is dense, one can select extremely sparse super teaching sets, as small as a few items, to teach effectively 8. Understanding the different regimes remains future wor. Finally, the running time grows fast with n. For example, when n = 4 it taes around half an hour to solve (38). Future wor needs to address this bottlenec in applying MINLP to large problems. In the second set of experiments we fix n = 3 and vary d =, 4, 8, 6, 3. The left half of Table shows the results. The empirical teaching ratio ĉ n is still below in all cases, showing super teaching. But as the dimension of the problem increases ĉ n deteriorates toward. Nonetheless, even when d = n we still see a median super teaching ratio of.8; the corresponding super teaching set B(S) has only 58% training items than the dimension. It is interesting that the MINLP algorithm intentionally created a high dimensional learning problem (as in higher dimension d than selected training items B(S) ) to achieve better teaching, nowing that the learner A lr is regularized. The running time does not change dramatically. 6. Teaching Ridge Regression A rr Let X = R d, Θ = R d, θ = ( d,..., d ), p Z (x) = N (, I), p Z (y x) = N (y; x θ,.). Let the teaching ris be the parameter difference: R(ˆθ) = ˆθ θ. Given a sample S with n iid items drawn from p Z, ridge regression estimates ˆθ S = A rr (S) with λ =. and l(z i ) = (x ˆθ i y i ). The corresponding MINLP is: min ˆθ θ (39) b {,} n,ˆθ R d s.t. λˆθ + n b i (x ˆθ i y i )x i =. i= We run the same set of experiments. Tables and show the results, which are qualitatively similar to teaching logistic regression. Again, we see the empirical super teaching ratio ĉ n, indicating the presence of super teaching. (a) logistic regression (b) ridge regression Figure 3: Typical trials from the MINLP algorithm Finally, Figure 3 visualizes one typical trial each for teaching logistic regression and ridge regression. S consists of both dar and light points, while the dar ones representing B(S) optimized by MINLP. The dashed line shows ˆθ S, while the solid lines shows ˆθ B(S). The ground truth (x + x = in logistic regression, y = x in ridge regression) essentially overlaps with the solid lines. Specifically, the super taught models ˆθ B(S) have negligible riss of.5e- 4 and 3.3e-3, whereas models ˆθ S trained from the whole iid sample S incur much larger riss of.3 and.6, respectively. 7 Related Wor There has been several research threads in different communities aimed at reducing a data set while maintaining its utility. The first thread is training set reduction 9, 43, 4, which during training time prunes items in S in an attempt to improve the learned model. The second thread is coresets, 6, a summary of S such that models learned on the summary are provably competitive with models learned on the full data set S. But as they do not now the target model p Z or θ, these methods cannot truly achieve super teaching. The third thread is curriculum learning which showed that smart initialization is useful for nonconvex optimization. In contrast, our teacher can directly encode the true model and therefore obtain faster rates. The final thread is sample compression 7, where a compression function chooses a subset T S and a reconstruction function to form a hypothesis. Our present wor has some similarity with compression, which allows increased accuracy since compression bounds can be used as regularization 6. The theoretical study of machine teaching has focused on the teaching dimension, i.e. the minimum training set size needed to exactly teach a target concept θ, 38, 46, 8, 9, 45, 6, 44, 48, 9, 3, 4,, 3, 8, 7, 5, 5, 36, 3,. Most of the prior wor assumed a synthetic teaching setting where S is the whole item space, which is often unrealistic. Liu et al. considered approximate teaching in the finite S setting 3, though their analysis focused on a specific SGD learner. Our super teaching setting applies to arbitrary
8 learners, and we allow approximate teaching namely we do not require the teacher to teach exactly the target model, which is infeasible in our pool-based teaching setting with a finite S. Machine teaching applications include education 4, 33, 39, 7, 3, 34, computer security,, 3, and interactive machine learning 4,, 4. By establishing the existence of super-teaching, the present paper can guide the process of finding a more effective training set for these applications. 8 Discussions and Conclusion We presented super-teaching: when the teacher already nows the target model, she can often choose from a given training set a smaller subset that trains a learner better. We proved this for two learners, and provided an empirical algorithm based on mixed integer nonlinear programming to find a super teaching set. However, much needs to be done on the theory of super teaching. We give two counterexamples to illustrate that not all learners are super-teachable. Example (MLE of interval). Let X =, θ, where θ R +. p Z (x) = U(X). Given a n-item training set S, the MLE for θ is ˆθ S = A int (S) = max i=:n x i. The ris is defined as R(ˆθ S ) = ˆθ S θ. We show A int is not superteachable. ˆθB(S) = max xi B(S) x i max xi S x i = ˆθ S. Since ˆθ S θ, R(ˆθ B(S) ) = ˆθ B(S) θ ˆθ S θ = R(ˆθ S ). We can generalize this to a classification setting, and show that neither the least nor the greatest consistent hypothesis is not super-teachable: Example (Consistent learners). Let X = x min, x max Z be an interval over the integer grid. The hypothesis space is Θ = {a, b X : y = in a, b and outside}. θ = a, b Θ. p Z is uniform on X and noiseless y labeled according to θ. The ris R(ˆθ S ) is the size of the symmetric difference between the two intervals ˆθ S and θ, normalized by x max x min. Given a sample S, the least consistent learner A lc learns the tightest interval over positive lc items in S: ˆθ S = A lc(s) min i=:n x i, max i=:n x i. y i= y i= ˆθ S lc = if S does not contain positive items. The greatest consistent learner A gc extends the hypothesis interval in both directions as much as possible before hitting negative points in S. If S has no positive we define =, too. ˆθ gc S Proposition. Neither A lc nor A gc is super-teachable. Proof. We first show A lc is not super-teachable. Note that A lc learns the tightest interval consistent with S, thus we lc always have ˆθ S θ. Now we show that ˆθ B(S) lc lc ˆθ S is always true so that R(ˆθ S lc) R(ˆθ B(S) lc ) follows. If θ =, then trivially ˆθ lc B(S) = ˆθ lc S =. Now assume θ. If (x, ) B(S), let a, b = ˆθ B(S) lc lc. Note that ˆθ S because B(S) S and thus S has lc at least one positive point. Let ˆθ S = a, b. Now a = min{x (x, ) B(S)} min{x (x, ) S} = a, and b = max{x (x, ) B(S)} max{x (x, ) S} = b. Thus we have ˆθ B(S) lc lc ˆθ S. If (x, ) B(S), ˆθ B(S) lc = and ˆθ B(S) lc lc ˆθ S is always true. Thus ˆθ lc B(S) ˆθ lc S θ for any B and any S. The proof for A gc is similar by showing θ ˆθ gc S gc ˆθ B(S). This leads to an open question: which family of learners are super teachable? We offer a conjecture here: we speculate that MLEs (and the derived MAP estimates or regularized empirical ris minimizers) which satisfy the asymptotic normality conditions 4 are super teachable. This conjecture is motivated by its similarity to the proof in section 3. Also note that the two counterexamples are classic examples of MLE that do not satisfy the asymptotic normality conditions. Another open question concerns the optimal super-teaching subset size for a given training set of size n. For example, our result on teaching the MLE of Gaussian mean indicates that the rate improves as grows. However, our analysis only applies to a fixed. Further research is needed to identify the optimal. Acnowledgments: R.N. acnowledges support by NSF IIS and CCF P.R. is supported in part by grants NSF DMS-7596, NSF DMS-TRIPODS-7475, DARPA W9NF-6--55, ONR N and a grant from the MIT NEC Corporation. X.Z. is supported in part by NSF CCF-747, IIS-6365, CMMI-565, DGE-54548, and CCF References S. Alfeld, X. Zhu, and P. Barford. Data poisoning attacs against autoregressive models. AAAI, 6. S. Alfeld, X. Zhu, and P. Barford. Explicit defense actions against test-set attacs. In The Thirty-First AAAI Conference on Artificial Intelligence (AAAI), 7. 3 D. Angluin. Queries revisited. Theoretical Computer Science, 33():75 94, 4. 4 D. Angluin and M. Kriis. Teachers, learners and blac boxes. COLT, D. Angluin and M. Kriis. Learning from different teachers. Machine Learning, 5():37 63, 3.
9 Yuzhe Ma, Robert Nowa, Philippe Rigollet, Xuezhou Zhang, Xiaojin Zhu 6 O. Bachem, M. Lucic, and A. Krause. Practical Coreset Constructions for Machine Learning. ArXiv e- prints, Mar F. J. Balbach. Measuring teachability using variants of the teaching dimension. Theor. Comput. Sci., 397(- 3):94 3, 8. 8 F. J. Balbach and T. Zeugmann. Teaching randomized learners. COLT, pages 9 43, 6. 9 F. J. Balbach and T. Zeugmann. Recent developments in algorithmic teaching. In Proceedings of the 3rd International Conference on Language and Automata Theory and Applications, pages 8, 9. S. Ben-David and N. Eiron. Self-directed learning and its relation to the VC-dimension and to teacherdirected learning. Machine Learning, 33():87 4, 998. Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In L. Bottou and M. Littman, editors, Proceedings of the 6th International Conference on Machine Learning, pages 4 48, Montreal, June 9. Omnipress. M. Cama and M. Lopes. Algorithmic and human teaching of sequential decision tass. In AAAI,. 3 M. Cama and A. Thomaz. Mixed-initiative active learning. ICML Worshop on Combining Learning Strategies to Reduce Label Cost,. 4 B. Clement, P.-Y. Oudeyer, and M. Lopes. A comparison of automatic teaching strategies for heterogeneous student populations. In Educational Data Mining (EDM), 6. 5 J. Czyzy, M. P. Mesnier, and J. J. Moré. The NEOS server. IEEE Computational Science and Engineering, 5(3):68 75, T. Doliwa, G. Fan, H. U. Simon, and S. Zilles. Recursive teaching dimension, VC-dimension and sample compression. Journal of Machine Learning Research, 5:37 33, 4. 7 S. Floyd and M. Warmuth. Sample compression, learnability, and the Vapni-Chervonenis dimension. Machine learning, (3):69 34, Z. Gao, C. Ries, H. U. Simon, and S. Zilles. Preference-based teaching. Journal of Machine Learning Research, 8(3): 3, 7. 9 S. Garcia, J. Derrac, J. Cano, and F. Herrera. Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3):47 435,. S. Goldman and M. Kearns. On the complexity of teaching. Journal of Computer and Systems Sciences, 5(): 3, 995. S. A. Goldman and H. D. Mathias. Teaching a smarter learner. Journal of Computer and Systems Sciences, 5():55 67, 996. S. Har-Peled. Geometric approximation algorithms, volume 73. American mathematical society Boston,. 3 T. Hegedüs. Generalized teaching dimensions and the query complexity of learning. In Proceedings of the eighth Annual Conference on Computational Learning Theory (COLT), pages 8 7, F. Khan, X. Zhu, and B. Mutlu. How do humans teach: On curriculum learning and teaching dimension. NIPS,. 5 H. Kobayashi and A. Shinohara. Complexity of teaching by a restricted number of examples. COLT, pages 93 3, 9. 6 A. Kontorovich, S. Sabato, and R. Weiss. Nearestneighbor sample compression: Efficiency, consistency, infinite dimensions. arxiv preprint arxiv:75.884, 7. 7 R. Lindsey, M. Mozer, W. J. Huggins, and H. Pashler. Optimizing instructional policies. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing Systems 6, pages J. Liu and X. Zhu. The teaching dimension of linear learners. Journal of Machine Learning Research, 7(6): 5, 6. 9 J. Liu, X. Zhu, and H. G. Ohannessian. The teaching dimension of linear learners. In The 33rd International Conference on Machine Learning (ICML), 6. 3 W. Liu, B. Dai, J. M. Rehg, and L. Song. Iterative machine teaching. In ICML, 7. 3 H. D. Mathias. A model of interactive teaching. J. Comput. Syst. Sci., 54(3):487 5, S. Mei and X. Zhu. Using machine teaching to identify optimal training-set attacs on machine learners. AAAI, K. Patil, X. Zhu, L. Kopec, and B. C. Love. Optimal teaching for limited-capacity human learners. Advances in Neural Information Processing Systems (NIPS), 4.
10 34 A. N. Rafferty, E. Brunsill, T. L. Griffiths, and P. Shafto. Faster teaching by POMDP planning. In Proceedings of the 5th International Conference on Artificial Intelligence in Education, AIED, pages 8 87, Berlin, Heidelberg,. Springer-Verlag. 48 S. Zilles, S. Lange, R. Holte, and M. Zinevich. Models of cooperative teaching and learning. Journal of Machine Learning Research, : ,. 35 M. T. Ribeiro, S. Singh, and C. Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages ACM, R. L. Rivest and Y. L. Yin. Being taught can be faster than asing questions. COLT, S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, New Yor, NY, USA, A. Shinohara and S. Miyano. Teachability in computational learning. New Generation Computing, 8(4): , A. Singla, I. Bogunovic, G. Barto, A. Karbasi, and A. Krause. Near-optimally teaching the crowd to classify. In ICML, pages 54 6, 4. 4 J. Suh, X. Zhu, and S. Amershi. The label complexity of mixed-initiative classifier training. International Conference on Machine Learning (ICML), 6. 4 H. White. Maximum lielihood estimation of misspecified models. Econometrica: Journal of the Econometric Society, pages 5, D. R. Wilson and T. R. Martinez. Reduction techniques for instance-based learning algorithms. Machine Learning, 38(3):57 86,. 43 X. Zeng and X.-w. Chen. SMO-based pruning methods for sparse least squares support vector machines. IEEE transactions on Neural Networs, 6(6):54 546, X. Zhu. Machine teaching for bayesian learners in the exponential family. In Advances in Neural Information Processing Systems (NIPS), pages 95 93, X. Zhu. Machine teaching: an inverse problem to machine learning and an approach toward optimal education. AAAI, X. Zhu, J. Liu, and M. Lopes. No learner left behind: On the complexity of teaching multiple learners simultaneously. In The 6th International Joint Conference on Artificial Intelligence (IJCAI), X. Zhu, A. Singla, S. Zilles, and A. N. Rafferty. An Overview of Machine Teaching. ArXiv e-prints, Jan. 8.
11 Yuzhe Ma, Robert Nowa, Philippe Rigollet, Xuezhou Zhang, Xiaojin Zhu Supplemental Material Lemma 7. For the large margin classifier ˆθ S, we have ( ɛ) n + (ɛ) n < ɛ P R(ˆθ S ) > ɛ = ( )n < ɛ < ɛ =. (9) Proof. The ris is R(ˆθ S ) = ˆθ S. Define event E : { (x, ) S (x, +) S}. P ˆθ S ɛ can be decomposed into two components depending on if E happens as (4) shows. P ˆθ S ɛ = P ˆθ S ɛ, E + P ˆθ S ɛ, E c. (4) P ˆθ S ɛ, E c = P ˆθ S ɛ E c P E c. Note that P E c = ( )n, P ˆθ S ɛ E c is if ɛ < and if ɛ = because ˆθ S = ± always holds given E c happens. Thus, P ˆθ S ɛ, E c = if ɛ < ( )n if ɛ =. Now we compute P ˆθ S ɛ, E. Let n + be the number of positive points in S. Define E i : {n + = i}. Note E i E j = if i j and E = n i= E i, thus n P ˆθ S ɛ, E = i= n P ˆθ S ɛ, E i = i= (4) P ˆθ S ɛ E i P E i. (4) P E i = Ci n( )n. Note that P ˆθ S ɛ E i = P x +x+ ɛ E i. To compute it, we first compute F x,x + (ɛ, ɛ E i ) = P x ɛ, x + ɛ E i. Given E i happens, P x ɛ E i = ( ɛ ) n i and P x + ɛ E i = ( ɛ ) i. Also since x ɛ and x + ɛ are independent given E i happens, thus Tae the derivative of F gives F x,x + (ɛ, ɛ E i ) = P x ɛ, x + ɛ E i = ( ɛ ) n i ( ɛ ) i. (43) f x,x + (ɛ, ɛ E i ) = i(n i)( ɛ ) n i ( ɛ ) i. (44) Note that ˆθ S ɛ x x ɛ. Therefore, we integrate f x,x + (ɛ, ɛ E i ) over the region ɛ ɛ ɛ to obtain P ˆθ S ɛ E i. However, note that ɛ, ɛ, thus for ɛ >, the region ɛ ɛ ɛ becomes the whole,, and the integration is. Then (4) becomes P ˆθ S ɛ, E = n i= P E i = P E = ( )n. For ɛ, by (4) we have n P ˆθ S ɛ, E = P E i i(n i)( ɛ ) n i ( ɛ ) i dɛ dɛ i= ɛ ɛ ɛ n = Ci n ( )n i(n i)( ɛ ) n i ( ɛ ) i dɛ dɛ i= = ( )n ɛ ɛ ɛ ɛ ɛ ɛ i= n Ci n i(n i)( ɛ ) n i ( ɛ ) i dɛ dɛ. (45)
12 Note that C n i i(n i) = n(n )Cn i, (45) becomes P ˆθ S ɛ, E = n(n )( n )n C n i ( ɛ ) n i ( ɛ ) i dɛ dɛ ɛ ɛ ɛ i= = n(n )( n )n C n i ( ɛ ) n i ( ɛ ) i dɛ dɛ = n(n )( )n = n(n )( )n ɛ ɛ ɛ i= ɛ ɛ ɛ,, ( ɛ ɛ ) n dɛ dɛ ( ɛ ɛ ) n dɛ dɛ ɛ ɛ >ɛ ( ɛ ɛ ) n dɛ dɛ. (46) Now we compute the two integration in (46) =,, ( ɛ ɛ ) n dɛ dɛ = n ( ɛ ɛ ) n dɛ n ( ɛ ) n ( ɛ ) n dɛ = n(n ) ( ɛ)n + n(n ) ( ɛ ) n = n n(n ). (47) For the second integration, note that it can decomposed as ɛ ɛ >ɛ ( ɛ ɛ ) n dɛ dɛ = ɛ ɛ +ɛ ( ɛ ɛ ) n dɛ dɛ + ɛ ɛ ɛ ( ɛ ɛ ) n dɛ dɛ. (48) Since the two sub-integration s are identical because the two sub regions are symmetric. We only show the computation for the first. ɛ ɛ ( ɛ ɛ ) n dɛ dɛ = n ( ɛ ɛ ) n ɛ +ɛdɛ Thus we have = ɛ +ɛ ɛ n ( ɛ ) n + n n ( ɛ ɛ) n dɛ = n(n ) ( ɛ ) n n n(n ) ( ɛ ɛ) n ɛ = n n(n ) ɛn + ( ɛ) n n(n ). = ɛ ɛ >ɛ ( ɛ ɛ ) n dɛ dɛ = n n(n ) ɛn + ( ɛ) n n(n ). Combine (47) and (5), we can compute (46) as follows. P ˆθ S ɛ, E = = n(n )( n )n n(n ) ɛ ɛ +ɛ = n n ɛ n ( ɛ) n + ( )n = ɛ n ( ɛ) n ( ɛ ɛ ) n dɛ dɛ n n(n ) (ɛn + ( ɛ) n ) + n(n ) (49) (5) (5)
13 Yuzhe Ma, Robert Nowa, Philippe Rigollet, Xuezhou Zhang, Xiaojin Zhu Therefore we have Now combine (4) and (5) we have which is equivalent to (9). P ˆθ S ɛ, E = P ˆθ S ɛ = ɛ n ( ɛ) n if ɛ ( )n if < ɛ. (5) ɛ n ( ɛ) n if ɛ ( )n if < ɛ < if ɛ =. Lemma 9. Let n = 4m, where m is an integer. Let S be an n-item iid sample drawn from p Z. ɛ >, δ (, ), M(ɛ, δ) = max{ 3e ln 4 ln 3 δ, ( ɛ ln 3 δ ) } such that m M(ɛ, δ), P R(ˆθ Bms(S)) ɛ > δ. Proof. Let S = {x (x, ) S} and S = {x (x, ) S} respectively. Then we have S + S = 4m. Define event E : { S m S m}. Then we have (53) m P E = Ci 4m ( )4m. (54) where we rule out all possible sequences of 4m points which lead to S < m or S < m. By standard result 37 (Lemma A.5) d = Cm ( em d )d, we have i= P E ( 4em m )m ( )4m = e m 4 m ( m m )m ( e 4 )m ( e 4 )m (55) where the nd-to-last inequality follows from the fact that e ( + m )m. Note that by definition m 3e ln 4 ln 3 δ > ln 4 ln 3 δ, thus ( e 4 )m < δ 3 and P E > δ 3. Since S + S = 4m, then either S m or S m. Without loss of generality we assume S m. We then divide the interval, equally into N = m (ln 3 δ ) segments. The length of each segment is N = O( m ) as Figure 4 shows. Note that m 3e ln 4 ln 3 δ > 3e ln 3 δ, thus N 3em > em > m. 3 N - Figure 4: segments Let N o be the number of segments that are occupied by the points in S. Note that N o is a random variable. Let E be the event that N o m. Now we lower bound P E. This is a variant of the coupon collector s problem: there are N distinct coupons, and in S trials we want to collect at least m distinct coupons. Note that P E = P E c = m i= P N o = i. Let T i be the number of all possible coupon sequences of S such that S occupies exactly i segments (i.e. distinct coupons). We have Ci n ways of choosing i segments among a total of N. Also, for each choice of i segments, the number of all possible coupon sequences of S such that S fully occupies those i segments without empty is upper bounded by i S. Thus T i Ci ni S and we have Since m P N o = i = 3e ln 4 ln 3 δ > log 3 δ, S m, and N > em, thus P E c = < m i= m i= P N o = i m i= T i N S Cn i ( i N ) S. (56) C n i ( i N ) S m i= C n i ( m N )m C n i ( m N )m ( en m )m ( m N )m = ( em N )m < ( em em )m = ( )m < δ 3. (57)
14 Thus P E δ 3. Applying union bound, P E, E δ 3. Let E 3 be the following event: there exist a point x in S such that x, the flipped point, lies in the same segment as some point x in S. If E 3 happens, then x + x = x ( x ) N. Note that P E 3 P E, E, E 3 = P E 3 E, E P E, E. Now we lower bound P E 3 E, E. Given E and E happen, we have S m and N o m. Since N = m (ln 3 δ ) m (ln 3 δ ), we have P E3 c E, E = ( N o N ) S ( m N ) S ( m N )m e m N δ 3. (58) Thus, P E 3 E, E = P E3 c E, E > δ 3. P E 3 P E, E, E 3 = P E 3 E, E P E, E ( δ δ 3 )( 3 ) > δ. Thus with probability at least δ, there exist x S and x S such that x + x N. We now bound N. N = m (ln 3 δ ) m (ln 3 δ ). Therefore N m ln 3 δ. Recall by definition m ( ɛ ln 3 δ ), thus N ɛ. We now have x + x ɛ. Finally, since {s, s + } selected by teacher B ms is the most symmetric pair, it must satisfy s + s + x + x ɛ. Putting together, with probability at least δ, R(ˆθ Bms(S)) = s + s + ɛ.
Optimal Teaching for Online Perceptrons
Optimal Teaching for Online Perceptrons Xuezhou Zhang Hrag Gorune Ohannessian Ayon Sen Scott Alfeld Xiaojin Zhu Department of Computer Sciences, University of Wisconsin Madison {zhangxz1123, gorune, ayonsn,
More informationMachine Teaching. for Personalized Education, Security, Interactive Machine Learning. Jerry Zhu
Machine Teaching for Personalized Education, Security, Interactive Machine Learning Jerry Zhu NIPS 2015 Workshop on Machine Learning from and for Adaptive User Technologies Supervised Learning Review D:
More informationPAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers
PAC-Bayes Ris Bounds for Sample-Compressed Gibbs Classifiers François Laviolette Francois.Laviolette@ift.ulaval.ca Mario Marchand Mario.Marchand@ift.ulaval.ca Département d informatique et de génie logiciel,
More informationIFT Lecture 7 Elements of statistical learning theory
IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationDoes Unlabeled Data Help?
Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline
More informationHow can a Machine Learn: Passive, Active, and Teaching
How can a Machine Learn: Passive, Active, and Teaching Xiaojin Zhu jerryzhu@cs.wisc.edu Department of Computer Sciences University of Wisconsin-Madison NLP&CC 2013 Zhu (Univ. Wisconsin-Madison) How can
More informationQuadratic Upper Bound for Recursive Teaching Dimension of Finite VC Classes
Proceedings of Machine Learning Research vol 65:1 10, 2017 Quadratic Upper Bound for Recursive Teaching Dimension of Finite VC Classes Lunjia Hu Ruihan Wu Tianhong Li Institute for Interdisciplinary Information
More information10-701/15-781, Machine Learning: Homework 4
10-701/15-781, Machine Learning: Homewor 4 Aarti Singh Carnegie Mellon University ˆ The assignment is due at 10:30 am beginning of class on Mon, Nov 15, 2010. ˆ Separate you answers into five parts, one
More informationMachine Teaching and its Applications
Machine Teaching and its Applications Jerry Zhu University of Wisconsin-Madison Jan. 8, 2018 1/64 Introduction 2/64 Machine teaching Given target model θ, learner A Find the best training set D so that
More informationMINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava
MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,
More informationQualifying Exam in Machine Learning
Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationSharp Generalization Error Bounds for Randomly-projected Classifiers
Sharp Generalization Error Bounds for Randomly-projected Classifiers R.J. Durrant and A. Kabán School of Computer Science The University of Birmingham Birmingham B15 2TT, UK http://www.cs.bham.ac.uk/ axk
More informationThe Power of Random Counterexamples
Proceedings of Machine Learning Research 76:1 14, 2017 Algorithmic Learning Theory 2017 The Power of Random Counterexamples Dana Angluin DANA.ANGLUIN@YALE.EDU and Tyler Dohrn TYLER.DOHRN@YALE.EDU Department
More informationFORMULATION OF THE LEARNING PROBLEM
FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we
More informationDiscriminative Learning can Succeed where Generative Learning Fails
Discriminative Learning can Succeed where Generative Learning Fails Philip M. Long, a Rocco A. Servedio, b,,1 Hans Ulrich Simon c a Google, Mountain View, CA, USA b Columbia University, New York, New York,
More informationGeneralization, Overfitting, and Model Selection
Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How
More informationarxiv: v1 [cs.lg] 18 Feb 2017
Quadratic Upper Bound for Recursive Teaching Dimension of Finite VC Classes Lunjia Hu, Ruihan Wu, Tianhong Li and Liwei Wang arxiv:1702.05677v1 [cs.lg] 18 Feb 2017 February 21, 2017 Abstract In this work
More informationEstimating Gaussian Mixture Densities with EM A Tutorial
Estimating Gaussian Mixture Densities with EM A Tutorial Carlo Tomasi Due University Expectation Maximization (EM) [4, 3, 6] is a numerical algorithm for the maximization of functions of several variables
More informationIntroduction: The Perceptron
Introduction: The Perceptron Haim Sompolinsy, MIT October 4, 203 Perceptron Architecture The simplest type of perceptron has a single layer of weights connecting the inputs and output. Formally, the perceptron
More informationUnlabeled Data: Now It Helps, Now It Doesn t
institution-logo-filena A. Singh, R. D. Nowak, and X. Zhu. In NIPS, 2008. 1 Courant Institute, NYU April 21, 2015 Outline institution-logo-filena 1 Conflicting Views in Semi-supervised Learning The Cluster
More informationActive Learning: Disagreement Coefficient
Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More informationIntroduction to Machine Learning Midterm Exam
10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but
More informationCombine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning
Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning Michalis K. Titsias Department of Informatics Athens University of Economics and Business
More informationLearning Bound for Parameter Transfer Learning
Learning Bound for Parameter Transfer Learning Wataru Kumagai Faculty of Engineering Kanagawa University kumagai@kanagawa-u.ac.jp Abstract We consider a transfer-learning problem by using the parameter
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationMAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing
MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing Afonso S. Bandeira April 9, 2015 1 The Johnson-Lindenstrauss Lemma Suppose one has n points, X = {x 1,..., x n }, in R d with d very
More informationOn the Sample Complexity of Noise-Tolerant Learning
On the Sample Complexity of Noise-Tolerant Learning Javed A. Aslam Department of Computer Science Dartmouth College Hanover, NH 03755 Scott E. Decatur Laboratory for Computer Science Massachusetts Institute
More informationLecture 8. Instructor: Haipeng Luo
Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationSample width for multi-category classifiers
R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University
More informationCSE 546 Final Exam, Autumn 2013
CSE 546 Final Exam, Autumn 0. Personal info: Name: Student ID: E-mail address:. There should be 5 numbered pages in this exam (including this cover sheet).. You can use any material you brought: any book,
More informationProbability and Information Theory. Sargur N. Srihari
Probability and Information Theory Sargur N. srihari@cedar.buffalo.edu 1 Topics in Probability and Information Theory Overview 1. Why Probability? 2. Random Variables 3. Probability Distributions 4. Marginal
More informationComputational Learning Theory
CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful
More informationBeing Taught can be Faster than Asking Questions
Being Taught can be Faster than Asking Questions Ronald L. Rivest Yiqun Lisa Yin Abstract We explore the power of teaching by studying two on-line learning models: teacher-directed learning and self-directed
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationGaussian Processes (10/16/13)
STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs
More informationIntroduction to Machine Learning Midterm Exam Solutions
10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,
More informationGradient-Based Learning. Sargur N. Srihari
Gradient-Based Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation
More informationFrom Batch to Transductive Online Learning
From Batch to Transductive Online Learning Sham Kakade Toyota Technological Institute Chicago, IL 60637 sham@tti-c.org Adam Tauman Kalai Toyota Technological Institute Chicago, IL 60637 kalai@tti-c.org
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:
More informationECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction
ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationThe Expectation-Maximization Algorithm
The Expectation-Maximization Algorithm Francisco S. Melo In these notes, we provide a brief overview of the formal aspects concerning -means, EM and their relation. We closely follow the presentation in
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationarxiv: v1 [cs.it] 21 Feb 2013
q-ary Compressive Sensing arxiv:30.568v [cs.it] Feb 03 Youssef Mroueh,, Lorenzo Rosasco, CBCL, CSAIL, Massachusetts Institute of Technology LCSL, Istituto Italiano di Tecnologia and IIT@MIT lab, Istituto
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationDS-GA 1002 Lecture notes 2 Fall Random variables
DS-GA 12 Lecture notes 2 Fall 216 1 Introduction Random variables Random variables are a fundamental tool in probabilistic modeling. They allow us to model numerical quantities that are uncertain: the
More informationOn the Generalization Ability of Online Strongly Convex Programming Algorithms
On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract
More informationEfficient Semi-supervised and Active Learning of Disjunctions
Maria-Florina Balcan ninamf@cc.gatech.edu Christopher Berlind cberlind@gatech.edu Steven Ehrlich sehrlich@cc.gatech.edu Yingyu Liang yliang39@gatech.edu School of Computer Science, College of Computing,
More informationLearning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute
More information10.1 The Formal Model
67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 10: The Formal (PAC) Learning Model Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 We have see so far algorithms that explicitly estimate
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationReferences for online kernel methods
References for online kernel methods W. Liu, J. Principe, S. Haykin Kernel Adaptive Filtering: A Comprehensive Introduction. Wiley, 2010. W. Liu, P. Pokharel, J. Principe. The kernel least mean square
More information1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015
10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015 1 Active Learning Most classic machine learning methods and the formal learning
More informationLecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions
DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K
More informationTHE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY
THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY Dan A. Simovici UMB, Doctoral Summer School Iasi, Romania What is Machine Learning? The Vapnik-Chervonenkis Dimension Probabilistic Learning Potential
More informationMachine Learning 4771
Machine Learning 477 Instructor: Tony Jebara Topic 5 Generalization Guarantees VC-Dimension Nearest Neighbor Classification (infinite VC dimension) Structural Risk Minimization Support Vector Machines
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationToward Adversarial Learning as Control
Toward Adversarial Learning as Control Jerry Zhu University of Wisconsin-Madison The 2nd ARO/IARPA Workshop on Adversarial Machine Learning May 9, 2018 Test time attacks Given classifier f : X Y, x X Attacker
More informationLecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;
CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and
More informationDiscriminative Direction for Kernel Classifiers
Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering
More information12.1 A Polynomial Bound on the Sample Size m for PAC Learning
67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 12: PAC III Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 In this lecture will use the measure of VC dimension, which is a combinatorial
More informationPerformance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project
Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore
More informationSingle-tree GMM training
Single-tree GMM training Ryan R. Curtin May 27, 2015 1 Introduction In this short document, we derive a tree-independent single-tree algorithm for Gaussian mixture model training, based on a technique
More informationThe Decision List Machine
The Decision List Machine Marina Sokolova SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 sokolova@site.uottawa.ca Nathalie Japkowicz SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 nat@site.uottawa.ca
More informationarxiv: v1 [cs.ds] 3 Feb 2018
A Model for Learned Bloom Filters and Related Structures Michael Mitzenmacher 1 arxiv:1802.00884v1 [cs.ds] 3 Feb 2018 Abstract Recent work has suggested enhancing Bloom filters by using a pre-filter, based
More informationUnsupervised Learning with Permuted Data
Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University
More informationMachine Learning in the Data Revolution Era
Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,
More informationLecture 3: Introduction to Complexity Regularization
ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,
More informationLecture 4: Two-point Sampling, Coupon Collector s problem
Randomized Algorithms Lecture 4: Two-point Sampling, Coupon Collector s problem Sotiris Nikoletseas Associate Professor CEID - ETY Course 2013-2014 Sotiris Nikoletseas, Associate Professor Randomized Algorithms
More informationConsistency of Nearest Neighbor Methods
E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study
More information1 Machine Learning Concepts (16 points)
CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationLearning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin
Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good
More informationTheorem (Special Case of Ramsey s Theorem) R(k, l) is finite. Furthermore, it satisfies,
Math 16A Notes, Wee 6 Scribe: Jesse Benavides Disclaimer: These notes are not nearly as polished (and quite possibly not nearly as correct) as a published paper. Please use them at your own ris. 1. Ramsey
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More information18.9 SUPPORT VECTOR MACHINES
744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the
More informationLearning with Rejection
Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationA Large Deviation Bound for the Area Under the ROC Curve
A Large Deviation Bound for the Area Under the ROC Curve Shivani Agarwal, Thore Graepel, Ralf Herbrich and Dan Roth Dept. of Computer Science University of Illinois Urbana, IL 680, USA {sagarwal,danr}@cs.uiuc.edu
More informationOptimization Methods for Machine Learning (OMML)
Optimization Methods for Machine Learning (OMML) 2nd lecture (2 slots) Prof. L. Palagi 16/10/2014 1 What is (not) Data Mining? By Namwar Rizvi - Ad Hoc Query: ad Hoc queries just examines the current data
More informationThe Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems
The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems Weinan E 1 and Bing Yu 2 arxiv:1710.00211v1 [cs.lg] 30 Sep 2017 1 The Beijing Institute of Big Data Research,
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationGeneralization and Overfitting
Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle
More informationA Result of Vapnik with Applications
A Result of Vapnik with Applications Martin Anthony Department of Statistical and Mathematical Sciences London School of Economics Houghton Street London WC2A 2AE, U.K. John Shawe-Taylor Department of
More informationSupport Vector Machines
Support Vector Machines Tobias Pohlen Selected Topics in Human Language Technology and Pattern Recognition February 10, 2014 Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6
More informationWavelet Transform And Principal Component Analysis Based Feature Extraction
Wavelet Transform And Principal Component Analysis Based Feature Extraction Keyun Tong June 3, 2010 As the amount of information grows rapidly and widely, feature extraction become an indispensable technique
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More informationAnalysis of Spectral Kernel Design based Semi-supervised Learning
Analysis of Spectral Kernel Design based Semi-supervised Learning Tong Zhang IBM T. J. Watson Research Center Yorktown Heights, NY 10598 Rie Kubota Ando IBM T. J. Watson Research Center Yorktown Heights,
More informationPointwise Exact Bootstrap Distributions of Cost Curves
Pointwise Exact Bootstrap Distributions of Cost Curves Charles Dugas and David Gadoury University of Montréal 25th ICML Helsinki July 2008 Dugas, Gadoury (U Montréal) Cost curves July 8, 2008 1 / 24 Outline
More information