How can a Machine Learn: Passive, Active, and Teaching

How can a Machine Learn: Passive, Active, and Teaching Xiaojin Zhu jerryzhu@cs.wisc.edu Department of Computer Sciences University of Wisconsin-Madison NLP&CC 2013 Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 1 / 31

Item x R d Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 2 / 31

Class Label y { 1, 1} Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 3 / 31

Learning You see a training set (x 1, y 1 ),..., (x n, y n ) You must learn a good function f : R d { 1, 1} f must predict the label y on test item x, which may not be in the training set You need some assumptions Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 4 / 31

The World p(x, y) Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 5 / 31

Independent and Identically-Distributed (x 1, y 1 ),..., (x n, y n ), (x, y) iid p(x, y) Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 6 / 31

Example: Noiseless 1D Threshold Classifier p(x) = uniform[0, 1] { 0, x < θ p(y = 1 x) = 1, x θ Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 7 / 31

Example: Noisy 1D Threshold Classifier p(x) = uniform[0, 1] { ɛ, x < θ p(y = 1 x) = 1 ɛ, x θ Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 8 / 31

Generalization Error R(f) = E (x,y) p(x,y) (f(x) y) Approximated by test set error 1 m n+m i=n+1 (f(x i ) y i ) on test set (x n+1, y n+1 )... (x n+m, y n+m ) iid p(x, y) Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 9 / 31

Zero Generalization Error is a Dream Speed limit #1: Bayes error ( 1 R(Bayes) = E x p(x) 2 p(y = 1 x) 1 ) 2 Bayes classifier ( sign p(y = 1 x) 1 ) 2 All learners are no better than the Bayes classifier Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 10 / 31

Hypothesis Space f F {g : R d { 1, 1} measurable} Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 11 / 31

Approximation Error F may include the Bayes classifier... or not e.g. F = { g(x) = sign(x θ ) : θ [0, 1] } e.g. F = {g(x) = sign(sin(αx)) : α > 0} Speed limit #2: approximation error inf R(g) R(Bayes) g F Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 12 / 31

Estimation Error Let f = arg inf g F R(g). Can we at least learn f? No. You see a training set (x 1, y 1 ),..., (x n, y n ), not p(x, y) You learn ˆf n Speed limit #3: Estimation error R( ˆf n ) R(f ) Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 13 / 31

Estimation Error As training set size n increases, estimation error goes down But how quickly? Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 14 / 31

Paradigm 1: Passive Learning (x 1, y 1 ),..., (x n, y n ) iid p(x, y) 1D example: O( 1 n ) Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 15 / 31

Paradigm 2: Active Learning In iteration t 1 the learner picks a query x t 2 the world (oracle) answers with a label y t p(y x t ) Pick x t to maximally reduce the hypothesis space 1D example: O( 1 2 n ) Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 16 / 31

Paradigm 3: Teaching A teacher designs the training set 1D example: x 1 = θ ɛ/2, y 1 = 1 x 2 = θ + ɛ/2, y 2 = 1 n = 2 suffices to drive estimation error to ɛ (teaching dimension [Goldman & Kearns 95]) Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 17 / 31

Teaching as an Optimization problem min D loss( f D, θ) + effort(d) Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 18 / 31

Teaching Bayesian Learners if we choose min n,x 1,...,x n loss( f D, θ ) = KL (δ θ p(θ D)) effort(d) = cn log p(θ x 1,..., x n ) + cn Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 19 / 31

Example 1: Teaching a 1D threshold classifier θ θ θ { O(1/n) O(1/2 ) passive learning "waits" active learning "explores" teaching "guides" p 0 (θ) = 1 { p(y = 1 x, θ) = 1 if x θ and 0 otherwise effort(d) = c D For any D = {(x 1, y 1 ),..., (x n, y n )}, p(θ D) = uniform [max i:yi = 1(x i ), min i:yi =1(x i )] The optimal teaching problem becomes ( log min n,x 1,y 1,...,x n,y n n 1 min i:yi =1(x i ) max i:yi = 1(x i ) ) + cn. One solution: D = {(θ ɛ/2, 1), (θ + ɛ/2, 1)} as ɛ 0 with T I = log(ɛ) + 2c Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 20 / 31

Example 2: Learner with poor perception Same as Example 1 but the learner cannot distinguish similar items Encode this in effort() effort(d) = c min xi,x j D x i x j With D = {(θ ɛ/2, 1), (θ + ɛ/2, 1)}, T I = log(ɛ) + c/ɛ with minimum at ɛ = c. D = {(θ c/2, 1), (θ + c/2, 1)}. Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 21 / 31

Example 3: Teaching to pick a model out of two Θ = {θ A = N( 1 4, 1 2 ), θ B = N( 1 4, 1 2 )}, p 0(θ A ) = p 0 (θ B ) = 1 2. θ = θ A. Let D = {x 1,..., x n }. loss(d) = log (1 + n i=1 exp(x i)) minimized by x i. But suppose box constraints x i [ d, d]: ( ) n log 1 + exp(x i ) + cn + min n,x 1,...,x n i=1 n I( x i d) Solution: all x i = d, n = max ( 0, [ 1 d log ( d c 1)]). Note n = 0 for certain combinations of c, d (e.g., when c d): the effort of teaching outweighs the benefit. The teacher may choose to not teach at all and maintain the status quo (prior p 0 ) of the learner! i=1 Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 22 / 31

Teaching Dimension is a Special Case Given concept class C = {c}, define P (y = 1 x, θ c ) = [c(x) = +] and P (x) uniform. The world has θ = θ c The learner has Θ = {θ c c C}, p 0 (θ) = 1 C. 1 P (θ c D) = or 0. c C consistent with D Teaching dimension [Goldman & Kearns 95] T D(c ) is the minimum cardinality of D that uniquely identifies the target concept: min log P (θ c D) + γ D D where γ 1 C. The solution D is a minimum teaching set for c, and D = T D(c ). Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 23 / 31

Teaching Bayesian Learners in the Exponential Family So far, we solved the examples by inspection. Exponential family p(x θ) = h(x) exp ( θ T (x) A(θ) ) T (x) R D sufficient statistics of x θ R D natural parameter A(θ) log partition function h(x) base measure For D = {x 1,..., x n } the likelihood is p(d θ) = n i=1 ( ) h(x i ) exp θ s A(θ) with aggregate sufficient statistics s n i=1 T (x i) Two-step algorithm: finding aggregate sufficient statistics + unpacking Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 24 / 31

Step 1: Aggregate Sufficient Statistics from Conjugacy The conjugate prior has natural parameters (λ 1, λ 2 ) R D R: ( ) p(θ λ 1, λ 2 ) = h 0 (θ) exp λ 1 θ λ 2 A(θ) A 0 (λ 1, λ 2 ) The posterior p(θ D, λ 1, λ 2 ) = ( ) h 0 (θ) exp (λ 1 + s) θ (λ 2 + n)a(θ) A 0 (λ 1 + s, λ 2 + n) D enters the posterior only via s and n Optimal teaching problem min n,s θ (λ 1 + s) + A(θ )(λ 2 + n) + A 0 (λ 1 + s, λ 2 + n) + effort(n, s) Convex relaxation: n R and s R D (assuming effort(n, s) convex) Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 25 / 31

Step 2: Unpacking Cannot teach with the aggregate sufficient statistics n max(0, [n]) Find n teaching examples whose aggregate sufficient statistics is s. exponential distribution T (x) = x, x1 =... = x n = s/n. Poisson distribution T (x) = x (integers), round x1,..., x n Gaussian distribution T (x) = (x, x 2 ), harder. n = 3, s = (3, 5): {x 1 = 0, x 2 = 1, x 3 = 2} {x 1 = 1 2, x2 = 5+ 13 4, x 3 = 5 13 4 }. An approximate unpacking algorithm: 1 initialize x i iid p(x θ ), i = 1... n. 2 solve min x1,...,x n s n i=1 T (x i) 2 (nonconvex) Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 26 / 31

Example 4: Teaching the mean of a univariate Gaussian The world is N(x; µ, σ 2 ), σ 2 is known to the learner T (x) = x Learner s prior in standard form µ N(µ µ 0, σ 2 0 ) Optimal aggregate sufficient statistics s = σ2 (µ µ 0 ) + µ n s n µ : compensating for the learner s initial belief µ 0. n is the solution to n 1 2 effort (n) + σ2 σ 2 0 = 0 e.g. when effort(n) = cn, n = 1 2c σ2 σ0 2 Not to teach if the learner initially had a narrow mind : σ0 2 < 2cσ2. Unpacking s is trivial, e.g. x 1 =... = x n = s/n σ 2 0 Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 27 / 31

Example 5: Teaching a multinomial distribution The world multinomial π = (π 1,..., π K ) The learner Dirichlet prior p(π β) = Γ( β k ) Γ(βk ) K k=1 πβ k 1 k. Step 1: find aggregate sufficient statistics s = (s 1,..., s K ) min s ( K ) K log Γ (β k + s k ) + log Γ(β k + s k ) k=1 k=1 K (β k + s k 1) log πk + effort(s) k=1 Relax s R K 0 Step 2: unpack s k [s k ] for k = 1... K. Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 28 / 31

Examples of Example 5 world π = ( 1 10, 3 10, 6 10 ) learner s wrong Dirichlet prior β = (6, 3, 1) If effortless effort(s) = 0, s = (317, 965, 1933) (fmincon) The MLE from D is (0.099, 0.300, 0.601), very close to π. brute-force teaching : using big data to overwhelm the learner s prior If costly effort(s) = 0.3 K k=1 s k, s = (0, 2, 8), T I = 2.65. Not s = (1, 3, 6): the wrong prior. T I = 4.51 Not s = (317, 965, 1933), T I = 956.25 Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 29 / 31

Example 6: Teaching a multivariate Gaussian world: µ R D and Σ R D D learner likelihood N(x µ, Σ), Normal-Inverse-Wishart (NIW) prior Given x 1,..., x n R D, the aggregate sufficient statistics are s = n i=1 x i, S = n i=1 x ix i Step 1: optimal aggregate sufficient statistics via SDP D log 2 D ( νn + 1 i min ν n + log Γ n,s,s 2 2 i=1 ) ν n 2 log Λ n D 2 log κ n + ν n 2 log Σ + 1 2 tr(σ 1 Λ n ) + κ n 2 (µ µ n ) Σ 1 (µ µ n ) + effort(n, s, S) s.t. S 0; S ii s 2 i /2, i. Step 2: unpack s, S iid initializing x1,..., x n N(µ, Σ ) n solve min vec(s, S) i=1 vec(t (x i)) 2 Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 30 / 31

Examples of Example 6 The target Gaussian is µ = (0, 0, 0) and Σ = I The learner s NIW prior µ 0 = (1, 1, 1), κ 0 = 1, ν 0 = 2 + 10 5, Λ 0 = 10 5 I. expensive effort(n, s, S) = n Optimal D with n = 4, unpacked into a tetrahedron 1 0.5 0 0.5 1 1 0 1 2 0 2 1 0 1 1 0 1 2 0 2 2 T I(D) = 1.69. Four points N(µ, Σ ) have mean(t I) = 9.06 ± 3.34, min(t I) = 1.99, max(t I) = 35.51 (100,000 trials) 1 0.5 0 0.5 1 1.5 1 0 1 2 0 2 Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 31 / 31