Online Manifold Regularization: A New Learning Setting and Empirical Study Andrew B. Goldberg 1, Ming Li 2, Xiaojin Zhu 1 1 Computer Sciences, University of Wisconsin Madison, USA. {goldberg,jerryzhu}@cs.wisc.edu 2 National Key Laboratory for Novel Software Technology, Nanjing University, China. lim@lamda.nju.edu.cn Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 1 / 14
Life-long learning x 1 x 2... x 1000... x 1000000............ y 1 = 0 n/a... y 1000 = 1... y 1000000 = 0... Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 2 / 14
Life-long learning x 1 x 2... x 1000... x 1000000............ y 1 = 0 n/a... y 1000 = 1... y 1000000 = 0... Unlike standard supervised learning: n examples arrive sequentially, cannot even store them all most examples are unlabeled no iid assumption, p(x, y) can change over time Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 2 / 14
This is how children learn, too x 1 x 2... x 1000... x 1000000............ y 1 = 0 n/a... y 1000 = 1... y 1000000 = 0... Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 3 / 14
New paradigm: online semi-supervised learning Main contribution: merging learning settings 1 Online: learn non-iid sequentially, but fully labeled 2 Semi-supervised: learn from iid batch, (mostly) unlabeled 1 At time t, adversary picks x t X, y t Y not necessarily iid, shows x t 2 Learner has classifier f t : X R, predicts f t (x t ) 3 With small probability, adversary reveals y t ; otherwise it abstains (unlabeled) 4 Learner updates to f t+1 based on x t and y t (if given). Repeat. Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 4 / 14
New paradigm: online semi-supervised learning Main contribution: merging learning settings 1 Online: learn non-iid sequentially, but fully labeled 2 Semi-supervised: learn from iid batch, (mostly) unlabeled 1 At time t, adversary picks x t X, y t Y not necessarily iid, shows x t 2 Learner has classifier f t : X R, predicts f t (x t ) 3 With small probability, adversary reveals y t ; otherwise it abstains (unlabeled) 4 Learner updates to f t+1 based on x t and y t (if given). Repeat. Many batch SSL algorithms exist; we focus on manifold regularization Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 4 / 14
Review: batch manifold regularization A form of graph-based semi-supervised learning [Belkin et al. JMLR06]: Graph on x 1... x n Edge weights w st encode similarity between x s, x t, e.g., knn Assumption: similar examples have similar labels Manifold regularization minimizes risk: J(f) = 1 l T t=1 δ(y t )c(f(x t ), y t ) + λ 1 2 f 2 K + λ 2 2T T (f(x s ) f(x t )) 2 w st s,t=1 c(f(x), y) convex loss function, e.g., the hinge loss. Solution f = arg min f J(f). Generalizes graph mincut and label propagation. d1 d2 d3 Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 5 / 14 d4
From batch to online batch risk = average instantaneous risks J(f) = 1 T T t=1 J t(f) Batch risk J(f) = 1 l T t=1 δ(y t )c(f(x t ), y t ) + λ 1 2 f 2 K + λ 2 2T T (f(x s ) f(x t )) 2 w st s,t=1 Instantaneous risk J t (f) = T l δ(y t)c(f(x t ), y t ) + λ 1 2 f 2 K + λ 2 t (f(x i ) f(x t )) 2 w it (includes graph edges between x t and all previous examples) i=1 Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 6 / 14
Online convex programming Instead of minimizing convex J(f), reduce convex J t (f) at each step t. J t (f) f t+1 = f t η t f Remarkable no regret guarantee against adversary: Note: accuracy can be arbitrarily bad if adversary flips target often If so, no batch learner in hindsight can do well either ft regret 1 T T J t (f t ) J(f ) t=1 [Zinkevich ICML03] No regret: lim sup T 1 T T t=1 J t(f t ) J(f ) 0. If no adversary (iid), the average classifier f = 1/T T t=1 f t is good: J( f) J(f ). Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 7 / 14
Sparse approximation 1: buffering The algorithm is impractical Space O(T ): stores all previous examples Time O(T 2 ): each new example compared to all previous ones In reality, T Approximation: Keep a size τ buffer Approximate representers: f t = t 1 i=t τ α(t) i K(x i, ) Approximate instantaneous risk J t (f) = T l δ(y t)c(f(x t ), y t ) + λ 1 t 2 f 2 K + λ 2 τ Dynamic graph on examples in the buffer t (f(x i ) f(x t )) 2 w it i=t τ Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 8 / 14
Sparse approximation 2: random projection tree [Dasgupta and Freund, STOC08] Discretize data manifold by online clustering. When a cluster accumulates enough examples, split along random hyperplane. Extends k-d tree. Approximation uses a graph over clusters (represented by Gaussians) Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 9 / 14
Experimental Results Compare batch manifold regularization (MR) with several online variants (i.e., full graph, buffering, random projection tree) Runtime Risk (to assess no regret guarantee) Generalization error using average classifier (assumes iid) Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 10 / 14
Experiment: runtime Buffering and RPtree scale linearly, enabling life-long learning. Spirals MNIST 0 vs. 1 Time (seconds) 500 400 300 200 100 500 400 300 200 100 Batch MR Online MR Online MR (buffer) Online RPtree 0 0 2000 4000 6000 T 0 0 5000 10000 T Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 11 / 14
Experiment: risk Online MR risk J air (T ) 1 T T t=1 J t(f t ) approaches batch risk J(f ) as T increases. Risk 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1 0.9 0.8 J(f*) Batch MR J air (T) Online MR J air (T) Online MR (buffer) J air (T) Online RPtree 0.7 0 500 1000 1500 2000 2500 3000 3500 4000 4500 T Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 12 / 14
Experiment: generalization error of f if iid A variation of buffering as good as batch MR (preferentially keep labeled examples in buffer). (a) Spirals (b) Face Generalization error rate 0.3 0.25 0.2 0.15 0.1 Batch MR Online MR Online MR (buffer) Online MR (buffer U) Online RPtree Online RPtree (PPK) Generalization error rate 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Batch MR Online MR Online MR (buffer) Online MR (buffer U) Online RPtree 0.05 0.05 0 0 1000 2000 3000 4000 5000 6000 7000 T 0 0 2000 4000 6000 8000 10000 T (c) MNIST 0 vs. 1 (d) MNIST 1 vs. 2 Generalization error rate 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Batch MR Online MR Online MR (buffer) Online MR (buffer U) Online RPtree Generalization error rate 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Batch MR Online MR Online MR (buffer) Online MR (buffer U) Online RPtree 0.05 0.05 0 0 2000 4000 6000 8000 10000 T 0 0 2000 4000 6000 8000 10000 T Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 13 / 14
Conclusions Online semi-supervised learning framework Sparse approximations: buffering and RPtree Future work: new bounds, new algorithms (e.g., S3VM) Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 14 / 14
Conclusions Online semi-supervised learning framework Sparse approximations: buffering and RPtree Future work: new bounds, new algorithms (e.g., S3VM) Thank you! Any questions? Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 14 / 14
Experiment: adversarial concept drift Slowly rotating spirals, both p(x) and p(y x) changing. Batch f vs. online MR buffering f T Test set drawn from the current p(x, y) at time T. 0.7 Batch MR Online MR (buffer) 0.6 Generalization error rate 0.5 0.4 0.3 0.2 0.1 0 0 1000 2000 3000 4000 5000 6000 7000 T
Kernelized algorithm Init: t = 1, f 1 = 0 Repeat t 1 f t ( ) = α (t) i K(x i, ) i=1 1 receive x t, predict f t (x t ) = t 1 i=1 α(t) i K(x i, x t ) 2 occasionally receive y t 3 update f t to f t+1 by adjusting coefficients: α (t+1) i = (1 η t λ 1 )α (t) i 2η t λ 2 (f t (x i ) f t (x t ))w it, i < t t α (t+1) t = 2η t λ 2 4 store x t, let t = t + 1 i=1 (f t (x i ) f t (x t ))w it η t T l δ(y t)c (f(x t ), y t )
Sparse approximation 1: buffer update At each step, start with the current τ representers: f t = t 1 i=t τ Gradient descent on τ + 1 terms: f = α (t) i K(x i, ) + 0K(x t, ) t i=t τ Reduce to τ representers f t+1 = t Kernel matching pursuit α ik(x i, ) i=t τ+1 α(t+1) i min f f t+1 2 α (t+1) K(x i, ) by
Sparse approximation 2: random projection tree We use the clusters N (µ i, Σ i ) as representers: f t = s i=1 β (t) i K(µ i, ) Cluster graph edge weight between a cluster µ i and example x t is [ w µi t = E x N (µi,σ i ) exp ( x x t 2 )] 2σ 2 = (2π) d 2 Σi 1 2 Σ0 1 1 2 Σ 2 ( exp A further approximation is ( 1 2 µ i Σ 1 i w µi t = e µ i x t 2 /2σ 2 Update f (i.e., β) and the RPtree, discard x t. µ i + x t Σ 1 0 x t µ Σ µ ) )