Online Manifold Regularization: A New Learning Setting and Empirical Study

Similar documents
Seeing stars when there aren't many stars Graph-based semi-supervised learning for sentiment categorization

Machine Learning for NLP

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

4.1 Online Convex Optimization

Semi-Supervised Learning by Multi-Manifold Separation

What is semi-supervised learning?

Full-information Online Learning

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

A graph based approach to semi-supervised learning

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Manifold Coarse Graining for Online Semi-supervised Learning

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Support vector machines Lecture 4

Stochastic Gradient Descent

Warm up: risk prediction with logistic regression

Manifold Regularization

Better Algorithms for Selective Sampling

PAC-learning, VC Dimension and Margin-based Bounds

Back to the future: Radial Basis Function networks revisited

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31

Support Vector Machines for Classification: A Statistical Portrait

Semi-Supervised Learning in Reproducing Kernel Hilbert Spaces Using Local Invariances

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Computational and Statistical Learning Theory

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Introduction to Logistic Regression and Support Vector Machine

Adaptive Sampling Under Low Noise Conditions 1

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Regression with Numerical Optimization. Logistic

Graphs in Machine Learning

Approximation Theoretical Questions for SVMs

Midterm exam CS 189/289, Fall 2015

Kernelized Perceptron Support Vector Machines

Reducing contextual bandits to supervised learning

9 Classification. 9.1 Linear Classifiers

The No-Regret Framework for Online Learning

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

A Framework for Modeling Positive Class Expansion with Single Snapshot

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Bits of Machine Learning Part 1: Supervised Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Online Bayesian Passive-Aggressive Learning

Learning theory. Ensemble methods. Boosting. Boosting: history

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Support Vector Machines

1 Machine Learning Concepts (16 points)

Support Vector Machines (SVMs).

Hou, Ch. et al. IEEE Transactions on Neural Networks March 2011

CMU-Q Lecture 24:

Linear Models for Regression

Large Scale Semi-supervised Linear SVMs. University of Chicago

Ad Placement Strategies

Online Learning and Sequential Decision Making

Unlabeled Data: Now It Helps, Now It Doesn t

Analysis of Spectral Kernel Design based Semi-supervised Learning

How to learn from very few examples?

Escaping the curse of dimensionality with a tree-based regressor

Jeff Howbert Introduction to Machine Learning Winter

Graphs, Geometry and Semi-supervised Learning

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Linear & nonlinear classifiers

New Algorithms for Contextual Bandits

Machine Learning Lecture 7

Lecture 16: Perceptron and Exponential Weights Algorithm

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Coordinate Descent and Ascent Methods

Machine Learning for NLP

Efficient and Principled Online Classification Algorithms for Lifelon

Linear Models in Machine Learning

Linear classifiers: Overfitting and regularization

Presented by. Committee

ECS289: Scalable Machine Learning

1 [15 points] Frequent Itemsets Generation With Map-Reduce

Midterm, Fall 2003

Learning with Rejection

Optimal and Adaptive Online Learning

Learnability, Stability, Regularization and Strong Convexity

Logistic Regression Logistic

arxiv: v1 [cs.lg] 8 Feb 2018

Kernel Regression with Order Preferences

Regularization Paths

Multi-View Point Cloud Kernels for Semi-Supervised Learning

Machine Learning Practice Page 2 of 2 10/28/13

Perceptron Mistake Bounds

CS798: Selected topics in Machine Learning

Introduction to Machine Learning Midterm Exam

Logistic regression and linear classifiers COMS 4771

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Is the test error unbiased for these programs?

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

How can a Machine Learn: Passive, Active, and Teaching

Stochastic Optimization

18.9 SUPPORT VECTOR MACHINES

STA141C: Big Data & High Performance Statistical Computing

Transcription:

Online Manifold Regularization: A New Learning Setting and Empirical Study Andrew B. Goldberg 1, Ming Li 2, Xiaojin Zhu 1 1 Computer Sciences, University of Wisconsin Madison, USA. {goldberg,jerryzhu}@cs.wisc.edu 2 National Key Laboratory for Novel Software Technology, Nanjing University, China. lim@lamda.nju.edu.cn Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 1 / 14

Life-long learning x 1 x 2... x 1000... x 1000000............ y 1 = 0 n/a... y 1000 = 1... y 1000000 = 0... Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 2 / 14

Life-long learning x 1 x 2... x 1000... x 1000000............ y 1 = 0 n/a... y 1000 = 1... y 1000000 = 0... Unlike standard supervised learning: n examples arrive sequentially, cannot even store them all most examples are unlabeled no iid assumption, p(x, y) can change over time Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 2 / 14

This is how children learn, too x 1 x 2... x 1000... x 1000000............ y 1 = 0 n/a... y 1000 = 1... y 1000000 = 0... Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 3 / 14

New paradigm: online semi-supervised learning Main contribution: merging learning settings 1 Online: learn non-iid sequentially, but fully labeled 2 Semi-supervised: learn from iid batch, (mostly) unlabeled 1 At time t, adversary picks x t X, y t Y not necessarily iid, shows x t 2 Learner has classifier f t : X R, predicts f t (x t ) 3 With small probability, adversary reveals y t ; otherwise it abstains (unlabeled) 4 Learner updates to f t+1 based on x t and y t (if given). Repeat. Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 4 / 14

New paradigm: online semi-supervised learning Main contribution: merging learning settings 1 Online: learn non-iid sequentially, but fully labeled 2 Semi-supervised: learn from iid batch, (mostly) unlabeled 1 At time t, adversary picks x t X, y t Y not necessarily iid, shows x t 2 Learner has classifier f t : X R, predicts f t (x t ) 3 With small probability, adversary reveals y t ; otherwise it abstains (unlabeled) 4 Learner updates to f t+1 based on x t and y t (if given). Repeat. Many batch SSL algorithms exist; we focus on manifold regularization Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 4 / 14

Review: batch manifold regularization A form of graph-based semi-supervised learning [Belkin et al. JMLR06]: Graph on x 1... x n Edge weights w st encode similarity between x s, x t, e.g., knn Assumption: similar examples have similar labels Manifold regularization minimizes risk: J(f) = 1 l T t=1 δ(y t )c(f(x t ), y t ) + λ 1 2 f 2 K + λ 2 2T T (f(x s ) f(x t )) 2 w st s,t=1 c(f(x), y) convex loss function, e.g., the hinge loss. Solution f = arg min f J(f). Generalizes graph mincut and label propagation. d1 d2 d3 Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 5 / 14 d4

From batch to online batch risk = average instantaneous risks J(f) = 1 T T t=1 J t(f) Batch risk J(f) = 1 l T t=1 δ(y t )c(f(x t ), y t ) + λ 1 2 f 2 K + λ 2 2T T (f(x s ) f(x t )) 2 w st s,t=1 Instantaneous risk J t (f) = T l δ(y t)c(f(x t ), y t ) + λ 1 2 f 2 K + λ 2 t (f(x i ) f(x t )) 2 w it (includes graph edges between x t and all previous examples) i=1 Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 6 / 14

Online convex programming Instead of minimizing convex J(f), reduce convex J t (f) at each step t. J t (f) f t+1 = f t η t f Remarkable no regret guarantee against adversary: Note: accuracy can be arbitrarily bad if adversary flips target often If so, no batch learner in hindsight can do well either ft regret 1 T T J t (f t ) J(f ) t=1 [Zinkevich ICML03] No regret: lim sup T 1 T T t=1 J t(f t ) J(f ) 0. If no adversary (iid), the average classifier f = 1/T T t=1 f t is good: J( f) J(f ). Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 7 / 14

Sparse approximation 1: buffering The algorithm is impractical Space O(T ): stores all previous examples Time O(T 2 ): each new example compared to all previous ones In reality, T Approximation: Keep a size τ buffer Approximate representers: f t = t 1 i=t τ α(t) i K(x i, ) Approximate instantaneous risk J t (f) = T l δ(y t)c(f(x t ), y t ) + λ 1 t 2 f 2 K + λ 2 τ Dynamic graph on examples in the buffer t (f(x i ) f(x t )) 2 w it i=t τ Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 8 / 14

Sparse approximation 2: random projection tree [Dasgupta and Freund, STOC08] Discretize data manifold by online clustering. When a cluster accumulates enough examples, split along random hyperplane. Extends k-d tree. Approximation uses a graph over clusters (represented by Gaussians) Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 9 / 14

Experimental Results Compare batch manifold regularization (MR) with several online variants (i.e., full graph, buffering, random projection tree) Runtime Risk (to assess no regret guarantee) Generalization error using average classifier (assumes iid) Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 10 / 14

Experiment: runtime Buffering and RPtree scale linearly, enabling life-long learning. Spirals MNIST 0 vs. 1 Time (seconds) 500 400 300 200 100 500 400 300 200 100 Batch MR Online MR Online MR (buffer) Online RPtree 0 0 2000 4000 6000 T 0 0 5000 10000 T Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 11 / 14

Experiment: risk Online MR risk J air (T ) 1 T T t=1 J t(f t ) approaches batch risk J(f ) as T increases. Risk 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1 0.9 0.8 J(f*) Batch MR J air (T) Online MR J air (T) Online MR (buffer) J air (T) Online RPtree 0.7 0 500 1000 1500 2000 2500 3000 3500 4000 4500 T Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 12 / 14

Experiment: generalization error of f if iid A variation of buffering as good as batch MR (preferentially keep labeled examples in buffer). (a) Spirals (b) Face Generalization error rate 0.3 0.25 0.2 0.15 0.1 Batch MR Online MR Online MR (buffer) Online MR (buffer U) Online RPtree Online RPtree (PPK) Generalization error rate 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Batch MR Online MR Online MR (buffer) Online MR (buffer U) Online RPtree 0.05 0.05 0 0 1000 2000 3000 4000 5000 6000 7000 T 0 0 2000 4000 6000 8000 10000 T (c) MNIST 0 vs. 1 (d) MNIST 1 vs. 2 Generalization error rate 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Batch MR Online MR Online MR (buffer) Online MR (buffer U) Online RPtree Generalization error rate 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Batch MR Online MR Online MR (buffer) Online MR (buffer U) Online RPtree 0.05 0.05 0 0 2000 4000 6000 8000 10000 T 0 0 2000 4000 6000 8000 10000 T Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 13 / 14

Conclusions Online semi-supervised learning framework Sparse approximations: buffering and RPtree Future work: new bounds, new algorithms (e.g., S3VM) Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 14 / 14

Conclusions Online semi-supervised learning framework Sparse approximations: buffering and RPtree Future work: new bounds, new algorithms (e.g., S3VM) Thank you! Any questions? Andrew B. Goldberg (UW-Madison) Online Manifold Regularization 14 / 14

Experiment: adversarial concept drift Slowly rotating spirals, both p(x) and p(y x) changing. Batch f vs. online MR buffering f T Test set drawn from the current p(x, y) at time T. 0.7 Batch MR Online MR (buffer) 0.6 Generalization error rate 0.5 0.4 0.3 0.2 0.1 0 0 1000 2000 3000 4000 5000 6000 7000 T

Kernelized algorithm Init: t = 1, f 1 = 0 Repeat t 1 f t ( ) = α (t) i K(x i, ) i=1 1 receive x t, predict f t (x t ) = t 1 i=1 α(t) i K(x i, x t ) 2 occasionally receive y t 3 update f t to f t+1 by adjusting coefficients: α (t+1) i = (1 η t λ 1 )α (t) i 2η t λ 2 (f t (x i ) f t (x t ))w it, i < t t α (t+1) t = 2η t λ 2 4 store x t, let t = t + 1 i=1 (f t (x i ) f t (x t ))w it η t T l δ(y t)c (f(x t ), y t )

Sparse approximation 1: buffer update At each step, start with the current τ representers: f t = t 1 i=t τ Gradient descent on τ + 1 terms: f = α (t) i K(x i, ) + 0K(x t, ) t i=t τ Reduce to τ representers f t+1 = t Kernel matching pursuit α ik(x i, ) i=t τ+1 α(t+1) i min f f t+1 2 α (t+1) K(x i, ) by

Sparse approximation 2: random projection tree We use the clusters N (µ i, Σ i ) as representers: f t = s i=1 β (t) i K(µ i, ) Cluster graph edge weight between a cluster µ i and example x t is [ w µi t = E x N (µi,σ i ) exp ( x x t 2 )] 2σ 2 = (2π) d 2 Σi 1 2 Σ0 1 1 2 Σ 2 ( exp A further approximation is ( 1 2 µ i Σ 1 i w µi t = e µ i x t 2 /2σ 2 Update f (i.e., β) and the RPtree, discard x t. µ i + x t Σ 1 0 x t µ Σ µ ) )