Regularizing objective functionals in semi-supervised learning

Size: px

Start display at page:

Download "Regularizing objective functionals in semi-supervised learning"

Noah Sharp
5 years ago
Views:

1 Regularizing objective functionals in semi-supervised learning Dejan Slepčev Carnegie Mellon University February 9, / 47

2 References S,Thorpe, Analysis of p-laplacian regularization in semi-supervised learning, arxiv Dunlop, S, Stuart, Thorpe, Large-data and zero-noise limits of graph-based semi-supervised learning algorithms in preparation García Trillos, Gerlach, Hein, and S, Error estimates for spectral convergence of the graph Laplacian on random geometric graphs towards the Laplace Beltrami operator, arxiv García Trillos and S, Continuum limit of total variation on point clouds, Arch Ration Mech Anal, 220 no 1, (2016) García Trillos, S, J von Brecht, T Laurent, and X Bresson, Consistency of Cheeger and ratio graph cuts, J Mach Learn Res 17 (2016) 1-46 García Trillos, S, A variational approach to the consistency of spectral clustering, published online Applied and Computational Harmonic Analysis García Trillos and S, On the rate of convergence of empirical measures in -transportation distance, Canad J Math, 67, (2015), pp / 47

3 Semi-supervised learning Colors denote real-valued labels Task: Assign real-valued labels to all of the data points 3 / 47

4 Semi-supervised learning Graph is used to represent the geometry of the data set 4 / 47

5 Semi-supervised learning Consider graph-based objective functions which reward the regularity of the estimator and impose agreement with preassigned labels 5 / 47

6 From point clouds to graphs Let V = {x 1,, x n } be a point cloud in R d : x i W i,j x j Connect nearby vertices: Edge weights W i,j 6 / 47

7 Graph Constructions proximity based graphs W i,j = η(x i x j ) η η L L knn graphs: Connect each vertex with its k nearest neighbors 7 / 47

8 p-dirichelt energy V n = {x 1,, x n }, weight matrix W : W ij := η ( x i x j ) p-dirichlet energy of f n : V n R is E(f n ) = 1 W ij f n (x i ) f n (x j ) p 2 i,j For p = 2 associated operator is the (unnormalized) graph laplacian L = D W, where D = diag(d 1,, d n ) and d i = j W i,j 8 / 47

9 p-laplacian semi-supervised learning Assume we are given k labeled points (x 1, y 1 ), (x k, y k ) and unlabeled points x k+1, x n Question How to label the rest of the points? p-laplacian SSL Minimize subject to constraint E(f n ) = 1 W ij f n (x i ) f n (x j ) p 2 i,j f (x i ) = y i for i = 1,, k Zhu, Ghahramani, and Lafferty 03 introduced the approach with p = 2 Zhou and Schölkopf 05 consider general p 9 / 47

10 p-laplacian semi-supervised learning: Asymptotics p-laplacian SSL Minimize subject to constraint E(f n ) = 1 W ij f n (x i ) f n (x j ) p 2 i,j f (x i ) = y i for i = 1,, k Questions What happens as n? Do minimizers f n converge to a solution of a limiting problem? In what topology should the question be considered? Remark We would like to localize η as n 10 / 47

11 p-laplacian semi-supervised learning: Asymptotics p-laplacian SSL Minimize subject to constraint where E n (f n ) = 1 η ε 2 n 2 ε (x i x j ) f n (x i ) f n (x j ) p f n (x i ) = y i for i = 1,, k i,j η ε ( ) = 1 ( ) ε η d ε Questions Do minimizers f n converge to a solution of the limiting problem? In what topology should the question be considered? How shall ε n scale with n for the convergence to hold? 11 / 47

12 Ground Truth Assumption We assume points x 1, x 2,, are drawn iid out of measure dν = ρdx We also assume ρ is supported on a Lipschitz domain Ω and is bounded above and below by positive constants 12 / 47

13 z Ground Truth Assumption: Manifold version Assume points x 1, x 2,, are drawn iid out of measure dν = ρd Vol M, where M is a compact manifold without boundary, and 0 < ρ < C is continuous x = x, y = -(2 cos(t) (1 - x 2 ) 1/2 (cos(3 x) - 8/5))/5, z = -(2 sin(t) (1 - x 2 ) 1/2 (cos(3 x) - 8/5))/ y x / 47

14 Harmonic semi-supervised learning Nadler, Srebro, and Zhou 09 observed that for p = 2 the minimizers are spiky as n [Also see Wahba 90] Figure: Graph of the minimizer for p = 2, n = 1280, iid data on square; training points (05, 02) with label 0 and (05, 08) with label 1 14 / 47

15 p-laplacian semi-supervised learning El Alaoui, Cheng, Ramdas, Wainwright, and Jordan 16, show that spikes can occur for all p d and propose using p > d Heuristics E (p) n (f ) = 1 ε p n 2 n n η ε (x i x j ) f (x i ) f (x j ) p i,j=1 ε 0 σ η ( ) f (x) f (y p η ε (x i x j ) ρ(x)ρ(y)dxdy ε f (x) p ρ(x) 2 dx Sobolev space W 1,p (Ω) embeds into continuous functions iff p > d 15 / 47

16 Continuum p-laplacian semi-supervised learning µ- measure with density ρ, positive on Ω Continuum p-laplacian SSL Minimize subject to constraints that E (f ) = f (x) p ρ(x) 2 dx Ω f (x i ) = y i for all i = 1,, k The functional is convex The problem has a unique minimizer iff p > d The minimizer lies in W 1,p (Ω) 16 / 47

17 p-laplacian semi-supervised learning Here: d = 1 and p = 15 For ε > 002 the minimizers lack the expected regularity err (15) n (fn) % Graphs Connected fn ε (a) error for p = 15 and d = (b) minimizers for ε = 0023, n = 1280, ten realizations Labeled points are (0, 0) and (1, 1) Ω 17 / 47

18 p-laplacian semi-supervised learning Theorem (Thorpe and S 17) Let p > 1 Let f n be a sequence of minimizers of E (p) n satisfying constraints Let f be a minimizer of E (p) satisfying constraints ( ) 1 (i) If d 3 and n 1 p log n d εn then p > d, f is continuous and n f n converges locally uniformly to f, meaning that for any Ω Ω lim n max f (x k) f n (x k ) = 0 {k n : x k Ω } (ii) If 1 ε n n 1 p then there exists a sequence of real numbers c n such that f n c n converges to zero locally uniformly Note that in case (ii) all information about labels is lost in the limit The discrete minimizers exhibit spikes 18 / 47

19 p-laplacian semi-supervised learning (a) discrete minimizer (b) continuum minimizer Minimizer for p = 4, n = 1280, ε = 0058 iid data on square, with training points (02, 05) and (08, 05) and labels 0 and 1 respectively 19 / 47

20 p-laplacian semi-supervised learning (a) ε = 0058 (b) ε = 009 (c) ε = 02 p = 4 which in 2D is in the well-posed regime 20 / 47

21 Improved p-laplacian semi-supervised learning p > d Labeled points {(x i, y i ) : i = 1,, k} p-laplacian SSL Minimize E n (f n ) = 1 η ε 2 n 2 ε (x i x j ) f n (x i ) f n (x j ) p i,j subject to constraint f n (x m ) = y i whenever x m x i < 2ε, for all i = 1,, k where η ε ( ) = 1 ( ) ε η d ε 21 / 47

22 Asymptotics of improved p-laplacian SSL Theorem (Thorpe and S 17) Let p > d f n be a sequence of minimizers of improved p-laplacian SSL on n-point sample f minimizer of E (p) satisfying constraints Since p > d we know f is continuous ( log n If d 3 and 1 ε n n meaning that for any Ω Ω lim n ) 1 d then f n converges locally uniformly to f, max f (x k) f n (x k ) = 0 {k n : x k Ω } 22 / 47

23 Comparing the original and improved model Here: d = 1, p = 2, and n = 1280 Labeled points are (0, 0) and (1, 1) err (2) n (fn) % Graphs Connected err (2) n (fn) % Graphs Connected ε (a) original model ε (b) improved model Note that the axes on the error plots for the models are not the same 23 / 47

24 Techniques general approach developed with Garcia Trillos (ARMA 16) Γ-convergence Notion and set of techniques of calculus of variations to consider asymptotics of functionals (here random discrete to continuum) TL p space Notion of topology based on optimal transportation which allows to compare functions defined on different spaces (here f n L p (µ n ) and f L p (µ)) We also need Nonlocal operators and their asymptotics In SSL, for constraint to be satisfied we need uniform convergence This also requires discrete regularity and finer compactness results 24 / 47

25 Role of nonlocal operators Heuristics E (p) n (f ) = 1 ε p n 2 n n η ε (x i x j ) f (x i ) f (x j ) p i,j=1 ε 0 σ η ( ) f (x) f (y p η ε (x i x j ) ρ(x)ρ(y)dxdy ε f (x) p ρ(x) 2 dx Discrete problem on graph is closer to a nonlocal functional (with scale ε) than to limiting differential one Nonlocal energy does not have the smoothing properties of the differential one 25 / 47

26 Degeneracy of nonlocal operators Consider Then E (p) n (f ) = 1 ε p n 2 E (p) n (f ) = 2 ε p nn 2 n ε d j=2 n as n, when ε p nn n η ε (x i x j ) f (x i ) f (x j ) p i,j=1 f (x j ) = { 1 if j = 1 0 else ( ) 1 x1 x j η ε n 1 ε p nn 2 nεd n = 1 ε p nn 0 26 / 47

27 PDE based p-laplacian semi-supervised learning Manfredi, Oberman, Sviridov, 2012, Calder 2017 The infinity laplacian is defined by L n f (x i) = max w ij (f (x j ) f (x i )) + min w ij (f (x j ) f (x i )) j j and the p-laplacian is defined by L p nf = 1 d L2 n f + λ(p 2)L f 27 / 47

28 PDE based p-laplacian semi-supervised learning SSL problem L p nf = 1 d L2 n f + λ(p 2)L f L p nf = 0 on Ω \ Ω L f (x i ) = y i for all i = 1,, k Theorem (Calder 17) ( ) 1 log n 3d/2 Assume p > d If d 3 and ε n Then fn converges n uniformly to f, the solution of the limiting problem Note that there is no upper bound on ε n needed 28 / 47

29 Γ-Convergence (Y, d Y ) - metric space, F n : Y [0, ] Definition The sequence {F n } n N Γ-converges ( wrt d Y ) to F : Y [0, ] if: Liminf inequality: For every y Y and whenever y n y lim inf n F n(y n ) F(y), Limsup inequality: For every y Y there exists y n y such that Definition (Compactness property) lim sup F n (y n ) F(y) n {F n } n N satisfies the compactness property if } {y n } n N bounded and = {y {F n (y n )} n N bounded n } n N has convergent subsequence 29 / 47

30 Proposition: Convergence of minimizers Γ-convergence and Compactness imply: If y n is a minimizer of F n and {y n } n N is bounded in Y then along a subsequence y n y as n and y is a minimizer of F In particular, if F has a unique minimizer, then a sequence {y n } n N converges to the unique minimizer of F 30 / 47

31 Topology Consider domain D and V n = {x 1,, x n } random iid points How to compare f n : V n R and u : D R in a way consistent with L 1 topology? Note that u L 1 (ν) and f n L 1 (ν n ), where ν n = 1 n n δ xi i=1 31 / 47

32 Topology Consider domain D and V n = {x 1,, x n } random iid points f n T n f ν n ν Let T n be a transportation map from ν to ν n 32 / 47

33 Topology Let ν be a measure with density ρ, supported on the domain D We need to compare values at nearby points Thus we also penalize transport T n (x) x Metric For u L p (ν) and f n L p (ν n ) d((ν, u), (ν n, f n )) = ( f n (T n (x)) u(x) p + T n (x) x p ) ρ(x)dx inf T n ν=ν n D where T n ν = ν n 33 / 47

34 TL p Space Definition TL p = {(ν, f ) : ν P(D), f L p (ν)} d p TL ((ν, f ), (σ, g)) = y x p + g(y) f (x)) p dπ(x, y) p where inf π Π(ν,σ) D D Π(ν, σ) = {π P(D D) : π(a D) = ν(a), π(d A) = σ(a)} Lemma (TL p, d TL p) is a metric space The topology of TL p agrees with the L p convergence in the sense that (ν, f n ) TLp L (ν, f ) iff f p (ν) n f 34 / 47

35 TL p convergence (ν, f n ) TLp L (ν, f ) iff f p (ν) n f (ν n, f n ) TLp (ν, f ) iff the measures (I f n ) ν n weakly converge to (I f ) ν That is if graphs, considered as measures converge weakly The space TL p is not complete Its completion are the probability measures on the product space D R If (ν n, f n ) TLp (ν, f ) then there exists a sequence of transportation plans ν n such that (1) x y p dπ n (x, y) 0 as n D D We call a sequence of transportation plans π n Π(ν n, ν) stagnating if it satisfies (1) 35 / 47

36 Stagnating sequence: D D x y p dπ n (x, y) 0 TFAE: 1 (ν n, f n ) TLp (ν, f ) as n 2 ν n ν and there exists a stagnating sequence of transportation plans {π n } n N for which (2) D D f (x) f n (y) p dπ n (x, y) 0, as n 3 ν n ν and for every stagnating sequence of transportation plans π n, (2) holds 36 / 47

37 Formally TL p (D) is a fiber bundle over P(D) 37 / 47

38 Γ convergence for p-laplacian Energy E n (f n ) = 1 η ε 2 n 2 ε (x i x j ) f n (x i ) f n (x j ) p Γ-converges in TL p space to σe (f ) = σ f (x) p ρ(x) 2 dx Ω as n provided that i,j 1 ε n log log n n if d = 1 (log n) 3 4 n if d = 2 ( ) 1 log n d n if d 3; 38 / 47

39 Comment of ε n We require (log n)3/4 ε n n 1/2 if d = 2 (log n)1/d ε n n 1/d if d 3 Note that for d 3 this means that typical degree log(n) Does convergence hold if fewer than log(n) neighbors are connected to? 39 / 47

40 Comment of ε n We require (log n)3/4 ε n n 1/2 if d = 2 (log n)1/d ε n n 1/d if d 3 Note that for d 3 this means that typical degree log(n) Does convergence hold if fewer than log(n) neighbors are connected to? No There exists c > 0 such that ε n < c log(n)1/d then with probability n 1/d one the random geometric graph is asymptotically disconnected This implies that for large enough n, min GC n,εn = 0 While inf C > 0 So for d 3 the condition is optimal in terms of scaling 39 / 47

41 Optimal Transportation for p = transportation distance: d (µ, ν) = inf esssup π{ x y : x X, y Y } π Π(µ,ν) If µ = 1 n n i=1 δ x i and ν = 1 n n j=1 δ y j then d (µ, ν) = min max x i y σ(i) σ permutation If µ has density then OT map, T exists (Champion, De Pascale, Juutinen 2008) and d (µ, ν) = T (x) x L (µ) i 40 / 47

42 -OT between a measure and its random sample Optimal matchings in dimension d 3: Ajtai-Komlo s-tusna dy (1983), Yukich and Shor (1991), Garcia Trillos and S (2014) Theorem There are constants c > 0 and C > 0 (depending on d) such that with probability one we can find a sequence of transportation maps {Tn }n N from ν0 to νn (Tn# ν0 = νn ) and such that: c lim inf n n1/d kid Tn k n1/d kid Tn k lim sup C (log n)1/d (log n)1/d n 41 / 47

43 -OT between a measure and its random sample Optimal matchings in dimension d = 2: Leighton and Shor (1986), new proof by Talagrand (2005), Garcia Trillos and S (2014) Theorem There are constants c > 0 and C > 0 such that with probability one we can find a sequence of transportation maps {Tn }n N from ν0 to νn (Tn# ν0 = νn ) and such that: (3) c lim inf n n1/2 kid Tn k n1/2 kid Tn k lim sup C (log n)3/4 (log n)3/4 n 42 / 47

44 Higher order regularizations in SSL with Dunlop, Stuart, and Thorpe, model by Zhou, Belkin 11 Random sample x 1, x n Labels are known if x i Ω L, open Using graph laplacian L n we define A n = (L n + τ 2 I) α Power of a symmetric matrix is defined by M α = PD α P 1 for M = PDP 1 Higher order SSL Minimize subject to constraint E(f ) = 1 2 f n, A n f n µn f n (x i ) = y i whenever x i Ω L 43 / 47

45 Higher order regularizations in SSL Higher order SSL A n = (L n + τ 2 I) α Minimize subject to constraint E(f ) = 1 2 f n, A n f n µn f n (x i ) = y i whenever x i Ω L Theorem (Dunlop, Stuart, S Thorpe) For α > d 2, under usual assumptions, minimizers f n converge in TL 2 to the minimizer of subject to constraint E(f ) = σ u(x)(au)(x)ρ(x)dx Ω u(x i ) = y i whenever x i Ω L where A = (σl c + τi) α and L c u = 1 ρ div(ρ2 u) 44 / 47

46 Higher order regularizations in SSL with Dunlop, Stuart, and Thorpe, model by Zhou, Belkin 11 k labeled points, (x 1, y 1 ), (x k, y k ), and a random sample x k+1, x n Using graph laplacian L n we define A n = (L n + τ 2 I) α Higher order SSL Minimize subject to constraint E(f ) = 1 2 f n, A n f n µn f n (x i ) = y i for i = 1,, k 45 / 47

47 Higher order regularizations Higher order SSL A n = (L n + τ 2 I) α Minimize subject to constraint E(f ) = 1 2 f n, A n f n µn f n (x i ) = y i for i = 1,, k Lemma (Dunlop, Stuart, S Thorpe) If 1 ε n n 1 2α then minimizers fn converge in TL 2 along a subsequence to a constant That is spikes occur 46 / 47

48 Open problems Error estimates In particular why why is the error the smallest for rather coarse graphs Pointwise assigned labels for higher-order operators Better discretization of graphs 47 / 47

CONSISTENCY OF CHEEGER AND RATIO GRAPH CUTS

CONSISTENCY OF CHEEGER AND RATIO GRAPH CUTS CONSISTENCY OF CHEEGER AN RATIO GRAPH CUTS NICOLÁS GARCÍA TRILLOS 1, EJAN SLEPČEV 1, JAMES VON BRECHT 2, THOMAS LAURENT 3, XAVIER BRESSON 4. ABSTRACT. This paper establishes the consistency of a family