Journal of Mathematical Analysis and Applications

Size: px

Start display at page:

Download "Journal of Mathematical Analysis and Applications"

Ellen Baker
5 years ago
Views:

J Math Anal Appl 386 202 205 22 Contents lists available at ScienceDirect Journal of Matheatical Analysis and Applications wwwelsevierco/locate/jaa Sei-Supervised Learning with the help of Parzen

Technology of China, and City University of Hong ong, SuZhou 2523, China article info abstract Article history: Received 5 April 20 Available online 29 July 20 Subitted by U Stadtueller eywords:

1 J Math Anal Appl Contents lists available at ScienceDirect Journal of Matheatical Analysis and Applications wwwelsevierco/locate/jaa Sei-Supervised Learning with the help of Parzen Windows Shao-Gao Lv a,, Yun-Long Feng b a Statistics School, Southwestern University of Finance and Econoics, Chengdu 630, China b Joint Advanced Research Center of University of Science and Technology of China, and City University of Hong ong, SuZhou 2523, China article info abstract Article history: Received 5 April 20 Available online 29 July 20 Subitted by U Stadtueller eywords: Sei-Supervised Learning Graph-based odels Support vector achine Least square regression Reproducing kernel Hilbert spaces Sei-Supervised Learning is a faily of achine learning techniques that ake use of both labeled and unlabeled data for training, typically a sall aount of labeled data with a large nuber of unlabeled data In this paper we propose a Sei-Supervised regression algorith by eans of density estiator, generated by Parzen Windows functions under the fraework of Sei-Supervised Learning We conduct error analysis by capacity independent technique and obtain soe satisfactory learning rates in ters of regularity of the target function and the decay condition on the arginal distribution near the boundary 20 Elsevier Inc All rights reserved Introduction Traditional learning fro exaple uses only labeled data to train However, in any real world application, it is relatively easy to acquire a large aount of unlabeled data For exaple, in phonological acquisition contexts, a child is exposed to any acoustic utterances These utterances do not coe with identifiable phonological arkers Corrective feedback is the ain source of directly labeled exaples In any cases, a sall aount of feedback is sufficient to allow the child to aster the acoustic-to-phonetic apping of any language Another instance is for a robot with a video caera The robot continuously takes high frae-rate video of its surroundings and seek to learn the naes of various objects in the video, but the robot stores naes in designed syste beforehand only very rarely These phenoenon can be forulated as a Sei-Supervised Learning situation: ost objects are unlabeled, while only a few are labeled Sei-Supervised Learning addresses this proble by using large aount of unlabeled data, together with the labeled data, to build better learners Due to less huan efforts and higher accuracy, Sei-Supervised Learning can be of great practical value At the first glance, the unlabeled data by itself does not carry any inforation on the apping fro the input space to output space Y How can it help us learn a better predictor f : Y Balcan and Blu pointed out in [] that the key lies in an iplicit ordering of f F induced by the unlabeled data Roughly speaking, if the iplicit order happens to rank the target predictor f near the top, then one needs less labeled data to learn f This idea will be foralized on using PAC learning bounds In other context, the iplicit ordering is interpreted as a prior over F or as a regularizer Therefore, finding the iplicit order of input space and designing a good learning algorith are both very critical in Sei-Supervised Learning The history of Sei-Supervised Learning goes back to at least the 70s, when Self-training, transduction and Gaussian ixtures with the EM algorith first eerged A variety of Sei-Supervised techniques have been developed, which all can be classified into generative and discriinative ethods A straightforward, generative Sei-Supervised ethod is the Expectation Maization EM algorith The EM approach for naive Bayes text classification odels is discussed * Corresponding author E-ail address: kenan76@ailustceducn S-G Lv /$ see front atter 20 Elsevier Inc All rights reserved doi:006/jjaa

2 206 S-G Lv, Y-L Feng / J Math Anal Appl by Niga [2] Discriinative Sei-Supervised ethods [9] include probabilistic and non-probabilistic approaches, such as transductive or Sei-Supervised Support vector Machines TSVMs, S3VMs and a variety of other graph based ethods [23,2] Different learning ethods for Sei-Supervised Learning are based on different assuptions Fulfilling these assuptions is crucial for the success of the ethods More discussions can be founded in [3,22] Fro the learning kernel point of view, any esting structured learning algoriths eg conditional rando fields, au argin Markov networks can be endowed with a Sei-Supervised kernel The key idea lies in that the graph kernel can be constructed with all the unlabeled data, next one applies the graph kernel to a standard structured learning kernel achine Such kernel achines include the kernelized conditional rando fields [0] and the au argin Markov networks [7], which differ priarily by the loss function they use As is well known, Parzen Windows ethod [3] is a widely used technique for kernel density estiation and regression see eg [9], and recently they are also applied to ulti-classifier learning [4] and regression proble [2] in a general subset of R d by a decay condition near the boundary These otivate us to study Sei-Supervised regression integrated with density estiations using Parzen Windows Intuitively, the unknown arginal distribution can be characterized well by Parzen Windows using aounts of unlabeled data This aounts to provide us soe additional useful inforation for supervised learning Definition We say that : R d R d R is a basic window function if it is continuous and it satisfies: i R d x, y dy = foreachx R d ; ii there est soe q > d + 2 and constant c q > 0 such that x, y c q + x y, x, q y Rd The decay condition is reasonable since it is satisfied by any coonly used function in ultivariate analysis and learning theory Suppose that the input space is a copact subset of R d, and the output space Y R Letρ be the probability easure defined on Y Given a sequence of saple z := x i, y i } i= drawn independently according to ρ, andn unlabeled data x i } +n i=+ drawn fro the arginal distribution denoted by ρ,often n In our setting, the objective is to learn the regression function f ρ x = ydρy x, x, Y where ρy x is the conditional probability easure at any given x induced by ρ We consider a learning algorith in a reproducing kernel Hilbert space RHS H associated with a Mercer kernel, where : R is a continuous, syetric and positive sei-definite function Without loss of generality, assue that sup x,y x, y κ On the basis of density estiator generated by Parzen Windows, we learn the regression function by f z,λ = arg in f H + n d +n i= j=, x j f y i 2 + λ f 2 }, 2 where = >0 is a scalar paraeter and λ 0,, a regularization paraeter controlling the trade-off between the epirical error and the penalty ter f 2 Since is a copact subset of R d, soe regularity conditions on both the arginal distribution and the density are required Throughout the paper, we give the following assuption Assuption Suppose that for soe 0 < τ, c p > 0, the arginal distribution ρ satisfies } ρ x : inf u x s c 2 u R d p s2τ, s > 0, 3 \ and the density px of ρ satisfies sup px c p, x px pv c p v x τ, v, x 4 Denote L 2 μ as the L2 -space with the inner production f, g L 2 μ = f xgx dμ Our learning rates will be achieved under the regularity assuption on the regression function that f ρ lies in the range of L r for soe r > 0 Here L is the integral operator on L 2 ν defined by

3 L f x = S-G Lv, Y-L Feng / J Math Anal Appl x, y f y dνy, x, f L 2 ν, where dνx = px V p dρ x with V p := px2 dx>0 The operator L is linear, copact, positive and can be also regarded as a self-adjoint operator on H Sincetheeasuredνx = px V p dρ x is probability one on and p c p,wesee that the fractional operator L r 0 < r is also well-defined on L2 ρ in [4] We can now state our learning rate for the Sei-Supervised regression algorith which will be proved in Section 5 Theore Assue y M for soe constant M > 0 Suppose Assuption holds, and L f ρ L 2 ρ with 0 < r Choose τ λ = τ 2τ r++d and λ = For any 0 <δ<, with confidence to δ, there holds z,λ f ρ L 2 ρ C log 2 δ where C is a constant independent on or δ τ r 2τ r++d, The key technique behind the proof for the convergence of the learning schee lies in the error decoposition, consisting of a saple error ter and an approation error ter The first ter, the saple error, is bounded using a concentration inequality in a Hilbert space since it is a function of the saple z On the other hand, the second ter, the approation error, does not depend on the saple and we use approation theory to attain the purpose The paper is organized as follows In Section 2 a new Sei-Supervised regression algorith is proposed based on density estiations Then we introduce a necessary concentration inequality in a Hilbert space and saple error in H are estiated in Section 3 In Section 4 we bound the approation error under the condition of Assuption and the regularity of the regression function Finally we obtain the learning rate by cobining the saple error with approation error 2 Sei-Supervised algorith with density estiators Turn our attention to Parzen Windows again Let x, x 2,,x n be an iid saple drawn fro soe distribution with an unknown density p We are interested in estiating the shape of this function p Its kernel density estiator by eans of Parzen Windows ethod is given as p x, = n d n x, x i, i= where > 0 is a soothing paraeter called the bandwidth Parzen showed that p x, converges to px and = n satisfies li n = 0, n li n[ n ] d = n Akernelwithsubscript is called the scaled kernel and defined as x = / x/ Intuitively one wants to choose as sall as the data allows, however there is always a trade-off between the bias of the estiator and its variance When the boundary of satisfies certain decay condition, by choosing a proper paraeter n, a standard convergence rate for density estiation founded in [6,8] can be expressed as px, px L 2 = O n 2/d+4 22 Note that if the unknown density p is sooth enough, the convergence rate can be iproved by using the higher order Parzen Windows ethods [2] In this paper, our idea lies in that ore accuracy should be given at saple point where the density of the arginal distribution ρ is uch larger than other field In the classical regression algorith, all the saple points are equivalently treated and it ay cause large violation when there est soe singular points aong all the saple data Since an unseen data will appears with greater confidence in uch denser fields, our efforts should focus on there by using huge surplus unlabeled data Considering the data structure of the input space, we propose the kernel-based regression estiator with respect to density function as } f z,λ = arg in f H i= px i f x i y i 2 + λ f 2 2

4 208 S-G Lv, Y-L Feng / J Math Anal Appl As entioned above, px is usually unknown and we shall replace it with a density estiator Here we ake use of Parzen Windows ethods as the kernel density estiation Thus, given a set of n unlabeled exaples x i } +n i=+,wegetthe optiization proble 2 The optiization ethod 2 looks like weighted least square regression WLSR in Supervised Learning [5,7] But here we are ainly concerned with underling data shape induced by large aounts of unlabeled data In the classical WLSR setting, there ests no surplus unlabeled data and estiation paraeter ay be sensitive to the weight function using only a few observations By the reproducing property of RHS, the Representer Theore holds, which eans that the iniizer can be obtained in a finite diensional space in ters of both labeled and unlabeled exaples As before, the Representer Theore shows that the solution has the for f z,λ x = α i x, x i i= Substituting this for in the proble above, we arrive at the following convex differentiable objective function of the -diensional variable a =[a,,a ] T : a = arg in Y xa T Q Y x a + λa T x a, where x is the gra atrix i, j = x i, x j, Q = diagp x, x,,p x, x and Y is the label vector Y = [y,,y ] T The derivative of the objective function vanishes at the iniizer, which leads to the following solution: α = Q x + λi QY 3 Saple error estiate In this section we investigate the approation error of our proposed algorith as the paraeters and λ change To understand 2, denote by x the set of inputs x,,x }, and define the sapling operator S x : H R S x f = f x i i= The adjoint of the sapling operator, ST x : R H, can be written as as S T x c = c i, c R i= A siilar ethods with [4] are applied to this algorith 2, whose iniizer can be represented by f z,λ = + n d ST x GS x + λi + n d ST xgy, 3 where G = diaggx,,gx } with gx i = +n j= x i, x j i =,, and Y is defined as above, the following notations play key role in our theoretical analysis To study the liitation behavior of the function f z,λ Definition 2 The regularization error and the regularizing function of the triple H, f ρ, ρ are defined respectively as x Dλ = inf f H d, t f x fρ x } 2 dρ x dρ t + λ f 2, 32 and f λ = arg in f H d x, t f x fρ x } 2 dρ x dρ t + λ f 2, λ>0 33 In order to obtain the explicit for of f λ, we introduce the following definition L : L2 ρ L 2 ρ,givenas L f u = d x, t f x u x dρ x dρ t, for u Taking the functional derivatives, we know that f λ can be expressed associated with L f λ = λi + L L f ρ 34

5 S-G Lv, Y-L Feng / J Math Anal Appl Intuitively, fixed f H, there holds E ll+u d ST x GS x f = L f, and E ll+u d ST x GY = L f ρ Therefore, it sees that f x, can approate f λ To bound the saple error f z,λ f λ, we need to introduce a McDiarid Bernstein type probability inequality for vector-valued rando variables, which appears in [5] Lea Let x =x i } i= be iid draws fro a probability distribution ρ on, H, be a Hilbert space, and F : H be easurable If there ests M 0 such that F x E F x M foreach i andalostx, then for every ε > 0, } Prob x F x Ex F x ε 2 } ε 2exp 2 Mε + 2, 35 where 2 := i= sup x\x i E x i F x E F x 2 } For any 0 <δ<, with confidence to δ, there holds F x Ex F x 2log 2 M + } 2 δ It is worth pointing out that though estiating the Saple Error is a standard ethod in learning theory [,6], the operator L is not necessary in the Hilbert Schidt space and the corresponding skills are no longer applicable Proposition Suppose that the density px of ρ ests and satisfies sup x px c p, then for any 0 < and 0 <δ,with confidence δ, there holds z,λ f 2 2cq κm + f λ λ log + c λ d/2 d/2 qc p 36 δ Proof By 3, we have f z,λ f λ = + n d ST x GS x + λi S T + n d x GY S T x GS x f λ Define a function F : Z n H by i= j= i= j= λ f λ } F z, x = S T + n d x GY S T x GS x f λ +n = + n d, x +n j y i, x } j f λ x i By independence, the expectation of F z, x equals x d, t } fρ x f λ x u x dρ x dρ t = L fρ f λ u By 34, we see that λ f λ = L f ρ f λ = EF z, x Therefore, z,λ f λ λ j k F z, x E F z, x When k,,}, we see that F z, x E zk F z, x equals xk + n d, x j yk f λ x k u x k + yi + n d f λ x i u x i + + n d i k xk, x k The reproducing property of RHS iplies that, x k yk f λ x k u x k x, x j, x x, x } fρ x f λ x u x dρ x dρ x } fρ x f λ x u x dρ x f L 2 ρ f κ f, for f H 37

6 20 S-G Lv, Y-L Feng / J Math Anal Appl Hence, when k,,}, we have F z, x Ezk F z, x M := 6c q κm + f λ d 38 Clearly, the sae conclusion also holds for the case k +,, + n} Next we need to estiate the variance 2, F z, x E zk F z, x can be bounded by 3c q κm + f λ +n + n d xk, x j +, x } k j= i k Note that fro Definition of Parzen Windows, it follows that E k F z, x E zk F z, x 2 /2 is bounded by 6c q κm + f λ +n c 2 /2 q px + n d dx 39 + x j x/ 2q j= 6c2 qκc p M + f λ d/2 It follows that for, 2 36c4 qκ 2 c 2 p M + f λ 2 d Thus Proposition 3 follows fro Lea 4 Approation error estiate 30 To estiate the approation error, we need to consider the convergence of L as 0 Proposition 2 Under Assuption,thenV p c p and for any 0 < we have L V p L L 2 ρ τ κ 2 L 2 c q c p M q+2 + M ρ q+τ + M q Here M q+2,m q+τ and M q are constants independent of Proof Let f L 2 ρ, by Definition we see that L f V p L f L 2 is bounded by ρ κ 2 x d, t f x px pt dρ x dt + x, t f x px } dρ x dt, R d \ and it follows that κ 2 x d, t f x px pt dρ x dt κ 2 d c p c q + x t / q x t τ f x dt dρ x τ κ 2 c p c q M q+τ f L 2 ρ, where M q+τ is soe constant independent of and the last inequality is obtained by the Cauchy Schwartz inequality Separate the doain into := x : inf u R d \ u x } and its copleent \ When x \,foranyt R d \, t x is satisfied, and thereby t x 2 Hence κ 2 x d, t f x px dρ x dt κ 2 c q c p M q+2 f L 2, ρ \ R d \ where M q+2 is also a constant siilar with M q+τ For the subset, we use the Cauchy Schwartz inequality and obtain x, t f x px dρ x dt ρ κ 2 c q c p M q f L 2 ρ κ 2 d R d \

7 S-G Lv, Y-L Feng / J Math Anal Appl By 3, ρ c 2 p 2τ Thus,forany0<, we have L f V p L f L 2 ρ τ κ 2 c q c p M q+2 + M q+τ + M q f L 2 ρ This proves our desired result Define the regularizing function independent on λ f λ = I + L L f ρ V p Proposition 3 Under the assuptions 3 and 4 WhenL f ρ L 2 ρ with 0 < r,wehave λ f L λ 2 C 0 ρ τ λ r, where C 0 = κ 2 c q c p V p r M q+2 + M q+τ + M q L f ρ L 2 ρ Proof Observe that L V p L λi + V p L f ρ f λ f λ = λ λi + L It follows that λ f L λ 2 L ν V p L λi + V p L L r L The desired result follows fro Proposition 3 2 λ r L ρ V p L L 2 ρ To get the total error f z,λ f ρ we need bounds for the approation error f λ f ρ The following result is well known, proved by Sale and Zhou in [6] Lea 2 Define f λ as above If L f ρ L 2 ρ with 0 < r,then f λ f ρ 2 L 2 ρ + λ f λ 2 λ2r V 2r p L f ρ 2, L 2 if 0 < r < ρ 2 4 and f λ f ρ λ r 2 V 2 p Moreover, for 0 < r, there holds f λ f ρ L 2 ρ λ r V p L L 2, if r 42 ρ 2 2 ρ Following Proposition 3 and 43, the approation error can be stated as follows Proposition 4 Under the assuptions 3 and 4 WhenL f ρ L 2 ρ with 0 < r,wehave λ f ρl 2 ρ τ λ r + λ r C 0 + V p L 2, 44 ρ where C 0 is defined as in Proposition 3 5 Learning rate and conclusion We are now in a position to derive the learning rate of Sei-Supervised Learning algorith 2 Proof of Theore Following Propositions 3 and 4, with confidence to δ, z,λ 2 ρ z,λ f λ L 2 + ρ λ f ρl 2 ρ 2 2cq κ 2 M + f λ log δ + c λ d/2 d/2 qc p τ r 2 2τ r++d C log, δ τ provided that λ = τ and λ = 2τ r++d, where C is a constant independent on or δ + τ λ r + λ r C 0 + V p L 2 ρ 43

8 22 S-G Lv, Y-L Feng / J Math Anal Appl We introduce a Sei-Supervised Learning algorith with density estiators, generated by Parzen Windows functions The ain difference fro Weighted Supervised Learning is that large aounts of unlabeled data are eployed in density estiations Thus estiation paraeter ay be less sensitive than the case where only a few observations are used, such as Weighted Least Squared Regression An error analysis is given for the convergence of the learner to the regression function The technique we use here is capacity independent one Considering the coplety of hypothesis space such as Radeacher coplety [8] and various covering nubers [20], we are sure that the convergence rate can be iproved greatly if the hypothesis space is good enough References [] MF Balcan, A Blu, A discriinative odel for sei-supervised learning, J ACM 2009 [2] M Belkin, I Matveeva, P Niyogi, Regularization and sei-supervised learning on large graphs, in: COLT, 2004 [3] O Chapelle, A Zien, B Scholkopf, Sei-Supervised Learning, MIT Press, 2006 [4] F Cucker, S Sale, On the atheatical foundations of learning, Bull Aer Math Soc NS [5] F Cucker, D Zhou, Learning Theory: An Approation Theory Viewpoint, Cabridge University Press, Cabridge, 2007 [6] L Devroye, G Lugosi, Cobinatorial Methods in Density Estiation, Springer, Heidelberg, 2000 [7] T Evgeniou, M Pontil, T Poggio, Regularization networks and support vector achines, Adv Coput Math [8] Fukunaga, Introduction to Statistical Pattern Recognition, Acadeic, London, 990 [9] Rie ubota Ando, Tong Zhang, Two-view feature generation odel for sei-supervised, in: Proceedings of the 24th International Conference on Machine Learning, Corvallis, OR, 2007 [0] Lafferty, J Zhu, Y Liu, ernel conditional rando fields: Representation and clique selection, in: The 2st International Conference on Machine Learning ICML, 2004 [] S Mukherjee, D Zhou, Learning coordinate covariances via gradients, J Mach Learn Res [2] Niga, T Mitchell, A McCallu, S Thrun, Text classification fro labeled and unlabeled docuents using EM, in: Machine Learning, luwer Acadeic Publishers, Boston, 2000, pp 34 [3] E Parzen, On the estiation of a probability density function and the ode, Ann Math Stat [4] ZW Pan, DH iao, QW iao, D Zhou, Parzen windows for ulti-class classification, J Coplety [5] I Pinelis, Optiu bounds for the distributions of artingales in Banach spaces, Ann Probab [6] S Sale, D Zhou, Learning theory estiates via integral operators and their approations, Constr Approx [7] B Taskar, C Guestrin, D oller, Max-argin Markov networks, in: NIPS 03, 2003 [8] AW van der Vaart, JA Wellner, Weak Convergence and Epirical Processes, Springer-Verlag, New York, 996 [9] MP Wand, MC Jones, ernel Soothing, Monogr Statist Appl Probab, vol 60, Chapan Hall, London, 995 [20] D Zhou, Capacity of reproducing kernel spaces in learning theory, IEEE Trans Infor Theory [2] J Zhou, D Zhou, High order Parzen windows and randoized sapling, Adv Coput Math [22] J Zhu, Sei-supervised learning literature survey, Tech Report, 530 [23] J Zhu, Sei-supervised learning with graph, Doctoral Thesis, Carnegie Mellon University, 2005

Shannon Sampling II. Connections to Learning Theory

Shannon Sampling II. Connections to Learning Theory Shannon Sapling II Connections to Learning heory Steve Sale oyota echnological Institute at Chicago 147 East 60th Street, Chicago, IL 60637, USA E-ail: sale@athberkeleyedu Ding-Xuan Zhou Departent of Matheatics,