Statistical Analysis of Bayes Optimal Subset Ranking

Size: px

Start display at page:

Download "Statistical Analysis of Bayes Optimal Subset Ranking"

Crystal Nichols
6 years ago
Views:

1 Statstcal Analyss of Bayes Optmal Subset Rankng Davd Cossock Yahoo Inc., Santa Clara, CA, USA Tong Zhang Yahoo Inc., New York Cty, USA Abstract The rankng problem has become ncreasngly mportant n modern applcatons of statstcal methods n automated decson makng systems. In partcular, we consder a formulaton of the statstcal rankng problem whch we call subset rankng, and focus on the DCG (dscounted cumulated gan) crteron that measures the qualty of tems near the top of the rank-lst. Smlar to error mnmzaton for bnary classfcaton, drect optmzaton of natural rankng crtera such as DCG leads to a non-convex optmzaton problems that can be NP-hard. Therefore a computatonally more tractable approach s needed. We present bounds that relate the approxmate optmzaton of DCG to the approxmate mnmzaton of certan regresson errors. These bounds justfy the use of convex learnng formulatons for solvng the subset rankng problem. The resultng estmaton methods are not conventonal, n that we focus on the estmaton qualty n the top-porton of the rank-lst. We further nvestgate the generalzaton ablty of these formulatons. Under approprate condtons, the consstency of the estmaton schemes wth respect to the DCG metrc can be derved. 1 Introducton We consder the general rankng problem, where a computer system s requred to rank a set of tems based on a gven nput. In such applcatons, the system often needs to present only a few top ranked tems to the user. Therefore the qualty of the system output s determned by the performance near the top of ts rank-lst. Rankng s especally mportant n electronc commerce and many nternet applcatons, where personalzaton and nformaton based decson makng are crtcal to the success of such busness. The decson makng process can often be posed as a problem of selectng top canddates from a set of potental alternatves, leadng to a condtonal rankng problem. For example, n a recommender system, the computer s asked to choose a few tems a user s most lkely to buy based on the user s profle and buyng hstory. The selected tems wll then be presented to the user as recommendatons. Another mportant example that affects mllons of people everyday s the nternet search problem, where the user presents a query to the search engne, and the search engne then selects a few web-pages that are most relevant to the query from the whole web. The qualty of a search engne s largely determned by the top-ranked results the search engne can dsplay on the frst page. Internet search s the man motvaton of ths theoretcal study, although the model presented here can be useful for many other applcatons. For example, another rankng 1

2 problem s ad placement n a web-page (ether search result, or some content page) accordng to revenue-generatng potental. Snce for search and many other rankng problems, we are only nterested n the qualty of the top choces, the evaluaton of the system output s dfferent from many tradtonal error metrcs such as classfcaton error. In ths settng, a useful fgure of mert should focus on the top porton of the rank-lst. To our knowledge, ths characterstc of practcal rankng problems has not been carefully explored n earler studes (except for a recent paper [3], whch also touched on ths ssue). The purpose of ths paper s to develop some theoretcal results for convertng a rankng problem nto convex optmzaton problems that can be effcently solved. The resultng formulaton focuses on the qualty of the top ranked results. The theory can be regarded as an extenson of related theory for convex rsk mnmzaton formulatons for classfcaton, whch has drawn much attenton recently n the statstcal learnng lterature [4, 18, 9, 8, 4, 5]. We organze the paper as follows. Secton dscusses earler work n statstcs and machne learnng on global and par-wse rankng. Secton 3 ntroduces the subset rankng problem. We defne two rankng metrcs: one s the DCG measure whch we focus on n ths paper, and the other s a measure that counts the number of correctly ranked pars. The latter has been studed recently by several authors n the context of par-wse preference learnng. Secton 4 nvestgates the relatonshp of subset rankng and global rankng. Secton 5 ntroduces some basc estmaton methods for rankng. Ths paper focuses on the least squares regresson based formulaton. Secton 6 contans the man theoretcal results n ths paper, where we show that the approxmate mnmzaton of certan regresson errors leads to the approxmate optmzaton of the rankng metrcs defned earler. Ths mples that asymptotcally the non-convex rankng problem can be solved usng regresson methods that are convex. Secton 7 presents the regresson learnng formulaton derved from the theoretcal results n Secton 6. Smlar methods are currently used to optmze Yahoo s producton search engne. Secton 8 studes the generalzaton ablty of regresson learnng, where we focus on the L -regularzaton approach. Together wth earler theoretcal results, we can establsh the consstency of regresson based rankng under approprate condtons. Rankng and Par-wse Preference Learnng The tradtonal predcton problem n statstcal machne learnng assumes that we observe an nput vector q Q, so as to predct an unobserved output p P. However, n a rankng problem, f we assume P = {1,..., m} contans m possble values, then nstead of predctng a value n P, we predct a permutaton of P that gves an optmal orderng of P. That s, f we denote by P! the set of permutatons of P, then the goal s to predct an output n P!. There are two fundamental ssues: frst, how to measure the qualty of rankng; second, how to learn a good rankng procedure from hstorcal data. At the frst sght, t may seem that we can smply cast the rankng problem as an ordnary predcton problem where the output space becomes P!. However, the number of permutatons n P! s m!, whch can be extremely large even for small m. Therefore t s not practcal to solve the rankng problem drectly wthout mposng certan structures on the search space. Moreover, n practce, gven a tranng pont q Q, we are generally not gven an optmal permutaton n P! as the observed output. Instead, we may observe another form of output that typcally nfers the optmal rankng but may contan extra nformaton as well. The tranng procedure should take advantage of such nformaton.

3 A common dea to generate optmal permutaton n P! s to use a scorng functon that takes a par (q, p) n Q P, and maps t to a real valued number r(q, p). For each q, the predcted permutaton n P! nduced by ths scorng functon s defned as the orderng of p P sorted wth non-ncreasng value r(q, p). Ths s the method we wll focus on n ths paper. Although the rankng problem have receved consderable nterests n machne learnng recently due to ts mportant applcatons n modern automated nformaton processng systems, the problem has not been extensvely studed n the tradtonal statstcal lterature. A relevant statstcal model s ordnal regresson [0]. In ths model, we are stll nterested n predctng a sngle output. We redefne the nput space as X = Q P, and for each x, we observe an output value y Y. Moreover, we assume that the values n Y = {1,..., L} are ordered, and the cumulatve probablty P (y j x) (j = 1,..., L) has the form γ(p (y j x)) = θ j + g β (x). In ths model, both γ( ) and g β ( ) have known functonal forms, and θ and β are model parameters. Note that the ordnal regresson model nduces a stochastc preference relatonshp on the nput space X. Consder two samples (x 1, y 1 ) and (x, y ) on X Y. We say x 1 x f and only f y 1 < y. Ths s a classfcaton problem that takes a par of nput x 1 and x and tres to predct whether x 1 x or not (that s, whether the correspondng outputs satsfy y 1 < y or not). In ths formulaton, the optmal predcton rule to mnmze classfcaton error s nduced by the orderng of g β (x) on X because f g β (x 1 ) < g β (x ), than P (y 1 < y ) > 0.5 (based on the ordnal regresson model), whch s consstent wth the Bayes rule. Motvated by ths observaton, an SVM rankng method s proposed n [16]. The dea s to reformulate ordnal regresson as a model to learn preference relatonshp on the nput space X, whch can be learned usng par-wse classfcaton. Gven the parameter ˆβ learned from tranng data, the scorng functon s smply r(q, p) = g ˆβ(x). The par-wse preference learnng model becomes a major trend for rankng n the machne learnng lterature. For example, n addton to SVM, a smlar method based on AdaBoost s proposed n [13]. The dea was also used n optmzng the Mcrosoft web-search system [7]. A number of researchers worked on the theoretcal analyss of rankng, usng the par-wse rankng model. The crteron s to mnmze the error of par-wse preference predcton when we draw two pars x 1 and x randomly from the nput space X. That s, gven a scorng functon g : X R, the rankng loss s: E (X1,Y 1 )E (X,Y )[I(Y 1 < Y )I(g(X 1 ) g(x )) + I(Y 1 > Y )I(g(X 1 ) g(x ))] (1) =E X1,X [P (Y 1 < Y X 1, X )I(g(X 1 ) g(x )) + P (Y 1 > Y X 1, X )I(g(X 1 ) g(x ))], where I( ) denotes the ndcator functon. For bnary output y = 0, 1, t s known that ths metrc s equvalent to the AUC measure (area under ROC) for bnary classfers up to a scalng, and t s closely related to the Mann-Whtney-Wlcoxon statstcs [15]. In the lterature, theoretcal analyss has focused manly on ths rankng crteron (for example, see [1,, 9, ]). The par-wse preference learnng model has some lmtatons. Frst, although the crteron n (1) measures the global par-wse rankng qualty, t s not the best metrc to evaluate practcal rankng systems. Note that n most applcatons, a system does not need to rank all data-pars, but only a subset of them each tme. Moreover, typcally only the top few postons of the rank-lst s of mportance. Another ssue wth the par-wse preference learnng model s that the scorng functon s usually learned by mnmzng a convex relaxaton of the par-wse classfcaton error, smlar to large margn classfcaton. However, f the preference relatonshp s stochastc, then an mportant queston that should be addressed s whether such a learnng algorthm leads to a Bayes optmal rankng functon n the large sample lmt. Unfortunately ths s dffcult to analyze for 3

4 general rsk mnmzaton formulatons f the decson rule s nduced by a sngle-varable scorng functon of the form r(x). The problem of Bayes optmalty n the par-wse learnng model was partally nvestgated n [9], but wth a decson rule of a general form r(x 1, x ): we predct x 1 x f r(x 1, x ) < 0. To our knowledge, ths method s not wdely used n practce because a nave applcaton can lead to contradcton: we may predct r(x 1, x ) < 0, r(x, x 3 ) < 0, and r(x 3, x 1 ) < 0. Therefore n order to use such a method effectvely for rankng, there needs to be a mechansm to resolve such contradcton. For example, one possblty s to defne a scorng functon f(x) = x r(x, x ), and rank the data accordngly. Another possblty s to use a sortng method (such as quck-sort) drectly wth the comparson functon gven by r(x 1, x ). However, n order to show that such contradcton resoluton methods are well behaved asymptotcally, t s necessary to analyze the correspondng error. We are not aware of any study on such error analyss. 3 Subset Rankng Model The global par-wse preference learnng model n Secton has some lmtatons. In ths paper, we shall descrbe a model more relevant to practcal rankng systems such as web-search. We wll frst descrbe the model, and then use search as an example to llustrate t. 3.1 Problem defnton Let X be the space of observable features, and Z be the space of varables that are not necessarly drectly used n the deployed system. Denote by S the set of all fnte subsets of X that may possbly contan elements that are redundant. Let y be a non-negatve real-valued varable that corresponds to the qualty of x X. Assume also that we are gven a (measurable) feature-map F that takes each z Z, and produces a fnte subset F (z) = S = {x 1,..., x m } S. Note that the order of the tems n the set s of no mportance; the numercal subscrpts are for notatonal purposes only, so that permutatons can be more convenently defned. In subset rankng, we randomly draw a varable z Z accordng to some underlyng dstrbuton on Z. We then create a fnte subset F (z) = S = {x 1,..., x m } S consstng of feature vectors x j n X, and at the same tme, a set of grades {y j } = {y 1,..., y m } such that for each j, y j corresponds to x j. Whether the sze of the set m should be a random varable has no mportance n our analyss. In ths paper we assume that t s fxed for smplcty. Based on the observed subset S = {x 1,..., x m }, the system s requred to output an orderng (rankng) of the tems n the set. Usng our notaton, ths orderng can be represented as a permutaton J = [j 1,..., j m ] of [1,..., m]. Our goal s to produce a permutaton such that y j s n decreasng order for = 1,..., m. In practcal applcatons, each avalable poston can be assocated wth a weght c that measures the mportance of that poston. Now, gven the grades y j (j = 1,..., m), a very natural measure of the rank-lst J = [j 1,..., j m ] s qualty s the followng weghted sum: DCG(J, [y j ]) = c y j. We assume that {c } s a pre-defned sequence of non-ncreasng non-negatve dscount factors that may or may not depend on S. Ths metrc, descrbed n [17] as DCG (dscounted cumulated gan), s one of the man metrcs used n the evaluaton of nternet search systems, ncludng the 4

5 producton system of Yahoo and that of Mcrosoft [7]. In ths context, a typcal choce of c s to set c = 1/ log(1 + ) when k and c = 0 when > k for some k. One may also use other choces, such as lettng c be the probablty of user lookng at (or clckng) the result at poston. Although ntroduced n the context of web-search, the DCG crteron s clearly natural for many other rankng applcatons such as recommender systems. Moreover, by choosng a decayng sequence of c, ths measure naturally focuses on the qualty of the top porton of the rank-lst. Ths s n contrast wth the par-wse error crteron n (1), whch does not dstngushng top porton of the rank-lst from the bottom porton. For the DCG crteron, our goal s to tran a rankng functon r that can take a subset S S as nput, and produce an output permutaton J = r(s) such that the expected DCG s as large as possble: DCG(r) = E S DCG(r, S), () where DCG(r, S) = c E yj (x j,s) y j. (3) The global par-wse preference learnng metrc (1) can be adapted to the subset rankng settng. We may consder the followng weghted total of correctly ranked pars mnus ncorrectly ranked pars: m 1 T(J, [y j ]) = (y j y j ). m(m 1) =+1 If the output label y takes bnary-values, and the subset S = X s global (we may assume that t s fnte), then ths metrc s equvalent to (1). Although we pay specal attenton to the DCG metrc, we shall also nclude some analyss of the T crteron for completeness. Smlar to () and (3), we can defne the followng quanttes: where T(r, S) = m 1 m(m 1) T(r) = E S T(r, S), (4) =+1 (E yj (x j,s) y j E yj (x j,s) y j ). (5) Smlar to the concept of Bayes classfer n classfcaton, we can defne the Bayes rankng functon that optmzes the DCG and T measures. Based on the condtonal formulatons n (3) and (5), we have the followng result: Theorem 1 Gven a set S S, for each x j S, we defne the Bayes-scorng functon as f B (x j, S) = E yj (x j,s) y j An optmal Bayes rankng functon r B (S) that maxmzes (5) returns a rank lst J = [j 1,..., j m ] such that f B (x j, S) s n descendng order: f B (x j1, S) f B (x j, S) f B (x jm, S). An optmal Bayes rankng functon r B (S) that maxmzes (3) returns a rank lst J = [j 1,..., j m ] such that c k > c k mples that f B (x jk, S) > f B (x jk, S). 5

6 Proof Consder any k, k {1,..., m}. Defne J = [j 1,..., j m], where j = j when k, k, and j k = j k, and j k = j k. We consder the T-crteron frst, and let k = k+1. It s easy to check that T(J, S) T(J, S) = 4(f B (x jk+1, S) f B (x jk, S))/m(m 1). Therefore T(J, S) T(J, S) mples that f B (x jk+1, S) f B (x jk, S). Now consder the DCG-crteron. We have DCG(J, S) DCG(J, S) = (c k c k )(f B (x jk, S) f B (x jk, S)). Now c k > c k and DCG(J, S) DCG(J, S) mples f B (x jk, S) f B (x jk, S). The result ndcates that the optmal rankng can be nduced by a sngle varable rankng functon of the form r(x, S) : X S R where x S. 3. Web-search example As an example of the subset rankng model, we consder the web-search problem. In ths applcaton, a user submts a query q, and expects the search engne to return a rank-lst of web-pages {p j } such that a more relevant page s placed before a less relevant page. In a typcal nternet search engne, the system takes a query and uses a smple rankng formula for the ntal flterng, whch lmts the set of web-pages to an ntal pool {p j } of sze m (e.g., m = ). After ths ntal rankng, the system goes through a more complcated second stage rankng process, whch reorders the pool. Ths crtcal stage s the focus of ths paper. At ths step, the system takes the query q, and possble nformaton from addtonal resources, to generate a feature vector x j for each page p j n the ntal pool. The feature vector can encode varous types of nformaton such as the length of query q, the poston of p j n the ntal pool, the number of query terms that match the ttle of p j, the number of query terms that match the body of p j, etc. The set of all possble feature vectors x j s X. The rankng algorthm only observes a lst of feature vectors {x 1,..., x m } wth each x j X. A human edtor s presented wth a par (q, p j ) and assgns a score s j on a scale, e.g., 1 5 (least relevant to hghly relevant). The correspondng target value y j s defned as a transformaton of y j, 1 whch maps the grade nto the nterval [0, 1]. Another possble choce of y j s to normalze t by multplyng each y j by a factor such that the optmal DCG s no more than one. 4 Some Computatonal Aspects of Subset Rankng Due to the dependency of condtonal probablty of y on S, and thus the optmal rankng functon on S, a complete soluton of the subset rankng problem can be dffcult when m s large. In general, wthout further assumptons, the optmal Bayes rankng functon ranks the tems usng the Bayes scorng functon f B (x, S) for each x S. The explct S dependency of f B (x, S) s one of the dfferences that dstngush subset rankng from global rankng. If the sze m of S s small, then we may smply represent S as a feature vector [x 1,..., x m ] (although ths may not be the best representaton), so that we can learn a functon of the form f B (x j, S) = f([x j, x 1,..., x m ]). Therefore by redefnng x j = [x j, x 1,..., x m ] X m+1, we can remove the subset dependency by embeddng the orgnal problem nto a hgher dmensonal space. In the general case when m s large, ths approach s not practcal. Instead of usng the 1 For example, the formula ( s j 1)/( 5 1) s used n [7]. Yahoo uses a dfferent transformaton based on emprcal user surveys. 6

7 whole set S as a feature, we can project S nto a lower dmensonal space usng a feature map g( ), so that f B (x, g(s)) f(x, g(s)). By ntroducng such a set dependent feature vector g(s), we can remove the set dependency by ncorporatng g(s) nto x: ths can be acheved by smply redefnng x as x = [x, g(s)]. In ths way, f B (x, S) can be approxmated by a functon of the form f( x). If the subsets are dentcal, then subset rankng s equvalent to global rankng. In the more general case where subsets are not dentcal, the reducton from set-dependent local rankng nto set-ndependent global rankng can be complcated f we do not assume any underlyng structures of the problem (we shall dscuss such a structure later). However, one may ask the queston that f we only use a set-ndependent functon of the form f(x) as the scorng functon, how well t can approxmate the Bayes scorng functon f B (x, S), and whether t s easy to compute such a functon f(x). If the subsets are dsjont (or nearly dsjont), then the effect of f B (x, S) can be acheved by a global scorng functon of the form f(x) exactly (or almost exactly) because x determnes S. Ths can be a good approxmaton for practcal problems, where the feature vectors for dfferent subsets (e.g. queres n web-search) usually do not overlap. If the subsets overlap sgnfcantly but not exactly the same, the problem can be computatonally dffcult. To see ths, we may consder for smplcty that X s fnte, and each subset only contans two elements, and one s preferred over the other (determnstcally). Now n the subset learnng model, such a preference relatonshp x x of two elements x, x X can be denoted by a drected edge from x to x. In ths settng, to fnd a global scorng functon that approxmates the optmal set dependent Bayes scorng rule s equvalent to fndng a maxmum subgraph that s acyclc. In general, ths problem s computatonally dffcult, and known to be NP-hard (an applcaton of smlar arguments n rankng can be found n [1, 3]) as well as APX-hard [11]: the class APX conssts of problems havng an approxmaton to wth 1+c of the optmum for some c. A polynomal tme approxmaton scheme (PTAS) s an algorthm whch runs n polynomal tme n the nstance sze (but not necessarly poly(1/ɛ)) and returns a soluton approxmate to wthn 1+ɛ for any gven ɛ > 0. If any APX-hard problem admts a PTAS then P=NP. The above argument mples that wthout any assumpton, the reducton of the set-dependent Bayes optmal scorng functon f B (x, S) to a set ndependent functon of the form f(x) s dffcult. If we are able to ncorporate approprate set dependent feature nto x or f the sets do not overlap sgnfcantly, then ths s computatonally feasble. In the deal case, we can ntroduce the followng defnton. Defnton 1 If for every S S and x, x S, we have f B (x, S) > f B (x, S) f and only f f(x) > f(x ), then we say that f s an optmal rank preservng functon. Clearly, an optmal rank preservng functon may not always exst (wthout usng set-dependent features). As a smple example, we assume that X = {a, b, c} has three elements, wth m =, c 1 = 1 and c = 0 n the DCG defnton. We observe {y 1 = 1, y = 0} for the set {x 1 = a, x = b}, {y 1 = 1, y = 0} for the set {x 1 = b, x = c}, {y 1 = 1, y = 0} for the set {x 1 = c, x = a}. If an optmal rank preservng functon f exsts, then by defnton we have: f(a) > f(b), f(b) > f(c), and f(c) > f(a), whch s mpossble. Under approprate assumptons, the optmal rank preservng functon exsts. The followng result provdes a suffcent condton. 7

8 Proposton 1 Assume that for each x j, we observe y j = a(s)y j + b(s) where a(s) 0 and b(s) are normalzaton/shftng factors that may depend on S, and {y j } s a set of random varables that satsfy: P ({y j} S) = E ξ m j=1 P (y j x j, ξ), where ξ s a hdden random varable ndependent of S. Then E y j (x j,s) y j = E y j x j y j. That s, the condtonal expectaton f(x) = E y x y s an optmal rank preservng functon. Proof Observe that E yj (x j,s)y j = a(s)e y j (x j,s)y j + b(s). Therefore the scorng functons E yj (x j,s)y j and E y j (x j,s)y j lead to dentcal rankng. Moreover, E y j (x j,s) y j = E ξ m y jd P (y x, ξ) = E ξ y jdp (y j x j, ξ) = y jdp (y j x j ) = E y j x j y j. Ths proves the clam. Ths result justfes usng an approprately defned feature functon to remove set-dependency. If y j s a determnstc functon of x j and ξ, then the result always holds, whch mples the optmalty of set-ndependent condtonal expectaton. In ths case, the optmal global scorng rule gves the optmal Bayes rule for subset rankng. We also note that ths equvalence does not requre that the grade y to be ndependent of S. In web-search, the model n Proposton 1 has a natural nterpretaton. Consder a pool of human edtors ndexed by ξ. For each query q, we randomly pck an edtor ξ to grade the set of pages p j to be ranked, and assume that the grade the edtor gves to each page p j depends only on the par x j = (q, p j ). In ths case, Proposton 1 can be appled to show that the features x j are suffcent to determne the optmal Bayes rule. Proposton 1 (and dscusson there-after) suggests that regresson based learnng of the condtonal expectaton E y x y s asymptotcally optmal under some assumptons that are reasonable. We call a method that learns such condtonal expectaton E y x y or ts transformaton regresson based approach, whch s dfferent from the par-wse preference learnng methods used n the early work. There are two advantages for usng regresson: frst, the computatonal complexty s at most O(m) (t can be sub-lnear n m wth approprate mportance subsamplng schemes) nstead of O(m ); second, we are able to prove the consstency of such methods under reasonable assumptons. As dscussed at the end of Secton, ths ssue s more complcated for par-wse methods. Furthermore, as we wll dscuss n the next secton, some advantages of par-wse learnng can be ncorporated nto the regresson approach by usng set-dependent features. 5 Rsk Mnmzaton based Estmaton Methods From the prevous secton, we know that the optmal scorng functon s the condtonal expectaton of the grades y. We nvestgate some basc estmaton methods for condtonal expectaton learnng. 8

9 5.1 Relaton to mult-category classfcaton The subset rankng problem s a generalzaton of mult-category classfcaton. In the latter case, we observe an nput x 0, and are nterested n classfyng t nto one of the m classes. Let the output value be k {1,..., m}. We encode the nput x 0 nto m feature vectors {x 1,..., x m }, where x = [0,..., 0, x 0, 0,..., 0] wth the -th component beng x 0, and the other components are zeros. We then encode the output k nto m values {y j } such that y k = 1 and y j = 0 for j k. In ths settng, we try to fnd a scorng functon f such that f(x k ) > f(x j ) for j k. Consder the DCG crteron wth c 1 = 1 and c j = 0 when j > 1. Then the classfcaton error s gven by the correspondng DCG. Gven any mult-category classfcaton algorthm, one may use t to solve the subset rankng problem as follows. Consder a sample S = [x 1,..., x m ] as nput, and a set of outputs {y j }. We randomly draw k from 1 to m accordng to the dstrbuton y k / j y j. We then form another sample wth weght j y j, whch has the vector S = [x 1,..., x m ] (where order s mportant) as nput, and label y = k {1,..., m} as output. Ths changes the problem formulaton nto mult-category classfcaton. Snce the condtonal expectaton can be expressed as E yk (x k,s) y k = P (y = k S) E {yj } S y j, the order nduced by the scorng functon E yk (x k,s) y k s the same as that nduced by P (y = k S). Therefore a mult-category classfcaton solver that estmates condtonal probablty can be used to solve the subset rankng problem. In partcular, f we consder a rsk mnmzaton based multcategory classfcaton solver for m-class problem [8, 5] of the followng form: ˆf = arg mn f F n Φ(f(X ), Y ), where (X, Y ) are tranng ponts wth Y {1,..., m}, F s a vector functon class that takes values n R m, and Φ s some rsk functonal. Then for rankng wth tranng ponts ( S, {y,1,..., y,m }) and S = [x,1,..., x,m ], the correspondng learnng method becomes ˆf = arg mn f F n j=1 y,j Φ(f( S ), j), where the functon space F contans a subset of functons {f( S) : X m R m } of the form f( S) = [f(x 1, S),..., f(x m, S)], and S = {x 1,..., x m } s unordered set. An example would be maxmum entropy (mult-category logstc regresson) whch has the followng loss functon Φ(f( S), j) = f(x j, S) + ln m k=1 ef(x k,s). 5. Regresson based learnng Snce n rankng problems y,j can take values other than 0 or 1, we can have more general formulatons than mult-category classfcaton. In partcular, we may consder varatons of the followng regresson based learnng method to tran a scorng functon n F {X S R}: ˆf = arg mn f F n j=1 j φ(f(x,j, S ), y,j ), S = {x,1,..., x,m } S, (6) 9

10 where we assume that φ(a, b) = φ 0 (a) + φ 1 (a)b + φ (b). The estmaton formulaton s decoupled for each element x,j n a subset S, whch makes the problem easer to solve. In ths method, each tranng pont ((x,j, S ), y,j ) s treated as a sngle sample (for = 1,..., n and j = 1,..., m). The populaton verson of the rsk functon s: [ E S φ0 (f(x, S)) + φ 1 (f(x, S))E y (x,s) y + E y (x,s) φ (y). ] x S Ths mples that the optmal populaton soluton s a functon that mnmzes φ 0 (f(x, S)) + φ 1 (f(x, S))E y (x,s) y, whch s a functon of E y (x,s) y. Therefore the estmaton method n (6) leads to an estmator of condtonal expectaton wth a reasonable choce of φ 0 ( ) and φ 1 ( ). A smple example s the least squares method, where we pck φ 0 (a) = a, φ 1 (a) = a and φ (b) = b. That s, the learnng method (6) becomes least squares estmaton: ˆf = arg mn f F n j=1 (f(x,j, S ) y,j ). (7) Ths method, and some essental varatons whch we wll ntroduce later, wll be the focus of our analyss. It was shown n [8] that the only loss functon wth condtonal expectaton as the mnmzer (for an arbtrary condtonal dstrbuton of y) s least squares. However, for practcal purposes, we only need to estmate a monotonc transformaton of the condtonal expectaton. For ths purpose, we can have addtonal loss functons of the form (6). In partcular, let φ 0 (a) be an arbtrary convex functon such that φ 0 (a) s a monotone ncreasng functon of a, then we may smply take the functon φ(a, b) = φ 0 (a) ab n (6). The optmal populaton soluton s unquely determned by φ 0 (f(x, S)) = E y (x,s)y. A smple example s φ 0 (a) = a 4 /4 such that the populaton optmal soluton s f(x, S) = (E y (x,s) y) 1/3. Clearly such a transformaton does not affect rankng. Moreover, n many rankng problems, the range of y s bounded. It s known that addtonal loss functons can be used for computng the condtonal expectaton. As a smple example, f we assume that y [0, 1], then the followng modfed least squares can be used: ˆf = arg mn f F n j=1 [ (1 y,j ) max(0, f(x,j, S )) + y,j max(0, 1 f(x,j, S )) ]. (8) One may replace ths wth other loss functons used for bnary classfcaton that estmate condtonal probablty, such as those dscussed n [9]. Although such general formulatons mght be nterestng for certan applcatons, advantages over the smpler least squares loss of (7) are not completely certan, and they are more complcated to deal wth. Therefore we wll not consder such general formulatons n ths paper, but rather focus on adaptng the least squares method n (7) to the rankng problems. As we shall see, non-trval modfcatons of (7) are necessary to optmze system performance near the top of rank-lst. 10

11 5.3 Par-wse preference learnng A popular dea n the recent machne learnng lterature s to pose the rankng problem as a parwse preference relatonshp learnng problem (see Secton ). Usng ths dea, the scorng functon for subset rankng can be traned by the followng method: ˆf = arg mn f F n (j,j ) E φ(f(x,j, S ), f(x,j, S ); y,j, y,j ), (9) where each E s a subset of {1,..., m} {1,..., m} such that y,j < y,j. For example, we may use a non-ncreasng monotone functon φ 0 and let φ(a 1, a ; b 1, b ) = φ 0 ((a a 1 ) (b b 1 )) or φ(a 1, a ; b 1, b ) = (b b 1 )φ 0 (a a 1 ). Example loss functons nclude SVM loss φ 0 (x) = max(0, 1 x) and AdaBoost loss φ 0 (x) = exp( x) (see [13, 16, 3]). The approach works well f the rankng problem s nose-free (that s, y,j s determnstc). However, one dffculty wth ths approach s that f y,j s stochastc, then the correspondng populaton estmator from (9) may not be Bayes optmal, unless a more complcated scheme such as [9] s used. It wll be nterestng to nvestgate the error of such an approach, but the analyss s beyond the scope of ths paper. One argument used by the advocates of the par-wse learnng formulaton s that we do not have to learn an absolute grade judgment (or ts expectaton), but rather only the relatve judgment that one tem s better than another. In essence, ths means that for each subset S, f we shft each judgment by a constant, the rankng s not affected. If nvarance wth respect to a set-dependent judgment shft s a desrable property, then t can be ncorporated nto the regresson based model [6]. For example, smlar to Proposton 1, we may ntroduce an explct set dependent shft feature (whch s rank-preservng) nto (6): ˆf = arg mn f F n mn φ(f(x,j, S ) + b, y,j ). b R j=1 In partcular, for least squares, we have the followng method: ˆf = arg mn f F n mn (f(x,j, S ) + b y,j ). (10) b R j=1 More generally, we may ntroduce more sophstcated set dependent features and herarchcal models nto the regresson formulaton, and obtan effects that may not even be easly ncorporated nto par-wse models. 6 Convex Surrogate Bounds The subset rankng problem defned n Secton 3 s combnatoral n nature, whch s very dffcult to solve. Snce the optmal Bayes rankng rule s gven by condtonal expectaton, n Secton 5, we dscussed varous formulatons to estmate the condtonal expectaton. In partcular, we are nterested n least squares regresson based methods. In ths context, a natural queston to ask s f a scorng functon approxmately mnmzes regresson error, how well t can optmze rankng metrcs such as DCG or T. Ths secton provdes some theoretcal results that relate the optmzaton of 11

12 the rankng metrcs defned n Secton 3 to the mnmzaton of regresson errors. Ths allows us to desgn approprate convex learnng formulatons that mprove the smple least squares methods n (7) and (10). A scorng functon f(x, S) maps each x S to a real valued score. It nduces a rankng functon r f, whch ranks elements {x j } of S n descendng order of f(x j ). We are nterested n boundng the DCG performance of r f compared wth that of f B. Ths can be regarded as extensons of Theorem 1 that motvate regresson based learnng. Theorem Let f(x, S) be a real-valued scorng functon, whch nduces a rankng functon r f. Consder par p, q [1, ] such that 1/p + 1/q = 1. We have the followng relatonshp for each S = {x 1,..., x m }: DCG(r B, S) DCG(r f, S) ( c p ) 1/p (f(x j, S) f B (x j, S)) q j=1 1/q. Proof Let S = {x 1,..., x m }. Let r f (S) = J = [j 1,..., j m ], and let J 1 = [l 1,..., l m ] be ts nverse permutaton. Smlarly, let r B (S) = J B = [j1,..., j m], and let J 1 B = [l 1,..., l m] be ts nverse permutaton. We have DCG(r f, S) = c f B (x j, S) = c l f B (x, S) = c l f(x, S) + c l (f B (x, S) f(x, S)) c l f(x, S) + c l (f B (x, S) f(x, S)) = c l f B (x, S) + c l (f(x, S) f B (x, S)) + c l (f B (x, S) f(x, S)) DCG(r B, S) c l (f(x, S) f B (x, S)) + c l (f B (x, S) f(x, S)) + ( DCG(r B, S) where we used the notaton (z) + = max(0, z). ) 1/p (f(x j, S) f B (x j, S)) q c p j=1 1/q. 1

13 The above theorem shows that the DCG crteron can be bounded through regresson error. Although the theorem apples to any arbtrary par of p and q such that 1/p + 1/q = 1, the most useful case s wth p = q =. Ths s because n ths case, the problem of mnmzng m j=1 (f(x j, S) f B (x j, S)) can be drectly acheved usng least squares regresson n (7). If regresson error goes to zero, then the resultng rankng converges to the optmal DCG. Smlarly, we can show the followng result for the T crteron. Theorem 3 Let f(x, S) be a real-valued scorng functon, whch nduces a rankng functon r f. We have the followng relatonshp for each S = {x 1,..., x m }: T(r B, S) T(r f, S) 4 m (f(x j, S) f B (x j, S)) j=1 1/. Proof We have Let S = {x 1,..., x m }. Let r f (S) = J = [j 1,..., j m ], and let r B (S) = J B = [j 1,..., j m]. T(r f, S) = m(m 1) m(m 1) m(m 1) m(m 1) m 1 =+1 m 1 =+1 m 1 =+1 m 1 =T(r B, S) 4 m =+1 (f B (x j, S) f B (x j, S)) (f(x j, S) f(x j, S)) m (f(x j, S) f(x j, S)) m f(x j, S) f B (x j, S) f(x j, S) f B (x j, S) (f B (x j, S) f B (x j, S)) 4 m f(x j, S) f B (x j, S) ( T(r B, S) 4 m ) 1/ (f(x j, S) f B (x j m, S)). f(x j, S) f B (x j, S) The above approxmaton bounds mply that least square regresson can be used to learn the optmal rankng functons. The approxmaton error converges to zero when f converges to f B n L. However, n general, requrng f to converge to f B n L s not necessary. More mportantly, n real applcatons, we are often only nterested n the top porton of the rank-lst. Our bounds should reflect ths practcal consderaton. Assume that the coeffcents c n the DCG crteron decay fast, so that c s bounded (ndependent of m). In ths case, we may pck p = 1 and q = n Theorem. If sup j f(x j, S) f B (x j, S) s small, then we obtan a better bound than the least squares error bound wth p = q = 1/ whch depends on m. 13

14 However, we cannot ensure that sup j f(x j, S) f B (x j, S) s small usng the smple least squares estmaton n (7). Therefore n the followng, we develop a more refned bound for the DCG metrc, whch wll then be used to motvate practcal learnng methods that mprove on the smple least squares method. Theorem 4 Let f(x, S) be a real-valued scorng functon, whch nduces a rankng functon r f. Gven S = {x 1,..., x m }, let the optmal rankng order be J B = [j1,..., j m], where f B (x j ) s arranged n non-ncreasng order. Assume that c = 0 for all > k. Then we have the followng relatonshp for all γ (0, 1), p, q 1 such that 1/p + 1/q = 1, u > 0, and subset K {1,..., m} that contans j1,..., j k : DCG(r B, S) DCG(r f, S) C p (γ, u) j K(f(x j, S) f B (x j, S)) p + u sup(f(x j, S) f B(x j, S)) p + j / K where (z) + = max(z, 0), and ( C p (γ, u) = 1 1 γ ( k k ) p ) 1/p c p + u p/q c, f B(x j ) = f B (x j ) + γ(f B (x j k ) f B (x j )) +. 1/p, Proof Let S = {x 1,..., x m }. Let r f (S) = J = [j 1,..., j m ], and let J 1 = [l 1,..., l m ] be ts nverse permutaton. Smlarly, let J 1 B = [l 1,..., l m] be the nverse permutaton of r B (S) = J B = [j1,..., j m]. Let M = f B (x j k ), we have (M f B (x j, S)) γ (M f B(x j, S)) +. Moreover, snce m c ((f B (x j, S) M) (f B (x j, S) M) + ) 0, we have c ((f B (x j, S) M) (f B (x j, S) M) + ) 1 1 γ c ((f B (x j, S) M) (f B (x j, S) M) + ). 14

15 Therefore DCG(r B, S) DCG(r f, S) = c ((f B (x j, S) M) (f B (x j, S) M)) = c ((f B (x j, S) M) (f B (x j, S) M) + ) γ = 1 1 γ 1 1 γ 1 1 γ c (M f B (x j, S)) + [ m c ((f B (x j, S) M) (f B(x j, S) M) + ) + ( m c f B (x j, S) ) c f B(x j, S) ( m c (f B (x j, S) f(x j, S)) ( m c (f B (x j, S) f(x j, S)) + + ( 1 k 1 γ c p ] c (M f B(x j, S)) + ) c (f B(x j, S) f(x j, S)) ) c (f(x j, S) f B(x j, S)) + ) 1/p (f B (x j, S) f(x j, S)) q + j K ( k ) ) + c sup(f(x j, S) f B(x j, S)) + j / K ( ) 1/p 1 k c p B (x j, S) f(x j, S)) 1 γ j K(f q 1/q 1/q 1/q + (f(x j, S) f B(x j, S)) q + j K + k c sup(f(x j, S) f B(x j, S)) +. j / K Note that n the above dervaton, Hölder s nequalty has been appled to obtan the last two nequaltes. From the last nequalty, we can apply the Hölder s nequalty agan to obtan the desred bound. The easest way to nterpret ths bound s stll to take p = q = 1/. Intutvely, the bound says the followng: we should estmate the top ranked tems usng least squares. For the other tems, we do not have to make very accurate estmaton of ther condtonal expectatons. The DCG score wll not be affected as long as we do not over-estmate ther condtonal expectatons to such a degree that some of these tems are near the top of the rank-lst. Ths pont s a very mportant dfference between ths bound and Theorem whch assumes that we estmate the condtonal expectaton unformly well. The bound n Theorem 4 can stll be refned. However, the resultng nequaltes wll become more complcated. Therefore we wll not nclude such bounds n ths paper. Smlar to Theorem 4, such refned bounds show that we do not have to estmate condtonal expectaton unformly well. We present a smple example as llustraton. 15

16 Proposton Consder m = 3 and S = {x 1, x, x 3 }. Let c 1 =, c = 1, c 3 = 0, and f B (x 1, S) = f B (x, S) = 1, f B (x 3, S) = 0. Let f(x, S) be a real-valued scorng functon, whch nduces a rankng functon r f. Then DCG(r B, S) DCG(r f, S) f(x 3, S) f B (x 3, S) + f(x 1, S) f B (x 1, S) + f(x, S) f B (x, S). The coeffcents on the rght hand sde cannot be mproved. Proof Note that f s suboptmal only when ether f(x 3, S) f(x 1, S) or when f(x 3, S) f(x, S). Ths gves the followng bound: DCG(r B, S) DCG(r f, S) I(f(x 3, S) f(x 1, S)) + I(f(x 3, S) f(x, S)) I( f(x 3, S) f B (x 3, S) + f(x 1, S) f B (x 1, S) 1) + I( f(x 3, S) f B (x 3, S) + f(x, S) f B (x, S) 1) f(x 3, S) f B (x 3, S) + f(x 1, S) f B (x 1, S) + f(x, S) f B (x, S). To see that the coeffcents cannot be mproved, we smply note that the bound s tght when ether f(x 1, S) = f(x, S) = f(x 3, S) = 1, or when f(x 1, S) = 1 and f(x, S) = f(x 3, S) = 0, or when f(x, S) = 1 and f(x 1, S) = f(x 3, S) = 0. The Proposton mples that not all errors should be weghted equally: n the example, gettng x 3 rght s more mportant than gettng x 1 or x rght. Conceptually, Theorem 4 and Proposton show the followng: Snce we are nterested n the top porton of the rank-lst, we only need to estmate the top rated tems accurately, whle preventng the bottom tems from beng over-estmated (the condtonal expectatons don t have to be estmated accurately). For rankng purposes, some ponts are more mportant than other ponts. Therefore we should bas our learnng method to produce more accurate condtonal expectaton estmaton at the more mportant ponts. 7 Importance Weghted Regresson The key message from the analyss n Secton 6 s that we do not have to estmate the condtonal expectatons equally well for all tems. In partcular, snce we are nterested n the top porton of the rank-lst, Theorem 4 mples that we need to estmate the top porton more accurately than the bottom porton. Motvated by ths analyss, we consder a regresson based tranng method to solve the DCG optmzaton problem but weght dfferent ponts dfferently accordng to ther mportance. We shall not dscuss the mplementaton detals for modelng the functon f(x, S), whch s beyond the scope of ths paper. One smple model s to assume a form f(x, S) = f(x). Secton 4 dscussed the valdty of such models. For example, ths model s reasonable f we assume that for each x S, and the correspondng y, we have: E y (x,s) y = E y x y (see Proposton 1). Let F be a functon space that contans functons X S R. We draw n sets S 1,..., S n randomly, where S = {x,1,..., x,m }, wth the correspondng grades {y,j } j = {y,1,..., y,m }. 16

17 Based on Theorem, the smple least squares regresson (7) can be used to solve the subset rankng problem. However, ths drect regresson method s not adequate for many practcal problems such as web-search, for whch there are many tems to rank (that s, m s large) but only the top ranked pages are mportant. Ths s because the method pays equal attenton to relevant and rrelevant pages. In realty, one should pay more attenton to the top-ranked (relevant) pages. The grades of lower rank pages do not need to be estmated accurately, as long as we do not over-estmate them so that these pages appear n the top ranked postons. The above mentoned ntuton can be captured by Theorem 4 and Proposton, whch motvate the followng alternatve tranng method: ˆf = arg mn f F 1 n n L(f, S, {y,j } j ), (11) where for S = {x 1,..., x m }, wth the correspondng {y j } j, we have the followng mportance weghted regresson loss: L(f, S, {y j } j ) = j=1 w(x j, S)(f(x j, S) y j ) + u sup w (x j, S)(f(x j, S) δ(x j, S)) +, (1) j where u s a non-negatve parameter. A varaton of ths method s used to optmze the producton system of Yahoo s nternet search engne. The detaled mplementaton and parameter choces are trade secrets of Yahoo, whch we cannot completely dsclose here. It s also rrelevant for the purpose of ths paper. However, n the followng, we shall brefly explan the ntuton behnd (1) usng Theorem 4, and some practcal consderatons. The weght functon w(x j, S) n (1) s chosen so that t focuses only on the most mportant examples (the weght s set to zero for pages that we know are rrelevant). Ths part of the formulaton corresponds to the frst part of the bound n Theorem 4 (n that case, we choose w(x j, S) to be one for the top part of the example wth ndex set K, and zero otherwse). The usefulness of non-unform weghtng s also demonstrated n Proposton. The specfc choce of the weght functon requres varous engneerng consderatons that are not mportant for the purpose of ths paper. In general, f there are many tems wth smlar grades, then t s benefcal to gve each of the smlar tems a smaller weght. In the second part of (1), we choose w (x j, S) so that t focuses on the examples not covered by w(x j, S). In partcular, t only covers those data ponts x j that are low-ranked wth hgh confdence. We choose δ(x j, S) to be a small threshold that can be regarded as a lower bound of f B (x j) n Theorem 4, such as γf B (x k ). An mportant observaton s that although m s often very large, the number of ponts so that w(x j, S) s nonzero s often small. Moreover, (f(x j, S) δ(x j, S)) + s not zero only when f(x j, S) δ(x j, S). In practce the number of these ponts s usually small (that s, most rrelevant pages wll be predcted as rrelevant). Therefore the formulaton completely gnores those low-ranked data ponts such that f(x j, S) δ(x j, S). Ths makes the learnng procedure computatonally effcent even when m s large. The analogy here s support vector machnes, where only the support vectors are useful n the learnng formulaton. One can completely gnore samples correspondng to non support vectors. In the practcal mplementaton of (1), we can use an teratve refnement scheme, where we start wth a small number of samples to be ncluded n the frst part of (1), and then put the Some aspects of the mplementaton were covered n [10]. 17

18 low-ranked ponts nto the second part of (1) only when ther rankng scores exceed δ(x j, S). In fact, one may also put these ponts nto the frst part of (1), so that the second part always has zero values (whch makes the mplementaton smpler). In ths sense, the formulaton n (1) suggests a selectve samplng scheme, n whch we pay specal attenton to mportant and hghly ranked data ponts, whle completely gnorng most of the low ranked data ponts. In ths regard, wth approprately chosen w(x, S), the second part of (1) can be completely gnored. The emprcal rsk mnmzaton method n (11) approxmately mnmzes the followng crteron: where Q(f) = E S L(f, S), (13) L(f, S) =E {yj } j SL(f, S, {y j } j ) = w(x j, S)E yj (x j,s) (f(x j, S) y j ) + u sup w (x j, S)(f(x j, S) δ(x j, S)) +. j j=1 The followng theorem shows that under approprate assumptons, approxmate mnmzaton of (13) leads to the approxmate optmzaton of DCG. Theorem 5 Assume that c = 0 for all > k. S = {x 1,..., x m }: Assume the followng condtons hold for each = [j 1,..., j m], where f B (x j ) s arranged n non- Let the optmal rankng order be J B ncreasng order. There exsts γ [0, 1) such that δ(x j, S) γf B (x j k, S). For all f B (x j, S) > δ(x j, S), we have w(x j, S) 1. Let w (x j, S) = I(w(x j, S) < 1). Then the followng results hold: A functon f mnmzes (13) f f (x j, S) = f B (x j, S) when w(x j, S) > 0 and f (x j, S) δ(x j, S) otherwse. For all f, let r f be the nduced rankng functon. Let r B be the optmal Bayes rankng functon, we have: DCG(r f ) DCG(r B ) C(γ, u)(q(f) Q(f )) 1/. Proof Note that f f B (x j, S) > δ(x j, S), then w(x j, S) 1 and w (x j, S). Therefore the mnmzer f (x j, S) should mnmze E yj (x j,s)(f(x j, S) y j ), acheved at f (x j, S) = f B (x j, S). If f B (x j, S) δ(x j, S), then there are two cases: w(x j, S) > 0, f (x j, S) should mnmze E yj (x j,s)(f(x j, S) y j ), acheved at f (x j, S) = f B (x j, S). w(x j, S) = 0, f (x j, S) should mnmze E yj (x j,s)(f(x j, S) δ(x j, S)) +, acheved at f (x j, S) δ(x j, S). 18

19 Ths proves the frst clam. For each S, denote by K the set of x j such that w (x j, S) = 0. The second clam follows from the followng dervaton: Q(f) Q(f ) =E S (L(f, S) L(f, S)) k =E S w(x j, S)(f(x j, S) f B (x j, S)) + u sup w (x j, S)(f(x j, S) δ(x j, S)) + j=1 j E S j K(f B (x j, S) f(x j, S)) + + u sup(f(x j, S) δ(x j, S)) + j / K E S (DCG(r B, S) DCG(r f, S)) C(γ, u) (DCG(r B ) DCG(r f )) C(γ, u). Note that the second nequalty follows from Theorem 4. 8 Generalzaton Analyss In ths secton, we analyze the generalzaton performance of (11). The analyss depends on the underlyng functon class F. In the lterature, one often employs a lnear functon class wth approprate regularzaton condton, such as L 1 or L regularzaton for the lnear weght coeffcents. Yahoo s machne learnng rankng system employs the gradent boostng method descrbed n [14], whch s closely related to L 1 regularzaton, analyzed n [5, 18, 19]. Although the consstency of boostng for the standard least squares regresson s known (for example, see [6, 30]), such analyss does not deal wth the stuaton that m s large and thus s not sutable for analyzng the rankng problem consdered n ths paper. In ths secton, we wll consder lnear functon class wth L regularzaton, whch s closely related to kernel methods. We employ a relatvely smple stablty analyss whch s sutable for L regularzaton. Our result does not depend on m explctly, whch s mportant for large scale rankng problems such as web-search. Although smlar results can be obtaned for L 1 regularzaton or gradent boostng, the analyss wll become much more complcated. For L regularzaton, we consder a feature map ψ : X S H, where H s a vector space. We denote by w T v the L nner product of w and v n H. The functon class F consdered here s of the followng form: {β T ψ(x, S); β H, β T β A } X S R, (14) where the complexty s controlled by L regularzaton of the weght vector β T β A. We use (S = {x,1,..., x,m }, {y,j } j ) to ndcate a sample pont ndexed by. Note that for each sample, we do not need to assume that y,j are ndependently generated for dfferent j. Usng (14), the mportance weghted regresson n (11) becomes the followng regularzed emprcal rsk 19

20 mnmzaton method: f ˆβ(x, S) = ˆβ T ψ(x, S), [ 1 ˆβ = arg mn β H n L(β, S, {y j } j ) = j=1 ] n L(β, S, {y,j } j ) + λβ T β, (15) w(x j, S)(β T ψ(x j, S) y j ) + u sup w (x j, S)(β T ψ(x j, S) δ(x j, S)) +. j In ths method, we replace the hard regularzaton n (14) wth tunng parameter A by soft regularzaton wth tunng parameter λ, whch s computatonally more convenent. The followng result s an expected generalzaton bound for the L -regularzed emprcal rsk mnmzaton method (15), whch uses the stablty analyss n [7]. The proof s n Appendx A. Theorem 6 Let M = sup x,s ψ(x, S) and W = sup S [ x j S w(x j, S) + u sup xj S w (x j, S)]. Let f ˆβ be the estmator defned n (15). Then we have E {S,{y,j } j } n Q(f ˆβ) (1 + W M λn ) nf β H [Q(f β) + λβ T β]. We have pad specal attenton to the propertes of (15). In partcular, the quantty W s usually much smaller than m, whch s large for web-search applcatons. The pont we d lke to emphasze here s that even though the number m s large, the estmaton complexty s only affected by the top-porton of the rank-lst. If the estmaton of the lowest ranked tems s relatvely easy (as s generally the case), then the learnng complexty does not depend on the majorty of tems near the bottom of the rank-lst. We can combne Theorem 5 and Theorem 6, gvng the followng bound: Theorem 7 Suppose the condtons n Theorem 5 and Theorem 6 hold wth f mnmzng (13). Let ˆf = f ˆβ, we have E {S,{y,j } j } n DCG(r ˆf ) DCG(r B ) + C(γ, u) [ ( 1 + W M ) ] 1/ nf (Q(f β) + λβ T β) Q(f ). λn β H Proof From Theorem 5, we obtan E {S,{y,j } j } n DCG(r ˆf ) DCG(r B ) C(γ, u)e {S,{y,j } j } n (Q( ˆf) Q(f )) 1/ C(γ, u) [E {S,{y,j } j } n Q(f ˆβ) Q(f )] 1/. The second nequalty s a consequence of Jensen s nequalty. Now by applyng Theorem 6, we obtan the desred bound. The theorem mples that f Q(f ) = nf β H Q(f β ), then as n, we can let λ 0 and λn so that the second term on the rght hand sde vanshes n the large sample lmt. Therefore asymptotcally, we can acheve the optmal DCG score. Ths mples the consstency of regresson based learnng methods for the DCG crteron. 0

Feature Selection: Part 1

Feature Selection: Part 1 CSE 546: Machne Learnng Lecture 5 Feature Selecton: Part 1 Instructor: Sham Kakade 1 Regresson n the hgh dmensonal settng How do we learn when the number of features d s greater than the sample sze n?