Statistical Analysis of Bayes Optimal Subset Ranking

Size: px
Start display at page:

Download "Statistical Analysis of Bayes Optimal Subset Ranking"

Transcription

1 Statstcal Analyss of Bayes Optmal Subset Rankng Davd Cossock Yahoo Inc., Santa Clara, CA, USA Tong Zhang Yahoo Inc., New York Cty, USA Abstract The rankng problem has become ncreasngly mportant n modern applcatons of statstcal methods n automated decson makng systems. In partcular, we consder a formulaton of the statstcal rankng problem whch we call subset rankng, and focus on the DCG (dscounted cumulated gan) crteron that measures the qualty of tems near the top of the rank-lst. Smlar to error mnmzaton for bnary classfcaton, drect optmzaton of natural rankng crtera such as DCG leads to a non-convex optmzaton problems that can be NP-hard. Therefore a computatonally more tractable approach s needed. We present bounds that relate the approxmate optmzaton of DCG to the approxmate mnmzaton of certan regresson errors. These bounds justfy the use of convex learnng formulatons for solvng the subset rankng problem. The resultng estmaton methods are not conventonal, n that we focus on the estmaton qualty n the top-porton of the rank-lst. We further nvestgate the generalzaton ablty of these formulatons. Under approprate condtons, the consstency of the estmaton schemes wth respect to the DCG metrc can be derved. 1 Introducton We consder the general rankng problem, where a computer system s requred to rank a set of tems based on a gven nput. In such applcatons, the system often needs to present only a few top ranked tems to the user. Therefore the qualty of the system output s determned by the performance near the top of ts rank-lst. Rankng s especally mportant n electronc commerce and many nternet applcatons, where personalzaton and nformaton based decson makng are crtcal to the success of such busness. The decson makng process can often be posed as a problem of selectng top canddates from a set of potental alternatves, leadng to a condtonal rankng problem. For example, n a recommender system, the computer s asked to choose a few tems a user s most lkely to buy based on the user s profle and buyng hstory. The selected tems wll then be presented to the user as recommendatons. Another mportant example that affects mllons of people everyday s the nternet search problem, where the user presents a query to the search engne, and the search engne then selects a few web-pages that are most relevant to the query from the whole web. The qualty of a search engne s largely determned by the top-ranked results the search engne can dsplay on the frst page. Internet search s the man motvaton of ths theoretcal study, although the model presented here can be useful for many other applcatons. For example, another rankng 1

2 problem s ad placement n a web-page (ether search result, or some content page) accordng to revenue-generatng potental. Snce for search and many other rankng problems, we are only nterested n the qualty of the top choces, the evaluaton of the system output s dfferent from many tradtonal error metrcs such as classfcaton error. In ths settng, a useful fgure of mert should focus on the top porton of the rank-lst. To our knowledge, ths characterstc of practcal rankng problems has not been carefully explored n earler studes (except for a recent paper [3], whch also touched on ths ssue). The purpose of ths paper s to develop some theoretcal results for convertng a rankng problem nto convex optmzaton problems that can be effcently solved. The resultng formulaton focuses on the qualty of the top ranked results. The theory can be regarded as an extenson of related theory for convex rsk mnmzaton formulatons for classfcaton, whch has drawn much attenton recently n the statstcal learnng lterature [4, 18, 9, 8, 4, 5]. We organze the paper as follows. Secton dscusses earler work n statstcs and machne learnng on global and par-wse rankng. Secton 3 ntroduces the subset rankng problem. We defne two rankng metrcs: one s the DCG measure whch we focus on n ths paper, and the other s a measure that counts the number of correctly ranked pars. The latter has been studed recently by several authors n the context of par-wse preference learnng. Secton 4 nvestgates the relatonshp of subset rankng and global rankng. Secton 5 ntroduces some basc estmaton methods for rankng. Ths paper focuses on the least squares regresson based formulaton. Secton 6 contans the man theoretcal results n ths paper, where we show that the approxmate mnmzaton of certan regresson errors leads to the approxmate optmzaton of the rankng metrcs defned earler. Ths mples that asymptotcally the non-convex rankng problem can be solved usng regresson methods that are convex. Secton 7 presents the regresson learnng formulaton derved from the theoretcal results n Secton 6. Smlar methods are currently used to optmze Yahoo s producton search engne. Secton 8 studes the generalzaton ablty of regresson learnng, where we focus on the L -regularzaton approach. Together wth earler theoretcal results, we can establsh the consstency of regresson based rankng under approprate condtons. Rankng and Par-wse Preference Learnng The tradtonal predcton problem n statstcal machne learnng assumes that we observe an nput vector q Q, so as to predct an unobserved output p P. However, n a rankng problem, f we assume P = {1,..., m} contans m possble values, then nstead of predctng a value n P, we predct a permutaton of P that gves an optmal orderng of P. That s, f we denote by P! the set of permutatons of P, then the goal s to predct an output n P!. There are two fundamental ssues: frst, how to measure the qualty of rankng; second, how to learn a good rankng procedure from hstorcal data. At the frst sght, t may seem that we can smply cast the rankng problem as an ordnary predcton problem where the output space becomes P!. However, the number of permutatons n P! s m!, whch can be extremely large even for small m. Therefore t s not practcal to solve the rankng problem drectly wthout mposng certan structures on the search space. Moreover, n practce, gven a tranng pont q Q, we are generally not gven an optmal permutaton n P! as the observed output. Instead, we may observe another form of output that typcally nfers the optmal rankng but may contan extra nformaton as well. The tranng procedure should take advantage of such nformaton.

3 A common dea to generate optmal permutaton n P! s to use a scorng functon that takes a par (q, p) n Q P, and maps t to a real valued number r(q, p). For each q, the predcted permutaton n P! nduced by ths scorng functon s defned as the orderng of p P sorted wth non-ncreasng value r(q, p). Ths s the method we wll focus on n ths paper. Although the rankng problem have receved consderable nterests n machne learnng recently due to ts mportant applcatons n modern automated nformaton processng systems, the problem has not been extensvely studed n the tradtonal statstcal lterature. A relevant statstcal model s ordnal regresson [0]. In ths model, we are stll nterested n predctng a sngle output. We redefne the nput space as X = Q P, and for each x, we observe an output value y Y. Moreover, we assume that the values n Y = {1,..., L} are ordered, and the cumulatve probablty P (y j x) (j = 1,..., L) has the form γ(p (y j x)) = θ j + g β (x). In ths model, both γ( ) and g β ( ) have known functonal forms, and θ and β are model parameters. Note that the ordnal regresson model nduces a stochastc preference relatonshp on the nput space X. Consder two samples (x 1, y 1 ) and (x, y ) on X Y. We say x 1 x f and only f y 1 < y. Ths s a classfcaton problem that takes a par of nput x 1 and x and tres to predct whether x 1 x or not (that s, whether the correspondng outputs satsfy y 1 < y or not). In ths formulaton, the optmal predcton rule to mnmze classfcaton error s nduced by the orderng of g β (x) on X because f g β (x 1 ) < g β (x ), than P (y 1 < y ) > 0.5 (based on the ordnal regresson model), whch s consstent wth the Bayes rule. Motvated by ths observaton, an SVM rankng method s proposed n [16]. The dea s to reformulate ordnal regresson as a model to learn preference relatonshp on the nput space X, whch can be learned usng par-wse classfcaton. Gven the parameter ˆβ learned from tranng data, the scorng functon s smply r(q, p) = g ˆβ(x). The par-wse preference learnng model becomes a major trend for rankng n the machne learnng lterature. For example, n addton to SVM, a smlar method based on AdaBoost s proposed n [13]. The dea was also used n optmzng the Mcrosoft web-search system [7]. A number of researchers worked on the theoretcal analyss of rankng, usng the par-wse rankng model. The crteron s to mnmze the error of par-wse preference predcton when we draw two pars x 1 and x randomly from the nput space X. That s, gven a scorng functon g : X R, the rankng loss s: E (X1,Y 1 )E (X,Y )[I(Y 1 < Y )I(g(X 1 ) g(x )) + I(Y 1 > Y )I(g(X 1 ) g(x ))] (1) =E X1,X [P (Y 1 < Y X 1, X )I(g(X 1 ) g(x )) + P (Y 1 > Y X 1, X )I(g(X 1 ) g(x ))], where I( ) denotes the ndcator functon. For bnary output y = 0, 1, t s known that ths metrc s equvalent to the AUC measure (area under ROC) for bnary classfers up to a scalng, and t s closely related to the Mann-Whtney-Wlcoxon statstcs [15]. In the lterature, theoretcal analyss has focused manly on ths rankng crteron (for example, see [1,, 9, ]). The par-wse preference learnng model has some lmtatons. Frst, although the crteron n (1) measures the global par-wse rankng qualty, t s not the best metrc to evaluate practcal rankng systems. Note that n most applcatons, a system does not need to rank all data-pars, but only a subset of them each tme. Moreover, typcally only the top few postons of the rank-lst s of mportance. Another ssue wth the par-wse preference learnng model s that the scorng functon s usually learned by mnmzng a convex relaxaton of the par-wse classfcaton error, smlar to large margn classfcaton. However, f the preference relatonshp s stochastc, then an mportant queston that should be addressed s whether such a learnng algorthm leads to a Bayes optmal rankng functon n the large sample lmt. Unfortunately ths s dffcult to analyze for 3

4 general rsk mnmzaton formulatons f the decson rule s nduced by a sngle-varable scorng functon of the form r(x). The problem of Bayes optmalty n the par-wse learnng model was partally nvestgated n [9], but wth a decson rule of a general form r(x 1, x ): we predct x 1 x f r(x 1, x ) < 0. To our knowledge, ths method s not wdely used n practce because a nave applcaton can lead to contradcton: we may predct r(x 1, x ) < 0, r(x, x 3 ) < 0, and r(x 3, x 1 ) < 0. Therefore n order to use such a method effectvely for rankng, there needs to be a mechansm to resolve such contradcton. For example, one possblty s to defne a scorng functon f(x) = x r(x, x ), and rank the data accordngly. Another possblty s to use a sortng method (such as quck-sort) drectly wth the comparson functon gven by r(x 1, x ). However, n order to show that such contradcton resoluton methods are well behaved asymptotcally, t s necessary to analyze the correspondng error. We are not aware of any study on such error analyss. 3 Subset Rankng Model The global par-wse preference learnng model n Secton has some lmtatons. In ths paper, we shall descrbe a model more relevant to practcal rankng systems such as web-search. We wll frst descrbe the model, and then use search as an example to llustrate t. 3.1 Problem defnton Let X be the space of observable features, and Z be the space of varables that are not necessarly drectly used n the deployed system. Denote by S the set of all fnte subsets of X that may possbly contan elements that are redundant. Let y be a non-negatve real-valued varable that corresponds to the qualty of x X. Assume also that we are gven a (measurable) feature-map F that takes each z Z, and produces a fnte subset F (z) = S = {x 1,..., x m } S. Note that the order of the tems n the set s of no mportance; the numercal subscrpts are for notatonal purposes only, so that permutatons can be more convenently defned. In subset rankng, we randomly draw a varable z Z accordng to some underlyng dstrbuton on Z. We then create a fnte subset F (z) = S = {x 1,..., x m } S consstng of feature vectors x j n X, and at the same tme, a set of grades {y j } = {y 1,..., y m } such that for each j, y j corresponds to x j. Whether the sze of the set m should be a random varable has no mportance n our analyss. In ths paper we assume that t s fxed for smplcty. Based on the observed subset S = {x 1,..., x m }, the system s requred to output an orderng (rankng) of the tems n the set. Usng our notaton, ths orderng can be represented as a permutaton J = [j 1,..., j m ] of [1,..., m]. Our goal s to produce a permutaton such that y j s n decreasng order for = 1,..., m. In practcal applcatons, each avalable poston can be assocated wth a weght c that measures the mportance of that poston. Now, gven the grades y j (j = 1,..., m), a very natural measure of the rank-lst J = [j 1,..., j m ] s qualty s the followng weghted sum: DCG(J, [y j ]) = c y j. We assume that {c } s a pre-defned sequence of non-ncreasng non-negatve dscount factors that may or may not depend on S. Ths metrc, descrbed n [17] as DCG (dscounted cumulated gan), s one of the man metrcs used n the evaluaton of nternet search systems, ncludng the 4

5 producton system of Yahoo and that of Mcrosoft [7]. In ths context, a typcal choce of c s to set c = 1/ log(1 + ) when k and c = 0 when > k for some k. One may also use other choces, such as lettng c be the probablty of user lookng at (or clckng) the result at poston. Although ntroduced n the context of web-search, the DCG crteron s clearly natural for many other rankng applcatons such as recommender systems. Moreover, by choosng a decayng sequence of c, ths measure naturally focuses on the qualty of the top porton of the rank-lst. Ths s n contrast wth the par-wse error crteron n (1), whch does not dstngushng top porton of the rank-lst from the bottom porton. For the DCG crteron, our goal s to tran a rankng functon r that can take a subset S S as nput, and produce an output permutaton J = r(s) such that the expected DCG s as large as possble: DCG(r) = E S DCG(r, S), () where DCG(r, S) = c E yj (x j,s) y j. (3) The global par-wse preference learnng metrc (1) can be adapted to the subset rankng settng. We may consder the followng weghted total of correctly ranked pars mnus ncorrectly ranked pars: m 1 T(J, [y j ]) = (y j y j ). m(m 1) =+1 If the output label y takes bnary-values, and the subset S = X s global (we may assume that t s fnte), then ths metrc s equvalent to (1). Although we pay specal attenton to the DCG metrc, we shall also nclude some analyss of the T crteron for completeness. Smlar to () and (3), we can defne the followng quanttes: where T(r, S) = m 1 m(m 1) T(r) = E S T(r, S), (4) =+1 (E yj (x j,s) y j E yj (x j,s) y j ). (5) Smlar to the concept of Bayes classfer n classfcaton, we can defne the Bayes rankng functon that optmzes the DCG and T measures. Based on the condtonal formulatons n (3) and (5), we have the followng result: Theorem 1 Gven a set S S, for each x j S, we defne the Bayes-scorng functon as f B (x j, S) = E yj (x j,s) y j An optmal Bayes rankng functon r B (S) that maxmzes (5) returns a rank lst J = [j 1,..., j m ] such that f B (x j, S) s n descendng order: f B (x j1, S) f B (x j, S) f B (x jm, S). An optmal Bayes rankng functon r B (S) that maxmzes (3) returns a rank lst J = [j 1,..., j m ] such that c k > c k mples that f B (x jk, S) > f B (x jk, S). 5

6 Proof Consder any k, k {1,..., m}. Defne J = [j 1,..., j m], where j = j when k, k, and j k = j k, and j k = j k. We consder the T-crteron frst, and let k = k+1. It s easy to check that T(J, S) T(J, S) = 4(f B (x jk+1, S) f B (x jk, S))/m(m 1). Therefore T(J, S) T(J, S) mples that f B (x jk+1, S) f B (x jk, S). Now consder the DCG-crteron. We have DCG(J, S) DCG(J, S) = (c k c k )(f B (x jk, S) f B (x jk, S)). Now c k > c k and DCG(J, S) DCG(J, S) mples f B (x jk, S) f B (x jk, S). The result ndcates that the optmal rankng can be nduced by a sngle varable rankng functon of the form r(x, S) : X S R where x S. 3. Web-search example As an example of the subset rankng model, we consder the web-search problem. In ths applcaton, a user submts a query q, and expects the search engne to return a rank-lst of web-pages {p j } such that a more relevant page s placed before a less relevant page. In a typcal nternet search engne, the system takes a query and uses a smple rankng formula for the ntal flterng, whch lmts the set of web-pages to an ntal pool {p j } of sze m (e.g., m = ). After ths ntal rankng, the system goes through a more complcated second stage rankng process, whch reorders the pool. Ths crtcal stage s the focus of ths paper. At ths step, the system takes the query q, and possble nformaton from addtonal resources, to generate a feature vector x j for each page p j n the ntal pool. The feature vector can encode varous types of nformaton such as the length of query q, the poston of p j n the ntal pool, the number of query terms that match the ttle of p j, the number of query terms that match the body of p j, etc. The set of all possble feature vectors x j s X. The rankng algorthm only observes a lst of feature vectors {x 1,..., x m } wth each x j X. A human edtor s presented wth a par (q, p j ) and assgns a score s j on a scale, e.g., 1 5 (least relevant to hghly relevant). The correspondng target value y j s defned as a transformaton of y j, 1 whch maps the grade nto the nterval [0, 1]. Another possble choce of y j s to normalze t by multplyng each y j by a factor such that the optmal DCG s no more than one. 4 Some Computatonal Aspects of Subset Rankng Due to the dependency of condtonal probablty of y on S, and thus the optmal rankng functon on S, a complete soluton of the subset rankng problem can be dffcult when m s large. In general, wthout further assumptons, the optmal Bayes rankng functon ranks the tems usng the Bayes scorng functon f B (x, S) for each x S. The explct S dependency of f B (x, S) s one of the dfferences that dstngush subset rankng from global rankng. If the sze m of S s small, then we may smply represent S as a feature vector [x 1,..., x m ] (although ths may not be the best representaton), so that we can learn a functon of the form f B (x j, S) = f([x j, x 1,..., x m ]). Therefore by redefnng x j = [x j, x 1,..., x m ] X m+1, we can remove the subset dependency by embeddng the orgnal problem nto a hgher dmensonal space. In the general case when m s large, ths approach s not practcal. Instead of usng the 1 For example, the formula ( s j 1)/( 5 1) s used n [7]. Yahoo uses a dfferent transformaton based on emprcal user surveys. 6

7 whole set S as a feature, we can project S nto a lower dmensonal space usng a feature map g( ), so that f B (x, g(s)) f(x, g(s)). By ntroducng such a set dependent feature vector g(s), we can remove the set dependency by ncorporatng g(s) nto x: ths can be acheved by smply redefnng x as x = [x, g(s)]. In ths way, f B (x, S) can be approxmated by a functon of the form f( x). If the subsets are dentcal, then subset rankng s equvalent to global rankng. In the more general case where subsets are not dentcal, the reducton from set-dependent local rankng nto set-ndependent global rankng can be complcated f we do not assume any underlyng structures of the problem (we shall dscuss such a structure later). However, one may ask the queston that f we only use a set-ndependent functon of the form f(x) as the scorng functon, how well t can approxmate the Bayes scorng functon f B (x, S), and whether t s easy to compute such a functon f(x). If the subsets are dsjont (or nearly dsjont), then the effect of f B (x, S) can be acheved by a global scorng functon of the form f(x) exactly (or almost exactly) because x determnes S. Ths can be a good approxmaton for practcal problems, where the feature vectors for dfferent subsets (e.g. queres n web-search) usually do not overlap. If the subsets overlap sgnfcantly but not exactly the same, the problem can be computatonally dffcult. To see ths, we may consder for smplcty that X s fnte, and each subset only contans two elements, and one s preferred over the other (determnstcally). Now n the subset learnng model, such a preference relatonshp x x of two elements x, x X can be denoted by a drected edge from x to x. In ths settng, to fnd a global scorng functon that approxmates the optmal set dependent Bayes scorng rule s equvalent to fndng a maxmum subgraph that s acyclc. In general, ths problem s computatonally dffcult, and known to be NP-hard (an applcaton of smlar arguments n rankng can be found n [1, 3]) as well as APX-hard [11]: the class APX conssts of problems havng an approxmaton to wth 1+c of the optmum for some c. A polynomal tme approxmaton scheme (PTAS) s an algorthm whch runs n polynomal tme n the nstance sze (but not necessarly poly(1/ɛ)) and returns a soluton approxmate to wthn 1+ɛ for any gven ɛ > 0. If any APX-hard problem admts a PTAS then P=NP. The above argument mples that wthout any assumpton, the reducton of the set-dependent Bayes optmal scorng functon f B (x, S) to a set ndependent functon of the form f(x) s dffcult. If we are able to ncorporate approprate set dependent feature nto x or f the sets do not overlap sgnfcantly, then ths s computatonally feasble. In the deal case, we can ntroduce the followng defnton. Defnton 1 If for every S S and x, x S, we have f B (x, S) > f B (x, S) f and only f f(x) > f(x ), then we say that f s an optmal rank preservng functon. Clearly, an optmal rank preservng functon may not always exst (wthout usng set-dependent features). As a smple example, we assume that X = {a, b, c} has three elements, wth m =, c 1 = 1 and c = 0 n the DCG defnton. We observe {y 1 = 1, y = 0} for the set {x 1 = a, x = b}, {y 1 = 1, y = 0} for the set {x 1 = b, x = c}, {y 1 = 1, y = 0} for the set {x 1 = c, x = a}. If an optmal rank preservng functon f exsts, then by defnton we have: f(a) > f(b), f(b) > f(c), and f(c) > f(a), whch s mpossble. Under approprate assumptons, the optmal rank preservng functon exsts. The followng result provdes a suffcent condton. 7

8 Proposton 1 Assume that for each x j, we observe y j = a(s)y j + b(s) where a(s) 0 and b(s) are normalzaton/shftng factors that may depend on S, and {y j } s a set of random varables that satsfy: P ({y j} S) = E ξ m j=1 P (y j x j, ξ), where ξ s a hdden random varable ndependent of S. Then E y j (x j,s) y j = E y j x j y j. That s, the condtonal expectaton f(x) = E y x y s an optmal rank preservng functon. Proof Observe that E yj (x j,s)y j = a(s)e y j (x j,s)y j + b(s). Therefore the scorng functons E yj (x j,s)y j and E y j (x j,s)y j lead to dentcal rankng. Moreover, E y j (x j,s) y j = E ξ m y jd P (y x, ξ) = E ξ y jdp (y j x j, ξ) = y jdp (y j x j ) = E y j x j y j. Ths proves the clam. Ths result justfes usng an approprately defned feature functon to remove set-dependency. If y j s a determnstc functon of x j and ξ, then the result always holds, whch mples the optmalty of set-ndependent condtonal expectaton. In ths case, the optmal global scorng rule gves the optmal Bayes rule for subset rankng. We also note that ths equvalence does not requre that the grade y to be ndependent of S. In web-search, the model n Proposton 1 has a natural nterpretaton. Consder a pool of human edtors ndexed by ξ. For each query q, we randomly pck an edtor ξ to grade the set of pages p j to be ranked, and assume that the grade the edtor gves to each page p j depends only on the par x j = (q, p j ). In ths case, Proposton 1 can be appled to show that the features x j are suffcent to determne the optmal Bayes rule. Proposton 1 (and dscusson there-after) suggests that regresson based learnng of the condtonal expectaton E y x y s asymptotcally optmal under some assumptons that are reasonable. We call a method that learns such condtonal expectaton E y x y or ts transformaton regresson based approach, whch s dfferent from the par-wse preference learnng methods used n the early work. There are two advantages for usng regresson: frst, the computatonal complexty s at most O(m) (t can be sub-lnear n m wth approprate mportance subsamplng schemes) nstead of O(m ); second, we are able to prove the consstency of such methods under reasonable assumptons. As dscussed at the end of Secton, ths ssue s more complcated for par-wse methods. Furthermore, as we wll dscuss n the next secton, some advantages of par-wse learnng can be ncorporated nto the regresson approach by usng set-dependent features. 5 Rsk Mnmzaton based Estmaton Methods From the prevous secton, we know that the optmal scorng functon s the condtonal expectaton of the grades y. We nvestgate some basc estmaton methods for condtonal expectaton learnng. 8

9 5.1 Relaton to mult-category classfcaton The subset rankng problem s a generalzaton of mult-category classfcaton. In the latter case, we observe an nput x 0, and are nterested n classfyng t nto one of the m classes. Let the output value be k {1,..., m}. We encode the nput x 0 nto m feature vectors {x 1,..., x m }, where x = [0,..., 0, x 0, 0,..., 0] wth the -th component beng x 0, and the other components are zeros. We then encode the output k nto m values {y j } such that y k = 1 and y j = 0 for j k. In ths settng, we try to fnd a scorng functon f such that f(x k ) > f(x j ) for j k. Consder the DCG crteron wth c 1 = 1 and c j = 0 when j > 1. Then the classfcaton error s gven by the correspondng DCG. Gven any mult-category classfcaton algorthm, one may use t to solve the subset rankng problem as follows. Consder a sample S = [x 1,..., x m ] as nput, and a set of outputs {y j }. We randomly draw k from 1 to m accordng to the dstrbuton y k / j y j. We then form another sample wth weght j y j, whch has the vector S = [x 1,..., x m ] (where order s mportant) as nput, and label y = k {1,..., m} as output. Ths changes the problem formulaton nto mult-category classfcaton. Snce the condtonal expectaton can be expressed as E yk (x k,s) y k = P (y = k S) E {yj } S y j, the order nduced by the scorng functon E yk (x k,s) y k s the same as that nduced by P (y = k S). Therefore a mult-category classfcaton solver that estmates condtonal probablty can be used to solve the subset rankng problem. In partcular, f we consder a rsk mnmzaton based multcategory classfcaton solver for m-class problem [8, 5] of the followng form: ˆf = arg mn f F n Φ(f(X ), Y ), where (X, Y ) are tranng ponts wth Y {1,..., m}, F s a vector functon class that takes values n R m, and Φ s some rsk functonal. Then for rankng wth tranng ponts ( S, {y,1,..., y,m }) and S = [x,1,..., x,m ], the correspondng learnng method becomes ˆf = arg mn f F n j=1 y,j Φ(f( S ), j), where the functon space F contans a subset of functons {f( S) : X m R m } of the form f( S) = [f(x 1, S),..., f(x m, S)], and S = {x 1,..., x m } s unordered set. An example would be maxmum entropy (mult-category logstc regresson) whch has the followng loss functon Φ(f( S), j) = f(x j, S) + ln m k=1 ef(x k,s). 5. Regresson based learnng Snce n rankng problems y,j can take values other than 0 or 1, we can have more general formulatons than mult-category classfcaton. In partcular, we may consder varatons of the followng regresson based learnng method to tran a scorng functon n F {X S R}: ˆf = arg mn f F n j=1 j φ(f(x,j, S ), y,j ), S = {x,1,..., x,m } S, (6) 9

10 where we assume that φ(a, b) = φ 0 (a) + φ 1 (a)b + φ (b). The estmaton formulaton s decoupled for each element x,j n a subset S, whch makes the problem easer to solve. In ths method, each tranng pont ((x,j, S ), y,j ) s treated as a sngle sample (for = 1,..., n and j = 1,..., m). The populaton verson of the rsk functon s: [ E S φ0 (f(x, S)) + φ 1 (f(x, S))E y (x,s) y + E y (x,s) φ (y). ] x S Ths mples that the optmal populaton soluton s a functon that mnmzes φ 0 (f(x, S)) + φ 1 (f(x, S))E y (x,s) y, whch s a functon of E y (x,s) y. Therefore the estmaton method n (6) leads to an estmator of condtonal expectaton wth a reasonable choce of φ 0 ( ) and φ 1 ( ). A smple example s the least squares method, where we pck φ 0 (a) = a, φ 1 (a) = a and φ (b) = b. That s, the learnng method (6) becomes least squares estmaton: ˆf = arg mn f F n j=1 (f(x,j, S ) y,j ). (7) Ths method, and some essental varatons whch we wll ntroduce later, wll be the focus of our analyss. It was shown n [8] that the only loss functon wth condtonal expectaton as the mnmzer (for an arbtrary condtonal dstrbuton of y) s least squares. However, for practcal purposes, we only need to estmate a monotonc transformaton of the condtonal expectaton. For ths purpose, we can have addtonal loss functons of the form (6). In partcular, let φ 0 (a) be an arbtrary convex functon such that φ 0 (a) s a monotone ncreasng functon of a, then we may smply take the functon φ(a, b) = φ 0 (a) ab n (6). The optmal populaton soluton s unquely determned by φ 0 (f(x, S)) = E y (x,s)y. A smple example s φ 0 (a) = a 4 /4 such that the populaton optmal soluton s f(x, S) = (E y (x,s) y) 1/3. Clearly such a transformaton does not affect rankng. Moreover, n many rankng problems, the range of y s bounded. It s known that addtonal loss functons can be used for computng the condtonal expectaton. As a smple example, f we assume that y [0, 1], then the followng modfed least squares can be used: ˆf = arg mn f F n j=1 [ (1 y,j ) max(0, f(x,j, S )) + y,j max(0, 1 f(x,j, S )) ]. (8) One may replace ths wth other loss functons used for bnary classfcaton that estmate condtonal probablty, such as those dscussed n [9]. Although such general formulatons mght be nterestng for certan applcatons, advantages over the smpler least squares loss of (7) are not completely certan, and they are more complcated to deal wth. Therefore we wll not consder such general formulatons n ths paper, but rather focus on adaptng the least squares method n (7) to the rankng problems. As we shall see, non-trval modfcatons of (7) are necessary to optmze system performance near the top of rank-lst. 10

11 5.3 Par-wse preference learnng A popular dea n the recent machne learnng lterature s to pose the rankng problem as a parwse preference relatonshp learnng problem (see Secton ). Usng ths dea, the scorng functon for subset rankng can be traned by the followng method: ˆf = arg mn f F n (j,j ) E φ(f(x,j, S ), f(x,j, S ); y,j, y,j ), (9) where each E s a subset of {1,..., m} {1,..., m} such that y,j < y,j. For example, we may use a non-ncreasng monotone functon φ 0 and let φ(a 1, a ; b 1, b ) = φ 0 ((a a 1 ) (b b 1 )) or φ(a 1, a ; b 1, b ) = (b b 1 )φ 0 (a a 1 ). Example loss functons nclude SVM loss φ 0 (x) = max(0, 1 x) and AdaBoost loss φ 0 (x) = exp( x) (see [13, 16, 3]). The approach works well f the rankng problem s nose-free (that s, y,j s determnstc). However, one dffculty wth ths approach s that f y,j s stochastc, then the correspondng populaton estmator from (9) may not be Bayes optmal, unless a more complcated scheme such as [9] s used. It wll be nterestng to nvestgate the error of such an approach, but the analyss s beyond the scope of ths paper. One argument used by the advocates of the par-wse learnng formulaton s that we do not have to learn an absolute grade judgment (or ts expectaton), but rather only the relatve judgment that one tem s better than another. In essence, ths means that for each subset S, f we shft each judgment by a constant, the rankng s not affected. If nvarance wth respect to a set-dependent judgment shft s a desrable property, then t can be ncorporated nto the regresson based model [6]. For example, smlar to Proposton 1, we may ntroduce an explct set dependent shft feature (whch s rank-preservng) nto (6): ˆf = arg mn f F n mn φ(f(x,j, S ) + b, y,j ). b R j=1 In partcular, for least squares, we have the followng method: ˆf = arg mn f F n mn (f(x,j, S ) + b y,j ). (10) b R j=1 More generally, we may ntroduce more sophstcated set dependent features and herarchcal models nto the regresson formulaton, and obtan effects that may not even be easly ncorporated nto par-wse models. 6 Convex Surrogate Bounds The subset rankng problem defned n Secton 3 s combnatoral n nature, whch s very dffcult to solve. Snce the optmal Bayes rankng rule s gven by condtonal expectaton, n Secton 5, we dscussed varous formulatons to estmate the condtonal expectaton. In partcular, we are nterested n least squares regresson based methods. In ths context, a natural queston to ask s f a scorng functon approxmately mnmzes regresson error, how well t can optmze rankng metrcs such as DCG or T. Ths secton provdes some theoretcal results that relate the optmzaton of 11

12 the rankng metrcs defned n Secton 3 to the mnmzaton of regresson errors. Ths allows us to desgn approprate convex learnng formulatons that mprove the smple least squares methods n (7) and (10). A scorng functon f(x, S) maps each x S to a real valued score. It nduces a rankng functon r f, whch ranks elements {x j } of S n descendng order of f(x j ). We are nterested n boundng the DCG performance of r f compared wth that of f B. Ths can be regarded as extensons of Theorem 1 that motvate regresson based learnng. Theorem Let f(x, S) be a real-valued scorng functon, whch nduces a rankng functon r f. Consder par p, q [1, ] such that 1/p + 1/q = 1. We have the followng relatonshp for each S = {x 1,..., x m }: DCG(r B, S) DCG(r f, S) ( c p ) 1/p (f(x j, S) f B (x j, S)) q j=1 1/q. Proof Let S = {x 1,..., x m }. Let r f (S) = J = [j 1,..., j m ], and let J 1 = [l 1,..., l m ] be ts nverse permutaton. Smlarly, let r B (S) = J B = [j1,..., j m], and let J 1 B = [l 1,..., l m] be ts nverse permutaton. We have DCG(r f, S) = c f B (x j, S) = c l f B (x, S) = c l f(x, S) + c l (f B (x, S) f(x, S)) c l f(x, S) + c l (f B (x, S) f(x, S)) = c l f B (x, S) + c l (f(x, S) f B (x, S)) + c l (f B (x, S) f(x, S)) DCG(r B, S) c l (f(x, S) f B (x, S)) + c l (f B (x, S) f(x, S)) + ( DCG(r B, S) where we used the notaton (z) + = max(0, z). ) 1/p (f(x j, S) f B (x j, S)) q c p j=1 1/q. 1

13 The above theorem shows that the DCG crteron can be bounded through regresson error. Although the theorem apples to any arbtrary par of p and q such that 1/p + 1/q = 1, the most useful case s wth p = q =. Ths s because n ths case, the problem of mnmzng m j=1 (f(x j, S) f B (x j, S)) can be drectly acheved usng least squares regresson n (7). If regresson error goes to zero, then the resultng rankng converges to the optmal DCG. Smlarly, we can show the followng result for the T crteron. Theorem 3 Let f(x, S) be a real-valued scorng functon, whch nduces a rankng functon r f. We have the followng relatonshp for each S = {x 1,..., x m }: T(r B, S) T(r f, S) 4 m (f(x j, S) f B (x j, S)) j=1 1/. Proof We have Let S = {x 1,..., x m }. Let r f (S) = J = [j 1,..., j m ], and let r B (S) = J B = [j 1,..., j m]. T(r f, S) = m(m 1) m(m 1) m(m 1) m(m 1) m 1 =+1 m 1 =+1 m 1 =+1 m 1 =T(r B, S) 4 m =+1 (f B (x j, S) f B (x j, S)) (f(x j, S) f(x j, S)) m (f(x j, S) f(x j, S)) m f(x j, S) f B (x j, S) f(x j, S) f B (x j, S) (f B (x j, S) f B (x j, S)) 4 m f(x j, S) f B (x j, S) ( T(r B, S) 4 m ) 1/ (f(x j, S) f B (x j m, S)). f(x j, S) f B (x j, S) The above approxmaton bounds mply that least square regresson can be used to learn the optmal rankng functons. The approxmaton error converges to zero when f converges to f B n L. However, n general, requrng f to converge to f B n L s not necessary. More mportantly, n real applcatons, we are often only nterested n the top porton of the rank-lst. Our bounds should reflect ths practcal consderaton. Assume that the coeffcents c n the DCG crteron decay fast, so that c s bounded (ndependent of m). In ths case, we may pck p = 1 and q = n Theorem. If sup j f(x j, S) f B (x j, S) s small, then we obtan a better bound than the least squares error bound wth p = q = 1/ whch depends on m. 13

14 However, we cannot ensure that sup j f(x j, S) f B (x j, S) s small usng the smple least squares estmaton n (7). Therefore n the followng, we develop a more refned bound for the DCG metrc, whch wll then be used to motvate practcal learnng methods that mprove on the smple least squares method. Theorem 4 Let f(x, S) be a real-valued scorng functon, whch nduces a rankng functon r f. Gven S = {x 1,..., x m }, let the optmal rankng order be J B = [j1,..., j m], where f B (x j ) s arranged n non-ncreasng order. Assume that c = 0 for all > k. Then we have the followng relatonshp for all γ (0, 1), p, q 1 such that 1/p + 1/q = 1, u > 0, and subset K {1,..., m} that contans j1,..., j k : DCG(r B, S) DCG(r f, S) C p (γ, u) j K(f(x j, S) f B (x j, S)) p + u sup(f(x j, S) f B(x j, S)) p + j / K where (z) + = max(z, 0), and ( C p (γ, u) = 1 1 γ ( k k ) p ) 1/p c p + u p/q c, f B(x j ) = f B (x j ) + γ(f B (x j k ) f B (x j )) +. 1/p, Proof Let S = {x 1,..., x m }. Let r f (S) = J = [j 1,..., j m ], and let J 1 = [l 1,..., l m ] be ts nverse permutaton. Smlarly, let J 1 B = [l 1,..., l m] be the nverse permutaton of r B (S) = J B = [j1,..., j m]. Let M = f B (x j k ), we have (M f B (x j, S)) γ (M f B(x j, S)) +. Moreover, snce m c ((f B (x j, S) M) (f B (x j, S) M) + ) 0, we have c ((f B (x j, S) M) (f B (x j, S) M) + ) 1 1 γ c ((f B (x j, S) M) (f B (x j, S) M) + ). 14

15 Therefore DCG(r B, S) DCG(r f, S) = c ((f B (x j, S) M) (f B (x j, S) M)) = c ((f B (x j, S) M) (f B (x j, S) M) + ) γ = 1 1 γ 1 1 γ 1 1 γ c (M f B (x j, S)) + [ m c ((f B (x j, S) M) (f B(x j, S) M) + ) + ( m c f B (x j, S) ) c f B(x j, S) ( m c (f B (x j, S) f(x j, S)) ( m c (f B (x j, S) f(x j, S)) + + ( 1 k 1 γ c p ] c (M f B(x j, S)) + ) c (f B(x j, S) f(x j, S)) ) c (f(x j, S) f B(x j, S)) + ) 1/p (f B (x j, S) f(x j, S)) q + j K ( k ) ) + c sup(f(x j, S) f B(x j, S)) + j / K ( ) 1/p 1 k c p B (x j, S) f(x j, S)) 1 γ j K(f q 1/q 1/q 1/q + (f(x j, S) f B(x j, S)) q + j K + k c sup(f(x j, S) f B(x j, S)) +. j / K Note that n the above dervaton, Hölder s nequalty has been appled to obtan the last two nequaltes. From the last nequalty, we can apply the Hölder s nequalty agan to obtan the desred bound. The easest way to nterpret ths bound s stll to take p = q = 1/. Intutvely, the bound says the followng: we should estmate the top ranked tems usng least squares. For the other tems, we do not have to make very accurate estmaton of ther condtonal expectatons. The DCG score wll not be affected as long as we do not over-estmate ther condtonal expectatons to such a degree that some of these tems are near the top of the rank-lst. Ths pont s a very mportant dfference between ths bound and Theorem whch assumes that we estmate the condtonal expectaton unformly well. The bound n Theorem 4 can stll be refned. However, the resultng nequaltes wll become more complcated. Therefore we wll not nclude such bounds n ths paper. Smlar to Theorem 4, such refned bounds show that we do not have to estmate condtonal expectaton unformly well. We present a smple example as llustraton. 15

16 Proposton Consder m = 3 and S = {x 1, x, x 3 }. Let c 1 =, c = 1, c 3 = 0, and f B (x 1, S) = f B (x, S) = 1, f B (x 3, S) = 0. Let f(x, S) be a real-valued scorng functon, whch nduces a rankng functon r f. Then DCG(r B, S) DCG(r f, S) f(x 3, S) f B (x 3, S) + f(x 1, S) f B (x 1, S) + f(x, S) f B (x, S). The coeffcents on the rght hand sde cannot be mproved. Proof Note that f s suboptmal only when ether f(x 3, S) f(x 1, S) or when f(x 3, S) f(x, S). Ths gves the followng bound: DCG(r B, S) DCG(r f, S) I(f(x 3, S) f(x 1, S)) + I(f(x 3, S) f(x, S)) I( f(x 3, S) f B (x 3, S) + f(x 1, S) f B (x 1, S) 1) + I( f(x 3, S) f B (x 3, S) + f(x, S) f B (x, S) 1) f(x 3, S) f B (x 3, S) + f(x 1, S) f B (x 1, S) + f(x, S) f B (x, S). To see that the coeffcents cannot be mproved, we smply note that the bound s tght when ether f(x 1, S) = f(x, S) = f(x 3, S) = 1, or when f(x 1, S) = 1 and f(x, S) = f(x 3, S) = 0, or when f(x, S) = 1 and f(x 1, S) = f(x 3, S) = 0. The Proposton mples that not all errors should be weghted equally: n the example, gettng x 3 rght s more mportant than gettng x 1 or x rght. Conceptually, Theorem 4 and Proposton show the followng: Snce we are nterested n the top porton of the rank-lst, we only need to estmate the top rated tems accurately, whle preventng the bottom tems from beng over-estmated (the condtonal expectatons don t have to be estmated accurately). For rankng purposes, some ponts are more mportant than other ponts. Therefore we should bas our learnng method to produce more accurate condtonal expectaton estmaton at the more mportant ponts. 7 Importance Weghted Regresson The key message from the analyss n Secton 6 s that we do not have to estmate the condtonal expectatons equally well for all tems. In partcular, snce we are nterested n the top porton of the rank-lst, Theorem 4 mples that we need to estmate the top porton more accurately than the bottom porton. Motvated by ths analyss, we consder a regresson based tranng method to solve the DCG optmzaton problem but weght dfferent ponts dfferently accordng to ther mportance. We shall not dscuss the mplementaton detals for modelng the functon f(x, S), whch s beyond the scope of ths paper. One smple model s to assume a form f(x, S) = f(x). Secton 4 dscussed the valdty of such models. For example, ths model s reasonable f we assume that for each x S, and the correspondng y, we have: E y (x,s) y = E y x y (see Proposton 1). Let F be a functon space that contans functons X S R. We draw n sets S 1,..., S n randomly, where S = {x,1,..., x,m }, wth the correspondng grades {y,j } j = {y,1,..., y,m }. 16

17 Based on Theorem, the smple least squares regresson (7) can be used to solve the subset rankng problem. However, ths drect regresson method s not adequate for many practcal problems such as web-search, for whch there are many tems to rank (that s, m s large) but only the top ranked pages are mportant. Ths s because the method pays equal attenton to relevant and rrelevant pages. In realty, one should pay more attenton to the top-ranked (relevant) pages. The grades of lower rank pages do not need to be estmated accurately, as long as we do not over-estmate them so that these pages appear n the top ranked postons. The above mentoned ntuton can be captured by Theorem 4 and Proposton, whch motvate the followng alternatve tranng method: ˆf = arg mn f F 1 n n L(f, S, {y,j } j ), (11) where for S = {x 1,..., x m }, wth the correspondng {y j } j, we have the followng mportance weghted regresson loss: L(f, S, {y j } j ) = j=1 w(x j, S)(f(x j, S) y j ) + u sup w (x j, S)(f(x j, S) δ(x j, S)) +, (1) j where u s a non-negatve parameter. A varaton of ths method s used to optmze the producton system of Yahoo s nternet search engne. The detaled mplementaton and parameter choces are trade secrets of Yahoo, whch we cannot completely dsclose here. It s also rrelevant for the purpose of ths paper. However, n the followng, we shall brefly explan the ntuton behnd (1) usng Theorem 4, and some practcal consderatons. The weght functon w(x j, S) n (1) s chosen so that t focuses only on the most mportant examples (the weght s set to zero for pages that we know are rrelevant). Ths part of the formulaton corresponds to the frst part of the bound n Theorem 4 (n that case, we choose w(x j, S) to be one for the top part of the example wth ndex set K, and zero otherwse). The usefulness of non-unform weghtng s also demonstrated n Proposton. The specfc choce of the weght functon requres varous engneerng consderatons that are not mportant for the purpose of ths paper. In general, f there are many tems wth smlar grades, then t s benefcal to gve each of the smlar tems a smaller weght. In the second part of (1), we choose w (x j, S) so that t focuses on the examples not covered by w(x j, S). In partcular, t only covers those data ponts x j that are low-ranked wth hgh confdence. We choose δ(x j, S) to be a small threshold that can be regarded as a lower bound of f B (x j) n Theorem 4, such as γf B (x k ). An mportant observaton s that although m s often very large, the number of ponts so that w(x j, S) s nonzero s often small. Moreover, (f(x j, S) δ(x j, S)) + s not zero only when f(x j, S) δ(x j, S). In practce the number of these ponts s usually small (that s, most rrelevant pages wll be predcted as rrelevant). Therefore the formulaton completely gnores those low-ranked data ponts such that f(x j, S) δ(x j, S). Ths makes the learnng procedure computatonally effcent even when m s large. The analogy here s support vector machnes, where only the support vectors are useful n the learnng formulaton. One can completely gnore samples correspondng to non support vectors. In the practcal mplementaton of (1), we can use an teratve refnement scheme, where we start wth a small number of samples to be ncluded n the frst part of (1), and then put the Some aspects of the mplementaton were covered n [10]. 17

18 low-ranked ponts nto the second part of (1) only when ther rankng scores exceed δ(x j, S). In fact, one may also put these ponts nto the frst part of (1), so that the second part always has zero values (whch makes the mplementaton smpler). In ths sense, the formulaton n (1) suggests a selectve samplng scheme, n whch we pay specal attenton to mportant and hghly ranked data ponts, whle completely gnorng most of the low ranked data ponts. In ths regard, wth approprately chosen w(x, S), the second part of (1) can be completely gnored. The emprcal rsk mnmzaton method n (11) approxmately mnmzes the followng crteron: where Q(f) = E S L(f, S), (13) L(f, S) =E {yj } j SL(f, S, {y j } j ) = w(x j, S)E yj (x j,s) (f(x j, S) y j ) + u sup w (x j, S)(f(x j, S) δ(x j, S)) +. j j=1 The followng theorem shows that under approprate assumptons, approxmate mnmzaton of (13) leads to the approxmate optmzaton of DCG. Theorem 5 Assume that c = 0 for all > k. S = {x 1,..., x m }: Assume the followng condtons hold for each = [j 1,..., j m], where f B (x j ) s arranged n non- Let the optmal rankng order be J B ncreasng order. There exsts γ [0, 1) such that δ(x j, S) γf B (x j k, S). For all f B (x j, S) > δ(x j, S), we have w(x j, S) 1. Let w (x j, S) = I(w(x j, S) < 1). Then the followng results hold: A functon f mnmzes (13) f f (x j, S) = f B (x j, S) when w(x j, S) > 0 and f (x j, S) δ(x j, S) otherwse. For all f, let r f be the nduced rankng functon. Let r B be the optmal Bayes rankng functon, we have: DCG(r f ) DCG(r B ) C(γ, u)(q(f) Q(f )) 1/. Proof Note that f f B (x j, S) > δ(x j, S), then w(x j, S) 1 and w (x j, S). Therefore the mnmzer f (x j, S) should mnmze E yj (x j,s)(f(x j, S) y j ), acheved at f (x j, S) = f B (x j, S). If f B (x j, S) δ(x j, S), then there are two cases: w(x j, S) > 0, f (x j, S) should mnmze E yj (x j,s)(f(x j, S) y j ), acheved at f (x j, S) = f B (x j, S). w(x j, S) = 0, f (x j, S) should mnmze E yj (x j,s)(f(x j, S) δ(x j, S)) +, acheved at f (x j, S) δ(x j, S). 18

19 Ths proves the frst clam. For each S, denote by K the set of x j such that w (x j, S) = 0. The second clam follows from the followng dervaton: Q(f) Q(f ) =E S (L(f, S) L(f, S)) k =E S w(x j, S)(f(x j, S) f B (x j, S)) + u sup w (x j, S)(f(x j, S) δ(x j, S)) + j=1 j E S j K(f B (x j, S) f(x j, S)) + + u sup(f(x j, S) δ(x j, S)) + j / K E S (DCG(r B, S) DCG(r f, S)) C(γ, u) (DCG(r B ) DCG(r f )) C(γ, u). Note that the second nequalty follows from Theorem 4. 8 Generalzaton Analyss In ths secton, we analyze the generalzaton performance of (11). The analyss depends on the underlyng functon class F. In the lterature, one often employs a lnear functon class wth approprate regularzaton condton, such as L 1 or L regularzaton for the lnear weght coeffcents. Yahoo s machne learnng rankng system employs the gradent boostng method descrbed n [14], whch s closely related to L 1 regularzaton, analyzed n [5, 18, 19]. Although the consstency of boostng for the standard least squares regresson s known (for example, see [6, 30]), such analyss does not deal wth the stuaton that m s large and thus s not sutable for analyzng the rankng problem consdered n ths paper. In ths secton, we wll consder lnear functon class wth L regularzaton, whch s closely related to kernel methods. We employ a relatvely smple stablty analyss whch s sutable for L regularzaton. Our result does not depend on m explctly, whch s mportant for large scale rankng problems such as web-search. Although smlar results can be obtaned for L 1 regularzaton or gradent boostng, the analyss wll become much more complcated. For L regularzaton, we consder a feature map ψ : X S H, where H s a vector space. We denote by w T v the L nner product of w and v n H. The functon class F consdered here s of the followng form: {β T ψ(x, S); β H, β T β A } X S R, (14) where the complexty s controlled by L regularzaton of the weght vector β T β A. We use (S = {x,1,..., x,m }, {y,j } j ) to ndcate a sample pont ndexed by. Note that for each sample, we do not need to assume that y,j are ndependently generated for dfferent j. Usng (14), the mportance weghted regresson n (11) becomes the followng regularzed emprcal rsk 19

20 mnmzaton method: f ˆβ(x, S) = ˆβ T ψ(x, S), [ 1 ˆβ = arg mn β H n L(β, S, {y j } j ) = j=1 ] n L(β, S, {y,j } j ) + λβ T β, (15) w(x j, S)(β T ψ(x j, S) y j ) + u sup w (x j, S)(β T ψ(x j, S) δ(x j, S)) +. j In ths method, we replace the hard regularzaton n (14) wth tunng parameter A by soft regularzaton wth tunng parameter λ, whch s computatonally more convenent. The followng result s an expected generalzaton bound for the L -regularzed emprcal rsk mnmzaton method (15), whch uses the stablty analyss n [7]. The proof s n Appendx A. Theorem 6 Let M = sup x,s ψ(x, S) and W = sup S [ x j S w(x j, S) + u sup xj S w (x j, S)]. Let f ˆβ be the estmator defned n (15). Then we have E {S,{y,j } j } n Q(f ˆβ) (1 + W M λn ) nf β H [Q(f β) + λβ T β]. We have pad specal attenton to the propertes of (15). In partcular, the quantty W s usually much smaller than m, whch s large for web-search applcatons. The pont we d lke to emphasze here s that even though the number m s large, the estmaton complexty s only affected by the top-porton of the rank-lst. If the estmaton of the lowest ranked tems s relatvely easy (as s generally the case), then the learnng complexty does not depend on the majorty of tems near the bottom of the rank-lst. We can combne Theorem 5 and Theorem 6, gvng the followng bound: Theorem 7 Suppose the condtons n Theorem 5 and Theorem 6 hold wth f mnmzng (13). Let ˆf = f ˆβ, we have E {S,{y,j } j } n DCG(r ˆf ) DCG(r B ) + C(γ, u) [ ( 1 + W M ) ] 1/ nf (Q(f β) + λβ T β) Q(f ). λn β H Proof From Theorem 5, we obtan E {S,{y,j } j } n DCG(r ˆf ) DCG(r B ) C(γ, u)e {S,{y,j } j } n (Q( ˆf) Q(f )) 1/ C(γ, u) [E {S,{y,j } j } n Q(f ˆβ) Q(f )] 1/. The second nequalty s a consequence of Jensen s nequalty. Now by applyng Theorem 6, we obtan the desred bound. The theorem mples that f Q(f ) = nf β H Q(f β ), then as n, we can let λ 0 and λn so that the second term on the rght hand sde vanshes n the large sample lmt. Therefore asymptotcally, we can acheve the optmal DCG score. Ths mples the consstency of regresson based learnng methods for the DCG crteron. 0

Feature Selection: Part 1

Feature Selection: Part 1 CSE 546: Machne Learnng Lecture 5 Feature Selecton: Part 1 Instructor: Sham Kakade 1 Regresson n the hgh dmensonal settng How do we learn when the number of features d s greater than the sample sze n?

More information

Generalized Linear Methods

Generalized Linear Methods Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set

More information

Kernel Methods and SVMs Extension

Kernel Methods and SVMs Extension Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general

More information

Lecture 10 Support Vector Machines II

Lecture 10 Support Vector Machines II Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed

More information

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur Module 3 LOSSY IMAGE COMPRESSION SYSTEMS Verson ECE IIT, Kharagpur Lesson 6 Theory of Quantzaton Verson ECE IIT, Kharagpur Instructonal Objectves At the end of ths lesson, the students should be able to:

More information

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume

More information

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results. Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson

More information

Boostrapaggregating (Bagging)

Boostrapaggregating (Bagging) Boostrapaggregatng (Baggng) An ensemble meta-algorthm desgned to mprove the stablty and accuracy of machne learnng algorthms Can be used n both regresson and classfcaton Reduces varance and helps to avod

More information

CSC 411 / CSC D11 / CSC C11

CSC 411 / CSC D11 / CSC C11 18 Boostng s a general strategy for learnng classfers by combnng smpler ones. The dea of boostng s to take a weak classfer that s, any classfer that wll do at least slghtly better than chance and use t

More information

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg prnceton unv. F 17 cos 521: Advanced Algorthm Desgn Lecture 7: LP Dualty Lecturer: Matt Wenberg Scrbe: LP Dualty s an extremely useful tool for analyzng structural propertes of lnear programs. Whle there

More information

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification E395 - Pattern Recognton Solutons to Introducton to Pattern Recognton, Chapter : Bayesan pattern classfcaton Preface Ths document s a soluton manual for selected exercses from Introducton to Pattern Recognton

More information

Assortment Optimization under MNL

Assortment Optimization under MNL Assortment Optmzaton under MNL Haotan Song Aprl 30, 2017 1 Introducton The assortment optmzaton problem ams to fnd the revenue-maxmzng assortment of products to offer when the prces of products are fxed.

More information

1 Convex Optimization

1 Convex Optimization Convex Optmzaton We wll consder convex optmzaton problems. Namely, mnmzaton problems where the objectve s convex (we assume no constrants for now). Such problems often arse n machne learnng. For example,

More information

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}. CS 189 Introducton to Machne Learnng Sprng 2018 Note 26 1 Boostng We have seen that n the case of random forests, combnng many mperfect models can produce a snglodel that works very well. Ths s the dea

More information

Chapter 13: Multiple Regression

Chapter 13: Multiple Regression Chapter 13: Multple Regresson 13.1 Developng the multple-regresson Model The general model can be descrbed as: It smplfes for two ndependent varables: The sample ft parameter b 0, b 1, and b are used to

More information

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction ECONOMICS 5* -- NOTE (Summary) ECON 5* -- NOTE The Multple Classcal Lnear Regresson Model (CLRM): Specfcaton and Assumptons. Introducton CLRM stands for the Classcal Lnear Regresson Model. The CLRM s also

More information

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4) I. Classcal Assumptons Econ7 Appled Econometrcs Topc 3: Classcal Model (Studenmund, Chapter 4) We have defned OLS and studed some algebrac propertes of OLS. In ths topc we wll study statstcal propertes

More information

MMA and GCMMA two methods for nonlinear optimization

MMA and GCMMA two methods for nonlinear optimization MMA and GCMMA two methods for nonlnear optmzaton Krster Svanberg Optmzaton and Systems Theory, KTH, Stockholm, Sweden. krlle@math.kth.se Ths note descrbes the algorthms used n the author s 2007 mplementatons

More information

Estimation: Part 2. Chapter GREG estimation

Estimation: Part 2. Chapter GREG estimation Chapter 9 Estmaton: Part 2 9. GREG estmaton In Chapter 8, we have seen that the regresson estmator s an effcent estmator when there s a lnear relatonshp between y and x. In ths chapter, we generalzed the

More information

Numerical Heat and Mass Transfer

Numerical Heat and Mass Transfer Master degree n Mechancal Engneerng Numercal Heat and Mass Transfer 06-Fnte-Dfference Method (One-dmensonal, steady state heat conducton) Fausto Arpno f.arpno@uncas.t Introducton Why we use models and

More information

Excess Error, Approximation Error, and Estimation Error

Excess Error, Approximation Error, and Estimation Error E0 370 Statstcal Learnng Theory Lecture 10 Sep 15, 011 Excess Error, Approxaton Error, and Estaton Error Lecturer: Shvan Agarwal Scrbe: Shvan Agarwal 1 Introducton So far, we have consdered the fnte saple

More information

Natural Language Processing and Information Retrieval

Natural Language Processing and Information Retrieval Natural Language Processng and Informaton Retreval Support Vector Machnes Alessandro Moschtt Department of nformaton and communcaton technology Unversty of Trento Emal: moschtt@ds.untn.t Summary Support

More information

Lecture 12: Discrete Laplacian

Lecture 12: Discrete Laplacian Lecture 12: Dscrete Laplacan Scrbe: Tanye Lu Our goal s to come up wth a dscrete verson of Laplacan operator for trangulated surfaces, so that we can use t n practce to solve related problems We are mostly

More information

More metrics on cartesian products

More metrics on cartesian products More metrcs on cartesan products If (X, d ) are metrc spaces for 1 n, then n Secton II4 of the lecture notes we defned three metrcs on X whose underlyng topologes are the product topology The purpose of

More information

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014 COS 511: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture #16 Scrbe: Yannan Wang Aprl 3, 014 1 Introducton The goal of our onlne learnng scenaro from last class s C comparng wth best expert and

More information

Problem Set 9 Solutions

Problem Set 9 Solutions Desgn and Analyss of Algorthms May 4, 2015 Massachusetts Insttute of Technology 6.046J/18.410J Profs. Erk Demane, Srn Devadas, and Nancy Lynch Problem Set 9 Solutons Problem Set 9 Solutons Ths problem

More information

Lecture 4. Instructor: Haipeng Luo

Lecture 4. Instructor: Haipeng Luo Lecture 4 Instructor: Hapeng Luo In the followng lectures, we focus on the expert problem and study more adaptve algorthms. Although Hedge s proven to be worst-case optmal, one may wonder how well t would

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.65/15.070J Fall 013 Lecture 1 10/1/013 Martngale Concentraton Inequaltes and Applcatons Content. 1. Exponental concentraton for martngales wth bounded ncrements.

More information

Linear Approximation with Regularization and Moving Least Squares

Linear Approximation with Regularization and Moving Least Squares Lnear Approxmaton wth Regularzaton and Movng Least Squares Igor Grešovn May 007 Revson 4.6 (Revson : March 004). 5 4 3 0.5 3 3.5 4 Contents: Lnear Fttng...4. Weghted Least Squares n Functon Approxmaton...

More information

Vapnik-Chervonenkis theory

Vapnik-Chervonenkis theory Vapnk-Chervonenks theory Rs Kondor June 13, 2008 For the purposes of ths lecture, we restrct ourselves to the bnary supervsed batch learnng settng. We assume that we have an nput space X, and an unknown

More information

Supporting Information

Supporting Information Supportng Informaton The neural network f n Eq. 1 s gven by: f x l = ReLU W atom x l + b atom, 2 where ReLU s the element-wse rectfed lnear unt, 21.e., ReLUx = max0, x, W atom R d d s the weght matrx to

More information

Homework Assignment 3 Due in class, Thursday October 15

Homework Assignment 3 Due in class, Thursday October 15 Homework Assgnment 3 Due n class, Thursday October 15 SDS 383C Statstcal Modelng I 1 Rdge regresson and Lasso 1. Get the Prostrate cancer data from http://statweb.stanford.edu/~tbs/elemstatlearn/ datasets/prostate.data.

More information

EEE 241: Linear Systems

EEE 241: Linear Systems EEE : Lnear Systems Summary #: Backpropagaton BACKPROPAGATION The perceptron rule as well as the Wdrow Hoff learnng were desgned to tran sngle layer networks. They suffer from the same dsadvantage: they

More information

The exam is closed book, closed notes except your one-page cheat sheet.

The exam is closed book, closed notes except your one-page cheat sheet. CS 89 Fall 206 Introducton to Machne Learnng Fnal Do not open the exam before you are nstructed to do so The exam s closed book, closed notes except your one-page cheat sheet Usage of electronc devces

More information

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal Inner Product Defnton 1 () A Eucldean space s a fnte-dmensonal vector space over the reals R, wth an nner product,. Defnton 2 (Inner Product) An nner product, on a real vector space X s a symmetrc, blnear,

More information

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers Psychology 282 Lecture #24 Outlne Regresson Dagnostcs: Outlers In an earler lecture we studed the statstcal assumptons underlyng the regresson model, ncludng the followng ponts: Formal statement of assumptons.

More information

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X Statstcs 1: Probablty Theory II 37 3 EPECTATION OF SEVERAL RANDOM VARIABLES As n Probablty Theory I, the nterest n most stuatons les not on the actual dstrbuton of a random vector, but rather on a number

More information

Notes on Frequency Estimation in Data Streams

Notes on Frequency Estimation in Data Streams Notes on Frequency Estmaton n Data Streams In (one of) the data streamng model(s), the data s a sequence of arrvals a 1, a 2,..., a m of the form a j = (, v) where s the dentty of the tem and belongs to

More information

Errors for Linear Systems

Errors for Linear Systems Errors for Lnear Systems When we solve a lnear system Ax b we often do not know A and b exactly, but have only approxmatons  and ˆb avalable. Then the best thng we can do s to solve ˆx ˆb exactly whch

More information

Ensemble Methods: Boosting

Ensemble Methods: Boosting Ensemble Methods: Boostng Ncholas Ruozz Unversty of Texas at Dallas Based on the sldes of Vbhav Gogate and Rob Schapre Last Tme Varance reducton va baggng Generate new tranng data sets by samplng wth replacement

More information

Chapter 11: Simple Linear Regression and Correlation

Chapter 11: Simple Linear Regression and Correlation Chapter 11: Smple Lnear Regresson and Correlaton 11-1 Emprcal Models 11-2 Smple Lnear Regresson 11-3 Propertes of the Least Squares Estmators 11-4 Hypothess Test n Smple Lnear Regresson 11-4.1 Use of t-tests

More information

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI Logstc Regresson CAP 561: achne Learnng Instructor: Guo-Jun QI Bayes Classfer: A Generatve model odel the posteror dstrbuton P(Y X) Estmate class-condtonal dstrbuton P(X Y) for each Y Estmate pror dstrbuton

More information

Global Sensitivity. Tuesday 20 th February, 2018

Global Sensitivity. Tuesday 20 th February, 2018 Global Senstvty Tuesday 2 th February, 28 ) Local Senstvty Most senstvty analyses [] are based on local estmates of senstvty, typcally by expandng the response n a Taylor seres about some specfc values

More information

COS 521: Advanced Algorithms Game Theory and Linear Programming

COS 521: Advanced Algorithms Game Theory and Linear Programming COS 521: Advanced Algorthms Game Theory and Lnear Programmng Moses Charkar February 27, 2013 In these notes, we ntroduce some basc concepts n game theory and lnear programmng (LP). We show a connecton

More information

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems Numercal Analyss by Dr. Anta Pal Assstant Professor Department of Mathematcs Natonal Insttute of Technology Durgapur Durgapur-713209 emal: anta.bue@gmal.com 1 . Chapter 5 Soluton of System of Lnear Equatons

More information

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018 INF 5860 Machne learnng for mage classfcaton Lecture 3 : Image classfcaton and regresson part II Anne Solberg January 3, 08 Today s topcs Multclass logstc regresson and softma Regularzaton Image classfcaton

More information

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016 U.C. Berkeley CS94: Spectral Methods and Expanders Handout 8 Luca Trevsan February 7, 06 Lecture 8: Spectral Algorthms Wrap-up In whch we talk about even more generalzatons of Cheeger s nequaltes, and

More information

10-701/ Machine Learning, Fall 2005 Homework 3

10-701/ Machine Learning, Fall 2005 Homework 3 10-701/15-781 Machne Learnng, Fall 2005 Homework 3 Out: 10/20/05 Due: begnnng of the class 11/01/05 Instructons Contact questons-10701@autonlaborg for queston Problem 1 Regresson and Cross-valdaton [40

More information

Difference Equations

Difference Equations Dfference Equatons c Jan Vrbk 1 Bascs Suppose a sequence of numbers, say a 0,a 1,a,a 3,... s defned by a certan general relatonshp between, say, three consecutve values of the sequence, e.g. a + +3a +1

More information

Lecture 20: November 7

Lecture 20: November 7 0-725/36-725: Convex Optmzaton Fall 205 Lecturer: Ryan Tbshran Lecture 20: November 7 Scrbes: Varsha Chnnaobreddy, Joon Sk Km, Lngyao Zhang Note: LaTeX template courtesy of UC Berkeley EECS dept. Dsclamer:

More information

Week 5: Neural Networks

Week 5: Neural Networks Week 5: Neural Networks Instructor: Sergey Levne Neural Networks Summary In the prevous lecture, we saw how we can construct neural networks by extendng logstc regresson. Neural networks consst of multple

More information

Case A. P k = Ni ( 2L i k 1 ) + (# big cells) 10d 2 P k.

Case A. P k = Ni ( 2L i k 1 ) + (# big cells) 10d 2 P k. THE CELLULAR METHOD In ths lecture, we ntroduce the cellular method as an approach to ncdence geometry theorems lke the Szemeréd-Trotter theorem. The method was ntroduced n the paper Combnatoral complexty

More information

Learning Theory: Lecture Notes

Learning Theory: Lecture Notes Learnng Theory: Lecture Notes Lecturer: Kamalka Chaudhur Scrbe: Qush Wang October 27, 2012 1 The Agnostc PAC Model Recall that one of the constrants of the PAC model s that the data dstrbuton has to be

More information

CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE

CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE Analytcal soluton s usually not possble when exctaton vares arbtrarly wth tme or f the system s nonlnear. Such problems can be solved by numercal tmesteppng

More information

Linear Regression Analysis: Terminology and Notation

Linear Regression Analysis: Terminology and Notation ECON 35* -- Secton : Basc Concepts of Regresson Analyss (Page ) Lnear Regresson Analyss: Termnology and Notaton Consder the generc verson of the smple (two-varable) lnear regresson model. It s represented

More information

CS : Algorithms and Uncertainty Lecture 17 Date: October 26, 2016

CS : Algorithms and Uncertainty Lecture 17 Date: October 26, 2016 CS 29-128: Algorthms and Uncertanty Lecture 17 Date: October 26, 2016 Instructor: Nkhl Bansal Scrbe: Mchael Denns 1 Introducton In ths lecture we wll be lookng nto the secretary problem, and an nterestng

More information

A Robust Method for Calculating the Correlation Coefficient

A Robust Method for Calculating the Correlation Coefficient A Robust Method for Calculatng the Correlaton Coeffcent E.B. Nven and C. V. Deutsch Relatonshps between prmary and secondary data are frequently quantfed usng the correlaton coeffcent; however, the tradtonal

More information

Chapter 8 Indicator Variables

Chapter 8 Indicator Variables Chapter 8 Indcator Varables In general, e explanatory varables n any regresson analyss are assumed to be quanttatve n nature. For example, e varables lke temperature, dstance, age etc. are quanttatve n

More information

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Support Vector Machines. Vibhav Gogate The University of Texas at dallas Support Vector Machnes Vbhav Gogate he Unversty of exas at dallas What We have Learned So Far? 1. Decson rees. Naïve Bayes 3. Lnear Regresson 4. Logstc Regresson 5. Perceptron 6. Neural networks 7. K-Nearest

More information

Classification as a Regression Problem

Classification as a Regression Problem Target varable y C C, C,, ; Classfcaton as a Regresson Problem { }, 3 L C K To treat classfcaton as a regresson problem we should transform the target y nto numercal values; The choce of numercal class

More information

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010 Parametrc fractonal mputaton for mssng data analyss Jae Kwang Km Survey Workng Group Semnar March 29, 2010 1 Outlne Introducton Proposed method Fractonal mputaton Approxmaton Varance estmaton Multple mputaton

More information

Module 9. Lecture 6. Duality in Assignment Problems

Module 9. Lecture 6. Duality in Assignment Problems Module 9 1 Lecture 6 Dualty n Assgnment Problems In ths lecture we attempt to answer few other mportant questons posed n earler lecture for (AP) and see how some of them can be explaned through the concept

More information

ON A DETERMINATION OF THE INITIAL FUNCTIONS FROM THE OBSERVED VALUES OF THE BOUNDARY FUNCTIONS FOR THE SECOND-ORDER HYPERBOLIC EQUATION

ON A DETERMINATION OF THE INITIAL FUNCTIONS FROM THE OBSERVED VALUES OF THE BOUNDARY FUNCTIONS FOR THE SECOND-ORDER HYPERBOLIC EQUATION Advanced Mathematcal Models & Applcatons Vol.3, No.3, 2018, pp.215-222 ON A DETERMINATION OF THE INITIAL FUNCTIONS FROM THE OBSERVED VALUES OF THE BOUNDARY FUNCTIONS FOR THE SECOND-ORDER HYPERBOLIC EUATION

More information

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore Sesson Outlne Introducton to classfcaton problems and dscrete choce models. Introducton to Logstcs Regresson. Logstc functon and Logt functon. Maxmum Lkelhood Estmator (MLE) for estmaton of LR parameters.

More information

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD he Gaussan classfer Nuno Vasconcelos ECE Department, UCSD Bayesan decson theory recall that we have state of the world X observatons g decson functon L[g,y] loss of predctng y wth g Bayes decson rule s

More information

Complete subgraphs in multipartite graphs

Complete subgraphs in multipartite graphs Complete subgraphs n multpartte graphs FLORIAN PFENDER Unverstät Rostock, Insttut für Mathematk D-18057 Rostock, Germany Floran.Pfender@un-rostock.de Abstract Turán s Theorem states that every graph G

More information

Linear Feature Engineering 11

Linear Feature Engineering 11 Lnear Feature Engneerng 11 2 Least-Squares 2.1 Smple least-squares Consder the followng dataset. We have a bunch of nputs x and correspondng outputs y. The partcular values n ths dataset are x y 0.23 0.19

More information

Comparison of Regression Lines

Comparison of Regression Lines STATGRAPHICS Rev. 9/13/2013 Comparson of Regresson Lnes Summary... 1 Data Input... 3 Analyss Summary... 4 Plot of Ftted Model... 6 Condtonal Sums of Squares... 6 Analyss Optons... 7 Forecasts... 8 Confdence

More information

SDMML HT MSc Problem Sheet 4

SDMML HT MSc Problem Sheet 4 SDMML HT 06 - MSc Problem Sheet 4. The recever operatng characterstc ROC curve plots the senstvty aganst the specfcty of a bnary classfer as the threshold for dscrmnaton s vared. Let the data space be

More information

MAXIMUM A POSTERIORI TRANSDUCTION

MAXIMUM A POSTERIORI TRANSDUCTION MAXIMUM A POSTERIORI TRANSDUCTION LI-WEI WANG, JU-FU FENG School of Mathematcal Scences, Peng Unversty, Bejng, 0087, Chna Center for Informaton Scences, Peng Unversty, Bejng, 0087, Chna E-MIAL: {wanglw,

More information

Some modelling aspects for the Matlab implementation of MMA

Some modelling aspects for the Matlab implementation of MMA Some modellng aspects for the Matlab mplementaton of MMA Krster Svanberg krlle@math.kth.se Optmzaton and Systems Theory Department of Mathematcs KTH, SE 10044 Stockholm September 2004 1. Consdered optmzaton

More information

x i1 =1 for all i (the constant ).

x i1 =1 for all i (the constant ). Chapter 5 The Multple Regresson Model Consder an economc model where the dependent varable s a functon of K explanatory varables. The economc model has the form: y = f ( x,x,..., ) xk Approxmate ths by

More information

FREQUENCY DISTRIBUTIONS Page 1 of The idea of a frequency distribution for sets of observations will be introduced,

FREQUENCY DISTRIBUTIONS Page 1 of The idea of a frequency distribution for sets of observations will be introduced, FREQUENCY DISTRIBUTIONS Page 1 of 6 I. Introducton 1. The dea of a frequency dstrbuton for sets of observatons wll be ntroduced, together wth some of the mechancs for constructng dstrbutons of data. Then

More information

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011 Stanford Unversty CS359G: Graph Parttonng and Expanders Handout 4 Luca Trevsan January 3, 0 Lecture 4 In whch we prove the dffcult drecton of Cheeger s nequalty. As n the past lectures, consder an undrected

More information

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U) Econ 413 Exam 13 H ANSWERS Settet er nndelt 9 deloppgaver, A,B,C, som alle anbefales å telle lkt for å gøre det ltt lettere å stå. Svar er gtt . Unfortunately, there s a prntng error n the hnt of

More information

VQ widely used in coding speech, image, and video

VQ widely used in coding speech, image, and video at Scalar quantzers are specal cases of vector quantzers (VQ): they are constraned to look at one sample at a tme (memoryless) VQ does not have such constrant better RD perfomance expected Source codng

More information

4DVAR, according to the name, is a four-dimensional variational method.

4DVAR, according to the name, is a four-dimensional variational method. 4D-Varatonal Data Assmlaton (4D-Var) 4DVAR, accordng to the name, s a four-dmensonal varatonal method. 4D-Var s actually a drect generalzaton of 3D-Var to handle observatons that are dstrbuted n tme. The

More information

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity LINEAR REGRESSION ANALYSIS MODULE IX Lecture - 30 Multcollnearty Dr. Shalabh Department of Mathematcs and Statstcs Indan Insttute of Technology Kanpur 2 Remedes for multcollnearty Varous technques have

More information

A New Refinement of Jacobi Method for Solution of Linear System Equations AX=b

A New Refinement of Jacobi Method for Solution of Linear System Equations AX=b Int J Contemp Math Scences, Vol 3, 28, no 17, 819-827 A New Refnement of Jacob Method for Soluton of Lnear System Equatons AX=b F Naem Dafchah Department of Mathematcs, Faculty of Scences Unversty of Gulan,

More information

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Structure and Drive Paul A. Jensen Copyright July 20, 2003 Structure and Drve Paul A. Jensen Copyrght July 20, 2003 A system s made up of several operatons wth flow passng between them. The structure of the system descrbes the flow paths from nputs to outputs.

More information

Solutions HW #2. minimize. Ax = b. Give the dual problem, and make the implicit equality constraints explicit. Solution.

Solutions HW #2. minimize. Ax = b. Give the dual problem, and make the implicit equality constraints explicit. Solution. Solutons HW #2 Dual of general LP. Fnd the dual functon of the LP mnmze subject to c T x Gx h Ax = b. Gve the dual problem, and make the mplct equalty constrants explct. Soluton. 1. The Lagrangan s L(x,

More information

Simultaneous Optimization of Berth Allocation, Quay Crane Assignment and Quay Crane Scheduling Problems in Container Terminals

Simultaneous Optimization of Berth Allocation, Quay Crane Assignment and Quay Crane Scheduling Problems in Container Terminals Smultaneous Optmzaton of Berth Allocaton, Quay Crane Assgnment and Quay Crane Schedulng Problems n Contaner Termnals Necat Aras, Yavuz Türkoğulları, Z. Caner Taşkın, Kuban Altınel Abstract In ths work,

More information

The Minimum Universal Cost Flow in an Infeasible Flow Network

The Minimum Universal Cost Flow in an Infeasible Flow Network Journal of Scences, Islamc Republc of Iran 17(2): 175-180 (2006) Unversty of Tehran, ISSN 1016-1104 http://jscencesutacr The Mnmum Unversal Cost Flow n an Infeasble Flow Network H Saleh Fathabad * M Bagheran

More information

Markov Chain Monte Carlo Lecture 6

Markov Chain Monte Carlo Lecture 6 where (x 1,..., x N ) X N, N s called the populaton sze, f(x) f (x) for at least one {1, 2,..., N}, and those dfferent from f(x) are called the tral dstrbutons n terms of mportance samplng. Dfferent ways

More information

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA 4 Analyss of Varance (ANOVA) 5 ANOVA 51 Introducton ANOVA ANOVA s a way to estmate and test the means of multple populatons We wll start wth one-way ANOVA If the populatons ncluded n the study are selected

More information

Online Classification: Perceptron and Winnow

Online Classification: Perceptron and Winnow E0 370 Statstcal Learnng Theory Lecture 18 Nov 8, 011 Onlne Classfcaton: Perceptron and Wnnow Lecturer: Shvan Agarwal Scrbe: Shvan Agarwal 1 Introducton In ths lecture we wll start to study the onlne learnng

More information

Winter 2008 CS567 Stochastic Linear/Integer Programming Guest Lecturer: Xu, Huan

Winter 2008 CS567 Stochastic Linear/Integer Programming Guest Lecturer: Xu, Huan Wnter 2008 CS567 Stochastc Lnear/Integer Programmng Guest Lecturer: Xu, Huan Class 2: More Modelng Examples 1 Capacty Expanson Capacty expanson models optmal choces of the tmng and levels of nvestments

More information

Composite Hypotheses testing

Composite Hypotheses testing Composte ypotheses testng In many hypothess testng problems there are many possble dstrbutons that can occur under each of the hypotheses. The output of the source s a set of parameters (ponts n a parameter

More information

Chapter 5 Multilevel Models

Chapter 5 Multilevel Models Chapter 5 Multlevel Models 5.1 Cross-sectonal multlevel models 5.1.1 Two-level models 5.1.2 Multple level models 5.1.3 Multple level modelng n other felds 5.2 Longtudnal multlevel models 5.2.1 Two-level

More information

Discussion of Extensions of the Gauss-Markov Theorem to the Case of Stochastic Regression Coefficients Ed Stanek

Discussion of Extensions of the Gauss-Markov Theorem to the Case of Stochastic Regression Coefficients Ed Stanek Dscusson of Extensons of the Gauss-arkov Theorem to the Case of Stochastc Regresson Coeffcents Ed Stanek Introducton Pfeffermann (984 dscusses extensons to the Gauss-arkov Theorem n settngs where regresson

More information

ECONOMICS 351*-A Mid-Term Exam -- Fall Term 2000 Page 1 of 13 pages. QUEEN'S UNIVERSITY AT KINGSTON Department of Economics

ECONOMICS 351*-A Mid-Term Exam -- Fall Term 2000 Page 1 of 13 pages. QUEEN'S UNIVERSITY AT KINGSTON Department of Economics ECOOMICS 35*-A Md-Term Exam -- Fall Term 000 Page of 3 pages QUEE'S UIVERSITY AT KIGSTO Department of Economcs ECOOMICS 35* - Secton A Introductory Econometrcs Fall Term 000 MID-TERM EAM ASWERS MG Abbott

More information

Lecture 20: Lift and Project, SDP Duality. Today we will study the Lift and Project method. Then we will prove the SDP duality theorem.

Lecture 20: Lift and Project, SDP Duality. Today we will study the Lift and Project method. Then we will prove the SDP duality theorem. prnceton u. sp 02 cos 598B: algorthms and complexty Lecture 20: Lft and Project, SDP Dualty Lecturer: Sanjeev Arora Scrbe:Yury Makarychev Today we wll study the Lft and Project method. Then we wll prove

More information

Resource Allocation with a Budget Constraint for Computing Independent Tasks in the Cloud

Resource Allocation with a Budget Constraint for Computing Independent Tasks in the Cloud Resource Allocaton wth a Budget Constrant for Computng Independent Tasks n the Cloud Wemng Sh and Bo Hong School of Electrcal and Computer Engneerng Georga Insttute of Technology, USA 2nd IEEE Internatonal

More information

Economics 130. Lecture 4 Simple Linear Regression Continued

Economics 130. Lecture 4 Simple Linear Regression Continued Economcs 130 Lecture 4 Contnued Readngs for Week 4 Text, Chapter and 3. We contnue wth addressng our second ssue + add n how we evaluate these relatonshps: Where do we get data to do ths analyss? How do

More information

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015 CS 3710: Vsual Recognton Classfcaton and Detecton Adrana Kovashka Department of Computer Scence January 13, 2015 Plan for Today Vsual recognton bascs part 2: Classfcaton and detecton Adrana s research

More information

3.1 ML and Empirical Distribution

3.1 ML and Empirical Distribution 67577 Intro. to Machne Learnng Fall semester, 2008/9 Lecture 3: Maxmum Lkelhood/ Maxmum Entropy Dualty Lecturer: Amnon Shashua Scrbe: Amnon Shashua 1 In the prevous lecture we defned the prncple of Maxmum

More information

Lecture 12: Classification

Lecture 12: Classification Lecture : Classfcaton g Dscrmnant functons g The optmal Bayes classfer g Quadratc classfers g Eucldean and Mahalanobs metrcs g K Nearest Neghbor Classfers Intellgent Sensor Systems Rcardo Guterrez-Osuna

More information

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach A Bayes Algorthm for the Multtask Pattern Recognton Problem Drect Approach Edward Puchala Wroclaw Unversty of Technology, Char of Systems and Computer etworks, Wybrzeze Wyspanskego 7, 50-370 Wroclaw, Poland

More information

Convergence of random processes

Convergence of random processes DS-GA 12 Lecture notes 6 Fall 216 Convergence of random processes 1 Introducton In these notes we study convergence of dscrete random processes. Ths allows to characterze phenomena such as the law of large

More information

Negative Binomial Regression

Negative Binomial Regression STATGRAPHICS Rev. 9/16/2013 Negatve Bnomal Regresson Summary... 1 Data Input... 3 Statstcal Model... 3 Analyss Summary... 4 Analyss Optons... 7 Plot of Ftted Model... 8 Observed Versus Predcted... 10 Predctons...

More information