Statistical Analysis of Bayes Optimal Subset Ranking
|
|
- Crystal Nichols
- 6 years ago
- Views:
Transcription
1 Statstcal Analyss of Bayes Optmal Subset Rankng Davd Cossock Yahoo Inc., Santa Clara, CA, USA Tong Zhang Yahoo Inc., New York Cty, USA Abstract The rankng problem has become ncreasngly mportant n modern applcatons of statstcal methods n automated decson makng systems. In partcular, we consder a formulaton of the statstcal rankng problem whch we call subset rankng, and focus on the DCG (dscounted cumulated gan) crteron that measures the qualty of tems near the top of the rank-lst. Smlar to error mnmzaton for bnary classfcaton, drect optmzaton of natural rankng crtera such as DCG leads to a non-convex optmzaton problems that can be NP-hard. Therefore a computatonally more tractable approach s needed. We present bounds that relate the approxmate optmzaton of DCG to the approxmate mnmzaton of certan regresson errors. These bounds justfy the use of convex learnng formulatons for solvng the subset rankng problem. The resultng estmaton methods are not conventonal, n that we focus on the estmaton qualty n the top-porton of the rank-lst. We further nvestgate the generalzaton ablty of these formulatons. Under approprate condtons, the consstency of the estmaton schemes wth respect to the DCG metrc can be derved. 1 Introducton We consder the general rankng problem, where a computer system s requred to rank a set of tems based on a gven nput. In such applcatons, the system often needs to present only a few top ranked tems to the user. Therefore the qualty of the system output s determned by the performance near the top of ts rank-lst. Rankng s especally mportant n electronc commerce and many nternet applcatons, where personalzaton and nformaton based decson makng are crtcal to the success of such busness. The decson makng process can often be posed as a problem of selectng top canddates from a set of potental alternatves, leadng to a condtonal rankng problem. For example, n a recommender system, the computer s asked to choose a few tems a user s most lkely to buy based on the user s profle and buyng hstory. The selected tems wll then be presented to the user as recommendatons. Another mportant example that affects mllons of people everyday s the nternet search problem, where the user presents a query to the search engne, and the search engne then selects a few web-pages that are most relevant to the query from the whole web. The qualty of a search engne s largely determned by the top-ranked results the search engne can dsplay on the frst page. Internet search s the man motvaton of ths theoretcal study, although the model presented here can be useful for many other applcatons. For example, another rankng 1
2 problem s ad placement n a web-page (ether search result, or some content page) accordng to revenue-generatng potental. Snce for search and many other rankng problems, we are only nterested n the qualty of the top choces, the evaluaton of the system output s dfferent from many tradtonal error metrcs such as classfcaton error. In ths settng, a useful fgure of mert should focus on the top porton of the rank-lst. To our knowledge, ths characterstc of practcal rankng problems has not been carefully explored n earler studes (except for a recent paper [3], whch also touched on ths ssue). The purpose of ths paper s to develop some theoretcal results for convertng a rankng problem nto convex optmzaton problems that can be effcently solved. The resultng formulaton focuses on the qualty of the top ranked results. The theory can be regarded as an extenson of related theory for convex rsk mnmzaton formulatons for classfcaton, whch has drawn much attenton recently n the statstcal learnng lterature [4, 18, 9, 8, 4, 5]. We organze the paper as follows. Secton dscusses earler work n statstcs and machne learnng on global and par-wse rankng. Secton 3 ntroduces the subset rankng problem. We defne two rankng metrcs: one s the DCG measure whch we focus on n ths paper, and the other s a measure that counts the number of correctly ranked pars. The latter has been studed recently by several authors n the context of par-wse preference learnng. Secton 4 nvestgates the relatonshp of subset rankng and global rankng. Secton 5 ntroduces some basc estmaton methods for rankng. Ths paper focuses on the least squares regresson based formulaton. Secton 6 contans the man theoretcal results n ths paper, where we show that the approxmate mnmzaton of certan regresson errors leads to the approxmate optmzaton of the rankng metrcs defned earler. Ths mples that asymptotcally the non-convex rankng problem can be solved usng regresson methods that are convex. Secton 7 presents the regresson learnng formulaton derved from the theoretcal results n Secton 6. Smlar methods are currently used to optmze Yahoo s producton search engne. Secton 8 studes the generalzaton ablty of regresson learnng, where we focus on the L -regularzaton approach. Together wth earler theoretcal results, we can establsh the consstency of regresson based rankng under approprate condtons. Rankng and Par-wse Preference Learnng The tradtonal predcton problem n statstcal machne learnng assumes that we observe an nput vector q Q, so as to predct an unobserved output p P. However, n a rankng problem, f we assume P = {1,..., m} contans m possble values, then nstead of predctng a value n P, we predct a permutaton of P that gves an optmal orderng of P. That s, f we denote by P! the set of permutatons of P, then the goal s to predct an output n P!. There are two fundamental ssues: frst, how to measure the qualty of rankng; second, how to learn a good rankng procedure from hstorcal data. At the frst sght, t may seem that we can smply cast the rankng problem as an ordnary predcton problem where the output space becomes P!. However, the number of permutatons n P! s m!, whch can be extremely large even for small m. Therefore t s not practcal to solve the rankng problem drectly wthout mposng certan structures on the search space. Moreover, n practce, gven a tranng pont q Q, we are generally not gven an optmal permutaton n P! as the observed output. Instead, we may observe another form of output that typcally nfers the optmal rankng but may contan extra nformaton as well. The tranng procedure should take advantage of such nformaton.
3 A common dea to generate optmal permutaton n P! s to use a scorng functon that takes a par (q, p) n Q P, and maps t to a real valued number r(q, p). For each q, the predcted permutaton n P! nduced by ths scorng functon s defned as the orderng of p P sorted wth non-ncreasng value r(q, p). Ths s the method we wll focus on n ths paper. Although the rankng problem have receved consderable nterests n machne learnng recently due to ts mportant applcatons n modern automated nformaton processng systems, the problem has not been extensvely studed n the tradtonal statstcal lterature. A relevant statstcal model s ordnal regresson [0]. In ths model, we are stll nterested n predctng a sngle output. We redefne the nput space as X = Q P, and for each x, we observe an output value y Y. Moreover, we assume that the values n Y = {1,..., L} are ordered, and the cumulatve probablty P (y j x) (j = 1,..., L) has the form γ(p (y j x)) = θ j + g β (x). In ths model, both γ( ) and g β ( ) have known functonal forms, and θ and β are model parameters. Note that the ordnal regresson model nduces a stochastc preference relatonshp on the nput space X. Consder two samples (x 1, y 1 ) and (x, y ) on X Y. We say x 1 x f and only f y 1 < y. Ths s a classfcaton problem that takes a par of nput x 1 and x and tres to predct whether x 1 x or not (that s, whether the correspondng outputs satsfy y 1 < y or not). In ths formulaton, the optmal predcton rule to mnmze classfcaton error s nduced by the orderng of g β (x) on X because f g β (x 1 ) < g β (x ), than P (y 1 < y ) > 0.5 (based on the ordnal regresson model), whch s consstent wth the Bayes rule. Motvated by ths observaton, an SVM rankng method s proposed n [16]. The dea s to reformulate ordnal regresson as a model to learn preference relatonshp on the nput space X, whch can be learned usng par-wse classfcaton. Gven the parameter ˆβ learned from tranng data, the scorng functon s smply r(q, p) = g ˆβ(x). The par-wse preference learnng model becomes a major trend for rankng n the machne learnng lterature. For example, n addton to SVM, a smlar method based on AdaBoost s proposed n [13]. The dea was also used n optmzng the Mcrosoft web-search system [7]. A number of researchers worked on the theoretcal analyss of rankng, usng the par-wse rankng model. The crteron s to mnmze the error of par-wse preference predcton when we draw two pars x 1 and x randomly from the nput space X. That s, gven a scorng functon g : X R, the rankng loss s: E (X1,Y 1 )E (X,Y )[I(Y 1 < Y )I(g(X 1 ) g(x )) + I(Y 1 > Y )I(g(X 1 ) g(x ))] (1) =E X1,X [P (Y 1 < Y X 1, X )I(g(X 1 ) g(x )) + P (Y 1 > Y X 1, X )I(g(X 1 ) g(x ))], where I( ) denotes the ndcator functon. For bnary output y = 0, 1, t s known that ths metrc s equvalent to the AUC measure (area under ROC) for bnary classfers up to a scalng, and t s closely related to the Mann-Whtney-Wlcoxon statstcs [15]. In the lterature, theoretcal analyss has focused manly on ths rankng crteron (for example, see [1,, 9, ]). The par-wse preference learnng model has some lmtatons. Frst, although the crteron n (1) measures the global par-wse rankng qualty, t s not the best metrc to evaluate practcal rankng systems. Note that n most applcatons, a system does not need to rank all data-pars, but only a subset of them each tme. Moreover, typcally only the top few postons of the rank-lst s of mportance. Another ssue wth the par-wse preference learnng model s that the scorng functon s usually learned by mnmzng a convex relaxaton of the par-wse classfcaton error, smlar to large margn classfcaton. However, f the preference relatonshp s stochastc, then an mportant queston that should be addressed s whether such a learnng algorthm leads to a Bayes optmal rankng functon n the large sample lmt. Unfortunately ths s dffcult to analyze for 3
4 general rsk mnmzaton formulatons f the decson rule s nduced by a sngle-varable scorng functon of the form r(x). The problem of Bayes optmalty n the par-wse learnng model was partally nvestgated n [9], but wth a decson rule of a general form r(x 1, x ): we predct x 1 x f r(x 1, x ) < 0. To our knowledge, ths method s not wdely used n practce because a nave applcaton can lead to contradcton: we may predct r(x 1, x ) < 0, r(x, x 3 ) < 0, and r(x 3, x 1 ) < 0. Therefore n order to use such a method effectvely for rankng, there needs to be a mechansm to resolve such contradcton. For example, one possblty s to defne a scorng functon f(x) = x r(x, x ), and rank the data accordngly. Another possblty s to use a sortng method (such as quck-sort) drectly wth the comparson functon gven by r(x 1, x ). However, n order to show that such contradcton resoluton methods are well behaved asymptotcally, t s necessary to analyze the correspondng error. We are not aware of any study on such error analyss. 3 Subset Rankng Model The global par-wse preference learnng model n Secton has some lmtatons. In ths paper, we shall descrbe a model more relevant to practcal rankng systems such as web-search. We wll frst descrbe the model, and then use search as an example to llustrate t. 3.1 Problem defnton Let X be the space of observable features, and Z be the space of varables that are not necessarly drectly used n the deployed system. Denote by S the set of all fnte subsets of X that may possbly contan elements that are redundant. Let y be a non-negatve real-valued varable that corresponds to the qualty of x X. Assume also that we are gven a (measurable) feature-map F that takes each z Z, and produces a fnte subset F (z) = S = {x 1,..., x m } S. Note that the order of the tems n the set s of no mportance; the numercal subscrpts are for notatonal purposes only, so that permutatons can be more convenently defned. In subset rankng, we randomly draw a varable z Z accordng to some underlyng dstrbuton on Z. We then create a fnte subset F (z) = S = {x 1,..., x m } S consstng of feature vectors x j n X, and at the same tme, a set of grades {y j } = {y 1,..., y m } such that for each j, y j corresponds to x j. Whether the sze of the set m should be a random varable has no mportance n our analyss. In ths paper we assume that t s fxed for smplcty. Based on the observed subset S = {x 1,..., x m }, the system s requred to output an orderng (rankng) of the tems n the set. Usng our notaton, ths orderng can be represented as a permutaton J = [j 1,..., j m ] of [1,..., m]. Our goal s to produce a permutaton such that y j s n decreasng order for = 1,..., m. In practcal applcatons, each avalable poston can be assocated wth a weght c that measures the mportance of that poston. Now, gven the grades y j (j = 1,..., m), a very natural measure of the rank-lst J = [j 1,..., j m ] s qualty s the followng weghted sum: DCG(J, [y j ]) = c y j. We assume that {c } s a pre-defned sequence of non-ncreasng non-negatve dscount factors that may or may not depend on S. Ths metrc, descrbed n [17] as DCG (dscounted cumulated gan), s one of the man metrcs used n the evaluaton of nternet search systems, ncludng the 4
5 producton system of Yahoo and that of Mcrosoft [7]. In ths context, a typcal choce of c s to set c = 1/ log(1 + ) when k and c = 0 when > k for some k. One may also use other choces, such as lettng c be the probablty of user lookng at (or clckng) the result at poston. Although ntroduced n the context of web-search, the DCG crteron s clearly natural for many other rankng applcatons such as recommender systems. Moreover, by choosng a decayng sequence of c, ths measure naturally focuses on the qualty of the top porton of the rank-lst. Ths s n contrast wth the par-wse error crteron n (1), whch does not dstngushng top porton of the rank-lst from the bottom porton. For the DCG crteron, our goal s to tran a rankng functon r that can take a subset S S as nput, and produce an output permutaton J = r(s) such that the expected DCG s as large as possble: DCG(r) = E S DCG(r, S), () where DCG(r, S) = c E yj (x j,s) y j. (3) The global par-wse preference learnng metrc (1) can be adapted to the subset rankng settng. We may consder the followng weghted total of correctly ranked pars mnus ncorrectly ranked pars: m 1 T(J, [y j ]) = (y j y j ). m(m 1) =+1 If the output label y takes bnary-values, and the subset S = X s global (we may assume that t s fnte), then ths metrc s equvalent to (1). Although we pay specal attenton to the DCG metrc, we shall also nclude some analyss of the T crteron for completeness. Smlar to () and (3), we can defne the followng quanttes: where T(r, S) = m 1 m(m 1) T(r) = E S T(r, S), (4) =+1 (E yj (x j,s) y j E yj (x j,s) y j ). (5) Smlar to the concept of Bayes classfer n classfcaton, we can defne the Bayes rankng functon that optmzes the DCG and T measures. Based on the condtonal formulatons n (3) and (5), we have the followng result: Theorem 1 Gven a set S S, for each x j S, we defne the Bayes-scorng functon as f B (x j, S) = E yj (x j,s) y j An optmal Bayes rankng functon r B (S) that maxmzes (5) returns a rank lst J = [j 1,..., j m ] such that f B (x j, S) s n descendng order: f B (x j1, S) f B (x j, S) f B (x jm, S). An optmal Bayes rankng functon r B (S) that maxmzes (3) returns a rank lst J = [j 1,..., j m ] such that c k > c k mples that f B (x jk, S) > f B (x jk, S). 5
6 Proof Consder any k, k {1,..., m}. Defne J = [j 1,..., j m], where j = j when k, k, and j k = j k, and j k = j k. We consder the T-crteron frst, and let k = k+1. It s easy to check that T(J, S) T(J, S) = 4(f B (x jk+1, S) f B (x jk, S))/m(m 1). Therefore T(J, S) T(J, S) mples that f B (x jk+1, S) f B (x jk, S). Now consder the DCG-crteron. We have DCG(J, S) DCG(J, S) = (c k c k )(f B (x jk, S) f B (x jk, S)). Now c k > c k and DCG(J, S) DCG(J, S) mples f B (x jk, S) f B (x jk, S). The result ndcates that the optmal rankng can be nduced by a sngle varable rankng functon of the form r(x, S) : X S R where x S. 3. Web-search example As an example of the subset rankng model, we consder the web-search problem. In ths applcaton, a user submts a query q, and expects the search engne to return a rank-lst of web-pages {p j } such that a more relevant page s placed before a less relevant page. In a typcal nternet search engne, the system takes a query and uses a smple rankng formula for the ntal flterng, whch lmts the set of web-pages to an ntal pool {p j } of sze m (e.g., m = ). After ths ntal rankng, the system goes through a more complcated second stage rankng process, whch reorders the pool. Ths crtcal stage s the focus of ths paper. At ths step, the system takes the query q, and possble nformaton from addtonal resources, to generate a feature vector x j for each page p j n the ntal pool. The feature vector can encode varous types of nformaton such as the length of query q, the poston of p j n the ntal pool, the number of query terms that match the ttle of p j, the number of query terms that match the body of p j, etc. The set of all possble feature vectors x j s X. The rankng algorthm only observes a lst of feature vectors {x 1,..., x m } wth each x j X. A human edtor s presented wth a par (q, p j ) and assgns a score s j on a scale, e.g., 1 5 (least relevant to hghly relevant). The correspondng target value y j s defned as a transformaton of y j, 1 whch maps the grade nto the nterval [0, 1]. Another possble choce of y j s to normalze t by multplyng each y j by a factor such that the optmal DCG s no more than one. 4 Some Computatonal Aspects of Subset Rankng Due to the dependency of condtonal probablty of y on S, and thus the optmal rankng functon on S, a complete soluton of the subset rankng problem can be dffcult when m s large. In general, wthout further assumptons, the optmal Bayes rankng functon ranks the tems usng the Bayes scorng functon f B (x, S) for each x S. The explct S dependency of f B (x, S) s one of the dfferences that dstngush subset rankng from global rankng. If the sze m of S s small, then we may smply represent S as a feature vector [x 1,..., x m ] (although ths may not be the best representaton), so that we can learn a functon of the form f B (x j, S) = f([x j, x 1,..., x m ]). Therefore by redefnng x j = [x j, x 1,..., x m ] X m+1, we can remove the subset dependency by embeddng the orgnal problem nto a hgher dmensonal space. In the general case when m s large, ths approach s not practcal. Instead of usng the 1 For example, the formula ( s j 1)/( 5 1) s used n [7]. Yahoo uses a dfferent transformaton based on emprcal user surveys. 6
7 whole set S as a feature, we can project S nto a lower dmensonal space usng a feature map g( ), so that f B (x, g(s)) f(x, g(s)). By ntroducng such a set dependent feature vector g(s), we can remove the set dependency by ncorporatng g(s) nto x: ths can be acheved by smply redefnng x as x = [x, g(s)]. In ths way, f B (x, S) can be approxmated by a functon of the form f( x). If the subsets are dentcal, then subset rankng s equvalent to global rankng. In the more general case where subsets are not dentcal, the reducton from set-dependent local rankng nto set-ndependent global rankng can be complcated f we do not assume any underlyng structures of the problem (we shall dscuss such a structure later). However, one may ask the queston that f we only use a set-ndependent functon of the form f(x) as the scorng functon, how well t can approxmate the Bayes scorng functon f B (x, S), and whether t s easy to compute such a functon f(x). If the subsets are dsjont (or nearly dsjont), then the effect of f B (x, S) can be acheved by a global scorng functon of the form f(x) exactly (or almost exactly) because x determnes S. Ths can be a good approxmaton for practcal problems, where the feature vectors for dfferent subsets (e.g. queres n web-search) usually do not overlap. If the subsets overlap sgnfcantly but not exactly the same, the problem can be computatonally dffcult. To see ths, we may consder for smplcty that X s fnte, and each subset only contans two elements, and one s preferred over the other (determnstcally). Now n the subset learnng model, such a preference relatonshp x x of two elements x, x X can be denoted by a drected edge from x to x. In ths settng, to fnd a global scorng functon that approxmates the optmal set dependent Bayes scorng rule s equvalent to fndng a maxmum subgraph that s acyclc. In general, ths problem s computatonally dffcult, and known to be NP-hard (an applcaton of smlar arguments n rankng can be found n [1, 3]) as well as APX-hard [11]: the class APX conssts of problems havng an approxmaton to wth 1+c of the optmum for some c. A polynomal tme approxmaton scheme (PTAS) s an algorthm whch runs n polynomal tme n the nstance sze (but not necessarly poly(1/ɛ)) and returns a soluton approxmate to wthn 1+ɛ for any gven ɛ > 0. If any APX-hard problem admts a PTAS then P=NP. The above argument mples that wthout any assumpton, the reducton of the set-dependent Bayes optmal scorng functon f B (x, S) to a set ndependent functon of the form f(x) s dffcult. If we are able to ncorporate approprate set dependent feature nto x or f the sets do not overlap sgnfcantly, then ths s computatonally feasble. In the deal case, we can ntroduce the followng defnton. Defnton 1 If for every S S and x, x S, we have f B (x, S) > f B (x, S) f and only f f(x) > f(x ), then we say that f s an optmal rank preservng functon. Clearly, an optmal rank preservng functon may not always exst (wthout usng set-dependent features). As a smple example, we assume that X = {a, b, c} has three elements, wth m =, c 1 = 1 and c = 0 n the DCG defnton. We observe {y 1 = 1, y = 0} for the set {x 1 = a, x = b}, {y 1 = 1, y = 0} for the set {x 1 = b, x = c}, {y 1 = 1, y = 0} for the set {x 1 = c, x = a}. If an optmal rank preservng functon f exsts, then by defnton we have: f(a) > f(b), f(b) > f(c), and f(c) > f(a), whch s mpossble. Under approprate assumptons, the optmal rank preservng functon exsts. The followng result provdes a suffcent condton. 7
8 Proposton 1 Assume that for each x j, we observe y j = a(s)y j + b(s) where a(s) 0 and b(s) are normalzaton/shftng factors that may depend on S, and {y j } s a set of random varables that satsfy: P ({y j} S) = E ξ m j=1 P (y j x j, ξ), where ξ s a hdden random varable ndependent of S. Then E y j (x j,s) y j = E y j x j y j. That s, the condtonal expectaton f(x) = E y x y s an optmal rank preservng functon. Proof Observe that E yj (x j,s)y j = a(s)e y j (x j,s)y j + b(s). Therefore the scorng functons E yj (x j,s)y j and E y j (x j,s)y j lead to dentcal rankng. Moreover, E y j (x j,s) y j = E ξ m y jd P (y x, ξ) = E ξ y jdp (y j x j, ξ) = y jdp (y j x j ) = E y j x j y j. Ths proves the clam. Ths result justfes usng an approprately defned feature functon to remove set-dependency. If y j s a determnstc functon of x j and ξ, then the result always holds, whch mples the optmalty of set-ndependent condtonal expectaton. In ths case, the optmal global scorng rule gves the optmal Bayes rule for subset rankng. We also note that ths equvalence does not requre that the grade y to be ndependent of S. In web-search, the model n Proposton 1 has a natural nterpretaton. Consder a pool of human edtors ndexed by ξ. For each query q, we randomly pck an edtor ξ to grade the set of pages p j to be ranked, and assume that the grade the edtor gves to each page p j depends only on the par x j = (q, p j ). In ths case, Proposton 1 can be appled to show that the features x j are suffcent to determne the optmal Bayes rule. Proposton 1 (and dscusson there-after) suggests that regresson based learnng of the condtonal expectaton E y x y s asymptotcally optmal under some assumptons that are reasonable. We call a method that learns such condtonal expectaton E y x y or ts transformaton regresson based approach, whch s dfferent from the par-wse preference learnng methods used n the early work. There are two advantages for usng regresson: frst, the computatonal complexty s at most O(m) (t can be sub-lnear n m wth approprate mportance subsamplng schemes) nstead of O(m ); second, we are able to prove the consstency of such methods under reasonable assumptons. As dscussed at the end of Secton, ths ssue s more complcated for par-wse methods. Furthermore, as we wll dscuss n the next secton, some advantages of par-wse learnng can be ncorporated nto the regresson approach by usng set-dependent features. 5 Rsk Mnmzaton based Estmaton Methods From the prevous secton, we know that the optmal scorng functon s the condtonal expectaton of the grades y. We nvestgate some basc estmaton methods for condtonal expectaton learnng. 8
9 5.1 Relaton to mult-category classfcaton The subset rankng problem s a generalzaton of mult-category classfcaton. In the latter case, we observe an nput x 0, and are nterested n classfyng t nto one of the m classes. Let the output value be k {1,..., m}. We encode the nput x 0 nto m feature vectors {x 1,..., x m }, where x = [0,..., 0, x 0, 0,..., 0] wth the -th component beng x 0, and the other components are zeros. We then encode the output k nto m values {y j } such that y k = 1 and y j = 0 for j k. In ths settng, we try to fnd a scorng functon f such that f(x k ) > f(x j ) for j k. Consder the DCG crteron wth c 1 = 1 and c j = 0 when j > 1. Then the classfcaton error s gven by the correspondng DCG. Gven any mult-category classfcaton algorthm, one may use t to solve the subset rankng problem as follows. Consder a sample S = [x 1,..., x m ] as nput, and a set of outputs {y j }. We randomly draw k from 1 to m accordng to the dstrbuton y k / j y j. We then form another sample wth weght j y j, whch has the vector S = [x 1,..., x m ] (where order s mportant) as nput, and label y = k {1,..., m} as output. Ths changes the problem formulaton nto mult-category classfcaton. Snce the condtonal expectaton can be expressed as E yk (x k,s) y k = P (y = k S) E {yj } S y j, the order nduced by the scorng functon E yk (x k,s) y k s the same as that nduced by P (y = k S). Therefore a mult-category classfcaton solver that estmates condtonal probablty can be used to solve the subset rankng problem. In partcular, f we consder a rsk mnmzaton based multcategory classfcaton solver for m-class problem [8, 5] of the followng form: ˆf = arg mn f F n Φ(f(X ), Y ), where (X, Y ) are tranng ponts wth Y {1,..., m}, F s a vector functon class that takes values n R m, and Φ s some rsk functonal. Then for rankng wth tranng ponts ( S, {y,1,..., y,m }) and S = [x,1,..., x,m ], the correspondng learnng method becomes ˆf = arg mn f F n j=1 y,j Φ(f( S ), j), where the functon space F contans a subset of functons {f( S) : X m R m } of the form f( S) = [f(x 1, S),..., f(x m, S)], and S = {x 1,..., x m } s unordered set. An example would be maxmum entropy (mult-category logstc regresson) whch has the followng loss functon Φ(f( S), j) = f(x j, S) + ln m k=1 ef(x k,s). 5. Regresson based learnng Snce n rankng problems y,j can take values other than 0 or 1, we can have more general formulatons than mult-category classfcaton. In partcular, we may consder varatons of the followng regresson based learnng method to tran a scorng functon n F {X S R}: ˆf = arg mn f F n j=1 j φ(f(x,j, S ), y,j ), S = {x,1,..., x,m } S, (6) 9
10 where we assume that φ(a, b) = φ 0 (a) + φ 1 (a)b + φ (b). The estmaton formulaton s decoupled for each element x,j n a subset S, whch makes the problem easer to solve. In ths method, each tranng pont ((x,j, S ), y,j ) s treated as a sngle sample (for = 1,..., n and j = 1,..., m). The populaton verson of the rsk functon s: [ E S φ0 (f(x, S)) + φ 1 (f(x, S))E y (x,s) y + E y (x,s) φ (y). ] x S Ths mples that the optmal populaton soluton s a functon that mnmzes φ 0 (f(x, S)) + φ 1 (f(x, S))E y (x,s) y, whch s a functon of E y (x,s) y. Therefore the estmaton method n (6) leads to an estmator of condtonal expectaton wth a reasonable choce of φ 0 ( ) and φ 1 ( ). A smple example s the least squares method, where we pck φ 0 (a) = a, φ 1 (a) = a and φ (b) = b. That s, the learnng method (6) becomes least squares estmaton: ˆf = arg mn f F n j=1 (f(x,j, S ) y,j ). (7) Ths method, and some essental varatons whch we wll ntroduce later, wll be the focus of our analyss. It was shown n [8] that the only loss functon wth condtonal expectaton as the mnmzer (for an arbtrary condtonal dstrbuton of y) s least squares. However, for practcal purposes, we only need to estmate a monotonc transformaton of the condtonal expectaton. For ths purpose, we can have addtonal loss functons of the form (6). In partcular, let φ 0 (a) be an arbtrary convex functon such that φ 0 (a) s a monotone ncreasng functon of a, then we may smply take the functon φ(a, b) = φ 0 (a) ab n (6). The optmal populaton soluton s unquely determned by φ 0 (f(x, S)) = E y (x,s)y. A smple example s φ 0 (a) = a 4 /4 such that the populaton optmal soluton s f(x, S) = (E y (x,s) y) 1/3. Clearly such a transformaton does not affect rankng. Moreover, n many rankng problems, the range of y s bounded. It s known that addtonal loss functons can be used for computng the condtonal expectaton. As a smple example, f we assume that y [0, 1], then the followng modfed least squares can be used: ˆf = arg mn f F n j=1 [ (1 y,j ) max(0, f(x,j, S )) + y,j max(0, 1 f(x,j, S )) ]. (8) One may replace ths wth other loss functons used for bnary classfcaton that estmate condtonal probablty, such as those dscussed n [9]. Although such general formulatons mght be nterestng for certan applcatons, advantages over the smpler least squares loss of (7) are not completely certan, and they are more complcated to deal wth. Therefore we wll not consder such general formulatons n ths paper, but rather focus on adaptng the least squares method n (7) to the rankng problems. As we shall see, non-trval modfcatons of (7) are necessary to optmze system performance near the top of rank-lst. 10
11 5.3 Par-wse preference learnng A popular dea n the recent machne learnng lterature s to pose the rankng problem as a parwse preference relatonshp learnng problem (see Secton ). Usng ths dea, the scorng functon for subset rankng can be traned by the followng method: ˆf = arg mn f F n (j,j ) E φ(f(x,j, S ), f(x,j, S ); y,j, y,j ), (9) where each E s a subset of {1,..., m} {1,..., m} such that y,j < y,j. For example, we may use a non-ncreasng monotone functon φ 0 and let φ(a 1, a ; b 1, b ) = φ 0 ((a a 1 ) (b b 1 )) or φ(a 1, a ; b 1, b ) = (b b 1 )φ 0 (a a 1 ). Example loss functons nclude SVM loss φ 0 (x) = max(0, 1 x) and AdaBoost loss φ 0 (x) = exp( x) (see [13, 16, 3]). The approach works well f the rankng problem s nose-free (that s, y,j s determnstc). However, one dffculty wth ths approach s that f y,j s stochastc, then the correspondng populaton estmator from (9) may not be Bayes optmal, unless a more complcated scheme such as [9] s used. It wll be nterestng to nvestgate the error of such an approach, but the analyss s beyond the scope of ths paper. One argument used by the advocates of the par-wse learnng formulaton s that we do not have to learn an absolute grade judgment (or ts expectaton), but rather only the relatve judgment that one tem s better than another. In essence, ths means that for each subset S, f we shft each judgment by a constant, the rankng s not affected. If nvarance wth respect to a set-dependent judgment shft s a desrable property, then t can be ncorporated nto the regresson based model [6]. For example, smlar to Proposton 1, we may ntroduce an explct set dependent shft feature (whch s rank-preservng) nto (6): ˆf = arg mn f F n mn φ(f(x,j, S ) + b, y,j ). b R j=1 In partcular, for least squares, we have the followng method: ˆf = arg mn f F n mn (f(x,j, S ) + b y,j ). (10) b R j=1 More generally, we may ntroduce more sophstcated set dependent features and herarchcal models nto the regresson formulaton, and obtan effects that may not even be easly ncorporated nto par-wse models. 6 Convex Surrogate Bounds The subset rankng problem defned n Secton 3 s combnatoral n nature, whch s very dffcult to solve. Snce the optmal Bayes rankng rule s gven by condtonal expectaton, n Secton 5, we dscussed varous formulatons to estmate the condtonal expectaton. In partcular, we are nterested n least squares regresson based methods. In ths context, a natural queston to ask s f a scorng functon approxmately mnmzes regresson error, how well t can optmze rankng metrcs such as DCG or T. Ths secton provdes some theoretcal results that relate the optmzaton of 11
12 the rankng metrcs defned n Secton 3 to the mnmzaton of regresson errors. Ths allows us to desgn approprate convex learnng formulatons that mprove the smple least squares methods n (7) and (10). A scorng functon f(x, S) maps each x S to a real valued score. It nduces a rankng functon r f, whch ranks elements {x j } of S n descendng order of f(x j ). We are nterested n boundng the DCG performance of r f compared wth that of f B. Ths can be regarded as extensons of Theorem 1 that motvate regresson based learnng. Theorem Let f(x, S) be a real-valued scorng functon, whch nduces a rankng functon r f. Consder par p, q [1, ] such that 1/p + 1/q = 1. We have the followng relatonshp for each S = {x 1,..., x m }: DCG(r B, S) DCG(r f, S) ( c p ) 1/p (f(x j, S) f B (x j, S)) q j=1 1/q. Proof Let S = {x 1,..., x m }. Let r f (S) = J = [j 1,..., j m ], and let J 1 = [l 1,..., l m ] be ts nverse permutaton. Smlarly, let r B (S) = J B = [j1,..., j m], and let J 1 B = [l 1,..., l m] be ts nverse permutaton. We have DCG(r f, S) = c f B (x j, S) = c l f B (x, S) = c l f(x, S) + c l (f B (x, S) f(x, S)) c l f(x, S) + c l (f B (x, S) f(x, S)) = c l f B (x, S) + c l (f(x, S) f B (x, S)) + c l (f B (x, S) f(x, S)) DCG(r B, S) c l (f(x, S) f B (x, S)) + c l (f B (x, S) f(x, S)) + ( DCG(r B, S) where we used the notaton (z) + = max(0, z). ) 1/p (f(x j, S) f B (x j, S)) q c p j=1 1/q. 1
13 The above theorem shows that the DCG crteron can be bounded through regresson error. Although the theorem apples to any arbtrary par of p and q such that 1/p + 1/q = 1, the most useful case s wth p = q =. Ths s because n ths case, the problem of mnmzng m j=1 (f(x j, S) f B (x j, S)) can be drectly acheved usng least squares regresson n (7). If regresson error goes to zero, then the resultng rankng converges to the optmal DCG. Smlarly, we can show the followng result for the T crteron. Theorem 3 Let f(x, S) be a real-valued scorng functon, whch nduces a rankng functon r f. We have the followng relatonshp for each S = {x 1,..., x m }: T(r B, S) T(r f, S) 4 m (f(x j, S) f B (x j, S)) j=1 1/. Proof We have Let S = {x 1,..., x m }. Let r f (S) = J = [j 1,..., j m ], and let r B (S) = J B = [j 1,..., j m]. T(r f, S) = m(m 1) m(m 1) m(m 1) m(m 1) m 1 =+1 m 1 =+1 m 1 =+1 m 1 =T(r B, S) 4 m =+1 (f B (x j, S) f B (x j, S)) (f(x j, S) f(x j, S)) m (f(x j, S) f(x j, S)) m f(x j, S) f B (x j, S) f(x j, S) f B (x j, S) (f B (x j, S) f B (x j, S)) 4 m f(x j, S) f B (x j, S) ( T(r B, S) 4 m ) 1/ (f(x j, S) f B (x j m, S)). f(x j, S) f B (x j, S) The above approxmaton bounds mply that least square regresson can be used to learn the optmal rankng functons. The approxmaton error converges to zero when f converges to f B n L. However, n general, requrng f to converge to f B n L s not necessary. More mportantly, n real applcatons, we are often only nterested n the top porton of the rank-lst. Our bounds should reflect ths practcal consderaton. Assume that the coeffcents c n the DCG crteron decay fast, so that c s bounded (ndependent of m). In ths case, we may pck p = 1 and q = n Theorem. If sup j f(x j, S) f B (x j, S) s small, then we obtan a better bound than the least squares error bound wth p = q = 1/ whch depends on m. 13
14 However, we cannot ensure that sup j f(x j, S) f B (x j, S) s small usng the smple least squares estmaton n (7). Therefore n the followng, we develop a more refned bound for the DCG metrc, whch wll then be used to motvate practcal learnng methods that mprove on the smple least squares method. Theorem 4 Let f(x, S) be a real-valued scorng functon, whch nduces a rankng functon r f. Gven S = {x 1,..., x m }, let the optmal rankng order be J B = [j1,..., j m], where f B (x j ) s arranged n non-ncreasng order. Assume that c = 0 for all > k. Then we have the followng relatonshp for all γ (0, 1), p, q 1 such that 1/p + 1/q = 1, u > 0, and subset K {1,..., m} that contans j1,..., j k : DCG(r B, S) DCG(r f, S) C p (γ, u) j K(f(x j, S) f B (x j, S)) p + u sup(f(x j, S) f B(x j, S)) p + j / K where (z) + = max(z, 0), and ( C p (γ, u) = 1 1 γ ( k k ) p ) 1/p c p + u p/q c, f B(x j ) = f B (x j ) + γ(f B (x j k ) f B (x j )) +. 1/p, Proof Let S = {x 1,..., x m }. Let r f (S) = J = [j 1,..., j m ], and let J 1 = [l 1,..., l m ] be ts nverse permutaton. Smlarly, let J 1 B = [l 1,..., l m] be the nverse permutaton of r B (S) = J B = [j1,..., j m]. Let M = f B (x j k ), we have (M f B (x j, S)) γ (M f B(x j, S)) +. Moreover, snce m c ((f B (x j, S) M) (f B (x j, S) M) + ) 0, we have c ((f B (x j, S) M) (f B (x j, S) M) + ) 1 1 γ c ((f B (x j, S) M) (f B (x j, S) M) + ). 14
15 Therefore DCG(r B, S) DCG(r f, S) = c ((f B (x j, S) M) (f B (x j, S) M)) = c ((f B (x j, S) M) (f B (x j, S) M) + ) γ = 1 1 γ 1 1 γ 1 1 γ c (M f B (x j, S)) + [ m c ((f B (x j, S) M) (f B(x j, S) M) + ) + ( m c f B (x j, S) ) c f B(x j, S) ( m c (f B (x j, S) f(x j, S)) ( m c (f B (x j, S) f(x j, S)) + + ( 1 k 1 γ c p ] c (M f B(x j, S)) + ) c (f B(x j, S) f(x j, S)) ) c (f(x j, S) f B(x j, S)) + ) 1/p (f B (x j, S) f(x j, S)) q + j K ( k ) ) + c sup(f(x j, S) f B(x j, S)) + j / K ( ) 1/p 1 k c p B (x j, S) f(x j, S)) 1 γ j K(f q 1/q 1/q 1/q + (f(x j, S) f B(x j, S)) q + j K + k c sup(f(x j, S) f B(x j, S)) +. j / K Note that n the above dervaton, Hölder s nequalty has been appled to obtan the last two nequaltes. From the last nequalty, we can apply the Hölder s nequalty agan to obtan the desred bound. The easest way to nterpret ths bound s stll to take p = q = 1/. Intutvely, the bound says the followng: we should estmate the top ranked tems usng least squares. For the other tems, we do not have to make very accurate estmaton of ther condtonal expectatons. The DCG score wll not be affected as long as we do not over-estmate ther condtonal expectatons to such a degree that some of these tems are near the top of the rank-lst. Ths pont s a very mportant dfference between ths bound and Theorem whch assumes that we estmate the condtonal expectaton unformly well. The bound n Theorem 4 can stll be refned. However, the resultng nequaltes wll become more complcated. Therefore we wll not nclude such bounds n ths paper. Smlar to Theorem 4, such refned bounds show that we do not have to estmate condtonal expectaton unformly well. We present a smple example as llustraton. 15
16 Proposton Consder m = 3 and S = {x 1, x, x 3 }. Let c 1 =, c = 1, c 3 = 0, and f B (x 1, S) = f B (x, S) = 1, f B (x 3, S) = 0. Let f(x, S) be a real-valued scorng functon, whch nduces a rankng functon r f. Then DCG(r B, S) DCG(r f, S) f(x 3, S) f B (x 3, S) + f(x 1, S) f B (x 1, S) + f(x, S) f B (x, S). The coeffcents on the rght hand sde cannot be mproved. Proof Note that f s suboptmal only when ether f(x 3, S) f(x 1, S) or when f(x 3, S) f(x, S). Ths gves the followng bound: DCG(r B, S) DCG(r f, S) I(f(x 3, S) f(x 1, S)) + I(f(x 3, S) f(x, S)) I( f(x 3, S) f B (x 3, S) + f(x 1, S) f B (x 1, S) 1) + I( f(x 3, S) f B (x 3, S) + f(x, S) f B (x, S) 1) f(x 3, S) f B (x 3, S) + f(x 1, S) f B (x 1, S) + f(x, S) f B (x, S). To see that the coeffcents cannot be mproved, we smply note that the bound s tght when ether f(x 1, S) = f(x, S) = f(x 3, S) = 1, or when f(x 1, S) = 1 and f(x, S) = f(x 3, S) = 0, or when f(x, S) = 1 and f(x 1, S) = f(x 3, S) = 0. The Proposton mples that not all errors should be weghted equally: n the example, gettng x 3 rght s more mportant than gettng x 1 or x rght. Conceptually, Theorem 4 and Proposton show the followng: Snce we are nterested n the top porton of the rank-lst, we only need to estmate the top rated tems accurately, whle preventng the bottom tems from beng over-estmated (the condtonal expectatons don t have to be estmated accurately). For rankng purposes, some ponts are more mportant than other ponts. Therefore we should bas our learnng method to produce more accurate condtonal expectaton estmaton at the more mportant ponts. 7 Importance Weghted Regresson The key message from the analyss n Secton 6 s that we do not have to estmate the condtonal expectatons equally well for all tems. In partcular, snce we are nterested n the top porton of the rank-lst, Theorem 4 mples that we need to estmate the top porton more accurately than the bottom porton. Motvated by ths analyss, we consder a regresson based tranng method to solve the DCG optmzaton problem but weght dfferent ponts dfferently accordng to ther mportance. We shall not dscuss the mplementaton detals for modelng the functon f(x, S), whch s beyond the scope of ths paper. One smple model s to assume a form f(x, S) = f(x). Secton 4 dscussed the valdty of such models. For example, ths model s reasonable f we assume that for each x S, and the correspondng y, we have: E y (x,s) y = E y x y (see Proposton 1). Let F be a functon space that contans functons X S R. We draw n sets S 1,..., S n randomly, where S = {x,1,..., x,m }, wth the correspondng grades {y,j } j = {y,1,..., y,m }. 16
17 Based on Theorem, the smple least squares regresson (7) can be used to solve the subset rankng problem. However, ths drect regresson method s not adequate for many practcal problems such as web-search, for whch there are many tems to rank (that s, m s large) but only the top ranked pages are mportant. Ths s because the method pays equal attenton to relevant and rrelevant pages. In realty, one should pay more attenton to the top-ranked (relevant) pages. The grades of lower rank pages do not need to be estmated accurately, as long as we do not over-estmate them so that these pages appear n the top ranked postons. The above mentoned ntuton can be captured by Theorem 4 and Proposton, whch motvate the followng alternatve tranng method: ˆf = arg mn f F 1 n n L(f, S, {y,j } j ), (11) where for S = {x 1,..., x m }, wth the correspondng {y j } j, we have the followng mportance weghted regresson loss: L(f, S, {y j } j ) = j=1 w(x j, S)(f(x j, S) y j ) + u sup w (x j, S)(f(x j, S) δ(x j, S)) +, (1) j where u s a non-negatve parameter. A varaton of ths method s used to optmze the producton system of Yahoo s nternet search engne. The detaled mplementaton and parameter choces are trade secrets of Yahoo, whch we cannot completely dsclose here. It s also rrelevant for the purpose of ths paper. However, n the followng, we shall brefly explan the ntuton behnd (1) usng Theorem 4, and some practcal consderatons. The weght functon w(x j, S) n (1) s chosen so that t focuses only on the most mportant examples (the weght s set to zero for pages that we know are rrelevant). Ths part of the formulaton corresponds to the frst part of the bound n Theorem 4 (n that case, we choose w(x j, S) to be one for the top part of the example wth ndex set K, and zero otherwse). The usefulness of non-unform weghtng s also demonstrated n Proposton. The specfc choce of the weght functon requres varous engneerng consderatons that are not mportant for the purpose of ths paper. In general, f there are many tems wth smlar grades, then t s benefcal to gve each of the smlar tems a smaller weght. In the second part of (1), we choose w (x j, S) so that t focuses on the examples not covered by w(x j, S). In partcular, t only covers those data ponts x j that are low-ranked wth hgh confdence. We choose δ(x j, S) to be a small threshold that can be regarded as a lower bound of f B (x j) n Theorem 4, such as γf B (x k ). An mportant observaton s that although m s often very large, the number of ponts so that w(x j, S) s nonzero s often small. Moreover, (f(x j, S) δ(x j, S)) + s not zero only when f(x j, S) δ(x j, S). In practce the number of these ponts s usually small (that s, most rrelevant pages wll be predcted as rrelevant). Therefore the formulaton completely gnores those low-ranked data ponts such that f(x j, S) δ(x j, S). Ths makes the learnng procedure computatonally effcent even when m s large. The analogy here s support vector machnes, where only the support vectors are useful n the learnng formulaton. One can completely gnore samples correspondng to non support vectors. In the practcal mplementaton of (1), we can use an teratve refnement scheme, where we start wth a small number of samples to be ncluded n the frst part of (1), and then put the Some aspects of the mplementaton were covered n [10]. 17
18 low-ranked ponts nto the second part of (1) only when ther rankng scores exceed δ(x j, S). In fact, one may also put these ponts nto the frst part of (1), so that the second part always has zero values (whch makes the mplementaton smpler). In ths sense, the formulaton n (1) suggests a selectve samplng scheme, n whch we pay specal attenton to mportant and hghly ranked data ponts, whle completely gnorng most of the low ranked data ponts. In ths regard, wth approprately chosen w(x, S), the second part of (1) can be completely gnored. The emprcal rsk mnmzaton method n (11) approxmately mnmzes the followng crteron: where Q(f) = E S L(f, S), (13) L(f, S) =E {yj } j SL(f, S, {y j } j ) = w(x j, S)E yj (x j,s) (f(x j, S) y j ) + u sup w (x j, S)(f(x j, S) δ(x j, S)) +. j j=1 The followng theorem shows that under approprate assumptons, approxmate mnmzaton of (13) leads to the approxmate optmzaton of DCG. Theorem 5 Assume that c = 0 for all > k. S = {x 1,..., x m }: Assume the followng condtons hold for each = [j 1,..., j m], where f B (x j ) s arranged n non- Let the optmal rankng order be J B ncreasng order. There exsts γ [0, 1) such that δ(x j, S) γf B (x j k, S). For all f B (x j, S) > δ(x j, S), we have w(x j, S) 1. Let w (x j, S) = I(w(x j, S) < 1). Then the followng results hold: A functon f mnmzes (13) f f (x j, S) = f B (x j, S) when w(x j, S) > 0 and f (x j, S) δ(x j, S) otherwse. For all f, let r f be the nduced rankng functon. Let r B be the optmal Bayes rankng functon, we have: DCG(r f ) DCG(r B ) C(γ, u)(q(f) Q(f )) 1/. Proof Note that f f B (x j, S) > δ(x j, S), then w(x j, S) 1 and w (x j, S). Therefore the mnmzer f (x j, S) should mnmze E yj (x j,s)(f(x j, S) y j ), acheved at f (x j, S) = f B (x j, S). If f B (x j, S) δ(x j, S), then there are two cases: w(x j, S) > 0, f (x j, S) should mnmze E yj (x j,s)(f(x j, S) y j ), acheved at f (x j, S) = f B (x j, S). w(x j, S) = 0, f (x j, S) should mnmze E yj (x j,s)(f(x j, S) δ(x j, S)) +, acheved at f (x j, S) δ(x j, S). 18
19 Ths proves the frst clam. For each S, denote by K the set of x j such that w (x j, S) = 0. The second clam follows from the followng dervaton: Q(f) Q(f ) =E S (L(f, S) L(f, S)) k =E S w(x j, S)(f(x j, S) f B (x j, S)) + u sup w (x j, S)(f(x j, S) δ(x j, S)) + j=1 j E S j K(f B (x j, S) f(x j, S)) + + u sup(f(x j, S) δ(x j, S)) + j / K E S (DCG(r B, S) DCG(r f, S)) C(γ, u) (DCG(r B ) DCG(r f )) C(γ, u). Note that the second nequalty follows from Theorem 4. 8 Generalzaton Analyss In ths secton, we analyze the generalzaton performance of (11). The analyss depends on the underlyng functon class F. In the lterature, one often employs a lnear functon class wth approprate regularzaton condton, such as L 1 or L regularzaton for the lnear weght coeffcents. Yahoo s machne learnng rankng system employs the gradent boostng method descrbed n [14], whch s closely related to L 1 regularzaton, analyzed n [5, 18, 19]. Although the consstency of boostng for the standard least squares regresson s known (for example, see [6, 30]), such analyss does not deal wth the stuaton that m s large and thus s not sutable for analyzng the rankng problem consdered n ths paper. In ths secton, we wll consder lnear functon class wth L regularzaton, whch s closely related to kernel methods. We employ a relatvely smple stablty analyss whch s sutable for L regularzaton. Our result does not depend on m explctly, whch s mportant for large scale rankng problems such as web-search. Although smlar results can be obtaned for L 1 regularzaton or gradent boostng, the analyss wll become much more complcated. For L regularzaton, we consder a feature map ψ : X S H, where H s a vector space. We denote by w T v the L nner product of w and v n H. The functon class F consdered here s of the followng form: {β T ψ(x, S); β H, β T β A } X S R, (14) where the complexty s controlled by L regularzaton of the weght vector β T β A. We use (S = {x,1,..., x,m }, {y,j } j ) to ndcate a sample pont ndexed by. Note that for each sample, we do not need to assume that y,j are ndependently generated for dfferent j. Usng (14), the mportance weghted regresson n (11) becomes the followng regularzed emprcal rsk 19
20 mnmzaton method: f ˆβ(x, S) = ˆβ T ψ(x, S), [ 1 ˆβ = arg mn β H n L(β, S, {y j } j ) = j=1 ] n L(β, S, {y,j } j ) + λβ T β, (15) w(x j, S)(β T ψ(x j, S) y j ) + u sup w (x j, S)(β T ψ(x j, S) δ(x j, S)) +. j In ths method, we replace the hard regularzaton n (14) wth tunng parameter A by soft regularzaton wth tunng parameter λ, whch s computatonally more convenent. The followng result s an expected generalzaton bound for the L -regularzed emprcal rsk mnmzaton method (15), whch uses the stablty analyss n [7]. The proof s n Appendx A. Theorem 6 Let M = sup x,s ψ(x, S) and W = sup S [ x j S w(x j, S) + u sup xj S w (x j, S)]. Let f ˆβ be the estmator defned n (15). Then we have E {S,{y,j } j } n Q(f ˆβ) (1 + W M λn ) nf β H [Q(f β) + λβ T β]. We have pad specal attenton to the propertes of (15). In partcular, the quantty W s usually much smaller than m, whch s large for web-search applcatons. The pont we d lke to emphasze here s that even though the number m s large, the estmaton complexty s only affected by the top-porton of the rank-lst. If the estmaton of the lowest ranked tems s relatvely easy (as s generally the case), then the learnng complexty does not depend on the majorty of tems near the bottom of the rank-lst. We can combne Theorem 5 and Theorem 6, gvng the followng bound: Theorem 7 Suppose the condtons n Theorem 5 and Theorem 6 hold wth f mnmzng (13). Let ˆf = f ˆβ, we have E {S,{y,j } j } n DCG(r ˆf ) DCG(r B ) + C(γ, u) [ ( 1 + W M ) ] 1/ nf (Q(f β) + λβ T β) Q(f ). λn β H Proof From Theorem 5, we obtan E {S,{y,j } j } n DCG(r ˆf ) DCG(r B ) C(γ, u)e {S,{y,j } j } n (Q( ˆf) Q(f )) 1/ C(γ, u) [E {S,{y,j } j } n Q(f ˆβ) Q(f )] 1/. The second nequalty s a consequence of Jensen s nequalty. Now by applyng Theorem 6, we obtan the desred bound. The theorem mples that f Q(f ) = nf β H Q(f β ), then as n, we can let λ 0 and λn so that the second term on the rght hand sde vanshes n the large sample lmt. Therefore asymptotcally, we can acheve the optmal DCG score. Ths mples the consstency of regresson based learnng methods for the DCG crteron. 0
Feature Selection: Part 1
CSE 546: Machne Learnng Lecture 5 Feature Selecton: Part 1 Instructor: Sham Kakade 1 Regresson n the hgh dmensonal settng How do we learn when the number of features d s greater than the sample sze n?
More informationGeneralized Linear Methods
Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set
More informationKernel Methods and SVMs Extension
Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general
More informationLecture 10 Support Vector Machines II
Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed
More informationModule 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur
Module 3 LOSSY IMAGE COMPRESSION SYSTEMS Verson ECE IIT, Kharagpur Lesson 6 Theory of Quantzaton Verson ECE IIT, Kharagpur Instructonal Objectves At the end of ths lesson, the students should be able to:
More informationLecture Notes on Linear Regression
Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume
More informationFor now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.
Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson
More informationBoostrapaggregating (Bagging)
Boostrapaggregatng (Baggng) An ensemble meta-algorthm desgned to mprove the stablty and accuracy of machne learnng algorthms Can be used n both regresson and classfcaton Reduces varance and helps to avod
More informationCSC 411 / CSC D11 / CSC C11
18 Boostng s a general strategy for learnng classfers by combnng smpler ones. The dea of boostng s to take a weak classfer that s, any classfer that wll do at least slghtly better than chance and use t
More informationprinceton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg
prnceton unv. F 17 cos 521: Advanced Algorthm Desgn Lecture 7: LP Dualty Lecturer: Matt Wenberg Scrbe: LP Dualty s an extremely useful tool for analyzng structural propertes of lnear programs. Whle there
More information2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification
E395 - Pattern Recognton Solutons to Introducton to Pattern Recognton, Chapter : Bayesan pattern classfcaton Preface Ths document s a soluton manual for selected exercses from Introducton to Pattern Recognton
More informationAssortment Optimization under MNL
Assortment Optmzaton under MNL Haotan Song Aprl 30, 2017 1 Introducton The assortment optmzaton problem ams to fnd the revenue-maxmzng assortment of products to offer when the prces of products are fxed.
More information1 Convex Optimization
Convex Optmzaton We wll consder convex optmzaton problems. Namely, mnmzaton problems where the objectve s convex (we assume no constrants for now). Such problems often arse n machne learnng. For example,
More informationWe present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.
CS 189 Introducton to Machne Learnng Sprng 2018 Note 26 1 Boostng We have seen that n the case of random forests, combnng many mperfect models can produce a snglodel that works very well. Ths s the dea
More informationChapter 13: Multiple Regression
Chapter 13: Multple Regresson 13.1 Developng the multple-regresson Model The general model can be descrbed as: It smplfes for two ndependent varables: The sample ft parameter b 0, b 1, and b are used to
More informationThe Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction
ECONOMICS 5* -- NOTE (Summary) ECON 5* -- NOTE The Multple Classcal Lnear Regresson Model (CLRM): Specfcaton and Assumptons. Introducton CLRM stands for the Classcal Lnear Regresson Model. The CLRM s also
More informationEcon107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)
I. Classcal Assumptons Econ7 Appled Econometrcs Topc 3: Classcal Model (Studenmund, Chapter 4) We have defned OLS and studed some algebrac propertes of OLS. In ths topc we wll study statstcal propertes
More informationMMA and GCMMA two methods for nonlinear optimization
MMA and GCMMA two methods for nonlnear optmzaton Krster Svanberg Optmzaton and Systems Theory, KTH, Stockholm, Sweden. krlle@math.kth.se Ths note descrbes the algorthms used n the author s 2007 mplementatons
More informationEstimation: Part 2. Chapter GREG estimation
Chapter 9 Estmaton: Part 2 9. GREG estmaton In Chapter 8, we have seen that the regresson estmator s an effcent estmator when there s a lnear relatonshp between y and x. In ths chapter, we generalzed the
More informationNumerical Heat and Mass Transfer
Master degree n Mechancal Engneerng Numercal Heat and Mass Transfer 06-Fnte-Dfference Method (One-dmensonal, steady state heat conducton) Fausto Arpno f.arpno@uncas.t Introducton Why we use models and
More informationExcess Error, Approximation Error, and Estimation Error
E0 370 Statstcal Learnng Theory Lecture 10 Sep 15, 011 Excess Error, Approxaton Error, and Estaton Error Lecturer: Shvan Agarwal Scrbe: Shvan Agarwal 1 Introducton So far, we have consdered the fnte saple
More informationNatural Language Processing and Information Retrieval
Natural Language Processng and Informaton Retreval Support Vector Machnes Alessandro Moschtt Department of nformaton and communcaton technology Unversty of Trento Emal: moschtt@ds.untn.t Summary Support
More informationLecture 12: Discrete Laplacian
Lecture 12: Dscrete Laplacan Scrbe: Tanye Lu Our goal s to come up wth a dscrete verson of Laplacan operator for trangulated surfaces, so that we can use t n practce to solve related problems We are mostly
More informationMore metrics on cartesian products
More metrcs on cartesan products If (X, d ) are metrc spaces for 1 n, then n Secton II4 of the lecture notes we defned three metrcs on X whose underlyng topologes are the product topology The purpose of
More informationCOS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014
COS 511: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture #16 Scrbe: Yannan Wang Aprl 3, 014 1 Introducton The goal of our onlne learnng scenaro from last class s C comparng wth best expert and
More informationProblem Set 9 Solutions
Desgn and Analyss of Algorthms May 4, 2015 Massachusetts Insttute of Technology 6.046J/18.410J Profs. Erk Demane, Srn Devadas, and Nancy Lynch Problem Set 9 Solutons Problem Set 9 Solutons Ths problem
More informationLecture 4. Instructor: Haipeng Luo
Lecture 4 Instructor: Hapeng Luo In the followng lectures, we focus on the expert problem and study more adaptve algorthms. Although Hedge s proven to be worst-case optmal, one may wonder how well t would
More informationMASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.65/15.070J Fall 013 Lecture 1 10/1/013 Martngale Concentraton Inequaltes and Applcatons Content. 1. Exponental concentraton for martngales wth bounded ncrements.
More informationLinear Approximation with Regularization and Moving Least Squares
Lnear Approxmaton wth Regularzaton and Movng Least Squares Igor Grešovn May 007 Revson 4.6 (Revson : March 004). 5 4 3 0.5 3 3.5 4 Contents: Lnear Fttng...4. Weghted Least Squares n Functon Approxmaton...
More informationVapnik-Chervonenkis theory
Vapnk-Chervonenks theory Rs Kondor June 13, 2008 For the purposes of ths lecture, we restrct ourselves to the bnary supervsed batch learnng settng. We assume that we have an nput space X, and an unknown
More informationSupporting Information
Supportng Informaton The neural network f n Eq. 1 s gven by: f x l = ReLU W atom x l + b atom, 2 where ReLU s the element-wse rectfed lnear unt, 21.e., ReLUx = max0, x, W atom R d d s the weght matrx to
More informationHomework Assignment 3 Due in class, Thursday October 15
Homework Assgnment 3 Due n class, Thursday October 15 SDS 383C Statstcal Modelng I 1 Rdge regresson and Lasso 1. Get the Prostrate cancer data from http://statweb.stanford.edu/~tbs/elemstatlearn/ datasets/prostate.data.
More informationEEE 241: Linear Systems
EEE : Lnear Systems Summary #: Backpropagaton BACKPROPAGATION The perceptron rule as well as the Wdrow Hoff learnng were desgned to tran sngle layer networks. They suffer from the same dsadvantage: they
More informationThe exam is closed book, closed notes except your one-page cheat sheet.
CS 89 Fall 206 Introducton to Machne Learnng Fnal Do not open the exam before you are nstructed to do so The exam s closed book, closed notes except your one-page cheat sheet Usage of electronc devces
More informationInner Product. Euclidean Space. Orthonormal Basis. Orthogonal
Inner Product Defnton 1 () A Eucldean space s a fnte-dmensonal vector space over the reals R, wth an nner product,. Defnton 2 (Inner Product) An nner product, on a real vector space X s a symmetrc, blnear,
More informationPsychology 282 Lecture #24 Outline Regression Diagnostics: Outliers
Psychology 282 Lecture #24 Outlne Regresson Dagnostcs: Outlers In an earler lecture we studed the statstcal assumptons underlyng the regresson model, ncludng the followng ponts: Formal statement of assumptons.
More information3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X
Statstcs 1: Probablty Theory II 37 3 EPECTATION OF SEVERAL RANDOM VARIABLES As n Probablty Theory I, the nterest n most stuatons les not on the actual dstrbuton of a random vector, but rather on a number
More informationNotes on Frequency Estimation in Data Streams
Notes on Frequency Estmaton n Data Streams In (one of) the data streamng model(s), the data s a sequence of arrvals a 1, a 2,..., a m of the form a j = (, v) where s the dentty of the tem and belongs to
More informationErrors for Linear Systems
Errors for Lnear Systems When we solve a lnear system Ax b we often do not know A and b exactly, but have only approxmatons  and ˆb avalable. Then the best thng we can do s to solve ˆx ˆb exactly whch
More informationEnsemble Methods: Boosting
Ensemble Methods: Boostng Ncholas Ruozz Unversty of Texas at Dallas Based on the sldes of Vbhav Gogate and Rob Schapre Last Tme Varance reducton va baggng Generate new tranng data sets by samplng wth replacement
More informationChapter 11: Simple Linear Regression and Correlation
Chapter 11: Smple Lnear Regresson and Correlaton 11-1 Emprcal Models 11-2 Smple Lnear Regresson 11-3 Propertes of the Least Squares Estmators 11-4 Hypothess Test n Smple Lnear Regresson 11-4.1 Use of t-tests
More informationLogistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI
Logstc Regresson CAP 561: achne Learnng Instructor: Guo-Jun QI Bayes Classfer: A Generatve model odel the posteror dstrbuton P(Y X) Estmate class-condtonal dstrbuton P(X Y) for each Y Estmate pror dstrbuton
More informationGlobal Sensitivity. Tuesday 20 th February, 2018
Global Senstvty Tuesday 2 th February, 28 ) Local Senstvty Most senstvty analyses [] are based on local estmates of senstvty, typcally by expandng the response n a Taylor seres about some specfc values
More informationCOS 521: Advanced Algorithms Game Theory and Linear Programming
COS 521: Advanced Algorthms Game Theory and Lnear Programmng Moses Charkar February 27, 2013 In these notes, we ntroduce some basc concepts n game theory and lnear programmng (LP). We show a connecton
More informationChapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems
Numercal Analyss by Dr. Anta Pal Assstant Professor Department of Mathematcs Natonal Insttute of Technology Durgapur Durgapur-713209 emal: anta.bue@gmal.com 1 . Chapter 5 Soluton of System of Lnear Equatons
More informationINF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018
INF 5860 Machne learnng for mage classfcaton Lecture 3 : Image classfcaton and regresson part II Anne Solberg January 3, 08 Today s topcs Multclass logstc regresson and softma Regularzaton Image classfcaton
More informationU.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016
U.C. Berkeley CS94: Spectral Methods and Expanders Handout 8 Luca Trevsan February 7, 06 Lecture 8: Spectral Algorthms Wrap-up In whch we talk about even more generalzatons of Cheeger s nequaltes, and
More information10-701/ Machine Learning, Fall 2005 Homework 3
10-701/15-781 Machne Learnng, Fall 2005 Homework 3 Out: 10/20/05 Due: begnnng of the class 11/01/05 Instructons Contact questons-10701@autonlaborg for queston Problem 1 Regresson and Cross-valdaton [40
More informationDifference Equations
Dfference Equatons c Jan Vrbk 1 Bascs Suppose a sequence of numbers, say a 0,a 1,a,a 3,... s defned by a certan general relatonshp between, say, three consecutve values of the sequence, e.g. a + +3a +1
More informationLecture 20: November 7
0-725/36-725: Convex Optmzaton Fall 205 Lecturer: Ryan Tbshran Lecture 20: November 7 Scrbes: Varsha Chnnaobreddy, Joon Sk Km, Lngyao Zhang Note: LaTeX template courtesy of UC Berkeley EECS dept. Dsclamer:
More informationWeek 5: Neural Networks
Week 5: Neural Networks Instructor: Sergey Levne Neural Networks Summary In the prevous lecture, we saw how we can construct neural networks by extendng logstc regresson. Neural networks consst of multple
More informationCase A. P k = Ni ( 2L i k 1 ) + (# big cells) 10d 2 P k.
THE CELLULAR METHOD In ths lecture, we ntroduce the cellular method as an approach to ncdence geometry theorems lke the Szemeréd-Trotter theorem. The method was ntroduced n the paper Combnatoral complexty
More informationLearning Theory: Lecture Notes
Learnng Theory: Lecture Notes Lecturer: Kamalka Chaudhur Scrbe: Qush Wang October 27, 2012 1 The Agnostc PAC Model Recall that one of the constrants of the PAC model s that the data dstrbuton has to be
More informationCHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE
CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE Analytcal soluton s usually not possble when exctaton vares arbtrarly wth tme or f the system s nonlnear. Such problems can be solved by numercal tmesteppng
More informationLinear Regression Analysis: Terminology and Notation
ECON 35* -- Secton : Basc Concepts of Regresson Analyss (Page ) Lnear Regresson Analyss: Termnology and Notaton Consder the generc verson of the smple (two-varable) lnear regresson model. It s represented
More informationCS : Algorithms and Uncertainty Lecture 17 Date: October 26, 2016
CS 29-128: Algorthms and Uncertanty Lecture 17 Date: October 26, 2016 Instructor: Nkhl Bansal Scrbe: Mchael Denns 1 Introducton In ths lecture we wll be lookng nto the secretary problem, and an nterestng
More informationA Robust Method for Calculating the Correlation Coefficient
A Robust Method for Calculatng the Correlaton Coeffcent E.B. Nven and C. V. Deutsch Relatonshps between prmary and secondary data are frequently quantfed usng the correlaton coeffcent; however, the tradtonal
More informationChapter 8 Indicator Variables
Chapter 8 Indcator Varables In general, e explanatory varables n any regresson analyss are assumed to be quanttatve n nature. For example, e varables lke temperature, dstance, age etc. are quanttatve n
More informationSupport Vector Machines. Vibhav Gogate The University of Texas at dallas
Support Vector Machnes Vbhav Gogate he Unversty of exas at dallas What We have Learned So Far? 1. Decson rees. Naïve Bayes 3. Lnear Regresson 4. Logstc Regresson 5. Perceptron 6. Neural networks 7. K-Nearest
More informationClassification as a Regression Problem
Target varable y C C, C,, ; Classfcaton as a Regresson Problem { }, 3 L C K To treat classfcaton as a regresson problem we should transform the target y nto numercal values; The choce of numercal class
More informationParametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010
Parametrc fractonal mputaton for mssng data analyss Jae Kwang Km Survey Workng Group Semnar March 29, 2010 1 Outlne Introducton Proposed method Fractonal mputaton Approxmaton Varance estmaton Multple mputaton
More informationModule 9. Lecture 6. Duality in Assignment Problems
Module 9 1 Lecture 6 Dualty n Assgnment Problems In ths lecture we attempt to answer few other mportant questons posed n earler lecture for (AP) and see how some of them can be explaned through the concept
More informationON A DETERMINATION OF THE INITIAL FUNCTIONS FROM THE OBSERVED VALUES OF THE BOUNDARY FUNCTIONS FOR THE SECOND-ORDER HYPERBOLIC EQUATION
Advanced Mathematcal Models & Applcatons Vol.3, No.3, 2018, pp.215-222 ON A DETERMINATION OF THE INITIAL FUNCTIONS FROM THE OBSERVED VALUES OF THE BOUNDARY FUNCTIONS FOR THE SECOND-ORDER HYPERBOLIC EUATION
More informationPredictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore
Sesson Outlne Introducton to classfcaton problems and dscrete choce models. Introducton to Logstcs Regresson. Logstc functon and Logt functon. Maxmum Lkelhood Estmator (MLE) for estmaton of LR parameters.
More informationThe Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD
he Gaussan classfer Nuno Vasconcelos ECE Department, UCSD Bayesan decson theory recall that we have state of the world X observatons g decson functon L[g,y] loss of predctng y wth g Bayes decson rule s
More informationComplete subgraphs in multipartite graphs
Complete subgraphs n multpartte graphs FLORIAN PFENDER Unverstät Rostock, Insttut für Mathematk D-18057 Rostock, Germany Floran.Pfender@un-rostock.de Abstract Turán s Theorem states that every graph G
More informationLinear Feature Engineering 11
Lnear Feature Engneerng 11 2 Least-Squares 2.1 Smple least-squares Consder the followng dataset. We have a bunch of nputs x and correspondng outputs y. The partcular values n ths dataset are x y 0.23 0.19
More informationComparison of Regression Lines
STATGRAPHICS Rev. 9/13/2013 Comparson of Regresson Lnes Summary... 1 Data Input... 3 Analyss Summary... 4 Plot of Ftted Model... 6 Condtonal Sums of Squares... 6 Analyss Optons... 7 Forecasts... 8 Confdence
More informationSDMML HT MSc Problem Sheet 4
SDMML HT 06 - MSc Problem Sheet 4. The recever operatng characterstc ROC curve plots the senstvty aganst the specfcty of a bnary classfer as the threshold for dscrmnaton s vared. Let the data space be
More informationMAXIMUM A POSTERIORI TRANSDUCTION
MAXIMUM A POSTERIORI TRANSDUCTION LI-WEI WANG, JU-FU FENG School of Mathematcal Scences, Peng Unversty, Bejng, 0087, Chna Center for Informaton Scences, Peng Unversty, Bejng, 0087, Chna E-MIAL: {wanglw,
More informationSome modelling aspects for the Matlab implementation of MMA
Some modellng aspects for the Matlab mplementaton of MMA Krster Svanberg krlle@math.kth.se Optmzaton and Systems Theory Department of Mathematcs KTH, SE 10044 Stockholm September 2004 1. Consdered optmzaton
More informationx i1 =1 for all i (the constant ).
Chapter 5 The Multple Regresson Model Consder an economc model where the dependent varable s a functon of K explanatory varables. The economc model has the form: y = f ( x,x,..., ) xk Approxmate ths by
More informationFREQUENCY DISTRIBUTIONS Page 1 of The idea of a frequency distribution for sets of observations will be introduced,
FREQUENCY DISTRIBUTIONS Page 1 of 6 I. Introducton 1. The dea of a frequency dstrbuton for sets of observatons wll be ntroduced, together wth some of the mechancs for constructng dstrbutons of data. Then
More informationStanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011
Stanford Unversty CS359G: Graph Parttonng and Expanders Handout 4 Luca Trevsan January 3, 0 Lecture 4 In whch we prove the dffcult drecton of Cheeger s nequalty. As n the past lectures, consder an undrected
More informationANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)
Econ 413 Exam 13 H ANSWERS Settet er nndelt 9 deloppgaver, A,B,C, som alle anbefales å telle lkt for å gøre det ltt lettere å stå. Svar er gtt . Unfortunately, there s a prntng error n the hnt of
More informationVQ widely used in coding speech, image, and video
at Scalar quantzers are specal cases of vector quantzers (VQ): they are constraned to look at one sample at a tme (memoryless) VQ does not have such constrant better RD perfomance expected Source codng
More information4DVAR, according to the name, is a four-dimensional variational method.
4D-Varatonal Data Assmlaton (4D-Var) 4DVAR, accordng to the name, s a four-dmensonal varatonal method. 4D-Var s actually a drect generalzaton of 3D-Var to handle observatons that are dstrbuted n tme. The
More informationLINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity
LINEAR REGRESSION ANALYSIS MODULE IX Lecture - 30 Multcollnearty Dr. Shalabh Department of Mathematcs and Statstcs Indan Insttute of Technology Kanpur 2 Remedes for multcollnearty Varous technques have
More informationA New Refinement of Jacobi Method for Solution of Linear System Equations AX=b
Int J Contemp Math Scences, Vol 3, 28, no 17, 819-827 A New Refnement of Jacob Method for Soluton of Lnear System Equatons AX=b F Naem Dafchah Department of Mathematcs, Faculty of Scences Unversty of Gulan,
More informationStructure and Drive Paul A. Jensen Copyright July 20, 2003
Structure and Drve Paul A. Jensen Copyrght July 20, 2003 A system s made up of several operatons wth flow passng between them. The structure of the system descrbes the flow paths from nputs to outputs.
More informationSolutions HW #2. minimize. Ax = b. Give the dual problem, and make the implicit equality constraints explicit. Solution.
Solutons HW #2 Dual of general LP. Fnd the dual functon of the LP mnmze subject to c T x Gx h Ax = b. Gve the dual problem, and make the mplct equalty constrants explct. Soluton. 1. The Lagrangan s L(x,
More informationSimultaneous Optimization of Berth Allocation, Quay Crane Assignment and Quay Crane Scheduling Problems in Container Terminals
Smultaneous Optmzaton of Berth Allocaton, Quay Crane Assgnment and Quay Crane Schedulng Problems n Contaner Termnals Necat Aras, Yavuz Türkoğulları, Z. Caner Taşkın, Kuban Altınel Abstract In ths work,
More informationThe Minimum Universal Cost Flow in an Infeasible Flow Network
Journal of Scences, Islamc Republc of Iran 17(2): 175-180 (2006) Unversty of Tehran, ISSN 1016-1104 http://jscencesutacr The Mnmum Unversal Cost Flow n an Infeasble Flow Network H Saleh Fathabad * M Bagheran
More informationMarkov Chain Monte Carlo Lecture 6
where (x 1,..., x N ) X N, N s called the populaton sze, f(x) f (x) for at least one {1, 2,..., N}, and those dfferent from f(x) are called the tral dstrbutons n terms of mportance samplng. Dfferent ways
More information4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA
4 Analyss of Varance (ANOVA) 5 ANOVA 51 Introducton ANOVA ANOVA s a way to estmate and test the means of multple populatons We wll start wth one-way ANOVA If the populatons ncluded n the study are selected
More informationOnline Classification: Perceptron and Winnow
E0 370 Statstcal Learnng Theory Lecture 18 Nov 8, 011 Onlne Classfcaton: Perceptron and Wnnow Lecturer: Shvan Agarwal Scrbe: Shvan Agarwal 1 Introducton In ths lecture we wll start to study the onlne learnng
More informationWinter 2008 CS567 Stochastic Linear/Integer Programming Guest Lecturer: Xu, Huan
Wnter 2008 CS567 Stochastc Lnear/Integer Programmng Guest Lecturer: Xu, Huan Class 2: More Modelng Examples 1 Capacty Expanson Capacty expanson models optmal choces of the tmng and levels of nvestments
More informationComposite Hypotheses testing
Composte ypotheses testng In many hypothess testng problems there are many possble dstrbutons that can occur under each of the hypotheses. The output of the source s a set of parameters (ponts n a parameter
More informationChapter 5 Multilevel Models
Chapter 5 Multlevel Models 5.1 Cross-sectonal multlevel models 5.1.1 Two-level models 5.1.2 Multple level models 5.1.3 Multple level modelng n other felds 5.2 Longtudnal multlevel models 5.2.1 Two-level
More informationDiscussion of Extensions of the Gauss-Markov Theorem to the Case of Stochastic Regression Coefficients Ed Stanek
Dscusson of Extensons of the Gauss-arkov Theorem to the Case of Stochastc Regresson Coeffcents Ed Stanek Introducton Pfeffermann (984 dscusses extensons to the Gauss-arkov Theorem n settngs where regresson
More informationECONOMICS 351*-A Mid-Term Exam -- Fall Term 2000 Page 1 of 13 pages. QUEEN'S UNIVERSITY AT KINGSTON Department of Economics
ECOOMICS 35*-A Md-Term Exam -- Fall Term 000 Page of 3 pages QUEE'S UIVERSITY AT KIGSTO Department of Economcs ECOOMICS 35* - Secton A Introductory Econometrcs Fall Term 000 MID-TERM EAM ASWERS MG Abbott
More informationLecture 20: Lift and Project, SDP Duality. Today we will study the Lift and Project method. Then we will prove the SDP duality theorem.
prnceton u. sp 02 cos 598B: algorthms and complexty Lecture 20: Lft and Project, SDP Dualty Lecturer: Sanjeev Arora Scrbe:Yury Makarychev Today we wll study the Lft and Project method. Then we wll prove
More informationResource Allocation with a Budget Constraint for Computing Independent Tasks in the Cloud
Resource Allocaton wth a Budget Constrant for Computng Independent Tasks n the Cloud Wemng Sh and Bo Hong School of Electrcal and Computer Engneerng Georga Insttute of Technology, USA 2nd IEEE Internatonal
More informationEconomics 130. Lecture 4 Simple Linear Regression Continued
Economcs 130 Lecture 4 Contnued Readngs for Week 4 Text, Chapter and 3. We contnue wth addressng our second ssue + add n how we evaluate these relatonshps: Where do we get data to do ths analyss? How do
More informationCS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015
CS 3710: Vsual Recognton Classfcaton and Detecton Adrana Kovashka Department of Computer Scence January 13, 2015 Plan for Today Vsual recognton bascs part 2: Classfcaton and detecton Adrana s research
More information3.1 ML and Empirical Distribution
67577 Intro. to Machne Learnng Fall semester, 2008/9 Lecture 3: Maxmum Lkelhood/ Maxmum Entropy Dualty Lecturer: Amnon Shashua Scrbe: Amnon Shashua 1 In the prevous lecture we defned the prncple of Maxmum
More informationLecture 12: Classification
Lecture : Classfcaton g Dscrmnant functons g The optmal Bayes classfer g Quadratc classfers g Eucldean and Mahalanobs metrcs g K Nearest Neghbor Classfers Intellgent Sensor Systems Rcardo Guterrez-Osuna
More informationA Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach
A Bayes Algorthm for the Multtask Pattern Recognton Problem Drect Approach Edward Puchala Wroclaw Unversty of Technology, Char of Systems and Computer etworks, Wybrzeze Wyspanskego 7, 50-370 Wroclaw, Poland
More informationConvergence of random processes
DS-GA 12 Lecture notes 6 Fall 216 Convergence of random processes 1 Introducton In these notes we study convergence of dscrete random processes. Ths allows to characterze phenomena such as the law of large
More informationNegative Binomial Regression
STATGRAPHICS Rev. 9/16/2013 Negatve Bnomal Regresson Summary... 1 Data Input... 3 Statstcal Model... 3 Analyss Summary... 4 Analyss Optons... 7 Plot of Ftted Model... 8 Observed Versus Predcted... 10 Predctons...
More information