Active Learning Models and Noise

Size: px

Start display at page:

Download "Active Learning Models and Noise"

Clara Lawrence
5 years ago
Views:

1 Actve Learnng Models and Nose Sara Stolbach COMS 6253, Advanced CLT May 3, 2007 Abstract I study actve learnng n general pool-based actve learnng models as well nosy actve learnng algorthms and then compare them for the class of lnear separators under the unform dstrbuton. 1 Introducton There are often cases where data s abundant and easly accessble but labelng the data s costly. For example, n bonformatcs many DNA sequences are avalable but decodng one sequence can take many hours or days for one person to acheve. Ths scenaro s known as actve learnng; the labels of data are hdden and the learner can pay for the label of any example. Ths s not captured n the typcal PAC supervsed learnng scenaro. An actve learnng algorthm wll want to mnmze the number of examples t needs to label due to the expense of labelng. In ths paper I wll focus on pool-based actve learnng models, whch s when the learner can pay for the label of any example n a pool of unlabeled examples as opposed to another model where ponts can be created synthetcally. A nosy dataset s dffcult n the actve learnng settng snce standard actve learnng models seek to fnd the most nformatve examples whch tend to be the most nose-prone. I wll frst dscuss some noseless actve learnng models and show why they are nose-prone and then dscuss two nosy actve learnng models and how they compare as well as dscuss the next steps that should be taken. 1

2 For the most part actve learnng methods fall under three orthogonal technques; generalzed bnary search, opportunstc prors or algorthmc luckness, and Bayesan assumptons. Opportunstc prors s when a unform bet over all Ĥ leads to standard VC generalzaton bounds. If the algorthm places more weght on a certan hypothess, t could be excellent f guessed rght but worse than usual f guessed wrong. Ths method s not as practcal because of ts unpredctablty. The other two methods are dscussed n the paper. Progress can be measured n a number of ways such as the rate at whch the sze of the verson space decreases and the number of label requests needed. I wll be focusng on the number of label requests needed. 2 Prelmnares Some general defntons used throughout the paper are mentoned here; the varables have these meanngs unless specfed dfferently. Let X be the nstance space of examples x that are..d over the unform nput dstrbuton D n verson space V. Let n be the number of label requests. Let C be the concept class over dstrbuton P. Let η denote the nose rate n nosy models. A common applcaton examned n actve learnng s lnear separators through the orgn of a unt sphere n R d. X s the set of all data on the surface area of the sphere such that X = {x R d : x = 1}. Each example n X s denoted as ( x, t) where x s the drecton of the example and t s the offset. In other words X = S d [ 1, +1] where S d s the unt sphere around the orgn of R d. The dstrbuton D on X s unform. H s the class of lnear separators through the orgn, and any h H s a homogeneous hyperplane. h s represented by a unt vector w X wth the classfcaton rule h(x) = sgn(w x). The dstance between two hypothess, u and v n H wth respect to D s gven by dstance D (u, v) = θ(u,v) where π θ(u, v) = arccos(u v). 3 General Models 3.1 Bayesan Model The Query by Commttee (QBC) algorthm [1] s one of the most sgnfcant pool-based actve learnng algorthms. It shows that usng queres over random unlabeled examples can accelerate the learnng concept of some classes 2

3 over standard learnng approaches. Work s done n the Bayesan model whch dffers from the PAC model n that t s assumed that the target concept s chosen accordng to a pror dstrbuton P over C and that ths dstrbuton s known to the learner. A Bayesan Model has an mmedate beneft for actve learnng snce f there s large agreement on unlabeled data you can stop and output the current hypothess. The algorthm assumes realzablty, meanng a perfect classfer exsts. The papers analyss s based on probablstc assumptons and they show that queres can help accelerate learnng of concept classes that are determnstc and noseless. The paper also goes on to dscuss a generalzed form of QBC that uses two dstrbutons. It s more computatonally ntensve but can have an outcome that s not bnary or dscrete and the nputs can be stochastc. Algorthm 1 QBC Algorthm Input: > 0, δ > 0, Gbbs, Sample, Label Intalze: n = 0, V 0 = C repeat Call the Sample oracle to get a random nstance of x. Call Gbbs twce to get two predctons p 1 and p 2 for x. f p 1 = p 2 then reject the example else call the Label(x) to get c(x), ncrease n by 1 and set V n to be all concepts c V n 1 where c (x) = c(x) end f untl more than t n consecutve examples are rejected. Output: the Gbbs predcton hypothess The Query by Commttee algorthm (Algorthm 1) uses an oracle denoted Gbbs(V, x) whch computes the Gbbs predcton rule. It predcts the label of a new example x by randomly choosng an h C accordng to D, restrcted to V C, and labelng x accordng to t. Two calls to Gbbs wth the same V and x can result n dfferent predctons. The goal s to label x n order to maxmze the expected nformaton gan to cause an exponentally fast decrease n the error of the Gbbs predcton rule. However, ths s not guaranteed manly because dstrbuton s gnored. If queres of the same type are always called, the predcton error wll stay large. The Sample oracle returns an unlabeled example x X chosen accordng to D. A call to 3

4 the Label oracle wth nput x returns c(x) whch s the label of x accordng to the target concept. The teratons are done n a batch learnng scenaro untl some termnaton condton s acheved. The termnaton condton s met when t n = 1 ln π2 (n+1) 2 consecutve examples are rejected. 3δ Theorem 1 If a concept class C has VC-dmenson 0 < d < and the expected nformaton gan of queres made by QBC s unformly bounded by g > 0 bts, then wth probablty larger than 1 δ over the random choce of the target concept, the sequence of examples, and the choces made by QBC, the number of calls to Sample s smaller than m 0 = max ( 4d δ ( 160(d + 1), max 6, ln g the number of calls to Label s smaller than ) ) 2 80(d + 1), δ 2 g n 0 = 10(d + 1) g ln 4m 0 δ and the probablty that the Gbbs predcton that uses the fnal verson space of QBC makes a mstake s smaller than. [1] It s easy to show that f QBC ever stops then the error of the resultng hypothess s small wth hgh probablty. The real queston s wll the QBC algorthm stop; The proof of Theorem 1, n [1], shows that t wll stop f the number of examples that are rejected between consecutve queres ncreases wth the number of queres; a lnear ncrease. Ths mples that the probablty of acceptng a query or makng a predcton mstake s exponentally small compared to the number of queres asked (based on nformaton gan g > 0). A common class examned n Actve Learnng s unformly dstrbuted half spaces through the orgn of the unt sphere n R d. The nformaton gan from random examples wll vansh as d goes to nfnty because n a hgh dmenson the volume of the sphere s concentrated near the equator. A typcal example wll cut the sphere away from the equator whch means that query examples are especally mportant n hgh dmensons; QBC wll lkely choose two random ponts near the equator so the example that separates them wll lkely be near the equator whch mples that QBC can attan a fnte nformaton gan n hgh dmensons. [1] prove the lower bound on the 4

5 nformaton gan whch mples that Theorem 1 holds and that the number of calls to the Sample oracle s O( d log 1 ) and the number of calls to the δ label oracle s O( d). The paper also proves that QBC Algorthm has such results for the perceptron class as well by modelng t as a specal case of the dstrbuted halfspaces problem. Dasgupta, et al [2] show an algorthm whch uses a smple modfcaton to the perceptron update to provde even better results. The perceptron algorthm uses the same concept class of lnear separators where dataponts are on the surface area of the unt sphere n R d. It starts wth an ntal hypothess V 0 R d and n each teraton receves an unlabeled pont, makes a predcton sgn(v t x t ) and durng the flterng step t decdes whether to ask for ts label based on V t x t threshold, s t. If the label s requested the update step s called. The regular perceptron update s f (x t, y t ) s msclassfed then V t+1 = V t + y t x t. The error rate cannot be better than Ω(1/ l t ) where l t s the number of labels quered up to tme t. They have changed the update to be f (x t, y t ) s msclassfed then V t+1 = V t 2(V t x t )x t. Ths scaled the update by a factor of 2 V t x t to avod oscllatons cause by ponts close to the hyperplane. The flterng step s based on s t ; ts choce s crucal. [2] set t adaptvely by startng t hgh and keep dvdng t n half untl a level s reached where enough msclassfcaton ponts are quered. The modfed perceptron results n a theorem statng that O(d log 1) labels are needed when drawng O( d log 1 ) data ponts at random from the unt sphere n R d, as opposed to the QBC s need of O( d ) dataponts. It wll 2 make O(d log 1 ) errors and have fnal error. The bound mprovements are based on the change to the update step and the threshold, s t, used n the flterng step. The QBC Algorthm would have very poor results n a nosy settng because the wrong examples could be quered for labels producng poor verson spaces. In addton, an adversaral nose model could cause the algorthm to never stop. 3.2 Generalzed Bnary Search Actve learnng could also be looked at as an approach to mprove the same supervsed settng. In a supervsed settng of some class wth VC dmenson d, and error rate over dstrbuton P some m = m(, d) labeled ponts are needed. Dasgupta, [3] uses a greedy generalzed bnary search to examne 5

6 Fgure 1: The boundary can be found wth just O(log m) labels usng bnary search 1 f fewer then m labels are suffcent to learn the class f the ponts are not labeled. In the case of data lyng on the real lne, and a hypothess class H of smple threshold functons t s enough to draw m = O(1/) random labeled examples from P and return a classfer consstent wth them. However, as n Fgure 1, f we use unlabeled examples we can use a smple bnary search to fnd the transton from 0 to 1 whch requres only log m labels to nfer the rest of them. Hence, there s an exponental mprovement. However, what about n the generalzed case; Is t possble to pck among O(m d ) possbltes usng o(m) labels? If bnary search were possble, just O(d log m) labels would be needed. Ths s not the case. In d 2 there are some cases where the target hypothess cannot be dentfed wthout queryng all the labels. However, n the average case the number of labels needed s small. A varant of a popular greedy scheme s used, where one always asks for the label whch most evenly dvdes the current effectve verson space weghted by π. π s merely a devce for averagng queryng counts over some unform dstrbuton Ĥ. Ĥ s used nstead of H; t reflects the underlyng combnatoral structure of the problem, and π can often be chosen to mask ts structure. The expected number of labels needed by ths strategy s at most O(ln Ĥ ) tmes that of any other strategy. Ths s a sgnfcant performance guarantee. A query tree structure s used; there s not always a tree of average depth o(m). The best hope s to come close to mnmzng the number of queres and ths s done usng a greedy approach: Let S Ĥ be the current verson space. For each unlabeled x, let S + be the hypothess whch label x postve and S the ones whch label t negatve. Pck the x for whch the postve and negatve are most nearly equal n π-mass; n other words mn{π(s + ), π(s )} s largest. Generalzed bnary search would clearly have poor results n a nosy set- 1 Ths fgure was taken from Dasgupta s paper [3] 6

7 tng because as prevously mentoned, the dataponts that are chosen to be labeled tend to be the most nose-prone. A small amount of adversaral nose can cause the dataponts that would be chosen to dvde the verson space to gve vrtually no help n learnng the concept. 4 Nosy Models There are two actve learnng models that work wth arbtrary classfcaton nose. The only restrcton s that samples are drawn..d from some underlyng dstrbuton. The results hold for any sort of mechansm used to generate the nose. The algorthms have dfferent restrctons on η. However t s mportant to note that Kaaranen [10] shows a lower bound of Ω( η2 ) 2 on the sample complexty of any actve learner and therefore t can not be hoped to acheve speedups when η s large. 4.1 Agnostc Actve Learnng Balcan et al [4] descrbe an algorthm known as the A 2 algorthm, (Algorthm 2), whch s nose tolerant. It was the frst nose tolerant algorthm and shows some postve results. They produce bounds for Lnear Threshold Functons and lnear separators under the unform dstrbuton where the algorthm s successful for any amount of nose and shows exponental mprovements f η <. However, the algorthm s not very sample effcent. 16 A 2 reles on a subroutne such as the VC bound or Occam s Razor bound to compute the lower bound, LB(S, h, δ), and the upper bound, UB(S, h, δ) on the true error rate of h, err P (h), by usng a sample S of examples drawn..d from P such that LB(S, h, δ) err P (h) UB(S, h, δ) holds for all h wth probablty 1 δ. A 2 algorthm can be vewed as a robust selectve samplng algorthm [9]. Selectve samplng keeps track of two spaces; the current verson space, H, consstent wth all labels quered so far and the regon of uncertanty, R. The regon of uncertanty ncludes all dataponts x X, whch have two hypothess that do not agree on t. In each round of the selectve samplng algorthm, a random unlabeled example s pcked from R and quered, elmnatng all hypothess n H nconsstent wth the receved label. In the agnostc case we cannot elmnate an hypothess based on a sngle example. The A 2 algorthm samples a set of examples S and uses UB and LB to 7

8 calculate the dsagreement of a regon, DISAG D (H ). The dsagreement of a regon s DISAG D (H ) = Pr x D [ h 1, h 2 G : h 1 (x) h 2 (x)]. If all h H agree on some regon t can be safely elmnated thereby reducng the regon of uncertanty. Ths elmnates all hypotheses whose lower bound s greater than the mnmum upper bound. Each round completes when S s large enough to reduce half of ts regon of uncertanty. Therefore, the number of rounds s bounded by log 1. The algorthm stops when DISAG D (H )(mn h H UB(S, h, δ k ) mn h H LB(S, h, δ k )). A 2 returns h = argmn(mn h H UB(S, h, δ k )) where and k s the teraton that the algorthm was n when t satsfed the condton. The bounds for the class of Lnear Separators under the Unform Dstrbuton over the unt sphere for A 2 s descrbed n a later secton. Algorthm 2 A 2 Algorthm Input:, Sample Oracle for D, Label Oracle O, H Intalze: = 1, D = D H = H, S =, and k = 1 whle DISAG D (H )(mn h H UB(S, h, δ k ) mn h H LB(S, h, δ k )) > do Set S =, H = H, k = k + 1 whle DISAG D (H ) 1 2 DISAG D(H ) do f DISAG D (H )(mn h H UB(S, h, δ k ) mn h H LB(S, h, δ k )) then Output: h = argmn(mn h H UB(S, h, δ k )) else S = 2 S +1 sample from D satsfyng h 1, h 2 H : h 1 (x) h 2 (x) S = S {(x, O(x)) : x S }; H = {h H : LB(S, h, δ k ) mn h H UB(S, h, δ k )}; k = k + 1; end f end whle H +1 = H, D +1 = D condtoned on h 1, h 2 H : h 1 (x) h 2 (x), = + 1 end whle Output: h = argmn(mn h H UB(S, h, δ k )) 8

9 4.2 Teachng Dmenson and Actve Learnng Hanneke [5] descrbes a general actve learnng nose-tolerant algorthm whch s based on exact learnng wth membershp queres. He shows the frst nontrval general upper bound on label complexty n an actve learnng nose model. In exact learnng, the algorthm s requred to dentfy the oracle s actual target functon rather then approxmatng t wth hgh probablty. There s no classfcaton nose and the algorthm can ask for the label of any example. In a sense t s a lmtng case of PAC actve learnng. An addtonal restrcton to the algorthm n [5], s that t only works for arbtrary persstent classfcaton nose; meanng the label of a datapont cannot change from one query to the next. The goal of exact learnng s to ask for labels f(x) untl the only concept n C consstent wth the observed labels s the target f C. C C F where F s the correspondng σ-algebra of set X and C F s the set of all F - measurable f : X { 1, 1}. MembHalvng (Hegedüs [6]) s an example of an exact learnng algorthm. It uses majorty vote to contnously mnmze the verson space untl only one hypothess s left. Queryng a specfyng set for h maj guarantees that we at least halve the verson space each round because t s guaranteed that ether h makes a mstake or we dentfy f. Defnton 1 f C F, XT D(f, V, U) = nf({t R U : {h V : h(r) = f(r)} 1 R t} The teachng dmenson s the mnmum number of nstances a teacher must reveal to unquely dentfy any target concept chosen from the class. The exact teachng dmenson s a more restrctve form; The functon, f(r), of a mnmal subset, R U, can be satsfed by only one hypothess, h(r), and R s at most the value of t = XT D. The Teachng Dmenson and Actve Learnng algorthm (TDA), (Algorthm 3), works by contnuously reducng the sze of the verson space untl t s between specfed szes. The method Reduce acheves ths by gettng the mnmal specfyng set, R, of the subsequence S U based upon the majorty, h maj, of V. V s the set of h V where h(r ) Oracle(R ). Reduce gets the mnmal set r tmes where V s all h V that dd appeared n > θ r of the sets h(r ). It returns V = V \ V ; t s unlkely that these dataponts are nosy. It then get the labels from the verson space that should be used for the fnal hypothess va the method Label and returns the hypothess whch has the smallest error. Label gets the mnmal specfyng 9

10 Algorthm 3 ReduceAndLabel (TDA) Input: Fnte V C F, U = {x 1, x 2,..., x m } X m, values, δ, ˆη (0, 1]. Intalze: u = U /(5 ln V ), V 0 = V, = 0 repeat = + 1 Let U = {x 1+u( 1), x 2+u( 1),..., x u } δ V = Reduce(V 1, U,, ˆη + ) 48 ln V 2 untl V > 3 V 4 1 or V 1 Let Ū = {x u+1), x u+2),..., x u+l }, where l = 12 ˆη ln 12 V 2 δ L = Label(V 1, Ū, δ, ˆη + ) 12 2 Output: Concept h V havng smallest er L (h), (or any h V f V = ). set for V based upon h maj as n Reduce, and labels those ponts. It then looks at all examples n Ū that were not n the mnmal set and labels those based upon ts majorty value over ts value all ts values from the subsets f h(r ) = h maj (R ) = Oracle(R ). It s mportant to use subsamples of sze < 1 n TDA because the 16η probablty of such a subsample contanng a nosy sample s small. 1 Theorem 2 Let n =, and let N be the sze of a mnmal 16(η+3/4) cover of C. Let l = 48 η+/2 ln 12N 96 ln N. Let s = (397 ln ) (4 ln N) + 2 δ δ 167 l 36l ln, and t = XT D(C, D, n, δ n δ 2s n C, D,, δ, η s ts = O(t( η2 + 1)(d log 1 + log 1)(log d 2 δ δ 2 - ). Then the number of labels quered )). [5] The theorem states that the bound s st; s s based on n, N, and l, and t s the extended teachng dmenson. Ths mples that the upper bound for any concept class s based upon ts extended teachng dmenson. The number of dataponts TDA requres s based upon the sze of V where t s known that V N < 2( 4e m 224 η + /2 2 ln ln 4e 4e 48 ln 2( δ 2 ) [11]. So, the number of dataponts s ln 4e 2 ) (5 ln 2( 4e 2 4e ln ) ). The concept class of axs( algned rectangles ) s shown as an applcaton n whose XTD(C, D, n, δ) O 2 log nm. Results were not shown for any λ δ other concept classes; n partcular the most common applcaton n the actve 10

11 learnng model as descrbed n QBC and A 2 s not mentoned. Why wasn t ths concept class examned? How does ths algorthm compare to the other nosy model, A 2? 4.3 Lnear Separators under the Unform Dstrbuton Model # of Dataponts # of Labels Quered QBC O( d log 1 ) δ O( d) Modfed Perceptron O( d log 1 ) O(d log 1 ) ( A d ln( 12 ) + ln( 4)) O ( d ( ) ) d ln d + ln 1 2 δ δ ln 1 TDA 224 η+/2 ln 2 (5 ln 2( 4e 4e 4e 2 48 ln 2( ln ) δ O > (( 2d d )( η ) 4e ln 2 ) ) (d log 1 + log 1)(log d )) δ δ Table 1: Bounds for models wth the class of lnear separators under D A common applcaton analyzed s data drawn from the unt sphere n R d where the labels are dvded by a lnear separator that goes through the orgn of the sphere. The teachng dmenson and actve learnng model dd not examne ths case whch would be useful n analyzng the dfference between the two nosy models. I wll show the upper bounds for each algorthm wth ths applcaton and provde an analyss of the two relatve to eachother. Table 1 dsplays the bounds on dataponts and queres for each of the models mentoned that analyzed ths classfer. QBC and Modfed Perceptron use the Perceptron algorthm on the classfer to produce these bounds A 2 A 2 algorthm shows exponental mprovements for the lnear separator over the unt sphere. It s well-desgned for ths applcaton due to the mnmzaton that t does to the area of uncertanty. Theorem 3 Let X, H, andd be as defned above, and let LB and UB be the VC bound. Then for any 0 < < 1, 0 < η <, and δ > 0, wth probablty 2 16 d 11

12 1 δ, the algorthm A 2 requres ( O d (d ln d + ln 1δ ) ln 1 ) calls to the labelng oracle for lnear separators under ( the unform dstrbuton, where δ δ )) = (ln and N(, δ, H) = O 1 d 2 d ln 1/ ln d + ln. [4] N 2 (,δ,h) δ δ s based on N(, δ, H) whch s an upper bound on the number of bound (LB and UB) evaluatons needed n the algorthm. If H s a set of functons from X to { 1, 1} wth fnte VC Dmenson, d 1, then for any, δ > 0, the sample ( sze requred from D, an arbtrary but fxed probablty dstrbuton, s 64 2d ln( 12 ) + ln( 4)). Ths s based upon 2 δ standard sample complexty bounds from Anthony and Bartlett [12] and t mples wth probablty at least 1 δ that err(h) err(h) ˆ for all h H T DA The number of queres requred n the TDA model s based upon the exact teachng dmenson of the concept class. Ths proposes some problems because regardless of the number of dataponts gven by the teacher the separator cannot be learned. Ths s because there are nfntely many lnear separators n R d. The concept space must be dscretzed to X = {0, 1} d to produce some results. Theorem 4 Let XTD be as defned above and X = {0, 1} d and no dataponts le on the separator. The bound on the number of labels quered n C, D,, δ, η for lnear separators under the unform dstrbuton n X s > ( 2d d )( η )(d log 1 + log 1)(log d ) δ δ Proof: The teachng dmenson of lnearly separable functons s 2 d [8]. We are only concerned wth lnear separators through the orgn whch mples that we only need to be concerned about the dataponts that le near the orgn. It s possble to shft any one of those dataponts slghtly wthout changng the others and thereby requre a dfferent lnear separator. The teachng dmenson s on the order of 2d d. 2 The XTD s even worse snce t s a more restrctve case. However, the TD s poor n tself and t s therefore 2 Ths value was receved from Rocco Servedo 12

13 not necessary to fnd the XTD. The proof s therefore an approxmate bounds based on substtutng t = XT D wth TD and s as defned n Theorem 2. Based upon Theorem 4, t would not be a good dea to use TDA for lnear separators under the unform dstrbuton. The poor bounds n ths class were caused by the dependence of the TDA algorthm on the XTD. It can be assumed that f the XTD for a class s small that the algorthm would perform well and therefore be a good algorthm to use. It would appear that ths was the reason that lnear separators over the unt sphere was not examned n [5]. Hanneke shows the classfer of axs-algned rectangles, whch have a low XTD. (However, they requred dscreteness as well). In the class of lnear separators under the unform dstrbuton, A 2 algorthm has un upper bound of sgnfcantly less queres and would therefore be a better choce. 5 Concluson and Open Questons I have descrbed a number of actve learnng algorthms and analyzed and compared the A 2 and TDA algorthms. I have produced results that show that A 2 s better for lnear separators over the unt sphere. Some open questons: 1. In order to fully analyze and compare the two algorthms, what bounds does A 2 have for axs-algned rectangles? (the concept class shown usng TDA) 2. TDA s useful snce t s a general actve learnng and nosy model, however t does not do well n the settng analyzed by many other papers n ths topc. Can TDA be altered so that t does not depend on the exact teachng dmenson? 3. Can a general algorthm be wrtten whch would produce reasonable bounds for all the applcatons? 4. Can general bounds be made for A 2? 13

14 6 References [1] Y. Freund, S. Seung, E. Shamr, and N. Tshby. (1997) Selectve Samplng usng the query by commttee algorthm. Machne Learnng, 28: [2] S. Dasgupta, A. Kala, and C. Monteleon. (2005) Analyss of perceptronbased actve learnng. COLT. [3] S. Dasgupta (2004) Analyss of a greedy actve learnng strategy. NIPS. [4] M.-F. Balcan, A. Beygelzmer, J. Langford. (2006) Agnostc Actve Learnng. Proc. of the 23rd Internatonal Conference on Machne Learnng. [5] S. Hanneke. (2007) Teachng Dmenson and the Complexty of Actve Learnng. COLT. [6] T. Hegedüs. (1995) Generalsed teachng dmensonand the query complexty of learnng. Proc. of the 8th Annual Conference on Computatonal Learnng Theory. [7] S. A. Goldman and M. J. Kearns. (1995) On the complexty of teachng. Journal of Computer and System Scences, 50: [8] M. Anthony, G. Brghtwell, J. Shawe-Taylor (1995) On specfyng Boolean functons by labelled examples. Dscrete Appled Mathematcs, 61:1-25. [9] D. Cohen, L. Atlas, & R. Ladner (1994). Improvng generalzaton wth actve learnng. Machne Learnng, 15(2), [10] M. Kaaranen (2005). On actve learnng n the nonrealzable case. NIPS Workshop on foundatons of Actve Learnng. [11] D. Haussler (1992). Decson theoretc generalzatons of the PAC model for neural net and other learnng applcatons. Informaton and Computaton, 100: [12] M. Anthony & P. Bartlett (1999). Neural Network Learnng: Theoretcal Foundatons. Cambrdge Unversty Press. 14

Generalized Linear Methods

Generalized Linear Methods Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set