Improved Stumps Combined by Boosting for Text Categorization

Similar documents
CHAPTER 7: CLUSTERING

Computing Relevance, Similarity: The Vector Space Model

Introduction to Boosting

Sparse Kernel Ridge Regression Using Backward Deletion

The Research of Algorithm for Data Mining Based on Fuzzy Theory

CHAPTER 10: LINEAR DISCRIMINATION

Lecture 11 SVM cont

Variants of Pegasos. December 11, 2009

Sparse Kernel Ridge Regression Using Backward Deletion

Robustness Experiments with Two Variance Components

Lecture 6: Learning for Control (Generalised Linear Regression)

TSS = SST + SSE An orthogonal partition of the total SS

Lecture VI Regression

Department of Economics University of Toronto

FTCS Solution to the Heat Equation

In the complete model, these slopes are ANALYSIS OF VARIANCE FOR THE COMPLETE TWO-WAY MODEL. (! i+1 -! i ) + [(!") i+1,q - [(!

Outline. Probabilistic Model Learning. Probabilistic Model Learning. Probabilistic Model for Time-series Data: Hidden Markov Model

GENERATING CERTAIN QUINTIC IRREDUCIBLE POLYNOMIALS OVER FINITE FIELDS. Youngwoo Ahn and Kitae Kim

V.Abramov - FURTHER ANALYSIS OF CONFIDENCE INTERVALS FOR LARGE CLIENT/SERVER COMPUTER NETWORKS

Econ107 Applied Econometrics Topic 5: Specification: Choosing Independent Variables (Studenmund, Chapter 6)

Cubic Bezier Homotopy Function for Solving Exponential Equations

J i-1 i. J i i+1. Numerical integration of the diffusion equation (I) Finite difference method. Spatial Discretization. Internal nodes.

THEORETICAL AUTOCORRELATIONS. ) if often denoted by γ. Note that

John Geweke a and Gianni Amisano b a Departments of Economics and Statistics, University of Iowa, USA b European Central Bank, Frankfurt, Germany

WiH Wei He

Clustering (Bishop ch 9)

Improved Classification Based on Predictive Association Rules

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

Solution in semi infinite diffusion couples (error function analysis)

Fall 2010 Graduate Course on Dynamic Learning

( ) () we define the interaction representation by the unitary transformation () = ()

Math 128b Project. Jude Yuen

Stochastic State Estimation and Control for Stochastic Descriptor Systems

Advanced Machine Learning & Perception

Tight results for Next Fit and Worst Fit with resource augmentation

Boosted LMS-based Piecewise Linear Adaptive Filters

A Novel Efficient Stopping Criterion for BICM-ID System

Graduate Macroeconomics 2 Problem set 5. - Solutions

Linear Response Theory: The connection between QFT and experiments

Forecasting Using First-Order Difference of Time Series and Bagging of Competitive Associative Nets

Machine Learning Linear Regression

An introduction to Support Vector Machine

On One Analytic Method of. Constructing Program Controls

Lecture 2 L n i e n a e r a M od o e d l e s

Robust and Accurate Cancer Classification with Gene Expression Profiling

Research on Complex Networks Control Based on Fuzzy Integral Sliding Theory

Anomaly Detection. Lecture Notes for Chapter 9. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Attribute Reduction Algorithm Based on Discernibility Matrix with Algebraic Method GAO Jing1,a, Ma Hui1, Han Zhidong2,b

DITAN: A TOOL FOR OPTIMAL SPACE TRAJECTORY DESIGN

Testing a new idea to solve the P = NP problem with mathematical induction

THE POLYNOMIAL TENSOR INTERPOLATION

Volatility Interpolation

SOME NOISELESS CODING THEOREMS OF INACCURACY MEASURE OF ORDER α AND TYPE β

Computational results on new staff scheduling benchmark instances

Dynamic Team Decision Theory. EECS 558 Project Shrutivandana Sharma and David Shuman December 10, 2005

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 4

EFFICIENT TRAINING OF RBF NETWORKS VIA THE KURTOSIS AND SKEWNESS MINIMIZATION LEARNING ALGORITHM

Appendix H: Rarefaction and extrapolation of Hill numbers for incidence data

Towards the Optimization of Access Control List

THE PREDICTION OF COMPETITIVE ENVIRONMENT IN BUSINESS

January Examinations 2012

The Analysis of the Thickness-predictive Model Based on the SVM Xiu-ming Zhao1,a,Yan Wang2,band Zhimin Bi3,c

Short-term Load Forecasting Model for Microgrid Based on HSA-SVM

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

Existence and Uniqueness Results for Random Impulsive Integro-Differential Equation

Image Classification Using EM And JE algorithms

An adaptive approach to small object segmentation

New M-Estimator Objective Function. in Simultaneous Equations Model. (A Comparative Study)

A decision-theoretic generalization of on-line learning. and an application to boosting. AT&T Bell Laboratories. 600 Mountain Avenue

MARKOV CHAIN AND HIDDEN MARKOV MODEL

Data Collection Definitions of Variables - Conceptualize vs Operationalize Sample Selection Criteria Source of Data Consistency of Data

Learning Objectives. Self Organization Map. Hamming Distance(1/5) Introduction. Hamming Distance(3/5) Hamming Distance(2/5) 15/04/2015

Notes on the stability of dynamic systems and the use of Eigen Values.

Epistemic Game Theory: Online Appendix

EEL 6266 Power System Operation and Control. Chapter 5 Unit Commitment

Using Fuzzy Pattern Recognition to Detect Unknown Malicious Executables Code

Block compressed sensing of video based on unstable sampling rates and multihypothesis predictions

Lecture 18: The Laplace Transform (See Sections and 14.7 in Boas)

GMM parameter estimation. Xiaoye Lu CMPS290c Final Project

Neural network-based athletics performance prediction optimization model applied research

CS286.2 Lecture 14: Quantum de Finetti Theorems II

Chapter 6: AC Circuits

On Kalman Information Fusion for Multiple Wireless Sensors Networks Systems with Multiplicative Noise

Introduction ( Week 1-2) Course introduction A brief introduction to molecular biology A brief introduction to sequence comparison Part I: Algorithms

Mechanics Physics 151

Appendix to Online Clustering with Experts

Displacement, Velocity, and Acceleration. (WHERE and WHEN?)

Performance Analysis for a Network having Standby Redundant Unit with Waiting in Repair

Image Morphing Based on Morphological Interpolation Combined with Linear Filtering

A decision-theoretic generalization of on-line learning. and an application to boosting. AT&T Labs. 180 Park Avenue. Florham Park, NJ 07932

Approximate Analytic Solution of (2+1) - Dimensional Zakharov-Kuznetsov(Zk) Equations Using Homotopy

NPTEL Project. Econometric Modelling. Module23: Granger Causality Test. Lecture35: Granger Causality Test. Vinod Gupta School of Management

Including the ordinary differential of distance with time as velocity makes a system of ordinary differential equations.

Single-loop System Reliability-Based Design & Topology Optimization (SRBDO/SRBTO): A Matrix-based System Reliability (MSR) Method

Reactive Methods to Solve the Berth AllocationProblem with Stochastic Arrival and Handling Times

Polymerization Technology Laboratory Course

Comb Filters. Comb Filters

A Fuzzy Expert System for Solving Possibilistic Multiobjective Programming Problems

[ ] 2. [ ]3 + (Δx i + Δx i 1 ) / 2. Δx i-1 Δx i Δx i+1. TPG4160 Reservoir Simulation 2018 Lecture note 3. page 1 of 5

Bandlimited channel. Intersymbol interference (ISI) This non-ideal communication channel is also called dispersive channel

Transcription:

1000-985/00/13(08)1361-07 00 Journa of Sofware Vo.13, No.8 Improved Sumps Comned y Boosng for Tex Caegorzaon DIAO L-, HU Ke-yun, LU Yu-chang, SHI Chun-y (Sae Key Laoraory of Inegen Technoogy and Sysem, Tsnghua Unversy, Bejng 100084, Chna) (Deparmen of Compuer Scence and Technoogy, Tsnghua Unversy, Bejng 100084, Chna) E-ma: dao99@mas.snghua.edu.cn hp://www.cs.snghua.edu.cn Receved Ocoer 15, 001; acceped Feruary 6, 00 Asrac: Sumps, cassfcaon rees wh ony one sp a he roo node, have een shown y Schapre and Snger o e an effecve mehod for ex caegorzaon when emedded n a oosng agorhm as s ase cassfers. In her expermens, he spng pon (he paron) of each sump s decded y wheher a ceran erm appears or no n a ex documen, whch s oo weak o oan sasfed accuracy even afer hey are comned y oosng, and herefore he eraon mes needed y oosng s sharpy ncreased as an ndcaor of ow effcency. To mprove hese ase cassfers, an dea s proposed n hs paper o decde he spng pon of each sump y a he erms of a ex documen. Specfcay, empoys he numerca reaonshp eween he smares of he VSM-vecor of ex documen and he represenaona VSM-vecor of each cass as he paron crera of he ase cassfers. Meanwhe, o furher facae s convergence, he oosng weghs assgned o sampe documens are nroduced o he compuaon of represenaona VSM-vecors for posse casses dynamcay. Expermena resus show ha he agorhm s oh more effcen for ranng and more effecve han s predecessor for fufng ex caegorzaon asks. Ths rend seems more conspcuous aong wh he ncensemen of proem scae. Key words: ex caegorzaon; machne earnng; sump; oosng Boosng s an erave machne earnng procedure ha successvey cassfes a weghed verson of he sampe, and hen re-weghs he sampe dependen on how successfu he cassfcaon was. Is purpose s o fnd a hghy accurae cassfcaon rue y comnng many weak or ase hypoheses (cassfers), many of whch may e ony moderaey accurae [1]. Sumps, whch are cassfcaon rees [] wh ony one sp a he roo node, have een shown o e effecve when emedded n a oosng agorhm. Schapre and Snger devsed a oosng-ased ex-caegorzaon agorhm, caed ADABOOST.MH, o effceny represen and hande se of aes [3]. In hs agorhm, a smpe, sump-ke one-eve cassfcaon ree s used, as he ase hypohess or weak earner. Is spng creron s wheher a ceran erm (a words and pars of adjacen words n a documen are poena erms) exss or no. Ahough hs seng does work n mprovng he performance of ex caegorzaon, here are s some drawacks. Frsy, as ase cassfer, s proemac o Suppored y he Naona Naura Scence Foundaon of Chna under Gran No.79990580 ( ); he Naona Grand Fundamena Research 973 Program of Chna under Gran No.G1998030414 ( 973 ) DIAO L- was orn n 1974. He s a Ph.D. canddae a he Deparmen of Compuer Scence and Technoogy, Tsnghua Unversy. Hs research neress are ex mnng, machne earnng and KDD. HU Ke-yun was orn n 1970. He s a pos-docor a he Deparmen of Compuer Scence and Technoogy, Tsnghua Unversy. Hs research neress are rough se and daa mnng. LU Yu-chang was orn n 1937. He s a professor a he Deparmen of Compuer Scence and Technoogy, Tsnghua Unversy. Hs curren research areas are machne earnng and KDD. SHI Chun-y was orn n 1935. He s a professor and docora supervsor a he Deparmen of Compuer Scence and Technoogy, Tsnghua Unversy. Hs curren research areas are AI and MAS.

136 Journa of Sofware 00,13(8) decde one documen eongs o or does no eong o a ceran cass ony y checkng a snge erm appeared or no. As we know, ex caegorzaon proem s usuay very compex. There are aways many casses and many nersecons among hese casses, whch make neary mposse o represen a documen ony y one snge erm. Ths rue s undouedy oo weak o e pracca. Secondy, he hgher he error rae of ase cassfer s, he more he eraon mes of oosng agorhm w ake o acheve reasonae overa performance. Improvng he ase cassfer w aso reduce he me spen for ranng. Thrdy, under hs spng rue, compcaed compuaons and comparsons have o e made for each posse erm o fnd ou he es one for paronng. One fac s ha a ex documen aways conans hundreds or even housands of erms. If he ranng documen se were arge, whch s very key o happen, he agorhm s compuaona compexy woud e unearae for us, reachng O(m s en). Here m means he documens numer of ranng se, s means he numer of posse casses, and en means he average engh of ranng documens. Ths paper nroduces an dea o mprove he desgn of sumps, whch are specfcay shaped y ex caegorzaon echnques. Expermena resus are presened o show how good he new mehod s. 1 Pre-Processng of Tranng Documens VSM (vecor space mode s curreny he mos popuar represenaona mode for ex documens [4]. Gven a se of m ranng ex documens, D={Doc 1,Doc,,Doc m }, for any Doc D, =1,,,m, can e represened as a formazed feaure vecor V Doc ) = (va( 1 ),,va( k ),,va( n )), k=1,,,n. Here n means he numer of a ( posse erms n he space of ranng se, and k represens he k-h erm of Doc. va( k ) s a numerc vaue used o measure he mporance of k n Doc, 0 va ( k ) 1. By hs means, he proem of processng ex documens has een changed o he proem of processng numerca vecors, whch s que suae o e soved y mahemaca mehods. va ) can e easy compued y. means he appearance ( k va( ) = k m f k og + α d k m fk og + α d k frequency of k n Doc. d k denoes n how many ranng documens k appears. α s a consan. In he expermens we choose α=0.5. Whe compung he frequency of erms, a Sop Ls s used o remove he funcon word such as of, he, ec. If he ask s o cassfy Chnese ex documen, Chnese Word Segmenaon w e needed efore compung. Wh he vaue of va( k ), feaure seecon can e execued y defnng an mporance hreshod for each erm. Tex caegorzaon agorhms whou feaure seecon canno work we n reavey arge ranng ses, u somemes s accuracy mgh e eer han hose wh feaure seecon n he sense ha he aer agorhms may om mporan erms or ncude mseadng erms durng feaure seecon. We defne a parameer ρ [0,1] caed feaure reducon facor o refec he rao of seeced erms o a he erms n erm space. ρ=1 mpes no feaure seecon a a and ρ=0 mpes no erm s seeced. In mos cases ρ s eween hese wo oundary pons. Improvng Sumps Now e s oserve he mos commony used accompanyng cassfers, hose ha paron he doman of he predcor varaes. The mos we known exampe s he cassfcaon ree. Each cassfcaon ree parons he ranng documen space D no dsjon ocks D 0,D 1,,D N whose unon s D. A pons whn a gven ock are cassfed dencay so f Doc, Doc D j hen h(doc)=h(doc ). Here h( )s he predced cass of hypohess. As many cassfcaons need o e made n he overa oosng agorhm here has een much focus on usng sumps o make he cassfcaons. These are jus cassfcaon rees wh ony one spng node so ha he cassfer, h(doc), sps he daa no ony wo dsjon regons. These sumps are defned compeey y he snge spng n k = 1 f k

:Boosng Sumps 1363 queson ha parons he daa. As he reasons descred n he nroducory secon, we pu forward a genera form o descre a new spng creron: c0 f Sm( Doc, < hrs. H(Doc,, one of he ase cassfers esashed y sumps desgned for h( Doc, = c1 f Sm( Doc, hrs mu-cass mu-ae sengs, ams o predc he reaonshp eween a ceran Doc and a ceran cass y rea vaue of c 0 (nends o deny) or c 1 (nends o affrm). Genera funcon Sm(Doc, represens any funcon ha numercay measures he reaonshp eween any documen and any cass may eong o. hreshod of judgng wheher a documen eongs or no eongs o a ceran cass. In he desgn of funcon Sm(Doc,, we found s naure o empoy he concepon of he cosne vaue of he cross-ange formed eween wo VSM vecors. Therefore, Sm(Doc, s devsed as foowng: Sm va ( ( () ) ) k va V Doc CV () k k = 1 Doc, = cos V Doc, CV =. 0 ( Doc, ) 1 n n va k V Doc va () k k = 1 n k = 1 CV Sm. hrs [ 0,1] CV () denoes he cenra or represenaona vecor of a ranng documens ha eong o cass, he deaed compuaon mehod of whch w e nroduced n he nex secon. Snce oh ( Doc) compued ou, he ony proem ef o he ase cassfer s how o choose he vaue of hrs. s he V and CV () can e In oosng agorhm, for each posse cass, each documen n ranng se s ound wh a rea-vaue: wegh. j j The whoe dsruon of a weghs s denoed as. Le (W ) e he wegh (respec o he dsruon) of he documens n paron Dj (D 0 or D 1, suppose hrs sp he ranng se D no wo parons: D 0, Sm(Doc,<hrs, and D 1, Sm(Doc, hrs) ha are (are no) aeed y. For each posse ae, for j {0,1}, and for {+1,1} (whch s aways denoed as {,+}), W + m ( hrs) = ( Doc Doc D j IsBeongng( Doc, W j Here y denoes he se of a posse casses n he ranng se. (Doc, denoes he wegh of ranng documen reaed wh cass (ae. s a funcon ha oupu 1 f s conen keeps rue and 0 oherwse. Doc [ ] IsBeongng( Doc, ) = 1 y [ = ],. oupus +1 f Doc eongs o cass and 1 oherwse. The W j hrs defned aove s a j funcon of hrs snce he parons woud e dfferen wh dfferen hrs. For smpcy, we use W o represen W j (hrs ). Accordng o he heory of Schapre and Snger [5], he es hrs shoud e he one ha can sp he whoe ranng se no wo parons ha mnmze he vaue of z = j { 0,1} y W j + W j. Here z s he score funcon defned for he ase cassfer. Snce hrs [ 0,1], we need o choose dscree pons from doman [ 0,1] for fndng ou he es hrs. Le 1 A 1 A N, y pckng up he seres of pons of 0,,,,, 1, we can oan (A+1) pons as he posse vaues A A A hrs may ake. The arger he A s, he more accurae he fna hrs s. For each pon seeced y hrs, he correspondng score (z) s compued. Afer a scores wh regard o a he posse pons of hrs are cacuaed, he hrs wh owes score shoud e seeced as he hreshod of sump, he ase cassfer of oosng.

1364 Journa of Sofware 00,13(8) The oupu of he sump herefore shoud e: h ( Doc, c = c 0 1 1 W = og W 1 W = og W 0 + 0 1 + 1 f f (, Sm Doc (, Sm Doc < hrs. hrs 3 Boosng Agorhm Le χ denoe he doman of posse ex documens and e y={y 1,y,,y s } e a fne se of aes or casses. Le D={Doc1,Doc,,Doc m } denoe he ranng se of m ex documens, D χ. In he mu-ae case, each documen Doc χ may e assgned mupe aes n y. Thus, a aeed exampe s a par Doc, Y (Doc) where Y Doc y s he se of aes assgned o Doc. Formay, he ranng se D ransferred o oosng agorhms shoud e D={(Doc1,Y(Doc 1 )),(Doc,Y(Doc )),,(Doc m,y(doc m ))}. We empoy he mproved sumps descred n Secon as he ase cassfer repeaedy caed y oosng agorhm. As for he npued nformaon needed y he sumps, esdes he ranng se D and dsruon, he cenra (represenaona vecors of each posse ae (cass) CV () are aso mporan. The radona mehod s, for each cass y, o compue he arhmeca mean of a VSM vecors of he ranng documens for whch s one of he casses hey eong o. Anoher verson s assgnng dfferen weghs o dfferen ranng documens accordng o her mporance and hen averagng he weghed VSM vecors o oan eer cenra vecors. Snce he dsruonmananed y oosng agorhm aso refecs he reave mporance of ranng documens accordng o each posse cass, mgh e naure o usen compung Formay, CV () V Doc, Y ( Doc) = () s CV for each y. ( Doc) ( Doc,. Ths seng s expeced o facae he convergng speed of oosng, ( Doc, Doc, Y ( Doc) whch, f rue, woud asouey e a g prvege over radona mehod for compung cenra (represenaona cass vecors. Now we can sar o usrae he oosng agorhm sghy modfed for coaorang wh he sengs defned aove. The oosng agorhm s shown n Fg.1. Ths agorhm manans a se of weghs as a dsruon over exampe documens and aes (casses). Inay, hs dsruon s unform. On each round dsruon (ogeher wh he ranng se who compues weak hypoheses { } h(doc, as a predcon as o wheher he ae D and cass cenra vecors, he CV ) s passed o he ase cassfer h for a y={y 1,y,,y s } where : χ y R. We nerpre he sgn of h s or s no assgned o Doc. The magnude of he predcon h(doc, s nerpreed as a measure of confdence n he predcon. Snce he ase cassfer ams o mnmze he ranng se error, accordng o he heory of R.Schapre, α shoud e se o 1. The fna hypoheses rank documens usng weghed voes of he ase cassfers. Ths agorhm s derved usng a naura decomposon of he mu-cass, mu-ae proem no S orhogona nary cassfcaon proems. Tha s, we can hnk of each oserved ae se Y(Doc) as specfyng s nary aes (dependng on wheher a ae s or s no ncuded n Y(Doc)), and we can hen appy nary-predcon oosng agorhms.

:Boosng Sumps 1365 Gven: D={(Doc 1,Y(Doc 1 )),(Doc,Y(Doc )),,(Doc m,y(doc m ))} where Inaze (, = 1 ( m s) For =1,,,T: For each cass 1 Doc for a =1,,,m and a y y, compue he correspondng cenra (represenaona VSM vecor; Pass dsruon = { ( Doc, } and cenra vecors, CV () Sump generaes ase cassfers { h ( Doc, } for a y where h Doc χ, Y(Doc ) y={y 1,y,,y s }, s N. =1,,,m. { } CV = o he ase cassfer, say, he sump; : χ y R ; m α Se α = 1 and IsBeongng ( Doc, ) h ( Doc, ) z = ; Doc, e = 1 y α IsBeongn g ( Doc, ) h ( Doc, ) Updae: ( Doc, e + 1 Doc, = for a y and for a =1,,,m; z T Oupu he fna hypoheses:. f ( Doc, = αh ( Doc, =1 4 Expermens Fg.1 The oosng agorhm for ex caegorzaon We choose Precson as he man way for assessng and comparng he performances of ex caegorzaon mehods. We ca he new agorhm proposed here as AdaBoos.SZ. Three oher agorhms are aso seeced for comparson. They are TF-IDF, NAÏVE BAYESIAN, ADABOOST.MH. We have conduced a numer of expermens o es her vady and o compare he dfferences of her performances. The parameers, such as m, A, s, T, ρ as defned prevousy, are adjused o check he performances of hese agorhms n dfferen suaons. For hese expermens we used YAHOO! CHINESE NEWS as ranng and esng documens, whch can e rereved from hp://cn.news.yahoo.com. The casses and correspondng ranng exampes (documens) are seeced accordng o he parameers adjused. Taes 1 o 5 presen he resus. Tae 1 Precson on dfferen feaure reducon facors (ρ) Agorhms Feaure reducon facor 0.01 0.03 0.06 0.10 1.00 AdaBoos.SZ 0.8 8 0.889 4 0.911 0.934 4 0.900 7 TF-IDF 0.710 0 0.757 3 0.77 0.789 6 0.776 5 NAÏVE BAYESIAN 0.765 0 0.810 4 0.834 4 0.850 5 0.830 3 ADABOOST.MH 0.784 6 0.840 3 0.876 5 0.890 9 0.901 Tae Precson on dfferen ranng documen numers (m) Agorhms Tranng documen numer 500 1000 1500 000 500 AdaBoos.SZ 0.703 3 0.810 5 0.887 9 0.905 1 0.93 0 TF-IDF 0.66 0 0.715 1 0.759 0.775 6 0.781 6 NAÏVE BAYESIAN 0.670 9 0.730 6 0.794 4 0.83 8 0.865 9 ADABOOST.MH 0.658 3 0.751 1 0.84 9 0.889 4 0.900 1 Tae 3 Precson on dfferen posse cass numers (s) Agorhms Posse cass numer 5 10 15 0 40 AdaBoos.SZ 0.960 0 0.967 3 0.944 3 0.946 7 0.933 5 TF-IDF 0.84 3 0.819 6 0.800 8 0.79 1 0.77 NAÏVE BAYESIAN 0.889 4 0.875 3 0.866 0 0.854 0.841 7 ADABOOST.MH 0.93 3 0.96 5 0.911 8 0.909 5 0.893 9 Tae 4 Precson on dfferen vaues of A Vaue of A Agorhm 10 50 100 500 1000 AdaBoos.SZ 0.897 0.95 6 0.934 4 0.937 3 0.937 6

1366 Journa of Sofware 00,13(8) Tae 5 Precson on dfferen eraon mes of oosng (T) Agorhms Ieraon mes 50 100 00 500 1000 AdaBoos.SZ 0.866 3 0.888 0.893 7 0.934 4 0.945 6 ADABOOST.MH 0.801 0 0.834 9 0.859 0.890 9 0.901 7 Generay speakng, TF-IDF s a very smpe way for ex caegorzaon wh reavey ow accuracy. Naïve Bayesan s a e eer han TFIDF, and snce s a proay-ased mehod, when he ranng se ecomes arge enough, s accuracy woud e mproved. AdaBoos.MH s worse han NAÏVE BAYESIAN and even worse han TF-IDF when ranng se s sma and eraon mes s no arge enough. Bu when parameers are adjused o a reasonae pace, s overa performance woud e sghy eer han NAÏVE BAYESIAN. In mos cases AdaBoos.SZ ouperform ADABOOST.MH. From adjusng he parameers we oserved some neresng phenomenon. The performance w aways ncrease aong wh hgher reducon facor ρ, whch mpes seecng more erms no feaure se. Bu we found afer he reducon facor s hgher han 0.1, he mprovemen of accuracy s oo e o e noced. And when ρ s very arge (cose o 1) s even sghy worse han ha of ρ=0.1. The reason may e ha, as we choose erms as feaure, woud e proae o ncude some useess or even mseadng erms. When he feaure se s sma, such ad erms are aso sparse and hence have ony rva nfuence o he cassfcaon resus. Bu when he feaure se s arge enough, such erms woud e key o pay a roe n makng he fna decson. From he expermens we aso can see he accuracy of cassfers woud ncrease aong wh he ncrease of ranng se sze m. Boosng agorhms woud have overwhemngy advanages over non-oosng agorhms once m>1500. For each oosng agorhm, s eraon mes T shoud e one of he mos mporan parameers for achevng requred accuracy wh reasonae coss (CPU me). As T ncreases, oosng agorhms ncrease her accuraces accordngy, and over-fng s hardy oserved, whch s a phenomenon conssen wh ohers heoreca and pracca anayss [3]. Theorecay he vaue of parameer A shoud e he hgher, he eer. Bu accordng o he expermena resus, afer A>100 he mprovemen of accuracy of such cassfers ecomes very rva and runnng such programs Tme spen (mn.) 140 10 100 80 60 40 0 5 Concusons 0 Fas Boosng oosng Agorhm agorhm Non-Fas oosng Boosng agorhm Agorhm 50 100 150 Amoun of ranng documens Fg. Effcency eween fas and non-fas oosng agorhms ecomes an unearae ask wh sharpy ncreased consumng of me and space. The reason may e ha 0.01 as a scae s aready good enough for dsngushng eween usefu erms and useess erms. We aso compared he effcency of our new dea of facang he convergng speed of oosng agorhm wh ha does no empoy hs echnque. Fgure presens he resus, whch ceary show he enefs hs new dea can provde. So far we dscussed he dea for mprovng he performance of oosng agorhm wh sumps as s ase cassfers empoyed y mu-cass mu ae ex caegorzaon asks. In he expermens we proved ha hs s enefca and promsng. The furher deveopmens of such deas mgh ncude ncorporang he concep of SVM o he formaon of sumps and usng Bayesan heorem o mprove he performance of oosng as a whoe, ec. Sump s a knd of ase cassfer easy o e mpemened, u s performance s s far from eng perfec. Maye s more compcaed verson,.e. more parons, or more nodes, can e more suae o e negraed no oosng

:Boosng Sumps 1367 agorhms for resovng ex caegorzaon proems. References: [1] Freund, Y., Schapre, R. A decson-heorec generazaon of on-ne earnng and an appcaon o oosng. Journa of Compuer and Sysem Scences, 1997,55(1):119~139. [] Breman, L., Fredman, J., Oshen, R., e a. Cassfcaon and Regresson Trees. Bemon, CA: Wadsworh, 1984. 1~357. [3] Schapre, R., Snger, Y. BoosTexer: a oosng-ased sysem for ex caegorzaon. Machne Learnng, 000,39(/3):135~168. [4] Saon, G., Wong, A., Yang, C. A vecor space mode for auomac ndexng. Communcaons of he ACM, 1995,18:613~60. [5] Schapre, R., Snger, Y. Improved oosng agorhms usng confdence-reaed predcons. Machne Learnng, 1999,37(3): 97~336. Boosng Sumps,,, (,100084) (,100084) :,Schapre Snger Boosng (Sumps).., Boosng,,.,. VSM.,, Boosng., Boosng Sump ( ),. : ; ;sump;oosng : TP181 : A