Learning to model sequences generated by switching distributions

Size: px

Start display at page:

Download "Learning to model sequences generated by switching distributions"

Joanna Sullivan
5 years ago
Views:

1 earning to model sequenes generated by swithing distributions Yoav Freund A Bell abs 00 Mountain Ave Murray Hill NJ USA Dana on omputer Siene nstitute Hebrew University Jerusalem srael Abstrat We study eiient algorithms or solving the ollowing problem whih we all the swithing distributions learning problem A sequene over a inite alphabet is generated in the ollowing way he sequene is a onatenation o runs eah o whih is a onseutive subsequene ah run is generated by independent rom draws rom a distribution over where is an element in a set o distributions he learning algorithm is given this sequene its goal is to ind approximations o the distributions give an approximate segmentation o the sequene into its onstituting runs We give an eiient algorithm or solving this problem show onditions under whih the algorithm is guaranteed to work with high probability 1 ntrodution Our work is motivated by the Hidden Markov Model HMM) he HMM is a model or the distribution o sequenes over a inite alphabet $ An ' HMM onsists o a inite number o eah o whih is assoiated with a hidden states! # distribution over the alphabet here is a transition probability *) assoiated with eah pair o states he HMM an be seen as a model whih generates ininite sequenes as ollows At eah time step the model generates a single harater rom aording to where is its urrent hidden state it then makes a transition to a new state with probability *) HMMs are a popular model in the ontext o speeh analysis One an view the hidden state as representing the state o the voal trat o the speaker whih is not diretly observable but ontrols the distribution o the observable sounds he Baum- Welh algorithm [2] is the predominant algorithm or learning HMMs rom examples produes n many real-world ases this algorithm produes aurate hypotheses ater a small num- ber o iterations here is almost no theory or explaining why Baum-Welh perorms so well in some ases badly in others he theoretial results regarding the problem o learning HMMs o whih we are aware are mostly negative Abe Warmuth [1] Gillman Sipser [5] show that learning HMMs is N-hard under various onditions he model o sequenesthat we onsider is similar to the HMM with the restrition that the transition probabilities assign a very high value to the transition rom eah hidden state to itsel n other words the model tends to stay at the same hidden state or long periods o time swith rom state to state only inrequently Suh an assumption an be justiied in the ontext o speeh analysis beause the time sale in whih speeh is sampled is usually an order o magnitude smaller than the time sale o hanges in the voal trat he assumption o the inrequent transitions lets us alleviate the problem o estimating the transition probabilities rephrase the learning problem in a slightly dierent way n our new ormulation the learning problem onsists o two interdependent problems: the modeling problem whih is to estimate the distributions the segmentation problem whih is to partition the sequene into short runs that orrespond to the dierent distributions Given a solution to the segmentation problem the transition probabilities an be easily estimated We deine the swithing distributions learning problem as ollows he learning target onsists o distributions a segmentation sequene over the integers he segmentation sequene is a onatenation o runs eah o whih is a repetition o a single index he 0 element is the index o the distribution that generates the th element 45 8 o the sequene 2 We assume that we are given a sequene over a inite alphabet We as- :9 sume that ie that the same distribution is used in many dierent runs he learner reeives a single sequene o length that is generated by the target its goal is to generate a hypothesis segmentation a set o hypothesis distributions whih are lose to those o the target his problem is related to the problem o learning swithing onepts studied by Blum halasani [3] However in their setup the swithing entities are onepts ie mappings rom some domain to =< while in our setup the swithing enti- For an introdution on HMMs their use in speeh analysis the use o Hidden Markov Model see abiner Juang []

2 2 ties are distributions over a single spae n this work we give an eiient algorithm or learning swithing distributions We desribe several variants o the algorithm eah o whih is guaranteed to sueed under slightly dierent onditions regarding the proess whih is generating the sequene Our algorithm works in the ollowing general way t starts by inding rough approximations o the distributions his is done by inding short subsequenes o whih with high probability are generated mostly by a single distribution Starting with these approximations o the distributions the algorithm iterates the ollowing two steps whih are very similar to the expetation maximization steps o the Baum- Welh algorithm 1 Using the approximate distributions the algorithm inds an approximate segmentation o the sequene 2 From the approximate segmentation the algorithm inds new estimates o the distributions Our analysis shows that i the errors in the initial estimates o the distributions are suiiently small then the re-estimation proess desribed above onverges rapidly Speiially i the errors in the initial estimates are smaller than a onstant ator o the distane between the target distributions then the number o mistakes in the segmentation is with high probability smaller than the error in the re-estimated distributions is n other words i the sequene is long enough with respet to the number o runs other parameters o the soure whih we speiy later) then even rough initial estimates o the target distributions are suiient to ahieve very aurate estimates within a single iteration o the algorithm he paper is organized as ollows n the Setion 2 we give the exat statement o the swithing distributions problem n Setion 3 we desribe the general algorithm that we use to solve the problem he details o the algorithm depend on the number o distributions on the amount o inormation given to the algorithm onerning the dierent parameters o the problem n Setion 4 we present our main results Our strongest result is or is given in Setion n Setion we give a general algorithm or in Setion 8 we desribe how to treat unspeiied parameters 2 Desription o the roblem We are interested in the ollowing problem et 0 80 be$ an alphabet o size et 0 $ be the target segmentation sequene ontaining 0 at0 most runs where a run is a onseutive subsequene! onsisting o! # repetitions o a single index in $ et - be the set o target probability vetors where or eah #4$ is a dimensional probability vetor deined over We denote by the probability assigned to by ie the oordinate o ) 2 We assume that a single sequene o elements rom is generated aording to the target - in the ollowing manner For eah the element is hosen independently at rom aording to the distribution deined by ' *) We are interested in algorithms whih given suh a sequene onstrut a hypothesis 3 - where is a hypothesis segmentation sequene - is a set o distributions he error o a hypothesis = - with respet to the target - is deined as ollows -10 = ' - 8#9 :) < '1 *)*= is a measure o divergene between distribution vetors We give results with respet to three divergene measures: = A B D AF B 3 = A B D AG B H = A B J AK B We use H to denote the errors o a hypothesis with respet to H respetively As or any distribution vetors A B = A B M = A B N H = A B -1 is the most sensitive measure o the error o the hypothesis - H is the least sensitive measure ntuitively a hypothesis with small error is one whih deines a sequene o distributions that is very similar to the one whih generated the sequene the target distributions - are all ar rom eah other then the at that the error o a hypothesis is small implies that to eah target distribution there orresponds a hypothesis distribution whih is lose to it that this hypothesis distribution is mathed to it on most o the sequene his means that a hypothesis whih has small error solves both the segmentation the modeling problems desribed above A learning algorithm or the swithing distributions problem reeives as input a single sequene together with an auray parameter O < a reliability parameter < Ater time polynomial in 1O 1 the algorithm outputs a hypothesis We require that there exists a polynomial 1O Q suh that i O then with probability at least the error o the hypothesis is smaller than O 3 he general algorithm n this setion we desribe an eiient algorithm or solving the swithing distributions learning problem Some elements o the algorithm are let unspeiied hese elements are implemented dierently or the ase o two target distributions ) or the general ase o more than two distributions are desribed in detail in the ollowing setions he reason or the two dierent implementations is that we were able to derive better results or the ase by using proedures whih exploit the at that the sequene is generated by no more than two distributions he algorithm is desribed in Figure 1 t onsists o two parts n the irst part the algorithm inds rough approximations o the target distributions his is implemented as will be shown in more detail in the ollowing setions by loating short subsequenes that appear to be generated eah by a single distribution he seond part o the algorithm is an iterative part in whih the approximate distributions are used to generate an approximate segmentation this segmentation is then used to re-estimate the distributions hese two steps are repeated S times or On iteration a ost vetor UW some S is assoiated with eah approximate distribution his ost vetor is used to alulate a total ost or any hypothesis segmentation o the sequene as deined in step 2b o the algorithm he algorithm

3 3 O O General algorithm or learning swithing distributions nput: A sequene - he size o the alphabet - he number o unknown distributions - the maximal number o runs - the minimal ration o the sequene that orresponds to eah distribution 1 nitialization: Find initial approximations o the target distributions: 2 Do or!# : a) Set the ost vetors ' $ $ as untions o the distributions b) Find the segmentation ) * - ' whih minimizes the total ost: 954 $ 8 ) = $ <=? 9A@ - 9 9:8 ) alulate the new estimates o the distributions For BDF HG set J to the empirial distribution o the elements o or whih ) -2K #B 3 Output the segmentation )MON the distributions Figure 1: he general learning algorithm selets the segmentation that ahieves the minimal ost Using this segmentation the algorithm generates new estimates o the distributions the proess repeats Finally ater suh iterations the hypothesis QS - is output We were not able to demonstrate that several iterations ahieve substantially better perormane than a single one so our analysis onentrates on the ase S However in pratie it seems likely that additional iterations would improve the auray o the hypothesis We return to this point at the end o Setion Finding the segmentation with at most runs that minimizes the total ost or a given set o ost vetors an be perormed in time Q H using a dynami programming tehnique whih is essentially the same as the well known iterbi algorithm [8] he ost vetors are hosen so that with high probability the segmentation with the lowest total ost does not dier signiiantly rom the target segmentation More speiially our goal is to selet ost vetors that satisy the ollowing two properties: 1) he expeted ost o the target segmentation is smaller than the expeted ost o any other segmentation with at most runs) 2) With high probability the segmentation that minimizes the ost on a sample sequene has a small number o segmentation errors One the segmentation with the lowest total ost is ound the probability distributions are re-estimated the proess is repeated he key property o this iterative proess as we shall show in the ollowing setions is that i the error o the initial estimates o the distributions is smaller than some threshold then the iterative proess inreases the auray o the models very rapidly We shall show that this threshold need not depend on the approximation parameter O but rather is o the order o the smallest distane between any pair o target distributions 4 Summary o esults Beore we present our results we add the ollowing notation For # we use to denote the number o elements in orresponding to the distribution deine U FWJ 2435 n other words U is the minimal ration o elements in orresponding to any single distributions We use Y to denote the length o the shortest run in We summarize our results in our theorems whih orrespond to dierent variants o the algorithm desribed in Setion 3 n the irst variant we assume that there are only two target distributions that the algorithm reeives as input U he error o the algorithm s hypothesis is measured with respet to whih is the most sensitive distane measure we use heorem 1 Main heorem or the ase o two distributions) here exists a swithing distributions learning algorithm suh that or any target * deining sequenes whose length satisies [Z*\ ^]`_ \ Q a a ^]b_ d eu U g then the algorithm when reeiving a sequene generated aording to the target together with U outputs a hypothesis 1 suh that with probability at least -1 1 O n the seond variant we remove the assumption that is at most two but we assume that the algorithm is given Y We would like to note that we do not assume the algorithm knows n this ase we require that the target distributions are well separated We measure the separation between the target distributions by 2 35 hwij - `k `k Our requirements on are that U are polynomially related to O that Y grows logarithmially with he error o the algorithm s hypothesis is measured with respet to t atually suies that the algorithm reeive only an upper bound on a lower bound on n suh a ase in the theorem below are simply exhanged by these bounds A similar statement holds in the ase o heorem 2 where we may assume that the algorithm reeives only a lower bound on l as well as an upper bound on )

4 U U O a O heorem 2 Main heorem or general ) here exists a swithing distributions learning algorithm suh that or any 5 Useul nequalities target - deining sequenes whose length satisies [Z*\ a FWJ O H O? are a onatenation o runs o length at least a Y Y *Z \ then the algorithm when reeiving a sequene generated aording to the target together with Y outputs a hypothesis - suh that with probability at least -1 - N O he third variant o our algorithm needs no input other than makes only the assumption that he error o the algorithm s hypothesis is measured with respet to H heorem 3 Main heorem or two distributions unspeiied parameters) here exists a swithing distributions learning algorithm suh that or any target * deining sequenes whose length satisies *Z*\ ^]`_ \ Q a a ]`_ eu U then the algorithm when reeiving a sequene aording to the target outputs a hypothesis that with probability at least -1 H 1 generated suh O n our ourth variant we do not assume that the algorithm is given any o the parameters o the problem However we still require the existene o a lower bound on Y whih grows logarithmially with that U are polynomially related to O he error o the algorithm s hypothesis is measured with respet to H heorem 4 Main heorem or general unspeiied parameters) here exists a swithing distributions learning algorithm suh that or any target - deining sequenes whose length satisies [Z \ FWJ O H O? are a onatenation o runs o length at least Y Y *Z \ then the algorithm when reeiving a sequene generated aording to the target outputs a hypothesis - suh that with probability at least -1 H 1 - < O he theorems presented above result rom an analysis o a single iteration o step 2 o the learning algorithm t is natural to ask whether by inreasing the number o iterations we ould signiiantly weaken the requirements on Our analysis does not support suh a laim we shall later disuss this question briely a n the proos o our theorems lemmas we apply several well known inequalities that are given here as lemmas he irst is a hernohoeding type bound derived by ittlestone [] the seond is due to Sanov [4]page 292) #$ emma 5 For < let be independent rom variables where et hen or < - -A \ - emma Sanov s nequality) For an alphabet o size let be a dimensional probability vetor deined over et S be a rom sample o size generated aording to let the type o S be a dimensional probability vetor deined as ollows: the th oordinate o requeny o the symbol in S hen or any! 1 where is the relative < is the Kullbak eibler K) divergene between the distributions is deined as ollows: 2435 ') One more useul inequality is the ollowing: emma et * $# be two probability vetors then: he ase o wo Distributions n this setion we onsider the ase in whih there are two target probability distributions over an alphabet o size he Y distane between the two vetors plays an important role in our analysis is denoted by We assume that the algorithm knows the number o swithes in the sequene U the minimum between the ration o elements in the target sequene generated by the ration generated by As noted in Setion 4 it suies that the algorithm have only an upper bound on a lower bound on U n Setion 8 we give bounds on the additional error inurred when we remove this assumption n Figure 2 we desribe how we get initial approximations - o respetively we deine the pair o ost vetors U U given approximations o he initial approximation proedure is based on the ollowing two ats he irst is that by deinition o U both or or there exists a subsequene o o length U that was generated solely aording to that probability distribution he seond at is that in expetation the distane between pairs o empirial distributions deined based on pairs

5 0 B nitial estimates or two distributions 1 Set 1 l the minimum length o any run in is known set - 1 l ) 2 For eah K ` 1! G eah ` G let 9 - be the ration o the elements in N whih are equal to et $ 9 denote the vetor 9 - '0 9 - `0 9-3 Find the pair o indies K K 1! or whih $ 9! $ 9 is maximized 4 Set 9 9 hoie o ost vetors or two distributions 1 Given estimates let! # 4 $ - ' - or Bg 4 2 et $ be deined as ollows: 1 e $ - - -!g 0 2! e $ - 3 Set $ $ $ $ let - )! # * Figure 2: he initialization proedure the hoie o the ost vetors or the ase o two distributions - o subsequenes is maximized when one o the subsequenes was generated aording to the seond aording to Using these two ats we show that the pair o distributions hosen are good initial approximations o Our main result or the ase o two distributions is stated in heorem 1 whose proo is divided into three lemmas We use the ollowing notation Similar to the deinition o whih is used with respet to the target segmentation we use 0 *43* to denote the number o sequene loations or whih 3* 0 )k *3* to denote the number o sequene loations or whih Observe that the deinition o the distribution o sample sequenes is invariant under renaming o the target distribution t is thus lear that the hypothesis generated by any learning algorithm an be lose to the target only up to an arbitrary permutation o the distributions his issue is side-stepped by our deinition o the error o a hypothesis but the lemmas that onstitute the proo o heorem 1 reer to the one-to-one mapping whih deines this permutation However as the proos are idential or any permutation we shall reer to this mapping only in the statements o the lemmas otherwise onsider only the ase in whih the names given to the distributions by the target by the hypothesis are idential First we show that i the length o the sequene is large enough then the initial estimates - o are guaranteed to have small error emma 8 nitialization error or two distributions) or some <89 the length o the sequene satisies F a U then with probability at least there exists a one-toone mapping :< = suh that or? ) A@- roo: For a given index $ AB 1 ] let *) DF 2435 DF let be suh that the ration o symbols in *) G)H that were generated by are respetively We say that a vetor is pure i either < or For every pair o indies G B let - J t is not hard to show that the expeted value o - ahieves the maximal value o or a pair o pure vetors satisying < ) < ) whih we set subse- We shall show that or some J[ quently) or every K 1) et D DM be the pair o vetors or whih -D DM is maximized Without loss o generality we assume that G M Based on our hoie o B there must exist a pair o pure vetors having < Sine or this pure pair - or the maximizing pair D G - D DM as well t ollows that D 3 N@O DM< t remains to show that the seletion o assumed in quation 1 exists Applying emma using the bound on the K divergene given in emma we have that or every < QB 4 S B U _W \ here are at most suh vetors hene by setting as in the statement o the lemma we ind that with probability at least 3 or every as required Seondly we show that i the errors o the estimates are not too large then with high probability only a small number o mistakes exist in the segmentation with the lowest ost deined using the orresponding ost vetors 2)

6 @ < 0 k U U k k emma 9 Segmentation error or two distributions) et FWJ ^]b_ ' ) 1) = where ranges over the two) one-to-one mappings rom to et : be a mapping that ahieves the minimum 1 then with probability at least )? ) 8 )? ) d Ë where is the th hypothesis segmentation roo: Assume without loss o generality that : is the identity mapping By our deinition o the ost vetors U4 UW we have that or any given segmentation 43 9 U U = 9 U U 3 = ) 3 ) * 3 3) where U UW is as deined in Figure 2 We irst veriy that 4< < hene or every segmentation 3 U U U U 3 For let - thus - we get - 4) \ 5) ' ) - ' ) ) \ ) 9 = - - ) Similarly 3 t remains to bound the probability that 8 U4 U )3 8 U U or a segmentation 3 suh that ) *43 8 ) a )D 3 2 )*) ) ost 8 U U )3 8 U U 0 ontributions o the elements o or whih 9) 10) he dierene in the total is a sum o the hese ontributions are independent the dierene between the two values that are possible or eah element is exatly 1 hus applying emma 5 the probability that the total ost o is smaller than that o is upper bounded by 9 U U 3 = 9 U U = -1A! -1A \ ) ) where is a weighted average o \ ) ) ) ) ) ) ) d ) ) $# ) ' 11) : 12) 13) From our assumption on the size o ) ) by substituting our bounds on the probability above is bounded by Sine the total number o possible segmentation ontaining at most runs is at most we get the statement o the lemma n the third lemma we show that i the segmentation small number o errors then good new estimates o an be omputed using has a emma 10 eevaluation error or two distributions) Sup- is a one to one mapping suh that pose :Q# = $ ^]`_ )? ) 1 U hen or? ) = where ) * 2U )? ) 1 ) roo: Assume without loss o generality that : is the identity mapping For a given segmentation 3 or let be the empirial distribution o the elements in 3 We deine the deviation - 3* o mean as ollows: - * \ ) *3* ) 3 ) 3 8 ) 3 We show that there exists or whih rom its 14) whih is set subsequently) suh that or every 3 having ]`_ d ) 3Q ) *3Q - *3* - 3*< hus in partiular - - N8 ) ) ) 8 ) K ) 1 ) ) 1 ) ) ) K 15) 1) 1) where the last inequality ollows rom our assumptions on he same bound on is obtained analogously =

7 U 9 Y t remains to bound Applying emma using the bound on the K divergene given in emma we have that with probability at least or every segmentation 3 having ^]b_ d ) *43* ) 3QD or - * * 3 < 41 2U 18) 8 roo Sketh o heorem 1: -1 n order to show that < O we separate our argument into three parts aording to the values o U First we onsider the ase in whih both U are greater than O n this ase we show that the segmentation error is O that the estimation error is O t is not hard to veriy that or the right hoie o onstants the total error is then bounded by O he ondition on the length o the sequene together with emma 8 guarantee that with high probability the initialization proedure generates estimates $ whose error is smaller than Using this in emma 9 we get that the segmentation error is O U From our assumption on we have that this expression is upper bounded by O as desired Using our assumption on the bound in emma 10 we get that the error o the re-estimated probabilities is O? as required U O then it is not hard to veriy that under our assumption on the size o with probability at least the hypothesis whih onstitutes o a single run together with the orresponding probability distribution deined based on the omplete sequene) has error bounded by O As U is part o the input to the algorithm the algorithm heks or this ondition output the single-run hypothesis O then the single-run hypothesis has small error as well However as is not part o the input the algorithm annot hek or this ondition Nonetheless it is easy to hek that in this situation any hypothesis segmentation 3 or whih hwj d *3Q *43* U together with the orresponding estimated probability distributions) has error at most O hus as a last step o the algorithm we hek i FWJ d U this ondition holds then we simply output Otherwise we output the single-run hypothesis t ollows rom the irst part o this proo that i then the ondition holds with high probability O? heorem 1 summarizes the onvergene properties o the algorithm when step 3 o the algorithm is exeuted one t is interesting to onsider the onvergene properties that are implied by emmas 9 10 in a little more detail Aording to emma 9 i the errors o the estimates o are smaller than a onstant ration o 3 then the number o segmentation errors an be dereased to an arbitrarily small ration o by inreasing this in turn dereases the error o the new estimates o the distributions that result rom the segmentation ntuitively this means that there is a basin o attration o the estimates o the distributions whose size is proportional to the distane the algorithm starts with an estimate o the distribution that is within this basin o attration then the estimate it gives in the ollowing iteration is very aurate On the other h our analysis predits that iterating the algorithm more than one will not signiiantly improve the segmentation error his is beause even i the estimates o the distributions are peret there will be segmentation errors as a result o the romness o the sequene Next we onsider the error in the estimates o the distributions From our analysis we see that given our lower bound on the length o the sequene the estimation error ater the initialization step ish O G re-estimation step is O the error ater a single A General Number o Distributions n this setion we onsider the ase in whih there are target probability distributions n this ase the problem o inding good ost vetors to be assoiated with the dierent distributions is more ompliated We have to hoose ost vetors one per distribution but we also have to satisy = sets o requirements beause eah pair o distributions has to be distinguished well by the orresponding pair o ost vetors he hoie o the ost vetors that we have ound is desribed in Figure 3 his hoie is a generalization o the squared loss in the binary ase ) allows us to bound the error o the algorithm aording to -1 whih is weaker ie less sensitive than -1 ) he initialization proedure that is used in the two-distribution ase annot be applied to the general ase he initialization proedure that we suggest requires that the segmentation sequene is suh that all o the runs in are longer than some integer parameter Y his allows the algorithm to assume that in eah segment o o length Y there is at most one swith between distributions onsequently it an identiy i both parts were generated almost solely by the same probability distribution he initialization proedure is desribed in Figure 3 We assume that the algorithm reeives the parameters Y as input his assumption is removed at some additional ost) in the next setion he roo o heorem 2 is very similar to the proo o heorem 1 ollows rom the lemmas given below emma 11 nitialization error or general ) or some the minimal length Y o runs in satisies Y 9 a = then with probability at least the initialization proedure desribed in Figure 3 generates a set o estimates whih approximates in the ollowing $ $ sense here exists a one-to-one mapping : rom to suh that or all? 1) roo: he initialization proedure starts by onsidering all pairs o windows o the orm D DF gnoring the dependene on

8 nitial estimates o the distributions general ase) 1 Set 2 Set 3 For eah K 1! eah F G let 9 - be the ration o the elements in N whih are equal to et $ 9 denote the vetor et be the set o vetors - $ 9 $ 9 or all K 1! suh that $ 9! $ Start with as an empty set repeat the ollowing until beomes empty: Selet an element $ rom add it to emove rom all elements $ suh that $ 0! $ Output the set as the set o initial distribution estimates hoie o ost vetors general ase) With eah estimate assoiate the ost vetor $ - * - - k 8 -! - * k 8 - -!^ - Figure 3: he initialization proedure the hoie o the ost vetors or the general ase GF DF D he assumption on the minimal length o runs in implies that eah suh pair overlaps with at most one swith between runs in We onentrate on some partiular pair o windows whose orresponding estimates are DF assume without loss o generality that the swith is in the seond window et be the distribution whih generates the elements beore the swith be the distribution ater the swith hen the expeted value o is the expeted value o or DF is some < Using emma our requirement on Y we show that or all KB the atual value o whih is assoiated with eah window is lose to its expeted value Speiially with probability at least the Y distane between eah estimate its expeted value is at most We thus get that the pair GF is added to the set only i so that DF his implies that the estimate that is added to the set ase satisies DF hus the estimates in the set target distributions are all 19) 20) in this o atual On the other h we are guaranteed that eah run ontains at least one pair o estimates that are both pure it is easy to hek that the auray o the estimates guarantees that this pair will be aepted into hus eah target distribution has a least one representative in he goal o step 4 o the initialization proedure is to ind a single representative or eah target distribution Simple arguments show that the distane between two dierent representatives o the same distribution is at the distane between two representatives o two dierent distributions is at From the requirement on Y in the statement o the lemma we get hus step 4 generates a set o distributions with one representative per target distribution as required in the statement o the lemma emma 12 Segmentation error or general ) et hwj ^]`_ ' ) ) ) where ranges over all one-to-one mappings rom $ to $ minimum J then with probability at least the segmentation that minimizes the total ost satisies? )`k 1) 1 N a - j k d! Assume that : is the mapping that ahieves the roo: Assume without loss o generality that : is the identity mapping Assume some element in the sequene is generated by the distribution then assigned a ost U rom the ost vetor U k whih orresponds to the approximated distribution k Deine U# k hen the expeted value o U is - $ ' k - k k!*)

9 9 U - - W ) hus the expeted ontribution o any element to the total ost is a sum o two terms he irst term depends only on the underlying target distribution the seond term depends on the Y norm o the approximation error `k Similarly to the analysis in the proo o emma 9 we now onsider the dierene between the total osts orresponding to the approximate ost vetors o two dierent segmentations he irst is the orret segmentation the seond is the best segmentation whih minimizes the total ost on learly only the elements on whih dier ontribute to the dierene in the total ost From quation 21) we get that the expeted total dierene is 9 U - = 9 U - 3 = ) k k - j `k = hus i k K or all -3 22) then the expeted ost dierene is guaranteed to be negative Using the triangle inequality or the Y norm we ind that the onditions on the minimal separation on the maximal error imply that -3 `k We thus get that the expeted total dierene in osts is bounded by 9 U - = 9 U - 3 = k - j ) k 23) We want to bound the probability that a segmentation with many errors has a total ost whih is smaller than that o the orret segmentation he ost dierene is a sum o the ost dierenes in the plaes where the segmentations disagree hese are independent rom variables t is easy to hek that the oordinates o any ost vetor are bounded in the range < We an thus apply emma 5 get that or any individual segmentation the probability that the segmentation ahieves smaller total ost than the orret segmentation is upper bounded by thus the ost dierene is bounded in 9 U - 3 = U _ W d 9 U - = k - j )k ' 24) here are less than segmentations with runs ombining this with the last equation we get the statement o the lemma emma 13 ' eevaluation error or general ) Suppose : $ = $ is a one to one mapping suh that ]`_? 1) - j k )`k W U then with probability at least there exists a one-to-one mapping :A# = ' $ suh that or every #? ) ) where ) * D 8 2U 8 1 he roo o emma 13 is the same as the proo o emma 10 exept or the use o the Y norm in plae o the Y norm 8 reating Unspeiied parameters So ar we have assumed that our learning algorithms reeive as input parameters that desribe properties o the sequene n the two distribution ase the algorithm reeives as input U in the multiple distribution ase the algorithm reeives Y n this setion we show that these assumptions an be removed with some inrease in the error i we measure the error o the hypothesis with - H eall that -1 H is always smaller than - - thus the bounds we previously got or ases where the algorithm reeives additional input an be used here he idea is to try all possible settings o the unknown parameters then selet the best resulting hypothesis While the dierent algorithms reeive as input dierent subsets o U Y the only variable that is ontrolled by these parameters is B the length o the segments that are used or initialization As there are at most possible settings or B we need to run the algorithm times then ompare the hypotheses he dierent hypotheses are ompared in terms o the total ost deined in Figure 3 the hypothesis with the minimal total ost is seleted More ormally assume that we have hypotheses ) For eah 9 aording to the ost vetors U ) - we alulate the total ost U ) - = as deined in Figure 3 or the ase o more than two distributions We then selet the hypothesis with the minimal total ost as our inal hypothesis he ollowing lemma shows that the error o the seleted hypothesis is with high probability only slightly larger than that o the best hypothesis in the set emma 14 et be a sequene generated aording to the segmentation the distributions et ) - - be a set o hypotheses let argmin 8 U ) - hen with probability at least with respet to the distribution o -1 H )3 - N FWJ - -1 H ) - ) where )

10 2 roo: We irst onsider the expeted value o the total ost o the sequene or some ixed hypothesis = - et be the th element in hus is generated by the distribution '1 *) then assigned a ost U aording to the ost vetor U ' :) whih orresponds to the approximated distribution ' :) Deine - ' *) < ' *) hen similarly to quation 21) the expeted value o U is U ' *) - 25) Summing the expeted value o U over that the expeted total ost o is - - we get Where is a onstant independent o the hypothesis he range o all o the U s is < they are independent rom variables hus the deviation o the ost o the sequene rom its expeted value an be bounded using emma 5 as ollows 2 - U - U U _ W \ We now return to the set o hypotheses As eah o these hypotheses is seleted rom a set o at most segmentations we ind that by setting to be!k we get that with probability at least the total osts o all M possible hypotheses are within o their respetive expeted values hus the expeted total ost o the segmentation that appears to be best is within o the expeted value o the atual best segmentation Whih gives the statement o the lemma Applying emma 14 to heorem 1 we get heorem 3 n this ase we add to the idate hypotheses that orrespond to the hoies o B the hypothesis * H where H onsists o a single run is the empirial distribution o the sequene his hypothesis is aurate i either U or are smaller than O thus we an remove any assumption on U Applying emma 14 to heorem 2 we get heorem 4 n this ase we still need to assume a lower bound on Y in order to assure that at least one o the exeutions o the algorithm sueeds n both ases the requirement on that we have to make in order that the extra error be smaller than O is Z 9 O 1 = his requirement is subsumed by the requirements in heorems 1 2 Aknowledgments We would like to thank Yoram Singer ali ishby or disussions that motivated this work Dana on would like to thank the shkol Fellowship or its support art o this work was done while Dana on was visiting A Bell aboratories eerenes [1] N Abe M K Warmuth On the omputational omplexity o approximating distributions by probabilisti automata Mahine earning 923) 1992 Speial issue or O90 [2] eonard Baum J A agon An inequality with appliations to statistial estimation or probabilisti untions o markov proesses to a model or eology Bulletin o the Amerian Mathematial Soiety 3: [3] Avrim Blum rasad halasani earning swithing onepts n roeedings o the Fith Annual AM Workshop on omputational earning heory pages July 1992 [4] homas M over Joy A homas lements o normation heory Wiley 1991 [5] David Gillman Mihael Sipser nerene minimization o hidden markov hains n roeedings o the Seventh Annual AM onerene on omputational earning heory pages July 1994 [] Nik ittlestone Notes on the derivation o herno-type bounds or sums o rom variables Unpublished Manusript 1990 [] abiner B H Juang An introdution to hidden markov models ASS Magazine 31):4 1 January 198 [8] AJ iterbi rror bounds or onvulutional odes an asymptotially optimal deoding algorithm rans norm heory 13: Open roblems t is lear that improvements in our upper bounds should be possible n addition it would be interesting to searh or mathing lower bounds sine no lower bounds on the ahievable error are known t would be interesting to show whether our bounds an be improved i we use the likelihood ost as done in the Baum-Welh algorithm rather than the osts we invented A natural saling o the parameters is to ix $ the swithing rate to let = Our analysis breaks down in this ase

Complexity of Regularization RBF Networks

Complexity of Regularization RBF Networks Mark A Kon Department of Mathematis and Statistis Boston University Boston, MA 02215 mkon@buedu Leszek Plaskota Institute of Applied Mathematis University of Warsaw