arxiv: v1 [quant-ph] 4 Nov 2008

Size: px

Start display at page:

Download "arxiv: v1 [quant-ph] 4 Nov 2008"

Giles Hudson
6 years ago
Views:

1 Training a Binary Cassifier with the Quantum Adiabatic Agorithm Hartmut Neven Googe, neven@googe.com arxiv: v1 [quant-ph] 4 Nov 28 Vasi S. Denchev Purdue University, vdenchev@purdue.edu Geordie Rose and Wiiam G. Macready D-Wave Systems, rose,wgm@dwavesys.com November 4, 28 Abstract This paper describes how to make the probem of binary cassification amenabe to quantum computing. A formuation is empoyed in which the binary cassifier is constructed as a threshoded inear superposition of a set of weak cassifiers. The weights in the superposition are optimized in a earning process that strives to minimize the training error as we as the number of weak cassifiers used. No efficient soution to this probem is known. To bring it into a format that aows the appication of adiabatic quantum computing (AQC, we first show that the bit-precision with which the weights need to be represented ony grows ogarithmicay with the ratio of the number of training exampes to the number of weak cassifiers. This aows to effectivey formuate the training process as a binary optimization probem. Soving it with heuristic sovers such as tabu search, we find that the resuting cassifier outperforms a widey used state-of-the-art method, AdaBoost, on a variety of benchmark probems. Moreover, we discovered the interesting fact that bit-constrained earning machines often exhibit ower generaization error rates. Changing the oss function that measures the training error from -1 oss to east squares maps the training to quadratic unconstrained binary optimization. This corresponds to the format required by D-Wave s impementation of AQC. Simuations with heuristic sovers again yied resuts better than those obtained with boosting approaches. Since the resuting quadratic binary program is NP-hard, additiona gains can be expected from appying the actua quantum processor. 1 Introduction Many probems in machine earning map onto optimization probems that are formay NP-hard 1. Consequenty, arge areas of the fied are concerned with simpifications and reaxations that make the resuting optimization probems computationay tractabe. This has resuted in heuristic toos 1 From the forma NP-hardness of a cass of probems does not foow that probem instances encountered in practice are computationay difficut. However, experience tes us that this is usuay the case for the earning probems at hand. 1

2 that are usefu in practice but whose quaity is inferior compared to resuts obtained by soving the origina probem. Moreover, the weath of heuristic methods requires that the practitioner needs to seect the most suitabe approach on a case by case basis. Adiabatic quantum computing is a new method that draws on quantum mechanica processes that promises to sove hard discrete optimization probems better than possibe with cassica agorithms [FGGS][FGG + 1]. Thus it offers an opportunity to tacke hard machine earning probems heads on. This paper investigates how this nove method can be appied to a basic probem in machine earning: constructing a binary cassifier from a dictionary of feature detectors. 2 Training a binary cassifier We study a cassifier of the form ( N y = H(x = sign w i h i (x, (1 where x R M are the input patterns to be cassified, y { 1, 1} is the output of the cassifier, the h i : x { 1, 1} are so-caed weak cassifiers or features detectors, and the w i [, 1] are a set of weights to be optimized. H(x is known as a strong cassifier. Training, i.e. the process of choosing the weights w i, proceeds by simutaneousy minimizing two terms. One term measures the error over a set of S training exampes {(x s, y s s = 1,..., S}. A natura choice is -1 error, which counts the number of miscassifications over the training set. L(w = ( S H y s s=1 N w i h i (x s, (2 where H is the Heaviside step function. L(w is referred to as the oss function. The second term is known as reguarization, R(w, and it ensures that the cassifier does not become too compex. Cassifiers with high compexity tend to cassify the exampes in the training set with ow error but do not do we on independent test sets. The phenomenon of a cassifier achieving a sma training error but yieding a arge generaization error is known as overearning. A simpe choice for the reguarization term is based on the -norm, w, which gives the number of non-zero weights: R(w = λ w = λ w i (3 Therefore, training amounts to soving the foowing minimization probem: w opt = arg min w = arg min w (L(w + R(w ( S H( y s s=1 N w i h i (x s + λ w i, (4 where λ contros the reative importance of the reguarization. Due to the non-convexity of the oss function, the resuting optimization probem is suspected to be NP-hard. Even if we were to choose 2

3 a convex oss function, (4 is ikey to remain an NP-hard probem due to the choice of the -norm for the reguarization term [Zha8]. The choice of the -norm is attractive since it expicity enforces sparsity, i.e. it drives many of the w i to zero. This is not ony associated with good generaization but aso fast execution during the performance phase. Each contribution to the overa oss, i.e. the per sampe oss, H( y s N w ih i (x s, enforces an inequaity constraint: N y s w i h i (x s for s = 1,..., S (5 Thus, each training sampe brings about an inequaity, which demands to choose weights that are on one side of a diagona hyperpane in N-dimensiona space. The hyperpane is defined by a set of coefficients that are ±1 depending on the responses of the weak cassifiers. Fig. 1 iustrates the situation for N = 3. The number of regions created by S hyperpanes can be cacuated using their characteristic poynomia [OT92]: N regions = ( 1 N S k ( 1 k ( 1 dim(t S k, (6 where S k designates the k-eement subsets of the S hyperpanes and dim( S k is the dimension of the intersection of S k. 2 Due to inear dependencies among the hyperpanes, which occur for N 4, we were not abe to find a cosed form expression for dim( S k and instead have to resort to an upper bound for N regions, which is known [OT92][Sau72] to be N regions k= ( S k (7 It is possibe that two different training sampes generate identica inequaity constraints for the w i. In this sense, (7 is a conservative estimate as the actua number of soution spaces is often ower. 3 Modifications to aow the appication of the quantum adiabatic agorithm To bring (4 to a form that is amenabe to AQC as impemented by the D-Wave hardware 3, we need to effect severa modifications. First, we need to transition from continuous weights w i [, 1] to binary variabes. Formay this can aways be achieved by a binary expansion of the weights. The question that naturay arises is how many bits in the expansion are needed. Since each binary variabe is associated with a qubit, it is important that we ony use the minima number necessary. Discrete weight configurations represented by a finite number of bits ie on a hypercubic attice with edges that have 2 bits vertices. If each soution region contains a attice vertex then a cassifiers H w (x that can be attained with rea vaued weights can aso be reaized by the discrete weight 2 In this cacuation we ignored the fact that a hyperpane or parts of it can become a soution space itsef. This can occur when there are two training sampes for which the sets of {h i(x s} differ by a goba sign but have the same abe y s. Since this case is exceedingy unikey, the probabiity being of the order O(S/2 (2N, we can afford not to consider this situation. 3 The D-Wave hardware minimizes an Ising function via a physica anneaing of therma and quantum fuctuations. 3

Figure 1: Arrangement of the diagona hyperpanes that define the soution spaces for seecting w opt. Depicted is the situation for N = 3, which yieds 14 regions.

4 Figure 1: Arrangement of the diagona hyperpanes that define the soution spaces for seecting w opt. Depicted is the situation for N = 3, which yieds 14 regions. The number of soution chambers grows rapidy N = 4 eads to 14 and N = 5 to 1882 regions. Here a possibe hyperpanes are shown. However in practice S training sampes wi ony invoke a sma fraction of the 2 N 1 possibe hyperpanes. The bue dots are the vertices of a cube paced in the positive quadrant with one vertex coinciding with the origin. They correspond to weight configurations that can be represented with one bit. Mutipe bits woud give rise to a cube shaped attice. configurations. Thus one obtains a rough estimate for the required bit depth by demanding that the number of vertices on the (2 bits N attice is at east as arge as the number of soution regions created by the hyperpanes N regions. The weak cassifiers are typicay constructed in a way that ony positive weights are needed. Hence, we ony need to hit soution regions in the positive quadrant of which there are approximatey N regions /2 N. V ertices on Lattice Regions in P ositive Quadrant (2bits N = (2bits+1 N (8 N regions N regions 2 N 2(bits+1N N ( S k= k 2(bits+1N ( es = 2(bits+1N N N N N (es N! 1 (9 N ( 2(bits+1 N = ( 2(bits+1 N = ( 2(bits+1! N 1 es efn ef (1 bits og 2 (f + og 2 (e 1, (11 4 N

5 where e is the Euer number and f = S N. On the eft side of eqn. (9 we used a standard resut regarding binomia coefficients: N ( S k= k ( es N N. This hods in the case that S N. Smaer numbers of training exampes ead to an even ower bound than (11 for the required bit precision. Equation (11 is an important resut as it shows that the bit precision needed for the weights ony grows ogarithmicay with the ratio of the number of training exampes to the number of weak cassifiers. Thus for many probems that arise in practice we get away with very few bits and often we wi ony need ony a singe bit. The second modification is not imposed by our desire to appy AQC per se but rather by the imitations of the D-Wave hardware, which cas for a Hamitonian that has at most quadratic terms. To this end we effect a change in the oss function, now using the quadratic oss, such that finding w opt in (4 amounts to soving a quadratic optimization program: ( S w opt = arg min w i h i (x s y s 2 + λ w w s=1 ( S N 2 = arg min w i h i (x s 2 w i h i (x s y s + y 2 w s + λ w i s=1 ( S = arg min w w i w j h i (x s h j (x s + j=1 s=1 }{{} Corr(h i,h j w i S λ 2 h i (x s y s s=1 }{{} Corr(h i,y In the third ine we dropped S s=1 y2 s because it represents a constant offset. In order for the square oss to be compatibe with the binary decision enforced by the sign in eqn. (1 we scae the h i (x such that h i : x { 1 N, 1 N }. Eqn. (12 corresponds to a quadratic unconstrained binary optimization (QUBO probem. Note that the transition from the second to the third ine ony hods for weights comprised of a singe bit. If we use an arbitrary number of bits, we have to introduce an auxiiary bit w i,aux for each weight to enforce a -norm reguarization within the framework of quadratic optimization. R(w of (4 then becomes (12 R(w = κw i (1 w i,aux + λw i,aux (13 Minimizing R(w causes the w i,aux to act as indicator bits that are 1 when w i > and otherwise. For this to work κ has to be chosen sufficienty arge. There is an intuitive way to ook at (12. The weak cassifiers h i, whose output is we correated with the abes y cause the bias term to be owered, thus causing an increase in the probabiity that w i = 1. The couping terms are proportiona to the correation among the weak cassifiers. Weak cassifiers that are strongy correated with each other cause the couping energy to go up, thereby increasing the probabiity for one of the correated cassifiers to be switched off, i.e. that either w i or w j becomes. The matrix Corr(h i, h j figuring in the quadratic term is positive semi-definite, thereby seemingy making the resuting optimization probem efficienty sovabe with cassica optimization 5

6 techniques. However, it has been confirmed that the quadratic unconstrained program with binary weights an integer programming probem is NP-hard, which vaidates the motivation for appying quantum agorithms to find w opt in the above formuation [KN1][HR98]. Moreover, the matrix figuring in the quadratic term ceases to be of Gram type when the w i are represented by more that one bit and the modified reguarization term (13 eads to additiona entries in the matrix. 4 Impementation detais We impemented the training formuations given by (4 and (12 in Matab. The dictionaries of weak cassifiers that we empoyed consist of decision stumps of the form: h 1+ (x = sign(x Θ + for = 1,..., M (14 h 1 (x = sign( x Θ for = 1,..., M (15 h 2+ (x = sign(x i x j Θ + i,j for = 1,..., ( M 2 h 2 (x = sign( x i x j Θ i,j for = 1,..., ( M 2 ; i, j = 1,..., M; i < j (16 ; i, j = 1,..., M; i < j (17 Here h 1+, h 1, h 2+ and h 2 are positive and negative weak cassifiers of orders 1 and 2; M is the dimensionaity of the input vector x; x,x i,x j are the eements of the input vector and Θ +, Θ, Θ + i,j and Θ i,j are optima threshods of the positive and negative weak cassifiers of orders 1, and 2 respectivey. The input vectors are normaized using the 2-norm, i.e. we have x 2 = 1. Using the training data, an approximatey optima threshod Θ is computed for each cassifier. The goa is to obtain an operating point that resuts in the minimum number of errors due to that weak cassifier aone when the weak cassifier is evauated on the entire training set. To minimize (4 for the purpose of determining the optima weights, we empoy simuated anneaing. An exponentia cooing schedue is used, and the schedue is tuned to the dataset for improved performance. The QUBO from (12, can be rewritten as w opt = arg min w ( i,j Q i,jw i w j, where the coefficient matrix has eements Q i,j = Corr(h i, h j and Q i,i = S + λ 2Corr(h N 2 i, y. The resutant probem is soved with a muti-start tabu sover tuned to QUBO probems [Pa4]. We noticed that we coud achieve enhanced resuts by adding a post-processing step. The w opt returned by tabu search is used to compute an optima threshod for the fina strong cassifier: T = 1 S N S s=1 wopt i h i (x s, where the x vectors are taken from a vaidation data set. T represents the average of a computed responses of the strong cassifier immediatey before the categorica decision is made. We modify eqn. (1 by inserting T. Thus the fina cassifier becomes y = sign ( N w opt i h i (x T After that, the set of test exampes is evauated using the strong cassifier configured in this way and the test errors are counted. The output consists of the number of test errors, the number of weak cassifiers with non-zero weights that make up the fina strong cassifier. Since the optima reguarization strength cannot be known a-priori for different data, we use 3-fod cross vaidation (18 6

7 x s2 x s x s x s1 Figure 2: Visuaization of the synthetic test data empoyed in the benchmark tests. The data consists of two isotropic Gaussian distributions of 3-dimensiona inputs with different variances. Shown are the first two dimensions. The two couds represent the positive and negative data sets which are transated reative to each other. The position of their means is controed by an overap parameter chosen such that the couds are maximay segregated when this overap is (eft and maximay overap when the overap is 1 (right. Thus, the cassification task is hardest when the overap parameter is 1. in order to find a reguarization strength λ that resuts in the best generaization on a vaidation set. We ony consider vaues of λ for which the tota number of weak cassifiers does not exceed N/2. 5 Performance measurements on benchmark probems To assess the performance of binary cassifiers of the form (18 trained by soving the optimization probems (4 and (12 respectivey, we measured their test errors on 3-dimensiona synthetic and natura data sets. Synthetic test data was generated by samping from P (x, y = 1 2 δ(y 1N(x µ +, I+ 1 2 δ(y +1N(x µ, I where N(x µ, Σ is a spherica Gaussian having mean µ and covariance Σ. An overap coefficient determines the separation of the two Gaussians. The synthetic data is iustrated on Fig. 2. The natura data consists of vectors of Gabor waveets ampitudes extracted at eye ocations in images showing faces. For comparison we ran the same tests using different impementations of boosting. First we impemented AdaBoost as formuated by Freund and Shapire in [FS99]. Here we used the same weak cassifiers as the ones defined in (14-(17. Additionay, we compared the performance against the GML AdaBoost Matab Toobox [Vez6]. The GML toobox contains impementations of three different favors of AdaBoost. We ran the tests on a three of them but ony dispay the test resuts for the best out of the three. The GML toobox is imited to the equivaent of our order 1 weak cassifiers cassification and regression trees with branching factor of 1. In the tests we varied the dictionary of weak cassifiers, the bit precision used to represent the weights as we as f, the ratio of training sampes to weak cassifiers. We used two dictionaries. The first, caed the order 1 7

8 GML AB1 QP1 1 bit QP1 3 bits 1 Loss, order 1, 1 bit 1 Loss, order 1, 3 bits 1 Loss, order 1, 64 bits GML AB1 QP1 1 bit QP1 3 bits 1 Loss, order 1, 1 bit 1 Loss, order 1, 3 bits 1 Loss, order 1, 64 bits Error rate Error rate Overap coefficient Overap coefficient Figure 3: Test errors for the synthetic data set with f = 1 and f = 8 for the order 1 dictionary. Note that due to the different variances of the Gaussian distributions for the positive and negative training sampes the generaization error does not necessariy approach.5 as the overap becomes maxima AB2 QP2 1 bit QP2 3 bits 1 Loss, order 2, 1 bit 1 Loss, order 2, 3 bits 1 Loss, order 2, 64 bits.4.35 AB2 QP2 1 bit QP2 3 bits 1 Loss, order 2, 1 bit 1 Loss, order 2, 3 bits 1 Loss, order 2, 64 bits.3.3 Error rate.25.2 Error rate Overap coefficient Overap coefficient Figure 4: Test errors for the synthetic data set with f = 1 and f = 8 for the order 2 dictionary dictionary, consists of the set of decision stumps with inear arguments h 1+, h 1 as per eqns. (14 and (15. The second one, caed the order 2 dictionary, uses the set of weak cassifiers of the order 1 dictionary but adds the order 2 decisions stumps h 2+ and h 2 described in eqns. (17 and (18 as we. The order 1 dictionary has 6 weak cassifiers whie the order 2 dictionary empoys 93. Figs. 3 and 4 show the test errors on the synthetic data set we obtained for different configurations. QP1 and QP2 denote the cassifiers trained with the quadratic program (12 whie using the dictionaries of order 1 and 2, respectivey. Simiary, -1 Loss 1 and -1 Loss 2 stand for the cassifiers using dictionaries 1 and 2 trained by soving the optimization probem (4. AB1 and AB2 denote cassifiers trained for the same dictionaries but with the AdaBoost agorithm. Finay, GML represents the best resut obtained with the GML AdaBoost Matab toobox. For GML, ony 8

9 a dictionary equivaent to our order 1 dictionary is avaiabe. The figures show test errors obtained on data sets that were not used during the training but were drawn however i.i.d. from the the same distributions. Test error is potted against overap coefficient for the range.7 to 1 corresponding to an increasingy harder cassification probem. Accordingy, we observe an increasing error rate. A number of observations can be made. First, the cassifiers trained with goba optimization outperform those obtained by the greedy feature seection methods empoyed in boosting. Second, with the exception of QP1, the goba optimizations that used fewer bits to represent the weights did better than those that empoyed more. Cassifiers using the richer order 2 dictionary achieved a ower test error, which, given the structure and size of the dictionaries reative to the input dimensionaity is not surprising. We do not draw concusions from the fact that -1 oss fared worse than quadratic oss. This coud be an artifact of using simuated anneaing to sove (4, whie tabu search was empoyed to optimize (12. Besides synthetic data, we aso did testing with natura data. The tabe in Fig. 5 shows the resuts obtained from a test set consisting of vectors of Gabor waveets ampitudes extracted at eye ocations in images showing faces. The data consisted of 2, input vectors, which we divided eveny into a training set, a vaidation set to fix the parameter λ, and a test set. For QP the bitconstrained earners aways performed better in terms of accuracy and cassifier compactness. In Figure 5: Test resuts obtained for a natura data set, which consisted of vectors of Gabor waveets ampitudes extracted at eye ocations in images showing faces. Each data ce in the tabes contains two numbers the first represents the respective error rate, and the second gives the number of weak cassifiers with a non-zero weight. The vaues are averages obtained through cross-vaidation of 1 runs. The shaded ces indicate the most accurate resuts. 9

10 the case of -1 Loss the performance was simiar for the different bit depths with sma trade-offs between accuracy and compactness. The goba optimization approach using the quadratic objective function (12 yieds the best resuts. The accuracy is ony increased by ess than 1% reative to AdaBoost, but this is accompished with a reduction of more than 5% of the switched-on weak cassifiers. 6 Discussion We have seen an impressive performance of goba optimization approaches that minimize a reguarized measure of training error to find an optima combination of weights for constructing a binary cassifier. Goba optimization competes successfuy with greedy methods such as the state-of-theart method AdaBoost. Further, we discovered that bit-constrained earning machines often exhibit a generaization error that is ower than the one obtained when the weights are represented with higher precision. To the best of our knowedge, this has not been studied before. Bit constraining can be regarded as an intrinsic reguarization that contributes to keeping the mode compexity ow. The finding that the bit-precision needed to reaize the optima training error ony grows ogarithmicay with the ratio of the number of training exampes to weak cassifiers, suppies insight into why few-bit earning machines work. The competitive performance of bit-constrained cassifiers suggests that training benefits from being treated as an integer program. This has a twofod impication. First, this is good news for hardware-constrained impementations such as ce phones, sensor networks, or eary quantum chips with sma numbers of qubits. Second, this renders the training probem manifesty NP-hard, thus further motivating the appication of quantum agorithms that may generate better approximate soutions than cassicay avaiabe. Our next steps wi be to investigate the advantages that goba optimization with AQC hardware offers for our probem instances. We pan to use the next generation of D-Wave chips with 128 qubits. This wi invove adjusting our impementation to additiona engineering constraints of the existing AQC hardware such as a sparse connectivity graph among the qubits. Empoying AQC during the training phase has the significant benefit that once the optima set of weights has been computed, then those can be taken advantage of by an entirey cassica processor. In this work we ony considered fixed dictionaries of weak cassifiers. An important generaization that remains to be studied is to appy this framework to adaptive dictionaries. We want to concude with the remark that our finding that bit-constraint earning has good generaization properties may have impications when studying pasticity in the nervous system, where it is sti an unresoved probem how a synapse can store information reiaby over a ong period of time[ksj]. Acknowedgments We woud ike to thank Hartwig Adam, Jiayong Zhang and Xiaowei Li for their hep with preparing the natura test data; Aessandro Bissacco for his assistance with Matab and reviewing the boosting code; Edward Farhi, Yoram Singer, Urich Buddemeier and Vint Cerf for commenting on earier versions of the paper. 1

11 References [FGG + 1] Edward Farhi, Jeffrey Godstone, Sam Gutmann, Joshua Lapan, Andrew Lundgren, and Danie Preda. A quantum adiabatic evoution agorithm appied to random instances of an np-compete probem. Science, 292:472, 21. [FGGS] Edward Farhi, Jeffrey Godstone, Sam Gutmann, and Michae Sipser. Quantum computation by adiabatic evoution. 2. preprint quant-ph/116v1. [FS99] [HR98] [KN1] [KSJ] Yoav Freund and Robert E. Schapire. A short introduction to boosting. Journa of Japanese Society for Artificia Inteigence, 14(5:771 78, Christoph Hemberg and Franz Rend. Soving quadratic (,1-probems by semidefinite programs and cutting panes. Math. Program., 82(3: , Kengo Katayama and Hiroyuki Narihisa. Performance of simuated anneaing-based heuristic for the unconstrained binary quadratic programming probem. European Journa of Operationa Research, 134(1:13 119, 21. Eric R. Kande, James H. Schwartz, and Thomas M. Jesse. Principes of Neura Science. McGraw-Hi, 2. [OT92] Peter Orik and Hiroaki Terao. Arrangements of hyperpanes. Grundehren der Mathematischen Wissenschaften [Fundamenta Principes of Mathematica Sciences]. Springer-Verag, Berin, Germany, [Pa4] Gintaras Paubeckis. Mutistart tabu search strategies for the unconstrained binary quadratic optimization probem. Ann. Oper. Res., 131: , 24. [Sau72] [Vez6] [Zha8] Norbert Sauer. On the density of famiies of sets. Journa of Combinatoria Theory, 13: , Aexander Vezhnevets. GML AdaBoost Matab toobox.3. MSU Graphics & Media Lab, Computer Vision Group, Department of Computer Science, Moscow State University, 26. Tong Zhang. Forward-backward greedy agorithm for earning sparse representations. Rutgers Statistics Department Technica Report,

A Brief Introduction to Markov Chains and Hidden Markov Models

A Brief Introduction to Markov Chains and Hidden Markov Modes Aen B MacKenzie Notes for December 1, 3, &8, 2015 Discrete-Time Markov Chains You may reca that when we first introduced random processes,