2. Accelerated Computations

Size: px

Start display at page:

Download "2. Accelerated Computations"

Clinton Hicks
5 years ago
Views:

1 2. Accelerated Computations 2.1. Bent Function Enumeration by a Circular Pipeline Implemented on an FPGA Stuart W. Schneider Jon T. Butler Background A naive approach to encoding a plaintext message is to Exclusive OR one 7-bit key with all 7-bit ASCII characters. In this way, each plaintext letter is converted to a different letter in the ciphertext. Decryption is simple: just Exclusive OR the same key to all 7-bit ASCII characters in the ciphertext. Since the second application of the key annihilates the first application, the plaintext letter emerges. The problem with this is that the distribution of probabilities of the letters across the plaintext also occurs in the ciphertext. For example, the most frequent letters in the ciphertext are likely to be e or t, and the least frequent letters are likely to be z or q. To avoid decryption by an outsider, one seeks a key stream that is random. However, high-speed highly parallel computers can be used to exploit even small variations from pure randomness in the key stream. For example, in a linear attack, a key stream is tried that is generated from a linear Boolean function. If the actual key stream used in encryption is close to linear, there will be errors, but such an attack may be ultimately successful. Against such attacks, one seeks a function that is as far from linear as possible. These are the bent functions.

2 100 Accelerated Computations Properties of Bent Functions Bent functions are the most nonlinear of all n-variable functions, where n is even. They were introduced in 1976 by Rothaus [256]. Definition 2.14 Linear Function. A linear function is the constant 0 function or the Exclusive OR of one or more variables. Example 2.9. There are eight 3-variable linear functions, 0, x 1, x 2, x 3, x 1 x 2, x 1 x 3, x 2 x 3, and x 1 x 2 x 3. Only one of the eight functions actually depends on all three variables. However, we will view all eight functions as functions of 3 variables. Definition 2.15 Affine Function. An affine function is a linear function or the complement of a linear function. Example There are 16 different 3-variable affine functions, 0, x 1, x 2, x 3, x 1 x 2, x 1 x 3, x 2 x 3, x 1 x 2 x 3, 1, x 1 1, x 2 1, x 3 1, x 1 x 2 1, x 1 x 3 1, x 2 x 3 1, x 1 x 2 x 3 1. Affine functions are one extreme type of switching functions. Definition 2.16 Nonlinearity. The nonlinearity NL f of a function fx is the minimum number of truth table entries that must be changed in order to convert fx to an affine function. That is, the nonlinearity of a function fx is the minimum Hamming distance between the truth table of fx and the truth tables of the affine functions. Example Among the 3-variable functions, the function fx = x 1 x 2 x 3, which is not affine, has nonlinearity 1, since converting the single 1 in its truth table to a 0 creates the constant 0 function, which is affine. Definition 2.17 Bent Function. Let fx be a Boolean function on n-variables, where n is even. fx is a bent function if its nonlinearity is as large as possible. There are no bent functions with odd n. However, we use the term bent

3 Bent Function Enumeration Using an FPGA 101 to describe n-variable functions with maximum nonlinearity regardless of whether n is even or odd. The maximum value of the nonlinearity of n-variable bent functions, where n is odd, is known only for specific values of n. We know from [256] that the maximum value of the nonlinearity of n-variable bent functions, where n is even, is 2 n 1 2 n 2 1. Bent functions have the property that they are at a maximum distance from all affine functions. For example, fx = x 1 x 2 x 3 x 4 is a bent function on 4 variables; it is at a distance 6 from 16 of the 32 affine functions on 4 variables and at a distance 10 from the other 16 affine functions. That is, at least six entries of the truth table of fx must be changed to convert it into an affine function. Further, there are 16 affine functions that can be achieved by changing six entries in the truth table of fx. Since there are no 4-variable functions whose nonlinearity is 7 or larger, it follows that fx = x 1 x 2 x 3 x 4 is bent. Bent functions are important because of a cryptanalysis technique in which nonlinear functions used in the encryption process are approximated by linear functions. That is, when the encryption is linear, decryption is straightforward. When the encryption is slightly nonlinear, then a linear approximation can be used, with an understanding that decryption is erroneous but potentially correctable. Indeed, Matsui [212] proposes a linear attack of the Data Encryption Standard DES. Bent functions are valued because they are the most difficult to approximate by linear functions. Bent functions have been used in practical cryptographic algorithms, e.g., in the block ciphers CAST-128 and CAST-256 [3] Architecture for Bent Function Discovery Original Architecture. An interesting question that has remained open since Rothaus seminal contribution [256] is the number of bent functions. Although constructions of specific bent functions exist, there is no known way to effectively generate all bent functions. The only way to generate all bent functions is to sieve through a set of

4 102 Accelerated Computations function index function generator transform counter AlgNF index transeunt triangle switch truth table nonlinearity NL generates function under test ROTS index ROTS generator store yes maximum NL? Figure 2.1. Original bent function enumeration circuit. functions known to contain all bent functions. A reconfigurable computer is ideal for this task. Figure 2.1 shows our original architecture used to enumerate bent functions [60]. A counter, on the left, enumerates indices to functions under test. This drives a transform block that enumerates functions in the set under test. Specifically, three sets can be enumerated: 1. all n-variable functions function generator; 2. all functions with specified degrees transeunt triangle; and 3. all rotation symmetric functions ROTS generator. The transform block, in turn, drives a block labeled nonlinearity which computes the function s nonlinearity, N L. If N L is maximum, the function is bent, and it is stored. This architecture was implemented on a high-end reconfigurable computer, the SRC-6 which uses the Xilinx Virtex-II Field Programmable Gate Array FPGA and two Intel Xeon microprocessors. The Verilog program on this reconfigurable computer achieved a 60,000 times speed-up over a C program running on the SRC-6 s Xeon microprocessor [295]. In this contribution, we show that the computational capability of the SRC-6 is available on a much less expensive Digilent

5 Bent Function Enumeration Using an FPGA 103 ZedBoard development board, which uses a Xilinx Zynq-7000 FPGA and an ARM processor. The Zynq FPGA runs at 100 MHz, the same frequency as the Virtex-II. The ZedBoard system, at about $500, is much less expensive than the SRC-6, at about $500,000. This cost is for a complete SRC-6 with ten high-end Xilinx Virtex II FPGAs and two Xeon microprocessors. Also, the SRC-6 has a large memory and an extensive hardware/software interface. Only one Virtex II FPGA and none of the extra memory was used in our bent function enumeration. Generation of All Functions When the switch block in Figure 2.1 shows that the function generator drives the nonlinearity block the switch is in the up position, the circuit computes the bent functions over all n-variable functions. In this case, the function generator block contains no logic, it simply connects input wires to output wires. Generation of Functions by Algebraic Normal Form AlgNF. When the switch is in the horizontal position, nonlinearity is computed from a truth table of the function that is a transformation from the algebraic normal form of that function. Definition 2.18 Algebraic Normal Form. The Algebraic Normal Form AlgNF 1 of a function fx is fx = c af {0,1} c af x a1 1 xa xan n where is the XOR, and a f = a 1 a 2... a n. Example fx = x 1 x 2 x 3 x 4 has the AlgNF fx = x 1 x 2 x 3 x 4, and a f = gx = x 1 x 2 x 3 x 4 has the AlgNF gx = x 1 x 2 x 3 x 4, and a g = The abbreviation ANF is often used for Algebraic Normal Form in cryptographic literature, but in the Boolean Domain for Antivalence Normal Form. To avoid confusions we use in this book AlgNF as abbreviation of Algebraic Normal Form. An AlgNF is also called positive polarity Reed-Muller form or Zhegalkin polynomial.

6 104 Accelerated Computations Definition 2.19 Degree of a Function. The degree of a function fx is the number of variables in the terms in the AlgNF of fx with the largest number of variables. Example The degrees of f = x 1 x 2 x 3 x 4 and g = x 1 x 2 x 3 x 4 are 4 and 2, respectively. Definition 2.20 Affine Equivalent. Functions fx and gx are affine equivalent if there is an affine function h a x such that fx = gx h a x. The affine equivalent relation is reflexive, symmetric, and transitive, and therefore, is an equivalence relation. As such, it divides the functions into equivalence classes, affine classes. Example fx = x 1 x 2 x 3 x 4, a non-bent function, belongs to an affine class of 32 functions. gx = x 1 x 2 x 3 x 4, a bent function, belongs to an affine class of 32 functions. Each affine class contains the same number of functions, namely 2 n+1. Further, the AlgNFs of all functions in the same class are the same except for the coefficients of the constant term or the linear terms i.e., one-variable terms. From Rothaus [256], we know that functions in an affine class are either all bent or all non-bent. Therefore, to generate all bent functions, it is sufficient to enumerate only one representative of each affine class and then to use that to generate all 2 n+1 functions in that class. For example, we could generate all functions with neither degree 1 nor degree 0 terms. Based on this observation, our approach is to simply generate all bent function representatives across all functions of the degree n 2 or less. In the hardware, this is accomplished by setting the constant, linear, and terms with a degree greater than n 2 to 0 and connecting all other terms to the counter outputs. If we enumerated all 2 26 = variable functions at an FPGA clock rate of 100 MHz, it would take 5,849 years. However, Rothaus [256] showed that all bent functions are contained within the set of functions that have a degree no greater than n 2. By eliminating functions with a degree greater than n 2, it is only necessary to

7 Bent Function Enumeration Using an FPGA 105 enumerate = 2 42 = functions in the case of n = 6. As before, a function in an affine class is bent if and only if all functions in the same affine class are bent [256]. As a result, the number of bent functions is found by multiplying the number of affine classes by the number of functions in each class, = 2 7 = 128. The number of affine classes with degree 3 or less for n = 6 is = 2 35 = If we compute one class function per 100 MHz clock period, this enumeration takes only 5.7 minutes. That is, by enumerating only the affine classes corresponding to functions of degree 3 or less, we achieve a reduction to = 1 536, 870, 912 of the original computation time. However, this requires that we quickly convert between the AlgNF of a function and its truth table form. Indeed, we can do this quickly; we need the transeunt triangle, which is discussed in [295]. In Figure 2.1, the counter enumerates the AlgNFs with degree n 2 3 in the case of n = 6 or less. These are applied to the input of the transeunt triangle, which converts each AlgNF into a truth table form. The truth table form is then applied to the nonlinearity circuit, which computes the nonlinearity of each function. The transeunt triangle and nonlinearity circuit blocks are described in [295]. Generation of Rotation Symmetric Functions. When the switch in Figure 2.1 is in the down position, the nonlinearity is computed for rotation symmetric functions. Introduced in 1999 by Pieprzyk and Qu [237], rotation symmetric functions were shown to be especially efficient with respect to the number of additions and multiplications used in hashing algorithms for cryptographic applications. Stănică and Maitra [87] have counted the bent n-rotation symmetric functions for n = 4, 6, and 8. Kavut et al. [169, 170] have found examples

8 106 Accelerated Computations of 9-variable functions with high nonlinearity, 241, which are rotation symmetric functions. They have found 9-variable functions with nonlinearity 242 that are outside the set of rotation symmetric functions. Definition 2.21 Rotation Symmetric Function. A function fx = fx 1, x 2,..., x n is a rotation symmetric function ROTS if and only if fa 1, a 2,..., a n = fa n, a 1,..., a n 1, for all assignments a 1 a 2... a n of values a i {0, 1} to the variables x 1 x 2... x n. Example The 4-variable AND function f = x 1 x 2 x 3 x 4 ROTS function, since is a f0, 0, 0, 1 = f0, 0, 1, 0 = f0, 1, 0, 0 = f1, 0, 0, 0 = 0, f0, 1, 1, 1 = f1, 1, 1, 0 = f1, 1, 0, 1 = f1, 0, 1, 1 = 0, f0, 1, 0, 1 = f1, 0, 1, 0 = 0, and f0, 0, 1, 1 = f0, 1, 1, 0 = f1, 1, 0, 0 = f1, 0, 0, 1 = 0. The 4-variable bent function f = x 1 x 2 x 3 x 4 is not a ROTS function, since 1 = f0, 0, 1, 1 f0, 1, 1, 0 = Circular Pipeline Architecture In the original architecture of the previous subsection, each potential bent function fx is compared to all affine functions simultaneously, a distance is computed for each, and the minimum is extracted. If this minimum distance is 2 n 1 2 n 2 1, then fx is declared to be bent. Definition 2.22 Bent Distance. 2 n 1 2 n 2 1 and 2 n n 2 1 are bent distances weights. Are there any affine functions α such that the distance to a bent function is more than 2 n 1 2 n 2 1? The answer is no, as shown below.

9 Bent Function Enumeration Using an FPGA 107 Theorem A function fx is bent if and only if fx has a bent distance from all affine functions. Proof. only if On the contrary, assume there is a bent function fx and an affine function h a x whose distance is not a bent distance. Then, the distance between fx and h a x is greater than 2 n 1 2 n 2 1 and smaller than 2 n n 2 1. Since the distance between fx and h a x is greater than 2 n 1 2 n 2 1 and less than 2 n n 2 1, so also is the distance between h ax = h a x h a x = 0 and fx h a x. Thus, it follows that the function fx h a x has a weight greater than 2 n 1 2 n 2 1 and less than 2 n n 2 1, a contradiction. if If the distance of fx to an affine function is 2 n 1 2 n 2 1 or 2 n n 2 1, then the minimum distance is 2 n 1 2 n 2 1 and the function is bent. Note that it is sufficient to compare functions against the 2 n linear functions. Indeed, if some function fx is at a bent distance from any linear function gx, it is also at a bent distance from its complement gx = gx 1. Therefore, Theorem 2.12 says that there are only two values for the distance between all bent functions and any linear function, namely the two bent distances. This allows an efficient test of bent functions based on the bent distances. Specifically, if a function is not at a bent distance to some linear function, it is not a bent function. This forms the basis for the circular pipeline. In its simplest form, there are 2 n stages, each comparing a given function with one fixed linear function. If the given function and the fixed linear function are not at a bent distance apart, the given function is immediately discarded and replaced by another given function. This naturally brings up the question of how many bent distances exist for one given function fx, such that fx is not a bent function. Figure 2.2 from [60] shows the distribution of 4-variable functions according to the number of linear functions for which a bent distance exists 6 or 10 in the case of n = 4. As expected, all 896 bent functions are a bent distance 6 or 10 from all 16 linear functions on 4 variables. A large number, 26,880, have a bent distance from 8 linear functions, and 3,840 have a bent distance from 7 linear functions. A large number, 33,920, more than half of all 4-variable functions, have no bent distances. Also, it can be seen that a function having

10 108 Accelerated Computations 10 4 number of functions ,920 3,840 26, number of bent weights Figure 2.2. Distribution of functions to the number of bent weights for n = 4 [60]. more than 8 bent distances is a bent function. This serves as the basis for another test of bentness. That is, functions can be declared as non-bent if they have 8 or fewer bent distances. In the circular pipeline, we compare one given function with one linear function. If the two are not at a bent distance away from each other, then the given function can be immediately declared to be non-bent without testing it against any other linear functions. Further, if it has no more than 8 bent distances to linear functions, then it is also declared to be not bent. The advantage of this is a significant reduction in the hardware needed to test for bent functions. A MATLAB program was written to simulate the circular pipeline. In this simulation, the linear functions were arranged in the pipeline in natural order. That is, the very first function, the constant 0 function was first matched with the constant 0 linear function, then x n, then x n 1, then x n 1 x n, etc. Then, the function x 1 x 2... x n was inserted, etc. As a function passes through the circular pipeline, it exits at a time that depends on how many bent weights it has. For example, if it has 0 bent weights, it exits after one clock period. And, if it is bent, then it exits after all comparisons with linear functions, which is 2 n. From Figure 2.2, these are 16 comparisons n = 4. However, if the number of bent weights is somewhere in-between, then

11 Bent Function Enumeration Using an FPGA 109 number of functions ,698 8,130 3,930 1, persistence in a circular pipeline Figure 2.3. Distribution of functions to persistence in a circular pipeline for n = 4 [60]. the time it exits is also determined by when the first non-bent weight is detected. For example, from Figure 2.2, 26,880 functions have 8 bent weights. For these functions, the time at which it exits the circular pipeline is after 1, 2,..., or 9 comparisons clock periods. Note that, in a circular pipeline, we cannot measure a function s bent weight, because functions cannot be compared to all linear functions. On the other hand, we can measure how long it stays in the circular pipeline. We call this number the function s persistence. Figure 2.3 from [60] shows a histogram of the persistence values, for n = 4, as computed by this simulation. For example, 49,698 of the 65,536 functions have a persistence of 1. These include all functions with 0 bent distances, as well as functions with at least one bent distance that happen to have a non-bent distance to the first affine function against which it was tested. There is an interesting lack of functions with persistence 10, 11, 12, 13, 14, and 15. We have performed similar analysis for larger values of n. The results are similar. A 4-variable function could be declared to be bent if it has more than 8 bent weights. However, for general n- variable functions, we do not know the limit beyond which the function can be declared to be bent. We have not pursued this question because

12 110 Accelerated Computations it only improves the speed in the case of bent functions, and we know that such functions represent a small fraction of all functions. That is, the time penalty for over-testing bent functions is small Circuit of the Circular Pipeline linear function linear function linear function linear functions linear function circular pipeline cond. gen. transform Ham. dist. ones count main computation test circular pipeline function under test functions under test function under test function under test Figure 2.4. Architecture of a single circular pipeline. Figure 2.4 shows the circular pipeline circuit. Here, the bottom circular structure delivers functions to be tested for bentness. It is implemented as a pipeline of registers where the final stage is fed back to the initial stage. Each stage contains the function s truth table and the number of linear functions whose Hamming distance is a bent distance. The top circular structure delivers linear functions. If a given function is found to have a non-bent distance from a linear function, it is replaced by a new function. The test that determines this occurs in the computation unit near the center of Figure 2.4. Here, a

13 Bent Function Enumeration Using an FPGA 111 five stage pipeline generates the function and applies it to a transform block whose function depends on what is being counted. Specifically, the transform block is a truth table index, an AlgNF index, or a ROTS function index, depending on which function sets, all functions, AlgNFs, or ROTS functions, respectively, are being counted. The transform block then drives the sequence of three nonlinearity blocks that computes the Hamming distance to the current linear function, performs a ones count to determine that distance, and tests its value using a comparator. If that distance is a bent distance, the function is retained and returned to the function generation stage of the circular pipeline. Accompanying each function in the pipeline is a counter that indicates the number of linear functions from which it is a bent distance away. When that number is 16 in the case of 4-variable functions, the function is declared to be bent. Note that the number of stages in the circular pipeline 16 must be relatively prime to the number of stages in the given function circular pipeline 5; thus, each given function may be compared to all linear functions Experimental Results Figure 2.5 plots the speed-up achieved in computing n-variable bent functions for the case of n = 4 all functions and n = 6 & n = 8 rotation symmetric functions. All speed-up values are shown followed by, and are based on the speed-up achieved by the circular pipeline compared to the original circuit. The speed-up values shown on the extreme left correspond to a single pipeline, and all are less than 1.0, which represents a slow-down; that is, the original circuit is faster than a single circular pipeline. For example, for n = 4, 105,773 clock periods are needed to compute the bent functions using a single pipeline, versus 65,536 clock periods in our original design. This yields a speed-up of 65,536/105,773 = , which is a slowdown of 105,773/65,536 = It s interesting that it is close to the Golden ratio, The horizontal axis shows the number of circular pipelines running in parallel. As this number increases, the speed-up increases. For example, on the extreme right for 4-variable functions computed with 256 pipelines, the speed-up is over our original design. This data shows that we were able to implement 128 and 16 pipelines for n = 6 and n = 8, respectively. Figure 2.5 also

14 112 Accelerated Computations LUTs used speed-up n = LUTs n = 8 12,715 speed-up n = 4 speed-up n = ,555 LUTs n = 6 9,802 8,338 5,740 7,120 3,090 3,469 4, ,006 1,300 1, , ,097 27, ,731 51, number of pipelines speed-up 19,189 14, LUTs n = 4 10, Figure 2.5. Speed-Up and number of LUTs versus the number of circular pipelines. shows the number of Look-Up Tables LUTs needed in the implementation. As expected, the number of LUTs doubles approximately for each doubling of the number of circular pipelines. For example, for n = 4 and 256 pipelines, 19,189 LUTs are needed compared to 796 for the case of one pipeline or 19,189/796 = 24.1 times. Table 2.1 shows the number of clock cycles used in various configurations of the circular pipeline architecture for the case of n = 4 variables. The second column, labeled original shows the results from our original circuit, in which each function was compared simultaneously against all linear functions. In this case, there is one clock cycle for each of 65,536 4-variable functions for a total of 65,656 clock cycles. The next column, labeled by one shows the results of a single circular pipeline. In this case, a total of 105,773 clock cycles are needed to process all 65,536 4-variable functions. This is slower than the original method. However, this is an expected result, since, in one clock cycle, only one function under test is compared to one linear function. In the case of the original circuit, one function under

15 Bent Function Enumeration Using an FPGA 113 Table 2.1. Speed and resources used in the circular pipeline for n = 4 compared to the resources available in the Xilinx Zynq FPGA configuration original one four eight total clocks 65, ,773 26,523 13,286 available avg. performance resources speed-up on FPGA LUT * 2,608 2,814 3,108 53,200 LUTRAM * ,400 FF * 5,163 5,515 5, ,400 BRAM * IO * BUFG * test is compared to 16 linear functions during one clock cycle. The next column, labeled four shows the case of four circular pipelines operating simultaneously. In this case, 26,523 clocks are required to process all 65,536 functions under test, which is four times faster than a single circular pipeline and about two times faster than the original circuit. The next column, labeled eight, shows the case of eight circular pipelines operating simultaneously. In this case, 13,286 clocks are needed to process all 65,536 4-variable functions. This is about five times faster than the original circuit and about eight times faster than the circular pipeline with one copy of the circular pipeline. As observed earlier, in our original implementation, where each function under test was compared to all linear functions simultaneously, only 65,536 clocks were required to determine all the 4-variable bent functions. Specifically, there are 65,536 functions and only one clock was needed to test each function. Of course, there are many more tests for bentness, 65, = 1, 048, 576 to be exact, which is about 10 times that of a single circular pipeline. However, the circular pipeline implementation is much more compact compared to the original version in which each function was tested against all linear functions simultaneously. Table 2.1 also shows how multiple copies of the circular pipeline yield a system that is not only much faster but also requires fewer resources. It is interesting that the various resources used increase only slightly as the number of circular pipelines are implemented. This is due to the fact that overhead circuitry amounts

16 114 Accelerated Computations Table 2.2. Simultaneous bent functions found per clock period for n = 4 # simultaneous pipelines # clocks total 105, # simultaneous bent functions to a large fraction of the resources used. For example, the number of LUTs used for one, four, and eight circular pipelines is 2,608, 2,814, and 3,108. The right-most column shows the total resources available. For example, even in the worst case, no more than 6% of the available LUTs are used. Therefore, the circular pipeline is not only faster but also more efficient than the original architecture. Table 2.2 shows how many bent functions are discovered in one clock. This is important because of the overhead needed to collect all bent functions in one clock period. In our circuit, we only counted bent functions. We did not record them. For example, the column headed by 1 shows the case of a single pipeline. Here, at most one bent function can occur in a single clock period. Since there are variable bent functions, the value 896 appears in the column headed by 1 and in the row headed by 1. All of the remaining 4-variable functions, = 64, 640, represent no bent functions; and so the column headed by 1 and the row headed by 0 has 64,640. In the case of a single pipeline, a total of 105,773 clocks are needed, as shown in Table 2.2. Thus, it follows that 105, , 536 = 40, 237

17 Bent Function Enumeration Using an FPGA 115 clock periods produce neither a bent function nor a non-bent function. An important observation is the 1 on the lower right-hand side of Table 2.2. This shows, that, in the case of 256 pipelines, in one clock period, 20 of these 256 pipelines produce bent functions. That is, the maximum number of simultaneous bent functions that must be recorded in a single clock period is Analytical Results It is interesting to examine why a single circular pipeline takes 105,773 clocks, as shown in the third column and second line of Table 2.1. Note that the 33,920 functions in Figure 2.2 with exactly 0 bent weights each, exit the circular pipeline after one clock. On the other hand, 896 bent functions exit after 16 clocks, not 17, since a maximum of 16 tests is sufficient. That is, once a function survives 16 clocks, no additional tests are needed, it is guaranteed to be bent. For the functions with persistence 7 and 8, the situation is more complicated. Consider first the functions with persistence 8. Table 2.3 shows all possible arrangements of functions in the pipeline according to how many clocks they persist in in the pipeline. Table 2.3. Functions with persistence 8 and their contributions arrangements number of clocks clocks of bent weights patterns per function all functions , , , , , total clocks 24, 310 average number of clocks 24, 310/ total number of with persistence 8 26, , 773

18 116 Accelerated Computations For example, in the third row of Table 2.3, the entry 10 represents a function whose first weight is bent 1 and whose second is non-bent 0; such a function persists in the circular pipeline for 2 clocks. Assuming that the bent weight distributions are uniformly distributed and all arrangements are possible which they are not, the contribution to the cumulative clock count is = 6, 864 there are 14 remaining positions, 7 of which have bent weight, for a total of 8 1 s. Summing across all patterns of bent weights yields a total clock count of 24,310. There are 16 8 possible patterns, which we assume to be equally likely. Thus, the average number of clocks per function is 24, 310/ 16 8 = Since there are 26,880 functions with persistence 8, we expect these to contribute a total of 26, = 50, 773 clocks. A similar calculation can be performed on the 3,840 functions with persistence 7, as shown in Table 2.4. In this case, the number of clocks is 19,448, and the average number of clocks per function is 1.70 = 19, 448/ Since there are 3,840 functions with persistence 7, we expect these to contribute a total of 3, = 6, 528 clocks. Table 2.4. Functions with persistence 7 and their contributions arrangements number of clocks clocks of bent weights patterns per function all functions , , , , total clocks 19, 448 average number of clocks 19, 448/ total number of with persistence 7 3, , 528 Table 2.5 summarizes the number of clocks due to the various functions of persistence 0, 7, 8, and 16. Tallying all the contributions yields a total of 105,537, as shown in Table 2.5, which is close to the

19 Bent Function Enumeration Using an FPGA 117 experimentally derived number of clocks, 105,773 clocks. Table 2.5. Clocks used by functions of various persistence values persistence number of average number of clocks functions clocks per function 0 33, , , , , , * ,336 total clocks 105, 537 *bent functions Another interesting observation from this table is the relatively small total number of clocks needed for bent functions. Indeed, all nonbent functions are eliminated early relative to their persistence value. This occurs, of course, because once a non-bent weight is found, the function can be immediately declared to be non-bent. Bent functions, on the other hand, can only be declared as bent if their Hamming distance from all linear functions is a bent weight. Therefore, their persistence values are larger. This is clearly shown in Table Practical Aspects Figure 2.6 shows the ZedBoard on which a Zynq FPGA and an ARM processor are programmed to produce the experimental results shown Figure 2.6. The Xilinx ZedBoard system.

118 Accelerated Computations in this contribution. The ZedBoard connects with a standard laptop where the Xilinx Vivado application is used to describe the hardware that is put on the ZedBoard.

20 118 Accelerated Computations in this contribution. The ZedBoard connects with a standard laptop where the Xilinx Vivado application is used to describe the hardware that is put on the ZedBoard. Most of our effort was devoted to the development of Verilog code that configures the FPGA, as described in this section. However, there is a C program which runs on the ARM processor. The ARM processor executes the C code to interface with the Zynq FPGA and extract data to be displayed on the terminal. Figure 2.7 shows the ZedBoard and a laptop where the software development took place. The display shows the Vivado window and most of the circuit. Figure 2.7. The complete system. We note that this section emphasized tools. This section builds on our previous publications which introduced the reconfigurable computer as a way to enumerate bent functions for use in cryptographic applications [60, 287, 295]. Because the computation occurs in hardware, it is much faster than conventional software methods, of the order of 60,000 times faster. A naive way to improve on this is through multiple copies of the hardware. For example, replicating the original hardware 256 times allows us to increase throughput by 256 times. But, 256 copies consumes more hardware resources than are available. To compensate, we introduce the circular pipeline. We show that a single circular pipeline operating on 4-variable functions uses 105,773 clock periods to compute the bent functions, versus 65,536 clock periods in our original design. This is a reduction of 0.62 in computation

21 Bent Function Enumeration Using an FPGA 119 speed. However, we can replicate the circular pipeline 256 times. In this case, our experimental results show that we can achieve a speedup of times over our original design. At the same time, the complexity of the two circular pipeline circuits, as measured by the number of LUTs, increases from 796 to 19,189, which corresponds to only a 24.1 times increase in the number of LUTs.

Enumeration of Bent Boolean Functions by Reconfigurable Computer

Enumeration of Bent Boolean Functions by Reconfigurable Computer J. L. Shafer S. W. Schneider J. T. Butler P. Stănică ECE Department Department of ECE Department of Applied Math. US Naval Academy Naval