Efficient, portable template attacks

Size: px

Start display at page:

Download "Efficient, portable template attacks"

Samuel Rogers
5 years ago
Views:

1 Efficient, portable template attacks Marios O. Choudary, Markus G. Kuhn Computer Laboratory Paper: IEEE Trans. Inf. Foren. Sec. 13(2), Feb. 2018, DOI /TIFS / 46

2 2 / 46

3 3 / 46

4 4 / 46

5 8 Current [ma] Time [µs] 5 / 46

6 Side-channel attacks on microcontrollers I The power-supply current waveform of microprocessors (and resulting EM emissions) is affected at each clock cycle by (category of) the executed instruction addresses/registers accessed operands status flags result values prior state (of wires, bus lines, flip flops, memory cells) intermediate activities (e.g., glitches before ALU results are stable) micro-architectural state etc. 6 / 46

7 Side-channel attacks on microcontrollers II Instruction categories are often easy to distinguish visually, e.g. if a conditional branch is taken or not. ( simple power analysis ) In some cases (e.g., with interpreters) this enables reconstruction of executed application instruction sequences from recordings of a single execution. Data-dependent variations require more effort to separate them from measurement noise: repeat measurements statistical signal processing exploitation of knowledge of executed algorithms low-noise/low-jitter measurement setup 7 / 46

8 Current traces for 256 different values of password byte 1 20 wrong inputs: min/max measured currents wrong inputs: min/max difference to median correct input: current correct input: difference to median ma µs 8 / 46

9 Side-channel attacks on microcontroller data busses Many techniques have been demonstrated since 1998 to exploit data-dependent variations in power and EM emissions. Most of these reconstruct subkeys used in known crypto algorithms by observing operation v k (p) = S(p k) with known plain-text input p and substitution table ( s-box ) S, e.g. in first round of a block cipher: Differential Power Analysis: [Kocher, et al., 1998] for all candidate subkey bytes k S and each observed input p predict one bit b in v k (p) estimate leakage trace x k,b (t) as a function of b (by averaging many traces with different p but identical k and b) only the correct candidate key k will cause a significant peak at time t in the difference-of-means trace x k,1 (t) x k,0 (t) Only if the assumed k was correct will we have split our set of recorded traces correctly into two piles, one for b = 0 and one for b = 1, such that the two average traces, one for each pile, show a difference (contributed by b). 9 / 46

10 Side-channel attacks on microcontroller data busses Many techniques have been demonstrated since 1998 to exploit data-dependent variations in power and EM emissions. Most of these reconstruct subkeys used in known crypto algorithms by observing operation v k (p) = S(p k) with known plain-text input p and substitution table ( s-box ) S, e.g. in first round of a block cipher: Correlation Power Analysis: for all candidate subkeys k S predict a value f(v k (p)) that is expected to be proportional to some samples in the leakage traces, e.g. the Hamming weight of v k (p) the correct candidate key k will cause the highest Pearson correlation coefficient between f(v k (p)) and some sample positions in the recorded leakage traces 9 / 46

11 Side-channel attacks on microcontroller data busses Many techniques have been demonstrated since 1998 to exploit data-dependent variations in power and EM emissions. Most of these reconstruct subkeys used in known crypto algorithms by observing operation v k (p) = S(p k) with known plain-text input p and substitution table ( s-box ) S, e.g. in first round of a block cipher: Mutual Information Analysis: the correct candidate key k will cause the highest mutual information between (some function f of) v k (p) and some sample positions in the recorded leakage traces 9 / 46

12 Side-channel attacks on microcontroller data busses Many techniques have been demonstrated since 1998 to exploit data-dependent variations in power and EM emissions. Most of these reconstruct subkeys used in known crypto algorithms by observing operation v k (p) = S(p k) with known plain-text input p and substitution table ( s-box ) S, e.g. in first round of a block cipher: Template Attack: [Chari, et al., 2003] profiling phase: build a Gaussian multivariate model (pdf) for the leakage trace for each result byte v requires access to a test chip/mode where k and hence v is known attack phase: find the maximum-likelihood candidate key k given n a leakage traces x p1, x p2,..., x pna and associated inputs p, using the probability density function f(x p v) built during the profiling phase 9 / 46

13 Side-channel attacks on microcontroller data busses Many techniques have been demonstrated since 1998 to exploit data-dependent variations in power and EM emissions. Most of these reconstruct subkeys used in known crypto algorithms by observing operation v k (p) = S(p k) with known plain-text input p and substitution table ( s-box ) S, e.g. in first round of a block cipher: Stochastic Model : [Schindler, et al., 2005] profiled, like template attack, but rather than building a pdf for each possible value v, model the leakage trace of v as a linear combination of traces for its individual bits (or pairs) shorter profiling phase due to reduced number of parameters to be estimated more practical for 16-bit busses can be less accurate than full template attack, especially with small design sizes (more non-linear effects, capacitive coupling between bus traces, etc.) 9 / 46

14 Side-channel attacks on microcontroller data busses Many techniques have been demonstrated since 1998 to exploit data-dependent variations in power and EM emissions. Most of these reconstruct subkeys used in known crypto algorithms by observing operation v k (p) = S(p k) with known plain-text input p and substitution table ( s-box ) S, e.g. in first round of a block cipher: Deep Learning : profiled attack to train a neural network to classify traces according to v very compute intensive, very large number of parameters convolutional layers may learn to auto align traces, whereas template attacks rely strongly on low-jitter alignment all magic 9 / 46

15 Objectives here: Use template attack independent of any cryptographic algorithm (no known s-box, etc.). Directly eavesdrop on 8-bit parallel bus lines (or 32-bit busses that handle 8-bit data) Demonstration attack target: a single 8-bit load instruction (e.g., RAM to register) in a microcontroller Example targets: data parsers handling secrets, string processing functions, instruction fetch cycles, loading keys into cryptographic hardware, etc. ( sub-cryptographic algorithms ) Such code may still lack masking/hiding countermeasures Much more demanding than DPA-style crypto attacks, as we now depend on all bits being distinguishable (rather than just cruder leakage models, such as Hamming weights) Signal pre-processing and dimensionality reduction to maximize signal-to-noise ratio and reduce number of parameters to estimate become crucial 10 / 46

16 Template attack (basics, notation) Hopefully identical hardware: profiling device, attacked device Goal: infer some secret value k S, processed by the attacked device at some point. For 8-bit microcontroller: S = {0,..., 255} Required: ability to sample supply-current or electro-magnetic waveforms ( raw leakage vectors x r R mr ) at times {t 1,..., t m r} during and near the point in time where k is processed. Profiling phase: record n p raw leakage vectors x r ki Rmr (1 i n p ) from the profiling device for each possible candidate value k S. Result: one raw leakage matrix X r k =. Rnp mr for each k S, containing row vectors x r ki ( = transposed) 11 / 46

17 Trace compression (basics, notation) Raw leakage vectors x r ki may contain mr = hundreds or thousands of samples, due to high sampling rates used. We may compress them before further processing, either by sample selection: keep only a subset of m m r samples dimensionality reduction: Principal Component Analysis (PCA) or Fisher s Linear Discriminant Analysis (LDA) Compressed leakage vectors: x r ki Rmr x ki R m Combine these as rows into the compressed leakage matrix X k =. Rnp m Without any such compression step: X k = X r k and m = mr. 12 / 46

18 Template parameters (basics, notation) Now use compressed leakage matrices X k to estimate for each possible value k S n p Mean trace: x k = 1 n p i=1 Covariance matrix: S k = 1 n p 1 x ki n p (x ki x k )(x ki x k ) i=1 n p Note: (x ki x k )(x ki x k ) = X X k k where X k is X k with x k i=1 subtracted from each row. Side-channel leakage traces can generally be modelled well by a Gaussian multi-variate distribution, meaning that x k and S k are sufficient statistics defining the underlying distribution (probability density function) f(x k) = 1 (2π) m S k e 1 2 (x x k) S 1 k (x x k) 13 / 46

19 Illustrative example Each dot represents a trace x (with just m = 2 samples, colour indicates k), red circles represent mean traces x k, red lines represent eigenvectors of covariance matrix S k, and the green ellipses are equiprobability lines of f(x k). 14 / 46

20 Attack phase (basics, notation) Infer the secret value k S processed by the attacked device: Trigger repeat processing of k for n a times. Use same recording technique and compression method as in profiling phase. Obtain n a leakage vectors x i R m, store in leakage matrix X k =. Rna m For each k S compute a discriminant score D(k X k ). Finally try all k S on the attacked device, in order of decreasing score (optimized brute-force search, e.g. for a password or cryptographic key), until correct k found. 15 / 46

21 Discriminant function Given a trace x i from X k, Bayes rule suggests: D(k x i ) = f(x i k)p (k) or, if P (k) is independent of k (P (k) = S 1 ), then D(k x i ) = f(x i k). The full Bayes likelihood is f(x i k)p (k) L(k x i ) = k f(x i k )P (k ) but we can omit here factors that are same for each k and therefore do not affect the relative order of the discriminat scores. With more than one measurement, assuming noise is independent across repeat measurements, the joint likelihood over all attack traces x i in X k is L(k X k ) = L(k x i ) x i in X k Is this a better discriminat than L(k n 1 na a i=1 x i), i.e. averaging all attack traces first before looking up a pdf? Yes, but / 46

22 Numerical problems So far so simple. But in practice the pdf f(x k) = 1 (2π) m S k e 1 2 (x x k) S 1 k (x x k) can easily cause numerical problems that require attention: S k may not be invertible ( S k 0): In fact S k cannot be invertible if n p m: This is because S k is essentially X k X k, and therefore X k R np m and S k R m m have the same rank. S k may also overflow easily e x may overflow easily IEEE double covers e x only for x < 710, easily exceeded for large m. 17 / 46

23 Pooled co-variance matrix The template mean vectors x k characterize the signal. The co-variance matrices S k characterize the noise. If the measured noise is independent of the signal, then the underlying covariances estimated by the S k will be identical ( homoscedasticity ). We can then average the S k into a single pooled covariance matrix: This has many advantages: S pooled = 1 S better noise model (more data) k S relaxation of the necessary condition for S pooled being invertible: m < S n p, or n p > m S enables compression with Linear Discriminant Analysis (LDA) enables faster and more stable discriminant functions But: some side-channel countermeasures can result in data-dependent noise. S k 18 / 46

24 Illustrative example All S = 8 error ellipses are identically sized and orientated, and do not depend on k. 19 / 46

25 Compression: sample selection I keeping the dimension m of the multivariate pdf model small helps avoid numerical problems many samples in x r i will contain no data-dependent variation discarding too much information will reduce success rate Data-dependent variation characterized by between-groups vectors: τ k = x r k xr where x r = 1 x r k S. k S Various per-sample signal-strength estimates have been proposed: Difference of Means (DOM), the Sum of Squared Differences (SOSD), the Signal to Noise Ratio (SNR) and SOST. Example: s DOM (t) = 1 k<k < S x r k (t) xr k (t) 20 / 46

26 Compression: sample selection II Normalized signal-strength estimates from DOM, SOSD and SNR on our reference data set (Grizzly Beta). dom sosd snr std clock clock cycles Simplest techniques: take the m samples with the highest signal strength s(t), or all above some threshold. But these may all come from the same clock cycle and be highly correlated with each other (i.e., not say much new). Alternative strategy: Take a maximum number of samples (e.g., 1, 3, 20) from each clock clock cycle. 21 / 46

27 Covariance of the between-group vectors The between-groups vectors τ k = x r k xr shown in blue. 22 / 46

28 Principal Component Analysis [Archambeau et al., 2006] Sample-between-groups matrix: B = ( x r k xr )( x r k xr ) k S Singular value decomposition: B = UDU each column of the orthonormal matrix U R mr m r is an eigenvector u j of B diagonal matrix D R mr m r contains the corresponding eigenvalues δ j, with δ 1 δ 2 δ m r. Only the first m S eigenvectors (u 1... u m ) = U m are needed to preserve most of the variability from the mean vectors x r k. Compression step: X k = X r k Um This projects each raw trace x r i in Xr k eigenvectors of B: x i = x r i Um. onto the just m largest 23 / 46

29 PCA example: eigenvectors of B u1 u2 u3 u4 u5 u6 24 / 46

30 PCA example: eigenvalues of B / 46

31 Linear discriminant analysis: maximising SNR LDA uses two covariance matrixes: B for signal and S pooled for noise, and projects the x r i onto the largest eigenvectors of the signal-to-noise matrix ( S r ) 1B. pooled 26 / 46

32 Linear discriminant analysis I [Standaert/Archambeau, 2008] PCA finds directions δ j where the signal is strong, to project onto, but ignores the noise. Fisher s LDA instead considers projections y j = a j x r and finds directions a j R mr that maximize between-groups variance within-groups variance = ( E (yjk ) E (y j ) ) 2 ( ( aj E (x r k ) E (xr ) )) 2 k S Var (y jk ) k S which can be estimated as = k S k S Var ( a j x r ) k S (n p 1) (a j ( x r k xr )) 2 k S a j Ba j n p = a j S r a j (x ki x k )(x ki x k ) pooled a a j j k S i=1 27 / 46

33 Linear discriminant analysis II The coefficient a j that maximises a j Ba j a j S r pooled a j is the first eigenvector (i.e., the one with the largest associated eigenvalue) of ( S r pooled ) 1B With the constraint Cov(y ik, y jk ) = 0, the other a j that maximise the above ratio are the eigenvectors with the next largest eigenvalues. Note that ( S r pooled) 1B is not necessarily symmetric, so we cannot directly apply singular-value decomposition to obtain orthonormal eigenvectors. Instead, we can first compute the eigenvectors u j of the symmetric matrix ( S r pooled) 1 2 B ( S r pooled) 1 2, which has the same eigenvalues as ( S r pooled) 1B, and from which we can then obtain the coefficients a j = ( S r pooled) 1 2 u j. There are a maximum of s = min(m r, S 1) non-zero eigenvectors, as that is the maximum number of independent linear combinations available in B. 28 / 46

34 LDA example: eigenvectors of B u1 u2 u3 u4 u5 u6 29 / 46

35 LDA example: eigenvectors of ( S r pooled ) u1 u2 u3 u4 u5 u6 30 / 46

36 LDA example: eigenvectors of ( S r pooled) 1B u1 u2 u3 u4 u5 u6 31 / 46

37 Linear discriminant analysis III Like with PCA, pick m such that the first m eigenvalues of ( S r pooled ) 1B cover e.g. 95% of the sum of all eigenvalues. Let A = (a 1... a m ) be the matrix of the first m eigenvectors of ( S r pooled) 1B, then project each leakage matrix as X k = X r k A LDA generally outperforms all other compression methods, but relies on homoscedasticity, therefore PCA remains useful where the noise is not easily characterized. When we scale the coefficients a j, such that a j S r pooled a j = 1 the covariance in the discriminant function becomes the identity matrix, i.e. S k = I, which greatly reduces computation and storage requirements. 32 / 46

38 After linear discriminant analysis / 46

39 The log-likelihood discriminant Recall the numerical problems with f(x k) = 1 (2π) m S k e 1 2 (x x k) S 1 k (x x k) Avoid overflowing e x and S k by using instead the log-likelihood log f(x k) = m 2 log 2π 1 2 log S k 1 2 (x x k) S 1 k (x x k) Compute log S k = 2 m i=1 log c ii using the Cholesky decomposition S k = C C. Since C is triangular, its determinant is the product of its diagonal elements c ii. Dropping the first term (constant across all k) gives us a robust discriminant based on the log-likelihood: D log (k x i ) = 1 2 log S k 1 2 (x i x k ) S 1 k (x i x k ) 34 / 46

40 The linear discriminant Using S pooled, we can discard log S k as well. This leaves the Mahalanobis distance d 2 M (x, x k) = (x x k ) S 1 pooled (x x k) 0 to compare candidates k. (Covariance is positive semidefinite.) Rewrite as d 2 M (x, x k) = x S 1 pooled x 2 x k S 1 pooled x + x k S 1 pooled x k and drop the first term (constant for all candidates k) to obtain a discriminant that depends linearly on x i : D linear (k x i ) = x k S 1 pooled x i 1 2 x k S 1 pooled x k 35 / 46

41 Joint discriminants Recall that to combine n a attack traces (essential for the success of many side-channel attacks), we need to compute a discriminant based on their their joint likelihood L(k X k ) = n a L(k x i ) or log L(k X k ) = log L(k x i ) x i in X k i=1 This costs O(n a m 2 ) for D log (k X k ) = n a 2 log S k 1 2 but only O(n a m + m 2 ) for D linear (k X k ) = x k S 1 pooled ( na n a i=1 i=1 (x i x k ) S 1 k (x i x k ) x i ) n a 2 x k S 1 pooled x k since x k S 1 pooled and x k S 1 pooled x k only need to be done once. Practical evaluation example: D log 3.5 days, D linear 30 min! 36 / 46

42 Example: comparison of different compression methods Our test dataset Grizzly (available online): Atmel XMEGA 256 A3U processor 10 ohm resistor in ground line powered from 3.3 V battery via voltage regulator 1 MHz sine wave clock 250 MHz sampling frequency, 8-bit samples 3072 traces for each byte, m r = 2500 samples per trace sequence of LOAD instructions, where only one handles k, all others handle constant value zero Guessing entropy: Binary logarithm of rank order of correct k in list of k value sorted by decreasing discriminant function, averaged over 10 attacks. Sample selections: 1 samples/clock (1ppc, m 8), 3 samples/clock 3ppc (m 25), 20ppc (m 77) and allap (m 125) selections (all selected samples above the highest 95th percentile of s(t)). 37 / 46

43 Guessing entropy (bits) Guessing entropy (bits) n p = 200 n p = 2000 Guessing entropy (bits) Guessing entropy (bits) PCA PCA, m=4 sample, 1ppc sample, 3ppc sample, 20ppc sample, allap Sk (Dlog) na (log axis) na (log axis) LDA 1ppc LDA, m=4 PCA, m=4 sample, 1ppc sample, 3ppc sample, 20ppc sample, allap Spooled (Dlinear) na (log axis) na (log axis) 38 / 46

44 Guessing entropy (bits) Guessing entropy (bits) Attacks on AES software/hardware implementations LDA, m=4 PCA, m=4 1ppc 3ppc 20ppc allap n a (log axis) LDA, m=10 PCA, m=10 1ppc, m=6 3ppc, m=18 20ppc, m= na (log axis) Left: Guessing entropy after template attack on the Grizzly dataset in an AES S-box scenario (simulated). DPA-style attack on AES much easier than direct eavesdropping of a single LOAD instruction. Right: Template attack on AES engine (Polar dataset). Software implementation much easier to attack than hardware implementation. 39 / 46

45 Attacks on different devices Four XMEGA PCB devices used in our experiments. 40 / 46

46 Guessing entropy (bits) Guessing entropy (bits) Classic template attacks in different scenarios LDA, m=4 PCA, m=4 sample, 1ppc sample, 3ppc sample, 20ppc sample, allap LDA, K=4 PCA, K=4 sample, 1ppc sample, 3ppc sample, 20ppc sample, allap na (log axis) na (log axis) Left: using device Alpha for profiling and device Beta for attack. Right: using same device (Beta) but different acquisition campaigns for profile (Beta) and attack (Beta Bis) all compression techniques (except for LDA!) failed badly across different devices or even across different campaigns on the same device. 41 / 46

47 ma ma Major cause: DC drift across devices, boards, campaigns 5 4 single trace from Beta Alpha Beta Beta bis Gamma Delta Beta + ci Beta - ci SNR of Beta Top: Trace from Beta (first clock cycle of target LOAD) Bottom: overall mean vectors x r for all campaigns minus overall mean vector of Beta 42 / 46

48 LDA gets this: ( S r pooled) (noise) has DC eigenvector u1 u2 u3 u4 u5 u6 43 / 46

49 No major incompatibility of underlying leakage model ma ma Normal distribution at sample index j = 884 based on the template parameters ( x r k, Sr pooled ) for k {0,..., 9} on Alpha (left) and Beta (right). 44 / 46

50 Template attacks are very sensitive to changes in DC bias Changes in DC bias can also happen within a single campaign (e.g. due to temperature changes) This causes a DC eigenvector to emerge in S r pooled which LDA utilizes to ignore DC drift as noise Workarounds: Use different devices during profiling campaigns. Allow temperature variation during profiling campaigns (can also affect switching thresholds). Use LDA. Where LDA is not applicable: use PCA with random DC offsets added to mean vectors before calculating B, to push most DC signal into a single eigenvector and keep the rest DC-free. Apply DC-block filter: happens already automatically if EM sensors or other high-pass filters are used. However this can also significantly increase noise, by spreading nearby variability via filter impulse response. 45 / 46

51 Guessing entropy (bits) Profiling on Alpha, attack on Beta Guessing entropy (bits) PCA m = 4 LDA m = 3, m = 4 LDA, m=4 PCA, m=4 sample, 1ppc sample, 3ppc sample, 20ppc sample, allap LDA, m=3 LDA, m=5 LDA, m=6 LDA, m=40 PCA, m=5 PCA, m=6 PCA, m= LDA, m=4 LDA, m=5 PCA, m=4 PCA, m= na (log axis) na (log axis) Left: using various compressions with the classic method. DC eigenvector of B: j = 5 Right: using PCA and LDA after adding random DC offset. DC eigenvector of B: j = 1 PCA benefits from including DC eigenvector in projection, LDA does not. 46 / 46

Efficient Template Attacks

Efficient Template Attacks Omar Choudary and Markus G. Kuhn Computer Laboratory, University of Cambridge firstname.lastname@cl.cam.ac.uk Abstract. Template attacks remain a powerful side-channel technique