An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8))

Similar documents
Solving Multivariate Polynomial Systems

Owen (Chia-Hsin) Chen, National Taiwan University

S XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library

Practical Free-Start Collision Attacks on full SHA-1

CRYPTOGRAPHIC COMPUTING

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

On Stream Ciphers with Small State

arxiv: v1 [hep-lat] 7 Oct 2010

Fast Exhaustive Search for Polynomial Systems in F 2

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

MutantXL: Solving Multivariate Polynomial Equations for Cryptanalysis

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

Solving Quadratic Equations with XL on Parallel Architectures

Origami: Folding Warps for Energy Efficient GPUs

Practical Free-Start Collision Attacks on 76-step SHA-1

Solving PDEs with CUDA Jonathan Cohen

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application

Parallel Cube Tester Analysis of the CubeHash One-Way Hash Function

Cryptanalysis of Sosemanuk and SNOW 2.0 Using Linear Masks

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

MEMORIAL UNIVERSITY OF NEWFOUNDLAND

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --

ICS 233 Computer Architecture & Assembly Language

Cube Test Analysis of the Statistical Behavior of CubeHash and Skein

Multivariate Public Key Cryptography or Why is there a rainbow hidden behind fields full of oil and vinegar?

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

Key Recovery on Hidden Monomial Multivariate Schemes

Julian Merten. GPU Computing and Alternative Architecture

11 Parallel programming models

Shortest Lattice Vector Enumeration on Graphics Cards

Fast Evaluation of Polynomials over Binary Finite Fields. and Application to Side-channel Countermeasures

Parallel stochastic simulation using graphics processing units for the Systems Biology Toolbox for MATLAB

Secure PRNGs from Specialized Polynomial Maps over Any F q

GPU Computing Activities in KISTI

Scalable and Power-Efficient Data Mining Kernels

COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

A Five-Round Algebraic Property of the Advanced Encryption Standard

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Direct Self-Consistent Field Computations on GPU Clusters

Fast correlation attacks on certain stream ciphers

MXL2 : Solving Polynomial Equations over GF(2) Using an Improved Mutant Strategy

A CUDA Solver for Helmholtz Equation

Research into GPU accelerated pattern matching for applications in computer security

Hardware Acceleration of DNNs

GPU Accelerated Markov Decision Processes in Crowd Simulation

Differential Fault Analysis of Trivium

How Fast Can Higher-Order Masking Be in Software?

Real-time signal detection for pulsars and radio transients using GPUs

Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS

Parallel Circuit Simulation. Alexander Hayman

GPU Acceleration of BCP Procedure for SAT Algorithms

Small Public Keys and Fast Verification for Multivariate Quadratic Public Key Systems

Solving RODEs on GPU clusters

Algebraic Cryptanalysis of MQQ Public Key Cryptosystem by MutantXL

Zero-Knowledge for Multivariate Polynomials

Polyhedral Mass Properties (Revisited)

On the Complexity of the Hybrid Approach on HFEv-

VMPC One-Way Function and Stream Cipher

GPU-to-GPU and Host-to-Host Multipattern String Matching On A GPU

Shannon-Fano-Elias coding

How Fast can be Algebraic Attacks on Block Ciphers?

SYND: a Fast Code-Based Stream Cipher with a Security Reduction

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

Invariant Hopping Attacks on Block Ciphers

QUAD: Overview and Recent Developments

THEORETICAL SIMPLE POWER ANALYSIS OF THE GRAIN STREAM CIPHER. A. A. Zadeh and Howard M. Heys

A Blackbox Polynomial System Solver on Parallel Shared Memory Computers

Speeding up characteristic 2: I. Linear maps II. The Å(Ò) game III. Batching IV. Normal bases. D. J. Bernstein University of Illinois at Chicago

The Shortest Signatures Ever

Multicore Semantics and Programming

Key Recovery on Hidden Monomial Multivariate Schemes

Population annealing study of the frustrated Ising antiferromagnet on the stacked triangular lattice

INERTIALIZATION: HIGH-PERFORMANCE ANIMATION TRANSITIONS IN GEARS OF WAR DAVID BOLLO PRINCIPAL SOFTWARE ENGINEER THE COALITION (MICROSOFT STUDIOS)

Searching for Nonlinear Feedback Shift Registers with Parallel Computing

A Byte-Based Guess and Determine Attack on SOSEMANUK

arxiv: v1 [hep-lat] 10 Jul 2012

Statistics of the Universe: Exa-calculations and Cosmology's Data Deluge

Cryptanalysis of the Stream Cipher DECIM

On Fast and Provably Secure Message Authentication Based on. Universal Hashing. Victor Shoup. December 4, 1996.

Parallel Rabin-Karp Algorithm Implementation on GPU (preliminary version)

SOBER Cryptanalysis. Daniel Bleichenbacher and Sarvar Patel Bell Laboratories Lucent Technologies

Cryptanalysis of the Stream Cipher ABC v2

A survey of algebraic attacks against stream ciphers

Measuring freeze-out parameters on the Bielefeld GPU cluster

Mayer Sampling Monte Carlo Calculation of Virial. Coefficients on Graphics Processors

A Polynomial-Time Key-Recovery Attack on MQQ Cryptosystems

The XL and XSL attacks on Baby Rijndael. Elizabeth Kleiman. A thesis submitted to the graduate faculty

Linear Extension Cube Attack on Stream Ciphers ABSTRACT 1. INTRODUCTION

arxiv: v1 [hep-lat] 31 Oct 2015

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Mitsuru Matsui , Ofuna, Kamakura, Kanagawa, 247, Japan. which are block ciphers with a 128-bit key, a 64-bit block and a variable

Downloaded from

The Lattice Boltzmann Method for Laminar and Turbulent Channel Flows

of 19 23/02/ :23

ECE 571 Advanced Microprocessor-Based Design Lecture 9

VMware VMmark V1.1 Results

Transcription:

An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8)) Tung Chou January 5, 2012

QUAD Stream cipher. Security relies on MQ (Multivariate Quadratics).

QUAD The Provably-secure QUAD(q, n, r) Stream Cipher Proposed by Berbain, Gilbert, and Patarin in Eurocrypt 2006 StreamPcipher. i s, Q j s: Security randomly chosen, relies on public MQ quadratic (Multivariate polynomials Quadratics). State: n-tuple x =(x 1, x 2,..., x n ) F n q With multivariate Output: r-tuple quadratic (P 1 (x), systems P 2 (x),. P,.., Q, P r (x)) generate key Update: x (Q stream y 0, y 1, y 2,... 1 (x), Q 2 (x),..., Q n (x)) x 0 x 1 = Q(x 0) x 2 = Q(x 1) x 3 = Q(x 2) y 0 = P(x 0) y 1 = P(x 1) y 2 = P(x 2) y 3 = P(x 3)

QUAD The Provably-secure QUAD(q, n, r) Stream Cipher Proposed by Berbain, Gilbert, and Patarin in Eurocrypt 2006 StreamPcipher. i s, Q j s: Security randomly chosen, relies on public MQ quadratic (Multivariate polynomials Quadratics). State: n-tuple x =(x 1, x 2,..., x n ) F n q With multivariate Output: r-tuple quadratic (P 1 (x), systems P 2 (x),. P,.., Q, P r (x)) generate key Update: x (Q stream y 0, y 1, y 2,... 1 (x), Q 2 (x),..., Q n (x)) x 0 x 1 = Q(x 0) x 2 = Q(x 1) x 3 = Q(x 2) y 0 = P(x 0) y 1 = P(x 1) y 2 = P(x 2) y 3 = P(x 3) Simply speaking, QUAD is polynomial evaluations.

SPELT Security relies on SMP; i.e., P, Q are sparse.

SPELT Security relies on SMP; i.e., P, Q are sparse. Usually of higher degree than QUAD.

SPELT Security relies on SMP; i.e., P, Q are sparse. Usually of higher degree than QUAD. Example: SPELT(31, 4, 96, 96, (32, 16, 8)): Field: F31, Degree: 4, #Variables: 96, #Equations: 96 (for each of P, Q) Each equation has only 32 degree-2 terms, 16 degree-3 terms, 8 degree-4 terms

SPELT Security relies on SMP; i.e., P, Q are sparse. Usually of higher degree than QUAD. Example: SPELT(31, 4, 96, 96, (32, 16, 8)): Field: F31, Degree: 4, #Variables: 96, #Equations: 96 (for each of P, Q) Each equation has only 32 degree-2 terms, 16 degree-3 terms, 8 degree-4 terms More efficient than QUAD in practice.

Implementation Platform: GTX480 15 32 = 480 SPs (cores) running at 1.4 GHz (32 SPs in each MP).

Implementation Platform: GTX480 15 32 = 480 SPs (cores) running at 1.4 GHz (32 SPs in each MP). Each MP has 16 KB L1 cache and 48 KB shared memory (the sizes can be switched).

Implementation Platform: GTX480 15 32 = 480 SPs (cores) running at 1.4 GHz (32 SPs in each MP). Each MP has 16 KB L1 cache and 48 KB shared memory (the sizes can be switched). Each MP has 32K registers.

Implementation Platform: GTX480 15 32 = 480 SPs (cores) running at 1.4 GHz (32 SPs in each MP). Each MP has 16 KB L1 cache and 48 KB shared memory (the sizes can be switched). Each MP has 32K registers. 32 memory banks in shared memory.

Implementation Platform: GTX480 15 32 = 480 SPs (cores) running at 1.4 GHz (32 SPs in each MP). Each MP has 16 KB L1 cache and 48 KB shared memory (the sizes can be switched). Each MP has 32K registers. 32 memory banks in shared memory. The maximal number of registers assigned to each threads is 64.

Implementation Details Threads in a warp deal with the same equation(s) but different sets of x i s. In other words, each block generates 32 key streams at the same time.

Implementation Details Threads in a warp deal with the same equation(s) but different sets of x i s. In other words, each block generates 32 key streams at the same time. Information of each term is written in instructions.

Implementation Details Threads in a warp deal with the same equation(s) but different sets of x i s. In other words, each block generates 32 key streams at the same time. Information of each term is written in instructions. Values of x i are store in shared memory. We need 96 32 bytes. This is augmented into 100 32 to avoid bank conflicts. There are two buffers in shared memory, serving as source and destination. The results of the last 96 equations (Q) are written to global memory.

Implementation Details Threads in a warp deal with the same equation(s) but different sets of x i s. In other words, each block generates 32 key streams at the same time. Information of each term is written in instructions. Values of x i are store in shared memory. We need 96 32 bytes. This is augmented into 100 32 to avoid bank conflicts. There are two buffers in shared memory, serving as source and destination. The results of the last 96 equations (Q) are written to global memory. DIMGRID=30, DIMBLOCK=512. This means each warp has to deal with 192/16=12 equations.

Experiment Results Each block uses 64 KB shared. Each thread uses 32 regs. Therefore each MP should be able to run two blocks (32 warps) simultaneously.

Experiment Results Each block uses 64 KB shared. Each thread uses 32 regs. Therefore each MP should be able to run two blocks (32 warps) simultaneously. Performance: 1.38 Gbps. Good news: Better than the previous result: 0.91 Gbps. Bad news: Peak performance should be 6.99 Gbps if we consider multiplications only.

Experiment Results Each block uses 64 KB shared. Each thread uses 32 regs. Therefore each MP should be able to run two blocks (32 warps) simultaneously. Performance: 1.38 Gbps. Good news: Better than the previous result: 0.91 Gbps. Bad news: Peak performance should be 6.99 Gbps if we consider multiplications only. Mysterious behaviors of nvcc make it hard to find the bottleneck.

Tweaks to Accelerate the Evaluations Total number of mults: 96 + 32 2 + 16 3 + 8 4 = 240. Classifying terms by x i s. 7x 0 x 1 x 4 + 29x 1 x 1 (7x 0 x 4 + 29) Saving at least 32 + 16 + 8 = 56 mults. Classifying terms by coefficients. 14x 0 x 1 + 14x 3 x 9 14 (x 0 x 1 + x 3 x 9 ) Saving at least (32 + 16 + 8) + (96 30) = 122 mults. A mixed approach

Future Works asfermi: An assembler for the NVIDIA Fermi Instruction Set http://code.google.com/p/asfermi/ AMD GPUs