FPGA Implementation of a Predictive Controller SIAM Conference on Optimization 2011, Darmstadt, Germany Minisymposium on embedded optimization Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan May 18, 2011 1 / 20 Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan
MPC Problem Formulation Contents Field Programmable Gate Array (FPGA) Algorithms for Quadratic Programming Implementation Details Results Related Work 2 / 20 Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan
Optimal control problem subject to min θ x T N Qx N + N 1 k=0 [ xk u k ] T [ Q S S T R ] [ xk u k ] (1) x 0 = x (2a) x k+1 = Ax k + Bu k for k = 0, 1, 2,..., N 1 (2b) Jx k + Eu k d for k = 0, 1, 2,..., N 1 (2c) x k R n, u k R m Goal Accelerate the computation of the optimal value θ such that MPC can be implemented at faster sampling rates 3 / 20 Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan
where Quadratic Programming Formulation 1 min θ 2 θt Hθ subject to F θ = f, Gθ g θ := [x0 T u0 T x1 T u1 T x2 T u2 T... xn 1 T un 1 T xn T ] T R N(n+m)+n, [ ] Q S I H := N S T 0 R, 0 Q I n x A B I n F :=..., f := 0., A B I n 0 G := I N [ J E ], g := d := 1 N d. 4 / 20 Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan
where Quadratic Programming Formulation 1 min θ 2 θt Hθ subject to F θ = f, Gθ g θ := [x0 T u0 T x1 T u1 T x2 T u2 T... xn 1 T un 1 T xn T ] T R N(n+m)+n, [ ] Q S I RESULT H := N S T 0 R, DATA 0 Q I n x A B I n F :=..., f := 0., A B I n 0 G := I N [ J E ], g := d := 1 N d. 4 / 20 Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan
Reconfigurable logic blocks Reconfigurable interconnect Other reconfigurable hard blocks What is an FPGA? On-chip memories Embedded multipliers Advantages for embedded real-time applications Deterministic execution time Computational/Energy efficiency Much reduced low volume cost compared to ASIC Disadvantages Clock frequency < 350MHz Hardware design process 5 / 20 Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan
Is MPC suitable for FPGA computation? Parallelisation opportunities Level 2 BLAS operations Deep pipelining is necessary to maintain high clock frequency 6 / 20 Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan
Is MPC suitable for FPGA computation? Parallelisation opportunities Level 2 BLAS operations Deep pipelining is necessary to maintain high clock frequency 6 / 20 Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan
Is MPC suitable for FPGA computation? Parallelisation opportunities Level 2 BLAS operations Deep pipelining is necessary to maintain high clock frequency 6 / 20 Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan
Is MPC suitable for FPGA computation? Cycle accurate completion guarantee No jitter Compute-bound application O(n + m) 3 compute operations O(n + m) I/O operations Fixed-point computation is faster and uses less resources 7 / 20 Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan
Algorithms for Quadratic Programming Active-Set methods Worst-case exponential complexity Varying matrix structure Interior-Point methods Polynomial complexity Predictable matrix structure S. Mehrotra: Solves two systems of linear equations every iteration S. Wright [1]: Solves one system of linear equations [1] Applying new optimization algorithms to model predictive control. In Proc. Int. Conf. Chemical Process Control, Jan 1996. 8 / 20 Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan
Why iterative linear solvers? Small number of division operations Matrix vector multiplications Easy to parallelise Trade off between computation time and accuracy Conserve matrix structure (no fill-in) Allows exploiting fine structure to reduce memory requirements Examples Conjugate Gradient (CG) for SPD matrices Minimum Residual (MINRES) for indefinite symmetric matrices 9 / 20 Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan
Infeasible Primal-Dual Interior-Point algorithm Initialization (θ 0, ν 0, λ 0, s 0) with [λ T 0 s T 0 ] T > 0 for k = 0 to I IP 1 do [ H + G T W Linearization A k := k G F T F 0 [ (H + G T W b k := k G)θ k F T ν G T (λ k W k g + σµs 1 k ) F θ k + f [ ] θk Solve A k z k = b k for z k =: ν k Compute λ k := W k (G(θ k + θ k ) g + σµs 1 k ) s k := s k λ k [ ] λk + α λ Line Search α k := max (0,1] α : k > 0. s k + α s k Update (θ k+1, ν k+1, λ k+1, s k+1 ) := (θ k, ν k, λ k, s k ) + α k ( θ k, ν k, λ k, s k ) end for 10 / 20 Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan ], ]
Coefficient Matrix A k After variable re-ordering: I I Q 0 S A T S T R 0 B T A B I I Q 1 S A T S T R 1 B T A B I... I Q N 1 S A T S T R N 1 B T A B I I Q N Banded Size Symmetric Halfband Indefinite 11 / 20 Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan Z := N(2n + m) + 2n M := 2n + m
Coefficient Matrix A k After variable re-ordering: I I Q 0 S A T S T R 0 B T A B I I Q 1 S A T S T R 1 B T A B I... I Q N 1 S A T S T R N 1 B T A B I I Q N Banded Size Symmetric Halfband Indefinite 11 / 20 Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan Z := N(2n + m) + 2n M := 2n + m
Matrix storage 10 10 20 20 30 30 40 40 50 50 60 60 70 70 80 80 10 20 30 40 50 60 70 80 1 2 3 4 5 6 7 8 9 10 Columns of symmetric CDS matrix are stored in separate on-chip memories In-band zeros and ones do not need to be stored Constant columns consist of repeated blocks and are constant for all problems being solved simultaneously 12 / 20 Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan
Matrix storage 10 10 20 20 30 30 40 40 50 50 60 60 70 70 80 80 10 20 30 40 50 60 70 80 1 2 3 4 5 6 7 8 9 10 Columns of symmetric CDS matrix are stored in separate on-chip memories In-band zeros and ones do not need to be stored Constant columns consist of repeated blocks and are constant for all problems being solved simultaneously 12 / 20 Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan
Reduction in storage requirements 13 / 20 Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan
MINRES implementation Hardware architecture for computing Aq i RAMcolumn1 RAMcolumnM-1 RAMcolumnM Z -(M-1) Z -(M-2) vector x x x x 1 2 M 2M-2 x2m-1 + + + + log2(2m-1) latency = 2Z + M + k 1 log 2 (2M 1) + k 2 throughput = Z #problems = 2Z+M+k1 log 2 (2M 1) +k 2 Z + Z 3 q 1 = b, β 1 = q 1 2 for k = 1 to I MR do q i = q i β i z = Aq i α = qi T z q i+1 = z αq i β i q i 1 β i+1 = q i+1 2. γ i+1 = δ ρ 1 σ i+1 = β i+1 ρ 1 w i = q i ρ 3w i 2 ρ 2w i 1 ρ 1 x i = x i 1 + γ i+1 ηw i η = σ i+1 η end for 14 / 20 Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan
QP solver design overview maximise throughput: latency IP = 2 latency Stage2 (solves 2 #problems) For large problems, a sequential implementation of Stage 1 is sufficient for latency Stage1 < latency Stage2 minimise latency: latency IP = latency Stage1 + latency Stage2 (solves 1 problem) 15 / 20 Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan
Number of free parallel channels 25 Number of parallel channels 20 15 10 5 0 5 10 15 20 Number of states (n) 25 30 0 5 10 15 20 Number of inputs (m) 25 30 [1] An FPGA Implementation of a Sparse Quadratic Programming Solver for Constrained Predictive Control. In Proc. ACM/SIGDA Symposium on Field Programmable Gate Arrays. Mar 2011. 16 / 20 Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan
Performance Hardware : Xilinx Virtex 6 SX 475T @ 250MHz (40nm) Software : Intel Core2 Q8300 @ 2.5GHz, 3GB RAM, 4MB L2 Cache (45nm) Time per interior point iteration, seconds 10 1 10 0 CPU measured FPGA latency (2 #problems) FPGA throughput (2 #problems) FPGA latency (1 problem) 10 1 10 2 10 3 10 4 10 0 10 1 Number of states, n 17 / 20 Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan For small problems there is no performance improvement. For the largest problem, the improvement is: Red curve: 14x Black curve: 36x Blue curve: 85x 3 inputs 3 outputs 20 steps state and input constraints
Filling the pipeline Parallel Multiplexed MPC [1][2] Each thread optimizes over a subset of the m inputs assuming a fixed value for the rest. Effect on the size of the problem: m m 2 #problems Parallel Move Blocking MPC [3] The horizon N is split into blocks Each independent thread solves a problem with different splitting pattern to guarantee recursive feasibility Effect on the size of the problem: N N 2 #problems [1] MPC for Deeply Pipelined FPGA Implementation: Algorithms and Circuitry. In IET Control Theory and Applications 2011. [2] Parallel MPC for Real-time FPGA-based Implementation. In Proc. IFAC World Congress Aug 2011. [3] Parallel Move Blocking Model Predictive Control. Submitted to Conference on Decision and Control Dec 2011. 18 / 20 Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan
Filling the pipeline Other possible strategies: Distributed algorithms Sampling faster than the computational delay Moving horizon estimation 19 / 20 Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan
Questions 20 / 20 Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan