Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore su singolo chip

Size: px

Start display at page:

Download "Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore su singolo chip"

Lucas Anthony
5 years ago
Views:

1 Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore su singolo chip Luca Benini DEIS Università di Bologna Embedded Systems General purpose systems Embedded systems Microprocessor market shares 99% 1% 1

Automotive Electronics Market Size 1400 1200 1000 800 600 400 200 0 Cost of Electronics / Car ($) 2006: 25% of the total cost of a

2 Example Area: Automotive Electronics What is automotive electronics? Vehicle functions implemented with electronics Body electronics System electronics: chassis, engine Information/entertainment Automotive Electronics Market Size Cost of Electronics / Car ($) 2006: 25% of the total cost of a car will be electronics Market ($billions) % of future innovations in vehicles: based on electronic embedded systems 2

2002 Digital Convergence Mobile Example Communication Entertainment Computing

3 Automotive Electronics Platform Example Source: Expanding automotive electronic systems, IEEE Computer, Jan Digital Convergence Mobile Example Communication Entertainment Computing Broadcasting Imaging Telematics One device, multiple functions Center of ubiquitous media network Smart mobile device: next drive for semicon. Industry 3

4 4 th Gen and Next-Gen Networks Includes: , WiMAX (802.16), HSDPA, TDD UMTS, UMTS and future versions of UMTS SoC: Enabler for Digital Convergence 4G/5G, DMB, WiBro, etc. Today > 100X Performance Low Power Complexity Storage Future SoC 4

5 Systems on chip Moore s law provides exponential growth of resources But design does not become easier Deep submicron problems (DSM) Wire vs. transistor speed, power, signal integrity Design productivity gap IP re-use, platforms, NoCs Verification technologies General-purpose Scalable RISC Processor 50 to 300 MHz 32-bit or 64-bit The evolution of SoC platforms Library of Device IP Blocks Image coprocessors DSPs UART 1394 USB MIPS MIPS CPU D$ PRxxxx I$ DEVICE IP BLOCK DEVICE IP BLOCK... DEVICE IP BLOCK DVP SYSTEM SILICON PI BUS SDRAM MMI DVP MEMORY BUS PI BUS TriMedia TriMedia CPU D$ TM-xxxx I$ DEVICE IP BLOCK DEVICE IP BLOCK... DEVICE IP BLOCK 2 Cores: Philips Nexperia PNX8850 SoC platform for High-end digital video (2001) Scalable VLIW Media Processor: 100 to 300 MHz 32-bit or 64-bit Nexperia System Buses bit 5

Running forward Four 350/400 MHz StarCore SC140 DSP extended cores 16 ALUs: 5600/6400 MMACS 1436 KB of internal SRAM & multi-level memory hierarchy Internal DMA controller supports 16 TDM

6 Running forward Four 350/400 MHz StarCore SC140 DSP extended cores 16 ALUs: 5600/6400 MMACS 1436 KB of internal SRAM & multi-level memory hierarchy Internal DMA controller supports 16 TDM unidirectional channels, Two internal coprocesssors (TCOP and VCOP) to provide special-purpose processing capability in parallel with the core processors 6 Cores: Motorola s MSC8126 SoC platform for 3G base stations (late 2003) Application pull 1TOPS/W 100GOPS/W H264 encoding dictation Expression recognition Gbit radio Adaptive route Gesture recognition 3D gaming 3D TV 3D ambient Structured interaction decoding Ubiquitous 3D projectednavigation Autonomous display driving HMI by motion Structured Gesture detection encoding Collision avoidance Language Emotion recognition 5 GOPS/W [IMEC] Mobile Base-band Image recognition UWB A/V Sign streaming recognition n Si Xray H264 decoding Fully recognition (security) Auto personalization Year of Introduction 6

MPSoC Platform Evolution Applications Software opt.

Proc Local Memory hierarchy Net Int Power Test Mgmt 45 nm 2 <4mm 30Mtr <1GHz Today s

! Tile-based design I/O P E R I P H E R A L S 3D stacked main memory SoC

7 MPSoC Platform Evolution Applications Software opt. Middleware, RTOS, API, Run-Time Controller Mapping V,Vt,Fclk,IL router Bus based Multi Proc Local Memory hierarchy Net Int Power Test Mgmt 45 nm 2 <4mm 30Mtr <1GHz Today s SoCs could fit in 1 tile!! Tile-based design I/O P E R I P H E R A L S 3D stacked main memory SoC Solution-on-a-Chip SOC Target System Application Application S/W System / Service Middleware Mobile Terminal RTOS Module HAL Chip S/W IP Process System Chip e-sw Requires design of Hardware AND software 7

8 Design as optimization Design space The set of all possible design choices Constraints Solutions that we are not willing to accept Cost function A property we are interested in (execution time, power, reliability ) Hardware synthesis Ampl (db) APPLICATION Freq ALGORITHM IN k D D c 1 c 2 c 4 c 5 D D c 3 c 6 d IN k OUT c 1 D c 2 c 3 D c 4 c 5 D c 6 c 7 c D 8 ARCHITECTURE interconnect GP signal ASIC processor MCM memory LOGIC AND PHYSICAL SYNTHESIS HIGH-LEVEL SYNTHESIS S1 S2 S3 S4 8

9 Behavioral synthesis Control/Data Flow Graph (C DFG ) Implementation Reg Reg Multiplier Reg Reg Adder Allocation, Assignment, and Scheduling >> >> D D >> - - >> >> Allocation: How Much? 2 adders 1 shifter 24 registers Assignment: Where? Shifter 1 Schedule: When? Time Slot 4 Techniques Well Understood and Mature 9

10 Resource constraints 4 control steps *2 4 * *2 2 *3 3 *1 Control Step 3 *1 4 Schedule 1 Schedule 2 Control Step * * * *1 4 *2 * *2 3 2 *3 4 *1 Scheduling under resource constraints Intractable problem Algorithms: Exact: Integer linear program Hu (restrictive assumptions) Approximate : List scheduling Force-directed scheduling 10

11 ILP formulation Binary decision variables: X = { x il, i = 1,2,. n; l = 1,2,, λ 1} x il, is TRUE only when operation v i starts in step l of the schedule (i.e. l = t i ) λ is an upper bound on latency Start time of operation v i : Σl. x il l ILP formulation constraints Operations start only once Σ x il = 1 l i = 1, 2,, n Sequencing relations must be satisfied t i t j d j (v j, v i ) є E Σ l x il Σ l x il d j 0 (v j, v i ) є E l l Resource bounds must be satisfied Simple case (unit delay) Σ x il a k k = 1,2, n res ; l i:t(v i )=k A A A 11

12 ILP Formulation min (Σ l x nl ) such that l Σ x il = 1 i = 1, 2,, n l Σ l x ij - Σ l x jl -d j 0 i, j = 1, 2,, n, (v j, v i ) є E l l l Σ Σ x im a k k = 1, 2,, n res ; l = 0, 1,, λ i:t(v i )=k m=l-d i 1 Example NOP * * * * * * < 4 - Resource constraints: 2 ALUs; 2 Multipliers a 1 = 2; a 2 = 2 Single-cycle operation d i = 1 i A - 5 NOP n 12

13 Example Operations start only once x 11 = 1 x 61 x 62 =1 Sequencing relations must be satisfied x 61 2x 62 2x 72 3x x 92 3x 93 4x 94 5x N5 1 0 Resource bounds must be satisfied x 11 x 21 x 61 x 81 2 x 32 x 62 x 72 x 81 2 Example NOP 0 TIME * * 10 TIME 2 * 3 * 6 < 11 TIME 3-4 * 7 * 8 TIME NOP n 13

14 Resource-Efficient Application mapping for MPSoCs MULTIMEDIA APPLICATIONS Given a platform 1. Achieve a specified throughput 2. Minimize usage of shared resources Resource assignment and scheduling THE SYSTEM Limited Size Mem Node 1 Processor Tightly-Coupled Memory Bus Interface Max time wheel period T. Task. A (WCET Ta) Task. B (WCET Tb) Task. N (WCET Tn) Node N On-chip Memory SHARED SYSTEM BUS Assumed To be infinite Max bus bandwidth 14

15 The application T0 T1 T2 T3.. Each task is characterized by: WCET Memory requirements Signal Processing Pipeline T7 Throughput Constraint Queues for inter-processor communication in TCM for efficiency reasons Program data in TCM (if space) or on-chip memory Internal state in TCM (if space) or on-chip memory Communication-aware Allocation and Scheduling for Stream-Oriented MPSoCs T0 T1 T2.. T7 Signal Processing Pipeline? Simplifying assumptions vs predictability Efficient solutions in reasonable time Pure ILP formulations suitable for small task sets Widespread use of heuristics ARM7 Local Scratchpad Memory.. ARM7 Local Scratchpad Memory B U S Private Memory. Private Memory Messageoriented MPSoC architecture 15

16 Master Problem model Assignment of tasks and memory slots (master problem) T ij = 1 if task i executes on processor j, 0 otherwise, Y ij =1 if task i allocates program data on processor j memory, 0 otherwise, Z ij =1 if task i allocates the internal state on processor j memory, 0 otherwise X ij =1 if task i executes on processor j and task i1 does not, 0 otherwise Each process should be allocated to one processor T ij = 1 for all j Link between variables X and T: X ij = T ij T i1 j for all i and j (can be linearized) If a task is NOT allocated to a processor nor its required memories are: T ij = 0 Y ij =0 and Z ij =0 Objective function mem i (T ij -Y ij ) state i (T ij -Y ij ) data i X ij /2 i j i Improvement of the model With the proposed model, the allocation problem solver tends to pack all tasks on a single processor and all memory required on the local l memory so as to have a ZERO communication cost: TRIVIAL SOLUTION To improve the model we should add a relaxation of the subproblem to the master problem model: For each set S of consecutive tasks whose sum of durations exceeds the Real time requirement, we impose that their processors should not t be the same WCET i > RT T ij S -1 i S i S 16

17 Sub-Problem model Task scheduling with static resource assignment (subproblem) i Sub-Problem model Task scheduling with static resource assignment (subproblem) We have to schedule tasks so we have to decide when they start Activity Starting Time: Start i ::[0..Deadline i ] Precedence constraints: Start i Dur i Start j Real time constraints: for all activities running on the same processor (Start i Dur i ) RT i Cumulative constraints on resources processors are unary resources: cumulative([start], [Dur], [1],1) memories are additive resources: cumulative([start],[dur],[mr],c) What about the bus?? 17

18 BANDWIDTH BIT/SEC Bus model Unary resource: granularity clock cycle Execution time taski and task j Max bus bandwidth Taski state read Taskj state read Taski state write Taskj State write TIME Arbitration mechanism that decides the bus allocation Bus model BANDWIDTH BIT/SEC Additive bus model Task0 accesses input data: BW=MaxBW/NoProc Size of program data TaskExecTime Max bus bandwidth Taski state read taski taskj Taskj state read Taski state write Taski state write The model does not hold under heavy bus congestion (more than 65% of total bandwidth) Bus traffic has to be minimized TIME 18

No good generation Assignment of tasks and memory slots (master problem) Task scheduling with static resource assignment (subproblem) If no feasible schedule exist for the allocation provided by the

19 No good generation Assignment of tasks and memory slots (master problem) Task scheduling with static resource assignment (subproblem) If no feasible schedule exist for the allocation provided by the master a no-good is generated. We use the simple BUT EFFECTIVE one: identify CONFLICTING RESOURCES CR. For each R CR, ST R set of tasks allocated on R Σ T ir ST R - 1 i ST R Other cuts are also possible, [Hooker, Constraints 2005], but these are enough for our case and easy to extract Master Problem IP solver solution no good Sub- Problem CP solver solution Computational efficiency CP and IP formulations simplified Hybrid approach clearly outperforms pure CP and IP techniques Search time bounded to 15 minutes CP and IP can found a solution only in 50%- of the instances Hybrid approach always found a solution 19

of bus utilization Performance predictability greatly enhanced Validation of optimizer solutions MAX error lower than 10% AVG error

20 Validation of bus model Requesting more than 65% of the theoretical maximum bandwidth causes the additive model to fail. Lower threshold in presence of communication hotspots (50%) Benefits of the additive model task execution time almost indep. of bus utilization Performance predictability greatly enhanced Validation of optimizer solutions MAX error lower than 10% AVG error equal to 4.7%, with standard deviation of 0.08 Optimizer turn out to be conservative in predicting infeasibility The flow was successfully applied to GSM benchmark 20

Minimize power consumption Exploiting Voltage Supply Supply voltage impacts power and performance Circuit

21 Energy-Efficient Application mapping for MPSoCs MULTIMEDIA APPLICATIONS Given a platform 1. Achieve a specified throughput 2. Minimize power consumption Exploiting Voltage Supply Supply voltage impacts power and performance Circuit slowdown T=1/f=K/(V dd -V t ) a Cubic power savings P=C eff *V dd2 *f Just-in-time computation Stretch execution time up to the max tolerable Power Fixed voltage Shutdown Variable voltage Available time 21

22 Platform with frequency scaling Core A Core B f 1 f 2 FIFO Slave A FIFO Slave B AHB Bridge FIFO Progr Device Clock tree generator f ref Core C Core D f 3 Multiple clocks across the platform Memory-mapped device allows dynamic scaling Scheduling & Voltage Scaling Different voltages: different frequencies CPU V bs f1 f2 f3 V dd Energy/speed trade-offs: varying the voltages Power τ 1 τ 2 τ 3 Slack t deadline P τ 1 τ 2 τ 3 t deadline Mapping and scheduling: given (fastest freq.) 22

23 Continuous Voltage Scaling Given Application: set of interacting processes Platform: set of nodes, each having supply voltage (V dd ) and body bias voltage (V bs ) inputs Mapping and schedule table (including timing constraints) Architecture and mapping Schedule table / processor V dd V bs CPU2 τ 1 CPU1 τ 0 τ 2 τ 3 Bus τ 5 τ 4 CPU3 V dd V bs V dd V bs P τ 1 τ 2 τ 3 Slack t deadline Solving CVS Determine Voltage levels V dd and V bs for each process Such that system energy is minimized and Deadlines are satisfied Assessment P Height: Slack voltage level Input Polynomial time τ 1 solvable τ 2 τ 3 with an arbitrary t good precision deadline P Area: A. Andrei, Overhead-Conscious Voltage Selection for energy Dynamic and Leakage Energy Reduction of Output Time- Constrained Systems, τ 1 technical τ 2 report, τ 3 Linköping t deadline University, 2004 Convex nonlinear problem 23

24 Formulation Minimize energy E[τ 0 ] E[τ 1 ] E[τ 2 ] E_OH[τ 1 -τ 2 ] Energy due to processes Overhead due to voltage changes CPU11 τ0 BUS CPU 2 τ 1 τ 2 Such that T start [τ 0 ] T exe [τ 0 ] T start [τ 1 ] T start [τ 1 ] T exe [τ 1 ] T oh [τ 1 -τ 2 ] T start [τ 2 ] Precedence relationships T start [τ 2 ] T exe [τ 2 ] DL[τ 2 ] Deadlines More realistic setting Discrete f and V With variable transition delays Multiple processors MPSoC platforms Computation, communication, storage Optimize allocation and scheduling Allocate task to processors Set frequency-voltage Schedule 24

25 Problem Decomposition Memory constraints Obj. Function: Power ALLOCATION & F assign: INTEGER PROGRAMMING Avoid trivially infeasible solutions Valid allocation No good Real Time constraint Obj. Function: Makespan SCHEDULING: CONSTRAINT PROGRAMMING Threshold for bus utilization Much tougher: many more decision variables, cost function depends on master and sub-problem Solver Performance Hundreds of of decision variables Much beyond ILP solver or CP solver capability 25

26 Accuracy of results Against cycle-accurate functional simulation Including Power estimation (STM 0.13µm) Pittfalls Beware of abstraction! Abstraction noise even of exact results The cost of modeling How to determine the coefficients Validation Do not forget to validate via functional simulation 26

Embedded Systems Design: Optimization Challenges. Paul Pop Embedded Systems Lab (ESLAB) Linköping University, Sweden

of /4 4 Embedded Systems Design: Optimization Challenges Paul Pop Embedded Systems Lab (ESLAB) Linköping University, Sweden Outline! Embedded systems " Example area: automotive electronics " Embedded systems