SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

Similar documents
DESIGNING CNN GENES. Received January 23, 2003; Revised April 2, 2003

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI

Implementation of Nonlinear Template Runner Emulated Digital CNN-UM on FPGA

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

Artificial Intelligence Hopfield Networks

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

2. Accelerated Computations

CELLULAR NEURAL NETWORKS & APPLICATIONS TO IMAGE PROCESSING. Vedat Tavsanoglu School of EEIE SOUTH BANK UNIVERSITY LONDON UK

Performance Evaluation of Scientific Applications on POWER8

Direct Self-Consistent Field Computations on GPU Clusters

CycleTandem: Energy-Saving Scheduling for Real-Time Systems with Hardware Accelerators

Parallel Longest Common Subsequence using Graphics Hardware

Lecture 2: Metrics to Evaluate Systems

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Efficient implementation of the overlap operator on multi-gpus

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

A FPGA Implementation of Large Restricted Boltzmann Machines. Charles Lo. Supervisor: Paul Chow April 2010

Neural networks. Chapter 20. Chapter 20 1

Analog Computation in Flash Memory for Datacenter-scale AI Inference in a Small Chip

Neural networks. Chapter 19, Sections 1 5 1

Hardware Acceleration of DNNs

Energy-Efficient Real-Time Task Scheduling in Multiprocessor DVS Systems

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Parallelism in Structured Newton Computations

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

arxiv: v1 [cs.ar] 11 Dec 2017

Energy-aware scheduling for GreenIT in large-scale distributed systems

ACCELERATED LEARNING OF GAUSSIAN PROCESS MODELS

How can ideas from quantum computing improve or speed up neuromorphic models of computation?

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

Acceleration of WRF on the GPU

3.3 Discrete Hopfield Net An iterative autoassociative net similar to the nets described in the previous sections has been developed by Hopfield

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Welcome to MCS 572. content and organization expectations of the course. definition and classification

AI Programming CS F-20 Neural Networks

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

Accelerating linear algebra computations with hybrid GPU-multicore systems.

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Final Examination CS 540-2: Introduction to Artificial Intelligence

An Efficient Numerical Integration Algorithm for Cellular Neural Network Based Hole-Filler Template Design

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures

11 Parallel programming models

Scheduling of Frame-based Embedded Systems with Rechargeable Batteries

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

arxiv: v1 [hep-lat] 7 Oct 2010

INTRODUCTION TO ARTIFICIAL INTELLIGENCE

Andrew Morton University of Waterloo Canada

OHW2013 workshop. An open source PCIe device virtualization framework

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Neural networks. Chapter 20, Section 5 1

UTPlaceF 3.0: A Parallelization Framework for Modern FPGA Global Placement

Data Mining Part 5. Prediction

Neural Nets in PR. Pattern Recognition XII. Michal Haindl. Outline. Neural Nets in PR 2

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

ENERGY-EFFICIENT DIGITAL HARDWARE PLATFORM FOR LEARNING COMPLEX SYSTEMS. A Dissertation Presented to The Academic Faculty.

Grundlagen der Künstlichen Intelligenz

Machine Learning. Neural Networks

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Using a Hopfield Network: A Nuts and Bolts Approach

Synaptic Devices and Neuron Circuits for Neuron-Inspired NanoElectronics

FPGA Implementation of a Predictive Controller

A Deep Convolutional Neural Network Based on Nested Residue Number System

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

Artificial Neural Network and Fuzzy Logic

Sections 18.6 and 18.7 Artificial Neural Networks

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

CS 4700: Foundations of Artificial Intelligence

Christian Mohr

Parallel Transposition of Sparse Data Structures

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Scalable and Power-Efficient Data Mining Kernels

On the Computational Complexity of the Discrete Pascal Transform

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Convolutional Neural Networks

Sections 18.6 and 18.7 Artificial Neural Networks

Level-3 BLAS on a GPU

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0

VLSI Signal Processing

The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

TensorFlow: A Framework for Scalable Machine Learning

Lecture 7 Artificial neural networks: Supervised learning

Practical Combustion Kinetics with CUDA

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Retiming. delay elements in a circuit without affecting the input/output characteristics of the circuit.

A High-Yield Area-Power Efficient DWT Hardware for Implantable Neural Interface Applications

ERLANGEN REGIONAL COMPUTING CENTER

CS:4420 Artificial Intelligence

Transcription:

SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

Motivation Power is a first-order design constraint, especially for embedded devices. Certain applications prevalent in embedded devices Image processing, audio processing, context awareness, etc. Trend of including more specialized, energy-efficient accelerators. Moto X 2013

Cellular Neural Networks Investigated using Cellular Neural Networks (CNN) as a specialized accelerator Used for variety of applications, particularly image processing Hardware implementations known to consume very little power Lee et al. (2008) implemented CNN chip that consumed 84mW@ 130nm Issue: Current hardware Cellular Neural Network processing capabilities have not scaled with the growth in image sizes.

Outline Cellular Neural Network (CNN) Background Multiplexing SP-CNN Architecture Results

Neural Networks Radial Basis Function Network Self-Organizing Map Recurrent Neural Network Adaptive Resonance Theory Pulse-Coupled Neural Network Spiking Neural Network Artificial Neural Network Hopfield Neural Network Convolutional Cellular Neural Neural Network Network (CNN)

CNN Background Introduced by Leon Chua and Lin Yang in 1998 Characterized by a spatial arrangement of locally-coupled cells The set of cells a cell is coupled with is known as the cell s neighborhood Typical Cellular Neural Network consists of an MxN array of cells C(i,j)

CNN Applications Image Processing Edge Detection, Image Segmentation, Movement Detection, etc. Pattern Recognition Associative Memory Solving Partial Differential Equations Pre-data processing for other neural networks CNN Universal Machine is Turing Complete

CNN Cell Each cell is a dynamical system with an input, output, and a state that evolves according to some prescribed laws. Most CNNs follow the standard CNN state and output equations: dx ij (t) = x dt ij t + a kl y(t) kl + b kl u kl + z C k,l N r i,j C k,l N r i,j y ij t = 1 2 x ij t + 1 x ij t 1 N r i, k represents neighborhood for cell at location i,j x represents the current state y is the cell outputs u is the cell inputs z is the threshold value a and b are the feed-forward and feedback weights

Digital CNN Cell For our work, we focused on digital CNN implementation. In these implementations, the state equation simplifies to: x ij (t + 1) = a kl y(t) kl + b kl u kl + z C k,l N r i,k C k,l N r i,k (from neighborhood cells) A (outputs) z (inputs) B Σ x ij y ij

CNN Gene The programmable unit of a CNN is known as the CNN gene The CNN gene specifies: the z threshold value of the CNN state equation the a and b coefficients of the CNN state equation Furthermore, the programmer will specify the initial state and boundary conditions for the CNN 0.0 1.0 0.0 Hole Filling Gene: A = 1.0 2.0 1.0 0.0 1 0.0 Initial State: x ij 0 = 1 Boundary Condition: y kl = 0, u kl = 0, B = 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0.0 0.0, z = 1.0

Example Application The following is an example of running the HoleFilling gene, an application commonly used in character recognition, on a input image of the number eight.

Hardware CNNs Different companies and research institutions have implemented CNNs in hardware. Name Array Size Technology Node VAE 80x60 130 nm Flak+ 4x4 180 nm SCAMP 21x21 600 nm 4-layer Gabor Filter 32x64 250 nm ACE16k 128x128 350 nm QCIF 176x144 150 nm Kingetset+ Estimated Size 128x128 240 nm The array sizes for most implementations is typically small.

Issue: Image Dim vs CNN Array Dim Generally, most applications assume the CNN array s dimensions match the input (image) dimensions. But what if CNN array is smaller? 1. Shrink the Image Issue: Image Size to CNN Size Ratio Ex. 1024x1024 image to 128x128 CNN. That s a 64:1 ratio! 2. Emulate Ideal CNN through multiplexing

Ideal Multiplexing Partition 0 CNN State @T T=0 1 Partition 2 CNN State T=0 Partition 1 CNN State @T T=0 1 Partition 3 CNN State @T T=0 1 With ideal multiplexing, we transfer and run each partition on the CNN array for 1 CNN unit of time. Heavily Memory Bound Number of input/state data transfers proportional to convergence time. 1 CNN Time Unit @T 1 Rd. P0 u Rd. P0 x(0) Comp Wr. P0 x(1) Rd. P1 u time (t) CNN CNN Array Array P0 CNN State @ T1+1 Too little work to utilize memory-level parallelism (MLP) What if we add another CNN array to take advantage of independence between computations?

Key-Insight Let s take advantage of CNNs ability to still converge to the correct solution in presence of small errors. So, instead of running for partition on CNN array for 1 unit time, run partition for an interval T >> 1 unit of time. This is the key insight to our Scalable and Programmable CNN (SP-CNN) architecture.

SP-CNN Multiplexing With SP-CNN multiplexing, we run each partition on the CNN array for a certain interval of time. Run for 1 CNN Unit of Time Rd. P0 u Rd. P0 x(0) Comp Wr. P0 x(1) Run for INTERVAL CNN unit of Time Rd. P0 u Rd. P0 x(0) Comp (INTERVAL=4) Wr. P0 x(1) 1 st iteration 2 nd iteration time (t) SP-CNN INTERVAL=4 0 4 8 12 16 20 24 28 32 cnn-time (t) interval

Boundary Conditions When running a partition on the CNN array, what should we set boundary conditions to? To ensure that information propagates between partitions, we set the boundary condition for a given partition to its neighboring cells. Partition 0 Pixel Partition 1 Pixel Partition 0 Boundary Pixel Partition 0 Boundary made of Partition 1 Data Partition 0 Boundary also part of True Boundary Image Pixel True Boundary Condition for Image

Boundary Conditions Example P0 @T 1 P2 P1 @T 1 P3 CNN Array P0 P1 P2 P3 @T 2 P0 @T 2 @T 1 @T 1 True Boundary Cond. Next-State @ T 1 @ T 3 =T 2 +4*INTERVAL Next-State @ T 2 @ T 2 =T 1 +4*INTERVAL P0 @T 1 P1 @T 1 P2 @T 1 P3 @T 1 P0 @T 2 T 1 T 1 +INT T 1 +2*INT T 1 +3*INT T 2 =T 1 +4*INT T 2 +INT cnn-time (t)

Total Time and Virtual Time To compare convergence of Ideal-CNN vs SP-CNN, we introduce the concept of Virtual Time. Ideal-CNN 0 4 8 v(t)=4 time (t) v(t)=8 SP-CNN INTERVAL=4 0 4 8 12 16 20 24 28 32 time (t) interval 1 st iteration 2 nd iteration

Reduced Memory Transfers If virtual convergence time does not significantly increase in the SP-CNN case, we can significantly reduce number of memory transfers. T Memory Transfers Ideal-CNN P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 0 1 2 3 4 5 6 7 8 T-4 T-2 T-2 T-1 T Virtual Convergence Time (t) T 2 /INTERVAL Memory Transfers SP-CNN INTERVAL=4 0 P0 P1 1 2 3 4 5 6 7 8 P3 T 2-4 T 2-3 T 2-2 T 2-1 T 2 Virtual Convergence Time (t) If T T 2, than memory transfers reduced by factor of about 1/INTERVAL.

Increased MLP By increasing computation time, we can take advantage of parallel computation within the iteration. CNN0 Rd. P0 u Rd. P0 x(0) Comp (INTERVAL=4) Wr. P0 x(1) CNN1 Rd. P1 u Rd. P1 x(0) Comp (INTERVAL=4) CNN2 Increased memory level parallelism due to lengthier computation. Rd. P2 u time (t)

SP-CNN Architecture Host Processor Sends work to SP-CNN Global memory Stores the input and cnn state. Scheduler Assigns partitions to CNN-P CNN-P CNN processing unit. Processes partition

Methodology 6 benchmarks with 10 input images. The test images are of size 1024x1024 (~720p) or 2048x2048 (~1440p) Name Ideal SP-CNN Application Mechanism CNN array size equal to Image Size Proposed architecture Type *Local applications are essentially simple convolution operations that only care about neighborhood values. Corner Detection Edge Detection Connected Component Hole Filling Rotation Detector Local* Local* Global Global Global Shadow Creator Global

Simulators Functional Simulator Specifications CNN Array Size = 128x128 Interval Time = 128 cnn time units Timing simulator (DRAMSim2 for memory) CNN timing parameters based on VAE architecture* (200Mhz @ 130nm) 1 Processing Element for 32 cells Each CNN-P unit has 64 MSHR entries. (Based off Nvidia Fermi) Used 2GB DRAM memory with DDR3 timing parameters. *VAE is an existing hardware CNN chip implementation [Lee et. Al 2008]

Virtual Convergence Time (CNN Time Units) Virtual Convergence Results 4000 3500 3000 2500 2000 1500 1000 500 0 Ideal (1024x1024) Ideal (2048x2048) 3 4 3 4 2 3 2 3 SP-CNN (1024x1024) SP-CNN (2048x2048) Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow The virtual convergence time of using SP-CNN is comparable to the ideal case for both dimensions of images.

Time (ms) Timing Results for 1024x1024 128 112 96 80 64 48 32 16 30 FPS Boundary 60 FPS Boundary CNN-P=1 CNN-P=2 CNN-P=4 CNN-P=8 0 Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow Meet 60 FPS for most applications when CNN-P = 1 Meet 60 FPS for all applications when CNN-P = 8

Time (ms) Timing Results for 2048x2048 128 375 195 112 96 80 64 48 32 16 30 FPS Boundary 60 FPS Boundary CNN-P=1 CNN-P=2 CNN-P=4 CNN-P=8 0 Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow Meet 30 FPS for most applications when CNN-P=8 60 FPS difficult to meet even with CNN-P=8

Future Work Try more computationally-intensive CNN applications Video processing, Face recognition, pattern recognition, etc. FPGA implementation of SP-CNN Will allow us to perform better performance and power comparisons against CPU, GPU, etc. Further SP-CNN optimizations Determine optimal interval values Boundary Condition Propagation Time Determine if a CNN gene is suitable for SP-CNN

Conclusion Our proposed SP-CNN architecture brings scalability to small sized hardware CNN arrays. SP-CNN shows performance that is comparable to the ideal case in terms of virtual convergence time. Energy consumption around 10 to 30 mj. SP-CNN can meet 30 FPS and 60 FPS standards for most benchmarks. This can further improved with better scaling of the CNN technology.

Questions?

31

Digital CNN Operation x ij (t + 1) = C k,l N r i,k a kl y(t) kl + C k,l N r i,k b kl u kl + z A = 0.0 1.0 0.0 1.0 2.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 B = 0.0 4.0 0.0 z = -1.0 0.0 0.0 0.0 x ij t + 1 = C k,l N r i,k 0.0 1.0 0.0 1.0 2.0 1.0 y t0.0 kl 1.0 0.0 + C k,l N r i,k b kl u kl + z y(t) = 1.0 u = -1.0 y(t) =.75 u = -1.0 y(t) u =

1 26 51 76 101 126 151 176 201 226 251 276 301 326 351 376 401 426 451 476 501 526 551 576 601 626 651 676 701 726 751 776 801 The basis of the CNN operation is that we eventually converge to the correct solution. However, there are no guarantees on the convergence time and different inputs can converge at different rates. 100% Output Error of Running HoleFilling Gene on a CNN for Various 1024x1024 Test Images 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% capitala circle eight filledsquare lowera nine rect seven vertline zero 33

Slow Propagation Updated information in boundary conditions is seen at the beginning of the next iteration 1 2 5 partition 0 y(0) y(0) s output y(1) 3 4 partition 1 y(0) y(0) s output 6 y(1) 34

The SP-CNN algorithm can be viewed below: while change do change = false for p in partitions do loadstateandinputtocnn(p) for n = 0; n < interval; n++, t++ do change = parallel-computation of state parallel-computation of output end for savetonextstatefromcnn(p) end for swapstateandnextstatepointers(p) iter += 1, vt += interval end while 35

Fast propagation Updated information in boundary conditions is seen at the beginning of the next interval partition 0 1 y(0) 2 y(0) s output 5 y(1) partition 1 3 y(0) 4 y(0) s output 36

Using fast propagation, the partition order can improve on program convergence time. For example, if data is passed through the cells from right to left, then we can converge faster by moving through partitions in a reverse column-major order. Partition ordering only matters if fast propagation used. 37

Avg. Speedup in Total Convergence Time 1.5 1.4 1.3 1.2 1.1 0.9 1 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1024x1024 2048x2048 Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow For Connected Component, Hole Filling, and Rotation Detector, fast propagation can provide an average speedup from 15% to 30%. Shadow Creator shows no benefit since the information propagation in that benchmark is from right-to-left. 38

Avg. Speedup in Total Convergence Time 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 1024x1024 2048x2048 With Shadow Creator, using reverse row-major order can provide an average speedup from 13% and 30% respectively. 39

Virtual Convergence Time Originally: virtualconvtime = n T, where N is the number of iterations Now: virtualconvtime = n min(t, max( p Partitions convtime(p, i))) i=1 convtime(p, i) is the convergence time for partition p during iteration i 40

Percent Error No-Share Avg. Error No-Share Max. Error Share Avg. Error Share Max. Error 100% 80% 60% 40% 20% 0% Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow Both naïve mechanisms show errors on average of 20%, with max errors as high as 90% for global applications. Local type applications only show error for the No-Share mechanisms. SP-CNN converged correctly to solution for all benchmarks and tests. 41

Avg. Execution Time Speedup 6 CNN-P=2 (1024x1024) CNN-P=4 (1024x1024) CNN-P=8 (1024x1024) CNN-P=2 (2048x2048) CNN-P=4 (2048x2048) CNN-P=8 (2048x2048) 5 4 3 2 1 0 Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow For the global applications, memory bandwidth contention decreases the ability to achieve linear scaling. Shadow shows very poor scaling since many of the partitions converge quickly, therefore the CNN occupation time is low. 42

Avg. Execution Time Speedup 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 1 2 4 8 Number of CNN-P Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow Prefetching only provides some benefits when the number of CNN-Ps is equal to 1. When the number of CNN-Ps scales up, than the memory bandwidth contention becomes too large, and prefetching can actually cause slowdowns. 43

We also evaluated our SP-CNN mechanisms over CPU and GPU implementations of the benchmarks. For two benchmarks, Hole-Filling and Rotation Detector, we could not easily develop a CPU/GPU algorithm, so we collect their timing results by emulating the CNN s operation on the corresponding platform. Name Model Power Frequency Technology CPU Intel Core i5-35550 77W (TDP) 3.3 GHz 22nm GPU SP-CNN NVIDIA K1 mobile 1.5W 900 MHz 28nm 0.73mW per PE, 35.6 uw per node 200 MHz 45nm 44

Time (ms) 96 CPU*: 2923 GPU*:1006 CPU*: 3004 GPU*:1038 CPU 80 GPU SP-CNN (CNN-P=1) 64 SP-CNN (CNN-P=4) 48 Ideal-Mul (CNN-P=1) Ideal-Mul (CNN-P=4) 32 30 FPS Boundary 16 60 FPS Boundary 0 Crn-Detect Edge-Detect Conn-Comp Hole-Fill* Rot-Detect* Shadow For simple global applications like Connected Component and Shadow, the CPU/GPU versions perform better than the SP-CNN versions. However, emulation of a CNN is prohibitively slow, and for those applications SP-CNN provides much better performance. 45

Energy (mj) 60 CPU GPU SP-CNN (CNN-P=1) SP-CNN (CNN-P=4) 50 40 30 20 10 0 Crn-Detect Edge-Detect Conn-Comp Hole-Fill* Rot-Detect* Shadow For Connected Component and Shadow, execution time dominates over the low power consumption of SP-CNN, causing overall energy to be larger than the GPU case. Hole-Filling and Rotation Detector do show cases where the SP-CNN can perform complex tasks at a low energy cost. 46

Optimization: Early-Finish Originally, each partition run for on the CNN for the entire interval time T. We saw cases where a partition converged on the CNN before interval finished. Rest of execution essentially does nothing. So, introduced Early-Finish, where a partition runs on the CNN till it either converges or interval time T is reached.