SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay
|
|
- Darlene Booker
- 6 years ago
- Views:
Transcription
1 SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay
2 Motivation Power is a first-order design constraint, especially for embedded devices. Certain applications prevalent in embedded devices Image processing, audio processing, context awareness, etc. Trend of including more specialized, energy-efficient accelerators. Moto X 2013
3 Cellular Neural Networks Investigated using Cellular Neural Networks (CNN) as a specialized accelerator Used for variety of applications, particularly image processing Hardware implementations known to consume very little power Lee et al. (2008) implemented CNN chip that consumed 84mW@ 130nm Issue: Current hardware Cellular Neural Network processing capabilities have not scaled with the growth in image sizes.
4 Outline Cellular Neural Network (CNN) Background Multiplexing SP-CNN Architecture Results
5 Neural Networks Radial Basis Function Network Self-Organizing Map Recurrent Neural Network Adaptive Resonance Theory Pulse-Coupled Neural Network Spiking Neural Network Artificial Neural Network Hopfield Neural Network Convolutional Cellular Neural Neural Network Network (CNN)
6 CNN Background Introduced by Leon Chua and Lin Yang in 1998 Characterized by a spatial arrangement of locally-coupled cells The set of cells a cell is coupled with is known as the cell s neighborhood Typical Cellular Neural Network consists of an MxN array of cells C(i,j)
7 CNN Applications Image Processing Edge Detection, Image Segmentation, Movement Detection, etc. Pattern Recognition Associative Memory Solving Partial Differential Equations Pre-data processing for other neural networks CNN Universal Machine is Turing Complete
8 CNN Cell Each cell is a dynamical system with an input, output, and a state that evolves according to some prescribed laws. Most CNNs follow the standard CNN state and output equations: dx ij (t) = x dt ij t + a kl y(t) kl + b kl u kl + z C k,l N r i,j C k,l N r i,j y ij t = 1 2 x ij t + 1 x ij t 1 N r i, k represents neighborhood for cell at location i,j x represents the current state y is the cell outputs u is the cell inputs z is the threshold value a and b are the feed-forward and feedback weights
9 Digital CNN Cell For our work, we focused on digital CNN implementation. In these implementations, the state equation simplifies to: x ij (t + 1) = a kl y(t) kl + b kl u kl + z C k,l N r i,k C k,l N r i,k (from neighborhood cells) A (outputs) z (inputs) B Σ x ij y ij
10 CNN Gene The programmable unit of a CNN is known as the CNN gene The CNN gene specifies: the z threshold value of the CNN state equation the a and b coefficients of the CNN state equation Furthermore, the programmer will specify the initial state and boundary conditions for the CNN Hole Filling Gene: A = Initial State: x ij 0 = 1 Boundary Condition: y kl = 0, u kl = 0, B = , z = 1.0
11 Example Application The following is an example of running the HoleFilling gene, an application commonly used in character recognition, on a input image of the number eight.
12 Hardware CNNs Different companies and research institutions have implemented CNNs in hardware. Name Array Size Technology Node VAE 80x nm Flak+ 4x4 180 nm SCAMP 21x nm 4-layer Gabor Filter 32x nm ACE16k 128x nm QCIF 176x nm Kingetset+ Estimated Size 128x nm The array sizes for most implementations is typically small.
13 Issue: Image Dim vs CNN Array Dim Generally, most applications assume the CNN array s dimensions match the input (image) dimensions. But what if CNN array is smaller? 1. Shrink the Image Issue: Image Size to CNN Size Ratio Ex. 1024x1024 image to 128x128 CNN. That s a 64:1 ratio! 2. Emulate Ideal CNN through multiplexing
14 Ideal Multiplexing Partition 0 CNN T=0 1 Partition 2 CNN State T=0 Partition 1 CNN T=0 1 Partition 3 CNN T=0 1 With ideal multiplexing, we transfer and run each partition on the CNN array for 1 CNN unit of time. Heavily Memory Bound Number of input/state data transfers proportional to convergence time. 1 CNN Time 1 Rd. P0 u Rd. P0 x(0) Comp Wr. P0 x(1) Rd. P1 u time (t) CNN CNN Array Array P0 CNN T1+1 Too little work to utilize memory-level parallelism (MLP) What if we add another CNN array to take advantage of independence between computations?
15 Key-Insight Let s take advantage of CNNs ability to still converge to the correct solution in presence of small errors. So, instead of running for partition on CNN array for 1 unit time, run partition for an interval T >> 1 unit of time. This is the key insight to our Scalable and Programmable CNN (SP-CNN) architecture.
16 SP-CNN Multiplexing With SP-CNN multiplexing, we run each partition on the CNN array for a certain interval of time. Run for 1 CNN Unit of Time Rd. P0 u Rd. P0 x(0) Comp Wr. P0 x(1) Run for INTERVAL CNN unit of Time Rd. P0 u Rd. P0 x(0) Comp (INTERVAL=4) Wr. P0 x(1) 1 st iteration 2 nd iteration time (t) SP-CNN INTERVAL= cnn-time (t) interval
17 Boundary Conditions When running a partition on the CNN array, what should we set boundary conditions to? To ensure that information propagates between partitions, we set the boundary condition for a given partition to its neighboring cells. Partition 0 Pixel Partition 1 Pixel Partition 0 Boundary Pixel Partition 0 Boundary made of Partition 1 Data Partition 0 Boundary also part of True Boundary Image Pixel True Boundary Condition for Image
18 Boundary Conditions Example 1 P2 1 P3 CNN Array P0 P1 P True Boundary Cond. T T 3 =T 2 +4*INTERVAL T T 2 =T 1 +4*INTERVAL T 1 T 1 +INT T 1 +2*INT T 1 +3*INT T 2 =T 1 +4*INT T 2 +INT cnn-time (t)
19 Total Time and Virtual Time To compare convergence of Ideal-CNN vs SP-CNN, we introduce the concept of Virtual Time. Ideal-CNN v(t)=4 time (t) v(t)=8 SP-CNN INTERVAL= time (t) interval 1 st iteration 2 nd iteration
20 Reduced Memory Transfers If virtual convergence time does not significantly increase in the SP-CNN case, we can significantly reduce number of memory transfers. T Memory Transfers Ideal-CNN P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P T-4 T-2 T-2 T-1 T Virtual Convergence Time (t) T 2 /INTERVAL Memory Transfers SP-CNN INTERVAL=4 0 P0 P P3 T 2-4 T 2-3 T 2-2 T 2-1 T 2 Virtual Convergence Time (t) If T T 2, than memory transfers reduced by factor of about 1/INTERVAL.
21 Increased MLP By increasing computation time, we can take advantage of parallel computation within the iteration. CNN0 Rd. P0 u Rd. P0 x(0) Comp (INTERVAL=4) Wr. P0 x(1) CNN1 Rd. P1 u Rd. P1 x(0) Comp (INTERVAL=4) CNN2 Increased memory level parallelism due to lengthier computation. Rd. P2 u time (t)
22 SP-CNN Architecture Host Processor Sends work to SP-CNN Global memory Stores the input and cnn state. Scheduler Assigns partitions to CNN-P CNN-P CNN processing unit. Processes partition
23 Methodology 6 benchmarks with 10 input images. The test images are of size 1024x1024 (~720p) or 2048x2048 (~1440p) Name Ideal SP-CNN Application Mechanism CNN array size equal to Image Size Proposed architecture Type *Local applications are essentially simple convolution operations that only care about neighborhood values. Corner Detection Edge Detection Connected Component Hole Filling Rotation Detector Local* Local* Global Global Global Shadow Creator Global
24 Simulators Functional Simulator Specifications CNN Array Size = 128x128 Interval Time = 128 cnn time units Timing simulator (DRAMSim2 for memory) CNN timing parameters based on VAE architecture* 130nm) 1 Processing Element for 32 cells Each CNN-P unit has 64 MSHR entries. (Based off Nvidia Fermi) Used 2GB DRAM memory with DDR3 timing parameters. *VAE is an existing hardware CNN chip implementation [Lee et. Al 2008]
25 Virtual Convergence Time (CNN Time Units) Virtual Convergence Results Ideal (1024x1024) Ideal (2048x2048) SP-CNN (1024x1024) SP-CNN (2048x2048) Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow The virtual convergence time of using SP-CNN is comparable to the ideal case for both dimensions of images.
26 Time (ms) Timing Results for 1024x FPS Boundary 60 FPS Boundary CNN-P=1 CNN-P=2 CNN-P=4 CNN-P=8 0 Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow Meet 60 FPS for most applications when CNN-P = 1 Meet 60 FPS for all applications when CNN-P = 8
27 Time (ms) Timing Results for 2048x FPS Boundary 60 FPS Boundary CNN-P=1 CNN-P=2 CNN-P=4 CNN-P=8 0 Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow Meet 30 FPS for most applications when CNN-P=8 60 FPS difficult to meet even with CNN-P=8
28 Future Work Try more computationally-intensive CNN applications Video processing, Face recognition, pattern recognition, etc. FPGA implementation of SP-CNN Will allow us to perform better performance and power comparisons against CPU, GPU, etc. Further SP-CNN optimizations Determine optimal interval values Boundary Condition Propagation Time Determine if a CNN gene is suitable for SP-CNN
29 Conclusion Our proposed SP-CNN architecture brings scalability to small sized hardware CNN arrays. SP-CNN shows performance that is comparable to the ideal case in terms of virtual convergence time. Energy consumption around 10 to 30 mj. SP-CNN can meet 30 FPS and 60 FPS standards for most benchmarks. This can further improved with better scaling of the CNN technology.
30 Questions?
31 31
32 Digital CNN Operation x ij (t + 1) = C k,l N r i,k a kl y(t) kl + C k,l N r i,k b kl u kl + z A = B = z = x ij t + 1 = C k,l N r i,k y t0.0 kl C k,l N r i,k b kl u kl + z y(t) = 1.0 u = -1.0 y(t) =.75 u = -1.0 y(t) u =
33 The basis of the CNN operation is that we eventually converge to the correct solution. However, there are no guarantees on the convergence time and different inputs can converge at different rates. 100% Output Error of Running HoleFilling Gene on a CNN for Various 1024x1024 Test Images 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% capitala circle eight filledsquare lowera nine rect seven vertline zero 33
34 Slow Propagation Updated information in boundary conditions is seen at the beginning of the next iteration partition 0 y(0) y(0) s output y(1) 3 4 partition 1 y(0) y(0) s output 6 y(1) 34
35 The SP-CNN algorithm can be viewed below: while change do change = false for p in partitions do loadstateandinputtocnn(p) for n = 0; n < interval; n++, t++ do change = parallel-computation of state parallel-computation of output end for savetonextstatefromcnn(p) end for swapstateandnextstatepointers(p) iter += 1, vt += interval end while 35
36 Fast propagation Updated information in boundary conditions is seen at the beginning of the next interval partition 0 1 y(0) 2 y(0) s output 5 y(1) partition 1 3 y(0) 4 y(0) s output 36
37 Using fast propagation, the partition order can improve on program convergence time. For example, if data is passed through the cells from right to left, then we can converge faster by moving through partitions in a reverse column-major order. Partition ordering only matters if fast propagation used. 37
38 Avg. Speedup in Total Convergence Time x x2048 Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow For Connected Component, Hole Filling, and Rotation Detector, fast propagation can provide an average speedup from 15% to 30%. Shadow Creator shows no benefit since the information propagation in that benchmark is from right-to-left. 38
39 Avg. Speedup in Total Convergence Time x x2048 With Shadow Creator, using reverse row-major order can provide an average speedup from 13% and 30% respectively. 39
40 Virtual Convergence Time Originally: virtualconvtime = n T, where N is the number of iterations Now: virtualconvtime = n min(t, max( p Partitions convtime(p, i))) i=1 convtime(p, i) is the convergence time for partition p during iteration i 40
41 Percent Error No-Share Avg. Error No-Share Max. Error Share Avg. Error Share Max. Error 100% 80% 60% 40% 20% 0% Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow Both naïve mechanisms show errors on average of 20%, with max errors as high as 90% for global applications. Local type applications only show error for the No-Share mechanisms. SP-CNN converged correctly to solution for all benchmarks and tests. 41
42 Avg. Execution Time Speedup 6 CNN-P=2 (1024x1024) CNN-P=4 (1024x1024) CNN-P=8 (1024x1024) CNN-P=2 (2048x2048) CNN-P=4 (2048x2048) CNN-P=8 (2048x2048) Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow For the global applications, memory bandwidth contention decreases the ability to achieve linear scaling. Shadow shows very poor scaling since many of the partitions converge quickly, therefore the CNN occupation time is low. 42
43 Avg. Execution Time Speedup Number of CNN-P Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow Prefetching only provides some benefits when the number of CNN-Ps is equal to 1. When the number of CNN-Ps scales up, than the memory bandwidth contention becomes too large, and prefetching can actually cause slowdowns. 43
44 We also evaluated our SP-CNN mechanisms over CPU and GPU implementations of the benchmarks. For two benchmarks, Hole-Filling and Rotation Detector, we could not easily develop a CPU/GPU algorithm, so we collect their timing results by emulating the CNN s operation on the corresponding platform. Name Model Power Frequency Technology CPU Intel Core i W (TDP) 3.3 GHz 22nm GPU SP-CNN NVIDIA K1 mobile 1.5W 900 MHz 28nm 0.73mW per PE, 35.6 uw per node 200 MHz 45nm 44
45 Time (ms) 96 CPU*: 2923 GPU*:1006 CPU*: 3004 GPU*:1038 CPU 80 GPU SP-CNN (CNN-P=1) 64 SP-CNN (CNN-P=4) 48 Ideal-Mul (CNN-P=1) Ideal-Mul (CNN-P=4) FPS Boundary FPS Boundary 0 Crn-Detect Edge-Detect Conn-Comp Hole-Fill* Rot-Detect* Shadow For simple global applications like Connected Component and Shadow, the CPU/GPU versions perform better than the SP-CNN versions. However, emulation of a CNN is prohibitively slow, and for those applications SP-CNN provides much better performance. 45
46 Energy (mj) 60 CPU GPU SP-CNN (CNN-P=1) SP-CNN (CNN-P=4) Crn-Detect Edge-Detect Conn-Comp Hole-Fill* Rot-Detect* Shadow For Connected Component and Shadow, execution time dominates over the low power consumption of SP-CNN, causing overall energy to be larger than the GPU case. Hole-Filling and Rotation Detector do show cases where the SP-CNN can perform complex tasks at a low energy cost. 46
47 Optimization: Early-Finish Originally, each partition run for on the CNN for the entire interval time T. We saw cases where a partition converged on the CNN before interval finished. Rest of execution essentially does nothing. So, introduced Early-Finish, where a partition runs on the CNN till it either converges or interval time T is reached.
DESIGNING CNN GENES. Received January 23, 2003; Revised April 2, 2003
Tutorials and Reviews International Journal of Bifurcation and Chaos, Vol. 13, No. 10 (2003 2739 2824 c World Scientific Publishing Company DESIGNING CNN GENES MAKOTO ITOH Department of Information and
More informationBuilding a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI
Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI Charles Lo and Paul Chow {locharl1, pc}@eecg.toronto.edu Department of Electrical and Computer Engineering
More informationImplementation of Nonlinear Template Runner Emulated Digital CNN-UM on FPGA
Implementation of Nonlinear Template Runner Emulated Digital CNN-UM on FPGA Z. Kincses * Z. Nagy P. Szolgay * Department of Image Processing and Neurocomputing, University of Pannonia, Hungary * e-mail:
More informationArtificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino
Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as
More informationOptimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks
2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks Yufei Ma, Yu Cao, Sarma Vrudhula,
More informationA model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)
A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal
More informationHybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS
Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge González-Domínguez*, Bertil Schmidt*, Jan C. Kässens**, Lars Wienbrandt** *Parallel and Distributed Architectures
More informationArtificial Intelligence Hopfield Networks
Artificial Intelligence Hopfield Networks Andrea Torsello Network Topologies Single Layer Recurrent Network Bidirectional Symmetric Connection Binary / Continuous Units Associative Memory Optimization
More informationArtificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen
Artificial Neural Networks Introduction to Computational Neuroscience Tambet Matiisen 2.04.2018 Artificial neural network NB! Inspired by biology, not based on biology! Applications Automatic speech recognition
More information2. Accelerated Computations
2. Accelerated Computations 2.1. Bent Function Enumeration by a Circular Pipeline Implemented on an FPGA Stuart W. Schneider Jon T. Butler 2.1.1. Background A naive approach to encoding a plaintext message
More informationCELLULAR NEURAL NETWORKS & APPLICATIONS TO IMAGE PROCESSING. Vedat Tavsanoglu School of EEIE SOUTH BANK UNIVERSITY LONDON UK
CELLULAR NEURAL NETWORKS & APPLICATIONS TO IMAGE PROCESSING Vedat Tavsanoglu School of EEIE SOUTH BANK UNIVERSITY LONDON UK Outline What is CNN? Architecture of CNN Analogue Computing with CNN Advantages
More informationPerformance Evaluation of Scientific Applications on POWER8
Performance Evaluation of Scientific Applications on POWER8 2014 Nov 16 Andrew V. Adinetz 1, Paul F. Baumeister 1, Hans Böttiger 3, Thorsten Hater 1, Thilo Maurer 3, Dirk Pleiter 1, Wolfram Schenck 4,
More informationDirect Self-Consistent Field Computations on GPU Clusters
Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd
More informationCycleTandem: Energy-Saving Scheduling for Real-Time Systems with Hardware Accelerators
CycleTandem: Energy-Saving Scheduling for Real-Time Systems with Hardware Accelerators Sandeep D souza and Ragunathan (Raj) Rajkumar Carnegie Mellon University High (Energy) Cost of Accelerators Modern-day
More informationParallel Longest Common Subsequence using Graphics Hardware
Parallel Longest Common Subsequence using Graphics Hardware John Kloetzli rian Strege Jonathan Decker Dr. Marc Olano Presented by: rian Strege 1 Overview Introduction Problem Statement ackground and Related
More informationLecture 2: Metrics to Evaluate Systems
Lecture 2: Metrics to Evaluate Systems Topics: Metrics: power, reliability, cost, benchmark suites, performance equation, summarizing performance with AM, GM, HM Sign up for the class mailing list! Video
More informationParallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2
1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013
More informationHeterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry
Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)
More informationEfficient implementation of the overlap operator on multi-gpus
Efficient implementation of the overlap operator on multi-gpus Andrei Alexandru Mike Lujan, Craig Pelissier, Ben Gamari, Frank Lee SAAHPC 2011 - University of Tennessee Outline Motivation Overlap operator
More informationResearch on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method
NUCLEAR SCIENCE AND TECHNIQUES 25, 0501 (14) Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method XU Qi ( 徐琪 ), 1, YU Gang-Lin ( 余纲林 ), 1 WANG Kan ( 王侃 ),
More informationA FPGA Implementation of Large Restricted Boltzmann Machines. Charles Lo. Supervisor: Paul Chow April 2010
A FPGA Implementation of Large Restricted Boltzmann Machines by Charles Lo Supervisor: Paul Chow April 2010 Abstract A FPGA Implementation of Large Restricted Boltzmann Machines Charles Lo Engineering
More informationNeural networks. Chapter 20. Chapter 20 1
Neural networks Chapter 20 Chapter 20 1 Outline Brains Neural networks Perceptrons Multilayer networks Applications of neural networks Chapter 20 2 Brains 10 11 neurons of > 20 types, 10 14 synapses, 1ms
More informationAnalog Computation in Flash Memory for Datacenter-scale AI Inference in a Small Chip
1 Analog Computation in Flash Memory for Datacenter-scale AI Inference in a Small Chip Dave Fick, CTO/Founder Mike Henry, CEO/Founder About Mythic 2 Focused on high-performance Edge AI Full stack co-design:
More informationNeural networks. Chapter 19, Sections 1 5 1
Neural networks Chapter 19, Sections 1 5 Chapter 19, Sections 1 5 1 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural networks Chapter 19, Sections 1 5 2 Brains 10
More informationHardware Acceleration of DNNs
Lecture 12: Hardware Acceleration of DNNs Visual omputing Systems Stanford S348V, Winter 2018 Hardware acceleration for DNNs Huawei Kirin NPU Google TPU: Apple Neural Engine Intel Lake rest Deep Learning
More informationEnergy-Efficient Real-Time Task Scheduling in Multiprocessor DVS Systems
Energy-Efficient Real-Time Task Scheduling in Multiprocessor DVS Systems Jian-Jia Chen *, Chuan Yue Yang, Tei-Wei Kuo, and Chi-Sheng Shih Embedded Systems and Wireless Networking Lab. Department of Computer
More informationTR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems
TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a
More informationParallelism in Structured Newton Computations
Parallelism in Structured Newton Computations Thomas F Coleman and Wei u Department of Combinatorics and Optimization University of Waterloo Waterloo, Ontario, Canada N2L 3G1 E-mail: tfcoleman@uwaterlooca
More informationClaude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique
Claude Tadonki MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Monthly CRI Seminar MINES ParisTech - CRI June 06, 2016, Fontainebleau (France)
More informationBlock AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark
Block AIR Methods For Multicore and GPU Per Christian Hansen Hans Henrik B. Sørensen Technical University of Denmark Model Problem and Notation Parallel-beam 3D tomography exact solution exact data noise
More informationarxiv: v1 [cs.ar] 11 Dec 2017
Multi-Mode Inference Engine for Convolutional Neural Networks Arash Ardakani, Carlo Condo and Warren J. Gross Electrical and Computer Engineering Department, McGill University, Montreal, Quebec, Canada
More informationEnergy-aware scheduling for GreenIT in large-scale distributed systems
Energy-aware scheduling for GreenIT in large-scale distributed systems 1 PASCAL BOUVRY UNIVERSITY OF LUXEMBOURG GreenIT CORE/FNR project Context and Motivation Outline 2 Problem Description Proposed Solution
More informationACCELERATED LEARNING OF GAUSSIAN PROCESS MODELS
ACCELERATED LEARNING OF GAUSSIAN PROCESS MODELS Bojan Musizza, Dejan Petelin, Juš Kocijan, Jožef Stefan Institute Jamova 39, Ljubljana, Slovenia University of Nova Gorica Vipavska 3, Nova Gorica, Slovenia
More informationHow can ideas from quantum computing improve or speed up neuromorphic models of computation?
Neuromorphic Computation: Architectures, Models, Applications Associative Memory Models with Adiabatic Quantum Optimization Kathleen Hamilton, Alexander McCaskey, Jonathan Schrock, Neena Imam and Travis
More informationParallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)
Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Eftychios Sifakis CS758 Guest Lecture - 19 Sept 2012 Introduction Linear systems
More informationAcceleration of WRF on the GPU
Acceleration of WRF on the GPU Daniel Abdi, Sam Elliott, Iman Gohari Don Berchoff, Gene Pache, John Manobianco TempoQuest 1434 Spruce Street Boulder, CO 80302 720 726 9032 TempoQuest.com THE WORLD S FASTEST
More information3.3 Discrete Hopfield Net An iterative autoassociative net similar to the nets described in the previous sections has been developed by Hopfield
3.3 Discrete Hopfield Net An iterative autoassociative net similar to the nets described in the previous sections has been developed by Hopfield (1982, 1984). - The net is a fully interconnected neural
More informationDynamic Scheduling for Work Agglomeration on Heterogeneous Clusters
Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed
More informationWelcome to MCS 572. content and organization expectations of the course. definition and classification
Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson
More informationAI Programming CS F-20 Neural Networks
AI Programming CS662-2008F-20 Neural Networks David Galles Department of Computer Science University of San Francisco 20-0: Symbolic AI Most of this class has been focused on Symbolic AI Focus or symbols
More informationSparse LU Factorization on GPUs for Accelerating SPICE Simulation
Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,
More informationLecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)
COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets) Sanjeev Arora Elad Hazan Recap: Structure of a deep
More informationOn Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code
On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy 7 th Workshop on UnConventional High Performance
More informationAccelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More information上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose
上海超级计算中心 Shanghai Supercomputer Center Lei Xu Shanghai Supercomputer Center 03/26/2014 @GTC, San Jose Overview Introduction Fundamentals of the FDTD method Implementation of 3D UPML-FDTD algorithm on GPU
More information<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)
Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation
More informationFinal Examination CS 540-2: Introduction to Artificial Intelligence
Final Examination CS 540-2: Introduction to Artificial Intelligence May 7, 2017 LAST NAME: SOLUTIONS FIRST NAME: Problem Score Max Score 1 14 2 10 3 6 4 10 5 11 6 9 7 8 9 10 8 12 12 8 Total 100 1 of 11
More informationAn Efficient Numerical Integration Algorithm for Cellular Neural Network Based Hole-Filler Template Design
International Journal of Computers, Communications & Control Vol. II 007, No. 4, pp. 67-74 An Efficient Numerical Integration Algorithm for Cellular Neural Network Based Hole-Filler Template Design V.
More informationPerformance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures
Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures José I. Aliaga Performance and Energy Analysis of the Iterative Solution of Sparse
More information11 Parallel programming models
237 // Program Design 10.3 Assessing parallel programs 11 Parallel programming models Many different models for expressing parallelism in programming languages Actor model Erlang Scala Coordination languages
More informationScheduling of Frame-based Embedded Systems with Rechargeable Batteries
Scheduling of Frame-based Embedded Systems with Rechargeable Batteries André Allavena Computer Science Department Cornell University Ithaca, NY 14853 andre@cs.cornell.edu Daniel Mossé Department of Computer
More information(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann
(Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for
More informationarxiv: v1 [hep-lat] 7 Oct 2010
arxiv:.486v [hep-lat] 7 Oct 2 Nuno Cardoso CFTP, Instituto Superior Técnico E-mail: nunocardoso@cftp.ist.utl.pt Pedro Bicudo CFTP, Instituto Superior Técnico E-mail: bicudo@ist.utl.pt We discuss the CUDA
More informationINTRODUCTION TO ARTIFICIAL INTELLIGENCE
v=1 v= 1 v= 1 v= 1 v= 1 v=1 optima 2) 3) 5) 6) 7) 8) 9) 12) 11) 13) INTRDUCTIN T ARTIFICIAL INTELLIGENCE DATA15001 EPISDE 8: NEURAL NETWRKS TDAY S MENU 1. NEURAL CMPUTATIN 2. FEEDFRWARD NETWRKS (PERCEPTRN)
More informationAndrew Morton University of Waterloo Canada
EDF Feasibility and Hardware Accelerators Andrew Morton University of Waterloo Canada Outline 1) Introduction and motivation 2) Review of EDF and feasibility analysis 3) Hardware accelerators and scheduling
More informationOHW2013 workshop. An open source PCIe device virtualization framework
OHW2013 workshop An open source PCIe device virtualization framework Plan Context and objectives Design and implementation Future directions Questions Context - ESRF and the ISDD electronic laboratory
More informationPerformance Analysis of Lattice QCD Application with APGAS Programming Model
Performance Analysis of Lattice QCD Application with APGAS Programming Model Koichi Shirahata 1, Jun Doi 2, Mikio Takeuchi 2 1: Tokyo Institute of Technology 2: IBM Research - Tokyo Programming Models
More informationNeural networks. Chapter 20, Section 5 1
Neural networks Chapter 20, Section 5 Chapter 20, Section 5 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural networks Chapter 20, Section 5 2 Brains 0 neurons of
More informationUTPlaceF 3.0: A Parallelization Framework for Modern FPGA Global Placement
UTPlaceF 3.0: A Parallelization Framework for Modern FPGA Global Placement Wuxi Li, Meng Li, Jiajun Wang, and David Z. Pan University of Texas at Austin wuxili@utexas.edu November 14, 2017 UT DA Wuxi Li
More informationData Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline How the Brain Works Artificial Neural Networks Simple Computing Elements Feed-Forward Networks Perceptrons (Single-layer,
More informationNeural Nets in PR. Pattern Recognition XII. Michal Haindl. Outline. Neural Nets in PR 2
Neural Nets in PR NM P F Outline Motivation: Pattern Recognition XII human brain study complex cognitive tasks Michal Haindl Faculty of Information Technology, KTI Czech Technical University in Prague
More informationNeural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21
Neural Networks Chapter 8, Section 7 TB Artificial Intelligence Slides from AIMA http://aima.cs.berkeley.edu / 2 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural
More informationENERGY-EFFICIENT DIGITAL HARDWARE PLATFORM FOR LEARNING COMPLEX SYSTEMS. A Dissertation Presented to The Academic Faculty.
ENERGY-EFFICIENT DIGITAL HARDWARE PLATFORM FOR LEARNING COMPLEX SYSTEMS A Dissertation Presented to The Academic Faculty By Jae Ha Kung In Partial Fulfillment of the Requirements for the Degree Doctor
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Neural networks Daniel Hennes 21.01.2018 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Logistic regression Neural networks Perceptron
More informationMachine Learning. Neural Networks
Machine Learning Neural Networks Bryan Pardo, Northwestern University, Machine Learning EECS 349 Fall 2007 Biological Analogy Bryan Pardo, Northwestern University, Machine Learning EECS 349 Fall 2007 THE
More informationAccelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers
UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric
More informationUsing a Hopfield Network: A Nuts and Bolts Approach
Using a Hopfield Network: A Nuts and Bolts Approach November 4, 2013 Gershon Wolfe, Ph.D. Hopfield Model as Applied to Classification Hopfield network Training the network Updating nodes Sequencing of
More informationSynaptic Devices and Neuron Circuits for Neuron-Inspired NanoElectronics
Synaptic Devices and Neuron Circuits for Neuron-Inspired NanoElectronics Byung-Gook Park Inter-university Semiconductor Research Center & Department of Electrical and Computer Engineering Seoul National
More informationFPGA Implementation of a Predictive Controller
FPGA Implementation of a Predictive Controller SIAM Conference on Optimization 2011, Darmstadt, Germany Minisymposium on embedded optimization Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan
More informationA Deep Convolutional Neural Network Based on Nested Residue Number System
A Deep Convolutional Neural Network Based on Nested Residue Number System Hiroki Nakahara Tsutomu Sasao Ehime University, Japan Meiji University, Japan Outline Background Deep convolutional neural network
More informationFaster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)
Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine
More informationMassively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem
Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem Katharina Kormann 1 Klaus Reuter 2 Markus Rampp 2 Eric Sonnendrücker 1 1 Max Planck Institut für Plasmaphysik 2 Max Planck Computing
More informationArtificial Neural Network and Fuzzy Logic
Artificial Neural Network and Fuzzy Logic 1 Syllabus 2 Syllabus 3 Books 1. Artificial Neural Networks by B. Yagnanarayan, PHI - (Cover Topologies part of unit 1 and All part of Unit 2) 2. Neural Networks
More informationSections 18.6 and 18.7 Artificial Neural Networks
Sections 18.6 and 18.7 Artificial Neural Networks CS4811 - Artificial Intelligence Nilufer Onder Department of Computer Science Michigan Technological University Outline The brain vs artifical neural networks
More informationJacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA
Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is
More informationAnalytical Modeling of Parallel Programs (Chapter 5) Alexandre David
Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David 1.2.05 1 Topic Overview Sources of overhead in parallel programs. Performance metrics for parallel systems. Effect of granularity on
More informationIntroduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen
Neural Networks - I Henrik I Christensen Robotics & Intelligent Machines @ GT Georgia Institute of Technology, Atlanta, GA 30332-0280 hic@cc.gatech.edu Henrik I Christensen (RIM@GT) Neural Networks 1 /
More informationCS 4700: Foundations of Artificial Intelligence
CS 4700: Foundations of Artificial Intelligence Prof. Bart Selman selman@cs.cornell.edu Machine Learning: Neural Networks R&N 18.7 Intro & perceptron learning 1 2 Neuron: How the brain works # neurons
More informationChristian Mohr
Christian Mohr 20.12.2011 Recurrent Networks Networks in which units may have connections to units in the same or preceding layers Also connections to the unit itself possible Already covered: Hopfield
More informationParallel Transposition of Sparse Data Structures
Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing
More informationPerformance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So
Performance, Power & Energy ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Recall: Goal of this class Performance Reconfiguration Power/ Energy H. So, Sp10 Lecture 3 - ELEC8106/6102 2 PERFORMANCE EVALUATION
More informationScalable and Power-Efficient Data Mining Kernels
Scalable and Power-Efficient Data Mining Kernels Alok Choudhary, John G. Searle Professor Dept. of Electrical Engineering and Computer Science and Professor, Kellogg School of Management Director of the
More informationOn the Computational Complexity of the Discrete Pascal Transform
6 th International Conference Logic and Applications LAP 207, September 8-22, 207, Dubrovnik, Croatia On the Computational Complexity of the Discrete Pascal Transform Dušan B. Gajić, Radomir S. Stanković
More informationGPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications
GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign
More informationConvolutional Neural Networks
Convolutional Neural Networks Books» http://www.deeplearningbook.org/ Books http://neuralnetworksanddeeplearning.com/.org/ reviews» http://www.deeplearningbook.org/contents/linear_algebra.html» http://www.deeplearningbook.org/contents/prob.html»
More informationSections 18.6 and 18.7 Artificial Neural Networks
Sections 18.6 and 18.7 Artificial Neural Networks CS4811 - Artificial Intelligence Nilufer Onder Department of Computer Science Michigan Technological University Outline The brain vs. artifical neural
More informationLevel-3 BLAS on a GPU
Level-3 BLAS on a GPU Picking the Low Hanging Fruit Francisco Igual 1 Gregorio Quintana-Ortí 1 Robert A. van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón
More informationSPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics
SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS
More informationNEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0
NEC PerforCache Influence on M-Series Disk Array Behavior and Performance. Version 1.0 Preface This document describes L2 (Level 2) Cache Technology which is a feature of NEC M-Series Disk Array implemented
More informationVLSI Signal Processing
VLSI Signal Processing Lecture 1 Pipelining & Retiming ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-1 Introduction DSP System Real time requirement Data driven synchronized by data
More informationThe Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization
The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization Jia Wang, Shiyan Hu Department of Electrical and Computer Engineering Michigan Technological University Houghton, Michigan
More informationAccelerating computation of eigenvectors in the nonsymmetric eigenvalue problem
Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National
More informationTensorFlow: A Framework for Scalable Machine Learning
TensorFlow: A Framework for Scalable Machine Learning You probably Outline want to know... What is TensorFlow? Why did we create TensorFlow? How does Tensorflow Work? Example: Linear Regression Example:
More informationLecture 7 Artificial neural networks: Supervised learning
Lecture 7 Artificial neural networks: Supervised learning Introduction, or how the brain works The neuron as a simple computing element The perceptron Multilayer neural networks Accelerated learning in
More informationPractical Combustion Kinetics with CUDA
Funded by: U.S. Department of Energy Vehicle Technologies Program Program Manager: Gurpreet Singh & Leo Breton Practical Combustion Kinetics with CUDA GPU Technology Conference March 20, 2015 Russell Whitesides
More informationLast update: October 26, Neural networks. CMSC 421: Section Dana Nau
Last update: October 26, 207 Neural networks CMSC 42: Section 8.7 Dana Nau Outline Applications of neural networks Brains Neural network units Perceptrons Multilayer perceptrons 2 Example Applications
More informationRetiming. delay elements in a circuit without affecting the input/output characteristics of the circuit.
Chapter Retiming NCU EE -- SP VLSI esign. Chap. Tsung-Han Tsai 1 Retiming & A transformation techniques used to change the locations of delay elements in a circuit without affecting the input/output characteristics
More informationA High-Yield Area-Power Efficient DWT Hardware for Implantable Neural Interface Applications
Neural Engineering 27 A High-Yield Area-Power Efficient DWT Hardware for Implantable Neural Interface Applications Awais M. Kamboh, Andrew Mason, Karim Oweiss {Kambohaw, Mason, Koweiss} @msu.edu Department
More informationERLANGEN REGIONAL COMPUTING CENTER
ERLANGEN REGIONAL COMPUTING CENTER Making Sense of Performance Numbers Georg Hager Erlangen Regional Computing Center (RRZE) Friedrich-Alexander-Universität Erlangen-Nürnberg OpenMPCon 2018 Barcelona,
More informationCS:4420 Artificial Intelligence
CS:4420 Artificial Intelligence Spring 2018 Neural Networks Cesare Tinelli The University of Iowa Copyright 2004 18, Cesare Tinelli and Stuart Russell a a These notes were originally developed by Stuart
More information