SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

Size: px
Start display at page:

Download "SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay"

Transcription

1 SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

2 Motivation Power is a first-order design constraint, especially for embedded devices. Certain applications prevalent in embedded devices Image processing, audio processing, context awareness, etc. Trend of including more specialized, energy-efficient accelerators. Moto X 2013

3 Cellular Neural Networks Investigated using Cellular Neural Networks (CNN) as a specialized accelerator Used for variety of applications, particularly image processing Hardware implementations known to consume very little power Lee et al. (2008) implemented CNN chip that consumed 84mW@ 130nm Issue: Current hardware Cellular Neural Network processing capabilities have not scaled with the growth in image sizes.

4 Outline Cellular Neural Network (CNN) Background Multiplexing SP-CNN Architecture Results

5 Neural Networks Radial Basis Function Network Self-Organizing Map Recurrent Neural Network Adaptive Resonance Theory Pulse-Coupled Neural Network Spiking Neural Network Artificial Neural Network Hopfield Neural Network Convolutional Cellular Neural Neural Network Network (CNN)

6 CNN Background Introduced by Leon Chua and Lin Yang in 1998 Characterized by a spatial arrangement of locally-coupled cells The set of cells a cell is coupled with is known as the cell s neighborhood Typical Cellular Neural Network consists of an MxN array of cells C(i,j)

7 CNN Applications Image Processing Edge Detection, Image Segmentation, Movement Detection, etc. Pattern Recognition Associative Memory Solving Partial Differential Equations Pre-data processing for other neural networks CNN Universal Machine is Turing Complete

8 CNN Cell Each cell is a dynamical system with an input, output, and a state that evolves according to some prescribed laws. Most CNNs follow the standard CNN state and output equations: dx ij (t) = x dt ij t + a kl y(t) kl + b kl u kl + z C k,l N r i,j C k,l N r i,j y ij t = 1 2 x ij t + 1 x ij t 1 N r i, k represents neighborhood for cell at location i,j x represents the current state y is the cell outputs u is the cell inputs z is the threshold value a and b are the feed-forward and feedback weights

9 Digital CNN Cell For our work, we focused on digital CNN implementation. In these implementations, the state equation simplifies to: x ij (t + 1) = a kl y(t) kl + b kl u kl + z C k,l N r i,k C k,l N r i,k (from neighborhood cells) A (outputs) z (inputs) B Σ x ij y ij

10 CNN Gene The programmable unit of a CNN is known as the CNN gene The CNN gene specifies: the z threshold value of the CNN state equation the a and b coefficients of the CNN state equation Furthermore, the programmer will specify the initial state and boundary conditions for the CNN Hole Filling Gene: A = Initial State: x ij 0 = 1 Boundary Condition: y kl = 0, u kl = 0, B = , z = 1.0

11 Example Application The following is an example of running the HoleFilling gene, an application commonly used in character recognition, on a input image of the number eight.

12 Hardware CNNs Different companies and research institutions have implemented CNNs in hardware. Name Array Size Technology Node VAE 80x nm Flak+ 4x4 180 nm SCAMP 21x nm 4-layer Gabor Filter 32x nm ACE16k 128x nm QCIF 176x nm Kingetset+ Estimated Size 128x nm The array sizes for most implementations is typically small.

13 Issue: Image Dim vs CNN Array Dim Generally, most applications assume the CNN array s dimensions match the input (image) dimensions. But what if CNN array is smaller? 1. Shrink the Image Issue: Image Size to CNN Size Ratio Ex. 1024x1024 image to 128x128 CNN. That s a 64:1 ratio! 2. Emulate Ideal CNN through multiplexing

14 Ideal Multiplexing Partition 0 CNN T=0 1 Partition 2 CNN State T=0 Partition 1 CNN T=0 1 Partition 3 CNN T=0 1 With ideal multiplexing, we transfer and run each partition on the CNN array for 1 CNN unit of time. Heavily Memory Bound Number of input/state data transfers proportional to convergence time. 1 CNN Time 1 Rd. P0 u Rd. P0 x(0) Comp Wr. P0 x(1) Rd. P1 u time (t) CNN CNN Array Array P0 CNN T1+1 Too little work to utilize memory-level parallelism (MLP) What if we add another CNN array to take advantage of independence between computations?

15 Key-Insight Let s take advantage of CNNs ability to still converge to the correct solution in presence of small errors. So, instead of running for partition on CNN array for 1 unit time, run partition for an interval T >> 1 unit of time. This is the key insight to our Scalable and Programmable CNN (SP-CNN) architecture.

16 SP-CNN Multiplexing With SP-CNN multiplexing, we run each partition on the CNN array for a certain interval of time. Run for 1 CNN Unit of Time Rd. P0 u Rd. P0 x(0) Comp Wr. P0 x(1) Run for INTERVAL CNN unit of Time Rd. P0 u Rd. P0 x(0) Comp (INTERVAL=4) Wr. P0 x(1) 1 st iteration 2 nd iteration time (t) SP-CNN INTERVAL= cnn-time (t) interval

17 Boundary Conditions When running a partition on the CNN array, what should we set boundary conditions to? To ensure that information propagates between partitions, we set the boundary condition for a given partition to its neighboring cells. Partition 0 Pixel Partition 1 Pixel Partition 0 Boundary Pixel Partition 0 Boundary made of Partition 1 Data Partition 0 Boundary also part of True Boundary Image Pixel True Boundary Condition for Image

18 Boundary Conditions Example 1 P2 1 P3 CNN Array P0 P1 P True Boundary Cond. T T 3 =T 2 +4*INTERVAL T T 2 =T 1 +4*INTERVAL T 1 T 1 +INT T 1 +2*INT T 1 +3*INT T 2 =T 1 +4*INT T 2 +INT cnn-time (t)

19 Total Time and Virtual Time To compare convergence of Ideal-CNN vs SP-CNN, we introduce the concept of Virtual Time. Ideal-CNN v(t)=4 time (t) v(t)=8 SP-CNN INTERVAL= time (t) interval 1 st iteration 2 nd iteration

20 Reduced Memory Transfers If virtual convergence time does not significantly increase in the SP-CNN case, we can significantly reduce number of memory transfers. T Memory Transfers Ideal-CNN P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P T-4 T-2 T-2 T-1 T Virtual Convergence Time (t) T 2 /INTERVAL Memory Transfers SP-CNN INTERVAL=4 0 P0 P P3 T 2-4 T 2-3 T 2-2 T 2-1 T 2 Virtual Convergence Time (t) If T T 2, than memory transfers reduced by factor of about 1/INTERVAL.

21 Increased MLP By increasing computation time, we can take advantage of parallel computation within the iteration. CNN0 Rd. P0 u Rd. P0 x(0) Comp (INTERVAL=4) Wr. P0 x(1) CNN1 Rd. P1 u Rd. P1 x(0) Comp (INTERVAL=4) CNN2 Increased memory level parallelism due to lengthier computation. Rd. P2 u time (t)

22 SP-CNN Architecture Host Processor Sends work to SP-CNN Global memory Stores the input and cnn state. Scheduler Assigns partitions to CNN-P CNN-P CNN processing unit. Processes partition

23 Methodology 6 benchmarks with 10 input images. The test images are of size 1024x1024 (~720p) or 2048x2048 (~1440p) Name Ideal SP-CNN Application Mechanism CNN array size equal to Image Size Proposed architecture Type *Local applications are essentially simple convolution operations that only care about neighborhood values. Corner Detection Edge Detection Connected Component Hole Filling Rotation Detector Local* Local* Global Global Global Shadow Creator Global

24 Simulators Functional Simulator Specifications CNN Array Size = 128x128 Interval Time = 128 cnn time units Timing simulator (DRAMSim2 for memory) CNN timing parameters based on VAE architecture* 130nm) 1 Processing Element for 32 cells Each CNN-P unit has 64 MSHR entries. (Based off Nvidia Fermi) Used 2GB DRAM memory with DDR3 timing parameters. *VAE is an existing hardware CNN chip implementation [Lee et. Al 2008]

25 Virtual Convergence Time (CNN Time Units) Virtual Convergence Results Ideal (1024x1024) Ideal (2048x2048) SP-CNN (1024x1024) SP-CNN (2048x2048) Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow The virtual convergence time of using SP-CNN is comparable to the ideal case for both dimensions of images.

26 Time (ms) Timing Results for 1024x FPS Boundary 60 FPS Boundary CNN-P=1 CNN-P=2 CNN-P=4 CNN-P=8 0 Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow Meet 60 FPS for most applications when CNN-P = 1 Meet 60 FPS for all applications when CNN-P = 8

27 Time (ms) Timing Results for 2048x FPS Boundary 60 FPS Boundary CNN-P=1 CNN-P=2 CNN-P=4 CNN-P=8 0 Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow Meet 30 FPS for most applications when CNN-P=8 60 FPS difficult to meet even with CNN-P=8

28 Future Work Try more computationally-intensive CNN applications Video processing, Face recognition, pattern recognition, etc. FPGA implementation of SP-CNN Will allow us to perform better performance and power comparisons against CPU, GPU, etc. Further SP-CNN optimizations Determine optimal interval values Boundary Condition Propagation Time Determine if a CNN gene is suitable for SP-CNN

29 Conclusion Our proposed SP-CNN architecture brings scalability to small sized hardware CNN arrays. SP-CNN shows performance that is comparable to the ideal case in terms of virtual convergence time. Energy consumption around 10 to 30 mj. SP-CNN can meet 30 FPS and 60 FPS standards for most benchmarks. This can further improved with better scaling of the CNN technology.

30 Questions?

31 31

32 Digital CNN Operation x ij (t + 1) = C k,l N r i,k a kl y(t) kl + C k,l N r i,k b kl u kl + z A = B = z = x ij t + 1 = C k,l N r i,k y t0.0 kl C k,l N r i,k b kl u kl + z y(t) = 1.0 u = -1.0 y(t) =.75 u = -1.0 y(t) u =

33 The basis of the CNN operation is that we eventually converge to the correct solution. However, there are no guarantees on the convergence time and different inputs can converge at different rates. 100% Output Error of Running HoleFilling Gene on a CNN for Various 1024x1024 Test Images 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% capitala circle eight filledsquare lowera nine rect seven vertline zero 33

34 Slow Propagation Updated information in boundary conditions is seen at the beginning of the next iteration partition 0 y(0) y(0) s output y(1) 3 4 partition 1 y(0) y(0) s output 6 y(1) 34

35 The SP-CNN algorithm can be viewed below: while change do change = false for p in partitions do loadstateandinputtocnn(p) for n = 0; n < interval; n++, t++ do change = parallel-computation of state parallel-computation of output end for savetonextstatefromcnn(p) end for swapstateandnextstatepointers(p) iter += 1, vt += interval end while 35

36 Fast propagation Updated information in boundary conditions is seen at the beginning of the next interval partition 0 1 y(0) 2 y(0) s output 5 y(1) partition 1 3 y(0) 4 y(0) s output 36

37 Using fast propagation, the partition order can improve on program convergence time. For example, if data is passed through the cells from right to left, then we can converge faster by moving through partitions in a reverse column-major order. Partition ordering only matters if fast propagation used. 37

38 Avg. Speedup in Total Convergence Time x x2048 Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow For Connected Component, Hole Filling, and Rotation Detector, fast propagation can provide an average speedup from 15% to 30%. Shadow Creator shows no benefit since the information propagation in that benchmark is from right-to-left. 38

39 Avg. Speedup in Total Convergence Time x x2048 With Shadow Creator, using reverse row-major order can provide an average speedup from 13% and 30% respectively. 39

40 Virtual Convergence Time Originally: virtualconvtime = n T, where N is the number of iterations Now: virtualconvtime = n min(t, max( p Partitions convtime(p, i))) i=1 convtime(p, i) is the convergence time for partition p during iteration i 40

41 Percent Error No-Share Avg. Error No-Share Max. Error Share Avg. Error Share Max. Error 100% 80% 60% 40% 20% 0% Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow Both naïve mechanisms show errors on average of 20%, with max errors as high as 90% for global applications. Local type applications only show error for the No-Share mechanisms. SP-CNN converged correctly to solution for all benchmarks and tests. 41

42 Avg. Execution Time Speedup 6 CNN-P=2 (1024x1024) CNN-P=4 (1024x1024) CNN-P=8 (1024x1024) CNN-P=2 (2048x2048) CNN-P=4 (2048x2048) CNN-P=8 (2048x2048) Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow For the global applications, memory bandwidth contention decreases the ability to achieve linear scaling. Shadow shows very poor scaling since many of the partitions converge quickly, therefore the CNN occupation time is low. 42

43 Avg. Execution Time Speedup Number of CNN-P Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow Prefetching only provides some benefits when the number of CNN-Ps is equal to 1. When the number of CNN-Ps scales up, than the memory bandwidth contention becomes too large, and prefetching can actually cause slowdowns. 43

44 We also evaluated our SP-CNN mechanisms over CPU and GPU implementations of the benchmarks. For two benchmarks, Hole-Filling and Rotation Detector, we could not easily develop a CPU/GPU algorithm, so we collect their timing results by emulating the CNN s operation on the corresponding platform. Name Model Power Frequency Technology CPU Intel Core i W (TDP) 3.3 GHz 22nm GPU SP-CNN NVIDIA K1 mobile 1.5W 900 MHz 28nm 0.73mW per PE, 35.6 uw per node 200 MHz 45nm 44

45 Time (ms) 96 CPU*: 2923 GPU*:1006 CPU*: 3004 GPU*:1038 CPU 80 GPU SP-CNN (CNN-P=1) 64 SP-CNN (CNN-P=4) 48 Ideal-Mul (CNN-P=1) Ideal-Mul (CNN-P=4) FPS Boundary FPS Boundary 0 Crn-Detect Edge-Detect Conn-Comp Hole-Fill* Rot-Detect* Shadow For simple global applications like Connected Component and Shadow, the CPU/GPU versions perform better than the SP-CNN versions. However, emulation of a CNN is prohibitively slow, and for those applications SP-CNN provides much better performance. 45

46 Energy (mj) 60 CPU GPU SP-CNN (CNN-P=1) SP-CNN (CNN-P=4) Crn-Detect Edge-Detect Conn-Comp Hole-Fill* Rot-Detect* Shadow For Connected Component and Shadow, execution time dominates over the low power consumption of SP-CNN, causing overall energy to be larger than the GPU case. Hole-Filling and Rotation Detector do show cases where the SP-CNN can perform complex tasks at a low energy cost. 46

47 Optimization: Early-Finish Originally, each partition run for on the CNN for the entire interval time T. We saw cases where a partition converged on the CNN before interval finished. Rest of execution essentially does nothing. So, introduced Early-Finish, where a partition runs on the CNN till it either converges or interval time T is reached.

DESIGNING CNN GENES. Received January 23, 2003; Revised April 2, 2003

DESIGNING CNN GENES. Received January 23, 2003; Revised April 2, 2003 Tutorials and Reviews International Journal of Bifurcation and Chaos, Vol. 13, No. 10 (2003 2739 2824 c World Scientific Publishing Company DESIGNING CNN GENES MAKOTO ITOH Department of Information and

More information

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI Charles Lo and Paul Chow {locharl1, pc}@eecg.toronto.edu Department of Electrical and Computer Engineering

More information

Implementation of Nonlinear Template Runner Emulated Digital CNN-UM on FPGA

Implementation of Nonlinear Template Runner Emulated Digital CNN-UM on FPGA Implementation of Nonlinear Template Runner Emulated Digital CNN-UM on FPGA Z. Kincses * Z. Nagy P. Szolgay * Department of Image Processing and Neurocomputing, University of Pannonia, Hungary * e-mail:

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks Yufei Ma, Yu Cao, Sarma Vrudhula,

More information

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal

More information

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge González-Domínguez*, Bertil Schmidt*, Jan C. Kässens**, Lars Wienbrandt** *Parallel and Distributed Architectures

More information

Artificial Intelligence Hopfield Networks

Artificial Intelligence Hopfield Networks Artificial Intelligence Hopfield Networks Andrea Torsello Network Topologies Single Layer Recurrent Network Bidirectional Symmetric Connection Binary / Continuous Units Associative Memory Optimization

More information

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen Artificial Neural Networks Introduction to Computational Neuroscience Tambet Matiisen 2.04.2018 Artificial neural network NB! Inspired by biology, not based on biology! Applications Automatic speech recognition

More information

2. Accelerated Computations

2. Accelerated Computations 2. Accelerated Computations 2.1. Bent Function Enumeration by a Circular Pipeline Implemented on an FPGA Stuart W. Schneider Jon T. Butler 2.1.1. Background A naive approach to encoding a plaintext message

More information

CELLULAR NEURAL NETWORKS & APPLICATIONS TO IMAGE PROCESSING. Vedat Tavsanoglu School of EEIE SOUTH BANK UNIVERSITY LONDON UK

CELLULAR NEURAL NETWORKS & APPLICATIONS TO IMAGE PROCESSING. Vedat Tavsanoglu School of EEIE SOUTH BANK UNIVERSITY LONDON UK CELLULAR NEURAL NETWORKS & APPLICATIONS TO IMAGE PROCESSING Vedat Tavsanoglu School of EEIE SOUTH BANK UNIVERSITY LONDON UK Outline What is CNN? Architecture of CNN Analogue Computing with CNN Advantages

More information

Performance Evaluation of Scientific Applications on POWER8

Performance Evaluation of Scientific Applications on POWER8 Performance Evaluation of Scientific Applications on POWER8 2014 Nov 16 Andrew V. Adinetz 1, Paul F. Baumeister 1, Hans Böttiger 3, Thorsten Hater 1, Thilo Maurer 3, Dirk Pleiter 1, Wolfram Schenck 4,

More information

Direct Self-Consistent Field Computations on GPU Clusters

Direct Self-Consistent Field Computations on GPU Clusters Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd

More information

CycleTandem: Energy-Saving Scheduling for Real-Time Systems with Hardware Accelerators

CycleTandem: Energy-Saving Scheduling for Real-Time Systems with Hardware Accelerators CycleTandem: Energy-Saving Scheduling for Real-Time Systems with Hardware Accelerators Sandeep D souza and Ragunathan (Raj) Rajkumar Carnegie Mellon University High (Energy) Cost of Accelerators Modern-day

More information

Parallel Longest Common Subsequence using Graphics Hardware

Parallel Longest Common Subsequence using Graphics Hardware Parallel Longest Common Subsequence using Graphics Hardware John Kloetzli rian Strege Jonathan Decker Dr. Marc Olano Presented by: rian Strege 1 Overview Introduction Problem Statement ackground and Related

More information

Lecture 2: Metrics to Evaluate Systems

Lecture 2: Metrics to Evaluate Systems Lecture 2: Metrics to Evaluate Systems Topics: Metrics: power, reliability, cost, benchmark suites, performance equation, summarizing performance with AM, GM, HM Sign up for the class mailing list! Video

More information

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013

More information

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)

More information

Efficient implementation of the overlap operator on multi-gpus

Efficient implementation of the overlap operator on multi-gpus Efficient implementation of the overlap operator on multi-gpus Andrei Alexandru Mike Lujan, Craig Pelissier, Ben Gamari, Frank Lee SAAHPC 2011 - University of Tennessee Outline Motivation Overlap operator

More information

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method NUCLEAR SCIENCE AND TECHNIQUES 25, 0501 (14) Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method XU Qi ( 徐琪 ), 1, YU Gang-Lin ( 余纲林 ), 1 WANG Kan ( 王侃 ),

More information

A FPGA Implementation of Large Restricted Boltzmann Machines. Charles Lo. Supervisor: Paul Chow April 2010

A FPGA Implementation of Large Restricted Boltzmann Machines. Charles Lo. Supervisor: Paul Chow April 2010 A FPGA Implementation of Large Restricted Boltzmann Machines by Charles Lo Supervisor: Paul Chow April 2010 Abstract A FPGA Implementation of Large Restricted Boltzmann Machines Charles Lo Engineering

More information

Neural networks. Chapter 20. Chapter 20 1

Neural networks. Chapter 20. Chapter 20 1 Neural networks Chapter 20 Chapter 20 1 Outline Brains Neural networks Perceptrons Multilayer networks Applications of neural networks Chapter 20 2 Brains 10 11 neurons of > 20 types, 10 14 synapses, 1ms

More information

Analog Computation in Flash Memory for Datacenter-scale AI Inference in a Small Chip

Analog Computation in Flash Memory for Datacenter-scale AI Inference in a Small Chip 1 Analog Computation in Flash Memory for Datacenter-scale AI Inference in a Small Chip Dave Fick, CTO/Founder Mike Henry, CEO/Founder About Mythic 2 Focused on high-performance Edge AI Full stack co-design:

More information

Neural networks. Chapter 19, Sections 1 5 1

Neural networks. Chapter 19, Sections 1 5 1 Neural networks Chapter 19, Sections 1 5 Chapter 19, Sections 1 5 1 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural networks Chapter 19, Sections 1 5 2 Brains 10

More information

Hardware Acceleration of DNNs

Hardware Acceleration of DNNs Lecture 12: Hardware Acceleration of DNNs Visual omputing Systems Stanford S348V, Winter 2018 Hardware acceleration for DNNs Huawei Kirin NPU Google TPU: Apple Neural Engine Intel Lake rest Deep Learning

More information

Energy-Efficient Real-Time Task Scheduling in Multiprocessor DVS Systems

Energy-Efficient Real-Time Task Scheduling in Multiprocessor DVS Systems Energy-Efficient Real-Time Task Scheduling in Multiprocessor DVS Systems Jian-Jia Chen *, Chuan Yue Yang, Tei-Wei Kuo, and Chi-Sheng Shih Embedded Systems and Wireless Networking Lab. Department of Computer

More information

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a

More information

Parallelism in Structured Newton Computations

Parallelism in Structured Newton Computations Parallelism in Structured Newton Computations Thomas F Coleman and Wei u Department of Combinatorics and Optimization University of Waterloo Waterloo, Ontario, Canada N2L 3G1 E-mail: tfcoleman@uwaterlooca

More information

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique Claude Tadonki MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Monthly CRI Seminar MINES ParisTech - CRI June 06, 2016, Fontainebleau (France)

More information

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark Block AIR Methods For Multicore and GPU Per Christian Hansen Hans Henrik B. Sørensen Technical University of Denmark Model Problem and Notation Parallel-beam 3D tomography exact solution exact data noise

More information

arxiv: v1 [cs.ar] 11 Dec 2017

arxiv: v1 [cs.ar] 11 Dec 2017 Multi-Mode Inference Engine for Convolutional Neural Networks Arash Ardakani, Carlo Condo and Warren J. Gross Electrical and Computer Engineering Department, McGill University, Montreal, Quebec, Canada

More information

Energy-aware scheduling for GreenIT in large-scale distributed systems

Energy-aware scheduling for GreenIT in large-scale distributed systems Energy-aware scheduling for GreenIT in large-scale distributed systems 1 PASCAL BOUVRY UNIVERSITY OF LUXEMBOURG GreenIT CORE/FNR project Context and Motivation Outline 2 Problem Description Proposed Solution

More information

ACCELERATED LEARNING OF GAUSSIAN PROCESS MODELS

ACCELERATED LEARNING OF GAUSSIAN PROCESS MODELS ACCELERATED LEARNING OF GAUSSIAN PROCESS MODELS Bojan Musizza, Dejan Petelin, Juš Kocijan, Jožef Stefan Institute Jamova 39, Ljubljana, Slovenia University of Nova Gorica Vipavska 3, Nova Gorica, Slovenia

More information

How can ideas from quantum computing improve or speed up neuromorphic models of computation?

How can ideas from quantum computing improve or speed up neuromorphic models of computation? Neuromorphic Computation: Architectures, Models, Applications Associative Memory Models with Adiabatic Quantum Optimization Kathleen Hamilton, Alexander McCaskey, Jonathan Schrock, Neena Imam and Travis

More information

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Eftychios Sifakis CS758 Guest Lecture - 19 Sept 2012 Introduction Linear systems

More information

Acceleration of WRF on the GPU

Acceleration of WRF on the GPU Acceleration of WRF on the GPU Daniel Abdi, Sam Elliott, Iman Gohari Don Berchoff, Gene Pache, John Manobianco TempoQuest 1434 Spruce Street Boulder, CO 80302 720 726 9032 TempoQuest.com THE WORLD S FASTEST

More information

3.3 Discrete Hopfield Net An iterative autoassociative net similar to the nets described in the previous sections has been developed by Hopfield

3.3 Discrete Hopfield Net An iterative autoassociative net similar to the nets described in the previous sections has been developed by Hopfield 3.3 Discrete Hopfield Net An iterative autoassociative net similar to the nets described in the previous sections has been developed by Hopfield (1982, 1984). - The net is a fully interconnected neural

More information

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed

More information

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Welcome to MCS 572. content and organization expectations of the course. definition and classification Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson

More information

AI Programming CS F-20 Neural Networks

AI Programming CS F-20 Neural Networks AI Programming CS662-2008F-20 Neural Networks David Galles Department of Computer Science University of San Francisco 20-0: Symbolic AI Most of this class has been focused on Symbolic AI Focus or symbols

More information

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,

More information

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets) COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets) Sanjeev Arora Elad Hazan Recap: Structure of a deep

More information

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy 7 th Workshop on UnConventional High Performance

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose 上海超级计算中心 Shanghai Supercomputer Center Lei Xu Shanghai Supercomputer Center 03/26/2014 @GTC, San Jose Overview Introduction Fundamentals of the FDTD method Implementation of 3D UPML-FDTD algorithm on GPU

More information

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation) Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation

More information

Final Examination CS 540-2: Introduction to Artificial Intelligence

Final Examination CS 540-2: Introduction to Artificial Intelligence Final Examination CS 540-2: Introduction to Artificial Intelligence May 7, 2017 LAST NAME: SOLUTIONS FIRST NAME: Problem Score Max Score 1 14 2 10 3 6 4 10 5 11 6 9 7 8 9 10 8 12 12 8 Total 100 1 of 11

More information

An Efficient Numerical Integration Algorithm for Cellular Neural Network Based Hole-Filler Template Design

An Efficient Numerical Integration Algorithm for Cellular Neural Network Based Hole-Filler Template Design International Journal of Computers, Communications & Control Vol. II 007, No. 4, pp. 67-74 An Efficient Numerical Integration Algorithm for Cellular Neural Network Based Hole-Filler Template Design V.

More information

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures José I. Aliaga Performance and Energy Analysis of the Iterative Solution of Sparse

More information

11 Parallel programming models

11 Parallel programming models 237 // Program Design 10.3 Assessing parallel programs 11 Parallel programming models Many different models for expressing parallelism in programming languages Actor model Erlang Scala Coordination languages

More information

Scheduling of Frame-based Embedded Systems with Rechargeable Batteries

Scheduling of Frame-based Embedded Systems with Rechargeable Batteries Scheduling of Frame-based Embedded Systems with Rechargeable Batteries André Allavena Computer Science Department Cornell University Ithaca, NY 14853 andre@cs.cornell.edu Daniel Mossé Department of Computer

More information

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann (Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

More information

arxiv: v1 [hep-lat] 7 Oct 2010

arxiv: v1 [hep-lat] 7 Oct 2010 arxiv:.486v [hep-lat] 7 Oct 2 Nuno Cardoso CFTP, Instituto Superior Técnico E-mail: nunocardoso@cftp.ist.utl.pt Pedro Bicudo CFTP, Instituto Superior Técnico E-mail: bicudo@ist.utl.pt We discuss the CUDA

More information

INTRODUCTION TO ARTIFICIAL INTELLIGENCE

INTRODUCTION TO ARTIFICIAL INTELLIGENCE v=1 v= 1 v= 1 v= 1 v= 1 v=1 optima 2) 3) 5) 6) 7) 8) 9) 12) 11) 13) INTRDUCTIN T ARTIFICIAL INTELLIGENCE DATA15001 EPISDE 8: NEURAL NETWRKS TDAY S MENU 1. NEURAL CMPUTATIN 2. FEEDFRWARD NETWRKS (PERCEPTRN)

More information

Andrew Morton University of Waterloo Canada

Andrew Morton University of Waterloo Canada EDF Feasibility and Hardware Accelerators Andrew Morton University of Waterloo Canada Outline 1) Introduction and motivation 2) Review of EDF and feasibility analysis 3) Hardware accelerators and scheduling

More information

OHW2013 workshop. An open source PCIe device virtualization framework

OHW2013 workshop. An open source PCIe device virtualization framework OHW2013 workshop An open source PCIe device virtualization framework Plan Context and objectives Design and implementation Future directions Questions Context - ESRF and the ISDD electronic laboratory

More information

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Performance Analysis of Lattice QCD Application with APGAS Programming Model Performance Analysis of Lattice QCD Application with APGAS Programming Model Koichi Shirahata 1, Jun Doi 2, Mikio Takeuchi 2 1: Tokyo Institute of Technology 2: IBM Research - Tokyo Programming Models

More information

Neural networks. Chapter 20, Section 5 1

Neural networks. Chapter 20, Section 5 1 Neural networks Chapter 20, Section 5 Chapter 20, Section 5 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural networks Chapter 20, Section 5 2 Brains 0 neurons of

More information

UTPlaceF 3.0: A Parallelization Framework for Modern FPGA Global Placement

UTPlaceF 3.0: A Parallelization Framework for Modern FPGA Global Placement UTPlaceF 3.0: A Parallelization Framework for Modern FPGA Global Placement Wuxi Li, Meng Li, Jiajun Wang, and David Z. Pan University of Texas at Austin wuxili@utexas.edu November 14, 2017 UT DA Wuxi Li

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline How the Brain Works Artificial Neural Networks Simple Computing Elements Feed-Forward Networks Perceptrons (Single-layer,

More information

Neural Nets in PR. Pattern Recognition XII. Michal Haindl. Outline. Neural Nets in PR 2

Neural Nets in PR. Pattern Recognition XII. Michal Haindl. Outline. Neural Nets in PR 2 Neural Nets in PR NM P F Outline Motivation: Pattern Recognition XII human brain study complex cognitive tasks Michal Haindl Faculty of Information Technology, KTI Czech Technical University in Prague

More information

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA   1/ 21 Neural Networks Chapter 8, Section 7 TB Artificial Intelligence Slides from AIMA http://aima.cs.berkeley.edu / 2 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural

More information

ENERGY-EFFICIENT DIGITAL HARDWARE PLATFORM FOR LEARNING COMPLEX SYSTEMS. A Dissertation Presented to The Academic Faculty.

ENERGY-EFFICIENT DIGITAL HARDWARE PLATFORM FOR LEARNING COMPLEX SYSTEMS. A Dissertation Presented to The Academic Faculty. ENERGY-EFFICIENT DIGITAL HARDWARE PLATFORM FOR LEARNING COMPLEX SYSTEMS A Dissertation Presented to The Academic Faculty By Jae Ha Kung In Partial Fulfillment of the Requirements for the Degree Doctor

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Neural networks Daniel Hennes 21.01.2018 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Logistic regression Neural networks Perceptron

More information

Machine Learning. Neural Networks

Machine Learning. Neural Networks Machine Learning Neural Networks Bryan Pardo, Northwestern University, Machine Learning EECS 349 Fall 2007 Biological Analogy Bryan Pardo, Northwestern University, Machine Learning EECS 349 Fall 2007 THE

More information

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric

More information

Using a Hopfield Network: A Nuts and Bolts Approach

Using a Hopfield Network: A Nuts and Bolts Approach Using a Hopfield Network: A Nuts and Bolts Approach November 4, 2013 Gershon Wolfe, Ph.D. Hopfield Model as Applied to Classification Hopfield network Training the network Updating nodes Sequencing of

More information

Synaptic Devices and Neuron Circuits for Neuron-Inspired NanoElectronics

Synaptic Devices and Neuron Circuits for Neuron-Inspired NanoElectronics Synaptic Devices and Neuron Circuits for Neuron-Inspired NanoElectronics Byung-Gook Park Inter-university Semiconductor Research Center & Department of Electrical and Computer Engineering Seoul National

More information

FPGA Implementation of a Predictive Controller

FPGA Implementation of a Predictive Controller FPGA Implementation of a Predictive Controller SIAM Conference on Optimization 2011, Darmstadt, Germany Minisymposium on embedded optimization Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan

More information

A Deep Convolutional Neural Network Based on Nested Residue Number System

A Deep Convolutional Neural Network Based on Nested Residue Number System A Deep Convolutional Neural Network Based on Nested Residue Number System Hiroki Nakahara Tsutomu Sasao Ehime University, Japan Meiji University, Japan Outline Background Deep convolutional neural network

More information

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine

More information

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem Katharina Kormann 1 Klaus Reuter 2 Markus Rampp 2 Eric Sonnendrücker 1 1 Max Planck Institut für Plasmaphysik 2 Max Planck Computing

More information

Artificial Neural Network and Fuzzy Logic

Artificial Neural Network and Fuzzy Logic Artificial Neural Network and Fuzzy Logic 1 Syllabus 2 Syllabus 3 Books 1. Artificial Neural Networks by B. Yagnanarayan, PHI - (Cover Topologies part of unit 1 and All part of Unit 2) 2. Neural Networks

More information

Sections 18.6 and 18.7 Artificial Neural Networks

Sections 18.6 and 18.7 Artificial Neural Networks Sections 18.6 and 18.7 Artificial Neural Networks CS4811 - Artificial Intelligence Nilufer Onder Department of Computer Science Michigan Technological University Outline The brain vs artifical neural networks

More information

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is

More information

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David 1.2.05 1 Topic Overview Sources of overhead in parallel programs. Performance metrics for parallel systems. Effect of granularity on

More information

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen Neural Networks - I Henrik I Christensen Robotics & Intelligent Machines @ GT Georgia Institute of Technology, Atlanta, GA 30332-0280 hic@cc.gatech.edu Henrik I Christensen (RIM@GT) Neural Networks 1 /

More information

CS 4700: Foundations of Artificial Intelligence

CS 4700: Foundations of Artificial Intelligence CS 4700: Foundations of Artificial Intelligence Prof. Bart Selman selman@cs.cornell.edu Machine Learning: Neural Networks R&N 18.7 Intro & perceptron learning 1 2 Neuron: How the brain works # neurons

More information

Christian Mohr

Christian Mohr Christian Mohr 20.12.2011 Recurrent Networks Networks in which units may have connections to units in the same or preceding layers Also connections to the unit itself possible Already covered: Hopfield

More information

Parallel Transposition of Sparse Data Structures

Parallel Transposition of Sparse Data Structures Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing

More information

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Performance, Power & Energy ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Recall: Goal of this class Performance Reconfiguration Power/ Energy H. So, Sp10 Lecture 3 - ELEC8106/6102 2 PERFORMANCE EVALUATION

More information

Scalable and Power-Efficient Data Mining Kernels

Scalable and Power-Efficient Data Mining Kernels Scalable and Power-Efficient Data Mining Kernels Alok Choudhary, John G. Searle Professor Dept. of Electrical Engineering and Computer Science and Professor, Kellogg School of Management Director of the

More information

On the Computational Complexity of the Discrete Pascal Transform

On the Computational Complexity of the Discrete Pascal Transform 6 th International Conference Logic and Applications LAP 207, September 8-22, 207, Dubrovnik, Croatia On the Computational Complexity of the Discrete Pascal Transform Dušan B. Gajić, Radomir S. Stanković

More information

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign

More information

Convolutional Neural Networks

Convolutional Neural Networks Convolutional Neural Networks Books» http://www.deeplearningbook.org/ Books http://neuralnetworksanddeeplearning.com/.org/ reviews» http://www.deeplearningbook.org/contents/linear_algebra.html» http://www.deeplearningbook.org/contents/prob.html»

More information

Sections 18.6 and 18.7 Artificial Neural Networks

Sections 18.6 and 18.7 Artificial Neural Networks Sections 18.6 and 18.7 Artificial Neural Networks CS4811 - Artificial Intelligence Nilufer Onder Department of Computer Science Michigan Technological University Outline The brain vs. artifical neural

More information

Level-3 BLAS on a GPU

Level-3 BLAS on a GPU Level-3 BLAS on a GPU Picking the Low Hanging Fruit Francisco Igual 1 Gregorio Quintana-Ortí 1 Robert A. van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón

More information

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS

More information

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0 NEC PerforCache Influence on M-Series Disk Array Behavior and Performance. Version 1.0 Preface This document describes L2 (Level 2) Cache Technology which is a feature of NEC M-Series Disk Array implemented

More information

VLSI Signal Processing

VLSI Signal Processing VLSI Signal Processing Lecture 1 Pipelining & Retiming ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-1 Introduction DSP System Real time requirement Data driven synchronized by data

More information

The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization

The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization Jia Wang, Shiyan Hu Department of Electrical and Computer Engineering Michigan Technological University Houghton, Michigan

More information

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

TensorFlow: A Framework for Scalable Machine Learning

TensorFlow: A Framework for Scalable Machine Learning TensorFlow: A Framework for Scalable Machine Learning You probably Outline want to know... What is TensorFlow? Why did we create TensorFlow? How does Tensorflow Work? Example: Linear Regression Example:

More information

Lecture 7 Artificial neural networks: Supervised learning

Lecture 7 Artificial neural networks: Supervised learning Lecture 7 Artificial neural networks: Supervised learning Introduction, or how the brain works The neuron as a simple computing element The perceptron Multilayer neural networks Accelerated learning in

More information

Practical Combustion Kinetics with CUDA

Practical Combustion Kinetics with CUDA Funded by: U.S. Department of Energy Vehicle Technologies Program Program Manager: Gurpreet Singh & Leo Breton Practical Combustion Kinetics with CUDA GPU Technology Conference March 20, 2015 Russell Whitesides

More information

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau Last update: October 26, 207 Neural networks CMSC 42: Section 8.7 Dana Nau Outline Applications of neural networks Brains Neural network units Perceptrons Multilayer perceptrons 2 Example Applications

More information

Retiming. delay elements in a circuit without affecting the input/output characteristics of the circuit.

Retiming. delay elements in a circuit without affecting the input/output characteristics of the circuit. Chapter Retiming NCU EE -- SP VLSI esign. Chap. Tsung-Han Tsai 1 Retiming & A transformation techniques used to change the locations of delay elements in a circuit without affecting the input/output characteristics

More information

A High-Yield Area-Power Efficient DWT Hardware for Implantable Neural Interface Applications

A High-Yield Area-Power Efficient DWT Hardware for Implantable Neural Interface Applications Neural Engineering 27 A High-Yield Area-Power Efficient DWT Hardware for Implantable Neural Interface Applications Awais M. Kamboh, Andrew Mason, Karim Oweiss {Kambohaw, Mason, Koweiss} @msu.edu Department

More information

ERLANGEN REGIONAL COMPUTING CENTER

ERLANGEN REGIONAL COMPUTING CENTER ERLANGEN REGIONAL COMPUTING CENTER Making Sense of Performance Numbers Georg Hager Erlangen Regional Computing Center (RRZE) Friedrich-Alexander-Universität Erlangen-Nürnberg OpenMPCon 2018 Barcelona,

More information

CS:4420 Artificial Intelligence

CS:4420 Artificial Intelligence CS:4420 Artificial Intelligence Spring 2018 Neural Networks Cesare Tinelli The University of Iowa Copyright 2004 18, Cesare Tinelli and Stuart Russell a a These notes were originally developed by Stuart

More information