Code Generation for GPU Accelerators in the Domain of Image Preprocessing

Size: px
Start display at page:

Download "Code Generation for GPU Accelerators in the Domain of Image Preprocessing"

Transcription

1 Code Generation for GPU Accelerators in the Domain of Image Preprocessing Oliver Reiche, Richard Membarth, Frank Hannig, and Jürgen Teich Hardware/Software Co-Design, University of Erlangen-Nuremberg Dagstuhl, April 4, 2013

2 Motivation: Medical Image Preprocessing Keep X-ray dosage and contrast agent as low as possible noisy images. Improve quality of medical images... remove noise detect edges compensate for detector defects... an implementation must be as efficient as possible. most algorithms are well known. What for do we need code generation?

3 Challenge: How To Target Multiple Architectures? Efficient code generation for different target architectures. Domain-specific Languages performance portable: high performance on different target hardware competitive: comparable performance to hand-written code productivity algorithm description at a high-level hide low-level details from programmer portability support different target architectures from the same algorithm description support different target languages from the same algorithm description Domain-specific languages offer both functional- and perfomance-portability.

4 Agenda HIPA cc Results Summary

5 HIPAcc

6 HIPAcc: The Heterogeneous Image Processing Acceleration Framework C++ embedded DSL Domain Source-to-Source Compiler Clang/LLVM Knowledge Architecture Knowledge CUDA OpenCL C/C++ Renderscript (GPU) (x86/gpu) (x86) (x86/arm/gpu) CUDA/OpenCL/Renderscript Runtime Library 3

7 Domain Analysis: Image Processing Kernel Categorization Identified three groups of kernels: Point operators [HPPC 11] each pixel is updated uninfluential of other pixels Local operators [IPDPS 12] centered at the pixel it is applied to [0,0] bounded to the neighborhood [ m,+m] [ n,+n] operator can be applied in parallel Global operators [ISPDC 12] pixels of the whole image contribute to result for instance, reduction operators [HPPC 11] Richard Membarth, Anton Lokhmotov, and Jürgen Teich. Generating GPU Code from a High-level Representation for Image Processing Kernels. In: Proceedings of the 5th Workshop on Highly Parallel Processing on a Chip (HPPC). Springer. Bordeaux, France, Aug. 30, 2011, pp DOI: / _31. [IPDPS 12] Richard Membarth et al. Generating Device-specific GPU Code for Local Operators in Medical Imaging. In: Proceedings of the 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS). IEEE. Shanghai, China, May 21 25, 2012, pp DOI: /IPDPS [ISPDC 12] Richard Membarth et al. Automatic Optimization of In-Flight Memory Transactions for GPU Accelerators based on a Domain-Specific Language for Medical Imaging. In: Proceedings of the 11th International Symposium on Parallel and Distributed Computing (ISPDC). IEEE. Munich, Germany, June 25 29, 2012, pp DOI: /ISPDC

8 HIPA cc : The Heterogeneous Image Processing Acceleration Framework Domain-specific Extensions IterationSpace defines ROI of the output image Accessor input ROI with filtering (nearest, bilinear, bicubic,... ) BoundaryCondition boundary handling modes Mask convolution mask Output image. Crop of output image. Crop of output image with offset. 5

9 HIPA cc : The Heterogeneous Image Processing Acceleration Framework Domain-specific Extensions IterationSpace defines ROI of the output image Accessor input ROI with filtering (nearest, bilinear, bicubic,... ) BoundaryCondition boundary handling modes Mask convolution mask Image and boundary. Image crop. Image crop with offset. Image offset. 5

10 HIPA cc : The Heterogeneous Image Processing Acceleration Framework Domain-specific Extensions IterationSpace defines ROI of the output image Accessor input ROI with filtering (nearest, bilinear, bicubic,... ) BoundaryCondition boundary handling modes Mask convolution mask F G H J K L N O P B C D F G H J K L N O P B C D F G H J K L E F G H I J K L M N O P A B C D E F G H I J K L M N O P A B C D E F G H I J K L Repeat E F G I J K M N O A B C E F G I J K M N O A B C E F G I J K A A A A A A A A A A A A E E E I I I M M M M M M M M M M M M A B C D A B C D A B C D A B C D E F G H I J K L M N O P M N O P M N O P M N O P Clamp D D D D D D D D D D D D H H H L L L P P P P P P P P P P P P K G C J F B I E A C B A G F E K J I O N M E I M F J N G K O I J K L E F G H A B C D A B C D E F G H I J K L M N O P M N O P I J K L E F G H Mirror B F J C G K D H L D C B H G F L K J P O N P L H O K G N J F Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q A B C D Q Q Q Q Q Q E F G H Q Q Q Q Q Q I J K L Q Q Q Q Q Q M N O P Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Constant 5

11 HIPA cc : The Heterogeneous Image Processing Acceleration Framework Domain-specific Extensions IterationSpace defines ROI of the output image Accessor input ROI with filtering (nearest, bilinear, bicubic,... ) BoundaryCondition boundary handling modes Mask convolution mask f(x,y) x y

12 HIPA cc Example: Gaussian Blur 1 /*... */ 2 Image <uchar > in(width, height ); 3 Image <float > out(width, height ); 4 Mask <float > mask (size, size ); 5 6 in = in_image ; 7 out = out_image ; 8 mask = filter_mask ; 9 10 BoundaryCondition bound (in, mask, BOUNDARY_CLAMP ); AccessorLF < uchar > acc( bound, width, height, 0, 0); IterationSpace <float > iter (out, width /2, height /2, width /4, height /4); GaussianBlur filter ( iter, acc, mask, size /2); 17 filter.execute (); out_image = out ; 6

13 HIPA cc Example: Gaussian Blur Kernel 1 class GaussianBlur : public Kernel < float > { 2 Mask <float > mask ; 3 Accessor < uchar > input ; 4 size_t range ; 5 6 public : 7 GaussianBlur ( IterationSpace < float > iter, Accessor < uchar > acc, 8 Mask < float > mask, size_t range ) 9 : Kernel ( iter ), input (acc), mask ( mask ), range ( range ) { 10 addaccessor ( acc ); 11 } void kernel () { 14 float sum =.0f; 15 for ( int yf = - range ; yf <= range ; ++ yf) 16 for ( int xf = - range ; xf <= range ; ++ xf) 17 sum += input (xf, yf) * mask (xf, yf); 18 output () = sum; 19 } 20 }; 7

14 HIPA cc Example: Gaussian Blur Kernel + Lambda Function 1 class GaussianBlur : public Kernel < float > { 2 Mask <float > mask ; 3 Accessor < uchar > input ; 4 size_t range ; 5 6 public : 7 GaussianBlur ( IterationSpace < float > iter, Accessor < uchar > acc, 8 Mask < float > mask, size_t range ) 9 : Kernel ( iter ), input (acc), mask ( mask ), range ( range ) { 10 addaccessor ( acc ); 11 } 12 Lambda function for convolution 13 void kernel () { 14 output () = convolve (mask, HipaccSUM, [&]() { 15 return input ( mask ) * mask (); 16 }); 17 } 18 }; 7

15 Efficient Code Generation for Boundary Handling A A A B C D A B C D D D A A A B C D A B C D D D A A A B C D A B C D D D BH_TL BH_T BH_TR E E E F G H E F G H H H I I I J K L I J K L L L M M M N O P M N O P P P BH_L BH_NO BH_R A A A B C D A B C D D D E E E F G H E F G H H H I I I J K L BH_BL I BH_B J K L BH_BR L L M M M N O P M N O P P P M M M N O P M N O P P P generates 10 different code variants minimize executed conditionals minimize divergence block index determines code variant limit necessary boundary handling with respect to mask size and image padding M M M N O P M N O P P P 8

16 Mapping of GPU Memory Accesses Image Preprocessing mostly load compute store memory bound Architecture Model Memory Type global memory constant memory texture memory surface memory local memory Optimizations memory access alignment unrolling target (e. g., Kepler35, SouthernIsland, Midgard,... ) 9

17 Results

18 Results Gaussian Blur 5 5 (separated) 12 vs. 238 lines of CUDA code Undef. Clamp Repeat Mirror Const. naïve crash OpenCV n/a RapidMind crash n/a Halide n/a 8.93 n/a n/a n/a NPP (8-bit) 6.86 n/a n/a n/a n/a HIPA cc CUDA Image of pixels on a Tesla C2050. Times in ms. Bilateral Grid Filter 62 vs. 386 lines of CUDA code GTX 680 i Handtuned CUDA n/a Halide HIPA cc CUDA n/a HIPA cc OpenCL Image of pixels. Times in ms. 10

19 Summary

20 Summary HIPA cc domain-specific language for image preprocessing optimizations tailored to application domain architecture model for GPU accelerators target-specific code generator for CUDA and OpenCL transformations based on domain knowledge architecture information provides: performance, productivity and portability Recent Work three new target backends Renderscript Renderscript GPU Filterscript support for Midgard architecture (ARM Mali T604) 11

21 Questions? HIPA cc framework sources released under Simplified BSD License. 12

22 Results: Gaussian Blur, 5 5 window, pixel Samsung Exynos 5 execution time [ms] CPU GPU CL RS FS CV SI Qualcomm Snapdragon S4 Pro execution time [ms] CPU GPU RS FS CV SI 13

A CUDA Solver for Helmholtz Equation

A CUDA Solver for Helmholtz Equation Journal of Computational Information Systems 11: 24 (2015) 7805 7812 Available at http://www.jofcis.com A CUDA Solver for Helmholtz Equation Mingming REN 1,2,, Xiaoguang LIU 1,2, Gang WANG 1,2 1 College

More information

arxiv: v1 [hep-lat] 7 Oct 2010

arxiv: v1 [hep-lat] 7 Oct 2010 arxiv:.486v [hep-lat] 7 Oct 2 Nuno Cardoso CFTP, Instituto Superior Técnico E-mail: nunocardoso@cftp.ist.utl.pt Pedro Bicudo CFTP, Instituto Superior Técnico E-mail: bicudo@ist.utl.pt We discuss the CUDA

More information

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy 7 th Workshop on UnConventional High Performance

More information

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions are the most computationally intensive part of quantum

More information

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts

More information

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago

More information

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed

More information

11 Parallel programming models

11 Parallel programming models 237 // Program Design 10.3 Assessing parallel programs 11 Parallel programming models Many different models for expressing parallelism in programming languages Actor model Erlang Scala Coordination languages

More information

The Panel: What does the future look like for NPW application development? 17 th ECMWF Workshop on High Performance Computing in Meteorology

The Panel: What does the future look like for NPW application development? 17 th ECMWF Workshop on High Performance Computing in Meteorology The Panel: What does the future look like for NPW application development? 17 th ECMWF Workshop on High Performance Computing in Meteorology 16:00-17:30 27 October 2016 Panelists John Michalakes (UCAR,

More information

Automatic Star-tracker Optimization Framework. Andrew Tennenbaum The State University of New York at Buffalo

Automatic Star-tracker Optimization Framework. Andrew Tennenbaum The State University of New York at Buffalo SSC17-VIII-6 Automatic Star-tracker Optimization Framework Andrew Tennenbaum The State University of New York at Buffalo aztennen@buffalo.edu Faculty Advisor: John Crassidis The State University of New

More information

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

Accelerating Quantum Chromodynamics Calculations with GPUs

Accelerating Quantum Chromodynamics Calculations with GPUs Accelerating Quantum Chromodynamics Calculations with GPUs Guochun Shi, Steven Gottlieb, Aaron Torok, Volodymyr Kindratenko NCSA & Indiana University National Center for Supercomputing Applications University

More information

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Jorge González-Domínguez Parallel and Distributed Architectures Group Johannes Gutenberg University of Mainz, Germany j.gonzalez@uni-mainz.de

More information

sri 2D Implicit Charge- and Energy- Conserving Particle-in-cell Application Using CUDA Christopher Leibs Karthik Murthy

sri 2D Implicit Charge- and Energy- Conserving Particle-in-cell Application Using CUDA Christopher Leibs Karthik Murthy 2D Implicit Charge- and Energy- Conserving sri Particle-in-cell Application Using CUDA Christopher Leibs Karthik Murthy Mentors Dana Knoll and Allen McPherson IS&T CoDesign Summer School 2012, Los Alamos

More information

Efficient algorithms for symmetric tensor contractions

Efficient algorithms for symmetric tensor contractions Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to

More information

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method NUCLEAR SCIENCE AND TECHNIQUES 25, 0501 (14) Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method XU Qi ( 徐琪 ), 1, YU Gang-Lin ( 余纲林 ), 1 WANG Kan ( 王侃 ),

More information

Introduction to numerical computations on the GPU

Introduction to numerical computations on the GPU Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming

More information

Introduction GNURadio software implementation RFNoC implementation The End. RFNoC: fosphor. How to apply RFNoC to RTSA display acceleration

Introduction GNURadio software implementation RFNoC implementation The End. RFNoC: fosphor. How to apply RFNoC to RTSA display acceleration How to apply RFNoC to RTSA display acceleration FOSDEM 2015, February 1st, 2015 About the speaker Linux and free software enthusiast since 1999 M.Sc. in C.S. + some E.E. General orientation towards low

More information

Hydra: Generation and Tuning of parallel solutions for linear algebra equations. Alexandre X. Duchâteau University of Illinois at Urbana Champaign

Hydra: Generation and Tuning of parallel solutions for linear algebra equations. Alexandre X. Duchâteau University of Illinois at Urbana Champaign Hydra: Generation and Tuning of parallel solutions for linear algebra equations Alexandre X. Duchâteau University of Illinois at Urbana Champaign Collaborators Thesis Advisors Denis Barthou (Labri/INRIA

More information

Lecture 3: Linear Filters

Lecture 3: Linear Filters Lecture 3: Linear Filters Professor Fei Fei Li Stanford Vision Lab 1 What we will learn today? Images as functions Linear systems (filters) Convolution and correlation Discrete Fourier Transform (DFT)

More information

Histogram Processing

Histogram Processing Histogram Processing The histogram of a digital image with gray levels in the range [0,L-] is a discrete function h ( r k ) = n k where r k n k = k th gray level = number of pixels in the image having

More information

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign

More information

S XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library

S XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library S6349 - XMP LIBRARY INTERNALS Niall Emmart University of Massachusetts Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library High Performance Modular Exponentiation A^K mod P Where A,

More information

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!

More information

COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD

COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD XVIII International Conference on Water Resources CMWR 2010 J. Carrera (Ed) c CIMNE, Barcelona, 2010 COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD James.E. McClure, Jan F. Prins

More information

Lecture 3: Linear Filters

Lecture 3: Linear Filters Lecture 3: Linear Filters Professor Fei Fei Li Stanford Vision Lab 1 What we will learn today? Images as functions Linear systems (filters) Convolution and correlation Discrete Fourier Transform (DFT)

More information

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge González-Domínguez*, Bertil Schmidt*, Jan C. Kässens**, Lars Wienbrandt** *Parallel and Distributed Architectures

More information

Marawacc: A Framework for Heterogeneous Computing in Java

Marawacc: A Framework for Heterogeneous Computing in Java : A Framework for Heterogeneous Computing in Java Juan Fumero, Michel Steuwer, Christophe Dubach Code The University of Edinburgh UK Many-Core Developer Conference 2016 1 / 23 Code 2 / 23 Code 3 / 23 Code

More information

Image Filtering. Slides, adapted from. Steve Seitz and Rick Szeliski, U.Washington

Image Filtering. Slides, adapted from. Steve Seitz and Rick Szeliski, U.Washington Image Filtering Slides, adapted from Steve Seitz and Rick Szeliski, U.Washington The power of blur All is Vanity by Charles Allen Gillbert (1873-1929) Harmon LD & JuleszB (1973) The recognition of faces.

More information

LoG Blob Finding and Scale. Scale Selection. Blobs (and scale selection) Achieving scale covariance. Blob detection in 2D. Blob detection in 2D

LoG Blob Finding and Scale. Scale Selection. Blobs (and scale selection) Achieving scale covariance. Blob detection in 2D. Blob detection in 2D Achieving scale covariance Blobs (and scale selection) Goal: independently detect corresponding regions in scaled versions of the same image Need scale selection mechanism for finding characteristic region

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

Deep Learning. Convolutional Neural Networks Applications

Deep Learning. Convolutional Neural Networks Applications Deep Learning Using a Convolutional Neural Network Dr. Ing. Morris Riedel Adjunct Associated Professor School of Engineering and Natural Sciences, University of Iceland Research Group Leader, Juelich Supercomputing

More information

The Fast Multipole Method in molecular dynamics

The Fast Multipole Method in molecular dynamics The Fast Multipole Method in molecular dynamics Berk Hess KTH Royal Institute of Technology, Stockholm, Sweden ADAC6 workshop Zurich, 20-06-2018 Slide BioExcel Slide Molecular Dynamics of biomolecules

More information

Panorama des modèles et outils de programmation parallèle

Panorama des modèles et outils de programmation parallèle Panorama des modèles et outils de programmation parallèle Sylvain HENRY sylvain.henry@inria.fr University of Bordeaux - LaBRI - Inria - ENSEIRB April 19th, 2013 1/45 Outline Introduction Accelerators &

More information

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015 Tips Geared Towards R Departments of Statistics North Carolina State University Arpil 10, 2015 1 / 30 Advantages of R As an interpretive and interactive language, developing an algorithm in R can be done

More information

Embedded MCSoC Architecture and Period-Peak Detection (PPD) Algorithm for ECG/EKG Processing

Embedded MCSoC Architecture and Period-Peak Detection (PPD) Algorithm for ECG/EKG Processing The 19 th Intelligent System Symposium (FAN2009), Aizu-Wakamatsu, Sep.17-18, 2009 Embedded MCSoC Architecture and Period-Peak Detection (PPD) Algorithm for ECG/EKG Processing Yasuyoshi Haga, Abderazek

More information

Achieving scale covariance

Achieving scale covariance Achieving scale covariance Goal: independently detect corresponding regions in scaled versions of the same image Need scale selection mechanism for finding characteristic region size that is covariant

More information

Practical Free-Start Collision Attacks on 76-step SHA-1

Practical Free-Start Collision Attacks on 76-step SHA-1 Practical Free-Start Collision Attacks on 76-step SHA-1 Inria and École polytechnique, France Nanyang Technological University, Singapore Joint work with Thomas Peyrin and Marc Stevens CWI, Amsterdam 2015

More information

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a

More information

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric

More information

Scalable and Power-Efficient Data Mining Kernels

Scalable and Power-Efficient Data Mining Kernels Scalable and Power-Efficient Data Mining Kernels Alok Choudhary, John G. Searle Professor Dept. of Electrical Engineering and Computer Science and Professor, Kellogg School of Management Director of the

More information

Digital Image Processing COSC 6380/4393

Digital Image Processing COSC 6380/4393 Digital Image Processing COSC 6380/4393 Lecture 11 Oct 3 rd, 2017 Pranav Mantini Slides from Dr. Shishir K Shah, and Frank Liu Review: 2D Discrete Fourier Transform If I is an image of size N then Sin

More information

Laplacian Filters. Sobel Filters. Laplacian Filters. Laplacian Filters. Laplacian Filters. Laplacian Filters

Laplacian Filters. Sobel Filters. Laplacian Filters. Laplacian Filters. Laplacian Filters. Laplacian Filters Sobel Filters Note that smoothing the image before applying a Sobel filter typically gives better results. Even thresholding the Sobel filtered image cannot usually create precise, i.e., -pixel wide, edges.

More information

Review Smoothing Spatial Filters Sharpening Spatial Filters. Spatial Filtering. Dr. Praveen Sankaran. Department of ECE NIT Calicut.

Review Smoothing Spatial Filters Sharpening Spatial Filters. Spatial Filtering. Dr. Praveen Sankaran. Department of ECE NIT Calicut. Spatial Filtering Dr. Praveen Sankaran Department of ECE NIT Calicut January 7, 203 Outline 2 Linear Nonlinear 3 Spatial Domain Refers to the image plane itself. Direct manipulation of image pixels. Figure:

More information

Quantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm)

Quantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm) Quantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm) Alexander Smith & Khashayar Khavari Department of Electrical and Computer Engineering University of Toronto April 15, 2009 Alexander

More information

GPU Accelerated Markov Decision Processes in Crowd Simulation

GPU Accelerated Markov Decision Processes in Crowd Simulation GPU Accelerated Markov Decision Processes in Crowd Simulation Sergio Ruiz Computer Science Department Tecnológico de Monterrey, CCM Mexico City, México sergio.ruiz.loza@itesm.mx Benjamín Hernández National

More information

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal

More information

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks Yufei Ma, Yu Cao, Sarma Vrudhula,

More information

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 8. Sinan Kalkan

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 8. Sinan Kalkan CENG 783 Special topics in Deep Learning AlchemyAPI Week 8 Sinan Kalkan Loss functions Many correct labels case: Binary prediction for each label, independently: L i = σ j max 0, 1 y ij f j y ij = +1 if

More information

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)

More information

Two case studies of Monte Carlo simulation on GPU

Two case studies of Monte Carlo simulation on GPU Two case studies of Monte Carlo simulation on GPU National Institute for Computational Sciences University of Tennessee Seminar series on HPC, Feb. 27, 2014 Outline 1 Introduction 2 Discrete energy lattice

More information

Colorado School of Mines Image and Multidimensional Signal Processing

Colorado School of Mines Image and Multidimensional Signal Processing Image and Multidimensional Signal Processing Professor William Hoff Department of Electrical Engineering and Computer Science Spatial Filtering Main idea Spatial filtering Define a neighborhood of a pixel

More information

Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS

Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS Berk Hess, Szilárd Páll KTH Royal Institute of Technology GTC 2012 GROMACS: fast, scalable, free Classical molecular dynamics package

More information

Strassen s Algorithm for Tensor Contraction

Strassen s Algorithm for Tensor Contraction Strassen s Algorithm for Tensor Contraction Jianyu Huang, Devin A. Matthews, Robert A. van de Geijn The University of Texas at Austin September 14-15, 2017 Tensor Computation Workshop Flatiron Institute,

More information

Sunrise: Patrik Jonsson. Panchromatic SED Models of Simulated Galaxies. Lecture 2: Working with Sunrise. Harvard-Smithsonian Center for Astrophysics

Sunrise: Patrik Jonsson. Panchromatic SED Models of Simulated Galaxies. Lecture 2: Working with Sunrise. Harvard-Smithsonian Center for Astrophysics Sunrise: Panchromatic SED Models of Simulated Galaxies Lecture 2: Working with Sunrise Patrik Jonsson Harvard-Smithsonian Center for Astrophysics Lecture outline Lecture 1: Why Sunrise? What does it do?

More information

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung Sajid Anwar, Kyuyeon Hwang and Wonyong Sung Department of Electrical and Computer Engineering Seoul National University Seoul, 08826 Korea Email: sajid@dsp.snu.ac.kr, khwang@dsp.snu.ac.kr, wysung@snu.ac.kr

More information

COMPLETE FEATURE COMPARISON LIST

COMPLETE FEATURE COMPARISON LIST COMPLETE FEATURE COMPARISON LIST General Moho Debut Moho Pro Advanced Bone Rigging Ultimate Bone Rigging Smart Bones Read Only Frame-By-Frame Animation Read Only Bezier Handles optimized for animation

More information

S Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems

S Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems S4283 - Subdivide, : Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems Elmar Westphal - Forschungszentrum Jülich GmbH 1 Contents Micromagnetism TetraMag, a FEM/BEM Micromagnetism Simulator

More information

GPU accelerated Arnoldi solver for small batched matrix

GPU accelerated Arnoldi solver for small batched matrix 15. 09. 22 GPU accelerated Arnoldi solver for small batched matrix Samsung Advanced Institute of Technology Hyung-Jin Kim Contents - Eigen value problems - Solution - Arnoldi Algorithm - Target - CUDA

More information

Stability of Recursive Gaussian Filtering for Piecewise Linear Bilateral Filtering

Stability of Recursive Gaussian Filtering for Piecewise Linear Bilateral Filtering Stability of Recursive Gaussian Filtering for Piecewise Linear Bilateral Filtering Koichiro Watanabe, Yoshihiro Maeda, and Norishige Fukushima Nagoya Institute of Technology, Nagoya, Japan fukushima@nitech.ac.jp

More information

MODIFIED CENTRAL WEIGHTED VECTOR MEDIAN FILTER 1. STANDARD NOISE REDUCTION FILTERS

MODIFIED CENTRAL WEIGHTED VECTOR MEDIAN FILTER 1. STANDARD NOISE REDUCTION FILTERS JOUNAL OF MEDICAL INFOMATICS & TECHNOLOGIES Vol.3/22, ISSN 1642-637 Bogdan SMOLKA * vector median filter, image enhancement, noise suppression MODIFIED CENTAL WEIGHTED VECTO MEDIAN FILTE A new filtering

More information

Linear Diffusion and Image Processing. Outline

Linear Diffusion and Image Processing. Outline Outline Linear Diffusion and Image Processing Fourier Transform Convolution Image Restoration: Linear Filtering Diffusion Processes for Noise Filtering linear scale space theory Gauss-Laplace pyramid for

More information

Sampling in 1D ( ) Continuous time signal f(t) Discrete time signal. f(t) comb

Sampling in 1D ( ) Continuous time signal f(t) Discrete time signal. f(t) comb Sampling in 2D 1 Sampling in 1D Continuous time signal f(t) Discrete time signal t ( ) f [ k] = f( kt ) = f( t) δ t kt s k s f(t) comb k 2 Nyquist theorem (1D) At least 2 sample/period are needed to represent

More information

TensorFlow: A Framework for Scalable Machine Learning

TensorFlow: A Framework for Scalable Machine Learning TensorFlow: A Framework for Scalable Machine Learning You probably Outline want to know... What is TensorFlow? Why did we create TensorFlow? How does Tensorflow Work? Example: Linear Regression Example:

More information

Annotation Integration and Trade-off Analysis for Multimedia Applications

Annotation Integration and Trade-off Analysis for Multimedia Applications Annotation Integration and Trade-off Analysis for Multimedia Applications Radu Cornea, Alex Nicolau, Nikil Dutt School of Information & Computer Science University of California, Irvine Introduction and

More information

Auto-Tuning Complex Array Layouts for GPUs - Supplemental Material

Auto-Tuning Complex Array Layouts for GPUs - Supplemental Material BIN COUNT EGPGV,. This is the author version of the work. It is posted here by permission of Eurographics for your personal use. Not for redistribution. The definitive version is available at http://diglib.eg.org/.

More information

Adaptive Heterogeneous Computing with OpenCL: Harnessing hundreds of GPUs and CPUs

Adaptive Heterogeneous Computing with OpenCL: Harnessing hundreds of GPUs and CPUs Adaptive Heterogeneous Computing with OpenCL: Harnessing hundreds of GPUs and CPUs Simon McIntosh-Smith simonm@cs.bris.ac.uk Head of Microelectronics Research University of Bristol, UK 1 ! Collaborators

More information

P214 Efficient Computation of Passive Seismic Interferometry

P214 Efficient Computation of Passive Seismic Interferometry P214 Efficient Computation of Passive Seismic Interferometry J.W. Thorbecke* (Delft University of Technology) & G.G. Drijkoningen (Delft University of Technology) SUMMARY Seismic interferometry is from

More information

ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University

ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University Prof. Mi Lu TA: Ehsan Rohani Laboratory Exercise #4 MIPS Assembly and Simulation

More information

Efficient multigrid solvers for mixed finite element discretisations in NWP models

Efficient multigrid solvers for mixed finite element discretisations in NWP models 1/20 Efficient multigrid solvers for mixed finite element discretisations in NWP models Colin Cotter, David Ham, Lawrence Mitchell, Eike Hermann Müller *, Robert Scheichl * * University of Bath, Imperial

More information

Kalman filtering with intermittent heavy tailed observations

Kalman filtering with intermittent heavy tailed observations Kalman filtering with intermittent heavy tailed observations Sabina Zejnilović Abstract In large wireless sensor networks, data can experience loss and significant delay which from the aspect of control

More information

Parallel Sparse Tensor Decompositions using HiCOO Format

Parallel Sparse Tensor Decompositions using HiCOO Format Figure sources: A brief survey of tensors by Berton Earnshaw and NVIDIA Tensor Cores Parallel Sparse Tensor Decompositions using HiCOO Format Jiajia Li, Jee Choi, Richard Vuduc May 8, 8 @ SIAM ALA 8 Outline

More information

Julian Merten. GPU Computing and Alternative Architecture

Julian Merten. GPU Computing and Alternative Architecture Future Directions of Cosmological Simulations / Edinburgh 1 / 16 Julian Merten GPU Computing and Alternative Architecture Institut für Theoretische Astrophysik Zentrum für Astronomie Universität Heidelberg

More information

Parallelism of MRT Lattice Boltzmann Method based on Multi-GPUs

Parallelism of MRT Lattice Boltzmann Method based on Multi-GPUs Parallelism of MRT Lattice Boltzmann Method based on Multi-GPUs 1 School of Information Engineering, China University of Geosciences (Beijing) Beijing, 100083, China E-mail: Yaolk1119@icloud.com Ailan

More information

What is Image Deblurring?

What is Image Deblurring? What is Image Deblurring? When we use a camera, we want the recorded image to be a faithful representation of the scene that we see but every image is more or less blurry, depending on the circumstances.

More information

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Accelerating Model Reduction of Large Linear Systems with Graphics Processors Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex

More information

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team High-performance processing and development with Madagascar July 24, 2010 Madagascar development team Outline 1 HPC terminology and frameworks 2 Utilizing data parallelism 3 HPC development with Madagascar

More information

Linear Diffusion. E9 242 STIP- R. Venkatesh Babu IISc

Linear Diffusion. E9 242 STIP- R. Venkatesh Babu IISc Linear Diffusion Derivation of Heat equation Consider a 2D hot plate with Initial temperature profile I 0 (x, y) Uniform (isotropic) conduction coefficient c Unit thickness (along z) Problem: What is temperature

More information

Level-3 BLAS on a GPU

Level-3 BLAS on a GPU Level-3 BLAS on a GPU Picking the Low Hanging Fruit Francisco Igual 1 Gregorio Quintana-Ortí 1 Robert A. van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón

More information

Beam dynamics calculation

Beam dynamics calculation September 6 Beam dynamics calculation S.B. Vorozhtsov, Е.Е. Perepelkin and V.L. Smirnov Dubna, JINR http://parallel-compute.com Outline Problem formulation Numerical methods OpenMP and CUDA realization

More information

Solving PDEs with CUDA Jonathan Cohen

Solving PDEs with CUDA Jonathan Cohen Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear

More information

PDEs in Image Processing, Tutorials

PDEs in Image Processing, Tutorials PDEs in Image Processing, Tutorials Markus Grasmair Vienna, Winter Term 2010 2011 Direct Methods Let X be a topological space and R: X R {+ } some functional. following definitions: The mapping R is lower

More information

Solving RODEs on GPU clusters

Solving RODEs on GPU clusters HIGH TEA @ SCIENCE Solving RODEs on GPU clusters Christoph Riesinger Technische Universität München March 4, 206 HIGH TEA @ SCIENCE, March 4, 206 Motivation - Parallel Computing HIGH TEA @ SCIENCE, March

More information

Randomized Selection on the GPU. Laura Monroe, Joanne Wendelberger, Sarah Michalak Los Alamos National Laboratory

Randomized Selection on the GPU. Laura Monroe, Joanne Wendelberger, Sarah Michalak Los Alamos National Laboratory Randomized Selection on the GPU Laura Monroe, Joanne Wendelberger, Sarah Michalak Los Alamos National Laboratory High Performance Graphics 2011 August 6, 2011 Top k Selection on GPU Output the top k keys

More information

Convolutional Neural Networks

Convolutional Neural Networks Convolutional Neural Networks Books» http://www.deeplearningbook.org/ Books http://neuralnetworksanddeeplearning.com/.org/ reviews» http://www.deeplearningbook.org/contents/linear_algebra.html» http://www.deeplearningbook.org/contents/prob.html»

More information

Practical Free-Start Collision Attacks on full SHA-1

Practical Free-Start Collision Attacks on full SHA-1 Practical Free-Start Collision Attacks on full SHA-1 Inria and École polytechnique, France Nanyang Technological University, Singapore Joint work with Thomas Peyrin and Marc Stevens Séminaire Cryptologie

More information

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov University of Tennessee, Knoxville 05 / 03 / 2013 MAGMA:

More information

Multimedia Databases. Previous Lecture. 4.1 Multiresolution Analysis. 4 Shape-based Features. 4.1 Multiresolution Analysis

Multimedia Databases. Previous Lecture. 4.1 Multiresolution Analysis. 4 Shape-based Features. 4.1 Multiresolution Analysis Previous Lecture Multimedia Databases Texture-Based Image Retrieval Low Level Features Tamura Measure, Random Field Model High-Level Features Fourier-Transform, Wavelets Wolf-Tilo Balke Silviu Homoceanu

More information

Parallel Longest Common Subsequence using Graphics Hardware

Parallel Longest Common Subsequence using Graphics Hardware Parallel Longest Common Subsequence using Graphics Hardware John Kloetzli rian Strege Jonathan Decker Dr. Marc Olano Presented by: rian Strege 1 Overview Introduction Problem Statement ackground and Related

More information

Leone B.Bosi, Loriano Storchi et al. CCR Workshop Stato e Prospettive del Calcolo Scientifico

Leone B.Bosi, Loriano Storchi et al. CCR Workshop Stato e Prospettive del Calcolo Scientifico Leone B.Bosi, Loriano Storchi et al MaCGO Project - Einstein Telescope Project INFN Perugia CCR Workshop 2011 - Stato e Prospettive del Calcolo Scientifico Laboratori Nazionali di Legnaro 16-18 Febbraio

More information

Spatially adaptive alpha-rooting in BM3D sharpening

Spatially adaptive alpha-rooting in BM3D sharpening Spatially adaptive alpha-rooting in BM3D sharpening Markku Mäkitalo and Alessandro Foi Department of Signal Processing, Tampere University of Technology, P.O. Box FIN-553, 33101, Tampere, Finland e-mail:

More information

Tobias Markus. January 21, 2015

Tobias Markus. January 21, 2015 Automata Advanced Seminar Computer Engineering January 21, 2015 (Advanced Seminar Computer Engineering ) Automata January 21, 2015 1 / 35 1 2 3 4 5 6 obias Markus (Advanced Seminar Computer Engineering

More information

FPGA Implementation of a Predictive Controller

FPGA Implementation of a Predictive Controller FPGA Implementation of a Predictive Controller SIAM Conference on Optimization 2011, Darmstadt, Germany Minisymposium on embedded optimization Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan

More information

Machine Learning for Gravitational Wave signals classification in LIGO and Virgo

Machine Learning for Gravitational Wave signals classification in LIGO and Virgo Machine Learning for Gravitational Wave signals classification in LIGO and Virgo Elena Cuoco European Gravitational Observatory www.elenacuoco.com @elenacuoco 2 About me About me Working as Data Analyst

More information

Jonghwa Lee assistant engineer Samsung Electronics

Jonghwa Lee assistant engineer Samsung Electronics Jonghwa Lee assistant engineer Samsung Electronics Contents Generic Thermal Framework Thermal zone device Cooling device Binding & Thermal instance Governors SYSFS interfaces Thermal management CPU Cooling

More information

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig Multimedia Databases Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 4 Previous Lecture Texture-Based Image Retrieval Low

More information

On the computation of the reciprocal of floating point expansions using an adapted Newton-Raphson iteration

On the computation of the reciprocal of floating point expansions using an adapted Newton-Raphson iteration On the computation of the reciprocal of floating point expansions using an adapted Newton-Raphson iteration Mioara Joldes, Valentina Popescu, Jean-Michel Muller ASAP 2014 1 / 10 Motivation Numerical problems

More information

CSE 473/573 Computer Vision and Image Processing (CVIP)

CSE 473/573 Computer Vision and Image Processing (CVIP) CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu inwogu@buffalo.edu Lecture 11 Local Features 1 Schedule Last class We started local features Today More on local features Readings for

More information

IMAGE ENHANCEMENT II (CONVOLUTION)

IMAGE ENHANCEMENT II (CONVOLUTION) MOTIVATION Recorded images often exhibit problems such as: blurry noisy Image enhancement aims to improve visual quality Cosmetic processing Usually empirical techniques, with ad hoc parameters ( whatever

More information