Lecture 16 More Profiling: gperftools, systemwide tools: oprofile, perf, DTrace, etc.

Similar documents
Lecture 15 Compiler Optimizations

Clojure Concurrency Constructs, Part Two. CSCI 5828: Foundations of Software Engineering Lecture 13 10/07/2014

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application

Speculative Parallelism in Cilk++

Reasoning About Imperative Programs. COS 441 Slides 10b

UC Santa Barbara. Operating Systems. Christopher Kruegel Department of Computer Science UC Santa Barbara

Abstractions and Decision Procedures for Effective Software Model Checking

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Clock-driven scheduling

Lab Course: distributed data analytics

INF 4140: Models of Concurrency Series 3

Lecture 4: Process Management

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu

Latches. October 13, 2003 Latches 1

Bayes Nets III: Inference

CHAPTER 5 - PROCESS SCHEDULING

Safety and Liveness. Thread Synchronization: Too Much Milk. Critical Sections. A Really Cool Theorem

Deadlock. CSE 2431: Introduction to Operating Systems Reading: Chap. 7, [OSC]

Process Scheduling for RTS. RTS Scheduling Approach. Cyclic Executive Approach

Multicore Semantics and Programming

CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2019

An Experimental Evaluation of Passage-Based Process Discovery

Computational Complexity. This lecture. Notes. Lecture 02 - Basic Complexity Analysis. Tom Kelsey & Susmit Sarkar. Notes

Operating Systems. VII. Synchronization

Lecture 13: Sequential Circuits, FSM

Discrete-event simulations

Unit: Blocking Synchronization Clocks, v0.3 Vijay Saraswat

Sorting. CMPS 2200 Fall Carola Wenk Slides courtesy of Charles Leiserson with changes and additions by Carola Wenk

Math 54 Homework 3 Solutions 9/

INF Models of concurrency

A GUI FOR EVOLVE ZAMS

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

PHYSICS 15a, Fall 2006 SPEED OF SOUND LAB Due: Tuesday, November 14

Using web-based Java pplane applet to graph solutions of systems of differential equations

Static Program Analysis

CSE370: Introduction to Digital Design

Lab 4. Current, Voltage, and the Circuit Construction Kit

CS1210 Lecture 23 March 8, 2019

Comp 11 Lectures. Mike Shah. July 26, Tufts University. Mike Shah (Tufts University) Comp 11 Lectures July 26, / 40

ITI Introduction to Computing II

CSE 380 Computer Operating Systems

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Binding Performance and Power of Dense Linear Algebra Operations

1 SIMPLE PENDULUM 1 L (1.1)

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT

Table of Content. Chapter 11 Dedicated Microprocessors Page 1 of 25

Analysis of Software Artifacts

( )( b + c) = ab + ac, but it can also be ( )( a) = ba + ca. Let s use the distributive property on a couple of

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2018

Lecture 23: Illusiveness of Parallel Performance. James C. Hoe Department of ECE Carnegie Mellon University

Binocular Observing 101: A guide to binocular observing at LBT

1 Trees. Listing 1: Node with two child reference. public class ptwochildnode { protected Object data ; protected ptwochildnode l e f t, r i g h t ;

CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2018

ECE 448 Lecture 6. Finite State Machines. State Diagrams, State Tables, Algorithmic State Machine (ASM) Charts, and VHDL Code. George Mason University

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

CptS 464/564 Fall Prof. Dave Bakken. Cpt. S 464/564 Lecture January 26, 2014

The New Horizons Geometry Visualizer: Planning the Encounter with Pluto

Comp 11 Lectures. Mike Shah. July 26, Tufts University. Mike Shah (Tufts University) Comp 11 Lectures July 26, / 45

Timeline of a Vulnerability

Chemical Reaction Engineering Prof. JayantModak Department of Chemical Engineering Indian Institute of Science, Bangalore

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week:

CS 700: Quantitative Methods & Experimental Design in Computer Science

Example: 2x y + 3z = 1 5y 6z = 0 x + 4z = 7. Definition: Elementary Row Operations. Example: Type I swap rows 1 and 3

Timeline of a Vulnerability

Lecture 2: Metrics to Evaluate Systems

Model Checking, Theorem Proving, and Abstract Interpretation: The Convergence of Formal Verification Technologies

Efficient Dependency Tracking for Relevant Events in Concurrent Systems

Sorting. CS 3343 Fall Carola Wenk Slides courtesy of Charles Leiserson with small changes by Carola Wenk

CS 453 Operating Systems. Lecture 7 : Deadlock

CSE 241 Class 1. Jeremy Buhler. August 24,

Compiling Techniques

Algorithms and Programming I. Lecture#1 Spring 2015

Actors for Reactive Programming. Actors Origins

Designing geared 3D twisty puzzles (on the example of Gear Pyraminx) Timur Evbatyrov, Jan 2012

Performance and Application of Observation Sensitivity to Global Forecasts on the KMA Cray XE6

Computer Architecture

Cosmic Ray Detector Software

Sequential programs. Uri Abraham. March 9, 2014

CIS 4930/6930: Principles of Cyber-Physical Systems

Generate i/o load on your vm and use iostat to demonstrate the increased in i/o

Real Analysis Prof. S.H. Kulkarni Department of Mathematics Indian Institute of Technology, Madras. Lecture - 13 Conditional Convergence

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Lecture 27: Theory of Computation. Marvin Zhang 08/08/2016

Update and Modernization of Sales Tax Rate Lookup Tool for Public and Agency Users. David Wrigh

Automata Theory CS S-12 Turing Machine Modifications

Physics 2020 Lab 5 Intro to Circuits

DESIGN THEORY FOR RELATIONAL DATABASES. csc343, Introduction to Databases Renée J. Miller and Fatemeh Nargesian and Sina Meraji Winter 2018

Howto make an Arduino fast enough to... Willem Maes

CISC 4090: Theory of Computation Chapter 1 Regular Languages. Section 1.1: Finite Automata. What is a computer? Finite automata

Today s Agenda: 1) Why Do We Need To Measure The Memory Component? 2) Machine Pool Memory / Best Practice Guidelines

Our Problem. Model. Clock Synchronization. Global Predicate Detection and Event Ordering

Physics Motion Math. (Read objectives on screen.)

This chapter covers asymptotic analysis of function growth and big-o notation.

Introduction to Algorithms

ITI Introduction to Computing II

Marwan Burelle. Parallel and Concurrent Programming. Introduction and Foundation

Let s now begin to formalize our analysis of sequential machines Powerful methods for designing machines for System control Pattern recognition Etc.

COSMOS: Making Robots and Making Robots Intelligent Lecture 3: Introduction to discrete-time dynamics

EE 144/244: Fundamental Algorithms for System Modeling, Analysis, and Optimization Fall 2016

Transcription:

Lecture 16 More Profiling: gperftools, systemwide tools: oprofile, perf, DTrace, etc. ECE 459: Programming for Performance March 6, 2014

Part I gperftools 2 / 49

Introduction to gperftools Google Performance Tools include: CPU profiler. Heap profiler. Heap checker. Faster (multithreaded) malloc. We ll mostly use the CPU profiler: purely statistical sampling; no recompilation; at most linking; and built-in visual output. 3 / 49

Google Perf Tools profiler usage You can use the profiler without any recompilation. Not recommended worse data. LD_PRELOAD="/ u s r / l i b / l i b p r o f i l e r. so " \ CPUPROFILE=t e s t. p r o f. / t e s t The other option is to link to the profiler: -lprofiler Both options read the CPUPROFILE environment variable: states the location to write the profile data. 4 / 49

Other Usage You can use the profiling library directly as well: #include <gperftools/profiler.h> Bracket code you want profiled with: ProfilerStart() ProfilerEnd() You can change the sampling frequency with the CPUPROFILE_FREQUENCY environment variable. Default value: 100 5 / 49

pprof Usage Like gprof, it will analyze profiling results. % p p r o f t e s t t e s t. p r o f E n t e r s " i n t e r a c t i v e " mode % p p r o f t e x t t e s t t e s t. p r o f Outputs one l i n e p e r p r o c e d u r e % p p r o f gv t e s t t e s t. p r o f D i s p l a y s a n n o t a t e d c a l l graph v i a gv % p p r o f gv f o c u s=mutex t e s t t e s t. p r o f R e s t r i c t s to code p a t h s i n c l u d i n g a. Mutex. e n t r y % p p r o f gv f o c u s=mutex i g n o r e=s t r i n g t e s t t e s t. p r o f Code p a t h s i n c l u d i n g Mutex but not s t r i n g % p p r o f l i s t =g e t d i r t e s t t e s t. p r o f ( Per l i n e ) a n n o t a t e d s o u r c e l i s t i n g f o r g e t d i r ( ) % p p r o f disasm=g e t d i r t e s t t e s t. p r o f ( Per PC) a n n o t a t e d d i s a s s e m b l y f o r g e t d i r ( ) Can also output dot, ps, pdf or gif instead of gv. 6 / 49

Text Output Similar to the flat profile in gprof j o n @ r i k e r examples master % p p r o f t e x t t e s t t e s t. p r o f Using l o c a l f i l e t e s t. Using l o c a l f i l e t e s t. p r o f. Removing k i l l p g from a l l s t a c k t r a c e s. T o t a l : 300 s a m p l e s 95 31.7% 31.7% 102 34.0% int_ power 58 19.3% 51.0% 58 19.3% f l o a t _ p o w e r 51 17.0% 68.0% 96 32.0% f l o a t _ m a t h _ h e l p e r 50 16.7% 84.7% 137 45.7% i n t _ m a t h _ h e l p e r 18 6.0% 90.7% 131 43.7% float_ ma th 14 4.7% 95.3% 159 53.0% int_math 14 4.7% 100.0% 300 100.0% main 0 0.0% 100.0% 300 100.0% l i b c _ s t a r t _ m a i n 0 0.0% 100.0% 300 100.0% _ s t a r t 7 / 49

Text Output Explained Columns, from left to right: Number of checks (samples) in this function. Percentage of checks in this function. Same as time in gprof. Percentage of checks in the functions printed so far. Equivalent to cumulative (but in %). Number of checks in this function and its callees. Percentage of checks in this function and its callees. Function name. 8 / 49

of 300 (100.0%) of 300 (100.0%) Graphical Output a.out Total samples: 300 Focusing on: 300 Dropped nodes with <= 1 abs(samples) Dropped edges with <= 0 samples _start 0 (0.0%) 300 libc_start_main 0 (0.0%) 300 main 14 (4.7%) of 300 (100.0%) 16 131 155 float_math 18 (6.0%) of 131 (43.7%) 16 4 int_math 3 14 (4.7%) of 159 (53.0%) 12 134 int_math_helper 50 (16.7%) of 137 (45.7%) 48 3 89 4 8 83 4 float_math_helper 51 (17.0%) of 96 (32.0%) 48 13 int_power 95 (31.7%) of 102 (34.0%) 7 91 38 7 float_power 58 (19.3%) 51 9 / 49

Graphical Output Explained Output was too small to read on the slide. Shows the same numbers as the text output. Directed edges denote function calls. Note: number of samples in callees = number in this function + callees - number in this function. Example: float_math_helper, 51 (local) of 96 (cumulative) 96-51 = 45 (callees) callee int_power = 7 (bogus) callee float_power = 38 callees total = 45 10 / 49

Things You May Notice Call graph is not exact. In fact, it shows many bogus relations which clearly don t exist. For instance, we know that there are no cross-calls between int and float functions. As with gprof, optimizations will change the graph. You ll probably want to look at the text profile first, then use the focus flag to look at individual functions. 11 / 49

Part II System profiling: oprofile, perf, DTrace, WAIT 12 / 49

Introduction: oprofile http://oprofile.sourceforge.net Sampling-based tool. Uses CPU performance counters. Tracks currently-running function; records profiling data for every application run. Can work system-wide (across processes). Technology: Linux Kernel Performance Events (formerly a Linux kernel module). 13 / 49

Setting up oprofile Must run as root to use system-wide, otherwise can use per-process. % sudo o p c o n t r o l \ v m l i n u x=/u s r / s r c / l i n u x 3.2.7 1 ARCH/ v m l i n u x % echo 0 sudo t e e / p r o c / s y s / k e r n e l / nmi_watchdog % sudo o p c o n t r o l s t a r t Using d e f a u l t e v e n t : CPU_CLK_UNHALTED: 1 0 0 0 0 0 : 0 : 1 : 1 Using 2.6+ O P r o f i l e k e r n e l i n t e r f a c e. Reading module i n f o. Using l o g f i l e / v a r / l i b / o p r o f i l e / s a m p l e s / o p r o f i l e d. l o g Daemon s t a r t e d. P r o f i l e r r u n n i n g. Per-process: [ plam@lynch nm morph ] $ o p e r f. / t e s t _ h a r n e s s o p e r f : P r o f i l e r s t a r t e d P r o f i l i n g done. 14 / 49

oprofile Usage (1) Pass your executable to opreport. % sudo o p r e p o r t l. / t e s t CPU : I n t e l Core / i7, speed 1595.78 MHz ( e s t i m a t e d ) Counted CPU_CLK_UNHALTED e v e n t s ( Clock c y c l e s when not h a l t e d ) w i t h a u n i t mask o f 0 x00 (No u n i t mask ) count 100000 s a m p l e s % symbol name 7550 26.0749 i n t _ m a t h _ h e l p e r 5982 20. 6596 int_ power 5859 20.2348 f l o a t _ p o w e r 3605 12. 4504 flo at_ math 3198 11. 0447 int_math 2601 8.9829 f l o a t _ m a t h _ h e l p e r 160 0. 5 526 main If you have debug symbols (-g) you could use: % sudo o p a n n o t a t e s o u r c e \ output d i r =/path / to / annotated s o u r c e / path / to / mybinary 15 / 49

oprofile Usage (2) Use opreport by itself for a whole-system view. You can also reset and stop the profiling. % sudo o p c o n t r o l r e s e t S i g n a l l i n g daemon... done % sudo o p c o n t r o l s t o p S t o p p i n g p r o f i l i n g. 16 / 49

Perf: Introduction https://perf.wiki.kernel.org/index.php/tutorial Interface to Linux kernel built-in sampling-based profiling. Per-process, per-cpu, or system-wide. Can even report the cost of each line of code. 17 / 49

Perf: Usage Example On last year s Assignment 3 code: [ plam@lynch nm morph ] $ p e r f s t a t. / t e s t _ h a r n e s s Performance c o u n t e r s t a t s f o r. / t e s t _ h a r n e s s : 6562.501429 task c l o c k # 0. 9 9 7 CPUs u t i l i z e d 666 c o n t e x t s w i t c h e s # 0.101 K/ s e c 0 cpu m i g r a t i o n s # 0.000 K/ s e c 3,791 page f a u l t s # 0. 5 7 8 K/ s e c 24,874,267,078 c y c l e s # 3. 790 GHz [ 83.32%] 12,565,457,337 s t a l l e d c y c l e s f r o n t e n d # 50.52% f r o n t e n d c y c l e s i d l e [ 8 3. 3 1 % ] 5,874,853,028 s t a l l e d c y c l e s backend # 23.62% backend c y c l e s i d l e [ 6 6. 6 3 % ] 33,787,408,650 i n s t r u c t i o n s # 1. 3 6 i n s n s p e r c y c l e # 0. 3 7 s t a l l e d c y c l e s p e r i n s n [ 8 3. 3 2 % ] 5,271,501,213 b r a n c h e s # 803.276 M/ s e c [ 8 3. 3 8 % ] 155,568,356 branch misses # 2.95% of a l l branches [ 83.36%] 6.580225847 s e c o n d s time e l a p s e d 18 / 49

Perf: Source-level Analysis perf can tell you which instructions are taking time, or which lines of code. Compile with -ggdb to enable source code viewing. % p e r f r e c o r d. / t e s t _ h a r n e s s % p e r f a n n o t a t e perf annotate is interactive. Play around with it. 19 / 49

DTrace: Introduction http://queue.acm.org/detail.cfm?id=1117401 Intrumentation-based tool. System-wide. Meant to be used on production systems. (Eh?) 20 / 49

DTrace: Introduction http://queue.acm.org/detail.cfm?id=1117401 Intrumentation-based tool. System-wide. Meant to be used on production systems. (Eh?) (Typical instrumentation can have a slowdown of 100x (Valgrind).) Design goals: 1 No overhead when not in use; 2 Guarantee safety must not crash (strict limits on expressiveness of probes). 21 / 49

DTrace: Operation How does DTrace achieve 0 overhead? only when activated, dynamically rewrites code by placing a branch to instrumentation code. Uninstrumented: runs as if nothing changed. Most instrumentation: at function entry or exit points. You can also instrument kernel functions, locking, instrument-based on other events. Can express sampling as instrumentation-based events also. 22 / 49

DTrace Example You write this: s y s c a l l : : r e a d : e n t r y { s e l f >t = timestamp ; } s y s c a l l : : r e a d : r e t u r n / s e l f >t / { p r i n t f ("%d/%d s p e n t %d n s e c s i n r e a d \n " pid, t i d, timestamp s e l f >t ) ; } t is a thread-local variable. This code prints how long each call to read takes, along with context. To ensure safety, DTrace limits what you write; e.g. no loops. (Hence, no infinite loops!) 23 / 49

Other Tools AMD CodeAnalyst based on oprofile; leverages AMD processor features. WAIT IBM s tool tells you what operations your JVM is waiting on while idle. Non-free and not available. 24 / 49

Other Tools AMD CodeAnalyst based on oprofile. WAIT IBM s tool tells you what operations your JVM is waiting on while idle. Non-free and not available. 25 / 49

WAIT: Introduction Built for production environments. Specialized for profiling JVMs, uses JVM hooks to analyze idle time. Sampling-based analysis; infrequent samples (1 2 per minute!) 26 / 49

WAIT: Operation At each sample: records each thread s state, call stack; participation in system locks. Enables WAIT to compute a wait state (using expert-written rules): what the process is currently doing or waiting on, e.g. disk; GC; network; blocked; etc. 27 / 49

WAIT: Workflow You: run your application; collect data (using a script or manually); and upload the data to the server. Server provides a report. You fix the performance problems. Report indicates processor utilization (idle, your application, GC, etc); runnable threads; waiting threads (and why they are waiting); thread states; and a stack viewer. Paper presents 6 case studies where WAIT identified performance problems: deadlocks, server underloads, memory leaks, database bottlenecks, and excess filesystem activity. 28 / 49

Other Profiling Tools Profiling: Not limited to C/C++, or even code. You can profile Python using cprofile; standard profiling technology. Google s Page Speed Tool: profiling for web pages how can you make your page faster? reducing number of DNS lookups; leveraging browser caching; combining images; plus, traditional JavaScript profiling. 29 / 49

Part III Reentrancy 30 / 49

Reentrancy A function can be suspended in the middle and re-entered (called again) before the previous execution returns. Does not always mean thread-safe (although it usually is). Recall: thread-safe is essentially no data races. Moot point if the function only modifies local data, e.g. sin(). 31 / 49

Reentrancy Example Courtesy of Wikipedia (with modifications): i n t t ; v o i d swap ( i n t x, i n t y ) { t = x ; x = y ; // hardware i n t e r r u p t might i n v o k e i s r ( ) h e r e! y = t ; } v o i d i s r ( ) { i n t x = 1, y = 2 ; swap(&x, &y ) ; }... i n t a = 3, b = 4 ;... swap(&a, &b ) ; 32 / 49

Reentrancy Example Explained (a trace) c a l l swap(&a, &b ) ; t = x ; // t = 3 ( a ) x = y ; // a = 4 ( b ) c a l l i s r ( ) ; x = 1 ; y = 2 ; c a l l swap(&x, &y ) t = x ; // t = 1 ( x ) x = y ; // x = 2 ( y ) y = t ; // y = 1 y = t ; // b = 1 F i n a l v a l u e s : a = 4, b = 1 Expected v a l u e s : a = 4, b = 3 33 / 49

Reentrancy Example, Fixed i n t t ; v o i d swap ( i n t x, i n t y ) { i n t s ; } s = t ; // s a v e g l o b a l v a r i a b l e t = x ; x = y ; // hardware i n t e r r u p t might i n v o k e i s r ( ) h e r e! y = t ; t = s ; // r e s t o r e g l o b a l v a r i a b l e v o i d i s r ( ) { i n t x = 1, y = 2 ; swap(&x, &y ) ; }... i n t a = 3, b = 4 ;... swap(&a, &b ) ; 34 / 49

Reentrancy Example, Fixed Explained (a trace) c a l l swap(&a, &b ) ; s = t ; // s = UNDEFINED t = x ; // t = 3 ( a ) x = y ; // a = 4 ( b ) c a l l i s r ( ) ; x = 1 ; y = 2 ; c a l l swap(&x, &y ) s = t ; // s = 3 t = x ; // t = 1 ( x ) x = y ; // x = 2 ( y ) y = t ; // y = 1 t = s ; // t = 3 y = t ; // b = 3 t = s ; // t = UNDEFINED F i n a l v a l u e s : a = 4, b = 3 Expected v a l u e s : a = 4, b = 3 35 / 49

Previous Example: thread-safety Is the previous reentrant code also thread-safe? (This is more what we re concerned about in this course.) Let s see: i n t t ; v o i d swap ( i n t x, i n t y ) { i n t s ; } s = t ; // s a v e g l o b a l v a r i a b l e t = x ; x = y ; // hardware i n t e r r u p t might i n v o k e i s r ( ) h e r e! y = t ; t = s ; // r e s t o r e g l o b a l v a r i a b l e Consider two calls: swap(a, b), swap(c, d) with a = 1, b = 2, c = 3, d = 4. 36 / 49

Previous Example: thread-safety trace g l o b a l : t / t h r e a d 1 / / t h r e a d 2 / a = 1, b = 2 ; s = t ; // s = UNDEFINED t = a ; // t = 1 c = 3, d = 4 ; s = t ; // s = 1 t = c ; // t = 3 c = d ; // c = 4 d = t ; // d = 3 a = b ; // a = 2 b = t ; // b = 3 t = s ; // t = UNDEFINED t = s ; // t = 1 F i n a l v a l u e s : a = 2, b = 3, c = 4, d = 3, t = 1 Expected v a l u e s : a = 2, b = 1, c = 4, d = 3, t = UNDEFINED 37 / 49

Reentrancy vs Thread-Safety (1) Re-entrant does not always mean thread-safe (as we saw) But, for most sane implementations, it is thread-safe Ok, but are thread-safe functions reentrant? 38 / 49

Reentrancy vs Thread-Safety (2) Are thread-safe functions reentrant? Nope. Consider: i n t f ( ) { l o c k ( ) ; // p r o t e c t e d code u n l o c k ( ) ; } Recall: Reentrant functions can be suspended in the middle of execution and called again before the previous execution completes. f() obviously isn t reentrant. Plus, it will deadlock. Interrupt handling is more for systems programming, so the topic of reentrancy may or may not come up again. 39 / 49

Summary of Reentrancy vs Thread-Safety Difference between reentrant and thread-safe functions: Reentrancy Has nothing to do with threads assumes a single thread. Reentrant means the execution can context switch at any point in in a function, call the same function, and complete before returning to the original function call. Function s result does not depend on where the context switch happens. Thread-safety Result does not depend on any interleaving of threads from concurrency or parallelism. No unexpected results from multiple concurrent executions of the function. 40 / 49

Another Definition of Thread-Safe Functions A function whose effect, when called by two or more threads, is guaranteed to be as if the threads each executed the function one after another, in an undefined order, even if the actual execution is interleaved. 41 / 49

Good Example of an Exam Question v o i d swap ( i n t x, i n t y ) { i n t t ; t = x ; x = y ; y = t ; } Is the above code thread-safe? Write some expected results for running two calls in parallel. Argue these expected results always hold, or show an example where they do not. 42 / 49

Part IV Good Practices 43 / 49

Inlining We have seen the notion of inlining: Instructs the compiler to just insert the function code in-place, instead of calling the function. Hence, no function call overhead! Compilers can also do better context-sensitive operations they couldn t have done before. No overhead... sounds like better performance... let s inline everything! 44 / 49

Inlining in C++ Implicit inlining (defining a function inside a class definition): c l a s s P { p u b l i c : i n t get_x ( ) c o n s t { r e t u r n x ; }... p r i v a t e : i n t x ; } ; Explicit inlining: i n l i n e max ( c o n s t i n t& x, c o n s t i n t& y ) { r e t u r n x < y? y : x ; } 45 / 49

The Other Side of Inlining One big downside: Your program size is going to increase. This is worse than you think: Fewer cache hits. More trips to memory. Some inlines can grow very rapidly (C++ extended constructors). Just from this your performance may go down easily. 46 / 49

Compilers on Inlining Inlining is merely a suggestion to compilers. They may ignore you. For example: taking the address of an inline function and using it; or virtual functions (in C++), will get you ignored quite fast. 47 / 49

From a Usability Point-of-View Debugging is more difficult (e.g. you can t set a breakpoint in a function that doesn t actually exist). Most compilers simply won t inline code with debugging symbols on. Some do, but typically it s more of a pain. Library design: If you change any inline function in your library, any users of that library have to recompile their program if the library updates. (non-binary-compatible change!) Not a problem for non-inlined functions programs execute the new function dynamically at runtime. 48 / 49

Summary System profiling tools: oprofile, DTrace, WAIT, perf Reentrancy vs thread-safety. Inlining: limit your inlining to trivial functions: makes debugging easier and improves usability; won t slow down your program before you even start optimizing it. Tell the compiler high-level information but think low-level. 49 / 49