Lecture 16 More Profiling: gperftools, systemwide tools: oprofile, perf, DTrace, etc.

Lecture 16 More Profiling: gperftools, systemwide tools: oprofile, perf, DTrace, etc. ECE 459: Programming for Performance March 6, 2014

Part I gperftools 2 / 49

Introduction to gperftools Google Performance Tools include: CPU profiler. Heap profiler. Heap checker. Faster (multithreaded) malloc. We ll mostly use the CPU profiler: purely statistical sampling; no recompilation; at most linking; and built-in visual output. 3 / 49

Google Perf Tools profiler usage You can use the profiler without any recompilation. Not recommended worse data. LD_PRELOAD="/ u s r / l i b / l i b p r o f i l e r. so " \ CPUPROFILE=t e s t. p r o f. / t e s t The other option is to link to the profiler: -lprofiler Both options read the CPUPROFILE environment variable: states the location to write the profile data. 4 / 49

Other Usage You can use the profiling library directly as well: #include <gperftools/profiler.h> Bracket code you want profiled with: ProfilerStart() ProfilerEnd() You can change the sampling frequency with the CPUPROFILE_FREQUENCY environment variable. Default value: 100 5 / 49

pprof Usage Like gprof, it will analyze profiling results. % p p r o f t e s t t e s t. p r o f E n t e r s " i n t e r a c t i v e " mode % p p r o f t e x t t e s t t e s t. p r o f Outputs one l i n e p e r p r o c e d u r e % p p r o f gv t e s t t e s t. p r o f D i s p l a y s a n n o t a t e d c a l l graph v i a gv % p p r o f gv f o c u s=mutex t e s t t e s t. p r o f R e s t r i c t s to code p a t h s i n c l u d i n g a. Mutex. e n t r y % p p r o f gv f o c u s=mutex i g n o r e=s t r i n g t e s t t e s t. p r o f Code p a t h s i n c l u d i n g Mutex but not s t r i n g % p p r o f l i s t =g e t d i r t e s t t e s t. p r o f ( Per l i n e ) a n n o t a t e d s o u r c e l i s t i n g f o r g e t d i r ( ) % p p r o f disasm=g e t d i r t e s t t e s t. p r o f ( Per PC) a n n o t a t e d d i s a s s e m b l y f o r g e t d i r ( ) Can also output dot, ps, pdf or gif instead of gv. 6 / 49

Text Output Similar to the flat profile in gprof j o n @ r i k e r examples master % p p r o f t e x t t e s t t e s t. p r o f Using l o c a l f i l e t e s t. Using l o c a l f i l e t e s t. p r o f. Removing k i l l p g from a l l s t a c k t r a c e s. T o t a l : 300 s a m p l e s 95 31.7% 31.7% 102 34.0% int_ power 58 19.3% 51.0% 58 19.3% f l o a t _ p o w e r 51 17.0% 68.0% 96 32.0% f l o a t _ m a t h _ h e l p e r 50 16.7% 84.7% 137 45.7% i n t _ m a t h _ h e l p e r 18 6.0% 90.7% 131 43.7% float_ ma th 14 4.7% 95.3% 159 53.0% int_math 14 4.7% 100.0% 300 100.0% main 0 0.0% 100.0% 300 100.0% l i b c _ s t a r t _ m a i n 0 0.0% 100.0% 300 100.0% _ s t a r t 7 / 49

Text Output Explained Columns, from left to right: Number of checks (samples) in this function. Percentage of checks in this function. Same as time in gprof. Percentage of checks in the functions printed so far. Equivalent to cumulative (but in %). Number of checks in this function and its callees. Percentage of checks in this function and its callees. Function name. 8 / 49

of 300 (100.0%) of 300 (100.0%) Graphical Output a.out Total samples: 300 Focusing on: 300 Dropped nodes with <= 1 abs(samples) Dropped edges with <= 0 samples _start 0 (0.0%) 300 libc_start_main 0 (0.0%) 300 main 14 (4.7%) of 300 (100.0%) 16 131 155 float_math 18 (6.0%) of 131 (43.7%) 16 4 int_math 3 14 (4.7%) of 159 (53.0%) 12 134 int_math_helper 50 (16.7%) of 137 (45.7%) 48 3 89 4 8 83 4 float_math_helper 51 (17.0%) of 96 (32.0%) 48 13 int_power 95 (31.7%) of 102 (34.0%) 7 91 38 7 float_power 58 (19.3%) 51 9 / 49

Graphical Output Explained Output was too small to read on the slide. Shows the same numbers as the text output. Directed edges denote function calls. Note: number of samples in callees = number in this function + callees - number in this function. Example: float_math_helper, 51 (local) of 96 (cumulative) 96-51 = 45 (callees) callee int_power = 7 (bogus) callee float_power = 38 callees total = 45 10 / 49

Things You May Notice Call graph is not exact. In fact, it shows many bogus relations which clearly don t exist. For instance, we know that there are no cross-calls between int and float functions. As with gprof, optimizations will change the graph. You ll probably want to look at the text profile first, then use the focus flag to look at individual functions. 11 / 49

Part II System profiling: oprofile, perf, DTrace, WAIT 12 / 49

Introduction: oprofile http://oprofile.sourceforge.net Sampling-based tool. Uses CPU performance counters. Tracks currently-running function; records profiling data for every application run. Can work system-wide (across processes). Technology: Linux Kernel Performance Events (formerly a Linux kernel module). 13 / 49

Setting up oprofile Must run as root to use system-wide, otherwise can use per-process. % sudo o p c o n t r o l \ v m l i n u x=/u s r / s r c / l i n u x 3.2.7 1 ARCH/ v m l i n u x % echo 0 sudo t e e / p r o c / s y s / k e r n e l / nmi_watchdog % sudo o p c o n t r o l s t a r t Using d e f a u l t e v e n t : CPU_CLK_UNHALTED: 1 0 0 0 0 0 : 0 : 1 : 1 Using 2.6+ O P r o f i l e k e r n e l i n t e r f a c e. Reading module i n f o. Using l o g f i l e / v a r / l i b / o p r o f i l e / s a m p l e s / o p r o f i l e d. l o g Daemon s t a r t e d. P r o f i l e r r u n n i n g. Per-process: [ plam@lynch nm morph ] $ o p e r f. / t e s t _ h a r n e s s o p e r f : P r o f i l e r s t a r t e d P r o f i l i n g done. 14 / 49

oprofile Usage (1) Pass your executable to opreport. % sudo o p r e p o r t l. / t e s t CPU : I n t e l Core / i7, speed 1595.78 MHz ( e s t i m a t e d ) Counted CPU_CLK_UNHALTED e v e n t s ( Clock c y c l e s when not h a l t e d ) w i t h a u n i t mask o f 0 x00 (No u n i t mask ) count 100000 s a m p l e s % symbol name 7550 26.0749 i n t _ m a t h _ h e l p e r 5982 20. 6596 int_ power 5859 20.2348 f l o a t _ p o w e r 3605 12. 4504 flo at_ math 3198 11. 0447 int_math 2601 8.9829 f l o a t _ m a t h _ h e l p e r 160 0. 5 526 main If you have debug symbols (-g) you could use: % sudo o p a n n o t a t e s o u r c e \ output d i r =/path / to / annotated s o u r c e / path / to / mybinary 15 / 49

oprofile Usage (2) Use opreport by itself for a whole-system view. You can also reset and stop the profiling. % sudo o p c o n t r o l r e s e t S i g n a l l i n g daemon... done % sudo o p c o n t r o l s t o p S t o p p i n g p r o f i l i n g. 16 / 49

Perf: Introduction https://perf.wiki.kernel.org/index.php/tutorial Interface to Linux kernel built-in sampling-based profiling. Per-process, per-cpu, or system-wide. Can even report the cost of each line of code. 17 / 49

Perf: Usage Example On last year s Assignment 3 code: [ plam@lynch nm morph ] $ p e r f s t a t. / t e s t _ h a r n e s s Performance c o u n t e r s t a t s f o r. / t e s t _ h a r n e s s : 6562.501429 task c l o c k # 0. 9 9 7 CPUs u t i l i z e d 666 c o n t e x t s w i t c h e s # 0.101 K/ s e c 0 cpu m i g r a t i o n s # 0.000 K/ s e c 3,791 page f a u l t s # 0. 5 7 8 K/ s e c 24,874,267,078 c y c l e s # 3. 790 GHz [ 83.32%] 12,565,457,337 s t a l l e d c y c l e s f r o n t e n d # 50.52% f r o n t e n d c y c l e s i d l e [ 8 3. 3 1 % ] 5,874,853,028 s t a l l e d c y c l e s backend # 23.62% backend c y c l e s i d l e [ 6 6. 6 3 % ] 33,787,408,650 i n s t r u c t i o n s # 1. 3 6 i n s n s p e r c y c l e # 0. 3 7 s t a l l e d c y c l e s p e r i n s n [ 8 3. 3 2 % ] 5,271,501,213 b r a n c h e s # 803.276 M/ s e c [ 8 3. 3 8 % ] 155,568,356 branch misses # 2.95% of a l l branches [ 83.36%] 6.580225847 s e c o n d s time e l a p s e d 18 / 49

Perf: Source-level Analysis perf can tell you which instructions are taking time, or which lines of code. Compile with -ggdb to enable source code viewing. % p e r f r e c o r d. / t e s t _ h a r n e s s % p e r f a n n o t a t e perf annotate is interactive. Play around with it. 19 / 49

DTrace: Introduction http://queue.acm.org/detail.cfm?id=1117401 Intrumentation-based tool. System-wide. Meant to be used on production systems. (Eh?) 20 / 49

DTrace: Introduction http://queue.acm.org/detail.cfm?id=1117401 Intrumentation-based tool. System-wide. Meant to be used on production systems. (Eh?) (Typical instrumentation can have a slowdown of 100x (Valgrind).) Design goals: 1 No overhead when not in use; 2 Guarantee safety must not crash (strict limits on expressiveness of probes). 21 / 49

DTrace: Operation How does DTrace achieve 0 overhead? only when activated, dynamically rewrites code by placing a branch to instrumentation code. Uninstrumented: runs as if nothing changed. Most instrumentation: at function entry or exit points. You can also instrument kernel functions, locking, instrument-based on other events. Can express sampling as instrumentation-based events also. 22 / 49

DTrace Example You write this: s y s c a l l : : r e a d : e n t r y { s e l f >t = timestamp ; } s y s c a l l : : r e a d : r e t u r n / s e l f >t / { p r i n t f ("%d/%d s p e n t %d n s e c s i n r e a d \n " pid, t i d, timestamp s e l f >t ) ; } t is a thread-local variable. This code prints how long each call to read takes, along with context. To ensure safety, DTrace limits what you write; e.g. no loops. (Hence, no infinite loops!) 23 / 49

Other Tools AMD CodeAnalyst based on oprofile; leverages AMD processor features. WAIT IBM s tool tells you what operations your JVM is waiting on while idle. Non-free and not available. 24 / 49

Other Tools AMD CodeAnalyst based on oprofile. WAIT IBM s tool tells you what operations your JVM is waiting on while idle. Non-free and not available. 25 / 49

WAIT: Introduction Built for production environments. Specialized for profiling JVMs, uses JVM hooks to analyze idle time. Sampling-based analysis; infrequent samples (1 2 per minute!) 26 / 49

WAIT: Operation At each sample: records each thread s state, call stack; participation in system locks. Enables WAIT to compute a wait state (using expert-written rules): what the process is currently doing or waiting on, e.g. disk; GC; network; blocked; etc. 27 / 49

WAIT: Workflow You: run your application; collect data (using a script or manually); and upload the data to the server. Server provides a report. You fix the performance problems. Report indicates processor utilization (idle, your application, GC, etc); runnable threads; waiting threads (and why they are waiting); thread states; and a stack viewer. Paper presents 6 case studies where WAIT identified performance problems: deadlocks, server underloads, memory leaks, database bottlenecks, and excess filesystem activity. 28 / 49

Other Profiling Tools Profiling: Not limited to C/C++, or even code. You can profile Python using cprofile; standard profiling technology. Google s Page Speed Tool: profiling for web pages how can you make your page faster? reducing number of DNS lookups; leveraging browser caching; combining images; plus, traditional JavaScript profiling. 29 / 49

Part III Reentrancy 30 / 49

Reentrancy A function can be suspended in the middle and re-entered (called again) before the previous execution returns. Does not always mean thread-safe (although it usually is). Recall: thread-safe is essentially no data races. Moot point if the function only modifies local data, e.g. sin(). 31 / 49

Reentrancy Example Courtesy of Wikipedia (with modifications): i n t t ; v o i d swap ( i n t x, i n t y ) { t = x ; x = y ; // hardware i n t e r r u p t might i n v o k e i s r ( ) h e r e! y = t ; } v o i d i s r ( ) { i n t x = 1, y = 2 ; swap(&x, &y ) ; }... i n t a = 3, b = 4 ;... swap(&a, &b ) ; 32 / 49

Reentrancy Example Explained (a trace) c a l l swap(&a, &b ) ; t = x ; // t = 3 ( a ) x = y ; // a = 4 ( b ) c a l l i s r ( ) ; x = 1 ; y = 2 ; c a l l swap(&x, &y ) t = x ; // t = 1 ( x ) x = y ; // x = 2 ( y ) y = t ; // y = 1 y = t ; // b = 1 F i n a l v a l u e s : a = 4, b = 1 Expected v a l u e s : a = 4, b = 3 33 / 49

Reentrancy Example, Fixed i n t t ; v o i d swap ( i n t x, i n t y ) { i n t s ; } s = t ; // s a v e g l o b a l v a r i a b l e t = x ; x = y ; // hardware i n t e r r u p t might i n v o k e i s r ( ) h e r e! y = t ; t = s ; // r e s t o r e g l o b a l v a r i a b l e v o i d i s r ( ) { i n t x = 1, y = 2 ; swap(&x, &y ) ; }... i n t a = 3, b = 4 ;... swap(&a, &b ) ; 34 / 49

Reentrancy Example, Fixed Explained (a trace) c a l l swap(&a, &b ) ; s = t ; // s = UNDEFINED t = x ; // t = 3 ( a ) x = y ; // a = 4 ( b ) c a l l i s r ( ) ; x = 1 ; y = 2 ; c a l l swap(&x, &y ) s = t ; // s = 3 t = x ; // t = 1 ( x ) x = y ; // x = 2 ( y ) y = t ; // y = 1 t = s ; // t = 3 y = t ; // b = 3 t = s ; // t = UNDEFINED F i n a l v a l u e s : a = 4, b = 3 Expected v a l u e s : a = 4, b = 3 35 / 49

Previous Example: thread-safety Is the previous reentrant code also thread-safe? (This is more what we re concerned about in this course.) Let s see: i n t t ; v o i d swap ( i n t x, i n t y ) { i n t s ; } s = t ; // s a v e g l o b a l v a r i a b l e t = x ; x = y ; // hardware i n t e r r u p t might i n v o k e i s r ( ) h e r e! y = t ; t = s ; // r e s t o r e g l o b a l v a r i a b l e Consider two calls: swap(a, b), swap(c, d) with a = 1, b = 2, c = 3, d = 4. 36 / 49

Previous Example: thread-safety trace g l o b a l : t / t h r e a d 1 / / t h r e a d 2 / a = 1, b = 2 ; s = t ; // s = UNDEFINED t = a ; // t = 1 c = 3, d = 4 ; s = t ; // s = 1 t = c ; // t = 3 c = d ; // c = 4 d = t ; // d = 3 a = b ; // a = 2 b = t ; // b = 3 t = s ; // t = UNDEFINED t = s ; // t = 1 F i n a l v a l u e s : a = 2, b = 3, c = 4, d = 3, t = 1 Expected v a l u e s : a = 2, b = 1, c = 4, d = 3, t = UNDEFINED 37 / 49

Reentrancy vs Thread-Safety (1) Re-entrant does not always mean thread-safe (as we saw) But, for most sane implementations, it is thread-safe Ok, but are thread-safe functions reentrant? 38 / 49

Reentrancy vs Thread-Safety (2) Are thread-safe functions reentrant? Nope. Consider: i n t f ( ) { l o c k ( ) ; // p r o t e c t e d code u n l o c k ( ) ; } Recall: Reentrant functions can be suspended in the middle of execution and called again before the previous execution completes. f() obviously isn t reentrant. Plus, it will deadlock. Interrupt handling is more for systems programming, so the topic of reentrancy may or may not come up again. 39 / 49

Summary of Reentrancy vs Thread-Safety Difference between reentrant and thread-safe functions: Reentrancy Has nothing to do with threads assumes a single thread. Reentrant means the execution can context switch at any point in in a function, call the same function, and complete before returning to the original function call. Function s result does not depend on where the context switch happens. Thread-safety Result does not depend on any interleaving of threads from concurrency or parallelism. No unexpected results from multiple concurrent executions of the function. 40 / 49

Another Definition of Thread-Safe Functions A function whose effect, when called by two or more threads, is guaranteed to be as if the threads each executed the function one after another, in an undefined order, even if the actual execution is interleaved. 41 / 49

Good Example of an Exam Question v o i d swap ( i n t x, i n t y ) { i n t t ; t = x ; x = y ; y = t ; } Is the above code thread-safe? Write some expected results for running two calls in parallel. Argue these expected results always hold, or show an example where they do not. 42 / 49

Part IV Good Practices 43 / 49

Inlining We have seen the notion of inlining: Instructs the compiler to just insert the function code in-place, instead of calling the function. Hence, no function call overhead! Compilers can also do better context-sensitive operations they couldn t have done before. No overhead... sounds like better performance... let s inline everything! 44 / 49

Inlining in C++ Implicit inlining (defining a function inside a class definition): c l a s s P { p u b l i c : i n t get_x ( ) c o n s t { r e t u r n x ; }... p r i v a t e : i n t x ; } ; Explicit inlining: i n l i n e max ( c o n s t i n t& x, c o n s t i n t& y ) { r e t u r n x < y? y : x ; } 45 / 49

The Other Side of Inlining One big downside: Your program size is going to increase. This is worse than you think: Fewer cache hits. More trips to memory. Some inlines can grow very rapidly (C++ extended constructors). Just from this your performance may go down easily. 46 / 49

Compilers on Inlining Inlining is merely a suggestion to compilers. They may ignore you. For example: taking the address of an inline function and using it; or virtual functions (in C++), will get you ignored quite fast. 47 / 49

From a Usability Point-of-View Debugging is more difficult (e.g. you can t set a breakpoint in a function that doesn t actually exist). Most compilers simply won t inline code with debugging symbols on. Some do, but typically it s more of a pain. Library design: If you change any inline function in your library, any users of that library have to recompile their program if the library updates. (non-binary-compatible change!) Not a problem for non-inlined functions programs execute the new function dynamically at runtime. 48 / 49

Summary System profiling tools: oprofile, DTrace, WAIT, perf Reentrancy vs thread-safety. Inlining: limit your inlining to trivial functions: makes debugging easier and improves usability; won t slow down your program before you even start optimizing it. Tell the compiler high-level information but think low-level. 49 / 49