Better Living Through Statistics Monitoring Doesn't (Have To) Suck

Similar documents
Aberrant Behavior Detection in Time Series for Monitoring Business-Critical Metrics (DRAFT)

Jug: Executing Parallel Tasks in Python

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

Meteor Burst Communications: Theory and Practice. Click here if your download doesn"t start automatically

MITOCW watch?v=fkfsmwatddy

TOWARDS THE DEVELOPMENT OF A MONITORING SYSTEM FOR PLANNING POLICY Residential Land Uses Case study of Brisbane, Melbourne, Chicago and London

Mathematical Logic Part Three

EQ: How do I convert between standard form and scientific notation?

Performance Evaluation

Alex s Guide to Word Problems and Linear Equations Following Glencoe Algebra 1

Solving Equations with Addition and Subtraction

(Refer Slide Time: 00:10)

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0

Latches. October 13, 2003 Latches 1

MITOCW ocw-18_02-f07-lec02_220k

USING GIS IN WATER SUPPLY AND SEWER MODELLING AND MANAGEMENT

Hierarchical Anomaly Detection in Load Testing with StormRunner Load

Introduction to Algebra: The First Week

MITOCW watch?v=vjzv6wjttnc

MITOCW watch?v=ed_xr1bzuqs

MITOCW Investigation 6, Part 6

Instructor (Brad Osgood)

MITOCW big_picture_derivatives_512kb-mp4

Solution to Proof Questions from September 1st

MITOCW ocw-18_02-f07-lec17_220k

Iowa Department of Transportation Office of Transportation Data GIS / CAD Integration

Math 300: Foundations of Higher Mathematics Northwestern University, Lecture Notes

Contingency Tables. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels.

Laboratory 3: Steered molecular dynamics

Texas A&M University

Ted Pedersen. University of Minnesota, Duluth

Review of the 4 th and 8 th grade algebra and functions items on NAEP By Hyman Bass Brookings Institution September 15, 2005

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)

Lecture 19: Introduction to Kinetics First a CH 302 Kinetics Study Guide (Memorize these first three pages, they are all the background you need)

Hunting for Anomalies in PMU Data

Predicting New Search-Query Cluster Volume

AP Physics 1 Kinematics 1D

AP Calculus Summer Homework Worksheet Instructions

Unit 22: Sampling Distributions

LECTURE 10 MONITORING

Deep dive into analytics using Aggregation. Boaz

Introducing GIS analysis

Solving Equations with Addition and Subtraction. Solving Equations with Addition and Subtraction. Solving Equations with Addition and Subtraction

CS 124 Math Review Section January 29, 2018

Intelligent NMR productivity tools

Today s lecture. Scientific method Hypotheses, models, theories... Occam' razor Examples Diet coke and menthos

5th Grade. Slide 1 / 67. Slide 2 / 67. Slide 3 / 67. Matter and Its Interactions. Table of Contents: Matter and Its Interactions

Possible Prelab Questions.

MITOCW MITRES18_005S10_DiffEqnsGrowth_300k_512kb-mp4

MITOCW watch?v=0usje5vtiks

Fog Chamber Testing the Label: Photo of Fog. Joshua Gutwill 10/29/1999

17/07/ Pick up Lecture Notes... WEBSITE FOR ASSIGNMENTS AND TOOLBOX DEFINITION DEFINITIONS AND CONCEPTS OF GIS

The Venus-Sun distance. Teacher s Guide Intermediate Level CESAR s Science Case

Road to GIS, PSE s past, present and future

To: Amanda From: Daddy Date: 2004 February 19 About: How to solve math problems

Robert D. Borchert GIS Technician

Member Level Forecasting Using SAS Enterprise Guide and SAS Forecast Studio

Physics. Nov Title: Nov 3 8:52 AM (1 of 45)

MITOCW MIT6_041F11_lec17_300k.mp4

MITOCW watch?v=ws1mx-c2v9w

Summarizing Measured Data

15 Skepticism of quantum computing

Large-Scale Behavioral Targeting

GIS = Geographic Information Systems;

Instructor (Brad Osgood)

Welcome to GST 101: Introduction to Geospatial Technology. This course will introduce you to Geographic Information Systems (GIS), cartography,

Mathematical Logic Part Three

The Venus-Sun distance. Teacher s Guide Advanced Level CESAR s Science Case

CMPSCI 105 Final Exam Spring 2016 Professor William T. Verts CMPSCI 105 Final Exam Spring 2016 May 5, 2016 Professor William T. Verts Solution Key

PARTICLE MEASUREMENT IN CLEAN ROOM TECHNOLOGY

Assignment 9. Multiple Choice Identify the letter of the choice that best completes the statement or answers the question.

A Tale of Two Erasure Codes in HDFS

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

Mathematical Logic Part Three

Announcements. Problem Set 1 out. Checkpoint due Monday, September 30. Remaining problems due Friday, October 4.

MITOCW watch?v=ko0vmalkgj8

Appendix 4 Weather. Weather Providers

MITOCW Investigation 4, Part 3

SSSC Discovery Series NMR2 Multidimensional NMR Spectroscopy

Chapter 1 Review of Equations and Inequalities

173 Logic and Prolog 2013 With Answers and FFQ

A Note on Turing Machine Design

Qualitative vs Quantitative metrics

Scales & Measurements. Joe Celko copyright 2015

California State Science Fair

8.1 Solutions of homogeneous linear differential equations

Symmetry And Spectroscopy: An Introduction To Vibrational And Electronic Spectroscopy (Dover Books On Chemistry) PDF

RAPPOR: Randomized Aggregatable Privacy- Preserving Ordinal Response

Today s Agenda: 1) Why Do We Need To Measure The Memory Component? 2) Machine Pool Memory / Best Practice Guidelines

Sequential Logic Optimization. Optimization in Context. Algorithmic Approach to State Minimization. Finite State Machine Optimization

Some hints for the Radioactive Decay lab

Student: We have to buy a new access code? I'm afraid you have to buy a new one. Talk to the bookstore about that.

- measures the center of our distribution. In the case of a sample, it s given by: y i. y = where n = sample size.

MITOCW watch?v=7q32wnm4dew

Calculus II. Calculus II tends to be a very difficult course for many students. There are many reasons for this.

Topic 1. Definitions

Machine Learning. Neural Networks

MEASURING THE COMPLEXITY AND IMPACT OF DESIGN CHANGES

Kinematics II Mathematical Analysis of Motion

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

Transcription:

Better Living Through Statistics Monitoring Doesn't (Have To) Suck Jamie Wilkinson Site Reliability Engineer, Google jaq@{spacepants.org,google.com} @jaqpants

#monitoringsucks I love monitoring. This hashtag makes me sad. But what to talk about? I have no idea what has gone on in the real world since going into the black hole. These #monitoringsucks people seem to be on the right path... do I have anything new to add?

Validation http://blog.lusis.org/blog/2012/06/05/monitoringsucking-just-a-little-bit-less/ "Instead of alerting on data and then storing it as an afterthought (perfdata anyone?) let s start collecting the data, storing it and then alerting based on it."

Monitoring systems automate boring parts of experimental method measuring, recording, alerting, visualisation so you have more time to do fun things... and debug the occasional emergency.

The current state of monitoring 1. blackbox monitoring resulting in alerts 2. whitebox monitoring resulting in charts

Blackbox vs Whitebox Blackbox: treat the system as opaque: you can only use it as a user would. a.k.a. "probing" c.f. cucumber-nagios fuel indicator light

Blackbox vs Whitebox Whitebox: expose the internal state of the system for inspection a.k.a. instrumentation, telemetry,... ROCKET SCIENCE c.f.... graphite? new relic? metrics? tacho, fuel gauge, water meter

What's wrong with blackbox? Only boolean: no visibility into why Why is the site slow? Why has image serving stopped working? No predictive capability How long until we need more disks? cpus? datacenters?

Problems with "check+alert" model Thresholds vary among instances, tuning difficult. Adding new targets, new checks is lots of effort. Check script logic performs the measurement and the "judgement" all in one. Alerts for things you can't act on. Duplicates Physical resource limits.

So, Why does #monitoringsuck? TL;DR: when the cost of maintenance is too high to improve the quality of alerts

An idea... http://blog.lusis.org/blog/2012/06/05/monitoringsucking-just-a-little-bit-less/ "Instead of alerting on data and then storing it as an afterthought (perfdata anyone?) let s start collecting the data, storing it and then alerting based on it."

Structure of timeseries

Alerting on thresholds

Alert when beer supply low if cases - 1-1 <= 1: alert BarneyWorriedAboutBeerSupply

Disk full alert Alert when 90% full Different filesystems have different sizes 10% of 2TB is 200GB False positive! Alert on absolute space, < 500MB Arbitrary number Different workloads with different needs: 500MB might not be enough warning

Disk full alert More general alert: How long before the disk is full? and How long will it take for a human to fix an (almost) full disk?

Alerting on rates of change

Dennis Hopper's Alert if speed >= 50mph: alert BombArmed if BombArmed && speed < 50mph:...

Keanu's Alert if speed >= 50mph: alert SaveTheBus!

Keanu's alert v - a * t = 50 50 - v = - a * t (v - 50)/a = t if (v - 50)/a <= time to save bus: alert StartSavingTheBus!

#1 calculate rates of change of timeseries

error count Example: Error spike

Rate of errors vs nominal rate errors per second rate of change increases greater than expected

Rates and Derivatives In a discreet timeseries: δx/δt =(x t - x t-1 )/Δt (assuming no missing samples)

#2 observe timeseries history to gain context

Example: Traffic spike worth getting out of bed for?

Duration of traffic spike NO maybe? worth getting out of bed for? Δt Δt

Duration of abnormality Not just looking at the latest data point, or the derivative at the latest point. (Otherwise we have just reinvented check scripts.) Look back 5 minutes, 1 hour, 7 days, back to the dawn of time. Heuristic: 2.5x the sampling interval to work around missing values

#3 prefer counters over gauges

Timeseries Have Types Counter: monotonically nondecreasing "preserves the order" i.e. UP "nondecreasing" can be flat

Timeseries Have Types Gauge: everything else... not monotonic

Counters FTW Δt

Counters FTW Δt no loss of meaning after sampling

Gauges FTL Δt

Gauges FTL Δt lose spike events shorter than sampling interval

#4 aggregate to each grouping in the system

Example: Aggregation cluster rate t = rate(instance 1 t + instance 2 t )

Aggregation of ensembles Instances in a cluster don't work alone. How many queries per second is your cluster receiving? Take the rates of each counter, and then sum the rates.

#5 ratios of rates

Timeseries Operations: Ratios counter / counter = counter: instant means gauge / gauge = gauge: rate comparisons e.g. New deployment δ(errors) / δ(queries) > threshold?

What to alert on? Rate of change of QPS outside normal cycles Ratio of errors to queries Latency (median, 95th percentile) too high Rate of change of bandwidth Whatever is important to the business!

When to alert? Perhaps the topic of a whole other talk. Alerts are not logs. Make sure it's ACTIONABLE... then DOCUMENT IT 3-month-in-the-future-you will thank you.

Blackbox testing still necessary Blackbox tests are end-to-end tests. End-to-end testing by definition covers everything you have missed. You still have charts in the timeseries to inspect, right?

TL;DL Do maths on your timeseries Keep counters instead of gauges, derive rates Compare them to one another Do historical analysis (compare values over time) Alert only when action can be taken

HOW? Attach a statistical package to your timeseries database, and experiment R numpy your favourite here Make smarter alerts! (AWKWARD DEMO TIME)

Questions? Demo code: http://github.com/jaqx0r/blts