The Fundamentals of Heavy Tails Properties, Emergence, & Identification. Jayakrishnan Nair, Adam Wierman, Bert Zwart

Similar documents
Model Fitting. Jean Yves Le Boudec

Heavy Tails: The Origins and Implications for Large Scale Biological & Information Systems

Approximating the Integrated Tail Distribution

MFM Practitioner Module: Quantitiative Risk Management. John Dodson. October 14, 2015

Financial Econometrics and Volatility Models Extreme Value Theory

Large deviations for random walks under subexponentiality: the big-jump domain

Introduction to Algorithmic Trading Strategies Lecture 10

Approximation of Heavy-tailed distributions via infinite dimensional phase type distributions

We're in interested in Pr{three sixes when throwing a single dice 8 times}. => Y has a binomial distribution, or in official notation, Y ~ BIN(n,p).

Power Laws & Rich Get Richer

15-388/688 - Practical Data Science: Basic probability. J. Zico Kolter Carnegie Mellon University Spring 2018

CPSC 340: Machine Learning and Data Mining

Introduction to Scientific Modeling CS 365, Fall 2012 Power Laws and Scaling

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CS 124 Math Review Section January 29, 2018

11. Further Issues in Using OLS with TS Data

Overview of Extreme Value Theory. Dr. Sawsan Hilal space

Sampling. Where we re heading: Last time. What is the sample? Next week: Lecture Monday. **Lab Tuesday leaving at 11:00 instead of 1:00** Tomorrow:

Operational risk modeled analytically II: the consequences of classification invariance. Vivien BRUNEL

Theory and Applications of Complex Networks

Lecture 14 October 13

review session gov 2000 gov 2000 () review session 1 / 38

Using R in Undergraduate Probability and Mathematical Statistics Courses. Amy G. Froelich Department of Statistics Iowa State University

Chapter 18. Sampling Distribution Models /51

We're in interested in Pr{three sixes when throwing a single dice 8 times}. => Y has a binomial distribution, or in official notation, Y ~ BIN(n,p).

Introduction to Maximum Likelihood Estimation

Efficient rare-event simulation for sums of dependent random varia

Introduction to Rare Event Simulation

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015

12 Statistical Justifications; the Bias-Variance Decomposition

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Overview of Extreme Value Analysis (EVA)

On the estimation of the heavy tail exponent in time series using the max spectrum. Stilian A. Stoev

MATH 10 INTRODUCTORY STATISTICS

ACTEX CAS EXAM 3 STUDY GUIDE FOR MATHEMATICAL STATISTICS

robustness, efficiency, breakdown point, outliers, rank-based procedures, least absolute regression

Ordinary Least Squares Linear Regression

Importance Sampling Methodology for Multidimensional Heavy-tailed Random Walks

6.207/14.15: Networks Lecture 12: Generalized Random Graphs

Stat 5101 Lecture Notes

appstats27.notebook April 06, 2017

Sampling Distribution Models. Chapter 17

Chapter 27 Summary Inferences for Regression

Business Statistics. Lecture 10: Course Review

MATH 10 INTRODUCTORY STATISTICS

Complex Systems Methods 11. Power laws an indicator of complexity?

STA205 Probability: Week 8 R. Wolpert

Statistics 135 Fall 2007 Midterm Exam

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Sequence convergence, the weak T-axioms, and first countability

1 Introduction to Generalized Least Squares

Discrete Dependent Variable Models

Introduction to Probability and Statistics (Continued)

δ -method and M-estimation

Lecture 19 Inference about tail-behavior and measures of inequality

E cient Simulation and Conditional Functional Limit Theorems for Ruinous Heavy-tailed Random Walks

An overview of applied econometrics

1 Degree distributions and data

Descriptive Statistics (And a little bit on rounding and significant digits)

{X i } realize. n i=1 X i. Note that again X is a random variable. If we are to

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

CS224W: Analysis of Networks Jure Leskovec, Stanford University

1 Random walks and data

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Week 9 The Central Limit Theorem and Estimation Concepts

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Talk Science Professional Development

p y (1 p) 1 y, y = 0, 1 p Y (y p) = 0, otherwise.

Economics 583: Econometric Theory I A Primer on Asymptotics

Common ontinuous random variables

Bayesian Methods for Machine Learning

Gradualism versus Catastrophism Curriculum written by XXXXX

Solve Wave Equation from Scratch [2013 HSSP]

5.2 Fisher information and the Cramer-Rao bound

CHAPTER 1: Preliminary Description of Errors Experiment Methodology and Errors To introduce the concept of error analysis, let s take a real world

Extreme Value Theory and Applications

DEVELOPING MATH INTUITION

Exploring EDA. Exploring EDA, Clustering and Data Preprocessing Lecture 1. Exploring EDA. Vincent Croft NIKHEF - Nijmegen

Reduction of Variance. Importance Sampling

Computer Systems on the CERN LHC: How do we measure how well they re doing?

Maximum-Likelihood Estimation: Basic Ideas

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables THE UNIVERSITY OF MANCHESTER. 21 June :45 11:45

What is a random variable

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

Lecture 11: Linear Regression

Zwiers FW and Kharin VV Changes in the extremes of the climate simulated by CCC GCM2 under CO 2 doubling. J. Climate 11:

Correlation: Copulas and Conditioning

Nonparametric Bayesian Methods (Gaussian Processes)

Swarthmore Honors Exam 2012: Statistics

Gaussians Linear Regression Bias-Variance Tradeoff

MS&E 226: Small Data

Math 576: Quantitative Risk Management

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

Statistics 253/317 Introduction to Probability Models. Winter Midterm Exam Friday, Feb 8, 2013

Central limit theorem. Paninski, Intro. Math. Stats., October 5, probability, Z N P Z, if

Lecture: Local Spectral Methods (1 of 4)

{X i } realize. n i=1 X i. Note that again X is a random variable. If we are to

Learning Energy-Based Models of High-Dimensional Data

Transcription:

The Fundamentals of Heavy Tails Properties, Emergence, & Identification Jayakrishnan Nair, Adam Wierman, Bert Zwart

Why am I doing a tutorial on heavy tails? Because we re writing a book on the topic Why are we writing a book on the topic? Because heavy-tailed phenomena are everywhere!

Why am I doing a tutorial on heavy tails? Because we re writing a book on the topic Why are we writing a book on the topic? Because heavy-tailed phenomena are everywhere! BUT, they are extremely misunderstood.

Heavy-tailed phenomena are treated as something Mysterious, Surprising, & Controversial

Heavy-tailed phenomena are treated as something Mysterious, Surprising, & Controversial Our intuition is flawed because intro probability classes focus on light-tailed distributions Simple, appealing statistical approaches have BIG problems

Heavy-tailed phenomena are treated as something Mysterious, Surprising, & Controversial BUT Similar stories in electricity nets, citation nets, 2005, ToN

Heavy-tailed phenomena are treated as something Mysterious, Surprising, & Controversial 1. Properties 2. Emergence 3. Identification

What is a heavy-tailed distribution? A distribution with a tail that is heavier than an Exponential

What is a heavy-tailed distribution? A distribution with a tail that is heavier than an Exponential Pr ( > ) Exponential:

What is a heavy-tailed distribution? A distribution with a tail that is heavier than an Exponential Pr ( > ) Exponential: Canonical Example: The Pareto Distribution a.k.a. the power-law distribution Pr > =() = density: () = for

What is a heavy-tailed distribution? A distribution with a tail that is heavier than an Exponential Pr ( > ) Exponential: Canonical Example: The Pareto Distribution a.k.a. the power-law distribution Many other examples: LogNormal, Weibull, Zipf, Cauchy, Student s t, Frechet, : log / =

What is a heavy-tailed distribution? A distribution with a tail that is heavier than an Exponential Pr ( > ) Exponential: Canonical Example: The Pareto Distribution a.k.a. the power-law distribution Many other examples: LogNormal, Weibull, Zipf, Cauchy, Student s t, Frechet, Many subclasses: Regularly varying, Subexponential, Long-tailed, Fat-tailed,

Heavy-tailed distributions have many strange & beautiful properties The Pareto principle : 80% of the wealth owned by 20% of the population, etc. Infinite variance or even infinite mean Events that are much larger than the mean happen frequently. These are driven by 3 defining properties 1) Scale invariance 2) The catastrophe principle 3) The residual life blows up

Scale invariance

Scale invariance is scale invariant if there exists an and a such that = () for all, such that. change of scale

Scale invariance is scale invariant if there exists an and a such that = () for all, such that. Theorem: A distribution is scale invariant if and only if it is Pareto. Example: Pareto distributions = = 1

Scale invariance is scale invariant if there exists an and a such that = () for all, such that. Asymptotic scale invariance is asymptotically scale invariant if there exists a continuous, finite such that lim () = for all.

Example: Regularly varying distributions is regularly varying if = (), where () is slowly varying, () i.e., lim () =1for all >0. Theorem: A distribution is asymptotically scale invariant iff it is regularly varying. Asymptotic scale invariance is asymptotically scale invariant if there exists a continuous, finite such that lim () = for all.

Example: Regularly varying distributions is regularly varying if = (), where () is slowly varying, () i.e., lim () =1for all >0. Regularly varying distributions are extremely useful. They basically behave like Pareto distributions with respect to the tail: Karamata theorems Tauberian theorems

Heavy-tailed distributions have many strange & beautiful properties The Pareto principle : 80% of the wealth owned by 20% of the population, etc. Infinite variance or even infinite mean Events that are much larger than the mean happen frequently. These are driven by 3 defining properties 1) Scale invariance 2) The catastrophe principle 3) The residual life blows up

A thought experiment During lecture I polled my 50 students about their heights and the number of twitter followers they have The sum of the heights was ~300 feet. The sum of the number of twitter followers was 1,025,000 What led to these large values?

A thought experiment During lecture I polled my 50 students about their heights and the number of twitter followers they have The sum of the heights was ~300 feet. The sum of the number of twitter followers was 1,025,000 A bunch of people were probably just over 6 tall (Maybe the basketball teams were in the class.) Conspiracy principle One person was probably a twitter celebrity and had ~1 million followers. Catastrophe principle

Example Consider + i.i.d Weibull. Given + =, what is the marginal density of? () Light-tailed Weibull Conspiracy principle Exponential 0 Heavy-tailed Weibull Catastrophe principle

Extremely useful for random walks, queues, etc. Principle of a single big jump

Subexponential distributions is subexponential if for i.i.d., Pr + + > ( >)

Weibull Heavy-tailed Pareto Regularly Varying Subexponential LogNormal Subexponential distributions is subexponential if for i.i.d., Pr + + > ( >)

Heavy-tailed distributions have many strange & beautiful properties The Pareto principle : 80% of the wealth owned by 20% of the population, etc. Infinite variance or even infinite mean Events that are much larger than the mean happen frequently. These are driven by 3 defining properties 1) Scale invariance 2) The catastrophe principle 3) The residual life blows up

A thought experiment residual life What happens to the expected remaining waiting time as we wait for a table at a restaurant? for a bus? for the response to an email? The remaining wait drops as you wait If you don t get it quickly, you never will

The distribution of residual life The distribution of remaining waiting time given you have already waited time is = () (). Examples: Exponential : = () = memoryless Pareto : = = 1+ Increasing in x

The distribution of residual life The distribution of remaining waiting time given you have already waited time is = () (). Mean residual life = > = Hazard rate = = (0)

What happens to the expected remaining waiting time as we wait for a table at a restaurant? for a bus? for the response to an email? BUT: not all heavy-tailed distributions have DHR / IMRL some light-tailed distributions are DHR / IMRL

Long-tailed distributions () is long-tailed if lim = lim () =1for all BUT: not all heavy-tailed distributions have DHR / IMRL some light-tailed distributions are DHR / IMRL

Long-tailed distributions () is long-tailed if lim = lim () =1for all Weibull Heavy-tailed Pareto Regularly Varying Subexponential LogNormal Long-tailed distributions

Weibull Heavy-tailed Difficult to work with in general Pareto Regularly Varying Subexponential LogNormal Long-tailed distributions

Asymptotically scale invariant Useful analytic properties Weibull Heavy-tailed Difficult to work with in general Pareto Regularly Varying Subexponential LogNormal Long-tailed distributions

Catastrophe principle Useful for studying random walks Weibull Heavy-tailed Difficult to work with in general Pareto Regularly Varying Subexponential LogNormal Long-tailed distributions

Residual life blows up Useful for studying extremes Weibull Heavy-tailed Difficult to work with in general Pareto Regularly Varying Subexponential LogNormal Long-tailed distributions

Heavy-tailed phenomena are treated as something Mysterious, Surprising, & Controversial 1. Properties 2. Emergence 3. Identification

We ve all been taught that the Normal is normal because of the Central Limit Theorem But the Central Limit Theorem we re taught is not complete!

A quick review Consider i.i.d.. How does grow? Law of Large Numbers (LLN):.. when < = +() =15 = 300

A quick review Consider i.i.d.. How does Central Limit Theorem (CLT): 1 grow? ~(0, ) when Var = <. = + + ( ) [ ]

A quick review Consider i.i.d.. How does Central Limit Theorem (CLT): 1 grow? Two key assumptions ~(0, ) when Var = <. = + + ( ) [ ]

A quick review Consider i.i.d.. How does Central Limit Theorem (CLT): 1 What if grow? ~(0, ) when Var = <. = + + ( ) [ ]

A quick review Consider i.i.d.. How does Central Limit Theorem (CLT): 1 What if grow? ~(0, ) when Var = <. = + + ( ) [ ] = 300

A quick review Consider i.i.d.. How does Central Limit Theorem (CLT): 1 What if grow? ~(0, ) when Var = <. The Generalized Central Limit Theorem (GCLT): 1 (0, ) (0,2) = + / +( / )

A quick review Consider i.i.d.. How does Central Limit Theorem (CLT): 1 What if grow? ~(0, ) when Var = <. The Generalized Central Limit Theorem (GCLT): 1 (0, ) (0,2)

Returning to our original question Consider i.i.d.. How does grow? Either the Normal distribution OR a power-law distribution can emerge!

Returning to our original question Consider i.i.d.. How does grow? but this isn t the only question one can ask about Either the Normal distribution OR a power-law distribution can emerge!. What is the distribution of the ruin time? The ruin time is always heavy-tailed!

Consider a symmetric 1-D random walk 1/2 1/2 The distribution of ruin time satisfies Pr > / What is the distribution of the ruin time? The ruin time is always heavy-tailed!

We ve all been taught that the Normal is normal because of the Central Limit Theorem Heavy-tails are more normal than the Normal! 1. Additive Processes 2. Multiplicative Processes 3. Extremal Processes

A simple multiplicative process =, where are i.i.d. and positive Ex: incomes, populations, fragmentation, twitter popularity Rich get richer

Multiplicative processes almost always lead to heavy tails An example:, () Pr > Pr > = is heavy-tailed!

Multiplicative processes almost always lead to heavy tails = log =log +log + +log Central Limit Theorem log = + +,where (0, ) when Var = <. / (0, ) where = [ ] and Var log = <.

/ Multiplicative central limit theorem / (0, ) where = [ ] and Var log = <.

Satisfied by all distributions with finite mean and many with infinite mean. Multiplicative central limit theorem / (0, ) where = [ ] and Var log = <.

A simple multiplicative process =, where are i.i.d. and positive Ex: incomes, populations, fragmentation, twitter popularity Rich get richer LogNormals emerge Heavy-tails

A simple multiplicative process =, where are i.i.d. and positive Ex: incomes, populations, fragmentation, twitter popularity Multiplicative process with a lower barrier =min(,) Multiplicative process with noise = + Distributions that are approximately power-law emerge

A simple multiplicative process =, where are i.i.d. and positive Ex: incomes, populations, fragmentation, twitter popularity lim Multiplicative process with a lower barrier =min(,) Under minor technical conditions, such that () = where = sup ( 0 1) Nearly regularly varying

We ve all been taught that the Normal is normal because of the Central Limit Theorem Heavy-tails are more normal than the Normal! 1. Additive Processes 2. Multiplicative Processes 3. Extremal Processes

A simple extremal process =max(,,, ) Ex: engineering for floods, earthquakes, etc. Progression of world records Extreme value theory

=max(,,, ) How does scale? A simple example () Pr max,, > + = + =1, =log = 1 e = 1 e : Gumbel distribution

=max(,,, ) How does scale? Extremal Central Limit Theorem Heavy-tailed Heavy or light-tailed Light-tailed

=max(,,, ) How does scale? Extremal Central Limit Theorem iff are regularly varying e.g. when are Uniform e.g. when are LogNormal

A simple extremal process =max(,,, ) Ex: engineering for floods, earthquakes, etc. Progression of world records Either heavy-tailed or light-tailed distributions can emerge as but this isn t the only question one can ask about. What is the distribution of the time until a new record is set? The time until a record is always heavy-tailed!

: Time between & +1 record Pr > 2 What is the distribution of the time until a new record is set? The time until a record is always heavy-tailed!

We ve all been taught that the Normal is normal because of the Central Limit Theorem Heavy-tails are more normal than the Normal! 1. Additive Processes 2. Multiplicative Processes 3. Extremal Processes

Heavy-tailed phenomena are treated as something Mysterious, Surprising, & Controversial 1. Properties 2. Emergence 3. Identification

Heavy-tailed phenomena are treated as something Mysterious, Surprising, & Controversial 1999 Sigcomm paper 4500+ citations! BUT Similar stories in electricity nets, citation nets, 2005, ToN

A typical approach for identifying of heavy tails: Linear Regression frequency Heavy-tailed or light-tailed? value frequency plot

A typical approach for identifying of heavy tails: Linear Regression Heavy-tailed or light-tailed? log-linear scale Log(frequency) value

A typical approach for identifying of heavy tails: Linear Regression Heavy-tailed or light-tailed? log-linear scale Linear Exponential tail = log = log

A typical approach for identifying of heavy tails: Linear Regression Heavy-tailed or light-tailed? = log = log ( + 1) log log-log scale Linear Power-law tail log-linear scale Linear Exponential tail = log = log

= log = log ( + 1) log log-log scale Linear Power-law tail Regression Estimate of tail index ()

= log = log ( + 1) log log-log scale Is it really linear? Is the estimate of accurate? Linear Power-law tail Regression Estimate of tail index ()

Pr > = = log-log scale = log = log ( + 1) log log-log scale Pr > =1.9 =1.4 rank plot True =2.0 frequency plot

This simple change is extremely important log-log scale log-log scale Pr > rank plot frequency plot

This simple change is extremely important Does this look like a power law? log-log scale log-log scale Pr > rank plot frequency plot The data is from an Exponential!

This mistake has happened A LOT! Electricity grid degree distribution log-linear scale log-log scale Pr > rank plot =3 (from Science) frequency plot

Pr > This mistake has happened A LOT! =1.7 rank plot log-log scale WWW degree distribution log-log scale =1.1 (from Science) frequency plot

This simple change is extremely important But, this is still an error-prone approach log-log scale Pr > rank plot Regression Estimate of tail index () Linear Power-law tail other distributions can be nearly linear too log-log scale log-log scale Lognormal Weibull

This simple change is extremely important But, this is still an error-prone approach log-log scale Pr > rank plot Regression Estimate of tail index () assumptions of regression are not met tail is much noisier than the body Linear Power-law tail other distributions can be nearly linear too

A completely different approach: Maximum Likelihood Estimation (MLE) What is the for which the data is most likely? ; = log (; ) = log log Maximizing gives = log( / ) This has many nice properties: is the minimal variance, unbiased estimator. is asymptotically efficient.

not so A completely different approach: Maximum Likelihood Estimation (MLE) Weighted Least Squares Regression (WLS) asymptotically for large data sets, when weights are chosen as =1/(log log ). ) = log( / log( / ) log( / ) =

not so A completely different approach: Maximum Likelihood Estimation (MLE) Weighted Least Squares Regression (WLS) asymptotically for large data sets, when weights are chosen as =1/(log log ). WLS, =2.2 log-log scale LS, =1.5 rank true plot =2.0 Listen to your body

A quick summary of where we are: Suppose data comes from a power-law (Pareto) distribution = Then, we can identify this visually with a log-log plot, and we can estimate using either MLE or WLS..

What if the data is not exactly a power-law? What if only the tail is power-law? Suppose data comes from a power-law (Pareto) distribution = Then, we can identify this visually with a log-log plot, and we can estimate using either MLE or WLS.. log-log scale Can we just use MLE/WLS on the tail? But, where does the tail start? rank plot Impossible to answer

An example Suppose we have a mixture of power laws: = + 1 < We want as. but, suppose we use as our cutoff: 1 ( ) + (1 ) ( )

Identifying power-law distributions Listen to your body MLE/WLS v.s. Identifying power-law tails Let the tail do the talking Extreme value theory

Returning to our example Suppose we have a mixture of power laws: = + 1 < We want as. but, suppose we use as our cutoff: 1 ( ) + (1 ) ( ) The bias disappears as!

The idea: Improve robustness by throwing away nearly all the data!, where as. + Larger () Small bias - Larger () Larger variance

The idea: Improve robustness by throwing away nearly all the data!, where as. The Hill Estimator, = log where () is the th largest data point Looks almost like the MLE, but uses order th order statistic

The idea: Improve robustness by throwing away nearly all the data!, where as. The Hill Estimator, = where () is the th largest data point log Looks almost like the MLE, but uses order th order statistic how do we choose?, as if / 0 & throw away nearly all the data, but keep enough data for consistency

The idea: Improve robustness by throwing away nearly all the data!, where as. The Hill Estimator, = where () is the th largest data point log Looks almost like the MLE, but uses order th order statistic how do we choose?, as if / 0 & Throw away everything except the outliers!

Choosing in practice: The Hill plot

Choosing in practice: The Hill plot Pareto, =2 Exponential 2.4 6 2 3 1.6 0

Choosing in practice: The Hill plot log-log scale Pareto, =2 2.5 Mixture, with Pareto-tail, =2 2.4 2 2 1.5 1.6 1

but the hill estimator has problems too 1.5 1 0.5 =1.4 = 300 Hill horror plot =0.92 = 12000 This data is from TCP flow sizes!

Identifying power-law distributions Listen to your body MLE/WLS Identifying power-law tails Let the tail do the talking Hill estimator It s dangerous to rely on any one technique! (see our forthcoming book for other approaches)

Heavy-tailed phenomena are treated as something Mysterious, Surprising, & Controversial 1. Properties 2. Emergence 3. Identification

Heavy-tailed phenomena are treated as something Mysterious, Surprising, & Controversial Heavy-tailed 1. Properties Weibull Pareto Regularly Varying 2. Emergence 3. Identification Subexponential LogNormal Long-tailed distributions

Heavy-tailed phenomena are treated as something Mysterious, Surprising, & Controversial 1. Properties 2. Emergence 3. Identification

Heavy-tailed phenomena are treated as something Mysterious, Surprising, & Controversial 1. Properties 2. Emergence 3. Identification

The Fundamentals of Heavy Tails Properties, Emergence, & Identification Jayakrishnan Nair, Adam Wierman, Bert Zwart