SOCIETY FOR AMERICAN BASEBALL RESEARCH The Pythagorean Won-Loss Formula in Baseball: An Introduction to Statistics and Modeling

Similar documents
SOCIETY FOR AMERICAN BASEBALL RESEARCH The Pythagorean Won-Loss Formula in Baseball: An Introduction to Statistics and Modeling

The Pythagorean Won-Loss Theorem: An introduction to modeling

The Pythagorean Won-Loss Theorem: An introduction to modeling

Pythagoras at the Bat: An Introduction to Statistics and Mathematical Modeling

Pythagoras at the Bat: An Introduction to Statistics and Mathematical Modeling

Pythagoras at the Bat: An Introduction to Stats and Modeling

Pythagoras at the Bat: An Introduction to Stats and Modeling

Correlation (pp. 1 of 6)

RELIEVING AND READJUSTING PYTHAGORAS

Lesson 1 - Pre-Visit Safe at Home: Location, Place, and Baseball

Data and slope and y-intercepts, Oh My! Linear Regression in the Common Core

RS γ γ. * Kevin D. Dayaratna is a graduate student in Mathematical Statistics at the

Extending Pythagoras

The Best, The Hottest, & The Luckiest: A Statistical Tale of Extremes

12.2. Measures of Central Tendency: The Mean, Median, and Mode

Extending Pythagoras

Includes games of Wednesday, March 4, 2015 Major League Baseball

Statistics, Probability Distributions & Error Propagation. James R. Graham

Study Island Algebra 2 Post Test

Discrete distribution. Fitting probability models to frequency data. Hypotheses for! 2 test. ! 2 Goodness-of-fit test

CS 361: Probability & Statistics

Includes games of Friday, February 23, 2018 Major League Baseball

Statistics 100A Homework 5 Solutions

Probability theory for Networks (Part 1) CS 249B: Science of Networks Week 02: Monday, 02/04/08 Daniel Bilar Wellesley College Spring 2008

Introduction to Probability

Unit 4 Probability. Dr Mahmoud Alhussami

9. DISCRETE PROBABILITY DISTRIBUTIONS

Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019

Chapter 8: An Introduction to Probability and Statistics

EXAM. Exam #1. Math 3342 Summer II, July 21, 2000 ANSWERS

Lecture 17: The Exponential and Some Related Distributions

Solving Quadratic Equations by Graphing 6.1. ft /sec. The height of the arrow h(t) in terms

IEOR 3106: Introduction to Operations Research: Stochastic Models. Professor Whitt. SOLUTIONS to Homework Assignment 2

Guidelines for comparing boxplots

Math 151. Rumbos Fall Solutions to Review Problems for Exam 2. Pr(X = 1) = ) = Pr(X = 2) = Pr(X = 3) = p X. (k) =

Math/Stats 425, Sec. 1, Fall 04: Introduction to Probability. Final Exam: Solutions

Bemidji Area Schools Outcomes in Mathematics Algebra 2A. Based on Minnesota Academic Standards in Mathematics (2007) Page 1 of 6

MATH 3510: PROBABILITY AND STATS July 1, 2011 FINAL EXAM

EECS 126 Probability and Random Processes University of California, Berkeley: Spring 2015 Abhay Parekh February 17, 2015.

Swarthmore Honors Exam 2012: Statistics

( )( ) of wins. This means that the team won 74 games.

The top two Wild Card teams from each league make the postseason and play each other in a one game playoff.

functions Poisson distribution Normal distribution Arbitrary functions

MATH 3C: MIDTERM 1 REVIEW. 1. Counting

Statistics and Econometrics I

Course: ESO-209 Home Work: 1 Instructor: Debasis Kundu

Chapter 14: Omitted Explanatory Variables, Multicollinearity, and Irrelevant Explanatory Variables

Fourier and Stats / Astro Stats and Measurement : Stats Notes

STA 256: Statistics and Probability I

Homework 2. Spring 2019 (Due Thursday February 7)

Bell-shaped curves, variance

(Ch 3.4.1, 3.4.2, 4.1, 4.2, 4.3)

Chapter 4: An Introduction to Probability and Statistics

Probability. VCE Maths Methods - Unit 2 - Probability

Section 2.5 Linear Inequalities

IC 102: Data Analysis and Interpretation

Bemidji Area Schools Outcomes in Mathematics Algebra 2 Applications. Based on Minnesota Academic Standards in Mathematics (2007) Page 1 of 7

(Ch 3.4.1, 3.4.2, 4.1, 4.2, 4.3)

n N CHAPTER 1 Atoms Thermodynamics Molecules Statistical Thermodynamics (S.T.)

In other words, we are interested in what is happening to the y values as we get really large x values and as we get really small x values.

Problem Set 1 ANSWERS

Basics of Stochastic Modeling: Part II

Sample Spaces, Random Variables

Cookie Monster Meets the Fibonacci Numbers. Mmmmmm Theorems!

City Number Pct. 1.2 STEMS AND LEAVES

q3_3 MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Probability Distributions for Continuous Variables. Probability Distributions for Continuous Variables

1: PROBABILITY REVIEW

MATH Solutions to Probability Exercises

Name: 180A MIDTERM 2. (x + n)/2

Let X be a continuous random variable, < X < f(x) is the so called probability density function (pdf) if

What is a random variable

Chapter 4 Continuous Random Variables and Probability Distributions

CENTRAL LIMIT THEOREM (CLT)

ISyE 6739 Test 1 Solutions Summer 2015

11/16/2017. Chapter. Copyright 2009 by The McGraw-Hill Companies, Inc. 7-2

Econ 325: Introduction to Empirical Economics

Cookie Monster Meets the Fibonacci Numbers. Mmmmmm Theorems!

Inference for Proportions, Variance and Standard Deviation

Cookie Monster Meets the Fibonacci Numbers. Mmmmmm Theorems!

Stat 100a, Introduction to Probability.

Chapter 1 Principles of Probability

Lesson B1 - Probability Distributions.notebook

Lecture 1: August 28

Great Theoretical Ideas in Computer Science

Simultaneous Equations Solve for x and y (What are the values of x and y): Summation What is the value of the following given x = j + 1. x i.

Basics on Probability. Jingrui He 09/11/2007

Chapter 2. Some Basic Probability Concepts. 2.1 Experiments, Outcomes and Random Variables

NFC CONFERENCE CHAMPION /// VISITOR AT SUPER BOWL XLIV WON SUPER BOWL XLIV OVER THE INDIANAPOLIS, COLTS 31-17

Lecture #13 Tuesday, October 4, 2016 Textbook: Sections 7.3, 7.4, 8.1, 8.2, 8.3

DEPARTMENT OF QUANTITATIVE METHODS & INFORMATION SYSTEMS QM 120. Spring 2008

Machine Learning: Homework Assignment 2 Solutions

Lecture 11: Continuous-valued signals and differential entropy

Why? 0.10d 2x z _

ECE531 Lecture 4b: Composite Hypothesis Testing

Arkansas Tech University MATH 3513: Applied Statistics I Dr. Marcel B. Finan

Probability Distributions

Chapter 15: Other Regression Statistics and Pitfalls

MATH Notebook 5 Fall 2018/2019

Introducing the Normal Distribution

Transcription:

SOCIETY FOR AMERICAN BASEBALL RESEARCH The Pythagorean Won-Loss Formula in Baseball: An Introduction to Statistics and Modeling Steven J. Miller (sjmiller@math.brown.edu) Brown University Boston, MA, May 20 th, 2006 http://www.math.brown.edu/ sjmiller

Ackowledgements For many useful discussions: Sal Baxamusa, Phil Birnbaum, Ray Ciccolella, Steve Johnston, Michelle Manes, Russ Mann. Dedicated to my great uncle Newt Bromberg: a lifetime Sox fan who promised me that I would live to see a World Series Championship in Boston. 1

Goals of the Talk Derive James Pythagorean Won-Loss formula from a reasonable model. Introduce some of the techniques of modeling. Discuss the mathematics behind the models and model testing. 2

Goals of the Talk Derive James Pythagorean Won-Loss formula from a reasonable model. Introduce some of the techniques of modeling. Discuss the mathematics behind the models and model testing. Discuss the 2004 World Champion Boston Red Sox! 3

Parameters: Numerical Observation: Pythagorean Won-Loss Formula RS obs : average number of runs scored per game; RA obs : average number of runs allowed per game; γ: some parameter, constant for a sport. Bill James Won-Loss Formula (NUMERICAL Observation): RS γ Won Loss Percentage = obs RS γ obs + RA γ obs γ originally taken as 2, numerical studies show best γ is about 1.82. 4

Applications of the Pythagorean Won-Loss Formula Extrapolation: use half-way through season to predict a team s performance. Evaluation: see if consistently over-perform or under-perform. There are other statistics / formulas (example: run-differential per game); advantage is very easy to apply, depends only on two simple numbers for a team. 5

Probability density: p(x) 0; p(x)dx = 1; Probability Review X random variable with density p(x): Prob (X [a, b]) = b a p(x)dx. 6

Probability density: p(x) 0; p(x)dx = 1; Probability Review X random variable with density p(x): Prob (X [a, b]) = b a p(x)dx. Mean (average value) µ = xp(x)dx. Variance (how spread out) σ 2 = (x µ) 2 p(x)dx. 7

Probability density: p(x) 0; p(x)dx = 1; Probability Review X random variable with density p(x): Prob (X [a, b]) = b a p(x)dx. Mean (average value) µ = xp(x)dx. Variance (how spread out) σ 2 = (x µ) 2 p(x)dx. Independence: two random variables are independent if knowledge of one does not give knowledge of the other. 8

Guidelines for Modeling: Modeling the Real World Model should capture key features of the system; Model should be mathematically tractable (solvable). In general these are conflicting goals. How should we try and model baseball games? 9

Guidelines for Modeling: Modeling the Real World Model should capture key features of the system; Model should be mathematically tractable (solvable). In general these are conflicting goals. How should we try and model baseball games? Possible Model: Runs Scored and Runs Allowed independent random variables; f RS (x), g RA (y): probability density functions for runs scored (allowed). Reduced to calculating [ ] f RS (x)g RA (y)dy dx x y x or f RS (i)g RA (j). j<i i 10

Reduced to calculating [ ] f RS (x)g RA (y)dy dx x y x Problems with the Model or f RS (i)g RA (j). j<i i Problems with the model: Can the integral (or sum) be completed in closed form? Are the runs scored and allowed independent random variables? What are f RS and g RA? 11

Weibull distribution: f(x; α, β, γ) = Three Parameter Weibull ( ) γ x β γ 1 α α e ((x β)/α) γ if x β 0 otherwise. α: scale (variance: meters versus centimeters); β: origin (mean: translation, zero point); γ: shape (behavior near β and at infinity). Various values give different shapes, but can we find α, β, γ such that it fits observed data? Is the Weibull theoretically tractable? 12

Weibull Plots: Parameters (α, β, γ) 1.4 1.2 1 0.8 0.6 0.4 0.2 1 2 3 4 5 Red:(1, 0, 1) (exponential); Green:(1, 0, 2); Cyan:(1, 2, 2); Blue:(1, 2, 4) 13

Weibull Integrations Let f(x; α, β, γ) be the probability density of a Weibull(α, β, γ): ( ) γ x β γ 1 f(x; α, β, γ) = α α e ((x β)/α) γ if x β 0 otherwise. For s C with the real part of s greater than 0, recall the Γ-function: Γ(s) = 0 e u u s 1 du = Γ(n) = (n 1)! (i.e., Γ(4) = 3 2 1). 0 e u u sdu u. Let µ α,β,γ denote the mean of f(x; α, β, γ). 14

Weibull Integrations (Continued) µ α,β,γ = = β β x γ ( ) x β γ 1 e ((x β)/α)γ dx α α α x β α γ ( ) x β γ 1 e ((x β)/α)γ dx + β. α α Change variables: u = µ α,β,γ = ( x β α = α ) γ. ( ) Then du = γ x β γ 1 α dx and 0 0 α αu γ 1 e u du + β e u u 1+γ 1 du u + β = αγ(1 + γ 1 ) + β. A similar calculation determines the variance. 15

Pythagorean Won-Loss Formula Theorem: Pythagorean Won-Loss Formula: Let the runs scored and allowed per game be two independent random variables drawn from Weibull distributions (α RS, β, γ) and (α RA, β, γ); α RS and α RA are chosen so that the means are RS and RA. If γ > 0 then (RS β) γ Won-Loss Percentage(RS, RA, β, γ) = (RS β) γ + (RA β) γ. In baseball take β =.5 (from runs must be integers). RS β estimates average runs scored, RA β estimates average runs allowed. 16

Best Fit Weibulls to Data: Method of Least Squares Minimized the sum of squares of the error from the runs scored data plus the sum of squares of the error from the runs allowed data. Bin(k) is the k th bin; RS obs (k) (resp. RA obs (k)) the observed number of games with the number of runs scored (allowed) in Bin(k); A(α, β, γ, k) the area under the Weibull with parameters (α, β, γ) in Bin(k). Find the values of (α RS, α RA, γ) that minimize #Bins k=1 + (RS obs (k) #Games A(α RS,.5, γ, k)) 2 #Bins k=1 (RA obs (k) #Games A(α RA,.5, γ, k)) 2. 17

Best Fit Weibulls to Data (Method of Maximum Likelihood) Plots of RS predicted vs observed and RA predicted vs observed for the Boston Red Sox 25 20 15 10 5 20 15 10 5 5 10 15 20 5 10 15 20 Using as bins [.5,.5] [.5, 1.5] [7.5, 8.5] [8.5, 9.5] [9.5, 11.5] [11.5, ). 18

Plots of RS predicted vs observed and RA predicted vs observed for the New York Yankees 20 15 10 5 25 20 15 10 5 5 10 15 20 5 10 15 20 19

Plots of RS predicted vs observed and RA predicted vs observed for the Baltimore Orioles 25 20 15 10 5 20 15 10 5 5 10 15 20 5 10 15 20 20

Plots of RS predicted vs observed and RA predicted vs observed for the Tampa Bay Devil Rays 25 20 15 10 5 25 20 15 10 5 5 10 15 20 5 10 15 20 21

Plots of RS predicted vs observed and RA predicted vs observed for the Toronto Blue Jays 25 20 15 10 5 25 20 15 10 5 5 10 15 20 5 10 15 20 22

Bonferroni Adjustments Imagine a fair coin. If flip 1,000,000 times expect 500,000 heads. Approximately 95% of the time 499, 000 #Heads 501, 000. 23

Bonferroni Adjustments Imagine a fair coin. If flip 1,000,000 times expect 500,000 heads. Approximately 95% of the time 499, 000 #Heads 501, 000. Consider N independent experiments of fliping a fair coin 1,000,000 times. What is the probability that at least one of set doesn t have 499, 000 #Heads 501, 000? 24

Bonferroni Adjustments Imagine a fair coin. If flip 1,000,000 times expect 500,000 heads. Approximately 95% of the time 499, 000 #Heads 501, 000. Consider N independent experiments of fliping a fair coin 1,000,000 times. What is the probability that at least one of set doesn t have 499, 000 #Heads 501, 000? See unlikely events happen as N increases! N Prob 5 22.62 10 40.13 14 51.23 20 64.15 50 92.31 25

Team RS RAΧ2: 20 d.f. IndepΧ2: 109 d.f Boston Red Sox 15.63 83.19 New York Yankees 12.60 129.13 Baltimore Orioles 29.11 116.88 Tampa Bay Devil Rays 13.67 111.08 Toronto Blue Jays 41.18 100.11 Minnesota Twins 17.46 97.93 Chicago White Sox 22.51 153.07 Cleveland Indians 17.88 107.14 Detroit Tigers 12.50 131.27 Kansas City Royals 28.18 111.45 Los Angeles Angels 23.19 125.13 Oakland Athletics 30.22 133.72 Texas Rangers 16.57 111.96 Seattle Mariners 21.57 141.00 20 d.f.: 31.41 (at the 95% level) and 37.57 (at the 99% level). 109 d.f.: 134.4 (at the 95% level) and 146.3 (at the 99% level). Bonferroni Adjustment: 20 d.f.: 41.14 (at the 95% level) and 46.38 (at the 99% level). 109 d.f.: 152.9 (at the 95% level) and 162.2 (at the 99% level). 26

Testing the Model: Data from Method of Maximum Likelihood Team Obs Wins Pred Wins ObsPerc PredPerc GamesDiff Γ Boston Red Sox 98 93.0 0.605 0.574 5.03 1.82 New York Yankees 101 87.5 0.623 0.540 13.49 1.78 Baltimore Orioles 78 83.1 0.481 0.513 5.08 1.66 Tampa Bay Devil Rays 70 69.6 0.435 0.432 0.38 1.83 Toronto Blue Jays 67 74.6 0.416 0.464 7.65 1.97 Minnesota Twins 92 84.7 0.568 0.523 7.31 1.79 Chicago White Sox 83 85.3 0.512 0.527 2.33 1.73 Cleveland Indians 80 80.0 0.494 0.494 0. 1.79 Detroit Tigers 72 80.0 0.444 0.494 8.02 1.78 Kansas City Royals 58 68.7 0.358 0.424 10.65 1.76 Los Angeles Angels 92 87.5 0.568 0.540 4.53 1.71 Oakland Athletics 91 84.0 0.562 0.519 6.99 1.76 Texas Rangers 89 87.3 0.549 0.539 1.71 1.90 Seattle Mariners 63 70.7 0.389 0.436 7.66 1.78 γ: mean = 1.74, standard deviation =.06, median = 1.76; close to numerically observed value of 1.82. The mean number of the difference between observed and predicted wins was.13 with a standard deviation of 7.11 (and a median of 0.19). If we consider just the absolute value of the difference then we have a mean of 5.77 with a standard deviation of 3.85 (and a median of 6.04). 27

Conclusions Can find parameters such that the Weibulls are good fits to the data; The runs scored and allowed per game are statistically independent; The Pythagorean Won-Loss Formula is a consequence of our model; Method of Least Squares: best value of γ of about 1.79; Method of Maximum Likelihood: best value of γ of about 1.74; both are close to the observed best 1.82. 28

Future Work Micro-analysis: runs scored and allowed are not entirely independent (big lead, close game), run production smaller for inter-league games in NL parks, et cetera. What about other sports? Does the same model work? How does γ depend on the sport? Are there other probability distributions that give integrals which can be determined in closed form? 29

1. Baxamusa, Sal: Weibull worksheet: Bibliography http://www.beyondtheboxscore.com/story/2006/4/30/114737/251 Run distribution plots for various teams: http://www.beyondtheboxscore.com/story/2006/2/23/164417/484 2. Miller, Steven J.: A Derivation of James Pythagorean projection, By The Numbers The Newsletter of the SABR Statistical Analysis Committee, vol. 16 (February 2006), no. 1, 17 22. http://www.philbirnbaum.com/btn2006-02.pdf A derivation of the Pythagorean Won-Loss Formula in baseball (preprint, 13 pgs). http://www.math.brown.edu/ sjmiller/math/papers/pythagwonloss Paper.pdf 30

Appendix: Proof of the Pythagorean Won-Loss Formula Proof. Let X and Y be independent random variables with Weibull distributions (α RS, β, γ) and (α RA, β, γ) respectively. α RS = RS β Γ(1 + γ 1 ), α RA = RA β Γ(1 + γ 1 ). We need only calculate the probability that X exceeds Y. We use the integral of a probability density is 1. 31

Prob(X > Y ) = = = = x=β x y=β γ x=0 α RS = 1 γ x=0 α RS x=0 γ α RS ( x α RS ( x γ α RS where we have set α RS x=β x ( x β α RS y=β f(x; α RS, β, γ)f(y; α RA, β, γ)dy dx ) γ 1 e ( x β α RS ) γ ) γ 1 e ( x α RS ) γ [ x y=0 γ α RA γ α RA ( ) ( ) y β γ 1 e y β γ α RA dy α RA ( ) ( ) y γ 1 e y γ ] α RA dy α RA ) γ 1 e (x/α RS) γ [ 1 e (x/α RA) γ] dx ( x α RS 1 α γ = 1 α γ RS ) γ 1 e (x/α)γ dx, + 1 α γ RA = αγ RS + αγ RA α γ. RS αγ RA 32

Prob(X > Y ) = 1 αγ α γ RS = 1 αγ α γ RS = 1 1 α γ RS α γ RS 0 γ ( x ) γ 1 e (x/α) γ dx α α α γ RS αγ RA α γ RS + αγ RA = α γ RS +. αγ RA We substitute the relations for α RS and α RA and find that (RS β) γ Prob(X > Y ) = (RS β) γ + (RA β) γ. Note RS β estimates RS obs, RA β estimates RA obs. 33