The Pythagorean Won-Loss Theorem: An introduction to modeling

Size: px

Start display at page:

Download "The Pythagorean Won-Loss Theorem: An introduction to modeling"

Laureen Mason
5 years ago
Views:

1 The Pythagorean Won-Loss Theorem: An introduction to modeling Steven J Miller Brown University sjmiller@math.brown.edu sjmiller Williams College, January 15, 2008.

2 Acknowledgements Sal Baxamusa, Phil Birnbaum, Ray Ciccolella, Steve Johnston, Michelle Manes, Russ Mann, students of Math 162 at Brown. Dedicated to my great uncle Newt Bromberg (a lifetime Red Sox fan who promised me that I would live to see a World Series Championship in Boston). Chris Long and the San Diego Padres.

3 Goals of the Talk Derive James Pythagorean Won-Loss formula from a reasonable model.

4 Goals of the Talk Derive James Pythagorean Won-Loss formula from a reasonable model. Introduce some of the techniques of modeling.

5 5 Intro Prob. & Modeling Pythg. Thm Analysis of 04 Adv. Theory Summary Refs Appendices Goals of the Talk Derive James Pythagorean Won-Loss formula from a reasonable model. Introduce some of the techniques of modeling. Discuss the mathematics behind the models and model testing.

6 Goals of the Talk Derive James Pythagorean Won-Loss formula from a reasonable model. Introduce some of the techniques of modeling. Discuss the mathematics behind the models and model testing. Show how advanced theory enters in simple problems.

7 7 Intro Prob. & Modeling Pythg. Thm Analysis of 04 Adv. Theory Summary Refs Appendices Goals of the Talk Derive James Pythagorean Won-Loss formula from a reasonable model. Introduce some of the techniques of modeling. Discuss the mathematics behind the models and model testing. Show how advanced theory enters in simple problems. Further avenues for research for students.

8 Numerical Observation: Pythagorean Won-Loss Formula Parameters RS obs : average number of runs scored per game; RA obs : average number of runs allowed per game; γ: some parameter, constant for a sport.

9 Numerical Observation: Pythagorean Won-Loss Formula Parameters RS obs : average number of runs scored per game; RA obs : average number of runs allowed per game; γ: some parameter, constant for a sport. James Won-Loss Formula (NUMERICAL Observation) Won Loss Percentage = RS obs γ RS obs γ + RA obs γ γ originally taken as 2, numerical studies show best γ is about 1.82.

10 Applications of the Pythagorean Won-Loss Formula Extrapolation: use half-way through season to predict a team s performance.

11 Applications of the Pythagorean Won-Loss Formula Extrapolation: use half-way through season to predict a team s performance. Evaluation: see if consistently over-perform or under-perform.

12 Applications of the Pythagorean Won-Loss Formula Extrapolation: use half-way through season to predict a team s performance. Evaluation: see if consistently over-perform or under-perform. Advantage: Other statistics / formulas (run-differential per game); this is easy to use, depends only on two simple numbers for a team.

13 Probability Review Probability density: p(x) 0; p(x)dx = 1; X random variable with density p(x): Prob (X [a, b]) = b p(x)dx. a

14 Probability Review Probability density: p(x) 0; p(x)dx = 1; X random variable with density p(x): Prob (X [a, b]) = b p(x)dx. a Mean µ = xp(x)dx.

15 Probability Review Probability density: p(x) 0; p(x)dx = 1; X random variable with density p(x): Prob (X [a, b]) = b p(x)dx. a Mean µ = xp(x)dx. Variance σ 2 = (x µ)2 p(x)dx.

16 Probability Review Probability density: p(x) 0; p(x)dx = 1; X random variable with density p(x): Prob (X [a, b]) = b p(x)dx. a Mean µ = xp(x)dx. Variance σ 2 = (x µ)2 p(x)dx. Independence: two random variables are independent if knowledge of one does not give knowledge of the other.

17 Modeling the Real World Guidelines for Modeling: Model should capture key features of the system; Model should be mathematically tractable (solvable).

18 Modeling the Real World Guidelines for Modeling: Model should capture key features of the system; Model should be mathematically tractable (solvable). In general these are conflicting goals. How should we try and model baseball games?

19 Modeling the Real World (cont) Possible Model: Runs Scored and Runs Allowed independent random variables; f RS (x), g RA (y): probability density functions for runs scored (allowed).

20 Modeling the Real World (cont) Possible Model: Runs Scored and Runs Allowed independent random variables; f RS (x), g RA (y): probability density functions for runs scored (allowed). Reduced to calculating x [ y x ] f RS (x)g RA (y)dy dx or j<i i f RS (i)g RA (j).

21 Problems with the Model Reduced to calculating x [ y x ] f RS (x)g RA (y)dy dx or i j<i f RS (i)g RA (j).

22 Problems with the Model Reduced to calculating x [ y x ] f RS (x)g RA (y)dy dx or i j<i f RS (i)g RA (j). Problems with the model: Can the integral (or sum) be completed in closed form? Are the runs scored and allowed independent random variables? What are f RS and g RA?

23 Three Parameter Weibull Weibull distribution: f (x; α, β, γ) = { γ α ( x β ) γ 1 α e ((x β)/α) γ if x β 0 otherwise. α: scale (variance: meters versus centimeters); β: origin (mean: translation, zero point); γ: shape (behavior near β and at infinity). Various values give different shapes, but can we find α, β, γ such that it fits observed data? Is the Weibull theoretically tractable?

24 Weibull Plots: Parameters (α, β, γ) Red:(1, 0, 1) (exponential); Green:(1, 0, 2); Cyan:(1, 2, 2); Blue:(1, 2, 4)

25 Gamma Distribution For s C with the real part of s greater than 0, define the Γ-function: Γ(s) = e u u s 1 du = 0 0 e u u s du u. Generalizes factorial function: Γ(n) = (n 1)! for n 1 an integer.

26 Weibull Integrations µ α,β,γ = β x γ α ( ) γ 1 x β e ((x β)/α)γ dx α

27 Weibull Integrations µ α,β,γ = = β β x γ α α x β α ( ) γ 1 x β e ((x β)/α)γ dx α γ ( ) γ 1 x β e ((x β)/α)γ dx + β. α α

28 Weibull Integrations µ α,β,γ = x γ ( ) γ 1 x β e ((x β)/α)γ dx β α α = α x β β α γ ( ) γ 1 x β e ((x β)/α)γ dx + β. α α Change variables: u = ( ) x β γ. α ( Then du = γ x β ) γ 1 α α dx

29 Weibull Integrations µ α,β,γ = x γ ( ) γ 1 x β e ((x β)/α)γ dx β α α = α x β β α γ ( ) γ 1 x β e ((x β)/α)γ dx + β. α α Change variables: u = ( ) x β γ. α ( Then du = γ x β ) γ 1 α α dx and µ α,β,γ = 0 αu γ 1 e u du + β

30 Weibull Integrations µ α,β,γ = x γ ( ) γ 1 x β e ((x β)/α)γ dx β α α = α x β β α γ ( ) γ 1 x β e ((x β)/α)γ dx + β. α α Change variables: u = ( ) x β γ. α ( Then du = γ x β ) γ 1 α α dx and µ α,β,γ = = α 0 0 αu γ 1 e u du + β e u u 1+γ 1 du u + β

31 Weibull Integrations µ α,β,γ = x γ ( ) γ 1 x β e ((x β)/α)γ dx β α α = α x β β α γ ( ) γ 1 x β e ((x β)/α)γ dx + β. α α Change variables: u = ( ) x β γ. α ( Then du = γ x β ) γ 1 α α dx and µ α,β,γ = = α 0 0 αu γ 1 e u du + β e u u 1+γ 1 du u + β = αγ(1 + γ 1 ) + β.

32 Pythagorean Won-Loss Formula Theorem (Pythagorean Won-Loss Formula) Let the runs scored and allowed per game be two independent random variables drawn from Weibull distributions (α RS, β, γ) and (α RA, β, γ); α RS and α RA are chosen so that the means are RS and RA. If γ > 0 then Won-Loss Percentage(RS, RA, β, γ) = (RS β) γ (RS β) γ + (RA β) γ

33 Pythagorean Won-Loss Formula Theorem (Pythagorean Won-Loss Formula) Let the runs scored and allowed per game be two independent random variables drawn from Weibull distributions (α RS, β, γ) and (α RA, β, γ); α RS and α RA are chosen so that the means are RS and RA. If γ > 0 then Won-Loss Percentage(RS, RA, β, γ) = (RS β) γ (RS β) γ + (RA β) γ In baseball take β = 1/2 (from runs must be integers). RS β estimates average runs scored, RA β estimates average runs allowed.

34 Best Fit Weibulls to Data: Method of Least Squares Bin(k) is the k th bin;

35 Best Fit Weibulls to Data: Method of Least Squares Bin(k) is the k th bin; RS obs (k) (resp. RA obs (k)) the observed number of games with the number of runs scored (allowed) in Bin(k);

36 Best Fit Weibulls to Data: Method of Least Squares Bin(k) is the k th bin; RS obs (k) (resp. RA obs (k)) the observed number of games with the number of runs scored (allowed) in Bin(k); A(α, β, γ, k) the area under the Weibull with parameters (α, β, γ) in Bin(k).

37 Best Fit Weibulls to Data: Method of Least Squares Bin(k) is the k th bin; RS obs (k) (resp. RA obs (k)) the observed number of games with the number of runs scored (allowed) in Bin(k); A(α, β, γ, k) the area under the Weibull with parameters (α, β, γ) in Bin(k). Find the values of (α RS, α RA, γ) that minimize #Bins k=1 + (RS obs (k) #Games A(α RS, 1/2, γ, k)) 2 #Bins k=1 (RA obs (k) #Games A(α RA, 1/2, γ, k)) 2.

38 Best Fit Weibulls to Data (Method of Maximum Likelihood) Plots of RS predicted vs observed and RA predicted vs observed for the Boston Red Sox Using as bins [.5,.5] [.5, 1.5] [7.5, 8.5] [8.5, 9.5] [9.5, 11.5] [11.5, ).

39 Best Fit Weibulls to Data (Method of Maximum Likelihood) Plots of RS predicted vs observed and RA predicted vs observed for the New York Yankees Using as bins [.5,.5] [.5, 1.5] [7.5, 8.5] [8.5, 9.5] [9.5, 11.5] [11.5, ).

40 Best Fit Weibulls to Data (Method of Maximum Likelihood) Plots of RS predicted vs observed and RA predicted vs observed for the Baltimore Orioles Using as bins [.5,.5] [.5, 1.5] [7.5, 8.5] [8.5, 9.5] [9.5, 11.5] [11.5, ).

41 Best Fit Weibulls to Data (Method of Maximum Likelihood) Plots of RS predicted vs observed and RA predicted vs observed for the Tampa Bay Devil Rays Using as bins [.5,.5] [.5, 1.5] [7.5, 8.5] [8.5, 9.5] [9.5, 11.5] [11.5, ).

42 Best Fit Weibulls to Data (Method of Maximum Likelihood) Plots of RS predicted vs observed and RA predicted vs observed for the Toronto Blue Jays Using as bins [.5,.5] [.5, 1.5] [7.5, 8.5] [8.5, 9.5] [9.5, 11.5] [11.5, ).

43 Bonferroni Adjustments Fair coin: 1,000,000 flips, expect 500,000 heads.

44 Bonferroni Adjustments Fair coin: 1,000,000 flips, expect 500,000 heads. About 95% have 499, 000 #Heads 501, 000.

45 Bonferroni Adjustments Fair coin: 1,000,000 flips, expect 500,000 heads. About 95% have 499, 000 #Heads 501, 000. Consider N independent experiments of flipping a fair coin 1,000,000 times. What is the probability that at least one of set doesn t have 499, 000 #Heads 501, 000? N Probability See unlikely events happen as N increases!

46 Data Analysis: χ 2 Tests Team RS RA Χ2: 20 d.f. Indep Χ2: 109 d.f Boston Red Sox New York Yankees Baltimore Orioles Tampa Bay Devil Rays Toronto Blue Jays Minnesota Twins Chicago White Sox Cleveland Indians Detroit Tigers Kansas City Royals Los Angeles Angels Oakland Athletics Texas Rangers Seattle Mariners d.f.: (at the 95% level) and (at the 99% level). 109 d.f.: (at the 95% level) and (at the 99% level). Bonferroni Adjustment: 20 d.f.: (at the 95% level) and (at the 99% level). 109 d.f.: (at the 95% level) and (at the 99% level).

47 Data Analysis: Structural Zeros For independence of runs scored and allowed, use bins [0, 1) [1, 2) [2, 3) [8, 9) [9, 10) [10, 11) [11, ). Have an r c contingency table with structural zeros (runs scored and allowed per game are never equal). (Essentially) O r,r = 0 for all r, use an iterative fitting procedure to obtain maximum likelihood estimators for E r,c (expected frequency of cell (r, c) assuming that, given runs scored and allowed are distinct, the runs scored and allowed are independent).

48 Testing the Model: Data from Method of Maximum Likelihood Team Obs Wins Pred Wins ObsPerc PredPerc GamesDiff Γ Boston Red Sox New York Yankees Baltimore Orioles Tampa Bay Devil Rays Toronto Blue Jays Minnesota Twins Chicago White Sox Cleveland Indians Detroit Tigers Kansas City Royals Los Angeles Angels Oakland Athletics Texas Rangers Seattle Mariners γ: mean = 1.74, standard deviation =.06, median = 1.76; close to numerically observed value of 1.82.

49 Conclusions Find parameters such that Weibulls are good fits;

50 Conclusions Find parameters such that Weibulls are good fits; Runs scored and allowed per game are statistically independent;

51 Conclusions Find parameters such that Weibulls are good fits; Runs scored and allowed per game are statistically independent; Pythagorean Won-Loss Formula is a consequence of our model;

52 Conclusions Find parameters such that Weibulls are good fits; Runs scored and allowed per game are statistically independent; Pythagorean Won-Loss Formula is a consequence of our model; Best γ (both close to observed best 1.82): Method of Least Squares: 1.79; Method of Maximum Likelihood: 1.74.

53 Future Work Micro-analysis: runs scored and allowed are not entirely independent (big lead, close game), run production smaller for inter-league games in NL parks, et cetera. Other sports: Does the same model work? How does γ depend on the sport? Closed forms: Are there other probability distributions that give integrals which can be determined in closed form? Valuing Runs: Pythagorean formula used to value players (10 runs equals 1 win); better model leads to better team.

54 References Baxamusa, Sal: Weibull worksheet: Run distribution plots for various teams: Miller, Steven J.: A Derivation of James Pythagorean projection, By The Numbers The Newsletter of the SABR Statistical Analysis Committee, vol. 16 (February 2006), no. 1, A derivation of the Pythagorean Won-Loss Formula in baseball, Chance Magazine 20 (2007), no. 1, sjmiller/math/papers/pythagwonloss Paper.pdf

55 55 Intro Prob. & Modeling Pythg. Thm Analysis of 04 Adv. Theory Summary Refs Appendices Appendix I: Proof of the Pythagorean Won-Loss Formula Let X and Y be independent random variables with Weibull distributions (α RS, β, γ) and (α RA, β, γ) respectively. To have means of RS β and RA β our calculations for the means imply α RS = RS β Γ(1 + γ 1 ), α RA = RA β Γ(1 + γ 1 ). We need only calculate the probability that X exceeds Y. We use the integral of a probability density is 1.

56 Appendix I: Proof of the Pythagorean Won-Loss Formula (cont) Prob(X > Y ) = = = = x β β x=0 α RS = 1 γ γ x=0 α RS γ α RS ( x x=β ( x β α RS ( x γ x=0 α RS α RS α RS x y=β f (x; α RS, β, γ)f (y; α RA, β, γ)dy dx ) γ 1 e ( x β α RS ) γ ) [ γ 1 ( ) γ e x x α RS γ α RA γ y=0 α RA ( y β α RA ( y α RA ) γ 1 e (x/αrs) γ [ 1 e (x/αra) γ ] dx ( x α RS ) γ 1 e (x/α)γ dx, where we have set 1 α γ = 1 α γ + 1 RS α γ RA = αγ RS + αγ RA α γ. RS αγ RA ) γ 1 e ( y β α RA ) γ dydx ) ] γ 1 ( ) e y γ α RA dy dx

57 Appendix I: Proof of the Pythagorean Won-Loss Formula (cont) Prob(X > Y ) = 1 αγ γ ( x ) γ 1 e α γ (x/α) γ dx RS 0 α α = 1 αγ α γ RS = 1 1 α γ RS α γ RS αγ RA α γ RS + αγ RA = α γ RS α γ RS +. αγ RA We substitute the relations for α RS and α RA and find that Prob(X > Y ) = (RS β) γ (RS β) γ + (RA β) γ. 57 Note RS β estimates RS obs, RA β estimates RA obs.

58 Appendix II: Best Fit Weibulls and Structural Zeros The fits look good, but are they? Do χ 2 -tests: Let Bin(k) denote the k th bin. O r,c : the observed number of games where the team s runs scored is in Bin(r) and the runs allowed are in Bin(c). E r,c = (r, c). Then c O r,c r O r,c #Games #Rows r=1 is the expected frequency of cell #Columns c=1 (O r,c E r,c ) 2 E r,c is a χ 2 distribution with (#Rows 1)(#Columns 1) degrees of freedom.

59 Appendix II: Best Fit Weibulls and Structural Zeros (cont) For independence of runs scored and allowed, use bins [0, 1) [1, 2) [2, 3) [8, 9) [9, 10) [10, 11) [11, ). Have an r c contingency table (with r = c = 12); however, there are structural zeros (runs scored and allowed per game can never be equal). (Essentially) O r,r = 0 for all r. We use the iterative fitting procedure to obtain maximum likelihood estimators for the E r,c, the expected frequency of cell (r, c) under the assumption that, given that the runs scored and allowed are distinct, the runs scored and allowed are independent. For 1 r, c 12, let E (0) r,c = 1 if r c and 0 if r = c. Set Then and X r,+ = O r,c, X +,c = O r,c. c=1 r=1 E (l 1) E (l) r,c X r,+ / 12 c=1 E(l 1) r,c if l is odd r,c = E (l 1) r,c X +,c / 12 r=1 E(l 1) r,c if l is even, E r,c = lim l E(l) r,c ; the iterations converge very quickly. (If we had a complete two-dimensional contingency table, then the iteration reduces to the standard values, namely E r,c = c O r,c r O r,c / #Games.). Note (O r,c E r,c) 2 r=1 c=1 c r E r,c

60 Appendix III: Central Limit Theorem Convolution of f and g: h(y) = (f g)(y) = f (x)g(y x)dx = f (x y)g(x)dx. R R X 1 and X 2 independent random variables with probability density p. x+ x Prob(X i [x, x + x]) = p(t)dt p(x) x. x x+ x x1 Prob(X 1 + X 2 ) [x, x + x] = p(x 1 )p(x 2 )dx 2 dx 1. x 1 = x 2 =x x 1 As x 0 we obtain the convolution of p with itself: b Prob(X 1 + X 2 [a, b]) = (p p)(z)dz. a Exercise to show non-negative and integrates to 1.

61 Appendix III: Statement of Central Limit Theorem For simplicity, assume p has mean zero, variance one, finite third moment and is of sufficiently rapid decay so that all convolution integrals that arise converge: p an infinitely differentiable function satisfying xp(x)dx = 0, x 2 p(x)dx = 1, x 3 p(x)dx <. Assume X 1, X 2,... are independent identically distributed random variables drawn from p. Define S N = N i=1 X i. Standard Gaussian (mean zero, variance one) is 1 2π e x2 /2. Central Limit Theorem Let X i, S N be as above and assume the third moment of each X i is finite. Then S N / N converges in probability to the standard Gaussian: ( ) SN lim Prob [a, b] = N N 1 b e x2 /2 dx. 2π a

62 Appendix III: Proof of the Central Limit Theorem The Fourier transform of p is p(y) = p(x)e 2πixy dx. Derivative of ĝ is the Fourier transform of 2πixg(x); differentiation (hard) is converted to multiplication (easy). ĝ (y) = 2πix g(x)e 2πixy dx. If g is a probability density, ĝ (0) = 2πiE[x] and ĝ (0) = 4π 2 E[x 2 ]. Natural to use the Fourier transform to analyze probability distributions. The mean and variance are simple multiples of the derivatives of p at zero: p (0) = 0, p (0) = 4π 2. We Taylor expand p (need technical conditions on p): p(y) = 1 + p (0) y 2 + = 1 2π 2 y 2 + O(y 3 ). 2 Near the origin, the above shows p looks like a concave down parabola.

63 Appendix III: Proof of the Central Limit Theorem (cont) Prob(X X N [a, b]) = b a (p p)(z)dz. The Fourier transform converts convolution to multiplication. If FT[f ](y) denotes the Fourier transform of f evaluated at y: FT[p p](y) = p(y) p(y). Do not want the distribution of X X N = x, but rather S N = X 1 + +X N = x. N If B(x) = A(cx) for some fixed c 0, then B(y) ( ) = 1 c Â y. c Prob ( X1 + +X N N ) = x = ( Np Np)(x N). [ FT ( Np Np)(x ] [ ( )] y N N) (y) = p. N

64 Appendix III: Proof of the Central Limit Theorem (cont) Can find the Fourier transform of the distribution of S N : [ ( )] y N p. N Take the limit as N for fixed y. Know p(y) = 1 2π 2 y 2 + O(y 3 ). Thus study [ 1 2π2 y 2 ( )] y 3 N + O N N 3/2. For any fixed y, [ lim 1 2π2 y 2 ( )] y 3 N + O N N N 3/2 = e 2πy2. Fourier transform of e 2πy2 at x is 1 2π e x2 /2.

65 Appendix III: Proof of the Central Limit Theorem (cont) We have shown: the Fourier transform of the distribution of S N converges to e 2πy2 ; the Fourier transform of e 2πy2 is 1 2π e x2 /2. Therefore the distribution of S N equalling x converges to 1 2π e x2 /2. We need complex analysis to justify this conclusion. Must be careful: Consider g(x) = { e 1/x2 if x 0 0 if x = 0. All the Taylor coefficients about x = 0 are zero, but the function is not identically zero in a neighborhood of x = 0.

66 Appendix IV: Best Fit Weibulls from Method of Maximum Likelihood The likelihood function depends on: α RS, α RA, β =.5, γ. Let A(α,.5, γ, k) denote the area in Bin(k) of the Weibull with parameters α,.5, γ. The sample likelihood function L(α RS, α RA,.5, γ) is ( ) #Bins #Games A(α RS,.5, γ, k) RSobs(k) RS obs (1),..., RS obs (#Bins) ( #Games RA obs (1),..., RA obs (#Bins) k=1 ) #Bins k=1 A(α RA,.5, γ, k) RAobs(k). For each team we find the values of the parameters α RS, α RA and γ that maximize the likelihood. Computationally, it is equivalent to maximize the logarithm of the likelihood, and we may ignore the multinomial coefficients are they are independent of the parameters.

The Pythagorean Won-Loss Theorem: An introduction to modeling

The Pythagorean Won-Loss Theorem: An introduction to modeling Steven J Miller Williams College Steven.J.Miller@williams.edu http://www.williams.edu/go/math/sjmiller/ Bennington College, December 5, 2008