Nearest-Neighbor Matchup Effects: Predicting March Madness

Size: px

Start display at page:

Download "Nearest-Neighbor Matchup Effects: Predicting March Madness"

Laura Miles
6 years ago
Views:

1 Nearest-Neighbor Matchup Effects: Predicting March Madness Andrew Hoegh, Marcos Carzolio, Ian Crandell, Xinran Hu, Lucas Roberts, Yuhyun Song, Scotland Leman Department of Statistics, Virginia Tech September 26, 2015

2 NCAA Bracket Competition Who is has filled out an NCAA bracket before? Introduction Motivation 2 / 36

3 NCAA Bracket Competition Who has won an NCAA bracket competition? Introduction Motivation 3 / 36

4 NCAA Bracket Competition Who has lost a bracket competition to someone that does not know what a basketball looks like? Introduction Motivation 4 / 36

5 Talk Overview General modeling strategies for NCAA tournament competitions Matchup effects modeling framework Uncertainty inherent in NCAA tournament & competitions Introduction Overview 5 / 36

6 Competition Types General Modeling Competition Types 6 / 36

7 Loss Functions: Bracket L(y, ŷ) = cδ(y ŷ) L(y, ŷ, r) = c r δ(y ŷ) General Modeling Decision Theory 7 / 36

8 Loss Functions: Probabilistic Prediction Kaggle loss function L(y, p) = y log(p) (1 y) log(1 p) This is a proper scoring rule. L(y=1,p) P General Modeling Decision Theory 8 / 36

9 Relevant Data: Team Characteristics There are many important characteristics useful for modeling the strength of an NCAA basketball team: Winning percentage, Point differential, Strength of schedule, Conference affiliation,... Rebounding percentage, Adjusted offensive efficiency, and Adjusted defensive efficiency. General Modeling Data 9 / 36

10 Relevant Data: Ratings and Rankings Rather than using team characteristics, there are many available rating and ranking systems: ESPN BPI, Sagarin, RPI, Pomeroy, and Logistic Regression/Markov Chain (LRMC). General Modeling Data 10 / 36

11 Relative Strength Models Typically the model framework for predicting winner of games (or winning probabilities) can be formulated as a relative strength model. Formally this can be expressed as: y ij = f (θ i, θ j ) General Modeling Data 11 / 36

12 Relative Strength Models Typically the model framework for predicting winner of games (or winning probabilities) can be formulated as a relative strength model. Formally this can be expressed as: y ij = f (θ i, θ j ) such as linear model: y ij = β home + (θ i θ j )β D + ɛ ij, where ɛ ij N (0, σ 2 ) General Modeling Data 11 / 36

13 Relative Strength Models Typically the model framework for predicting winner of games (or winning probabilities) can be formulated as a relative strength model. Formally this can be expressed as: y ij = f (θ i, θ j ) such as linear model: y ij = β home + (θ i θ j )β D + ɛ ij, where ɛ ij N (0, σ 2 ) logistic regression: y ij Bernoulli(p ij ) where or logit(p ij ) = β home + (θ i θ j )β D General Modeling Data 11 / 36

14 Transitivity Note that relative strength models of this type are strictly transitive, where P A>B denotes the probability that team A beats team B. Under transitive models, {P A>B > 0.5 P B>C > 0.5} P A>C > 0.5. General Modeling Data 12 / 36

15 Bayesian Relative Strength Models Bayesian Linear Model Prior β N (β; m, s) Likelihood β Y, X N (β; ˆβ, Σ) Posterior β Y, X N (β; µ, Σ β ) General Modeling Data 13 / 36

16 Bayesian Relative Strength Models Bayesian Linear Model Prior β N (β; m, s) Likelihood β Y, X N (β; ˆβ, Σ) Posterior β Y, X N (β; µ, Σ β ) General Modeling Data 14 / 36

17 Bayesian Relative Strength Models Bayesian Linear Model Prior β N (β; m, s) Likelihood β Y, X N (β; ˆβ, Σ) Posterior β Y, X N (β; µ, Σ β ) General Modeling Data 15 / 36

18 Bayesian Relative Strength Models Bayesian Linear Model Predictive: p(y X, Y, X) = p(y X, β)p(β Y, X)dβ P(Y < 0) = 0 p(y X, Y, X)dY General Modeling Data 16 / 36

19 Analytics for Bracket Competitions General Modeling Data 17 / 36

20 Analytics for Kaggle-Style Competitions A substantial challenge in predictive modeling of sports competitions, and major discussion point, is predicting upsets. Consider predicting upsets as a comparison of the probabilistic predictions for competitors. General Modeling Data 18 / 36

21 Analytics for Kaggle-Style Competitions Distribution of Predictions A substantial challenge in predictive modeling of sports competitions, and major discussion point, is predicting upsets. Consider predicting upsets as a comparison of the probabilistic predictions for competitors. Your Prediction P(Upset) General Modeling Data 18 / 36

22 Matchup Effects Motivating Quote Well, it s hard to predict a particular team. You have to look at the higher seed and see, do they overwhelm the smaller school or the lower seed? Can they overwhelm someone with their athleticism or length or size or quickness or speed? Do they play a particular style of the pressure defense? And I think I look at the lower seed then and say, can they counter the higher seed s strengths? Do they shoot the three point shot well? Do they have athleticism at certain positions? Do they play a particular style that will give the higher seed trouble? Andy Enfield Nearest Neighbor Matchup Effects motivation 19 / 36

23 NNME Intuition Observed Results Games Nearest Neighbor Matchup Effects motivation 20 / 36

24 NNME Intuition Observed Results Estimated Strength Games Nearest Neighbor Matchup Effects motivation 21 / 36

25 NNME Intuition Observed Results Games Nearest Neighbor Matchup Effects motivation 22 / 36

26 NNME Intuition Observed Results Games Nearest Neighbor Matchup Effects motivation 23 / 36

27 Outline of the NNME Estimation of the nearest-neighbor matchup effects has three components: 1. Fit a relative strength model, 2. Identify neighbors for the matchup, and 3. Calibrate the matchup adjustment. Nearest Neighbor Matchup Effects Outline 24 / 36

28 Relative Strength Model Consider the simple relative strength model using the Sagarin ratings and home court effect. Y ij = β home + D ij β D + ɛ ij, ɛ ij N (0, σ 2 ) where D ij = Sagarin i Sagarin j Nearest Neighbor Matchup Effects Relative Strength Model 25 / 36

29 Relative Strength Model Consider the simple relative strength model using the Sagarin ratings and home court effect. Y ij = β home + D ij β D + ɛ ij, ɛ ij N (0, σ 2 ) where D ij = Sagarin i Sagarin j Note that relative strength models of this type are strictly transitive. Nearest Neighbor Matchup Effects Relative Strength Model 25 / 36

30 Identifying Neighbors Identifying neighbors first requires specifying team characteristics and a distance function between teams. We used a collection of data from Ken Pomeroy s including: Effective Height Adjusted Tempo Effective Field Goal Percentage Defense Offensive Rebound Percentage Block Percentage Steal Rate Three Point Field Goal Contribution Nearest Neighbor Matchup Effects Identifying Neighbors 26 / 36

31 Uniformly Weighting Characteristics Nearest Neighbor Matchup Effects Identifying Neighbors 27 / 36

32 Bayesian Visual Analytics Nearest Neighbor Matchup Effects Identifying Neighbors 28 / 36

33 Bayesian Visual Analytics Nearest Neighbor Matchup Effects Identifying Neighbors 29 / 36

34 Finding Neighbors For each particular matchup, say Dayton vs. Stanford, identify past opponents to see who is most like the current opponent. Teams Dayton played similar to Stanford Teams Stanford played similar to Dayton California George Mason Georgia Tech George Washington Gonzaga Cal Poly California Oregon Pittsburgh Utah Nearest Neighbor Matchup Effects Identifying Neighbors 30 / 36

35 Matchup Adjustment Notation Let R j (i) = 1 K k N j (Y ik µ ik ), where N j are the neighbors for team i with respect to team j, Y ik is the observed point differential between team i, and team k and µ ik is the expected point differential between team i and team k. Then φ ij = ρ(r i (j) R j (i)), where ρ controls how much information is passed from the neighbors. Nearest Neighbor Matchup Effects Matchup Adjustment 31 / 36

36 Predictive Distribution Then using the relative strength model previously described, the predictive distribution becomes: Y ij X ij = β home + D ij β D + φ ij + ɛ ij Effectively φ ij shifts the predictive distribution and results in a non-transitive model. Nearest Neighbor Matchup Effects Matchup Adjustment 32 / 36

37 Estimating Model Parameters Using data across NCAA tournaments from , the model parameters are estimated. β home β D σ 2 ρ Posterior mean Credible interval (3.83,3.91) (0.909,0.916) (120.0,123.2) (0.012,0.454) The positive credible interval suggests a moderate, but meaningful result from the matchup effect. Nearest Neighbor Matchup Effects Results 33 / 36

38 Demonstration: 2014 NCAA Tournament Largest shifts in expected point spread (φ ij ). Team 1 Team 1 R 1 (2) R 2 (1) φ 12 Cal Poly Wichita St UConn St. Joes Dayton Stanford Dayton Syracuse Kentucky Michigan UMass Tennessee Memphis Virginia Michigan Tennessee Michigan Texas Syracuse W.Mich Nearest Neighbor Matchup Effects Demonstration 34 / 36

39 Closer Look: 2014 NCAA Tournament Team Neighbor Point Diff. E[Point Diff] Residual Dayton California Dayton Gonzaga Dayton George Mason Dayton Georgia Tech Dayton George Washington Stanford California Stanford California Stanford Oregon Stanford Pittsburgh Stanford Cal Poly Stanford Utah Nearest Neighbor Matchup Effects Demonstration 35 / 36

40 NNME Concluding Thoughts 1. Evaluated on small number of data points leads to very uncertain outcomes, even with quality predictions 2. Does this contest actually have a proper scoring rule? 3. Alternative strategies (maximizing expected return). Nearest Neighbor Matchup Effects Demonstration 36 / 36

41 NNME Concluding Thoughts 1. Evaluated on small number of data points leads to very uncertain outcomes, even with quality predictions 2. Does this contest actually have a proper scoring rule? 3. Alternative strategies (maximizing expected return). Nearest Neighbor Matchup Effects Demonstration 36 / 36

Lesson Plan. AM 121: Introduction to Optimization Models and Methods. Lecture 17: Markov Chains. Yiling Chen SEAS. Stochastic process Markov Chains

Lesson Plan. AM 121: Introduction to Optimization Models and Methods. Lecture 17: Markov Chains. Yiling Chen SEAS. Stochastic process Markov Chains AM : Introduction to Optimization Models and Methods Lecture 7: Markov Chains Yiling Chen SEAS Lesson Plan Stochastic process Markov Chains n-step probabilities Communicating states, irreducibility Recurrent