Predicting the Winner of the 2017 Big 12/SEC Challenge

The Big 12/SEC challenge tips off tomorrow. This will be the 4th year of this competition, and the Big 12 has never lost. In this post, we’ll use a Monte Carlo simulation to estimate the Big 12’s chances of continuing this streak for another year.

As always, the code and data for this post are available on my Github page.

The Ratings

The team ratings come from my sports analytics website, Hawklytics. We first have to adjust these ratings for home court advantage1, and then we can calculate the probability that the Big 12 will win each game using the Log-5 formula.

library(dplyr)
library(ggplot2)
library(purrr)
library(readr)

big12sec <- read_csv("datafiles/big12sec_2017.csv", col_types = "ccn")
big12sec
#> # A tibble: 10 x 3
#>               Home            Away Big12_Win
#>              <chr>           <chr>     <dbl>
#>  1   West Virginia       Texas A&M 0.9435388
#>  2        Oklahoma         Florida 0.2869581
#>  3       Tennessee    Kansas State 0.4904035
#>  4      Texas Tech Louisiana State 0.9121202
#>  5         Georgia           Texas 0.3345746
#>  6      Vanderbilt      Iowa State 0.6583537
#>  7  Oklahoma State        Arkansas 0.7371545
#>  8        Kentucky          Kansas 0.2822114
#>  9     Mississippi          Baylor 0.8425315
#> 10 Texas Christian          Auburn 0.8387957

After accounting for the location of each game, we can see, for example, that Kansas has around a 28% chance of beating Kentucky, and Oklahoma State has a 74% chance of beating Arkansas. To get the expected number of wins for the Big 12, we can sum the probabilities of the Big 12 winning each game. Doing this, we get an expected number of wins of 6.33. This means that if we repeated the Big 12/SEC challenge multiple times, on average, the Big 12 would get about 6.33 wins.

Monte Carlo Simulation

It’s not possible for us to repeat the Big 12/SEC challenge multiple times in real life, but we can do this through a process called Monte Carlo simulation. In a Monte Carlo simulation we can generate data for an event multiple times, and then average the results over all replications. To illustrate we can simulate the winner of the Kansas vs. Kentucky game. According to the model, Kansas has a 28% chance of winning. We generate a random number between 0 and 1. If that number is less than 0.28, then Kansas is the winner of the simulation, otherwise, Kentucky is the winner. We do this for every game, and then count the number of winners that come from the Big 12 to determine which conference won the challenge (or if it was a tie). We then repeat this process over and over again to simulate many replications of the challenge.

To enact this process in R, we’ll first need to define some functions. The first function will take in the probability of a team winning (in our case the Big 12 team), and return a 1 if that team wins, and a 0 otherwise.

sim_game <- function(prob) {
  ifelse(runif(1, min = 0, max = 1) < prob, 1, 0)
}

We can then define a function that takes in a vector of probabilities, and returns the number of wins.

sim_challenge <- function(prob_vec) {
  map_dbl(.x = prob_vec, .f = sim_game) %>% sum()
}

Finally, we can use the purrr package to simulate the Big 12/SEC challenge 10,000 times. Importantly, we need to set the random seed generator in order to make this analysis replicable.

set.seed(71715)
num_sim <- 10000

challenge_wins <- map_dbl(.x = seq_len(num_sim), .f = function(x, prob_vec) {
  sim_challenge(prob_vec)
}, prob_vec = big12sec$Big12_Win)

sim_results <- data_frame(big12_wins = challenge_wins) %>%
  group_by(big12_wins) %>%
  summarize(n = n())
sim_results
#> # A tibble: 10 x 2
#>    big12_wins     n
#>         <dbl> <int>
#>  1          1     1
#>  2          2    12
#>  3          3   158
#>  4          4   626
#>  5          5  1785
#>  6          6  2878
#>  7          7  2755
#>  8          8  1380
#>  9          9   367
#> 10         10    38

As we would expect, the most common outcome was the Big 12 winning 6 games. This occured 2,878 times. The second most common outcome was the Big 12 winning 7 games, which occured 2,755 times. Also note that although there were 38 simulations where the Big 12 went undefeated, there no simulations where the Big 12 failed to win a game, and only 1 simulation where the Big 12 won a single game.

Plotting the Results

Now that we have a distribution for the number of wins for the Big 12, we can plot the distribution using ggplot2. First we assign a conference winner to each number of wins.

sim_results$winner <- case_when(
  sim_results$big12_wins < 5 ~ "SEC",
  sim_results$big12_wins > 5 ~ "Big 12",
  TRUE ~ "Tie"
)
sim_results
#> # A tibble: 10 x 3
#>    big12_wins     n winner
#>         <dbl> <int>  <chr>
#>  1          1     1    SEC
#>  2          2    12    SEC
#>  3          3   158    SEC
#>  4          4   626    SEC
#>  5          5  1785    Tie
#>  6          6  2878 Big 12
#>  7          7  2755 Big 12
#>  8          8  1380 Big 12
#>  9          9   367 Big 12
#> 10         10    38 Big 12

Then we can calulate the probability of each outcome.

outcome <- sim_results %>%
  group_by(winner) %>%
  summarize(probability = sum(n) / num_sim) %>%
  mutate(probability = paste0(sprintf("%0.1f", probability * 100), "%"))
outcome
#> # A tibble: 3 x 2
#>   winner probability
#>    <chr>       <chr>
#> 1 Big 12       74.2%
#> 2    SEC        8.0%
#> 3    Tie       17.8%

Finally, we put all of that information together and plot the distribution!

sim_results %>%
  mutate(n = n / num_sim) %>%
  ggplot(aes(x = factor(big12_wins, levels = 0:10), y = n, fill = winner)) +
  geom_col() +
  scale_x_discrete(drop = FALSE) +
  scale_y_continuous(breaks = seq(0, 1, 0.1),
    labels = paste0(seq(0, 100, 10), "%")) +
  scale_fill_brewer(name = "Winner", type = "qual", palette = 7) +
  labs(title = "Big 12/SEC Challenge Probabilities",
    subtitle = paste(paste0(outcome$winner, ": ", outcome$probability),
      collapse = "  |  "), x = "Big 12 Wins", y = "Probability") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  guides(fill = guide_legend(title.position = "top", title.hjust = 0.5))

Conclusion

The results of the simulation show that the Big 12 has about a 74% chance of winning the challenge and continuing their streak. I’ll be tweeting out updated probabilities throughout the day tomorrow as the games finish, so follow me there for updates!

Session info

devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.4.1 (2017-06-30)
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  tz       America/Chicago             
#>  date     2017-08-30
#> Packages -----------------------------------------------------------------
#>  package      * version    date       source                            
#>  animation    * 2.5        2017-06-06 Github (yihui/animation@21f635c)  
#>  assertthat     0.2.0      2017-04-11 CRAN (R 3.4.0)                    
#>  backports      1.1.0      2017-05-22 CRAN (R 3.4.0)                    
#>  base         * 3.4.1      2017-07-07 local                             
#>  bindr          0.1        2016-11-13 cran (@0.1)                       
#>  bindrcpp     * 0.2        2017-06-17 cran (@0.2)                       
#>  blogdown       0.1.3      2017-08-30 Github (rstudio/blogdown@45860c7) 
#>  bookdown       0.5        2017-08-20 CRAN (R 3.4.1)                    
#>  codetools      0.2-15     2016-10-05 CRAN (R 3.4.1)                    
#>  colorspace     1.3-2      2016-12-14 CRAN (R 3.4.0)                    
#>  compiler       3.4.1      2017-07-07 local                             
#>  datasets     * 3.4.1      2017-07-07 local                             
#>  devtools       1.13.3     2017-08-02 cran (@1.13.3)                    
#>  digest         0.6.12     2017-01-27 CRAN (R 3.4.0)                    
#>  dplyr        * 0.7.2      2017-07-20 CRAN (R 3.4.1)                    
#>  evaluate       0.10.1     2017-06-24 cran (@0.10.1)                    
#>  ggplot2      * 2.2.1      2016-12-30 CRAN (R 3.3.2)                    
#>  glue           1.1.1      2017-06-21 cran (@1.1.1)                     
#>  graphics     * 3.4.1      2017-07-07 local                             
#>  grDevices    * 3.4.1      2017-07-07 local                             
#>  grid           3.4.1      2017-07-07 local                             
#>  gtable         0.2.0      2016-02-26 CRAN (R 3.4.0)                    
#>  hms            0.3        2016-11-22 CRAN (R 3.4.0)                    
#>  htmltools      0.3.6      2017-04-28 CRAN (R 3.4.0)                    
#>  knitr        * 1.17       2017-08-10 cran (@1.17)                      
#>  lazyeval       0.2.0      2016-06-12 CRAN (R 3.4.0)                    
#>  magrittr       1.5        2014-11-22 CRAN (R 3.4.0)                    
#>  memoise        1.1.0      2017-04-21 CRAN (R 3.4.0)                    
#>  methods      * 3.4.1      2017-07-07 local                             
#>  munsell        0.4.3      2016-02-13 CRAN (R 3.4.0)                    
#>  pkgconfig      2.0.1      2017-03-21 cran (@2.0.1)                     
#>  plyr           1.8.4      2016-06-08 CRAN (R 3.4.0)                    
#>  purrr        * 0.2.3      2017-08-02 cran (@0.2.3)                     
#>  R6             2.2.2      2017-06-17 cran (@2.2.2)                     
#>  RColorBrewer   1.1-2      2014-12-07 CRAN (R 3.4.0)                    
#>  Rcpp           0.12.12    2017-07-15 CRAN (R 3.4.1)                    
#>  readr        * 1.1.1      2017-05-16 CRAN (R 3.4.0)                    
#>  rlang          0.1.2.9000 2017-08-30 Github (hadley/rlang@f20124b)     
#>  rmarkdown      1.6.0.9001 2017-08-30 Github (rstudio/rmarkdown@e22703a)
#>  rprojroot      1.2        2017-01-16 CRAN (R 3.4.0)                    
#>  scales         0.5.0.9000 2017-08-30 Github (hadley/scales@d767915)    
#>  stats        * 3.4.1      2017-07-07 local                             
#>  stringi        1.1.5      2017-04-07 CRAN (R 3.4.0)                    
#>  stringr        1.2.0      2017-02-18 CRAN (R 3.4.0)                    
#>  tibble         1.3.4      2017-08-22 cran (@1.3.4)                     
#>  tools          3.4.1      2017-07-07 local                             
#>  utils        * 3.4.1      2017-07-07 local                             
#>  withr          2.0.0      2017-08-13 Github (jimhester/withr@190d293)  
#>  yaml           2.1.14     2016-11-12 CRAN (R 3.4.0)


  1. The home team’s offense is increased by 30% and defense is decreased by 30%, and the reverse is done for the away team.

comments powered by Disqus