Last week I premiered my in game win probabilities for KU basketball. These have been available for a while on Hawklytics, but were always made after the game rather than in real time. Now that they are going live, I thought it would helpful to document how these are made using R and the `ggplot2`

package.

## Calculating Win Probability

The win probabilities are based on the Elo ratings that I calculate for the team ratings on Hawkytics. Elo ratings provide an estimate of team strength or ability *at the current point in time*. This makes it straightforward to determine the ratings of each team on the day the game was played. For more about Elo ratings and how they are calculated go here, or check out this explainer from FiveThirtyEight, which is what my Elo ratings are based off of.

For this example, we’ll use the Kansas vs. Kansas State game on January 3, 2017. After giving KU a boost for home court advantage Kansas had a pre-game Elo rating of 2163, and Kansas State had an Elo rating of 1761. This difference in Elo ratings translates to Kansas being favored by ~15 points.

We can likewise calculate the predicted point spread for every game in the data set (my data set includes all games between D1 opponents going back to 1980). This allows us to look at the difference between the predicted point spread, and the actual margin of victory. This prediction error is normally distributed with a mean of 0 and a standard deviation of 11.36. So the distribution of possible margins of victory for the Kansas vs. Kansas State game should look like this:

```
library(ggplot2)
library(dplyr)
data_frame(
x = seq(-50, 50, 0.5),
y = dnorm(seq(-50, 50, 0.5), mean = -15, sd = 11.36)
) %>%
mutate(winner = ifelse(x <= 0, "Kansas", "Kansas State")) %>%
ggplot() +
geom_ribbon(aes(x = x, ymin = 0, ymax = y, fill = winner)) +
labs(x = "away team's margin of victory")
```

The distribution peaks at -15, which is what we calculated as the most likely outcome. By convention, point spreads are given in terms of the home team, and a negative point spread means that team is the favorite. Because this game was played at Kansas, the point spread is **Kansas -15**. If the game were being played at Kansas State, the point spread would be written as **Kansas State + 15**. Because of this a negative margin of victory indicates a win for the home team. Therefore, a negative margin of victory is associated with Kansas winning, and a positive margin of victory is associated with Kansas State winning. To get the probability of Kansas winning, we can simply look at the proportion of the curve that is less than zero.

```
pnorm(0, mean = -15, sd = 11.36, lower.tail = TRUE)
#> [1] 0.906653
```

So at the beginning of the game, we estimate Kansas to have a 90.7% chance of winning. As the game progresses, we calculate win probability in the exact same way, we just have to adjust for the current score and the amount of time remaining^{1}. The mean of the distribution gets defined as so that as the game progresses, the point spread gets less weight, and the current margin get more weight.

Similarly, the standard deviation is adjusted so that the distribution gets more narrow as the game progresses.

\[\begin{equation} \sigma = \frac{11.36}{\sqrt{\frac{40}{minutes\_remain}}} \tag{2} \end{equation}\]As the time remaining approaches 0, the denominator increases, making the standard deviation smaller and smaller.

The only we need at this point is the score at each moment of the game, in order to calculate the mean and standard deviation. To get this information, we can scrape play-by-play data from the web.

## Scraping Play-By-Play Data

There are many places we could scrape play-by-play information from, and many different packages we could use, but I’ll use the `rvest`

package to scrape play-by-play data from ESPN. With `rvest`

, getting the data from ESPN is fairly straightforward.

```
library(rvest)
game_data <- read_html("http://www.espn.com/mens-college-basketball/playbyplay?gameId=400916199")
tables <- html_nodes(game_data, css = "table")
tables <- html_table(tables, fill = TRUE)
```

The data we want is in tables 2 and 3, so we can select those and do some formatting.

```
half_1 <- tables[[2]]
colnames(half_1) <- make.names(colnames(half_1))
half_1 <- half_1 %>%
mutate(
minute = gsub(":.*", "", time) %>% as.numeric(),
second = gsub(".*:", "", time) %>% as.numeric(),
min_played = (20 - (minute + (second / 60))),
min_remain = 40 - min_played,
SCORE = gsub(" ", "", SCORE),
away_score = gsub("-.*", "", SCORE) %>% as.numeric(),
home_score = gsub(".*-", "", SCORE) %>% as.numeric(),
period = "H1"
) %>%
select(period, minute, second, min_played, min_remain, away_score,
home_score, play = PLAY)
half_2 <- tables[[3]]
colnames(half_2) <- make.names(colnames(half_2))
half_2 <- half_2 %>%
mutate(
minute = gsub(":.*", "", time) %>% as.numeric(),
second = gsub(".*:", "", time) %>% as.numeric(),
min_played = 20 + (20 - (minute + (second / 60))),
min_remain = 40 - min_played,
SCORE = gsub(" ", "", SCORE),
away_score = gsub("-.*", "", SCORE) %>% as.numeric(),
home_score = gsub(".*-", "", SCORE) %>% as.numeric(),
period = "H2"
) %>%
select(period, minute, second, min_played, min_remain, away_score,
home_score, play = PLAY)
full_pbp <- bind_rows(list(half_1, half_2))
full_pbp
```

```
#> # A tibble: 336 x 8
#> period minute second min_played min_remain away_score home_score
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 H1 20 0 0.0000000 40.00000 0 0
#> 2 H1 19 50 0.1666667 39.83333 0 0
#> 3 H1 19 50 0.1666667 39.83333 0 0
#> 4 H1 19 43 0.2833333 39.71667 0 2
#> 5 H1 19 27 0.5500000 39.45000 0 2
#> 6 H1 19 27 0.5500000 39.45000 0 2
#> 7 H1 19 12 0.8000000 39.20000 0 2
#> 8 H1 19 12 0.8000000 39.20000 0 2
#> 9 H1 18 48 1.2000000 38.80000 2 2
#> 10 H1 18 33 1.4500000 38.55000 2 2
#> # ... with 326 more rows, and 1 more variables: play <chr>
```

Now we can create a data frame of all possible time points in the game, and fill in the scores.

```
library(tidyr)
minute <- 0:40
second <- 0:59
full_game <- crossing(minute, second) %>%
arrange(desc(minute), desc(second)) %>%
mutate(min_remain = minute + (second / 60), min_played = 40 - min_remain,
home = 0, away = 0) %>%
filter(min_remain <= 40)
for (i in seq_len(nrow(full_pbp))) {
cur_time <- round(full_pbp$min_remain[i], digits = 2)
cur_row <- which(round(full_game$min_remain, digits = 2) == cur_time)
full_game$home[cur_row:nrow(full_game)] <- full_pbp$home_score[i]
full_game$away[cur_row:nrow(full_game)] <- full_pbp$away_score[i]
}
```

Now that we have the data we want in a workable form, we can move on to calculating the win probabilities and creating the plot.

## Plotting the Win Probabilities

The first thing we have to do is calculate the mean and standard deviation of the distribution at every second of the game, and the corresponding win probability.

```
full_game <- full_game %>%
mutate(
away_margin = away - home,
mean = (-15 * (min_remain / 40)) + (away_margin * (min_played / 40)),
sd = 11.36 / sqrt(40 / min_remain),
home_winprob = pnorm(0, mean = mean, sd = sd, lower.tail = TRUE),
away_winprob = 1 - home_winprob
)
full_game
#> # A tibble: 2,401 x 11
#> minute second min_remain min_played home away away_margin mean
#> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 40 0 40.00000 0.00000000 0 0 0 -15.00000
#> 2 39 59 39.98333 0.01666667 0 0 0 -14.99375
#> 3 39 58 39.96667 0.03333333 0 0 0 -14.98750
#> 4 39 57 39.95000 0.05000000 0 0 0 -14.98125
#> 5 39 56 39.93333 0.06666667 0 0 0 -14.97500
#> 6 39 55 39.91667 0.08333333 0 0 0 -14.96875
#> 7 39 54 39.90000 0.10000000 0 0 0 -14.96250
#> 8 39 53 39.88333 0.11666667 0 0 0 -14.95625
#> 9 39 52 39.86667 0.13333333 0 0 0 -14.95000
#> 10 39 51 39.85000 0.15000000 0 0 0 -14.94375
#> # ... with 2,391 more rows, and 3 more variables: sd <dbl>,
#> # home_winprob <dbl>, away_winprob <dbl>
```

We can then put the data into long format using the `gather`

function from the `tidyr`

package, and plot the probabilities!

```
full_game %>%
gather(team, winprob, home_winprob:away_winprob) %>%
ggplot(aes(x = min_played, y = winprob, color = team)) +
geom_line()
```

Looks pretty good! We can see that even though Kansas wasn’t leading on the score board the whole game, they were always favored to win. Although Kansas State was able to make it close at the end of the game. Now we can add some formatting to make it look prettier.

```
full_game %>%
gather(team, winprob, home_winprob:away_winprob) %>%
ggplot(aes(x = min_played, y = winprob, color = team)) +
geom_line(size = 1) +
scale_color_manual(values = c("#512888", "#0051BA"),
labels = c("Kansas State", "Kansas")) +
geom_hline(aes(yintercept = 0.5), color = "#000000", linetype = "dashed",
size = 1) +
scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.1),
labels = paste0(seq(0, 100, by = 10), "%")) +
scale_x_continuous(limits = c(0, 40), breaks = seq(0, 40, 4),
labels = paste0(seq(40, 0, -4))) +
labs(y = "Win Probability", x = "Minutes Remaining") +
theme_minimal() +
theme(legend.position = "bottom", legend.title = element_blank())
```

And there you have our final product! For future Kansas games, I will be tweeting out real time win probability graphs, and as always, previous games can be found on Hawklytics.

## Bonus: Animate the Plots

We could go one step further and animate the win probability plot using David Robinson’s `gganimate`

package. Our code looks the same, except we add a `frame`

aesthetic and the `gg_animate`

function at the end.

```
library(gganimate)
p <- full_game %>%
filter(second %% 20 == 0) %>%
gather(team, winprob, home_winprob:away_winprob) %>%
ggplot(aes(x = min_played, y = winprob, color = team, frame = min_played)) +
geom_line(aes(cumulative = TRUE), size = 1) +
scale_color_manual(values = c("#512888", "#0051BA"),
labels = c("Kansas State", "Kansas")) +
geom_hline(aes(yintercept = 0.5), color = "#000000", linetype = "dashed",
size = 1) +
scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.1),
labels = paste0(seq(0, 100, by = 10), "%")) +
scale_x_continuous(limits = c(0, 40), breaks = seq(0, 40, 4),
labels = paste0(seq(40, 0, -4))) +
labs(y = "Win Probability", x = "Minutes Remaining") +
theme_minimal() +
theme(legend.position = "bottom", legend.title = element_blank())
gganimate(p, interval = 0.2, title_frame = FALSE)
```

We could also animate the distribution to show exactly how the distribution is changing as we alter the mean and standard deviation.

```
library(purrr)
dist <- full_game %>%
filter(second %% 20 == 0) %>%
select(min_played, mean, sd) %>%
as.list() %>%
pmap_df(.l = ., .f = function(min_played, mean, sd) {
data_frame(
min_played = min_played,
x = seq(-50, 50, 0.5),
y = dnorm(seq(-50, 50, 0.5), mean = mean, sd = sd)
) %>%
mutate(winner = ifelse(x <= 0, "home_win", "away_win"))
}) %>%
mutate(min_played = round(min_played, digits = 2))
d <- ggplot(dist, aes(frame = min_played)) +
geom_ribbon(aes(x = x, ymin = 0, ymax = y, fill = winner)) +
scale_fill_manual(values = c("#512888", "#0051BA"),
labels = c("Kansas State", "Kansas")) +
scale_x_continuous(breaks = seq(-50, 50, 10)) +
labs(x = "Kansas State Margin of Victory", title = "Minutes Played: ") +
theme_minimal() +
theme(legend.position = "bottom", legend.title = element_blank())
gganimate(d, interval = 0.2)
```

## Limitations

There are several limitations to the way these win probabilities are calculated. First, the calculations assume that each team has a 50% chance of winning if the game goes into overtime. This is entirely accurate, as a team favored before the game would still be favored in overtime (but not by as much). Secondly, I don’t factor in who has possession of the ball. For example, if a team is down by 1 with 25 seconds to go and the ball, the model probably underestimates their chance of winning. In reality, when calculating the mean of the distribution, expected points on the current possession should be factored into the current margin. However, this model provides a nice starting place, and I think provides a pretty good general idea of how a teams probability of winning changed throughout the game.

## Session info

```
devtools::session_info()
#> setting value
#> version R version 3.4.1 (2017-06-30)
#> system x86_64, darwin15.6.0
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> tz America/Chicago
#> date 2017-08-30
#>
#> package * version date source
#> animation * 2.5 2017-06-06 Github (yihui/animation@21f635c)
#> assertthat 0.2.0 2017-04-11 CRAN (R 3.4.0)
#> backports 1.1.0 2017-05-22 CRAN (R 3.4.0)
#> base * 3.4.1 2017-07-07 local
#> bindr 0.1 2016-11-13 cran (@0.1)
#> bindrcpp * 0.2 2017-06-17 cran (@0.2)
#> blogdown 0.1.3 2017-08-30 Github (rstudio/blogdown@45860c7)
#> bookdown 0.5 2017-08-20 CRAN (R 3.4.1)
#> broom 0.4.2 2017-02-13 CRAN (R 3.4.0)
#> cellranger 1.1.0 2016-07-27 CRAN (R 3.4.0)
#> codetools 0.2-15 2016-10-05 CRAN (R 3.4.1)
#> colorspace 1.3-2 2016-12-14 CRAN (R 3.4.0)
#> compiler 3.4.1 2017-07-07 local
#> datasets * 3.4.1 2017-07-07 local
#> devtools 1.13.3 2017-08-02 cran (@1.13.3)
#> digest 0.6.12 2017-01-27 CRAN (R 3.4.0)
#> dplyr * 0.7.2 2017-07-20 CRAN (R 3.4.1)
#> evaluate 0.10.1 2017-06-24 cran (@0.10.1)
#> forcats 0.2.0 2017-01-23 CRAN (R 3.4.0)
#> foreign 0.8-69 2017-06-22 CRAN (R 3.4.1)
#> gganimate * 0.1.0.9000 2017-06-06 Github (dgrtwo/gganimate@bf82002)
#> ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.3.2)
#> glue 1.1.1 2017-06-21 cran (@1.1.1)
#> graphics * 3.4.1 2017-07-07 local
#> grDevices * 3.4.1 2017-07-07 local
#> grid 3.4.1 2017-07-07 local
#> gtable 0.2.0 2016-02-26 CRAN (R 3.4.0)
#> haven 1.1.0 2017-07-09 CRAN (R 3.4.1)
#> hms 0.3 2016-11-22 CRAN (R 3.4.0)
#> htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0)
#> httr 1.3.1 2017-08-20 CRAN (R 3.4.1)
#> jsonlite 1.5 2017-06-01 cran (@1.5)
#> knitr * 1.17 2017-08-10 cran (@1.17)
#> labeling 0.3 2014-08-23 CRAN (R 3.4.0)
#> lattice 0.20-35 2017-03-25 CRAN (R 3.4.1)
#> lazyeval 0.2.0 2016-06-12 CRAN (R 3.4.0)
#> lubridate 1.6.0 2016-09-13 CRAN (R 3.4.0)
#> magrittr 1.5 2014-11-22 CRAN (R 3.4.0)
#> memoise 1.1.0 2017-04-21 CRAN (R 3.4.0)
#> methods * 3.4.1 2017-07-07 local
#> mnormt 1.5-5 2016-10-15 CRAN (R 3.4.0)
#> modelr 0.1.1 2017-07-24 cran (@0.1.1)
#> munsell 0.4.3 2016-02-13 CRAN (R 3.4.0)
#> nlme 3.1-131 2017-02-06 CRAN (R 3.4.0)
#> parallel 3.4.1 2017-07-07 local
#> pkgconfig 2.0.1 2017-03-21 cran (@2.0.1)
#> plyr 1.8.4 2016-06-08 CRAN (R 3.4.0)
#> psych 1.7.5 2017-05-03 CRAN (R 3.4.1)
#> purrr * 0.2.3 2017-08-02 cran (@0.2.3)
#> R6 2.2.2 2017-06-17 cran (@2.2.2)
#> Rcpp 0.12.12 2017-07-15 CRAN (R 3.4.1)
#> readr * 1.1.1 2017-05-16 CRAN (R 3.4.0)
#> readxl 1.0.0 2017-04-18 CRAN (R 3.4.0)
#> reshape2 1.4.2 2016-10-22 CRAN (R 3.4.0)
#> rlang 0.1.2.9000 2017-08-30 Github (hadley/rlang@f20124b)
#> rmarkdown 1.6.0.9001 2017-08-30 Github (rstudio/rmarkdown@e22703a)
#> rprojroot 1.2 2017-01-16 CRAN (R 3.4.0)
#> rvest * 0.3.2 2016-06-17 CRAN (R 3.4.0)
#> scales 0.5.0.9000 2017-08-30 Github (hadley/scales@d767915)
#> stats * 3.4.1 2017-07-07 local
#> stringi 1.1.5 2017-04-07 CRAN (R 3.4.0)
#> stringr 1.2.0 2017-02-18 CRAN (R 3.4.0)
#> tibble * 1.3.4 2017-08-22 cran (@1.3.4)
#> tidyr * 0.7.0 2017-08-16 CRAN (R 3.4.1)
#> tidyselect 0.1.1 2017-07-24 CRAN (R 3.4.1)
#> tidyverse * 1.1.1 2017-01-27 CRAN (R 3.4.0)
#> tools 3.4.1 2017-07-07 local
#> utils * 3.4.1 2017-07-07 local
#> withr 2.0.0 2017-08-13 Github (jimhester/withr@190d293)
#> xml2 * 1.1.1 2017-01-24 CRAN (R 3.4.0)
#> yaml 2.1.14 2016-11-12 CRAN (R 3.4.0)
```

For details on the where these formulas come from, see Wayne Winston’s book,

*Mathletics*, and Neil Paine’s explainer.↩