### The Approach

At the time this post will go live, the draft will be 10 days away! With that in mind, its time for the second of my series of blog posts about using mock draft data to predict the NFL Draft. In my last blog post, I layed out my theory of mock drafts and how I think about using mock drafts as a way of assessing a player’s stock throughout the NFL Draft process. This post aims to get a bit deeper into the data and statistical aspects of using mock draft data to predict the actual outcomes of the NFL Draft.

I decided to start down the road for this project by exploring mock draft data for the 2018 NFL Draft, which means that 2018 is the only draft I have data for, which is a small sample but its all I’ve got and as good a place to start as any. For my 2018, I collected the data by hand (more on that choice another time) with a focus on getting a diverse sample of mock drafts from media members, fans, and draft experts. Ultimately, I collected *395 mock drafts*: *229* from fans, *90* from the media, and *76* from experts. This might sound like a lot but for the 2019 draft, I’ve collected more than double this number of mock drafts. This will ideally make any estimates I make using the mock draft data less-noisy in terms of the spread of mock draft selections for each draft-eligible player.

Because we have to start from somewhere, I thought it was a good idea to explore how well the basic measures of central tendenency in data: the average, the median, and a weighted average based on how far (in numbers of days) a mock draft is from the date of the actual NFL Draft. From there, we can move into model-based estimates using two basic statistical estimation tools: the linear regression and the Loess regression.

A linear regression take a series of points and tries to fit a line of best fit in the data based off where the points cluster to minimize the difference between your prediction and the actual outcomes you’re trying to predict. In this case, we’re trying to use mock draft data to predict actual NFL draft selections. A loess regression takes a slightly different approach in that it attempts to fit a line through your data but attempts *smooth* the data you provide it to account for change over time, which is often why its known as a “moving regression”.

Let’s begin by setting up the data and calculating the basic summary statistics for our draft eligible players in mock drafts.

```
options(stringsAsFactors = FALSE, scipen = 7)
library(tidyr)
library(dplyr)
```

`## Warning: package 'dplyr' was built under R version 3.4.4`

```
##
## Attaching package: 'dplyr'
```

```
## The following objects are masked from 'package:stats':
##
## filter, lag
```

```
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
```

`library(ggplot2)`

`## Warning: package 'ggplot2' was built under R version 3.4.4`

`library(plotly)`

`## Warning: package 'plotly' was built under R version 3.4.4`

```
##
## Attaching package: 'plotly'
```

```
## The following object is masked from 'package:ggplot2':
##
## last_plot
```

```
## The following object is masked from 'package:stats':
##
## filter
```

```
## The following object is masked from 'package:graphics':
##
## layout
```

`library(teamcolors)`

`## Warning: package 'teamcolors' was built under R version 3.4.4`

`library(ggrepel)`

`## Warning: package 'ggrepel' was built under R version 3.4.4`

`library(ggthemes)`

`## Warning: package 'ggthemes' was built under R version 3.4.4`

```
library(broom)
`%notin%` <- function(x, y) !(x %in% y)
read.csv("https://github.com/benjaminrobinson/2018NFLMockDrafts/raw/master/data/2018%20NFL%20Mock%20Draft%20Data%20-%20Projections.csv") %>%
filter(
name %notin% c(
"Christian Wilkins",
"Clelin Ferrell",
"Mitch Hyatt",
"Austin Bryant",
"Trey Adams",
"Dre'Mont Jones",
"Adonis Alexander",
"Bryce Love",
"Iman Marshall",
"Cam Smith",
"Martez Ivey",
"Clayton Thorson",
"Jarrett Stidham",
"Ken Webster",
"Parris Campbell",
"Damien Harris",
"Dante Booker",
"Beau Benzschawel",
"Jake Browning",
"Porter Gustin",
"Brian Hill",
"Daylon Mack",
"Grant Newsome",
"LJ Scott",
"Michael Dieter",
"Nick Fitzgerald",
"TJ Edwards",
"Will Grier",
"Kendall Joseph",
"Jerry Tillery",
"Brock Ruble",
"Andre Dillard",
"Byron Cowart",
"CJ Conrad",
"George Panos",
"Caleb Wilson",
"Dontavius Russell",
"Sam Beal",
"Chase Hansen",
"Adam Breneman",
"Jaylon Ferguson",
"Casey Tucker",
""
) & !(name == 'Josh Allen' & position == 'LB')
) %>%
mutate(
date = as.Date(date, format = "%m/%d/%Y"),
draft_weight = 1 / ((max(date) + 1) - date) %>% as.numeric,
draft_year = 2018
) -> prj
read.csv("https://github.com/benjaminrobinson/2018NFLMockDrafts/raw/master/data/2018%20NFL%20Mock%20Draft%20Data%20-%20Actuals.csv") %>%
mutate(date = as.Date(date, format = "%m/%d/%Y"),
draft_year = 2018) -> act
prj %>%
left_join(
act %>%
rename(actual = pick) %>%
distinct(round, actual, name, position, school, team)
) %>%
mutate(team = ifelse(is.na(actual), "Undrafted", team),
round = ifelse(is.na(actual), 8, round),
actual = ifelse(is.na(actual), 257, actual)) -> pick
```

`## Joining, by = c("name", "position", "school")`

```
prj %>%
mutate(n_drafts = n_distinct(paste0(site, date))) %>%
group_by(name, position, school, draft_year, n_drafts) %>%
summarize(
draft_count = n(),
average_draft_position = mean(pick, na.rm = TRUE),
median_draft_position = median(pick, na.rm = TRUE),
weighted_average_draft_position = weighted.mean(pick, draft_weight, na.rm = TRUE),
sd = sd(pick, na.rm = TRUE),
sd = ifelse(is.na(sd), NA, sd)
) %>%
ungroup %>%
mutate(
draft_share = draft_count / n_drafts,
draft_share = ifelse(draft_share > 1, 1, draft_share)
) %>%
select(-n_drafts) %>%
left_join(act %>% distinct(round, pick, name, position, school, team)) %>%
mutate(pick = ifelse(is.na(pick), max(act$pick) + 1, pick),
round = ifelse(is.na(round), max(act$round) + 1, round)) %>%
gather(metric,
value,
-name,
-position,
-school,
-draft_year,
-pick,
-round,
-team,
-draft_share,
-draft_count,
-sd) %>%
mutate(
metric = gsub("_", " ", metric),
metric = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", metric, perl = TRUE),
lwr = ifelse(is.na(sd), value, value - 1.96*sd),
lwr = ifelse(lwr <= 1, 1, lwr),
upr = ifelse(is.na(sd), value, value + 1.96*sd),
upr = ifelse(upr >= 256, 256, upr)
) %>%
left_join(
teamcolors %>%
filter(league == 'nfl') %>%
rename(team = name)
) %>%
group_by(metric) %>%
mutate(rank = dense_rank(value)) %>%
as.data.frame -> agg
```

`## Joining, by = c("name", "position", "school")`

`## Joining, by = "team"`

## The Metrics

Now that we’ve done that, let’s use a univariate linear regression to compare the three basic measures of mock draft position against the actual mock draft position to see which one is most accurate:

```
agg %>%
filter(!is.na(sd)) %>%
group_by(metric) %>%
do(mock = lm(pick ~ value, data = .)) %>%
glance(mock) %>%
as.data.frame
```

```
## metric r.squared adj.r.squared sigma
## 1 Average Draft Position 0.5423606 0.5409077 59.61236
## 2 Median Draft Position 0.5063135 0.5047462 61.91562
## 3 Weighted Average Draft Position 0.6332899 0.6321257 53.36249
## statistic p.value df logLik AIC BIC deviance
## 1 373.3148 2.072960e-55 2 -1744.653 3495.306 3506.583 1119394.7
## 2 323.0567 3.292161e-50 2 -1756.670 3519.340 3530.617 1207566.5
## 3 543.9892 1.354256e-70 2 -1709.544 3425.087 3436.364 896979.8
## df.residual
## 1 315
## 2 315
## 3 315
```

So it turns out that in a simple linear regression the basic metrics explain about **TWO-THIRDS** of the variation in actual mock draft selections. That’s pretty great on its face for a single metric!

However, this takes into account the whole universe of mock draft and most folks only care about the first round of the NFL Draft and that’s probably the only part of the draft worth using the mock draft data to predict the outcome of the draft en-masse. Let’s focus on the first round of the draft:

```
agg %>%
filter(!is.na(sd)) %>%
ggplot(aes(
value,
pick,
color = factor(team, levels = team %>% unique %>% sort),
fill = factor(team, levels = team %>% unique %>% sort)
)) +
geom_point(size = 3) +
geom_text_repel(aes(label = name)) +
geom_abline(slope = 1,
intercept = 0,
size = .5) +
geom_smooth(aes(group = metric), method = 'lm', formula = 'y ~ x') +
scale_color_manual(
values = agg %>% filter(!is.na(team)) %>% distinct(team, primary) %>% arrange(team) %>% select(-team) %>% unlist %>% unname
) +
scale_fill_manual(
values = agg %>% filter(!is.na(team)) %>% distinct(team, secondary) %>% arrange(team) %>% select(-team) %>% unlist %>% unname
) +
scale_x_continuous(limits = c(1, 32), breaks = c(1, 16, 32)) +
scale_y_continuous(limits = c(-10, 174),
breaks = c(1, 32, 64, 96, 137, 174)) +
facet_wrap(~metric, ncol = 1) +
theme_fivethirtyeight() +
theme(legend.position = "none") +
labs(x = "Mock Draft Position",
y = "Actual Draft Position",
title = "2018 NFL Draft 1st Round Projections",
subtitle = "Compared to Actual Draft Position",
caption = "Data and Graph by @benj_robinson"
)
```

`## Warning: Removed 828 rows containing non-finite values (stat_smooth).`

`## Warning: Removed 828 rows containing missing values (geom_point).`

`## Warning: Removed 828 rows containing missing values (geom_text_repel).`

```
agg %>%
filter(!is.na(sd) & rank <= 32) %>%
ggplot(aes(
value,
pick,
color = factor(team, levels = team %>% unique %>% sort),
fill = factor(team, levels = team %>% unique %>% sort)
)) +
geom_point(size = 3) +
geom_text_repel(aes(label = name)) +
geom_abline(slope = 1,
intercept = 0,
size = .5) +
geom_smooth(aes(group = metric), method = 'lm', formula = 'y ~ x') +
scale_color_manual(
values = agg %>% filter(!is.na(team)) %>% distinct(team, primary) %>% arrange(team) %>% select(-team) %>% unlist %>% unname
) +
scale_fill_manual(
values = agg %>% filter(!is.na(team)) %>% distinct(team, secondary) %>% arrange(team) %>% select(-team) %>% unlist %>% unname
) +
scale_x_continuous(limits = c(1, 32), breaks = c(1, 16, 32)) +
scale_y_continuous(limits = c(-10, 174),
breaks = c(1, 32, 64, 96, 137, 174)) +
facet_wrap(~metric, ncol = 1) +
theme_fivethirtyeight() +
theme(legend.position = "none") +
labs(x = "Mock Draft Position",
y = "Actual Draft Position",
title = "2018 NFL Draft 1st Round Projections",
subtitle = "Compared to Actual Draft Position",
caption = "Data and Graph by @benj_robinson"
)
```

`## Warning: Removed 8 rows containing non-finite values (stat_smooth).`

`## Warning: Removed 8 rows containing missing values (geom_point).`

`## Warning: Removed 8 rows containing missing values (geom_text_repel).`

As we can see, even with the first round, there is quite a bit of noise in the data. Due to a number of outliers, mostly in the direction of players getting projected to go in the first but going later (as well as players who were projected to go in later rounds but went in the first), the metric does not predict the first round as well overall.

It seems that if we are to use a metric Weighted Average Draft Position metric then we can remove some of those NFL Draft false positives: players that we thought might go in the 1st round based on data points from earlier in the draft process and that really didn’t have much business going in the 1st round as the process played out. Let’s look at a common performance metric before we move on to modeled estimates of Mock Draft Position:

```
## Player Subset Metric
## 1 First Round Actual Draft Picks Average Draft Position
## 2 First Round Actual Draft Picks Median Draft Position
## 3 First Round Actual Draft Picks Weighted Average Draft Position
## Mock Draft Position Mean Squared Error
## 1 191.8729
## 2 168.0625
## 3 283.6368
## Mock Draft Ranking Mean Squared Error Mock Draft Metric Correlation
## 1 364.59375 0.6918107
## 2 98.84375 0.6947786
## 3 423.12500 0.6252490
## Mock Draft Ranking Correlation
## 1 0.6831410
## 2 0.6671132
## 3 0.6017523
```

```
## Player Subset Metric
## 1 First Round Mock Draft Rank Average Draft Position
## 2 First Round Mock Draft Rank Median Draft Position
## 3 First Round Mock Draft Rank Weighted Average Draft Position
## Mock Draft Position Mean Squared Error
## 1 754.2557
## 2 2809.6447
## 3 110.7990
## Mock Draft Ranking Mean Squared Error Mock Draft Metric Correlation
## 1 756.9375 0.6165475
## 2 3183.6491 0.4558563
## 3 103.8125 0.7820824
## Mock Draft Ranking Correlation
## 1 0.5905570
## 2 0.4680832
## 3 0.8011319
```

```
## Player Subset Metric
## 1 First Round Mock Draft Position Average Draft Position
## 2 First Round Mock Draft Position Median Draft Position
## 3 First Round Mock Draft Position Weighted Average Draft Position
## Mock Draft Position Mean Squared Error
## 1 692.6951
## 2 1920.4643
## 3 437.5016
## Mock Draft Ranking Mean Squared Error Mock Draft Metric Correlation
## 1 686.3590 0.5530910
## 2 2220.3265 0.4988623
## 3 407.5833 0.6105045
## Mock Draft Ranking Correlation
## 1 0.4949965
## 2 0.4920950
## 3 0.6331873
```

What we see confirms the need to have multiple metrics for comparison purposes. While Weighted Average Draft Position minimizes the Mean Squared Error for a player’s estimated draft position and draft rank, with the benefit of hindsight, the Median Draft Position metric predicts the players who actually went in the 1st round best, mostly because as part of its definition, the median does not react as strongly to outliers. However, based on my assumptions and the quantity and quality of my 2019 NFL Mock Draft data, I will use the Weighted Average Draft Position going forward as my central tendency metric of choice for predicting actual draft position.

## The Models

From the world of aggregated metrics, let’s move to the more abstract world of models. I’ll test two models based on the raw mock draft data (not aggregates): good old linear regression and the Loess regression. First of all, let’s put together the dataset:

```
suppressWarnings(
bind_rows(
prj %>%
group_by(name, position, school, draft_year, metric = "Linear Regression") %>%
do(mock = lm(pick ~ date, data = .)) %>%
mutate(value = predict(mock, data.frame(date = as.Date("2018-04-26")), interval = 'confidence', level = .95)[1],
lwr = predict(mock, data.frame(date = as.Date("2018-04-26")), interval = 'confidence', level = .95)[2],
upr = predict(mock, data.frame(date = as.Date("2018-04-26")), interval = 'confidence', level = .95)[3],
lwr = ifelse(lwr <= 1, 1, lwr),
upr = ifelse(upr >= 256, 256, upr)) %>%
select(-mock) %>%
filter(!is.na(lwr)),
prj %>%
left_join(
agg %>%
distinct(name, position, school, draft_count)
) %>%
filter(draft_count > 2) %>%
group_by(name, position, school, draft_year, metric = "Loess Regression") %>%
do(mock = loess(pick ~ date %>% as.numeric, data = .)) %>%
mutate(value = predict(mock, data.frame(date = as.Date("2018-04-26")), se = TRUE)[1] %>% unlist,
se = predict(mock, data.frame(date = as.Date("2018-04-26")), se = TRUE)[2] %>% unlist,
lwr = value - 1.96*se,
lwr = ifelse(lwr <= 1, 1, lwr),
upr = value + 1.96*se,
upr = ifelse(upr >= 256, 256, upr)
) %>%
select(-mock, -se) %>%
filter(!is.na(value) & !is.na(lwr))
) %>%
left_join(
act
) %>%
left_join(teamcolors %>%
filter(league == 'nfl') %>%
rename(team = name)
) %>%
group_by(metric) %>%
mutate(rank = dense_rank(value),
residual_pick = abs(value - pick),
residual_rank = abs(rank - pick)) %>%
as.data.frame
) -> models
```

`## Joining, by = c("name", "position", "school")`

`## Joining, by = c("name", "position", "school", "draft_year")`

`## Joining, by = "team"`

Let’s run the same diagnostics we did for the aggregate metrics:

```
models %>%
group_by(metric) %>%
do(mock = lm(pick ~ value, data = .)) %>%
glance(mock) %>%
as.data.frame
```

```
## metric r.squared adj.r.squared sigma statistic
## 1 Linear Regression 0.6005095 0.5985981 44.04117 314.1664
## 2 Loess Regression 0.6071465 0.6049760 40.99118 279.7315
## p.value df logLik AIC BIC deviance df.residual
## 1 1.614930e-43 2 -1097.0526 2200.105 2210.161 405381.6 209
## 2 1.436171e-38 2 -938.2045 1882.409 1892.038 304130.0 181
```

Looks like overall, the linear and Loess regression modesl perform similarly well with an R squared value that explains about 60% of the variation in the actual 2018 NFRL Draft selections.

```
models %>%
ggplot(aes(
value,
pick,
color = factor(team, levels = team %>% unique %>% sort),
fill = factor(team, levels = team %>% unique %>% sort)
)) +
geom_point(size = 3) +
geom_text_repel(aes(label = name)) +
geom_abline(slope = 1,
intercept = 0,
size = .5) +
geom_smooth(aes(group = metric), method = 'lm', formula = 'y ~ x') +
scale_color_manual(
values = models %>% filter(!is.na(team)) %>% distinct(team, primary) %>% arrange(team) %>% select(-team) %>% unlist %>% unname
) +
scale_fill_manual(
values = models %>% filter(!is.na(team)) %>% distinct(team, secondary) %>% arrange(team) %>% select(-team) %>% unlist %>% unname
) +
scale_x_continuous(limits = c(1, 32), breaks = c(1, 16, 32)) +
scale_y_continuous(limits = c(-10, 174),
breaks = c(1, 32, 64, 96, 137, 174)) +
facet_wrap(~metric, ncol = 1) +
theme_fivethirtyeight() +
theme(legend.position = "none") +
labs(x = "Mock Draft Position",
y = "Actual Draft Position",
title = "2018 NFL Draft 1st Round Projections",
subtitle = "Compared to Actual Draft Position",
caption = "Data and Graph by @benj_robinson"
)
```

`## Warning: Removed 446 rows containing non-finite values (stat_smooth).`

`## Warning: Removed 446 rows containing missing values (geom_point).`

`## Warning: Removed 446 rows containing missing values (geom_text_repel).`

That conclusion from the regression summaries is borne out in the graph above but what about when we focus in on the first round: where the majority of the mock draft data comes from. Suddenly, the Loess regression looks a lot stronger.

```
models %>%
filter(rank <= 32) %>%
ggplot(aes(
value,
pick,
color = factor(team, levels = team %>% unique %>% sort),
fill = factor(team, levels = team %>% unique %>% sort)
)) +
geom_point(size = 3) +
geom_text_repel(aes(label = name)) +
geom_abline(slope = 1,
intercept = 0,
size = .5) +
geom_smooth(aes(group = metric), method = 'lm', formula = 'y ~ x') +
scale_color_manual(
values = models %>% filter(!is.na(team)) %>% distinct(team, primary) %>% arrange(team) %>% select(-team) %>% unlist %>% unname
) +
scale_fill_manual(
values = models %>% filter(!is.na(team)) %>% distinct(team, secondary) %>% arrange(team) %>% select(-team) %>% unlist %>% unname
) +
scale_x_continuous(limits = c(1, 32), breaks = c(1, 16, 32)) +
scale_y_continuous(limits = c(-10, 174),
breaks = c(1, 32, 64, 96, 137, 174)) +
facet_wrap(~metric, ncol = 1) +
theme_fivethirtyeight() +
theme(legend.position = "none") +
labs(x = "Mock Draft Position",
y = "Actual Draft Position",
title = "2018 NFL Draft 1st Round Projections",
subtitle = "Compared to Actual Draft Position",
caption = "Data and Graph by @benj_robinson"
)
```

```
## Player Subset Metric
## 1 First Round Actual Draft Picks Linear Regression
## 2 First Round Actual Draft Picks Loess Regression
## Mock Draft Position Mean Squared Error
## 1 238.9097
## 2 308.9240
## Mock Draft Ranking Mean Squared Error Mock Draft Metric Correlation
## 1 303.7188 0.6654195
## 2 384.0625 0.6060257
## Mock Draft Ranking Correlation
## 1 0.6601037
## 2 0.6113017
```

```
## Player Subset Metric
## 1 First Round Mock Draft Rank Linear Regression
## 2 First Round Mock Draft Rank Loess Regression
## Mock Draft Position Mean Squared Error
## 1 510.6507
## 2 97.4431
## Mock Draft Ranking Mean Squared Error Mock Draft Metric Correlation
## 1 477.0312 0.6424860
## 2 97.7500 0.7558404
## Mock Draft Ranking Correlation
## 1 0.6878918
## 2 0.7460792
```

```
## Player Subset Metric
## 1 First Round Mock Draft Position Linear Regression
## 2 First Round Mock Draft Position Loess Regression
## Mock Draft Position Mean Squared Error
## 1 485.4605
## 2 438.5455
## Mock Draft Ranking Mean Squared Error Mock Draft Metric Correlation
## 1 458.9706 0.6021054
## 2 391.7027 0.6245101
## Mock Draft Ranking Correlation
## 1 0.6200909
## 2 0.6628927
```

While it seems that the Linear Regression is best for the ex post facto analysis of looking back at 2018 NFL Draft 1st round actuals, the Loess Regression has the lower of the two modeled estimates in terms of minimizing the Mean Squared Error of the predicted draft positions relative to actuals.

## The Conclusion

I don’t want to belabor the point too much, since that was a lot of math (but this is how we do exploratory analysis!), but the math based on the mock draft data does tell us quite a bit (in the 60% range) about what the actual draft looks like. However, that still means that there is 40% of the variation in the actual draft picks that is not explained by the mock drafts. To me, this means that we should not use data alone to predict the actual draft but that, combined with human intelligence, this data can do quite a bit to inform us about when a player might go in the draft.

Trades can and will occur a lot more than mock drafts likely predict (an analysis for another day) but I still think that in an uncertain world, the NFL Draft is not as unpredictable as we would like to think. In fact, I’m especially bullish about my projections for the 2019 NFL Draft given my twice as larger mock draft dataset to work from and the knowledge I’ve gathered from working on this post about which metrics to pay closest attention to.

Look for my own (very meta) data-informed 2019 NFL Mock Draft on April 25th, the day of the draft, as well as a healthy amount of Twitter content the first two days of the draft. I’m as excited as you all to see what happens!