It’s been a while since my last update and let me just say that I am starting to understand the struggle of writer’s block.
Anyways, despite my lack of updates I have not been idle over the past couple months. I’ve built out version 1 of my draft model and have been tinkering with the results. Before I get into that, let me take a step back and document my model inputs, outputs and limitations.
The dataset I am using (The Stepian’s College Basketball Draft Model Starter Kit) includes 63 variables derived from NCAA data. These variables can be roughly grouped into a couple categories:
- Per game box score stats – points, turnovers, assists, etc.
- Season total stats – total points, total turnovers, total assists, etc.
- Advanced stats – true shooting %, Win Shares, offensive/defensive rating, block rate, etc.
- Miscellaneous stats – Age, height, wins, etc.
As you have probably noticed, many of these stats feature heavy overlap. For example, the dataset includes blocks per game, total blocks, and block rate. Additionally, some stats are directly related to each other such as total rebounds, defensive rebounds, and offensive rebounds per game. This presents challenges in dimensionality and multicollinearity in my draft model that I will need to adjust for, but more on that when I discuss the model methodologies.
A key question that I had to grapple with was how to approach model inputs for players playing multiple college seasons. Take Ivan Rabb for example; he played 2 seasons at my alma mater Cal before entering the 2017 NBA draft. Should I:
- Use Rabb’s sophomore stats only
- Use an average between Rabb’s freshman and sophomore stats
- Find a way to incorporate both seasons with a weighted adjustment
Intuitively I leaned towards option C, but that begs the question of how to apply the weighting. Presumably the season prior to entering the NBA should “count” for the most, but how much? Any split that I tried to come up with seemed both arbitrary and subjective. Eventually I decided to do what all good millennials do and simply put off making this decision. I didn’t want to fall into the rabbit hole of data weighting when I didn’t even have a draft model built out yet. Additionally, with the prevalence of one-and-dones entering the NBA it felt wrong to mix weighted and unweighted data without solid justification. For now, I decided to use only each player’s final college season (though I do plan to revisit this topic in the future). One lesson I have taken to heart from my market research / consulting work is to always try to keep projects in-scope and progressing.
The output that the model is designed to predict is extremely important and worth diving into. The choice of output in many ways depends on the answer to a fundamental question – how do you define a successful draft pick? Is it most important that a player be able to contribute right away, or have long term upside? Depending on the answer, the metric you would want to use could differ significantly.
For my model I decided my output would be the Win Shares accumulated by each player in their first 4 seasons. I chose Win Shares because it is a metric that tries to measure a player’s cumulative value produced in terms of generating wins. The specific formula for calculating Win Shares is complicated, but at a high level it tries to credit the players on a team for the actual wins the team produced in a given season. For example, if you were to add up the Win Shares of all the players on a 50 win team, the sum would be roughly equal to 50 Win Shares.
There are other all-in-one metrics I could have used as outputs (such as PIPM, PER, Net Rating to name a few), however I found Win Shares to be the most intuitive and liked the way it tied directly into team performance.
For more about Win Shares check out this explainer from basketball-reference.
The timeframe I decided to use for Win Shares was the 4 seasons following a player entering the NBA draft. This is because 4 seasons is the standard contract length for first round picks. Due to restricted free agency rules, teams can typically retain these players beyond their rookie contracts, but a determined player can force his way to unrestricted free agency following his 5th season by taking the qualifying offer. Additionally, the first 4 seasons are often the most valuable from a team-building perspective because the rookie scale contract rules essentially prevents negotiation on that first contract.
Limitations & Considerations
As with any model, there are inevitably going to be limitations and considerations that need to be accounted for when interpreting results. For this model, the training data used is NCAA college basketball data between the 2002-03 season and the 2013-14 season which accounts for ~4600 individual players. However, this dataset is by no means comprehensive. Players with data missing in any of the 63 categories had to be excluded for the model to run properly. This was most common for players from smaller schools. Because the dataset consists of solely former NCAA players, players entering the draft via high school or international play also had to be excluded.
One other key challenge I had to grapple with was what to do with college players who never played in the NBA in any of those first 4 seasons. These players would not have recorded any NBA Win Shares, so there would be no actual output which could be used by the model.
One option was to exclude these players as well and thereby avoid making any assumptions on their hypothetical Win Shares produced. I feared that this approach would introduce significant bias to my model. By including only players that reached the NBA in the training data, there would be an implicit assumption that almost all the players I would be projecting for are also NBA-level players. I instead decided to artificially set these non-NBA player outputs to be 0 Win Shares produced. The underlying assumption being that these players would probably not have played significant (if at all) had they made the NBA.
My next post will document my methodologies used to relate the inputs and outputs discussed above along with some preliminary draft model results. I promise that this next entry will not take nearly as long as this one did so stay tuned. Thanks for reading!