In my last post, I laid out my initial design for the NBA draft model that I am trying to develop. The next step was to get my hands on some real basketball data. I trawled the web for several days and ended up finding a couple options. For my model to have the proper inputs, I will need robust data spanning multiple years (ideally >20 seasons, but I’ll take what I can get) for both the NBA and NCAA. You’d think that in today’s digital age this wouldn’t be much of a challenge, but having access to box scores (via NBA.com for example) is quite different from having years of box scores in an organized and accessible format. I will include all the useful links that I came across during my foray across the web in hopes that they might prove useful to some person who might stumble upon this post in the future.
Robust NBA data was much easier for me to locate than robust NCCA data was. I suspect that this is due to a difference in the volume of data generated each season. There are only 30 teams in the NBA each with 15 roster spots (excluding 2-way players) for a total of ~450 active NBA players in any given season. In contrast, there are ~350 NCAA Division 1 basketball teams with ~13 players per team.
Anyway, there are several good options for collecting usable NBA data without going to the trouble of web scraping. Basketball-reference is a great resource with just about every cut of data you could want (aside from Second Spectrum player tracking data). The site even has a useful tool that allows downloads of their various tables in CSV format. Depending on your data needs, you may be able to get by with manually downloading the data tables you need, but the number of tables I would have to pull for this project make this unfeasible without using a web scraper. To make matters even more difficult, Basketball-reference has apparently implemented measures to stop excessive scraping, so I opted to look in a different direction.
I also considered going directly to stats.NBA.com for data. The web client is great if you are just looking for a specific stat or playing around with all the various filters, but unlike basketball-reference there does not appear to be an option on the front-end to export any of this data. I did find some HTML endpoint documentation (here, and here) that could be formatted as a JSON file / scraped. Here is an example link which shows the draft combine results from the 2016-17 draft class (https://stats.nba.com/stats/draftcombinedrillresults/?LeagueID=00&SeasonYear=2016-17). I actually began experimenting with pulling data from this source until my roommate randomly found a dataset on Kaggle that fit my needs exactly.
Kaggle is a data science community owned by Google with a variety of publicly available datasets. The NBA dataset can be found here and was created by Omri Goldstein (and further supplemented by user AbidR) via basketball-reference. Although I do eventually want to develop a scraper for the stats.nba.com, I will table that project for now in the interest of time. The dataset is easily downloadable and already formatted in 2 MB’s worth of CSV files. This should serve as a solid starting point for my purposes.
Robust college data on the other hand was quite a bit trickier. I could not find any similar endpoint documentation for the NCAA website and there was no easily downloadable datasets on Kaggle. I did however find a non-downloadable dataset on Kaggle which seems to have been uploaded jointly between Google and the NCAA. The “catch” is that you have to use BigQuery API to query the specific data that you want instead of having direct access to the full database. This seems to be a solid option (and I may go back to this at some point), but I opted for a simpler solution.
As a quick aside here I would like to highlight a blog series from Patrick Howell which I found very informative. He discusses his ESPN box score scraper and several unforeseen issues that he had to account for before he could get his code working properly. His experience informed on my decision to find an established dataset to work with instead of jumping directly into building my own (at least for now).
Anyway, the dataset that I am planning to proceed with comes from Will Schreefer at The Stepian in the College Basketball “Draft Model Starter Kit” database. This “kit” includes data from as far back as 2002-03 in addition to some extra goodies such as combine measurements and recruit rankings. When I contacted Will, he was very quick to respond and I had my hands on all the data I should need within an hour of my first email to him. All he asks is for a small donation of $3 (or more if you’re feeling generous) which will go to maintaining the apps/servers he uses.
Now that I have my hands on the necessary data, my next step will be to begin working with that data. I’ll need to give myself a bit of refresher on statistical models before I feel comfortable proceeding from this point. For now, I’ll start with reading select chapters from An Introduction to Statistical Learning with Applications in R. In my next post I will try to share some of my notes from the text as well as any preliminary results I may have. Until next time, thanks for reading!