In case you missed it, read this earlier post on what The Extra Pass is - in short, a fly-on-the-wall view as I build a data project with Python.

What am I building?

Photo by Sergey Zolkin on Unsplash

Let's start with the project goals.

It's no surprise that I love to watch basketball. I am not an expert or even close to it, but I've done some amateur analysis in the area before, and built some basketball-related dashboards in the past for myself and for my clients.

Some basketball-related dashboards I've built before (basketboards?)

But one thing that I had not done is to build something that automatically updates the underlying data and reacts to any 'surprises' that stand out.

I also wrote earlier about how I wished that there were more metrics for consistency in sports.

Luckily for me these two factors align, and I hope to build an app which will:

  • Automatically fetch NBA game data
  • Evaluate the latest individual and team performances
  • Flag standout performances over one game or an extended stretch (good or bad),
  • Produce eye-catching visualisations, and
  • Post them on social media.

Along with that, I'll have to define what I think are good measures for 'standout' performances. This will also be used to:

  • Build a dashboard capturing the outputs and metrics of player / team consistency, and
  • Deploy it somewhere online.

If my past experiences are anything to go by - each of these tasks will involve their own challenges and require problem-solving.

Getting the data

Data source

The majority of my NBA data projects in the past have used bulk data products from various sources. One of the best sources of data (and frankly, NBA analysis) has been Darryl Blackport's PBPStats site, who offers his data through his Patreon site.

If you're looking for historical data or snapshots, exporting data from Basketball-reference is a great option too.

But that isn't quite suitable for what I want to do here. As much as I love the NBA, updating the data manually on a daily basis is not going to happen. So that leaves scraping the data or using an API. Scraping data is not a preferred option for me. Although it might be (probably) not illegal, it is against most site's TOS and it's just not something I want to do.

I have long heard about a magical, official NBA API, but it's undocumented. So I went looking for convenient wrappers on github. A search for "NBA" on github (see https://github.com/search?q=nba) reveals a number of popular packages.

Side note: GitHub topics is also a great way of discovering projects - like this https://github.com/topics/nba-stats

There's a Node.js based client for JavaScript users, and NBAStatR will help R wizards talk to the NBA's API as well as many more. There's even a CLI-based tool for... "watching" text-based live updates on your console.

Live NBA text updates on your console 🤷🏻‍♂️! (NBA-Go)

I'm a Python man as you know, so I ended up here:

GitHub - swar/nba_api: An API Client package to access the APIs for NBA.com
An API Client package to access the APIs for NBA.com - GitHub - swar/nba_api: An API Client package to access the APIs for NBA.com

Great, job done, right? Well, actually...

Um, have you ever been to a restaurant and overwhelmed by the menu? These are the available API endpoints ... up towards the end of those starting with C.

Just a few (A-C) of the available API endpoints..

And although the endpoints are sensibly named, it was pretty difficult for me to ascertain the difference between the playergamelogs (plural) endpoint and the playergamelog endpoint.

What I ended up doing was to just run a google search for nba_api game logs which pointed me to these two (first, second) GitHub threads on exactly what I'm looking for.

I tested the code from the first thread, seeing what different API arguments did, and built these two functions.

To fetch a list of players for a season:

To fetch the game logs:

You might notice that the second function takes a test_mode argument. I do this from time to time where I don't want to repeat unecessary volumes of work. There are a LOT of game logs to download, and if for example I screwed up the code for saving them then I've wasted a lot of time and made a bunch of requests for no good reason.

So I've set up a test mode which will break out of the loop after grabbing a few.


Note: Sanity-check your data!

At this point I should point out how important it is to sanity-check any outputs. When I tested the code from the GitHub thread, it only returned 115 players via len(df_cap)! That's way too few for a season - it should be about 450-500 at 15 players per team x 30 teams.

As it turns out, the code applies the flag "is_only_current_season" to the CommonAllPlayers call, and the best I can gather it filters the returned data for players who's still active in the current season (as in 2021-22).

If I had used this code, it'd be returning a correct answer for the current season, but would not return players from the past seasons who are no longer active. Given the high volume of game logs per season, it's not obvious that I would have noticed it if looking just at the game logs later.


Anyway, these functions are used in broader functions to update the local data.

To update the list of players for all specified season:

And to update the game logs

Link to code

These scripts are also set up to check whether the data has been already saved, and skips re-downloading historical data except for the current season. So, running:

update_season_pl_list(2015, 2021)
update_season_pl_gamelogs()

Will update the data between 2015-2021.

A quirk of the NBA season that doesn't always translate well to code is its naming. Seasons straddle two calendar years (like 2021-2022), making it a little bit awkward from time to time.

So I wrote a couple of helper functions to convert the starting year (e.g. 2021) to the equivalent text suffix ('2021-22'), and to get the relevant "present" season from the current date (e.g. March 2022 -> '2021-22', September 2022 -> '2022-23').

These live in the utils.py module and get called as needed. I'm trying to get better at practicing DRY (do not repeat yourself). The same utils module houses variables that get re-used, and filename prefixes for those that are used across multiple functions so that I don't accidentally break the library by renaming something by mistake.

All of that brings us to the project code at v0.1 tag. The project code lives in this GitHub repo:

GitHub - databyjp/TheExtraPass: Sit alongside as a real Python-based data project is built
Sit alongside as a real Python-based data project is built - GitHub - databyjp/TheExtraPass: Sit alongside as a real Python-based data project is built

I've set up the project, familiarised myself with what I need from the nba_api wrapper, and wrote scripts to download the necessary data along the way.

In the next issue I'll talk a bit about the tech setup for the project like Python version management, dependency/package management, my IDE, Python logging, using git/Github effectively, and so on, which I think will be helpful.

As mentioned in the earlier post, the first couple of issues here will be free but the rest will be published for paying subscribers only.

At $5/mth, the price is going to be quite low for early subscribers, and I think it's great value for what you'll get from this set of tutorials and code. But let me know on Twitter what you think.

(If you sign up in the next few weeks it's actually only $4/mth!)

Visual/Noise - The Extra Pass
Learn data vis / analytics through practical, fun real-life examples. NBA and sports data visualisations, data dashboards with Python.

Thanks for your support.

Share this post