print("Hello, world!")
Hello, world!
👋 Hi, we're Modelbit, and this is our blog.
Our blog is a little different. Every post is a downloadable, runnable Jupyter notebook. We hope to use it to share useful tips and techniques to the data science community. This way, you'll be able to download these notebooks and try those tips yourself!
Our first post is about something near and dear to our heart: Model deployment. It's hard. It's complicated. We want to make it simple, and here we'll show you how.
It's true, this post shows off some of our own product. Not all of them will. We hope most will just be data science techniques we learned along the way. And of course there will be some corporate announcements that don't have much data science at all. But we promise to embed some runnable code in those too, because that's just our style. 😎
Let's make a prediction 🤔
To show off how to deploy a model, we'll need a model! At Modelbit, we're Golden State Warriors fans. (We believe! 😉) So for our model, we'll use Nathan Lauga's basketball dataset. For simplicity, we're going to download it from Modelbit so we get type inference and other simplifiers. But you could also download it from Kaggle and load it up yourself.
import modelbit
mb = modelbit.login()
Connect to Modelbit
Open modelbit.com/t/e2386a5506ed406c to authenticate this kernel, then re-run this cell. Learn more.
df = mb.get_dataset("nba games")
Let's see what we've got:
df
GAME_ID | TEAM_ID | TEAM_ABBR | TEAM_CITY | PLAYER_ID | PLAYER_NAME | PLAYER_NICKNAME | START_POSITION | COMMMENTS | MINUTES_PLAYED | ... | DREB | REB | AST | STL | BLK | TURNOVERS | PF | PTS | PLUS_MINUS | PREDICTED_POINTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22100213 | 1610612766 | CHA | Charlotte | 1630176 | Vernon Carey Jr. | Vernon | NaN | DNP - Coach's Decision | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 |
1 | 22100213 | 1610612766 | CHA | Charlotte | 202397 | Ish Smith | Ish | NaN | DNP - Coach's Decision | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 |
2 | 22100213 | 1610612766 | CHA | Charlotte | 1630550 | JT Thor | JT | NaN | DNP - Coach's Decision | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 |
3 | 22100214 | 1610612754 | IND | Indiana | 1629052 | Oshae Brissett | Oshae | NaN | DNP - Coach's Decision | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 |
4 | 22100214 | 1610612754 | IND | Indiana | 202954 | Brad Wanamaker | Brad | NaN | DNP - Coach's Decision | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
626106 | 21200003 | 1610612747 | LAL | Los Angeles | 203135 | Robert Sacre | NaN | NaN | DNP - Coach's Decision | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 |
626107 | 21200001 | 1610612764 | WAS | Washington | 201858 | Cartier Martin | NaN | NaN | DNP - Coach's Decision | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 |
626108 | 21200001 | 1610612739 | CLE | Cleveland | 201956 | Omri Casspi | NaN | NaN | DNP - Coach's Decision | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 |
626109 | 21200001 | 1610612739 | CLE | Cleveland | 202720 | Jon Leuer | NaN | NaN | DNP - Coach's Decision | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 |
626110 | 21200001 | 1610612739 | CLE | Cleveland | 202396 | Samardo Samuels | NaN | NaN | DNP - Coach's Decision | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 |
626111 rows × 30 columns
This dataset is a stat line for every player in every NBA game since 2004.
So let's build a model to predict the number of points a player will score in a game. To do that, we'll build a linear regression with one feature: the number of baskets they made.
In order to do that, we'll have to do a bit of data cleaning. If a player didn't make a basket or score a point, they'll have a null
value in that respective column, and our regression will crash. Let's just drop those rows for now:
df.dropna(inplace=True, subset=["FGM", "PTS"])
df
GAME_ID | TEAM_ID | TEAM_ABBR | TEAM_CITY | PLAYER_ID | PLAYER_NAME | PLAYER_NICKNAME | START_POSITION | COMMMENTS | MINUTES_PLAYED | ... | DREB | REB | AST | STL | BLK | TURNOVERS | PF | PTS | PLUS_MINUS | PREDICTED_POINTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
47 | 22100213 | 1610612764 | WAS | Washington | 203484 | Kentavious Caldwell-Pope | Kentavious | F | NaN | 27:41 | ... | 5.0 | 6.0 | 2.0 | 1.0 | 0.0 | 1.0 | 0.0 | 3.0 | 2.0 | 3 |
48 | 22100213 | 1610612764 | WAS | Washington | 1628398 | Kyle Kuzma | Kyle | F | NaN | 30:28 | ... | 4.0 | 5.0 | 3.0 | 1.0 | 2.0 | 1.0 | 1.0 | 5.0 | -14.0 | 6 |
49 | 22100213 | 1610612764 | WAS | Washington | 1629655 | Daniel Gafford | Daniel | C | NaN | 24:21 | ... | 7.0 | 9.0 | 1.0 | 2.0 | 1.0 | 1.0 | 4.0 | 20.0 | -2.0 | 24 |
50 | 22100213 | 1610612764 | WAS | Washington | 203078 | Bradley Beal | Bradley | G | NaN | 35:07 | ... | 3.0 | 3.0 | 7.0 | 2.0 | 0.0 | 2.0 | 3.0 | 24.0 | -9.0 | 24 |
51 | 22100213 | 1610612764 | WAS | Washington | 203915 | Spencer Dinwiddie | Spencer | G | NaN | 28:34 | ... | 3.0 | 3.0 | 2.0 | 0.0 | 0.0 | 2.0 | 1.0 | 0.0 | -5.0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
625953 | 11200005 | 1610612743 | DEN | Denver | 202706 | Jordan Hamilton | NaN | NaN | NaN | 19 | ... | 2.0 | 2.0 | 0.0 | 2.0 | 0.0 | 1.0 | 3.0 | 17.0 | NaN | 11 |
625954 | 11200005 | 1610612743 | DEN | Denver | 202702 | Kenneth Faried | NaN | NaN | NaN | 23 | ... | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 3.0 | 3.0 | 18.0 | NaN | 18 |
625955 | 11200005 | 1610612743 | DEN | Denver | 201585 | Kosta Koufos | NaN | NaN | NaN | 15 | ... | 5.0 | 8.0 | 0.0 | 1.0 | 0.0 | 0.0 | 3.0 | 6.0 | NaN | 8 |
625956 | 11200005 | 1610612743 | DEN | Denver | 202389 | Timofey Mozgov | NaN | NaN | NaN | 19 | ... | 2.0 | 3.0 | 1.0 | 0.0 | 0.0 | 4.0 | 2.0 | 2.0 | NaN | 3 |
625957 | 11200005 | 1610612743 | DEN | Denver | 201951 | Ty Lawson | NaN | NaN | NaN | 27 | ... | 2.0 | 2.0 | 6.0 | 2.0 | 0.0 | 6.0 | 1.0 | 8.0 | NaN | 8 |
523751 rows × 30 columns
500,000 rows left! FGM
, or field goals made, is the number of baskets a player made, and PTS
, or points, is the number of points they scored.
So let's build our model:
from sklearn.linear_model import LinearRegression
l = LinearRegression()
l.fit(df[["FGM"]].values, df["PTS"])
LinearRegression()
OK, now we've got a trained SciKit-Learn linear regression! Let's graph it to see how we did:
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(df["FGM"], df["PTS"])
plt.plot(df["FGM"], l.predict(df[["FGM"]].values), c="g")
[<matplotlib.lines.Line2D at 0x7f99b3b68820>]
The blue dots are our dataset, and the green line is our model. Looking good! This model is ready to deploy.
It's time to deploy 🚀
Every model needs a little input santization, data cleaning, or feature encoding. Ours is no different: We're going to have to handle the null
case explicitly, because our linear regression isn't expecting them.
To do that, we'll write a little Python function that checks for nulls and then calls the model. We could also use this function to scale the data, call out to a feature encoder, or any other business logic we might need.
Our little model only expects one input: the number of baskets made. That'll be the argument to our function.
def predictPoints(baskets_made):
# Handle the null case explicitly
if baskets_made == None:
return 0
# Call l, our linear regression
return l.predict([[baskets_made]])[0]
predictPoints(5)
13.285483450179342
Looks like it's working! Let's deploy it. As promised. only one line of code:
mb.deploy(predictPoints)
Deployment "predictPoints" will be ready in a few seconds!
Huzzah! This link takes us to Modelbit, where we can see the code that was deployed, as well as the logs, model version information, and Python environment. (Since we didn’t specify one, we got the default Python 3.9 environment.)
We've also got the URL where the model lives! (Astute readers will note that my co-founder, Tom, named our workspace "Tom's House." 🙄
Anyway let's test the deployed model!
import requests
response = requests.post(
"https://api.modelbit.com/api/d/toms_house/predictPoints/latest",
json = {"data": [[1,5], [2,10], [3, None]]}
)
response.json()
{'data': [[1, 13.285483450179342], [2, 26.069509622647892], [3, 0]]}
Here we called it in batch, asking for 3 predictions. They all worked, including the explicitly-handled null case!
But what if we want to call this from our data warehouse? After all our predictions might be lead scores, or customer health scores. Those are best written back to the database...
Calling prediction models from your Warehouse ❄️
If we head back to Modelbit and click the "Snowflake" tab, we get some sample code for Snowflake. You'll see we also support some other warehouse types as well as deploying the code directly into Snowflake itself. For this example though, let's keep it simple.
If we copy that Snowflake sample code into our warehouse and run it, we get a SQL function that can call our model directly from our warehouse!
In fact, we can update the whole table that way! Remember, the column with the number of baskets a player made was called FGM
. So let's go ahead and update the table:
That statement called our model 626,111 times, which took a little less than 16 seconds. So we could call out for predictions part of building a dbt
model directly in the warehouse, or call it every time a new row is inserted into the table. This way, our table would always have the predictions living alongside the data!
Obviously we're pretty excited about this. We hope you find it useful. If you have questions, or feedback, or just want to chat data science, please don't hesitate to reach out. See you around.