In [1]:

print("Hello, world!")

Hello, world!

👋 Hi, we're Modelbit, and this is our blog.

Our blog is a little different. Every post is a downloadable, runnable Jupyter notebook. We hope to use it to share useful tips and techniques to the data science community. This way, you'll be able to download these notebooks and try those tips yourself!

Our first post is about something near and dear to our heart: Model deployment. It's hard. It's complicated. We want to make it simple, and here we'll show you how.

It's true, this post shows off some of our own product. Not all of them will. We hope most will just be data science techniques we learned along the way. And of course there will be some corporate announcements that don't have much data science at all. But we promise to embed some runnable code in those too, because that's just our style. 😎

Let's make a prediction 🤔

To show off how to deploy a model, we'll need a model! At Modelbit, we're Golden State Warriors fans. (We believe! 😉) So for our model, we'll use Nathan Lauga's basketball dataset. For simplicity, we're going to download it from Modelbit so we get type inference and other simplifiers. But you could also download it from Kaggle and load it up yourself.

In [2]:

import modelbit
mb = modelbit.login()

Connect to Modelbit
Open modelbit.com/t/e2386a5506ed406c to authenticate this kernel, then re-run this cell. Learn more.

In [3]:

df = mb.get_dataset("nba games")

Let's see what we've got:

In [4]:

df

Out[4]:

	GAME_ID	TEAM_ID	TEAM_ABBR	TEAM_CITY	PLAYER_ID	PLAYER_NAME	PLAYER_NICKNAME	START_POSITION	COMMMENTS	MINUTES_PLAYED	...	DREB	REB	AST	STL	BLK	TURNOVERS	PF	PTS	PLUS_MINUS	PREDICTED_POINTS
0	22100213	1610612766	CHA	Charlotte	1630176	Vernon Carey Jr.	Vernon	NaN	DNP - Coach's Decision	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0
1	22100213	1610612766	CHA	Charlotte	202397	Ish Smith	Ish	NaN	DNP - Coach's Decision	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0
2	22100213	1610612766	CHA	Charlotte	1630550	JT Thor	JT	NaN	DNP - Coach's Decision	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0
3	22100214	1610612754	IND	Indiana	1629052	Oshae Brissett	Oshae	NaN	DNP - Coach's Decision	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0
4	22100214	1610612754	IND	Indiana	202954	Brad Wanamaker	Brad	NaN	DNP - Coach's Decision	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
626106	21200003	1610612747	LAL	Los Angeles	203135	Robert Sacre	NaN	NaN	DNP - Coach's Decision	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0
626107	21200001	1610612764	WAS	Washington	201858	Cartier Martin	NaN	NaN	DNP - Coach's Decision	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0
626108	21200001	1610612739	CLE	Cleveland	201956	Omri Casspi	NaN	NaN	DNP - Coach's Decision	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0
626109	21200001	1610612739	CLE	Cleveland	202720	Jon Leuer	NaN	NaN	DNP - Coach's Decision	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0
626110	21200001	1610612739	CLE	Cleveland	202396	Samardo Samuels	NaN	NaN	DNP - Coach's Decision	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0

626111 rows × 30 columns

This dataset is a stat line for every player in every NBA game since 2004.

So let's build a model to predict the number of points a player will score in a game. To do that, we'll build a linear regression with one feature: the number of baskets they made.

In order to do that, we'll have to do a bit of data cleaning. If a player didn't make a basket or score a point, they'll have a null value in that respective column, and our regression will crash. Let's just drop those rows for now:

In [5]:

df.dropna(inplace=True, subset=["FGM", "PTS"])
df

Out[5]:

	GAME_ID	TEAM_ID	TEAM_ABBR	TEAM_CITY	PLAYER_ID	PLAYER_NAME	PLAYER_NICKNAME	START_POSITION	COMMMENTS	MINUTES_PLAYED	...	DREB	REB	AST	STL	BLK	TURNOVERS	PF	PTS	PLUS_MINUS	PREDICTED_POINTS
47	22100213	1610612764	WAS	Washington	203484	Kentavious Caldwell-Pope	Kentavious	F	NaN	27:41	...	5.0	6.0	2.0	1.0	0.0	1.0	0.0	3.0	2.0	3
48	22100213	1610612764	WAS	Washington	1628398	Kyle Kuzma	Kyle	F	NaN	30:28	...	4.0	5.0	3.0	1.0	2.0	1.0	1.0	5.0	-14.0	6
49	22100213	1610612764	WAS	Washington	1629655	Daniel Gafford	Daniel	C	NaN	24:21	...	7.0	9.0	1.0	2.0	1.0	1.0	4.0	20.0	-2.0	24
50	22100213	1610612764	WAS	Washington	203078	Bradley Beal	Bradley	G	NaN	35:07	...	3.0	3.0	7.0	2.0	0.0	2.0	3.0	24.0	-9.0	24
51	22100213	1610612764	WAS	Washington	203915	Spencer Dinwiddie	Spencer	G	NaN	28:34	...	3.0	3.0	2.0	0.0	0.0	2.0	1.0	0.0	-5.0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
625953	11200005	1610612743	DEN	Denver	202706	Jordan Hamilton	NaN	NaN	NaN	19	...	2.0	2.0	0.0	2.0	0.0	1.0	3.0	17.0	NaN	11
625954	11200005	1610612743	DEN	Denver	202702	Kenneth Faried	NaN	NaN	NaN	23	...	0.0	1.0	1.0	1.0	0.0	3.0	3.0	18.0	NaN	18
625955	11200005	1610612743	DEN	Denver	201585	Kosta Koufos	NaN	NaN	NaN	15	...	5.0	8.0	0.0	1.0	0.0	0.0	3.0	6.0	NaN	8
625956	11200005	1610612743	DEN	Denver	202389	Timofey Mozgov	NaN	NaN	NaN	19	...	2.0	3.0	1.0	0.0	0.0	4.0	2.0	2.0	NaN	3
625957	11200005	1610612743	DEN	Denver	201951	Ty Lawson	NaN	NaN	NaN	27	...	2.0	2.0	6.0	2.0	0.0	6.0	1.0	8.0	NaN	8

523751 rows × 30 columns

500,000 rows left! FGM, or field goals made, is the number of baskets a player made, and PTS, or points, is the number of points they scored.

So let's build our model:

In [6]:

from sklearn.linear_model import LinearRegression
l = LinearRegression()
l.fit(df[["FGM"]].values, df["PTS"])

Out[6]:

LinearRegression()

OK, now we've got a trained SciKit-Learn linear regression! Let's graph it to see how we did:

In [7]:

import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(df["FGM"], df["PTS"])
plt.plot(df["FGM"], l.predict(df[["FGM"]].values), c="g")

Out[7]:

[<matplotlib.lines.Line2D at 0x7f99b3b68820>]

The blue dots are our dataset, and the green line is our model. Looking good! This model is ready to deploy.

It's time to deploy 🚀

Every model needs a little input santization, data cleaning, or feature encoding. Ours is no different: We're going to have to handle the null case explicitly, because our linear regression isn't expecting them.

To do that, we'll write a little Python function that checks for nulls and then calls the model. We could also use this function to scale the data, call out to a feature encoder, or any other business logic we might need.

Our little model only expects one input: the number of baskets made. That'll be the argument to our function.

In [8]:

def predictPoints(baskets_made):

    # Handle the null case explicitly
    if baskets_made == None:
        return 0

    # Call l, our linear regression
    return l.predict([[baskets_made]])[0]

predictPoints(5)

Out[8]:

13.285483450179342

Looks like it's working! Let's deploy it. As promised. only one line of code:

In [9]:

mb.deploy(predictPoints)

Deployment "predictPoints" will be ready in a few seconds!

View status and integration options.

Huzzah! This link takes us to Modelbit, where we can see the code that was deployed, as well as the logs, model version information, and Python environment. (Since we didn’t specify one, we got the default Python 3.9 environment.)

We've also got the URL where the model lives! (Astute readers will note that my co-founder, Tom, named our workspace "Tom's House." 🙄

Anyway let's test the deployed model!

In [10]:

import requests
response = requests.post(
    "https://api.modelbit.com/api/d/toms_house/predictPoints/latest",
    json = {"data": [[1,5], [2,10], [3, None]]}
)
response.json()

Out[10]:

{'data': [[1, 13.285483450179342], [2, 26.069509622647892], [3, 0]]}

Here we called it in batch, asking for 3 predictions. They all worked, including the explicitly-handled null case!

But what if we want to call this from our data warehouse? After all our predictions might be lead scores, or customer health scores. Those are best written back to the database...

Calling prediction models from your Warehouse ❄️

If we head back to Modelbit and click the "Snowflake" tab, we get some sample code for Snowflake. You'll see we also support some other warehouse types as well as deploying the code directly into Snowflake itself. For this example though, let's keep it simple.

A Modelbit screenshot showing sample Snowflake code to call the predictPoints model

If we copy that Snowflake sample code into our warehouse and run it, we get a SQL function that can call our model directly from our warehouse!

A Snowflake screenshot depicting the creation and execution of an external function calling the Modelbit model

In fact, we can update the whole table that way! Remember, the column with the number of baskets a player made was called FGM. So let's go ahead and update the table:

A Snowflake screenshot depicting the Modelbit model being called to update a table with more than 600,000 rows

That statement called our model 626,111 times, which took a little less than 16 seconds. So we could call out for predictions part of building a dbt model directly in the warehouse, or call it every time a new row is inserted into the table. This way, our table would always have the predictions living alongside the data!

Obviously we're pretty excited about this. We hope you find it useful. If you have questions, or feedback, or just want to chat data science, please don't hesitate to reach out. See you around.

How to deploy an ML model to production with one line of code

👋 Hi, we're Modelbit, and this is our blog.

Let's make a prediction 🤔

It's time to deploy 🚀

Calling prediction models from your Warehouse ❄️

Deploy Custom ML Models to Production with Modelbit

Contact Us

Resources

Product