Hex is a platform for collaborative analytics and data science. Users can connect to data, write SQL and Python in a collaborative notebook interface, and build interactive data apps to share with anyone. Modelbit allows data scientists to deploy machine learning models directly from their data platforms. When used together, Hex and Modelbit enable data scientists to build machine learning models directly in Hex, and then deploy them to production with just one line of code using Modelbit.
In this blog post, we'll create a simple classification model in Hex that predicts how many baskets an NBA player will score in a game. Once trained, we'll deploy our model to Modelbit so that we can make predictions on some unseen data. Without further ado, let's get started!
We'll first import the packages required for this project and log in to Modelbit by calling "modelbit.login()". Here's what that looks like in Hex:
To pull our data into Hex, we'll import Hex's Snowflake data connection that contains our NBA stats data, then write a quick query to return all of the rows. You'll notice that in Hex,the table is automatically returned to us as a DataFrame and displayed.
Since we're trying to predict a player's "start_position" , we've intentionally removed any rows where "start_position" is null. We already know our target, so let's select some features from our dataset that might help us make predictions.
Specifically, we'll look at:
- How many three-point shots a player attempted: "FG3A"
- How many points they scored: "PTS"
- How many rebounds they got: "REB"
- How many shots they blocked: "BLK"
- How many turnovers they committed: "TURNOVERS"
Having selected our features, we have one more step to complete before we can start training— encoding our target column. Machine learning models don't often play well with raw text, so we'll need a method that can transform text into numbers. For this, we've chosen a label encoder. Label encoders work by assigning a number to each unique label, starting at 0 and increasing by one until the last label is assigned a number. For example, we'd transform ['A', 'B', 'C', 'A'] into [0, 1, 2, 0]. This format is both smaller in size, and way easier for computers to manipulate.
The model we've chosen to use is an XGBoost XGBClassifier, which is a classification model that aims to avoid overfitting. In other words, this model learns to classify targets and is less prone to accidentally becoming overly complex and specific to unintentional artifacts of the training data. Here's the code implementation of these steps:
{%CODE python%}
# so that we can encode our target
position_encoder = LabelEncoder()
# Selecting our features and target
X = nba_stats[["FG3A", "PTS", "REB", "BLK", "TURNOVERS"]]
y = position_encoder.fit_transform(nba_stats["START_POSITION"])
# split our data into a test and train set
X_train, X_test, y_train, y_test = train_test_split(X, y)
# Train model
position_guesser = XGBClassifier(verbosity = 0, silent = True)
position_guesser.fit(X_train, y_train)
# evaluate model accuracy
score = accuracy_score(y_test, position_guesser.predict(X_test))
{%/CODE%}
And we can see the code and its output as Hex cells!
Looks like we got a 62% accuracy score, when evaluated on the testing data! Not too bad, so let's ship this model to production 🚀. To deploy our model, we need to take this logic and bundle it up into a function that we can deploy to Modelbit. We’ll start by writing a Python function that will do the inference. Here's the code of the function:
{%CODE python%}
from pandas import DataFrame
def infer_position(fg3a: int, pts: int, reb: int, blk: int, turnovers: int) -> str:
''' Takes our features as arguments, then converts into a dataframe and returns the predicted value '''
if fg3a is None or pts is None or reb is None or blk is None or turnovers is None:
return None
input_df = DataFrame({
"FG3A": [fg3a],
"PTS": [pts],
"REB": [reb],
"BLK": [blk],
"TURNOVERS": [turnovers]
})
infered_position = position_guesser.predict(input_df)
return position_encoder.inverse_transform(infered_position)[0]
{%/CODE%}
And here's what it looks like in Hex:
This function takes our features as its arguments, then uses them to create a Pandas DataFrame. Then, with our newly trained model, we can use the function to predict, decode, and return a players position. Let's see it in action.
A player that takes 15 three pointers, scores 30 points, has only one rebound, no blocked shots, and 5 turnovers is probably a Guard. Whereas a player with no three point attempts or turnovers, but 3 points, 10 rebounds and 4 blocked shots is probably a Center. Makes sense!
Now that we have a model that's capable of predicting position with decent accuracy, we can call "mb.deploy" to deploy our inference function to Modelbit. Here's that line of code. Notice we specify a custom Python production environment, to match the Hex development environment!
{%CODE python%}
mb.deploy(infer_position, python_packages=["scikit-learn==1.1.2", "pandasql==0.7.3", "xgboost==0.81", "pandas==1.4.2"], python_version="3.9")
{%/CODE%}
Let's see it run:
Modelbit will capture the function code and its dependencies (the encoder and the model itself) and ship them all to the cloud. The remarkable thing about this is that models uploaded to Modelbit can be easily called from anywhere — even directly from Snowflake.
If we click the link in the deploy response, we’ll head to Modelbit, where we can see the model that was captured, its dependencies, and examples of how to call it via a REST API as well as from Snowflake:
Let's test it out in Snowflake. Modelbit gives us a handy SQL function we can call to get our inferences: "ext_infer_position_1()". We can run a simple query using that function and pass in our predictive features, which could also be passed directly from another query or Snowflake table. As we can see, we’re getting the same predictions in Snowflake that we got moments ago in Python.
Now the XGBoost model that we built and trained moments ago in Hex is available to use right in Snowflake at effectively limitless cloud scale. This is especially useful for running a prediction every time a new row is inserted in a table, every time a dbt model is built, or when building views on top of a raw table.
With just one update call, we made and stored over 600,000 predictions in less than 11 seconds, without having to write any extra code or provision any cloud infrastructure or GPUs.
That's the power of Hex and Modelbit. Exploring data and creating a machine learning model in Hex takes just a few minutes, and deploying it to Modelbit means that in just a few seconds, you can be making batch predictions right in your database with simple, straightforward SQL commands. No GPUs, no mucking around with servers or lambda functions, just quick and easy predictions in the cloud.