First Impressions: Snowpark Python
Snowpark is Snowflake's generic compute environment that runs arbitrary user code right inside your data warehouse. Until now, Spark environments were better suited to generic-compute tasks like data pipelines and machine learning workflows. Snowpark is Snowflake's response.
Snowpark's support for Python is currently in private preview. The Modelbit team has been part of that private preview, testing out new versions of the Python support and getting our feet wet. Snowpark Python allows Modelbit customers to deploy machine learning models right into Snowflake, where they run as UDFs in the customer's own Snowflake compute environment!
Having tested it out for a couple months now, we wanted to share our early findings.
The excitement
So much of Snowflake's Python support in Snowpark is a game changer for data scientists. Here's what we love the most:
Built for Jupyter Notebooks
The API is clearly intended to be used in Jupyter notebooks, where data scientists live and breathe. The API is still changing in the private preview — if they're anything like us, they'll be tinkering up until launch day! 😉 — so we won't document it here. But it's clearly intended to be used from a notebook, making it a first-class citizen of the data science world.
Native DataFrame support
The API works with DataFrames as first-class citizens, just like data scientists do. The fact that Snowflake DataFrames are distinct from Pandas DataFrames is a bit confusing. But converting back and forth is a single API call. This lets data scientists work with a lot more data in their DataFrames than they're used to in their notebooks. The best part? They'll do that right from their notebooks!
Oh, and saving a DataFrame back as a table in Snowflake with a single API call? Slick. 😎
Python decorators for deployment
Python's decorators are a clever syntactic sugar for higher-order functions. They look like this:
{%CODE python%}
def do_twice(func):
def inner():
func()
func()
return inner
@do_twice # Look, ma, my first decorator!
def print_hello():
print("hello!")
print_hello()
{%/CODE%}
Because we decorated print_hello with do_twice, when we call print_hello, it gets called two times! Decorators are commonly used for everything from simple tasks like input checking, to complex processes like lock synchronization.
Snowpark Python has a fresh use for decorators: Shipping Python code to your warehouse! A simple @udf decorator on top of a function will ensure that function is deployed to your warehouse and executed there. Very cool.
The opportunities
As with everything so new and potentially groundbreaking, the opportunities for expansion and improvement seem limitless! Here's where we see the biggest near-term opportunities:
Production-grade tooling
As the warehouse-based compute environment is brand new, there's a huge opportunity to bring the tooling up to par with what we're used to for web, mobile and PC development. Logging, network access, CI/CD, and the ability to test code in a production-like environment are all trails that have been blazed in software engineering. They let engineers know their code will work before deploying to users. Data scientists will need that same confidence.
We'll be interested to see whether Snowflake takes the approach of building all of this themselves, or opens up the environment enough to let startups compete here.
Libraries and dependencies
Wnowflake protects you (and themselves) from dangerous code by whitelisting certain conda packages that you can import and use. But you can of course ship as much of your own code as you like! This may lead to users getting around the package whitelist by vendoring the desired packages into their own codebase. We may even see companies rebuild dependency management by bundling all the code and dependencies into a big fat ball of Python, and then shipping it all as a single Python UDF!
There's a balance to be struck between safety and developer friendliness. We'd guess that Snowflake will be threading this needle for a long time to come. It'll be fascinating to see what tack they take here.
The future: Arbitrary compute
Snowpark Python is one of those features that seems obvious as soon as you touch it, and immediately unlocks possibilities for future directions. At its core, it's a notebook-friendly Python API for compiling and executing SQL statements, and for deploying Python UDFs. You might write a Python UDF, deploy it, and then write more Python to generate SQL to call the UDF.
This means your in-warehouse Python is limited to code that can be called on a single row of data, by a SQL statement invoking it. Obviously, this is a big limitation compared to arbitrary compute environments like Spark and Spark-based platforms like Databricks.
The next strategic direction for Snowflake? Flip the script. Provide compute environments that can run Python and call your SQL environment, rather than be called by it. That ML model you're developing in the notebook? What if you didn't just deploy it to Snowpark at the end: What if you trained it in Snowpark? What if you ran the whole notebook in Snowpark? What if the notebook was hosted in Snowpark?
Achieving this would mean rethinking the relationship between compute and data inside Snowflake itself. Will they do it? We have no idea. But we wouldn't bet against it.
One more thing...
Are you a Snowpark Python Private Preview user? Reach out to start deploying your machine learning modles directly to Snowflake Python UDFs!
Are you planning on using Snowpark Python once it launches? Reach out to check out a demo of how it'll all work for you end-to-end once it launches!