How OM1 uses ML to detect rare diseases faster than ever before

Eric Shrock

CTO

OM1™ is a leader in building deep clinical Real World Data (RWD) datasets to better understand patient journeys and outcomes. We use a combination of data sources, including EHR data, to create billions of data points for more than 330 million patients. However, EHRs primarily support operations and billing, so most deep clinical information remains in unstructured clinical notes. For example, you can tell from standard structured medical records that a patient was diagnosed with Psoriasis. But you won’t know what part of the body is affected, what symptoms they may be experiencing, and whether their condition has improved or worsened.

At OM1™, we use machine learning to process the raw text of hundreds of millions of doctors’ notes, patient records, and medical histories into novel insights. Over the last decade, advancements in machine learning have helped usher in a new era of clinical insights. We can now analyze vast amounts of unstructured data in ways no human can, identifying correlations between symptoms and outcomes that might elude even the most seasoned clinicians. The resulting data provides unprecedented insight into real-world patient journeys, bridging evidence gaps from bench to bedside.

These insights lead to increased medical treatment effectiveness, improved access to care, and personalized medical insights that predict patient trajectories and inform clinical decision-making. Our innovations are helping improve health outcomes and save lives every day.

Building a next-gen AI platform for medical research

At OM1™, we run large-scale data pipelines that process structured and unstructured data. These pipelines first standardize and organize the raw information, placing the raw test of doctors’ notes, patient records, and other unstructured text-based data into a standard data store.

When we look at clinical notes, we have found that off-the-shelf natural language processing (NLP) models, even ones trained on clinical text, don’t fare particularly well. They struggle because clinicians often don’t write in “natural language,” using abbreviations, esoteric shorthand, and semi-structured patterns that vary across institutions and providers.

We use a collection of proprietary models and approaches to understand the semantic context of clinical text so we can extract and estimate structured clinical concepts. Such elements include patient symptoms, disease severity, and patient outcomes.

This approach creates a powerful engine for understanding deep clinical narratives:

Identifying subtle signals: Doctors can refer to a particular concept across notes in dozens of different ways.
Correlating signals: The presence of a single word or phrase is often insufficient to model a complex clinical concept.
Operational scalability: OM1™ processes hundreds of millions of notes, and our pipelines must be time—and cost-efficient.

Challenges with our old infrastructure

Our data pipelines are built primarily with dbt on top of Snowflake. Historically, when we needed to run non-SQL code, we would have to build, deploy, and operate multiple compute environments that could run Python or other code. We made this work, but it came with additional cost:

We had to integrate those systems, data pipelines, and CI/CD processes.
We had to maintain those compute environments, including operational overhead and maintenance.
Context switching between our core dbt pipelines and these compute environments took time and effort for those unfamiliar with the systems.
We often had to store and execute SQL code in our Python environments when this SQL code would be better stored and executed within dbt.

These costs were consuming precious engineering bandwidth and dragging down productivity. We knew we wanted a more modern approach and were delighted when we learned how seamlessly Modelbit could operate within our existing Snowflake infrastructure and processes.

Building with a modern ML stack

As an early adopter of Snowflake and dbt, we wanted to keep our data and compute environments as simple as possible. Writing and deploying non-SQL code should be an enjoyable experience and work seamlessly with our infrastructure-as-code and continuous integration foundation.

After learning about Modelbit through the Snowflake ecosystem, we were able to rapidly prototype, evaluate, and deploy new models without any of the legacy overhead of managing third-party compute environments. Within a few months, we developed and deployed new models into production using Modelbit. Here is how things look today:

Versioned ML notebooks

We have multiple ML development environments, but they all leverage notebook concepts for developing and organizing our code. Modelbit’s ability to deploy from anywhere makes it easy to use in any development environment, from local Python to SaaS notebook environments. Any production code needs to be reviewed and versioned in Git, and Modelbit’s deep Git integration makes it easy to version code and artifacts. By defining all our infrastructure in code in our Git repo, we can leverage critical components of our development lifecycle, such as peer code review, continuous integration testing, and separate development/stage/production branches.

SQL functions from dbt

Dbt is our primary data orchestration tool at OM1. Standardizing on SQL has made it easier to develop, test, and collaborate on data pipelines. However, invoking Python-based transformations and ML models has always been a challenge. Modelbit produces a SQL function in Snowflake for each model and handles naming, versioning, and data marshaling. This process enables seamless integration of Python models into our dbt environment and provides flexibility in structuring our pipelines. When there are a series of Python steps to execute, we can place those steps together in a single model or package each separately. When packaged independently, we can use Snowflake and dbt to manage orchestration and maximize parallelism without burdening users.

Snowflake and Snowpark

Snowflake is our primary data store at OM1. As Snowflake has expanded into broader computing capabilities, we have increasingly sought to keep processing within the Snowflake environment whenever possible. Snowpark provides Snowflake native compute environments, but managing Python packages and deployments has been cumbersome. Modelbit can transparently deploy to Snowpark when models are compatible with the Snowpark runtime. This capability allows us to run Snowpark and non-Snowpark models through a common development and deployment framework. All accessible through SQL functions without consumers needing to manage the details.

Centralized registry and observability

Modelbit has helped mature our MLOps needs by providing a centralized registry of all our models, configurations, and versions. It provides one source of truth for what has been deployed and what is being used without having to comb through individual Git repositories or Snowflake logs. With Modelbit’s integrated observability, we can quickly see how models are executing and rapidly debug issues through centralized logs.

Business Impacts

At OM1, we believe that personalized medicine is the way of the future, and AI is the path to get there. With access to large-scale, clinically rich Real-World Data, we are able to extract insights and predict outcomes that were previously thought impossible. These insights help the development and adoption of new therapies and inform clinical decision-making to improve patient health outcomes.

Rapidly developing and deploying new ML models with minimal overhead is critical to our success. We have achieved what we have through the hard work of our engineering team, which developed and maintained custom tooling for executing Python models at scale. Modelbit alleviates that burden, giving us more bandwidth to focus on our strategic capabilities. More importantly, Modelbit provides a richer and more streamlined set of capabilities so that we can iterate and deliver more quickly.

Tools like Modelbit are helping us experiment with new models faster and allowing our teams to spend more time doing what they do best: building transformational technologies for the healthcare industry.