Enhancing Healthcare Revenue Forecasting with DataFrames.jl and MemPool.jl: A Case Study
Introduction
The Great Lakes Consulting team collaborated with a major healthcare client to develop a Revenue Forecasting application using the Julia framework. This application allows in-memory processing of large healthcare claims datasets, enabling real-time scenario modeling and quick analysis.
Recently, our team upgraded the application, replacing JuliaDB.jl with DataFrames.jl. During this project, JuliaDB.jl NDSparse
objects were replaced with a custom DCTable
object. This new custom object wraps a DataFrame
and manages disk caching with MemPool.jl. High-level design decisions and lessons learned are detailed in this blog post.
As a bonus, our Healthcare client is looking to hire new developers to continue development and support of the Net Revenue Forecasting application and its Julia framework. There are two (2) positions posted on their Jobs Portal. Please consider applying if you are interested.
Julia Developer Specialist (REMOTE)
Senior Applications Programmer / Analyst (REMOTE)
Background
In 2018, the GLCS team developed a Net Revenue and Mid-Month Forecasting application (NRF) for one of the largest not-for-profit healthcare systems in the U.S. This application is part of a suite of revenue tools that provide A/R Valuation, Revenue Forecasting, Analytics, and Reporting. It helps the health system create accurate revenue forecasts for over 200 hospitals, continuing care facilities, and urgent care locations nationwide.
The most recent generation of the NRF application leverages a services-based architecture powered by Julia. The platform facilitates in-memory processing for very large multi-dimensional datasets (8 GB – 12 GB) to enable real-time scenario modeling for users. Core data manipulation was originally performed with JuliaDB.jl.
When the NRF application was developed in 2018, JuliaDB.jl was the best solution for:
Loading multi-dimensional datasets quickly.
Indexing data and performing FILTER, AGGREGATE, SORT and JOIN operations.
Saving results and loading them back efficiently.
Leveraging Julia's built-in parallelism to fully utilize available resources.
Now for some bad news. The last official release for this package was August 3, 2020 (v0.13.1). JuliaDB.jl is effectively abandoned and not receiving support or maintenance updates. The development recommended DTables.jl and DataFrames.jl as the preferred alternatives for JuliaDB.jl.
Earlier this year, the GLCS team set out to upgrade the NRF application and replace JuliaDB.jl components within the framework. The first design leveraged DTables.jl, an abstraction layer on top of Dagger.jl that allows for manipulation of table-like structures in a distributed environment.
Issues Encountered
The initial builds showed some promise. In general, query and processing performance improved. However, when deployed to the test environment (which mimicked the production server configuration) with real-time user interactions and very large datasets, individual processes occasionally failed or hung without explanation. Due to the constraints of the test environment, a C-library function call within Dagger.jl, a dependency of DTables.jl, failed when scheduling parallel tasks. Additionally, the new framework had excessive memory (RAM) demands, which exhausted available resources and resulted in poor performance.
The initial versions were less stable for single-threaded, multi-process applications like NRF. The GLCS team reported several issues to the maintainers of DTables.jl and Dagger.jl to enhance their stability for applications like NRF. The resulting bug fixes and improvements increased effectiveness and efficiency. Local testing ran smoothly, but the application would still occasionally hang or crash when deployed to the official test environment.
Root Cause Analysis
DTables.jl does not allow for modifying data in-place. As a result, data often had to be copied to another structure, modified, and then stored again using DTables.jl. This frequent copying of very large datasets led to high memory usage and slow performance.
Despite working closely with the maintainers of DTables.jl and Dagger.jl, NRF still faced occasional issues in the official test environment that did not occur in personal test setups. This suggests an incompatibility between Dagger.jl and the operating system used in the official test environment (RHEL 7.9).
Solution
Because the only feature NRF needed from DTables.jl was disk caching (i.e., being able to swap data seamlessly from memory to disk, and vice versa), the team dropped DTables.jl and instead used MemPool.jl. The team replaced JuliaDB.jl NDSparse
objects with custom DCTable
objects (DC meaning disk-cacheable). The new custom object wraps a DataFrames.jl DataFrame
object and manages disk caching with MemPool.jl. In code, a DCTable
is used in the same way one would use a DataFrame
. The difference is that whenever a DCTable
is used, first the underlying DataFrame
must be grabbed from MemPool.jl, fetching the DataFrame
from disk if it has been cached. Then the DataFrame
is used as normal. The result of the operation (if it is a DataFrame
) is then wrapped in a new DCTable
to be managed by MemPool.jl.
"""
DCTable --> Disk-cacheable table (`DataFrame`).
Disk caching is managed via MemPool.jl, and a `DCTable` stores a `DRef`
returned by calling `poolset` on the `DataFrame` to store.
The `DataFrame` can be retrieved by calling `fetch` on a `DCTable`
(which calls `poolget` on the stored `DRef`).
"""
struct DCTable
ref::DRef
DCTable(df::DataFrame) = new(poolset(df; size = nbytes(df)))
end
The DCTable
objects resolved the issues experienced with DTables.jl. First, because operating on a DCTable
works on the underlying DataFrame
, all the benefits of using DataFrame
s apply, including being able to modify data in-place. As a result, using DCTable
s requires much less memory. And second, relying on MemPool.jl drops the complex task-scheduling logic of Dagger.jl that seemed to be incompatible with the official NRF test environment.
With DCTable
objects on hand, the team was able to replace all JuliaDB.jl function calls with equivalent operations on DCTable
/DataFrame
objects. The team ensured feature parity between new and old back-ends. Tests were developed to ensure the results obtained via the new back-end were consistent with the old back-end, and benchmarks were created to compare the performance of the old and new back-ends. Next, the team deployed the new codebase to the official test environment. Finally, after extensive internal testing and user-acceptance testing, the new codebase was officially deployed to the production environment.
Follow-up Opportunities
Are you a Julia programmer looking for a new opportunity? Our client is seeking additional help to continue developing and supporting the Net Revenue Forecasting application and its Julia framework. They have posted two positions on their Jobs Portal. Links to those positions are below. Please consider applying if you are interested.
Subscribe to my newsletter
Read articles from Jeff Dixon directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by