Data helping guide decisions in COVID-19 response | Hennepin County
Established an analytics infrastructure to monitor
Number of cases, hospitalizations and deaths
Key county response activities (vaccine administration, PPE delivery, small business support, etc.)
"Official statistics" - referencing ACS / census data here?
Had an inkling that graphs could help us with this. In early 2020 we piloted a few graph technologies
CosmosDB (Microsoft shop)
TigerGraph
Internuntius consulting
Partnership with Carlson Analytics Lab (University of Minnesota)
Successfully modelled interaction between SNAP (food stamp) recipients and community demographics to inform the placement of food shelves.
With these promising results we began to work on a database to collect metrics on the impact of COVID in our community.
It was obvious to everyone that the situation with COVID was changing quickly. Day to day changes in data availability are hard to handle for any organization, and we don't have the most mature tech stack in the world. We knew we needed to follow an iterative process with fast cycles.
Step 1) implement a new idea, like adding new data, a new summary measure, or improved functionality to the user-facing dashboard
Step 2) Gather feedback on our implementation, and identify gaps alongside subject matter experts
Step 3) Use that feedback to fix issues or plan future improvements
As the project came together, these iterations get closer and closer together. The schemaless nature of the graph storage was a key factor to increasing our development speed.
Here's a look at our final product, an interactive dashboard built in Power BI, using a Neo4j database as its main back end data store.
This was a win in multiple ways.
First, it confirms our ability to write efficient cypher queries to support these types of reports (both in terms of execution time and storage space, both of which an interactive dashboard has little).
Second, it showcases our ability to aggregate geography and date hierarchies at any desired summary level.
On the right side, we show a monthly indicator at city-level geographies, but that same rollup can be done for quarters or years, for the full county or comissioner districts, only by attaching relationships to those "dimensional" nodes.
One of the first benefits we realized with our COVID graph DB is that having a storage model that closely follows the logical structure allows for greater ease of use. This includes faster development, and easier for us to explain technical details to business partners. For example, "can we summarize this at the county level?" or "can I see all the housing measures for this city?". A simple modification to the query rather than a complicated join or worse, having to completely rewrite tables to fit unexpected schemas.
1) Graph relationships - Easier to communicate capabilities and limitions of each data point
https://commons.wikimedia.org/wiki/File:Jenga_distorted.jpg
We said earlier that the volatile nature of the problem, especially early in the pandemic, impressed upon us the need to iterate rapidly. The schemaless data storage of the graph db was instrumental in this.
We have some measures that have been around for years, some that became available just as the first COVID cases appeared in Minnesota, and others that weren't available for months after.
As the pandemic wound down, some indicators have stopped reporting. Others have changed summary levels or collection methodology.
Since we weren't pinned down by a particular table schema, we could quickly add new indicators, remove old ones, and modify relationships to keep ourselves on track.
Cypher / APOC library to load CSVs from "import" folder
Shifted to python scripts running on the database's server, executing Cypher via the transactional API
Today we run python scripts remotely, loading data using Neo4j python library
Soon: moving those scripts to the cloud (Databricks on Azure), integrating more closely with existing data pipelines and cloud storage