This is the sixth in a series of posts chronicling my reflections on participating in the 2014 Data Science for Social Good Fellowship at the University of Chicago.
You can read my last post here:
To get automatically notified about new posts, you can subscribe by clicking here. You can also subscribe via RSS to this blog to get updates.
Throughout the DSSG Fellowship, it’s been clear that my team is quite unique – unlike other groups, we were tasked with two separate projects with two different partners: Health Leads and The Chicago Alliance to End Homelessness.
However, after spending a few weeks wrestling with the challenges of context-switching between different projects, tangoing with multiple parties through different communication channels, and wading through raw and smelly data, our team decided to break up into two sub-groups that would each tackle a different project.
I ended up gravitating to focusing on tackling the problems presented by Health Leads.
Now that it’s been just over six weeks since the Fellowship began, it would be a worthwhile reflection to assess what we’ve been able to accomplish up to this point, and what is still left to be done.
I’ve written briefly about Health Leads before, in which I recounted the story of how Rebecca Onie founded the organization upon discovering the hidden link between social services and debilitating health conditions. To briefly summarize Health Leads mission: many health clinic patients experience health concerns that are brought on more so by social ills than medical ones. Asthma can be treated with medication, but not if the root cause is a mold-infested apartment.
Health Leads trains college students to work with patients who are referred by health service providers to work on identifying patient needs and working with them to acquire the resources that satisfy these needs.
Unfortunately, Health Leads is seeing a large number of their patients drop off. After one or two successful contacts, many patients stop returning phone calls. They don’t reply to emails and may live transitory lives, rendering direct mail a difficult channel of reaching them.
Our team is working on sifting through the collection of interaction data Health Leads has provided us with and bringing to light the possible reasons as to why a patient may disengage. In the end, we also hope to provide insights as to how Health Leads could direct their energy and activities to boost patient responsiveness in such a way that can increases the chances they will receive the resources they need.
What we quickly realized upon tackling this project was that Health Leads had yet to seriously determine exactly what it meant to for a patient to be “engaged.” To be fair, even in the world of technology product management this definition can be difficult to pin down. Groupon and Zynga, both struggling companies, certainly saw high usage and engagement numbers in their heyday by measurement of engagement. However, unlike in the web-world where companies can track every iota of data down to the click, non profits oftentimes have to make do with infrequently collected data that must be actively (and sometimes, painfully) recorded rather than passively accumulated.
Translating this into practice packs a painful twofold punch. Not only do we not have a great deal of data (our entire dataset totals less than 250mb), but a large portion of is afflicted with data quality issues. We see fields with low coverage, data clearly generated from user error or have otherwise untrustworthy cleanliness issues that raise our eyebrows.
All this presents a rather challenging scenario. After all, it’s hard to do data science without good data.
In addition to data concerns, I also mentioned earlier that nailing down the exactly definition of engagement is proving to be a challenge. The difficulty therein lies in translating a more nebulous human intuition into some rigorous formulation. If we were to proceed on the wrong calculation of engagement, any statistical machine learning methods we build to model it become suspect.
Our team had initially ran a logistic regression attempting to predict outcome as a function of responsiveness, only to discover that my calculation of responsiveness was off. However, upon recalculating it I learned that the accuracy of my predictions was actually higher on the erroneous calculations, presenting quite a conundrum.
Furthermore, even beyond the practical implementation concerns our team has, there are higher level questions that we’re asking ourselves. Namely, we’re questioning the underlying assumption of the entire problem: does higher engagement actually increase the chance of a successful patient outcomes?
After all, if the answer is a resounding ‘No’, then the entire foundation upon which we’ve been working crumbles into sand. Unfortunately, there are some small inklings that we’re finding possibly pointing in this direction. Tentatively, we believe this surprising finding is due more to low-quality data and an iffy definition of engagement than actual causal processes in the real world. Nevertheless, finding this raises a red flag in our minds.
Finally, one last challenge may be that the factors we end up discovering as influencing patient outcome may be ones outside of Health Leads control. Perhaps the most significant indicators of patients successfully acquiring necessary resources are variables such as gender or age. It may be quite difficult for Health Leads to translate into operationalizable steps their Advocates can take.
The previous section might have come off as pessimistic. But I didn’t mean it to be. Reviewing them, none of the challenges in my list are insurmountable or dead ends. In fact there are also a number of reasons to be quite positive about in thinking about what I, as a data scientist, can do to help Health Leads achieve their vision of creating a healthcare system in which all patients’ basic resource needs are adequately addressed.
For starters, our team has already started to think about ways to more carefully redefine and incorporate engagement as a measurement of patient outcome. We think that our initial findings of a disconnection between patient outcome and engagement is due more so to faulty wiring at the definition level rather than an actual lack of relationship between the two factors.
With Health Leads’ help, we’re also thinking of ways of engineering more substantive and accurate features from the data we have available that can paint a more nuanced and informative story about how a patient traverses through the process of getting the resources they need. As an example, one road we’re exploring is to, rather than summarize engagement using one single number averaging across multiple interactions, instead look at vectorizing engagement by calculating it at various points through a patient’s relationship with Health Leads.[1]
Even if engagement ends up proving a dud – bearing little to no predictive significance on a patient’s outcome – this itself would be a landmark discovery for Health Leads. And based upon their stellar team and impressive organizational quality that I’ve observed up to this point, I have no doubt that they’ll thoughtfully incorporate this finding into their model so as to better continue serving the health and social service needs of individuals all over America.
[1] Another opportunity we’ll be exploring as we continue working with Health Leads will be to directly predict the outcome of a patient, rather than using engagement as a proxy.
I write posts about data science applied to social causes. If you want to be notified when my next reflection is published, subscribe by clicking here.