Carl Shan home my books quotes

Week 2: Why Doing Data Science in Non Profits is Different from Industry

15 June 2014 - Chicago

This is the second in a series of posts chronicling my reflections on participating in the 2014 Data Science for Social Good Fellowship at the University of Chicago. You can read the first post here.

To get automatically notified about new posts, you can subscribe by clicking here. You can also subscribe via RSS to this blog to get updates.

One of the nonprofit partners my team is working with came in to the office this past week. They shared with us the overarching problem they’re facing as an organization, and how they hoped our team can shed some light onto potential answers.[1]

After listening to and reflecting upon the needs of our partner, I’ve come to see that the solutions to their problems are likely to not be all that technically sophisticated. What the non profit needs is not some complex model feeding on thousands of variables and millions of data points. After speaking with a few of the other Fellows, this observation seems to be widely applicable to many of the other projects.

Since then, I’ve been crystallizing a series of observations about how data science as practiced in the non profit space can be remarkably different from industry data science.


In the technology industry, startups and companies can leverage data science to huge payoffs. Airbnb’s data science team mines the rich set of host-visitor data to build conditional-probability models estimating the likelihood that a user will book in a particular neighborhood. Education startup Knewton uses Kalman filters to estimate student ability.

However, generally speaking non profits interested in data science are not typically looking for technical wizardry. Rather, the needs of the these organizations seem very straightforward and can be solved with common techniques.

My project partner is asking for an evaluation of the social support programs they put together, so that they can make informed decisions about where to allocate federal funding. Much of that means doing simple statistical analyses, such as looking at percentages, doing cohort analysis and calculating survival rates.

There doesn’t seem to be any whiz-bang math to introduce here. There’s no need to break out the most cutting edge machine learning algorithms.

To be fair this isn’t something that’s true only of the non profit space. Many for-profit companies also don’t need sophisticated data science teams. I’ve heard that even now, Dropbox’s data team comprises of only three people. Nevertheless, I would wager that having a PhD in machine learning would help you solve more problems at, say, Google than it would at the National Breast Cancer Foundation.

Where’s The Data?

Beyond the question of needs, a key component of effectively conducting data science is having a wealth of quality data. Even if the goals of non profits necessitated complex techniques, machine learning algorithms really succeed in finding patterns when they’re being fed large volumes (think giga or petabytes) of high-quality data.

Unfortunately for many non profits (and for-profit companies as well), this is a dealbreaker.

Non profits are often launched to advance a praiseworthy social cause. They raise money from individual donors, foundation grants or other charitable giving to fund their organization. However, as non profit are oftentimes not directly selling a product or service, they need to appeal to the emotional pathos or moralistic beliefs of donors. As a result, powerful personal stories and anecdotes are more potent forces in a non profits arsenal than metrics or data.

And to be fair, this is very understandable. After all, it’s hard to measure exactly how much effective mentoring is happening as a result of your organization.[2] Or whether women are being effectively empowered. Or if people are living more self-actualized lives. These missions just don’t easily lend themselves to measurement.

Nevertheless, all this poses a problem to applying data science.

Our partner has only been focusing on reliably collecting data for the past two years. Before then, there was some data entry, but it was performed as more of an afterthought. The data collection process also only happens on rare occasions: when a patient either enters or leaves a social service program.

There is little information about how an individual patient is doing during the program itself.

As a result, the data we have is fairly sparse, dirty and can only provide a limited perspective in evaluating the effectiveness of social programs.

(Fortunately, there are a few movements to attach greater importance to metrics in the non profit space. These include ones such as outcomes-driven philanthropy and effective altruism.)

Black Boxes

In a seminal 1993 paper, machine learning researcher Robert Holte analyzed the performance of a variety of different machine learning techniques on common datasets. He surprisingly revealed that simple methods he studied, based off of analyzing only one variable in the data, “are almost as accurate as more complex rules” looking at many more variables! This led him to conclude that any “additional complexity must be justified” in machine learning models.[3]

Nowhere has this statement been made more clear to me than working with non profits.

It’s likely that most non profit organizations have never heard of any of the algorithms that are all the rage in the data science field today. As a result, individuals in these non profits may be reluctant to touch and intimidated by black box methods, potentially resulting in situations such as my partner never touching anything I build for them.

Holte’s warnings about complexity without justification rings even more truly here. Even the most impressive tool is only valuable if others trust it enough to use it. This understanding motivates me to create something that meets non profits at their level, rather than indulging in sophistication for sophistication’s sake.


As a DSSG Fellow this summer, I hope to create and ship something that can be successfully used in advancing the causes of the visionary non profits I’m working with.

As I progress throughout my summer, it will important for me to keep in mind that there will be many difficulties particular to the non profit space that my team and I will need to carefully navigate. We must learn how to effectively deal with sparse and impoverished datasets, as well as negotiate the tension of building something that’s complex enough to solve a challenging problem, while being simple enough to be trusted by decision-makers.


[1] I’ll be able to share more details about both my project and my partners in future posts.

[2] This comes from personal experience as someone who founded and directed a national non-profit peer mentoring program for 3 years.

[3] Robert Holte, “Very Simple Classification Rules Perform Well on Most Commonly Used Datasets” (1993)

I write posts about data science applied to social causes. If you want to be notified when my next reflection is published, subscribe by clicking here.