Data Science for Civic Hackers

Before we start, I would like to set the context by introducing a quote by an authoritative voice in the field of data science applied to civic and social problems:


In [1]:
from IPython.display import Image

In [2]:
i = Image(filename='Pics/ghani.png')

In [3]:
i


Out[3]:

For those who don't know, Rayid Ghani is a Research Director at the Computation Institute and a Senior Fellow at the Harris School of Public Policy at the University of Chicago. He is also more famously known as the Chief Scientist at Obama for America during the 2012 campaign. If you do a Google search on him. you will likely encounter articles claiming he used "Big Data" to help win the election. This is a gross simplification of the work Ghani and his team did during that time. A more accutate account would be that he effectively leveraged the current technology at the time to get people who where going to vote for Obama to convince their friends and neighbors to vote for him too (to hear it from Ghani himself, go here). Now, he leads the The Eric & Wendy Schmidt Data Science for Social Good Summer Fellowship to mentor and foster the new generation of data scientists interested the social problems we face today. The main points to take away from this prologue, however, are the insights and lessons he learned making use of data to solve social issues.

(1) It's not about the data, it's about the problems you're solving.

This is just the quote in the tweet above, but it deserves an explanation. It is easy to get distracted by the "data science" hype and all the cool tools that are now available. When you want to seriously analyize a problem using data, you have to know what the problem is BEFORE you start doing the analysis. For example, if you want to do serious research using government data, asking yourself "How can I use Deep Learning or Random Forests to make interesting conclusions?" is the WRONG way to go. This is harmless if you want to learn how to use these methods on your own. However, if you go this direction when you are trying to make an impact on people's lives, you will likely be wasting your time, or, at worst, make conclusions that do not accurately represent reality.

(2) Not all data is good data.

Rayid Ghani will have no qualms admitting his distaste for the "big data" hype.

(3) The most important thing 'data scientists' can do is get the science part right.

This is an important quote by Ghani in [1]. The image below provides a simplified, yet important, look at the relevant data science skills and how they interact with each other.


In [1]:
from wand.image import Image as WImage

In [2]:
img = WImage(filename='Pics/dsvenn.pdf')

In [3]:
img


Out[3]:

The people who volunteer for organizations like Code for America are some of the most awesome and engaged people you will ever meet. The typical CfA hacker would have great hacking skills and be passionate about the social issues affecting the community. However, as awesome as these civic hackers are, if they do not have the adequate mathematical and statistical knowledge to make well-reasoned conslusions, they risk falling into the the danger zone!

Motivation:

My plan is to provide a set of IPython notebooks on how civic hackers can do data science effectively. We are currently experiencing a surge of new data and tools that can help us derive conclusions from the data. Software packages containing methods from Statistics, Machine Learning, and Artificial Intelligence have been open-sourced and available for all to use. Like all tools, however, you have to know how to use these methods effectively. There are other great IPython notebooks out there related to statistics and machine learning, but most of them require some level of advanced mathematical training. Civic hackers who want to use data to solve social problems may be overwhelmed by the prerequisite knowledge necessary to make sense of references usually mentioned by experts and academics. There is also the danger of consulting references that are too simplistic and don’t emphasize good statistical practice. The goal is to fill in that gap; that is, provide a set of IPython notebooks that are rigorous but not mathematically overwhelming. This set of notebooks attempts to go beyond the recent “Big Data” hype and focuses on the social problems we face today. The goal is to provide a resource so that all civic hackers can learn the necessary computational and statistical skills to tackle any social issue when adequate data are available.

How is this book different from other resources?

(1) Designed for the civic hacker in mind. ALL the examples will use open civic data or be relevant to social issues.

Don't expect any "Assume the cow is a sphere..." theoretical nonsense. I'll try my best to find useful datasets from Open Data websites. If I

(2) Makes use of excellent Python Data Packages so that you can incorporate the code into your App.


In [ ]:

(3) Minimizes the need to reinvent the wheel. Emphasis on high level concepts instead of creating everything from scratch.


In [ ]:

(4) An emphasis on good statistical practice and avoiding common pitfalls.


In [ ]:

(5) Teaches you how to be a competent data scientist without assuming a Ph.D in Statisics.


In [ ]:

Contents:

Prologue

Chapter 1: What is Statistics? (or HOW THE HECK DO I MAKE SENSE OF THIS DATA?) This introduction emphasizes the need for good statistical practice when making conclusions from your data. Discusses Simpson's paradox and some bad, misleading visualizations. Getting out of the "Danger Zone".

Part I: Descriptive Statistics

Chapter 2: Handling and Exploring Your Data Introduction to pandas and StatsModels. What is an "average" (mean, median, mode)?

Chapter 3: Doing Visualization Right Introduction to Matplotlib and ggplot.

Part II: Probability

Chapter 4: Quantifying Uncertainty Introduction to Probability

Chapter 5: Conditional Probability Introduction to Conditional Probability and Bayes Theorem.

Chapter 6: First Steps Toward Modeling the World Discrete Distributions

Part III: Statistical Inference

Part IV: Machine Learning

Part V: Toward More Advanced Material

Chapter V1: Behavior Change

References:


In [ ]: