Before we start, I would like to set the context by introducing a quote by an authoritative voice in the field of data science applied to civic and social problems:
In [1]:
from IPython.display import Image
In [2]:
i = Image(filename='Pics/ghani.png')
In [3]:
i
Out[3]:
For those who don't know, Rayid Ghani is a Research Director at the Computation Institute and a Senior Fellow at the Harris School of Public Policy at the University of Chicago. He is also more famously known as the Chief Scientist at Obama for America during the 2012 campaign. If you do a Google search on him. you will likely encounter articles claiming he used "Big Data" to help win the election. This is a gross simplification of the work Ghani and his team did during that time. A more accutate account would be that he effectively leveraged the current technology at the time to get people who where going to vote for Obama to convince their friends and neighbors to vote for him too (to hear it from Ghani himself, go here). Now, he leads the The Eric & Wendy Schmidt Data Science for Social Good Summer Fellowship to mentor and foster the new generation of data scientists interested the social problems we face today. The main points to take away from this prologue, however, are the insights and lessons he learned making use of data to solve social issues.
This is just the quote in the tweet above, but it deserves an explanation. It is easy to get distracted by the "data science" hype and all the cool tools that are now available. When you want to seriously analyize a problem using data, you have to know what the problem is BEFORE you start doing the analysis. For example, if you want to do serious research using government data, asking yourself "How can I use Deep Learning or Random Forests to make interesting conclusions?" is the WRONG way to go. This is harmless if you want to learn how to use these methods on your own. However, if you go this direction when you are trying to make an impact on people's lives, you will likely be wasting your time, or, at worst, make conclusions that do not accurately represent reality.
Rayid Ghani will have no qualms admitting his distaste for the "big data" hype.
This is an important quote by Ghani in [1]. The image below provides a simplified, yet important, look at the relevant data science skills and how they interact with each other.
In [1]:
from wand.image import Image as WImage
In [2]:
img = WImage(filename='Pics/dsvenn.pdf')
In [3]:
img
Out[3]:
The people who volunteer for organizations like Code for America are some of the most awesome and engaged people you will ever meet. The typical CfA hacker would have great hacking skills and be passionate about the social issues affecting the community. However, as awesome as these civic hackers are, if they do not have the adequate mathematical and statistical knowledge to make well-reasoned conslusions, they risk falling into the the danger zone!
My plan is to provide a set of IPython notebooks on how civic hackers can do data science effectively. We are currently experiencing a surge of new data and tools that can help us derive conclusions from the data. Software packages containing methods from Statistics, Machine Learning, and Artificial Intelligence have been open-sourced and available for all to use. Like all tools, however, you have to know how to use these methods effectively. There are other great IPython notebooks out there related to statistics and machine learning, but most of them require some level of advanced mathematical training. Civic hackers who want to use data to solve social problems may be overwhelmed by the prerequisite knowledge necessary to make sense of references usually mentioned by experts and academics. There is also the danger of consulting references that are too simplistic and don’t emphasize good statistical practice. The goal is to fill in that gap; that is, provide a set of IPython notebooks that are rigorous but not mathematically overwhelming. This set of notebooks attempts to go beyond the recent “Big Data” hype and focuses on the social problems we face today. The goal is to provide a resource so that all civic hackers can learn the necessary computational and statistical skills to tackle any social issue when adequate data are available.
Don't expect any "Assume the cow is a sphere..." theoretical nonsense. I'll try my best to find useful datasets from Open Data websites. If I
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Prologue
Chapter 1: What is Statistics? (or HOW THE HECK DO I MAKE SENSE OF THIS DATA?) This introduction emphasizes the need for good statistical practice when making conclusions from your data. Discusses Simpson's paradox and some bad, misleading visualizations. Getting out of the "Danger Zone".
Part I: Descriptive Statistics
Chapter 2: Handling and Exploring Your Data Introduction to pandas and StatsModels. What is an "average" (mean, median, mode)?
Chapter 3: Doing Visualization Right Introduction to Matplotlib and ggplot.
Part II: Probability
Chapter 4: Quantifying Uncertainty Introduction to Probability
Chapter 5: Conditional Probability Introduction to Conditional Probability and Bayes Theorem.
Chapter 6: First Steps Toward Modeling the World Discrete Distributions
Part III: Statistical Inference
Part IV: Machine Learning
Part V: Toward More Advanced Material
Chapter V1: Behavior Change
In [ ]: