This course is a practical introduction to the methods and tools that the social scientist can use to make sense of big data, and thus programming resources are also important. We make extensive use of the Python programming language and SQL database management. We recommend that any social scientist who aspires to work with large datasets become proficient in the use of these two systems as well as Github. All three, fortunately, are quite accessible and supported by excellent online resources.
Why version control?
Version control is like a lab notebook for your digital projects. It enables you and your collaborators to work on a project simultaneously, rather than waiting for someone else to make edits before you can add yours, or poring over every line to try to assimilate changes from multiple editors. It allows you to return to previous versions of documents, like an "undo" button. It allows you to free your file system of names like paper_final_FINAL_JZ_edits.doc without fear that your work will be lost, overwritten, or duplicated. Version control is also useful to keep track of what you've done and when on your own research projects; if, in a year's time, you discover an error in your analysis, or want to investigate or change your process in any way, Future You will be grateful to have a record of what you did and when.
GitHub is one common service that hosts projects using the Git system of version control; in this class, we will be using GitLab.
The Git section of this tutorial is based on the Software Carpentry version control tutorial.
You should have a Gitlab account and be able to access the yellow environment (?).
Why Python instead of SAS, STATA, etc? Reproducibility, open source software.
Before coming to class, you should have completed the DataCamp Intro to Python for Data Science course. It is free and takes about four hours.
We import packages.
In [ ]:
import numpy
import pandas
import psycopg2
%matplotlib inline
In this lesson, we'll be using the pandas package to read in and manipulate data. pandas reads data from the PostGreSQL database and stores the data in special table format called a "data frame," which will be a more familiar format if you are used to using R or STATA for data analysis. Dataframes allow for easy statistical analysis and can be directly used for machine learning. pandas uses a database engine to connect to databases.
In the code cell below, we'll use psycopg2 to connect to the database. Replace your_db_name_here with the name of the class database, then run the cell.
In [ ]:
db_name = "your_db_name_here"
pgsql_connection - psycopg2.connect( database = db_name )
Next, we will use this database connection to have tell pandas where to retrieve the data. pandas has a set of Input/Output tools that let it read from and write to a large variety of tabular data formats, including CSV and Excel files, databases via SQL, JSON files, and even SAS and Stata data files. In the example below, we'll use the pandas.read_sql() function to read the results of an SQL query into a data frame.
In [ ]:
data_frame = pandas.read_sql( 'SELECT * FROM table;' db_name)
Now, let's look at what the data looks like. The pandas.DataFrame method 'data_frame.head( number_of_rows )' outputs the first number_of_rows rows in a data frame. Let's look at the first five rows in our data. In the code cell below, there are two ways to output this information. If you just call the method, you'll get an HTML table output directly into the ipython notebook. If you pass the results of the method to the "print()" function, you'll get text output that works outside of jupyter/ipython.
In [ ]:
# to get a pretty tabular view, just call the method.
data_frame.head( 5 )
# to get a text-based view, print() the call to the method.
#print( data_frame.head( 5 ) )
In pandas, our data is represented by a DataFrame. You can think of data frames as a giant spreadsheet which you can program, with the data for each column stored in its own list that pandas calls a Series (or vector of values), along with a set of methods (another name for functions that are tied to objects) that make managing data in pandas easy.
A Series is a list of values each of which can also have a label, which pandas calls an index, and which generally is used to store names of columns when you retrieve a Series that represents a row, and IDs of rows when you retrieve a Series that represents a column of data in a table.
While DataFrames and Series are separate objects, they may share the same methods where those methods make sense in both a table and list context (head() and tail(), as used in examples in this notebook, for example).
More details on pandas data structures:
In [ ]:
# get vector of "ORG_DEPT" column values from data frame
org_dept_column_series = data_frame[ "ORG_DEPT" ]
# see the last 5 values in the vector.
print( org_dept_column_series.tail( 5 ) )
# It is also OK to chain together, but I did not above for clarity's sake, and in
# general, be wary of doing too many things on one line.
# data_frame[ "ORG_DEPT" ].tail( 5 )
# empty org_dept_column_series variable and garbage collect, to conserve memory
org_dept_column_series = None
gc.collect()
In [ ]:
data_frame.dtypes
Pandas provides some great functions for descriptive statistics. Some examples:
describe() - "computes a variety of summary statistics about a Series or the columns of a DataFrame (excluding NAs of course)" ( See documentation)
head() and tail(), shown above - "To view a small sample of a Series or DataFrame object, use the head() and tail() methods. The default number of elements to display is five, but you may pass a custom number." ( See documentation.
value_counts() - The value_counts() "series method and top-level function computes a histogram of a one-dimensional array of values." ( See documentation ). This method returns a Series of the counts of the number of times each unique value in the column is present in the column (also known as frequencies), from largest count to least, with the value itself the label for each row.Back to Table of Contents
Alex Bell's Python for Economists provides a wonderful 30-page introduction to the use of Python in the social sciences, complete with XKCD cartoons.