Intro to Git and Python

Introduction
Git
Python
Exercises
Resources

Introduction

This course is a practical introduction to the methods and tools that the social scientist can use to make sense of big data, and thus programming resources are also important. We make extensive use of the Python programming language and SQL database management. We recommend that any social scientist who aspires to work with large datasets become proficient in the use of these two systems as well as Github. All three, fortunately, are quite accessible and supported by excellent online resources.

Git

Back to Table of Contents

Why version control? Version control is like a lab notebook for your digital projects. It enables you and your collaborators to work on a project simultaneously, rather than waiting for someone else to make edits before you can add yours, or poring over every line to try to assimilate changes from multiple editors. It allows you to return to previous versions of documents, like an "undo" button. It allows you to free your file system of names like paper_final_FINAL_JZ_edits.doc without fear that your work will be lost, overwritten, or duplicated. Version control is also useful to keep track of what you've done and when on your own research projects; if, in a year's time, you discover an error in your analysis, or want to investigate or change your process in any way, Future You will be grateful to have a record of what you did and when.

GitHub is one common service that hosts projects using the Git system of version control; in this class, we will be using GitLab.

The Git section of this tutorial is based on the Software Carpentry version control tutorial.

Git Setup

You should have a Gitlab account and be able to access the yellow environment (?).

Basic Git Structure

Repositories and branches
Examples of well structured repos

Working with Git

Workflow (working directory, index, HEAD)
git add & git commit, commit messages
branching (checkout, branching, pull requests, branch creation & deletion)
update & merge (pull, merge, diff, push) (following basic tutorial)
.gitignore (leave data & passwords out of the repository)

Python

Back to Table of Contents

Why Python instead of SAS, STATA, etc? Reproducibility, open source software.

Writing & running Python code: Notebooks vs. scripts, text editors.
Declaring and assigning values to variables
Data structures: tuples and lists, dictionaries and sets
Logic and control flow: if-then-else and conditional operators, iteration and loops (for, while)
Optional Extensions: Writing your own Functions
Data Analysis & Pandas: Designed to familiarize students with available datasets as well as demonstrate how to explore datasets using Pandas. Data should be selected & subsetted into a manageable size (within the notebook) so students can see the process.
Importing packages and modules
Connecting to database (psycopg2) (students should have read but not write access to DB)
Using Pandas to explore and understand data: working with dataframes, getting dimensions and columns, summary statistics, subsetting, groupbys, handling missing values
Making basic plots (matplotlib)

Before coming to class, you should have completed the DataCamp Intro to Python for Data Science course. It is free and takes about four hours.

Python Setup

We import packages.

What are numpy, pandas, sql_alchemy, why do we use them?



In [ ]:

    
import numpy
import pandas
import psycopg2
%matplotlib inline

Data Basics

Information about the data that we're using: where did it come from, what variables are present?
How to connect to the database

In this lesson, we'll be using the pandas package to read in and manipulate data. pandas reads data from the PostGreSQL database and stores the data in special table format called a "data frame," which will be a more familiar format if you are used to using R or STATA for data analysis. Dataframes allow for easy statistical analysis and can be directly used for machine learning. pandas uses a database engine to connect to databases.

In the code cell below, we'll use psycopg2 to connect to the database. Replace your_db_name_here with the name of the class database, then run the cell.



In [ ]:

    
db_name = "your_db_name_here" 
pgsql_connection - psycopg2.connect( database = db_name )

Loading Data

Next, we will use this database connection to have tell pandas where to retrieve the data. pandas has a set of Input/Output tools that let it read from and write to a large variety of tabular data formats, including CSV and Excel files, databases via SQL, JSON files, and even SAS and Stata data files. In the example below, we'll use the pandas.read_sql() function to read the results of an SQL query into a data frame.

We will change this query to only select the dataset that we want



In [ ]:

    
data_frame = pandas.read_sql( 'SELECT * FROM table;' db_name)

Now, let's look at what the data looks like. The pandas.DataFrame method 'data_frame.head( number_of_rows )' outputs the first number_of_rows rows in a data frame. Let's look at the first five rows in our data. In the code cell below, there are two ways to output this information. If you just call the method, you'll get an HTML table output directly into the ipython notebook. If you pass the results of the method to the "print()" function, you'll get text output that works outside of jupyter/ipython.



In [ ]:

    
# to get a pretty tabular view, just call the method.
data_frame.head( 5 )

# to get a text-based view, print() the call to the method.
#print( data_frame.head( 5 ) )

Understanding the Data

In pandas, our data is represented by a DataFrame. You can think of data frames as a giant spreadsheet which you can program, with the data for each column stored in its own list that pandas calls a Series (or vector of values), along with a set of methods (another name for functions that are tied to objects) that make managing data in pandas easy.

A Series is a list of values each of which can also have a label, which pandas calls an index, and which generally is used to store names of columns when you retrieve a Series that represents a row, and IDs of rows when you retrieve a Series that represents a column of data in a table.

While DataFrames and Series are separate objects, they may share the same methods where those methods make sense in both a table and list context (head() and tail(), as used in examples in this notebook, for example). More details on pandas data structures:



In [ ]:

    
# get vector of "ORG_DEPT" column values from data frame
org_dept_column_series = data_frame[ "ORG_DEPT" ]

# see the last 5 values in the vector.
print( org_dept_column_series.tail( 5 ) )

# It is also OK to chain together, but I did not above for clarity's sake, and in
#    general, be wary of doing too many things on one line.
# data_frame[ "ORG_DEPT" ].tail( 5 )

# empty org_dept_column_series variable and garbage collect, to conserve memory
org_dept_column_series = None
gc.collect()



In [ ]:

    
data_frame.dtypes

Descriptive Statistics

Pandas provides some great functions for descriptive statistics. Some examples:

describe() - "computes a variety of summary statistics about a Series or the columns of a DataFrame (excluding NAs of course)" ( See documentation)
- includes the count of values, mean, standard deviation, min, 25%, 50%, and 75% values, and the max.
head() and tail(), shown above - "To view a small sample of a Series or DataFrame object, use the head() and tail() methods. The default number of elements to display is five, but you may pass a custom number." ( See documentation.
value_counts() - The value_counts() "series method and top-level function computes a histogram of a one-dimensional array of values." ( See documentation ). This method returns a Series of the counts of the number of times each unique value in the column is present in the column (also known as frequencies), from largest count to least, with the value itself the label for each row.

Massaging Data

How do we go from the original dataset to a "model matrix" with one observation per line?
Subsetting, groupbys

Saving Results

pandas.write_csv

Exercises

Back to Table of Contents

Now you and your teammates make a repo for your project, add some Python scripts or notebooks, make sure you can all work within it.
Optional advanced exercise: Read in some crappy CSV with weird unicode

Resources

Back to Table of Contents
Alex Bell's Python for Economists provides a wonderful 30-page introduction to the use of Python in the social sciences, complete with XKCD cartoons.
Economists Tom Sargent and John Stachurski provide a [very useful set of lectures and examples]
For more detail, we recommend Charles Severance's Python for Informatics: Exploring Information
Software Carpentry version control tutorial