Manipulating Data in Python

Laila A. Wahedi

Massive Data Institute Postdoctoral Fellow
McCourt School of Public Policy

Follow along: Wahedi.us, Current Presentation

Installing packages:

On a Mac:

  • Open terminal

On Windows:

  • Type cmd into the start menu

What is this?

  • Like an explorer window. The text at the start of the line tells you where you are.
  • To see what's there, type:
    • On Mac: ls (That's an L, not an i)
    • On Windows: dir

Installing Packages:

Install Packages

Install some packages so they are done by the time we need them. Type:

  • pip install pandas

When that's done, type:

  • pip install matplotlib
  • pip install pickle
  • pip install statsmodels

Note: if you installed the Continuum Python distribution, you may already have some of these packages installed.


What just happened?

  • Pip is a download manager, it automatically looks online to find the package and installs it on your computer
  • Similar to the CRAN Mirror
  • Bonus: if you followed the link I sent to install Anaconda, you have the conda download manager too.
    • conda install works similarly to pip, and also works on many R packages when used with Jupyter notebooks.

Opening Jupyter

  • Open up another terminal/command prompt while your packages download
  • Navigate to the directory you want your script to be in
    • Type ls to see what folders are in the current folder
    • Type cd .. to go one folder up
    • Type cd folder_name to go to a sub-folder
      • Type mkdir folder_name to make a folder
    • Type:
      • jupyter notebook

What happened?

  • A browser window popped up
  • Address something along the lines of localhost:8888/tree
  • Address is listed in the terminal/command prompt
  • Go to this address from any browser to open your tree

Your Jupyter Tree

  • You should be in the folder you navigated to in terminal/cmd
  • You can make folders here too
  • Top right, select new notebook, python3

Open this notebook:

  • Move the file into the directory for your Jupyter Tree
  • Select the file from teh Jupyter Tree
  • A new tab should open in your browser.
  • Add new cells to take notes as you go along.
    • Change the type of note cells to Markdown

Some Python Basics

What is code?

  • Set of clear instructions to the computer
  • Start with basic set of building blocks
  • Put those building blocks together

Goal of programming:

  • Break your task into building blocks
  • Put building blocks together into an algorithm

Our tasks:

  • Move and manipulate data using these building blocks.

Building Blocks

  • Ball and urn example

Our First Algorithm

Basic Data Structures

Strings

  • Text variables
  • Declare with either '', or ""

In [20]:
my_string = 'Hello World'
print(my_string)


Hello World

Numbers:

  • ints: integers
  • floats: numbers with decimals
  • Combine them to get a new float

In [25]:
my_int = 2
my_float = 2.2
new_float = my_int+my_float
print(new_float)
type(new_float)


4.2
Out[25]:
float

Lists

  • Declare with: []
  • Ordered list of objects

In [ ]:
my_list = [0,1,2,3,4]
  • Zero indexed

In [12]:
print(my_list[1])


1
  • Lists are Mutable:

In [13]:
my_list[2]='hello'
print(my_list)


[0, 1, 'hello', 3, 4]

Dictionaries

  • key value pairs indexed by key
  • Declared by {key:value}

In [14]:
my_dictionary = {'apple':4,
                'pear':'yum'}
print(my_dictionary['apple'])


4
  • Add key value pairs later using indexing with brackets []
  • You can store data type you want

In [17]:
my_dictionary['numbers'] = my_list
print(my_dictionary['numbers'])


[0, 1, 'hello', 3, 4]

Sets

  • Like a list, but:
    • Contains unique values
    • Unordered
  • Declare with {}

In [18]:
my_set = {'thing1','thing2','cat in hat','thing1', 4,4}
print(my_set)


{'thing2', 4, 'thing1', 'cat in hat'}

Tuples

  • Like an ordered list, but can't be changed
  • Declare with ()

In [19]:
my_tuple = (1,3,2)
print(my_tuple)


(1, 3, 2)

Our First Algorithm: Strings


In [23]:
# Declare Data
my_data = 'hello '
my_other_data = 'world'
#Manipulate it
manipulated_data = my_data+my_other_data
#Output it:
print(manipulated_data)


hello world

Our First Algorithm: Numbers


In [26]:
# Declare Data
my_data =  1
my_other_data = 5
#Manipulate it
manipulated_data = 1/5
#Output it:
print(manipulated_data)


0.2

Adding Choices

Writing Instructions Python Can Understand:

Whitespace matters

  • tabs indicate a block of code within an if statement or loop
  • : marks the start of the block
  • \ for a linebreak mid-line
  • Line breaks allowed inside (), [], {}

If statements

  • = to assign
    • Like <- in R
  • == to evaluate

In [2]:
my_variable = 5
print(my_variable)
print(my_variable == 5)


5
True
  • Compare values with:
    • <, <=, >, >=, ==, !=, in, not in,

In [5]:
print(my_variable > 6)
print(my_variable in [1,4,7])


False
False

Booleans and Indicators

  • booleans are indicators of True or False
  • In Python3, True = 1, and False = 0

In [6]:
True + True


Out[6]:
2
  • None is a null value
  • None evaluates to false in an if statement.

If statements:

  • Used to change behavior depending on conditionals
  • Declare the statement
  • Declare the conditional action within a code block, or indentation:
  • Declare alternative actions in else
  • Stack conditionals with elif

In [7]:
my_bool = 'ice cream'
if my_bool == 'ice cream': 
    print('yay')
elif my_bool == 'cake':
    print('woo!')
else:
    print('Woe and great tragedy!')


yay

Play around yourself:


In [9]:
check = True
# check = False
# check = None
# check = 'monkey'
# check = 0
# check = 10
print('Check is:', check)
if check == 'monkey':
    print('banana')
elif check:
    print('yes')
else:
    print('no')
    
if 1 not in [1,2,3]:
    print('not not in')
if 1 in [1,2,3]:
    print('in')


Check is: True
yes
in

Repetition

While loops

  • Do something in block of code until condition met
  • Make sure you change the condition in every loop, or it will go forever

In [23]:
n = 0
while n < 5:
    print(n)
    n= n+1


0
1
2
3
4

For loops

  • repeat block of code for:
    • a certain number of iterations
    • For every element in a list
  • range() gives you an iterator

In [22]:
print('use a range:')
for i in range(3):
    print(i)
print('use a range slice:')
for i in range(3,6):
    print(i)
print('iterate throubh a list:')
for i in my_list:
    print(i)


use a range:
0
1
2
use a range slice:
3
4
5
iterate throubh a list:
0
1
hello
3
4

Put them together:

  • Play around on your own

In [10]:
my_list = [0,1,'cat',None,'frog',3]
animals = []
nums = []
for i in my_list:
    if type(i)==str:
        animals.append(i)
    elif type(i)==int:
        nums.append(i)
    else:
        pass
print(animals)
print(nums)


['cat', 'frog']
[0, 1, 3]

Learn More:

These are the basic building blocks for scripting. To learn more about how to put them together to greater effect:

  • Take an introductory computer science course online
  • Check out GU Women Coders.

Import packages

  • Like library(package) in R
  • Pandas is a dataframe data structure like dataframes in R
  • matplotlib is a plotting package similar to plotting in Matlab
    • you can view plots inline in pandas notebooks with the inline command
  • ggplot is a plotting package built on matplotlib like ggplot2 in R
  • Pickle lets you save your workspace
  • Statsmodels contains many of the statistical models you know and love

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# from ggplot import *
import pickle
import statsmodels.api as sm

Pandas Data Frames

Using Documentation

  • Pandas website
  • Stack Overflow
  • Copy errors into google
  • Look up syntax differences with R ## Load Data from a comma separated file
  • Start by googling it: http://lmgtfy.com/?q=pandas+load+csv

We will use the Big Allied and Dangerous Data from START


In [3]:
baad_covars = pd.read_csv('BAAD_1_Lethality_Data.tab',sep='\t')

Look at the data


In [85]:
baad_covars.head()


Out[85]:
mastertccode3606 group statespond cowmastercountry masterccode fatalities19982005 OrgAge ordsize terrStrong degree ContainRelig ContainEthno LeftNoReligEthno PureRelig PureEthno ReligEthno ContainRelig2 ContainEthno2 Islam
0 50 Animal Liberation Front (ALF) 0 United States of America 2 0 30 0 0 1 0 0 0 0 0 0 0 0 0
1 89 Army of God 0 United States of America 2 1 24 0 0 0 1 0 0 1 0 0 1 0 0
2 113 Cambodian Freedom Fighters (CFF) 0 United States of America 2 0 8 0 0 0 0 1 0 0 0 0 0 1 0
3 126 Coalition to Save the Preserves (CSP) 0 United States of America 2 0 6 0 0 0 0 0 0 0 0 0 0 0 0
4 153 Earth Liberation Front (ELF) 0 United States of America 2 0 14 0 0 1 0 0 0 0 0 0 0 0 0

Rename things and adjust values


In [8]:
baad_covars.rename(columns = {'cowmastercountry':'country',
                              'masterccode':'ccode',
                       'mastertccode3606':'group_code',
                       'fatalities19982005':'fatalities'},
            inplace = True)
baad_covars.replace({'country':{'United States of America':'US'}},
            inplace = True)
print('Dimensions: ',baad_covars.shape)
baad_covars.head()


('Dimensions: ', (395, 19))
Out[8]:
group_code group statespond country ccode fatalities OrgAge ordsize terrStrong degree ContainRelig ContainEthno LeftNoReligEthno PureRelig PureEthno ReligEthno ContainRelig2 ContainEthno2 Islam
0 50 Animal Liberation Front (ALF) 0 US 2 0 30 0 0 1 0 0 0 0 0 0 0 0 0
1 89 Army of God 0 US 2 1 24 0 0 0 1 0 0 1 0 0 1 0 0
2 113 Cambodian Freedom Fighters (CFF) 0 US 2 0 8 0 0 0 0 1 0 0 0 0 0 1 0
3 126 Coalition to Save the Preserves (CSP) 0 US 2 0 6 0 0 0 0 0 0 0 0 0 0 0 0
4 153 Earth Liberation Front (ELF) 0 US 2 0 14 0 0 1 0 0 0 0 0 0 0 0 0

Set a useful index


In [9]:
#Set the index
baad_covars.set_index(['group_code'],inplace = True)
baad_covars.head()


Out[9]:
group statespond country ccode fatalities OrgAge ordsize terrStrong degree ContainRelig ContainEthno LeftNoReligEthno PureRelig PureEthno ReligEthno ContainRelig2 ContainEthno2 Islam
group_code
50 Animal Liberation Front (ALF) 0 US 2 0 30 0 0 1 0 0 0 0 0 0 0 0 0
89 Army of God 0 US 2 1 24 0 0 0 1 0 0 1 0 0 1 0 0
113 Cambodian Freedom Fighters (CFF) 0 US 2 0 8 0 0 0 0 1 0 0 0 0 0 1 0
126 Coalition to Save the Preserves (CSP) 0 US 2 0 6 0 0 0 0 0 0 0 0 0 0 0 0
153 Earth Liberation Front (ELF) 0 US 2 0 14 0 0 1 0 0 0 0 0 0 0 0 0

Slicing

  • Get specific values from the dataframe.
  • Pandas has several slice operators.
    • iloc can be used to index the row by ordered integer. i.e. first row is 0, second row is 1, etc. Use this option sparingly. Better practice to use the index you have created.
    • loc uses the named index and columns.
  • Index using [row, columns]
  • For multiple columns, put your column names in a list
  • Use : for all values
  • Notice that the output keeps the index names.

In [10]:
baad_covars.loc[:, 'fatalities'].head()


Out[10]:
group_code
50     0
89     1
113    0
126    0
153    0
Name: fatalities, dtype: int64

Look at the data


In [15]:
baad_covars.loc[:,['OrgAge']].plot.density()
print(baad_covars.loc[:,['OrgAge']].mean())
baad_covars.loc[:,['fatalities']].plot.hist(bins=20)


OrgAge    11.270886
dtype: float64
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x10d5cb00>

Slicing Using Conditionals

  • Put conditionals in parentheses
  • Stack multiple conditionals using:
    • & when both conditions must always apply
    • | when at least one condition must apply

In [89]:
baad_covars.loc[(baad_covars.fatalities>1) | (baad_covars.degree>=1),
               ['group','country']].head()


Out[89]:
group country
group_code
50 Animal Liberation Front (ALF) US
153 Earth Liberation Front (ELF) US
30035 Maras Salvatruchas US
10042 Group of Guerilla Combatants of Jose Maria Mor... Mexico
246 Justice Army of the Defenseless People Mexico

Handling Missing Values

First lets make some:


In [90]:
baad_covars.loc[(baad_covars.fatalities>1) | (baad_covars.degree>=1),
               ['terrStrong']] = None
baad_covars.loc[(baad_covars.fatalities>1) | (baad_covars.degree>=1),
               ['terrStrong']].head()


Out[90]:
terrStrong
group_code
50 NaN
153 NaN
30035 NaN
10042 NaN
246 NaN

Handling Missing Values

We could index by them


In [91]:
baad_covars.loc[baad_covars.terrStrong.isnull(),'terrStrong'].head()


Out[91]:
group_code
50      NaN
153     NaN
30035   NaN
10042   NaN
246     NaN
Name: terrStrong, dtype: float64

Handling Missing Values

We could fill them:


In [92]:
baad_covars['terrStrong'] = baad_covars.terrStrong.fillna(-77)
baad_covars.terrStrong.head()


Out[92]:
group_code
50    -77.0
89      0.0
113     0.0
126     0.0
153   -77.0
Name: terrStrong, dtype: float64

Making New Columns

Assign values to a new column based on other columns:


In [93]:
baad_covars['big'] = 0
baad_covars.loc[(baad_covars.fatalities>1) | 
                (baad_covars.degree>=1),
               'big']=1
baad_covars.big.head()


Out[93]:
group_code
50     1
89     0
113    0
126    0
153    1
Name: big, dtype: int64

Reindexing: Pop the index out without losing it


In [95]:
baad_covars.reset_index(inplace=True)
baad_covars.head()


Out[95]:
group_code group statespond country ccode fatalities OrgAge ordsize terrStrong degree ContainRelig ContainEthno LeftNoReligEthno PureRelig PureEthno ReligEthno ContainRelig2 ContainEthno2 Islam big
0 50 Animal Liberation Front (ALF) 0 US 2 0 30 0 -77.0 1 0 0 0 0 0 0 0 0 0 1
1 89 Army of God 0 US 2 1 24 0 0.0 0 1 0 0 1 0 0 1 0 0 0
2 113 Cambodian Freedom Fighters (CFF) 0 US 2 0 8 0 0.0 0 0 1 0 0 0 0 0 1 0 0
3 126 Coalition to Save the Preserves (CSP) 0 US 2 0 6 0 0.0 0 0 0 0 0 0 0 0 0 0 0
4 153 Earth Liberation Front (ELF) 0 US 2 0 14 0 -77.0 1 0 0 0 0 0 0 0 0 0 1

Set a multi-index


In [96]:
baad_covars.set_index(['group','country'],inplace = True)
baad_covars.head()


Out[96]:
group_code statespond ccode fatalities OrgAge ordsize terrStrong degree ContainRelig ContainEthno LeftNoReligEthno PureRelig PureEthno ReligEthno ContainRelig2 ContainEthno2 Islam big
group country
Animal Liberation Front (ALF) US 50 0 2 0 30 0 -77.0 1 0 0 0 0 0 0 0 0 0 1
Army of God US 89 0 2 1 24 0 0.0 0 1 0 0 1 0 0 1 0 0 0
Cambodian Freedom Fighters (CFF) US 113 0 2 0 8 0 0.0 0 0 1 0 0 0 0 0 1 0 0
Coalition to Save the Preserves (CSP) US 126 0 2 0 6 0 0.0 0 0 0 0 0 0 0 0 0 0 0
Earth Liberation Front (ELF) US 153 0 2 0 14 0 -77.0 1 0 0 0 0 0 0 0 0 0 1

Using the new index, make a new dataframe

  • Note the new slicing operator for multi-index

In [103]:
indonesia_grps = baad_covars.xs('Indonesia',level = 'country',drop_level=False)
indonesia_grps = indonesia_grps.loc[indonesia_grps.fatalities>=1,['degree','ContainRelig',
                                            'ContainEthno','terrStrong',
                                            'ordsize','OrgAge']]
indonesia_grps.head()


Out[103]:
degree ContainRelig ContainEthno terrStrong ordsize OrgAge
group country
Free Aceh Movement (GAM) Indonesia 1 1 1 -77.0 2 31
Jemaah Islamiya (JI) Indonesia 2 1 0 -77.0 1 13
Laskar Jihad Indonesia 3 1 1 -77.0 0 6
South Maluku Republic (RMS) Indonesia 0 0 1 -77.0 2 8

Warning: Making copies

  • If you set a variable as equal to an object, Python creates a reference rather than copying the whole object. More efficient, unless you really want to make a copy

In [67]:
little_df = pd.DataFrame([1,2,3,4,5],columns = ['A'])
little_df['B']=[0,1,0,1,1]
copied_df = little_df
print('before:')
print(copied_df)
little_df.loc[little_df.A == 3,'B'] = 7
print('after')
copied_df


before:
   A  B
0  1  0
1  2  1
2  3  0
3  4  1
4  5  1
after
Out[67]:
A B
0 1 0
1 2 1
2 3 7
3 4 1
4 5 1

What happened?

  • copied_df changed when little_df changed.
  • Let's fix that: import "copy"

In [68]:
import copy
little_df = pd.DataFrame([1,2,3,4,5],columns = ['A'])
little_df['B']=[0,1,0,1,1]
copied_df = little_df.copy()
print('before:')
print(copied_df)
little_df.loc[little_df.A == 3,'B'] = 7
print('after')
copied_df


before:
   A  B
0  1  0
1  2  1
2  3  0
3  4  1
4  5  1
after
Out[68]:
A B
0 1 0
1 2 1
2 3 0
3 4 1
4 5 1

Saving

  • Unlike R or Stata, can't just save your workspace
  • Save as a csv now that we have the data we want
  • Pickle a variable to recreate it without having to reset indexes, etc.

In [107]:
indonesia_grps.to_csv('indonesia.csv')
pickle.dump(indonesia_grps, open('indonesia.p','wb'))
indonesia_grps = pickle.load(open('indonesia.p','rb'))

Next time:

  • Scrape Data from the internet
  • Clean the data
  • Merge that data into our data
  • Run a basic stats and ML model

In [ ]:


In [ ]:

Until then, here's some reference code on merging your data sets

Merging and Concatenating

  • Merges automatically if shared index

In [73]:
C = pd.DataFrame(['apple','orange','grape','pear','banana'],
                 columns = ['C'],
                 index = [2,4,3,0,1])
little_df['C'] = C
little_df


Out[73]:
A B C
0 1 0 pear
1 2 1 banana
2 3 7 apple
3 4 1 grape
4 5 1 orange

Joins

  • Same as SQL, inner and outer

In [85]:
C = pd.DataFrame(['apple','orange','grape','apple'],
                 columns = ['C'],
                 index = [2,4,3,'a'])
C['cuts']=['slices','wedges','whole','spirals']
print('C:')
print(C)
print('Inner: Intersection')
print(little_df.merge(right=C,
                how='inner',
                on=None,
                left_index = True,
                right_index =True))
print('Outer: Keep all rows')
print(little_df.merge(right=C,
                how='outer',
                on=None,
                left_index = True,
                right_index =True))

print('Left: Keep little_df')
print(little_df.merge(right=C,
                how='left',
                on=None,
                left_index = True,
                right_index =True))
print('Right: Keep C')
print(little_df.merge(right=C,
                how='right',
                on=None,
                left_index = True,
                right_index =True))

print('Outer, merging on column instead of index')
print(little_df.merge(right=C,
                how='outer',
                on='C',
                left_index = True,
                right_index =True))


C:
        C     cuts
2   apple   slices
4  orange   wedges
3   grape    whole
a   apple  spirals
Inner: Intersection
   A  B     C_x     C_y    cuts
2  3  7   apple   apple  slices
3  4  1   grape   grape   whole
4  5  1  orange  orange  wedges
Outer: Keep all rows
     A    B     C_x     C_y     cuts
0  1.0  0.0    pear     NaN      NaN
1  2.0  1.0  banana     NaN      NaN
2  3.0  7.0   apple   apple   slices
3  4.0  1.0   grape   grape    whole
4  5.0  1.0  orange  orange   wedges
a  NaN  NaN     NaN   apple  spirals
Left: Keep little_df
   A  B     C_x     C_y    cuts
0  1  0    pear     NaN     NaN
1  2  1  banana     NaN     NaN
2  3  7   apple   apple  slices
3  4  1   grape   grape   whole
4  5  1  orange  orange  wedges
Right: Keep C
     A    B     C_x     C_y     cuts
2  3.0  7.0   apple   apple   slices
4  5.0  1.0  orange  orange   wedges
3  4.0  1.0   grape   grape    whole
a  NaN  NaN     NaN   apple  spirals
Outer, merging on column instead of index
     A    B       C     cuts
0  1.0  0.0    pear      NaN
1  2.0  1.0  banana      NaN
2  3.0  7.0   apple   slices
3  4.0  1.0   grape    whole
4  5.0  1.0  orange   wedges
a  NaN  NaN   apple  spirals
/home/law98/anaconda/envs/p3env/lib/python3.6/site-packages/pandas/core/indexes/base.py:3033: RuntimeWarning: '<' not supported between instances of 'int' and 'str', sort order is undefined for incomparable objects
  return this.join(other, how=how, return_indexers=return_indexers)

Concatenate

  • Stack dataframes on top of one another
  • Stack dataframes beside one another

In [94]:
add_df = pd.DataFrame({'A':[6],'B':[7],'C':'peach'},index= ['p'])
little_df = pd.concat([little_df,add_df])
little_df


Out[94]:
A B C
0 1 0 pear
1 2 1 banana
2 3 7 apple
3 4 1 grape
4 5 1 orange
0 6 7 peach
p 6 7 peach