Python Exercise: Use Pandas to Merge CSV, JSON, SQL files

Data Analysis with Python: Pandas Data Aggregation and Group Operations, by Wes McKinney

Based on an assignment by JT Wolohan

  • This exercise integrates files from three different formats (CSV, JSON, SQL) with pandas
  • Data: Reports of hate crimes incidents across Agencies in 50 U.S. states by quarter for 2013
    Hate Crimes Dataset from U.S. Department of Justice and FBI [Table 13]
    State               Str   State which reported the data
    Agency type         Str   Type of agency reporting the data
    Religion            Int   Count of hate crimes motivated by religion
    Sexual orientation  Int   Count of hate crimes motivated by sexual orientation
    Ethnicity           Int   Count of hate crimes motivated by ethnicity
    Disability          Int   Count of hate crimes motivated by disibility
    Gender              Int   Count of hate crimes motivated by gender
    Gender identity     Int   Count of hate crimes motivated by gender identity
    1Q                  Int   Count of hate crimes reported in Q1
    2Q                  Int   Count of hate crimes repotred in Q2
    3Q                  Int   Count of hate crimes reported in Q3
    4Q                  Int   Count of hate crimes reported in Q4
    Population          Int   Population of area for which reporting agency is responsible

Part I: Merge three files into single DF and export as CSV

  • Combine data from three files into single data frame and write to CSV file.
  • Each file contains a different subset of the variables above.
    hate_crime_2013.csv
    hate_crime_2013.json
    hate_crime_2013.db

Import Applications: NumPy, Pandas, SQLite


In [1]:
import pandas as pd
import sqlite3

Step 1. Import data from CSV file

  • Use pandas read_csv application to import data as data frame object
  • Create new index "ID" that will be comparable across data frames

In [2]:
df1 = pd.read_csv('hate_crimes_2013.csv')
df1['ID'] = df1.index
df1.head()


Out[2]:
State Agency_type Agency_name Ethnicity Disability Gender Gender_Identity ID
0 Alabama Cities Florence 0 0 0 0 0
1 Alabama Cities Hoover 0 0 0 0 1
2 Alabama Cities Prattville 0 0 0 0 2
3 Alabama Cities Tuscaloosa 0 0 0 0 3
4 Alaska Cities Anchorage 0 0 0 0 4

Step 2. Import data from JSON file

  • Use pandas read_json application to import data as data frame object
  • Create new index "ID" for comparison bewteen data frames

In [3]:
df2 = pd.read_json('hate_crimes_2013.json')
df2['ID'] = df2.index
df2.head()


Out[3]:
Agency_name Agency_type Population State quarter_1 quarter_2 quarter_3 quarter_4 ID
0 Florence Cities 39481 Alabama 0 1 0 1 0
1 Hoover Cities 84139 Alabama 0 1 0 0 1
2 Prattville Cities 35154 Alabama 1 0 1 0 2
3 Tuscaloosa Cities 94126 Alabama 0 0 1 0 3
4 Anchorage Cities 299455 Alaska 2 3 3 0 4

Step 3. Merge two DataFrames into New DF

  • Merge left based on ID variable as index

In [4]:
df3 = pd.merge(df1, df2, how='left', left_on='ID', right_on='ID')
df3.head()


Out[4]:
State_x Agency_type_x Agency_name_x Ethnicity Disability Gender Gender_Identity ID Agency_name_y Agency_type_y Population State_y quarter_1 quarter_2 quarter_3 quarter_4
0 Alabama Cities Florence 0 0 0 0 0 Florence Cities 39481 Alabama 0 1 0 1
1 Alabama Cities Hoover 0 0 0 0 1 Hoover Cities 84139 Alabama 0 1 0 0
2 Alabama Cities Prattville 0 0 0 0 2 Prattville Cities 35154 Alabama 1 0 1 0
3 Alabama Cities Tuscaloosa 0 0 0 0 3 Tuscaloosa Cities 94126 Alabama 0 0 1 0
4 Alaska Cities Anchorage 0 0 0 0 4 Anchorage Cities 299455 Alaska 2 3 3 0

Step 4. Read data from SQL file into Pandas DF

  • Create connection to database with sqlite3
  • Read data from sql query to crimes table as dataFrame
  • Create ID variable based on Index

In [5]:
# Create sqlite connection
cnx = sqlite3.connect('hatecrimes.db')

df4 = pd.read_sql_query("SELECT * FROM crimes", cnx)

In [6]:
df4['ID'] = df4.index
df4.head()


Out[6]:
state agType agName vRace vRel vSexOr ID
0 Alabama Cities Florence 2 0 0 0
1 Alabama Cities Hoover 0 0 1 1
2 Alabama Cities Prattville 2 0 0 2
3 Alabama Cities Tuscaloosa 1 0 0 3
4 Alaska Cities Anchorage 8 0 0 4

Step 5. Merge DF4 and create new 'crimesDF'

  • Merge left based on ID variable to create new data frame
  • crimesDF contains data from all three data files

In [7]:
crimesDF = pd.merge(df3, df4, how='left', left_on='ID', right_on='ID')
crimesDF.head()


Out[7]:
State_x Agency_type_x Agency_name_x Ethnicity Disability Gender Gender_Identity ID Agency_name_y Agency_type_y ... quarter_1 quarter_2 quarter_3 quarter_4 state agType agName vRace vRel vSexOr
0 Alabama Cities Florence 0 0 0 0 0 Florence Cities ... 0 1 0 1 Alabama Cities Florence 2 0 0
1 Alabama Cities Hoover 0 0 0 0 1 Hoover Cities ... 0 1 0 0 Alabama Cities Hoover 0 0 1
2 Alabama Cities Prattville 0 0 0 0 2 Prattville Cities ... 1 0 1 0 Alabama Cities Prattville 2 0 0
3 Alabama Cities Tuscaloosa 0 0 0 0 3 Tuscaloosa Cities ... 0 0 1 0 Alabama Cities Tuscaloosa 1 0 0
4 Alaska Cities Anchorage 0 0 0 0 4 Anchorage Cities ... 2 3 3 0 Alaska Cities Anchorage 8 0 0

5 rows × 22 columns

Step 5. Consolidate Merged Dataset and Rename Columns

  • Remove redundant columns from DF: 'State_y', 'Agency_Type_y'
  • Rename Columns as needed to create single inclusive dataset: 'State', 'AgencyType'

In [8]:
columns = ['Agency_name_y', 'Agency_type_y', 'State_y', 'state', 'agType', 'agName']
crimesDF.drop(columns, inplace=True, axis=1)

crimesDF.rename(columns={'State_x': 'State', 'Agency_type_x': 'Agency_Type', 'Agency_name_x':'Agency_Name',
                         'vRace': 'Race', 'vRel': 'Religion', 'vSexOr': 'SexOrient',
                          }, inplace=True)
crimesDF.head()


Out[8]:
State Agency_Type Agency_Name Ethnicity Disability Gender Gender_Identity ID Population quarter_1 quarter_2 quarter_3 quarter_4 Race Religion SexOrient
0 Alabama Cities Florence 0 0 0 0 0 39481 0 1 0 1 2 0 0
1 Alabama Cities Hoover 0 0 0 0 1 84139 0 1 0 0 0 0 1
2 Alabama Cities Prattville 0 0 0 0 2 35154 1 0 1 0 2 0 0
3 Alabama Cities Tuscaloosa 0 0 0 0 3 94126 0 0 1 0 1 0 0
4 Alaska Cities Anchorage 0 0 0 0 4 299455 2 3 3 0 8 0 0

Step 6. Export Merged DataFrame to CSV file

  • Use comma as separator
  • Use 'utf-8' encoding

In [9]:
crimesDF.to_csv("hate_crimes_13.csv", sep=',', encoding='utf-8')

Part II. Grouping By Agency

Question A: "What percentage of Annual Crime happened in each of the four Quarters for 2013?"

Step 1. Select columns, create new DF based on Agency_Type and quarters_1-4

  • Caluculate new variable "year" as the sum of quarters
  • Create grouped variable based on Agency_Type

In [10]:
columns = ['Agency_Type', 'quarter_1', 'quarter_2', 'quarter_3', 'quarter_4']

agencyDF = crimesDF[columns].copy()

agencyDF['annual'] = agencyDF.loc[:,('quarter_1', 'quarter_2', 'quarter_3', 'quarter_4')].sum(axis=1)
agencyDF.head()


Out[10]:
Agency_Type quarter_1 quarter_2 quarter_3 quarter_4 annual
0 Cities 0 1 0 1 2
1 Cities 0 1 0 0 1
2 Cities 1 0 1 0 2
3 Cities 0 0 1 0 1
4 Cities 2 3 3 0 8

In [11]:
# You can select data for only Cities subset for Indexing on Agency_type
agencyDF[agencyDF.Agency_Type == 'Cities'].head()


Out[11]:
Agency_Type quarter_1 quarter_2 quarter_3 quarter_4 annual
0 Cities 0 1 0 1 2
1 Cities 0 1 0 0 1
2 Cities 1 0 1 0 2
3 Cities 0 0 1 0 1
4 Cities 2 3 3 0 8

Step 2. Group Agency dataframe by Agency_Type

  • Creates GroupBy object, that is used to group the data frame

In [12]:
agencyDF.groupby('Agency_Type')


Out[12]:
<pandas.core.groupby.DataFrameGroupBy object at 0x1165bba90>

Step 3. Obtain Quarterly and Annual sums for each Agency_Type

  • Use group by method to select variables and sum method to add
  • Returns data frame with sum of crime reports by Agency_Type for each Quarter and Year

In [13]:
agencyDF[['quarter_1', 'quarter_2', 'quarter_3', 'quarter_4', 'annual']].groupby(agencyDF['Agency_Type']).sum()


Out[13]:
quarter_1 quarter_2 quarter_3 quarter_4 annual
Agency_Type
Cities 1003 1328 1328 1027 4686
Metropolitan Counties 151 218 221 183 773
Nonmetropolitan Counties 23 33 46 36 138
Other Agencies 10 14 16 12 52
State Police Agencies 19 22 20 10 71
Universities and Colleges 43 42 44 58 187

Step 4. Iterate over Groups to calculate average for each quarter:

  • Calculate the Percentage of Annual Crime for each Quarter
  • Save the output to file, append percentages to quarters
  • Append output to dataFrame with columns labels and print

In [14]:
output = []
for name, group in agencyDF.groupby('Agency_Type'):
    prctQ1 = (group['quarter_1'].sum() / group['annual'].sum())*100
    prctQ2 = (group['quarter_2'].sum() / group['annual'].sum())*100
    prctQ3 = (group['quarter_3'].sum() / group['annual'].sum())*100
    prctQ4 = (group['quarter_4'].sum() / group['annual'].sum())*100
    
    output.append({'Agency_Type': name, 'Q1':prctQ1, 'Q2':prctQ2, 'Q3':prctQ3, 'Q4':prctQ4})

DF = pd.DataFrame(output)
print("Percentage of Annual Crime by Annual Quarters for 2013")  
print(DF)


Percentage of Annual Crime by Annual Quarters for 2013
                 Agency_Type         Q1         Q2         Q3         Q4
0                     Cities  21.404183  28.339735  28.339735  21.916347
1      Metropolitan Counties  19.534282  28.201811  28.589909  23.673997
2   Nonmetropolitan Counties  16.666667  23.913043  33.333333  26.086957
3             Other Agencies  19.230769  26.923077  30.769231  23.076923
4      State Police Agencies  26.760563  30.985915  28.169014  14.084507
5  Universities and Colleges  22.994652  22.459893  23.529412  31.016043

Question B: "Which agency type reported maximum number of crimes in 2013?"

  • Use the groupby() and sum() methods to obtain counts for each Agency_Type
  • "The Cities Agencies reported the maximum number of crimes in 2013."

In [15]:
agencyDF.groupby('Agency_Type').sum()


Out[15]:
quarter_1 quarter_2 quarter_3 quarter_4 annual
Agency_Type
Cities 1003 1328 1328 1027 4686
Metropolitan Counties 151 218 221 183 773
Nonmetropolitan Counties 23 33 46 36 138
Other Agencies 10 14 16 12 52
State Police Agencies 19 22 20 10 71
Universities and Colleges 43 42 44 58 187

Question C: "Which state(s) have the most crimes committed against ethnic or racial groups?"

Step 1. Create data subset for State, Ethnicity, and Race

  • Aggregate crimes based on Ethnicity and Race into new variable: 'Ethnic'

In [16]:
columns = ['State', 'Ethnicity', 'Race']
ethnicDF = crimesDF[columns].copy()

ethnicDF['Ethnic'] = ethnicDF.loc[:,('Ethnicity', 'Race')].sum(axis=1)
ethnicDF.head()


Out[16]:
State Ethnicity Race Ethnic
0 Alabama 0 2 2
1 Alabama 0 0 0
2 Alabama 0 2 2
3 Alabama 0 1 1
4 Alaska 0 8 8

Step 2. Groupby 'State' and sum the number of crimes


In [17]:
# Group ethnicDF by State, creates GroupBy object
ethnicDF.groupby('State').sum()


Out[17]:
Ethnicity Race Ethnic
State
Alabama 0 5 5
Alaska 0 8 8
Arizona 24 73 97
Arkansas 3 17 20
California 115 374 489
Colorado 17 54 71
Connecticut 26 72 98
DC 4 18 22
Delaware 0 9 9
Florida 8 40 48
Georgia 5 29 34
Idaho 3 19 22
Illinois 8 61 69
Indiana 9 49 58
Iowa 1 5 6
Kansas 5 38 43
Kentucky 15 113 128
Louisiana 1 7 8
Maine 0 12 12
Maryland 3 26 29
Massachusetts 43 146 189
Michigan 19 211 230
Minnesota 11 83 94
Mississippi 2 2 4
Missouri 6 51 57
Montana 1 25 26
Nebraska 6 17 23
Nevada 7 31 38
New_Hampshire 1 11 12
New_Jersey 38 187 225
New_Mexico 1 6 7
New_York 36 150 186
North_Carolina 11 67 78
North_Dakota 10 27 37
Ohio 43 226 269
Oklahoma 3 22 25
Oregon 9 32 41
Pennsylvania 1 44 45
Rhode_Island 1 4 5
South_Carolina 3 33 36
South_Dakota 0 10 10
Tennessee 81 59 140
Texas 26 54 80
Utah 9 47 56
Vermont 1 8 9
Virginia 8 70 78
Washington 33 158 191
West_Virginia 2 40 42
Wisconsin 10 22 32
Wyoming 1 0 1
"California has the most crimes committed against ethnic or racial groups in 2013, followed by Ohio, Michigan, New Jersey, Washington, Massachusetts, and New York."

Question D: "What percentage of crimes committed on the basis of disability?"

(Percent of crimes in Disability column over the total number of hate crimes)

Step 1. Subset DF for Disability and calculate the total hate crimes for 2013


In [18]:
columns = ['Disability', 'quarter_1', 'quarter_2', 'quarter_3', 'quarter_4']
disabledDF = crimesDF[columns].copy()

disabledDF['tot2013'] = disabledDF.loc[:,('quarter_1', 'quarter_2', 'quarter_3', 'quarter_4')].sum(axis=1)
disabledDF.head()


Out[18]:
Disability quarter_1 quarter_2 quarter_3 quarter_4 tot2013
0 0 0 1 0 1 2
1 0 0 1 0 0 1
2 0 1 0 1 0 2
3 0 0 0 1 0 1
4 0 2 3 3 0 8

Step 2. Sum disabled hatecrimes, sum total hatecrimes, calculate percentage


In [19]:
disabled = disabledDF['Disability'].sum()
disabled


Out[19]:
86

In [20]:
total = disabledDF['tot2013'].sum()
total


Out[20]:
5907

In [21]:
prctDisabled = (disabled/total)*100
prctDisabled


Out[21]:
1.4558997799221263
"Only 1.46% of hatecrimes were commited on the basis of disability."