Python Exercise: Use Pandas to Merge CSV, JSON, SQL files

Data Analysis with Python: Pandas Data Aggregation and Group Operations, by Wes McKinney

http://nbviewer.jupyter.org/github/pydata/pydata-book/blob/2nd-edition/ch10.ipynb

Based on an assignment by JT Wolohan

This exercise integrates files from three different formats (CSV, JSON, SQL) with pandas

Data: Reports of hate crimes incidents across Agencies in 50 U.S. states by quarter for 2013

Hate Crimes Dataset from U.S. Department of Justice and FBI [Table 13]
State               Str   State which reported the data
Agency type         Str   Type of agency reporting the data
Religion            Int   Count of hate crimes motivated by religion
Sexual orientation  Int   Count of hate crimes motivated by sexual orientation
Ethnicity           Int   Count of hate crimes motivated by ethnicity
Disability          Int   Count of hate crimes motivated by disibility
Gender              Int   Count of hate crimes motivated by gender
Gender identity     Int   Count of hate crimes motivated by gender identity
1Q                  Int   Count of hate crimes reported in Q1
2Q                  Int   Count of hate crimes repotred in Q2
3Q                  Int   Count of hate crimes reported in Q3
4Q                  Int   Count of hate crimes reported in Q4
Population          Int   Population of area for which reporting agency is responsible

Part I: Merge three files into single DF and export as CSV

Combine data from three files into single data frame and write to CSV file.
Each file contains a different subset of the variables above.
```
hate_crime_2013.csv
hate_crime_2013.json
hate_crime_2013.db
```

Import Applications: NumPy, Pandas, SQLite



In [1]:

    
import pandas as pd
import sqlite3

Step 1. Import data from CSV file

Use pandas read_csv application to import data as data frame object
Create new index "ID" that will be comparable across data frames



In [2]:

    
df1 = pd.read_csv('hate_crimes_2013.csv')
df1['ID'] = df1.index
df1.head()









    Out[2]:







  
    
      
      State
      Agency_type
      Agency_name
      Ethnicity
      Disability
      Gender
      Gender_Identity
      ID
    
  
  
    
      0
      Alabama
      Cities
      Florence
      0
      0
      0
      0
      0
    
    
      1
      Alabama
      Cities
      Hoover
      0
      0
      0
      0
      1
    
    
      2
      Alabama
      Cities
      Prattville
      0
      0
      0
      0
      2
    
    
      3
      Alabama
      Cities
      Tuscaloosa
      0
      0
      0
      0
      3
    
    
      4
      Alaska
      Cities
      Anchorage
      0
      0
      0
      0
      4

Step 2. Import data from JSON file

Use pandas read_json application to import data as data frame object
Create new index "ID" for comparison bewteen data frames



In [3]:

    
df2 = pd.read_json('hate_crimes_2013.json')
df2['ID'] = df2.index
df2.head()









    Out[3]:







  
    
      
      Agency_name
      Agency_type
      Population
      State
      quarter_1
      quarter_2
      quarter_3
      quarter_4
      ID
    
  
  
    
      0
      Florence
      Cities
      39481
      Alabama
      0
      1
      0
      1
      0
    
    
      1
      Hoover
      Cities
      84139
      Alabama
      0
      1
      0
      0
      1
    
    
      2
      Prattville
      Cities
      35154
      Alabama
      1
      0
      1
      0
      2
    
    
      3
      Tuscaloosa
      Cities
      94126
      Alabama
      0
      0
      1
      0
      3
    
    
      4
      Anchorage
      Cities
      299455
      Alaska
      2
      3
      3
      0
      4

Step 3. Merge two DataFrames into New DF

Merge left based on ID variable as index



In [4]:

    
df3 = pd.merge(df1, df2, how='left', left_on='ID', right_on='ID')
df3.head()









    Out[4]:







  
    
      
      State_x
      Agency_type_x
      Agency_name_x
      Ethnicity
      Disability
      Gender
      Gender_Identity
      ID
      Agency_name_y
      Agency_type_y
      Population
      State_y
      quarter_1
      quarter_2
      quarter_3
      quarter_4
    
  
  
    
      0
      Alabama
      Cities
      Florence
      0
      0
      0
      0
      0
      Florence
      Cities
      39481
      Alabama
      0
      1
      0
      1
    
    
      1
      Alabama
      Cities
      Hoover
      0
      0
      0
      0
      1
      Hoover
      Cities
      84139
      Alabama
      0
      1
      0
      0
    
    
      2
      Alabama
      Cities
      Prattville
      0
      0
      0
      0
      2
      Prattville
      Cities
      35154
      Alabama
      1
      0
      1
      0
    
    
      3
      Alabama
      Cities
      Tuscaloosa
      0
      0
      0
      0
      3
      Tuscaloosa
      Cities
      94126
      Alabama
      0
      0
      1
      0
    
    
      4
      Alaska
      Cities
      Anchorage
      0
      0
      0
      0
      4
      Anchorage
      Cities
      299455
      Alaska
      2
      3
      3
      0

Step 4. Read data from SQL file into Pandas DF

Create connection to database with sqlite3
Read data from sql query to crimes table as dataFrame
Create ID variable based on Index



In [5]:

    
# Create sqlite connection
cnx = sqlite3.connect('hatecrimes.db')

df4 = pd.read_sql_query("SELECT * FROM crimes", cnx)



In [6]:

    
df4['ID'] = df4.index
df4.head()









    Out[6]:







  
    
      
      state
      agType
      agName
      vRace
      vRel
      vSexOr
      ID
    
  
  
    
      0
      Alabama
      Cities
      Florence
      2
      0
      0
      0
    
    
      1
      Alabama
      Cities
      Hoover
      0
      0
      1
      1
    
    
      2
      Alabama
      Cities
      Prattville
      2
      0
      0
      2
    
    
      3
      Alabama
      Cities
      Tuscaloosa
      1
      0
      0
      3
    
    
      4
      Alaska
      Cities
      Anchorage
      8
      0
      0
      4

Step 5. Merge DF4 and create new 'crimesDF'

Merge left based on ID variable to create new data frame
crimesDF contains data from all three data files



In [7]:

    
crimesDF = pd.merge(df3, df4, how='left', left_on='ID', right_on='ID')
crimesDF.head()









    Out[7]:







  
    
      
      State_x
      Agency_type_x
      Agency_name_x
      Ethnicity
      Disability
      Gender
      Gender_Identity
      ID
      Agency_name_y
      Agency_type_y
      ...
      quarter_1
      quarter_2
      quarter_3
      quarter_4
      state
      agType
      agName
      vRace
      vRel
      vSexOr
    
  
  
    
      0
      Alabama
      Cities
      Florence
      0
      0
      0
      0
      0
      Florence
      Cities
      ...
      0
      1
      0
      1
      Alabama
      Cities
      Florence
      2
      0
      0
    
    
      1
      Alabama
      Cities
      Hoover
      0
      0
      0
      0
      1
      Hoover
      Cities
      ...
      0
      1
      0
      0
      Alabama
      Cities
      Hoover
      0
      0
      1
    
    
      2
      Alabama
      Cities
      Prattville
      0
      0
      0
      0
      2
      Prattville
      Cities
      ...
      1
      0
      1
      0
      Alabama
      Cities
      Prattville
      2
      0
      0
    
    
      3
      Alabama
      Cities
      Tuscaloosa
      0
      0
      0
      0
      3
      Tuscaloosa
      Cities
      ...
      0
      0
      1
      0
      Alabama
      Cities
      Tuscaloosa
      1
      0
      0
    
    
      4
      Alaska
      Cities
      Anchorage
      0
      0
      0
      0
      4
      Anchorage
      Cities
      ...
      2
      3
      3
      0
      Alaska
      Cities
      Anchorage
      8
      0
      0
    
  

5 rows × 22 columns

Step 5. Consolidate Merged Dataset and Rename Columns

Remove redundant columns from DF: 'State_y', 'Agency_Type_y'
Rename Columns as needed to create single inclusive dataset: 'State', 'AgencyType'



In [8]:

    
columns = ['Agency_name_y', 'Agency_type_y', 'State_y', 'state', 'agType', 'agName']
crimesDF.drop(columns, inplace=True, axis=1)

crimesDF.rename(columns={'State_x': 'State', 'Agency_type_x': 'Agency_Type', 'Agency_name_x':'Agency_Name',
                         'vRace': 'Race', 'vRel': 'Religion', 'vSexOr': 'SexOrient',
                          }, inplace=True)
crimesDF.head()









    Out[8]:







  
    
      
      State
      Agency_Type
      Agency_Name
      Ethnicity
      Disability
      Gender
      Gender_Identity
      ID
      Population
      quarter_1
      quarter_2
      quarter_3
      quarter_4
      Race
      Religion
      SexOrient
    
  
  
    
      0
      Alabama
      Cities
      Florence
      0
      0
      0
      0
      0
      39481
      0
      1
      0
      1
      2
      0
      0
    
    
      1
      Alabama
      Cities
      Hoover
      0
      0
      0
      0
      1
      84139
      0
      1
      0
      0
      0
      0
      1
    
    
      2
      Alabama
      Cities
      Prattville
      0
      0
      0
      0
      2
      35154
      1
      0
      1
      0
      2
      0
      0
    
    
      3
      Alabama
      Cities
      Tuscaloosa
      0
      0
      0
      0
      3
      94126
      0
      0
      1
      0
      1
      0
      0
    
    
      4
      Alaska
      Cities
      Anchorage
      0
      0
      0
      0
      4
      299455
      2
      3
      3
      0
      8
      0
      0

Step 6. Export Merged DataFrame to CSV file

Use comma as separator
Use 'utf-8' encoding



In [9]:

    
crimesDF.to_csv("hate_crimes_13.csv", sep=',', encoding='utf-8')

Part II. Grouping By Agency

Question A: "What percentage of Annual Crime happened in each of the four Quarters for 2013?"

Step 1. Select columns, create new DF based on Agency_Type and quarters_1-4

Caluculate new variable "year" as the sum of quarters
Create grouped variable based on Agency_Type



In [10]:

    
columns = ['Agency_Type', 'quarter_1', 'quarter_2', 'quarter_3', 'quarter_4']

agencyDF = crimesDF[columns].copy()

agencyDF['annual'] = agencyDF.loc[:,('quarter_1', 'quarter_2', 'quarter_3', 'quarter_4')].sum(axis=1)
agencyDF.head()









    Out[10]:







  
    
      
      Agency_Type
      quarter_1
      quarter_2
      quarter_3
      quarter_4
      annual
    
  
  
    
      0
      Cities
      0
      1
      0
      1
      2
    
    
      1
      Cities
      0
      1
      0
      0
      1
    
    
      2
      Cities
      1
      0
      1
      0
      2
    
    
      3
      Cities
      0
      0
      1
      0
      1
    
    
      4
      Cities
      2
      3
      3
      0
      8



In [11]:

    
# You can select data for only Cities subset for Indexing on Agency_type
agencyDF[agencyDF.Agency_Type == 'Cities'].head()









    Out[11]:







  
    
      
      Agency_Type
      quarter_1
      quarter_2
      quarter_3
      quarter_4
      annual
    
  
  
    
      0
      Cities
      0
      1
      0
      1
      2
    
    
      1
      Cities
      0
      1
      0
      0
      1
    
    
      2
      Cities
      1
      0
      1
      0
      2
    
    
      3
      Cities
      0
      0
      1
      0
      1
    
    
      4
      Cities
      2
      3
      3
      0
      8

Step 2. Group Agency dataframe by Agency_Type

Creates GroupBy object, that is used to group the data frame



In [12]:

    
agencyDF.groupby('Agency_Type')









    Out[12]:





<pandas.core.groupby.DataFrameGroupBy object at 0x1165bba90>

Step 3. Obtain Quarterly and Annual sums for each Agency_Type

Use group by method to select variables and sum method to add
Returns data frame with sum of crime reports by Agency_Type for each Quarter and Year



In [13]:

    
agencyDF[['quarter_1', 'quarter_2', 'quarter_3', 'quarter_4', 'annual']].groupby(agencyDF['Agency_Type']).sum()









    Out[13]:







  
    
      
      quarter_1
      quarter_2
      quarter_3
      quarter_4
      annual
    
    
      Agency_Type
      
      
      
      
      
    
  
  
    
      Cities
      1003
      1328
      1328
      1027
      4686
    
    
      Metropolitan Counties
      151
      218
      221
      183
      773
    
    
      Nonmetropolitan Counties
      23
      33
      46
      36
      138
    
    
      Other Agencies
      10
      14
      16
      12
      52
    
    
      State Police Agencies
      19
      22
      20
      10
      71
    
    
      Universities and Colleges
      43
      42
      44
      58
      187

Step 4. Iterate over Groups to calculate average for each quarter:

Calculate the Percentage of Annual Crime for each Quarter
Save the output to file, append percentages to quarters
Append output to dataFrame with columns labels and print



In [14]:

    
output = []
for name, group in agencyDF.groupby('Agency_Type'):
    prctQ1 = (group['quarter_1'].sum() / group['annual'].sum())*100
    prctQ2 = (group['quarter_2'].sum() / group['annual'].sum())*100
    prctQ3 = (group['quarter_3'].sum() / group['annual'].sum())*100
    prctQ4 = (group['quarter_4'].sum() / group['annual'].sum())*100
    
    output.append({'Agency_Type': name, 'Q1':prctQ1, 'Q2':prctQ2, 'Q3':prctQ3, 'Q4':prctQ4})

DF = pd.DataFrame(output)
print("Percentage of Annual Crime by Annual Quarters for 2013")  
print(DF)









    



Percentage of Annual Crime by Annual Quarters for 2013
                 Agency_Type         Q1         Q2         Q3         Q4
0                     Cities  21.404183  28.339735  28.339735  21.916347
1      Metropolitan Counties  19.534282  28.201811  28.589909  23.673997
2   Nonmetropolitan Counties  16.666667  23.913043  33.333333  26.086957
3             Other Agencies  19.230769  26.923077  30.769231  23.076923
4      State Police Agencies  26.760563  30.985915  28.169014  14.084507
5  Universities and Colleges  22.994652  22.459893  23.529412  31.016043

Question B: "Which agency type reported maximum number of crimes in 2013?"

Use the groupby() and sum() methods to obtain counts for each Agency_Type
"The Cities Agencies reported the maximum number of crimes in 2013."



In [15]:

    
agencyDF.groupby('Agency_Type').sum()









    Out[15]:







  
    
      
      quarter_1
      quarter_2
      quarter_3
      quarter_4
      annual
    
    
      Agency_Type
      
      
      
      
      
    
  
  
    
      Cities
      1003
      1328
      1328
      1027
      4686
    
    
      Metropolitan Counties
      151
      218
      221
      183
      773
    
    
      Nonmetropolitan Counties
      23
      33
      46
      36
      138
    
    
      Other Agencies
      10
      14
      16
      12
      52
    
    
      State Police Agencies
      19
      22
      20
      10
      71
    
    
      Universities and Colleges
      43
      42
      44
      58
      187

Question C: "Which state(s) have the most crimes committed against ethnic or racial groups?"

Step 1. Create data subset for State, Ethnicity, and Race

Aggregate crimes based on Ethnicity and Race into new variable: 'Ethnic'



In [16]:

    
columns = ['State', 'Ethnicity', 'Race']
ethnicDF = crimesDF[columns].copy()

ethnicDF['Ethnic'] = ethnicDF.loc[:,('Ethnicity', 'Race')].sum(axis=1)
ethnicDF.head()

Step 2. Groupby `'State'` and sum the number of crimes



In [17]:

    
# Group ethnicDF by State, creates GroupBy object
ethnicDF.groupby('State').sum()









    Out[17]:







  
    
      
      Ethnicity
      Race
      Ethnic
    
    
      State
      
      
      
    
  
  
    
      Alabama
      0
      5
      5
    
    
      Alaska
      0
      8
      8
    
    
      Arizona
      24
      73
      97
    
    
      Arkansas
      3
      17
      20
    
    
      California
      115
      374
      489
    
    
      Colorado
      17
      54
      71
    
    
      Connecticut
      26
      72
      98
    
    
      DC
      4
      18
      22
    
    
      Delaware
      0
      9
      9
    
    
      Florida
      8
      40
      48
    
    
      Georgia
      5
      29
      34
    
    
      Idaho
      3
      19
      22
    
    
      Illinois
      8
      61
      69
    
    
      Indiana
      9
      49
      58
    
    
      Iowa
      1
      5
      6
    
    
      Kansas
      5
      38
      43
    
    
      Kentucky
      15
      113
      128
    
    
      Louisiana
      1
      7
      8
    
    
      Maine
      0
      12
      12
    
    
      Maryland
      3
      26
      29
    
    
      Massachusetts
      43
      146
      189
    
    
      Michigan
      19
      211
      230
    
    
      Minnesota
      11
      83
      94
    
    
      Mississippi
      2
      2
      4
    
    
      Missouri
      6
      51
      57
    
    
      Montana
      1
      25
      26
    
    
      Nebraska
      6
      17
      23
    
    
      Nevada
      7
      31
      38
    
    
      New_Hampshire
      1
      11
      12
    
    
      New_Jersey
      38
      187
      225
    
    
      New_Mexico
      1
      6
      7
    
    
      New_York
      36
      150
      186
    
    
      North_Carolina
      11
      67
      78
    
    
      North_Dakota
      10
      27
      37
    
    
      Ohio
      43
      226
      269
    
    
      Oklahoma
      3
      22
      25
    
    
      Oregon
      9
      32
      41
    
    
      Pennsylvania
      1
      44
      45
    
    
      Rhode_Island
      1
      4
      5
    
    
      South_Carolina
      3
      33
      36
    
    
      South_Dakota
      0
      10
      10
    
    
      Tennessee
      81
      59
      140
    
    
      Texas
      26
      54
      80
    
    
      Utah
      9
      47
      56
    
    
      Vermont
      1
      8
      9
    
    
      Virginia
      8
      70
      78
    
    
      Washington
      33
      158
      191
    
    
      West_Virginia
      2
      40
      42
    
    
      Wisconsin
      10
      22
      32
    
    
      Wyoming
      1
      0
      1

"California has the most crimes committed against ethnic or racial groups in 2013, followed by Ohio, Michigan, New Jersey, Washington, Massachusetts, and New York."

Question D: "What percentage of crimes committed on the basis of disability?"

(Percent of crimes in Disability column over the total number of hate crimes)

Step 1. Subset DF for Disability and calculate the total hate crimes for 2013



In [18]:

    
columns = ['Disability', 'quarter_1', 'quarter_2', 'quarter_3', 'quarter_4']
disabledDF = crimesDF[columns].copy()

disabledDF['tot2013'] = disabledDF.loc[:,('quarter_1', 'quarter_2', 'quarter_3', 'quarter_4')].sum(axis=1)
disabledDF.head()

Step 2. Sum disabled hatecrimes, sum total hatecrimes, calculate percentage



In [19]:

    
disabled = disabledDF['Disability'].sum()
disabled









    Out[19]:





86



In [20]:

    
total = disabledDF['tot2013'].sum()
total









    Out[20]:





5907



In [21]:

    
prctDisabled = (disabled/total)*100
prctDisabled









    Out[21]:





1.4558997799221263

	State	Agency_type	Agency_name	ID
0	Alabama	Cities	Florence	0
1	Alabama	Cities	Hoover	1
2	Alabama	Cities	Prattville	2
3	Alabama	Cities	Tuscaloosa	3
4	Alaska	Cities	Anchorage	4

	Agency_name	Agency_type	Population	State	quarter_1	quarter_2	quarter_3	quarter_4	ID
0	Florence	Cities	39481	Alabama	0	1	0	1	0
1	Hoover	Cities	84139	Alabama	0	1	0	0	1
2	Prattville	Cities	35154	Alabama	1	0	1	0	2
3	Tuscaloosa	Cities	94126	Alabama	0	0	1	0	3
4	Anchorage	Cities	299455	Alaska	2	3	3	0	4

	quarter_1	quarter_2	quarter_3	quarter_4	annual
Agency_Type
Cities	1003	1328	1328	1027	4686
Metropolitan Counties	151	218	221	183	773
Nonmetropolitan Counties	23	33	46	36	138
Other Agencies	10	14	16	12	52
State Police Agencies	19	22	20	10	71
Universities and Colleges	43	42	44	58	187

	Ethnicity	Race	Ethnic
State
Alabama	0	5	5
Alaska	0	8	8
Arizona	24	73	97
Arkansas	3	17	20
California	115	374	489
Colorado	17	54	71
Connecticut	26	72	98
DC	4	18	22
Delaware	0	9	9
Florida	8	40	48
Georgia	5	29	34
Idaho	3	19	22
Illinois	8	61	69
Indiana	9	49	58
Iowa	1	5	6
Kansas	5	38	43
Kentucky	15	113	128
Louisiana	1	7	8
Maine	0	12	12
Maryland	3	26	29
Massachusetts	43	146	189
Michigan	19	211	230
Minnesota	11	83	94
Mississippi	2	2	4
Missouri	6	51	57
Montana	1	25	26
Nebraska	6	17	23
Nevada	7	31	38
New_Hampshire	1	11	12
New_Jersey	38	187	225
New_Mexico	1	6	7
New_York	36	150	186
North_Carolina	11	67	78
North_Dakota	10	27	37
Ohio	43	226	269
Oklahoma	3	22	25
Oregon	9	32	41
Pennsylvania	1	44	45
Rhode_Island	1	4	5
South_Carolina	3	33	36
South_Dakota	0	10	10
Tennessee	81	59	140
Texas	26	54	80
Utah	9	47	56
Vermont	1	8	9
Virginia	8	70	78
Washington	33	158	191
West_Virginia	2	40	42
Wisconsin	10	22	32
Wyoming	1	0	1