Pandas and SQL

Pandas is a widely used python package for data analysis. We will mainly focus on pandas.DataFrame. Informally, you can think of a dataframe as an advanced relational table (or a spreadsheet in an Excel file). There, you can easily perform many useful tasks quickly, such as aggregate analysis, and data enrichment or selection.

In this notebook, we focus on supporting common SQL operations on dataframes. More documentation can be found at

"Comparison with SQL": http://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html
Dataframe/Pandas documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

We also use the plotly plotting library. You need to install it, register an account, and perform initialization (c.f., https://plot.ly/python/getting-started/), before the following code can work. Alternatively, just comment out plotly import below and the code that generates the plot.

Import Modules



In [13]:

    
import pandas as pd
import numpy as np
import plotly.plotly as py
import plotly.graph_objs as go

Import data

We can easily create a DataFrame from a CSV or a tab-delimited file. You can creata a dataframe object in other ways, e.g., from multiple Series or from a dictionary.

Then we can

use shape to see the number of rows and columns
use head(x) or tail(x) to view the first or last x rows of the dataFrame. The default value of x is 5.
pandas is also smart in that it will intelligently print part of the dataframe content if it is too large.



In [14]:

    
df = pd.read_csv('./asset/lecture_data.txt', sep='\t')  # to read an excel file, use read_excel()
df.head()









    Out[14]:






  
    
      
      location
      time
      item
      dollars_sold
    
  
  
    
      0
      Vancouver
      Q1
      home entertainment
      605
    
    
      1
      Vancouver
      Q2
      home entertainment
      680
    
    
      2
      Vancouver
      Q3
      home entertainment
      812
    
    
      3
      Vancouver
      Q4
      home entertainment
      927
    
    
      4
      New York
      Q1
      home entertainment
      1087



In [15]:

    
df









    Out[15]:






  
    
      
      location
      time
      item
      dollars_sold
    
  
  
    
      0
      Vancouver
      Q1
      home entertainment
      605
    
    
      1
      Vancouver
      Q2
      home entertainment
      680
    
    
      2
      Vancouver
      Q3
      home entertainment
      812
    
    
      3
      Vancouver
      Q4
      home entertainment
      927
    
    
      4
      New York
      Q1
      home entertainment
      1087
    
    
      5
      New York
      Q2
      home entertainment
      1130
    
    
      6
      New York
      Q3
      home entertainment
      1034
    
    
      7
      New York
      Q4
      home entertainment
      1142
    
    
      8
      Toronto
      Q1
      home entertainment
      818
    
    
      9
      Toronto
      Q2
      home entertainment
      894
    
    
      10
      Toronto
      Q3
      home entertainment
      940
    
    
      11
      Toronto
      Q4
      home entertainment
      978
    
    
      12
      Vancouver
      Q1
      computer
      825
    
    
      13
      Vancouver
      Q2
      computer
      952
    
    
      14
      Vancouver
      Q3
      computer
      1023
    
    
      15
      Vancouver
      Q4
      computer
      1038
    
    
      16
      New York
      Q1
      computer
      968
    
    
      17
      New York
      Q2
      computer
      1024
    
    
      18
      New York
      Q3
      computer
      1048
    
    
      19
      New York
      Q4
      computer
      1091
    
    
      20
      Toronto
      Q1
      computer
      746
    
    
      21
      Toronto
      Q2
      computer
      769
    
    
      22
      Toronto
      Q3
      computer
      795
    
    
      23
      Toronto
      Q4
      computer
      864
    
    
      24
      Vancouver
      Q1
      phone
      14
    
    
      25
      Vancouver
      Q2
      phone
      31
    
    
      26
      Vancouver
      Q3
      phone
      30
    
    
      27
      Vancouver
      Q4
      phone
      38
    
    
      28
      New York
      Q1
      phone
      38
    
    
      29
      New York
      Q2
      phone
      41
    
    
      30
      New York
      Q3
      phone
      45
    
    
      31
      New York
      Q4
      phone
      54
    
    
      32
      Toronto
      Q1
      phone
      43
    
    
      33
      Toronto
      Q2
      phone
      52
    
    
      34
      Toronto
      Q3
      phone
      58
    
    
      35
      Toronto
      Q4
      phone
      59
    
    
      36
      Vancouver
      Q1
      security
      400
    
    
      37
      Vancouver
      Q2
      security
      512
    
    
      38
      Vancouver
      Q3
      security
      501
    
    
      39
      Vancouver
      Q4
      security
      580
    
    
      40
      New York
      Q1
      security
      872
    
    
      41
      New York
      Q2
      security
      925
    
    
      42
      New York
      Q3
      security
      1002
    
    
      43
      New York
      Q4
      security
      984
    
    
      44
      Toronto
      Q1
      security
      591
    
    
      45
      Toronto
      Q2
      security
      682
    
    
      46
      Toronto
      Q3
      security
      728
    
    
      47
      Toronto
      Q4
      security
      784



In [16]:

    
df.describe()









    Out[16]:






  
    
      
      dollars_sold
    
  
  
    
      count
      48.000000
    
    
      mean
      651.125000
    
    
      std
      392.167443
    
    
      min
      14.000000
    
    
      25%
      314.750000
    
    
      50%
      789.500000
    
    
      75%
      970.500000
    
    
      max
      1142.000000

SELECT

Columns can be identified by its index (0-based) or its name.



In [17]:

    
df.location.head()









    Out[17]:





0    Vancouver
1    Vancouver
2    Vancouver
3    Vancouver
4     New York
Name: location, dtype: object



In [18]:

    
# show location and dollars_sold
df[[0, 3]].head()
# df[['location','dollars_sold']].head() # also okay









    Out[18]:






  
    
      
      location
      dollars_sold
    
  
  
    
      0
      Vancouver
      605
    
    
      1
      Vancouver
      680
    
    
      2
      Vancouver
      812
    
    
      3
      Vancouver
      927
    
    
      4
      New York
      1087

WHERE



In [19]:

    
df[df['location'] == 'Vancouver'].head()









    Out[19]:






  
    
      
      location
      time
      item
      dollars_sold
    
  
  
    
      0
      Vancouver
      Q1
      home entertainment
      605
    
    
      1
      Vancouver
      Q2
      home entertainment
      680
    
    
      2
      Vancouver
      Q3
      home entertainment
      812
    
    
      3
      Vancouver
      Q4
      home entertainment
      927
    
    
      12
      Vancouver
      Q1
      computer
      825

When dealing with multiple conditions, pandas uses & and |. Each condition must be bracketed.



In [20]:

    
# location at Vancouver and dollars_sold more than 500, except the 1st quarter
df[(df['location'] == 'Vancouver') & (df['time'] != 'Q1') & (df['dollars_sold'] > 500)]









    Out[20]:






  
    
      
      location
      time
      item
      dollars_sold
    
  
  
    
      1
      Vancouver
      Q2
      home entertainment
      680
    
    
      2
      Vancouver
      Q3
      home entertainment
      812
    
    
      3
      Vancouver
      Q4
      home entertainment
      927
    
    
      13
      Vancouver
      Q2
      computer
      952
    
    
      14
      Vancouver
      Q3
      computer
      1023
    
    
      15
      Vancouver
      Q4
      computer
      1038
    
    
      37
      Vancouver
      Q2
      security
      512
    
    
      38
      Vancouver
      Q3
      security
      501
    
    
      39
      Vancouver
      Q4
      security
      580

Under the hood, the conditions determines a Boolean array.



In [21]:

    
df['time'] == 'Q1'









    Out[21]:





0      True
1     False
2     False
3     False
4      True
5     False
6     False
7     False
8      True
9     False
10    False
11    False
12     True
13    False
14    False
15    False
16     True
17    False
18    False
19    False
20     True
21    False
22    False
23    False
24     True
25    False
26    False
27    False
28     True
29    False
30    False
31    False
32     True
33    False
34    False
35    False
36     True
37    False
38    False
39    False
40     True
41    False
42    False
43    False
44     True
45    False
46    False
47    False
Name: time, dtype: bool

GROUP BY

In pandas, we uses groupby() to split a dataset into groups; we can apply some function (e.g., aggregation) and combine the groups together.

We use size() to return the number of rows of each group (like COUNT in SQL)



In [22]:

    
df.groupby('location').size()









    Out[22]:





location
New York     16
Toronto      16
Vancouver    16
dtype: int64

We use agg() to apply multiple functions at once, and pass a list of columns to groupby() to grouping multiple columns



In [23]:

    
df.groupby(['location','item']).agg({'dollars_sold': [np.mean,np.sum]})









    Out[23]:






  
    
      
      
      dollars_sold
    
    
      
      
      mean
      sum
    
    
      location
      item
      
      
    
  
  
    
      New York
      computer
      1032.75
      4131
    
    
      home entertainment
      1098.25
      4393
    
    
      phone
      44.50
      178
    
    
      security
      945.75
      3783
    
    
      Toronto
      computer
      793.50
      3174
    
    
      home entertainment
      907.50
      3630
    
    
      phone
      53.00
      212
    
    
      security
      696.25
      2785
    
    
      Vancouver
      computer
      959.50
      3838
    
    
      home entertainment
      756.00
      3024
    
    
      phone
      28.25
      113
    
    
      security
      498.25
      1993

JOIN



In [24]:

    
# import another table about the population of each city
df2 = pd.read_csv('./asset/population_1.txt', sep='\t')
df2.head()









    Out[24]:






  
    
      
      location
      population
    
  
  
    
      0
      New York
      8406000
    
    
      1
      Vancouver
      630500
    
    
      2
      Sydney
      4293000

We use pd.merge() to join two DataFrames, where you can specify the join key



In [25]:

    
pd.merge(df,df2,on='location').head()









    Out[25]:






  
    
      
      location
      time
      item
      dollars_sold
      population
    
  
  
    
      0
      Vancouver
      Q1
      home entertainment
      605
      630500
    
    
      1
      Vancouver
      Q2
      home entertainment
      680
      630500
    
    
      2
      Vancouver
      Q3
      home entertainment
      812
      630500
    
    
      3
      Vancouver
      Q4
      home entertainment
      927
      630500
    
    
      4
      Vancouver
      Q1
      computer
      825
      630500

We can also specify the join type



In [26]:

    
pd.merge(df,df2,on='location',how='left').tail()









    Out[26]:






  
    
      
      location
      time
      item
      dollars_sold
      population
    
  
  
    
      43
      New York
      Q4
      security
      984
      8406000.0
    
    
      44
      Toronto
      Q1
      security
      591
      NaN
    
    
      45
      Toronto
      Q2
      security
      682
      NaN
    
    
      46
      Toronto
      Q3
      security
      728
      NaN
    
    
      47
      Toronto
      Q4
      security
      784
      NaN

UNION



In [27]:

    
# import another part of the population table
df3=pd.read_csv('./asset/population_2.txt', sep='\t')
df3









    Out[27]:






  
    
      
      location
      population
    
  
  
    
      0
      New York
      8406000
    
    
      1
      Toronto
      2615000

use pd.concat() to union two tables without removing duplicates (i.e., UNION ALL in SQL)



In [28]:

    
pd.concat([df2,df3])

use drop_duplicates() to remove duplicate rows



In [29]:

    
pd.concat([df2,df3]).drop_duplicates()









    Out[29]:






  
    
      
      location
      population
    
  
  
    
      0
      New York
      8406000
    
    
      1
      Vancouver
      630500
    
    
      2
      Sydney
      4293000
    
    
      1
      Toronto
      2615000

Adding Columns



In [30]:

    
df_city = pd.concat([df2,df3]).drop_duplicates()
df_city['big city'] = pd.Series(df_city['population'] > 1000000, index=df_city.index)
df_city









    Out[30]:






  
    
      
      location
      population
      big city
    
  
  
    
      0
      New York
      8406000
      True
    
    
      1
      Vancouver
      630500
      False
    
    
      2
      Sydney
      4293000
      True
    
    
      1
      Toronto
      2615000
      True

Pivoting and Visualization



In [31]:

    
table = pd.pivot_table(df, index = 'location', columns = 'time', aggfunc=np.sum)
table









    Out[31]:






  
    
      
      dollars_sold
    
    
      time
      Q1
      Q2
      Q3
      Q4
    
    
      location
      
      
      
      
    
  
  
    
      New York
      2965
      3120
      3129
      3271
    
    
      Toronto
      2198
      2397
      2521
      2685
    
    
      Vancouver
      1844
      2175
      2366
      2583



In [32]:

    
pd.pivot_table(df, index = ['location', 'item'], columns = 'time', aggfunc=np.sum, margins=True)









    Out[32]:






  
    
      
      
      dollars_sold
    
    
      
      time
      Q1
      Q2
      Q3
      Q4
      All
    
    
      location
      item
      
      
      
      
      
    
  
  
    
      New York
      computer
      968.0
      1024.0
      1048.0
      1091.0
      4131.0
    
    
      home entertainment
      1087.0
      1130.0
      1034.0
      1142.0
      4393.0
    
    
      phone
      38.0
      41.0
      45.0
      54.0
      178.0
    
    
      security
      872.0
      925.0
      1002.0
      984.0
      3783.0
    
    
      Toronto
      computer
      746.0
      769.0
      795.0
      864.0
      3174.0
    
    
      home entertainment
      818.0
      894.0
      940.0
      978.0
      3630.0
    
    
      phone
      43.0
      52.0
      58.0
      59.0
      212.0
    
    
      security
      591.0
      682.0
      728.0
      784.0
      2785.0
    
    
      Vancouver
      computer
      825.0
      952.0
      1023.0
      1038.0
      3838.0
    
    
      home entertainment
      605.0
      680.0
      812.0
      927.0
      3024.0
    
    
      phone
      14.0
      31.0
      30.0
      38.0
      113.0
    
    
      security
      400.0
      512.0
      501.0
      580.0
      1993.0
    
    
      All
      
      7007.0
      7692.0
      8016.0
      8539.0
      31254.0



In [33]:

    
trace1 = go.Bar(
    x=table.index,
    y=table.dollars_sold.Q1,
    name='Q1'
)
trace2 = go.Bar(
    x=table.index,
    y=table.dollars_sold.Q2,
    name='Q2'
)
trace3 = go.Bar(
    x=table.index,
    y=table.dollars_sold.Q3,
    name='Q3'
)
trace4 = go.Bar(
    x=table.index,
    y=table.dollars_sold.Q4,
    name='Q4'
)

data = [trace1, trace2, trace3, trace4]
layout = go.Layout(
    barmode='group'
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='sales-plot')









    Out[33]:

Exercise

For each item in each city (location), add one extra column which is the average dollars_sold per person (e.g., total dollars sold for the item over the city's population), and sort rows by the average dollars_sold.

Your output should look like below.



In [ ]:



In [ ]:



In [ ]:

	location	time	item	dollars_sold
0	Vancouver	Q1	home entertainment	605
1	Vancouver	Q2	home entertainment	680
2	Vancouver	Q3	home entertainment	812
3	Vancouver	Q4	home entertainment	927
4	New York	Q1	home entertainment	1087

	dollars_sold
count	48.000000
mean	651.125000
std	392.167443
min	14.000000
25%	314.750000
50%	789.500000
75%	970.500000
max	1142.000000

		dollars_sold
		mean	sum
location	item
New York	computer	1032.75	4131
	home entertainment	1098.25	4393
	phone	44.50	178
	security	945.75	3783
Toronto	computer	793.50	3174
	home entertainment	907.50	3630
	phone	53.00	212
	security	696.25	2785
Vancouver	computer	959.50	3838
	home entertainment	756.00	3024
	phone	28.25	113
	security	498.25	1993

	location	time	item	dollars_sold	population
0	Vancouver	Q1	home entertainment	605	630500
1	Vancouver	Q2	home entertainment	680	630500
2	Vancouver	Q3	home entertainment	812	630500
3	Vancouver	Q4	home entertainment	927	630500
4	Vancouver	Q1	computer	825	630500

	location	time	item	dollars_sold	population
43	New York	Q4	security	984	8406000.0
44	Toronto	Q1	security	591	NaN
45	Toronto	Q2	security	682	NaN
46	Toronto	Q3	security	728	NaN
47	Toronto	Q4	security	784	NaN

	location	population
0	New York	8406000
1	Vancouver	630500
2	Sydney	4293000
0	New York	8406000
1	Toronto	2615000

	location	population	big city
0	New York	8406000	True
1	Vancouver	630500	False
2	Sydney	4293000	True
1	Toronto	2615000	True

	dollars_sold
time	Q1	Q2	Q3	Q4
location
New York	2965	3120	3129	3271
Toronto	2198	2397	2521	2685
Vancouver	1844	2175	2366	2583