Pandas and SQL

Pandas is a widely used python package for data analysis. We will mainly focus on pandas.DataFrame. Informally, you can think of a dataframe as an advanced relational table (or a spreadsheet in an Excel file). There, you can easily perform many useful tasks quickly, such as aggregate analysis, and data enrichment or selection.

In this notebook, we focus on supporting common SQL operations on dataframes. More documentation can be found at

We also use the plotly plotting library. You need to install it, register an account, and perform initialization (c.f., https://plot.ly/python/getting-started/), before the following code can work. Alternatively, just comment out plotly import below and the code that generates the plot.

Import Modules


In [13]:
import pandas as pd
import numpy as np
import plotly.plotly as py
import plotly.graph_objs as go

Import data

We can easily create a DataFrame from a CSV or a tab-delimited file. You can creata a dataframe object in other ways, e.g., from multiple Series or from a dictionary.

Then we can

  • use shape to see the number of rows and columns
  • use head(x) or tail(x) to view the first or last x rows of the dataFrame. The default value of x is 5.
  • pandas is also smart in that it will intelligently print part of the dataframe content if it is too large.

In [14]:
df = pd.read_csv('./asset/lecture_data.txt', sep='\t')  # to read an excel file, use read_excel()
df.head()


Out[14]:
location time item dollars_sold
0 Vancouver Q1 home entertainment 605
1 Vancouver Q2 home entertainment 680
2 Vancouver Q3 home entertainment 812
3 Vancouver Q4 home entertainment 927
4 New York Q1 home entertainment 1087

In [15]:
df


Out[15]:
location time item dollars_sold
0 Vancouver Q1 home entertainment 605
1 Vancouver Q2 home entertainment 680
2 Vancouver Q3 home entertainment 812
3 Vancouver Q4 home entertainment 927
4 New York Q1 home entertainment 1087
5 New York Q2 home entertainment 1130
6 New York Q3 home entertainment 1034
7 New York Q4 home entertainment 1142
8 Toronto Q1 home entertainment 818
9 Toronto Q2 home entertainment 894
10 Toronto Q3 home entertainment 940
11 Toronto Q4 home entertainment 978
12 Vancouver Q1 computer 825
13 Vancouver Q2 computer 952
14 Vancouver Q3 computer 1023
15 Vancouver Q4 computer 1038
16 New York Q1 computer 968
17 New York Q2 computer 1024
18 New York Q3 computer 1048
19 New York Q4 computer 1091
20 Toronto Q1 computer 746
21 Toronto Q2 computer 769
22 Toronto Q3 computer 795
23 Toronto Q4 computer 864
24 Vancouver Q1 phone 14
25 Vancouver Q2 phone 31
26 Vancouver Q3 phone 30
27 Vancouver Q4 phone 38
28 New York Q1 phone 38
29 New York Q2 phone 41
30 New York Q3 phone 45
31 New York Q4 phone 54
32 Toronto Q1 phone 43
33 Toronto Q2 phone 52
34 Toronto Q3 phone 58
35 Toronto Q4 phone 59
36 Vancouver Q1 security 400
37 Vancouver Q2 security 512
38 Vancouver Q3 security 501
39 Vancouver Q4 security 580
40 New York Q1 security 872
41 New York Q2 security 925
42 New York Q3 security 1002
43 New York Q4 security 984
44 Toronto Q1 security 591
45 Toronto Q2 security 682
46 Toronto Q3 security 728
47 Toronto Q4 security 784

In [16]:
df.describe()


Out[16]:
dollars_sold
count 48.000000
mean 651.125000
std 392.167443
min 14.000000
25% 314.750000
50% 789.500000
75% 970.500000
max 1142.000000

SELECT

Columns can be identified by its index (0-based) or its name.


In [17]:
df.location.head()


Out[17]:
0    Vancouver
1    Vancouver
2    Vancouver
3    Vancouver
4     New York
Name: location, dtype: object

In [18]:
# show location and dollars_sold
df[[0, 3]].head()
# df[['location','dollars_sold']].head() # also okay


Out[18]:
location dollars_sold
0 Vancouver 605
1 Vancouver 680
2 Vancouver 812
3 Vancouver 927
4 New York 1087

WHERE


In [19]:
df[df['location'] == 'Vancouver'].head()


Out[19]:
location time item dollars_sold
0 Vancouver Q1 home entertainment 605
1 Vancouver Q2 home entertainment 680
2 Vancouver Q3 home entertainment 812
3 Vancouver Q4 home entertainment 927
12 Vancouver Q1 computer 825

When dealing with multiple conditions, pandas uses & and |. Each condition must be bracketed.


In [20]:
# location at Vancouver and dollars_sold more than 500, except the 1st quarter
df[(df['location'] == 'Vancouver') & (df['time'] != 'Q1') & (df['dollars_sold'] > 500)]


Out[20]:
location time item dollars_sold
1 Vancouver Q2 home entertainment 680
2 Vancouver Q3 home entertainment 812
3 Vancouver Q4 home entertainment 927
13 Vancouver Q2 computer 952
14 Vancouver Q3 computer 1023
15 Vancouver Q4 computer 1038
37 Vancouver Q2 security 512
38 Vancouver Q3 security 501
39 Vancouver Q4 security 580

Under the hood, the conditions determines a Boolean array.


In [21]:
df['time'] == 'Q1'


Out[21]:
0      True
1     False
2     False
3     False
4      True
5     False
6     False
7     False
8      True
9     False
10    False
11    False
12     True
13    False
14    False
15    False
16     True
17    False
18    False
19    False
20     True
21    False
22    False
23    False
24     True
25    False
26    False
27    False
28     True
29    False
30    False
31    False
32     True
33    False
34    False
35    False
36     True
37    False
38    False
39    False
40     True
41    False
42    False
43    False
44     True
45    False
46    False
47    False
Name: time, dtype: bool

GROUP BY

In pandas, we uses groupby() to split a dataset into groups; we can apply some function (e.g., aggregation) and combine the groups together.

We use size() to return the number of rows of each group (like COUNT in SQL)


In [22]:
df.groupby('location').size()


Out[22]:
location
New York     16
Toronto      16
Vancouver    16
dtype: int64

We use agg() to apply multiple functions at once, and pass a list of columns to groupby() to grouping multiple columns


In [23]:
df.groupby(['location','item']).agg({'dollars_sold': [np.mean,np.sum]})


Out[23]:
dollars_sold
mean sum
location item
New York computer 1032.75 4131
home entertainment 1098.25 4393
phone 44.50 178
security 945.75 3783
Toronto computer 793.50 3174
home entertainment 907.50 3630
phone 53.00 212
security 696.25 2785
Vancouver computer 959.50 3838
home entertainment 756.00 3024
phone 28.25 113
security 498.25 1993

JOIN


In [24]:
# import another table about the population of each city
df2 = pd.read_csv('./asset/population_1.txt', sep='\t')
df2.head()


Out[24]:
location population
0 New York 8406000
1 Vancouver 630500
2 Sydney 4293000

We use pd.merge() to join two DataFrames, where you can specify the join key


In [25]:
pd.merge(df,df2,on='location').head()


Out[25]:
location time item dollars_sold population
0 Vancouver Q1 home entertainment 605 630500
1 Vancouver Q2 home entertainment 680 630500
2 Vancouver Q3 home entertainment 812 630500
3 Vancouver Q4 home entertainment 927 630500
4 Vancouver Q1 computer 825 630500

We can also specify the join type


In [26]:
pd.merge(df,df2,on='location',how='left').tail()


Out[26]:
location time item dollars_sold population
43 New York Q4 security 984 8406000.0
44 Toronto Q1 security 591 NaN
45 Toronto Q2 security 682 NaN
46 Toronto Q3 security 728 NaN
47 Toronto Q4 security 784 NaN

UNION


In [27]:
# import another part of the population table
df3=pd.read_csv('./asset/population_2.txt', sep='\t')
df3


Out[27]:
location population
0 New York 8406000
1 Toronto 2615000

use pd.concat() to union two tables without removing duplicates (i.e., UNION ALL in SQL)


In [28]:
pd.concat([df2,df3])


Out[28]:
location population
0 New York 8406000
1 Vancouver 630500
2 Sydney 4293000
0 New York 8406000
1 Toronto 2615000

use drop_duplicates() to remove duplicate rows


In [29]:
pd.concat([df2,df3]).drop_duplicates()


Out[29]:
location population
0 New York 8406000
1 Vancouver 630500
2 Sydney 4293000
1 Toronto 2615000

Adding Columns


In [30]:
df_city = pd.concat([df2,df3]).drop_duplicates()
df_city['big city'] = pd.Series(df_city['population'] > 1000000, index=df_city.index)
df_city


Out[30]:
location population big city
0 New York 8406000 True
1 Vancouver 630500 False
2 Sydney 4293000 True
1 Toronto 2615000 True

Pivoting and Visualization


In [31]:
table = pd.pivot_table(df, index = 'location', columns = 'time', aggfunc=np.sum)
table


Out[31]:
dollars_sold
time Q1 Q2 Q3 Q4
location
New York 2965 3120 3129 3271
Toronto 2198 2397 2521 2685
Vancouver 1844 2175 2366 2583

In [32]:
pd.pivot_table(df, index = ['location', 'item'], columns = 'time', aggfunc=np.sum, margins=True)


Out[32]:
dollars_sold
time Q1 Q2 Q3 Q4 All
location item
New York computer 968.0 1024.0 1048.0 1091.0 4131.0
home entertainment 1087.0 1130.0 1034.0 1142.0 4393.0
phone 38.0 41.0 45.0 54.0 178.0
security 872.0 925.0 1002.0 984.0 3783.0
Toronto computer 746.0 769.0 795.0 864.0 3174.0
home entertainment 818.0 894.0 940.0 978.0 3630.0
phone 43.0 52.0 58.0 59.0 212.0
security 591.0 682.0 728.0 784.0 2785.0
Vancouver computer 825.0 952.0 1023.0 1038.0 3838.0
home entertainment 605.0 680.0 812.0 927.0 3024.0
phone 14.0 31.0 30.0 38.0 113.0
security 400.0 512.0 501.0 580.0 1993.0
All 7007.0 7692.0 8016.0 8539.0 31254.0

In [33]:
trace1 = go.Bar(
    x=table.index,
    y=table.dollars_sold.Q1,
    name='Q1'
)
trace2 = go.Bar(
    x=table.index,
    y=table.dollars_sold.Q2,
    name='Q2'
)
trace3 = go.Bar(
    x=table.index,
    y=table.dollars_sold.Q3,
    name='Q3'
)
trace4 = go.Bar(
    x=table.index,
    y=table.dollars_sold.Q4,
    name='Q4'
)

data = [trace1, trace2, trace3, trace4]
layout = go.Layout(
    barmode='group'
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='sales-plot')


Out[33]:

Exercise

For each item in each city (location), add one extra column which is the average dollars_sold per person (e.g., total dollars sold for the item over the city's population), and sort rows by the average dollars_sold.

Your output should look like below.


In [ ]:


In [ ]:


In [ ]: