Pandas
is a widely used python package for data analysis. We will mainly focus on pandas.DataFrame
. Informally, you can think of a dataframe
as an advanced relational table (or a spreadsheet in an Excel file). There, you can easily perform many useful tasks quickly, such as aggregate analysis, and data enrichment or selection.
In this notebook, we focus on supporting common SQL operations on dataframe
s. More documentation can be found at
We also use the plotly
plotting library. You need to install it, register an account, and perform initialization (c.f., https://plot.ly/python/getting-started/), before the following code can work. Alternatively, just comment out plotly
import below and the code that generates the plot.
In [13]:
import pandas as pd
import numpy as np
import plotly.plotly as py
import plotly.graph_objs as go
We can easily create a DataFrame
from a CSV or a tab-delimited file. You can creata a dataframe
object in other ways, e.g., from multiple Series
or from a dictionary.
Then we can
shape
to see the number of rows and columnshead(x)
or tail(x)
to view the first or last x
rows of the dataFrame. The default value of x
is 5. pandas
is also smart in that it will intelligently print part of the dataframe content if it is too large.
In [14]:
df = pd.read_csv('./asset/lecture_data.txt', sep='\t') # to read an excel file, use read_excel()
df.head()
Out[14]:
In [15]:
df
Out[15]:
In [16]:
df.describe()
Out[16]:
In [17]:
df.location.head()
Out[17]:
In [18]:
# show location and dollars_sold
df[[0, 3]].head()
# df[['location','dollars_sold']].head() # also okay
Out[18]:
In [19]:
df[df['location'] == 'Vancouver'].head()
Out[19]:
When dealing with multiple conditions, pandas
uses &
and |
. Each condition must be bracketed.
In [20]:
# location at Vancouver and dollars_sold more than 500, except the 1st quarter
df[(df['location'] == 'Vancouver') & (df['time'] != 'Q1') & (df['dollars_sold'] > 500)]
Out[20]:
Under the hood, the conditions determines a Boolean array.
In [21]:
df['time'] == 'Q1'
Out[21]:
We use size()
to return the number of rows of each group (like COUNT
in SQL)
In [22]:
df.groupby('location').size()
Out[22]:
We use agg()
to apply multiple functions at once, and pass a list of columns to groupby()
to grouping multiple columns
In [23]:
df.groupby(['location','item']).agg({'dollars_sold': [np.mean,np.sum]})
Out[23]:
In [24]:
# import another table about the population of each city
df2 = pd.read_csv('./asset/population_1.txt', sep='\t')
df2.head()
Out[24]:
We use pd.merge()
to join two DataFrames, where you can specify the join key
In [25]:
pd.merge(df,df2,on='location').head()
Out[25]:
We can also specify the join type
In [26]:
pd.merge(df,df2,on='location',how='left').tail()
Out[26]:
In [27]:
# import another part of the population table
df3=pd.read_csv('./asset/population_2.txt', sep='\t')
df3
Out[27]:
use pd.concat()
to union two tables without removing duplicates (i.e., UNION ALL
in SQL)
In [28]:
pd.concat([df2,df3])
Out[28]:
use drop_duplicates()
to remove duplicate rows
In [29]:
pd.concat([df2,df3]).drop_duplicates()
Out[29]:
In [30]:
df_city = pd.concat([df2,df3]).drop_duplicates()
df_city['big city'] = pd.Series(df_city['population'] > 1000000, index=df_city.index)
df_city
Out[30]:
In [31]:
table = pd.pivot_table(df, index = 'location', columns = 'time', aggfunc=np.sum)
table
Out[31]:
In [32]:
pd.pivot_table(df, index = ['location', 'item'], columns = 'time', aggfunc=np.sum, margins=True)
Out[32]:
In [33]:
trace1 = go.Bar(
x=table.index,
y=table.dollars_sold.Q1,
name='Q1'
)
trace2 = go.Bar(
x=table.index,
y=table.dollars_sold.Q2,
name='Q2'
)
trace3 = go.Bar(
x=table.index,
y=table.dollars_sold.Q3,
name='Q3'
)
trace4 = go.Bar(
x=table.index,
y=table.dollars_sold.Q4,
name='Q4'
)
data = [trace1, trace2, trace3, trace4]
layout = go.Layout(
barmode='group'
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='sales-plot')
Out[33]:
For each item in each city (location), add one extra column which is the average dollars_sold per person (e.g., total dollars sold for the item over the city's population), and sort rows by the average dollars_sold.
Your output should look like below.
In [ ]:
In [ ]:
In [ ]: