In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
Databases are designed to minimize redundancy and maintain data integrity, particularly when data is added, changed, or deleted.
Side note: you may also have to think about isolation level when working with a database where someone may be updating data as you're trying to read it. The isolation level determines the database read behavior in this situation. See https://en.wikipedia.org/wiki/Isolation_(database_systems).
key
values that can be matched. Usually one table has a primary key
and the other table has a foreign key
.In this class, we will cover table joining, merging and concatenation. We will also go over using some of the time-series handling capabilities in Pandas.
In [2]:
import pandas as pd
import numpy as np
To introduce join operations, we will be working with the AdventureWorks dataset, a standard dataset from Microsoft SLQ Server for learing to work with databases. It contains data for the fictitious bicycle manufacturer (Adventure Works Cycles).
Let's starts by importing some tables from AdventureWorks in /home/data/AdventureWorks. These tables contain data on AdventureWorks employees, sales territories, customers, and orders placed by the customers.
In [3]:
Employees = pd.read_excel('/home/data/AdventureWorks/Employees.xls')
Territory = pd.read_excel('/home/data/AdventureWorks/SalesTerritory.xls')
Customers = pd.read_excel('/home/data/AdventureWorks/Customers.xls')
Orders = pd.read_excel('/home/data/AdventureWorks/ItemsOrdered.xls')
Let's take a look at the data we'll be working with:
In [4]:
Employees.head()
Out[4]:
In [5]:
Territory.head()
Out[5]:
In [6]:
Customers.head()
Out[6]:
In [7]:
Orders.head()
Out[7]:
Let's construct a slightly artificial example. Suppose that AdventureWorks was formed by merging two companies, AdventuresUSA which operated in the US and AdventuresWorld, which operated in other countries. Now we want information on their combined sales territories.
The Pandas "concat" function is good for stacking tables on top of each other. We will use it to combine the AdventuresUSA and AdventuresWorld territories data tables.
In [8]:
help(pd.concat)
In [9]:
# constructing the territory tables... as noted, this is an artificial example
TerritoryUSA = Territory[Territory.CountryCode=='US']; TerritoryUSA['RepID'] = np.random.randint(1,1000,5)
TerritoryWorld = Territory[Territory.CountryCode!='US']
In [10]:
TerritoryUSA
Out[10]:
In [11]:
TerritoryWorld
Out[11]:
In [12]:
# we'll concatenate the databases, but keep separate keys so that we can keep track of which entries came from AdventuresUSA and
# which from AdventuresWorld.
# We'll use "join='inner'" to only keep colunms that are common to both tables;
# that is, we will drop the no-longer needed RepID in AdventuresUSA.
Territory2 = pd.concat([TerritoryUSA, TerritoryWorld], keys=['usa', 'world'], join='inner')
In [13]:
Territory2
Out[13]:
Pandas "append" behaves just like "concat" with axis=0 and join='outer' (i.e., keep all column names). Missing values are set to NaN.
In [14]:
help(pd.DataFrame.append)
In [15]:
Territory3 = TerritoryUSA.append(TerritoryWorld)
In [16]:
Territory3
Out[16]:
Join and merge are powerful tools for working with multiple tables. We will use them to answer some questions about the AdventureWorks dataset that you might encounter in real-life situations.
Join does fast table joining on a shared index. Merge does the same thing, but gives you the option to specify columns to join on. The idea of joining on a column will become clearer with some examples.
From AdventureWorks, we have a table "Employees" that gives a lot of information about AdventureWorks employees, like 'EmployeeID', 'ManagerID', 'TerritoryID', 'Title', 'FirstName','MiddleName', 'LastName', 'Suffix', 'JobTitle', 'NationalIDNumber', 'BirthDate', 'MaritalStatus', 'Gender', 'HireDate', 'SalariedFlag', 'VacationHours', 'SickLeaveHours', 'PhoneNumber', 'PhoneNumberType', 'EmailAddress', 'AddressLine1', 'AddressLine2', 'City', 'StateProvinceName', 'PostalCode', 'CountryName'. \
Since we're just being asked for a list of employees, we'll give the EmployeeID and their first, middle, and last names, and their role in the company (since additional information is requested for salespeople only). Then, for the salespeople, we must attach information about their sales territories, which is contained in the Territories table.
Notice that the Employees table has a column 'TerritoryID', which corresponds to the primary key in the 'Territory' table (in 'Territory', each territory has a unique 'TerritoryID'). We'll do a join on TerritoryID.
In [17]:
help(pd.merge)
In [18]:
Ans = pd.merge(Employees.loc[:,["EmployeeID","FirstName","MiddleName","LastName","JobTitle","TerritoryID"]],
Territory,
how='left', on='TerritoryID')
Ans.head()
Out[18]:
In [19]:
# Overachiever answer:
Ans['EmployeeName'] = Ans[["FirstName","MiddleName","LastName"]].apply(lambda x: x.LastName+", "+x.FirstName+" "+str(x.MiddleName), axis=1)
Ans = Ans[['EmployeeName', 'EmployeeID', 'JobTitle', 'TerritoryID', 'Name', 'CountryCode', 'Region', 'SalesYTD', 'SalesLastYear']]
Ans
Out[19]:
In [20]:
Ans2 = Ans[Ans.JobTitle=='Sales Representative']
Ans2
Out[20]:
In [21]:
# Overachiever: What about *all* employees associated with sales?
Ans2 = Ans[Ans["JobTitle"].apply(lambda x: 'Sales' in x)]
Ans2
Out[21]:
This looks like another question for "merge"! We have a list of customers with their addresses, and we have a list of territories, but they are in separate tables.
Let's recover a list of customer names and IDs, together with corresponding sales territory names.
This time, we have to be careful, because "TerritoryID" in the Territory table matches "SalesTerritoryID" in the table Customers. So, we'll have to specify different columns names to merge on for the two tables.
In [22]:
Ans3 = pd.merge(Customers[["CustomerID","FirstName","LastName","SalesTerritoryID"]],
Territory[["TerritoryID","Name"]],
how='left',
left_on='SalesTerritoryID', right_on='TerritoryID', )
Ans3
Out[22]:
In [23]:
Ans = pd.merge(Territory, Customers, how="inner", left_on="TerritoryID", right_on="SalesTerritoryID")
In [24]:
Ans
Out[24]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [25]:
# In-class exercise! :)
In [34]:
Customers[Customers.StateName=="North Carolina"].CustomerID.count()
Out[34]:
In [ ]:
In [ ]:
In [26]:
# We'll use the Orders table for this! In-class exercise :)
In [41]:
Orders['TotalItemPrice'] = Orders.Quantity * Orders.Price
In [ ]:
In [ ]:
In [27]:
# In-class exercise! :)
In [52]:
pd.merge(Customers[["FirstName","LastName","CustomerID"]],Orders[["CustomerID","TotalItemPrice"]], how="inner", on="CustomerID").groupby(["CustomerID","FirstName","LastName"]).sum()
Out[52]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [28]:
help(pd.DataFrame.combine_first)
In [29]:
help(pd.DataFrame.update)
In [30]:
Customers
Out[30]: