In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))


Advanced Tables

Why are databases so complex?

  • Data stored in a database may be split into multiple tables, each containing multiple columns. A column stores a single attribute of the data; a table stores a collection of related attributes.
  • The database also keeps track of the relationships between different tables.
  • Databases are designed to minimize redundancy and maintain data integrity, particularly when data is added, changed, or deleted.

    1. Consistency when updating: no duplicate places for information https://en.wikipedia.org/wiki/Database_normalization
    2. Performance https://en.wikipedia.org/wiki/Star_schema
  • Side note: you may also have to think about isolation level when working with a database where someone may be updating data as you're trying to read it. The isolation level determines the database read behavior in this situation. See https://en.wikipedia.org/wiki/Isolation_(database_systems).

Working with multiple tables

  • Two tables can be joined at a time. 'Join' is a binary operator. See https://en.wikipedia.org/wiki/Join_(SQL).
  • Tables must have key values that can be matched. Usually one table has a primary key and the other table has a foreign key.

Pandas

In this class, we will cover table joining, merging and concatenation. We will also go over using some of the time-series handling capabilities in Pandas.


In [2]:
import pandas as pd
import numpy as np

Concatenating tables in Pandas

To introduce join operations, we will be working with the AdventureWorks dataset, a standard dataset from Microsoft SLQ Server for learing to work with databases. It contains data for the fictitious bicycle manufacturer (Adventure Works Cycles).

Let's starts by importing some tables from AdventureWorks in /home/data/AdventureWorks. These tables contain data on AdventureWorks employees, sales territories, customers, and orders placed by the customers.


In [3]:
Employees = pd.read_excel('/home/data/AdventureWorks/Employees.xls')
Territory = pd.read_excel('/home/data/AdventureWorks/SalesTerritory.xls')
Customers = pd.read_excel('/home/data/AdventureWorks/Customers.xls')
Orders = pd.read_excel('/home/data/AdventureWorks/ItemsOrdered.xls')

Let's take a look at the data we'll be working with:


In [4]:
Employees.head()


Out[4]:
EmployeeID ManagerID TerritoryID Title FirstName MiddleName LastName Suffix JobTitle NationalIDNumber ... SickLeaveHours PhoneNumber PhoneNumberType EmailAddress AddressLine1 AddressLine2 City StateProvinceName PostalCode CountryName
0 259 250.0 NaN NaN Ben T Miller NaN Buyer 20269531 ... 47 151-555-0113 Work ben0@adventure-works.com 101 Candy Rd. NaN Redmond Washington 98052 United States
1 278 274.0 6.0 NaN Garrett R Vargas NaN Sales Representative 234474252 ... 36 922-555-0165 Work garrett1@mapleleafmail.ca 10203 Acorn Avenue NaN Calgary Alberta T2P 2G8 Canada
2 204 26.0 NaN NaN Gabe B Mares NaN Production Technician - WC40 440379437 ... 48 310-555-0117 Work gabe0@adventure-works.com 1061 Buskrik Avenue NaN Edmonds Washington 98020 United States
3 78 26.0 NaN NaN Reuben H D'sa NaN Production Supervisor - WC40 370989364 ... 56 191-555-0112 Work reuben0@adventure-works.com 1064 Slow Creek Road NaN Seattle Washington 98104 United States
4 255 250.0 NaN NaN Gordon L Hee NaN Buyer 466142721 ... 46 230-555-0144 Cell gordon0@adventure-works.com 108 Lakeside Court NaN Bellevue Washington 98004 United States

5 rows × 26 columns


In [5]:
Territory.head()


Out[5]:
TerritoryID Name CountryCode Region SalesYTD SalesLastYear
0 1 Northwest US North America 7887186.79 3298694.49
1 2 Northeast US North America 2402176.85 3607148.94
2 3 Central US North America 3072175.12 3205014.08
3 4 Southwest US North America 10510853.87 5366575.71
4 5 Southeast US North America 2538667.25 3925071.43

In [6]:
Customers.head()


Out[6]:
CustomerID SalesTerritoryID FirstName LastName City StateName
0 10101 1 John Gray Lynden Washington
1 10298 4 Leroy Brown Pinetop Arizona
2 10299 1 Elroy Keller Snoqualmie Washington
3 10315 3 Lisa Jones Oshkosh Wisconsin
4 10325 1 Ginger Schultz Pocatello Idaho

In [7]:
Orders.head()


Out[7]:
CustomerID OrderDate Item Quantity Price
0 10330 2004-06-30 Pogo stick 1 28.0
1 10101 2004-06-30 Raft 1 58.0
2 10298 2004-07-01 Skateboard 1 33.0
3 10101 2004-07-01 Life Vest 4 125.0
4 10299 2004-07-06 Parachute 1 1250.0

Let's construct a slightly artificial example. Suppose that AdventureWorks was formed by merging two companies, AdventuresUSA which operated in the US and AdventuresWorld, which operated in other countries. Now we want information on their combined sales territories.

The Pandas "concat" function is good for stacking tables on top of each other. We will use it to combine the AdventuresUSA and AdventuresWorld territories data tables.


In [8]:
help(pd.concat)


Help on function concat in module pandas.tools.merge:

concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True)
    Concatenate pandas objects along a particular axis with optional set logic
    along the other axes. Can also add a layer of hierarchical indexing on the
    concatenation axis, which may be useful if the labels are the same (or
    overlapping) on the passed axis number
    
    Parameters
    ----------
    objs : a sequence or mapping of Series, DataFrame, or Panel objects
        If a dict is passed, the sorted keys will be used as the `keys`
        argument, unless it is passed, in which case the values will be
        selected (see below). Any None objects will be dropped silently unless
        they are all None in which case a ValueError will be raised
    axis : {0, 1, ...}, default 0
        The axis to concatenate along
    join : {'inner', 'outer'}, default 'outer'
        How to handle indexes on other axis(es)
    join_axes : list of Index objects
        Specific indexes to use for the other n - 1 axes instead of performing
        inner/outer set logic
    verify_integrity : boolean, default False
        Check whether the new concatenated axis contains duplicates. This can
        be very expensive relative to the actual data concatenation
    keys : sequence, default None
        If multiple levels passed, should contain tuples. Construct
        hierarchical index using the passed keys as the outermost level
    levels : list of sequences, default None
        Specific levels (unique values) to use for constructing a
        MultiIndex. Otherwise they will be inferred from the keys
    names : list, default None
        Names for the levels in the resulting hierarchical index
    ignore_index : boolean, default False
        If True, do not use the index values along the concatenation axis. The
        resulting axis will be labeled 0, ..., n - 1. This is useful if you are
        concatenating objects where the concatenation axis does not have
        meaningful indexing information. Note the index values on the other
        axes are still respected in the join.
    copy : boolean, default True
        If False, do not copy data unnecessarily
    
    Notes
    -----
    The keys, levels, and names arguments are all optional
    
    Returns
    -------
    concatenated : type of objects


In [9]:
# constructing the territory tables... as noted, this is an artificial example
TerritoryUSA = Territory[Territory.CountryCode=='US']; TerritoryUSA['RepID'] = np.random.randint(1,1000,5)
TerritoryWorld = Territory[Territory.CountryCode!='US']


/usr/lib/python3.4/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

In [10]:
TerritoryUSA


Out[10]:
TerritoryID Name CountryCode Region SalesYTD SalesLastYear RepID
0 1 Northwest US North America 7887186.79 3298694.49 445
1 2 Northeast US North America 2402176.85 3607148.94 717
2 3 Central US North America 3072175.12 3205014.08 328
3 4 Southwest US North America 10510853.87 5366575.71 206
4 5 Southeast US North America 2538667.25 3925071.43 886

In [11]:
TerritoryWorld


Out[11]:
TerritoryID Name CountryCode Region SalesYTD SalesLastYear
5 6 Canada CA North America 6771829.14 5693988.860
6 7 France FR Europe 4772398.31 2396539.760
7 8 Germany DE Europe 3805202.35 1307949.790
8 9 Australia AU Pacific 5977814.92 2278548.980
9 10 United Kingdom GB Europe 5012905.37 1635823.400
10 11 Brazil BR South America 0.00 261589.958
11 12 Mexico MX North America 0.00 0.000

In [12]:
# we'll concatenate the databases, but keep separate keys so that we can keep track of which entries came from AdventuresUSA and 
# which from AdventuresWorld.
# We'll use "join='inner'" to only keep colunms that are common to both tables; 
# that is, we will drop the no-longer needed RepID in AdventuresUSA. 
Territory2 = pd.concat([TerritoryUSA, TerritoryWorld], keys=['usa', 'world'], join='inner')

In [13]:
Territory2


Out[13]:
TerritoryID Name CountryCode Region SalesYTD SalesLastYear
usa 0 1 Northwest US North America 7887186.79 3298694.490
1 2 Northeast US North America 2402176.85 3607148.940
2 3 Central US North America 3072175.12 3205014.080
3 4 Southwest US North America 10510853.87 5366575.710
4 5 Southeast US North America 2538667.25 3925071.430
world 5 6 Canada CA North America 6771829.14 5693988.860
6 7 France FR Europe 4772398.31 2396539.760
7 8 Germany DE Europe 3805202.35 1307949.790
8 9 Australia AU Pacific 5977814.92 2278548.980
9 10 United Kingdom GB Europe 5012905.37 1635823.400
10 11 Brazil BR South America 0.00 261589.958
11 12 Mexico MX North America 0.00 0.000

Pandas "append" behaves just like "concat" with axis=0 and join='outer' (i.e., keep all column names). Missing values are set to NaN.


In [14]:
help(pd.DataFrame.append)


Help on function append in module pandas.core.frame:

append(self, other, ignore_index=False, verify_integrity=False)
    Append rows of `other` to the end of this frame, returning a new
    object. Columns not in this frame are added as new columns.
    
    Parameters
    ----------
    other : DataFrame or Series/dict-like object, or list of these
        The data to append.
    ignore_index : boolean, default False
        If True, do not use the index labels.
    verify_integrity : boolean, default False
        If True, raise ValueError on creating index with duplicates.
    
    Returns
    -------
    appended : DataFrame
    
    Notes
    -----
    If a list of dict/series is passed and the keys are all contained in
    the DataFrame's index, the order of the columns in the resulting
    DataFrame will be unchanged.
    
    See also
    --------
    pandas.concat : General function to concatenate DataFrame, Series
        or Panel objects
    
    Examples
    --------
    
    >>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
    >>> df
       A  B
    0  1  2
    1  3  4
    >>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
    >>> df.append(df2)
       A  B
    0  1  2
    1  3  4
    0  5  6
    1  7  8
    
    With `ignore_index` set to True:
    
    >>> df.append(df2, ignore_index=True)
       A  B
    0  1  2
    1  3  4
    2  5  6
    3  7  8


In [15]:
Territory3 = TerritoryUSA.append(TerritoryWorld)

In [16]:
Territory3


Out[16]:
CountryCode Name Region RepID SalesLastYear SalesYTD TerritoryID
0 US Northwest North America 445.0 3298694.490 7887186.79 1
1 US Northeast North America 717.0 3607148.940 2402176.85 2
2 US Central North America 328.0 3205014.080 3072175.12 3
3 US Southwest North America 206.0 5366575.710 10510853.87 4
4 US Southeast North America 886.0 3925071.430 2538667.25 5
5 CA Canada North America NaN 5693988.860 6771829.14 6
6 FR France Europe NaN 2396539.760 4772398.31 7
7 DE Germany Europe NaN 1307949.790 3805202.35 8
8 AU Australia Pacific NaN 2278548.980 5977814.92 9
9 GB United Kingdom Europe NaN 1635823.400 5012905.37 10
10 BR Brazil South America NaN 261589.958 0.00 11
11 MX Mexico North America NaN 0.000 0.00 12

Joining and merging tables in Pandas

Join and merge are powerful tools for working with multiple tables. We will use them to answer some questions about the AdventureWorks dataset that you might encounter in real-life situations.

Join does fast table joining on a shared index. Merge does the same thing, but gives you the option to specify columns to join on. The idea of joining on a column will become clearer with some examples.

Example 1. "I want a list of all employees, and if any are salespeople then show me the details about their sales territory"

From AdventureWorks, we have a table "Employees" that gives a lot of information about AdventureWorks employees, like 'EmployeeID', 'ManagerID', 'TerritoryID', 'Title', 'FirstName','MiddleName', 'LastName', 'Suffix', 'JobTitle', 'NationalIDNumber', 'BirthDate', 'MaritalStatus', 'Gender', 'HireDate', 'SalariedFlag', 'VacationHours', 'SickLeaveHours', 'PhoneNumber', 'PhoneNumberType', 'EmailAddress', 'AddressLine1', 'AddressLine2', 'City', 'StateProvinceName', 'PostalCode', 'CountryName'. \

Since we're just being asked for a list of employees, we'll give the EmployeeID and their first, middle, and last names, and their role in the company (since additional information is requested for salespeople only). Then, for the salespeople, we must attach information about their sales territories, which is contained in the Territories table.

Notice that the Employees table has a column 'TerritoryID', which corresponds to the primary key in the 'Territory' table (in 'Territory', each territory has a unique 'TerritoryID'). We'll do a join on TerritoryID.


In [17]:
help(pd.merge)


Help on function merge in module pandas.tools.merge:

merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False)
    Merge DataFrame objects by performing a database-style join operation by
    columns or indexes.
    
    If joining columns on columns, the DataFrame indexes *will be
    ignored*. Otherwise if joining indexes on indexes or indexes on a column or
    columns, the index will be passed on.
    
    Parameters
    ----------
    left : DataFrame
    right : DataFrame
    how : {'left', 'right', 'outer', 'inner'}, default 'inner'
        * left: use only keys from left frame (SQL: left outer join)
        * right: use only keys from right frame (SQL: right outer join)
        * outer: use union of keys from both frames (SQL: full outer join)
        * inner: use intersection of keys from both frames (SQL: inner join)
    on : label or list
        Field names to join on. Must be found in both DataFrames. If on is
        None and not merging on indexes, then it merges on the intersection of
        the columns by default.
    left_on : label or list, or array-like
        Field names to join on in left DataFrame. Can be a vector or list of
        vectors of the length of the DataFrame to use a particular vector as
        the join key instead of columns
    right_on : label or list, or array-like
        Field names to join on in right DataFrame or vector/list of vectors per
        left_on docs
    left_index : boolean, default False
        Use the index from the left DataFrame as the join key(s). If it is a
        MultiIndex, the number of keys in the other DataFrame (either the index
        or a number of columns) must match the number of levels
    right_index : boolean, default False
        Use the index from the right DataFrame as the join key. Same caveats as
        left_index
    sort : boolean, default False
        Sort the join keys lexicographically in the result DataFrame
    suffixes : 2-length sequence (tuple, list, ...)
        Suffix to apply to overlapping column names in the left and right
        side, respectively
    copy : boolean, default True
        If False, do not copy data unnecessarily
    indicator : boolean or string, default False
        If True, adds a column to output DataFrame called "_merge" with
        information on the source of each row.
        If string, column with information on source of each row will be added to
        output DataFrame, and column will be named value of string.
        Information column is Categorical-type and takes on a value of "left_only"
        for observations whose merge key only appears in 'left' DataFrame,
        "right_only" for observations whose merge key only appears in 'right'
        DataFrame, and "both" if the observation's merge key is found in both.
    
        .. versionadded:: 0.17.0
    
    Examples
    --------
    
    >>> A              >>> B
        lkey value         rkey value
    0   foo  1         0   foo  5
    1   bar  2         1   bar  6
    2   baz  3         2   qux  7
    3   foo  4         3   bar  8
    
    >>> A.merge(B, left_on='lkey', right_on='rkey', how='outer')
       lkey  value_x  rkey  value_y
    0  foo   1        foo   5
    1  foo   4        foo   5
    2  bar   2        bar   6
    3  bar   2        bar   8
    4  baz   3        NaN   NaN
    5  NaN   NaN      qux   7
    
    Returns
    -------
    merged : DataFrame
        The output type will the be same as 'left', if it is a subclass
        of DataFrame.


In [18]:
Ans = pd.merge(Employees.loc[:,["EmployeeID","FirstName","MiddleName","LastName","JobTitle","TerritoryID"]], 
               Territory, 
               how='left', on='TerritoryID')
Ans.head()


Out[18]:
EmployeeID FirstName MiddleName LastName JobTitle TerritoryID Name CountryCode Region SalesYTD SalesLastYear
0 259 Ben T Miller Buyer NaN NaN NaN NaN NaN NaN
1 278 Garrett R Vargas Sales Representative 6.0 Canada CA North America 6771829.14 5693988.86
2 204 Gabe B Mares Production Technician - WC40 NaN NaN NaN NaN NaN NaN
3 78 Reuben H D'sa Production Supervisor - WC40 NaN NaN NaN NaN NaN NaN
4 255 Gordon L Hee Buyer NaN NaN NaN NaN NaN NaN

In [19]:
# Overachiever answer:
Ans['EmployeeName'] = Ans[["FirstName","MiddleName","LastName"]].apply(lambda x: x.LastName+", "+x.FirstName+" "+str(x.MiddleName), axis=1)
Ans = Ans[['EmployeeName', 'EmployeeID', 'JobTitle', 'TerritoryID', 'Name', 'CountryCode', 'Region', 'SalesYTD', 'SalesLastYear']]
Ans


Out[19]:
EmployeeName EmployeeID JobTitle TerritoryID Name CountryCode Region SalesYTD SalesLastYear
0 Miller, Ben T 259 Buyer NaN NaN NaN NaN NaN NaN
1 Vargas, Garrett R 278 Sales Representative 6.0 Canada CA North America 6771829.14 5693988.86
2 Mares, Gabe B 204 Production Technician - WC40 NaN NaN NaN NaN NaN NaN
3 D'sa, Reuben H 78 Production Supervisor - WC40 NaN NaN NaN NaN NaN NaN
4 Hee, Gordon L 255 Buyer NaN NaN NaN NaN NaN NaN
5 Khanna, Karan R 66 Production Technician - WC60 NaN NaN NaN NaN NaN NaN
6 Ajenstat, François P 270 Database Administrator NaN NaN NaN NaN NaN NaN
7 Harnpadoungsataya, Sariya E 22 Marketing Specialist NaN NaN NaN NaN NaN NaN
8 Koenigsbauer, Kirk J 161 Production Technician - WC45 NaN NaN NaN NaN NaN NaN
9 Ralls, Kim T 124 Stocker NaN NaN NaN NaN NaN NaN
10 Raheem, Michael nan 10 Research and Development Manager NaN NaN NaN NaN NaN NaN
11 Seamans, Mike K 248 Accountant NaN NaN NaN NaN NaN NaN
12 Koch, Reed T 175 Production Technician - WC30 NaN NaN NaN NaN NaN NaN
13 Fakhouri, Fadi K 155 Production Technician - WC20 NaN NaN NaN NaN NaN NaN
14 Singh, Paul R 144 Production Technician - WC20 NaN NaN NaN NaN NaN NaN
15 Diaz, Brenda M 192 Production Supervisor - WC40 NaN NaN NaN NaN NaN NaN
16 Richins, Jack S 166 Production Supervisor - WC30 NaN NaN NaN NaN NaN NaN
17 Evans, John P 112 Production Technician - WC50 NaN NaN NaN NaN NaN NaN
18 Myer, Ken L 203 Production Technician - WC40 NaN NaN NaN NaN NaN NaN
19 Moreland, Barbara C 245 Accountant NaN NaN NaN NaN NaN NaN
20 Kurjan, Eric S 257 Buyer NaN NaN NaN NaN NaN NaN
21 Walton, Bryan A 244 Accounts Receivable Specialist NaN NaN NaN NaN NaN NaN
22 Masters, Steve F 69 Production Technician - WC60 NaN NaN NaN NaN NaN NaN
23 Newman, Belinda M 210 Production Technician - WC45 NaN NaN NaN NaN NaN NaN
24 Nayberg, Alex M 163 Production Technician - WC45 NaN NaN NaN NaN NaN NaN
25 Hall, Don L 118 Production Technician - WC50 NaN NaN NaN NaN NaN NaN
26 Lawrence, David Oliver 176 Production Technician - WC30 NaN NaN NaN NaN NaN NaN
27 Reátegui Alayo, Sandra nan 147 Production Technician - WC30 NaN NaN NaN NaN NaN NaN
28 Smith, Samantha H 138 Production Technician - WC20 NaN NaN NaN NaN NaN NaN
29 Scardelis, Jim H 95 Production Technician - WC50 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ...
261 Ortiz, David J 70 Production Technician - WC60 NaN NaN NaN NaN NaN NaN
262 Chai, Sean N 219 Document Control Assistant NaN NaN NaN NaN NaN NaN
263 Watters, Jason M 148 Production Technician - WC30 NaN NaN NaN NaN NaN NaN
264 Samarawickrama, Prasanna E 140 Production Technician - WC20 NaN NaN NaN NaN NaN NaN
265 Varkey Chudukatil, Ranjit R 290 Sales Representative 7.0 France FR Europe 4772398.31 2396539.76
266 Erickson, Gail A 5 Design Engineer NaN NaN NaN NaN NaN NaN
267 Rao, Arvind B 252 Buyer NaN NaN NaN NaN NaN NaN
268 Connelly, Peter I 266 Network Administrator NaN NaN NaN NaN NaN NaN
269 Hedlund, Magnus E 233 Facilities Administrative Assistant NaN NaN NaN NaN NaN NaN
270 Cook, Patrick M 83 Production Technician - WC40 NaN NaN NaN NaN NaN NaN
271 Sandberg, Mikael Q 251 Buyer NaN NaN NaN NaN NaN NaN
272 Liu, David J 241 Accounts Manager NaN NaN NaN NaN NaN NaN
273 Hamilton, James R 25 Vice President of Production NaN NaN NaN NaN NaN NaN
274 Liu, Jinghao K 108 Production Supervisor - WC50 NaN NaN NaN NaN NaN NaN
275 Martin, Mindy C 239 Benefits Specialist NaN NaN NaN NaN NaN NaN
276 Brown, Jo A 27 Production Supervisor - WC60 NaN NaN NaN NaN NaN NaN
277 Kim, Shane S 186 Production Supervisor - WC45 NaN NaN NaN NaN NaN NaN
278 Abercrombie, Kim B 38 Production Technician - WC60 NaN NaN NaN NaN NaN NaN
279 Sam, Raymond K 154 Production Technician - WC20 NaN NaN NaN NaN NaN NaN
280 Krapauskas, Mindaugas J 114 Production Technician - WC50 NaN NaN NaN NaN NaN NaN
281 Kuppa, Vamsi N 123 Shipping and Receiving Clerk NaN NaN NaN NaN NaN NaN
282 McArthur, Mark K 29 Production Technician - WC60 NaN NaN NaN NaN NaN NaN
283 Pournasseh, Houman N 101 Production Technician - WC50 NaN NaN NaN NaN NaN NaN
284 Norred, Chris K 221 Control Specialist NaN NaN NaN NaN NaN NaN
285 Wood, John L 18 Marketing Specialist NaN NaN NaN NaN NaN NaN
286 Kogan, Eugene O 46 Production Technician - WC60 NaN NaN NaN NaN NaN NaN
287 Nusbaum, Tawana G 202 Production Technician - WC40 NaN NaN NaN NaN NaN NaN
288 Pak, Jae B 289 Sales Representative 10.0 United Kingdom GB Europe 5012905.37 1635823.40
289 Valdez, Rachel B 288 Sales Representative 8.0 Germany DE Europe 3805202.35 1307949.79
290 Smith, Chadwick nan 999 BI Professor NaN NaN NaN NaN NaN NaN

291 rows × 9 columns

"For the list above, limit the results to just salespeople"


In [20]:
Ans2 = Ans[Ans.JobTitle=='Sales Representative']
Ans2


Out[20]:
EmployeeName EmployeeID JobTitle TerritoryID Name CountryCode Region SalesYTD SalesLastYear
1 Vargas, Garrett R 278 Sales Representative 6.0 Canada CA North America 6771829.14 5693988.86
54 Campbell, David R 283 Sales Representative 1.0 Northwest US North America 7887186.79 3298694.49
63 Mitchell, Linda C 276 Sales Representative 4.0 Southwest US North America 10510853.87 5366575.71
93 Tsoflias, Lynn N 286 Sales Representative 9.0 Australia AU Pacific 5977814.92 2278548.98
114 Mensa-Annan, Tete A 284 Sales Representative 1.0 Northwest US North America 7887186.79 3298694.49
163 Ito, Shu K 281 Sales Representative 4.0 Southwest US North America 10510853.87 5366575.71
176 Ansman-Wolfe, Pamela O 280 Sales Representative 1.0 Northwest US North America 7887186.79 3298694.49
230 Carson, Jillian nan 277 Sales Representative 3.0 Central US North America 3072175.12 3205014.08
235 Blythe, Michael G 275 Sales Representative 2.0 Northeast US North America 2402176.85 3607148.94
240 Reiter, Tsvi Michael 279 Sales Representative 5.0 Southeast US North America 2538667.25 3925071.43
256 Saraiva, José Edvaldo 282 Sales Representative 6.0 Canada CA North America 6771829.14 5693988.86
265 Varkey Chudukatil, Ranjit R 290 Sales Representative 7.0 France FR Europe 4772398.31 2396539.76
288 Pak, Jae B 289 Sales Representative 10.0 United Kingdom GB Europe 5012905.37 1635823.40
289 Valdez, Rachel B 288 Sales Representative 8.0 Germany DE Europe 3805202.35 1307949.79

In [21]:
# Overachiever: What about *all* employees associated with sales?
Ans2 = Ans[Ans["JobTitle"].apply(lambda x: 'Sales' in x)]
Ans2


Out[21]:
EmployeeName EmployeeID JobTitle TerritoryID Name CountryCode Region SalesYTD SalesLastYear
1 Vargas, Garrett R 278 Sales Representative 6.0 Canada CA North America 6771829.14 5693988.86
54 Campbell, David R 283 Sales Representative 1.0 Northwest US North America 7887186.79 3298694.49
59 Jiang, Stephen Y 274 North American Sales Manager NaN NaN NaN NaN NaN NaN
63 Mitchell, Linda C 276 Sales Representative 4.0 Southwest US North America 10510853.87 5366575.71
93 Tsoflias, Lynn N 286 Sales Representative 9.0 Australia AU Pacific 5977814.92 2278548.98
114 Mensa-Annan, Tete A 284 Sales Representative 1.0 Northwest US North America 7887186.79 3298694.49
135 Alberts, Amy E 287 European Sales Manager NaN NaN NaN NaN NaN NaN
163 Ito, Shu K 281 Sales Representative 4.0 Southwest US North America 10510853.87 5366575.71
176 Ansman-Wolfe, Pamela O 280 Sales Representative 1.0 Northwest US North America 7887186.79 3298694.49
212 Abbas, Syed E 285 Pacific Sales Manager NaN NaN NaN NaN NaN NaN
222 Welcker, Brian S 273 Vice President of Sales NaN NaN NaN NaN NaN NaN
230 Carson, Jillian nan 277 Sales Representative 3.0 Central US North America 3072175.12 3205014.08
235 Blythe, Michael G 275 Sales Representative 2.0 Northeast US North America 2402176.85 3607148.94
240 Reiter, Tsvi Michael 279 Sales Representative 5.0 Southeast US North America 2538667.25 3925071.43
256 Saraiva, José Edvaldo 282 Sales Representative 6.0 Canada CA North America 6771829.14 5693988.86
265 Varkey Chudukatil, Ranjit R 290 Sales Representative 7.0 France FR Europe 4772398.31 2396539.76
288 Pak, Jae B 289 Sales Representative 10.0 United Kingdom GB Europe 5012905.37 1635823.40
289 Valdez, Rachel B 288 Sales Representative 8.0 Germany DE Europe 3805202.35 1307949.79

"Give me a list of our customers, and also tell me which sales territory they fall in."

This looks like another question for "merge"! We have a list of customers with their addresses, and we have a list of territories, but they are in separate tables.

Let's recover a list of customer names and IDs, together with corresponding sales territory names.

This time, we have to be careful, because "TerritoryID" in the Territory table matches "SalesTerritoryID" in the table Customers. So, we'll have to specify different columns names to merge on for the two tables.


In [22]:
Ans3 = pd.merge(Customers[["CustomerID","FirstName","LastName","SalesTerritoryID"]], 
                Territory[["TerritoryID","Name"]], 
                how='left', 
                left_on='SalesTerritoryID', right_on='TerritoryID', )
Ans3


Out[22]:
CustomerID FirstName LastName SalesTerritoryID TerritoryID Name
0 10101 John Gray 1 1 Northwest
1 10298 Leroy Brown 4 4 Southwest
2 10299 Elroy Keller 1 1 Northwest
3 10315 Lisa Jones 3 3 Central
4 10325 Ginger Schultz 1 1 Northwest
5 10329 Kelly Mendoza 5 5 Southeast
6 10330 Shawn Dalton 1 1 Northwest
7 10338 Michael Howell 1 1 Northwest
8 10339 Anthony Sanchez 4 4 Southwest
9 10408 Elroy Cleaver 4 4 Southwest
10 10410 Mary Ann Howell 5 5 Southeast
11 10413 Donald Davids 4 4 Southwest
12 10419 Linda Sakahara 4 4 Southwest
13 10429 Sarah Graham 5 5 Southeast
14 10438 Kevin Smith 3 3 Central
15 10439 Conrad Giles 3 3 Central
16 10449 Isabela Moore 4 4 Southwest

"Give me a list of all sales territories, also show what customers fall under them"


In [23]:
Ans = pd.merge(Territory, Customers, how="inner", left_on="TerritoryID", right_on="SalesTerritoryID")

In [24]:
Ans


Out[24]:
TerritoryID Name CountryCode Region SalesYTD SalesLastYear CustomerID SalesTerritoryID FirstName LastName City StateName
0 1 Northwest US North America 7887186.79 3298694.49 10101 1 John Gray Lynden Washington
1 1 Northwest US North America 7887186.79 3298694.49 10299 1 Elroy Keller Snoqualmie Washington
2 1 Northwest US North America 7887186.79 3298694.49 10325 1 Ginger Schultz Pocatello Idaho
3 1 Northwest US North America 7887186.79 3298694.49 10330 1 Shawn Dalton Cannon Beach Oregon
4 1 Northwest US North America 7887186.79 3298694.49 10338 1 Michael Howell Tillamook Oregon
5 3 Central US North America 3072175.12 3205014.08 10315 3 Lisa Jones Oshkosh Wisconsin
6 3 Central US North America 3072175.12 3205014.08 10438 3 Kevin Smith Durango Colorado
7 3 Central US North America 3072175.12 3205014.08 10439 3 Conrad Giles Telluride Colorado
8 4 Southwest US North America 10510853.87 5366575.71 10298 4 Leroy Brown Pinetop Arizona
9 4 Southwest US North America 10510853.87 5366575.71 10339 4 Anthony Sanchez Winslow Arizona
10 4 Southwest US North America 10510853.87 5366575.71 10408 4 Elroy Cleaver Globe Arizona
11 4 Southwest US North America 10510853.87 5366575.71 10413 4 Donald Davids Gila Bend Arizona
12 4 Southwest US North America 10510853.87 5366575.71 10419 4 Linda Sakahara Nogales Arizona
13 4 Southwest US North America 10510853.87 5366575.71 10449 4 Isabela Moore Yuma Arizona
14 5 Southeast US North America 2538667.25 3925071.43 10329 5 Kelly Mendoza Kailua Hawaii
15 5 Southeast US North America 2538667.25 3925071.43 10410 5 Mary Ann Howell Charleston South Carolina
16 5 Southeast US North America 2538667.25 3925071.43 10429 5 Sarah Graham Greensboro North Carolina

In [ ]:


In [ ]:


In [ ]:


In [ ]:

"Give me a list of the customers we have in North Carolina, and tell me how many there are."


In [25]:
# In-class exercise! :)

In [34]:
Customers[Customers.StateName=="North Carolina"].CustomerID.count()


Out[34]:
1

In [ ]:


In [ ]:

"For each of the items ordered, show the total price (sometimes they ordered more than 1 item)"


In [26]:
# We'll use the Orders table for this! In-class exercise :)

In [41]:
Orders['TotalItemPrice'] = Orders.Quantity * Orders.Price

In [ ]:


In [ ]:

"Show a list of customers, and the total amount of money they have spent with AdventureWorks. I want the highest spenders to appear first!"


In [27]:
# In-class exercise! :)

In [52]:
pd.merge(Customers[["FirstName","LastName","CustomerID"]],Orders[["CustomerID","TotalItemPrice"]], how="inner", on="CustomerID").groupby(["CustomerID","FirstName","LastName"]).sum()


Out[52]:
TotalItemPrice
CustomerID FirstName LastName
10101 John Gray 813.95
10298 Leroy Brown 147.88
10299 Elroy Keller 1288.00
10315 Lisa Jones 8.00
10330 Shawn Dalton 156.75
10339 Anthony Sanchez 4.50
10410 Mary Ann Howell 281.72
10413 Donald Davids 128.00
10438 Kevin Smith 95.24
10439 Conrad Giles 139.00
10449 Isabela Moore 970.79

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:

Another side note:


In [28]:
help(pd.DataFrame.combine_first)


Help on function combine_first in module pandas.core.frame:

combine_first(self, other)
    Combine two DataFrame objects and default to non-null values in frame
    calling the method. Result index columns will be the union of the
    respective indexes and columns
    
    Parameters
    ----------
    other : DataFrame
    
    Examples
    --------
    a's values prioritized, use values from b to fill holes:
    
    >>> a.combine_first(b)
    
    
    Returns
    -------
    combined : DataFrame


In [29]:
help(pd.DataFrame.update)


Help on function update in module pandas.core.frame:

update(self, other, join='left', overwrite=True, filter_func=None, raise_conflict=False)
    Modify DataFrame in place using non-NA values from passed
    DataFrame. Aligns on indices
    
    Parameters
    ----------
    other : DataFrame, or object coercible into a DataFrame
    join : {'left'}, default 'left'
    overwrite : boolean, default True
        If True then overwrite values for common keys in the calling frame
    filter_func : callable(1d-array) -> 1d-array<boolean>, default None
        Can choose to replace values other than NA. Return True for values
        that should be updated
    raise_conflict : boolean
        If True, will raise an error if the DataFrame and other both
        contain data in the same place.


In [30]:
Customers


Out[30]:
CustomerID SalesTerritoryID FirstName LastName City StateName
0 10101 1 John Gray Lynden Washington
1 10298 4 Leroy Brown Pinetop Arizona
2 10299 1 Elroy Keller Snoqualmie Washington
3 10315 3 Lisa Jones Oshkosh Wisconsin
4 10325 1 Ginger Schultz Pocatello Idaho
5 10329 5 Kelly Mendoza Kailua Hawaii
6 10330 1 Shawn Dalton Cannon Beach Oregon
7 10338 1 Michael Howell Tillamook Oregon
8 10339 4 Anthony Sanchez Winslow Arizona
9 10408 4 Elroy Cleaver Globe Arizona
10 10410 5 Mary Ann Howell Charleston South Carolina
11 10413 4 Donald Davids Gila Bend Arizona
12 10419 4 Linda Sakahara Nogales Arizona
13 10429 5 Sarah Graham Greensboro North Carolina
14 10438 3 Kevin Smith Durango Colorado
15 10439 3 Conrad Giles Telluride Colorado
16 10449 4 Isabela Moore Yuma Arizona