Check Homework HW04

Use this notebook to check your solutions. This notebook will not be graded.


In [ ]:
import pandas as pd
import numpy as np

Now, import your solutions from hw4_answers.py. The following code looks a bit redundant. However, we do this to allow reloading the hw4_answers.py in case you made some changes. Normally, Python assumes that modules don't change and therefore does not try to import them again.


In [ ]:
import hw4_answers
reload(hw4_answers)
from hw4_answers import *

Problem 1

Create a function load_employees that loads the employees table from the file /home/data/AdventureWorks/Employees.xls and sets the index of the DataFrame to the EmployeeID. The function should return a table with the EmployeeID as the index and the remaining 25 columns.


In [ ]:
employees_df = load_employees()
print "Number of rows: %d\nNumber of cols: %d\n" % (employees_df.shape[0], employees_df.shape[1])
print "Head of index: %s\n" % (employees_df.index[:10])
print "Record of employee with ID=999\n"
print employees_df.loc[999]

The output should be

Number of rows: 291
Number of cols: 25

Head of index: Int64Index([259, 278, 204, 78, 255, 66, 270, 22, 161, 124], dtype='int64', name=u'EmployeeID')

Record of employee with ID=999

ManagerID                                      1
TerritoryID                                  NaN
Title                                        NaN
FirstName                               Chadwick
MiddleName                                   NaN
LastName                                   Smith
Suffix                                       NaN
JobTitle                            BI Professor
NationalIDNumber                       123456789
BirthDate                             1967-07-05
MaritalStatus                                  M
Gender                                         M
HireDate              2003-12-31 23:59:59.997000
SalariedFlag                                   0
VacationHours                                 55
SickLeaveHours                                47
PhoneNumber                         555-887-9788
PhoneNumberType                             Work
EmailAddress         chadwick.smith@rentpath.com
AddressLine1                   565 Peachtree Rd.
AddressLine2                                 NaN
City                                     Atlanta
StateProvinceName                        Georgia
PostalCode                                 30084
CountryName                        United States
Name: 999, dtype: object

Problem 2

Define a function getFullName which takes the employees table and a single employee ID as arguments, and returns a string with the full name of the employee in the format "LAST, FIRST MIDDLE". If the given ID does not belong to any employee return the string "UNKNOWN" (in all caps) If no middle name is given only return "LAST, FIRST". Make sure there are not trailing spaces! If only the middle initial is given the return the full name in the format "LAST, FIRST M." with the middle initial followed by a '.'.

Arguments:

  • df (DataFrame): Employee Table
  • empid (int): Employee ID

Returns:

  • String with full name

In [ ]:
for eid in [274, 999, 102]:
    print '%d, "%s"' %(eid, getFullName(employees_df, eid))

The output should be

274, "Jiang, Stephen Y."
999, "Smith, Chadwick"
102, "Mu, Zheng W."

Problem 3

Define a function isSales that takes the job title of an employee as string as an argument and return either True if the job title indicates this person works in sales, and False otherwise.

Argument:

  • jobtitle (str)

Returns:

  • True or False

In [ ]:
for jt in ['Chief Data Scientist', 'Sales Manager', 'Vice President of Sales']:
    if isSales(jt):
        print "The job title '%s' is part of the Sales Department." % jt
    else:
        print "The job title '%s' belongs to a different department." % jt

The output should be

The job title 'Chief Data Scientist' belongs to a different department.
The job title 'Sales Manager' is part of the Sales Department.
The job title 'Vice President of Sales' is part of the Sales Department.

Problem 4

Define a function filterSales with the employee tables as an argument, that returns a new table of the same schema (i.e. columns and index) containing only row of sales people. You should use the isSales function from the previous problem.

Arguments:

  • employees (DataFrame)

Returns:

  • DataFrame with only people form the Sales Department

In [ ]:
sales_df = filterSales(employees_df)
print "Number of rows: %d\nNumber of cols: %d\n" % (sales_df.shape[0], sales_df.shape[1])
print "Head of index: %s\n" % (sales_df.index[:10])
print "Record of sales employee with ID=280\n"
print sales_df.loc[280]

The output should be

Number of rows: 18
Number of cols: 25

Head of index: Int64Index([278, 283, 274, 276, 286, 284, 287, 281, 280, 285], dtype='int64', name=u'EmployeeID')

Record of sales employee with ID=280

ManagerID                             274
TerritoryID                             1
Title                                 NaN
FirstName                          Pamela
MiddleName                              O
LastName                     Ansman-Wolfe
Suffix                                NaN
JobTitle             Sales Representative
NationalIDNumber                 61161660
BirthDate                      1969-01-06
MaritalStatus                           S
Gender                                  F
HireDate              2005-10-01 00:00:00
SalariedFlag                            1
VacationHours                          22
SickLeaveHours                         31
PhoneNumber                  340-555-0193
PhoneNumberType                      Cell
EmailAddress            pamela0@yahoo.com
AddressLine1            636 Vine Hill Way
AddressLine2                          NaN
City                             Portland
StateProvinceName                  Oregon
PostalCode                          97205
CountryName                 United States
Name: 280, dtype: object

Problem 5

Define a function getEmailList with that returns a Series of strings of all email addresses of employees in this state or province. The email addresses should be separated by a given character, usually a comma ',' or semicolon ';'.

Arguments:

  • employees (DataFrame)
  • delimiter (str)

Returns:

  • Series of email addresses, concatenated by the given delimiter. The Series is indexed by the state or province.

In [ ]:
emails = getEmailListByState(sales_df, ", ")
for state in sorted(emails.index):
    print "%15s: %s" % (state, emails[state])

The output should be

        Alberta: garrett1@mapleleafmail.ca
     California: shu0@adventure-works.com
        England: jae0@aol.co.uk
        Gironde: ranjit0@adventure-works.com
        Hamburg: rachel0@adventure-works.com
  Massachusetts: tete0@adventure-works.com
       Michigan: michael9@adventure-works.com
      Minnesota: jillian0@adventure-works.com
        Ontario: josé1@safe-mail.net
         Oregon: pamela0@yahoo.com
      Tennessee: tsvi0@adventure-works.com
           Utah: linda3@adventure-works.com
       Victoria: lynn0@adventure-works.com
     Washington: david8@adventure-works.com, stephen0@adventure-works.com, amy0@yahoo.com, syed0@yahoo.com, brian3@aol.com

Problem 6 (Bonus)

Define a function managementCounts which produces a Series of how many employees report to a manager. The Series is indexed by the ManagerID, the count should be performed on the EmployeeID because this is the only field that is guaranteed to be unique. The resulting Series should be order by the number of employees in descending order.

Arguments:

  • employees (DataFrame)

Returns:

  • Series of counts (int), indexed by ManagerID

In [ ]:
print managementCounts(employees_df)

The output should be

ManagerID
26.0     178
25.0      30
250.0     17
274.0     10
263.0      9
249.0      9
16.0       8
1.0        8
227.0      6
235.0      5
287.0      3
273.0      3
234.0      3
285.0      1
Name: EmployeeID, dtype: int64