The *2015_white_house.csv** file contains data on White House employees in 2015, and their salares. Here are the columns:
In [1]:
import pandas as pd
white_house = pd.read_csv("../data/2015_white_house.csv")
print(white_house.shape)
In [2]:
print(white_house.iloc[-1])
In [3]:
white_house
Out[3]:
In [13]:
%matplotlib notebook
import matplotlib.pyplot as plt
plt.hist(white_house["Salary"])
plt.show()
So far we have imported a dataset from a CSV file into a Pandas DataFrame using the read_csv() function. Then we displayed the data, first as a table, and secondly as a historgram.
There are a near infinite number of questions we could possibly ask about this data. But to get started, here are a few example questions that could be asked:
Steps for figuring this out may look like the following:
In [5]:
# Calculate the length of each employee's title and add to the DataFrame
white_house['LengthOfTitle'] = white_house['Position Title'].apply(len)
white_house.head()
Out[5]:
In [6]:
# Plot the length of employee title versus salary to look for correlation
plt.plot(white_house['LengthOfTitle'], white_house['Salary'])
plt.title('How does length of employee titles correlate to salary?')
plt.xlabel('Length of Employee Title')
plt.ylabel('Salary ($)')
Out[6]:
Uh ok, maybe I was wrong about visuallizing being great for detecting correlation ;-)
It looks like there may be a weak positive correlation. But it is really hard to tell.
Maybe we should just numerically calculate the correlation.
Also, it looks like there are some low salary outliers. Should we check to make sure we aren't mixing in monthly salaries with yearly ones?
In [7]:
# Get the values in Pay Basis and figure out how many unique ones there are
types_of_pay_basis = set(white_house['Pay Basis'])
types_of_pay_basis
Out[7]:
Ok, only one pay basis, annually. So that wasn't an issue.
In [8]:
# Compute pairwise correlation of columns, excluding NA/null values
correlations = white_house.corr()
correlations
Out[8]:
In [9]:
# Linear Regression using ordinary least squares
import statsmodels.api as sm
model = sm.OLS(white_house['Salary'], white_house['LengthOfTitle'])
residuals = model.fit()
print(residuals.summary())
So yea, there is a real positive correlation between length of employee title and salary!
In [10]:
total_salary = sum(white_house['Salary'])
total_salary
Out[10]:
The white house pays about $40 Million per year in total salary.
In [11]:
highest_paid = white_house[white_house['Salary'] == max(white_house['Salary'])]
highest_paid
Out[11]:
In [12]:
lowest_paid = white_house[white_house['Salary'] == min(white_house['Salary'])]
lowest_paid
Out[12]:
Wow, who are these poor unpaid schmucks?
This is another multi-step one that is a bit more involved. One approach to solving it might go like the following:
In [ ]: