This notebook demonstrates how to read and manipulate data from Excel using Pandas:
First, import the Pandas library:
In [1]:
# The library for handling tabular data is called 'pandas'
# Everyone shortens this to 'pd' for convenience.
import pandas as pd
The IRS website has some aggregated statistics on business returns in Excel files. We will use the Selected Income and Tax Items for Selected Years.
The original data is from the file linked here: https://www.irs.gov/pub/irs-soi/14intaba.xls, but I cleaned it up by hand to remove footnotes and reformat the column and row headers. You can get the cleaned file in this repository data/14intaba_cleaned.xls. It looks like this:
We will use the read_excel function inside of the Pandas library (accessed using pd.read_excel) to get the data. We need to:
When referring to files on your computer from Jupyer, the path you use is relative to the current Jupyter notebook. My directory looks like this:
.
|-- notebooks
|-- input_output.ipynb
|-- data
|- 14intaba_cleaned.xls
so, the relative path from the notebook input_output.ipynb to the dataset 14intaba_cleaned.xls is:
data/14intaba_cleaned.xls
In [2]:
raw = pd.read_excel('data/14intaba_cleaned.xls', skiprows=2)
In [3]:
# Look at the last 3 rows
raw.tail(3)
Out[3]:
There are two sets of data — for the actual dollars for each variable, and also for constant dollars (accounting for inflation). We will split the raw dataset into two and then index the rows by the units (whether they're number of returns or amount paid/claimed).
The columns we care about are Variable, Units, and the current or constant dollars from each year. (You can view them all with raw.columns.)
We can subset the dataset with the columns we want using raw.ix[:, <desired_cols>].
There are a lot of commands in this section...we will do a better job explaining later. For now, ['braces', 'denote', 'a list'], you can add lists, and you can write a shorthand for loop inside of a list (that's called a "list comprehension").
In [4]:
index_cols = ['Units', 'Variable']
current_dollars_cols = index_cols + [
c for c in raw.columns if c.startswith('Current')
]
constant_dollars_cols = index_cols + [
c for c in raw.columns if c.startswith('Constant')
]
current_dollars_data = raw[current_dollars_cols][9:]
current_dollars_data.set_index(keys=index_cols, inplace=True)
constant_dollars_data = raw[constant_dollars_cols][9:]
constant_dollars_data.set_index(keys=index_cols, inplace=True)
In [5]:
years = [int(c[-4:]) for c in constant_dollars_data.columns]
constant_dollars_data.columns = years
In [6]:
per_entry = (
constant_dollars_data.transpose()['Amount (thousand USD)'] * 1000 /
constant_dollars_data.transpose()['Number of returns']
)
per_entry.dropna(axis=1).describe().round()
Out[6]:
The library that provides plot functions is called Matplotlib. To show the plots in this notebook you need to use the "magic method" %matplotlib inline. It should be used at the beginning of the notebook for clarity.
In [7]:
# This should always be at the beginning of the notebook,
# like all magic statements and import statements.
# It's only here because I didn't want to describe it earlier.
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10, 12)
The data are (I think) for every form filed, not really per capita, but since we're not interpreting it for anything important we can conflate the two.
Per capita income (Blue line) rose a lot with the tech bubble, then sunk with its crash, and then followed the housing bubble and crash. It also looks like small business income (Red dashed line) hasn't really come back since the crash, but that unemployment (Magenta dots) has gone down.
In [8]:
styles = ['b-', 'g-.', 'r--', 'c-', 'm:']
axes = per_entry[[
'Total income',
'Total social security benefits (not in income)',
'Business or profession net income less loss',
'Total payments',
'Unemployment compensation']].plot(style=styles)
plt.suptitle('Average USD per return (when stated)')
Out[8]:
In [9]:
styles = ['b-', 'r--', 'g-.', 'c-', 'm:']
axes = constant_dollars_data.transpose()['Amount (thousand USD)'][[
'Total income',
'Total payments',
'Total social security benefits (not in income)',
'Business or profession net income less loss',
'Unemployment compensation']].plot(logy=True, style=styles)
plt.legend(bbox_to_anchor=(1, 1),
bbox_transform=plt.gcf().transFigure)
plt.suptitle('Total USD (constant 1990 basis)')
Out[9]: