Reading Excel files

This notebook demonstrates how to read and manipulate data from Excel using Pandas:

Input / Output
summaries
plotting

First, import the Pandas library:



In [1]:

    
# The library for handling tabular data is called 'pandas'
# Everyone shortens this to 'pd' for convenience.
import pandas as pd

Get IRS data on businesses

The IRS website has some aggregated statistics on business returns in Excel files. We will use the Selected Income and Tax Items for Selected Years.

The original data is from the file linked here: https://www.irs.gov/pub/irs-soi/14intaba.xls, but I cleaned it up by hand to remove footnotes and reformat the column and row headers. You can get the cleaned file in this repository data/14intaba_cleaned.xls. It looks like this:

Read the data!

We will use the read_excel function inside of the Pandas library (accessed using pd.read_excel) to get the data. We need to:

skip the first 2 rows
Split out the 'Current dollars' and 'Constant 1990 dollars' subsets
use the left two columns to split out the number of returns and their dollar amounts

When referring to files on your computer from Jupyer, the path you use is relative to the current Jupyter notebook. My directory looks like this:

.
|-- notebooks
  |-- input_output.ipynb
  |-- data
    |- 14intaba_cleaned.xls

so, the relative path from the notebook input_output.ipynb to the dataset 14intaba_cleaned.xls is:

data/14intaba_cleaned.xls



In [2]:

    
raw = pd.read_excel('data/14intaba_cleaned.xls', skiprows=2)

Look at the last 3 rows

The function pd.read_excel returns an object called a 'Data Frame', that is defined inside of the Pandas library. It has associated functions that access and manipulate the data inside. For example:



In [3]:

    
# Look at the last 3 rows
raw.tail(3)









    Out[3]:






  
    
      
      Variable
      Units
      Current dollars 1990
      Current dollars 1991
      Current dollars 1992
      Current dollars 1993
      Current dollars 1994
      Current dollars 1995
      Current dollars 1996
      Current dollars 1997
      ...
      Constant 1990 dollars 2006
      Constant 1990 dollars 2007
      Constant 1990 dollars 2008
      Constant 1990 dollars 2009
      Constant 1990 dollars 2010
      Constant 1990 dollars 2011
      Constant 1990 dollars 2012
      Constant 1990 dollars 2013
      Constant 1990 dollars 2014
      Percent change, 2013 to 2014.1
    
  
  
    
      221
      Tax due at time of filing
      Amount (thousand USD)
      56561150.0
      53046236.0
      50411381.0
      50330900.0
      60507961.0
      71153015.0
      85337201.0
      93909641.0
      ...
      8.190006e+07
      8.235046e+07
      6.088502e+07
      5.248614e+07
      6.042705e+07
      6.131351e+07
      7.163618e+07
      7.769037e+07
      8.623847e+07
      11.0028
    
    
      222
      Predetermined estimated tax penalty
      Number of returns
      3792027.0
      3798637.0
      4555184.0
      4972115.0
      4707696.0
      5177201.0
      5530812.0
      5920839.0
      ...
      6.496924e+06
      7.549807e+06
      6.355582e+06
      6.548380e+06
      6.670019e+06
      6.702687e+06
      7.619267e+06
      8.543611e+06
      9.327132e+06
      9.17084
    
    
      223
      Predetermined estimated tax penalty
      Amount (thousand USD)
      636676.0
      596943.0
      581476.0
      626890.0
      705246.0
      856106.0
      955207.0
      1080062.0
      ...
      1.148484e+06
      1.180518e+06
      7.425282e+05
      5.239647e+05
      4.830977e+05
      4.463376e+05
      4.781212e+05
      5.715477e+05
      6.497465e+05
      13.6819
    
  

3 rows × 54 columns

Split out the 'Current dollars' and 'Constant 1990 dollars'

There are two sets of data — for the actual dollars for each variable, and also for constant dollars (accounting for inflation). We will split the raw dataset into two and then index the rows by the units (whether they're number of returns or amount paid/claimed).

The columns we care about are Variable, Units, and the current or constant dollars from each year. (You can view them all with raw.columns.) We can subset the dataset with the columns we want using raw.ix[:, <desired_cols>].

There are a lot of commands in this section...we will do a better job explaining later. For now, ['braces', 'denote', 'a list'], you can add lists, and you can write a shorthand for loop inside of a list (that's called a "list comprehension").



In [4]:

    
index_cols = ['Units', 'Variable']
current_dollars_cols = index_cols + [
    c for c in raw.columns if c.startswith('Current')
]
constant_dollars_cols = index_cols + [
    c for c in raw.columns if c.startswith('Constant')
]

current_dollars_data = raw[current_dollars_cols][9:]
current_dollars_data.set_index(keys=index_cols, inplace=True)

constant_dollars_data = raw[constant_dollars_cols][9:]
constant_dollars_data.set_index(keys=index_cols, inplace=True)



In [5]:

    
years = [int(c[-4:]) for c in constant_dollars_data.columns]
constant_dollars_data.columns = years

Statistics

Pandas provides methods for statistical summaries. The describe method gives an overall summary. dropna(axis=1) deletes columns containing null values. If it were axis=0 it would be deleting rows.



In [6]:

    
per_entry = (
    constant_dollars_data.transpose()['Amount (thousand USD)'] * 1000 /
    constant_dollars_data.transpose()['Number of returns']
)
per_entry.dropna(axis=1).describe().round()









    Out[6]:






  
    
      Variable
      Additional taxes
      Alimony paid adjustment
      Alimony received
      Alternative minimum tax
      Basic standard deduction
      Business or profession net income less loss
      Child care credit
      Credit for elderly or disabled
      Deductible part of self-employment tax
      Earned income credit
      ...
      Total credits (including all not enumerated in this table)
      Total income
      Total itemized deductions
      Total payments
      Total pensions and annuities (not in income)
      Total social security benefits (not in income)
      Total statutory adjustments
      Total tax liability (including all not enumerated in this table)
      Total taxpayer IRA adjustment
      Unemployment compensation
    
  
  
    
      count
      25.0
      25.0
      25.0
      25.0
      25.0
      25.0
      25.0
      25.0
      25.0
      25.0
      ...
      25.0
      25.0
      25.0
      25.0
      25.0
      25.0
      25.0
      25.0
      25.0
      25.0
    
    
      mean
      4310.0
      9151.0
      10668.0
      4291.0
      4300.0
      8424.0
      347.0
      130.0
      883.0
      1183.0
      ...
      775.0
      34259.0
      14749.0
      5784.0
      16409.0
      10494.0
      2012.0
      6164.0
      2108.0
      2682.0
    
    
      std
      2814.0
      1156.0
      1378.0
      947.0
      220.0
      888.0
      31.0
      46.0
      37.0
      194.0
      ...
      93.0
      2931.0
      1027.0
      612.0
      2493.0
      657.0
      89.0
      748.0
      420.0
      809.0
    
    
      min
      658.0
      7553.0
      8317.0
      2607.0
      3980.0
      6752.0
      305.0
      63.0
      820.0
      598.0
      ...
      549.0
      29288.0
      13184.0
      4643.0
      11941.0
      9252.0
      1868.0
      4796.0
      1585.0
      1893.0
    
    
      25%
      1707.0
      7910.0
      9208.0
      3698.0
      4113.0
      7516.0
      322.0
      88.0
      847.0
      1234.0
      ...
      695.0
      31720.0
      13815.0
      5443.0
      13902.0
      9940.0
      1931.0
      5755.0
      1713.0
      2077.0
    
    
      50%
      4198.0
      9282.0
      11129.0
      4404.0
      4196.0
      8628.0
      346.0
      120.0
      891.0
      1249.0
      ...
      765.0
      34596.0
      14649.0
      5865.0
      17196.0
      10546.0
      2025.0
      6162.0
      2079.0
      2433.0
    
    
      75%
      5994.0
      10185.0
      11862.0
      4905.0
      4490.0
      9059.0
      362.0
      164.0
      914.0
      1299.0
      ...
      855.0
      36366.0
      15467.0
      6224.0
      18287.0
      10885.0
      2083.0
      6736.0
      2516.0
      3033.0
    
    
      max
      9889.0
      10725.0
      12745.0
      6285.0
      4685.0
      9662.0
      415.0
      199.0
      933.0
      1335.0
      ...
      901.0
      38961.0
      16668.0
      6733.0
      20000.0
      11590.0
      2180.0
      7408.0
      2698.0
      4827.0
    
  

8 rows × 64 columns

Plot

The library that provides plot functions is called Matplotlib. To show the plots in this notebook you need to use the "magic method" %matplotlib inline. It should be used at the beginning of the notebook for clarity.



In [7]:

    
# This should always be at the beginning of the notebook,
# like all magic statements and import statements.
# It's only here because I didn't want to describe it earlier.
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10, 12)

The per-entry data

The data are (I think) for every form filed, not really per capita, but since we're not interpreting it for anything important we can conflate the two.

Per capita income (Blue line) rose a lot with the tech bubble, then sunk with its crash, and then followed the housing bubble and crash. It also looks like small business income (Red dashed line) hasn't really come back since the crash, but that unemployment (Magenta dots) has gone down.



In [8]:

    
styles = ['b-', 'g-.', 'r--', 'c-', 'm:']
axes = per_entry[[
        'Total income',
        'Total social security benefits (not in income)',
        'Business or profession net income less loss',
        'Total payments',
        'Unemployment compensation']].plot(style=styles)
plt.suptitle('Average USD per return (when stated)')









    Out[8]:





<matplotlib.text.Text at 0x111f7f550>

Also with log-y

We can see the total social security benefits payout (Green dot dash) increase as the baby boomers come of age, and we see the unemployment compensation (Magenta dots) spike after the 2008 crisis and then fall off.



In [9]:

    
styles = ['b-', 'r--', 'g-.', 'c-', 'm:']
axes = constant_dollars_data.transpose()['Amount (thousand USD)'][[
        'Total income',
        'Total payments',
        'Total social security benefits (not in income)',
        'Business or profession net income less loss',
        'Unemployment compensation']].plot(logy=True, style=styles)
plt.legend(bbox_to_anchor=(1, 1),
           bbox_transform=plt.gcf().transFigure)
plt.suptitle('Total USD (constant 1990 basis)')









    Out[9]:





<matplotlib.text.Text at 0x111fbb588>

We did it!

That was I/O with a little statistical summarization and plotting. ❤ Yay.

	Variable	Units	Current dollars 1990	Current dollars 1991	Current dollars 1992	Current dollars 1993	Current dollars 1994	Current dollars 1995	Current dollars 1996	Current dollars 1997	...	Constant 1990 dollars 2006	Constant 1990 dollars 2007	Constant 1990 dollars 2008	Constant 1990 dollars 2009	Constant 1990 dollars 2010	Constant 1990 dollars 2011	Constant 1990 dollars 2012	Constant 1990 dollars 2013	Constant 1990 dollars 2014	Percent change, 2013 to 2014.1
221	Tax due at time of filing	Amount (thousand USD)	56561150.0	53046236.0	50411381.0	50330900.0	60507961.0	71153015.0	85337201.0	93909641.0	...	8.190006e+07	8.235046e+07	6.088502e+07	5.248614e+07	6.042705e+07	6.131351e+07	7.163618e+07	7.769037e+07	8.623847e+07	11.0028
222	Predetermined estimated tax penalty	Number of returns	3792027.0	3798637.0	4555184.0	4972115.0	4707696.0	5177201.0	5530812.0	5920839.0	...	6.496924e+06	7.549807e+06	6.355582e+06	6.548380e+06	6.670019e+06	6.702687e+06	7.619267e+06	8.543611e+06	9.327132e+06	9.17084
223	Predetermined estimated tax penalty	Amount (thousand USD)	636676.0	596943.0	581476.0	626890.0	705246.0	856106.0	955207.0	1080062.0	...	1.148484e+06	1.180518e+06	7.425282e+05	5.239647e+05	4.830977e+05	4.463376e+05	4.781212e+05	5.715477e+05	6.497465e+05	13.6819

Variable	Additional taxes	Alimony paid adjustment	Alimony received	Alternative minimum tax	Basic standard deduction	Business or profession net income less loss	Child care credit	Credit for elderly or disabled	Deductible part of self-employment tax	Earned income credit	...	Total credits (including all not enumerated in this table)	Total income	Total itemized deductions	Total payments	Total pensions and annuities (not in income)	Total social security benefits (not in income)	Total statutory adjustments	Total tax liability (including all not enumerated in this table)	Total taxpayer IRA adjustment	Unemployment compensation
count	25.0	25.0	25.0	25.0	25.0	25.0	25.0	25.0	25.0	25.0	...	25.0	25.0	25.0	25.0	25.0	25.0	25.0	25.0	25.0	25.0
mean	4310.0	9151.0	10668.0	4291.0	4300.0	8424.0	347.0	130.0	883.0	1183.0	...	775.0	34259.0	14749.0	5784.0	16409.0	10494.0	2012.0	6164.0	2108.0	2682.0
std	2814.0	1156.0	1378.0	947.0	220.0	888.0	31.0	46.0	37.0	194.0	...	93.0	2931.0	1027.0	612.0	2493.0	657.0	89.0	748.0	420.0	809.0
min	658.0	7553.0	8317.0	2607.0	3980.0	6752.0	305.0	63.0	820.0	598.0	...	549.0	29288.0	13184.0	4643.0	11941.0	9252.0	1868.0	4796.0	1585.0	1893.0
25%	1707.0	7910.0	9208.0	3698.0	4113.0	7516.0	322.0	88.0	847.0	1234.0	...	695.0	31720.0	13815.0	5443.0	13902.0	9940.0	1931.0	5755.0	1713.0	2077.0
50%	4198.0	9282.0	11129.0	4404.0	4196.0	8628.0	346.0	120.0	891.0	1249.0	...	765.0	34596.0	14649.0	5865.0	17196.0	10546.0	2025.0	6162.0	2079.0	2433.0
75%	5994.0	10185.0	11862.0	4905.0	4490.0	9059.0	362.0	164.0	914.0	1299.0	...	855.0	36366.0	15467.0	6224.0	18287.0	10885.0	2083.0	6736.0	2516.0	3033.0
max	9889.0	10725.0	12745.0	6285.0	4685.0	9662.0	415.0	199.0	933.0	1335.0	...	901.0	38961.0	16668.0	6733.0	20000.0	11590.0	2180.0	7408.0	2698.0	4827.0