Pandas -- Love and Hate

If you've never used pandas before, it's amazing. It will also frustrate you to tears.

High level tip -- try to represent data in the proper format: floats as floats; ints as ints; etc. Especially if you have dates, or timestamps, or datetimestamps, keep them in that format. The temptation to operate on them like strings may be overwhelming, but resist! In the long run you might prevail. :\


In [1]:
# %install_ext http://raw.github.com/jrjohansson/version_information/master/version_information.py
%load_ext version_information
%reload_ext version_information
%version_information numpy, scipy, matplotlib, pandas


Out[1]:
SoftwareVersion
Python2.7.10 64bit [GCC 4.2.1 (Apple Inc. build 5577)]
IPython4.0.0
OSDarwin 14.5.0 x86_64 i386 64bit
numpy1.9.2
scipy0.16.0
matplotlib1.4.3
pandas0.16.2
Wed Sep 09 18:08:35 2015 PDT

In [3]:
# Doing this in python 2.7 code allows for most of the code to be python 3 portable.
# But you have to write your print functions: print("Hello world.")
# from __future__ import division, absolute_import, print_function, unicode_literals
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context('talk')
sns.set_style('darkgrid') 
plt.rcParams['figure.figsize'] = 12, 8  # plotsize 


import numpy as np
import pandas as pd

Note

Using cleaned data from Data Cleaning Notebook. See Notebook for details.


In [4]:
dframe = pd.read_csv("../data/coal_prod_cleaned.csv")

Notebook Extensions -- qgrid


In [5]:
# Check out http://nbviewer.ipython.org/github/quantopian/qgrid/blob/master/qgrid_demo.ipynb for more (including demo)
import qgrid # Best practices is to put imports at the top of the Notebook.
qgrid.nbinstall(overwrite=True)


/Users/jonathan/anaconda/lib/python2.7/site-packages/IPython/html.py:14: ShimWarning: The `IPython.html` package has been deprecated. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
  "`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)

In [6]:
dframe.head()


Out[6]:
MSHA_ID Average_Employees Company_Type Labor_Hours Mine_Basin Mine_County Mine_Name Mine_State Mine_Status Mine_Type Operating_Company Operating_Company_Address Operation_Type Production_short_tons Union_Code Year
0 102838 4 Independent Producer Operator 2712 Appalachia Southern Bibb Hebron Mine Alabama Permanently abandoned Surface Birmingham Coal & Coke Company 2477 Valleydale Rd. S. B3, Birmingham, AL 35244 Mine only 10572 NaN 2002
1 103184 5 Independent Producer Operator 2480 Appalachia Southern Fayette Berry Mine Alabama Temporarily closed Surface Midas Coal Company Incorporate 401 10th Avenue, S. E, Cullman, AL 35055 Mine only 9725 NaN 2002
2 100329 55 Operating Subsidiary 123618 Appalachia Southern Jefferson Concord Mine Alabama Active Underground U S Steel Mining Company Llc 8800 Oak Grove Mine Road, Adger, AL 35006 Preparation Plant 0 United Mine Workers of America 2002
3 100851 331 Operating Subsidiary 748182 Appalachia Southern Jefferson Oak Grove Mine Alabama Active Underground U S Steel Mining Company Llc 8800 Oak Grove Mine Rd, Adger, AL 35006 Mine only 1942153 United Mine Workers of America 2002
4 102354 28 Independent Producer Operator 55306 Appalachia Southern Jefferson Lindbergh Alabama Active Surface C & H Mining Company Inc P.O. Box 70250, Tuscaloosa, AL 35407 Mine only 168446 NaN 2002

In [7]:
qgrid.show_grid(dframe[['MSHA_ID', 'Year', 'Mine_Name', 'Mine_State', 'Mine_County']], remote_js=True)



In [8]:
plt.scatter(dframe.Average_Employees, dframe.Labor_Hours)
plt.xlabel("Number of Employees")
plt.ylabel("Total Hours Worked")


Out[8]:
<matplotlib.text.Text at 0x10e288090>

In [9]:
plt.scatter(dframe.Labor_Hours, dframe.Production_short_tons, )
plt.xlabel("Total Hours Worked")
plt.ylabel("Total Amount Produced")


Out[9]:
<matplotlib.text.Text at 0x10d260350>

In [10]:
colors = sns.color_palette(n_colors=11)

In [11]:
color_dict = {key: value for key, value in zip(sorted(dframe.Year.unique()), colors)}

In [12]:
color_dict


Out[12]:
{2002: (0.2980392156862745, 0.4470588235294118, 0.6901960784313725),
 2003: (0.3333333333333333, 0.6588235294117647, 0.40784313725490196),
 2004: (0.7686274509803922, 0.3058823529411765, 0.3215686274509804),
 2005: (0.5058823529411764, 0.4470588235294118, 0.6980392156862745),
 2006: (0.8, 0.7254901960784313, 0.4549019607843137),
 2007: (0.39215686274509803, 0.7098039215686275, 0.803921568627451),
 2008: (0.2980392156862745, 0.4470588235294118, 0.6901960784313725),
 2009: (0.3333333333333333, 0.6588235294117647, 0.40784313725490196),
 2010: (0.7686274509803922, 0.3058823529411765, 0.3215686274509804),
 2011: (0.5058823529411764, 0.4470588235294118, 0.6980392156862745),
 2012: (0.8, 0.7254901960784313, 0.4549019607843137)}

In [13]:
for year in sorted(dframe.Year.unique()[[0,2, 5, -1]]):
    plt.scatter(dframe[dframe.Year == year].Labor_Hours,
                dframe[dframe.Year == year].Production_short_tons, 
                c=color_dict[year],
                s=50,
                label=year,
               )
plt.xlabel("Total Hours Worked")
plt.ylabel("Total Amount Produced")
plt.legend()
plt.savefig("ex1.png")



In [ ]:
# facet grid

In [14]:
for col in dframe.columns:
    print col


MSHA_ID
Average_Employees
Company_Type
Labor_Hours
Mine_Basin
Mine_County
Mine_Name
Mine_State
Mine_Status
Mine_Type
Operating_Company
Operating_Company_Address
Operation_Type
Production_short_tons
Union_Code
Year

In [ ]:

SQL connections

You will often use and interact with databases of some kind or another. Having the queries you ran to create the dataframes in a Notebook is great for future reference. There are many python/IPython connections to databases of all kinds: sqlite, mysql, impala, etc.


In [15]:
# An updated implementation from Christian Perez at SVDS https://github.com/cfperez/ipython-sql
%load_ext sql
%reload_ext sql


/Users/jonathan/anaconda/lib/python2.7/site-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated. You should import from traitlets.config instead.
  "You should import from traitlets.config instead.", ShimWarning)
/Users/jonathan/anaconda/lib/python2.7/site-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.
  warn("IPython.utils.traitlets has moved to a top-level traitlets package.")

In [16]:
coalproduction = dframe.copy()

In [17]:
%config SqlMagic.autopandas=True

In [18]:
%%sql sqlite://
PERSIST coalproduction


Out[18]:
u'Persisted coalproduction'

In [19]:
%%sql sqlite://
SELECT DISTINCT company_type FROM coalproduction 
WHERE msha_id = 5000030


Done.
Out[19]:
Company_Type
0 Independent Producer Operator

In [20]:
dbtest = %sql SELECT * FROM coalproduction


Done.

In [21]:
type(dbtest)


Out[21]:
pandas.core.frame.DataFrame

In [22]:
dbtest.head()


Out[22]:
index MSHA_ID Average_Employees Company_Type Labor_Hours Mine_Basin Mine_County Mine_Name Mine_State Mine_Status Mine_Type Operating_Company Operating_Company_Address Operation_Type Production_short_tons Union_Code Year
0 0 102838 4 Independent Producer Operator 2712 Appalachia Southern Bibb Hebron Mine Alabama Permanently abandoned Surface Birmingham Coal & Coke Company 2477 Valleydale Rd. S. B3, Birmingham, AL 35244 Mine only 10572 None 2002
1 1 103184 5 Independent Producer Operator 2480 Appalachia Southern Fayette Berry Mine Alabama Temporarily closed Surface Midas Coal Company Incorporate 401 10th Avenue, S. E, Cullman, AL 35055 Mine only 9725 None 2002
2 2 100329 55 Operating Subsidiary 123618 Appalachia Southern Jefferson Concord Mine Alabama Active Underground U S Steel Mining Company Llc 8800 Oak Grove Mine Road, Adger, AL 35006 Preparation Plant 0 United Mine Workers of America 2002
3 3 100851 331 Operating Subsidiary 748182 Appalachia Southern Jefferson Oak Grove Mine Alabama Active Underground U S Steel Mining Company Llc 8800 Oak Grove Mine Rd, Adger, AL 35006 Mine only 1942153 United Mine Workers of America 2002
4 4 102354 28 Independent Producer Operator 55306 Appalachia Southern Jefferson Lindbergh Alabama Active Surface C & H Mining Company Inc P.O. Box 70250, Tuscaloosa, AL 35407 Mine only 168446 None 2002

Use Cases for the Jupyter Notebook

  • Use Case 1: Teaching (Some basics to start!)
  • Use Case 2: Exploratory Data Analysis
  • Use Case 3: Running remotely (server)
  • Use Case 4: Sharing results
  • Use Case 5: Presentations

In [ ]: