Pandas

If you've never used pandas before, it's amazingly useful, and at times frustrating.

Recommended links:

Read through this full series of excellent blog posts by Tom Augspurger.

High level tip

  • try to represent data in the proper format
    • floats as floats; ints as ints; etc.
    • Especially if you have dates, or timestamps, or datetimestamps, keep them in that format.

This pdf Tidy Data by Hadley Wickham is an excellent read with a lot that relates to data analysis in any language.


In [ ]:
from __future__ import absolute_import, division, print_function

%matplotlib inline
import matplotlib.pyplot as plt

In [ ]:
import seaborn as sns
sns.set_context('poster')
sns.set_style('whitegrid') 
# sns.set_style('darkgrid') 
plt.rcParams['figure.figsize'] = 12, 8  # plotsize

In [ ]:
import numpy as np
import pandas as pd
from pandas.tools.plotting import scatter_matrix
from sklearn.datasets import load_boston

import warnings
warnings.filterwarnings('ignore')

Note

Using cleaned data from Data Cleaning Notebook. See Notebook for details.


In [ ]:
df = pd.read_csv("../data/coal_prod_cleaned.csv")

In [ ]:
df.head()

In [ ]:
plt.scatter(df['Average_Employees'], 
            df.Labor_Hours)
plt.xlabel("Number of Employees")
plt.ylabel("Total Hours Worked");

In [ ]:
colors = sns.color_palette(n_colors=df.Year.nunique())

In [ ]:
color_dict = {key: value for key, value in zip(sorted(df.Year.unique()), colors)}

In [ ]:
color_dict

In [ ]:
for year in sorted(df.Year.unique()[[0, 2, -1]]):
    plt.scatter(df[df.Year == year].Labor_Hours,
                df[df.Year == year].Production_short_tons, 
                c=color_dict[year],
                s=50,
                label=year,
               )
plt.xlabel("Total Hours Worked")
plt.ylabel("Total Amount Produced")
plt.legend()
plt.savefig("ex1.png")

In [ ]:
import matplotlib as mpl

In [ ]:
mpl.style.use('seaborn-colorblind')

In [ ]:
plt.style.available

In [ ]:
for year in sorted(df.Year.unique()[[0, 2, -1]]):
    plt.scatter(df[df.Year == year].Labor_Hours,
                df[df.Year == year].Production_short_tons, 
                c=color_dict[year],
                s=50,
                label=year,
               )
plt.xlabel("Total Hours Worked")
plt.ylabel("Total Amount Produced")
plt.legend();
# plt.savefig("ex1.png")

In [ ]:
df_dict = load_boston()
features = pd.DataFrame(data=df_dict.data, columns = df_dict.feature_names)
target = pd.DataFrame(data=df_dict.target, columns = ['MEDV'])
df = pd.concat([features, target], axis=1)
df.head()

In [ ]:
# Target variable
fig, ax = plt.subplots(figsize=(6, 4))
sns.distplot(df.MEDV, ax=ax, rug=True, hist=False)

In [ ]:
fig, ax = plt.subplots(figsize=(10,7))
sns.kdeplot(df.LSTAT,
            df.MEDV,
            ax=ax)

In [ ]:
fig, ax = plt.subplots(figsize=(10, 10))
scatter_matrix(df[['MEDV', 'LSTAT', 'CRIM', 'RM', 'NOX', 'DIS']], alpha=0.2, diagonal='hist', ax=ax);

In [ ]:
pd.cut()

In [ ]:


In [ ]:


In [ ]: