Pandas and data wrangling

Pandas is a tool for accessing columnar data, like that in SQL tables or CSV files.



In [ ]:

    
# convention recommended in documentation
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

#enable inline plotting in notebook

%matplotlib inline

Let's start by reading in a dataset. This dataset is about different subclasses of the iris flower.

We'll use it for another exercise later on.



In [ ]:

    
df = pd.read_csv("../data/iris.data")
df = df.sample(frac=0.2) # only use 20% of the data so the results aren't so long
type(df)

DataFrame is the basic building block of Pandas. It represents two-dimensional data with labeled rows and columns.



In [ ]:

    
# Columns can have different types.
# you can check the data types of the values
df.dtypes



In [ ]:

    
# you can access the dataframe with a single column name
df["petal.width"]
# this leaves the original unmodified



In [ ]:

    
# then the returned type is a Series, the second major concept in Pandas
type(df["petal.width"])



In [ ]:

    
#



In [ ]:

    
# alternately you can index a dataframe with a list of column names
df[["sepal.length", "petal.width", "class"]]



In [ ]:

    
# the comparison operator returns a list of boolean 
matching = df["sepal.width"] > df["petal.length"]
matching



In [ ]:

    
# which can also be used to query the dataframe
df[list_]

# or more idiomatically

df[df["sepal.width"] > df["petal.length"]]



In [ ]:

    
# one can get aggregates of single dimensions

df["sepal.width"].var() # try min, max, mean, median, sum, var



In [ ]:

    
# or of the whole thing

df.sum() # same operations as above



In [ ]:

    
# it's also possible to plot simple graphs using a simpleish syntax

df["sepal.width"].plot.box()



In [ ]:

    
df["sepal.width"].plot.hist()



In [ ]:

    
df.boxplot(column="sepal.width", by="class")



In [ ]:

    
df.plot.scatter(x="sepal.length", y="sepal.width")



In [ ]:

    
df.groupby("class").mean()



In [ ]:

    
# creating a grouped by plot requires a loop
fig, ax = plt.subplots(figsize=(8,6))
for label, df_ in df.groupby('class'):
    df_["sepal.length"].plot(kind="kde", ax=ax, label=label)
plt.legend()

Exercises: Davis data set

Read in the Davis data set of self reported heights and weights from "../data/davis.data".



In [ ]:

Plot box plots of the data. Is there a value that is off?



In [ ]:

How do you remove it?