Pandas and data wrangling

Pandas is a tool for accessing columnar data, like that in SQL tables or CSV files.


In [ ]:
# convention recommended in documentation
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

#enable inline plotting in notebook

%matplotlib inline

Let's start by reading in a dataset. This dataset is about different subclasses of the iris flower.

We'll use it for another exercise later on.


In [ ]:
df = pd.read_csv("../data/iris.data")
df = df.sample(frac=0.2) # only use 20% of the data so the results aren't so long
type(df)

DataFrame is the basic building block of Pandas. It represents two-dimensional data with labeled rows and columns.


In [ ]:
# Columns can have different types.
# you can check the data types of the values
df.dtypes

In [ ]:
# you can access the dataframe with a single column name
df["petal.width"]
# this leaves the original unmodified

In [ ]:
# then the returned type is a Series, the second major concept in Pandas
type(df["petal.width"])

In [ ]:
#

In [ ]:
# alternately you can index a dataframe with a list of column names
df[["sepal.length", "petal.width", "class"]]

In [ ]:
# the comparison operator returns a list of boolean 
matching = df["sepal.width"] > df["petal.length"]
matching

In [ ]:
# which can also be used to query the dataframe
df[list_]

# or more idiomatically

df[df["sepal.width"] > df["petal.length"]]

In [ ]:
# one can get aggregates of single dimensions

df["sepal.width"].var() # try min, max, mean, median, sum, var

In [ ]:
# or of the whole thing

df.sum() # same operations as above

In [ ]:
# it's also possible to plot simple graphs using a simpleish syntax

df["sepal.width"].plot.box()

In [ ]:
df["sepal.width"].plot.hist()

In [ ]:
df.boxplot(column="sepal.width", by="class")

In [ ]:
df.plot.scatter(x="sepal.length", y="sepal.width")

In [ ]:
df.groupby("class").mean()

In [ ]:
# creating a grouped by plot requires a loop
fig, ax = plt.subplots(figsize=(8,6))
for label, df_ in df.groupby('class'):
    df_["sepal.length"].plot(kind="kde", ax=ax, label=label)
plt.legend()

Exercises: Davis data set

Read in the Davis data set of self reported heights and weights from "../data/davis.data".


In [ ]:

Plot box plots of the data. Is there a value that is off?


In [ ]:

How do you remove it?