In todays class, we will first introduce a data analysis tools called Pandas, and then show how to visualize the data using a module called Seaborn.
Most of the examples come from Pandas tutorial and Seaborn tutorial.
In [12]:
import pandas as pd
In [2]:
# The inital set of baby names and bith rates
names = ['Bob','Jessica','Mary','John','Mel']
births = [968, 155, 77, 578, 973]
To merge these two lists together we will use the zip function.
In [3]:
BabyDataSet = zip(names,births)
BabyDataSet
Out[3]:
We are basically done creating the data set. We now will use the pandas library to export this data set into a csv file.
We will create a DataFrame object. You can think of this object holding the contents of the BabyDataSet in a format similar to an excel spreadsheet.
In [4]:
df = pd.DataFrame(data = BabyDataSet, columns=['Names', 'Births'])
df
Out[4]:
Export the dataframe to a csv file. We can name the file births1880.csv. The function to_csv will be used to export the file. The file will be saved in the same location of the notebook unless specified otherwise.
In [5]:
df.to_csv('births1880.csv',index=False,header=False)
In [6]:
df = pd.read_csv("births1880.csv")
df
Out[6]:
This brings us the our first problem of the exercise. The read_csv function treated the first record in the csv file as the header names. This is obviously not correct since the text file did not provide us with header names.
To correct this we will pass the header parameter to the read_csv function and set it to None (means null in python).
In [7]:
df = pd.read_csv("births1880.csv", header=None)
df
Out[7]:
If we wanted to give the columns specific names, we would have to pass another paramter called names. We can also omit the header parameter.
In [8]:
df = pd.read_csv("births1880.csv", names=['Names','Births'])
df
Out[8]:
It is also possible to read in a csv file by passing an url address Here we use the famous Iris dataset.
The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals.
In [10]:
df = pd.read_csv("https://raw.githubusercontent.com/bigmlcom/bigmler/master/data/iris.csv")
df.head(10)
Out[10]:
In [11]:
# show basic statistics
df.describe()
Out[11]:
In [12]:
# Select a column
df["sepal length"].head()
Out[12]:
In [13]:
# select columns
df[["sepal length", "petal width"]].head()
Out[13]:
In [14]:
# select rows by name
df.loc[5:10]
Out[14]:
In [15]:
# select rows by position
df.iloc[5:10]
Out[15]:
In [16]:
# select rows by condition
df[df["sepal length"] > 5.0]
Out[16]:
We can get the maximum sepal length by
In [17]:
df["sepal length"].max()
Out[17]:
If we want to find full information of the flower with maximum sepal length
In [18]:
df.sort("sepal length", ascending=False).head(1)
Out[18]:
In [25]:
df.sort("petal length", ascending=True)
df.iloc[1]
Out[25]:
Pandas also has some basic plotting functions.
In [39]:
import matplotlib.pyplot as plt
%matplotlib inline
df.hist()
Out[39]:
In [1]:
import seaborn as sns
# make the plots to show right below the codes
% matplotlib inline
In [2]:
import numpy as np
x = np.random.normal(size=100)
sns.distplot(x)
Out[2]:
In [10]:
import random
x = [random.normalvariate (0, 1) for i in range (0, 1000)]
sns.distplot (x)
Out[10]:
In [16]:
mean, cov = [0, 1], [(1, .5), (.5, 1)]
data = np.random.multivariate_normal(mean, cov, 200)
df = pd.DataFrame(data, columns=["x", "y"])
In [17]:
sns.jointplot(x="x", y="y", data=df)
Out[17]:
In [18]:
sns.jointplot(x="x", y="y", data=df, kind="hex")
Out[18]:
In [19]:
sns.jointplot(x="x", y="y", data=df, kind="kde")
Out[19]:
To plot multiple pairwise bivariate distributions in a dataset, you can use the pairplot() function. This creates a matrix of axes and shows the relationship for each pair of columns in a DataFrame. by default, it also draws the univariate distribution of each variable on the diagonal Axes:
In [21]:
iris = sns.load_dataset("iris")
sns.pairplot(iris)
Out[21]:
In [22]:
# we can add colors to different species
sns.pairplot(iris, hue="species")
Out[22]:
In [23]:
tips = sns.load_dataset("tips")
tips.head()
Out[23]:
We can use the function regplot
to show the linear relationship between total_bill and tip.
It also shows the 95% confidence interval.
In [24]:
sns.regplot(x="total_bill", y="tip", data=tips)
Out[24]:
In [25]:
anscombe = sns.load_dataset("anscombe")
sns.regplot(x="x", y="y", data=anscombe[anscombe["dataset"] == "II"])
Out[25]:
The plot clearly shows that this is not a good model. Let's try to fit a polynomial regression model with degree 2.
In [164]:
sns.regplot(x="x", y="y", data=anscombe[anscombe["dataset"] == "II"], order=2)
Out[164]:
In [168]:
sns.stripplot(x="day", y="total_bill", data=tips)
Out[168]:
In [169]:
sns.boxplot(x="day", y="total_bill", hue="time", data=tips)
Out[169]:
In [170]:
titanic = sns.load_dataset("titanic")
sns.barplot(x="sex", y="survived", hue="class", data=titanic)
Out[170]: