Exploratory Data Analysis

First, the data and tools imports


In [ ]:
import pandas as pd
from pprint import pprint
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline

In [ ]:
df = pd.read_table('data/preprocessed.tsv')

Exercise 1: Go through each column and figure out it's characteristics


In [ ]:
pprint(df.columns.tolist())

Exercise 1.1: "Describe" the columns which have numerical types

Hint: use the describe method

Sample code:


In [ ]:
df['n_votes'].describe()

Try it out on other columns


In [ ]:
# ENTER CODE HERE

Exercise 1.2: Is there a way to compactly visualize summary statistics of a column?

Hint: Use a histogram

Sample code:


In [ ]:
df['n_votes'].hist(bins=20)

Try it out on other columns


In [ ]:
# ENTER CODE HERE

Exercise 2: Problem - What about columns (features) that are non-numerical?

Hint: Try different kind of plots

Sample code:


In [ ]:
df['section'].value_counts().plot(kind="bar")

In [ ]:
df['section'].value_counts().plot(kind="pie")

Try it out on other (non-numerical) columns


In [ ]:
### ENTER CODE HERE

Exercise 3: What about multivariate analysis? How does one column affect the other?

Exercise 3.1: Is there a relation between the number of votes and whether the talk is selected?

Hint: Use a scatterplot

Sample code:


In [ ]:
df.plot.scatter(x="n_votes", y="selected")

Exercise 3.2: Find correlation between other pairs of columns


In [ ]:
### ENTER CODE HERE