# 3. Explore the Data

"I don't know, what I don't know"

• Why do visual exploration?
• Understand Data Structure & Types
• Explore single variable graphs - Quantitative, Categorical
• Explore dual variable graphs - (Q & Q, Q & C, C & C)
• Explore multi variable graphs
import pandas as pd
import numpy as np

# Load the price data again and fill the missing values, Add year
df.sort(columns=['State','date'], inplace=True)
df.fillna(method = "ffill", inplace=True)

Lets load the libraries required for Visual Exploration

# Load the visualisation libraries - Matplotlib and Seaborn
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# Set some parameters to get good visuals - style to ggplot and size to 15,10
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (15, 10)

## 3.1 Quantiative Variable - Single Variable

# Filter data for location California and calculate the Year
df['year'] = pd.DatetimeIndex(df['date']).year
df_cal = df[df["State"] == "California"]

# Plot
df_cal.plot(x = "date", y = "HighQ")

# Set index as date - this is important to get the labels in the plots automatically
df_cal.index = df_cal.date

# Lets plot the HighQ prices
df_cal.HighQ.plot()

# Lets plot this HighQ as a histogram to see the most common price
df_cal.HighQ.plot(kind = "hist")

# Lets increase the bins to see some granularity
df_cal.HighQ.plot(kind = "hist", bins = 40)

``````

## 3.2 Quantiative - Multi Variable

# Lets plot all the three prices in California
df_cal[["HighQ", "MedQ", "LowQ"]].plot()

# Lets see the distribution of these prices by using a histogram
df_cal[["HighQ", "MedQ", "LowQ"]].plot(kind = "hist", bins = 50, alpha = 0.5)

``````

### Exercise

Filter the data for 2014 and Alaska

Plot the HighQ, MedQ and LowQ prices for Alaska in 2014

Plot the histogram of HighQ, MedQ and LowQ prices for Alaska in 2014

### Box Plots

# Lets plot a box plot for the HighQ, MedQ and LowQ
df_cal.describe()

# Lets plot a Box Plot for the prices
df_cal[["HighQ", "MedQ", "LowQ"]].plot(kind = "box")

# Lets plot a Box Plot for the sample size
df_cal[["HighQN", "MedQN", "LowQN"]].plot(kind = "box")

``````

What if we want to show the price in all the states in the year 2014?

# Select only the year 2014
df_2014 = df[df["year"] == 2014]

# Lets use pivot tables to get HighQ values for each Date by each State
df_states = pd.pivot_table(df_2014, values = "HighQ", index = "date", columns = "State")

# Lets plot of these lines
df_states.plot()

df_states.iloc[:,1:10].plot()

``````
# What if we group by State and plot
# df_2014.groupby("State").plot(x = "date", y = "HighQ")

# Arrange in a grid fashion
grid = sns.FacetGrid(df_2014, col = "State", col_wrap = 7)
grid.map(plt.plot, "date", "HighQ")

``````
``````

## 3.3 Single Variable - Categorical

# Create an index in the demographic data to ease the labels
df_demo.index = df_demo.region

# DO NOT make pie charts, especially when the number of category is greater than 6
df_demo.total_population.plot(kind = "pie")

