3. Explore the Data

"I don't know, what I don't know"

  • Why do visual exploration?
  • Understand Data Structure & Types
  • Explore single variable graphs - Quantitative, Categorical
  • Explore dual variable graphs - (Q & Q, Q & C, C & C)
  • Explore multi variable graphs

In [1]:
# Load the libraries
import pandas as pd
import numpy as np

In [2]:
# Load the price data again and fill the missing values, Add year
df = pd.read_csv("data/Weed_Price.csv", parse_dates=[-1])
df.sort(columns=['State','date'], inplace=True)
df.fillna(method = "ffill", inplace=True)

In [3]:
# Load the demographic data
df_demo = pd.read_csv("data/Demographics_State.csv")

Lets load the libraries required for Visual Exploration


In [4]:
# Load the visualisation libraries - Matplotlib and Seaborn
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

In [5]:
# Set some parameters to get good visuals - style to ggplot and size to 15,10
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (15, 10)

3.1 Quantiative Variable - Single Variable


In [6]:
# Filter data for location California and calculate the Year
df['year'] = pd.DatetimeIndex(df['date']).year
df_cal = df[df["State"] == "California"]
df_cal.head()


Out[6]:
State HighQ HighQN MedQ MedQN LowQ LowQN date year
20098 California 248.77 12021 193.44 12724 193.88 770 2013-12-27 2013
20863 California 248.74 12025 193.44 12728 193.88 770 2013-12-28 2013
21577 California 248.76 12047 193.55 12760 193.60 772 2013-12-29 2013
22291 California 248.82 12065 193.54 12779 193.80 773 2013-12-30 2013
22801 California 248.76 12082 193.54 12792 193.80 773 2013-12-31 2013

In [7]:
# Plot
df_cal.plot(x = "date", y = "HighQ")


Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a1e4f50>

In [8]:
# Set index as date - this is important to get the labels in the plots automatically
df_cal.index = df_cal.date
df_cal.head()


Out[8]:
State HighQ HighQN MedQ MedQN LowQ LowQN date year
date
2013-12-27 California 248.77 12021 193.44 12724 193.88 770 2013-12-27 2013
2013-12-28 California 248.74 12025 193.44 12728 193.88 770 2013-12-28 2013
2013-12-29 California 248.76 12047 193.55 12760 193.60 772 2013-12-29 2013
2013-12-30 California 248.82 12065 193.54 12779 193.80 773 2013-12-30 2013
2013-12-31 California 248.76 12082 193.54 12792 193.80 773 2013-12-31 2013

In [9]:
# Lets plot the HighQ prices
df_cal.HighQ.plot()


Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a3a3d90>

In [10]:
# Lets plot this HighQ as a histogram to see the most common price
df_cal.HighQ.plot(kind = "hist")


Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a632210>

In [11]:
# Lets increase the bins to see some granularity
df_cal.HighQ.plot(kind = "hist", bins = 40)


Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a6b1190>

3.2 Quantiative - Multi Variable


In [12]:
# Lets plot all the three prices in California
df_cal[["HighQ", "MedQ", "LowQ"]].plot()


Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a6b11d0>

In [13]:
# Lets see the distribution of these prices by using a histogram
df_cal[["HighQ", "MedQ", "LowQ"]].plot(kind = "hist", bins = 50, alpha = 0.5)


Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x10b204750>

Exercise

Filter the data for 2014 and Alaska


In [ ]:

Plot the HighQ, MedQ and LowQ prices for Alaska in 2014


In [ ]:

Plot the histogram of HighQ, MedQ and LowQ prices for Alaska in 2014


In [ ]:

Box Plots


In [14]:
# Lets plot a box plot for the HighQ, MedQ and LowQ
df_cal.describe()


Out[14]:
HighQ HighQN MedQ MedQN LowQ LowQN year
count 449.000000 449.000000 449.000000 449.000000 449.000000 449.000000 449.000000
mean 245.376125 14947.073497 191.268909 16769.821826 189.783586 976.298441 2014.167038
std 1.727046 1656.133565 1.524028 2433.943191 1.598252 120.246714 0.402204
min 241.840000 12021.000000 187.850000 12724.000000 187.830000 770.000000 2013.000000
25% 244.480000 13610.000000 190.260000 14826.000000 188.600000 878.000000 2014.000000
50% 245.310000 15037.000000 191.570000 16793.000000 188.600000 982.000000 2014.000000
75% 246.220000 16090.000000 192.550000 18435.000000 191.320000 1060.000000 2014.000000
max 248.820000 18492.000000 193.630000 22027.000000 193.880000 1232.000000 2015.000000

In [15]:
# Lets plot a Box Plot for the prices
df_cal[["HighQ", "MedQ", "LowQ"]].plot(kind = "box")


Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x10b687110>

In [16]:
# Lets plot a Box Plot for the sample size
df_cal[["HighQN", "MedQN", "LowQN"]].plot(kind = "box")


Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x10bac04d0>

What if we want to show the price in all the states in the year 2014?


In [17]:
# Select only the year 2014
df_2014 = df[df["year"] == 2014]
df_2014.head()


Out[17]:
State HighQ HighQN MedQ MedQN LowQ LowQN date year
0 Alabama 339.06 1042 198.64 933 149.49 123 2014-01-01 2014
765 Alabama 339.20 1043 198.64 933 149.49 123 2014-01-02 2014
1479 Alabama 339.20 1043 198.64 933 148.48 124 2014-01-03 2014
2244 Alabama 339.20 1043 198.43 934 148.48 124 2014-01-04 2014
3009 Alabama 339.32 1046 198.13 936 148.48 124 2014-01-05 2014

In [18]:
# Lets use pivot tables to get HighQ values for each Date by each State
df_states = pd.pivot_table(df_2014, values = "HighQ", index = "date", columns = "State")
df_states.head()


Out[18]:
State Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware District of Columbia Florida ... South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming
date
2014-01-01 339.06 288.75 303.31 361.85 248.78 236.31 347.90 373.18 352.26 306.43 ... 387.8 359.75 343.33 291.54 379.88 374.01 236.03 360.80 357.99 355.90
2014-01-02 339.20 289.67 303.36 361.85 248.67 236.20 347.84 373.18 352.02 306.35 ... 387.8 359.75 343.35 291.70 379.88 373.62 236.13 359.80 358.19 355.90
2014-01-03 339.20 289.14 303.31 361.37 248.67 236.31 347.89 373.18 351.40 306.21 ... 387.8 359.84 343.26 291.61 379.76 373.77 236.03 360.72 358.08 354.03
2014-01-04 339.20 289.14 303.22 361.21 248.65 236.07 347.89 373.35 351.40 306.11 ... 387.8 359.74 343.23 291.78 379.76 373.72 236.03 361.11 358.04 354.03
2014-01-05 339.32 289.14 303.22 361.21 248.68 236.14 347.79 373.35 352.18 306.12 ... 387.8 359.68 343.14 291.56 379.76 373.77 235.97 361.11 358.18 354.03

5 rows × 51 columns


In [19]:
# Lets plot of these lines
df_states.plot()


Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x10bc97fd0>

In [20]:
df_states.iloc[:,1:10].plot()


Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x106ca3350>

In [21]:
# What if we group by State and plot
# df_2014.groupby("State").plot(x = "date", y = "HighQ")

In [22]:
# Arrange in a grid fashion
grid = sns.FacetGrid(df_2014, col = "State", col_wrap = 7)
grid.map(plt.plot, "date", "HighQ")


Out[22]:
<seaborn.axisgrid.FacetGrid at 0x1072c7d90>

3.3 Single Variable - Categorical


In [23]:
df_demo.head()


Out[23]:
region total_population percent_white percent_black percent_asian percent_hispanic per_capita_income median_rent median_age
0 alabama 4799277 67 26 1 4 23680 501 38.1
1 alaska 720316 63 3 5 6 32651 978 33.6
2 arizona 6479703 57 4 3 30 25358 747 36.3
3 arkansas 2933369 74 15 1 7 22170 480 37.5
4 california 37659181 40 6 13 38 29527 1119 35.4

In [24]:
# Create an index in the demographic data to ease the labels 
df_demo.index = df_demo.region
df_demo.head()


Out[24]:
region total_population percent_white percent_black percent_asian percent_hispanic per_capita_income median_rent median_age
region
alabama alabama 4799277 67 26 1 4 23680 501 38.1
alaska alaska 720316 63 3 5 6 32651 978 33.6
arizona arizona 6479703 57 4 3 30 25358 747 36.3
arkansas arkansas 2933369 74 15 1 7 22170 480 37.5
california california 37659181 40 6 13 38 29527 1119 35.4

In [25]:
# DO NOT make pie charts, especially when the number of category is greater than 6
df_demo.total_population.plot(kind = "pie")


Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x106cac090>