3. Explore the Data

"I don't know, what I don't know"

  • Why do visual exploration?
  • Understand Data Structure & Types
  • Explore single variable graphs - Quantitative, Categorical
  • Explore dual variable graphs - (Q & Q, Q & C, C & C)
  • Explore multi variable graphs

In [6]:
# Load the libraries
import pandas as pd
import numpy as np

In [7]:
# Load the price data again and fill the missing values, Add year
df = pd.read_csv("data/Weed_Price.csv", parse_dates=[-1])
df.sort(columns=['State','date'], inplace=True)
df.fillna(method = "ffill", inplace=True)

In [8]:
# Load the demographic data
df_demo = pd.read_csv("data/Demographics_State.csv")

Lets load the libraries required for Visual Exploration


In [9]:
# Load the visualisation libraries - Matplotlib and Seaborn
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

In [10]:
# Set some parameters to get good visuals - style to ggplot and size to 15,10
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (15, 10)

3.1 Quantiative Variable - Single Variable


In [11]:
# Filter data for location California and calculate the Year
df['year'] = pd.DatetimeIndex(df['date']).year
df_cal = df[df["State"] == "California"]
df_cal.head()


Out[11]:
State HighQ HighQN MedQ MedQN LowQ LowQN date year
20098 California 248.77 12021 193.44 12724 193.88 770 2013-12-27 2013
20863 California 248.74 12025 193.44 12728 193.88 770 2013-12-28 2013
21577 California 248.76 12047 193.55 12760 193.60 772 2013-12-29 2013
22291 California 248.82 12065 193.54 12779 193.80 773 2013-12-30 2013
22801 California 248.76 12082 193.54 12792 193.80 773 2013-12-31 2013

In [12]:
# Plot
df_cal.plot(x = "date", y = "HighQ")


Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x1098d3890>

In [13]:
# Set index as date - this is important to get the labels in the plots automatically
df_cal.index = df_cal.date
df_cal.head()


Out[13]:
State HighQ HighQN MedQ MedQN LowQ LowQN date year
date
2013-12-27 California 248.77 12021 193.44 12724 193.88 770 2013-12-27 2013
2013-12-28 California 248.74 12025 193.44 12728 193.88 770 2013-12-28 2013
2013-12-29 California 248.76 12047 193.55 12760 193.60 772 2013-12-29 2013
2013-12-30 California 248.82 12065 193.54 12779 193.80 773 2013-12-30 2013
2013-12-31 California 248.76 12082 193.54 12792 193.80 773 2013-12-31 2013

In [14]:
# Lets plot the HighQ prices
df_cal.HighQ.plot()


Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x106d65110>

In [15]:
# Lets plot this HighQ as a histogram to see the most common price
df_cal.HighQ.plot(kind = "hist")


Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x106fed090>

In [16]:
# Lets increase the bins to see some granularity
df_cal.HighQ.plot(kind = "hist", bins = 40)


Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x107104210>

3.2 Quantiative - Multi Variable


In [17]:
# Lets plot all the three prices in California
df_cal[["HighQ", "MedQ", "LowQ"]].plot()


Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x107228c90>

In [18]:
# Lets see the distribution of these prices by using a histogram
df_cal[["HighQ", "MedQ", "LowQ"]].plot(kind = "hist", bins = 50, alpha = 0.5)


Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x10b6b2890>

Exercise

Filter the data for 2014 and Alaska


In [ ]:

Plot the HighQ, MedQ and LowQ prices for Alaska in 2014


In [ ]:

Plot the histogram of HighQ, MedQ and LowQ prices for Alaska in 2014


In [ ]:

Box Plots


In [19]:
# Lets plot a box plot for the HighQ, MedQ and LowQ
df_cal.describe()


Out[19]:
HighQ HighQN MedQ MedQN LowQ LowQN year
count 449.000000 449.000000 449.000000 449.000000 449.000000 449.000000 449.000000
mean 245.376125 14947.073497 191.268909 16769.821826 189.783586 976.298441 2014.167038
std 1.727046 1656.133565 1.524028 2433.943191 1.598252 120.246714 0.402204
min 241.840000 12021.000000 187.850000 12724.000000 187.830000 770.000000 2013.000000
25% 244.480000 13610.000000 190.260000 14826.000000 188.600000 878.000000 2014.000000
50% 245.310000 15037.000000 191.570000 16793.000000 188.600000 982.000000 2014.000000
75% 246.220000 16090.000000 192.550000 18435.000000 191.320000 1060.000000 2014.000000
max 248.820000 18492.000000 193.630000 22027.000000 193.880000 1232.000000 2015.000000

In [20]:
# Lets plot a Box Plot for the prices
df_cal[["HighQ", "MedQ", "LowQ"]].plot(kind = "box")


Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x10bc0c510>

In [21]:
# Lets plot a Box Plot for the sample size
df_cal[["HighQN", "MedQN", "LowQN"]].plot(kind = "box")


Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x10bf8e890>

What if we want to show the price in all the states in the year 2014?


In [22]:
# Select only the year 2014
df_2014 = df[df["year"] == 2014]
df_2014.head()


Out[22]:
State HighQ HighQN MedQ MedQN LowQ LowQN date year
0 Alabama 339.06 1042 198.64 933 149.49 123 2014-01-01 2014
765 Alabama 339.20 1043 198.64 933 149.49 123 2014-01-02 2014
1479 Alabama 339.20 1043 198.64 933 148.48 124 2014-01-03 2014
2244 Alabama 339.20 1043 198.43 934 148.48 124 2014-01-04 2014
3009 Alabama 339.32 1046 198.13 936 148.48 124 2014-01-05 2014

In [23]:
# Lets use pivot tables to get HighQ values for each Date by each State
df_states = pd.pivot_table(df_2014, values = "HighQ", index = "date", columns = "State")
df_states.head()


Out[23]:
State Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware District of Columbia Florida ... South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming
date
2014-01-01 339.06 288.75 303.31 361.85 248.78 236.31 347.90 373.18 352.26 306.43 ... 387.8 359.75 343.33 291.54 379.88 374.01 236.03 360.80 357.99 355.90
2014-01-02 339.20 289.67 303.36 361.85 248.67 236.20 347.84 373.18 352.02 306.35 ... 387.8 359.75 343.35 291.70 379.88 373.62 236.13 359.80 358.19 355.90
2014-01-03 339.20 289.14 303.31 361.37 248.67 236.31 347.89 373.18 351.40 306.21 ... 387.8 359.84 343.26 291.61 379.76 373.77 236.03 360.72 358.08 354.03
2014-01-04 339.20 289.14 303.22 361.21 248.65 236.07 347.89 373.35 351.40 306.11 ... 387.8 359.74 343.23 291.78 379.76 373.72 236.03 361.11 358.04 354.03
2014-01-05 339.32 289.14 303.22 361.21 248.68 236.14 347.79 373.35 352.18 306.12 ... 387.8 359.68 343.14 291.56 379.76 373.77 235.97 361.11 358.18 354.03

5 rows × 51 columns


In [24]:
# Lets plot of these lines
df_states.plot()


Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x10bf9c8d0>

In [25]:
df_states.iloc[:,1:10].plot()


Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x109b9b750>

In [26]:
# What if we group by State and plot
# df_2014.groupby("State").plot(x = "date", y = "HighQ")

In [27]:
# Arrange in a grid fashion
grid = sns.FacetGrid(df_2014, col = "State", col_wrap = 7)
grid.map(plt.plot, "date", "HighQ")


Out[27]:
<seaborn.axisgrid.FacetGrid at 0x107043ad0>

3.3 Single Variable - Categorical


In [28]:
df_demo.head()


Out[28]:
region total_population percent_white percent_black percent_asian percent_hispanic per_capita_income median_rent median_age
0 alabama 4799277 67 26 1 4 23680 501 38.1
1 alaska 720316 63 3 5 6 32651 978 33.6
2 arizona 6479703 57 4 3 30 25358 747 36.3
3 arkansas 2933369 74 15 1 7 22170 480 37.5
4 california 37659181 40 6 13 38 29527 1119 35.4

In [29]:
# Create an index in the demographic data to ease the labels 
df_demo.index = df_demo.region
df_demo.head()


Out[29]:
region total_population percent_white percent_black percent_asian percent_hispanic per_capita_income median_rent median_age
region
alabama alabama 4799277 67 26 1 4 23680 501 38.1
alaska alaska 720316 63 3 5 6 32651 978 33.6
arizona arizona 6479703 57 4 3 30 25358 747 36.3
arkansas arkansas 2933369 74 15 1 7 22170 480 37.5
california california 37659181 40 6 13 38 29527 1119 35.4

In [30]:
# DO NOT make pie charts, especially when the number of category is greater than 6
df_demo.total_population.plot(kind = "pie")


Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x113368910>

Exercise

Draw a pie chart of the total population of the six largest states in US?


In [ ]:


In [31]:
# Lets plot this in a simple bar chart
df_demo.total_population.plot(kind ="bar")


Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x113e77f50>

In [32]:
# Lets sort the columns to make them in ascending order
df_demo.sort(columns = 'total_population', ascending = True, inplace = True)
df_demo.head()


Out[32]:
region total_population percent_white percent_black percent_asian percent_hispanic per_capita_income median_rent median_age
region
wyoming wyoming 570134 85 1 1 9 28902 647 36.8
district of columbia district of columbia 619371 35 49 3 10 45290 1154 33.8
vermont vermont 625904 94 1 1 2 29167 754 42.0
north dakota north dakota 689781 88 1 1 2 29732 564 36.4
alaska alaska 720316 63 3 5 6 32651 978 33.6

In [33]:
# Lets now plot again 
df_demo.total_population.plot(kind = "barh")


Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x11425d7d0>

In [34]:
# Lets select the percent population
df_demo.iloc[:,2:6].head()


Out[34]:
percent_white percent_black percent_asian percent_hispanic
region
wyoming 85 1 1 9
district of columbia 35 49 3 10
vermont 94 1 1 2
north dakota 88 1 1 2
alaska 63 3 5 6

In [35]:
# Lets plot the percentage population
df_demo.iloc[:,2:6].plot(kind = "barh")


Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x11457bed0>

In [36]:
# Lets stack the percentage population
df_demo.iloc[:,2:6].plot(kind = "barh", stacked = True)


Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x114e3ca50>

In [37]:
# Lets plot the Per Capita Income 
df_demo.sort(columns='per_capita_income', ascending=True).per_capita_income.plot(kind = "barh")


Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x11599e090>

Lets see if there is a relationship between per_capita_income and percent_white


In [38]:
# Scatter - per_capita_income and percent_white
df_demo.plot(kind = "scatter", x = "per_capita_income", y = "percent_white", s = 100)


Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x1160de110>

Generate hypothesis on demographic data

Is there a reltaionship between per_capita_income and percentage of others?


In [ ]:

Is there a relationship between median_age and percentage of asians?


In [ ]:

Lets plot in the Geographic Coordinates (without using maps)


In [39]:
# Lets get the latitude and longitude for each of the state
df_geo = pd.read_csv("data/State_Location.csv")
df_geo.head()


Out[39]:
region state latitude longitude status
0 alaska AK 61.3850 -152.2683 legal
1 alabama AL 32.7990 -86.8073 medical-limited
2 arkansas AR 34.9513 -92.3809 illegal
3 arizona AZ 33.7712 -111.3877 medical
4 california CA 36.1700 -119.7462 decriminalized+medical

In [40]:
# Lets draw the states as a point
df_geo.plot(kind = "scatter", x = "longitude", y = "latitude")


Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x116763050>

In [41]:
# Lets change the color of the dot and increase the size
df_geo.plot(kind = "scatter", x = "longitude", y = "latitude", c = "red", s = 100)


Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x1167cfdd0>

In [42]:
# Lets zoom in to the mainland USA

df_geo.plot(kind = "scatter", x = "longitude", y = "latitude", c = "red", s = 100,
            xlim = (-140,-60), ylim = (20,60))


Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x11674c4d0>

In [43]:
# Lets plot some data on this plot
df_demo_geo = pd.merge(df_demo, df_geo, on = "region")
df_demo_geo.index = df_demo_geo.region
df_demo_geo.head()


Out[43]:
region total_population percent_white percent_black percent_asian percent_hispanic per_capita_income median_rent median_age state latitude longitude status
region
wyoming wyoming 570134 85 1 1 9 28902 647 36.8 WY 42.7475 -107.2085 illegal
district of columbia district of columbia 619371 35 49 3 10 45290 1154 33.8 DC 38.8964 -77.0262 legal
vermont vermont 625904 94 1 1 2 29167 754 42.0 VT 44.0407 -72.7093 decriminalized+medical
north dakota north dakota 689781 88 1 1 2 29732 564 36.4 ND 47.5362 -99.7930 illegal
alaska alaska 720316 63 3 5 6 32651 978 33.6 AK 61.3850 -152.2683 legal

In [44]:
df_demo_geo.index = df_demo_geo.region
df_demo_geo.iloc[:,10:12].head()


Out[44]:
latitude longitude
region
wyoming 42.7475 -107.2085
district of columbia 38.8964 -77.0262
vermont 44.0407 -72.7093
north dakota 47.5362 -99.7930
alaska 61.3850 -152.2683

In [45]:
# Lets draw a scatter plot
df_demo_geo.plot(kind = "scatter", x = "longitude", y = "latitude", c = "per_capita_income", s = 400,
                 xlim = (-140,-60), ylim = (20,60), colormap = "Oranges")


Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x1179ff210>

Exercise

Draw a scatterplot of latitude and longitude with color as median age.


In [ ]:

Lets plot it on a true geographic system using folium


In [46]:
from IPython.display import HTML
import folium

# Temporary code to embed in Jupyter Notebook  - will be fixed in version 0.16
def display(m, height=500):
    """Takes a folium instance and embed HTML."""
    m._build_map()
    srcdoc = m.HTML.replace('"', '&quot;')
    embed = HTML('<iframe srcdoc="{0}" '
                 'style="width: 100%; height: {1}px; '
                 'border: none"></iframe>'.format(srcdoc, height))
    return embed

In [47]:
map = folium.Map(location=[48, -102],  zoom_start= 4)
display(map)


Out[47]:

In [ ]:
# Import a Geo JSON
state_geo = r'data/us-states.json'

In [ ]:
# Plot it as an overlay on the map
map = folium.Map(location=[48, -102], zoom_start=3)
map.geo_json(geo_path = state_geo)
display(map)

In [ ]:
# Lets bind some data to this
df_demo_geo.sort("region", inplace = True)
df_demo_geo.head()

In [ ]:
mapa = folium.Map(location=[48, -102], zoom_start=3)

mapa.geo_json(geo_path=state_geo, 
              data=df_demo_geo,
              columns=['state', 'total_population'],
              fill_color='BuPu',
              key_on='feature.id', reset = True)

# We will not use inline version due to bug in 0.15.0

mapa.create_map()

# Go to http://localhost:8888/files/map.html

Exercise

Create a choropleth using the per_capita_income status


In [ ]: