3. Explore the Data

"I don't know, what I don't know"

Why do visual exploration?
Understand Data Structure & Types
Explore single variable graphs - Quantitative, Categorical
Explore dual variable graphs - (Q & Q, Q & C, C & C)
Explore multi variable graphs



In [6]:

    
# Load the libraries
import pandas as pd
import numpy as np



In [7]:

    
# Load the price data again and fill the missing values, Add year
df = pd.read_csv("data/Weed_Price.csv", parse_dates=[-1])
df.sort(columns=['State','date'], inplace=True)
df.fillna(method = "ffill", inplace=True)



In [8]:

    
# Load the demographic data
df_demo = pd.read_csv("data/Demographics_State.csv")

Lets load the libraries required for Visual Exploration



In [9]:

    
# Load the visualisation libraries - Matplotlib and Seaborn
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns



In [10]:

    
# Set some parameters to get good visuals - style to ggplot and size to 15,10
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (15, 10)

3.1 Quantiative Variable - Single Variable



In [11]:

    
# Filter data for location California and calculate the Year
df['year'] = pd.DatetimeIndex(df['date']).year
df_cal = df[df["State"] == "California"]
df_cal.head()



In [12]:

    
# Plot
df_cal.plot(x = "date", y = "HighQ")









    Out[12]:





<matplotlib.axes._subplots.AxesSubplot at 0x1098d3890>



In [13]:

    
# Set index as date - this is important to get the labels in the plots automatically
df_cal.index = df_cal.date
df_cal.head()



In [14]:

    
# Lets plot the HighQ prices
df_cal.HighQ.plot()









    Out[14]:





<matplotlib.axes._subplots.AxesSubplot at 0x106d65110>



In [15]:

    
# Lets plot this HighQ as a histogram to see the most common price
df_cal.HighQ.plot(kind = "hist")









    Out[15]:





<matplotlib.axes._subplots.AxesSubplot at 0x106fed090>



In [16]:

    
# Lets increase the bins to see some granularity
df_cal.HighQ.plot(kind = "hist", bins = 40)









    Out[16]:





<matplotlib.axes._subplots.AxesSubplot at 0x107104210>

3.2 Quantiative - Multi Variable



In [17]:

    
# Lets plot all the three prices in California
df_cal[["HighQ", "MedQ", "LowQ"]].plot()









    Out[17]:





<matplotlib.axes._subplots.AxesSubplot at 0x107228c90>



In [18]:

    
# Lets see the distribution of these prices by using a histogram
df_cal[["HighQ", "MedQ", "LowQ"]].plot(kind = "hist", bins = 50, alpha = 0.5)









    Out[18]:





<matplotlib.axes._subplots.AxesSubplot at 0x10b6b2890>

Exercise

Filter the data for 2014 and Alaska



In [ ]:

Plot the HighQ, MedQ and LowQ prices for Alaska in 2014



In [ ]:

Plot the histogram of HighQ, MedQ and LowQ prices for Alaska in 2014



In [ ]:

Box Plots



In [19]:

    
# Lets plot a box plot for the HighQ, MedQ and LowQ
df_cal.describe()









    Out[19]:






  
    
      
      HighQ
      HighQN
      MedQ
      MedQN
      LowQ
      LowQN
      year
    
  
  
    
      count
      449.000000
      449.000000
      449.000000
      449.000000
      449.000000
      449.000000
      449.000000
    
    
      mean
      245.376125
      14947.073497
      191.268909
      16769.821826
      189.783586
      976.298441
      2014.167038
    
    
      std
      1.727046
      1656.133565
      1.524028
      2433.943191
      1.598252
      120.246714
      0.402204
    
    
      min
      241.840000
      12021.000000
      187.850000
      12724.000000
      187.830000
      770.000000
      2013.000000
    
    
      25%
      244.480000
      13610.000000
      190.260000
      14826.000000
      188.600000
      878.000000
      2014.000000
    
    
      50%
      245.310000
      15037.000000
      191.570000
      16793.000000
      188.600000
      982.000000
      2014.000000
    
    
      75%
      246.220000
      16090.000000
      192.550000
      18435.000000
      191.320000
      1060.000000
      2014.000000
    
    
      max
      248.820000
      18492.000000
      193.630000
      22027.000000
      193.880000
      1232.000000
      2015.000000



In [20]:

    
# Lets plot a Box Plot for the prices
df_cal[["HighQ", "MedQ", "LowQ"]].plot(kind = "box")









    Out[20]:





<matplotlib.axes._subplots.AxesSubplot at 0x10bc0c510>



In [21]:

    
# Lets plot a Box Plot for the sample size
df_cal[["HighQN", "MedQN", "LowQN"]].plot(kind = "box")









    Out[21]:





<matplotlib.axes._subplots.AxesSubplot at 0x10bf8e890>

What if we want to show the price in all the states in the year 2014?



In [22]:

    
# Select only the year 2014
df_2014 = df[df["year"] == 2014]
df_2014.head()



In [23]:

    
# Lets use pivot tables to get HighQ values for each Date by each State
df_states = pd.pivot_table(df_2014, values = "HighQ", index = "date", columns = "State")
df_states.head()









    Out[23]:






  
    
      State
      Alabama
      Alaska
      Arizona
      Arkansas
      California
      Colorado
      Connecticut
      Delaware
      District of Columbia
      Florida
      ...
      South Dakota
      Tennessee
      Texas
      Utah
      Vermont
      Virginia
      Washington
      West Virginia
      Wisconsin
      Wyoming
    
    
      date
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2014-01-01
      339.06
      288.75
      303.31
      361.85
      248.78
      236.31
      347.90
      373.18
      352.26
      306.43
      ...
      387.8
      359.75
      343.33
      291.54
      379.88
      374.01
      236.03
      360.80
      357.99
      355.90
    
    
      2014-01-02
      339.20
      289.67
      303.36
      361.85
      248.67
      236.20
      347.84
      373.18
      352.02
      306.35
      ...
      387.8
      359.75
      343.35
      291.70
      379.88
      373.62
      236.13
      359.80
      358.19
      355.90
    
    
      2014-01-03
      339.20
      289.14
      303.31
      361.37
      248.67
      236.31
      347.89
      373.18
      351.40
      306.21
      ...
      387.8
      359.84
      343.26
      291.61
      379.76
      373.77
      236.03
      360.72
      358.08
      354.03
    
    
      2014-01-04
      339.20
      289.14
      303.22
      361.21
      248.65
      236.07
      347.89
      373.35
      351.40
      306.11
      ...
      387.8
      359.74
      343.23
      291.78
      379.76
      373.72
      236.03
      361.11
      358.04
      354.03
    
    
      2014-01-05
      339.32
      289.14
      303.22
      361.21
      248.68
      236.14
      347.79
      373.35
      352.18
      306.12
      ...
      387.8
      359.68
      343.14
      291.56
      379.76
      373.77
      235.97
      361.11
      358.18
      354.03
    
  

5 rows × 51 columns



In [24]:

    
# Lets plot of these lines
df_states.plot()









    Out[24]:





<matplotlib.axes._subplots.AxesSubplot at 0x10bf9c8d0>



In [25]:

    
df_states.iloc[:,1:10].plot()









    Out[25]:





<matplotlib.axes._subplots.AxesSubplot at 0x109b9b750>



In [26]:

    
# What if we group by State and plot
# df_2014.groupby("State").plot(x = "date", y = "HighQ")



In [27]:

    
# Arrange in a grid fashion
grid = sns.FacetGrid(df_2014, col = "State", col_wrap = 7)
grid.map(plt.plot, "date", "HighQ")









    Out[27]:





<seaborn.axisgrid.FacetGrid at 0x107043ad0>

3.3 Single Variable - Categorical



In [28]:

    
df_demo.head()









    Out[28]:






  
    
      
      region
      total_population
      percent_white
      percent_black
      percent_asian
      percent_hispanic
      per_capita_income
      median_rent
      median_age
    
  
  
    
      0
      alabama
      4799277
      67
      26
      1
      4
      23680
      501
      38.1
    
    
      1
      alaska
      720316
      63
      3
      5
      6
      32651
      978
      33.6
    
    
      2
      arizona
      6479703
      57
      4
      3
      30
      25358
      747
      36.3
    
    
      3
      arkansas
      2933369
      74
      15
      1
      7
      22170
      480
      37.5
    
    
      4
      california
      37659181
      40
      6
      13
      38
      29527
      1119
      35.4



In [29]:

    
# Create an index in the demographic data to ease the labels 
df_demo.index = df_demo.region
df_demo.head()









    Out[29]:






  
    
      
      region
      total_population
      percent_white
      percent_black
      percent_asian
      percent_hispanic
      per_capita_income
      median_rent
      median_age
    
    
      region
      
      
      
      
      
      
      
      
      
    
  
  
    
      alabama
      alabama
      4799277
      67
      26
      1
      4
      23680
      501
      38.1
    
    
      alaska
      alaska
      720316
      63
      3
      5
      6
      32651
      978
      33.6
    
    
      arizona
      arizona
      6479703
      57
      4
      3
      30
      25358
      747
      36.3
    
    
      arkansas
      arkansas
      2933369
      74
      15
      1
      7
      22170
      480
      37.5
    
    
      california
      california
      37659181
      40
      6
      13
      38
      29527
      1119
      35.4



In [30]:

    
# DO NOT make pie charts, especially when the number of category is greater than 6
df_demo.total_population.plot(kind = "pie")









    Out[30]:





<matplotlib.axes._subplots.AxesSubplot at 0x113368910>

Exercise

Draw a pie chart of the total population of the six largest states in US?



In [ ]:



In [31]:

    
# Lets plot this in a simple bar chart
df_demo.total_population.plot(kind ="bar")









    Out[31]:





<matplotlib.axes._subplots.AxesSubplot at 0x113e77f50>



In [32]:

    
# Lets sort the columns to make them in ascending order
df_demo.sort(columns = 'total_population', ascending = True, inplace = True)
df_demo.head()









    Out[32]:






  
    
      
      region
      total_population
      percent_white
      percent_black
      percent_asian
      percent_hispanic
      per_capita_income
      median_rent
      median_age
    
    
      region
      
      
      
      
      
      
      
      
      
    
  
  
    
      wyoming
      wyoming
      570134
      85
      1
      1
      9
      28902
      647
      36.8
    
    
      district of columbia
      district of columbia
      619371
      35
      49
      3
      10
      45290
      1154
      33.8
    
    
      vermont
      vermont
      625904
      94
      1
      1
      2
      29167
      754
      42.0
    
    
      north dakota
      north dakota
      689781
      88
      1
      1
      2
      29732
      564
      36.4
    
    
      alaska
      alaska
      720316
      63
      3
      5
      6
      32651
      978
      33.6



In [33]:

    
# Lets now plot again 
df_demo.total_population.plot(kind = "barh")









    Out[33]:





<matplotlib.axes._subplots.AxesSubplot at 0x11425d7d0>



In [34]:

    
# Lets select the percent population
df_demo.iloc[:,2:6].head()









    Out[34]:






  
    
      
      percent_white
      percent_black
      percent_asian
      percent_hispanic
    
    
      region
      
      
      
      
    
  
  
    
      wyoming
      85
      1
      1
      9
    
    
      district of columbia
      35
      49
      3
      10
    
    
      vermont
      94
      1
      1
      2
    
    
      north dakota
      88
      1
      1
      2
    
    
      alaska
      63
      3
      5
      6



In [35]:

    
# Lets plot the percentage population
df_demo.iloc[:,2:6].plot(kind = "barh")









    Out[35]:





<matplotlib.axes._subplots.AxesSubplot at 0x11457bed0>



In [36]:

    
# Lets stack the percentage population
df_demo.iloc[:,2:6].plot(kind = "barh", stacked = True)









    Out[36]:





<matplotlib.axes._subplots.AxesSubplot at 0x114e3ca50>



In [37]:

    
# Lets plot the Per Capita Income 
df_demo.sort(columns='per_capita_income', ascending=True).per_capita_income.plot(kind = "barh")









    Out[37]:





<matplotlib.axes._subplots.AxesSubplot at 0x11599e090>

Lets see if there is a relationship between per_capita_income and percent_white



In [38]:

    
# Scatter - per_capita_income and percent_white
df_demo.plot(kind = "scatter", x = "per_capita_income", y = "percent_white", s = 100)









    Out[38]:





<matplotlib.axes._subplots.AxesSubplot at 0x1160de110>

Generate hypothesis on demographic data

Is there a reltaionship between per_capita_income and percentage of others?



In [ ]:

Is there a relationship between median_age and percentage of asians?



In [ ]:

Lets plot in the Geographic Coordinates (without using maps)



In [39]:

    
# Lets get the latitude and longitude for each of the state
df_geo = pd.read_csv("data/State_Location.csv")
df_geo.head()









    Out[39]:






  
    
      
      region
      state
      latitude
      longitude
      status
    
  
  
    
      0
      alaska
      AK
      61.3850
      -152.2683
      legal
    
    
      1
      alabama
      AL
      32.7990
      -86.8073
      medical-limited
    
    
      2
      arkansas
      AR
      34.9513
      -92.3809
      illegal
    
    
      3
      arizona
      AZ
      33.7712
      -111.3877
      medical
    
    
      4
      california
      CA
      36.1700
      -119.7462
      decriminalized+medical



In [40]:

    
# Lets draw the states as a point
df_geo.plot(kind = "scatter", x = "longitude", y = "latitude")









    Out[40]:





<matplotlib.axes._subplots.AxesSubplot at 0x116763050>



In [41]:

    
# Lets change the color of the dot and increase the size
df_geo.plot(kind = "scatter", x = "longitude", y = "latitude", c = "red", s = 100)









    Out[41]:





<matplotlib.axes._subplots.AxesSubplot at 0x1167cfdd0>



In [42]:

    
# Lets zoom in to the mainland USA

df_geo.plot(kind = "scatter", x = "longitude", y = "latitude", c = "red", s = 100,
            xlim = (-140,-60), ylim = (20,60))









    Out[42]:





<matplotlib.axes._subplots.AxesSubplot at 0x11674c4d0>



In [43]:

    
# Lets plot some data on this plot
df_demo_geo = pd.merge(df_demo, df_geo, on = "region")
df_demo_geo.index = df_demo_geo.region
df_demo_geo.head()









    Out[43]:






  
    
      
      region
      total_population
      percent_white
      percent_black
      percent_asian
      percent_hispanic
      per_capita_income
      median_rent
      median_age
      state
      latitude
      longitude
      status
    
    
      region
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      wyoming
      wyoming
      570134
      85
      1
      1
      9
      28902
      647
      36.8
      WY
      42.7475
      -107.2085
      illegal
    
    
      district of columbia
      district of columbia
      619371
      35
      49
      3
      10
      45290
      1154
      33.8
      DC
      38.8964
      -77.0262
      legal
    
    
      vermont
      vermont
      625904
      94
      1
      1
      2
      29167
      754
      42.0
      VT
      44.0407
      -72.7093
      decriminalized+medical
    
    
      north dakota
      north dakota
      689781
      88
      1
      1
      2
      29732
      564
      36.4
      ND
      47.5362
      -99.7930
      illegal
    
    
      alaska
      alaska
      720316
      63
      3
      5
      6
      32651
      978
      33.6
      AK
      61.3850
      -152.2683
      legal



In [44]:

    
df_demo_geo.index = df_demo_geo.region
df_demo_geo.iloc[:,10:12].head()









    Out[44]:






  
    
      
      latitude
      longitude
    
    
      region
      
      
    
  
  
    
      wyoming
      42.7475
      -107.2085
    
    
      district of columbia
      38.8964
      -77.0262
    
    
      vermont
      44.0407
      -72.7093
    
    
      north dakota
      47.5362
      -99.7930
    
    
      alaska
      61.3850
      -152.2683



In [45]:

    
# Lets draw a scatter plot
df_demo_geo.plot(kind = "scatter", x = "longitude", y = "latitude", c = "per_capita_income", s = 400,
                 xlim = (-140,-60), ylim = (20,60), colormap = "Oranges")









    Out[45]:





<matplotlib.axes._subplots.AxesSubplot at 0x1179ff210>

Exercise

Draw a scatterplot of latitude and longitude with color as median age.



In [ ]:

Lets plot it on a true geographic system using folium



In [46]:

    
from IPython.display import HTML
import folium

# Temporary code to embed in Jupyter Notebook  - will be fixed in version 0.16
def display(m, height=500):
    """Takes a folium instance and embed HTML."""
    m._build_map()
    srcdoc = m.HTML.replace('"', '&quot;')
    embed = HTML('<iframe srcdoc="{0}" '
                 'style="width: 100%; height: {1}px; '
                 'border: none"></iframe>'.format(srcdoc, height))
    return embed



In [47]:

    
map = folium.Map(location=[48, -102],  zoom_start= 4)
display(map)









    Out[47]:



In [ ]:

    
# Import a Geo JSON
state_geo = r'data/us-states.json'



In [ ]:

    
# Plot it as an overlay on the map
map = folium.Map(location=[48, -102], zoom_start=3)
map.geo_json(geo_path = state_geo)
display(map)



In [ ]:

    
# Lets bind some data to this
df_demo_geo.sort("region", inplace = True)
df_demo_geo.head()



In [ ]:

    
mapa = folium.Map(location=[48, -102], zoom_start=3)

mapa.geo_json(geo_path=state_geo, 
              data=df_demo_geo,
              columns=['state', 'total_population'],
              fill_color='BuPu',
              key_on='feature.id', reset = True)

# We will not use inline version due to bug in 0.15.0

mapa.create_map()

# Go to http://localhost:8888/files/map.html

Exercise

Create a choropleth using the per_capita_income status



In [ ]:

	State	HighQ	HighQN	MedQ	MedQN	LowQ	LowQN	date	year
20098	California	248.77	12021	193.44	12724	193.88	770	2013-12-27	2013
20863	California	248.74	12025	193.44	12728	193.88	770	2013-12-28	2013
21577	California	248.76	12047	193.55	12760	193.60	772	2013-12-29	2013
22291	California	248.82	12065	193.54	12779	193.80	773	2013-12-30	2013
22801	California	248.76	12082	193.54	12792	193.80	773	2013-12-31	2013

	State	HighQ	HighQN	MedQ	MedQN	LowQ	LowQN	date	year
date
2013-12-27	California	248.77	12021	193.44	12724	193.88	770	2013-12-27	2013
2013-12-28	California	248.74	12025	193.44	12728	193.88	770	2013-12-28	2013
2013-12-29	California	248.76	12047	193.55	12760	193.60	772	2013-12-29	2013
2013-12-30	California	248.82	12065	193.54	12779	193.80	773	2013-12-30	2013
2013-12-31	California	248.76	12082	193.54	12792	193.80	773	2013-12-31	2013

	HighQ	HighQN	MedQ	MedQN	LowQ	LowQN	year
count	449.000000	449.000000	449.000000	449.000000	449.000000	449.000000	449.000000
mean	245.376125	14947.073497	191.268909	16769.821826	189.783586	976.298441	2014.167038
std	1.727046	1656.133565	1.524028	2433.943191	1.598252	120.246714	0.402204
min	241.840000	12021.000000	187.850000	12724.000000	187.830000	770.000000	2013.000000
25%	244.480000	13610.000000	190.260000	14826.000000	188.600000	878.000000	2014.000000
50%	245.310000	15037.000000	191.570000	16793.000000	188.600000	982.000000	2014.000000
75%	246.220000	16090.000000	192.550000	18435.000000	191.320000	1060.000000	2014.000000
max	248.820000	18492.000000	193.630000	22027.000000	193.880000	1232.000000	2015.000000

	State	HighQ	HighQN	MedQ	MedQN	LowQ	LowQN	date	year
0	Alabama	339.06	1042	198.64	933	149.49	123	2014-01-01	2014
765	Alabama	339.20	1043	198.64	933	149.49	123	2014-01-02	2014
1479	Alabama	339.20	1043	198.64	933	148.48	124	2014-01-03	2014
2244	Alabama	339.20	1043	198.43	934	148.48	124	2014-01-04	2014
3009	Alabama	339.32	1046	198.13	936	148.48	124	2014-01-05	2014

	region	total_population	percent_white	percent_black	percent_asian	percent_hispanic	per_capita_income	median_rent	median_age
0	alabama	4799277	67	26	1	4	23680	501	38.1
1	alaska	720316	63	3	5	6	32651	978	33.6
2	arizona	6479703	57	4	3	30	25358	747	36.3
3	arkansas	2933369	74	15	1	7	22170	480	37.5
4	california	37659181	40	6	13	38	29527	1119	35.4

	region	total_population	percent_white	percent_black	percent_asian	percent_hispanic	per_capita_income	median_rent	median_age
region
alabama	alabama	4799277	67	26	1	4	23680	501	38.1
alaska	alaska	720316	63	3	5	6	32651	978	33.6
arizona	arizona	6479703	57	4	3	30	25358	747	36.3
arkansas	arkansas	2933369	74	15	1	7	22170	480	37.5
california	california	37659181	40	6	13	38	29527	1119	35.4

	region	total_population	percent_white	percent_black	percent_asian	percent_hispanic	per_capita_income	median_rent	median_age
region
wyoming	wyoming	570134	85	1	1	9	28902	647	36.8
district of columbia	district of columbia	619371	35	49	3	10	45290	1154	33.8
vermont	vermont	625904	94	1	1	2	29167	754	42.0
north dakota	north dakota	689781	88	1	1	2	29732	564	36.4
alaska	alaska	720316	63	3	5	6	32651	978	33.6

	region	state	latitude	longitude	status
0	alaska	AK	61.3850	-152.2683	legal
1	alabama	AL	32.7990	-86.8073	medical-limited
2	arkansas	AR	34.9513	-92.3809	illegal
3	arizona	AZ	33.7712	-111.3877	medical
4	california	CA	36.1700	-119.7462	decriminalized+medical

	latitude	longitude
region
wyoming	42.7475	-107.2085
district of columbia	38.8964	-77.0262
vermont	44.0407	-72.7093
north dakota	47.5362	-99.7930
alaska	61.3850	-152.2683