# 3. Explore the Data

"I don't know, what I don't know"

• Why do visual exploration?
• Understand Data Structure & Types
• Explore single variable graphs - Quantitative, Categorical
• Explore dual variable graphs - (Q & Q, Q & C, C & C)
• Explore multi variable graphs
``````

In [1]:

import pandas as pd
import numpy as np

``````
``````

In [2]:

# Load the price data again and fill the missing values, Add year
df.sort(columns=['State','date'], inplace=True)
df.fillna(method = "ffill", inplace=True)

``````
``````

In [3]:

``````

Lets load the libraries required for Visual Exploration

``````

In [4]:

# Load the visualisation libraries - Matplotlib and Seaborn
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

``````
``````

In [5]:

# Set some parameters to get good visuals - style to ggplot and size to 15,10
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (15, 10)

``````

## 3.1 Quantiative Variable - Single Variable

``````

In [6]:

# Filter data for location California and calculate the Year
df['year'] = pd.DatetimeIndex(df['date']).year
df_cal = df[df["State"] == "California"]

``````
``````

Out[6]:

State
HighQ
HighQN
MedQ
MedQN
LowQ
LowQN
date
year

20098
California
248.77
12021
193.44
12724
193.88
770
2013-12-27
2013

20863
California
248.74
12025
193.44
12728
193.88
770
2013-12-28
2013

21577
California
248.76
12047
193.55
12760
193.60
772
2013-12-29
2013

22291
California
248.82
12065
193.54
12779
193.80
773
2013-12-30
2013

22801
California
248.76
12082
193.54
12792
193.80
773
2013-12-31
2013

``````
``````

In [7]:

# Plot
df_cal.plot(x = "date", y = "HighQ")

``````
``````

Out[7]:

<matplotlib.axes._subplots.AxesSubplot at 0x10a1e4f50>

``````
``````

In [8]:

# Set index as date - this is important to get the labels in the plots automatically
df_cal.index = df_cal.date

``````
``````

Out[8]:

State
HighQ
HighQN
MedQ
MedQN
LowQ
LowQN
date
year

date

2013-12-27
California
248.77
12021
193.44
12724
193.88
770
2013-12-27
2013

2013-12-28
California
248.74
12025
193.44
12728
193.88
770
2013-12-28
2013

2013-12-29
California
248.76
12047
193.55
12760
193.60
772
2013-12-29
2013

2013-12-30
California
248.82
12065
193.54
12779
193.80
773
2013-12-30
2013

2013-12-31
California
248.76
12082
193.54
12792
193.80
773
2013-12-31
2013

``````
``````

In [9]:

# Lets plot the HighQ prices
df_cal.HighQ.plot()

``````
``````

Out[9]:

<matplotlib.axes._subplots.AxesSubplot at 0x10a3a3d90>

``````
``````

In [10]:

# Lets plot this HighQ as a histogram to see the most common price
df_cal.HighQ.plot(kind = "hist")

``````
``````

Out[10]:

<matplotlib.axes._subplots.AxesSubplot at 0x10a632210>

``````
``````

In [11]:

# Lets increase the bins to see some granularity
df_cal.HighQ.plot(kind = "hist", bins = 40)

``````
``````

Out[11]:

<matplotlib.axes._subplots.AxesSubplot at 0x10a6b1190>

``````

## 3.2 Quantiative - Multi Variable

``````

In [12]:

# Lets plot all the three prices in California
df_cal[["HighQ", "MedQ", "LowQ"]].plot()

``````
``````

Out[12]:

<matplotlib.axes._subplots.AxesSubplot at 0x10a6b11d0>

``````
``````

In [13]:

# Lets see the distribution of these prices by using a histogram
df_cal[["HighQ", "MedQ", "LowQ"]].plot(kind = "hist", bins = 50, alpha = 0.5)

``````
``````

Out[13]:

<matplotlib.axes._subplots.AxesSubplot at 0x10b204750>

``````

### Exercise

Filter the data for 2014 and Alaska

``````

In [ ]:

``````

Plot the HighQ, MedQ and LowQ prices for Alaska in 2014

``````

In [ ]:

``````

Plot the histogram of HighQ, MedQ and LowQ prices for Alaska in 2014

``````

In [ ]:

``````

### Box Plots

``````

In [14]:

# Lets plot a box plot for the HighQ, MedQ and LowQ
df_cal.describe()

``````
``````

Out[14]:

HighQ
HighQN
MedQ
MedQN
LowQ
LowQN
year

count
449.000000
449.000000
449.000000
449.000000
449.000000
449.000000
449.000000

mean
245.376125
14947.073497
191.268909
16769.821826
189.783586
976.298441
2014.167038

std
1.727046
1656.133565
1.524028
2433.943191
1.598252
120.246714
0.402204

min
241.840000
12021.000000
187.850000
12724.000000
187.830000
770.000000
2013.000000

25%
244.480000
13610.000000
190.260000
14826.000000
188.600000
878.000000
2014.000000

50%
245.310000
15037.000000
191.570000
16793.000000
188.600000
982.000000
2014.000000

75%
246.220000
16090.000000
192.550000
18435.000000
191.320000
1060.000000
2014.000000

max
248.820000
18492.000000
193.630000
22027.000000
193.880000
1232.000000
2015.000000

``````
``````

In [15]:

# Lets plot a Box Plot for the prices
df_cal[["HighQ", "MedQ", "LowQ"]].plot(kind = "box")

``````
``````

Out[15]:

<matplotlib.axes._subplots.AxesSubplot at 0x10b687110>

``````
``````

In [16]:

# Lets plot a Box Plot for the sample size
df_cal[["HighQN", "MedQN", "LowQN"]].plot(kind = "box")

``````
``````

Out[16]:

<matplotlib.axes._subplots.AxesSubplot at 0x10bac04d0>

``````

What if we want to show the price in all the states in the year 2014?

``````

In [17]:

# Select only the year 2014
df_2014 = df[df["year"] == 2014]

``````
``````

Out[17]:

State
HighQ
HighQN
MedQ
MedQN
LowQ
LowQN
date
year

0
Alabama
339.06
1042
198.64
933
149.49
123
2014-01-01
2014

765
Alabama
339.20
1043
198.64
933
149.49
123
2014-01-02
2014

1479
Alabama
339.20
1043
198.64
933
148.48
124
2014-01-03
2014

2244
Alabama
339.20
1043
198.43
934
148.48
124
2014-01-04
2014

3009
Alabama
339.32
1046
198.13
936
148.48
124
2014-01-05
2014

``````
``````

In [18]:

# Lets use pivot tables to get HighQ values for each Date by each State
df_states = pd.pivot_table(df_2014, values = "HighQ", index = "date", columns = "State")

``````
``````

Out[18]:

State
Alabama
Arizona
Arkansas
California
Connecticut
Delaware
District of Columbia
Florida
...
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming

date

2014-01-01
339.06
288.75
303.31
361.85
248.78
236.31
347.90
373.18
352.26
306.43
...
387.8
359.75
343.33
291.54
379.88
374.01
236.03
360.80
357.99
355.90

2014-01-02
339.20
289.67
303.36
361.85
248.67
236.20
347.84
373.18
352.02
306.35
...
387.8
359.75
343.35
291.70
379.88
373.62
236.13
359.80
358.19
355.90

2014-01-03
339.20
289.14
303.31
361.37
248.67
236.31
347.89
373.18
351.40
306.21
...
387.8
359.84
343.26
291.61
379.76
373.77
236.03
360.72
358.08
354.03

2014-01-04
339.20
289.14
303.22
361.21
248.65
236.07
347.89
373.35
351.40
306.11
...
387.8
359.74
343.23
291.78
379.76
373.72
236.03
361.11
358.04
354.03

2014-01-05
339.32
289.14
303.22
361.21
248.68
236.14
347.79
373.35
352.18
306.12
...
387.8
359.68
343.14
291.56
379.76
373.77
235.97
361.11
358.18
354.03

5 rows × 51 columns

``````
``````

In [19]:

# Lets plot of these lines
df_states.plot()

``````
``````

Out[19]:

<matplotlib.axes._subplots.AxesSubplot at 0x10bc97fd0>

``````
``````

In [20]:

df_states.iloc[:,1:10].plot()

``````
``````

Out[20]:

<matplotlib.axes._subplots.AxesSubplot at 0x106ca3350>

``````
``````

In [21]:

# What if we group by State and plot
# df_2014.groupby("State").plot(x = "date", y = "HighQ")

``````
``````

In [22]:

# Arrange in a grid fashion
grid = sns.FacetGrid(df_2014, col = "State", col_wrap = 7)
grid.map(plt.plot, "date", "HighQ")

``````
``````

Out[22]:

<seaborn.axisgrid.FacetGrid at 0x1072c7d90>

``````

## 3.3 Single Variable - Categorical

``````

In [23]:

``````
``````

Out[23]:

region
total_population
percent_white
percent_black
percent_asian
percent_hispanic
per_capita_income
median_rent
median_age

0
alabama
4799277
67
26
1
4
23680
501
38.1

1
720316
63
3
5
6
32651
978
33.6

2
arizona
6479703
57
4
3
30
25358
747
36.3

3
arkansas
2933369
74
15
1
7
22170
480
37.5

4
california
37659181
40
6
13
38
29527
1119
35.4

``````
``````

In [24]:

# Create an index in the demographic data to ease the labels
df_demo.index = df_demo.region

``````
``````

Out[24]:

region
total_population
percent_white
percent_black
percent_asian
percent_hispanic
per_capita_income
median_rent
median_age

region

alabama
alabama
4799277
67
26
1
4
23680
501
38.1

720316
63
3
5
6
32651
978
33.6

arizona
arizona
6479703
57
4
3
30
25358
747
36.3

arkansas
arkansas
2933369
74
15
1
7
22170
480
37.5

california
california
37659181
40
6
13
38
29527
1119
35.4

``````
``````

In [25]:

# DO NOT make pie charts, especially when the number of category is greater than 6
df_demo.total_population.plot(kind = "pie")

``````
``````

Out[25]:

<matplotlib.axes._subplots.AxesSubplot at 0x106cac090>

``````