Big Data Applications and Analytics: Term Project

Sean M. Shiverick Fall 2017

Data Visualization

Resources:

Dataset: 2015 NSDUH

1. Import modules and Load the data

  • Import python modules
  • load data file and save as DataFrame object
  • Subset dataframe by column

In [1]:
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df = pd.read_csv('~/project-data.csv')
df.drop(df.columns[[0,1]], axis=1, inplace=True)
df.shape


Out[2]:
(57146, 21)

Explore data frames to check headers

  • Look at columns headers, variable information, type, etc.

In [3]:
df.columns


Out[3]:
Index(['AGECAT', 'SEX', 'MARRIED', 'EDUCAT', 'EMPLOY18', 'CTYMETRO', 'HEALTH',
       'MENTHLTH', 'SUICATT', 'PRLMISEVR', 'PRLMISAB', 'PRLANY', 'HEROINEVR',
       'HEROINUSE', 'HEROINFQY', 'TRQLZRS', 'SEDATVS', 'COCAINE', 'AMPHETMN',
       'TRTMENT', 'MHTRTMT'],
      dtype='object')

In [4]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57146 entries, 0 to 57145
Data columns (total 21 columns):
AGECAT       57146 non-null int64
SEX          57146 non-null int64
MARRIED      57146 non-null float64
EDUCAT       57146 non-null int64
EMPLOY18     57146 non-null float64
CTYMETRO     57146 non-null int64
HEALTH       57146 non-null float64
MENTHLTH     57146 non-null float64
SUICATT      57146 non-null float64
PRLMISEVR    57146 non-null int64
PRLMISAB     57146 non-null float64
PRLANY       57146 non-null int64
HEROINEVR    57146 non-null int64
HEROINUSE    57146 non-null int64
HEROINFQY    57146 non-null float64
TRQLZRS      57146 non-null int64
SEDATVS      57146 non-null int64
COCAINE      57146 non-null int64
AMPHETMN     57146 non-null int64
TRTMENT      57146 non-null float64
MHTRTMT      57146 non-null float64
dtypes: float64(9), int64(12)
memory usage: 9.2 MB

Explore data frames to check headers and data types

  • AGECAT 57146 non-null int64
  • SEX 57146 non-null int64
  • MARRIED 57146 non-null float64
  • EDUCAT 57146 non-null int64
  • EMPLOY18 57146 non-null float64
  • CTYMETRO 57146 non-null int64
  • HEALTH 57146 non-null float64
  • MENTHLTH 57146 non-null float64
  • SUICATT 57146 non-null float64
  • PRLMISEVR 57146 non-null int64
  • PRLMISAB 57146 non-null float64
  • PRLANY 57146 non-null int64
  • HEROINEVR 57146 non-null int64
  • HEROINUSE 57146 non-null int64
  • HEROINFQY 57146 non-null float64
  • TRQLZRS 57146 non-null int64
  • SEDATVS 57146 non-null int64
  • COCAINE 57146 non-null int64
  • AMPHETMN 57146 non-null int64
  • TRTMENT 57146 non-null float64
  • MHTRTMT 57146 non-null float64

First plot: scatterplot with linear correlation

  • Compare Y == PRLMISANY and X == AGE.
  • Pass BWT as Y variable and AGE as X variable to seaborns lmplot (linear model plot)
  • It plot points, axes, and regression line, and also plots an error field. Super handy!

In [5]:
sns.set(style='ticks')
sns.lmplot(y='PRLMISAB',x='HEROINUSE',data=df)


Out[5]:
<seaborn.axisgrid.FacetGrid at 0x11595d518>

Check PRLMISAB effects HEROINUSE, controlling for CTYMETRO.

No real hypothes, just to show you how we can do this. Code for race: 1=white, 2=black, 3=other Use command below to save this plot


In [6]:
sns.lmplot(y='PRLMISAB',x='HEROINUSE',hue='CTYMETRO',data=df) 

p = sns.lmplot(y='PRLMISAB',x='HEROINUSE',hue='CTYMETRO',data=df) 
p.savefig('fancy-regression-chart.png')


Third Plot: Factorplot

  • Compare interaction of SMOKE, BWT, HT using bar charts.

In [7]:
sns.factorplot(x='HEROINEVR', hue='PRLMISEVR',col='SEX',kind='count',data=df)


Out[7]:
<seaborn.axisgrid.FacetGrid at 0x115972cf8>

Fourth Plot: Pairplots

To understand the distribution of each variable and Also plot it against all other variables to understand their relationship. Graph can be visualized for different values of a chosen 'hue' variable


In [8]:
'AGECAT', 'SEX', 'MARRIED', 'EDUCAT', 'EMPLOY18', 'CTYMETRO', 'HEALTH',
'MENTHLTH', 'SUICATT', 'PRLMISEVR', 'PRLMISAB', 'PRLANY', 'HEROINEVR',
'HEROINUSE', 'HEROINFQY', 'TRQLZRS', 'SEDATVS', 'COCAINE', 'AMPHETMN',
'TRTMENT', 'MHTRTMT'


Out[8]:
('TRTMENT', 'MHTRTMT')

In [9]:
df1 = df[['MENTHLTH','PRLMISAB','HEROINUSE','CTYMETRO']]
sns.pairplot(df1, hue = 'CTYMETRO',size=2.5);
plt.savefig('Figure3.png', bbox_inches='tight')



In [10]:
df1 = df[['AGECAT','SEX','PRLMISAB','HEROINUSE']]
sns.pairplot(df1, hue = 'SEX',size=2.5);



In [ ]: