Exploratory Data Analysis

When placed in Metapack data package, this notebook will load the package and run a variety of common EDA operations on the first resource.


In [1]:
import matplotlib.pyplot as plt 
import seaborn as sns
import metapack as mp
import pandas as pd
import numpy as np
from IPython.display import display 

%matplotlib inline
sns.set_context('notebook')

In [2]:
pd.__version__


Out[2]:
'0.23.4'

In [3]:
pkg = mp.jupyter.open_package()

# For testing and development
#pkg = mp.open_package('http://s3.amazonaws.com/library.metatab.org/cde.ca.gov-accountability_dashboard-2.zip')

pkg


Out[3]:

Cornell National Social Survey (CNSS), 2017

Cornell National Social Survey is a random-sample survey of adults aged 18 and over. In 2017, participants were asked their opinions on a range of topics.

cornell.edu-cnss-2017-1 from metapack+file:///Users/eric/proj/virt-proj/data-project/sdrdl-data-projects/cornell-social-survey/cornell.edu-cnss-2017/_packages/cornell.edu-cnss-2017-1/metadata.csv

To use this package, you will have to download the research file manually, placing the file in the data directory. Cornell requires you to accept terms and conditions before downloading the file, so we can't redistribute it.

I used Stat Transfer to convert the SPSS file to Stata, but there is one variable, KRq1, "Number of Dogs Owned", that StatTransfer converts with duplicate labels.

Documentation

Contacts

Wrangler: Eric Busboom Civic Knowledge

Resources

  1. cnss_2017 CNSS 2017, with categorical values

References

  1. cnss_2017 Cornell National Social Survey 2017, in converted Stata forma


In [4]:
resource_name='cnss_2017'

In [5]:
pkg.resource(resource_name)


Out[5]:

cnss_2017

file:///Users/eric/proj/virt-proj/data-project/sdrdl-data-projects/cornell-social-survey/cornell.edu-cnss-2017/_packages/cornell.edu-cnss-2017-1/data/cnss_2017.csv

HeaderTypeDescription
caseidintegerCase identification number (assigned by SRI)
survidintegerCase identification number (assigned by SRI)
timezonestringTime zone (provided by MSG)
statestringState (provided by MSG)
msastringMetropolitan Statistical Area (provided by MSG)
mscintegerMetropolitan Status Code (provided by MSG)
censusrintegerCensus Region (provided by MSG)
censusdintegerCensus Division (provided by MSG)
cbsamsaintegerCBSA MSA Met Status Code (provided by MSG)
cbsamcsaintegerCBSA MCSA Met Status Code (provided by MSG)
ahq1stringCollege degrees for prisoners
ahq2stringState funded college for prisoners
ahq3stringPrisoners should repay education costs
kenq1stringCountry needs strong leader
kenq2stringCourts get in way of leaders
kenq3stringMedia get in way of leaders
js_versionstringRandomized text within JSq1
jsq1stringContraceptive policy under Trump
lcq1stringChina rise a threat or opportunity
skq1stringMens rights participant
skq2_astringRed pill - Heard of term
skq2_bstringAlt right - Heard of term
skq3stringMen should be alpha
jbq1stringResource to find new surgeon
jbq2stringUsed internet to find physician
jbq3stringHelpfulness of internet w/ finding physician
rvq1string# of healthcare visits in the past year
rvq2stringRate customer service last healthcare visit
rvq3stringMost urgent healthcare issue
rvq4stringEmployer offers wellness program
rvq5stringDescribe workplace wellness program
rvq6stringRate workplace wellness program
jaq1stringTeen access to patient portal
rq_seq_istringJAq2, JAq3 - Sequence Index Variable
jaq2_rqstringParent access teen medical record
jaq3_rqstringSensitive issue avoidance if parent views record
pdq1stringCosmetic or health reasons - Bariatric surgery
pdq2stringQuick fix - Bariatric surgery
pdq3stringShould insurance cover - Bariatric surgery
krq2stringCanine clinical study interest
mmq1stringAware of bee health concern
mmq2stringPersonal concern about bee health
mmq3stringProduce protecting bees - Pay more
mmq4stringOrganic food - Pay more
riq1stringWhat do antibiotics kill
riq2stringCow antibiotics threaten human health
riq3stringMilk without antibiotics - Pay more
riq4stringCow treatment on conventional/organic farms
szq1stringAttractive natural sights - Neighborhood
szq2stringPeople look out for each other - Neighborhood
szq3stringMental health a priority - Community
szq4stringSufficient mental health services - Community
szq5stringPhysical environment impacts mental health
ebq1stringWorry about crime - Workplace
ebq2stringWorry about crime - Neighborhood
ebq3stringRecidivism due to criminal record
drugq1stringMarijuana legalization
mfq1stringRate distraction at work
mfq2stringWork from home frequency
ssq1stringHomework completion rate in high school
kwq1_astringWork harassment - Experienced
kwq1_bstringDomestic violence - Experienced
kwq2stringExperience w/ violence impacted work
skq4stringHarder for men to be successful
skq5stringMen should protect women
skq6stringFeminism good or bad for America
eycq1stringAsked to do favor outside of work
eycq2stringWho asked for favor outside of work
eycq3stringTime spent doing favor outside of work
eycq4stringRules about doing favors outside of work
lvq1_astringRestless - Past 30 days
lvq1_bstringEverything an effort - Past 30 days
mjq1stringUndocumented farmworkers community impact
idq1stringOccupational category
idq2stringLabor union member
idq3stringVoted in 2016 presidential election
idq4stringElected officials represent the rich
idq5stringMinorities get government advantages
employstringEmployed
jbtypestringMain job type
selfemplstringSelf-employed
lkworkstringLooking for new work
lvq2_astringPerformance raise - Eligible at work
lvq2_bstringPerformance bonus - Eligible at work
hhsize_ainteger# adults 65+ in household
hhsize_binteger# adults 18-64 in household
hhsize_cinteger# children in household
hhadultsinteger
ph_totlinteger# phones for household
ph_cellstringCell/Landline for survey
yobintegerYear born
ageintegerAge (computed from yob)
borninusstringBorn in US
marriedstringMarital status
ideostringSocial ideology
partystringPolitical party
educstringEducation level
ownrentstringHome ownership status
hispstringHispanic or Latino
race_astringCaucasian - Race
race_bstringAfrican-American - Race
race_cstringNative American - Race
race_dstringAsian - Race
race_estringOther - Race
numracesinteger
religstringReligious affiliation
churchstringHow often attend religious services
hhincestringExact household income 2016
hhinc50kstringOver/Under $50k - Household income 2016
hhincustringRange under $50k - Household income 2016
hhincostringRange over $50k - Household income 2016
hhincstringHousehold income 2016 - Coded value
genderstringGender

In [6]:
df = pkg.resource(resource_name).read_csv(parse_dates=True)

In [7]:
df.head()


Out[7]:
caseid survid timezone state msa msc censusr censusd cbsamsa cbsamcsa ... race_e numraces relig church hhince hhinc50k hhincu hhinco hhinc gender
0 80007 80007 C TX 2920.0 1 3 7 3 5 ... No 1.0 No religion / Atheist / Agnostic A few times a year NaN $50,000 or over NaN $150,000 or more $150,000 or more Male
1 80027 80027 C AL NaN 5 3 6 5 1 ... No 1.0 Protestant A few times a year NaN $50,000 or over NaN 50 to under $75,000 50 to under $75,000 Female
2 80029 80029 C LA 5560.0 3 3 7 3 5 ... No 1.0 Catholic A few times a year 196000.0 NaN NaN NaN $150,000 or more Male
3 80037 80037 C IN NaN 5 2 3 5 1 ... No 1.0 No religion / Atheist / Agnostic Never 75000.0 NaN NaN NaN 75 to under $100,000 Male
4 80041 80041 C MO 3760.0 1 2 4 1 5 ... No 1.0 No religion / Atheist / Agnostic Never 60000.0 NaN NaN NaN 50 to under $75,000 Female

5 rows × 113 columns


In [8]:
empty_col_names = [cn for cn in df.columns if df[cn].nunique() == 0]
const_col_names= [cn for cn in df.columns if df[cn].nunique() == 1]
ignore_cols = empty_col_names+const_col_names
dt_col_names= list(df.select_dtypes(include=[np.datetime64]).columns)

number_col_names = [ cn for cn in df.select_dtypes(include=[np.number]).columns if cn not in ignore_cols ]
other_col_names = [cn for cn in df.columns if cn not in (empty_col_names+const_col_names+dt_col_names+number_col_names)]

In [9]:
pd.DataFrame.from_dict({'empty':[len(empty_col_names)], 
                        'const':[len(const_col_names)],
                        'datetime':[len(dt_col_names)],
                        'number':[len(number_col_names)],
                        'other':[len(other_col_names)],
                       }, 
                       orient='index', columns=['count'])


Out[9]:
count
empty 0
const 2
datetime 0
number 16
other 95

Constant Columns


In [10]:
if const_col_names:
    display(df[const_col_names].drop_duplicates().T)


0 1 4 36
rvq3 Answer Answer NaN NaN
rvq5 NaN Answer Answer NaN

Empty Columns


In [11]:
if empty_col_names:
    display(df[empty_col_names].drop_duplicates().T)

Date and Time Columns


In [12]:
if dt_col_names:
    display(df[dt_col_names].info())
    display(df[dt_col_names].describe().T)

Number Columns


In [13]:
if number_col_names:
    display(df[number_col_names].info())
    display(df[number_col_names].describe().T)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
caseid      1000 non-null int64
survid      1000 non-null int64
msa         830 non-null float64
msc         1000 non-null int64
censusr     1000 non-null int64
censusd     1000 non-null int64
cbsamsa     1000 non-null int64
cbsamcsa    1000 non-null int64
hhsize_a    999 non-null float64
hhsize_b    999 non-null float64
hhsize_c    998 non-null float64
hhadults    999 non-null float64
ph_totl     998 non-null float64
yob         1000 non-null int64
age         1000 non-null int64
numraces    991 non-null float64
dtypes: float64(7), int64(9)
memory usage: 125.1 KB
None
count mean std min 25% 50% 75% max
caseid 1000.0 67918.285000 31335.857247 10011.0 80093.5 83635.0 86935.5 90998.0
survid 1000.0 67918.285000 31335.857247 10011.0 80093.5 83635.0 86935.5 90998.0
msa 830.0 4446.293976 2542.086945 80.0 1960.0 4920.0 6280.0 9320.0
msc 1000.0 2.196000 1.553060 1.0 1.0 1.0 3.0 5.0
censusr 1000.0 2.590000 1.005441 1.0 2.0 3.0 3.0 4.0
censusd 1000.0 5.048000 2.473017 1.0 3.0 5.0 7.0 9.0
cbsamsa 1000.0 1.880000 1.429561 1.0 1.0 1.0 3.0 5.0
cbsamcsa 1000.0 4.636000 1.147528 1.0 5.0 5.0 5.0 5.0
hhsize_a 999.0 0.468468 0.760206 0.0 0.0 0.0 1.0 3.0
hhsize_b 999.0 1.866867 1.329974 0.0 1.0 2.0 2.0 10.0
hhsize_c 998.0 0.630261 1.090014 0.0 0.0 0.0 1.0 7.0
hhadults 999.0 2.335335 1.207302 1.0 2.0 2.0 3.0 11.0
ph_totl 998.0 2.641283 1.480593 1.0 2.0 2.0 3.0 10.0
yob 1000.0 1968.829000 18.057459 1922.0 1955.0 1969.0 1984.0 1999.0
age 1000.0 48.171000 18.057459 18.0 33.0 48.0 62.0 95.0
numraces 991.0 1.109990 0.374720 1.0 1.0 1.0 1.0 4.0

Distributions


In [14]:
def plot_histograms(df):

    col_names = list(df.columns)

    n_cols = np.ceil(np.sqrt(len(col_names)))
    n_rows = np.ceil(np.sqrt(len(col_names)))

    #plt.figure(figsize=(3*n_cols,3*n_rows))
    fig, ax = plt.subplots(figsize=(3*n_cols,3*n_rows))
    
    for i in range(0,len(col_names)):
        plt.subplot(n_rows + 1,n_cols,i+1)
        try:
            g = sns.distplot(df[col_names[i]].dropna(),kde=True) 
            g.set(xticklabels=[])
            g.set(yticklabels=[])
        except:
            pass
        
    plt.tight_layout()

In [15]:
plot_histograms(df[number_col_names])


/Users/eric/proj/virt/data-project/lib/python3.6/site-packages/scipy/stats/stats.py:1706: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

Box Plots


In [16]:
def plot_boxes(df):

    col_names = list(df.columns)

    n_cols = np.ceil(np.sqrt(len(col_names)))
    n_rows = np.ceil(np.sqrt(len(col_names)))

    #plt.figure(figsize=(2*n_cols,3*n_rows))
    fig, ax = plt.subplots(figsize=(2*n_cols,5*n_rows))
    
    for i in range(0,len(col_names)):
        plt.subplot(n_rows + 1,n_cols,i+1)
        try:
            g = sns.boxplot(df[col_names[i]].dropna(),orient='v') 
        except:
            pass
        
    plt.tight_layout()

In [17]:
plot_boxes(df[number_col_names])



In [18]:
## Correlations

In [19]:
cm = df[number_col_names].corr()

mask = np.zeros_like(cm, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

plt.figure(figsize=(.5*len(number_col_names),.5*len(number_col_names)))
sns.heatmap(cm, mask=mask, cmap = 'viridis')


Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x124a34f60>

Other Columns


In [20]:
if other_col_names:
    display(df[other_col_names].info())
    display(df[other_col_names].describe().T)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 95 columns):
timezone      1000 non-null object
state         1000 non-null object
ahq1          1000 non-null object
ahq2          998 non-null object
ahq3          999 non-null object
kenq1         998 non-null object
kenq2         996 non-null object
kenq3         997 non-null object
js_version    1000 non-null object
jsq1          989 non-null object
lcq1          983 non-null object
skq1          1000 non-null object
skq2_a        1000 non-null object
skq2_b        1000 non-null object
skq3          987 non-null object
jbq1          998 non-null object
jbq2          999 non-null object
jbq3          450 non-null object
rvq1          999 non-null object
rvq2          999 non-null object
rvq4          998 non-null object
rvq6          356 non-null object
jaq1          998 non-null object
rq_seq_i      1000 non-null object
jaq2_rq       995 non-null object
jaq3_rq       993 non-null object
pdq1          986 non-null object
pdq2          993 non-null object
pdq3          1000 non-null object
krq2          487 non-null object
mmq1          1000 non-null object
mmq2          998 non-null object
mmq3          999 non-null object
mmq4          999 non-null object
riq1          979 non-null object
riq2          983 non-null object
riq3          996 non-null object
riq4          982 non-null object
szq1          997 non-null object
szq2          998 non-null object
szq3          997 non-null object
szq4          979 non-null object
szq5          997 non-null object
ebq1          1000 non-null object
ebq2          1000 non-null object
ebq3          999 non-null object
drugq1        996 non-null object
mfq1          1000 non-null object
mfq2          783 non-null object
ssq1          999 non-null object
kwq1_a        1000 non-null object
kwq1_b        999 non-null object
kwq2          393 non-null object
skq4          999 non-null object
skq5          1000 non-null object
skq6          983 non-null object
eycq1         999 non-null object
eycq2         450 non-null object
eycq3         450 non-null object
eycq4         976 non-null object
lvq1_a        999 non-null object
lvq1_b        999 non-null object
mjq1          997 non-null object
idq1          997 non-null object
idq2          998 non-null object
idq3          969 non-null object
idq4          1000 non-null object
idq5          997 non-null object
employ        1000 non-null object
jbtype        652 non-null object
selfempl      651 non-null object
lkwork        978 non-null object
lvq2_a        650 non-null object
lvq2_b        651 non-null object
ph_cell       999 non-null object
borninus      1000 non-null object
married       999 non-null object
ideo          992 non-null object
party         996 non-null object
educ          999 non-null object
ownrent       998 non-null object
hisp          999 non-null object
race_a        993 non-null object
race_b        993 non-null object
race_c        993 non-null object
race_d        993 non-null object
race_e        991 non-null object
relig         996 non-null object
church        996 non-null object
hhince        502 non-null object
hhinc50k      485 non-null object
hhincu        99 non-null object
hhinco        142 non-null object
hhinc         986 non-null object
gender        1000 non-null object
dtypes: object(95)
memory usage: 742.3+ KB
None
count unique top freq
timezone 1000 4 E 473
state 1000 48 CA 98
ahq1 1000 5 Strongly agree 462
ahq2 998 5 Somewhat agree 258
ahq3 999 5 Strongly agree 450
kenq1 998 5 Strongly disagree 311
kenq2 996 5 Strongly disagree 265
kenq3 997 5 Strongly agree 332
js_version 1000 2 Provided no notice 511
jsq1 989 4 Some of the time 615
lcq1 983 3 Equal parts threat and opportunity 600
skq1 1000 2 No 945
skq2_a 1000 2 No 815
skq2_b 1000 2 Yes 604
skq3 987 3 No 653
jbq1 998 4 Talking to your current doctor or another heal... 585
jbq2 999 2 No 546
jbq3 450 4 Somewhat helpful 237
rvq1 999 7 1-2 times 363
rvq2 999 5 Very positive 518
rvq4 998 3 Yes 357
rvq6 356 6 Very positive 131
jaq1 998 3 Only with parental permission 602
rq_seq_i 1000 2 Question order JAq2_rq, JAq3_rq 532
jaq2_rq 995 3 Always 733
jaq3_rq 993 2 Yes 830
pdq1 986 2 Mostly for health reasons 498
pdq2 993 2 No 604
pdq3 1000 3 Yes, but only for health benefits 729
krq2 487 2 Yes 255
... ... ... ... ...
idq3 969 6 Hillary Clinton 322
idq4 1000 5 Strongly agree 312
idq5 997 5 Disagree 239
employ 1000 5 Yes 652
jbtype 652 5 Full-time, all year round 519
selfempl 651 2 No 546
lkwork 978 2 No 800
lvq2_a 650 2 Yes 402
lvq2_b 651 2 No 381
ph_cell 999 3 Cell 764
borninus 1000 2 Yes 898
married 999 6 Married 499
ideo 992 7 Moderate or middle of the road 363
party 996 8 Independent (close to Neither) 245
educ 999 7 College graduate (BS, BA, or other 4-year degree) 266
ownrent 998 3 Own 629
hisp 999 2 No 872
race_a 993 2 Yes 815
race_b 993 2 No 849
race_c 993 2 No 918
race_d 993 2 No 952
race_e 991 2 No 966
relig 996 7 Protestant 474
church 996 6 Once a week 232
hhince 502 95 50000.0 38
hhinc50k 485 2 $50,000 or over 283
hhincu 99 5 20 to under $30,000 26
hhinco 142 4 50 to under $75,000 45
hhinc 986 9 50 to under $75,000 290
gender 1000 2 Female 502

95 rows × 4 columns

Nulls


In [21]:
cols = dt_col_names + number_col_names + other_col_names

fig, ax = plt.subplots(figsize=(15,.5*len(cols)))
sns.heatmap(df[cols].isnull().T,cbar=False,xticklabels=False,cmap = 'viridis', ax=ax )


Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x124cba6a0>