Predicting Blood Donations: Initial Data Exploration

To do:

Import data
Clean data
Visualize data

Import Data

Functions used:

pandas.read_csv
[pandas df].head()



In [26]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy  as np



In [27]:

    
data_dir      = '../data/raw/'
data_filename = 'blood_train.csv'
df_blood      = pd.read_csv(data_dir+data_filename)
df_blood.head()









    Out[27]:






  
    
      
      Unnamed: 0
      Months since Last Donation
      Number of Donations
      Total Volume Donated (c.c.)
      Months since First Donation
      Made Donation in March 2007
    
  
  
    
      0
      619
      2
      50
      12500
      98
      1
    
    
      1
      664
      0
      13
      3250
      28
      1
    
    
      2
      441
      1
      16
      4000
      35
      1
    
    
      3
      160
      2
      20
      5000
      45
      1
    
    
      4
      358
      1
      24
      6000
      77
      0

Clean Data

Are there any missing values?



In [ ]:

    
# FILL IN TEST
# FILL IN ACTION

Visualize Data

Table: Summary Statistics

To get a feel for the data as a whole.

Functions Used:

[pandas df].iloc()
[pandas df].describe()



In [44]:

    
df_blood.iloc[:, 1:].describe()









    Out[44]:






  
    
      
      Months since Last Donation
      Number of Donations
      Total Volume Donated (c.c.)
      Months since First Donation
      Made Donation in March 2007
    
  
  
    
      count
      576.000000
      576.000000
      576.000000
      576.000000
      576.000000
    
    
      mean
      9.439236
      5.427083
      1356.770833
      34.050347
      0.239583
    
    
      std
      8.175454
      5.740010
      1435.002556
      24.227672
      0.427200
    
    
      min
      0.000000
      1.000000
      250.000000
      2.000000
      0.000000
    
    
      25%
      2.000000
      2.000000
      500.000000
      16.000000
      0.000000
    
    
      50%
      7.000000
      4.000000
      1000.000000
      28.000000
      0.000000
    
    
      75%
      14.000000
      7.000000
      1750.000000
      49.250000
      0.000000
    
    
      max
      74.000000
      50.000000
      12500.000000
      98.000000
      1.000000

Insights from Summary stats table:

Variable	Value	Interpretation
Number of data points N	576	Not too big of a dataset
Average number of donations in March, 2007	0.2396	Whether blood was donated in March was low in general
Max Months since 1st Donation	98	Earliest donation was 98 months (~8 years) ago
Average number of donations	5.427	People in dataset donate an average of ~5.5 times

Plot: Scatter Matrix of all of the variables + histograms

Note:

Number of donations & Total Volume Donated are perfectly correlated
- thus can probably drop one of the variables
More likely to NOT have donated in March 2008 (from Made Donation histogram)



In [45]:

    
plot_scatter = pd.scatter_matrix(df_blood.iloc[:, 1:], 
                                 figsize=(20,20))









    



/Users/jason/anaconda/lib/python3.5/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):

Plot data as a scatter plot (w/r 'Made Donations in March 2007')

In order to visually inspect whether the given data is linearly separable

want to create scatter plots of the data (like those in Abu-Mostafa, et al., 2012)

2-dim Scatterplot: Number of Donations + Months since First Donation ~ Made Donation in March 2007

With 2-dimensions/factors (Number of Donations & Months since First Donation), can we linearly separate whether a donation was made in March, 2007?



In [29]:

    
import seaborn as sns



In [70]:

    
# sns.set_context("notebook", font_scale=1.1)
# sns.set_style("ticks")

sns.set_context("notebook", font_scale=1.5, rc={'figure.figsize': [11, 8]})
sns.set_style("darkgrid", {"axes.facecolor": ".9"})



In [77]:

    
g = sns.lmplot(data=df_blood,
               x='Number of Donations', 
               y='Months since First Donation', 
               hue='Made Donation in March 2007',            
               fit_reg=False,
               palette='RdYlBu',
               aspect=3/1,
               scatter_kws={"marker": "D",
                            "s": 50})









    



/Users/jason/anaconda/lib/python3.5/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):



In [ ]:

	Unnamed: 0	Months since Last Donation	Number of Donations	Total Volume Donated (c.c.)	Months since First Donation	Made Donation in March 2007
0	619	2	50	12500	98	1
1	664	0	13	3250	28	1
2	441	1	16	4000	35	1
3	160	2	20	5000	45	1
4	358	1	24	6000	77	0

	Months since Last Donation	Number of Donations	Total Volume Donated (c.c.)	Months since First Donation	Made Donation in March 2007
count	576.000000	576.000000	576.000000	576.000000	576.000000
mean	9.439236	5.427083	1356.770833	34.050347	0.239583
std	8.175454	5.740010	1435.002556	24.227672	0.427200
min	0.000000	1.000000	250.000000	2.000000	0.000000
25%	2.000000	2.000000	500.000000	16.000000	0.000000
50%	7.000000	4.000000	1000.000000	28.000000	0.000000
75%	14.000000	7.000000	1750.000000	49.250000	0.000000
max	74.000000	50.000000	12500.000000	98.000000	1.000000