Predicting Blood Donations: Initial Data Exploration

To do:


Import Data

Functions used:

  • pandas.read_csv
  • [pandas df].head()

In [26]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy  as np

In [27]:
data_dir      = '../data/raw/'
data_filename = 'blood_train.csv'
df_blood      = pd.read_csv(data_dir+data_filename)
df_blood.head()


Out[27]:
Unnamed: 0 Months since Last Donation Number of Donations Total Volume Donated (c.c.) Months since First Donation Made Donation in March 2007
0 619 2 50 12500 98 1
1 664 0 13 3250 28 1
2 441 1 16 4000 35 1
3 160 2 20 5000 45 1
4 358 1 24 6000 77 0

Clean Data

  • Are there any missing values?

In [ ]:
# FILL IN TEST
# FILL IN ACTION

Visualize Data

Table: Summary Statistics

To get a feel for the data as a whole.

Functions Used:

  • [pandas df].iloc()
  • [pandas df].describe()

In [44]:
df_blood.iloc[:, 1:].describe()


Out[44]:
Months since Last Donation Number of Donations Total Volume Donated (c.c.) Months since First Donation Made Donation in March 2007
count 576.000000 576.000000 576.000000 576.000000 576.000000
mean 9.439236 5.427083 1356.770833 34.050347 0.239583
std 8.175454 5.740010 1435.002556 24.227672 0.427200
min 0.000000 1.000000 250.000000 2.000000 0.000000
25% 2.000000 2.000000 500.000000 16.000000 0.000000
50% 7.000000 4.000000 1000.000000 28.000000 0.000000
75% 14.000000 7.000000 1750.000000 49.250000 0.000000
max 74.000000 50.000000 12500.000000 98.000000 1.000000

Insights from Summary stats table:

Variable Value Interpretation
Number of data points N 576 Not too big of a dataset
Average number of donations in March, 2007 0.2396 Whether blood was donated in March was low in general
Max Months since 1st Donation 98 Earliest donation was 98 months (~8 years) ago
Average number of donations 5.427 People in dataset donate an average of ~5.5 times

Plot: Scatter Matrix of all of the variables + histograms

Note:

  • Number of donations & Total Volume Donated are perfectly correlated
    • thus can probably drop one of the variables
  • More likely to NOT have donated in March 2008 (from Made Donation histogram)

In [45]:
plot_scatter = pd.scatter_matrix(df_blood.iloc[:, 1:], 
                                 figsize=(20,20))


/Users/jason/anaconda/lib/python3.5/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):

Plot data as a scatter plot (w/r 'Made Donations in March 2007')

In order to visually inspect whether the given data is linearly separable

  • want to create scatter plots of the data (like those in Abu-Mostafa, et al., 2012)

2-dim Scatterplot: Number of Donations + Months since First Donation ~ Made Donation in March 2007

With 2-dimensions/factors (Number of Donations & Months since First Donation), can we linearly separate whether a donation was made in March, 2007?


In [29]:
import seaborn as sns

In [70]:
# sns.set_context("notebook", font_scale=1.1)
# sns.set_style("ticks")

sns.set_context("notebook", font_scale=1.5, rc={'figure.figsize': [11, 8]})
sns.set_style("darkgrid", {"axes.facecolor": ".9"})

In [77]:
g = sns.lmplot(data=df_blood,
               x='Number of Donations', 
               y='Months since First Donation', 
               hue='Made Donation in March 2007',            
               fit_reg=False,
               palette='RdYlBu',
               aspect=3/1,
               scatter_kws={"marker": "D",
                            "s": 50})


/Users/jason/anaconda/lib/python3.5/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):

In [ ]: