Predicting Blood Donations: Initial Data Exploration

To do:

Import Data

Functions used:

• `pandas.read_csv`
• `[pandas df].head()`
``````

In [26]:

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy  as np

``````
``````

In [27]:

data_dir      = '../data/raw/'
data_filename = 'blood_train.csv'
df_blood      = pd.read_csv(data_dir+data_filename)
df_blood.head()

``````
``````

Out[27]:

Unnamed: 0
Months since Last Donation
Number of Donations
Total Volume Donated (c.c.)
Months since First Donation
Made Donation in March 2007

0
619
2
50
12500
98
1

1
664
0
13
3250
28
1

2
441
1
16
4000
35
1

3
160
2
20
5000
45
1

4
358
1
24
6000
77
0

``````

Clean Data

• Are there any missing values?
``````

In [ ]:

# FILL IN TEST
# FILL IN ACTION

``````

Visualize Data

Table: Summary Statistics

To get a feel for the data as a whole.

Functions Used:

• `[pandas df].iloc()`
• `[pandas df].describe()`
``````

In [44]:

df_blood.iloc[:, 1:].describe()

``````
``````

Out[44]:

Months since Last Donation
Number of Donations
Total Volume Donated (c.c.)
Months since First Donation
Made Donation in March 2007

count
576.000000
576.000000
576.000000
576.000000
576.000000

mean
9.439236
5.427083
1356.770833
34.050347
0.239583

std
8.175454
5.740010
1435.002556
24.227672
0.427200

min
0.000000
1.000000
250.000000
2.000000
0.000000

25%
2.000000
2.000000
500.000000
16.000000
0.000000

50%
7.000000
4.000000
1000.000000
28.000000
0.000000

75%
14.000000
7.000000
1750.000000
49.250000
0.000000

max
74.000000
50.000000
12500.000000
98.000000
1.000000

``````

Insights from Summary stats table:

Variable Value Interpretation
Number of data points N 576 Not too big of a dataset
Average number of donations in March, 2007 0.2396 Whether blood was donated in March was low in general
Max Months since 1st Donation 98 Earliest donation was 98 months (~8 years) ago
Average number of donations 5.427 People in dataset donate an average of ~5.5 times

Plot: Scatter Matrix of all of the variables + histograms

Note:

• `Number of donations` & `Total Volume Donated` are perfectly correlated
• thus can probably drop one of the variables
• More likely to NOT have donated in March 2008 (from `Made Donation` histogram)
``````

In [45]:

plot_scatter = pd.scatter_matrix(df_blood.iloc[:, 1:],
figsize=(20,20))

``````
``````

/Users/jason/anaconda/lib/python3.5/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
if self._edgecolors == str('face'):

``````

Plot data as a scatter plot (w/r 'Made Donations in March 2007')

In order to visually inspect whether the given data is linearly separable

• want to create scatter plots of the data (like those in Abu-Mostafa, et al., 2012)

2-dim Scatterplot: Number of Donations + Months since First Donation ~ Made Donation in March 2007

With 2-dimensions/factors (Number of Donations & Months since First Donation), can we linearly separate whether a donation was made in March, 2007?

``````

In [29]:

import seaborn as sns

``````
``````

In [70]:

# sns.set_context("notebook", font_scale=1.1)
# sns.set_style("ticks")

sns.set_context("notebook", font_scale=1.5, rc={'figure.figsize': [11, 8]})
sns.set_style("darkgrid", {"axes.facecolor": ".9"})

``````
``````

In [77]:

g = sns.lmplot(data=df_blood,
x='Number of Donations',
y='Months since First Donation',
hue='Made Donation in March 2007',
fit_reg=False,
palette='RdYlBu',
aspect=3/1,
scatter_kws={"marker": "D",
"s": 50})

``````
``````

/Users/jason/anaconda/lib/python3.5/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
if self._edgecolors == str('face'):

``````
``````

In [ ]:

``````