In [2]:
import numpy as np
import pandas as pd
%pylab inline
pylab.style.use('ggplot')
In [3]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data'
data_df = pd.read_csv(url)
To demonstrate the RFMTC marketing model (a modified version of RFM), this study adopted the donor database of Blood Transfusion Service Center in Hsin-Chu City in Taiwan. The center passes their blood transfusion service bus to one university in Hsin-Chu City to gather blood donated about every three months. To build a FRMTC model, we selected 748 donors at random from the donor database. These 748 donor data, each one included R (Recency - months since last donation), F (Frequency - total number of donation), M (Monetary - total blood donated in c.c.), T (Time - months since first donation), and a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood).
Attribute Information:
Given is the variable name, variable type, the measurement unit and a brief description. The "Blood Transfusion Service Center" is a classification problem. The order of this listing corresponds to the order of numerals along the rows of the database.
In [4]:
data_df.head()
Out[4]:
In [5]:
cols = [c.lower().split()[0] for c in data_df.columns]
cols[-1] = 'class_name'
data_df.columns = cols
In [6]:
data_df.head()
Out[6]:
In [7]:
data_df.head()
Out[7]:
In [8]:
counts = data_df['class_name'].value_counts()
counts.plot(kind='bar')
Out[8]:
In [9]:
import seaborn as sns
sns.pairplot(data_df, hue='class_name')
Out[9]:
In [10]:
features = data_df.drop('class_name', axis=1)
labels = data_df['class_name']
corrs = features.corr()
sns.heatmap(corrs, annot=True)
Out[10]:
In [11]:
features.corrwith(labels).plot(kind='bar')
pylab.xticks(rotation=30)
Out[11]:
In [12]:
from sklearn.feature_selection import f_classif
t_stat, p_vals = f_classif(features, labels)
test_results = pd.DataFrame(np.column_stack([t_stat, p_vals]),
index=features.columns.copy(),
columns=['t_stats', 'p_vals'])
test_results.plot(kind='bar', subplots=True)
Out[12]:
In [13]:
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import StratifiedKFold, cross_val_score
estimator = GaussianNB(priors=[0.5, 0.5])
kbest = SelectKBest(f_classif, k=2)
pipeline = Pipeline([('selector', kbest), ('model', estimator)])
cv = StratifiedKFold(n_splits=10, shuffle=True)
nb_scores = cross_val_score(pipeline, features, labels, cv=cv)
nb_scores = pd.Series(nb_scores)
In [30]:
nb_scores.plot(kind='bar')
Out[30]:
Next, we compare train vs. test scores for each fold.
If train scores are higher than test scores, then the model is overfit.
If both scores are nearly equal, it implies the model is not overfit. So we either need to give more data to each fold or use a different model.
In [14]:
from sklearn.metrics import f1_score
def calc_scores(train_idx, test_idx):
"""Take train and test data for each CV fold and calculate scores for train and test splits."""
train_data, train_labels = features.iloc[train_idx], labels.iloc[train_idx]
test_data, test_labels = features.iloc[test_idx], labels.iloc[test_idx]
estimator = GaussianNB(priors=[0.5, 0.5])
kbest = SelectKBest(f_classif, k=2)
pipeline = Pipeline([('selector', kbest), ('model', estimator)])
pipeline = pipeline.fit(train_data, train_labels)
train_score = f1_score(train_labels, pipeline.predict(train_data))
test_score = f1_score(test_labels, pipeline.predict(test_data))
return (train_score, test_score)
return fold_scores
cv = StratifiedKFold(n_splits=10, shuffle=True)
fold_scores = [calc_scores(train_idx, test_idx)
for train_idx, test_idx
in cv.split(features, np.ones(len(features)))]
fold_scores = pd.DataFrame(fold_scores, columns=['train_score', 'test_score'])
fold_scores.plot(kind='bar')
Out[14]:
In [15]:
fold_scores.mean()
Out[15]:
In [16]:
cv = StratifiedKFold(n_splits=5, shuffle=True)
fold_scores = [calc_scores(train_idx, test_idx)
for train_idx, test_idx
in cv.split(features, np.ones(len(features)))]
fold_scores = pd.DataFrame(fold_scores, columns=['train_score', 'test_score'])
fold_scores.plot(kind='bar')
Out[16]:
In [79]:
fold_scores.mean()
Out[79]:
In [ ]: