Blood Transfusion Service Center

Data Set Characteristics: Multivariate

Number of Instances: 748

Area: Business

Attribute Characteristics: Real

Number of Attributes: 5

Date Donated 2008-10-03

Associated Tasks: Classification

Number of Web Hits:

140894

Source:

Original Owner and Donor Prof. I-Cheng Yeh Department of Information Management Chung-Hua University, Hsin Chu, Taiwan 30067, R.O.C. e-mail:icyeh '@' chu.edu.tw TEL:886-3-5186511

Date Donated: October 3, 2008

Data Set Information:

To demonstrate the RFMTC marketing model (a modified version of RFM), this study adopted the donor database of Blood Transfusion Service Center in Hsin-Chu City in Taiwan. The center passes their blood transfusion service bus to one university in Hsin-Chu City to gather blood donated about every three months. To build a FRMTC model, we selected 748 donors at random from the donor database. These 748 donor data, each one included R (Recency - months since last donation), F (Frequency - total number of donation), M (Monetary - total blood donated in c.c.), T (Time - months since first donation), and a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood).

Attribute Information:

Given is the variable name, variable type, the measurement unit and a brief description. The "Blood Transfusion Service Center" is a classification problem. The order of this listing corresponds to the order of numerals along the rows of the database.

R (Recency - months since last donation), 
F (Frequency - total number of donation), 
M (Monetary - total blood donated in c.c.), 
T (Time - months since first donation), and 
a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood). 


Table 1 shows the descriptive statistics of the data. We selected 500 data at random as the training set, and the rest 248 as the testing set.

Table 1. Descriptive statistics of the data

Variable    Data Type   Measurement Description min max mean    std 
Recency quantitative    Months  Input   0.03    74.4    9.74    8.07 
Frequency quantitative  Times   Input   1   50  5.51    5.84 
Monetary    quantitative    c.c. blood  Input   250 12500   1378.68 1459.83 
Time quantitative   Months  Input   2.27    98.3    34.42   24.32 
Whether he/she donated blood in March 2007  binary  1=yes 0=no  Output  0   1   1 (24%) 0 (76%)


Relevant Papers:

Yeh, I-Cheng, Yang, King-Jang, and Ting, Tao-Ming, "Knowledge discovery on RFM model using Bernoulli sequence," Expert Systems with Applications, 2008, [Web Link]

Resources:


In [2]:
import warnings
import itertools
warnings.filterwarnings('ignore')
from functools import lru_cache

# standard tools
import numpy as np
import pandas as pd

# %load_ext autoreload

seed = 7 * 9
np.random.seed(seed)

import xgboost
import sklearn.ensemble
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

In [3]:
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 4

In [4]:
scale_cols = {}

def rename_cols(name):
    if '(' in name:
        name = name.split('(')[0]
    return ''.join(map(lambda x: x[0], name.lower().split()))

@lru_cache(maxsize=128)
def get_data():
    df = pd.read_csv('data/BloodDonation.csv', index_col=0)
    test_df = pd.read_csv('data/BloodDonationTest.csv', index_col=0)

    df.drop(['Total Volume Donated (c.c.)'], inplace=True, axis=1)
    test_df.drop(['Total Volume Donated (c.c.)'], inplace=True, axis=1)
    
    # rename cols
    new_cols_names = df.columns.map(rename_cols)
    for old_name, new_name in zip(df.columns, new_cols_names):
        print('Rename:', old_name, '\t\tNewname:', new_name)
    df.columns = new_cols_names
    test_df.columns = test_df.columns.map(rename_cols)
    
    global scale_cols
    for col in df.columns[:-1]:
        scale_cols[col] = StandardScaler(copy=True, with_mean=True, with_std=True).fit(df[col])
        df[col] = scale_cols[col].transform(df[col])
        test_df[col] = scale_cols[col].transform(test_df[col])

    return (df, test_df)

In [5]:
## Data Modelling
def get_test_train(df):
    X = df.drop('mdim2', axis=1)
    y = df['mdim2']
    X_train, X_validation, y_train, y_validation = train_test_split(X, y, train_size=0.75, random_state=1234)
    return (X_train, X_validation, y_train, y_validation)


def test_train_validation_splt(X, y):
    # https://stackoverflow.com/questions/40829137/stratified-train-validation-test-split-in-scikit-learn
    from sklearn.cross_validation import train_test_split as tts
    SEED = 2000
    x_train, x_validation_and_test, y_train, y_validation_and_test = tts(X, y, test_size=.4, random_state=SEED)
    x_validation, x_test, y_validation, y_test = tts(x_validation_and_test, y_validation_and_test, test_size=.5, random_state=SEED)
    return (x_train, x_test, x_validation,
            y_train, y_test, y_validation)


## save preds
def save_preds(preds, filename='submit.csv'):
    pd.DataFrame(preds.astype(np.float64),
                 index=test_df.index,
                 columns=['Made Donation in March 2007']
                ).to_csv(filename)
    print('stored file as', filename)

In [6]:
df, test_df = get_data()


Rename: Months since Last Donation 		Newname: msld
Rename: Number of Donations 		Newname: nod
Rename: Months since First Donation 		Newname: msfd
Rename: Made Donation in March 2007 		Newname: mdim2

In [7]:
df.columns


Out[7]:
Index(['msld', 'nod', 'msfd', 'mdim2'], dtype='object')

In [8]:
df['nod_per_msfd'] = df['nod'] / df['msfd']
df['msfd_per_nod'] = 1/df['nod_per_msfd']

test_df['nod_per_msfd'] = test_df['nod'] / test_df['msfd']
test_df['msfd_per_nod'] = 1/test_df['nod_per_msfd']

In [9]:
df.columns


Out[9]:
Index(['msld', 'nod', 'msfd', 'mdim2', 'nod_per_msfd', 'msfd_per_nod'], dtype='object')

test-train


In [10]:
X_train, X_validation, y_train, y_validation = get_test_train(df)

In [11]:
X_train.shape, X_validation.shape


Out[11]:
((432, 5), (144, 5))

Bernoli


In [12]:
from sklearn.naive_bayes import BernoulliNB

clf = BernoulliNB(alpha=0.5, binarize=0.5)
clf.fit(X_train, y_train)
log_loss(y_train, clf.predict(X_train)), log_loss(y_validation, clf.predict(X_validation))


Out[12]:
(8.7946162014164528, 6.7158953989390273)

In [13]:
clf


Out[13]:
BernoulliNB(alpha=0.5, binarize=0.5, class_prior=None, fit_prior=True)

Gradient Boosting Classifer


In [14]:
clf = sklearn.ensemble.GradientBoostingClassifier(
    warm_start=True, subsample=.8,
    n_estimators=500,
#     learning_rate=0.0001,
    presort=True, verbose=0).fit(X_train, y_train)
# log_loss(y, clf.predict(X))

# results = cross_val_score(clf, X, y, cv=kfold, scoring='log_loss')
log_loss(y_train, clf.predict(X_train)), log_loss(y_validation, clf.predict(X_validation))


Out[14]:
(2.0787411625971761, 7.9152251013259409)

In [15]:
clf


Out[15]:
GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=500, presort=True, random_state=None,
              subsample=0.8, verbose=0, warm_start=True)
def special_preprocessing(X): new_X = X.copy() cols = X.columns.tolist() # check if its already processed if len(X.columns) > 5: return new_X for col in cols: new_X['tan_' + col] = np.tan(X[col]) for col1, col2 in itertools.combinations(cols, 2): new_X['BC_' + col1 + ':' + col2] = 6 * X[col1] - 3 * X[col2] new_X['BC1_' + col1 + ':' + col2] = X[col1] / X[col2] new_X['BC2_' + col1 + ':' + col2] = X[col1] * X[col1] for col1, col2, col3 in itertools.combinations(cols, 3): new_X['TC_' + col1 + ':' + col2 + ':' + col3] = 5 * X[col1] - 3 * X[col2] + X[col3] + 12 new_X['TC2_' + col1 + ':' + col2 + ':' + col3] = X[col1] * X[col2] * X[col3] return new_X new_X_train = special_preprocessing(X_train).copy() new_X_validation = special_preprocessing(X_validation).copy() # clf = sklearn.ensemble.GradientBoostingClassifier(n_estimators=500).fit(new_X, y) # log_loss(y_train, clf.predict(X_train)), log_loss(y_validation, clf.predict(X_validation))
clf = sklearn.ensemble.GradientBoostingClassifier( warm_start=True, subsample=.8, n_estimators=500, # learning_rate=0.0001, presort=True, verbose=0).fit(new_X_train, y_train) # log_loss(y, clf.predict(X)) # results = cross_val_score(clf, X, y, cv=kfold, scoring='log_loss') log_loss(y_train, clf.predict(new_X_train)), log_loss(y_validation, clf.predict(new_X_validation))

XGBOOST


In [16]:
from xgboost import XGBClassifier

clf = XGBClassifier(max_depth=4,
                      learning_rate=0.05,
                      reg_alpha=0.1,
                      reg_lambda=0.5,
                      seed=12,
#                       eta=0.02,
                      colsample_bylevel=0.5,
                      objective= 'binary:logistic'
#                       n_estimators=800
                     )

clf.fit(X_train, y_train)
log_loss(y_train, clf.predict(X_train)), log_loss(y_validation, clf.predict(X_validation))


Out[16]:
(5.5166304787513178, 5.9963653211780485)

In [17]:
xgb = xgboost

params = {}
params['objective'] = 'binary:logistic'
params['eval_metric'] = 'logloss'
params['eta'] = 0.02
params['max_depth'] = 5

d_train = xgb.DMatrix(X_train, label=y_train)
d_test = xgb.DMatrix(X_validation, label=y_validation)

watchlist = [(d_train, 'train'),
            (d_test, 'train')]

bst = xgb.train(params, d_train, 1000, watchlist, early_stopping_rounds=50, verbose_eval=20)
log_loss(y_train, clf.predict(X_train)), log_loss(y_validation, clf.predict(X_validation))


[0]	train-logloss:0.68503	train-logloss:0.68421
Multiple eval metrics have been passed: 'train-logloss' will be used for early stopping.

Will train until train-logloss hasn't improved in 50 rounds.
[20]	train-logloss:0.566094	train-logloss:0.565548
[40]	train-logloss:0.498244	train-logloss:0.506693
[60]	train-logloss:0.457584	train-logloss:0.480979
[80]	train-logloss:0.427996	train-logloss:0.468697
[100]	train-logloss:0.407399	train-logloss:0.464057
[120]	train-logloss:0.391243	train-logloss:0.462123
[140]	train-logloss:0.376362	train-logloss:0.455048
[160]	train-logloss:0.365766	train-logloss:0.450661
[180]	train-logloss:0.357882	train-logloss:0.449149
[200]	train-logloss:0.350592	train-logloss:0.449524
[220]	train-logloss:0.343504	train-logloss:0.450387
Stopping. Best iteration:
[183]	train-logloss:0.356856	train-logloss:0.448933

Out[17]:
(5.5166304787513178, 5.9963653211780485)

Neural nets


In [18]:
from sklearn.neural_network import MLPClassifier

clf = MLPClassifier(hidden_layer_sizes=(5, 5, 5), max_iter=500)

clf.fit(X_train,y_train)
log_loss(y_train, clf.predict(X_train)), log_loss(y_validation, clf.predict(X_validation))


Out[18]:
(7.1156553081215366, 7.1956394959656444)

In [19]:
# %%time
clf = MLPClassifier(hidden_layer_sizes=(30, 18, 12, 5),
                    max_iter=1250,
                    solver='lbfgs', # 'lbfgs', 'adam'
                    learning_rate_init=0.01,
                    learning_rate='adaptive',
                    activation='tanh',
                    alpha=0.4,
                    validation_fraction=0.25,
                    early_stopping=True,
                    verbose=True,
                    random_state=7)

clf.fit(X_train, y_train)
log_loss(y_train, clf.predict(X_train)), log_loss(y_validation, clf.predict(X_validation))


Out[19]:
(2.9581951939669144, 10.793534206207546)

catboost


In [20]:
from catboost import Pool, CatBoostClassifier, cv, CatboostIpythonWidget

In [21]:
model = CatBoostClassifier(
    custom_loss=['Logloss'],
    random_seed=42
)

In [22]:
categorical_features_indices = np.where(X_train.dtypes != np.float)[0]

model.fit(
    X_train, y_train,
    cat_features=categorical_features_indices,
    eval_set=(X_validation, y_validation),
#     verbose=True,  # you can uncomment this for text output
#     plot=True
)

log_loss(y_train, model.predict(X_train)), log_loss(y_validation, model.predict(X_validation))


Out[22]:
(3.4379004216740072, 6.4760816544050046)

Keras 1


In [50]:
from keras.models import Sequential
from keras.layers import Activation, Dense
from keras.wrappers.scikit_learn import KerasClassifier

In [52]:
# For a single-input model with 2 classes (binary classification):

model = Sequential()
model.add(Dense(5, activation='tanh', input_dim=5))
model.add(Dense(5, activation='relu'))
model.add(Dense(5, activation='tanh'))
model.add(Dense(1, activation='relu'))
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])



# Train the model, iterating on the data in batches of 32 samples
model.fit(X_train.values, y_train.values, epochs=100, batch_size=32, verbose=0)

log_loss(y_train.values, model.predict(X_train.values)), log_loss(y_validation.values, model.predict(X_validation.values))


Out[52]:
(0.63126560618864591, 0.70500663543336684)
save_preds(model.predict(test_df.values), 'keras_submit.csv')

KERAS 2


In [49]:
model = Sequential([
    Dense(8, input_dim=(5)),
    Dense(6),
    Activation('tanh'),
#     Dense(6),
#     Activation('relu'),
    Dense(6),
    Activation('relu'),
    Dense(1),
    Activation('sigmoid'),
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.fit(X_train.values, y_train.values, epochs=1000, batch_size=32, verbose=0)

log_loss(y_train.values, model.predict(X_train.values)), log_loss(y_validation.values, model.predict(X_validation.values))


Out[49]:
(0.42249499242623439, 0.47635158023331314)

KERAS 3


In [35]:
# baseline model
def create_baseline():
	# create model
	model = Sequential()
	model.add(Dense(9, input_dim=5, kernel_initializer='normal', activation='relu'))
# 	model.add(Dense(5))
	model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
	# Compile model
	model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
	return model

estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(build_fn=create_baseline, epochs=100, batch_size=5, verbose=0)))
pipeline = Pipeline(estimators)

In [37]:
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(pipeline, X_train.values, y_train.values, cv=kfold)

log_loss(y_train.values, model.predict(X_train.values)), log_loss(y_validation.values, model.predict(X_validation.values))


Out[37]:
(0.73735096146938972, 0.63999319029971957)

In [38]:
results


Out[38]:
array([ 0.79545455,  0.79545456,  0.8139535 ,  0.83720931,  0.79069768,
        0.79069768,  0.69767443,  0.76744187,  0.67441862,  0.86046512])