Data Set Characteristics: Multivariate
Number of Instances: 748
Area: Business
Attribute Characteristics: Real
Number of Attributes: 5
Date Donated 2008-10-03
Associated Tasks: Classification
Number of Web Hits:
140894
Original Owner and Donor Prof. I-Cheng Yeh Department of Information Management Chung-Hua University, Hsin Chu, Taiwan 30067, R.O.C. e-mail:icyeh '@' chu.edu.tw TEL:886-3-5186511
Date Donated: October 3, 2008
To demonstrate the RFMTC marketing model (a modified version of RFM), this study adopted the donor database of Blood Transfusion Service Center in Hsin-Chu City in Taiwan. The center passes their blood transfusion service bus to one university in Hsin-Chu City to gather blood donated about every three months. To build a FRMTC model, we selected 748 donors at random from the donor database. These 748 donor data, each one included R (Recency - months since last donation), F (Frequency - total number of donation), M (Monetary - total blood donated in c.c.), T (Time - months since first donation), and a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood).
Given is the variable name, variable type, the measurement unit and a brief description. The "Blood Transfusion Service Center" is a classification problem. The order of this listing corresponds to the order of numerals along the rows of the database.
R (Recency - months since last donation),
F (Frequency - total number of donation),
M (Monetary - total blood donated in c.c.),
T (Time - months since first donation), and
a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood).
Table 1 shows the descriptive statistics of the data. We selected 500 data at random as the training set, and the rest 248 as the testing set.
Table 1. Descriptive statistics of the data
Variable Data Type Measurement Description min max mean std
Recency quantitative Months Input 0.03 74.4 9.74 8.07
Frequency quantitative Times Input 1 50 5.51 5.84
Monetary quantitative c.c. blood Input 250 12500 1378.68 1459.83
Time quantitative Months Input 2.27 98.3 34.42 24.32
Whether he/she donated blood in March 2007 binary 1=yes 0=no Output 0 1 1 (24%) 0 (76%)
Yeh, I-Cheng, Yang, King-Jang, and Ting, Tao-Ming, "Knowledge discovery on RFM model using Bernoulli sequence," Expert Systems with Applications, 2008, [Web Link]
In [2]:
import warnings
import itertools
warnings.filterwarnings('ignore')
from functools import lru_cache
# standard tools
import numpy as np
import pandas as pd
# %load_ext autoreload
seed = 7 * 9
np.random.seed(seed)
import xgboost
import sklearn.ensemble
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
In [3]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 4
In [4]:
scale_cols = {}
def rename_cols(name):
if '(' in name:
name = name.split('(')[0]
return ''.join(map(lambda x: x[0], name.lower().split()))
@lru_cache(maxsize=128)
def get_data():
df = pd.read_csv('data/BloodDonation.csv', index_col=0)
test_df = pd.read_csv('data/BloodDonationTest.csv', index_col=0)
df.drop(['Total Volume Donated (c.c.)'], inplace=True, axis=1)
test_df.drop(['Total Volume Donated (c.c.)'], inplace=True, axis=1)
# rename cols
new_cols_names = df.columns.map(rename_cols)
for old_name, new_name in zip(df.columns, new_cols_names):
print('Rename:', old_name, '\t\tNewname:', new_name)
df.columns = new_cols_names
test_df.columns = test_df.columns.map(rename_cols)
global scale_cols
for col in df.columns[:-1]:
scale_cols[col] = StandardScaler(copy=True, with_mean=True, with_std=True).fit(df[col])
df[col] = scale_cols[col].transform(df[col])
test_df[col] = scale_cols[col].transform(test_df[col])
return (df, test_df)
In [5]:
## Data Modelling
def get_test_train(df):
X = df.drop('mdim2', axis=1)
y = df['mdim2']
X_train, X_validation, y_train, y_validation = train_test_split(X, y, train_size=0.75, random_state=1234)
return (X_train, X_validation, y_train, y_validation)
def test_train_validation_splt(X, y):
# https://stackoverflow.com/questions/40829137/stratified-train-validation-test-split-in-scikit-learn
from sklearn.cross_validation import train_test_split as tts
SEED = 2000
x_train, x_validation_and_test, y_train, y_validation_and_test = tts(X, y, test_size=.4, random_state=SEED)
x_validation, x_test, y_validation, y_test = tts(x_validation_and_test, y_validation_and_test, test_size=.5, random_state=SEED)
return (x_train, x_test, x_validation,
y_train, y_test, y_validation)
## save preds
def save_preds(preds, filename='submit.csv'):
pd.DataFrame(preds.astype(np.float64),
index=test_df.index,
columns=['Made Donation in March 2007']
).to_csv(filename)
print('stored file as', filename)
In [6]:
df, test_df = get_data()
In [7]:
df.columns
Out[7]:
In [8]:
df['nod_per_msfd'] = df['nod'] / df['msfd']
df['msfd_per_nod'] = 1/df['nod_per_msfd']
test_df['nod_per_msfd'] = test_df['nod'] / test_df['msfd']
test_df['msfd_per_nod'] = 1/test_df['nod_per_msfd']
In [9]:
df.columns
Out[9]:
In [10]:
X_train, X_validation, y_train, y_validation = get_test_train(df)
In [11]:
X_train.shape, X_validation.shape
Out[11]:
In [12]:
from sklearn.naive_bayes import BernoulliNB
clf = BernoulliNB(alpha=0.5, binarize=0.5)
clf.fit(X_train, y_train)
log_loss(y_train, clf.predict(X_train)), log_loss(y_validation, clf.predict(X_validation))
Out[12]:
In [13]:
clf
Out[13]:
In [14]:
clf = sklearn.ensemble.GradientBoostingClassifier(
warm_start=True, subsample=.8,
n_estimators=500,
# learning_rate=0.0001,
presort=True, verbose=0).fit(X_train, y_train)
# log_loss(y, clf.predict(X))
# results = cross_val_score(clf, X, y, cv=kfold, scoring='log_loss')
log_loss(y_train, clf.predict(X_train)), log_loss(y_validation, clf.predict(X_validation))
Out[14]:
In [15]:
clf
Out[15]:
In [16]:
from xgboost import XGBClassifier
clf = XGBClassifier(max_depth=4,
learning_rate=0.05,
reg_alpha=0.1,
reg_lambda=0.5,
seed=12,
# eta=0.02,
colsample_bylevel=0.5,
objective= 'binary:logistic'
# n_estimators=800
)
clf.fit(X_train, y_train)
log_loss(y_train, clf.predict(X_train)), log_loss(y_validation, clf.predict(X_validation))
Out[16]:
In [17]:
xgb = xgboost
params = {}
params['objective'] = 'binary:logistic'
params['eval_metric'] = 'logloss'
params['eta'] = 0.02
params['max_depth'] = 5
d_train = xgb.DMatrix(X_train, label=y_train)
d_test = xgb.DMatrix(X_validation, label=y_validation)
watchlist = [(d_train, 'train'),
(d_test, 'train')]
bst = xgb.train(params, d_train, 1000, watchlist, early_stopping_rounds=50, verbose_eval=20)
log_loss(y_train, clf.predict(X_train)), log_loss(y_validation, clf.predict(X_validation))
Out[17]:
In [18]:
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(hidden_layer_sizes=(5, 5, 5), max_iter=500)
clf.fit(X_train,y_train)
log_loss(y_train, clf.predict(X_train)), log_loss(y_validation, clf.predict(X_validation))
Out[18]:
In [19]:
# %%time
clf = MLPClassifier(hidden_layer_sizes=(30, 18, 12, 5),
max_iter=1250,
solver='lbfgs', # 'lbfgs', 'adam'
learning_rate_init=0.01,
learning_rate='adaptive',
activation='tanh',
alpha=0.4,
validation_fraction=0.25,
early_stopping=True,
verbose=True,
random_state=7)
clf.fit(X_train, y_train)
log_loss(y_train, clf.predict(X_train)), log_loss(y_validation, clf.predict(X_validation))
Out[19]:
In [20]:
from catboost import Pool, CatBoostClassifier, cv, CatboostIpythonWidget
In [21]:
model = CatBoostClassifier(
custom_loss=['Logloss'],
random_seed=42
)
In [22]:
categorical_features_indices = np.where(X_train.dtypes != np.float)[0]
model.fit(
X_train, y_train,
cat_features=categorical_features_indices,
eval_set=(X_validation, y_validation),
# verbose=True, # you can uncomment this for text output
# plot=True
)
log_loss(y_train, model.predict(X_train)), log_loss(y_validation, model.predict(X_validation))
Out[22]:
In [50]:
from keras.models import Sequential
from keras.layers import Activation, Dense
from keras.wrappers.scikit_learn import KerasClassifier
In [52]:
# For a single-input model with 2 classes (binary classification):
model = Sequential()
model.add(Dense(5, activation='tanh', input_dim=5))
model.add(Dense(5, activation='relu'))
model.add(Dense(5, activation='tanh'))
model.add(Dense(1, activation='relu'))
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['accuracy'])
# Train the model, iterating on the data in batches of 32 samples
model.fit(X_train.values, y_train.values, epochs=100, batch_size=32, verbose=0)
log_loss(y_train.values, model.predict(X_train.values)), log_loss(y_validation.values, model.predict(X_validation.values))
Out[52]:
In [49]:
model = Sequential([
Dense(8, input_dim=(5)),
Dense(6),
Activation('tanh'),
# Dense(6),
# Activation('relu'),
Dense(6),
Activation('relu'),
Dense(1),
Activation('sigmoid'),
])
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(X_train.values, y_train.values, epochs=1000, batch_size=32, verbose=0)
log_loss(y_train.values, model.predict(X_train.values)), log_loss(y_validation.values, model.predict(X_validation.values))
Out[49]:
In [35]:
# baseline model
def create_baseline():
# create model
model = Sequential()
model.add(Dense(9, input_dim=5, kernel_initializer='normal', activation='relu'))
# model.add(Dense(5))
model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(build_fn=create_baseline, epochs=100, batch_size=5, verbose=0)))
pipeline = Pipeline(estimators)
In [37]:
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(pipeline, X_train.values, y_train.values, cv=kfold)
log_loss(y_train.values, model.predict(X_train.values)), log_loss(y_validation.values, model.predict(X_validation.values))
Out[37]:
In [38]:
results
Out[38]: