Welcome! This jupyter notebook will go through some typical data analysis on customer activity data associated with the Kaggle competition of the same name (https://www.kaggle.com/c/predicting-red-hat-business-value).
Questions, comments, suggestions, and corrections can be sent to mgebhard@gmail.com.
Red Hat teamed up with Kaggle to challenge teams to create a classification algorithm that identifies which customers have the most potential business value for Red Hat based on their characteristics and activities. Red Hat plans to use these prediction models to efficiently prioritize resources to generate future business and serve customers.
Red Hat provided four .CSV files for the challenge (https://www.kaggle.com/c/predicting-red-hat-business-value/data). The people.csv file contains each customer, with a unique people_id and various characteristics. The files act_train.csv and act_test.csv contain all the various activities and associated characteristics performed by customers. The sample_submission.csv file gives the format that Kaggle expects our prediction to take. Let's read the three relevant files with pandas.
In [3]:
import pandas as pd
act_train = pd.read_csv('act_train.csv')
act_test = pd.read_csv('act_test.csv')
people = pd.read_csv('people.csv')
In [4]:
def prepare_acts(data, train_set=True):
data = data.drop(['date', 'activity_id'], axis=1)
if train_set:
data = data.drop(['outcome'], axis=1)
data['people_id'] = data['people_id'].apply(lambda x: x.split('_')[1])
data['people_id'] = pd.to_numeric(data['people_id']).astype(int)
columns = list(data.columns)
for col in columns[1:]:
data[col] = data[col].fillna('type 0')
data[col] = data[col].apply(lambda x: x.split(' ')[1])
data[col] = pd.to_numeric(data[col]).astype(int)
return data
def prepare_people(data):
data = data.drop(['date'], axis=1)
data['people_id'] = data['people_id'].apply(lambda x: x.split('_')[1])
data['people_id'] = pd.to_numeric(data['people_id']).astype(int)
columns = list(data.columns)
bools = columns[11:]
strings = columns[1:11]
for col in bools:
data[col] = pd.to_numeric(data[col]).astype(int)
for col in strings:
data[col] = data[col].fillna('type 0')
data[col] = data[col].apply(lambda x: x.split(' ')[1])
data[col] = pd.to_numeric(data[col]).astype(int)
return data
people_prepared = prepare_people(people)
actions_train = prepare_acts(act_train)
actions_test = prepare_acts(act_test, train_set=False)
features = actions_train.merge(people_prepared, how='left', on='people_id')
labels = act_train['outcome']
test = actions_test.merge(people_prepared, how='left', on='people_id')
Here are some resources from the web discussing the Random Forests model:
And here are resources discussing the RandomForestClassifier from sklearn.ensemble:
In [5]:
from sklearn.ensemble import RandomForestClassifier
rfclassifier = RandomForestClassifier()
rfclassifier.fit(features, labels)
test_proba = rfclassifier.predict_proba(test)
test_preds = test_proba[:,1]
In [6]:
test_ids = act_test['activity_id']
output = pd.DataFrame({'activity_id': test_ids, 'outcome': test_preds})
output.to_csv('redhat.csv', index=False)