Predicting Red Hat Business Value

Welcome! This jupyter notebook will go through some typical data analysis on customer activity data associated with the Kaggle competition of the same name (https://www.kaggle.com/c/predicting-red-hat-business-value).

Questions, comments, suggestions, and corrections can be sent to mgebhard@gmail.com.

Business Challenge

Red Hat teamed up with Kaggle to challenge teams to create a classification algorithm that identifies which customers have the most potential business value for Red Hat based on their characteristics and activities. Red Hat plans to use these prediction models to efficiently prioritize resources to generate future business and serve customers.

Data

Red Hat provided four .CSV files for the challenge (https://www.kaggle.com/c/predicting-red-hat-business-value/data). The people.csv file contains each customer, with a unique people_id and various characteristics. The files act_train.csv and act_test.csv contain all the various activities and associated characteristics performed by customers. The sample_submission.csv file gives the format that Kaggle expects our prediction to take. Let's read the three relevant files with pandas.



In [3]:

    
import pandas as pd 

act_train = pd.read_csv('act_train.csv')
act_test = pd.read_csv('act_test.csv')
people = pd.read_csv('people.csv')

Preparing the Data

We now want to clean up the data for use in our analytical model. We can also merge the people data with the activity data.



In [4]:

    
def prepare_acts(data, train_set=True):
    data = data.drop(['date', 'activity_id'], axis=1)
    if train_set:
        data = data.drop(['outcome'], axis=1)
    data['people_id'] = data['people_id'].apply(lambda x: x.split('_')[1])
    data['people_id'] = pd.to_numeric(data['people_id']).astype(int)
    columns = list(data.columns)
    for col in columns[1:]:
        data[col] = data[col].fillna('type 0')
        data[col] = data[col].apply(lambda x: x.split(' ')[1])
        data[col] = pd.to_numeric(data[col]).astype(int)
    return data

def prepare_people(data):
    data = data.drop(['date'], axis=1)
    data['people_id'] = data['people_id'].apply(lambda x: x.split('_')[1])
    data['people_id'] = pd.to_numeric(data['people_id']).astype(int)
    columns = list(data.columns)
    bools = columns[11:]
    strings = columns[1:11]
    for col in bools:
        data[col] = pd.to_numeric(data[col]).astype(int)
    for col in strings:
        data[col] = data[col].fillna('type 0')
        data[col] = data[col].apply(lambda x: x.split(' ')[1])
        data[col] = pd.to_numeric(data[col]).astype(int)
    return data

people_prepared = prepare_people(people)
actions_train = prepare_acts(act_train)
actions_test = prepare_acts(act_test, train_set=False)

features = actions_train.merge(people_prepared, how='left', on='people_id')
labels = act_train['outcome']
test = actions_test.merge(people_prepared, how='left', on='people_id')

A Random Forest Model

Here are some resources from the web discussing the Random Forests model:

And here are resources discussing the RandomForestClassifier from sklearn.ensemble:



In [5]:

    
from sklearn.ensemble import RandomForestClassifier

rfclassifier = RandomForestClassifier()
rfclassifier.fit(features, labels)

test_proba = rfclassifier.predict_proba(test)
test_preds = test_proba[:,1]

Output

Finally, we want to write our predictions to a file that can be uploaded to Kaggle.



In [6]:

    
test_ids = act_test['activity_id']
output = pd.DataFrame({'activity_id': test_ids, 'outcome': test_preds})
output.to_csv('redhat.csv', index=False)

Conclusion

After submitting the test output to Kaggle, we receive a public score of 0.946164. Not a bad starting point for a simple random forest model.