Using scikit-learn to guess what I'm doing:

Predictive modeling with my personal finance data

These are the packages that I used:


In [14]:
import pandas as pd
import numpy as np
from sklearn.lda import LDA
import datetime
import random

Loading the data:


In [2]:
#specify my working directory:
path = "C:\mycomputer\Bill\{}"
#I put it in this way so that I could just add CSV files as they come in. 
myfiles = ["transactions2014.csv",
         "transactions2015.csv"]
frames = [pd.read_csv(path.format(x)) for x in myfiles]
df = pd.concat(frames)

Some quick cleaning:


In [4]:
keep_cats = ['Arts&Crafts', 'Coffee', 'Eating at Home', 'Eating out', 'Education', 'Gaming', 'Grocery',
       'Lunch', 'Moving', 'Music', 'Out for drinks', 'Out of town',
       'Technology', 'Transportation', 'Uncategorized']
df['filters'] = df['Category'].apply(lambda x: x in keep_cats)
df = df[df['filters']]
df = df.dropna().reset_index(drop=True)

In [5]:
df.head()


Out[5]:
Date of pull Unique mail id Reciept text Date Amount Time EST Month Day Hour Category filters
0 7/10/14 1471cefe61af52ac PACIFIC SUPPLY CO... 41829 18.51 0.7104861 7 9 17 Arts&Crafts True
1 7/10/14 1471dd507b0c3b83 RHINO ROOM 41829 10.00 0.8842014 7 9 21 Out for drinks True
2 7/10/14 14718cb3d1bc998c CENTURY BALLROOM 41828 55.00 0.9058912 7 8 21 Out for drinks True
3 7/10/14 1471966915ab72a3 RHINO ROOM 41829 20.00 0.02373843 7 9 0 Out for drinks True
4 7/10/14 147125097a342e9f KOHL'S #1053 41827 29.99 0.6478125 7 7 15 Arts&Crafts True

Adding a day of the week:


In [6]:
def get_day_of_week(x):
    try:
        mydate = datetime.datetime.strptime(x, '%m/%d/%Y')
    except:
        #I was inconsistent with my datestrings
        #Why didn't I just use ISO format!
        mydate = datetime.datetime.strptime(x, '%m/%d/%y')
    return mydate.strftime('%A')

df['dayOfWeek'] = df['Date of pull'].apply(lambda x: get_day_of_week(x))

Then I'm using the "distance from Saturday" as a proxy for catagorical value


In [8]:
def dist_from_sat(x):
    myvalues = {'Friday' :1,
                'Monday':2,
                'Saturday':0, 
                'Sunday':1, 
                'Thursday':2,
                'Tuesday':3,
                'Wednesday':3}
    return myvalues[x]
df['distFromSat'] = df['dayOfWeek'].apply(lambda x: dist_from_sat(x))

In [15]:
df[['distFromSat','dayOfWeek']].ix[random.sample(df.index, 10)]


Out[15]:
distFromSat dayOfWeek
842 1 Sunday
829 2 Thursday
805 3 Wednesday
952 1 Friday
237 1 Friday
271 2 Monday
6 2 Thursday
835 0 Saturday
847 3 Tuesday
633 2 Thursday

Transform for my model:


In [16]:
X = df.loc[:,['distFromSat','Hour','Amount']].values
y = df.loc[:,'Category'].values

Now I can run my model:


In [17]:
clf = LDA()
clf.fit(X, y)


Out[17]:
LDA(n_components=None, priors=None)

Now see if my model has any validity:


In [19]:
df['predictions'] = clf.predict(X)
accuracy_of_model = len(df[df['predictions'] == df['Category']])/(len(df)*1.)
accuracy_of_random_guess = 1./len(np.unique(y))

In [20]:
accuracy_of_model


Out[20]:
0.28330206378986866

In [21]:
accuracy_of_random_guess


Out[21]:
0.07142857142857142

Acurate almost a third of the time. That's not bad considering the number of catagories. I can't wait to train it over current data to see how able it is to predict what I am going to do going forward.