These are the packages that I used:
In [14]:
import pandas as pd
import numpy as np
from sklearn.lda import LDA
import datetime
import random
Loading the data:
In [2]:
#specify my working directory:
path = "C:\mycomputer\Bill\{}"
#I put it in this way so that I could just add CSV files as they come in.
myfiles = ["transactions2014.csv",
"transactions2015.csv"]
frames = [pd.read_csv(path.format(x)) for x in myfiles]
df = pd.concat(frames)
Some quick cleaning:
In [4]:
keep_cats = ['Arts&Crafts', 'Coffee', 'Eating at Home', 'Eating out', 'Education', 'Gaming', 'Grocery',
'Lunch', 'Moving', 'Music', 'Out for drinks', 'Out of town',
'Technology', 'Transportation', 'Uncategorized']
df['filters'] = df['Category'].apply(lambda x: x in keep_cats)
df = df[df['filters']]
df = df.dropna().reset_index(drop=True)
In [5]:
df.head()
Out[5]:
Adding a day of the week:
In [6]:
def get_day_of_week(x):
try:
mydate = datetime.datetime.strptime(x, '%m/%d/%Y')
except:
#I was inconsistent with my datestrings
#Why didn't I just use ISO format!
mydate = datetime.datetime.strptime(x, '%m/%d/%y')
return mydate.strftime('%A')
df['dayOfWeek'] = df['Date of pull'].apply(lambda x: get_day_of_week(x))
Then I'm using the "distance from Saturday" as a proxy for catagorical value
In [8]:
def dist_from_sat(x):
myvalues = {'Friday' :1,
'Monday':2,
'Saturday':0,
'Sunday':1,
'Thursday':2,
'Tuesday':3,
'Wednesday':3}
return myvalues[x]
df['distFromSat'] = df['dayOfWeek'].apply(lambda x: dist_from_sat(x))
In [15]:
df[['distFromSat','dayOfWeek']].ix[random.sample(df.index, 10)]
Out[15]:
Transform for my model:
In [16]:
X = df.loc[:,['distFromSat','Hour','Amount']].values
y = df.loc[:,'Category'].values
Now I can run my model:
In [17]:
clf = LDA()
clf.fit(X, y)
Out[17]:
Now see if my model has any validity:
In [19]:
df['predictions'] = clf.predict(X)
accuracy_of_model = len(df[df['predictions'] == df['Category']])/(len(df)*1.)
accuracy_of_random_guess = 1./len(np.unique(y))
In [20]:
accuracy_of_model
Out[20]:
In [21]:
accuracy_of_random_guess
Out[21]:
Acurate almost a third of the time. That's not bad considering the number of catagories. I can't wait to train it over current data to see how able it is to predict what I am going to do going forward.