Using scikit-learn to guess what I'm doing:

Predictive modeling with my personal finance data

These are the packages that I used:



In [14]:

    
import pandas as pd
import numpy as np
from sklearn.lda import LDA
import datetime
import random

Loading the data:



In [2]:

    
#specify my working directory:
path = "C:\mycomputer\Bill\{}"
#I put it in this way so that I could just add CSV files as they come in. 
myfiles = ["transactions2014.csv",
         "transactions2015.csv"]
frames = [pd.read_csv(path.format(x)) for x in myfiles]
df = pd.concat(frames)

Some quick cleaning:



In [4]:

    
keep_cats = ['Arts&Crafts', 'Coffee', 'Eating at Home', 'Eating out', 'Education', 'Gaming', 'Grocery',
       'Lunch', 'Moving', 'Music', 'Out for drinks', 'Out of town',
       'Technology', 'Transportation', 'Uncategorized']
df['filters'] = df['Category'].apply(lambda x: x in keep_cats)
df = df[df['filters']]
df = df.dropna().reset_index(drop=True)



In [5]:

    
df.head()









    Out[5]:






  
    
      
      Date of pull
      Unique mail id
      Reciept text
      Date
      Amount
      Time EST
      Month
      Day
      Hour
      Category
      filters
    
  
  
    
      0
       7/10/14
       1471cefe61af52ac
       PACIFIC SUPPLY CO...
       41829
       18.51
        0.7104861
       7
       9
       17
          Arts&Crafts
       True
    
    
      1
       7/10/14
       1471dd507b0c3b83
                 RHINO ROOM
       41829
       10.00
        0.8842014
       7
       9
       21
       Out for drinks
       True
    
    
      2
       7/10/14
       14718cb3d1bc998c
           CENTURY BALLROOM
       41828
       55.00
        0.9058912
       7
       8
       21
       Out for drinks
       True
    
    
      3
       7/10/14
       1471966915ab72a3
                 RHINO ROOM
       41829
       20.00
       0.02373843
       7
       9
        0
       Out for drinks
       True
    
    
      4
       7/10/14
       147125097a342e9f
           KOHL&#39;S #1053
       41827
       29.99
        0.6478125
       7
       7
       15
          Arts&Crafts
       True

Adding a day of the week:



In [6]:

    
def get_day_of_week(x):
    try:
        mydate = datetime.datetime.strptime(x, '%m/%d/%Y')
    except:
        #I was inconsistent with my datestrings
        #Why didn't I just use ISO format!
        mydate = datetime.datetime.strptime(x, '%m/%d/%y')
    return mydate.strftime('%A')

df['dayOfWeek'] = df['Date of pull'].apply(lambda x: get_day_of_week(x))

Then I'm using the "distance from Saturday" as a proxy for catagorical value



In [8]:

    
def dist_from_sat(x):
    myvalues = {'Friday' :1,
                'Monday':2,
                'Saturday':0, 
                'Sunday':1, 
                'Thursday':2,
                'Tuesday':3,
                'Wednesday':3}
    return myvalues[x]
df['distFromSat'] = df['dayOfWeek'].apply(lambda x: dist_from_sat(x))



In [15]:

    
df[['distFromSat','dayOfWeek']].ix[random.sample(df.index, 10)]









    Out[15]:






  
    
      
      distFromSat
      dayOfWeek
    
  
  
    
      842
       1
          Sunday
    
    
      829
       2
        Thursday
    
    
      805
       3
       Wednesday
    
    
      952
       1
          Friday
    
    
      237
       1
          Friday
    
    
      271
       2
          Monday
    
    
      6  
       2
        Thursday
    
    
      835
       0
        Saturday
    
    
      847
       3
         Tuesday
    
    
      633
       2
        Thursday

Transform for my model:



In [16]:

    
X = df.loc[:,['distFromSat','Hour','Amount']].values
y = df.loc[:,'Category'].values

Now I can run my model:



In [17]:

    
clf = LDA()
clf.fit(X, y)









    Out[17]:





LDA(n_components=None, priors=None)

Now see if my model has any validity:



In [19]:

    
df['predictions'] = clf.predict(X)
accuracy_of_model = len(df[df['predictions'] == df['Category']])/(len(df)*1.)
accuracy_of_random_guess = 1./len(np.unique(y))



In [20]:

    
accuracy_of_model









    Out[20]:





0.28330206378986866



In [21]:

    
accuracy_of_random_guess









    Out[21]:





0.07142857142857142

Acurate almost a third of the time. That's not bad considering the number of catagories. I can't wait to train it over current data to see how able it is to predict what I am going to do going forward.

	Date of pull	Unique mail id	Reciept text	Date	Amount	Time EST	Month	Day	Hour	Category	filters
0	7/10/14	1471cefe61af52ac	PACIFIC SUPPLY CO...	41829	18.51	0.7104861	7	9	17	Arts&Crafts	True
1	7/10/14	1471dd507b0c3b83	RHINO ROOM	41829	10.00	0.8842014	7	9	21	Out for drinks	True
2	7/10/14	14718cb3d1bc998c	CENTURY BALLROOM	41828	55.00	0.9058912	7	8	21	Out for drinks	True
3	7/10/14	1471966915ab72a3	RHINO ROOM	41829	20.00	0.02373843	7	9	0	Out for drinks	True
4	7/10/14	147125097a342e9f	KOHL'S #1053	41827	29.99	0.6478125	7	7	15	Arts&Crafts	True

	distFromSat	dayOfWeek
842	1	Sunday
829	2	Thursday
805	3	Wednesday
952	1	Friday
237	1	Friday
271	2	Monday
6	2	Thursday
835	0	Saturday
847	3	Tuesday
633	2	Thursday