In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Make the graphs a bit prettier, and bigger
pd.set_option('display.mpl_style', 'default')
plt.rcParams['figure.figsize'] = (15, 5)
plt.rcParams['font.family'] = 'sans-serif'
# This is necessary to show lots of columns in pandas 0.12.
# Not necessary in pandas 0.13.
pd.set_option('display.width', 5000)
pd.set_option('display.max_columns', 60)
In [2]:
df=pd.read_csv('sample2v.csv', header=None)
In [3]:
df
Out[3]:
In [5]:
df[8].value_counts()
Out[5]:
In [6]:
114./10400
Out[6]:
In this dataset, Fails take roughly 1%. If it is a representative sample of the real data, then running machine learning on the whole set will just not make sense. Any classifier that just predict "Success" for every line will attain 99% accuracy.
This means that I need to collect data for "Fail" cases and randomly sample data for "Success" in roughly equal amounts and then look at machine learning (classifier) for such sets to see the true accuracy of the algorithm.
I want to examine this dataset to see if there are any obvious correlations and to understand the data I have in my columns.
In [9]:
df[5].unique()
Out[9]:
In [10]:
df[6].unique()
Out[10]:
In [11]:
df[7].unique()
Out[11]:
In [14]:
df.groupby([5,8]).count()
Out[14]:
In [15]:
df.groupby([6,8]).count()
Out[15]:
In [13]:
df.groupby([7,8]).count()
Out[13]:
In [16]:
df.groupby([6,7]).count()
Out[16]:
This is a simple way to see if there are any labels in columns 5-7 that predict the outcome. (answer: not really as the count for most events that can be interpreted this way is too low). Also I am try to see if there are any interesting correlations between labels.
In [18]:
print len(df[3].unique()), len(df[4].unique())
Potentially too many variables to be used in analysis
In [19]:
df["source_user"], df["source_domain"] = zip(*df[1].str.split('@').tolist())
In [20]:
df["source_user"]=df["source_user"].str.rstrip('$')
In [21]:
df["destination_user"], df["destination_domain"] = zip(*df[2].str.split('@').tolist())
df["destination_user"]=df["destination_user"].str.rstrip('$')
In [22]:
df['same_user']=(df['destination_user']==df['source_user'])
df['same_domain']=(df['destination_domain']==df['source_domain'])
In [23]:
df['same_user'].value_counts()
Out[23]:
In [24]:
df['same_domain'].value_counts()
Out[24]:
In [25]:
df['source_domain'].unique()
Out[25]:
In [26]:
df['destination_domain'].unique()
Out[26]:
In [27]:
df['source_user'].unique()
Out[27]:
In [28]:
df['destination_user'].unique()
Out[28]:
Potentially too many variable. I now want to explore what users I have in addition to C-numbers and U-numbers. (C=computer and U=user?)
In [29]:
good=df[~df.source_user.str.startswith("U")]
good=good.source_user[~good.source_user.str.startswith('C')]
good.unique()
Out[29]:
In [30]:
good=df[~np.logical_or(df.destination_user.str.startswith("U"), df.destination_user.str.startswith("C"))]
#good=good.destination_user[~good.destination_user.str.contains('C')]
good.destination_user.unique()
Out[30]:
Idea: one can expand this column into into 6 categories: C-users, U-users, 'ANONYMOUS LOGON', 'LOCAL_SERVICE', 'SYSTEM', 'NETWORK SERVICE'
In [31]:
dd=df['destination_domain'].str.startswith('C')
print min(df['destination_domain'][dd].str.slice(1).astype(int)), max(df['destination_domain'][dd].str.slice(1).astype(int))
dd=df[~df.destination_domain.str.startswith('C')]
print dd.destination_domain.unique()
In [33]:
sd=df['source_domain'].str.startswith('C')
print min(df['source_domain'][sd].str.slice(1).astype(int)), max(df['source_domain'][sd].str.slice(1).astype(int))
sd=df[~df.source_domain.str.startswith('C')]
print sd.source_domain.unique()
This dataset contains columns of categorical data (aside from time). To work with this data, each label should be converted to its own column with values 1 (True) if the label applies and 0 (False) otherwise. Some columns (5-7) contain ~10 labels, where as other columns contain ten thousands of labels. I will ignore the 2nd class of labels on the first pass. Instead I will consider when these labels coincide. This way I will prevent my set of features from exploding. Also the 2nd class of labels most likely comes from some ordering of computers and users in the lab. Considering if one wants to authenticate to the same computer or to a different computer should matter more for authentication success than specific computer label.
In [34]:
df['source_user_comp_same']=(df[3]==df['source_user'])
df['destination_user_comp_same']=(df['destination_user']==df[4])
df['same_comp']=(df[3]==df[4])
df['source_domain_comp_same']=(df[3]==df['source_domain'])
df['destination_domain_comp_same']=(df['destination_domain']==df[4])
In [35]:
df['source_user_comp_same'].value_counts()
Out[35]:
In [36]:
df['destination_user_comp_same'].value_counts()
Out[36]:
In [37]:
df['same_comp'].value_counts()
Out[37]:
In [38]:
df['source_domain_comp_same'].value_counts()
Out[38]:
In [39]:
df['destination_domain_comp_same'].value_counts()
Out[39]:
In [ ]: