Graph Construction and feature engineering

Load library



In [1]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import networkx as nx
import pygraphviz as pgv
import pydot as pyd
from networkx.drawing.nx_agraph import graphviz_layout
from networkx.drawing.nx_agraph import write_dot

Load data



In [2]:

    
%%time
edges = pd.read_csv('../data/edges.csv').drop('Unnamed: 0',1)
nodes = pd.read_csv('../data/nodes.csv').drop('Unnamed: 0',1)
rogues = pd.read_csv('../data/rogues.csv')









    



CPU times: user 31.1 s, sys: 6.51 s, total: 37.6 s
Wall time: 37.6 s

Graph generation & analysis

Build proper edge array



In [3]:

    
%%time
#Simple way (non parallel computing)
edge_array = []
for i in range(0,1000000):
    edge_array.append((edges['from'][i],edges['to'][i],{'value':edges['value'][i],'time':edges['timestamp'][i],'hash':edges['hash'][i]}))









    



CPU times: user 2min 20s, sys: 872 ms, total: 2min 21s
Wall time: 2min 21s

Generate a MultiDigraph with networkx and edge array



In [4]:

    
%%time
TG=nx.MultiDiGraph()
TG.add_weighted_edges_from(edge_array)









    



CPU times: user 5.41 s, sys: 224 ms, total: 5.63 s
Wall time: 5.62 s



In [5]:

    
%%time
# Network Characteristics
print 'Number of nodes:', TG.number_of_nodes() 
print 'Number of edges:', TG.number_of_edges() 
print 'Number of connected components:', nx.number_connected_components(TG.to_undirected())

# Degree
degree_sequence = TG.degree().values()
degree_out_sequence = TG.out_degree().values()
degree_in_sequence = TG.in_degree().values()

print "Min degree ", np.min(degree_sequence)
print "Max degree ", np.max(degree_sequence)
print "Median degree ", np.median(degree_sequence)
print "Mean degree ", np.mean(degree_sequence)

print "Min degree IN", np.min(degree_in_sequence)
print "Max degree IN", np.max(degree_in_sequence)
print "Median degree IN", np.median(degree_in_sequence)
print "Mean degree IN", np.mean(degree_in_sequence)

print "Min degree OUT", np.min(degree_out_sequence)
print "Max degree OUT", np.max(degree_out_sequence)
print "Median degree OUT", np.median(degree_out_sequence)
print "Mean degree OUT", np.mean(degree_out_sequence)









    



Number of nodes: 29355
Number of edges: 1000000
Number of connected components: 96
Min degree  1
Max degree  232035
Median degree  3.0
Mean degree  68.131493783
Min degree IN 0
Max degree IN 222682
Median degree IN 1.0
Mean degree IN 34.0657468915
Min degree OUT 0
Max degree OUT 137271
Median degree OUT 1.0
Mean degree OUT 34.0657468915
CPU times: user 21.8 s, sys: 956 ms, total: 22.8 s
Wall time: 22.8 s



In [6]:

    
%%time
# Degree distribution
y=nx.degree_histogram(TG)
plt.figure(1)
plt.loglog(y,'b-',marker='o')
plt.ylabel("Frequency")
plt.xlabel("Degree")
plt.draw()
plt.show()









    












    



CPU times: user 1.48 s, sys: 828 ms, total: 2.31 s
Wall time: 1.43 s

Features Engineering



In [7]:

    
#New dataframe for feature engineering
df = pd.DataFrame()
df['nodes']=TG.nodes()

Features description

Roman (I-XXI) : Node intrinsec charateristics feature set
Arabic (1-5) : Neighbors behaviours feature set
Greek (α-ω) : External data features

#	Description	Variable
I	Degree	total_degree
II	Degree in	degree_in
III	Degree out	degree_out
IV	Number of unique predecessors	unique_predecessors
V	Number of unique successors	unique_successors
VI	Mean ether amount in incoming transactions	mean_value_in
VII	Mean ether amount in outgoing transactions	mean_value_out
VIII	Std ether amount in incoming transactions	std_value_in
IX	Std ether amount in outgoing transactions	std_value_out
X	Ratio of the number of incoming transactions to the number of unique timestamps	ratio_in_timestamp
XI	Ratio of the number of outgoing transactions to the number of unique timestamps	ratio_out_timestamp
XII	Frequency of incoming transactions	frequency_in
XIII	Frequency of outgoing transactions	frequency_out
XIV	Ether balance of the node	balance
XVI	Average velocity in	mean_velocity_out
XVII	Average velocity out	mean_velocity_out
XVIII	Std velocity in	std_velocity_in
XIX	Std velocity out	std_velocity_out
XX	Average acceleration in	mean_acceleration_in
XXI	Average acceleration out	mean_acceleration_out
α	Min path to a rogue node	min_path_to_rogue
β	Min path from a rogue node	min_path_from_rogue
δ	Amount of ether on the min path to a rogue node	amount_to_rogue
ε	Amount of ether on the min path from a rogue node	amount_from_rogue
1	Average neighbours velocity out
2	Average neighbours acceleration out	-

Others possible features:
- average/std path size to/from rogue
- average amount/std amount to/from rogue
- neighbours description

Add total degree [I]



In [8]:

    
df['total_degree']=df['nodes'].map(lambda x: TG.degree(x))

Add degree in and degree out [II] [III]



In [9]:

    
df['degree_in']=df['nodes'].map(lambda x: TG.in_degree(x))
df['degree_out']=df['nodes'].map(lambda x: TG.out_degree(x))

Add unique predecessors and unique successors (must be < degree_in and out) [IV][V]



In [10]:

    
df['unique_successors']=df['nodes'].map(lambda x: len((TG.successors(x))))
df['unique_predecessors']=df['nodes'].map(lambda x: len((TG.predecessors(x))))

Add mean ether value going in the node [VI]

Write a function



In [11]:

    
def get_mean_value_in(node):
    '''
    Return the mean value of all the in transactions of a given node
    '''
    #Get the in edges list
    edges = TG.in_edges_iter(node, keys=False, data=True)
    #Build a list of all the values of the in edges list
    values=[]
    for edge in edges:
        values.append(float(edge[2]['weight']['value']))
    #Compute the mean of this list
    mean = np.average(values)
    
    return mean



In [12]:

    
%%time
#Add the feature
df['mean_value_in']=df['nodes'].map(lambda x: get_mean_value_in(x))









    



/usr/local/lib/python2.7/dist-packages/numpy/core/_methods.py:59: RuntimeWarning: Mean of empty slice.
  warnings.warn("Mean of empty slice.", RuntimeWarning)






    



CPU times: user 2.09 s, sys: 20 ms, total: 2.11 s
Wall time: 2.09 s

Add mean ether value going out the node [VII]



In [13]:

    
#Write a function
def get_mean_value_out(node):
    '''
    Return the mean value of all the out transactions of a given node
    '''
    #Get the out edges list
    edges = TG.out_edges_iter(node, keys=False, data=True)
    #Build a list of all the values of the out edges list
    values=[]
    for edge in edges:
        values.append(float(edge[2]['weight']['value']))
    #Compute the mean of this list
    mean = np.average(values)
    return mean



In [14]:

    
%%time
#Add the feature
df['mean_value_out']=df['nodes'].map(lambda x: get_mean_value_out(x))









    



CPU times: user 2.06 s, sys: 8 ms, total: 2.06 s
Wall time: 2.05 s

Add std ether value going in the node [VIII]



In [15]:

    
#Write a function
def get_std_value_in(node):
    '''
    Return the std value of all the in transactions of a given node
    '''
    #Get the in edges list
    edges = TG.in_edges_iter(node, keys=False, data=True)
    #Build a list of all the values of the in edges list
    values=[]
    for edge in edges:
        values.append(float(edge[2]['weight']['value']))
    #Compute the std of this list
    std = np.std(values)
    
    return std



In [16]:

    
%%time
#Add the feature
df['std_value_in']=df['nodes'].map(lambda x: get_std_value_in(x))









    



/usr/local/lib/python2.7/dist-packages/numpy/core/_methods.py:82: RuntimeWarning: Degrees of freedom <= 0 for slice
  warnings.warn("Degrees of freedom <= 0 for slice", RuntimeWarning)






    



CPU times: user 2.98 s, sys: 44 ms, total: 3.02 s
Wall time: 2.96 s

Add std ether value going out the node [IX]



In [17]:

    
#Write a function
def get_std_value_out(node):
    '''
    Return the std value of all the out transactions of a given node
    '''
    #Get the out edges list
    edges = TG.out_edges_iter(node, keys=False, data=True)
    #Build a list of all the values of the out edges list
    values=[]
    for edge in edges:
        values.append(float(edge[2]['weight']['value']))
    #Compute the std of this list
    std = np.std(values)
    
    return std



In [18]:

    
%%time
#Add the feature
df['std_value_out']=df['nodes'].map(lambda x: get_std_value_out(x))









    



CPU times: user 2.87 s, sys: 24 ms, total: 2.89 s
Wall time: 2.86 s

Add the ratio of the number of incoming transactions to the number of unique timestamps for those transactions [X]



In [19]:

    
#Write a function
def get_ratio_in_timestamp(node):
    '''
    Return the ratio between the number of incoming transaction to the number of unique timestamp for these transactions
    '''
    #Get the list of incoming transactions
    edges = TG.in_edges(node,keys=False, data=True)
    #Build the list of timestamps
    timestamps=[]
    for edge in edges:
        timestamps.append(edge[2]['weight']['time'])
    #Compute the ratio
    unique_time = float(len(np.unique(timestamps)))
    transactions = float(len(edges))
    
    if unique_time !=0:
        ratio = transactions / unique_time
    else:
        ratio = np.nan
    
    return ratio



In [20]:

    
%%time
#Add the feature
df['ratio_in_timestamp']=df['nodes'].map(lambda x: get_ratio_in_timestamp(x))









    



CPU times: user 3 s, sys: 24 ms, total: 3.02 s
Wall time: 3.02 s

Add the ratio of the number of outgoing transactions to the number of unique timestamps for those transactions [XI]



In [21]:

    
#Write a function
def get_ratio_out_timestamp(node):
    '''
    Return the ratio between the number of incoming transaction to the number of unique timestamp for these transactions
    '''
    #Get the list of outgoing transactions
    edges = TG.out_edges(node,keys=False, data=True)
    #Build the list of timestamps
    timestamps=[]
    for edge in edges:
        timestamps.append(edge[2]['weight']['time'])
    #Compute the ratio
    unique_time = float(len(np.unique(timestamps)))
    transactions = float(len(edges))
    
    if unique_time !=0:
        ratio = transactions / unique_time
    else:
        ratio = np.nan
    
    return ratio



In [22]:

    
%%time
#Add the feature
df['ratio_out_timestamp']=df['nodes'].map(lambda x: get_ratio_out_timestamp(x))









    



CPU times: user 2.82 s, sys: 16 ms, total: 2.84 s
Wall time: 2.84 s

the incoming transaction frequency for the user (#in transactions / max date - min date) [XII]



In [23]:

    
#write function
def get_in_frequency(node):
    '''
    Return the incoming transaction frequency for the user (#in transactions / max date - min date)
    '''
    #Get the list of incoming transactions
    edges = TG.in_edges(node,keys=False, data=True)
    #Build the list of timestamps
    timestamps=[]
    for edge in edges:
        timestamps.append(edge[2]['weight']['time'])
    #Build the delta in seconds
    date = pd.to_datetime(pd.Series(timestamps))
    dt = date.max()-date.min()
    #deltaseconds = dt.item().total_seconds()
    
    if dt.total_seconds()!=0:
        ratio = len(edges)/dt.total_seconds()
    else:
        ratio = np.nan
    
    return ratio



In [24]:

    
%%time
#Add the feature
df['frequency_in']=df['nodes'].map(lambda x: get_in_frequency(x))









    



CPU times: user 14.9 s, sys: 20 ms, total: 14.9 s
Wall time: 14.9 s

the outgoing transaction frequency for the user (#out transactions / max date - min date) [XIII]



In [25]:

    
#write function
def get_out_frequency(node):
    '''
    Return the outgoing transaction frequency for the user (#in transactions / max date - min date)
    '''
    #Get the list of incoming transactions
    edges = TG.out_edges(node,keys=False, data=True)
    #Build the list of timestamps
    timestamps=[]
    for edge in edges:
        timestamps.append(edge[2]['weight']['time'])
    #Build the delta in seconds
    date = pd.to_datetime(pd.Series(timestamps))
    dt = date.max()-date.min()
    #deltaseconds = dt.item().total_seconds()
    
    if dt.total_seconds()!=0:
        ratio = len(edges)/dt.total_seconds()
    else:
        ratio = np.nan
    
    return ratio



In [26]:

    
%%time
#Add the feature
df['frequency_out']=df['nodes'].map(lambda x: get_out_frequency(x))









    



CPU times: user 14.3 s, sys: 8 ms, total: 14.3 s
Wall time: 14.3 s

ether balance [XIV]



In [27]:

    
#write function
def get_balance(node):
    '''
    Return the balance (in wei) of a given node
    '''
    #Get edges in and edges out
    edges_in = TG.in_edges(node,keys=False, data=True)
    edges_out = TG.out_edges(node,keys=False, data=True)
    #Build value in array and value out array
    values_in=[]
    for edge in edges_in:
        values_in.append(float(edge[2]['weight']['value']))
    values_out=[]
    for edge in edges_out:
        values_out.append(float(edge[2]['weight']['value']))
    #Compute balance
    balance = np.sum(values_in)-np.sum(values_out)
    
    return balance



In [28]:

    
%%time
#Add the feature
df['balance']=df['nodes'].map(lambda x: get_balance(x))









    



CPU times: user 5.59 s, sys: 8 ms, total: 5.6 s
Wall time: 5.62 s

Average Velocity In [XV]



In [29]:

    
#write function
def get_mean_velocity_in(node):
    """
    Return the average ether velocitiy incoming into the node in wei/s
    """
    #Get edges in collection
    edges_in = TG.in_edges(node,keys=False, data=True)
    values_in=[]
    timestamps=[]
    #Collect values and timestamps
    for edge in edges_in:
        values_in.append(float(edge[2]['weight']['value']))
        timestamps.append(edge[2]['weight']['time'])
    #Create Velocity list
    velocities = []
    #Convert date str to datetime
    dates = pd.to_datetime(pd.Series(timestamps))
    #Build the velocity array 
    for i in range(1,(len(edges_in)-1)):
        if dates[i+1]!=dates[i-1]:
            velocity = np.absolute(values_in[i+1]-values_in[i-1])/(dates[i+1]-dates[i-1]).total_seconds()
            velocities.append(velocity)
    
    #Return the velocities average
    return np.average(np.absolute(velocities))



In [30]:

    
%%time
#Add the feature
df['mean_velocity_in']=df['nodes'].map(lambda x: get_mean_velocity_in(x))









    



CPU times: user 2min 19s, sys: 448 ms, total: 2min 20s
Wall time: 2min 19s

Average Velocity Out [XVI]



In [31]:

    
#write function
def get_mean_velocity_out(node):
    """
    Return the average ether velocitiy outgoing from the node in wei/s
    """
    #Get edges out collection
    edges_out = TG.out_edges(node,keys=False, data=True)
    values_out=[]
    timestamps=[]
    #Collect values and timestamps
    for edge in edges_out:
        values_out.append(float(edge[2]['weight']['value']))
        timestamps.append(edge[2]['weight']['time'])
    #Create Velocity list
    velocities = []
    #Convert date str to datetime
    dates = pd.to_datetime(pd.Series(timestamps))
    #Build the velocity array 
    for i in range(1,(len(edges_out)-1)):
        if dates[i+1]!=dates[i-1]:
            velocity = np.absolute(values_out[i+1]-values_out[i-1])/(dates[i+1]-dates[i-1]).total_seconds()
            velocities.append(velocity)
    
    #Return the velocities average
    return np.average(np.absolute(velocities))



In [32]:

    
%%time
#Add the feature
df['mean_velocity_out']=df['nodes'].map(lambda x: get_mean_velocity_out(x))









    



CPU times: user 2min 20s, sys: 508 ms, total: 2min 21s
Wall time: 2min 20s

Std Velocity In [XVII]



In [38]:

    
#write function
def get_std_velocity_in(node):
    """
    Return the std ether velocitiy incoming into the node in wei/s
    """
    #Get edges in collection
    edges_in = TG.in_edges(node,keys=False, data=True)
    values_in=[]
    timestamps=[]
    #Collect values and timestamps
    for edge in edges_in:
        values_in.append(float(edge[2]['weight']['value']))
        timestamps.append(edge[2]['weight']['time'])
    #Create Velocity list
    velocities = []
    #Convert date str to datetime
    dates = pd.to_datetime(pd.Series(timestamps))
    #Build the velocity array 
    for i in range(1,(len(edges_in)-1)):
        if dates[i+1]!=dates[i-1]:
            velocity = np.absolute(values_in[i+1]-values_in[i-1])/(dates[i+1]-dates[i-1]).total_seconds()
            velocities.append(velocity)
    
    #Return the velocities average
    return np.std(np.absolute(velocities))



In [39]:

    
%%time
#Add the feature
df['std_velocity_in']=df['nodes'].map(lambda x: get_std_velocity_in(x))









    



CPU times: user 2min 21s, sys: 472 ms, total: 2min 22s
Wall time: 2min 21s

Std Velocity Out [XVIII]



In [40]:

    
#write function
def get_std_velocity_out(node):
    """
    Return the std ether velocitiy outgoing from the node in wei/s
    """
    #Get edges out collection
    edges_out = TG.out_edges(node,keys=False, data=True)
    values_out=[]
    timestamps=[]
    #Collect values and timestamps
    for edge in edges_out:
        values_out.append(float(edge[2]['weight']['value']))
        timestamps.append(edge[2]['weight']['time'])
    #Create Velocity list
    velocities = []
    #Convert date str to datetime
    dates = pd.to_datetime(pd.Series(timestamps))
    #Build the velocity array 
    for i in range(1,(len(edges_out)-1)):
        if dates[i+1]!=dates[i-1]:
            velocity = np.absolute(values_out[i+1]-values_out[i-1])/(dates[i+1]-dates[i-1]).total_seconds()
            velocities.append(velocity)
    
    #Return the velocities average
    return np.std(np.absolute(velocities))



In [41]:

    
%%time
#Add the feature
df['std_velocity_out']=df['nodes'].map(lambda x: get_std_velocity_out(x))









    



CPU times: user 2min 24s, sys: 564 ms, total: 2min 25s
Wall time: 2min 24s

Average Acceleration In [XIX]



In [52]:

    
#write function
def get_mean_acceleration_in(node):
    """
    Return the average ether acceleration incoming into the node in wei.s-2
    """
    #Get edges in collection
    edges_in = TG.in_edges(node,keys=False, data=True)
    values_in=[]
    timestamps=[]
    #Collect values and timestamps
    for edge in edges_in:
        values_in.append(float(edge[2]['weight']['value']))
        timestamps.append(edge[2]['weight']['time'])
    #Create Velocity list
    velocities = []
    #Convert date str to datetime
    dates = pd.to_datetime(pd.Series(timestamps))
    #Build the velocity array 
    for i in range(1,(len(edges_in)-1)):
        if dates[i+1]!=dates[i-1]:
            velocity = np.absolute(values_in[i+1]-values_in[i-1])/(dates[i+1]-dates[i-1]).total_seconds()
            velocities.append(velocity)
    #Make sure we have abs ... 
    velocities=np.absolute(velocities)
    #Velocities range from 1 to N-1 (no 0 and N)
    #Accelerations range from 2 to N-2
    #Build the acceleration array
    accelerations=[]
    for i in range(1,(len(velocities)-1)):
        if dates[i+1]!=dates[i-1]:
            acceleration = np.absolute(velocities[i+1]-velocities[i-1])/(dates[i+1]-dates[i-1]).total_seconds()
            accelerations.append(acceleration)
    #Return the velocities average
    return np.average(np.absolute(accelerations))



In [43]:

    
%%time
#Add the feature
df['mean_acceleration_in']=df['nodes'].map(lambda x: get_mean_acceleration_in(x))









    



CPU times: user 4min 23s, sys: 1.05 s, total: 4min 24s
Wall time: 4min 23s

Average Velocity Out [XX]



In [44]:

    
#write function
def get_mean_acceleration_out(node):
    """
    Return the average ether acceleration outgoing into the node in wei.s-2
    """
    #Get edges out collection
    edges_out = TG.out_edges(node,keys=False, data=True)
    values_out=[]
    timestamps=[]
    #Collect values and timestamps
    for edge in edges_out:
        values_out.append(float(edge[2]['weight']['value']))
        timestamps.append(edge[2]['weight']['time'])
    #Create Velocity list
    velocities = []
    #Convert date str to datetime
    dates = pd.to_datetime(pd.Series(timestamps))
    #Build the velocity array 
    for i in range(1,(len(edges_out)-1)):
        if dates[i+1]!=dates[i-1]:
            velocity = np.absolute(values_out[i+1]-values_out[i-1])/(dates[i+1]-dates[i-1]).total_seconds()
            velocities.append(velocity)
    #Make sure we have abs ... 
    velocities=np.absolute(velocities)
    #Velocities range from 1 to N-1 (no 0 and N)
    #Accelerations range from 2 to N-2
    #Build the acceleration array
    accelerations=[]
    for i in range(1,(len(velocities)-1)):
        if dates[i+1]!=dates[i-1]:
            acceleration = np.absolute(velocities[i+1]-velocities[i-1])/(dates[i+1]-dates[i-1]).total_seconds()
            accelerations.append(acceleration)
    #Return the velocities average
    return np.average(np.absolute(accelerations))



In [45]:

    
%%time
#Add the feature
df['mean_acceleration_out']=df['nodes'].map(lambda x: get_mean_acceleration_out(x))









    



CPU times: user 4min 23s, sys: 1.05 s, total: 4min 25s
Wall time: 4min 23s

Getting Rogue nodes



In [46]:

    
rogues = pd.read_csv("../data/rogues.csv")
rogues_id = np.array(rogues['id'])
fake_rogues = ['0x223294182093bfc6b11e8ef5722d496f066036c2','0xec1ebac9da3430213281c80fa6d46378341a96ae','0xe6447ae67346b5fb7ebd65ebfc4c7e6521b21f8a']

Min path to a rogue node [α]



In [47]:

    
#write function
def min_path_to_rogue(node,rogues):
    paths_lengths=[]
    for rogue in rogues:
        if nx.has_path(TG,node,rogue):
            paths_lengths.append(nx.shortest_path_length(TG,node,rogue))
    if len(paths_lengths)!=0:
        return np.min(paths_lengths)
    else:
        return np.nan



In [48]:

    
%%time
#Add the feature
df['min_path_to_rogue']=df['nodes'].map(lambda x: min_path_to_rogue(x,fake_rogues))









    



CPU times: user 33 s, sys: 0 ns, total: 33 s
Wall time: 33 s

Min path from a rogue node [β]



In [51]:

    
#write function
def min_path_from_rogue(node,rogues):
    paths_lengths=[]
    for rogue in rogues:
        if nx.has_path(TG,rogue,node):
            paths_lengths.append(shortest_path_length(TG,rogue,node))
    if len(paths_lengths)!=0:
        return np.min(paths_lengths)
    else:
        return np.nan



In [50]:

    
%%time
#Add the feature
df['min_path_from_rogue']=df['nodes'].map(lambda x: min_path_from_rogue(x,fake_rogues))









    



CPU times: user 10 s, sys: 4 ms, total: 10 s
Wall time: 10 s

Amount of ether flown from node to the closest rogue [δ]

Amount of ether flown to node from the closest rogue [ε]

Write features dataframe in csv



In [53]:

    
df.to_csv('../data/features.csv')



In [ ]: