Graph Construction and feature engineering

Load library

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import networkx as nx
import pygraphviz as pgv
import pydot as pyd
from networkx.drawing.nx_agraph import graphviz_layout
from networkx.drawing.nx_agraph import write_dot

Load data

In [2]:
edges = pd.read_csv('../data/edges.csv').drop('Unnamed: 0',1)
nodes = pd.read_csv('../data/nodes.csv').drop('Unnamed: 0',1)
rogues = pd.read_csv('../data/rogues.csv')

CPU times: user 31.1 s, sys: 6.51 s, total: 37.6 s
Wall time: 37.6 s

Graph generation & analysis

Build proper edge array

In [3]:
#Simple way (non parallel computing)
edge_array = []
for i in range(0,1000000):

CPU times: user 2min 20s, sys: 872 ms, total: 2min 21s
Wall time: 2min 21s

Generate a MultiDigraph with networkx and edge array

In [4]:

CPU times: user 5.41 s, sys: 224 ms, total: 5.63 s
Wall time: 5.62 s

In [5]:
# Network Characteristics
print 'Number of nodes:', TG.number_of_nodes() 
print 'Number of edges:', TG.number_of_edges() 
print 'Number of connected components:', nx.number_connected_components(TG.to_undirected())

# Degree
degree_sequence =
degree_out_sequence = TG.out_degree().values()
degree_in_sequence = TG.in_degree().values()

print "Min degree ", np.min(degree_sequence)
print "Max degree ", np.max(degree_sequence)
print "Median degree ", np.median(degree_sequence)
print "Mean degree ", np.mean(degree_sequence)

print "Min degree IN", np.min(degree_in_sequence)
print "Max degree IN", np.max(degree_in_sequence)
print "Median degree IN", np.median(degree_in_sequence)
print "Mean degree IN", np.mean(degree_in_sequence)

print "Min degree OUT", np.min(degree_out_sequence)
print "Max degree OUT", np.max(degree_out_sequence)
print "Median degree OUT", np.median(degree_out_sequence)
print "Mean degree OUT", np.mean(degree_out_sequence)

Number of nodes: 29355
Number of edges: 1000000
Number of connected components: 96
Min degree  1
Max degree  232035
Median degree  3.0
Mean degree  68.131493783
Min degree IN 0
Max degree IN 222682
Median degree IN 1.0
Mean degree IN 34.0657468915
Min degree OUT 0
Max degree OUT 137271
Median degree OUT 1.0
Mean degree OUT 34.0657468915
CPU times: user 21.8 s, sys: 956 ms, total: 22.8 s
Wall time: 22.8 s

In [6]:
# Degree distribution

CPU times: user 1.48 s, sys: 828 ms, total: 2.31 s
Wall time: 1.43 s

Features Engineering

In [7]:
#New dataframe for feature engineering
df = pd.DataFrame()

Features description

  • Roman (I-XXI) : Node intrinsec charateristics feature set
  • Arabic (1-5) : Neighbors behaviours feature set
  • Greek (α-ω) : External data features
# Description Variable
I Degree total_degree
II Degree in degree_in
III Degree out degree_out
IV Number of unique predecessors unique_predecessors
V Number of unique successors unique_successors
VI Mean ether amount in incoming transactions mean_value_in
VII Mean ether amount in outgoing transactions mean_value_out
VIII Std ether amount in incoming transactions std_value_in
IX Std ether amount in outgoing transactions std_value_out
X Ratio of the number of incoming transactions to the number of unique timestamps ratio_in_timestamp
XI Ratio of the number of outgoing transactions to the number of unique timestamps ratio_out_timestamp
XII Frequency of incoming transactions frequency_in
XIII Frequency of outgoing transactions frequency_out
XIV Ether balance of the node balance
XVI Average velocity in mean_velocity_out
XVII Average velocity out mean_velocity_out
XVIII Std velocity in std_velocity_in
XIX Std velocity out std_velocity_out
XX Average acceleration in mean_acceleration_in
XXI Average acceleration out mean_acceleration_out
α Min path to a rogue node min_path_to_rogue
β Min path from a rogue node min_path_from_rogue
δ Amount of ether on the min path to a rogue node amount_to_rogue
ε Amount of ether on the min path from a rogue node amount_from_rogue
1 Average neighbours velocity out
2 Average neighbours acceleration out -
  • Others possible features:
    • average/std path size to/from rogue
    • average amount/std amount to/from rogue
    • neighbours description

Add total degree [I]

In [8]:
df['total_degree']=df['nodes'].map(lambda x:

Add degree in and degree out [II] [III]

In [9]:
df['degree_in']=df['nodes'].map(lambda x: TG.in_degree(x))
df['degree_out']=df['nodes'].map(lambda x: TG.out_degree(x))

Add unique predecessors and unique successors (must be < degree_in and out) [IV][V]

In [10]:
df['unique_successors']=df['nodes'].map(lambda x: len((TG.successors(x))))
df['unique_predecessors']=df['nodes'].map(lambda x: len((TG.predecessors(x))))

Add mean ether value going in the node [VI]

Write a function

In [11]:
def get_mean_value_in(node):
    Return the mean value of all the in transactions of a given node
    #Get the in edges list
    edges = TG.in_edges_iter(node, keys=False, data=True)
    #Build a list of all the values of the in edges list
    for edge in edges:
    #Compute the mean of this list
    mean = np.average(values)
    return mean

In [12]:
#Add the feature
df['mean_value_in']=df['nodes'].map(lambda x: get_mean_value_in(x))

/usr/local/lib/python2.7/dist-packages/numpy/core/ RuntimeWarning: Mean of empty slice.
  warnings.warn("Mean of empty slice.", RuntimeWarning)
CPU times: user 2.09 s, sys: 20 ms, total: 2.11 s
Wall time: 2.09 s

Add mean ether value going out the node [VII]

In [13]:
#Write a function
def get_mean_value_out(node):
    Return the mean value of all the out transactions of a given node
    #Get the out edges list
    edges = TG.out_edges_iter(node, keys=False, data=True)
    #Build a list of all the values of the out edges list
    for edge in edges:
    #Compute the mean of this list
    mean = np.average(values)
    return mean

In [14]:
#Add the feature
df['mean_value_out']=df['nodes'].map(lambda x: get_mean_value_out(x))

CPU times: user 2.06 s, sys: 8 ms, total: 2.06 s
Wall time: 2.05 s

Add std ether value going in the node [VIII]

In [15]:
#Write a function
def get_std_value_in(node):
    Return the std value of all the in transactions of a given node
    #Get the in edges list
    edges = TG.in_edges_iter(node, keys=False, data=True)
    #Build a list of all the values of the in edges list
    for edge in edges:
    #Compute the std of this list
    std = np.std(values)
    return std

In [16]:
#Add the feature
df['std_value_in']=df['nodes'].map(lambda x: get_std_value_in(x))

/usr/local/lib/python2.7/dist-packages/numpy/core/ RuntimeWarning: Degrees of freedom <= 0 for slice
  warnings.warn("Degrees of freedom <= 0 for slice", RuntimeWarning)
CPU times: user 2.98 s, sys: 44 ms, total: 3.02 s
Wall time: 2.96 s

Add std ether value going out the node [IX]

In [17]:
#Write a function
def get_std_value_out(node):
    Return the std value of all the out transactions of a given node
    #Get the out edges list
    edges = TG.out_edges_iter(node, keys=False, data=True)
    #Build a list of all the values of the out edges list
    for edge in edges:
    #Compute the std of this list
    std = np.std(values)
    return std

In [18]:
#Add the feature
df['std_value_out']=df['nodes'].map(lambda x: get_std_value_out(x))

CPU times: user 2.87 s, sys: 24 ms, total: 2.89 s
Wall time: 2.86 s

Add the ratio of the number of incoming transactions to the number of unique timestamps for those transactions [X]

In [19]:
#Write a function
def get_ratio_in_timestamp(node):
    Return the ratio between the number of incoming transaction to the number of unique timestamp for these transactions
    #Get the list of incoming transactions
    edges = TG.in_edges(node,keys=False, data=True)
    #Build the list of timestamps
    for edge in edges:
    #Compute the ratio
    unique_time = float(len(np.unique(timestamps)))
    transactions = float(len(edges))
    if unique_time !=0:
        ratio = transactions / unique_time
        ratio = np.nan
    return ratio

In [20]:
#Add the feature
df['ratio_in_timestamp']=df['nodes'].map(lambda x: get_ratio_in_timestamp(x))

CPU times: user 3 s, sys: 24 ms, total: 3.02 s
Wall time: 3.02 s

Add the ratio of the number of outgoing transactions to the number of unique timestamps for those transactions [XI]

In [21]:
#Write a function
def get_ratio_out_timestamp(node):
    Return the ratio between the number of incoming transaction to the number of unique timestamp for these transactions
    #Get the list of outgoing transactions
    edges = TG.out_edges(node,keys=False, data=True)
    #Build the list of timestamps
    for edge in edges:
    #Compute the ratio
    unique_time = float(len(np.unique(timestamps)))
    transactions = float(len(edges))
    if unique_time !=0:
        ratio = transactions / unique_time
        ratio = np.nan
    return ratio

In [22]:
#Add the feature
df['ratio_out_timestamp']=df['nodes'].map(lambda x: get_ratio_out_timestamp(x))

CPU times: user 2.82 s, sys: 16 ms, total: 2.84 s
Wall time: 2.84 s

the incoming transaction frequency for the user (#in transactions / max date - min date) [XII]

In [23]:
#write function
def get_in_frequency(node):
    Return the incoming transaction frequency for the user (#in transactions / max date - min date)
    #Get the list of incoming transactions
    edges = TG.in_edges(node,keys=False, data=True)
    #Build the list of timestamps
    for edge in edges:
    #Build the delta in seconds
    date = pd.to_datetime(pd.Series(timestamps))
    dt = date.max()-date.min()
    #deltaseconds = dt.item().total_seconds()
    if dt.total_seconds()!=0:
        ratio = len(edges)/dt.total_seconds()
        ratio = np.nan
    return ratio

In [24]:
#Add the feature
df['frequency_in']=df['nodes'].map(lambda x: get_in_frequency(x))

CPU times: user 14.9 s, sys: 20 ms, total: 14.9 s
Wall time: 14.9 s

the outgoing transaction frequency for the user (#out transactions / max date - min date) [XIII]

In [25]:
#write function
def get_out_frequency(node):
    Return the outgoing transaction frequency for the user (#in transactions / max date - min date)
    #Get the list of incoming transactions
    edges = TG.out_edges(node,keys=False, data=True)
    #Build the list of timestamps
    for edge in edges:
    #Build the delta in seconds
    date = pd.to_datetime(pd.Series(timestamps))
    dt = date.max()-date.min()
    #deltaseconds = dt.item().total_seconds()
    if dt.total_seconds()!=0:
        ratio = len(edges)/dt.total_seconds()
        ratio = np.nan
    return ratio

In [26]:
#Add the feature
df['frequency_out']=df['nodes'].map(lambda x: get_out_frequency(x))

CPU times: user 14.3 s, sys: 8 ms, total: 14.3 s
Wall time: 14.3 s

ether balance [XIV]

In [27]:
#write function
def get_balance(node):
    Return the balance (in wei) of a given node
    #Get edges in and edges out
    edges_in = TG.in_edges(node,keys=False, data=True)
    edges_out = TG.out_edges(node,keys=False, data=True)
    #Build value in array and value out array
    for edge in edges_in:
    for edge in edges_out:
    #Compute balance
    balance = np.sum(values_in)-np.sum(values_out)
    return balance

In [28]:
#Add the feature
df['balance']=df['nodes'].map(lambda x: get_balance(x))

CPU times: user 5.59 s, sys: 8 ms, total: 5.6 s
Wall time: 5.62 s

Average Velocity In [XV]

In [29]:
#write function
def get_mean_velocity_in(node):
    Return the average ether velocitiy incoming into the node in wei/s
    #Get edges in collection
    edges_in = TG.in_edges(node,keys=False, data=True)
    #Collect values and timestamps
    for edge in edges_in:
    #Create Velocity list
    velocities = []
    #Convert date str to datetime
    dates = pd.to_datetime(pd.Series(timestamps))
    #Build the velocity array 
    for i in range(1,(len(edges_in)-1)):
        if dates[i+1]!=dates[i-1]:
            velocity = np.absolute(values_in[i+1]-values_in[i-1])/(dates[i+1]-dates[i-1]).total_seconds()
    #Return the velocities average
    return np.average(np.absolute(velocities))

In [30]:
#Add the feature
df['mean_velocity_in']=df['nodes'].map(lambda x: get_mean_velocity_in(x))

CPU times: user 2min 19s, sys: 448 ms, total: 2min 20s
Wall time: 2min 19s

Average Velocity Out [XVI]

In [31]:
#write function
def get_mean_velocity_out(node):
    Return the average ether velocitiy outgoing from the node in wei/s
    #Get edges out collection
    edges_out = TG.out_edges(node,keys=False, data=True)
    #Collect values and timestamps
    for edge in edges_out:
    #Create Velocity list
    velocities = []
    #Convert date str to datetime
    dates = pd.to_datetime(pd.Series(timestamps))
    #Build the velocity array 
    for i in range(1,(len(edges_out)-1)):
        if dates[i+1]!=dates[i-1]:
            velocity = np.absolute(values_out[i+1]-values_out[i-1])/(dates[i+1]-dates[i-1]).total_seconds()
    #Return the velocities average
    return np.average(np.absolute(velocities))

In [32]:
#Add the feature
df['mean_velocity_out']=df['nodes'].map(lambda x: get_mean_velocity_out(x))

CPU times: user 2min 20s, sys: 508 ms, total: 2min 21s
Wall time: 2min 20s

Std Velocity In [XVII]

In [38]:
#write function
def get_std_velocity_in(node):
    Return the std ether velocitiy incoming into the node in wei/s
    #Get edges in collection
    edges_in = TG.in_edges(node,keys=False, data=True)
    #Collect values and timestamps
    for edge in edges_in:
    #Create Velocity list
    velocities = []
    #Convert date str to datetime
    dates = pd.to_datetime(pd.Series(timestamps))
    #Build the velocity array 
    for i in range(1,(len(edges_in)-1)):
        if dates[i+1]!=dates[i-1]:
            velocity = np.absolute(values_in[i+1]-values_in[i-1])/(dates[i+1]-dates[i-1]).total_seconds()
    #Return the velocities average
    return np.std(np.absolute(velocities))

In [39]:
#Add the feature
df['std_velocity_in']=df['nodes'].map(lambda x: get_std_velocity_in(x))

CPU times: user 2min 21s, sys: 472 ms, total: 2min 22s
Wall time: 2min 21s

Std Velocity Out [XVIII]

In [40]:
#write function
def get_std_velocity_out(node):
    Return the std ether velocitiy outgoing from the node in wei/s
    #Get edges out collection
    edges_out = TG.out_edges(node,keys=False, data=True)
    #Collect values and timestamps
    for edge in edges_out:
    #Create Velocity list
    velocities = []
    #Convert date str to datetime
    dates = pd.to_datetime(pd.Series(timestamps))
    #Build the velocity array 
    for i in range(1,(len(edges_out)-1)):
        if dates[i+1]!=dates[i-1]:
            velocity = np.absolute(values_out[i+1]-values_out[i-1])/(dates[i+1]-dates[i-1]).total_seconds()
    #Return the velocities average
    return np.std(np.absolute(velocities))

In [41]:
#Add the feature
df['std_velocity_out']=df['nodes'].map(lambda x: get_std_velocity_out(x))

CPU times: user 2min 24s, sys: 564 ms, total: 2min 25s
Wall time: 2min 24s

Average Acceleration In [XIX]

In [52]:
#write function
def get_mean_acceleration_in(node):
    Return the average ether acceleration incoming into the node in wei.s-2
    #Get edges in collection
    edges_in = TG.in_edges(node,keys=False, data=True)
    #Collect values and timestamps
    for edge in edges_in:
    #Create Velocity list
    velocities = []
    #Convert date str to datetime
    dates = pd.to_datetime(pd.Series(timestamps))
    #Build the velocity array 
    for i in range(1,(len(edges_in)-1)):
        if dates[i+1]!=dates[i-1]:
            velocity = np.absolute(values_in[i+1]-values_in[i-1])/(dates[i+1]-dates[i-1]).total_seconds()
    #Make sure we have abs ... 
    #Velocities range from 1 to N-1 (no 0 and N)
    #Accelerations range from 2 to N-2
    #Build the acceleration array
    for i in range(1,(len(velocities)-1)):
        if dates[i+1]!=dates[i-1]:
            acceleration = np.absolute(velocities[i+1]-velocities[i-1])/(dates[i+1]-dates[i-1]).total_seconds()
    #Return the velocities average
    return np.average(np.absolute(accelerations))

In [43]:
#Add the feature
df['mean_acceleration_in']=df['nodes'].map(lambda x: get_mean_acceleration_in(x))

CPU times: user 4min 23s, sys: 1.05 s, total: 4min 24s
Wall time: 4min 23s

Average Velocity Out [XX]

In [44]:
#write function
def get_mean_acceleration_out(node):
    Return the average ether acceleration outgoing into the node in wei.s-2
    #Get edges out collection
    edges_out = TG.out_edges(node,keys=False, data=True)
    #Collect values and timestamps
    for edge in edges_out:
    #Create Velocity list
    velocities = []
    #Convert date str to datetime
    dates = pd.to_datetime(pd.Series(timestamps))
    #Build the velocity array 
    for i in range(1,(len(edges_out)-1)):
        if dates[i+1]!=dates[i-1]:
            velocity = np.absolute(values_out[i+1]-values_out[i-1])/(dates[i+1]-dates[i-1]).total_seconds()
    #Make sure we have abs ... 
    #Velocities range from 1 to N-1 (no 0 and N)
    #Accelerations range from 2 to N-2
    #Build the acceleration array
    for i in range(1,(len(velocities)-1)):
        if dates[i+1]!=dates[i-1]:
            acceleration = np.absolute(velocities[i+1]-velocities[i-1])/(dates[i+1]-dates[i-1]).total_seconds()
    #Return the velocities average
    return np.average(np.absolute(accelerations))

In [45]:
#Add the feature
df['mean_acceleration_out']=df['nodes'].map(lambda x: get_mean_acceleration_out(x))

CPU times: user 4min 23s, sys: 1.05 s, total: 4min 25s
Wall time: 4min 23s

Getting Rogue nodes

In [46]:
rogues = pd.read_csv("../data/rogues.csv")
rogues_id = np.array(rogues['id'])
fake_rogues = ['0x223294182093bfc6b11e8ef5722d496f066036c2','0xec1ebac9da3430213281c80fa6d46378341a96ae','0xe6447ae67346b5fb7ebd65ebfc4c7e6521b21f8a']

Min path to a rogue node [α]

In [47]:
#write function
def min_path_to_rogue(node,rogues):
    for rogue in rogues:
        if nx.has_path(TG,node,rogue):
    if len(paths_lengths)!=0:
        return np.min(paths_lengths)
        return np.nan

In [48]:
#Add the feature
df['min_path_to_rogue']=df['nodes'].map(lambda x: min_path_to_rogue(x,fake_rogues))

CPU times: user 33 s, sys: 0 ns, total: 33 s
Wall time: 33 s

Min path from a rogue node [β]

In [51]:
#write function
def min_path_from_rogue(node,rogues):
    for rogue in rogues:
        if nx.has_path(TG,rogue,node):
    if len(paths_lengths)!=0:
        return np.min(paths_lengths)
        return np.nan

In [50]:
#Add the feature
df['min_path_from_rogue']=df['nodes'].map(lambda x: min_path_from_rogue(x,fake_rogues))

CPU times: user 10 s, sys: 4 ms, total: 10 s
Wall time: 10 s

Amount of ether flown from node to the closest rogue [δ]

Amount of ether flown to node from the closest rogue [ε]

Write features dataframe in csv

In [53]:

In [ ]: