In [93]:
import matplotlib.pyplot as plt
#%matplotlib inline
import pandas as pd
import numpy as np
from math import ceil
In [94]:
import io
import requests
Content = requests.get("https://raw.githubusercontent.com/flmath-dirty/matrixes_in_erlang/master/results.cvs").content
LoadedTable= pd.read_csv(io.StringIO(Content.decode('utf-8')),skipinitialspace=True,
names=['Test type', 'Width', 'Height','Matrix representation','No. of calls', 'Execution time'])
The LoadedTable contains statistics gathered from running the script that generates a matrix (Width x Height) and then runs with the fprof following tests for each representation of matrix and sizes:
The tests categories:
one_rows_sums - sum of values from each row in the matrix
one_cols_sums - sum of values from each row in the matrix
get_value - gets each value from the matrix once separetly
set_value - sets each value from the matrix once separetly
Matrixes are represented as follow:
sofs: Set of the sets module is a way of representing relations (in strictly mathematical sense). In this case the matrix is represented as a relation (a function) F where domain is constructed from coordinates which is related to the matrix entry value ({x,y} F Value).
To get values in specific row we define equivalence relation aRb iff thera are {x,y1} F a {x,y2} F b. This relation divides antidomain in abbstract classes. If we sum elements belonging to the abstract class with element {x,_} we will get sum of the elements from this specific row. We use the same logic to receive abstrac classes for columns.
I have run tests on the square matrixes of sizes 50x50, 100x100, 1000x1000. For some of the 1000x1000 tests fprof logs exceeded the dedicated 2TB disk size. For those the values are missing.
In [95]:
LoadedTable.head()
Out[95]:
In [96]:
from scipy import stats
from functools import reduce
AggregatedTable = (LoadedTable.groupby(['Test type','Matrix representation'])['Execution time']
.apply(list).reset_index(name='Execution Times List'))
Create aggregation table with execution time values for different size of matrixes stored in one list.
In [97]:
AggregatedTable.head()
Out[97]:
Each list should have 3 values, if one is missing it means it exceeded testing capability of my equipment, and can be considered infinite. Also add columns with minimum values, maximum values and harmonic averages (since standard average would be too susceptible to outliers here).
In [98]:
AggregatedTable['Harmonic Average'] = AggregatedTable['Execution Times List'].apply(
lambda x: stats.hmean(x) if (len(x) == 3) else np.inf)
In [99]:
AggregatedTable['Max'] = AggregatedTable['Execution Times List'].apply(
lambda x: max(x) if (len(x) == 3) else np.inf)
In [100]:
AggregatedTable['Min'] = AggregatedTable['Execution Times List'].apply(min)
In [101]:
AggregatedTable.head()
Out[101]:
Lets compare execution times for retrieving values.
In [102]:
GetValue = (
AggregatedTable
.loc[AggregatedTable['Test type']=='get_value']
.loc[:,['Matrix representation','Harmonic Average','Max','Min']]
)
GetValue = GetValue.set_index('Matrix representation')
In [103]:
from jupyter_extensions.matplotlib.sawed_bar_plot import sawed_bar_plot
In [104]:
sawed_bar_plot(GetValue)
It occurs that lists of the list and set of sets implementation is extremely inefficient at retrieving values.
The list of the lists sounds obvious since to get element in the middle we need to traverse to middle list representing row and then again in the row traverse to the middle element which gives O(n^2) access time.
The sofs occurs to be a trap, if you will read documentation you will see elegant (from mathematician point of view) relation algebra. But if you go inside the source code you will find that module extensively uses lists:fold which translates implementation to the list of lists scenario in the best case. We can probaly make the first conclusion and an advice here: Don't abuse sofs, remember map is actually relation too (precisely function, but noone can forbid you from returning lists or other containers).
Other implementations look resonable. The are some suprises, like the better perfomance of big tuples than arrays.
Let see set_value functions perfomances.
In [105]:
SetValue = (
AggregatedTable
.loc[AggregatedTable['Test type']=='set_value']
.loc[:,['Matrix representation','Harmonic Average','Max','Min']]
)
SetValue = SetValue.set_index('Matrix representation')
In [106]:
sawed_bar_plot(SetValue)
The lists of lists and softs are outclassed even for small matrixes like 10x10.
As we suspected the big_tuple is much worse at setting values since it is immutable and setting one value means copying whole tuple. We can notice that the array even if it is more efficient has some values above the rift, lets change a little perspective:
In [107]:
sawed_bar_plot(SetValue,50000)
Lets zoom a little bit to see small values.
In [108]:
sawed_bar_plot(SetValue,500)
The array is much better than tuple, but both are magnitude worse than directed graph and matrix. It is now worth mentioning that digraph are implemented as an ets table which suggest a fixed overhead at first, but good efficiency for big tables. Still implementation with maps (build in ERTS) is faster even for 1000x1000 matrixes. We will investigate execution times increse further in the document.
Since digraph represented as an ets database is quite efficient is worth to ask and keep in mind a question: if we can craft matrix representation with ets table that will be more efficient than map. I will come back to this question at the end of this document.
The need for this small research occured during implemetation of a statistics gathering module. It was important to be able to calculate sums of specific counters. Usually in that case the best way is to make interface to some c or python application which support fast calculating. But this increases the complexity of the product, which should be usually avoided. Let check a performance of columns summation.
In [109]:
OneColsSums = (
AggregatedTable
.loc[AggregatedTable['Test type']=='one_cols_sums']
.loc[:,['Matrix representation','Harmonic Average','Max','Min']]
)
OneColsSums = OneColsSums.set_index('Matrix representation')
In [110]:
sawed_bar_plot(OneColsSums,200)
It is very confusing result, both array and tuple calculates the needed n indexes (in very C style) and sum extracted values (should be o(n^2), n calculations times n columns to calculate), still array is slower for big numbers than map which has to try to find value with hash (o(log(n)*n^2)), suggesting some inefficiency in implementation.
Lets look at summing rows:
In [111]:
OneRowsSums = (
AggregatedTable
.loc[AggregatedTable['Test type']=='one_rows_sums']
.loc[:,['Matrix representation','Harmonic Average','Max','Min']]
)
OneRowsSums = OneRowsSums.set_index('Matrix representation')
In [112]:
sawed_bar_plot(OneRowsSums)
Results are similar to the previous, with exception of the list of the list, which has obvious advantage since the dominating operation is summation of the list implemented as a linked list. Anyway since setting a value is extremly expensive, this implementation is still not really good.
The next thing we should consider is which implementation is the best for small matrixes.
In [113]:
AggregatedTableRestricted = AggregatedTable[
(AggregatedTable['Matrix representation']!='matrix_as_list_of_lists') &
(AggregatedTable['Matrix representation']!='matrix_as_sofs')]
AggregatedTableRestricted.head()
Out[113]:
In [114]:
Minimums = (
AggregatedTableRestricted
.loc[:,['Test type','Matrix representation','Min']]
)
#Minimums.groupby(['Test type','Matrix representation'])
Minimums = Minimums.set_index(['Test type','Matrix representation'])
Minimums.head()
Out[114]:
In [115]:
TestTypes = Minimums.index.get_level_values('Test type').unique().tolist()
#MatrixRepresentations = Minimums.index.get_level_values('Matrix representation').unique().tolist()
Plot = Minimums.unstack().plot(kind='bar',figsize=(12,9))
Plot.set_xticklabels(TestTypes,rotation="horizontal")
handles, labels=Plot.get_legend_handles_labels()
labels = [Label[6:-1] for Label in labels]
Plot.legend(handles,labels)
plt.show()
It occurs that for small values a big tuple is quite a good implementation, unlsess we are going to use our matrix for setting heavy tasks, in that case it is better to use map implementation.
Now we can see how those implementation efficiency looks for relatively(*) big 1000 x 1000 matrixes.
(*) If you really want something big you need to interface other programming language.
In [116]:
Maximums = (
AggregatedTableRestricted
.loc[:,['Test type','Matrix representation','Max']]
)
Maximums = Maximums.set_index(['Test type','Matrix representation'])
Maximums.head()
Out[116]:
In [117]:
TestTypes = Maximums.index.get_level_values('Test type').unique().tolist()
Plot = Maximums.unstack().plot(kind='bar',figsize=(12,9))
Plot.set_xticklabels(TestTypes,rotation="horizontal")
handles, labels = Plot.get_legend_handles_labels()
labels = [Label[6:-1] for Label in labels]
Plot.legend(handles,labels)
plt.show()
It looks that set_value operations on the big tuple are overwhelmingly slow, to the level that they obfuscating our plot. Lets remove them from the result and compare operations that left.
In [118]:
Maximums.loc['set_value','matrix_as_big_tuple']=np.nan
In [119]:
TestTypes = Maximums.index.get_level_values('Test type').unique().tolist()
Plot = Maximums.unstack().plot(kind='bar',figsize=(12,9))
Plot.set_xticklabels(TestTypes,rotation="horizontal")
handles, labels=Plot.get_legend_handles_labels()
labels = [Label[6:-1] for Label in labels]
Plot.legend(handles,labels)
plt.show()
So the big tuple with calculated index values occurs not to scale great. It is not suprising, since tuple is immutable, every time you set a new value you need to copy all of them, what makes it beyond redemption when it comes to storing operations.
Overall the map implementation is overall winner here, and it is overall recommendated implementation. But before we congratulate ourselves, we can notice that digraph implementation even if it is slower, maybe is only slower because overhead it implements over ets tables, and with futher increase of matrixes sizes it actually become better than maps.
To get a little more insight, that would help us estimate if this scenario happens we should find a way to compare how quickly the time consumption functions grow.
Lets try to figure out an acceleration of the resource consumption increase.
For each index ([Test type, Matrix representation]) we are getting a function F that mapping the matrixes sizes:
See the picture below:
We want to get a hint how fast the particular function growth will accelerate.
We will start from visualising scale of distances we are dealing here with:
For picture above we want to find out the increase of the increase or bind to it the angle R. In our simplified calculus we can consider finding an increase as finding derivate. Since one of interpretations of derivate is actually tangens, what we can look for if we want to comparable estimates of R we can use tan(R) (since tanges is monotonic in $(\frac{-\pi}2, \frac\pi2)$).
But before we will try to find R values, we need to notice that the increase of the increase to the projected increase ratio is depended on value of the 25h, so does the R.
We want to remove this sensitivity to initial conditions, to do that we scale all the mappings, so the value of the 25h is the same for each of them.
In [120]:
pd.set_option('mode.chained_assignment',None)
AggregatedTableRestricted['Scaled Execution Times List'] = AggregatedTableRestricted['Execution Times List'].apply(
lambda X: [(z * 100)/X[0] for z in X])
pd.set_option('mode.chained_assignment','warn')
In [121]:
AggregatedTableRestricted.head()
Out[121]:
Now by definition:
tan(B) = the projected increase/(396 x 25h)
tan(R+B) = 10kTo1mValueIncrease/(396 x 25h)
Since denominator is the same we don't need to consider it here. Lets calculate those increases.
In [122]:
pd.set_option('mode.chained_assignment',None)
AggregatedTableRestricted['ProjectedIncrease'] = AggregatedTableRestricted['Scaled Execution Times List'].apply(
lambda x: 396*(x[1]-x[0]))
AggregatedTableRestricted['ActualIncrease'] = AggregatedTableRestricted['Scaled Execution Times List'].apply(
lambda x: x[2]-x[1])
pd.set_option('mode.chained_assignment','warn')
In [123]:
AggregatedTableRestricted.head()
Out[123]:
We have tan(R+B) and tan(B), but we want to calculate tan(R).
Fortunately there are identities that can help us with that.
In [124]:
import sympy as sp
R, B = sp.symbols('R B')
tanRplusB = sp.expand_trig(sp.tan(R + B))
tanRplusB
Out[124]:
We want to experess tan(R).
SymPy trigonometric expansion/simplification functions would give us quite complicated answers (including atan function). But we just want to treat tan(R) as rational function of the other tanges, so we make substitution.
In [125]:
Ratio= tanRplusB.subs(sp.tan(R),R).subs(sp.tan(B),B)
C = sp.symbols('C')
RatioSolution = sp.solve(sp.Eq(Ratio , C),R)
RatioSolution
Out[125]:
In [126]:
FinalSolution = RatioSolution[0].subs(B,sp.tan(B)).subs(C,sp.tan(R+B))
FinalSolution
Out[126]:
Double check that result is actually valid:
In [127]:
sp.simplify(FinalSolution-sp.tan(R)) == 0
Out[127]:
Lets apply the result equation on our data:
In [128]:
pd.set_option('mode.chained_assignment',None)
AggregatedTableRestricted['TanB'] = AggregatedTableRestricted['ProjectedIncrease'].apply(
lambda x: x/396)
AggregatedTableRestricted['TanAplusB'] = AggregatedTableRestricted['ActualIncrease'].apply(
lambda x: x/396)
AggregatedTableRestricted['TanA'] = (
(AggregatedTableRestricted['TanAplusB']-AggregatedTableRestricted['TanB'])/((AggregatedTableRestricted['TanAplusB']*AggregatedTableRestricted['TanB'])+1))
pd.set_option('mode.chained_assignment','warn')
In [129]:
Diffs = (
AggregatedTableRestricted
.loc[:,['Test type','Matrix representation','TanA']]
)
#Minimums.groupby(['Test type','Matrix representation'])
Diffs = Diffs.set_index(['Test type','Matrix representation'])
In [130]:
TestTypes = Diffs.index.get_level_values('Test type').unique().tolist()
Plot = Diffs.unstack().plot(kind='bar',figsize=(12,9))
Plot.set_xticklabels(TestTypes,rotation="horizontal")
handles, labels=Plot.get_legend_handles_labels()
labels = [Label[7:-1] for Label in labels]
Plot.legend(handles,labels)
plt.show()
Bars in this plot represent growth acceleration, which is related to the second derivative of the function.
It looks that maps implementation is actually the best choice. It is also hearth-warming that majority of operations execution times growth is slower than a linear one.
What left for another articles are implementations with ets tables with different keys formats, which I will compare to our map representation and also map with different formats of keys (key as list, binary ets. instead of tuple). I'm also considering implementing matrix multiplication instead columns and rows sums (those operations are comparable, if it comes to complexity).
After that I have plans to more sophisticated implementations using Erlang main advantage, possibility to store values in small processes.