Es necesario tener instalada la versión de java 1.8:
In [2]:
%%bash
java -version
También es necesario tener añadida al PATH la carpeta bin de spark 2.2.1 para hadoop 2.7 o posterior (descarga).
In [3]:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
import seaborn as sns
from matplotlib import pylab
import numpy as np
pylab.rcParams['figure.figsize'] = (16.0, 8.0)
sns.set(style="whitegrid")
Esta función crea en la carpeta input una lista de archivos json con colecciones de elementos:
In [3]:
def createTestFileCollection(elements=120, entities=2, versions=2, depth=2, fields=2, batch=12):
!rm -rf input
!mkdir -p input
out = !java -jar es.um.nosql.streaminginference.benchmark-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
--elements $elements \
--entities $entities \
--versions $versions \
--depth $depth \
--fields $fields \
--mode file \
--flow stream \
--batch $batch \
--output input/collection.json \
--delay 10
Esta función rellena la base de datos benchmark con entidades de prueba:
In [4]:
def createTestMongoCollection(elements=120, entities=2, versions=2, depth=2, fields=2):
out = !java -jar es.um.nosql.streaminginference.benchmark-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
--elements $elements \
--entities $entities \
--versions $versions \
--depth $depth \
--fields $fields \
--mode mongo \
--host localhost \
--port 27017 \
--database benchmark
Esta función crea un único archivo json con una colección de elementos en la carpeta input con nombre "collection":
In [5]:
def createTestSingleCollection(elements=120, entities=2, versions=2, depth=2, fields=2):
!rm -rf input
!mkdir -p input
out = !java -jar es.um.nosql.streaminginference.benchmark-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
--elements $elements \
--entities $entities \
--versions $versions \
--depth $depth \
--fields $fields \
--mode file \
--output input/collection.json
Esta función determina el comando a utilizar en función del modo de funcionamiento:
In [6]:
def createTestCollection(mode="file", elements=120, entities=2, versions=2, depth=2, fields=2, batch=12):
!mkdir -p output
if (mode == "file"):
createTestFileCollection(elements, entities, versions, depth, fields, batch)
elif (mode == "mongo"):
createTestMongoCollection(elements, entities, versions, depth, fields)
elif (mode == "single"):
createTestSingleCollection(elements, entities, versions, depth, fields)
Esta función ejecuta la aplicación de inferencia sobre una serie de colecciones previamente creada y vuelca en stats.csv los resultados:
In [7]:
def benchmarkFile(interval=1000, kryo="true"):
out = !spark-submit --driver-memory 8g --master local[*] es.um.nosql.streaminginference.json2dbschema-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
--mode file \
--input input \
--benchmark true \
--interval $interval \
--kryo $kryo
Esta función ejecuta la aplicación de inferencia sobre la base de datos previamente creada y genera el archivo stats.csv:
In [8]:
def benchmarkMongo(interval=1000, block=200, kryo="true"):
out = !spark-submit --driver-memory 8g --master local[*] es.um.nosql.streaminginference.json2dbschema-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
--mode mongo \
--database benchmark \
--host localhost \
--port 27017 \
--benchmark true \
--interval $interval \
--block-interval $block \
--kryo $kryo
Esta función ejecuta la aplicación de inferencia sobre la colección creada genera el archivo stats.csv, en este caso solamente se mostrará el tiempo de procesamiento:
In [9]:
def benchmarkSingle():
out = !spark-submit --driver-memory 8g --master local[*] es.um.nosql.streaminginference.json2dbschema-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
--mode single \
--input input/collection.json \
--benchmark true
Esta función determina el comando a utilizar en función del modo de funcionamiento:
In [10]:
def benchmarkSparkApp(mode="file", interval=1000, block=200, kryo="true"):
if (mode == "file"):
benchmarkFile(interval, kryo)
elif (mode == "mongo"):
benchmarkMongo(interval, block, kryo)
elif (mode== "single"):
benchmarkSingle()
La siguiente función compone las funciones anteriores para ejecutar una prueba con los parámetros introducidos:
In [11]:
def benchmark(mode="file", interval=1000, block=200, elements=120, entities=2, versions=2, depth=2, fields=2, batch=12, kryo="true"):
global benchmarked
!rm -f output/stats.csv
createTestCollection(mode, elements, entities, versions, depth, fields, batch)
for x in range(0, 10):
benchmarkSparkApp(mode, interval, block, kryo)
benchmarked = pd.read_csv("output/stats.csv")
return benchmarked
Creación de una colección de 60000 elementos segmentada en 5 archivos:
In [135]:
createTestCollection(mode="file", elements=60000, batch=12000)
Creación de un único archivo con 60000 elementos:
In [131]:
createTestCollection(mode="single", elements=60000)
Inserción en la base de datos "benchmark" de MongoDB de 60000 elementos:
In [132]:
createTestCollection(mode="mongo", elements=60000)
Prueba de ejecución de 60000 elementos en modo file, en batches de 12000 elementos:
In [134]:
benchmark(mode="file",elements=60000, batch=12000)
Out[134]:
Prueba de ejecución de 30000 elementos en modo single:
In [136]:
benchmark(mode="single",elements=60000)
Out[136]:
Prueba de ejecución de 1200 elementos en modo mongo:
In [69]:
benchmark(mode="mongo", elements=60000)
Out[69]:
Estudio del efecto de la serialización Kryo en la aplicación:
In [58]:
results = pd.DataFrame()
df = benchmark(mode="file", elements=2400000, batch=80000, entities=30, versions=30, depth=5, fields=4, kryo="true")
df.to_csv("kryo-enabled.csv")
results["kryo enabled"] = df["TOTAL_PROCESSING"]
df = benchmark(mode="file", elements=2400000, batch=80000, entities=30, versions=30, depth=5, fields=4, kryo="false")
df.to_csv("kryo-disabled.csv")
results["kryo disabled"] = df["TOTAL_PROCESSING"]
ax = sns.barplot(data=results)
ax.set_ylabel("Milisegundos de procesamiento")
Out[58]:
Estudio del efecto del número de entidades en el tiempo de procesamiento:
In [13]:
ents = np.array([])
mode = np.array([])
millis = np.array([])
for entities in [1, 50, 100, 200, 400]:
df = benchmark(mode="file", elements=2400000, batch=80000, entities=entities, versions=1, depth=2, fields=2, kryo="true")
df.to_csv("file-entities-"+str(entities)+".csv")
length = df["TOTAL_PROCESSING"].size
ents = np.append(ents, np.repeat(entities, length))
mode = np.append(mode, np.repeat("Paralelo", length))
millis = np.append(millis, df["TOTAL_PROCESSING"].as_matrix())
df = benchmark(mode="single", elements=2400000, entities=entities, versions=1, depth=2, fields=2)
df.to_csv("original-file-entities-"+str(entities)+".csv")
length = df["TOTAL_PROCESSING"].size
ents = np.append(ents, np.repeat(entities, length))
mode = np.append(mode, np.repeat("Original", length))
millis = np.append(millis, df["TOTAL_PROCESSING"].as_matrix())
results = pd.DataFrame({"Entidades":ents, "Modo": mode, "Milisegundos de procesamiento": millis})
sns.factorplot(x="Entidades", y="Milisegundos de procesamiento", col="Modo", data=results, kind="bar", size=7)
Out[13]:
Estudio del efecto del número de versiones en el tiempo de procesamiento:
In [14]:
vers = np.array([])
mode = np.array([])
millis = np.array([])
for versions in [1, 50, 100, 200, 400]:
df = benchmark(mode="file", elements=2400000, batch=80000, entities=1, versions=versions, depth=2, fields=2, kryo="true")
df.to_csv("file-versions-"+str(versions)+".csv")
length = df["TOTAL_PROCESSING"].size
vers = np.append(vers, np.repeat(versions, length))
mode = np.append(mode, np.repeat("Paralelo", length))
millis = np.append(millis, df["TOTAL_PROCESSING"].as_matrix())
df = benchmark(mode="single", elements=2400000, entities=1, versions=versions, depth=2, fields=2)
df.to_csv("original-file-versions-"+str(versions)+".csv")
vers = np.append(vers, np.repeat(versions, length))
mode = np.append(mode, np.repeat("Original", length))
millis = np.append(millis, df["TOTAL_PROCESSING"].as_matrix())
results = pd.DataFrame({"Versiones":vers, "Modo": mode, "Milisegundos de procesamiento": millis})
sns.factorplot(x="Versiones", y="Milisegundos de procesamiento", col="Modo", data=results, kind="bar", size=7)
Out[14]:
Estudio del efecto del número de elementos en el tiempo de procesamiento:
In [15]:
elems = np.array([])
mode = np.array([])
micros = np.array([])
for elements in [60000, 120000, 480000, 1200000, 2400000, 3600000]:
df = benchmark(mode="file", elements=elements, batch=(elements/30), entities=1, versions=1, depth=2, fields=2, kryo="true")
df.to_csv("light-file-elements-"+str(elements)+".csv")
length = df["TOTAL_PROCESSING"].size
elems = np.append(elems, np.repeat(elements, length))
mode = np.append(mode, np.repeat("Paralelo", length))
micros = np.append(micros, (df["TOTAL_PROCESSING"]*1000/elements).as_matrix())
df = benchmark(mode="single", elements=elements, entities=1, versions=1, depth=2, fields=2)
df.to_csv("light-original-file-elements-"+str(elements)+".csv")
elems = np.append(elems, np.repeat(elements, length))
mode = np.append(mode, np.repeat("Original", length))
micros = np.append(micros, (df["TOTAL_PROCESSING"]*1000/elements).as_matrix())
results = pd.DataFrame({"Elementos":elems, "Modo": mode, "Microsegundos por elemento": micros})
sns.factorplot(x="Elementos", y="Microsegundos por elemento", col="Modo", data=results, kind="bar", size=7)
Out[15]:
In [16]:
elems = np.array([])
mode = np.array([])
micros = np.array([])
for elements in [60000, 120000, 480000, 1200000, 2400000, 3600000]:
df = benchmark(mode="file", elements=elements, batch=(elements/30), entities=20, versions=20, depth=2, fields=2, kryo="true")
df.to_csv("medium-file-elements-"+str(elements)+".csv")
length = df["TOTAL_PROCESSING"].size
elems = np.append(elems, np.repeat(elements, length))
mode = np.append(mode, np.repeat("Paralelo", length))
micros = np.append(micros, (df["TOTAL_PROCESSING"]*1000/elements).as_matrix())
df = benchmark(mode="single", elements=elements, entities=20, versions=20, depth=2, fields=2)
df.to_csv("medium-original-file-elements-"+str(elements)+".csv")
elems = np.append(elems, np.repeat(elements, length))
mode = np.append(mode, np.repeat("Original", length))
micros = np.append(micros, (df["TOTAL_PROCESSING"]*1000/elements).as_matrix())
results = pd.DataFrame({"Elementos":elems, "Modo": mode, "Microsegundos por elemento": micros})
sns.factorplot(x="Elementos", y="Microsegundos por elemento", col="Modo", data=results, kind="bar", size=7)
Out[16]:
In [17]:
elems = np.array([])
mode = np.array([])
micros = np.array([])
for elements in [60000, 120000, 480000, 1200000, 2400000, 3600000]:
df = benchmark(mode="file", elements=elements, batch=(elements/30), entities=50, versions=50, depth=2, fields=2, kryo="true")
df.to_csv("hard-file-elements-"+str(elements)+".csv")
length = df["TOTAL_PROCESSING"].size
elems = np.append(elems, np.repeat(elements, length))
mode = np.append(mode, np.repeat("Paralelo", length))
micros = np.append(micros, (df["TOTAL_PROCESSING"]*1000/elements).as_matrix())
df = benchmark(mode="single", elements=elements, entities=50, versions=50, depth=2, fields=2)
df.to_csv("hard-original-file-elements-"+str(elements)+".csv")
elems = np.append(elems, np.repeat(elements, length))
mode = np.append(mode, np.repeat("Original", length))
micros = np.append(micros, (df["TOTAL_PROCESSING"]*1000/elements).as_matrix())
results = pd.DataFrame({"Elementos":elems, "Modo": mode, "Microsegundos por elemento": micros})
sns.factorplot(x="Elementos", y="Microsegundos por elemento", col="Modo", data=results, kind="bar", size=7)
Out[17]:
Estudio del efecto del número de particiones en el tiempo de procesamiento:
In [18]:
parts = np.array([])
millis = np.array([])
for partitions in [1, 2, 4, 8, 16]:
df = benchmark(mode="file", elements=2400000, batch=(elements/partitions), entities=1, versions=1, depth=2, fields=2, kryo="true")
df.to_csv("file-partitions-"+str(partitions)+".csv")
length = df["TOTAL_PROCESSING"].size
parts = np.append(parts, np.repeat(partitions, length))
millis = np.append(millis, df["TOTAL_PROCESSING"].as_matrix())
results = pd.DataFrame({"Particiones":parts, "Milisegundos de procesamiento": millis})
sns.factorplot(x="Particiones", y="Milisegundos de procesamiento", data=results, kind="bar", size=7)
Out[18]:
In [145]:
elems = np.array([])
mode = np.array([])
micros = np.array([])
for elements in [480000, 1200000, 2400000, 3600000]:
for executors in [4, 16]:
df = pd.read_csv("cesga/results-"+str(elements)+"-1-1-"+str(elements/30)+"-"+str(executors)+"-1.csv")
length = df["TOTAL_PROCESSING"].size
elems = np.append(elems, np.repeat(elements, length))
mode = np.append(mode, np.repeat("CESGA-1-"+str(executors), length))
micros = np.append(micros, (df["TOTAL_PROCESSING"]*1000/elements).as_matrix())
results = pd.DataFrame({"Elementos":elems, "Modo": mode, "Microsegundos por elemento": micros})
sns.factorplot(x="Elementos", y="Microsegundos por elemento", col="Modo", col_wrap=3, data=results, kind="bar", size=5)
Out[145]:
In [84]:
import matplotlib.pyplot as plt
import os.path
f, ax = plt.subplots(1,3, figsize=(11, 7))
f.tight_layout()
cmap = sns.color_palette("Blues", n_colors=1000)
row = 0
for version in [1, 20, 50]:
elems = np.array([])
mode = np.array([])
micros = np.array([])
for elements in [480000, 1200000, 2400000, 3600000]:
if version == 1:
strVersion = "light"
if version == 20:
strVersion = "medium"
elif version == 50:
strVersion = "hard"
df = pd.read_csv("local/"+strVersion+"-file-elements-"+str(elements)+".csv")
length = df["TOTAL_PROCESSING"].size
elems = np.append(elems, np.repeat(elements, length))
mode = np.append(mode, np.repeat("PARALELO", length))
micros = np.append(micros, (df["TOTAL_PROCESSING"]*1000/elements).as_matrix())
df = pd.read_csv("local/"+strVersion+"-original-file-elements-"+str(elements)+".csv")
length = df["TOTAL_PROCESSING"].size
elems = np.append(elems, np.repeat(elements, length))
mode = np.append(mode, np.repeat("ORIGINAL", length))
micros = np.append(micros, (df["TOTAL_PROCESSING"]*1000/elements).as_matrix())
for executors in [4, 16]:
df = pd.read_csv("cesga/results-"+str(elements)+"-"+str(version)+"-"+str(version)+"-"+str(elements/30)+"-"+str(executors)+"-1.csv")
length = df["TOTAL_PROCESSING"].size
elems = np.append(elems, np.repeat(elements, length))
mode = np.append(mode, np.repeat("CESGA-"+str(executors).zfill(2)+"-1", length))
micros = np.append(micros, (df["TOTAL_PROCESSING"]*1000/elements).as_matrix())
df = pd.read_csv("cesga/results-"+str(elements)+"-"+str(version)+"-"+str(version)+"-"+str(elements/30)+"-2-8.csv")
length = df["TOTAL_PROCESSING"].size
elems = np.append(elems, np.repeat(elements, length))
mode = np.append(mode, np.repeat("CESGA-02-8", length))
micros = np.append(micros, (df["TOTAL_PROCESSING"]*1000/elements).as_matrix())
df = pd.read_csv("cesga/results-"+str(elements)+"-"+str(version)+"-"+str(version)+"-"+str(elements/30)+"-8-2.csv")
length = df["TOTAL_PROCESSING"].size
elems = np.append(elems, np.repeat(elements, length))
mode = np.append(mode, np.repeat("CESGA-08-2", length))
micros = np.append(micros, (df["TOTAL_PROCESSING"]*1000/elements).as_matrix())
results = pd.DataFrame({"Documentos":elems.astype(int), "Modo": mode, "Microsegundos por documento": micros})
grouped = results.groupby(['Documentos', 'Modo'], as_index=False).mean()
grouped.sort_values("Modo")
pivoted = grouped.pivot("Modo", "Documentos", "Microsegundos por documento")
#display(pivoted)
sns.heatmap(pivoted, annot=True, linewidths=.5, fmt="1.2f", ax=ax[row], cmap=cmap, cbar=False, annot_kws={"size": 14})
#ax[row].yticks(np.arange(0, 1, step=0.2))
row += 1
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=0.3)
plt.show()
In [16]:
import matplotlib.pyplot as plt
import os.path
cmap = sns.color_palette("Blues", n_colors=1000)
f, ax = plt.subplots(1,1, figsize=(12.95, 4.5))
elems = np.array([])
mode = np.array([])
micros = np.array([])
bestMode = ""
bestMicros = 9999999
originalMicros = 0
labels = pd.DataFrame(columns=["Modo", "Documentos", "Candidato"])
results = pd.DataFrame(columns=["Modo", "Documentos", "Speedup"])
for version in [1, 20, 50]:
for elements in [480000, 1200000, 2400000]:
if version == 1:
strVersion = "light"
labelVersion = u"1 entidad\n1 versión"
if version == 20:
strVersion = "medium"
labelVersion = "20 entidades\n20 versiones"
elif version == 50:
strVersion = "hard"
labelVersion = "50 entidades\n50 versiones"
df = pd.read_csv("local/"+strVersion+"-file-elements-"+str(elements)+".csv")
length = df["TOTAL_PROCESSING"].size
elems = np.append(elems, np.repeat(elements, length))
bestMode = "Local"
bestMicros = np.append(micros, (df["TOTAL_PROCESSING"]*1000/elements).as_matrix()).mean()
df = pd.read_csv("local/"+strVersion+"-original-file-elements-"+str(elements)+".csv")
length = df["TOTAL_PROCESSING"].size
elems = np.append(elems, np.repeat(elements, length))
originalMicros = np.append(micros, (df["TOTAL_PROCESSING"]*1000/elements).as_matrix()).mean()
if (originalMicros < bestMicros):
bestMicros = originalMicros
bestMode = "Original"
for executors in [4, 16]:
df = pd.read_csv("cesga/results-"+str(elements)+"-"+str(version)+"-"+str(version)+"-"+str(elements/30)+"-"+str(executors)+"-1.csv")
length = df["TOTAL_PROCESSING"].size
micros = np.append(micros, (df["TOTAL_PROCESSING"]*1000/elements).as_matrix()).mean()
if (micros < bestMicros):
bestMicros = micros
bestMode = "CESGA\n" + str(executors) + " executors 1 core"
df = pd.read_csv("cesga/results-"+str(elements)+"-"+str(version)+"-"+str(version)+"-"+str(elements/30)+"-2-8.csv")
length = df["TOTAL_PROCESSING"].size
elems = np.append(elems, np.repeat(elements, length))
micros = np.append(micros, (df["TOTAL_PROCESSING"]*1000/elements).as_matrix()).mean()
if (micros < bestMicros):
bestMicros = micros
bestMode = "CESGA\n2 executors 8 cores"
df = pd.read_csv("cesga/results-"+str(elements)+"-"+str(version)+"-"+str(version)+"-"+str(elements/30)+"-8-2.csv")
length = df["TOTAL_PROCESSING"].size
elems = np.append(elems, np.repeat(elements, length))
micros = np.append(micros, (df["TOTAL_PROCESSING"]*1000/elements).as_matrix()).mean()
if (micros < bestMicros):
bestMicros = micros
bestMode = "CESGA\n8 executors 2 cores"
speedup = originalMicros/bestMicros
bestMode += "\nSpeedup: " + "{0:.2f}".format(speedup)
results = results.append({"Modo": labelVersion, "Documentos": elements, "Speedup": speedup}, ignore_index=True)
labels = labels.append({"Modo": labelVersion, "Documentos": elements, "Candidato": bestMode}, ignore_index=True)
#results["Tipo"] = results["Tipo"].astype(int)
results["Documentos"] = results["Documentos"].astype(int)
results = results.pivot("Modo", "Documentos", "Speedup")
labels = labels.pivot("Modo", "Documentos", "Candidato")
sns.heatmap(results, annot=labels, linewidths=.5, fmt="", cmap=cmap, cbar=False, annot_kws={"size": 16}, ax=ax)
ax.set_ylabel('')
ax.set_xlabel("Documentos",fontsize=14)
ax.tick_params(labelsize="large")
plt.yticks(rotation=0)
plt.show()
In [13]:
elems = np.array([])
mode = np.array([])
micros = np.array([])
for elements in [480000, 1200000, 2400000, 3600000]:
for executors in [4, 16]:
df = pd.read_csv("cesga/results-"+str(elements)+"-1-1-"+str(elements/30)+"-"+str(executors)+"-1.csv")
length = df["TOTAL_PROCESSING"].size
elems = np.append(elems, np.repeat(elements, length))
mode = np.append(mode, np.repeat("CESGA-"+str(executors).zfill(2)+"-1", length))
micros = np.append(micros, (df["TOTAL_PROCESSING"]*1000/elements).as_matrix())
results = pd.DataFrame({"Documentos":elems.astype(int), "Modo": mode, "Microsegundos por documento": micros})
sns.factorplot(x="Documentos", y="Microsegundos por documento", col="Modo", col_wrap=3, data=results, kind="bar", size=3)
Out[13]:
In [128]:
elems = np.array([])
mode = np.array([])
micros = np.array([])
for elements in [480000, 1200000, 2400000, 3600000]:
for executors in [4, 16]:
df = pd.read_csv("cesga/results-"+str(elements)+"-50-50-"+str(elements/30)+"-"+str(executors)+"-1.csv")
length = df["TOTAL_PROCESSING"].size
elems = np.append(elems, np.repeat(elements, length))
mode = np.append(mode, np.repeat("CESGA "+str(executors)+" Executors", length))
micros = np.append(micros, (df["TOTAL_PROCESSING"]*1000/elements).as_matrix())
results = pd.DataFrame({"Documentos":elems.astype(int), "Modo": mode, "Microsegundos por documento": micros})
sns.factorplot(x="Documentos", y="Microsegundos por documento", col="Modo", col_wrap=3, data=results, kind="bar", size=3.5)
Out[128]:
In [15]:
elems = np.array([])
mode = np.array([])
micros = np.array([])
for elements in [480000, 1200000, 2400000, 3600000]:
df = pd.read_csv("cesga/results-"+str(elements)+"-1-1-"+str(elements/30)+"-16-1.csv")
length = df["TOTAL_PROCESSING"].size
elems = np.append(elems, np.repeat(elements, length))
mode = np.append(mode, np.repeat("CESGA-16-1", length))
micros = np.append(micros, (df["TOTAL_PROCESSING"]*1000/elements).as_matrix())
df = pd.read_csv("cesga/results-"+str(elements)+"-1-1-"+str(elements/30)+"-8-2.csv")
length = df["TOTAL_PROCESSING"].size
elems = np.append(elems, np.repeat(elements, length))
mode = np.append(mode, np.repeat("CESGA-08-2", length))
micros = np.append(micros, (df["TOTAL_PROCESSING"]*1000/elements).as_matrix())
df = pd.read_csv("cesga/results-"+str(elements)+"-1-1-"+str(elements/30)+"-2-8.csv")
length = df["TOTAL_PROCESSING"].size
elems = np.append(elems, np.repeat(elements, length))
mode = np.append(mode, np.repeat("CESGA-02-8", length))
micros = np.append(micros, (df["TOTAL_PROCESSING"]*1000/elements).as_matrix())
results = pd.DataFrame({"Documentos":elems.astype(int), "Modo": mode, "Microsegundos por documento": micros})
sns.factorplot(x="Documentos", y="Microsegundos por documento", col="Modo", col_wrap=3, data=results, kind="bar", size=4)
Out[15]:
In [8]:
ents = np.array([])
mode = np.array([])
millis = np.array([])
for entities in [1, 50, 100, 200, 400]:
df = pd.read_csv("local/file-entities-"+str(entities)+".csv")
length = df["TOTAL_PROCESSING"].size
ents = np.append(ents, np.repeat(entities, length))
mode = np.append(mode, np.repeat("Paralelo", length))
millis = np.append(millis, df["TOTAL_PROCESSING"].as_matrix())
df = pd.read_csv("local/original-file-entities-"+str(entities)+".csv")
length = df["TOTAL_PROCESSING"].size
ents = np.append(ents, np.repeat(entities, length))
mode = np.append(mode, np.repeat("Original", length))
millis = np.append(millis, df["TOTAL_PROCESSING"].as_matrix())
results = pd.DataFrame({"Entidades":ents.astype(int), "Modo": mode, "Milisegundos de procesamiento": millis})
sns.factorplot(x="Entidades", y="Milisegundos de procesamiento", col="Modo", data=results, kind="bar", size=3.5)
Out[8]:
In [9]:
vers = np.array([])
mode = np.array([])
millis = np.array([])
for versions in [1, 50, 100, 200, 400]:
df = pd.read_csv("local/file-versions-"+str(versions)+".csv")
length = df["TOTAL_PROCESSING"].size
vers = np.append(vers, np.repeat(versions, length))
mode = np.append(mode, np.repeat("Paralelo", length))
millis = np.append(millis, df["TOTAL_PROCESSING"].as_matrix())
df = pd.read_csv("local/original-file-versions-"+str(versions)+".csv")
vers = np.append(vers, np.repeat(versions, length))
mode = np.append(mode, np.repeat("Original", length))
millis = np.append(millis, df["TOTAL_PROCESSING"].as_matrix())
results = pd.DataFrame({"Versiones":vers.astype(int), "Modo": mode, "Milisegundos de procesamiento": millis})
sns.factorplot(x="Versiones", y="Milisegundos de procesamiento", col="Modo", data=results, kind="bar", size=3.5)
Out[9]:
In [10]:
elems = np.array([])
mode = np.array([])
micros = np.array([])
for elements in [60000, 120000, 480000, 1200000, 2400000]:
df = pd.read_csv("local/light-file-elements-"+str(elements)+".csv")
length = df["TOTAL_PROCESSING"].size
elems = np.append(elems, np.repeat(elements, length))
mode = np.append(mode, np.repeat("Paralelo", length))
micros = np.append(micros, (df["TOTAL_PROCESSING"]*1000/elements).as_matrix())
df = pd.read_csv("local/light-original-file-elements-"+str(elements)+".csv")
elems = np.append(elems, np.repeat(elements, length))
mode = np.append(mode, np.repeat("Original", length))
micros = np.append(micros, (df["TOTAL_PROCESSING"]*1000/elements).as_matrix())
results = pd.DataFrame({"Documentos":elems.astype(int), "Modo": mode, "Microsegundos por documento": micros})
sns.factorplot(x="Documentos", y="Microsegundos por documento", col="Modo", data=results, kind="bar", size=3.5)
Out[10]:
In [11]:
parts = np.array([])
millis = np.array([])
for partitions in [1, 2, 4, 8, 16]:
df = pd.read_csv("local/file-partitions-"+str(partitions)+".csv")
length = df["TOTAL_PROCESSING"].size
parts = np.append(parts, np.repeat(partitions, length))
millis = np.append(millis, df["TOTAL_PROCESSING"].as_matrix())
results = pd.DataFrame({"Ficheros de entrada":parts.astype(int), "Milisegundos de procesamiento": millis})
sns.factorplot(x="Ficheros de entrada", y="Milisegundos de procesamiento", data=results, kind="bar", size=3.5)
Out[11]: