Profiler performance

We use the part of the instacart data that you can find here https://www.instacart.com/datasets/grocery-shopping-2017

Specically order_products__prior.csv a 4 columns, 33.2 Million rows csv file.

It took 355.58 seconds to process all the data set in a Windows 10, Instacart data


In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path.append("..")

In [3]:
# Create optimus
from optimus import Optimus
op = Optimus(master="local[*]", app_name = "optimus" ,verbose =True, checkpoint= True)


Just check that Spark and all necessary environments vars are present...
-----
SPARK_HOME=C:\opt\spark\spark-2.3.1-bin-hadoop2.7
HADOOP_HOME=C:\opt\spark\spark-2.3.1-bin-hadoop2.7
You don't have PYSPARK_PYTHON set
You don't have PYSPARK_DRIVER_PYTHON set
JAVA_HOME=C:\Program Files\Java\jdk1.8.0_181
Pyarrow Installed
-----
Starting or getting SparkSession and SparkContext...

                             ____        __  _                     
                            / __ \____  / /_(_)___ ___  __  _______
                           / / / / __ \/ __/ / __ `__ \/ / / / ___/
                          / /_/ / /_/ / /_/ / / / / / / /_/ (__  ) 
                          \____/ .___/\__/_/_/ /_/ /_/\__,_/____/  
                              /_/                                  
                              
Transform and Roll out...
Setting checkpoint folder local. If you are in a cluster initialize Optimus with master='your_ip' as param
Deleting previous folder if exists...
Creating the checkpoint directory...
Optimus successfully imported. Have fun :).

Benchmark


In [4]:
df = op.load.csv("C:\\Users\\argenisleon\\Desktop\\order_products__prior.csv")

In [5]:
df.table()


Viewing 100 of 32.4 million rows / 4 columns
8 partition(s)
order_id
1 (int)
nullable
product_id
2 (int)
nullable
add_to_cart_order
3 (int)
nullable
reordered
4 (int)
nullable
2 33120 1 1
2 28985 2 1
2 9327 3 0
2 45918 4 1
2 30035 5 0
2 17794 6 1
2 40141 7 1
2 1819 8 1
2 43668 9 0
3 33754 1 1
3 24838 2 1
3 17704 3 1
3 21903 4 1
3 17668 5 1
3 46667 6 1
3 17461 7 1
3 32665 8 1
4 46842 1 0
4 26434 2 1
4 39758 3 1
4 27761 4 1
4 10054 5 1
4 21351 6 1
4 22598 7 1
4 34862 8 1
4 40285 9 1
4 17616 10 1
4 25146 11 1
4 32645 12 1
4 41276 13 1
5 13176 1 1
5 15005 2 1
5 47329 3 1
5 27966 4 1
5 23909 5 1
5 48370 6 1
5 13245 7 1
5 9633 8 1
5 27360 9 1
5 6348 10 1
5 40878 11 1
5 6184 12 1
5 48002 13 1
5 20914 14 1
5 37011 15 1
5 12962 16 1
5 45698 17 1
5 24773 18 1
5 18569 19 1
5 41176 20 1
5 48366 21 1
5 47209 22 0
5 46522 23 0
5 38693 24 0
5 48825 25 0
5 8479 26 0
6 40462 1 0
6 15873 2 0
6 41897 3 0
7 34050 1 0
7 46802 2 0
8 23423 1 1
9 21405 1 0
9 47890 2 1
9 11182 3 0
9 2014 4 1
9 29193 5 1
9 34203 6 1
9 14992 7 1
9 31506 8 1
9 23288 9 0
9 44533 10 1
9 18362 11 0
9 27366 12 1
9 432 13 1
9 3990 14 1
9 14183 15 0
10 24852 1 1
10 4796 2 1
10 31717 3 0
10 47766 4 1
10 4605 5 1
10 1529 6 0
10 21137 7 1
10 22122 8 1
10 34134 9 1
10 27156 10 0
10 14992 11 0
10 49235 12 1
10 26842 13 0
10 3464 14 0
10 25720 15 0
11 30162 1 1
11 27085 2 1
11 5994 3 1
11 1313 4 1
11 31506 5 1
12 30597 1 1
12 15221 2 1
12 43772 3 1
Viewing 100 of 32.4 million rows / 4 columns
8 partition(s)

In [10]:
op.profiler.run(df, "order_id", infer=False, relative_error=1)


Processing column 'order_id'...
_count_data_types() executed in 18.69 sec
count_data_types() executed in 18.69 sec
cast_columns() executed in 0.01 sec
_exprs() executed in 16.04 sec
general_stats() executed in 16.05 sec
------------------------------
Processing column 'order_id'...
frequency() executed in 23.65 sec
stats_by_column() executed in 8.83 sec
percentile() executed in 12.21 sec
extra_numeric_stats() executed in 37.45 sec
bucketizer() executed in 0.29 sec
hist() executed in 14.6 sec
dataset_info() executed in 22.43 sec

Overview

Dataset info

Number of columns 4
Number of rows 32434489
Total Missing (%) 0.0%
Total size in memory 188.4 MB

Column types

String 0
Numeric 1
Date 0
Bool 0
Array 0
Not available 0

order_id

numeric
Unique 3025302
Unique (%) 9.327
Missing 0.0
Missing (%) 0

Datatypes

String 0
Integer 32434489
Float 0
Bool 0
Date 0
Missing 0
Null 0

Basic Stats

Mean 1710748.5189427834
Minimum 2
Maximum 3421083
Zeros(%) 0

Frequency

Value Count Frequency (%)
1564244 145 0.0%
790903 137 0.0%
61355 127 0.0%
2970392 121 0.0%
2069920 116 0.0%
3308010 115 0.0%
2753324 114 0.0%
2499774 112 0.0%
2621625 109 0.0%
77151 109 0.0%
"Missing" 0 0.0%

Quantile statistics

Minimum 2
5-th percentile 2.0
Q1 2.0
Median 2.0
Q3 2.0
95-th percentile 2.0
Maximum 3421083
Range 3421081
Interquartile range 0.0

Descriptive statistics

Standard deviation 987300.6964529774
Coef of variation 0.57712
Kurtosis -1.199128348852751
Mean 1710748.5189427834
MAD 0.0
Skewness 0
Sum 55487254019416
Variance 974762665216.534
Viewing 10 of 32.4 million rows / 4 columns
8 partition(s)
order_id
1 (int)
nullable
product_id
2 (int)
nullable
add_to_cart_order
3 (int)
nullable
reordered
4 (int)
nullable
2 33120 1 1
2 28985 2 1
2 9327 3 0
2 45918 4 1
2 30035 5 0
2 17794 6 1
2 40141 7 1
2 1819 8 1
2 43668 9 0
3 33754 1 1
Viewing 10 of 32.4 million rows / 4 columns
8 partition(s)
run() executed in 186.8 sec

In [11]:
op.profiler.run(df, "order_id", infer=True, relative_error=1)


Processing column 'order_id'...
_count_data_types() executed in 21.72 sec
count_data_types() executed in 21.72 sec
cast_columns() executed in 0.01 sec
_exprs() executed in 17.72 sec
general_stats() executed in 17.73 sec
------------------------------
Processing column 'order_id'...
frequency() executed in 25.8 sec
stats_by_column() executed in 9.99 sec
percentile() executed in 13.46 sec
extra_numeric_stats() executed in 39.63 sec
bucketizer() executed in 0.3 sec
hist() executed in 14.25 sec
dataset_info() executed in 22.55 sec

Overview

Dataset info

Number of columns 4
Number of rows 32434489
Total Missing (%) 0.0%
Total size in memory 8.3 MB

Column types

String 0
Numeric 1
Date 0
Bool 0
Array 0
Not available 0

order_id

numeric
Unique 3025302
Unique (%) 9.327
Missing 0.0
Missing (%) 0

Datatypes

String 0
Integer 32434489
Float 0
Bool 0
Date 0
Missing 0
Null 0

Basic Stats

Mean 1710748.5189427834
Minimum 2
Maximum 3421083
Zeros(%) 0

Frequency

Value Count Frequency (%)
1564244 145 0.0%
790903 137 0.0%
61355 127 0.0%
2970392 121 0.0%
2069920 116 0.0%
3308010 115 0.0%
2753324 114 0.0%
2499774 112 0.0%
2621625 109 0.0%
77151 109 0.0%
"Missing" 0 0.0%

Quantile statistics

Minimum 2
5-th percentile 2.0
Q1 2.0
Median 2.0
Q3 2.0
95-th percentile 2.0
Maximum 3421083
Range 3421081
Interquartile range 0.0

Descriptive statistics

Standard deviation 987300.6964529774
Coef of variation 0.57712
Kurtosis -1.199128348852751
Mean 1710748.5189427834
MAD 0.0
Skewness 0
Sum 55487254019416
Variance 974762665216.534
Viewing 10 of 32.4 million rows / 4 columns
8 partition(s)
order_id
1 (int)
nullable
product_id
2 (int)
nullable
add_to_cart_order
3 (int)
nullable
reordered
4 (int)
nullable
2 33120 1 1
2 28985 2 1
2 9327 3 0
2 45918 4 1
2 30035 5 0
2 17794 6 1
2 40141 7 1
2 1819 8 1
2 43668 9 0
3 33754 1 1
Viewing 10 of 32.4 million rows / 4 columns
8 partition(s)
run() executed in 199.09 sec