Dataset info
Number of columns | 4 |
Number of rows | 32434489 |
Total Missing (%) | 0.0% |
Total size in memory | 188.4 MB |
Column types
String | 0 |
Numeric | 1 |
Date | 0 |
Bool | 0 |
Array | 0 |
Not available | 0 |
We use the part of the instacart data that you can find here https://www.instacart.com/datasets/grocery-shopping-2017
Specically order_products__prior.csv a 4 columns, 33.2 Million rows csv file.
It took 355.58 seconds to process all the data set in a Windows 10, Instacart data
In [1]:
%load_ext autoreload
%autoreload 2
In [2]:
import sys
sys.path.append("..")
In [3]:
# Create optimus
from optimus import Optimus
op = Optimus(master="local[*]", app_name = "optimus" ,verbose =True, checkpoint= True)
Just check that Spark and all necessary environments vars are present...
-----
SPARK_HOME=C:\opt\spark\spark-2.3.1-bin-hadoop2.7
HADOOP_HOME=C:\opt\spark\spark-2.3.1-bin-hadoop2.7
You don't have PYSPARK_PYTHON set
You don't have PYSPARK_DRIVER_PYTHON set
JAVA_HOME=C:\Program Files\Java\jdk1.8.0_181
Pyarrow Installed
-----
Starting or getting SparkSession and SparkContext...
____ __ _
/ __ \____ / /_(_)___ ___ __ _______
/ / / / __ \/ __/ / __ `__ \/ / / / ___/
/ /_/ / /_/ / /_/ / / / / / / /_/ (__ )
\____/ .___/\__/_/_/ /_/ /_/\__,_/____/
/_/
Transform and Roll out...
Setting checkpoint folder local. If you are in a cluster initialize Optimus with master='your_ip' as param
Deleting previous folder if exists...
Creating the checkpoint directory...
Optimus successfully imported. Have fun :).
In [4]:
df = op.load.csv("C:\\Users\\argenisleon\\Desktop\\order_products__prior.csv")
In [5]:
df.table()
Viewing 100 of 32.4 million rows / 4 columns
8 partition(s)
order_id
1 (int)
nullable
product_id
2 (int)
nullable
add_to_cart_order
3 (int)
nullable
reordered
4 (int)
nullable
2
33120
1
1
2
28985
2
1
2
9327
3
0
2
45918
4
1
2
30035
5
0
2
17794
6
1
2
40141
7
1
2
1819
8
1
2
43668
9
0
3
33754
1
1
3
24838
2
1
3
17704
3
1
3
21903
4
1
3
17668
5
1
3
46667
6
1
3
17461
7
1
3
32665
8
1
4
46842
1
0
4
26434
2
1
4
39758
3
1
4
27761
4
1
4
10054
5
1
4
21351
6
1
4
22598
7
1
4
34862
8
1
4
40285
9
1
4
17616
10
1
4
25146
11
1
4
32645
12
1
4
41276
13
1
5
13176
1
1
5
15005
2
1
5
47329
3
1
5
27966
4
1
5
23909
5
1
5
48370
6
1
5
13245
7
1
5
9633
8
1
5
27360
9
1
5
6348
10
1
5
40878
11
1
5
6184
12
1
5
48002
13
1
5
20914
14
1
5
37011
15
1
5
12962
16
1
5
45698
17
1
5
24773
18
1
5
18569
19
1
5
41176
20
1
5
48366
21
1
5
47209
22
0
5
46522
23
0
5
38693
24
0
5
48825
25
0
5
8479
26
0
6
40462
1
0
6
15873
2
0
6
41897
3
0
7
34050
1
0
7
46802
2
0
8
23423
1
1
9
21405
1
0
9
47890
2
1
9
11182
3
0
9
2014
4
1
9
29193
5
1
9
34203
6
1
9
14992
7
1
9
31506
8
1
9
23288
9
0
9
44533
10
1
9
18362
11
0
9
27366
12
1
9
432
13
1
9
3990
14
1
9
14183
15
0
10
24852
1
1
10
4796
2
1
10
31717
3
0
10
47766
4
1
10
4605
5
1
10
1529
6
0
10
21137
7
1
10
22122
8
1
10
34134
9
1
10
27156
10
0
10
14992
11
0
10
49235
12
1
10
26842
13
0
10
3464
14
0
10
25720
15
0
11
30162
1
1
11
27085
2
1
11
5994
3
1
11
1313
4
1
11
31506
5
1
12
30597
1
1
12
15221
2
1
12
43772
3
1
Viewing 100 of 32.4 million rows / 4 columns
8 partition(s)
In [10]:
op.profiler.run(df, "order_id", infer=False, relative_error=1)
Processing column 'order_id'...
_count_data_types() executed in 18.69 sec
count_data_types() executed in 18.69 sec
cast_columns() executed in 0.01 sec
_exprs() executed in 16.04 sec
general_stats() executed in 16.05 sec
------------------------------
Processing column 'order_id'...
frequency() executed in 23.65 sec
stats_by_column() executed in 8.83 sec
percentile() executed in 12.21 sec
extra_numeric_stats() executed in 37.45 sec
bucketizer() executed in 0.29 sec
hist() executed in 14.6 sec
dataset_info() executed in 22.43 sec
Overview
Dataset info
Number of columns
4
Number of rows
32434489
Total Missing (%)
0.0%
Total size in memory
188.4 MB
Column types
String
0
Numeric
1
Date
0
Bool
0
Array
0
Not available
0
order_id
numeric
Unique
3025302
Unique (%)
9.327
Missing
0.0
Missing (%)
0
Datatypes
String
0
Integer
32434489
Float
0
Bool
0
Date
0
Missing
0
Null
0
Basic Stats
Mean
1710748.5189427834
Minimum
2
Maximum
3421083
Zeros(%)
0
Frequency
Value
Count
Frequency (%)
1564244
145
0.0%
790903
137
0.0%
61355
127
0.0%
2970392
121
0.0%
2069920
116
0.0%
3308010
115
0.0%
2753324
114
0.0%
2499774
112
0.0%
2621625
109
0.0%
77151
109
0.0%
"Missing"
0
0.0%
Quantile statistics
Minimum
2
5-th percentile
2.0
Q1
2.0
Median
2.0
Q3
2.0
95-th percentile
2.0
Maximum
3421083
Range
3421081
Interquartile range
0.0
Descriptive statistics
Standard deviation
987300.6964529774
Coef of variation
0.57712
Kurtosis
-1.199128348852751
Mean
1710748.5189427834
MAD
0.0
Skewness
0
Sum
55487254019416
Variance
974762665216.534
Viewing 10 of 32.4 million rows / 4 columns
8 partition(s)
order_id
1 (int)
nullable
product_id
2 (int)
nullable
add_to_cart_order
3 (int)
nullable
reordered
4 (int)
nullable
2
33120
1
1
2
28985
2
1
2
9327
3
0
2
45918
4
1
2
30035
5
0
2
17794
6
1
2
40141
7
1
2
1819
8
1
2
43668
9
0
3
33754
1
1
Viewing 10 of 32.4 million rows / 4 columns
8 partition(s)
run() executed in 186.8 sec
In [11]:
op.profiler.run(df, "order_id", infer=True, relative_error=1)
Processing column 'order_id'...
_count_data_types() executed in 21.72 sec
count_data_types() executed in 21.72 sec
cast_columns() executed in 0.01 sec
_exprs() executed in 17.72 sec
general_stats() executed in 17.73 sec
------------------------------
Processing column 'order_id'...
frequency() executed in 25.8 sec
stats_by_column() executed in 9.99 sec
percentile() executed in 13.46 sec
extra_numeric_stats() executed in 39.63 sec
bucketizer() executed in 0.3 sec
hist() executed in 14.25 sec
dataset_info() executed in 22.55 sec
Overview
Dataset info
Number of columns
4
Number of rows
32434489
Total Missing (%)
0.0%
Total size in memory
8.3 MB
Column types
String
0
Numeric
1
Date
0
Bool
0
Array
0
Not available
0
order_id
numeric
Unique
3025302
Unique (%)
9.327
Missing
0.0
Missing (%)
0
Datatypes
String
0
Integer
32434489
Float
0
Bool
0
Date
0
Missing
0
Null
0
Basic Stats
Mean
1710748.5189427834
Minimum
2
Maximum
3421083
Zeros(%)
0
Frequency
Value
Count
Frequency (%)
1564244
145
0.0%
790903
137
0.0%
61355
127
0.0%
2970392
121
0.0%
2069920
116
0.0%
3308010
115
0.0%
2753324
114
0.0%
2499774
112
0.0%
2621625
109
0.0%
77151
109
0.0%
"Missing"
0
0.0%
Quantile statistics
Minimum
2
5-th percentile
2.0
Q1
2.0
Median
2.0
Q3
2.0
95-th percentile
2.0
Maximum
3421083
Range
3421081
Interquartile range
0.0
Descriptive statistics
Standard deviation
987300.6964529774
Coef of variation
0.57712
Kurtosis
-1.199128348852751
Mean
1710748.5189427834
MAD
0.0
Skewness
0
Sum
55487254019416
Variance
974762665216.534
Viewing 10 of 32.4 million rows / 4 columns
8 partition(s)
order_id
1 (int)
nullable
product_id
2 (int)
nullable
add_to_cart_order
3 (int)
nullable
reordered
4 (int)
nullable
2
33120
1
1
2
28985
2
1
2
9327
3
0
2
45918
4
1
2
30035
5
0
2
17794
6
1
2
40141
7
1
2
1819
8
1
2
43668
9
0
3
33754
1
1
Viewing 10 of 32.4 million rows / 4 columns
8 partition(s)
run() executed in 199.09 sec
Content source: ironmussa/Optimus
Similar notebooks: