NOTE: We are working hard in version 3.0 https://github.com/ironmussa/Optimus/tree/develop-3.0. in which you could use Dask, Dask/cudf as backend using a unified API.

To launch a live notebook server to test optimus using binder or Colab, click on one of the following badges:

Optimus is the missing framework to profile, clean, process and do ML in a distributed fashion using Apache Spark(PySpark).

Installation (pip):

In your terminal just type pip install optimuspyspark

Requirements

Apache Spark>= 2.4.0
Python>=3.6

Examples

You can go to the 10 minutes to Optimus notebook where you can find the basic to start working.

Also you can go to the examples folder to found specific notebooks about data cleaning, data munging, profiling, data enrichment and how to create ML and DL models.

Besides check the Cheat Sheet

Documentation

Feedback

Feedback is what drive Optimus future, so please take a couple of minutes to help shape the Optimus' Roadmap: http://bit.ly/optimus_survey

Also if you want to a suggestion or feature request use https://github.com/ironmussa/optimus/issues

Start Optimus



In [1]:

    
%load_ext autoreload
%autoreload 2



In [2]:

    
import sys
sys.path.append("..")



In [3]:

    
from optimus import Optimus
op= Optimus(verbose=True)









    



C:\Users\argenisleon\Anaconda3\lib\site-packages\socks.py:58: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import Callable

    You are using PySparkling of version 2.4.10, but your PySpark is of
    version 2.3.1. Please make sure Spark and PySparkling versions are compatible. 
INFO:optimus:Operative System:Windows
INFO:optimus:Just check that Spark and all necessary environments vars are present...
INFO:optimus:-----
INFO:optimus:SPARK_HOME=C:\opt\spark\spark-2.3.1-bin-hadoop2.7
INFO:optimus:HADOOP_HOME=C:\opt\hadoop-2.7.7
INFO:optimus:PYSPARK_PYTHON=C:\Users\argenisleon\Anaconda3\python.exe
INFO:optimus:PYSPARK_DRIVER_PYTHON=jupyter
INFO:optimus:PYSPARK_SUBMIT_ARGS=--jars "file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/RedshiftJDBC42-1.2.16.1027.jar,file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/mysql-connector-java-8.0.16.jar,file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/ojdbc8.jar,file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/postgresql-42.2.5.jar" --driver-class-path "C:/Users/argenisleon/Documents/Optimus/optimus/jars/RedshiftJDBC42-1.2.16.1027.jar;C:/Users/argenisleon/Documents/Optimus/optimus/jars/mysql-connector-java-8.0.16.jar;C:/Users/argenisleon/Documents/Optimus/optimus/jars/ojdbc8.jar;C:/Users/argenisleon/Documents/Optimus/optimus/jars/postgresql-42.2.5.jar" --conf "spark.sql.catalogImplementation=hive" pyspark-shell
INFO:optimus:JAVA_HOME=C:\java
INFO:optimus:Pyarrow Installed
INFO:optimus:-----
INFO:optimus:Starting or getting SparkSession and SparkContext...
INFO:optimus:Spark Version:2.3.1
INFO:optimus:
                             ____        __  _                     
                            / __ \____  / /_(_)___ ___  __  _______
                           / / / / __ \/ __/ / __ `__ \/ / / / ___/
                          / /_/ / /_/ / /_/ / / / / / / /_/ (__  ) 
                          \____/ .___/\__/_/_/ /_/ /_/\__,_/____/  
                              /_/                                  
                              
INFO:optimus:Transform and Roll out...
INFO:optimus:Optimus successfully imported. Have fun :).
INFO:optimus:Config.ini not found

You also can use an already created Spark session:



In [4]:

    
from pyspark.sql import SparkSession
from optimus import Optimus

spark = SparkSession.builder.appName('optimus').getOrCreate()
op= Optimus(spark)









    



INFO:optimus:
                             ____        __  _                     
                            / __ \____  / /_(_)___ ___  __  _______
                           / / / / __ \/ __/ / __ `__ \/ / / / ___/
                          / /_/ / /_/ / /_/ / / / / / / /_/ (__  ) 
                          \____/ .___/\__/_/_/ /_/ /_/\__,_/____/  
                              /_/                                  
                              
INFO:optimus:Transform and Roll out...
INFO:optimus:Optimus successfully imported. Have fun :).
INFO:optimus:Config.ini not found

Loading data

Now Optimus can load data in csv, json, parquet, avro, excel from a local file or URL.



In [5]:

    
#csv
df = op.load.csv("../examples/data/foo.csv")

#json
# Use a local file
df = op.load.json("../examples/data/foo.json")

# Use a url
df = op.load.json("https://raw.githubusercontent.com/ironmussa/Optimus/master/examples/data/foo.json")

# parquet
df = op.load.parquet("../examples/data/foo.parquet")

# avro
# df = op.load.avro("../examples/data/foo.avro").table(5)

# excel 
df = op.load.excel("../examples/data/titanic3.xls")









    



INFO:optimus:Downloading foo.json from https://raw.githubusercontent.com/ironmussa/Optimus/master/examples/data/foo.json
INFO:optimus:Downloaded 2596 bytes
INFO:optimus:Creating DataFrame for foo.json. Please wait...
INFO:optimus:Successfully created DataFrame for 'foo.json'

Also you can load data from oracle, redshift, mysql and postgres. See Database connection

Saving Data



In [6]:

    
#csv
df.save.csv("data/foo.csv")

# json
df.save.json("data/foo.json")

# parquet
df.save.parquet("data/foo.parquet")

# avro
#df.save.avro("examples/data/foo.avro")









    



INFO:optimus:`pclass`,`survived`,`name`,`sex`,`age`,`sibsp`,`parch`,`ticket`,`fare`,`cabin`,`embarked`,`boat`,`body`,`home_dest` column(s) was not processed because is/are not date,array,vector,binary,null
INFO:optimus:`pclass`,`survived`,`name`,`sex`,`age`,`sibsp`,`parch`,`ticket`,`fare`,`cabin`,`embarked`,`boat`,`body`,`home_dest` column(s) was not processed because is/are not null

Also you can save data to oracle, redshift, mysql and postgres. See Database connection

Handling Spark jars, packages and repositories

With optimus is easy to loading jars, packages and repos. You can init optimus/spark like



In [7]:

    
op= Optimus(repositories = "myrepo", packages="org.apache.spark:spark-avro_2.12:2.4.3", jars="my.jar", driver_class_path="this_is_a_jar_class_path.jar", verbose= True)









    



INFO:optimus:Operative System:Windows
INFO:optimus:Just check that Spark and all necessary environments vars are present...
INFO:optimus:-----
INFO:optimus:SPARK_HOME=C:\opt\spark\spark-2.3.1-bin-hadoop2.7
INFO:optimus:HADOOP_HOME=C:\opt\hadoop-2.7.7
INFO:optimus:PYSPARK_PYTHON=C:\Users\argenisleon\Anaconda3\python.exe
INFO:optimus:PYSPARK_DRIVER_PYTHON=jupyter
INFO:optimus:PYSPARK_SUBMIT_ARGS=--repositories myrepo --packages org.apache.spark:spark-avro_2.12:2.4.3 --jars "my.jar,file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/RedshiftJDBC42-1.2.16.1027.jar,file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/mysql-connector-java-8.0.16.jar,file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/ojdbc8.jar,file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/postgresql-42.2.5.jar" --driver-class-path "this_is_a_jar_class_path.jar;C:/Users/argenisleon/Documents/Optimus/optimus/jars/RedshiftJDBC42-1.2.16.1027.jar;C:/Users/argenisleon/Documents/Optimus/optimus/jars/mysql-connector-java-8.0.16.jar;C:/Users/argenisleon/Documents/Optimus/optimus/jars/ojdbc8.jar;C:/Users/argenisleon/Documents/Optimus/optimus/jars/postgresql-42.2.5.jar" --conf "spark.sql.catalogImplementation=hive" pyspark-shell
INFO:optimus:JAVA_HOME=C:\java
INFO:optimus:Pyarrow Installed
INFO:optimus:-----
INFO:optimus:Starting or getting SparkSession and SparkContext...
INFO:optimus:Spark Version:2.3.1
INFO:optimus:
                             ____        __  _                     
                            / __ \____  / /_(_)___ ___  __  _______
                           / / / / __ \/ __/ / __ `__ \/ / / / ___/
                          / /_/ / /_/ / /_/ / / / / / / /_/ (__  ) 
                          \____/ .___/\__/_/_/ /_/ /_/\__,_/____/  
                              /_/                                  
                              
INFO:optimus:Transform and Roll out...
INFO:optimus:Optimus successfully imported. Have fun :).
INFO:optimus:Config.ini not found

Create dataframes

Also you can create a dataframe from scratch



In [8]:

    
from pyspark.sql.types import *
from datetime import date, datetime

df = op.create.df(
    [
        ("names", "str", True), 
        ("height(ft)","int", True), 
        ("function", "str", True), 
        ("rank", "int", True), 
        ("age","int",True),
        ("weight(t)","float",True),
        ("japanese name", ArrayType(StringType()), True),
        ("last position seen", "str", True),
        ("date arrival", "str", True),
        ("last date seen", "str", True),
        ("attributes", ArrayType(FloatType()), True),
        ("DateType"),
        ("Tiemstamp"),
        ("Cybertronian", "bool", True), 
        ("NullType", "null", True),
    ],
    [
        ("Optim'us", 28, "Leader", 10, 5000000, 4.3, ["Inochi", "Convoy"], "19.442735,-99.201111", "1980/04/10",
         "2016/09/10", [8.5344, 4300.0], date(2016, 9, 10), datetime(2014, 6, 24), True,
         None),
        ("bumbl#ebéé  ", 17, "Espionage", 7, 5000000, 2.0, ["Bumble", "Goldback"], "10.642707,-71.612534", "1980/04/10",
         "2015/08/10", [5.334, 2000.0], date(2015, 8, 10), datetime(2014, 6, 24), True,
         None),
        ("ironhide&", 26, "Security", 7, 5000000, 4.0, ["Roadbuster"], "37.789563,-122.400356", "1980/04/10",
         "2014/07/10", [7.9248, 4000.0], date(2014, 6, 24), datetime(2014, 6, 24), True,
         None),
        ("Jazz", 13, "First Lieutenant", 8, 5000000, 1.80, ["Meister"], "33.670666,-117.841553", "1980/04/10",
         "2013/06/10", [3.9624, 1800.0], date(2013, 6, 24), datetime(2014, 6, 24), True, None),
        ("Megatron", None, "None", 10, 5000000, 5.70, ["Megatron"], None, "1980/04/10", "2012/05/10", [None, 5700.0],
         date(2012, 5, 10), datetime(2014, 6, 24), True, None),
        ("Metroplex_)^$", 300, "Battle Station", 8, 5000000, None, ["Metroflex"], None, "1980/04/10", "2011/04/10",
         [91.44, None], date(2011, 4, 10), datetime(2014, 6, 24), True, None),

    ], infer_schema = True).h_repartition(1)

With .table() you have a beautifull way to show your data. You have extra information like column number, column data type and marked white spaces



In [9]:

    
df.table_image("images/table.png")









    



Loading page (1/2)
Rendering (2/2)                                                    
Done

Also you can create a dataframe from a panda dataframe



In [10]:

    
import pandas as pd
pdf = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c',3:'d'},
                    'B': {0: 1, 1: 3, 2: 5,3:7},
                       'C': {0: 2, 1: 4, 2: 6,3:None},
                       'D': {0:'1980/04/10',1:'1980/04/10',2:'1980/04/10',3:'1980/04/10'},
                       })

s_pdf = op.create.df(pdf=pdf)
s_pdf.table_image("images/pandas.png")









    



Loading page (1/2)
Rendering (2/2)                                                    
Done

Cleaning and Processing

Optimus V2 was created to make data cleaning a breeze. The API was designed to be super easy to newcomers and very familiar for people that comes from Pandas. Optimus expands the Spark DataFrame functionality adding .rows and .cols attributes.

For example you can load data from a url, transform and apply some predefined cleaning functions:



In [11]:

    
# This is a custom function
def func(value, arg):
    return "this was a number"
    
new_df = df\
    .rows.sort("rank","desc")\
    .withColumn('new_age', df.age)\
    .cols.lower(["names","function"])\
    .cols.date_transform("date arrival", "yyyy/MM/dd", "dd-MM-YYYY")\
    .cols.years_between("date arrival", "dd-MM-YYYY", output_cols = "from arrival")\
    .cols.remove_accents("names")\
    .cols.remove_special_chars("names")\
    .rows.drop(df["rank"]>8)\
    .cols.rename(str.lower)\
    .cols.trim("*")\
    .cols.unnest("japanese name", output_cols="other names")\
    .cols.unnest("last position seen",separator=",", output_cols="pos")\
    .cols.drop(["last position seen", "japanese name","date arrival", "cybertronian", "nulltype"])









    



INFO:optimus:Using 'column_exp' to process column 'names' with function _lower
INFO:optimus:Using 'column_exp' to process column 'function' with function _lower
INFO:optimus:Using 'column_exp' to process column 'date arrival' with function _date_transform
INFO:optimus:Using 'column_exp' to process column 'date arrival' with function _years_between
INFO:optimus:Using 'column_exp' to process column 'from arrival' with function _cast_to
INFO:optimus:Using 'pandas_udf' to process column 'names' with function _remove_accents
INFO:optimus:Using 'pandas_udf' to process column 'names' with function multiple_replace
INFO:optimus:`japanese name`,`attributes`,`tiemstamp`,`nulltype` column(s) was not processed because is/are not byte,short,big,int,double,float,string,date,bool
INFO:optimus:Using 'column_exp' to process column 'names' with function _trim
INFO:optimus:Using 'column_exp' to process column 'height(ft)' with function _trim
INFO:optimus:Using 'column_exp' to process column 'function' with function _trim
INFO:optimus:Using 'column_exp' to process column 'rank' with function _trim
INFO:optimus:Using 'column_exp' to process column 'age' with function _trim
INFO:optimus:Using 'column_exp' to process column 'weight(t)' with function _trim
INFO:optimus:Using 'column_exp' to process column 'last position seen' with function _trim
INFO:optimus:Using 'column_exp' to process column 'date arrival' with function _trim
INFO:optimus:Using 'column_exp' to process column 'last date seen' with function _trim
INFO:optimus:Using 'column_exp' to process column 'datetype' with function _trim
INFO:optimus:Using 'column_exp' to process column 'cybertronian' with function _trim
INFO:optimus:Using 'column_exp' to process column 'new_age' with function _trim
INFO:optimus:Using 'column_exp' to process column 'from arrival' with function _trim

You transform this



In [12]:

    
df.table_image("images/table1.png")









    



Loading page (1/2)
Rendering (2/2)                                                    
Done

Into this



In [13]:

    
new_df.table_image("images/table2.png")









    



Loading page (1/2)
Rendering (2/2)                                                    
Done

Note that you can use Optimus functions and Spark functions(.WithColumn()) and all the df function availables in a Spark Dataframe at the same time. To know about all the Optimus functionality please go to this notebooks

Handling column output

With Optimus you can handle how the output column from a transformation in going to be handled.



In [14]:

    
from pyspark.sql import functions as F

def func(col_name, attr):
    return F.upper(F.col(col_name))

If a string is passed to input_cols and output_cols is not defined the result from the operation is going to be saved in the same input column



In [15]:

    
output_df = df.cols.apply(input_cols="names", output_cols=None,func=func)
output_df.table_image("images/column_output_1.png")









    



INFO:optimus:Using 'column_exp' to process column 'names' with function func






    



Loading page (1/2)
Rendering (2/2)                                                    
Done

If a string is passed to input_cols and a string is passed to output_cols the output is going to be saved in the output column



In [16]:

    
output_df = df.cols.apply(input_cols="names", output_cols="names_up",func=func)
output_df.table_image("images/column_output_2.png")









    



INFO:optimus:Using 'column_exp' to process column 'names' with function func






    



Loading page (1/2)
Rendering (2/2)                                                    
Done

If a list is passed to input_cols and a string is passed to out_cols Optimus will concatenate the list with every element in the list to create a new column name with the output



In [17]:

    
output_df = df.cols.apply(input_cols=["names","function"], output_cols="_up",func=func)
output_df.table_image("images/column_output_3.png")









    



INFO:optimus:Using 'column_exp' to process column 'names' with function func
INFO:optimus:Using 'column_exp' to process column 'function' with function func






    



Loading page (1/2)
Rendering (2/2)                                                    
Done

If a list is passed to input_cols and a list is passed in out_cols Optimus will output every input column in the respective output column



In [18]:

    
output_df = df.cols.apply(input_cols=["names","function"], output_cols=["names_up","function_up"],func=func)
output_df.table_image("images/column_output_4.png")









    



INFO:optimus:Using 'column_exp' to process column 'names' with function func
INFO:optimus:Using 'column_exp' to process column 'function' with function func






    



Loading page (1/2)
Rendering (2/2)                                                    
Done

Custom functions

Spark has multiple ways to transform your data like rdd, Column Expression, udf and pandas udf. In Optimus we created the apply() and apply_expr which handles all the implementation complexity.

Here you apply a function to the "billingid" column. Sum 1 and 2 to the current column value. All powered by Pandas UDF



In [19]:

    
def func(value, args):
    return value + args[0] + args[1]

df.cols.apply("height(ft)",func,"int", [1,2]).table_image("images/table3.png")









    



INFO:optimus:Using 'pandas_udf' to process column 'height(ft)' with function func






    



Loading page (1/2)
Rendering (2/2)                                                    
Done

If you want to apply a Column Expression use apply_expr() like this. In this case we pass an argument 10 to divide the actual column value



In [20]:

    
from pyspark.sql import functions as F

def func(col_name, args):
    return F.col(col_name)/20

df.cols.apply("height(ft)", func=func, args=20).table_image("images/table4.png")









    



INFO:optimus:Using 'column_exp' to process column 'height(ft)' with function func






    



Loading page (1/2)
Rendering (2/2)                                                    
Done

You can change the table output back to ascii if you wish



In [21]:

    
op.output("ascii")

To return to HTML just:



In [22]:

    
op.output("html")

Data profiling

Optimus comes with a powerful and unique data profiler. Besides basic and advance stats like min, max, kurtosis, mad etc, it also let you know what type of data has every column. For example if a string column have string, integer, float, bool, date Optimus can give you an unique overview about your data. Just run df.profile("*") to profile all the columns. For more info about the profiler please go to this notebook.

Let's load a "big" dataset



In [23]:

    
df = op.load.csv("https://raw.githubusercontent.com/ironmussa/Optimus/master/examples/data/Meteorite_Landings.csv").h_repartition()









    



INFO:optimus:Downloading Meteorite_Landings.csv from https://raw.githubusercontent.com/ironmussa/Optimus/master/examples/data/Meteorite_Landings.csv
INFO:optimus:Downloaded 4978151 bytes
INFO:optimus:Creating DataFrame for Meteorite_Landings.csv. Please wait...
INFO:optimus:Successfully created DataFrame for 'Meteorite_Landings.csv'

Numeric



In [24]:

    
op.profiler.run(df, "mass (g)", infer=False)









    



INFO:optimus:Processing column 'mass (g)'...
INFO:optimus:_count_data_types() executed in 2.31 sec
INFO:optimus:count_data_types() executed in 2.31 sec
INFO:optimus:cast_columns() executed in 0.0 sec
INFO:optimus:agg_exprs() executed in 3.29 sec
INFO:optimus:general_stats() executed in 3.3 sec
INFO:optimus:------------------------------
INFO:optimus:Processing column 'mass (g)'...
INFO:optimus:frequency() executed in 4.77 sec
INFO:optimus:stats_by_column() executed in 0.0 sec
INFO:optimus:Using 'column_exp' to process column 'mass (g)' with function _cast_to
INFO:optimus:percentile() executed in 0.39 sec
INFO:optimus:Using 'column_exp' to process column 'mass (g)' with function _cast_to
INFO:optimus:Using 'column_exp' to process column 'mass (g)' with function _cast_to
INFO:optimus:extra_numeric_stats() executed in 0.9 sec
INFO:optimus:Using 'column_exp' to process column 'mass (g)' with function _bucketizer
INFO:optimus:bucketizer() executed in 0.47 sec
INFO:optimus:hist() executed in 2.54 sec






    



Including 'nan' as Null in processing 'name'
Including 'nan' as Null in processing 'nametype'
Including 'nan' as Null in processing 'recclass'
Including 'nan' as Null in processing 'fall'
Including 'nan' as Null in processing 'year'
Including 'nan' as Null in processing 'GeoLocation'






    



INFO:optimus:dataset_info() executed in 2.28 sec






    






    Overview


    
        Dataset info
        
            
            
                Number of columns
                10

            
            
                Number of rows
                45716

            
            
                Total Missing (%)
                0.49%

            
            
                Total size in memory
                88.2 MB

            
            
        
    
    
        Column types
        
            
            
                String
                0

            
            
                Numeric
                1

            
            
                Date
                0

            
            
                Bool
                0

            
            
                Array
                0

            
            
                Not available
                0

            
            
        
    




    

        

        
            
                mass (g)
                numeric
            
            
                
                
                    Unique
                     12497
                
                
                    Unique (%)
                     27.336
                
                
                    Missing
                    0.0
                
                
                    Missing (%)
                    0
                
                
            
            
                
                    Datatypes
                
            
            
                
                
                    
                        String
                    
                    
                        0
                    
                
                
                    
                        Integer
                    
                    
                        0
                    
                
                
                    
                        Float
                    
                    
                        0
                    
                
                
                    
                        Bool
                    
                    
                        0
                    
                
                
                    
                        Date
                    
                    
                        0
                    
                
                
                    
                        Missing
                    
                    
                        0
                    
                
                
                    
                        Null
                    
                    
                        131
                    

                
                
            
            
            
                
                    Basic Stats
                

            
            
                
                
                    Mean
                    13278.07855
                
                
                    Minimum
                    0.0
                
                
                    Maximum
                    60000000.0
                
                
                    Zeros(%)
                    
                

                
            
            

        
        
            Frequency
            

                
                    Value
                    Count
                    Frequency (%)
                
                
                
                    1.3
                    171
                    0.374%
                

                
                
                    1.2
                    140
                    0.306%
                

                
                
                    1.4
                    138
                    0.302%
                

                
                
                    None
                    131
                    0.287%
                

                
                
                    2.1
                    130
                    0.284%
                

                
                
                    2.4
                    126
                    0.276%
                

                
                
                    1.6
                    120
                    0.262%
                

                
                
                    0.5
                    119
                    0.26%
                

                
                
                    1.1
                    116
                    0.254%
                

                
                
                    3.8
                    114
                    0.249%
                

                
                
                    "Missing"
                    0
                    0.0%
                

            
        
        

        
        


            Quantile statistics
            
                
                
                    Minimum
                    0.0
                
                
                    5-th percentile
                    
                
                
                    Q1
                    
                
                
                    Median
                    
                
                
                    Q3
                    
                
                
                    95-th percentile
                    
                
                
                    Maximum
                    60000000.0
                
                
                    Range
                    60000000.0
                
                
                    Interquartile range
                    0.0
                
                
            
        
        
            Descriptive statistics
            
                
                
                    Standard deviation
                    574988.87641
                
                
                    Coef of variation
                    43.30362
                
                
                    Kurtosis
                    6796.17061
                
                
                    Mean
                    13278.07855
                
                
                    MAD
                    0.0
                
                
                    Skewness
                    
                
                
                    Sum
                    605281210.638
                
                
                    Variance
                    330612207995.7785
                
                
            
        
        
    
    
        
        

            

                
                    
                
            
        
        
        
        
            
                
                    
                

            

        
        
        
        
        
        
        

    





Viewing 10 of 45716 rows / 10 columns
32 partition(s)


    
    
        
        
            name
            1 (string)
            
                
                nullable
                
            
        
        
        
            id
            2 (int)
            
                
                nullable
                
            
        
        
        
            nametype
            3 (string)
            
                
                nullable
                
            
        
        
        
            recclass
            4 (string)
            
                
                nullable
                
            
        
        
        
            mass (g)
            5 (double)
            
                
                nullable
                
            
        
        
        
            fall
            6 (string)
            
                
                nullable
                
            
        
        
        
            year
            7 (string)
            
                
                nullable
                
            
        
        
        
            reclat
            8 (double)
            
                
                nullable
                
            
        
        
        
            reclong
            9 (double)
            
                
                nullable
                
            
        
        
        
            GeoLocation
            10 (string)
            
                
                nullable
                
            
        
        
    

    
    
    
    
        
        
            Acfer⋅232
            
        
        
        
            240
            
        
        
        
            Valid
            
        
        
        
            H5
            
        
        
        
            725.0
            
        
        
        
            Found
            
        
        
        
            01/01/1991⋅12:00:00⋅AM
            
        
        
        
            27.73944
            
        
        
        
            4.32833
            
        
        
        
            (27.739440,⋅4.328330)
            
        
        
    
    
    
        
        
            Asuka⋅87197
            
        
        
        
            2554
            
        
        
        
            Valid
            
        
        
        
            H4
            
        
        
        
            124.99
            
        
        
        
            Found
            
        
        
        
            01/01/1987⋅12:00:00⋅AM
            
        
        
        
            -72.0
            
        
        
        
            26.0
            
        
        
        
            (-72.000000,⋅26.000000)
            
        
        
    
    
    
        
        
            Gladstone⋅(iron)
            
        
        
        
            10920
            
        
        
        
            Valid
            
        
        
        
            Iron,⋅IAB-MG
            
        
        
        
            736600.0
            
        
        
        
            Found
            
        
        
        
            01/01/1915⋅12:00:00⋅AM
            
        
        
        
            -23.9
            
        
        
        
            151.3
            
        
        
        
            (-23.900000,⋅151.300000)
            
        
        
    
    
    
        
        
            Nullarbor⋅015
            
        
        
        
            17955
            
        
        
        
            Valid
            
        
        
        
            L6
            
        
        
        
            3986.0
            
        
        
        
            Found
            
        
        
        
            01/01/1980⋅12:00:00⋅AM
            
        
        
        
            None
            
        
        
        
            None
            
        
        
        
            None
            
        
        
    
    
    
        
        
            Lewis⋅Cliff⋅86533
            
        
        
        
            13461
            
        
        
        
            Valid
            
        
        
        
            H5
            
        
        
        
            15.7
            
        
        
        
            Found
            
        
        
        
            01/01/1986⋅12:00:00⋅AM
            
        
        
        
            -84.26756
            
        
        
        
            161.3631
            
        
        
        
            (-84.267560,⋅161.363100)
            
        
        
    
    
    
        
        
            Grove⋅Mountains⋅053589
            
        
        
        
            48447
            
        
        
        
            Valid
            
        
        
        
            L5
            
        
        
        
            1.4
            
        
        
        
            Found
            
        
        
        
            01/01/2006⋅12:00:00⋅AM
            
        
        
        
            -72.7825
            
        
        
        
            75.300278
            
        
        
        
            (-72.782500,⋅75.300278)
            
        
        
    
    
    
        
        
            Sayh⋅al⋅Uhaymir⋅108
            
        
        
        
            23300
            
        
        
        
            Valid
            
        
        
        
            H6
            
        
        
        
            16.0
            
        
        
        
            Found
            
        
        
        
            01/01/2001⋅12:00:00⋅AM
            
        
        
        
            21.06667
            
        
        
        
            57.31667
            
        
        
        
            (21.066670,⋅57.316670)
            
        
        
    
    
    
        
        
            Northwest⋅Africa⋅3088
            
        
        
        
            31218
            
        
        
        
            Valid
            
        
        
        
            L6
            
        
        
        
            171.0
            
        
        
        
            Found
            
        
        
        
            01/01/2003⋅12:00:00⋅AM
            
        
        
        
            None
            
        
        
        
            None
            
        
        
        
            None
            
        
        
    
    
    
        
        
            Reckling⋅Peak⋅92423
            
        
        
        
            22432
            
        
        
        
            Valid
            
        
        
        
            L6
            
        
        
        
            3.8
            
        
        
        
            Found
            
        
        
        
            01/01/1992⋅12:00:00⋅AM
            
        
        
        
            -76.22029
            
        
        
        
            158.37967
            
        
        
        
            (-76.220290,⋅158.379670)
            
        
        
    
    
    
        
        
            Sweetwater
            
        
        
        
            23770
            
        
        
        
            Valid
            
        
        
        
            H5
            
        
        
        
            1760.0
            
        
        
        
            Found
            
        
        
        
            01/01/1961⋅12:00:00⋅AM
            
        
        
        
            32.55
            
        
        
        
            -100.41667
            
        
        
        
            (32.550000,⋅-100.416670)
            
        
        
    
    
    



Viewing 10 of 45716 rows / 10 columns
32 partition(s)







    



INFO:optimus:run() executed in 19.68 sec



In [25]:

    
op.profiler.to_image(output_path="images/profiler_numeric.png")









    



Loading page (1/2)
Warning: Failed to load file:///C:/Users/ARGENI~1/AppData/Local/Temp/optimus/styles/styles.css (ignore)
Rendering (2/2)                                                    
Done



In [26]:

    
op.profiler.run(df, "name", infer=False)









    



INFO:optimus:Processing column 'name'...






    



Including 'nan' as Null in processing 'name'






    



INFO:optimus:_count_data_types() executed in 1.43 sec
INFO:optimus:count_data_types() executed in 1.43 sec
INFO:optimus:Using 'column_exp' to process column 'name' with function _cast_to
INFO:optimus:cast_columns() executed in 0.02 sec
INFO:optimus:agg_exprs() executed in 1.94 sec
INFO:optimus:general_stats() executed in 1.95 sec
INFO:optimus:------------------------------
INFO:optimus:Processing column 'name'...
INFO:optimus:frequency() executed in 4.21 sec
INFO:optimus:stats_by_column() executed in 0.0 sec
INFO:optimus:Using 'column_exp' to process column 'name_len' with function func_col_exp
INFO:optimus:Using 'column_exp' to process column 'name_len' with function _bucketizer
INFO:optimus:bucketizer() executed in 0.35 sec
INFO:optimus:hist() executed in 3.02 sec
INFO:optimus:hist_string() executed in 5.39 sec






    



Including 'nan' as Null in processing 'name'
Including 'nan' as Null in processing 'nametype'
Including 'nan' as Null in processing 'recclass'
Including 'nan' as Null in processing 'fall'
Including 'nan' as Null in processing 'year'
Including 'nan' as Null in processing 'GeoLocation'






    



INFO:optimus:dataset_info() executed in 1.78 sec






    






    Overview


    
        Dataset info
        
            
            
                Number of columns
                10

            
            
                Number of rows
                45716

            
            
                Total Missing (%)
                0.49%

            
            
                Total size in memory
                92.1 MB

            
            
        
    
    
        Column types
        
            
            
                String
                1

            
            
                Numeric
                0

            
            
                Date
                0

            
            
                Bool
                0

            
            
                Array
                0

            
            
                Not available
                0

            
            
        
    




    

        

        
            
                name
                categorical
            
            
                
                
                    Unique
                     45515
                
                
                    Unique (%)
                     99.56
                
                
                    Missing
                    0.0
                
                
                    Missing (%)
                    0
                
                
            
            
                
                    Datatypes
                
            
            
                
                
                    
                        String
                    
                    
                        45716
                    
                
                
                    
                        Integer
                    
                    
                        0
                    
                
                
                    
                        Float
                    
                    
                        0
                    
                
                
                    
                        Bool
                    
                    
                        0
                    
                
                
                    
                        Date
                    
                    
                        0
                    
                
                
                    
                        Missing
                    
                    
                        0
                    
                
                
                    
                        Null
                    
                    
                        0
                    

                
                
            
            

        
        
            Frequency
            

                
                    Value
                    Count
                    Frequency (%)
                
                
                
                    Święcany
                    1
                    0.002%
                

                
                
                    Łowicz
                    1
                    0.002%
                

                
                
                    Österplana 064
                    1
                    0.002%
                

                
                
                    Österplana 063
                    1
                    0.002%
                

                
                
                    Österplana 062
                    1
                    0.002%
                

                
                
                    Österplana 061
                    1
                    0.002%
                

                
                
                    Österplana 060
                    1
                    0.002%
                

                
                
                    Österplana 059
                    1
                    0.002%
                

                
                
                    Österplana 058
                    1
                    0.002%
                

                
                
                    Österplana 057
                    1
                    0.002%
                

                
                
                    "Missing"
                    0
                    0.0%
                

            
        
        

        
    
    
        
        

            

                
                    
                
            
        
        
        
        
            
                
                    
                

            

        
        
        
        
        
        
        

    





Viewing 10 of 45716 rows / 10 columns
32 partition(s)


    
    
        
        
            name
            1 (string)
            
                
                nullable
                
            
        
        
        
            id
            2 (int)
            
                
                nullable
                
            
        
        
        
            nametype
            3 (string)
            
                
                nullable
                
            
        
        
        
            recclass
            4 (string)
            
                
                nullable
                
            
        
        
        
            mass (g)
            5 (double)
            
                
                nullable
                
            
        
        
        
            fall
            6 (string)
            
                
                nullable
                
            
        
        
        
            year
            7 (string)
            
                
                nullable
                
            
        
        
        
            reclat
            8 (double)
            
                
                nullable
                
            
        
        
        
            reclong
            9 (double)
            
                
                nullable
                
            
        
        
        
            GeoLocation
            10 (string)
            
                
                nullable
                
            
        
        
    

    
    
    
    
        
        
            Acfer⋅232
            
        
        
        
            240
            
        
        
        
            Valid
            
        
        
        
            H5
            
        
        
        
            725.0
            
        
        
        
            Found
            
        
        
        
            01/01/1991⋅12:00:00⋅AM
            
        
        
        
            27.73944
            
        
        
        
            4.32833
            
        
        
        
            (27.739440,⋅4.328330)
            
        
        
    
    
    
        
        
            Asuka⋅87197
            
        
        
        
            2554
            
        
        
        
            Valid
            
        
        
        
            H4
            
        
        
        
            124.99
            
        
        
        
            Found
            
        
        
        
            01/01/1987⋅12:00:00⋅AM
            
        
        
        
            -72.0
            
        
        
        
            26.0
            
        
        
        
            (-72.000000,⋅26.000000)
            
        
        
    
    
    
        
        
            Gladstone⋅(iron)
            
        
        
        
            10920
            
        
        
        
            Valid
            
        
        
        
            Iron,⋅IAB-MG
            
        
        
        
            736600.0
            
        
        
        
            Found
            
        
        
        
            01/01/1915⋅12:00:00⋅AM
            
        
        
        
            -23.9
            
        
        
        
            151.3
            
        
        
        
            (-23.900000,⋅151.300000)
            
        
        
    
    
    
        
        
            Nullarbor⋅015
            
        
        
        
            17955
            
        
        
        
            Valid
            
        
        
        
            L6
            
        
        
        
            3986.0
            
        
        
        
            Found
            
        
        
        
            01/01/1980⋅12:00:00⋅AM
            
        
        
        
            None
            
        
        
        
            None
            
        
        
        
            None
            
        
        
    
    
    
        
        
            Lewis⋅Cliff⋅86533
            
        
        
        
            13461
            
        
        
        
            Valid
            
        
        
        
            H5
            
        
        
        
            15.7
            
        
        
        
            Found
            
        
        
        
            01/01/1986⋅12:00:00⋅AM
            
        
        
        
            -84.26756
            
        
        
        
            161.3631
            
        
        
        
            (-84.267560,⋅161.363100)
            
        
        
    
    
    
        
        
            Grove⋅Mountains⋅053589
            
        
        
        
            48447
            
        
        
        
            Valid
            
        
        
        
            L5
            
        
        
        
            1.4
            
        
        
        
            Found
            
        
        
        
            01/01/2006⋅12:00:00⋅AM
            
        
        
        
            -72.7825
            
        
        
        
            75.300278
            
        
        
        
            (-72.782500,⋅75.300278)
            
        
        
    
    
    
        
        
            Sayh⋅al⋅Uhaymir⋅108
            
        
        
        
            23300
            
        
        
        
            Valid
            
        
        
        
            H6
            
        
        
        
            16.0
            
        
        
        
            Found
            
        
        
        
            01/01/2001⋅12:00:00⋅AM
            
        
        
        
            21.06667
            
        
        
        
            57.31667
            
        
        
        
            (21.066670,⋅57.316670)
            
        
        
    
    
    
        
        
            Northwest⋅Africa⋅3088
            
        
        
        
            31218
            
        
        
        
            Valid
            
        
        
        
            L6
            
        
        
        
            171.0
            
        
        
        
            Found
            
        
        
        
            01/01/2003⋅12:00:00⋅AM
            
        
        
        
            None
            
        
        
        
            None
            
        
        
        
            None
            
        
        
    
    
    
        
        
            Reckling⋅Peak⋅92423
            
        
        
        
            22432
            
        
        
        
            Valid
            
        
        
        
            L6
            
        
        
        
            3.8
            
        
        
        
            Found
            
        
        
        
            01/01/1992⋅12:00:00⋅AM
            
        
        
        
            -76.22029
            
        
        
        
            158.37967
            
        
        
        
            (-76.220290,⋅158.379670)
            
        
        
    
    
    
        
        
            Sweetwater
            
        
        
        
            23770
            
        
        
        
            Valid
            
        
        
        
            H5
            
        
        
        
            1760.0
            
        
        
        
            Found
            
        
        
        
            01/01/1961⋅12:00:00⋅AM
            
        
        
        
            32.55
            
        
        
        
            -100.41667
            
        
        
        
            (32.550000,⋅-100.416670)
            
        
        
    
    
    



Viewing 10 of 45716 rows / 10 columns
32 partition(s)







    



INFO:optimus:run() executed in 17.73 sec



In [63]:

    
op.profiler.to_image(output_path="images/profiler.png")









    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-63-1304007aee00> in <module>
----> 1 op.profiler.to_image(output_path="images/profiler.png")

~\Documents\Optimus\optimus\profiler\profiler.py in to_image(self, output_path)
    267         """
    268         css = absolute_path("/css/styles.css")
--> 269         imgkit.from_string(self.html, output_path, css=css)
    270 
    271         print_html("<img src='" + output_path + "'>")

~\Anaconda3\lib\site-packages\imgkit\api.py in from_string(string, output_path, options, toc, cover, css, config, cover_first)
     87     """
     88     rtn = IMGKit(string, 'string', options=options, toc=toc, cover=cover, css=css,
---> 89                  config=config, cover_first=cover_first)
     90     return rtn.to_img(output_path)
     91 

~\Anaconda3\lib\site-packages\imgkit\imgkit.py in __init__(self, url_or_file, source_type, options, toc, cover, css, config, cover_first)
     42         self.options = {}
     43         if self.source.isString():
---> 44             self.options.update(self._find_options_in_meta(url_or_file))
     45 
     46         if options:

~\Anaconda3\lib\site-packages\imgkit\imgkit.py in _find_options_in_meta(self, content)
    199         found = {}
    200 
--> 201         for x in re.findall('<meta [^>]*>', content):
    202             if re.search('name=["\']%s' % self.config.meta_tag_prefix, x):
    203                 name = re.findall('name=["\']%s([^"\']*)' %

~\Anaconda3\lib\re.py in findall(pattern, string, flags)
    221 
    222     Empty matches are included in the result."""
--> 223     return _compile(pattern, flags).findall(string)
    224 
    225 def finditer(pattern, string, flags=0):

TypeError: expected string or bytes-like object

Processing Dates

For dates data types Optimus can give you extra information



In [28]:

    
op.profiler.run(df, "year", infer=True)









    



INFO:optimus:Processing column 'year'...
INFO:optimus:_count_data_types() executed in 23.75 sec
INFO:optimus:count_data_types() executed in 23.75 sec
INFO:optimus:cast_columns() executed in 0.0 sec
INFO:optimus:agg_exprs() executed in 1.59 sec
INFO:optimus:general_stats() executed in 1.6 sec
INFO:optimus:------------------------------
INFO:optimus:Processing column 'year'...
INFO:optimus:frequency() executed in 3.26 sec
INFO:optimus:stats_by_column() executed in 0.0 sec
INFO:optimus:Using 'pandas_udf' to process column 'year' with function func_infer_date
INFO:optimus:Using 'column_exp' to process column 'year_0' with function _bucketizer
INFO:optimus:bucketizer() executed in 0.81 sec
INFO:optimus:hist() executed in 3.44 sec
INFO:optimus:Using 'column_exp' to process column 'year_1' with function _bucketizer
INFO:optimus:bucketizer() executed in 0.12 sec
INFO:optimus:hist() executed in 1.61 sec
INFO:optimus:Using 'column_exp' to process column 'year_2' with function _bucketizer
INFO:optimus:bucketizer() executed in 0.1 sec
INFO:optimus:hist() executed in 1.76 sec
INFO:optimus:Using 'column_exp' to process column 'year_3' with function _bucketizer
INFO:optimus:bucketizer() executed in 0.22 sec
INFO:optimus:hist() executed in 1.7 sec
INFO:optimus:Using 'column_exp' to process column 'year_4' with function _bucketizer
INFO:optimus:bucketizer() executed in 0.52 sec
INFO:optimus:hist() executed in 1.97 sec
INFO:optimus:hist_date() executed in 62.45 sec






    



Including 'nan' as Null in processing 'name'
Including 'nan' as Null in processing 'nametype'
Including 'nan' as Null in processing 'recclass'
Including 'nan' as Null in processing 'fall'
Including 'nan' as Null in processing 'year'
Including 'nan' as Null in processing 'GeoLocation'






    



INFO:optimus:dataset_info() executed in 1.79 sec






    






    Overview


    
        Dataset info
        
            
            
                Number of columns
                10

            
            
                Number of rows
                45716

            
            
                Total Missing (%)
                0.49%

            
            
                Total size in memory
                97.6 MB

            
            
        
    
    
        Column types
        
            
            
                String
                0

            
            
                Numeric
                0

            
            
                Date
                1

            
            
                Bool
                0

            
            
                Array
                0

            
            
                Not available
                0

            
            
        
    




    

        

        
            
                year
                date
            
            
                
                
                    Unique
                     265
                
                
                    Unique (%)
                     0.58
                
                
                    Missing
                    0.0
                
                
                    Missing (%)
                    0
                
                
            
            
                
                    Datatypes
                
            
            
                
                
                    
                        String
                    
                    
                        0
                    
                
                
                    
                        Integer
                    
                    
                        0
                    
                
                
                    
                        Float
                    
                    
                        0
                    
                
                
                    
                        Bool
                    
                    
                        0
                    
                
                
                    
                        Date
                    
                    
                        45428
                    
                
                
                    
                        Missing
                    
                    
                        0
                    
                
                
                    
                        Null
                    
                    
                        288
                    

                
                
            
            

        
        
            Frequency
            

                
                    Value
                    Count
                    Frequency (%)
                
                
                
                    01/01/2003 12:00:00 AM
                    3323
                    7.269%
                

                
                
                    01/01/1979 12:00:00 AM
                    3046
                    6.663%
                

                
                
                    01/01/1998 12:00:00 AM
                    2697
                    5.899%
                

                
                
                    01/01/2006 12:00:00 AM
                    2456
                    5.372%
                

                
                
                    01/01/1988 12:00:00 AM
                    2296
                    5.022%
                

                
                
                    01/01/2002 12:00:00 AM
                    2078
                    4.545%
                

                
                
                    01/01/2004 12:00:00 AM
                    1940
                    4.244%
                

                
                
                    01/01/2000 12:00:00 AM
                    1792
                    3.92%
                

                
                
                    01/01/1997 12:00:00 AM
                    1696
                    3.71%
                

                
                
                    01/01/1999 12:00:00 AM
                    1691
                    3.699%
                

                
                
                    "Missing"
                    0
                    0.0%
                

            
        
        

        
    
    
        
        
        
            
                
                    
                

            

        
        
        
        
            
                
                    
                

            

        
        
        
        
            
                
                    
                

            

        
        
        
        
            
                
                    
                

            

        
        
        
        
            
                
                    
                

            

        
        
        
        
            
                
                    
                

            

        
        

    





Viewing 10 of 45716 rows / 10 columns
32 partition(s)


    
    
        
        
            name
            1 (string)
            
                
                nullable
                
            
        
        
        
            id
            2 (int)
            
                
                nullable
                
            
        
        
        
            nametype
            3 (string)
            
                
                nullable
                
            
        
        
        
            recclass
            4 (string)
            
                
                nullable
                
            
        
        
        
            mass (g)
            5 (double)
            
                
                nullable
                
            
        
        
        
            fall
            6 (string)
            
                
                nullable
                
            
        
        
        
            year
            7 (string)
            
                
                nullable
                
            
        
        
        
            reclat
            8 (double)
            
                
                nullable
                
            
        
        
        
            reclong
            9 (double)
            
                
                nullable
                
            
        
        
        
            GeoLocation
            10 (string)
            
                
                nullable
                
            
        
        
    

    
    
    
    
        
        
            Acfer⋅232
            
        
        
        
            240
            
        
        
        
            Valid
            
        
        
        
            H5
            
        
        
        
            725.0
            
        
        
        
            Found
            
        
        
        
            01/01/1991⋅12:00:00⋅AM
            
        
        
        
            27.73944
            
        
        
        
            4.32833
            
        
        
        
            (27.739440,⋅4.328330)
            
        
        
    
    
    
        
        
            Asuka⋅87197
            
        
        
        
            2554
            
        
        
        
            Valid
            
        
        
        
            H4
            
        
        
        
            124.99
            
        
        
        
            Found
            
        
        
        
            01/01/1987⋅12:00:00⋅AM
            
        
        
        
            -72.0
            
        
        
        
            26.0
            
        
        
        
            (-72.000000,⋅26.000000)
            
        
        
    
    
    
        
        
            Gladstone⋅(iron)
            
        
        
        
            10920
            
        
        
        
            Valid
            
        
        
        
            Iron,⋅IAB-MG
            
        
        
        
            736600.0
            
        
        
        
            Found
            
        
        
        
            01/01/1915⋅12:00:00⋅AM
            
        
        
        
            -23.9
            
        
        
        
            151.3
            
        
        
        
            (-23.900000,⋅151.300000)
            
        
        
    
    
    
        
        
            Nullarbor⋅015
            
        
        
        
            17955
            
        
        
        
            Valid
            
        
        
        
            L6
            
        
        
        
            3986.0
            
        
        
        
            Found
            
        
        
        
            01/01/1980⋅12:00:00⋅AM
            
        
        
        
            None
            
        
        
        
            None
            
        
        
        
            None
            
        
        
    
    
    
        
        
            Lewis⋅Cliff⋅86533
            
        
        
        
            13461
            
        
        
        
            Valid
            
        
        
        
            H5
            
        
        
        
            15.7
            
        
        
        
            Found
            
        
        
        
            01/01/1986⋅12:00:00⋅AM
            
        
        
        
            -84.26756
            
        
        
        
            161.3631
            
        
        
        
            (-84.267560,⋅161.363100)
            
        
        
    
    
    
        
        
            Grove⋅Mountains⋅053589
            
        
        
        
            48447
            
        
        
        
            Valid
            
        
        
        
            L5
            
        
        
        
            1.4
            
        
        
        
            Found
            
        
        
        
            01/01/2006⋅12:00:00⋅AM
            
        
        
        
            -72.7825
            
        
        
        
            75.300278
            
        
        
        
            (-72.782500,⋅75.300278)
            
        
        
    
    
    
        
        
            Sayh⋅al⋅Uhaymir⋅108
            
        
        
        
            23300
            
        
        
        
            Valid
            
        
        
        
            H6
            
        
        
        
            16.0
            
        
        
        
            Found
            
        
        
        
            01/01/2001⋅12:00:00⋅AM
            
        
        
        
            21.06667
            
        
        
        
            57.31667
            
        
        
        
            (21.066670,⋅57.316670)
            
        
        
    
    
    
        
        
            Northwest⋅Africa⋅3088
            
        
        
        
            31218
            
        
        
        
            Valid
            
        
        
        
            L6
            
        
        
        
            171.0
            
        
        
        
            Found
            
        
        
        
            01/01/2003⋅12:00:00⋅AM
            
        
        
        
            None
            
        
        
        
            None
            
        
        
        
            None
            
        
        
    
    
    
        
        
            Reckling⋅Peak⋅92423
            
        
        
        
            22432
            
        
        
        
            Valid
            
        
        
        
            L6
            
        
        
        
            3.8
            
        
        
        
            Found
            
        
        
        
            01/01/1992⋅12:00:00⋅AM
            
        
        
        
            -76.22029
            
        
        
        
            158.37967
            
        
        
        
            (-76.220290,⋅158.379670)
            
        
        
    
    
    
        
        
            Sweetwater
            
        
        
        
            23770
            
        
        
        
            Valid
            
        
        
        
            H5
            
        
        
        
            1760.0
            
        
        
        
            Found
            
        
        
        
            01/01/1961⋅12:00:00⋅AM
            
        
        
        
            32.55
            
        
        
        
            -100.41667
            
        
        
        
            (32.550000,⋅-100.416670)
            
        
        
    
    
    



Viewing 10 of 45716 rows / 10 columns
32 partition(s)







    



INFO:optimus:run() executed in 96.23 sec



In [29]:

    
op.profiler.to_image(output_path="images/profiler1.png")









    



Loading page (1/2)
Warning: Failed to load file:///C:/Users/ARGENI~1/AppData/Local/Temp/optimus/styles/styles.css (ignore)
Rendering (2/2)                                                    
Done

Profiler Speed

With relative_error and approx_count params you can control how some operations are caculated so you can speedup the profiling in case is needed.

relative_error: Relative Error for quantile discretizer calculation. 1 is Faster, 0 Slower

approx_count: Use approx_count_distinct or countDistinct. approx_count_distinct is faster



In [30]:

    
op.profiler.run(df, "mass (g)", infer=False, relative_error =1, approx_count=True)









    



INFO:optimus:Processing column 'mass (g)'...
INFO:optimus:_count_data_types() executed in 1.33 sec
INFO:optimus:count_data_types() executed in 1.33 sec
INFO:optimus:cast_columns() executed in 0.0 sec
INFO:optimus:agg_exprs() executed in 1.64 sec
INFO:optimus:general_stats() executed in 1.65 sec
INFO:optimus:------------------------------
INFO:optimus:Processing column 'mass (g)'...
INFO:optimus:frequency() executed in 3.56 sec
INFO:optimus:stats_by_column() executed in 0.0 sec
INFO:optimus:Using 'column_exp' to process column 'mass (g)' with function _cast_to
INFO:optimus:percentile() executed in 0.23 sec
INFO:optimus:Using 'column_exp' to process column 'mass (g)' with function _cast_to
INFO:optimus:Using 'column_exp' to process column 'mass (g)' with function _cast_to
INFO:optimus:extra_numeric_stats() executed in 0.58 sec
INFO:optimus:Using 'column_exp' to process column 'mass (g)' with function _bucketizer
INFO:optimus:bucketizer() executed in 0.3 sec
INFO:optimus:hist() executed in 2.04 sec






    



Including 'nan' as Null in processing 'name'
Including 'nan' as Null in processing 'nametype'
Including 'nan' as Null in processing 'recclass'
Including 'nan' as Null in processing 'fall'
Including 'nan' as Null in processing 'year'
Including 'nan' as Null in processing 'GeoLocation'






    



INFO:optimus:dataset_info() executed in 1.75 sec






    






    Overview


    
        Dataset info
        
            
            
                Number of columns
                10

            
            
                Number of rows
                45716

            
            
                Total Missing (%)
                0.49%

            
            
                Total size in memory
                98.2 MB

            
            
        
    
    
        Column types
        
            
            
                String
                0

            
            
                Numeric
                1

            
            
                Date
                0

            
            
                Bool
                0

            
            
                Array
                0

            
            
                Not available
                0

            
            
        
    




    

        

        
            
                mass (g)
                numeric
            
            
                
                
                    Unique
                     12497
                
                
                    Unique (%)
                     27.336
                
                
                    Missing
                    0.0
                
                
                    Missing (%)
                    0
                
                
            
            
                
                    Datatypes
                
            
            
                
                
                    
                        String
                    
                    
                        0
                    
                
                
                    
                        Integer
                    
                    
                        0
                    
                
                
                    
                        Float
                    
                    
                        0
                    
                
                
                    
                        Bool
                    
                    
                        0
                    
                
                
                    
                        Date
                    
                    
                        0
                    
                
                
                    
                        Missing
                    
                    
                        0
                    
                
                
                    
                        Null
                    
                    
                        131
                    

                
                
            
            
            
                
                    Basic Stats
                

            
            
                
                
                    Mean
                    13278.07855
                
                
                    Minimum
                    0.0
                
                
                    Maximum
                    60000000.0
                
                
                    Zeros(%)
                    
                

                
            
            

        
        
            Frequency
            

                
                    Value
                    Count
                    Frequency (%)
                
                
                
                    1.3
                    171
                    0.374%
                

                
                
                    1.2
                    140
                    0.306%
                

                
                
                    1.4
                    138
                    0.302%
                

                
                
                    None
                    131
                    0.287%
                

                
                
                    2.1
                    130
                    0.284%
                

                
                
                    2.4
                    126
                    0.276%
                

                
                
                    1.6
                    120
                    0.262%
                

                
                
                    0.5
                    119
                    0.26%
                

                
                
                    1.1
                    116
                    0.254%
                

                
                
                    3.8
                    114
                    0.249%
                

                
                
                    "Missing"
                    0
                    0.0%
                

            
        
        

        
        


            Quantile statistics
            
                
                
                    Minimum
                    0.0
                
                
                    5-th percentile
                    
                
                
                    Q1
                    
                
                
                    Median
                    
                
                
                    Q3
                    
                
                
                    95-th percentile
                    
                
                
                    Maximum
                    60000000.0
                
                
                    Range
                    60000000.0
                
                
                    Interquartile range
                    0.0
                
                
            
        
        
            Descriptive statistics
            
                
                
                    Standard deviation
                    574988.87641
                
                
                    Coef of variation
                    43.30362
                
                
                    Kurtosis
                    6796.17061
                
                
                    Mean
                    13278.07855
                
                
                    MAD
                    0.0
                
                
                    Skewness
                    
                
                
                    Sum
                    605281210.638
                
                
                    Variance
                    330612207995.7785
                
                
            
        
        
    
    
        
        

            

                
                    
                
            
        
        
        
        
            
                
                    
                

            

        
        
        
        
        
        
        

    





Viewing 10 of 45716 rows / 10 columns
32 partition(s)


    
    
        
        
            name
            1 (string)
            
                
                nullable
                
            
        
        
        
            id
            2 (int)
            
                
                nullable
                
            
        
        
        
            nametype
            3 (string)
            
                
                nullable
                
            
        
        
        
            recclass
            4 (string)
            
                
                nullable
                
            
        
        
        
            mass (g)
            5 (double)
            
                
                nullable
                
            
        
        
        
            fall
            6 (string)
            
                
                nullable
                
            
        
        
        
            year
            7 (string)
            
                
                nullable
                
            
        
        
        
            reclat
            8 (double)
            
                
                nullable
                
            
        
        
        
            reclong
            9 (double)
            
                
                nullable
                
            
        
        
        
            GeoLocation
            10 (string)
            
                
                nullable
                
            
        
        
    

    
    
    
    
        
        
            Acfer⋅232
            
        
        
        
            240
            
        
        
        
            Valid
            
        
        
        
            H5
            
        
        
        
            725.0
            
        
        
        
            Found
            
        
        
        
            01/01/1991⋅12:00:00⋅AM
            
        
        
        
            27.73944
            
        
        
        
            4.32833
            
        
        
        
            (27.739440,⋅4.328330)
            
        
        
    
    
    
        
        
            Asuka⋅87197
            
        
        
        
            2554
            
        
        
        
            Valid
            
        
        
        
            H4
            
        
        
        
            124.99
            
        
        
        
            Found
            
        
        
        
            01/01/1987⋅12:00:00⋅AM
            
        
        
        
            -72.0
            
        
        
        
            26.0
            
        
        
        
            (-72.000000,⋅26.000000)
            
        
        
    
    
    
        
        
            Gladstone⋅(iron)
            
        
        
        
            10920
            
        
        
        
            Valid
            
        
        
        
            Iron,⋅IAB-MG
            
        
        
        
            736600.0
            
        
        
        
            Found
            
        
        
        
            01/01/1915⋅12:00:00⋅AM
            
        
        
        
            -23.9
            
        
        
        
            151.3
            
        
        
        
            (-23.900000,⋅151.300000)
            
        
        
    
    
    
        
        
            Nullarbor⋅015
            
        
        
        
            17955
            
        
        
        
            Valid
            
        
        
        
            L6
            
        
        
        
            3986.0
            
        
        
        
            Found
            
        
        
        
            01/01/1980⋅12:00:00⋅AM
            
        
        
        
            None
            
        
        
        
            None
            
        
        
        
            None
            
        
        
    
    
    
        
        
            Lewis⋅Cliff⋅86533
            
        
        
        
            13461
            
        
        
        
            Valid
            
        
        
        
            H5
            
        
        
        
            15.7
            
        
        
        
            Found
            
        
        
        
            01/01/1986⋅12:00:00⋅AM
            
        
        
        
            -84.26756
            
        
        
        
            161.3631
            
        
        
        
            (-84.267560,⋅161.363100)
            
        
        
    
    
    
        
        
            Grove⋅Mountains⋅053589
            
        
        
        
            48447
            
        
        
        
            Valid
            
        
        
        
            L5
            
        
        
        
            1.4
            
        
        
        
            Found
            
        
        
        
            01/01/2006⋅12:00:00⋅AM
            
        
        
        
            -72.7825
            
        
        
        
            75.300278
            
        
        
        
            (-72.782500,⋅75.300278)
            
        
        
    
    
    
        
        
            Sayh⋅al⋅Uhaymir⋅108
            
        
        
        
            23300
            
        
        
        
            Valid
            
        
        
        
            H6
            
        
        
        
            16.0
            
        
        
        
            Found
            
        
        
        
            01/01/2001⋅12:00:00⋅AM
            
        
        
        
            21.06667
            
        
        
        
            57.31667
            
        
        
        
            (21.066670,⋅57.316670)
            
        
        
    
    
    
        
        
            Northwest⋅Africa⋅3088
            
        
        
        
            31218
            
        
        
        
            Valid
            
        
        
        
            L6
            
        
        
        
            171.0
            
        
        
        
            Found
            
        
        
        
            01/01/2003⋅12:00:00⋅AM
            
        
        
        
            None
            
        
        
        
            None
            
        
        
        
            None
            
        
        
    
    
    
        
        
            Reckling⋅Peak⋅92423
            
        
        
        
            22432
            
        
        
        
            Valid
            
        
        
        
            L6
            
        
        
        
            3.8
            
        
        
        
            Found
            
        
        
        
            01/01/1992⋅12:00:00⋅AM
            
        
        
        
            -76.22029
            
        
        
        
            158.37967
            
        
        
        
            (-76.220290,⋅158.379670)
            
        
        
    
    
    
        
        
            Sweetwater
            
        
        
        
            23770
            
        
        
        
            Valid
            
        
        
        
            H5
            
        
        
        
            1760.0
            
        
        
        
            Found
            
        
        
        
            01/01/1961⋅12:00:00⋅AM
            
        
        
        
            32.55
            
        
        
        
            -100.41667
            
        
        
        
            (32.550000,⋅-100.416670)
            
        
        
    
    
    



Viewing 10 of 45716 rows / 10 columns
32 partition(s)







    



INFO:optimus:run() executed in 13.74 sec

Plots

Besides histograms and frequency plots you also have scatter plots and box plots. All powered by Apache by pyspark



In [31]:

    
df = op.load.excel("../examples/data/titanic3.xls")
df = df.rows.drop_na(["age","fare"])

You can output to the notebook or as an image



In [32]:

    
# Output and image
df.plot.hist("fare", output_format="image", output_path="images/hist.png")









    



INFO:optimus:Using 'column_exp' to process column 'fare' with function _bucketizer
INFO:optimus:bucketizer() executed in 0.11 sec
INFO:optimus:hist() executed in 5.16 sec
INFO:optimus:hist() executed in 9.98 sec



In [33]:

    
df.plot.frequency("age")
df.plot.frequency("age", output_format="image", output_path="images/frequency.png")



In [34]:

    
df.plot.scatter(["fare", "age"], buckets=30)
df.plot.scatter(["fare", "age"], buckets=30, output_format="image", output_path="images/scatter.png")









    



INFO:optimus:Using 'column_exp' to process column 'fare' with function _bucketizer
INFO:optimus:bucketizer() executed in 0.29 sec
INFO:optimus:Using 'column_exp' to process column 'age' with function _bucketizer
INFO:optimus:bucketizer() executed in 0.38 sec
INFO:optimus:Using 'column_exp' to process column 'fare' with function _bucketizer
INFO:optimus:bucketizer() executed in 0.29 sec
INFO:optimus:Using 'column_exp' to process column 'age' with function _bucketizer
INFO:optimus:bucketizer() executed in 0.3 sec



In [35]:

    
df.plot.box("age")
df.plot.box("age", output_format="image", output_path="images/box.png")









    



INFO:optimus:Using 'column_exp' to process column 'age' with function _cast_to
INFO:optimus:percentile() executed in 6.89 sec
INFO:optimus:Using 'column_exp' to process column 'age' with function _cast_to
INFO:optimus:percentile() executed in 3.92 sec






    











    












    





<Figure size 864x360 with 0 Axes>



In [36]:

    
df.plot.correlation("*")
df.plot.correlation("*", output_format="image", output_path="images/correlation.png")









    



INFO:optimus:`name`,`sex`,`ticket`,`cabin`,`embarked`,`boat`,`home_dest` column(s) was not processed because is/are not byte,short,big,int,double,float
INFO:optimus:Using 'column_exp' to process column 'pclass' with function _cast_to
INFO:optimus:Casting pclass to float...
INFO:optimus:Using 'column_exp' to process column 'survived' with function _cast_to
INFO:optimus:Casting survived to float...
INFO:optimus:Using 'column_exp' to process column 'age' with function _cast_to
INFO:optimus:Casting age to float...
INFO:optimus:Using 'column_exp' to process column 'sibsp' with function _cast_to
INFO:optimus:Casting sibsp to float...
INFO:optimus:Using 'column_exp' to process column 'parch' with function _cast_to
INFO:optimus:Casting parch to float...
INFO:optimus:Using 'column_exp' to process column 'fare' with function _cast_to
INFO:optimus:Casting fare to float...
INFO:optimus:Using 'column_exp' to process column 'body' with function _cast_to
INFO:optimus:Casting body to float...
object of type <class 'float'> cannot be safely interpreted as an integer.
INFO:optimus:`name`,`sex`,`ticket`,`cabin`,`embarked`,`boat`,`home_dest` column(s) was not processed because is/are not byte,short,big,int,double,float
INFO:optimus:Using 'column_exp' to process column 'pclass' with function _cast_to
INFO:optimus:Casting pclass to float...
INFO:optimus:Using 'column_exp' to process column 'survived' with function _cast_to
INFO:optimus:Casting survived to float...
INFO:optimus:Using 'column_exp' to process column 'age' with function _cast_to
INFO:optimus:Casting age to float...
INFO:optimus:Using 'column_exp' to process column 'sibsp' with function _cast_to
INFO:optimus:Casting sibsp to float...
INFO:optimus:Using 'column_exp' to process column 'parch' with function _cast_to
INFO:optimus:Casting parch to float...
INFO:optimus:Using 'column_exp' to process column 'fare' with function _cast_to
INFO:optimus:Casting fare to float...
INFO:optimus:Using 'column_exp' to process column 'body' with function _cast_to
INFO:optimus:Casting body to float...

Using other plotting libraries

Optimus has a tiny API so you can use any plotting library. For example, you can use df.cols.scatter(), df.cols.frequency(), df.cols.boxplot() or df.cols.hist() to output a JSON that you can process to adapt the data to any plotting library.

Outliers

Get the ouliers using tukey



In [37]:

    
df.outliers.tukey("age").select().table_image("images/table5.png")









    



INFO:optimus:Using 'column_exp' to process column 'age' with function _cast_to
INFO:optimus:percentile() executed in 4.02 sec






    



Loading page (1/2)
Rendering (2/2)                                                    
Done

Remove the outliers using tukey



In [38]:

    
df.outliers.tukey("age").drop().table_image("images/table6.png")









    



INFO:optimus:Using 'column_exp' to process column 'age' with function _cast_to
INFO:optimus:percentile() executed in 4.1 sec






    



Loading page (1/2)
Rendering (2/2)                                                    
Done



In [39]:

    
df.outliers.tukey("age").info()









    



INFO:optimus:Using 'column_exp' to process column 'age' with function _cast_to
INFO:optimus:percentile() executed in 3.94 sec
INFO:optimus:Using 'column_exp' to process column 'age' with function _cast_to
INFO:optimus:percentile() executed in 4.09 sec






    Out[39]:





{'count_outliers': 1045,
 'count_non_outliers': 1036,
 'lower_bound': -6.0,
 'upper_bound': 66.0,
 'iqr1': 21.0,
 'iqr3': 39.0}

You can also use z_score, modified_z_score or mad



In [40]:

    
df.outliers.z_score("age", threshold=2).drop()
df.outliers.modified_z_score("age", threshold = 2).drop()
df.outliers.mad("age", threshold = 2).drop()









    



INFO:optimus:Using 'column_exp' to process column 'age' with function _z_score
INFO:optimus:Using 'column_exp' to process column 'age' with function _cast_to
INFO:optimus:Using 'column_exp' to process column 'age' with function _cast_to
INFO:optimus:Using 'column_exp' to process column 'age' with function _cast_to
INFO:optimus:Using 'column_exp' to process column 'age' with function _cast_to






    Out[40]:





DataFrame[pclass: bigint, survived: bigint, name: string, sex: string, age: double, sibsp: bigint, parch: bigint, ticket: string, fare: double, cabin: string, embarked: string, boat: string, body: double, home_dest: string]

Database connection

Optimus have handy tools to connect to databases and extract informacion. Optimus can handle redshift, postgres, oracle and mysql



In [41]:

    
import sys
sys.path.append("..")

from optimus import Optimus
op= Optimus(verbose=True)









    



INFO:optimus:Operative System:Windows
INFO:optimus:Just check that Spark and all necessary environments vars are present...
INFO:optimus:-----
INFO:optimus:SPARK_HOME=C:\opt\spark\spark-2.3.1-bin-hadoop2.7
INFO:optimus:HADOOP_HOME=C:\opt\hadoop-2.7.7
INFO:optimus:PYSPARK_PYTHON=C:\Users\argenisleon\Anaconda3\python.exe
INFO:optimus:PYSPARK_DRIVER_PYTHON=jupyter
INFO:optimus:PYSPARK_SUBMIT_ARGS=--jars "file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/RedshiftJDBC42-1.2.16.1027.jar,file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/mysql-connector-java-8.0.16.jar,file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/ojdbc8.jar,file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/postgresql-42.2.5.jar,file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/RedshiftJDBC42-1.2.16.1027.jar,file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/mysql-connector-java-8.0.16.jar,file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/ojdbc8.jar,file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/postgresql-42.2.5.jar" --driver-class-path "C:/Users/argenisleon/Documents/Optimus/optimus/jars/RedshiftJDBC42-1.2.16.1027.jar;C:/Users/argenisleon/Documents/Optimus/optimus/jars/mysql-connector-java-8.0.16.jar;C:/Users/argenisleon/Documents/Optimus/optimus/jars/ojdbc8.jar;C:/Users/argenisleon/Documents/Optimus/optimus/jars/postgresql-42.2.5.jar;C:/Users/argenisleon/Documents/Optimus/optimus/jars/RedshiftJDBC42-1.2.16.1027.jar;C:/Users/argenisleon/Documents/Optimus/optimus/jars/mysql-connector-java-8.0.16.jar;C:/Users/argenisleon/Documents/Optimus/optimus/jars/ojdbc8.jar;C:/Users/argenisleon/Documents/Optimus/optimus/jars/postgresql-42.2.5.jar" --conf "spark.sql.catalogImplementation=hive" pyspark-shell
INFO:optimus:JAVA_HOME=C:\java
INFO:optimus:Pyarrow Installed
INFO:optimus:-----
INFO:optimus:Starting or getting SparkSession and SparkContext...
INFO:optimus:Spark Version:2.3.1
INFO:optimus:
                             ____        __  _                     
                            / __ \____  / /_(_)___ ___  __  _______
                           / / / / __ \/ __/ / __ `__ \/ / / / ___/
                          / /_/ / /_/ / /_/ / / / / / / /_/ (__  ) 
                          \____/ .___/\__/_/_/ /_/ /_/\__,_/____/  
                              /_/                                  
                              
INFO:optimus:Transform and Roll out...
INFO:optimus:Optimus successfully imported. Have fun :).
INFO:optimus:Config.ini not found



In [42]:

    
# This import is only to hide the credentials
from credentials import *

# For others databases use in db_type accepts 'oracle','mysql','redshift','postgres'

db =  op.connect(
    db_type=DB_TYPE,
    host=HOST,
    database= DATABASE,
    user= USER,
    password = PASSWORD,
    port=PORT)
    
# Show all tables names
db.tables(limit="all")









    



INFO:optimus:jdbc:redshift://yopter.ccyy9l6xansm.us-east-1.redshift.amazonaws.com:5439/yopterdwh?currentSchema=public
INFO:optimus:(
            SELECT relname as table_name,cast (reltuples as integer) AS count 
            FROM pg_class C LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace) 
            WHERE nspname IN ('public') AND relkind='r' ORDER BY reltuples DESC) AS t
INFO:optimus:jdbc:redshift://yopter.ccyy9l6xansm.us-east-1.redshift.amazonaws.com:5439/yopterdwh?currentSchema=public






    



---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-42-f35140062f08> in <module>
     13 
     14 # Show all tables names
---> 15 db.tables(limit="all")

~\Documents\Optimus\optimus\io\jdbc.py in tables(self, schema, database, limit)
    123                     FROM user_tables ORDER BY table_name"""
    124 
--> 125         df = self.execute(query, limit)
    126         return df.table(limit)
    127 

~\Documents\Optimus\optimus\io\jdbc.py in execute(self, query, limit)
    230             conf.option("driver", self.driver_option)
    231 
--> 232         return conf.load()
    233 
    234     def df_to_table(self, df, table, mode="overwrite"):

~\Anaconda3\lib\site-packages\pyspark\sql\readwriter.py in load(self, path, format, schema, **options)
    170             return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
    171         else:
--> 172             return self._df(self._jreader.load())
    173 
    174     @since(1.4)

~\Anaconda3\lib\site-packages\py4j\java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

~\Anaconda3\lib\site-packages\pyspark\sql\utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

~\Anaconda3\lib\site-packages\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling o6140.load.
: java.sql.SQLException: No suitable driver
	at java.sql.DriverManager.getDriver(DriverManager.java:315)
	at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:85)
	at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:85)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:84)
	at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:35)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:34)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)



In [ ]:

    
# # Show a summary of every table
db.table.show("*",20)



In [ ]:

    
# # Get a table as dataframe
df_ = db.table_to_df("places_interest").table()



In [ ]:

    
# # Create new table in the database
db.df_to_table(df, "new_table")

Data enrichment

You can connect to any external API to enrich your data using Optimus. Optimus uses MongoDB to download the data and then merge it with the Spark Dataframe. You need to install MongoDB

Let's load a tiny dataset we can enrich



In [ ]:

    
df = op.load.json("https://raw.githubusercontent.com/ironmussa/Optimus/master/examples/data/foo.json")



In [ ]:

    
import requests

def func_request(params):
    # You can use here whatever header or auth info you need to send. 
    # For more information see the requests library
    
    url= "https://jsonplaceholder.typicode.com/todos/" + str(params["id"])
    return requests.get(url)

def func_response(response):
    # Here you can parse de response
    return response["title"]


e = op.enrich(host="localhost", port=27017, db_name="jazz")

df_result = e.run(df, func_request, func_response, calls= 60, period = 60, max_tries = 8)



In [ ]:

    
df_result.table("all")



In [ ]:

    
df_result.table_image("images/table7.png")

Clustering Strings

Optimus implements some funciton to cluster Strings. We get graet inspiration from OpenRefine

Here a quote from its site:

"In OpenRefine, clustering refers to the operation of "finding groups of different values that might be alternative representations of the same thing". For example, the two strings "New York" and "new york" are very likely to refer to the same concept and just have capitalization differences. Likewise, "Gödel" and "Godel" probably refer to the same person."

For more informacion see this: https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth

Keycolision



In [43]:

    
df = op.read.csv("../examples/data/random.csv",header=True, sep=";")



In [44]:

    
from optimus.ml import keycollision as keyCol



In [45]:

    
df_kc = keyCol.fingerprint_cluster(df, 'STATE')
df_kc.table()
df_kc.table_image("images/table8.png")









    



INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _trim
INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _lower
INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function multiple_replace
INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _remove_accents
INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _split_sort_remove_join






    









Viewing 5 of 5 rows / 4 columns
200 partition(s)


    
    
        
        
            STATE_CLUSTER_SIZE
            1 (int)
            
                
                not nullable
                
            
        
        
        
            STATE_CLUSTER
            2 (array<string>)
            
                
                nullable
                
            
        
        
        
            STATE_COUNT
            3 (bigint)
            
                
                nullable
                
            
        
        
        
            STATE_RECOMMENDED
            4 (string)
            
                
                nullable
                
            
        
        
    

    
    
    
    
        
        
            1
            
        
        
        
            ['Estado⋅de⋅México']
            
        
        
        
            810
            
        
        
        
            Estado⋅de⋅México
            
        
        
    
    
    
        
        
            2
            
        
        
        
            ['México⋅D.F.',⋅'Mexico⋅D.F.']
            
        
        
        
            2495
            
        
        
        
            Mexico⋅D.F.
            
        
        
    
    
    
        
        
            1
            
        
        
        
            ['D.F.']
            
        
        
        
            66
            
        
        
        
            D.F.
            
        
        
    
    
    
        
        
            1
            
        
        
        
            ['Distriro⋅Federal']
            
        
        
        
            259
            
        
        
        
            Distriro⋅Federal
            
        
        
    
    
    
        
        
            3
            
        
        
        
            ['Distrito⋅Federal',⋅'DISTRITO⋅FEDERAL',⋅'distrito⋅federal']
            
        
        
        
            11930
            
        
        
        
            Distrito⋅Federal
            
        
        
    
    
    



Viewing 5 of 5 rows / 4 columns
200 partition(s)







    



Loading page (1/2)
Rendering (2/2)                                                    
Done



In [46]:

    
keyCol.fingerprint_cluster(df, "STATE").to_json()









    



INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _trim
INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _lower
INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function multiple_replace
INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _remove_accents
INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _split_sort_remove_join






    Out[46]:





[{'STATE_CLUSTER_SIZE': 1,
  'STATE_CLUSTER': ['Estado de México'],
  'STATE_COUNT': 810,
  'STATE_RECOMMENDED': 'Estado de México'},
 {'STATE_CLUSTER_SIZE': 2,
  'STATE_CLUSTER': ['México D.F.', 'Mexico D.F.'],
  'STATE_COUNT': 2495,
  'STATE_RECOMMENDED': 'Mexico D.F.'},
 {'STATE_CLUSTER_SIZE': 1,
  'STATE_CLUSTER': ['D.F.'],
  'STATE_COUNT': 66,
  'STATE_RECOMMENDED': 'D.F.'},
 {'STATE_CLUSTER_SIZE': 1,
  'STATE_CLUSTER': ['Distriro Federal'],
  'STATE_COUNT': 259,
  'STATE_RECOMMENDED': 'Distriro Federal'},
 {'STATE_CLUSTER_SIZE': 3,
  'STATE_CLUSTER': ['Distrito Federal',
   'DISTRITO FEDERAL',
   'distrito federal'],
  'STATE_COUNT': 11930,
  'STATE_RECOMMENDED': 'Distrito Federal'}]



In [47]:

    
df_kc = keyCol.n_gram_fingerprint_cluster(df, "STATE" , 2)
df_kc.table()
df_kc.table_image("images/table9.png")









    



INFO:optimus:Using 'column_exp' to process column 'STATE_NGRAM' with function _lower
INFO:optimus:Using 'column_exp' to process column 'STATE_NGRAM' with function _remove_white_spaces
INFO:optimus:Using 'pandas_udf' to process column 'STATE_NGRAM' with function multiple_replace
INFO:optimus:Using 'pandas_udf' to process column 'STATE_NGRAM' with function _remove_accents
INFO:optimus:Using 'column_exp' to process column 'count' with function _cast_to
INFO:optimus:Using 'column_exp' to process column 'STATE' with function _cast_to
INFO:optimus:Using 'column_exp' to process column 'STATE_NGRAM' with function _cast_to
INFO:optimus:Using 'column_exp' to process column 'STATE_NGRAM' with function func_col_exp
INFO:optimus:Using 'pandas_udf' to process column 'STATE_NGRAM_FINGERPRINT' with function remote_white_spaces_remove_sort_join






    









Viewing 1 of 1 rows / 4 columns
200 partition(s)


    
    
        
        
            STATE_CLUSTER_SIZE
            1 (int)
            
                
                not nullable
                
            
        
        
        
            STATE_CLUSTER
            2 (array<string>)
            
                
                nullable
                
            
        
        
        
            STATE_COUNT
            3 (double)
            
                
                nullable
                
            
        
        
        
            STATE_RECOMMENDED
            4 (string)
            
                
                nullable
                
            
        
        
    

    
    
    
    
        
        
            8
            
        
        
        
            ['Distrito⋅Federal',⋅'México⋅D.F.',⋅'DISTRITO⋅FEDERAL',⋅'Mexico⋅D.F.',⋅'Distr...
            
        
        
        
            15560.0
            
        
        
        
            Mexico⋅D.F.
            
        
        
    
    
    



Viewing 1 of 1 rows / 4 columns
200 partition(s)







    



Loading page (1/2)
Rendering (2/2)                                                    
Done



In [48]:

    
keyCol.n_gram_fingerprint_cluster(df, "STATE" , 2).to_json()









    



INFO:optimus:Using 'column_exp' to process column 'STATE_NGRAM' with function _lower
INFO:optimus:Using 'column_exp' to process column 'STATE_NGRAM' with function _remove_white_spaces
INFO:optimus:Using 'pandas_udf' to process column 'STATE_NGRAM' with function multiple_replace
INFO:optimus:Using 'pandas_udf' to process column 'STATE_NGRAM' with function _remove_accents
INFO:optimus:Using 'column_exp' to process column 'count' with function _cast_to
INFO:optimus:Using 'column_exp' to process column 'STATE' with function _cast_to
INFO:optimus:Using 'column_exp' to process column 'STATE_NGRAM' with function _cast_to
INFO:optimus:Using 'column_exp' to process column 'STATE_NGRAM' with function func_col_exp
INFO:optimus:Using 'pandas_udf' to process column 'STATE_NGRAM_FINGERPRINT' with function remote_white_spaces_remove_sort_join






    Out[48]:





[{'STATE_CLUSTER_SIZE': 8,
  'STATE_CLUSTER': ['Distrito Federal',
   'México D.F.',
   'DISTRITO FEDERAL',
   'Mexico D.F.',
   'Distriro Federal',
   'D.F.',
   'Estado de México',
   'distrito federal'],
  'STATE_COUNT': 15560.0,
  'STATE_RECOMMENDED': 'Mexico D.F.'}]

Nearest Neighbor Methods



In [49]:

    
from optimus.ml import distancecluster as dc
df_dc = dc.levenshtein_matrix(df,"STATE")
df_dc.table_image("images/table10.png")









    



INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _trim
INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _lower
INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function multiple_replace
INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _remove_accents
INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _split_sort_remove_join






    



Loading page (1/2)
Rendering (2/2)                                                    
Done



In [50]:

    
df_dc=dc.levenshtein_filter(df,"STATE")
df_dc.table()
df_dc.table_image("images/table11.png")









    



INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _trim
INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _lower
INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function multiple_replace
INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _remove_accents
INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _split_sort_remove_join






    









Viewing 5 of 5 rows / 3 columns
200 partition(s)


    
    
        
        
            STATE_FROM
            1 (string)
            
                
                nullable
                
            
        
        
        
            STATE_LEVENSHTEIN_DISTANCE
            2 (int)
            
                
                nullable
                
            
        
        
        
            STATE_TO
            3 (string)
            
                
                nullable
                
            
        
        
    

    
    
    
    
        
        
            estadodemexico
            
        
        
        
            10
            
        
        
        
            mexicodf
            
        
        
    
    
    
        
        
            df
            
        
        
        
            6
            
        
        
        
            mexicodf
            
        
        
    
    
    
        
        
            distrirofederal
            
        
        
        
            1
            
        
        
        
            distritofederal
            
        
        
    
    
    
        
        
            distritofederal
            
        
        
        
            1
            
        
        
        
            distrirofederal
            
        
        
    
    
    
        
        
            mexicodf
            
        
        
        
            6
            
        
        
        
            df
            
        
        
    
    
    



Viewing 5 of 5 rows / 3 columns
200 partition(s)







    



Loading page (1/2)
Rendering (2/2)                                                    
Done



In [51]:

    
df_dc = dc.levenshtein_cluster(df,"STATE")
df_dc.table()
df_dc.table_image("images/table12.png")









    



INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _trim
INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _lower
INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function multiple_replace
INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _remove_accents
INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _split_sort_remove_join
INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _trim
INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _lower
INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function multiple_replace
INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _remove_accents
INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _split_sort_remove_join






    









Viewing 5 of 5 rows / 4 columns
1 partition(s)


    
    
        
        
            STATE_CLUSTER
            1 (array<string>)
            
                
                nullable
                
            
        
        
        
            STATE_CLUSTER_SIZE
            2 (int)
            
                
                nullable
                
            
        
        
        
            STATE_RECOMMENDED
            3 (string)
            
                
                nullable
                
            
        
        
        
            STATE_COUNT
            4 (bigint)
            
                
                nullable
                
            
        
        
    

    
    
    
    
        
        
            ['Estado⋅de⋅México']
            
        
        
        
            1
            
        
        
        
            Estado⋅de⋅México
            
        
        
        
            810
            
        
        
    
    
    
        
        
            ['D.F.']
            
        
        
        
            1
            
        
        
        
            D.F.
            
        
        
        
            66
            
        
        
    
    
    
        
        
            ['Distriro⋅Federal']
            
        
        
        
            1
            
        
        
        
            Distriro⋅Federal
            
        
        
        
            259
            
        
        
    
    
    
        
        
            ['Distrito⋅Federal',⋅'DISTRITO⋅FEDERAL',⋅'distrito⋅federal']
            
        
        
        
            3
            
        
        
        
            Distrito⋅Federal
            
        
        
        
            11930
            
        
        
    
    
    
        
        
            ['Mexico⋅D.F.',⋅'México⋅D.F.']
            
        
        
        
            2
            
        
        
        
            Mexico⋅D.F.
            
        
        
        
            2495
            
        
        
    
    
    



Viewing 5 of 5 rows / 4 columns
1 partition(s)







    



Loading page (1/2)
Rendering (2/2)                                                    
Done



In [52]:

    
dc.to_json(df, "STATE")









    



INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _trim
INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _lower
INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function multiple_replace
INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _remove_accents
INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _split_sort_remove_join
INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _trim
INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _lower
INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function multiple_replace
INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _remove_accents
INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _split_sort_remove_join






    Out[52]:





[{'STATE_CLUSTER': ['Estado de México'],
  'STATE_CLUSTER_SIZE': 1,
  'STATE_RECOMMENDED': 'Estado de México',
  'STATE_COUNT': 810},
 {'STATE_CLUSTER': ['D.F.'],
  'STATE_CLUSTER_SIZE': 1,
  'STATE_RECOMMENDED': 'D.F.',
  'STATE_COUNT': 66},
 {'STATE_CLUSTER': ['Distriro Federal'],
  'STATE_CLUSTER_SIZE': 1,
  'STATE_RECOMMENDED': 'Distriro Federal',
  'STATE_COUNT': 259},
 {'STATE_CLUSTER': ['Distrito Federal',
   'DISTRITO FEDERAL',
   'distrito federal'],
  'STATE_CLUSTER_SIZE': 3,
  'STATE_RECOMMENDED': 'Distrito Federal',
  'STATE_COUNT': 11930},
 {'STATE_CLUSTER': ['Mexico D.F.', 'México D.F.'],
  'STATE_CLUSTER_SIZE': 2,
  'STATE_RECOMMENDED': 'Mexico D.F.',
  'STATE_COUNT': 2495}]

Machine Learning

Machine Learning is one of the last steps, and the goal for most Data Science WorkFlows.

Apache Spark created a library called MLlib where they coded great algorithms for Machine Learning. Now with the ML library we can take advantage of the Dataframe API and its optimization to create Machine Learning Pipelines easily.

Even though this task is not extremely hard, it is not easy. The way most Machine Learning models work on Spark are not straightforward, and they need lots of feature engineering to work. That's why we created the feature engineering section inside Optimus.

One of the best "tree" models for machine learning is Random Forest. What about creating a RF model with just one line? With Optimus is really easy.



In [53]:

    
df_cancer = op.load.csv("https://raw.githubusercontent.com/ironmussa/Optimus/master/tests/data_cancer.csv")









    



INFO:optimus:Downloading data_cancer.csv from https://raw.githubusercontent.com/ironmussa/Optimus/master/tests/data_cancer.csv
INFO:optimus:Downloaded 125205 bytes
INFO:optimus:Creating DataFrame for data_cancer.csv. Please wait...
INFO:optimus:Successfully created DataFrame for 'data_cancer.csv'



In [54]:

    
columns = ['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean',
           'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean',
           'fractal_dimension_mean']

df_predict, rf_model = op.ml.random_forest(df_cancer, columns, "diagnosis")

This will create a DataFrame with the predictions of the Random Forest model.

So lets see the prediction compared with the actual label:



In [55]:

    
df_predict.cols.select(["label","prediction"]).table_image("images/table13.png")









    



Loading page (1/2)
Rendering (2/2)                                                    
Done

The rf_model variable contains the Random Forest model for analysis.

Contributing to Optimus

Contributions go far beyond pull requests and commits. We are very happy to receive any kind of contributions
including:

Documentation updates, enhancements, designs, or bugfixes.
Spelling or grammar fixes.
README.md corrections or redesigns.
Adding unit, or functional tests
Triaging GitHub issues -- especially determining whether an issue still persists or is reproducible.
Searching #optimusdata on twitter and helping someone else who needs help.
Blogging, speaking about, or creating tutorials about Optimus and its many features.
Helping others on Discord

Backers

[Become a backer] and get your image on our README on Github with a link to your site.

Core Team

Argenis Leon and Favio Vazquez

Contributors:

Here is the amazing people that make Optimus possible:

License:

Post-process readme script. Always run this if you modify the notebook.

This will recreate README.md

The bellow script process the readme_.md that is ouputed from this notebook and remove the header from jupytext, python comments and convert/add table to images and output readme.md.

To make table_image() function be sure to install imagekit pip install imgkit Also install wkhtmltopdf https://wkhtmltopdf.org/downloads.html. This is responsible to generate the optimus tables as images



In [5]:

    
from shutil import copyfile
output_file = "../README.md"
copyfile("readme_.md", output_file)

import sys
import fileinput
import re

pattern = r'"([A-Za-z0-9_\./\\-]*)"'

jupytext_header = False
flag_remove = False

remove = ["load_ext", "autoreload","import sys","sys.path.append"]

buffer = None
for i, line in enumerate(fileinput.input(output_file, inplace=1)):
    done= False
    try:
        # Remove some helper lines
        for r in remove:
            if re.search(r, line):
                done= True
        
        #Remove the post process code
        if re.search("Post-process", line):
            flag_remove = True
            
        if flag_remove is True:
            done = True        
            
        
        # Remove jupytext header
        if jupytext_header is True:
            done = True
            
        if  "---\n" == line: 
            jupytext_header = not jupytext_header      
                    
        elif done is False:
     
            # Replace .table_image(...) by table()
            chars_table=re.search(".table_image", line)
            chars_image=re.search(".to_image", line)
            chars_plot = True if len(re.findall('(.plot.|output_path=)', line))==2 else False
            
            
            
            path = "readme/"
            if chars_table:
                print(line[0:int(chars_table.start())]+".table()")

                m = re.search(r'table_image\("(.*?)"\)', line).group(1)
                if m:
                    buffer = "![]("+ path + m + ")"              
            elif chars_image:
                m = re.search(r'to_image\(output_path="(.*?)"\)', line).group(1)
                if m:
                    buffer = "![]("+ path + m + ")"  
            elif chars_plot:

                m = re.search('output_path="(.*?)"', line).group(1)

                if m:
                    buffer = "![]("+ path + m + ")"  
            
            else:
                sys.stdout.write(line)
                
            if "```\n"==line and buffer:                
                print(buffer)
                buffer = None
                
    except Exception as e:
        print(e)
        
fileinput.close()


# Remove empyt python cells
flag = False
for i, line in enumerate(fileinput.input(output_file, inplace=1)):
   
    if re.search("```python", line):     
        flag = True
    elif re.search("```", line) and flag is True:
        flag=False
    elif flag is True:
        flag = False
        print("```python")
        print(line,end="")
    else:
        print(line, end="")
                    
        
fileinput.close()



In [6]:

    
line = 'op.profiler.to_image(output_path="images/profiler.png")")'
m = re.search(r'to_image\(output_path="(.*?)"\)', line).group(1)
print(m)









    



images/profiler.png



In [ ]:

Number of columns	10
Number of rows	45716
Total Missing (%)	0.49%
Total size in memory	88.2 MB

Value	Count	Frequency (%)
1.3	171	0.374%
1.2	140	0.306%
1.4	138	0.302%
None	131	0.287%
2.1	130	0.284%
2.4	126	0.276%
1.6	120	0.262%
0.5	119	0.26%
1.1	116	0.254%
3.8	114	0.249%
"Missing"	0	0.0%

Minimum	0.0
5-th percentile
Q1
Median
Q3
95-th percentile
Maximum	60000000.0
Range	60000000.0
Interquartile range	0.0

Standard deviation	574988.87641
Coef of variation	43.30362
Kurtosis	6796.17061
Mean	13278.07855
MAD	0.0
Skewness
Sum	605281210.638
Variance	330612207995.7785

name 1 (string) nullable	id 2 (int) nullable	nametype 3 (string) nullable	recclass 4 (string) nullable	mass (g) 5 (double) nullable	fall 6 (string) nullable	year 7 (string) nullable	reclat 8 (double) nullable	reclong 9 (double) nullable	GeoLocation 10 (string) nullable
Acfer⋅232	240	Valid	H5	725.0	Found	01/01/1991⋅12:00:00⋅AM	27.73944	4.32833	(27.739440,⋅4.328330)
Asuka⋅87197	2554	Valid	H4	124.99	Found	01/01/1987⋅12:00:00⋅AM	-72.0	26.0	(-72.000000,⋅26.000000)
Gladstone⋅(iron)	10920	Valid	Iron,⋅IAB-MG	736600.0	Found	01/01/1915⋅12:00:00⋅AM	-23.9	151.3	(-23.900000,⋅151.300000)
Nullarbor⋅015	17955	Valid	L6	3986.0	Found	01/01/1980⋅12:00:00⋅AM	None	None	None
Lewis⋅Cliff⋅86533	13461	Valid	H5	15.7	Found	01/01/1986⋅12:00:00⋅AM	-84.26756	161.3631	(-84.267560,⋅161.363100)
Grove⋅Mountains⋅053589	48447	Valid	L5	1.4	Found	01/01/2006⋅12:00:00⋅AM	-72.7825	75.300278	(-72.782500,⋅75.300278)
Sayh⋅al⋅Uhaymir⋅108	23300	Valid	H6	16.0	Found	01/01/2001⋅12:00:00⋅AM	21.06667	57.31667	(21.066670,⋅57.316670)
Northwest⋅Africa⋅3088	31218	Valid	L6	171.0	Found	01/01/2003⋅12:00:00⋅AM	None	None	None
Reckling⋅Peak⋅92423	22432	Valid	L6	3.8	Found	01/01/1992⋅12:00:00⋅AM	-76.22029	158.37967	(-76.220290,⋅158.379670)
Sweetwater	23770	Valid	H5	1760.0	Found	01/01/1961⋅12:00:00⋅AM	32.55	-100.41667	(32.550000,⋅-100.416670)

Value	Count	Frequency (%)
Święcany	1	0.002%
Łowicz	1	0.002%
Österplana 064	1	0.002%
Österplana 063	1	0.002%
Österplana 062	1	0.002%
Österplana 061	1	0.002%
Österplana 060	1	0.002%
Österplana 059	1	0.002%
Österplana 058	1	0.002%
Österplana 057	1	0.002%
"Missing"	0	0.0%

Value	Count	Frequency (%)
01/01/2003 12:00:00 AM	3323	7.269%
01/01/1979 12:00:00 AM	3046	6.663%
01/01/1998 12:00:00 AM	2697	5.899%
01/01/2006 12:00:00 AM	2456	5.372%
01/01/1988 12:00:00 AM	2296	5.022%
01/01/2002 12:00:00 AM	2078	4.545%
01/01/2004 12:00:00 AM	1940	4.244%
01/01/2000 12:00:00 AM	1792	3.92%
01/01/1997 12:00:00 AM	1696	3.71%
01/01/1999 12:00:00 AM	1691	3.699%
"Missing"	0	0.0%

STATE_CLUSTER_SIZE 1 (int) not nullable	STATE_CLUSTER 2 (array<string>) nullable	STATE_COUNT 3 (bigint) nullable	STATE_RECOMMENDED 4 (string) nullable
1	['Estado⋅de⋅México']	810	Estado⋅de⋅México
2	['México⋅D.F.',⋅'Mexico⋅D.F.']	2495	Mexico⋅D.F.
1	['D.F.']	66	D.F.
1	['Distriro⋅Federal']	259	Distriro⋅Federal
3	['Distrito⋅Federal',⋅'DISTRITO⋅FEDERAL',⋅'distrito⋅federal']	11930	Distrito⋅Federal

STATE_FROM 1 (string) nullable	STATE_LEVENSHTEIN_DISTANCE 2 (int) nullable	STATE_TO 3 (string) nullable
estadodemexico	10	mexicodf
df	6	mexicodf
distrirofederal	1	distritofederal
distritofederal	1	distrirofederal
mexicodf	6	df

Installation (pip):

Requirements

Examples

Documentation

Feedback

Start Optimus

Loading data

Saving Data

Handling Spark jars, packages and repositories

Create dataframes

Cleaning and Processing

Handling column output

Custom functions

Data profiling

Numeric

Overview

Dataset info

Column types

mass (g)

Datatypes

Basic Stats

Frequency

Quantile statistics

Descriptive statistics

Overview

Dataset info

Column types

name

Datatypes

Frequency

Processing Dates

Overview

Dataset info

Column types

year

Datatypes

Frequency

Profiler Speed

Overview

Dataset info

Column types

mass (g)

Datatypes

Basic Stats

Frequency

Quantile statistics

Descriptive statistics

Plots

Using other plotting libraries

Outliers

Get the ouliers using tukey

Remove the outliers using tukey

You can also use z_score, modified_z_score or mad

Database connection

Data enrichment

Clustering Strings

Keycolision

Nearest Neighbor Methods

Machine Learning

Contributing to Optimus

Backers

Sponsors

Core Team

Contributors:

License:

Post-process readme script. Always run this if you modify the notebook.