In order to make a first evaluation of the given datasets, we compute some basic metrics.
For more information on the metrics and also the extraciton of metrics for the smaller datasets look at:
`Evaluation metrics for picking an appropriate data set for our goals.ipynb `
For the importing the four largest datasets to postgresql and evaluating their metrics look at:
`Importing the large data sets to psql and computing their metrics.ipynb`
Finally, the evaluated metrics of all datasets are exported to metadata and imported here to visualize.
In [1]:
    
def percentage(some_float):
    return '%i%%' % int(100 * some_float)
def metrics_comparison_matrix(reviews_df):
    return reviews_df.apply(
        lambda row: 
            [ percentage(row[i]) for i in range(0, 5) ] 
            + [ int(row[5]), row[6], row[7] ], 
        axis=1)
    
In [2]:
    
import pandas as pd
small_data_metrics = pd.read_csv('./metadata/initial-data-evaluation-metrics.csv')
large_data_metrics = pd.read_csv('./metadata/large-datasets-evaluation-metrics.csv')
    
In [3]:
    
metrics = metrics_comparison_matrix(
    pd.concat([ small_data_metrics, large_data_metrics ])
        .set_index('dataset_name'))
    
In [5]:
    
metrics.to_csv('./metadata/all-metrics-formatted.csv')
metrics
    
    Out[5]: