This post suggests the revised functionalites offered by the summary
function described previously in the Summary of Data. The functions are mainly available at the preprocess.py file. We use a simple set of data 'beijing_201802_201803_aq.csv' from KDD CUP of Fresh Air, which provides air qualities measured by the weather stations in Beijing during February, March 2018.
The following mainly presents two functionalities:
To test the execution of functions, plese clone the repository jqlearning and work on the jupyter notebook: Summary-of-data2.ipynb.
In [1]:
# Import functions and load data into a dataframe
import sys
sys.path.append("../")
import pandas as pd
from script.preprocess import summary, warn_missing
kwargs = {"parse_dates": ["utc_time"]}
bj_aq_df = pd.read_csv("beijing_201802_201803_aq.csv", **kwargs)
In [2]:
warn_missing(bj_aq_df, "beijing_201802_201803_aq.csv")
The following summary considers the top 1% and bottom 1% of given numeric data as outliers. summary
function offers us a simple idea on the type of data and the numeric data distribution.
In [2]:
summary(bj_aq_df, quantile=(0.01, 0.99), outlier_as_nan=True, filter_outlier=True)
User can also set outlier_as_nan=False parameter to display the outliers of data. This setting basically set the non-outliers as NaN (not a number). This offers us the convenience to understand the bigger picture of our available data before we proceed further to uncover additional insights!
In [3]:
summary(bj_aq_df, quantile=(0.01, 0.99), outlier_as_nan=False, filter_outlier=True)