Summary of Data (Part 2)

This post suggests the revised functionalites offered by the summary function described previously in the Summary of Data. The functions are mainly available at the preprocess.py file. We use a simple set of data 'beijing_201802_201803_aq.csv' from KDD CUP of Fresh Air, which provides air qualities measured by the weather stations in Beijing during February, March 2018.

The following mainly presents two functionalities:

  1. warn_missing - Describe the proportion of missing values
  2. summary - Provide the descriptive summary of data distribution. The visualization of data distribution can exclude outliers (filter_outlier=True) specified by a set of quantiles.

To test the execution of functions, plese clone the repository jqlearning and work on the jupyter notebook: Summary-of-data2.ipynb.


In [1]:
# Import functions and load data into a dataframe
import sys
sys.path.append("../")
import pandas as pd
from script.preprocess import summary, warn_missing

kwargs = {"parse_dates": ["utc_time"]}
bj_aq_df = pd.read_csv("beijing_201802_201803_aq.csv", **kwargs)

In [2]:
warn_missing(bj_aq_df, "beijing_201802_201803_aq.csv")


Warning: Column PM2.5 has (6.2%) 3070 missing values in beijing_201802_201803_aq.csv!
Warning: Column PM10 has (26.1%) 12912 missing values in beijing_201802_201803_aq.csv!
Warning: Column NO2 has (6.2%) 3069 missing values in beijing_201802_201803_aq.csv!
Warning: Column CO has (6.7%) 3331 missing values in beijing_201802_201803_aq.csv!
Warning: Column O3 has (6.7%) 3311 missing values in beijing_201802_201803_aq.csv!
Warning: Column SO2 has (6.3%) 3116 missing values in beijing_201802_201803_aq.csv!

The following summary considers the top 1% and bottom 1% of given numeric data as outliers. summary function offers us a simple idea on the type of data and the numeric data distribution.


In [2]:
summary(bj_aq_df, quantile=(0.01, 0.99), outlier_as_nan=True, filter_outlier=True)


--------------------------------------------------------------------------------
********************    Begin of the summary of text data   ********************
--------------------------------------------------------------------------------
count          49420
unique            35
top       tiantan_aq
freq            1412
Name: stationId, dtype: object


count                   49420
unique                   1412
top       2018-02-23 16:00:00
freq                       35
first     2018-01-31 16:00:00
last      2018-03-31 15:00:00
Name: utc_time, dtype: object


--------------------------------------------------------------------------------
********************    End of the summary of text data     ********************
--------------------------------------------------------------------------------

User can also set outlier_as_nan=False parameter to display the outliers of data. This setting basically set the non-outliers as NaN (not a number). This offers us the convenience to understand the bigger picture of our available data before we proceed further to uncover additional insights!


In [3]:
summary(bj_aq_df, quantile=(0.01, 0.99), outlier_as_nan=False, filter_outlier=True)


--------------------------------------------------------------------------------
********************    Begin of the summary of text data   ********************
--------------------------------------------------------------------------------
count          49420
unique            35
top       tiantan_aq
freq            1412
Name: stationId, dtype: object


count                   49420
unique                   1412
top       2018-02-23 16:00:00
freq                       35
first     2018-01-31 16:00:00
last      2018-03-31 15:00:00
Name: utc_time, dtype: object


--------------------------------------------------------------------------------
********************    End of the summary of text data     ********************
--------------------------------------------------------------------------------