Improving Data Quality

Learning Objectives

Resolve missing values
Convert the Date feature column to a datetime format
Rename a feature column, remove a value from a feature column
Create one-hot encoding features
Understand temporal feature conversions

Introduction

Recall that machine learning models can only consume numeric data, and that numeric data should be "1"s or "0"s. Data is said to be "messy" or "untidy" if it is missing attribute values, contains noise or outliers, has duplicates, wrong data, upper/lower case column names, and is essentially not ready for ingestion by a machine learning algorithm.

This notebook presents and solves some of the most common issues of "untidy" data. Note that different problems will require different methods, and they are beyond the scope of this notebook.

Each learning objective will correspond to a #TODO in this student lab notebook -- try to complete this notebook first and then review the solution notebook.



In [ ]:

    
!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst



In [ ]:

    
# Ensure the right version of Tensorflow is installed.
!pip freeze | grep tensorflow==2.1 || pip install tensorflow==2.1

Start by importing the necessary libraries for this lab.

Import Libraries



In [1]:

    
import os
import pandas as pd  # First, we'll import Pandas, a data processing and CSV file I/O library
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Load the Dataset

The dataset is based on California's Vehicle Fuel Type Count by Zip Code report. The dataset has been modified to make the data "untidy" and is thus a synthetic representation that can be used for learning purposes.

Let's download the raw .csv data by copying the data from a cloud storage bucket.



In [2]:

    
if not os.path.isdir("../data/transport"):
    os.makedirs("../data/transport")



In [3]:

    
!gsutil cp gs://cloud-training-demos/feat_eng/transport/untidy_vehicle_data.csv ../data/transport









    



Copying gs://cloud-training-demos/feat_eng/transport/untidy_vehicle_data.csv...
/ [1 files][ 47.2 KiB/ 47.2 KiB]                                                
Operation completed over 1 objects/47.2 KiB.



In [4]:

    
!ls -l ../data/transport









    



total 48
-rw-r--r-- 1 jupyter jupyter 48343 May 15 04:48 untidy_vehicle_data.csv

Read Dataset into a Pandas DataFrame

Next, let's read in the dataset just copied from the cloud storage bucket and create a Pandas DataFrame. We also add a Pandas .head() function to show you the top 5 rows of data in the DataFrame. Head() and Tail() are "best-practice" functions used to investigate datasets.



In [5]:

    
df_transport = pd.read_csv('../data/transport/untidy_vehicle_data.csv')
df_transport.head() # Output the first five rows.









    Out[5]:







  
    
      
      Date
      Zip Code
      Model Year
      Fuel
      Make
      Light_Duty
      Vehicles
    
  
  
    
      0
      10/1/2018
      90000
      2006
      Gasoline
      OTHER/UNK
      NaN
      1.0
    
    
      1
      10/1/2018
      NaN
      2014
      Gasoline
      NaN
      Yes
      1.0
    
    
      2
      NaN
      90000
      NaN
      Gasoline
      OTHER/UNK
      Yes
      NaN
    
    
      3
      10/1/2018
      90000
      2017
      Gasoline
      OTHER/UNK
      Yes
      1.0
    
    
      4
      10/1/2018
      90000
      <2006
      Diesel and Diesel Hybrid
      OTHER/UNK
      No
      55.0

DataFrame Column Data Types

DataFrames may have heterogenous or "mixed" data types, that is, some columns are numbers, some are strings, and some are dates etc. Because CSV files do not contain information on what data types are contained in each column, Pandas infers the data types when loading the data, e.g. if a column contains only numbers, Pandas will set that column’s data type to numeric: integer or float.

Run the next cell to see information on the DataFrame.



In [6]:

    
df_transport.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 7 columns):
Date          997 non-null object
Zip Code      997 non-null object
Model Year    997 non-null object
Fuel          996 non-null object
Make          996 non-null object
Light_Duty    996 non-null object
Vehicles      996 non-null float64
dtypes: float64(1), object(6)
memory usage: 54.8+ KB

From what the .info() function shows us, we have six string objects and one float object. Let's print out the first and last five rows of each column. We can definitely see more of the "string" object values now!



In [7]:

    
print(df_transport,5)









    



          Date Zip Code Model Year                      Fuel       Make  \
0    10/1/2018    90000       2006                  Gasoline  OTHER/UNK   
1    10/1/2018      NaN       2014                  Gasoline        NaN   
2          NaN    90000        NaN                  Gasoline  OTHER/UNK   
3    10/1/2018    90000       2017                  Gasoline  OTHER/UNK   
4    10/1/2018    90000      <2006  Diesel and Diesel Hybrid  OTHER/UNK   
..         ...      ...        ...                       ...        ...   
994   6/7/2019    90003       2012                  Gasoline     Type_R   
995   6/8/2019    90003       2012           Hybrid Gasoline  OTHER/UNK   
996   6/9/2019    90003       2012           Hybrid Gasoline     Type_Q   
997  6/10/2019    90003       2012               Natural Gas  OTHER/UNK   
998  6/11/2019    90003       2012            Plug-in Hybrid  OTHER/UNK   

    Light_Duty  Vehicles  
0          NaN       1.0  
1          Yes       1.0  
2          Yes       NaN  
3          Yes       1.0  
4           No      55.0  
..         ...       ...  
994        Yes      26.0  
995        Yes       4.0  
996        Yes      25.0  
997        Yes       1.0  
998        Yes       3.0  

[999 rows x 7 columns] 5

Summary Statistics

At this point, we have only one column which contains a numerical value (e.g. Vehicles). For features which contain numerical values, we are often interested in various statistical measures relating to those values. We can use .describe() to see some summary statistics for the numeric fields in our dataframe. Note, that because we only have one numeric feature, we see only one summary stastic - for now.



In [8]:

    
df_transport.describe()









    Out[8]:







  
    
      
      Vehicles
    
  
  
    
      count
      996.000000
    
    
      mean
      72.878514
    
    
      std
      229.696895
    
    
      min
      1.000000
    
    
      25%
      13.000000
    
    
      50%
      23.000000
    
    
      75%
      57.250000
    
    
      max
      3178.000000

Let's investigate a bit more of our data by using the .groupby() function.



In [9]:

    
grouped_data = df_transport.groupby(['Zip Code','Model Year','Fuel','Make','Light_Duty','Vehicles'])
df_transport.groupby('Fuel').first() # Get the first entry for each month.









    Out[9]:







  
    
      
      Date
      Zip Code
      Model Year
      Make
      Light_Duty
      Vehicles
    
    
      Fuel
      
      
      
      
      
      
    
  
  
    
      Battery Electric
      10/1/2018
      90000
      <2006
      OTHER/UNK
      No
      4.0
    
    
      Diesel and Diesel Hybrid
      10/1/2018
      90000
      <2006
      OTHER/UNK
      No
      55.0
    
    
      Flex-Fuel
      10/14/2018
      90001
      2007
      Type_A
      Yes
      78.0
    
    
      Gasoline
      10/1/2018
      90000
      2006
      OTHER/UNK
      Yes
      1.0
    
    
      Hybrid Gasoline
      10/24/2018
      90001
      2009
      OTHER/UNK
      Yes
      18.0
    
    
      Natural Gas
      10/25/2018
      90001
      2009
      OTHER/UNK
      No
      2.0
    
    
      Other
      10/8/2018
      90000
      <2006
      OTHER/UNK
      Yes
      6.0
    
    
      Plug-in Hybrid
      11/2/2018
      90001
      2012
      OTHER/UNK
      Yes
      1.0

Checking for Missing Values

Missing values adversely impact data quality, as they can lead the machine learning model to make inaccurate inferences about the data. Missing values can be the result of numerous factors, e.g. "bits" lost during streaming transmission, data entry, or perhaps a user forgot to fill in a field. Note that Pandas recognizes both empty cells and “NaN” types as missing values.

Let's show the null values for all features in the DataFrame.



In [10]:

    
df_transport.isnull().sum()









    Out[10]:





Date          2
Zip Code      2
Model Year    2
Fuel          3
Make          3
Light_Duty    3
Vehicles      3
dtype: int64

To see a sampling of which values are missing, enter the feature column name. You'll notice that "False" and "True" correpond to the presence or abscence of a value by index number.



In [11]:

    
print (df_transport['Date'])
print (df_transport['Date'].isnull())









    



0      10/1/2018
1      10/1/2018
2            NaN
3      10/1/2018
4      10/1/2018
         ...    
994     6/7/2019
995     6/8/2019
996     6/9/2019
997    6/10/2019
998    6/11/2019
Name: Date, Length: 999, dtype: object
0      False
1      False
2       True
3      False
4      False
       ...  
994    False
995    False
996    False
997    False
998    False
Name: Date, Length: 999, dtype: bool



In [12]:

    
print (df_transport['Make'])
print (df_transport['Make'].isnull())









    



0      OTHER/UNK
1            NaN
2      OTHER/UNK
3      OTHER/UNK
4      OTHER/UNK
         ...    
994       Type_R
995    OTHER/UNK
996       Type_Q
997    OTHER/UNK
998    OTHER/UNK
Name: Make, Length: 999, dtype: object
0      False
1       True
2      False
3      False
4      False
       ...  
994    False
995    False
996    False
997    False
998    False
Name: Make, Length: 999, dtype: bool



In [13]:

    
print (df_transport['Model Year'])
print (df_transport['Model Year'].isnull())









    



0       2006
1       2014
2        NaN
3       2017
4      <2006
       ...  
994     2012
995     2012
996     2012
997     2012
998     2012
Name: Model Year, Length: 999, dtype: object
0      False
1      False
2       True
3      False
4      False
       ...  
994    False
995    False
996    False
997    False
998    False
Name: Model Year, Length: 999, dtype: bool

What can we deduce about the data at this point?

First, let's summarize our data by row, column, features, unique, and missing values,



In [14]:

    
print ("Rows     : " ,df_transport.shape[0])
print ("Columns  : " ,df_transport.shape[1])
print ("\nFeatures : \n" ,df_transport.columns.tolist())
print ("\nUnique values :  \n",df_transport.nunique())
print ("\nMissing values :  ", df_transport.isnull().sum().values.sum())









    



Rows     :  999
Columns  :  7

Features : 
 ['Date', 'Zip Code', 'Model Year', 'Fuel', 'Make', 'Light_Duty', 'Vehicles']

Unique values :  
 Date          248
Zip Code        6
Model Year     15
Fuel            8
Make           43
Light_Duty      3
Vehicles      210
dtype: int64

Missing values :   18

Let's see the data again -- this time the last five rows in the dataset.



In [15]:

    
df_transport.tail()









    Out[15]:







  
    
      
      Date
      Zip Code
      Model Year
      Fuel
      Make
      Light_Duty
      Vehicles
    
  
  
    
      994
      6/7/2019
      90003
      2012
      Gasoline
      Type_R
      Yes
      26.0
    
    
      995
      6/8/2019
      90003
      2012
      Hybrid Gasoline
      OTHER/UNK
      Yes
      4.0
    
    
      996
      6/9/2019
      90003
      2012
      Hybrid Gasoline
      Type_Q
      Yes
      25.0
    
    
      997
      6/10/2019
      90003
      2012
      Natural Gas
      OTHER/UNK
      Yes
      1.0
    
    
      998
      6/11/2019
      90003
      2012
      Plug-in Hybrid
      OTHER/UNK
      Yes
      3.0

What Are Our Data Quality Issues?

Data Quality Issue #1:
Missing Values: Each feature column has multiple missing values. In fact, we have a total of 18 missing values.
Data Quality Issue #2:
Date DataType: Date is shown as an "object" datatype and should be a datetime. In addition, Date is in one column. Our business requirement is to see the Date parsed out to year, month, and day.
Data Quality Issue #3:
Model Year: We are only interested in years greater than 2006, not "<2006".
Data Quality Issue #4:
Categorical Columns: The feature column "Light_Duty" is categorical and has a "Yes/No" choice. We cannot feed values like this into a machine learning model. In addition, we need to "one-hot encode the remaining "string"/"object" columns.
Data Quality Issue #5:
Temporal Features: How do we handle year, month, and day?

Data Quality Issue #1:

Resolving Missing Values

Most algorithms do not accept missing values. Yet, when we see missing values in our dataset, there is always a tendency to just "drop all the rows" with missing values. Although Pandas will fill in the blank space with “NaN", we should "handle" them in some way.

While all the methods to handle missing values is beyond the scope of this lab, there are a few methods you should consider. For numeric columns, use the "mean" values to fill in the missing numeric values. For categorical columns, use the "mode" (or most frequent values) to fill in missing categorical values.

In this lab, we use the .apply and Lambda functions to fill every column with its own most frequent value. You'll learn more about Lambda functions later in the lab.

Let's check again for missing values by showing how many rows contain NaN values for each feature column.

Lab Task #1a: Check for missing values by showing how many rows contain NaN values for each feature column.



In [16]:

    
# TODO 1a
# TODO -- Your code here.









    Out[16]:





Date          2
Zip Code      2
Model Year    2
Fuel          3
Make          3
Light_Duty    3
Vehicles      3
dtype: int64

Lab Task #1b: Apply the lambda function.



In [17]:

    
# TODO 1b
# TODO -- Your code here.

Lab Task #1c: Check again for missing values.



In [18]:

    
# TODO 1c
# TODO -- Your code here.









    Out[18]:





Date          0
Zip Code      0
Model Year    0
Fuel          0
Make          0
Light_Duty    0
Vehicles      0
dtype: int64

Data Quality Issue #2:

Convert the Date Feature Column to a Datetime Format

The date column is indeed shown as a string object.

Lab Task #2a: Convert the datetime datatype with the to_datetime() function in Pandas.



In [19]:

    
# TODO 2a
# TODO -- Your code here.

Lab Task #2b: Show the converted Date.



In [20]:

    
# TODO 2b
# TODO -- Your code here.









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 7 columns):
Date          999 non-null datetime64[ns]
Zip Code      999 non-null object
Model Year    999 non-null object
Fuel          999 non-null object
Make          999 non-null object
Light_Duty    999 non-null object
Vehicles      999 non-null float64
dtypes: datetime64[ns](1), float64(1), object(5)
memory usage: 54.8+ KB

Let's parse Date into three columns, e.g. year, month, and day.



In [21]:

    
df_transport['year'] = df_transport['Date'].dt.year
df_transport['month'] = df_transport['Date'].dt.month
df_transport['day'] = df_transport['Date'].dt.day
#df['hour'] = df['date'].dt.hour - you could use this if your date format included hour.
#df['minute'] = df['date'].dt.minute - you could use this if your date format included minute.
df_transport.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 10 columns):
Date          999 non-null datetime64[ns]
Zip Code      999 non-null object
Model Year    999 non-null object
Fuel          999 non-null object
Make          999 non-null object
Light_Duty    999 non-null object
Vehicles      999 non-null float64
year          999 non-null int64
month         999 non-null int64
day           999 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(3), object(5)
memory usage: 78.2+ KB

Next, let's confirm the Date parsing. This will also give us a another visualization of the data.



In [22]:

    
grouped_data = df_transport.groupby(['Make'])
df_transport.groupby('Fuel').first() # Get the first entry for each month.









    Out[22]:







  
    
      
      Date
      Zip Code
      Model Year
      Make
      Light_Duty
      Vehicles
      year
      month
      day
    
    
      Fuel
      
      
      
      
      
      
      
      
      
    
  
  
    
      Battery Electric
      2018-10-01
      90000
      <2006
      OTHER/UNK
      No
      4.0
      2018
      10
      1
    
    
      Diesel and Diesel Hybrid
      2018-10-01
      90000
      <2006
      OTHER/UNK
      No
      55.0
      2018
      10
      1
    
    
      Flex-Fuel
      2018-10-14
      90001
      2007
      Type_A
      Yes
      78.0
      2018
      10
      14
    
    
      Gasoline
      2018-10-01
      90000
      2006
      OTHER/UNK
      Yes
      1.0
      2018
      10
      1
    
    
      Hybrid Gasoline
      2018-10-24
      90001
      2009
      OTHER/UNK
      Yes
      18.0
      2018
      10
      24
    
    
      Natural Gas
      2018-10-25
      90001
      2009
      OTHER/UNK
      No
      2.0
      2018
      10
      25
    
    
      Other
      2018-10-08
      90000
      <2006
      OTHER/UNK
      Yes
      6.0
      2018
      10
      8
    
    
      Plug-in Hybrid
      2018-11-02
      90001
      2012
      OTHER/UNK
      Yes
      1.0
      2018
      11
      2

Now that we have Dates as a integers, let's do some additional plotting.



In [23]:

    
plt.figure(figsize=(10,6))
sns.jointplot(x='month',y='Vehicles',data=df_transport)
#plt.title('Vehicles by Month')









    Out[23]:





<seaborn.axisgrid.JointGrid at 0x7f559d06e668>






    





<Figure size 720x432 with 0 Axes>

Data Quality Issue #3:

Rename a Feature Column and Remove a Value.

Our feature columns have different "capitalizations" in their names, e.g. both upper and lower "case". In addition, there are "spaces" in some of the column names. In addition, we are only interested in years greater than 2006, not "<2006".

Lab Task #3a: Remove all the spaces for feature columns by renaming them.



In [24]:

    
# TODO 3a
# TODO -- Your code here.









    Out[24]:







  
    
      
      date
      zipcode
      modelyear
      fuel
      make
      lightduty
      vehicles
      year
      month
      day
    
  
  
    
      0
      2018-10-01
      90000
      2006
      Gasoline
      OTHER/UNK
      Yes
      1.0
      2018
      10
      1
    
    
      1
      2018-10-01
      90002
      2014
      Gasoline
      OTHER/UNK
      Yes
      1.0
      2018
      10
      1

Note: Next we create a copy of the dataframe to avoid the "SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame" warning. Run the cell to remove the value '<2006' from the modelyear feature column.

Lab Task #3b: Create a copy of the dataframe to avoid copy warning issues.



In [25]:

    
# TODO 3b
# TODO -- Your code here.

Next, confirm that the modelyear value '<2006' has been removed by doing a value count.



In [26]:

    
df['modelyear'].value_counts(0)









    Out[26]:





2007    87
2012    81
2008    79
2011    77
2010    71
2006    70
2015    61
2014    59
2016    57
2017    57
2009    53
2013    52
2018    42
2019     6
Name: modelyear, dtype: int64

Data Quality Issue #4:

Handling Categorical Columns

The feature column "lightduty" is categorical and has a "Yes/No" choice. We cannot feed values like this into a machine learning model. We need to convert the binary answers from strings of yes/no to integers of 1/0. There are various methods to achieve this. We will use the "apply" method with a lambda expression. Pandas. apply() takes a function and applies it to all values of a Pandas series.

What is a Lambda Function?

Typically, Python requires that you define a function using the def keyword. However, lambda functions are anonymous -- which means there is no need to name them. The most common use case for lambda functions is in code that requires a simple one-line function (e.g. lambdas only have a single expression).

As you progress through the Course Specialization, you will see many examples where lambda functions are being used. Now is a good time to become familiar with them.

First, lets count the number of "Yes" and"No's" in the 'lightduty' feature column.



In [27]:

    
df['lightduty'].value_counts(0)









    Out[27]:





Yes    771
No      80
--       1
Name: lightduty, dtype: int64

Let's convert the Yes to 1 and No to 0. Pandas. apply() . apply takes a function and applies it to all values of a Pandas series (e.g. lightduty).



In [28]:

    
df.loc[:,'lightduty'] = df['lightduty'].apply(lambda x: 0 if x=='No' else 1)
df['lightduty'].value_counts(0)









    Out[28]:





1    772
0     80
Name: lightduty, dtype: int64



In [29]:

    
# Confirm that "lightduty" has been converted.

df.head()









    Out[29]:







  
    
      
      date
      zipcode
      modelyear
      fuel
      make
      lightduty
      vehicles
      year
      month
      day
    
  
  
    
      0
      2018-10-01
      90000
      2006
      Gasoline
      OTHER/UNK
      1
      1.0
      2018
      10
      1
    
    
      1
      2018-10-01
      90002
      2014
      Gasoline
      OTHER/UNK
      1
      1.0
      2018
      10
      1
    
    
      3
      2018-10-01
      90000
      2017
      Gasoline
      OTHER/UNK
      1
      1.0
      2018
      10
      1
    
    
      16
      2018-10-09
      90001
      2006
      Diesel and Diesel Hybrid
      Type_C
      0
      16.0
      2018
      10
      9
    
    
      17
      2018-10-10
      90001
      2006
      Diesel and Diesel Hybrid
      OTHER/UNK
      0
      23.0
      2018
      10
      10

One-Hot Encoding Categorical Feature Columns

Machine learning algorithms expect input vectors and not categorical features. Specifically, they cannot handle text or string values. Thus, it is often useful to transform categorical features into vectors.

One transformation method is to create dummy variables for our categorical features. Dummy variables are a set of binary (0 or 1) variables that each represent a single class from a categorical feature. We simply encode the categorical variable as a one-hot vector, i.e. a vector where only one element is non-zero, or hot. With one-hot encoding, a categorical feature becomes an array whose size is the number of possible choices for that feature.

Panda provides a function called "get_dummies" to convert a categorical variable into dummy/indicator variables.



In [30]:

    
# Making dummy variables for categorical data with more inputs.  

data_dummy = pd.get_dummies(df[['zipcode','modelyear', 'fuel', 'make']], drop_first=True)
data_dummy.head()









    Out[30]:







  
    
      
      zipcode_90001
      zipcode_90002
      zipcode_90003
      zipcode_9001
      zipcode_na
      modelyear_2007
      modelyear_2008
      modelyear_2009
      modelyear_2010
      modelyear_2011
      ...
      make_Type_Q
      make_Type_R
      make_Type_S
      make_Type_T
      make_Type_U
      make_Type_V
      make_Type_W
      make_Type_X
      make_Type_Y
      make_Type_Z
    
  
  
    
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      3
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      16
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      17
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 56 columns

Lab Task #4a: Merge (concatenate) original data frame with 'dummy' dataframe.



In [31]:

    
# TODO 4a
# TODO -- Your code here.









    Out[31]:







  
    
      
      date
      zipcode
      modelyear
      fuel
      make
      lightduty
      vehicles
      year
      month
      day
      ...
      make_Type_Q
      make_Type_R
      make_Type_S
      make_Type_T
      make_Type_U
      make_Type_V
      make_Type_W
      make_Type_X
      make_Type_Y
      make_Type_Z
    
  
  
    
      0
      2018-10-01
      90000
      2006
      Gasoline
      OTHER/UNK
      1
      1.0
      2018
      10
      1
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      2018-10-01
      90002
      2014
      Gasoline
      OTHER/UNK
      1
      1.0
      2018
      10
      1
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      3
      2018-10-01
      90000
      2017
      Gasoline
      OTHER/UNK
      1
      1.0
      2018
      10
      1
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      16
      2018-10-09
      90001
      2006
      Diesel and Diesel Hybrid
      Type_C
      0
      16.0
      2018
      10
      9
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      17
      2018-10-10
      90001
      2006
      Diesel and Diesel Hybrid
      OTHER/UNK
      0
      23.0
      2018
      10
      10
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 66 columns

Lab Task #4b: Drop attributes for which we made dummy variables.



In [32]:

    
# TODO 4b
# TODO -- Your code here.



In [33]:

    
# Confirm that 'zipcode','modelyear', 'fuel', and 'make' have been dropped.

df.head()









    Out[33]:







  
    
      
      lightduty
      vehicles
      year
      month
      day
      zipcode_90001
      zipcode_90002
      zipcode_90003
      zipcode_9001
      zipcode_na
      ...
      make_Type_Q
      make_Type_R
      make_Type_S
      make_Type_T
      make_Type_U
      make_Type_V
      make_Type_W
      make_Type_X
      make_Type_Y
      make_Type_Z
    
  
  
    
      0
      1
      1.0
      2018
      10
      1
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      1
      1.0
      2018
      10
      1
      0
      1
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      3
      1
      1.0
      2018
      10
      1
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      16
      0
      16.0
      2018
      10
      9
      1
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      17
      0
      23.0
      2018
      10
      10
      1
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 61 columns

Data Quality Issue #5:

Temporal Feature Columns

Our dataset now contains year, month, and day feature columns. Let's convert the month and day feature columns to meaningful representations as a way to get us thinking about changing temporal features -- as they are sometimes overlooked.

Note that the Feature Engineering course in this Specialization will provide more depth on methods to handle year, month, day, and hour feature columns.

First, let's print the unique values for "month" and "day" in our dataset.



In [34]:

    
print ('Unique values of month:',df.month.unique())
print ('Unique values of day:',df.day.unique())
print ('Unique values of year:',df.year.unique())









    



Unique values of month: [10 11 12  1  2  3  4  5  6  7]
Unique values of day: [ 1  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
  2  3  4  5  6  7  8]
Unique values of year: [2018 2019]

Next, we map each temporal variable onto a circle such that the lowest value for that variable appears right next to the largest value. We compute the x- and y- component of that point using sin and cos trigonometric functions. Don't worry, this is the last time we will use this code, as you can develop an input pipeline to address these temporal feature columns in TensorFlow and Keras - and it is much easier! But, sometimes you need to appreciate what you're not going to encounter as you move through the course!

Run the cell to view the output.

Lab Task #5: Drop month, and day



In [ ]:

    
df['day_sin'] = np.sin(df.day*(2.*np.pi/31))
df['day_cos'] = np.cos(df.day*(2.*np.pi/31))
df['month_sin'] = np.sin((df.month-1)*(2.*np.pi/12))
df['month_cos'] = np.cos((df.month-1)*(2.*np.pi/12))

# TODO 5
# TODO -- Your code here.



In [39]:

    
# scroll left to see the converted month and day coluumns.
df.tail(4)









    Out[39]:







  
    
      
      lightduty
      vehicles
      zipcode_90001
      zipcode_90002
      zipcode_90003
      zipcode_9001
      zipcode_na
      modelyear_2007
      modelyear_2008
      modelyear_2009
      ...
      make_Type_U
      make_Type_V
      make_Type_W
      make_Type_X
      make_Type_Y
      make_Type_Z
      day_sin
      day_cos
      month_sin
      month_cos
    
  
  
    
      995
      1
      4.0
      0
      0
      1
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0.998717
      -0.050649
      0.5
      -0.866025
    
    
      996
      1
      25.0
      0
      0
      1
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0.968077
      -0.250653
      0.5
      -0.866025
    
    
      997
      1
      1.0
      0
      0
      1
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0.897805
      -0.440394
      0.5
      -0.866025
    
    
      998
      1
      3.0
      0
      0
      1
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0.790776
      -0.612106
      0.5
      -0.866025
    
  

4 rows × 62 columns

Conclusion

This notebook introduced a few concepts to improve data quality. We resolved missing values, converted the Date feature column to a datetime format, renamed feature columns, removed a value from a feature column, created one-hot encoding features, and converted temporal features to meaningful representations. By the end of our lab, we gained an understanding as to why data should be "cleaned" and "pre-processed" before input into a machine learning model.

Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

	Date	Zip Code	Model Year	Fuel	Make	Light_Duty	Vehicles
0	10/1/2018	90000	2006	Gasoline	OTHER/UNK	NaN	1.0
1	10/1/2018	NaN	2014	Gasoline	NaN	Yes	1.0
2	NaN	90000	NaN	Gasoline	OTHER/UNK	Yes	NaN
3	10/1/2018	90000	2017	Gasoline	OTHER/UNK	Yes	1.0
4	10/1/2018	90000	<2006	Diesel and Diesel Hybrid	OTHER/UNK	No	55.0

	Vehicles
count	996.000000
mean	72.878514
std	229.696895
min	1.000000
25%	13.000000
50%	23.000000
75%	57.250000
max	3178.000000

	Date	Zip Code	Model Year	Make	Light_Duty	Vehicles
Fuel
Battery Electric	10/1/2018	90000	<2006	OTHER/UNK	No	4.0
Diesel and Diesel Hybrid	10/1/2018	90000	<2006	OTHER/UNK	No	55.0
Flex-Fuel	10/14/2018	90001	2007	Type_A	Yes	78.0
Gasoline	10/1/2018	90000	2006	OTHER/UNK	Yes	1.0
Hybrid Gasoline	10/24/2018	90001	2009	OTHER/UNK	Yes	18.0
Natural Gas	10/25/2018	90001	2009	OTHER/UNK	No	2.0
Other	10/8/2018	90000	<2006	OTHER/UNK	Yes	6.0
Plug-in Hybrid	11/2/2018	90001	2012	OTHER/UNK	Yes	1.0

	Date	Zip Code	Model Year	Fuel	Make	Light_Duty	Vehicles
994	6/7/2019	90003	2012	Gasoline	Type_R	Yes	26.0
995	6/8/2019	90003	2012	Hybrid Gasoline	OTHER/UNK	Yes	4.0
996	6/9/2019	90003	2012	Hybrid Gasoline	Type_Q	Yes	25.0
997	6/10/2019	90003	2012	Natural Gas	OTHER/UNK	Yes	1.0
998	6/11/2019	90003	2012	Plug-in Hybrid	OTHER/UNK	Yes	3.0

	Date	Zip Code	Model Year	Make	Light_Duty	Vehicles	year	month	day
Fuel
Battery Electric	2018-10-01	90000	<2006	OTHER/UNK	No	4.0	2018	10	1
Diesel and Diesel Hybrid	2018-10-01	90000	<2006	OTHER/UNK	No	55.0	2018	10	1
Flex-Fuel	2018-10-14	90001	2007	Type_A	Yes	78.0	2018	10	14
Gasoline	2018-10-01	90000	2006	OTHER/UNK	Yes	1.0	2018	10	1
Hybrid Gasoline	2018-10-24	90001	2009	OTHER/UNK	Yes	18.0	2018	10	24
Natural Gas	2018-10-25	90001	2009	OTHER/UNK	No	2.0	2018	10	25
Other	2018-10-08	90000	<2006	OTHER/UNK	Yes	6.0	2018	10	8
Plug-in Hybrid	2018-11-02	90001	2012	OTHER/UNK	Yes	1.0	2018	11	2

	zipcode_90001	zipcode_90002	...
0	0	0	...
1	0	1	...
3	0	0	...
16	1	0	...
17	1	0	...

	lightduty	vehicles	zipcode_90003	...	day_sin	day_cos	month_sin	month_cos
995	1	4.0	1	...	0.998717	-0.050649	0.5	-0.866025
996	1	25.0	1	...	0.968077	-0.250653	0.5	-0.866025
997	1	1.0	1	...	0.897805	-0.440394	0.5	-0.866025
998	1	3.0	1	...	0.790776	-0.612106	0.5	-0.866025

	zipcode_90001	zipcode_90002	...
0	0	0	...
1	0	1	...
3	0	0	...
16	1	0	...
17	1	0	...

	zipcode_90001	zipcode_90002	...
0	0	0	...
1	0	1	...
3	0	0	...
16	1	0	...
17	1	0	...