In the last (2) weeks I have been doing a project for my Business Intelligence class and I learned why people always say they spend 80% of their time clearning the data, couldn't be more true.

To practice a little bit more I decide to try the Kaggle bulldozers competition. The first thing I notice was that the data is huge (+450 Mgs), is by far the bigguest data I have dealed with, for most Big Data experts is probably tiny but for me was huge xD.

I was curious to see if python was capable of dealing with that and make some basic cleaning, also I ended creating some functionality for copper, the date cleaning and join datasets.

I start by looking (and getting scared) the data on Excel then imported it into python (pandas); +401k and 53 columns.



In [1]:

    
import copper
copper.project.path = '../'



In [2]:

    
train = copper.read_csv('train.csv')



In [3]:

    
len(train), len(train.columns)









    Out[3]:





(401125, 53)

I decide to divide the cleaning into 4 sections approximately 12 columns each. And create 4 different copper.Datasets to modify the metadata of each section. Mainly needed to decide which columns are useful and which columns has to many missing values to be useful. At the same time making sure that the useful columns are do not have duplicates (categorical variables) and are good for machine learning (dates).

Columns 1

Select the first 11 columsn and print its first values.



In [4]:

    
cols1 = train[train.columns[0:12]]



In [5]:

    
cols1.head()









    Out[5]:






  
    
      
      SalesID
      SalePrice
      MachineID
      ModelID
      datasource
      auctioneerID
      YearMade
      MachineHoursCurrentMeter
      UsageBand
      saledate
      fiModelDesc
      fiBaseModel
    
  
  
    
      0
       1139246
       66000
        999089
        3157
       121
       3
       2004
         68
          Low
       11/16/2006 0:00
           521D
         521
    
    
      1
       1139248
       57000
        117657
          77
       121
       3
       1996
       4640
          Low
        3/26/2004 0:00
         950FII
         950
    
    
      2
       1139249
       10000
        434808
        7009
       121
       3
       2001
       2838
         High
        2/26/2004 0:00
            226
         226
    
    
      3
       1139251
       38500
       1026470
         332
       121
       3
       2001
       3486
         High
        5/19/2011 0:00
       PC120-6E
       PC120
    
    
      4
       1139253
       11000
       1057373
       17311
       121
       3
       2007
        722
       Medium
        7/23/2009 0:00
           S175
        S175

Most of the columns on this section are only IDs which are not usefull on prediction, also the target is the SalePrice variable so I created a Dataset and put the correct metadata.



In [6]:

    
ds1 = copper.Dataset(cols1)



In [7]:

    
ds1.role['SalesID'] = ds1.ID
ds1.role['MachineID'] = ds1.ID
ds1.role['ModelID'] = ds1.ID
ds1.role['datasource'] = ds1.ID
ds1.role['auctioneerID'] = ds1.ID
ds1.role['SalePrice'] = ds1.TARGET



In [8]:

    
ds1.percent_missing()









    Out[8]:





UsageBand                   0.826391
MachineHoursCurrentMeter    0.644089
auctioneerID                0.050199
fiBaseModel                 0.000000
fiModelDesc                 0.000000
saledate                    0.000000
YearMade                    0.000000
datasource                  0.000000
ModelID                     0.000000
MachineID                   0.000000
SalePrice                   0.000000
SalesID                     0.000000

The values with more than 50% of missing values are automatically rejected. But the ones that are not reject (and are not IDs) are looking good.

Transforming a date

The Date field could be usefull but is a text and is necessary to convert it into a python datetime, and for doing machine learning is usefull to convert dates into numbers, the most common aproach is to use the Julian Date method.

Original data:



In [9]:

    
ds1['saledate'].head(2)









    Out[9]:





0    11/16/2006 0:00
1     3/26/2004 0:00
Name: saledate

Convert it into a datetime giving the correct format.



In [10]:

    
ds1['saledate'] = ds1['saledate'].apply(copper.transform.strptime, args='%m/%d/%Y %H:%M')



In [11]:

    
ds1['saledate'].head(2)









    Out[11]:





0    2006-11-16 00:00:00
1    2004-03-26 00:00:00
Name: saledate

Convert the dates into numbers



In [12]:

    
ds1['saledate'] = ds1['saledate'].apply(copper.transform.date_to_number)



In [13]:

    
ds1['saledate'].head(2)









    Out[13]:





0    13468
1    12503
Name: saledate

Change the type metadata of saledate because now is a Number not a Category.



In [14]:

    
ds1.type['saledate'] = ds1.NUMBER



In [15]:

    
ds1.metadata









    Out[15]:






  
    
      
      Role
      Type
      dtype
    
  
  
    
      SalesID
           ID
         Number
         int64
    
    
      SalePrice
       Target
         Number
         int64
    
    
      MachineID
           ID
         Number
         int64
    
    
      ModelID
           ID
         Number
         int64
    
    
      datasource
           ID
         Number
         int64
    
    
      auctioneerID
           ID
         Number
       float64
    
    
      YearMade
        Input
         Number
         int64
    
    
      MachineHoursCurrentMeter
       Reject
         Number
       float64
    
    
      UsageBand
       Reject
       Category
        object
    
    
      saledate
        Input
         Number
         int64
    
    
      fiModelDesc
        Input
       Category
        object
    
    
      fiBaseModel
        Input
       Category
        object

Columns 2

Select the next columns and print the percent of missing values.



In [16]:

    
cols2 = train[train.columns[12:20]]
cols2.head(7)









    Out[16]:






  
    
      
      fiSecondaryDesc
      fiModelSeries
      fiModelDescriptor
      ProductSize
      fiProductClassDesc
      state
      ProductGroup
      ProductGroupDesc
    
  
  
    
      0
         D
       NaN
       NaN
                  NaN
                Wheel Loader - 110.0 to 120.0 Horsepower
              Alabama
        WL
             Wheel Loader
    
    
      1
         F
        II
       NaN
               Medium
                Wheel Loader - 150.0 to 175.0 Horsepower
       North Carolina
        WL
             Wheel Loader
    
    
      2
       NaN
       NaN
       NaN
                  NaN
       Skid Steer Loader - 1351.0 to 1601.0 Lb Operat...
             New York
       SSL
       Skid Steer Loaders
    
    
      3
       NaN
       -6E
       NaN
                Small
       Hydraulic Excavator, Track - 12.0 to 14.0 Metr...
                Texas
       TEX
         Track Excavators
    
    
      4
       NaN
       NaN
       NaN
                  NaN
       Skid Steer Loader - 1601.0 to 1751.0 Lb Operat...
             New York
       SSL
       Skid Steer Loaders
    
    
      5
         G
       NaN
       NaN
                  NaN
       Backhoe Loader - 14.0 to 15.0 Ft Standard Digg...
              Arizona
        BL
          Backhoe Loaders
    
    
      6
         E
       NaN
        LC
       Large / Medium
       Hydraulic Excavator, Track - 21.0 to 24.0 Metr...
              Florida
       TEX
         Track Excavators



In [17]:

    
ds2 = copper.Dataset(cols2)
ds2.percent_missing()









    Out[17]:





fiModelSeries         0.858129
fiModelDescriptor     0.820707
ProductSize           0.525460
fiSecondaryDesc       0.342016
ProductGroupDesc      0.000000
ProductGroup          0.000000
state                 0.000000
fiProductClassDesc    0.000000

On the first dataset we have one column (fiModelDesc) that is divided into two variables (fiBaseModel and fiSecondaryDesc) so we do not need the original variable. We also have ProductGroup and ProductGroupDesc which are have the same information so I reject one.

Finally we have fiProductClassDesc, which can be usefull but needs more transformation. Initially I was thinking of taking the numbers of the field but some values are Lbs and some are HorsePower so is not that simple, for not I reject it.



In [18]:

    
ds1.role['fiModelDesc'] = ds1.REJECTED
ds2.role['ProductGroupDesc'] = ds2.REJECTED
ds2.role['fiProductClassDesc'] = ds2.REJECTED

Since I have rejected so many variables I gave a change to 'ProductSize' since is just above the margin of missing values.



In [19]:

    
ds2.role['ProductSize'] = ds1.INPUT



In [20]:

    
set(ds2['ProductSize'])









    Out[20]:





set([nan, 'Mini', 'Medium', 'Large / Medium', 'Compact', 'Large', 'Small'])

Looks quite clean so no problems there. The final metadata for this section looks like this.



In [21]:

    
ds2.metadata









    Out[21]:






  
    
      
      Role
      Type
      dtype
    
  
  
    
      fiSecondaryDesc
        Input
       Category
       object
    
    
      fiModelSeries
       Reject
       Category
       object
    
    
      fiModelDescriptor
       Reject
       Category
       object
    
    
      ProductSize
        Input
       Category
       object
    
    
      fiProductClassDesc
       Reject
       Category
       object
    
    
      state
        Input
       Category
       object
    
    
      ProductGroup
        Input
       Category
       object
    
    
      ProductGroupDesc
       Reject
       Category
       object

Columns 3

Print the first rows of the next section and see the missing values.



In [22]:

    
cols3 = train[train.columns[20:31]]
cols3.head(5)









    Out[22]:






  
    
      
      Drive_System
      Enclosure
      Forks
      Pad_Type
      Ride_Control
      Stick
      Transmission
      Turbocharged
      Blade_Extension
      Blade_Width
      Enclosure_Type
    
  
  
    
      0
       NaN
       EROPS w AC
       None or Unspecified
       NaN
       None or Unspecified
       NaN
       NaN
       NaN
       NaN
       NaN
       NaN
    
    
      1
       NaN
       EROPS w AC
       None or Unspecified
       NaN
       None or Unspecified
       NaN
       NaN
       NaN
       NaN
       NaN
       NaN
    
    
      2
       NaN
            OROPS
       None or Unspecified
       NaN
                       NaN
       NaN
       NaN
       NaN
       NaN
       NaN
       NaN
    
    
      3
       NaN
       EROPS w AC
                       NaN
       NaN
                       NaN
       NaN
       NaN
       NaN
       NaN
       NaN
       NaN
    
    
      4
       NaN
            EROPS
       None or Unspecified
       NaN
                       NaN
       NaN
       NaN
       NaN
       NaN
       NaN
       NaN



In [23]:

    
ds3 = copper.Dataset(cols3)
ds3.percent_missing()









    Out[23]:





Enclosure_Type     0.937129
Blade_Width        0.937129
Blade_Extension    0.937129
Turbocharged       0.802720
Stick              0.802720
Pad_Type           0.802720
Drive_System       0.739829
Ride_Control       0.629527
Transmission       0.543210
Forks              0.521154
Enclosure          0.000810

By default only Enclosure will be an input so lets see if need to be cleaned.



In [24]:

    
set(ds3['Enclosure'])









    Out[24]:





set([nan,
     'None or Unspecified',
     'OROPS',
     'EROPS w AC',
     'NO ROPS',
     'EROPS AC',
     'EROPS'])

I believe 'EROPS w AC' and 'EROPS AC' are the same, probably should look at the forum of the competition but for now lets say I am right. Also I change the 'None or Unspecified' to nan.



In [25]:

    
ds3['Enclosure'][ds3['Enclosure'] == 'EROPS w AC'] = 'EROPS AC'
ds3['Enclosure'][ds3['Enclosure'] == 'None or Unspecified'] = np.nan

Since I create some new nans let's see if that changes the missing values.



In [26]:

    
ds3.percent_missing()['Enclosure']









    Out[26]:





0.00081520722966654802

Lets take a quick look at 'Forks' which is just above the rejection limit.



In [27]:

    
set(ds3['Forks'])









    Out[27]:





set([nan, 'Yes', 'None or Unspecified'])



In [28]:

    
ds3['Forks'][ds3['Forks'] == 'None or Unspecified'] = np.nan



In [29]:

    
ds3.percent_missing()['Forks']









    Out[29]:





0.96565409784979739

Changing 'None or Unespecified' to nan make the variable almost all missing values so is not usefull at all. The final metadata for this section is:



In [30]:

    
ds3.metadata









    Out[30]:






  
    
      
      Role
      Type
      dtype
    
  
  
    
      Drive_System
       Reject
       Category
       object
    
    
      Enclosure
        Input
       Category
       object
    
    
      Forks
       Reject
       Category
       object
    
    
      Pad_Type
       Reject
       Category
       object
    
    
      Ride_Control
       Reject
       Category
       object
    
    
      Stick
       Reject
       Category
       object
    
    
      Transmission
       Reject
       Category
       object
    
    
      Turbocharged
       Reject
       Category
       object
    
    
      Blade_Extension
       Reject
       Category
       object
    
    
      Blade_Width
       Reject
       Category
       object
    
    
      Enclosure_Type
       Reject
       Category
       object

Columns 4

Let's see the remaining columns and their missing values.



In [31]:

    
ds4 = copper.Dataset(train[train.columns[31:]])
ds4.percent_missing()









    Out[31]:





Tip_Control                0.937129
Pushblock                  0.937129
Engine_Horsepower          0.937129
Scarifier                  0.937102
Hydraulics_Flow            0.891899
Grouser_Tracks             0.891899
Coupler_System             0.891660
Steering_Controls          0.827064
Differential_Type          0.826959
Backhoe_Mounting           0.803872
Blade_Type                 0.800977
Travel_Controls            0.800975
Tire_Size                  0.763869
Grouser_Type               0.752813
Track_Type                 0.752813
Pattern_Changer            0.752651
Stick_Length               0.752651
Thumb                      0.752476
Undercarriage_Pad_Width    0.751020
Ripper                     0.740388
Coupler                    0.466620
Hydraulics                 0.200823

Almost everything is rejected. Let's take a look at what is not rejected.



In [32]:

    
set(ds4['Coupler'])









    Out[32]:





set([nan, 'None or Unspecified', 'Manual', 'Hydraulic'])



In [33]:

    
set(ds4['Hydraulics'])









    Out[33]:





set([nan,
     'None or Unspecified',
     'Base + 3 Function',
     'Auxiliary',
     'Base + 4 Function',
     'Base + 1 Function',
     'Standard',
     'Base + 6 Function',
     '4 Valve',
     '3 Valve',
     '2 Valve',
     'Base + 5 Function',
     'Base + 2 Function'])

Need to change the 'None or Unspecified' to nan and see what happens.



In [34]:

    
ds4['Coupler'][ds4['Coupler'] == 'None or Unspecified'] = np.nan
ds4['Hydraulics'][ds4['Hydraulics'] == 'None or Unspecified'] = np.nan



In [35]:

    
ds4.percent_missing()[['Coupler', 'Hydraulics']]









    Out[35]:





Coupler       0.926781
Hydraulics    0.200848

Hydraulics maintains low missing values but Coupler is now huge on missing values so reject.



In [36]:

    
ds4.role['Coupler'] = ds4.REJECTED

Join

With all the datasets ready we can join them into one huge dataset.



In [37]:

    
ds = copper.join(ds1, ds2, others=[ds3, ds4])

Let's just check the number of rows and columns.



In [38]:

    
len(ds), len(ds.columns)









    Out[38]:





(401125, 53)

We have everything but I am not going to use all the rejected data in machine learning so let's filter the data taking only the Inputs and Targets.

Note: By default the filter method returns a pandas.DataFrame but on this case I want a Dataset so I use the ret_ds parameter.



In [39]:

    
ds = ds.filter(role=['Input', 'Target'], ret_ds=True)

Just a final check on the first values and missing values.



In [40]:

    
ds.head()









    Out[40]:






  
    
      
      SalePrice
      YearMade
      saledate
      fiBaseModel
      fiSecondaryDesc
      ProductSize
      state
      ProductGroup
      Enclosure
      Hydraulics
    
  
  
    
      0
       66000
       2004
       13468
         521
         D
          NaN
              Alabama
        WL
       EROPS AC
         2 Valve
    
    
      1
       57000
       1996
       12503
         950
         F
       Medium
       North Carolina
        WL
       EROPS AC
         2 Valve
    
    
      2
       10000
       2001
       12474
         226
       NaN
          NaN
             New York
       SSL
          OROPS
       Auxiliary
    
    
      3
       38500
       2001
       15113
       PC120
       NaN
        Small
                Texas
       TEX
       EROPS AC
         2 Valve
    
    
      4
       11000
       2007
       14448
        S175
       NaN
          NaN
             New York
       SSL
          EROPS
       Auxiliary



In [41]:

    
ds.percent_missing()









    Out[41]:





ProductSize        0.525460
fiSecondaryDesc    0.342016
Hydraulics         0.200848
Enclosure          0.000815
ProductGroup       0.000000
state              0.000000
fiBaseModel        0.000000
saledate           0.000000
YearMade           0.000000
SalePrice          0.000000

Finally save the (pickled) Dataset for future use.



In [42]:

    
copper.save(ds, 'cleaned')

Conclusion

Removing and addind columns using copper Datasets is really easy and the new join feature makes easier this kind of tasks.

What I don't like is that I removed 43 columns and I dont have very high expectations only using the remaining 10 for machine learning. Usually the dates are not that useful, but will see the, and I prefer to have less columns with good information than 43 columns with 95% missing values. But that is the data available so I guess there is nothing much to do.

I created a repo for this kaggle competition, there are only 30 days left so I want to make a few submisions, stay tunned for imputation and some machine learning. The code for copper is on github too.

	SalesID	SalePrice	MachineID	ModelID	datasource	auctioneerID	YearMade	MachineHoursCurrentMeter	UsageBand	saledate	fiModelDesc	fiBaseModel
0	1139246	66000	999089	3157	121	3	2004	68	Low	11/16/2006 0:00	521D	521
1	1139248	57000	117657	77	121	3	1996	4640	Low	3/26/2004 0:00	950FII	950
2	1139249	10000	434808	7009	121	3	2001	2838	High	2/26/2004 0:00	226	226
3	1139251	38500	1026470	332	121	3	2001	3486	High	5/19/2011 0:00	PC120-6E	PC120
4	1139253	11000	1057373	17311	121	3	2007	722	Medium	7/23/2009 0:00	S175	S175

	Role	Type	dtype
SalesID	ID	Number	int64
SalePrice	Target	Number	int64
MachineID	ID	Number	int64
ModelID	ID	Number	int64
datasource	ID	Number	int64
auctioneerID	ID	Number	float64
YearMade	Input	Number	int64
MachineHoursCurrentMeter	Reject	Number	float64
UsageBand	Reject	Category	object
saledate	Input	Number	int64
fiModelDesc	Input	Category	object
fiBaseModel	Input	Category	object

	fiSecondaryDesc	fiModelSeries	fiModelDescriptor	ProductSize	fiProductClassDesc	state	ProductGroup	ProductGroupDesc
0	D	NaN	NaN	NaN	Wheel Loader - 110.0 to 120.0 Horsepower	Alabama	WL	Wheel Loader
1	F	II	NaN	Medium	Wheel Loader - 150.0 to 175.0 Horsepower	North Carolina	WL	Wheel Loader
2	NaN	NaN	NaN	NaN	Skid Steer Loader - 1351.0 to 1601.0 Lb Operat...	New York	SSL	Skid Steer Loaders
3	NaN	-6E	NaN	Small	Hydraulic Excavator, Track - 12.0 to 14.0 Metr...	Texas	TEX	Track Excavators
4	NaN	NaN	NaN	NaN	Skid Steer Loader - 1601.0 to 1751.0 Lb Operat...	New York	SSL	Skid Steer Loaders
5	G	NaN	NaN	NaN	Backhoe Loader - 14.0 to 15.0 Ft Standard Digg...	Arizona	BL	Backhoe Loaders
6	E	NaN	LC	Large / Medium	Hydraulic Excavator, Track - 21.0 to 24.0 Metr...	Florida	TEX	Track Excavators

	Role	Type	dtype
fiSecondaryDesc	Input	Category	object
fiModelSeries	Reject	Category	object
fiModelDescriptor	Reject	Category	object
ProductSize	Input	Category	object
fiProductClassDesc	Reject	Category	object
state	Input	Category	object
ProductGroup	Input	Category	object
ProductGroupDesc	Reject	Category	object

	Drive_System	Enclosure	Forks	Pad_Type	Ride_Control	Stick	Transmission	Turbocharged	Blade_Extension	Blade_Width	Enclosure_Type
0	NaN	EROPS w AC	None or Unspecified	NaN	None or Unspecified	NaN	NaN	NaN	NaN	NaN	NaN
1	NaN	EROPS w AC	None or Unspecified	NaN	None or Unspecified	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	OROPS	None or Unspecified	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	EROPS w AC	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	EROPS	None or Unspecified	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	Role	Type	dtype
Drive_System	Reject	Category	object
Enclosure	Input	Category	object
Forks	Reject	Category	object
Pad_Type	Reject	Category	object
Ride_Control	Reject	Category	object
Stick	Reject	Category	object
Transmission	Reject	Category	object
Turbocharged	Reject	Category	object
Blade_Extension	Reject	Category	object
Blade_Width	Reject	Category	object
Enclosure_Type	Reject	Category	object

	SalePrice	YearMade	saledate	fiBaseModel	fiSecondaryDesc	ProductSize	state	ProductGroup	Enclosure	Hydraulics
0	66000	2004	13468	521	D	NaN	Alabama	WL	EROPS AC	2 Valve
1	57000	1996	12503	950	F	Medium	North Carolina	WL	EROPS AC	2 Valve
2	10000	2001	12474	226	NaN	NaN	New York	SSL	OROPS	Auxiliary
3	38500	2001	15113	PC120	NaN	Small	Texas	TEX	EROPS AC	2 Valve
4	11000	2007	14448	S175	NaN	NaN	New York	SSL	EROPS	Auxiliary