In the last (2) weeks I have been doing a project for my Business Intelligence class and I learned why people always say they spend 80% of their time clearning the data, couldn't be more true.

To practice a little bit more I decide to try the Kaggle bulldozers competition. The first thing I notice was that the data is huge (+450 Mgs), is by far the bigguest data I have dealed with, for most Big Data experts is probably tiny but for me was huge xD.

I was curious to see if python was capable of dealing with that and make some basic cleaning, also I ended creating some functionality for copper, the date cleaning and join datasets.

I start by looking (and getting scared) the data on Excel then imported it into python (pandas); +401k and 53 columns.


In [1]:
import copper
copper.project.path = '../'

In [2]:
train = copper.read_csv('train.csv')

In [3]:
len(train), len(train.columns)


Out[3]:
(401125, 53)

I decide to divide the cleaning into 4 sections approximately 12 columns each. And create 4 different copper.Datasets to modify the metadata of each section. Mainly needed to decide which columns are useful and which columns has to many missing values to be useful. At the same time making sure that the useful columns are do not have duplicates (categorical variables) and are good for machine learning (dates).

Columns 1

Select the first 11 columsn and print its first values.


In [4]:
cols1 = train[train.columns[0:12]]

In [5]:
cols1.head()


Out[5]:
SalesID SalePrice MachineID ModelID datasource auctioneerID YearMade MachineHoursCurrentMeter UsageBand saledate fiModelDesc fiBaseModel
0 1139246 66000 999089 3157 121 3 2004 68 Low 11/16/2006 0:00 521D 521
1 1139248 57000 117657 77 121 3 1996 4640 Low 3/26/2004 0:00 950FII 950
2 1139249 10000 434808 7009 121 3 2001 2838 High 2/26/2004 0:00 226 226
3 1139251 38500 1026470 332 121 3 2001 3486 High 5/19/2011 0:00 PC120-6E PC120
4 1139253 11000 1057373 17311 121 3 2007 722 Medium 7/23/2009 0:00 S175 S175

Most of the columns on this section are only IDs which are not usefull on prediction, also the target is the SalePrice variable so I created a Dataset and put the correct metadata.


In [6]:
ds1 = copper.Dataset(cols1)

In [7]:
ds1.role['SalesID'] = ds1.ID
ds1.role['MachineID'] = ds1.ID
ds1.role['ModelID'] = ds1.ID
ds1.role['datasource'] = ds1.ID
ds1.role['auctioneerID'] = ds1.ID
ds1.role['SalePrice'] = ds1.TARGET

In [8]:
ds1.percent_missing()


Out[8]:
UsageBand                   0.826391
MachineHoursCurrentMeter    0.644089
auctioneerID                0.050199
fiBaseModel                 0.000000
fiModelDesc                 0.000000
saledate                    0.000000
YearMade                    0.000000
datasource                  0.000000
ModelID                     0.000000
MachineID                   0.000000
SalePrice                   0.000000
SalesID                     0.000000

The values with more than 50% of missing values are automatically rejected. But the ones that are not reject (and are not IDs) are looking good.

Transforming a date

The Date field could be usefull but is a text and is necessary to convert it into a python datetime, and for doing machine learning is usefull to convert dates into numbers, the most common aproach is to use the Julian Date method.

Original data:


In [9]:
ds1['saledate'].head(2)


Out[9]:
0    11/16/2006 0:00
1     3/26/2004 0:00
Name: saledate

Convert it into a datetime giving the correct format.


In [10]:
ds1['saledate'] = ds1['saledate'].apply(copper.transform.strptime, args='%m/%d/%Y %H:%M')

In [11]:
ds1['saledate'].head(2)


Out[11]:
0    2006-11-16 00:00:00
1    2004-03-26 00:00:00
Name: saledate

Convert the dates into numbers


In [12]:
ds1['saledate'] = ds1['saledate'].apply(copper.transform.date_to_number)

In [13]:
ds1['saledate'].head(2)


Out[13]:
0    13468
1    12503
Name: saledate

Change the type metadata of saledate because now is a Number not a Category.


In [14]:
ds1.type['saledate'] = ds1.NUMBER

In [15]:
ds1.metadata


Out[15]:
Role Type dtype
SalesID ID Number int64
SalePrice Target Number int64
MachineID ID Number int64
ModelID ID Number int64
datasource ID Number int64
auctioneerID ID Number float64
YearMade Input Number int64
MachineHoursCurrentMeter Reject Number float64
UsageBand Reject Category object
saledate Input Number int64
fiModelDesc Input Category object
fiBaseModel Input Category object

Columns 2

Select the next columns and print the percent of missing values.


In [16]:
cols2 = train[train.columns[12:20]]
cols2.head(7)


Out[16]:
fiSecondaryDesc fiModelSeries fiModelDescriptor ProductSize fiProductClassDesc state ProductGroup ProductGroupDesc
0 D NaN NaN NaN Wheel Loader - 110.0 to 120.0 Horsepower Alabama WL Wheel Loader
1 F II NaN Medium Wheel Loader - 150.0 to 175.0 Horsepower North Carolina WL Wheel Loader
2 NaN NaN NaN NaN Skid Steer Loader - 1351.0 to 1601.0 Lb Operat... New York SSL Skid Steer Loaders
3 NaN -6E NaN Small Hydraulic Excavator, Track - 12.0 to 14.0 Metr... Texas TEX Track Excavators
4 NaN NaN NaN NaN Skid Steer Loader - 1601.0 to 1751.0 Lb Operat... New York SSL Skid Steer Loaders
5 G NaN NaN NaN Backhoe Loader - 14.0 to 15.0 Ft Standard Digg... Arizona BL Backhoe Loaders
6 E NaN LC Large / Medium Hydraulic Excavator, Track - 21.0 to 24.0 Metr... Florida TEX Track Excavators

In [17]:
ds2 = copper.Dataset(cols2)
ds2.percent_missing()


Out[17]:
fiModelSeries         0.858129
fiModelDescriptor     0.820707
ProductSize           0.525460
fiSecondaryDesc       0.342016
ProductGroupDesc      0.000000
ProductGroup          0.000000
state                 0.000000
fiProductClassDesc    0.000000

On the first dataset we have one column (fiModelDesc) that is divided into two variables (fiBaseModel and fiSecondaryDesc) so we do not need the original variable. We also have ProductGroup and ProductGroupDesc which are have the same information so I reject one.

Finally we have fiProductClassDesc, which can be usefull but needs more transformation. Initially I was thinking of taking the numbers of the field but some values are Lbs and some are HorsePower so is not that simple, for not I reject it.


In [18]:
ds1.role['fiModelDesc'] = ds1.REJECTED
ds2.role['ProductGroupDesc'] = ds2.REJECTED
ds2.role['fiProductClassDesc'] = ds2.REJECTED

Since I have rejected so many variables I gave a change to 'ProductSize' since is just above the margin of missing values.


In [19]:
ds2.role['ProductSize'] = ds1.INPUT

In [20]:
set(ds2['ProductSize'])


Out[20]:
set([nan, 'Mini', 'Medium', 'Large / Medium', 'Compact', 'Large', 'Small'])

Looks quite clean so no problems there. The final metadata for this section looks like this.


In [21]:
ds2.metadata


Out[21]:
Role Type dtype
fiSecondaryDesc Input Category object
fiModelSeries Reject Category object
fiModelDescriptor Reject Category object
ProductSize Input Category object
fiProductClassDesc Reject Category object
state Input Category object
ProductGroup Input Category object
ProductGroupDesc Reject Category object

Columns 3

Print the first rows of the next section and see the missing values.


In [22]:
cols3 = train[train.columns[20:31]]
cols3.head(5)


Out[22]:
Drive_System Enclosure Forks Pad_Type Ride_Control Stick Transmission Turbocharged Blade_Extension Blade_Width Enclosure_Type
0 NaN EROPS w AC None or Unspecified NaN None or Unspecified NaN NaN NaN NaN NaN NaN
1 NaN EROPS w AC None or Unspecified NaN None or Unspecified NaN NaN NaN NaN NaN NaN
2 NaN OROPS None or Unspecified NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN EROPS w AC NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN EROPS None or Unspecified NaN NaN NaN NaN NaN NaN NaN NaN

In [23]:
ds3 = copper.Dataset(cols3)
ds3.percent_missing()


Out[23]:
Enclosure_Type     0.937129
Blade_Width        0.937129
Blade_Extension    0.937129
Turbocharged       0.802720
Stick              0.802720
Pad_Type           0.802720
Drive_System       0.739829
Ride_Control       0.629527
Transmission       0.543210
Forks              0.521154
Enclosure          0.000810

By default only Enclosure will be an input so lets see if need to be cleaned.


In [24]:
set(ds3['Enclosure'])


Out[24]:
set([nan,
     'None or Unspecified',
     'OROPS',
     'EROPS w AC',
     'NO ROPS',
     'EROPS AC',
     'EROPS'])

I believe 'EROPS w AC' and 'EROPS AC' are the same, probably should look at the forum of the competition but for now lets say I am right. Also I change the 'None or Unspecified' to nan.


In [25]:
ds3['Enclosure'][ds3['Enclosure'] == 'EROPS w AC'] = 'EROPS AC'
ds3['Enclosure'][ds3['Enclosure'] == 'None or Unspecified'] = np.nan

Since I create some new nans let's see if that changes the missing values.


In [26]:
ds3.percent_missing()['Enclosure']


Out[26]:
0.00081520722966654802

Lets take a quick look at 'Forks' which is just above the rejection limit.


In [27]:
set(ds3['Forks'])


Out[27]:
set([nan, 'Yes', 'None or Unspecified'])

In [28]:
ds3['Forks'][ds3['Forks'] == 'None or Unspecified'] = np.nan

In [29]:
ds3.percent_missing()['Forks']


Out[29]:
0.96565409784979739

Changing 'None or Unespecified' to nan make the variable almost all missing values so is not usefull at all. The final metadata for this section is:


In [30]:
ds3.metadata


Out[30]:
Role Type dtype
Drive_System Reject Category object
Enclosure Input Category object
Forks Reject Category object
Pad_Type Reject Category object
Ride_Control Reject Category object
Stick Reject Category object
Transmission Reject Category object
Turbocharged Reject Category object
Blade_Extension Reject Category object
Blade_Width Reject Category object
Enclosure_Type Reject Category object

Columns 4

Let's see the remaining columns and their missing values.


In [31]:
ds4 = copper.Dataset(train[train.columns[31:]])
ds4.percent_missing()


Out[31]:
Tip_Control                0.937129
Pushblock                  0.937129
Engine_Horsepower          0.937129
Scarifier                  0.937102
Hydraulics_Flow            0.891899
Grouser_Tracks             0.891899
Coupler_System             0.891660
Steering_Controls          0.827064
Differential_Type          0.826959
Backhoe_Mounting           0.803872
Blade_Type                 0.800977
Travel_Controls            0.800975
Tire_Size                  0.763869
Grouser_Type               0.752813
Track_Type                 0.752813
Pattern_Changer            0.752651
Stick_Length               0.752651
Thumb                      0.752476
Undercarriage_Pad_Width    0.751020
Ripper                     0.740388
Coupler                    0.466620
Hydraulics                 0.200823

Almost everything is rejected. Let's take a look at what is not rejected.


In [32]:
set(ds4['Coupler'])


Out[32]:
set([nan, 'None or Unspecified', 'Manual', 'Hydraulic'])

In [33]:
set(ds4['Hydraulics'])


Out[33]:
set([nan,
     'None or Unspecified',
     'Base + 3 Function',
     'Auxiliary',
     'Base + 4 Function',
     'Base + 1 Function',
     'Standard',
     'Base + 6 Function',
     '4 Valve',
     '3 Valve',
     '2 Valve',
     'Base + 5 Function',
     'Base + 2 Function'])

Need to change the 'None or Unspecified' to nan and see what happens.


In [34]:
ds4['Coupler'][ds4['Coupler'] == 'None or Unspecified'] = np.nan
ds4['Hydraulics'][ds4['Hydraulics'] == 'None or Unspecified'] = np.nan

In [35]:
ds4.percent_missing()[['Coupler', 'Hydraulics']]


Out[35]:
Coupler       0.926781
Hydraulics    0.200848

Hydraulics maintains low missing values but Coupler is now huge on missing values so reject.


In [36]:
ds4.role['Coupler'] = ds4.REJECTED

Join

With all the datasets ready we can join them into one huge dataset.


In [37]:
ds = copper.join(ds1, ds2, others=[ds3, ds4])

Let's just check the number of rows and columns.


In [38]:
len(ds), len(ds.columns)


Out[38]:
(401125, 53)

We have everything but I am not going to use all the rejected data in machine learning so let's filter the data taking only the Inputs and Targets.

Note: By default the filter method returns a pandas.DataFrame but on this case I want a Dataset so I use the ret_ds parameter.


In [39]:
ds = ds.filter(role=['Input', 'Target'], ret_ds=True)

Just a final check on the first values and missing values.


In [40]:
ds.head()


Out[40]:
SalePrice YearMade saledate fiBaseModel fiSecondaryDesc ProductSize state ProductGroup Enclosure Hydraulics
0 66000 2004 13468 521 D NaN Alabama WL EROPS AC 2 Valve
1 57000 1996 12503 950 F Medium North Carolina WL EROPS AC 2 Valve
2 10000 2001 12474 226 NaN NaN New York SSL OROPS Auxiliary
3 38500 2001 15113 PC120 NaN Small Texas TEX EROPS AC 2 Valve
4 11000 2007 14448 S175 NaN NaN New York SSL EROPS Auxiliary

In [41]:
ds.percent_missing()


Out[41]:
ProductSize        0.525460
fiSecondaryDesc    0.342016
Hydraulics         0.200848
Enclosure          0.000815
ProductGroup       0.000000
state              0.000000
fiBaseModel        0.000000
saledate           0.000000
YearMade           0.000000
SalePrice          0.000000

Finally save the (pickled) Dataset for future use.


In [42]:
copper.save(ds, 'cleaned')

Conclusion

Removing and addind columns using copper Datasets is really easy and the new join feature makes easier this kind of tasks.

What I don't like is that I removed 43 columns and I dont have very high expectations only using the remaining 10 for machine learning. Usually the dates are not that useful, but will see the, and I prefer to have less columns with good information than 43 columns with 95% missing values. But that is the data available so I guess there is nothing much to do.

I created a repo for this kaggle competition, there are only 30 days left so I want to make a few submisions, stay tunned for imputation and some machine learning. The code for copper is on github too.