In the last (2) weeks I have been doing a project for my Business Intelligence class and I learned why people always say they spend 80% of their time clearning the data, couldn't be more true.
To practice a little bit more I decide to try the Kaggle bulldozers competition. The first thing I notice was that the data is huge (+450 Mgs), is by far the bigguest data I have dealed with, for most Big Data experts is probably tiny but for me was huge xD.
I was curious to see if python was capable of dealing with that and make some basic cleaning, also I ended creating some functionality for copper, the date cleaning and join datasets.
I start by looking (and getting scared) the data on Excel then imported it into python (pandas); +401k and 53 columns.
In [1]:
import copper
copper.project.path = '../'
In [2]:
train = copper.read_csv('train.csv')
In [3]:
len(train), len(train.columns)
Out[3]:
I decide to divide the cleaning into 4 sections approximately 12 columns each. And create 4 different copper.Datasets
to modify the metadata of each section. Mainly needed to decide which columns are useful and which columns has to many missing values to be useful. At the same time making sure that the useful columns are do not have duplicates (categorical variables) and are good for machine learning (dates).
Select the first 11 columsn and print its first values.
In [4]:
cols1 = train[train.columns[0:12]]
In [5]:
cols1.head()
Out[5]:
Most of the columns on this section are only IDs which are not usefull on prediction, also the target is the SalePrice
variable so I created a Dataset and put the correct metadata.
In [6]:
ds1 = copper.Dataset(cols1)
In [7]:
ds1.role['SalesID'] = ds1.ID
ds1.role['MachineID'] = ds1.ID
ds1.role['ModelID'] = ds1.ID
ds1.role['datasource'] = ds1.ID
ds1.role['auctioneerID'] = ds1.ID
ds1.role['SalePrice'] = ds1.TARGET
In [8]:
ds1.percent_missing()
Out[8]:
The values with more than 50% of missing values are automatically rejected. But the ones that are not reject (and are not IDs) are looking good.
The Date field could be usefull but is a text and is necessary to convert it into a python datetime, and for doing machine learning is usefull to convert dates into numbers, the most common aproach is to use the Julian Date method.
Original data:
In [9]:
ds1['saledate'].head(2)
Out[9]:
Convert it into a datetime giving the correct format.
In [10]:
ds1['saledate'] = ds1['saledate'].apply(copper.transform.strptime, args='%m/%d/%Y %H:%M')
In [11]:
ds1['saledate'].head(2)
Out[11]:
Convert the dates into numbers
In [12]:
ds1['saledate'] = ds1['saledate'].apply(copper.transform.date_to_number)
In [13]:
ds1['saledate'].head(2)
Out[13]:
Change the type metadata of saledate
because now is a Number not a Category.
In [14]:
ds1.type['saledate'] = ds1.NUMBER
In [15]:
ds1.metadata
Out[15]:
Select the next columns and print the percent of missing values.
In [16]:
cols2 = train[train.columns[12:20]]
cols2.head(7)
Out[16]:
In [17]:
ds2 = copper.Dataset(cols2)
ds2.percent_missing()
Out[17]:
On the first dataset we have one column (fiModelDesc
) that is divided into two variables (fiBaseModel
and fiSecondaryDesc
) so we do not need the original variable. We also have ProductGroup
and ProductGroupDesc
which are have the same information so I reject one.
Finally we have fiProductClassDesc, which can be usefull but needs more transformation. Initially I was thinking of taking the numbers of the field but some values are Lbs and some are HorsePower so is not that simple, for not I reject it.
In [18]:
ds1.role['fiModelDesc'] = ds1.REJECTED
ds2.role['ProductGroupDesc'] = ds2.REJECTED
ds2.role['fiProductClassDesc'] = ds2.REJECTED
Since I have rejected so many variables I gave a change to 'ProductSize' since is just above the margin of missing values.
In [19]:
ds2.role['ProductSize'] = ds1.INPUT
In [20]:
set(ds2['ProductSize'])
Out[20]:
Looks quite clean so no problems there. The final metadata for this section looks like this.
In [21]:
ds2.metadata
Out[21]:
Print the first rows of the next section and see the missing values.
In [22]:
cols3 = train[train.columns[20:31]]
cols3.head(5)
Out[22]:
In [23]:
ds3 = copper.Dataset(cols3)
ds3.percent_missing()
Out[23]:
By default only Enclosure
will be an input so lets see if need to be cleaned.
In [24]:
set(ds3['Enclosure'])
Out[24]:
I believe 'EROPS w AC' and 'EROPS AC' are the same, probably should look at the forum of the competition but for now lets say I am right. Also I change the 'None or Unspecified' to nan.
In [25]:
ds3['Enclosure'][ds3['Enclosure'] == 'EROPS w AC'] = 'EROPS AC'
ds3['Enclosure'][ds3['Enclosure'] == 'None or Unspecified'] = np.nan
Since I create some new nans let's see if that changes the missing values.
In [26]:
ds3.percent_missing()['Enclosure']
Out[26]:
Lets take a quick look at 'Forks' which is just above the rejection limit.
In [27]:
set(ds3['Forks'])
Out[27]:
In [28]:
ds3['Forks'][ds3['Forks'] == 'None or Unspecified'] = np.nan
In [29]:
ds3.percent_missing()['Forks']
Out[29]:
Changing 'None or Unespecified' to nan make the variable almost all missing values so is not usefull at all. The final metadata for this section is:
In [30]:
ds3.metadata
Out[30]:
Let's see the remaining columns and their missing values.
In [31]:
ds4 = copper.Dataset(train[train.columns[31:]])
ds4.percent_missing()
Out[31]:
Almost everything is rejected. Let's take a look at what is not rejected.
In [32]:
set(ds4['Coupler'])
Out[32]:
In [33]:
set(ds4['Hydraulics'])
Out[33]:
Need to change the 'None or Unspecified' to nan and see what happens.
In [34]:
ds4['Coupler'][ds4['Coupler'] == 'None or Unspecified'] = np.nan
ds4['Hydraulics'][ds4['Hydraulics'] == 'None or Unspecified'] = np.nan
In [35]:
ds4.percent_missing()[['Coupler', 'Hydraulics']]
Out[35]:
Hydraulics
maintains low missing values but Coupler
is now huge on missing values so reject.
In [36]:
ds4.role['Coupler'] = ds4.REJECTED
With all the datasets ready we can join them into one huge dataset.
In [37]:
ds = copper.join(ds1, ds2, others=[ds3, ds4])
Let's just check the number of rows and columns.
In [38]:
len(ds), len(ds.columns)
Out[38]:
We have everything but I am not going to use all the rejected data in machine learning so let's filter the data taking only the Inputs and Targets.
Note: By default the filter
method returns a pandas.DataFrame
but on this case I want a Dataset so I use the ret_ds
parameter.
In [39]:
ds = ds.filter(role=['Input', 'Target'], ret_ds=True)
Just a final check on the first values and missing values.
In [40]:
ds.head()
Out[40]:
In [41]:
ds.percent_missing()
Out[41]:
Finally save the (pickled) Dataset for future use.
In [42]:
copper.save(ds, 'cleaned')
Removing and addind columns using copper Datasets is really easy and the new join feature makes easier this kind of tasks.
What I don't like is that I removed 43 columns and I dont have very high expectations only using the remaining 10 for machine learning. Usually the dates are not that useful, but will see the, and I prefer to have less columns with good information than 43 columns with 95% missing values. But that is the data available so I guess there is nothing much to do.
I created a repo for this kaggle competition, there are only 30 days left so I want to make a few submisions, stay tunned for imputation and some machine learning. The code for copper is on github too.