Today I start working on the first assignment for the Coursera's Data Analysis course and I needed to make some basic transformation on the data so decide to include them on copper.
I realize I was making some wrong decisions on the Type metadata on the Dataset. Since the money type is just a number (similar to a percent number on the case of this example) it is not necessary to have it as a different type, just make the transformations I was doing so I decide to remove the Money type, now the only types are Number and Category.
Lets dive into the example: dataset: loansData.csv
Import copper and set the project path
In [1]:
import copper
copper.project.path = '../'
Load the data into a pandas DataFrame and see the first rows
In [2]:
loans = copper.read_csv('loansData.csv')
In [3]:
loans.head()
Out[3]:
There are some transformations to be done:
Notice a pattern? xD. Thats where I realize I was doing the wrong about a type only for money columns on the previous examples, is just a number so just define it like number.
Lets load the data into a Dataset and see the metadata
In [4]:
dataset = copper.Dataset(loans)
In [5]:
dataset.metadata
Out[5]:
Now all columns with dtype of float or int are Number and the rest are Category no other unnecessay types such as Money
Let's convert Interest.Rate, Loan.Lenght, Debt.To.Income.Ratio to Numbers
In [6]:
dataset.type['Interest.Rate'] = dataset.NUMBER
dataset.type['Loan.Length'] = dataset.NUMBER
dataset.type['Debt.To.Income.Ratio'] = dataset.NUMBER
In [7]:
dataset.update()
In [8]:
dataset[['Interest.Rate', 'Loan.Length', 'Debt.To.Income.Ratio']].head()
Out[8]:
We can see that the values are transformed. Inside copper uses a very simple regular expression to find the first number on each entry so it works for many cases, such as money values, percent values or even custom values such as '36 months' to '36'.
Let's check the metadata,
In [9]:
dataset.metadata
Out[9]:
So far so good. Now is necessary to convert Employment.Length
to number to but this column is special. It has two values that the regular expression is going to match as the same: "< 1 years" and "1 year" are both going to be converted to just '1'. We need to fix that.
So I convert the '< 1 year' values to '0 years' and then use the copper transformation.
In [10]:
dataset['Employment.Length'][dataset['Employment.Length'] == '< 1 year'] = '0 years'
In [11]:
dataset[['Employment.Length']].head(3) # Checking
Out[11]:
In [12]:
dataset.type['Employment.Length'] = dataset.NUMBER
In [13]:
dataset.update()
In [14]:
dataset[['Employment.Length']].head(3)
Out[14]:
The final transformation is the 'Fico.Range' column, on this example is possible to see that the second number is just the first number plus four, so we can just use the same regular expression and get the first number.
But lets say we want to see what would happen on case of using it.
In [15]:
copper.transform.to_number(dataset['FICO.Range']).head()
Out[15]:
That is exactly what we want so lets use the transformation.
In [16]:
dataset.type['FICO.Range'] = dataset.NUMBER
In [17]:
dataset.update()
Finally let's see the final table and metadata
In [18]:
dataset.head()
Out[18]:
In [19]:
dataset.metadata
Out[19]:
The values are correct and the metadata matches the information on the table, we are done. Now is possible to do some machine learning on the data.
Finally let's save the dataset for future use.
In [22]:
copper.save(dataset, 'loans')
As usual the code is on github: copper.