Today I start working on the first assignment for the Coursera's Data Analysis course and I needed to make some basic transformation on the data so decide to include them on copper.

I realize I was making some wrong decisions on the Type metadata on the Dataset. Since the money type is just a number (similar to a percent number on the case of this example) it is not necessary to have it as a different type, just make the transformations I was doing so I decide to remove the Money type, now the only types are Number and Category.

Lets dive into the example: dataset: loansData.csv

Import copper and set the project path



In [1]:

    
import copper
copper.project.path = '../'

Load the data into a pandas DataFrame and see the first rows



In [2]:

    
loans = copper.read_csv('loansData.csv')



In [3]:

    
loans.head()









    Out[3]:






  
    
      
      Amount.Requested
      Amount.Funded.By.Investors
      Interest.Rate
      Loan.Length
      Loan.Purpose
      Debt.To.Income.Ratio
      State
      Home.Ownership
      Monthly.Income
      FICO.Range
      Open.CREDIT.Lines
      Revolving.CREDIT.Balance
      Inquiries.in.the.Last.6.Months
      Employment.Length
    
  
  
    
      81174
       20000
       20000
        8.90%
       36 months
       debt_consolidation
       14.90%
       SC
       MORTGAGE
        6541.67
       735-739
       14
       14272
       2
       < 1 year
    
    
      99592
       19200
       19200
       12.12%
       36 months
       debt_consolidation
       28.36%
       TX
       MORTGAGE
        4583.33
       715-719
       12
       11140
       1
        2 years
    
    
      80059
       35000
       35000
       21.98%
       60 months
       debt_consolidation
       23.81%
       CA
       MORTGAGE
       11500.00
       690-694
       14
       21977
       1
        2 years
    
    
      15825
       10000
        9975
        9.99%
       36 months
       debt_consolidation
       14.30%
       KS
       MORTGAGE
        3833.33
       695-699
       10
        9346
       0
        5 years
    
    
      33182
       12000
       12000
       11.71%
       36 months
              credit_card
       18.78%
       NJ
           RENT
        3195.00
       695-699
       11
       14469
       0
        9 years

There are some transformations to be done:

Interest.Rate to Number
Loan.Lenght to Number
Debt.To.Income.Ratio to Number
FICO.Rate to Number
Employment.Length to Number

Notice a pattern? xD. Thats where I realize I was doing the wrong about a type only for money columns on the previous examples, is just a number so just define it like number.

Lets load the data into a Dataset and see the metadata



In [4]:

    
dataset = copper.Dataset(loans)



In [5]:

    
dataset.metadata









    Out[5]:






  
    
      
      Role
      Type
      dtype
    
  
  
    
      Amount.Requested
       Input
         Number
         int64
    
    
      Amount.Funded.By.Investors
       Input
         Number
       float64
    
    
      Interest.Rate
       Input
       Category
        object
    
    
      Loan.Length
       Input
       Category
        object
    
    
      Loan.Purpose
       Input
       Category
        object
    
    
      Debt.To.Income.Ratio
       Input
       Category
        object
    
    
      State
       Input
       Category
        object
    
    
      Home.Ownership
       Input
       Category
        object
    
    
      Monthly.Income
       Input
         Number
       float64
    
    
      FICO.Range
       Input
       Category
        object
    
    
      Open.CREDIT.Lines
       Input
         Number
       float64
    
    
      Revolving.CREDIT.Balance
       Input
         Number
       float64
    
    
      Inquiries.in.the.Last.6.Months
       Input
         Number
       float64
    
    
      Employment.Length
       Input
       Category
        object

Now all columns with dtype of float or int are Number and the rest are Category no other unnecessay types such as Money

Let's convert Interest.Rate, Loan.Lenght, Debt.To.Income.Ratio to Numbers



In [6]:

    
dataset.type['Interest.Rate'] = dataset.NUMBER
dataset.type['Loan.Length'] = dataset.NUMBER
dataset.type['Debt.To.Income.Ratio'] = dataset.NUMBER



In [7]:

    
dataset.update()



In [8]:

    
dataset[['Interest.Rate', 'Loan.Length', 'Debt.To.Income.Ratio']].head()









    Out[8]:






  
    
      
      Interest.Rate
      Loan.Length
      Debt.To.Income.Ratio
    
  
  
    
      81174
        8.90
       36
       14.90
    
    
      99592
       12.12
       36
       28.36
    
    
      80059
       21.98
       60
       23.81
    
    
      15825
        9.99
       36
       14.30
    
    
      33182
       11.71
       36
       18.78

We can see that the values are transformed. Inside copper uses a very simple regular expression to find the first number on each entry so it works for many cases, such as money values, percent values or even custom values such as '36 months' to '36'.

Let's check the metadata,



In [9]:

    
dataset.metadata









    Out[9]:






  
    
      
      Role
      Type
      dtype
    
  
  
    
      Amount.Requested
       Input
         Number
         int64
    
    
      Amount.Funded.By.Investors
       Input
         Number
       float64
    
    
      Interest.Rate
       Input
         Number
       float64
    
    
      Loan.Length
       Input
         Number
       float64
    
    
      Loan.Purpose
       Input
       Category
        object
    
    
      Debt.To.Income.Ratio
       Input
         Number
       float64
    
    
      State
       Input
       Category
        object
    
    
      Home.Ownership
       Input
       Category
        object
    
    
      Monthly.Income
       Input
         Number
       float64
    
    
      FICO.Range
       Input
       Category
        object
    
    
      Open.CREDIT.Lines
       Input
         Number
       float64
    
    
      Revolving.CREDIT.Balance
       Input
         Number
       float64
    
    
      Inquiries.in.the.Last.6.Months
       Input
         Number
       float64
    
    
      Employment.Length
       Input
       Category
        object

So far so good. Now is necessary to convert Employment.Length to number to but this column is special. It has two values that the regular expression is going to match as the same: "< 1 years" and "1 year" are both going to be converted to just '1'. We need to fix that. So I convert the '< 1 year' values to '0 years' and then use the copper transformation.



In [10]:

    
dataset['Employment.Length'][dataset['Employment.Length'] == '< 1 year'] = '0 years'



In [11]:

    
dataset[['Employment.Length']].head(3) # Checking









    Out[11]:






  
    
      
      Employment.Length
    
  
  
    
      81174
       0 years
    
    
      99592
       2 years
    
    
      80059
       2 years



In [12]:

    
dataset.type['Employment.Length'] = dataset.NUMBER



In [13]:

    
dataset.update()



In [14]:

    
dataset[['Employment.Length']].head(3)









    Out[14]:






  
    
      
      Employment.Length
    
  
  
    
      81174
       0
    
    
      99592
       2
    
    
      80059
       2

The final transformation is the 'Fico.Range' column, on this example is possible to see that the second number is just the first number plus four, so we can just use the same regular expression and get the first number.

But lets say we want to see what would happen on case of using it.



In [15]:

    
copper.transform.to_number(dataset['FICO.Range']).head()









    Out[15]:





81174    735
99592    715
80059    690
15825    695
33182    695
Name: FICO.Range

That is exactly what we want so lets use the transformation.



In [16]:

    
dataset.type['FICO.Range'] = dataset.NUMBER



In [17]:

    
dataset.update()

Finally let's see the final table and metadata



In [18]:

    
dataset.head()









    Out[18]:






  
    
      
      Amount.Requested
      Amount.Funded.By.Investors
      Interest.Rate
      Loan.Length
      Loan.Purpose
      Debt.To.Income.Ratio
      State
      Home.Ownership
      Monthly.Income
      FICO.Range
      Open.CREDIT.Lines
      Revolving.CREDIT.Balance
      Inquiries.in.the.Last.6.Months
      Employment.Length
    
  
  
    
      81174
       20000
       20000
        8.90
       36
       debt_consolidation
       14.90
       SC
       MORTGAGE
        6541.67
       735
       14
       14272
       2
       0
    
    
      99592
       19200
       19200
       12.12
       36
       debt_consolidation
       28.36
       TX
       MORTGAGE
        4583.33
       715
       12
       11140
       1
       2
    
    
      80059
       35000
       35000
       21.98
       60
       debt_consolidation
       23.81
       CA
       MORTGAGE
       11500.00
       690
       14
       21977
       1
       2
    
    
      15825
       10000
        9975
        9.99
       36
       debt_consolidation
       14.30
       KS
       MORTGAGE
        3833.33
       695
       10
        9346
       0
       5
    
    
      33182
       12000
       12000
       11.71
       36
              credit_card
       18.78
       NJ
           RENT
        3195.00
       695
       11
       14469
       0
       9



In [19]:

    
dataset.metadata









    Out[19]:






  
    
      
      Role
      Type
      dtype
    
  
  
    
      Amount.Requested
       Input
         Number
         int64
    
    
      Amount.Funded.By.Investors
       Input
         Number
       float64
    
    
      Interest.Rate
       Input
         Number
       float64
    
    
      Loan.Length
       Input
         Number
       float64
    
    
      Loan.Purpose
       Input
       Category
        object
    
    
      Debt.To.Income.Ratio
       Input
         Number
       float64
    
    
      State
       Input
       Category
        object
    
    
      Home.Ownership
       Input
       Category
        object
    
    
      Monthly.Income
       Input
         Number
       float64
    
    
      FICO.Range
       Input
         Number
       float64
    
    
      Open.CREDIT.Lines
       Input
         Number
       float64
    
    
      Revolving.CREDIT.Balance
       Input
         Number
       float64
    
    
      Inquiries.in.the.Last.6.Months
       Input
         Number
       float64
    
    
      Employment.Length
       Input
         Number
       float64

The values are correct and the metadata matches the information on the table, we are done. Now is possible to do some machine learning on the data.

Finally let's save the dataset for future use.



In [22]:

    
copper.save(dataset, 'loans')

As usual the code is on github: copper.

	Amount.Requested	Amount.Funded.By.Investors	Interest.Rate	Loan.Length	Loan.Purpose	Debt.To.Income.Ratio	State	Home.Ownership	Monthly.Income	FICO.Range	Open.CREDIT.Lines	Revolving.CREDIT.Balance	Inquiries.in.the.Last.6.Months	Employment.Length
81174	20000	20000	8.90%	36 months	debt_consolidation	14.90%	SC	MORTGAGE	6541.67	735-739	14	14272	2	< 1 year
99592	19200	19200	12.12%	36 months	debt_consolidation	28.36%	TX	MORTGAGE	4583.33	715-719	12	11140	1	2 years
80059	35000	35000	21.98%	60 months	debt_consolidation	23.81%	CA	MORTGAGE	11500.00	690-694	14	21977	1	2 years
15825	10000	9975	9.99%	36 months	debt_consolidation	14.30%	KS	MORTGAGE	3833.33	695-699	10	9346	0	5 years
33182	12000	12000	11.71%	36 months	credit_card	18.78%	NJ	RENT	3195.00	695-699	11	14469	0	9 years

	Role	Type	dtype
Amount.Requested	Input	Number	int64
Amount.Funded.By.Investors	Input	Number	float64
Interest.Rate	Input	Category	object
Loan.Length	Input	Category	object
Loan.Purpose	Input	Category	object
Debt.To.Income.Ratio	Input	Category	object
State	Input	Category	object
Home.Ownership	Input	Category	object
Monthly.Income	Input	Number	float64
FICO.Range	Input	Category	object
Open.CREDIT.Lines	Input	Number	float64
Revolving.CREDIT.Balance	Input	Number	float64
Inquiries.in.the.Last.6.Months	Input	Number	float64
Employment.Length	Input	Category	object

	Interest.Rate	Loan.Length	Debt.To.Income.Ratio
81174	8.90	36	14.90
99592	12.12	36	28.36
80059	21.98	60	23.81
15825	9.99	36	14.30
33182	11.71	36	18.78