Datapot Usage Examples



In [1]:

    
import datapot as dp
from datapot import datasets



In [2]:

    
import pandas as pd
from __future__ import print_function
import sys
import bz2
import time
import xgboost as xgb
from sklearn.model_selection import cross_val_score

import datapot as dp
from datapot.utils import csv_to_jsonlines

Dataset with timestamp features extraction.

Convert CSV file to JSON lines



In [3]:

    
transactions = pd.read_csv('../data/transactions.csv')
transactions.head()









    Out[3]:






  
    
      
      merchant_id
      latitude
      longitude
      real_transaction_dttm
      record_date
    
  
  
    
      0
      178
      0.000000
      0.000000
      9:34:47
      9:30:36
    
    
      1
      178
      55.055995
      82.912991
      17:49:50
      17:54:24
    
    
      2
      178
      0.000000
      0.000000
      9:34:47
      9:31:22
    
    
      3
      178
      55.056034
      82.912734
      17:49:50
      17:43:01
    
    
      4
      178
      55.056034
      82.912734
      17:49:50
      17:45:17

Creating the DataPot object.



In [4]:

    
datapot = dp.DataPot()



In [5]:

    
from datapot.utils import csv_to_jsonlines

csv_to_jsonlines('../data/transactions.csv', '../data/transactions.jsonlines')



In [6]:

    
data_trns = open('../data/transactions.jsonlines')
data_trns.readline()









    Out[6]:





'{"merchant_id":178,"latitude":0.0,"longitude":0.0,"real_transaction_dttm":"9:34:47","record_date":"9:30:36"}\n'

Let's call the fit method. It automatically finds appropriate transformers for the fields of jsonlines file. The parameter 'limit' means how many objects will be used to detect the right transformers.



In [7]:

    
datapot.detect(data_trns, limit=100)









    Out[7]:





DataPot class instance
 - number of features without transformation: 5
 - number of new features: 13
features to transform: 
	('merchant_id', [SVDOneHotTransformer, NumericTransformer])
	('latitude', [NumericTransformer])
	('longitude', [NumericTransformer])
	('real_transaction_dttm', [TimestampTransformer])
	('record_date', [TimestampTransformer])



In [8]:

    
t0 = time.time()
datapot.fit(data_trns, verbose=True)
print('fit time:', time.time()-t0)









    



fit transformers...
fit: ('merchant_id', [SVDOneHotTransformer, NumericTransformer])
fit: ('latitude', [NumericTransformer])
fit: ('longitude', [NumericTransformer])
fit: ('real_transaction_dttm', [TimestampTransformer])
fit: ('record_date', [TimestampTransformer])
fit transformers...OK
num of new features: 23
fit time: 4.036453008651733



In [9]:

    
datapot









    Out[9]:





DataPot class instance
 - number of features without transformation: 5
 - number of new features: 23
features to transform: 
	('merchant_id', [SVDOneHotTransformer, NumericTransformer])
	('latitude', [NumericTransformer])
	('longitude', [NumericTransformer])
	('real_transaction_dttm', [TimestampTransformer])
	('record_date', [TimestampTransformer])

Let's remove the SVDOneHotTransformer



In [10]:

    
datapot.remove_transformer('merchant_id', 0)









    Out[10]:





DataPot class instance
 - number of features without transformation: 5
 - number of new features: 23
features to transform: 
	('merchant_id', [NumericTransformer])
	('latitude', [NumericTransformer])
	('longitude', [NumericTransformer])
	('real_transaction_dttm', [TimestampTransformer])
	('record_date', [TimestampTransformer])



In [11]:

    
t0 = time.time()
df_trns = datapot.transform(data_trns)
print('transform time:', time.time()-t0)









    



transform time: 42.444371938705444






    



/usr/local/lib/python3.6/site-packages/datapot/__init__.py:137: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  return pd.DataFrame(data=np.hstack(columns), columns=names).convert_objects(convert_numeric=True)



In [12]:

    
df_trns.head()









    Out[12]:






  
    
      
      merchant_id
      latitude
      longitude
      real_transaction_dttm_timestamp_unixtime
      real_transaction_dttm_timestamp_week_day
      real_transaction_dttm_timestamp_month_day
      real_transaction_dttm_timestamp_hour
      real_transaction_dttm_timestamp_minute
      record_date_timestamp_unixtime
      record_date_timestamp_week_day
      record_date_timestamp_month_day
      record_date_timestamp_hour
      record_date_timestamp_minute
    
  
  
    
      0
      178.0
      0.000000
      0.000000
      1.496299e+09
      3.0
      1.0
      9.0
      34.0
      1.496299e+09
      3.0
      1.0
      9.0
      30.0
    
    
      1
      178.0
      55.055996
      82.912991
      1.496329e+09
      3.0
      1.0
      17.0
      49.0
      1.496329e+09
      3.0
      1.0
      17.0
      54.0
    
    
      2
      178.0
      0.000000
      0.000000
      1.496299e+09
      3.0
      1.0
      9.0
      34.0
      1.496299e+09
      3.0
      1.0
      9.0
      31.0
    
    
      3
      178.0
      55.056034
      82.912734
      1.496329e+09
      3.0
      1.0
      17.0
      49.0
      1.496328e+09
      3.0
      1.0
      17.0
      43.0
    
    
      4
      178.0
      55.056034
      82.912734
      1.496329e+09
      3.0
      1.0
      17.0
      49.0
      1.496328e+09
      3.0
      1.0
      17.0
      45.0



In [ ]:

Bag of Words Meets Bags of Popcorn

Usage example for unstructured textual bzip2-compressed data

https://www.kaggle.com/c/word2vec-nlp-tutorial/data

datapot.fit method subsamples the data to detect language and choose corresponding stopwords and stemming.

For each review datapot.transform generates an SVD-compressed 12-dimensional tfidf-vector representation.



In [13]:

    
import datapot as dp
from datapot import datasets

Load data from datapot.datasets



In [14]:

    
data_imdb = datasets.load_imdb()

Or load directly from file



In [15]:

    
data_imdb = bz2.BZ2File('data/imdb.jsonlines.bz2')

# imdb.jsonlines example: {"id":"5814_8", "sentiment":1, "review":"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.

Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.

The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.

Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.

Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter."}



In [16]:

    
datapot_imdb = dp.DataPot()



In [17]:

    
t0 = time.time()
datapot_imdb.detect(data_imdb)
print('detect time:', time.time()-t0)
datapot_imdb









    



detect time: 0.04420304298400879






    Out[17]:





DataPot class instance
 - number of features without transformation: 3
 - number of new features: Unknown
features to transform: 
	('id', [NumericTransformer])
	('sentiment', [SVDOneHotTransformer, NumericTransformer])
	('review', [TfidfTransformer])



In [18]:

    
datapot_imdb.remove_transformer('sentiment', 0)









    Out[18]:





DataPot class instance
 - number of features without transformation: 3
 - number of new features: Unknown
features to transform: 
	('id', [NumericTransformer])
	('sentiment', [NumericTransformer])
	('review', [TfidfTransformer])



In [19]:

    
t0 = time.time()
datapot_imdb.fit(data_imdb, verbose=True)









    



fit transformers...
fit: ('id', [NumericTransformer])
fit: ('sentiment', [NumericTransformer])
fit: ('review', [TfidfTransformer])
fit transformers...OK
num of new features: 14






    Out[19]:





DataPot class instance
 - number of features without transformation: 3
 - number of new features: 14
features to transform: 
	('id', [NumericTransformer])
	('sentiment', [NumericTransformer])
	('review', [TfidfTransformer])



In [20]:

    
print('fit time:', time.time()-t0)









    



fit time: 4.17433500289917



In [21]:

    
t0 = time.time()
df_imdb = datapot_imdb.transform(data_imdb)
print('transform time:', time.time()-t0)









    



transform time: 3.3115808963775635






    



/usr/local/lib/python3.6/site-packages/datapot/__init__.py:137: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  return pd.DataFrame(data=np.hstack(columns), columns=names).convert_objects(convert_numeric=True)



In [22]:

    
df_imdb.head()



In [23]:

    
X = df_imdb.drop(['sentiment'], axis=1)
y = df_imdb['sentiment']



In [24]:

    
model = xgb.XGBClassifier()
cv_score = cross_val_score(model, X, y, cv=5)
assert all(i > 0.5 for i in cv_score), 'Low score!'
print('Cross-val score:', cv_score)

model.fit(X, y)
fi = model.feature_importances_

print('Feature importance:')
print(*(list(zip(X.columns, fi))), sep='\n')









    



Cross-val score: [ 0.72427572  0.73226773  0.726       0.72772773  0.70870871]
Feature importance:
('id', 0.16129032)
('review_0', 0.05882353)
('review_1', 0.068311192)
('review_2', 0.060721062)
('review_3', 0.064516127)
('review_4', 0.072106265)
('review_5', 0.062618598)
('review_6', 0.070208728)
('review_7', 0.089184061)
('review_8', 0.1309298)
('review_9', 0.066413663)
('review_10', 0.051233396)
('review_11', 0.043643262)

Job Salary Prediction

Usage example for unstructured textual bzip2-compressed data



In [25]:

    
from datapot import datasets

data_job = datasets.load_job_salary()

# Or load from file%: 
# data_job = bz2.BZ2File('datapot/data/job.jsonlines.bz2')

# jobs.jsonlines example: {"Id":12612628, "Title":"Engineering Systems Analyst","FullDescription":"Engineering Systems Analyst Dorking Surrey Salary ****K Our client is located in Dorking, Surrey and are looking for Engineering Systems Analyst our client provides specialist software development Keywords Mathematical Modelling, Risk Analysis, System Modelling, Optimisation, MISER, PIONEEER Engineering Systems Analyst Dorking Surrey Salary ****K", "LocationNormalized":"Dorking", "ContractType":null, "ContractTime":"permanent", "Company":"Gregory Martin International", "Category":"Engineering Jobs", "SalaryNormalized":25000}



In [26]:

    
datapot_job = dp.DataPot()



In [27]:

    
t0 = time.time()
datapot_job.detect(data_job)
print('detect time:', time.time()-t0)
datapot_job









    



detect time: 0.03157186508178711






    Out[27]:





DataPot class instance
 - number of features without transformation: 9
 - number of new features: Unknown
features to transform: 
	('Id', [NumericTransformer])
	('FullDescription', [TfidfTransformer])
	('ContractType', [SVDOneHotTransformer])
	('ContractTime', [SVDOneHotTransformer])
	('Category', [SVDOneHotTransformer])
	('SalaryNormalized', [NumericTransformer])



In [28]:

    
t0 = time.time()
datapot_job.fit(data_job, verbose=True)
print('fit time:', time.time()-t0)









    



fit transformers...
fit: ('Id', [NumericTransformer])
fit: ('FullDescription', [TfidfTransformer])
fit: ('ContractType', [SVDOneHotTransformer])
fit: ('ContractTime', [SVDOneHotTransformer])
fit: ('Category', [SVDOneHotTransformer])
fit: ('SalaryNormalized', [NumericTransformer])
fit transformers...OK
num of new features: 38
fit time: 1.8940820693969727



In [29]:

    
t0 = time.time()
df_job = datapot_job.transform(data_job)
print('transform time:', time.time()-t0)









    



transform time: 2.0284600257873535






    



/usr/local/lib/python3.6/site-packages/datapot/__init__.py:137: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  return pd.DataFrame(data=np.hstack(columns), columns=names).convert_objects(convert_numeric=True)



In [30]:

    
print(df_job.columns)
print(df_job.shape)
df_job.head()









    



Index(['Id', 'FullDescription_0', 'FullDescription_1', 'FullDescription_2',
       'FullDescription_3', 'FullDescription_4', 'FullDescription_5',
       'FullDescription_6', 'FullDescription_7', 'FullDescription_8',
       'FullDescription_9', 'FullDescription_10', 'FullDescription_11',
       'ContractType_None', 'ContractType_full_time', 'ContractType_part_time',
       'ContractTime_permanent', 'ContractTime_None', 'ContractTime_contract',
       'Category_Engineering Jobs', 'Category_HR & Recruitment Jobs',
       'Category_Accounting & Finance Jobs',
       'Category_Healthcare & Nursing Jobs', 'Category_Other/General Jobs',
       'Category_Hospitality & Catering Jobs', 'Category_IT Jobs',
       'Category_Customer Services Jobs', 'Category_Travel Jobs',
       'Category_Sales Jobs', 'Category_Manufacturing Jobs',
       'Category_Teaching Jobs', 'Category_Creative & Design Jobs',
       'Category_Trade & Construction Jobs', 'Category_Property Jobs',
       'Category_Admin Jobs', 'Category_Legal Jobs', 'Category_Retail Jobs',
       'SalaryNormalized'],
      dtype='object')
(2000, 38)






    Out[30]:






  
    
      
      Id
      FullDescription_0
      FullDescription_1
      FullDescription_2
      FullDescription_3
      FullDescription_4
      FullDescription_5
      FullDescription_6
      FullDescription_7
      FullDescription_8
      ...
      Category_Sales Jobs
      Category_Manufacturing Jobs
      Category_Teaching Jobs
      Category_Creative & Design Jobs
      Category_Trade & Construction Jobs
      Category_Property Jobs
      Category_Admin Jobs
      Category_Legal Jobs
      Category_Retail Jobs
      SalaryNormalized
    
  
  
    
      0
      12612628.0
      0.000000
      0.000000
      0.0
      0.000000
      0.000000
      0.000000
      0.0
      0.150115
      0.000000
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      25000.0
    
    
      1
      12612830.0
      0.013077
      0.000000
      0.0
      0.007214
      0.010782
      0.016549
      0.0
      0.221792
      0.016945
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      30000.0
    
    
      2
      12612844.0
      0.040371
      0.000186
      0.0
      0.000000
      0.003483
      0.000266
      0.0
      0.098020
      0.011783
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      30000.0
    
    
      3
      12613049.0
      0.000000
      0.000000
      0.0
      0.000000
      0.000000
      0.000000
      0.0
      0.142823
      0.000000
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      27500.0
    
    
      4
      12613647.0
      0.000000
      0.000000
      0.0
      0.000000
      0.000000
      0.000000
      0.0
      0.116813
      0.000000
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      25000.0
    
  

5 rows × 38 columns



In [31]:

    
X_job = df_job.drop(['SalaryNormalized', 'Id'], axis=1)
y_job = pd.qcut(df_job['SalaryNormalized'].values, q=2, labels=[0,1]).ravel()

model = xgb.XGBClassifier()
cv_score_job = cross_val_score(model, X_job, y_job, cv=5)
print('Cross-val score:', cv_score_job)
assert all(i > 0.5 for i in cv_score_job), 'Low score!'

model.fit(X_job, y_job)
fi_job = model.feature_importances_

print('Feature importance:')
print(*(list(zip(X_job.columns, fi_job))), sep='\n')









    



Cross-val score: [ 0.71072319  0.84538653  0.715       0.72431078  0.72932331]
Feature importance:
('FullDescription_0', 0.072026804)
('FullDescription_1', 0.14237855)
('FullDescription_2', 0.082077049)
('FullDescription_3', 0.046901174)
('FullDescription_4', 0.038525961)
('FullDescription_5', 0.095477387)
('FullDescription_6', 0.14405361)
('FullDescription_7', 0.070351757)
('FullDescription_8', 0.072026804)
('FullDescription_9', 0.070351757)
('FullDescription_10', 0.046901174)
('FullDescription_11', 0.048576213)
('ContractType_None', 0.0)
('ContractType_full_time', 0.01675042)
('ContractType_part_time', 0.01675042)
('ContractTime_permanent', 0.0083752098)
('ContractTime_None', 0.0050251256)
('ContractTime_contract', 0.0)
('Category_Engineering Jobs', 0.0016750419)
('Category_HR & Recruitment Jobs', 0.0)
('Category_Accounting & Finance Jobs', 0.0)
('Category_Healthcare & Nursing Jobs', 0.0)
('Category_Other/General Jobs', 0.0)
('Category_Hospitality & Catering Jobs', 0.0)
('Category_IT Jobs', 0.011725293)
('Category_Customer Services Jobs', 0.0083752098)
('Category_Travel Jobs', 0.0016750419)
('Category_Sales Jobs', 0.0)
('Category_Manufacturing Jobs', 0.0)
('Category_Teaching Jobs', 0.0)
('Category_Creative & Design Jobs', 0.0)
('Category_Trade & Construction Jobs', 0.0)
('Category_Property Jobs', 0.0)
('Category_Admin Jobs', 0.0)
('Category_Legal Jobs', 0.0)
('Category_Retail Jobs', 0.0)



In [ ]:

	id	sentiment	review_0	review_1	review_2	review_3	review_4	review_5	review_6	review_7	review_8	review_9	review_10	review_11
0	58148.0	1.0	0.033939	0.066220	0.045984	0.000000	0.030910	0.117753	0.039371	0.034749	0.013392	0.046078	0.110713	0.013378
1	23819.0	1.0	0.063591	0.000000	0.021630	0.005718	0.019691	0.021786	0.042178	0.076461	0.014525	0.000000	0.013750	0.000000
2	77593.0	0.0	0.097556	0.018326	0.003088	0.007263	0.000000	0.000000	0.020470	0.000000	0.173276	0.005671	0.000000	0.000000
3	36304.0	0.0	0.126620	0.035640	0.011742	0.006708	0.000000	0.027994	0.082361	0.053935	0.054434	0.001473	0.008279	0.000000
4	94958.0	1.0	0.064286	0.000287	0.010754	0.039657	0.000336	0.035009	0.001940	0.016348	0.118498	0.046068	0.022181	0.001115

	merchant_id	latitude	longitude	real_transaction_dttm	record_date
0	178	0.000000	0.000000	9:34:47	9:30:36
1	178	55.055995	82.912991	17:49:50	17:54:24
2	178	0.000000	0.000000	9:34:47	9:31:22
3	178	55.056034	82.912734	17:49:50	17:43:01
4	178	55.056034	82.912734	17:49:50	17:45:17

	merchant_id	latitude	longitude	real_transaction_dttm_timestamp_unixtime	real_transaction_dttm_timestamp_week_day	real_transaction_dttm_timestamp_month_day	real_transaction_dttm_timestamp_hour	real_transaction_dttm_timestamp_minute	record_date_timestamp_unixtime	record_date_timestamp_week_day	record_date_timestamp_month_day	record_date_timestamp_hour	record_date_timestamp_minute
0	178.0	0.000000	0.000000	1.496299e+09	3.0	1.0	9.0	34.0	1.496299e+09	3.0	1.0	9.0	30.0
1	178.0	55.055996	82.912991	1.496329e+09	3.0	1.0	17.0	49.0	1.496329e+09	3.0	1.0	17.0	54.0
2	178.0	0.000000	0.000000	1.496299e+09	3.0	1.0	9.0	34.0	1.496299e+09	3.0	1.0	9.0	31.0
3	178.0	55.056034	82.912734	1.496329e+09	3.0	1.0	17.0	49.0	1.496328e+09	3.0	1.0	17.0	43.0
4	178.0	55.056034	82.912734	1.496329e+09	3.0	1.0	17.0	49.0	1.496328e+09	3.0	1.0	17.0	45.0

	Id	FullDescription_0	FullDescription_1	FullDescription_3	FullDescription_4	FullDescription_5	FullDescription_7	FullDescription_8	...	SalaryNormalized
0	12612628.0	0.000000	0.000000	0.000000	0.000000	0.000000	0.150115	0.000000	...	25000.0
1	12612830.0	0.013077	0.000000	0.007214	0.010782	0.016549	0.221792	0.016945	...	30000.0
2	12612844.0	0.040371	0.000186	0.000000	0.003483	0.000266	0.098020	0.011783	...	30000.0
3	12613049.0	0.000000	0.000000	0.000000	0.000000	0.000000	0.142823	0.000000	...	27500.0
4	12613647.0	0.000000	0.000000	0.000000	0.000000	0.000000	0.116813	0.000000	...	25000.0