Datapot Usage Examples


In [1]:
import datapot as dp
from datapot import datasets

In [2]:
import pandas as pd
from __future__ import print_function
import sys
import bz2
import time
import xgboost as xgb
from sklearn.model_selection import cross_val_score

import datapot as dp
from datapot.utils import csv_to_jsonlines

Dataset with timestamp features extraction.

Convert CSV file to JSON lines


In [3]:
transactions = pd.read_csv('../data/transactions.csv')
transactions.head()


Out[3]:
merchant_id latitude longitude real_transaction_dttm record_date
0 178 0.000000 0.000000 9:34:47 9:30:36
1 178 55.055995 82.912991 17:49:50 17:54:24
2 178 0.000000 0.000000 9:34:47 9:31:22
3 178 55.056034 82.912734 17:49:50 17:43:01
4 178 55.056034 82.912734 17:49:50 17:45:17

Creating the DataPot object.


In [4]:
datapot = dp.DataPot()

In [5]:
from datapot.utils import csv_to_jsonlines

csv_to_jsonlines('../data/transactions.csv', '../data/transactions.jsonlines')

In [6]:
data_trns = open('../data/transactions.jsonlines')
data_trns.readline()


Out[6]:
'{"merchant_id":178,"latitude":0.0,"longitude":0.0,"real_transaction_dttm":"9:34:47","record_date":"9:30:36"}\n'

Let's call the fit method. It automatically finds appropriate transformers for the fields of jsonlines file. The parameter 'limit' means how many objects will be used to detect the right transformers.


In [7]:
datapot.detect(data_trns, limit=100)


Out[7]:
DataPot class instance
 - number of features without transformation: 5
 - number of new features: 13
features to transform: 
	('merchant_id', [SVDOneHotTransformer, NumericTransformer])
	('latitude', [NumericTransformer])
	('longitude', [NumericTransformer])
	('real_transaction_dttm', [TimestampTransformer])
	('record_date', [TimestampTransformer])

In [8]:
t0 = time.time()
datapot.fit(data_trns, verbose=True)
print('fit time:', time.time()-t0)


fit transformers...
fit: ('merchant_id', [SVDOneHotTransformer, NumericTransformer])
fit: ('latitude', [NumericTransformer])
fit: ('longitude', [NumericTransformer])
fit: ('real_transaction_dttm', [TimestampTransformer])
fit: ('record_date', [TimestampTransformer])
fit transformers...OK
num of new features: 23
fit time: 4.036453008651733

In [9]:
datapot


Out[9]:
DataPot class instance
 - number of features without transformation: 5
 - number of new features: 23
features to transform: 
	('merchant_id', [SVDOneHotTransformer, NumericTransformer])
	('latitude', [NumericTransformer])
	('longitude', [NumericTransformer])
	('real_transaction_dttm', [TimestampTransformer])
	('record_date', [TimestampTransformer])

Let's remove the SVDOneHotTransformer


In [10]:
datapot.remove_transformer('merchant_id', 0)


Out[10]:
DataPot class instance
 - number of features without transformation: 5
 - number of new features: 23
features to transform: 
	('merchant_id', [NumericTransformer])
	('latitude', [NumericTransformer])
	('longitude', [NumericTransformer])
	('real_transaction_dttm', [TimestampTransformer])
	('record_date', [TimestampTransformer])

In [11]:
t0 = time.time()
df_trns = datapot.transform(data_trns)
print('transform time:', time.time()-t0)


transform time: 42.444371938705444
/usr/local/lib/python3.6/site-packages/datapot/__init__.py:137: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  return pd.DataFrame(data=np.hstack(columns), columns=names).convert_objects(convert_numeric=True)

In [12]:
df_trns.head()


Out[12]:
merchant_id latitude longitude real_transaction_dttm_timestamp_unixtime real_transaction_dttm_timestamp_week_day real_transaction_dttm_timestamp_month_day real_transaction_dttm_timestamp_hour real_transaction_dttm_timestamp_minute record_date_timestamp_unixtime record_date_timestamp_week_day record_date_timestamp_month_day record_date_timestamp_hour record_date_timestamp_minute
0 178.0 0.000000 0.000000 1.496299e+09 3.0 1.0 9.0 34.0 1.496299e+09 3.0 1.0 9.0 30.0
1 178.0 55.055996 82.912991 1.496329e+09 3.0 1.0 17.0 49.0 1.496329e+09 3.0 1.0 17.0 54.0
2 178.0 0.000000 0.000000 1.496299e+09 3.0 1.0 9.0 34.0 1.496299e+09 3.0 1.0 9.0 31.0
3 178.0 55.056034 82.912734 1.496329e+09 3.0 1.0 17.0 49.0 1.496328e+09 3.0 1.0 17.0 43.0
4 178.0 55.056034 82.912734 1.496329e+09 3.0 1.0 17.0 49.0 1.496328e+09 3.0 1.0 17.0 45.0

In [ ]:

Bag of Words Meets Bags of Popcorn

Usage example for unstructured textual bzip2-compressed data

https://www.kaggle.com/c/word2vec-nlp-tutorial/data

datapot.fit method subsamples the data to detect language and choose corresponding stopwords and stemming.

For each review datapot.transform generates an SVD-compressed 12-dimensional tfidf-vector representation.


In [13]:
import datapot as dp
from datapot import datasets

Load data from datapot.datasets


In [14]:
data_imdb = datasets.load_imdb()

Or load directly from file


In [15]:
data_imdb = bz2.BZ2File('data/imdb.jsonlines.bz2')
# imdb.jsonlines example: {"id":"5814_8", "sentiment":1, "review":"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.

Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.

The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.

Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.

Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter."}

In [16]:
datapot_imdb = dp.DataPot()

In [17]:
t0 = time.time()
datapot_imdb.detect(data_imdb)
print('detect time:', time.time()-t0)
datapot_imdb


detect time: 0.04420304298400879
Out[17]:
DataPot class instance
 - number of features without transformation: 3
 - number of new features: Unknown
features to transform: 
	('id', [NumericTransformer])
	('sentiment', [SVDOneHotTransformer, NumericTransformer])
	('review', [TfidfTransformer])

In [18]:
datapot_imdb.remove_transformer('sentiment', 0)


Out[18]:
DataPot class instance
 - number of features without transformation: 3
 - number of new features: Unknown
features to transform: 
	('id', [NumericTransformer])
	('sentiment', [NumericTransformer])
	('review', [TfidfTransformer])

In [19]:
t0 = time.time()
datapot_imdb.fit(data_imdb, verbose=True)


fit transformers...
fit: ('id', [NumericTransformer])
fit: ('sentiment', [NumericTransformer])
fit: ('review', [TfidfTransformer])
fit transformers...OK
num of new features: 14
Out[19]:
DataPot class instance
 - number of features without transformation: 3
 - number of new features: 14
features to transform: 
	('id', [NumericTransformer])
	('sentiment', [NumericTransformer])
	('review', [TfidfTransformer])

In [20]:
print('fit time:', time.time()-t0)


fit time: 4.17433500289917

In [21]:
t0 = time.time()
df_imdb = datapot_imdb.transform(data_imdb)
print('transform time:', time.time()-t0)


transform time: 3.3115808963775635
/usr/local/lib/python3.6/site-packages/datapot/__init__.py:137: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  return pd.DataFrame(data=np.hstack(columns), columns=names).convert_objects(convert_numeric=True)

In [22]:
df_imdb.head()


Out[22]:
id sentiment review_0 review_1 review_2 review_3 review_4 review_5 review_6 review_7 review_8 review_9 review_10 review_11
0 58148.0 1.0 0.033939 0.066220 0.045984 0.000000 0.030910 0.117753 0.039371 0.034749 0.013392 0.046078 0.110713 0.013378
1 23819.0 1.0 0.063591 0.000000 0.021630 0.005718 0.019691 0.021786 0.042178 0.076461 0.014525 0.000000 0.013750 0.000000
2 77593.0 0.0 0.097556 0.018326 0.003088 0.007263 0.000000 0.000000 0.020470 0.000000 0.173276 0.005671 0.000000 0.000000
3 36304.0 0.0 0.126620 0.035640 0.011742 0.006708 0.000000 0.027994 0.082361 0.053935 0.054434 0.001473 0.008279 0.000000
4 94958.0 1.0 0.064286 0.000287 0.010754 0.039657 0.000336 0.035009 0.001940 0.016348 0.118498 0.046068 0.022181 0.001115

In [23]:
X = df_imdb.drop(['sentiment'], axis=1)
y = df_imdb['sentiment']

In [24]:
model = xgb.XGBClassifier()
cv_score = cross_val_score(model, X, y, cv=5)
assert all(i > 0.5 for i in cv_score), 'Low score!'
print('Cross-val score:', cv_score)

model.fit(X, y)
fi = model.feature_importances_

print('Feature importance:')
print(*(list(zip(X.columns, fi))), sep='\n')


Cross-val score: [ 0.72427572  0.73226773  0.726       0.72772773  0.70870871]
Feature importance:
('id', 0.16129032)
('review_0', 0.05882353)
('review_1', 0.068311192)
('review_2', 0.060721062)
('review_3', 0.064516127)
('review_4', 0.072106265)
('review_5', 0.062618598)
('review_6', 0.070208728)
('review_7', 0.089184061)
('review_8', 0.1309298)
('review_9', 0.066413663)
('review_10', 0.051233396)
('review_11', 0.043643262)

Job Salary Prediction

Usage example for unstructured textual bzip2-compressed data


In [25]:
from datapot import datasets

data_job = datasets.load_job_salary()

# Or load from file%: 
# data_job = bz2.BZ2File('datapot/data/job.jsonlines.bz2')
# jobs.jsonlines example: {"Id":12612628, "Title":"Engineering Systems Analyst","FullDescription":"Engineering Systems Analyst Dorking Surrey Salary ****K Our client is located in Dorking, Surrey and are looking for Engineering Systems Analyst our client provides specialist software development Keywords Mathematical Modelling, Risk Analysis, System Modelling, Optimisation, MISER, PIONEEER Engineering Systems Analyst Dorking Surrey Salary ****K", "LocationNormalized":"Dorking", "ContractType":null, "ContractTime":"permanent", "Company":"Gregory Martin International", "Category":"Engineering Jobs", "SalaryNormalized":25000}

In [26]:
datapot_job = dp.DataPot()

In [27]:
t0 = time.time()
datapot_job.detect(data_job)
print('detect time:', time.time()-t0)
datapot_job


detect time: 0.03157186508178711
Out[27]:
DataPot class instance
 - number of features without transformation: 9
 - number of new features: Unknown
features to transform: 
	('Id', [NumericTransformer])
	('FullDescription', [TfidfTransformer])
	('ContractType', [SVDOneHotTransformer])
	('ContractTime', [SVDOneHotTransformer])
	('Category', [SVDOneHotTransformer])
	('SalaryNormalized', [NumericTransformer])

In [28]:
t0 = time.time()
datapot_job.fit(data_job, verbose=True)
print('fit time:', time.time()-t0)


fit transformers...
fit: ('Id', [NumericTransformer])
fit: ('FullDescription', [TfidfTransformer])
fit: ('ContractType', [SVDOneHotTransformer])
fit: ('ContractTime', [SVDOneHotTransformer])
fit: ('Category', [SVDOneHotTransformer])
fit: ('SalaryNormalized', [NumericTransformer])
fit transformers...OK
num of new features: 38
fit time: 1.8940820693969727

In [29]:
t0 = time.time()
df_job = datapot_job.transform(data_job)
print('transform time:', time.time()-t0)


transform time: 2.0284600257873535
/usr/local/lib/python3.6/site-packages/datapot/__init__.py:137: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  return pd.DataFrame(data=np.hstack(columns), columns=names).convert_objects(convert_numeric=True)

In [30]:
print(df_job.columns)
print(df_job.shape)
df_job.head()


Index(['Id', 'FullDescription_0', 'FullDescription_1', 'FullDescription_2',
       'FullDescription_3', 'FullDescription_4', 'FullDescription_5',
       'FullDescription_6', 'FullDescription_7', 'FullDescription_8',
       'FullDescription_9', 'FullDescription_10', 'FullDescription_11',
       'ContractType_None', 'ContractType_full_time', 'ContractType_part_time',
       'ContractTime_permanent', 'ContractTime_None', 'ContractTime_contract',
       'Category_Engineering Jobs', 'Category_HR & Recruitment Jobs',
       'Category_Accounting & Finance Jobs',
       'Category_Healthcare & Nursing Jobs', 'Category_Other/General Jobs',
       'Category_Hospitality & Catering Jobs', 'Category_IT Jobs',
       'Category_Customer Services Jobs', 'Category_Travel Jobs',
       'Category_Sales Jobs', 'Category_Manufacturing Jobs',
       'Category_Teaching Jobs', 'Category_Creative & Design Jobs',
       'Category_Trade & Construction Jobs', 'Category_Property Jobs',
       'Category_Admin Jobs', 'Category_Legal Jobs', 'Category_Retail Jobs',
       'SalaryNormalized'],
      dtype='object')
(2000, 38)
Out[30]:
Id FullDescription_0 FullDescription_1 FullDescription_2 FullDescription_3 FullDescription_4 FullDescription_5 FullDescription_6 FullDescription_7 FullDescription_8 ... Category_Sales Jobs Category_Manufacturing Jobs Category_Teaching Jobs Category_Creative & Design Jobs Category_Trade & Construction Jobs Category_Property Jobs Category_Admin Jobs Category_Legal Jobs Category_Retail Jobs SalaryNormalized
0 12612628.0 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.150115 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 25000.0
1 12612830.0 0.013077 0.000000 0.0 0.007214 0.010782 0.016549 0.0 0.221792 0.016945 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 30000.0
2 12612844.0 0.040371 0.000186 0.0 0.000000 0.003483 0.000266 0.0 0.098020 0.011783 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 30000.0
3 12613049.0 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.142823 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 27500.0
4 12613647.0 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.116813 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 25000.0

5 rows × 38 columns


In [31]:
X_job = df_job.drop(['SalaryNormalized', 'Id'], axis=1)
y_job = pd.qcut(df_job['SalaryNormalized'].values, q=2, labels=[0,1]).ravel()

model = xgb.XGBClassifier()
cv_score_job = cross_val_score(model, X_job, y_job, cv=5)
print('Cross-val score:', cv_score_job)
assert all(i > 0.5 for i in cv_score_job), 'Low score!'

model.fit(X_job, y_job)
fi_job = model.feature_importances_

print('Feature importance:')
print(*(list(zip(X_job.columns, fi_job))), sep='\n')


Cross-val score: [ 0.71072319  0.84538653  0.715       0.72431078  0.72932331]
Feature importance:
('FullDescription_0', 0.072026804)
('FullDescription_1', 0.14237855)
('FullDescription_2', 0.082077049)
('FullDescription_3', 0.046901174)
('FullDescription_4', 0.038525961)
('FullDescription_5', 0.095477387)
('FullDescription_6', 0.14405361)
('FullDescription_7', 0.070351757)
('FullDescription_8', 0.072026804)
('FullDescription_9', 0.070351757)
('FullDescription_10', 0.046901174)
('FullDescription_11', 0.048576213)
('ContractType_None', 0.0)
('ContractType_full_time', 0.01675042)
('ContractType_part_time', 0.01675042)
('ContractTime_permanent', 0.0083752098)
('ContractTime_None', 0.0050251256)
('ContractTime_contract', 0.0)
('Category_Engineering Jobs', 0.0016750419)
('Category_HR & Recruitment Jobs', 0.0)
('Category_Accounting & Finance Jobs', 0.0)
('Category_Healthcare & Nursing Jobs', 0.0)
('Category_Other/General Jobs', 0.0)
('Category_Hospitality & Catering Jobs', 0.0)
('Category_IT Jobs', 0.011725293)
('Category_Customer Services Jobs', 0.0083752098)
('Category_Travel Jobs', 0.0016750419)
('Category_Sales Jobs', 0.0)
('Category_Manufacturing Jobs', 0.0)
('Category_Teaching Jobs', 0.0)
('Category_Creative & Design Jobs', 0.0)
('Category_Trade & Construction Jobs', 0.0)
('Category_Property Jobs', 0.0)
('Category_Admin Jobs', 0.0)
('Category_Legal Jobs', 0.0)
('Category_Retail Jobs', 0.0)

In [ ]: