Preparation phase

Objective for training dataframe:

  • Cleaning the data (missing data imputation/replacement, change the string to numeric, and so on)
  • Split the data frame into subsets (if the size of the training data is very large) and perform feature extraction/feature engineering and then merge the result
  • Split training dataframe into two part; 80% of the data for the training itself, and other 20% is for validation (repeat the process using different subset and it become cross-validation)

Objective for test dataframe:

  • Cleaning the data (missing data imputation/replacement, change the string to numeric, and so on)
  • Split the data frame into subsets (if the size of the training data is very large) and perform feature extraction/feature engineering and then merge the result. Create the same features for the test set

In [50]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('preparation.csv',delimiter=',')

In [3]:
df.shape


Out[3]:
(59400, 19)

In [4]:
!ls


label.csv	      submission1.csv	 SubmissionFormat.csv	trainingset.csv
Preparatin_TzW.ipynb  submission2.csv	 Tanzanian Water.ipynb
preparation.csv       submission3rf.csv  testset.csv

take 80% as train dataframe, and leave 20% as a validator/to check the accuracy of our prediction before applying the model to the real test data.


In [5]:
59400*0.8


Out[5]:
47520.0

In [6]:
dftrain = pd.read_csv('preparation.csv',delimiter=',',nrows=47520)

In [131]:
dfpayment = pd.read_csv('trainingset.csv',delimiter=',',nrows=47520)

In [132]:
dftrain.tail()


Out[132]:
id amount_tsh gps_height longitude latitude num_private region_code district_code population construction_year ... quantity_group source_type source_class waterpoint_type status_group wp_age daysdiff terrain water_availability payment
47515 34684 0.0 0 33.457381 -3.552051 0 17 7 0 0 ... 1 4 1 3 1 2017 1655.0 0 0 3
47516 8766 0.0 1321 30.592280 -3.733459 0 16 1 130 2003 ... 1 6 2 1 3 14 1529.0 2 0 2
47517 22662 0.0 1306 35.262273 -9.831219 0 10 2 0 1994 ... 3 6 2 2 2 23 1512.0 2 0 4
47518 39854 50.0 -13 38.979195 -6.519993 0 6 1 60 1995 ... 1 6 2 1 1 22 2216.0 0 1 3
47519 45078 0.0 0 33.420152 -2.795071 0 19 4 0 0 ... 1 5 1 3 1 2017 2080.0 0 0 2

5 rows × 22 columns


In [133]:
dftrain['terrain'] = 'dataran tinggi'

In [139]:
dftrain['terrain'][dftrain['gps_height']<=500] = 'dataran rendah'


/usr/local/lib/python2.7/dist-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':

In [140]:
dftrain['terrain'][(dftrain['gps_height']>500) & (dftrain['gps_height']<=1000)] = 'dataran sedang'


/usr/local/lib/python2.7/dist-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':

In [141]:
dftrain['terrain'].replace('dataran rendah',0,inplace=True)
dftrain['terrain'].replace('dataran sedang',1,inplace=True)
dftrain['terrain'].replace('dataran tinggi',2,inplace=True)

In [144]:
dftrain.head()


Out[144]:
id amount_tsh gps_height longitude latitude num_private region_code district_code population construction_year ... quantity_group source_type source_class waterpoint_type status_group wp_age daysdiff terrain water_availability payment
0 69572 6000.0 1390 34.938093 -9.856322 0 11 5 109 1999 ... 1 1 1 1 1 18 2225.0 2 1 1
1 8776 0.0 1399 34.698766 -2.147466 0 20 2 280 2010 ... 2 2 2 1 1 7 1502.0 2 0 2
2 34310 25.0 686 37.460664 -3.821329 0 21 4 250 2009 ... 1 3 2 2 1 8 1511.0 1 1 3
3 67743 0.0 263 38.486161 -11.155298 0 90 63 58 1986 ... 3 4 1 2 2 31 1539.0 0 0 2
4 19728 0.0 0 31.130847 -1.825359 0 18 1 0 0 ... 4 2 2 1 1 2017 2104.0 0 0 2

5 rows × 22 columns


In [145]:
print dftrain.columns


Index([u'id', u'amount_tsh', u'gps_height', u'longitude', u'latitude',
       u'num_private', u'region_code', u'district_code', u'population',
       u'construction_year', u'water_quality', u'quality_group',
       u'quantity_group', u'source_type', u'source_class', u'waterpoint_type',
       u'status_group', u'wp_age', u'daysdiff', u'terrain',
       u'water_availability', u'payment'],
      dtype='object')

In [146]:
dftrain['water_availability'] = dftrain.apply(lambda row: 0 if row['amount_tsh']==0.0 else 1,axis=1)

In [147]:
dftrain['payment'] = dfpayment['payment']

In [148]:
dftrain['payment'].unique()


Out[148]:
array(['pay annually', 'never pay', 'pay per bucket', 'unknown',
       'pay when scheme fails', 'other', 'pay monthly'], dtype=object)

In [149]:
dftrain['payment'].replace('pay annually',1,inplace=True)
dftrain['payment'].replace('never pay',2,inplace=True)
dftrain['payment'].replace('pay per bucket',3,inplace=True)
dftrain['payment'].replace('unknown',0,inplace=True)
dftrain['payment'].replace('pay when scheme fails',4,inplace=True)
dftrain['payment'].replace('other',5,inplace=True)
dftrain['payment'].replace('pay monthly',6,inplace=True)

#prepare the training dataset for export


In [150]:
dftrain2 = dftrain[['id','water_availability','terrain','num_private','region_code','district_code','population','water_quality','quality_group','quantity_group','source_type','source_class','waterpoint_type','wp_age','daysdiff','payment','status_group']].copy()

In [151]:
dftrain2.tail()


Out[151]:
id water_availability terrain num_private region_code district_code population water_quality quality_group quantity_group source_type source_class waterpoint_type wp_age daysdiff payment status_group
47515 34684 0 0 0 17 7 0 2 2 1 4 1 3 2017 1655.0 3 1
47516 8766 0 2 0 16 1 130 1 1 1 6 2 1 14 1529.0 2 3
47517 22662 0 2 0 10 2 0 1 1 3 6 2 2 23 1512.0 4 2
47518 39854 1 0 0 6 1 60 1 1 1 6 2 1 22 2216.0 3 1
47519 45078 0 0 0 19 4 0 1 1 1 5 1 3 2017 2080.0 2 1

In [152]:
dftrain2.to_csv('dftrain.csv',sep=',',header=True,index=False)

#prepare the validation dataset

The validation dataset must have same table structure with the training dataset so that we can calculate the accuracy.


In [153]:
dfvalidation = pd.read_csv('preparation.csv',delimiter=',',skiprows=range(1,47520),nrows=11880)

In [154]:
dfv_payment = pd.read_csv('trainingset.csv',delimiter=',',skiprows=range(1,47520),nrows=11880)

In [156]:
print dfvalidation.shape
dfvalidation.tail()


(11880, 19)
Out[156]:
id amount_tsh gps_height longitude latitude num_private region_code district_code population construction_year water_quality quality_group quantity_group source_type source_class waterpoint_type status_group wp_age daysdiff
11875 11164 500.0 351 37.634053 -6.124830 0 5 6 89 2007 1 1 1 4 1 1 2 10 2230.0
11876 60739 10.0 1210 37.169807 -3.253847 0 3 5 125 1999 1 1 1 1 1 1 1 18 1444.0
11877 27263 4700.0 1212 35.249991 -9.070629 0 11 4 56 1996 1 1 1 6 2 1 1 21 2171.0
11878 37057 0.0 0 34.017087 -8.750434 0 12 7 0 0 4 4 1 4 1 3 1 2017 2197.0
11879 31282 0.0 0 35.861315 -6.378573 0 1 4 0 0 1 1 2 5 1 3 1 2017 2231.0

In [157]:
dfvalidation['terrain'] = 'dataran tinggi'

In [158]:
dfvalidation['terrain'][dftrain['gps_height']<=500] = 'dataran rendah'
dfvalidation['terrain'][(dfvalidation['gps_height']>500) & (dfvalidation['gps_height']<=1000)] = 'dataran sedang'


/usr/local/lib/python2.7/dist-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
/usr/local/lib/python2.7/dist-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

In [159]:
dfvalidation['terrain'].replace('dataran rendah',0,inplace=True)
dfvalidation['terrain'].replace('dataran sedang',1,inplace=True)
dfvalidation['terrain'].replace('dataran tinggi',2,inplace=True)

In [160]:
dfvalidation.head()


Out[160]:
id amount_tsh gps_height longitude latitude num_private region_code district_code population construction_year water_quality quality_group quantity_group source_type source_class waterpoint_type status_group wp_age daysdiff terrain
0 45078 0.0 0 33.420152 -2.795071 0 19 4 0 0 1 1 1 5 1 3 1 2017 2080.0 2
1 31768 50.0 99 38.991021 -9.255090 0 80 13 860 1985 1 1 1 5 1 3 1 32 1546.0 2
2 36242 0.0 1518 37.314627 -3.218825 0 3 4 30 1976 1 1 1 1 1 0 3 41 1504.0 2
3 18510 1000.0 0 38.803419 -4.865751 0 4 8 40 1990 1 1 2 1 1 1 1 27 2223.0 0
4 45518 0.0 0 32.561960 -9.174393 0 12 6 0 0 1 1 1 6 2 1 1 2017 2108.0 0

In [161]:
dfvalidation['water_availability'] = dfvalidation.apply(lambda row: 0 if row['amount_tsh']==0.0 else 1,axis=1)

In [162]:
dfvalidation['payment'] = dfv_payment['payment']

In [163]:
dfvalidation['payment'].replace('pay annually',1,inplace=True)
dfvalidation['payment'].replace('never pay',2,inplace=True)
dfvalidation['payment'].replace('pay per bucket',3,inplace=True)
dfvalidation['payment'].replace('unknown',0,inplace=True)
dfvalidation['payment'].replace('pay when scheme fails',4,inplace=True)
dfvalidation['payment'].replace('other',5,inplace=True)
dfvalidation['payment'].replace('pay monthly',6,inplace=True)

In [177]:
dfvalidation2 = dfvalidation[['id','water_availability','terrain','num_private','region_code','district_code','population','water_quality','quality_group','quantity_group','source_type','source_class','waterpoint_type','wp_age','daysdiff','payment','status_group']].copy()

In [178]:
dfvalidation2.head()


Out[178]:
id water_availability terrain num_private region_code district_code population water_quality quality_group quantity_group source_type source_class waterpoint_type wp_age daysdiff payment status_group
0 45078 0 2 0 19 4 0 1 1 1 5 1 3 2017 2080.0 2 1
1 31768 1 2 0 80 13 860 1 1 1 5 1 3 32 1546.0 3 1
2 36242 0 2 0 3 4 30 1 1 1 1 1 0 41 1504.0 2 3
3 18510 1 0 0 4 8 40 1 1 2 1 1 1 27 2223.0 6 1
4 45518 0 0 0 12 6 0 1 1 1 6 2 1 2017 2108.0 2 1

In [179]:
dfvalidation2.to_csv('validation.csv',sep=',',header=True,index=False)

In [167]:
!ls -l


total 35184
-rw-r--r-- 1 febrifahmi febrifahmi  2189312 Apr 16 08:57 dftrain.csv
-rw-r--r-- 1 febrifahmi febrifahmi  1148327 Apr 14 21:12 label.csv
-rw-r--r-- 1 febrifahmi febrifahmi    94598 Apr 16 07:34 Modelling_TzW.ipynb
-rw-r--r-- 1 febrifahmi febrifahmi  4495565 Apr 16 04:49 preparation.csv
-rw-r--r-- 1 febrifahmi febrifahmi    86334 Apr 16 09:00 Preparation_TzW.ipynb
-rw-r--r-- 1 febrifahmi febrifahmi   443312 Apr 15 07:43 submission1.csv
-rw-r--r-- 1 febrifahmi febrifahmi   278500 Apr 15 07:55 submission2.csv
-rw-r--r-- 1 febrifahmi febrifahmi   278077 Apr 15 08:35 submission3rf.csv
-rw-r--r-- 1 febrifahmi febrifahmi   283084 Apr 16 07:34 submission4.csv
-rw-r--r-- 1 febrifahmi febrifahmi   324512 Apr 14 21:14 SubmissionFormat.csv
-rw-r--r-- 1 febrifahmi febrifahmi   135699 Apr 16 04:51 Tanzanian Water.ipynb
-rw-r--r-- 1 febrifahmi febrifahmi  5016337 Apr 14 21:15 testset.csv
-rw-r--r-- 1 febrifahmi febrifahmi   624805 Apr 16 07:25 testsetfix.csv
-rw-r--r-- 1 febrifahmi febrifahmi 20069199 Apr 14 20:59 trainingset.csv
-rw-r--r-- 1 febrifahmi febrifahmi   523733 Apr 16 09:00 validation.csv

In [168]:
print "done"


done

#preparation of test dataset


In [169]:
dftest = pd.read_csv('testset.csv',delimiter=',')

In [170]:
dftest.head()


Out[170]:
id amount_tsh date_recorded funder gps_height installer longitude latitude wpt_name num_private ... payment_type water_quality quality_group quantity quantity_group source source_type source_class waterpoint_type waterpoint_type_group
0 50785 0.0 2013-02-04 Dmdd 1996 DMDD 35.290799 -4.059696 Dinamu Secondary School 0 ... never pay soft good seasonal seasonal rainwater harvesting rainwater harvesting surface other other
1 51630 0.0 2013-02-04 Government Of Tanzania 1569 DWE 36.656709 -3.309214 Kimnyak 0 ... never pay soft good insufficient insufficient spring spring groundwater communal standpipe communal standpipe
2 17168 0.0 2013-02-01 NaN 1567 NaN 34.767863 -5.004344 Puma Secondary 0 ... never pay soft good insufficient insufficient rainwater harvesting rainwater harvesting surface other other
3 45559 0.0 2013-01-22 Finn Water 267 FINN WATER 38.058046 -9.418672 Kwa Mzee Pange 0 ... unknown soft good dry dry shallow well shallow well groundwater other other
4 49871 500.0 2013-03-27 Bruder 1260 BRUDER 35.006123 -10.950412 Kwa Mzee Turuka 0 ... monthly soft good enough enough spring spring groundwater communal standpipe communal standpipe

5 rows × 40 columns


In [171]:
dftest.isnull().sum().sum()


Out[171]:
11464

In [172]:
print list(dftest.columns)


['id', 'amount_tsh', 'date_recorded', 'funder', 'gps_height', 'installer', 'longitude', 'latitude', 'wpt_name', 'num_private', 'basin', 'subvillage', 'region', 'region_code', 'district_code', 'lga', 'ward', 'population', 'public_meeting', 'recorded_by', 'scheme_management', 'scheme_name', 'permit', 'construction_year', 'extraction_type', 'extraction_type_group', 'extraction_type_class', 'management', 'management_group', 'payment', 'payment_type', 'water_quality', 'quality_group', 'quantity', 'quantity_group', 'source', 'source_type', 'source_class', 'waterpoint_type', 'waterpoint_type_group']

In [96]:
dftest['source_class'].isnull().sum().sum()


Out[96]:
0

In [80]:
dftest['public_meeting'].fillna('unknown',inplace=True)

In [83]:
dftest['payment'].unique()


Out[83]:
array(['never pay', 'unknown', 'pay monthly', 'pay per bucket',
       'pay annually', 'pay when scheme fails', 'other'], dtype=object)

In [84]:
dftest['wp_age'] = dftest.apply(lambda row: (2017-row['construction_year']),axis=1)

In [85]:
dftest['water_quality'].replace('soft',1,inplace=True)
dftest['water_quality'].replace('salty',2,inplace=True)
dftest['water_quality'].replace('milky',3,inplace=True)
dftest['water_quality'].replace('fluoride',4,inplace=True)
dftest['water_quality'].replace('coloured',5,inplace=True)
dftest['water_quality'].replace('salty abandoned',6,inplace=True)
dftest['water_quality'].replace('fluoride abandoned',7,inplace=True)
dftest['water_quality'].replace('unknown',0,inplace=True)

In [89]:
dftest['quantity_group'].replace('enough',1,inplace=True)
dftest['quantity_group'].replace('insufficient',2,inplace=True)
dftest['quantity_group'].replace('dry',3,inplace=True)
dftest['quantity_group'].replace('seasonal',4,inplace=True)
dftest['quantity_group'].replace('unknown',0,inplace=True)

In [91]:
dftest['waterpoint_type'].replace('communal standpipe',1,inplace=True)
dftest['waterpoint_type'].replace('communal standpipe multiple',2,inplace=True)
dftest['waterpoint_type'].replace('hand pump',3,inplace=True)
dftest['waterpoint_type'].replace('other',0,inplace=True)
dftest['waterpoint_type'].replace('improved spring',4,inplace=True)
dftest['waterpoint_type'].replace('cattle trough',5,inplace=True)
dftest['waterpoint_type'].replace('dam',6,inplace=True)

In [93]:
dftest['source_type'].replace('spring',1,inplace=True)
dftest['source_type'].replace('rainwater harvesting',2,inplace=True)
dftest['source_type'].replace('dam',3,inplace=True)
dftest['source_type'].replace('borehole',4,inplace=True)
dftest['source_type'].replace('other',0,inplace=True)
dftest['source_type'].replace('shallow well',5,inplace=True)
dftest['source_type'].replace('river/lake',6,inplace=True)

In [95]:
dftest['quality_group'].replace('good',1,inplace=True)
dftest['quality_group'].replace('salty',2,inplace=True)
dftest['quality_group'].replace('milky',3,inplace=True)
dftest['quality_group'].replace('unknown',0,inplace=True)
dftest['quality_group'].replace('fluoride',4,inplace=True)
dftest['quality_group'].replace('colored',5,inplace=True)

In [97]:
dftest['source_class'].replace('groundwater',1,inplace=True)
dftest['source_class'].replace('surface',2,inplace=True)
dftest['source_class'].replace('unknown',0,inplace=True)

In [98]:
import datetime as dt

In [99]:
dftest['today'] = dftest.apply(lambda row: dt.datetime.today().strftime('%Y-%m-%d'),axis=1)

In [100]:
dftest['date_recorded'] = [time.date() for time in dftest['date_recorded'].astype('datetime64[ns]')]
dftest['today'] = [time.date() for time in dftest['today'].astype('datetime64[ns]')]

In [101]:
dftest['lastcheckdays'] = abs(dftest['date_recorded'].sub(dftest['today'], axis=0))

In [102]:
dftest['daysdiff'] = dftest['lastcheckdays'] / np.timedelta64(1, 'D')

In [114]:
dftest['terrain'] = 'dataran tinggi'

In [115]:
dftest['terrain'][dftest['gps_height']<=500] = 'dataran rendah'
dftest['terrain'][(dftest['gps_height']>500) & (dftest['gps_height']<=1000)] = 'dataran sedang'


/usr/local/lib/python2.7/dist-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
/usr/local/lib/python2.7/dist-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

In [116]:
dftest['terrain'].replace('dataran rendah',0,inplace=True)
dftest['terrain'].replace('dataran sedang',1,inplace=True)
dftest['terrain'].replace('dataran tinggi',2,inplace=True)

In [117]:
dftest['water_availability'] = dftest.apply(lambda row: 0 if row['amount_tsh']==0.0 else 1,axis=1)

In [118]:
dftest2 = dftest._get_numeric_data()

In [119]:
dftest2.head()


Out[119]:
id amount_tsh gps_height longitude latitude num_private region_code district_code population construction_year water_quality quality_group quantity_group source_type source_class waterpoint_type wp_age daysdiff terrain water_availability
0 50785 0.0 1996 35.290799 -4.059696 0 21 3 321 2012 1 1 4 2 2 0 5 1532.0 2 0
1 51630 0.0 1569 36.656709 -3.309214 0 2 2 300 2000 1 1 2 1 1 1 17 1532.0 2 0
2 17168 0.0 1567 34.767863 -5.004344 0 13 2 500 2010 1 1 2 2 2 0 7 1535.0 2 0
3 45559 0.0 267 38.058046 -9.418672 0 80 43 250 1987 1 1 3 5 1 0 30 1545.0 0 0
4 49871 500.0 1260 35.006123 -10.950412 0 10 3 60 2000 1 1 1 1 1 1 17 1481.0 2 1

In [120]:
dftest3 = dftest2[['id','water_availability','terrain','num_private','region_code','district_code','population','water_quality','quality_group','quantity_group','source_type','source_class','waterpoint_type','wp_age','daysdiff']].copy()

In [121]:
dftest3.head()


Out[121]:
id water_availability terrain num_private region_code district_code population water_quality quality_group quantity_group source_type source_class waterpoint_type wp_age daysdiff
0 50785 0 2 0 21 3 321 1 1 4 2 2 0 5 1532.0
1 51630 0 2 0 2 2 300 1 1 2 1 1 1 17 1532.0
2 17168 0 2 0 13 2 500 1 1 2 2 2 0 7 1535.0
3 45559 0 0 0 80 43 250 1 1 3 5 1 0 30 1545.0
4 49871 1 2 0 10 3 60 1 1 1 1 1 1 17 1481.0

In [122]:
dftest3.to_csv('testsetfix.csv',sep=',',header=True,index=False)

In [173]:
dftest3 = pd.read_csv('testsetfix.csv',delimiter=',')

In [174]:
dftest3['payment'] = dftest['payment']

In [175]:
dftest3['payment'].unique()


Out[175]:
array(['never pay', 'unknown', 'pay monthly', 'pay per bucket',
       'pay annually', 'pay when scheme fails', 'other'], dtype=object)

In [176]:
dftest3['payment'].replace('pay annually',1,inplace=True)
dftest3['payment'].replace('never pay',2,inplace=True)
dftest3['payment'].replace('pay per bucket',3,inplace=True)
dftest3['payment'].replace('unknown',0,inplace=True)
dftest3['payment'].replace('pay when scheme fails',4,inplace=True)
dftest3['payment'].replace('other',5,inplace=True)
dftest3['payment'].replace('pay monthly',6,inplace=True)

In [180]:
dftest3 = dftest3[['id','water_availability','terrain','num_private','region_code','district_code','population','water_quality','quality_group','quantity_group','source_type','source_class','waterpoint_type','wp_age','daysdiff','payment']].copy()

In [181]:
dftest3.head()


Out[181]:
id water_availability terrain num_private region_code district_code population water_quality quality_group quantity_group source_type source_class waterpoint_type wp_age daysdiff payment
0 50785 0 2 0 21 3 321 1 1 4 2 2 0 5 1532.0 2
1 51630 0 2 0 2 2 300 1 1 2 1 1 1 17 1532.0 2
2 17168 0 2 0 13 2 500 1 1 2 2 2 0 7 1535.0 2
3 45559 0 0 0 80 43 250 1 1 3 5 1 0 30 1545.0 0
4 49871 1 2 0 10 3 60 1 1 1 1 1 1 17 1481.0 6

In [182]:
dftest3.to_csv('testsetfix.csv',sep=',',header=True,index=False)

In [ ]: