Preparation phase

Objective for training dataframe:

Cleaning the data (missing data imputation/replacement, change the string to numeric, and so on)
Split the data frame into subsets (if the size of the training data is very large) and perform feature extraction/feature engineering and then merge the result
Split training dataframe into two part; 80% of the data for the training itself, and other 20% is for validation (repeat the process using different subset and it become cross-validation)

Objective for test dataframe:

Cleaning the data (missing data imputation/replacement, change the string to numeric, and so on)
Split the data frame into subsets (if the size of the training data is very large) and perform feature extraction/feature engineering and then merge the result. Create the same features for the test set



In [50]:

    
import pandas as pd
import numpy as np



In [2]:

    
df = pd.read_csv('preparation.csv',delimiter=',')



In [3]:

    
df.shape









    Out[3]:





(59400, 19)



In [4]:

    
!ls









    



label.csv	      submission1.csv	 SubmissionFormat.csv	trainingset.csv
Preparatin_TzW.ipynb  submission2.csv	 Tanzanian Water.ipynb
preparation.csv       submission3rf.csv  testset.csv

take 80% as train dataframe, and leave 20% as a validator/to check the accuracy of our prediction before applying the model to the real test data.



In [5]:

    
59400*0.8









    Out[5]:





47520.0



In [6]:

    
dftrain = pd.read_csv('preparation.csv',delimiter=',',nrows=47520)



In [131]:

    
dfpayment = pd.read_csv('trainingset.csv',delimiter=',',nrows=47520)



In [132]:

    
dftrain.tail()









    Out[132]:






  
    
      
      id
      amount_tsh
      gps_height
      longitude
      latitude
      num_private
      region_code
      district_code
      population
      construction_year
      ...
      quantity_group
      source_type
      source_class
      waterpoint_type
      status_group
      wp_age
      daysdiff
      terrain
      water_availability
      payment
    
  
  
    
      47515
      34684
      0.0
      0
      33.457381
      -3.552051
      0
      17
      7
      0
      0
      ...
      1
      4
      1
      3
      1
      2017
      1655.0
      0
      0
      3
    
    
      47516
      8766
      0.0
      1321
      30.592280
      -3.733459
      0
      16
      1
      130
      2003
      ...
      1
      6
      2
      1
      3
      14
      1529.0
      2
      0
      2
    
    
      47517
      22662
      0.0
      1306
      35.262273
      -9.831219
      0
      10
      2
      0
      1994
      ...
      3
      6
      2
      2
      2
      23
      1512.0
      2
      0
      4
    
    
      47518
      39854
      50.0
      -13
      38.979195
      -6.519993
      0
      6
      1
      60
      1995
      ...
      1
      6
      2
      1
      1
      22
      2216.0
      0
      1
      3
    
    
      47519
      45078
      0.0
      0
      33.420152
      -2.795071
      0
      19
      4
      0
      0
      ...
      1
      5
      1
      3
      1
      2017
      2080.0
      0
      0
      2
    
  

5 rows × 22 columns



In [133]:

    
dftrain['terrain'] = 'dataran tinggi'



In [139]:

    
dftrain['terrain'][dftrain['gps_height']<=500] = 'dataran rendah'









    



/usr/local/lib/python2.7/dist-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':



In [140]:

    
dftrain['terrain'][(dftrain['gps_height']>500) & (dftrain['gps_height']<=1000)] = 'dataran sedang'









    



/usr/local/lib/python2.7/dist-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':



In [141]:

    
dftrain['terrain'].replace('dataran rendah',0,inplace=True)
dftrain['terrain'].replace('dataran sedang',1,inplace=True)
dftrain['terrain'].replace('dataran tinggi',2,inplace=True)



In [144]:

    
dftrain.head()









    Out[144]:






  
    
      
      id
      amount_tsh
      gps_height
      longitude
      latitude
      num_private
      region_code
      district_code
      population
      construction_year
      ...
      quantity_group
      source_type
      source_class
      waterpoint_type
      status_group
      wp_age
      daysdiff
      terrain
      water_availability
      payment
    
  
  
    
      0
      69572
      6000.0
      1390
      34.938093
      -9.856322
      0
      11
      5
      109
      1999
      ...
      1
      1
      1
      1
      1
      18
      2225.0
      2
      1
      1
    
    
      1
      8776
      0.0
      1399
      34.698766
      -2.147466
      0
      20
      2
      280
      2010
      ...
      2
      2
      2
      1
      1
      7
      1502.0
      2
      0
      2
    
    
      2
      34310
      25.0
      686
      37.460664
      -3.821329
      0
      21
      4
      250
      2009
      ...
      1
      3
      2
      2
      1
      8
      1511.0
      1
      1
      3
    
    
      3
      67743
      0.0
      263
      38.486161
      -11.155298
      0
      90
      63
      58
      1986
      ...
      3
      4
      1
      2
      2
      31
      1539.0
      0
      0
      2
    
    
      4
      19728
      0.0
      0
      31.130847
      -1.825359
      0
      18
      1
      0
      0
      ...
      4
      2
      2
      1
      1
      2017
      2104.0
      0
      0
      2
    
  

5 rows × 22 columns



In [145]:

    
print dftrain.columns









    



Index([u'id', u'amount_tsh', u'gps_height', u'longitude', u'latitude',
       u'num_private', u'region_code', u'district_code', u'population',
       u'construction_year', u'water_quality', u'quality_group',
       u'quantity_group', u'source_type', u'source_class', u'waterpoint_type',
       u'status_group', u'wp_age', u'daysdiff', u'terrain',
       u'water_availability', u'payment'],
      dtype='object')



In [146]:

    
dftrain['water_availability'] = dftrain.apply(lambda row: 0 if row['amount_tsh']==0.0 else 1,axis=1)



In [147]:

    
dftrain['payment'] = dfpayment['payment']



In [148]:

    
dftrain['payment'].unique()









    Out[148]:





array(['pay annually', 'never pay', 'pay per bucket', 'unknown',
       'pay when scheme fails', 'other', 'pay monthly'], dtype=object)



In [149]:

    
dftrain['payment'].replace('pay annually',1,inplace=True)
dftrain['payment'].replace('never pay',2,inplace=True)
dftrain['payment'].replace('pay per bucket',3,inplace=True)
dftrain['payment'].replace('unknown',0,inplace=True)
dftrain['payment'].replace('pay when scheme fails',4,inplace=True)
dftrain['payment'].replace('other',5,inplace=True)
dftrain['payment'].replace('pay monthly',6,inplace=True)

#prepare the training dataset for export



In [150]:

    
dftrain2 = dftrain[['id','water_availability','terrain','num_private','region_code','district_code','population','water_quality','quality_group','quantity_group','source_type','source_class','waterpoint_type','wp_age','daysdiff','payment','status_group']].copy()



In [151]:

    
dftrain2.tail()









    Out[151]:






  
    
      
      id
      water_availability
      terrain
      num_private
      region_code
      district_code
      population
      water_quality
      quality_group
      quantity_group
      source_type
      source_class
      waterpoint_type
      wp_age
      daysdiff
      payment
      status_group
    
  
  
    
      47515
      34684
      0
      0
      0
      17
      7
      0
      2
      2
      1
      4
      1
      3
      2017
      1655.0
      3
      1
    
    
      47516
      8766
      0
      2
      0
      16
      1
      130
      1
      1
      1
      6
      2
      1
      14
      1529.0
      2
      3
    
    
      47517
      22662
      0
      2
      0
      10
      2
      0
      1
      1
      3
      6
      2
      2
      23
      1512.0
      4
      2
    
    
      47518
      39854
      1
      0
      0
      6
      1
      60
      1
      1
      1
      6
      2
      1
      22
      2216.0
      3
      1
    
    
      47519
      45078
      0
      0
      0
      19
      4
      0
      1
      1
      1
      5
      1
      3
      2017
      2080.0
      2
      1



In [152]:

    
dftrain2.to_csv('dftrain.csv',sep=',',header=True,index=False)

#prepare the validation dataset

The validation dataset must have same table structure with the training dataset so that we can calculate the accuracy.



In [153]:

    
dfvalidation = pd.read_csv('preparation.csv',delimiter=',',skiprows=range(1,47520),nrows=11880)



In [154]:

    
dfv_payment = pd.read_csv('trainingset.csv',delimiter=',',skiprows=range(1,47520),nrows=11880)



In [156]:

    
print dfvalidation.shape
dfvalidation.tail()









    



(11880, 19)






    Out[156]:






  
    
      
      id
      amount_tsh
      gps_height
      longitude
      latitude
      num_private
      region_code
      district_code
      population
      construction_year
      water_quality
      quality_group
      quantity_group
      source_type
      source_class
      waterpoint_type
      status_group
      wp_age
      daysdiff
    
  
  
    
      11875
      11164
      500.0
      351
      37.634053
      -6.124830
      0
      5
      6
      89
      2007
      1
      1
      1
      4
      1
      1
      2
      10
      2230.0
    
    
      11876
      60739
      10.0
      1210
      37.169807
      -3.253847
      0
      3
      5
      125
      1999
      1
      1
      1
      1
      1
      1
      1
      18
      1444.0
    
    
      11877
      27263
      4700.0
      1212
      35.249991
      -9.070629
      0
      11
      4
      56
      1996
      1
      1
      1
      6
      2
      1
      1
      21
      2171.0
    
    
      11878
      37057
      0.0
      0
      34.017087
      -8.750434
      0
      12
      7
      0
      0
      4
      4
      1
      4
      1
      3
      1
      2017
      2197.0
    
    
      11879
      31282
      0.0
      0
      35.861315
      -6.378573
      0
      1
      4
      0
      0
      1
      1
      2
      5
      1
      3
      1
      2017
      2231.0



In [157]:

    
dfvalidation['terrain'] = 'dataran tinggi'



In [158]:

    
dfvalidation['terrain'][dftrain['gps_height']<=500] = 'dataran rendah'
dfvalidation['terrain'][(dfvalidation['gps_height']>500) & (dfvalidation['gps_height']<=1000)] = 'dataran sedang'









    



/usr/local/lib/python2.7/dist-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
/usr/local/lib/python2.7/dist-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app



In [159]:

    
dfvalidation['terrain'].replace('dataran rendah',0,inplace=True)
dfvalidation['terrain'].replace('dataran sedang',1,inplace=True)
dfvalidation['terrain'].replace('dataran tinggi',2,inplace=True)



In [160]:

    
dfvalidation.head()









    Out[160]:






  
    
      
      id
      amount_tsh
      gps_height
      longitude
      latitude
      num_private
      region_code
      district_code
      population
      construction_year
      water_quality
      quality_group
      quantity_group
      source_type
      source_class
      waterpoint_type
      status_group
      wp_age
      daysdiff
      terrain
    
  
  
    
      0
      45078
      0.0
      0
      33.420152
      -2.795071
      0
      19
      4
      0
      0
      1
      1
      1
      5
      1
      3
      1
      2017
      2080.0
      2
    
    
      1
      31768
      50.0
      99
      38.991021
      -9.255090
      0
      80
      13
      860
      1985
      1
      1
      1
      5
      1
      3
      1
      32
      1546.0
      2
    
    
      2
      36242
      0.0
      1518
      37.314627
      -3.218825
      0
      3
      4
      30
      1976
      1
      1
      1
      1
      1
      0
      3
      41
      1504.0
      2
    
    
      3
      18510
      1000.0
      0
      38.803419
      -4.865751
      0
      4
      8
      40
      1990
      1
      1
      2
      1
      1
      1
      1
      27
      2223.0
      0
    
    
      4
      45518
      0.0
      0
      32.561960
      -9.174393
      0
      12
      6
      0
      0
      1
      1
      1
      6
      2
      1
      1
      2017
      2108.0
      0



In [161]:

    
dfvalidation['water_availability'] = dfvalidation.apply(lambda row: 0 if row['amount_tsh']==0.0 else 1,axis=1)



In [162]:

    
dfvalidation['payment'] = dfv_payment['payment']



In [163]:

    
dfvalidation['payment'].replace('pay annually',1,inplace=True)
dfvalidation['payment'].replace('never pay',2,inplace=True)
dfvalidation['payment'].replace('pay per bucket',3,inplace=True)
dfvalidation['payment'].replace('unknown',0,inplace=True)
dfvalidation['payment'].replace('pay when scheme fails',4,inplace=True)
dfvalidation['payment'].replace('other',5,inplace=True)
dfvalidation['payment'].replace('pay monthly',6,inplace=True)



In [177]:

    
dfvalidation2 = dfvalidation[['id','water_availability','terrain','num_private','region_code','district_code','population','water_quality','quality_group','quantity_group','source_type','source_class','waterpoint_type','wp_age','daysdiff','payment','status_group']].copy()



In [178]:

    
dfvalidation2.head()









    Out[178]:






  
    
      
      id
      water_availability
      terrain
      num_private
      region_code
      district_code
      population
      water_quality
      quality_group
      quantity_group
      source_type
      source_class
      waterpoint_type
      wp_age
      daysdiff
      payment
      status_group
    
  
  
    
      0
      45078
      0
      2
      0
      19
      4
      0
      1
      1
      1
      5
      1
      3
      2017
      2080.0
      2
      1
    
    
      1
      31768
      1
      2
      0
      80
      13
      860
      1
      1
      1
      5
      1
      3
      32
      1546.0
      3
      1
    
    
      2
      36242
      0
      2
      0
      3
      4
      30
      1
      1
      1
      1
      1
      0
      41
      1504.0
      2
      3
    
    
      3
      18510
      1
      0
      0
      4
      8
      40
      1
      1
      2
      1
      1
      1
      27
      2223.0
      6
      1
    
    
      4
      45518
      0
      0
      0
      12
      6
      0
      1
      1
      1
      6
      2
      1
      2017
      2108.0
      2
      1



In [179]:

    
dfvalidation2.to_csv('validation.csv',sep=',',header=True,index=False)



In [167]:

    
!ls -l









    



total 35184
-rw-r--r-- 1 febrifahmi febrifahmi  2189312 Apr 16 08:57 dftrain.csv
-rw-r--r-- 1 febrifahmi febrifahmi  1148327 Apr 14 21:12 label.csv
-rw-r--r-- 1 febrifahmi febrifahmi    94598 Apr 16 07:34 Modelling_TzW.ipynb
-rw-r--r-- 1 febrifahmi febrifahmi  4495565 Apr 16 04:49 preparation.csv
-rw-r--r-- 1 febrifahmi febrifahmi    86334 Apr 16 09:00 Preparation_TzW.ipynb
-rw-r--r-- 1 febrifahmi febrifahmi   443312 Apr 15 07:43 submission1.csv
-rw-r--r-- 1 febrifahmi febrifahmi   278500 Apr 15 07:55 submission2.csv
-rw-r--r-- 1 febrifahmi febrifahmi   278077 Apr 15 08:35 submission3rf.csv
-rw-r--r-- 1 febrifahmi febrifahmi   283084 Apr 16 07:34 submission4.csv
-rw-r--r-- 1 febrifahmi febrifahmi   324512 Apr 14 21:14 SubmissionFormat.csv
-rw-r--r-- 1 febrifahmi febrifahmi   135699 Apr 16 04:51 Tanzanian Water.ipynb
-rw-r--r-- 1 febrifahmi febrifahmi  5016337 Apr 14 21:15 testset.csv
-rw-r--r-- 1 febrifahmi febrifahmi   624805 Apr 16 07:25 testsetfix.csv
-rw-r--r-- 1 febrifahmi febrifahmi 20069199 Apr 14 20:59 trainingset.csv
-rw-r--r-- 1 febrifahmi febrifahmi   523733 Apr 16 09:00 validation.csv



In [168]:

    
print "done"









    



done

#preparation of test dataset



In [169]:

    
dftest = pd.read_csv('testset.csv',delimiter=',')



In [170]:

    
dftest.head()









    Out[170]:






  
    
      
      id
      amount_tsh
      date_recorded
      funder
      gps_height
      installer
      longitude
      latitude
      wpt_name
      num_private
      ...
      payment_type
      water_quality
      quality_group
      quantity
      quantity_group
      source
      source_type
      source_class
      waterpoint_type
      waterpoint_type_group
    
  
  
    
      0
      50785
      0.0
      2013-02-04
      Dmdd
      1996
      DMDD
      35.290799
      -4.059696
      Dinamu Secondary School
      0
      ...
      never pay
      soft
      good
      seasonal
      seasonal
      rainwater harvesting
      rainwater harvesting
      surface
      other
      other
    
    
      1
      51630
      0.0
      2013-02-04
      Government Of Tanzania
      1569
      DWE
      36.656709
      -3.309214
      Kimnyak
      0
      ...
      never pay
      soft
      good
      insufficient
      insufficient
      spring
      spring
      groundwater
      communal standpipe
      communal standpipe
    
    
      2
      17168
      0.0
      2013-02-01
      NaN
      1567
      NaN
      34.767863
      -5.004344
      Puma Secondary
      0
      ...
      never pay
      soft
      good
      insufficient
      insufficient
      rainwater harvesting
      rainwater harvesting
      surface
      other
      other
    
    
      3
      45559
      0.0
      2013-01-22
      Finn Water
      267
      FINN WATER
      38.058046
      -9.418672
      Kwa Mzee Pange
      0
      ...
      unknown
      soft
      good
      dry
      dry
      shallow well
      shallow well
      groundwater
      other
      other
    
    
      4
      49871
      500.0
      2013-03-27
      Bruder
      1260
      BRUDER
      35.006123
      -10.950412
      Kwa Mzee Turuka
      0
      ...
      monthly
      soft
      good
      enough
      enough
      spring
      spring
      groundwater
      communal standpipe
      communal standpipe
    
  

5 rows × 40 columns



In [171]:

    
dftest.isnull().sum().sum()









    Out[171]:





11464



In [172]:

    
print list(dftest.columns)









    



['id', 'amount_tsh', 'date_recorded', 'funder', 'gps_height', 'installer', 'longitude', 'latitude', 'wpt_name', 'num_private', 'basin', 'subvillage', 'region', 'region_code', 'district_code', 'lga', 'ward', 'population', 'public_meeting', 'recorded_by', 'scheme_management', 'scheme_name', 'permit', 'construction_year', 'extraction_type', 'extraction_type_group', 'extraction_type_class', 'management', 'management_group', 'payment', 'payment_type', 'water_quality', 'quality_group', 'quantity', 'quantity_group', 'source', 'source_type', 'source_class', 'waterpoint_type', 'waterpoint_type_group']



In [96]:

    
dftest['source_class'].isnull().sum().sum()









    Out[96]:





0



In [80]:

    
dftest['public_meeting'].fillna('unknown',inplace=True)



In [83]:

    
dftest['payment'].unique()









    Out[83]:





array(['never pay', 'unknown', 'pay monthly', 'pay per bucket',
       'pay annually', 'pay when scheme fails', 'other'], dtype=object)



In [84]:

    
dftest['wp_age'] = dftest.apply(lambda row: (2017-row['construction_year']),axis=1)



In [85]:

    
dftest['water_quality'].replace('soft',1,inplace=True)
dftest['water_quality'].replace('salty',2,inplace=True)
dftest['water_quality'].replace('milky',3,inplace=True)
dftest['water_quality'].replace('fluoride',4,inplace=True)
dftest['water_quality'].replace('coloured',5,inplace=True)
dftest['water_quality'].replace('salty abandoned',6,inplace=True)
dftest['water_quality'].replace('fluoride abandoned',7,inplace=True)
dftest['water_quality'].replace('unknown',0,inplace=True)



In [89]:

    
dftest['quantity_group'].replace('enough',1,inplace=True)
dftest['quantity_group'].replace('insufficient',2,inplace=True)
dftest['quantity_group'].replace('dry',3,inplace=True)
dftest['quantity_group'].replace('seasonal',4,inplace=True)
dftest['quantity_group'].replace('unknown',0,inplace=True)



In [91]:

    
dftest['waterpoint_type'].replace('communal standpipe',1,inplace=True)
dftest['waterpoint_type'].replace('communal standpipe multiple',2,inplace=True)
dftest['waterpoint_type'].replace('hand pump',3,inplace=True)
dftest['waterpoint_type'].replace('other',0,inplace=True)
dftest['waterpoint_type'].replace('improved spring',4,inplace=True)
dftest['waterpoint_type'].replace('cattle trough',5,inplace=True)
dftest['waterpoint_type'].replace('dam',6,inplace=True)



In [93]:

    
dftest['source_type'].replace('spring',1,inplace=True)
dftest['source_type'].replace('rainwater harvesting',2,inplace=True)
dftest['source_type'].replace('dam',3,inplace=True)
dftest['source_type'].replace('borehole',4,inplace=True)
dftest['source_type'].replace('other',0,inplace=True)
dftest['source_type'].replace('shallow well',5,inplace=True)
dftest['source_type'].replace('river/lake',6,inplace=True)



In [95]:

    
dftest['quality_group'].replace('good',1,inplace=True)
dftest['quality_group'].replace('salty',2,inplace=True)
dftest['quality_group'].replace('milky',3,inplace=True)
dftest['quality_group'].replace('unknown',0,inplace=True)
dftest['quality_group'].replace('fluoride',4,inplace=True)
dftest['quality_group'].replace('colored',5,inplace=True)



In [97]:

    
dftest['source_class'].replace('groundwater',1,inplace=True)
dftest['source_class'].replace('surface',2,inplace=True)
dftest['source_class'].replace('unknown',0,inplace=True)



In [98]:

    
import datetime as dt



In [99]:

    
dftest['today'] = dftest.apply(lambda row: dt.datetime.today().strftime('%Y-%m-%d'),axis=1)



In [100]:

    
dftest['date_recorded'] = [time.date() for time in dftest['date_recorded'].astype('datetime64[ns]')]
dftest['today'] = [time.date() for time in dftest['today'].astype('datetime64[ns]')]



In [101]:

    
dftest['lastcheckdays'] = abs(dftest['date_recorded'].sub(dftest['today'], axis=0))



In [102]:

    
dftest['daysdiff'] = dftest['lastcheckdays'] / np.timedelta64(1, 'D')



In [114]:

    
dftest['terrain'] = 'dataran tinggi'



In [115]:

    
dftest['terrain'][dftest['gps_height']<=500] = 'dataran rendah'
dftest['terrain'][(dftest['gps_height']>500) & (dftest['gps_height']<=1000)] = 'dataran sedang'









    



/usr/local/lib/python2.7/dist-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
/usr/local/lib/python2.7/dist-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app



In [116]:

    
dftest['terrain'].replace('dataran rendah',0,inplace=True)
dftest['terrain'].replace('dataran sedang',1,inplace=True)
dftest['terrain'].replace('dataran tinggi',2,inplace=True)



In [117]:

    
dftest['water_availability'] = dftest.apply(lambda row: 0 if row['amount_tsh']==0.0 else 1,axis=1)



In [118]:

    
dftest2 = dftest._get_numeric_data()



In [119]:

    
dftest2.head()









    Out[119]:






  
    
      
      id
      amount_tsh
      gps_height
      longitude
      latitude
      num_private
      region_code
      district_code
      population
      construction_year
      water_quality
      quality_group
      quantity_group
      source_type
      source_class
      waterpoint_type
      wp_age
      daysdiff
      terrain
      water_availability
    
  
  
    
      0
      50785
      0.0
      1996
      35.290799
      -4.059696
      0
      21
      3
      321
      2012
      1
      1
      4
      2
      2
      0
      5
      1532.0
      2
      0
    
    
      1
      51630
      0.0
      1569
      36.656709
      -3.309214
      0
      2
      2
      300
      2000
      1
      1
      2
      1
      1
      1
      17
      1532.0
      2
      0
    
    
      2
      17168
      0.0
      1567
      34.767863
      -5.004344
      0
      13
      2
      500
      2010
      1
      1
      2
      2
      2
      0
      7
      1535.0
      2
      0
    
    
      3
      45559
      0.0
      267
      38.058046
      -9.418672
      0
      80
      43
      250
      1987
      1
      1
      3
      5
      1
      0
      30
      1545.0
      0
      0
    
    
      4
      49871
      500.0
      1260
      35.006123
      -10.950412
      0
      10
      3
      60
      2000
      1
      1
      1
      1
      1
      1
      17
      1481.0
      2
      1



In [120]:

    
dftest3 = dftest2[['id','water_availability','terrain','num_private','region_code','district_code','population','water_quality','quality_group','quantity_group','source_type','source_class','waterpoint_type','wp_age','daysdiff']].copy()



In [121]:

    
dftest3.head()









    Out[121]:






  
    
      
      id
      water_availability
      terrain
      num_private
      region_code
      district_code
      population
      water_quality
      quality_group
      quantity_group
      source_type
      source_class
      waterpoint_type
      wp_age
      daysdiff
    
  
  
    
      0
      50785
      0
      2
      0
      21
      3
      321
      1
      1
      4
      2
      2
      0
      5
      1532.0
    
    
      1
      51630
      0
      2
      0
      2
      2
      300
      1
      1
      2
      1
      1
      1
      17
      1532.0
    
    
      2
      17168
      0
      2
      0
      13
      2
      500
      1
      1
      2
      2
      2
      0
      7
      1535.0
    
    
      3
      45559
      0
      0
      0
      80
      43
      250
      1
      1
      3
      5
      1
      0
      30
      1545.0
    
    
      4
      49871
      1
      2
      0
      10
      3
      60
      1
      1
      1
      1
      1
      1
      17
      1481.0



In [122]:

    
dftest3.to_csv('testsetfix.csv',sep=',',header=True,index=False)



In [173]:

    
dftest3 = pd.read_csv('testsetfix.csv',delimiter=',')



In [174]:

    
dftest3['payment'] = dftest['payment']



In [175]:

    
dftest3['payment'].unique()









    Out[175]:





array(['never pay', 'unknown', 'pay monthly', 'pay per bucket',
       'pay annually', 'pay when scheme fails', 'other'], dtype=object)



In [176]:

    
dftest3['payment'].replace('pay annually',1,inplace=True)
dftest3['payment'].replace('never pay',2,inplace=True)
dftest3['payment'].replace('pay per bucket',3,inplace=True)
dftest3['payment'].replace('unknown',0,inplace=True)
dftest3['payment'].replace('pay when scheme fails',4,inplace=True)
dftest3['payment'].replace('other',5,inplace=True)
dftest3['payment'].replace('pay monthly',6,inplace=True)



In [180]:

    
dftest3 = dftest3[['id','water_availability','terrain','num_private','region_code','district_code','population','water_quality','quality_group','quantity_group','source_type','source_class','waterpoint_type','wp_age','daysdiff','payment']].copy()



In [181]:

    
dftest3.head()









    Out[181]:






  
    
      
      id
      water_availability
      terrain
      num_private
      region_code
      district_code
      population
      water_quality
      quality_group
      quantity_group
      source_type
      source_class
      waterpoint_type
      wp_age
      daysdiff
      payment
    
  
  
    
      0
      50785
      0
      2
      0
      21
      3
      321
      1
      1
      4
      2
      2
      0
      5
      1532.0
      2
    
    
      1
      51630
      0
      2
      0
      2
      2
      300
      1
      1
      2
      1
      1
      1
      17
      1532.0
      2
    
    
      2
      17168
      0
      2
      0
      13
      2
      500
      1
      1
      2
      2
      2
      0
      7
      1535.0
      2
    
    
      3
      45559
      0
      0
      0
      80
      43
      250
      1
      1
      3
      5
      1
      0
      30
      1545.0
      0
    
    
      4
      49871
      1
      2
      0
      10
      3
      60
      1
      1
      1
      1
      1
      1
      17
      1481.0
      6



In [182]:

    
dftest3.to_csv('testsetfix.csv',sep=',',header=True,index=False)



In [ ]:

	id	amount_tsh	gps_height	longitude	latitude	region_code	district_code	population	construction_year	...	quantity_group	source_type	source_class	waterpoint_type	status_group	wp_age	daysdiff	terrain	water_availability	payment
47515	34684	0.0	0	33.457381	-3.552051	17	7	0	0	...	1	4	1	3	1	2017	1655.0	0	0	3
47516	8766	0.0	1321	30.592280	-3.733459	16	1	130	2003	...	1	6	2	1	3	14	1529.0	2	0	2
47517	22662	0.0	1306	35.262273	-9.831219	10	2	0	1994	...	3	6	2	2	2	23	1512.0	2	0	4
47518	39854	50.0	-13	38.979195	-6.519993	6	1	60	1995	...	1	6	2	1	1	22	2216.0	0	1	3
47519	45078	0.0	0	33.420152	-2.795071	19	4	0	0	...	1	5	1	3	1	2017	2080.0	0	0	2

	id	amount_tsh	gps_height	longitude	latitude	region_code	district_code	population	construction_year	...	quantity_group	source_type	source_class	waterpoint_type	status_group	wp_age	daysdiff	terrain	water_availability	payment
0	69572	6000.0	1390	34.938093	-9.856322	11	5	109	1999	...	1	1	1	1	1	18	2225.0	2	1	1
1	8776	0.0	1399	34.698766	-2.147466	20	2	280	2010	...	2	2	2	1	1	7	1502.0	2	0	2
2	34310	25.0	686	37.460664	-3.821329	21	4	250	2009	...	1	3	2	2	1	8	1511.0	1	1	3
3	67743	0.0	263	38.486161	-11.155298	90	63	58	1986	...	3	4	1	2	2	31	1539.0	0	0	2
4	19728	0.0	0	31.130847	-1.825359	18	1	0	0	...	4	2	2	1	1	2017	2104.0	0	0	2

	id	amount_tsh	gps_height	longitude	latitude	region_code	district_code	population	construction_year	water_quality	quality_group	quantity_group	source_type	source_class	waterpoint_type	status_group	wp_age	daysdiff
11875	11164	500.0	351	37.634053	-6.124830	5	6	89	2007	1	1	1	4	1	1	2	10	2230.0
11876	60739	10.0	1210	37.169807	-3.253847	3	5	125	1999	1	1	1	1	1	1	1	18	1444.0
11877	27263	4700.0	1212	35.249991	-9.070629	11	4	56	1996	1	1	1	6	2	1	1	21	2171.0
11878	37057	0.0	0	34.017087	-8.750434	12	7	0	0	4	4	1	4	1	3	1	2017	2197.0
11879	31282	0.0	0	35.861315	-6.378573	1	4	0	0	1	1	2	5	1	3	1	2017	2231.0

	id	amount_tsh	date_recorded	funder	gps_height	installer	longitude	latitude	wpt_name	...	payment_type	water_quality	quality_group	quantity	quantity_group	source	source_type	source_class	waterpoint_type	waterpoint_type_group
0	50785	0.0	2013-02-04	Dmdd	1996	DMDD	35.290799	-4.059696	Dinamu Secondary School	...	never pay	soft	good	seasonal	seasonal	rainwater harvesting	rainwater harvesting	surface	other	other
1	51630	0.0	2013-02-04	Government Of Tanzania	1569	DWE	36.656709	-3.309214	Kimnyak	...	never pay	soft	good	insufficient	insufficient	spring	spring	groundwater	communal standpipe	communal standpipe
2	17168	0.0	2013-02-01	NaN	1567	NaN	34.767863	-5.004344	Puma Secondary	...	never pay	soft	good	insufficient	insufficient	rainwater harvesting	rainwater harvesting	surface	other	other
3	45559	0.0	2013-01-22	Finn Water	267	FINN WATER	38.058046	-9.418672	Kwa Mzee Pange	...	unknown	soft	good	dry	dry	shallow well	shallow well	groundwater	other	other
4	49871	500.0	2013-03-27	Bruder	1260	BRUDER	35.006123	-10.950412	Kwa Mzee Turuka	...	monthly	soft	good	enough	enough	spring	spring	groundwater	communal standpipe	communal standpipe

	id	water_availability	terrain	region_code	district_code	population	water_quality	quality_group	quantity_group	source_type	source_class	waterpoint_type	wp_age	daysdiff
0	50785	0	2	21	3	321	1	1	4	2	2	0	5	1532.0
1	51630	0	2	2	2	300	1	1	2	1	1	1	17	1532.0
2	17168	0	2	13	2	500	1	1	2	2	2	0	7	1535.0
3	45559	0	0	80	43	250	1	1	3	5	1	0	30	1545.0
4	49871	1	2	10	3	60	1	1	1	1	1	1	17	1481.0