2. Refine the Data

"Data is messy"

We will be performing the following operation on our Onion price to refine it

  • Remove e.g. remove redundant data from the data frame
  • Derive e.g. State and City from the market field
  • Parse e.g. extract date from year and month column

Other stuff you may need to do to refine are...

  • Missing e.g. Check for missing or incomplete data
  • Quality e.g. Check for duplicates, accuracy, unusual data
  • Convert e.g. free text to coded value
  • Calculate e.g. percentages, proportion
  • Merge e.g. first and surname for full name
  • Aggregate e.g. rollup by year, cluster by area
  • Filter e.g. exclude based on location
  • Sample e.g. extract a representative data
  • Summary e.g. show summary stats like mean

In [1]:
# Import the two library we need, which is Pandas and Numpy
import pandas as pd
import numpy as np

In [2]:
# Read the csv file of Month Wise Market Arrival data that has been scraped.
df = pd.read_csv('MonthWiseMarketArrivals.csv')

In [3]:
df.head()


Out[3]:
market month year quantity priceMin priceMax priceMod
0 ABOHAR(PB) January 2005 2350 404 493 446
1 ABOHAR(PB) January 2006 900 487 638 563
2 ABOHAR(PB) January 2010 790 1283 1592 1460
3 ABOHAR(PB) January 2011 245 3067 3750 3433
4 ABOHAR(PB) January 2012 1035 523 686 605

In [4]:
df.tail()


Out[4]:
market month year quantity priceMin priceMax priceMod
10223 YEOLA(MS) December 2012 207066 485 1327 1136
10224 YEOLA(MS) December 2013 215883 472 1427 1177
10225 YEOLA(MS) December 2014 201077 446 1654 1456
10226 YEOLA(MS) December 2015 223315 609 1446 1126
10227 NaN NaN Total 783438108 647(Avg) 1213(Avg) 984(Avg)

Remove the redundant data


In [5]:
df.dtypes


Out[5]:
market      object
month       object
year        object
quantity     int64
priceMin    object
priceMax    object
priceMod    object
dtype: object

In [6]:
# Delete the last row from the dataframe
df.tail(1)


Out[6]:
market month year quantity priceMin priceMax priceMod
10227 NaN NaN Total 783438108 647(Avg) 1213(Avg) 984(Avg)

In [7]:
# Delete a row from the dataframe
df.drop(df.tail(1).index, inplace = True)

In [8]:
df.head()


Out[8]:
market month year quantity priceMin priceMax priceMod
0 ABOHAR(PB) January 2005 2350 404 493 446
1 ABOHAR(PB) January 2006 900 487 638 563
2 ABOHAR(PB) January 2010 790 1283 1592 1460
3 ABOHAR(PB) January 2011 245 3067 3750 3433
4 ABOHAR(PB) January 2012 1035 523 686 605

In [56]:
df.tail()


Out[56]:
market month year quantity priceMin priceMax priceMod
10222 YEOLA(MS) December 2011 131326 282 612 526
10223 YEOLA(MS) December 2012 207066 485 1327 1136
10224 YEOLA(MS) December 2013 215883 472 1427 1177
10225 YEOLA(MS) December 2014 201077 446 1654 1456
10226 YEOLA(MS) December 2015 223315 609 1446 1126

In [57]:
df.dtypes


Out[57]:
market      object
month       object
year        object
quantity     int64
priceMin    object
priceMax    object
priceMod    object
dtype: object

In [58]:
df.iloc[:,4:7].head()


Out[58]:
priceMin priceMax priceMod
0 404 493 446
1 487 638 563
2 1283 1592 1460
3 3067 3750 3433
4 523 686 605

In [59]:
df.iloc[:,2:7] = df.iloc[:,2:7].astype(int)

In [60]:
df.dtypes


Out[60]:
market      object
month       object
year         int64
quantity     int64
priceMin     int64
priceMax     int64
priceMod     int64
dtype: object

In [61]:
df.head()


Out[61]:
market month year quantity priceMin priceMax priceMod
0 ABOHAR(PB) Jan 2005 2350 404 493 446
1 ABOHAR(PB) Jan 2006 900 487 638 563
2 ABOHAR(PB) Jan 2010 790 1283 1592 1460
3 ABOHAR(PB) Jan 2011 245 3067 3750 3433
4 ABOHAR(PB) Jan 2012 1035 523 686 605

In [62]:
df.describe()


Out[62]:
year quantity priceMin priceMax priceMod
count 10227.000000 10227.000000 10227.000000 10227.000000 10227.000000
mean 2009.022294 76604.880023 646.944363 1212.760731 984.284345
std 4.372841 124408.698759 673.121850 979.658874 818.471498
min 1996.000000 20.000000 16.000000 145.000000 80.000000
25% 2006.000000 8898.000000 209.000000 557.000000 448.000000
50% 2009.000000 27460.000000 440.000000 923.000000 747.000000
75% 2013.000000 88356.500000 828.000000 1527.000000 1248.000000
max 2016.000000 1639032.000000 6000.000000 8192.000000 6400.000000

Extracting the states from market names


In [63]:
df.market.value_counts().head()


Out[63]:
LASALGAON(MS)     242
PIMPALGAON(MS)    224
MANMAD(MS)        218
LONAND(MS)        211
MAHUVA(GUJ)       210
Name: market, dtype: int64

In [64]:
df['state'] = df.market.str.split('(').str[-1]

In [65]:
df.head()


Out[65]:
market month year quantity priceMin priceMax priceMod state
0 ABOHAR(PB) Jan 2005 2350 404 493 446 PB)
1 ABOHAR(PB) Jan 2006 900 487 638 563 PB)
2 ABOHAR(PB) Jan 2010 790 1283 1592 1460 PB)
3 ABOHAR(PB) Jan 2011 245 3067 3750 3433 PB)
4 ABOHAR(PB) Jan 2012 1035 523 686 605 PB)

In [66]:
df['city'] = df.market.str.split('(').str[0]

In [67]:
df.head()


Out[67]:
market month year quantity priceMin priceMax priceMod state city
0 ABOHAR(PB) Jan 2005 2350 404 493 446 PB) ABOHAR
1 ABOHAR(PB) Jan 2006 900 487 638 563 PB) ABOHAR
2 ABOHAR(PB) Jan 2010 790 1283 1592 1460 PB) ABOHAR
3 ABOHAR(PB) Jan 2011 245 3067 3750 3433 PB) ABOHAR
4 ABOHAR(PB) Jan 2012 1035 523 686 605 PB) ABOHAR

In [68]:
df.state.unique()


Out[68]:
array(['PB)', 'UP)', 'GUJ)', 'MS)', 'RAJ)', 'BANGALORE', 'KNT)', 'BHOPAL',
       'OR)', 'BHR)', 'WB)', 'CHANDIGARH', 'CHENNAI', 'bellary)',
       'podisu)', 'UTT)', 'DELHI', 'MP)', 'TN)', 'Podis', 'GUWAHATI',
       'HYDERABAD', 'JAIPUR', 'WHITE)', 'JAMMU', 'HR)', 'KOLKATA', 'AP)',
       'LUCKNOW', 'MUMBAI', 'NAGPUR', 'KER)', 'PATNA', 'CHGARH)', 'JH)',
       'SHIMLA', 'SRINAGAR', 'TRIVENDRUM'], dtype=object)

In [69]:
df['state'] = df.state.str.split(')').str[0]

In [70]:
df.state.unique()


Out[70]:
array(['PB', 'UP', 'GUJ', 'MS', 'RAJ', 'BANGALORE', 'KNT', 'BHOPAL', 'OR',
       'BHR', 'WB', 'CHANDIGARH', 'CHENNAI', 'bellary', 'podisu', 'UTT',
       'DELHI', 'MP', 'TN', 'Podis', 'GUWAHATI', 'HYDERABAD', 'JAIPUR',
       'WHITE', 'JAMMU', 'HR', 'KOLKATA', 'AP', 'LUCKNOW', 'MUMBAI',
       'NAGPUR', 'KER', 'PATNA', 'CHGARH', 'JH', 'SHIMLA', 'SRINAGAR',
       'TRIVENDRUM'], dtype=object)

In [71]:
dfState = df.groupby(['state', 'market'], as_index=False).count()

In [72]:
dfState.market.unique()


Out[72]:
array(['KURNOOL(AP)', 'RAJAHMUNDRY(AP)', 'BANGALORE', 'BHOPAL',
       'BIHARSHARIF(BHR)', 'CHANDIGARH', 'CHENNAI', 'RAIPUR(CHGARH)',
       'DELHI', 'AHMEDABAD(GUJ)', 'BHAVNAGAR(GUJ)', 'DEESA(GUJ)',
       'GONDAL(GUJ)', 'JAMNAGAR(GUJ)', 'MAHUVA(GUJ)', 'RAJKOT(GUJ)',
       'SURAT(GUJ)', 'GUWAHATI', 'KARNAL(HR)', 'HYDERABAD', 'JAIPUR',
       'JAMMU', 'RANCHI(JH)', 'PALAYAM(KER)', 'BELGAUM(KNT)',
       'BIJAPUR(KNT)', 'CHALLAKERE(KNT)', 'CHICKBALLAPUR(KNT)',
       'DHAVANGERE(KNT)', 'HASSAN(KNT)', 'HUBLI(KNT)', 'KOLAR(KNT)',
       'RAICHUR(KNT)', 'KOLKATA', 'LUCKNOW', 'DEWAS(MP)', 'INDORE(MP)',
       'MANDSOUR(MP)', 'NEEMUCH(MP)', 'SAGAR(MP)', 'UJJAIN(MP)',
       'AHMEDNAGAR(MS)', 'BOMBORI(MS)', 'CHAKAN(MS)', 'CHANDVAD(MS)',
       'DEVALA(MS)', 'DHULIA(MS)', 'DINDORI(MS)', 'JALGAON(MS)',
       'JUNNAR(MS)', 'KALVAN(MS)', 'KOLHAPUR(MS)', 'KOPERGAON(MS)',
       'LASALGAON(MS)', 'LONAND(MS)', 'MALEGAON(MS)', 'MANMAD(MS)',
       'NANDGAON(MS)', 'NASIK(MS)', 'NEWASA(MS)', 'NIPHAD(MS)',
       'PHALTAN (MS)', 'PIMPALGAON(MS)', 'PUNE(MS)', 'RAHATA(MS)',
       'RAHURI(MS)', 'SAIKHEDA(MS)', 'SANGALI(MS)', 'SANGAMNER(MS)',
       'SATANA(MS)', 'SHRIRAMPUR(MS)', 'SINNAR(MS)', 'SOLAPUR(MS)',
       'SRIRAMPUR(MS)', 'VANI(MS)', 'YEOLA(MS)', 'MUMBAI', 'NAGPUR',
       'BHUBNESWER(OR)', 'PATNA', 'ABOHAR(PB)', 'AMRITSAR(PB)',
       'BHATINDA(PB)', 'HOSHIARPUR(PB)', 'JALANDHAR(PB)', 'KHANNA(PB)',
       'LUDHIANA(PB)', 'PATIALA(PB)', 'DINDIGUL(TN)(Podis', 'AJMER(RAJ)',
       'ALWAR(RAJ)', 'BIKANER(RAJ)', 'JODHPUR(RAJ)', 'KOTA(RAJ)',
       'SRIGANGANAGAR(RAJ)', 'UDAIPUR(RAJ)', 'SHIMLA', 'SRINAGAR',
       'DINDIGUL(TN)', 'MADURAI(TN)', 'TRIVENDRUM', 'AGRA(UP)',
       'ALIGARH(UP)', 'BALLIA(UP)', 'BAREILLY(UP)', 'DEORIA(UP)',
       'ETAWAH(UP)', 'GORAKHPUR(UP)', 'KANPUR(UP)', 'MEERUT(UP)',
       'VARANASI(UP)', 'DEHRADOON(UTT)', 'HALDWANI(UTT)', 'BURDWAN(WB)',
       'MIDNAPUR(WB)', 'PURULIA(WB)', 'SHEROAPHULY(WB)', 'JALGAON(WHITE)',
       'COIMBATORE(TN) (bellary)', 'COIMBATORE(TN) (podisu)'], dtype=object)

In [73]:
state_now = ['PB', 'UP', 'GUJ', 'MS', 'RAJ', 'BANGALORE', 'KNT', 'BHOPAL', 'OR',
       'BHR', 'WB', 'CHANDIGARH', 'CHENNAI', 'bellary', 'podisu', 'UTT',
       'DELHI', 'MP', 'TN', 'Podis', 'GUWAHATI', 'HYDERABAD', 'JAIPUR',
       'WHITE', 'JAMMU', 'HR', 'KOLKATA', 'AP', 'LUCKNOW', 'MUMBAI',
       'NAGPUR', 'KER', 'PATNA', 'CHGARH', 'JH', 'SHIMLA', 'SRINAGAR',
       'TRIVENDRUM']

In [74]:
state_new =['PB', 'UP', 'GUJ', 'MS', 'RAJ', 'KNT', 'KNT', 'MP', 'OR',
       'BHR', 'WB', 'CH', 'TN', 'KNT', 'TN', 'UP',
       'DEL', 'MP', 'TN', 'TN', 'ASM', 'AP', 'RAJ',
       'MS', 'JK', 'HR', 'WB', 'AP', 'UP', 'MS',
       'MS', 'KER', 'BHR', 'HR', 'JH', 'HP', 'JK',
       'KEL']

In [75]:
df.state = df.state.replace(state_now, state_new)

In [76]:
df.state.unique()


Out[76]:
array(['PB', 'UP', 'GUJ', 'MS', 'RAJ', 'KNT', 'MP', 'OR', 'BHR', 'WB',
       'CH', 'TN', 'DEL', 'ASM', 'AP', 'JK', 'HR', 'KER', 'JH', 'HP', 'KEL'], dtype=object)

Getting the Dates


In [77]:
df.head()


Out[77]:
market month year quantity priceMin priceMax priceMod state city
0 ABOHAR(PB) Jan 2005 2350 404 493 446 PB ABOHAR
1 ABOHAR(PB) Jan 2006 900 487 638 563 PB ABOHAR
2 ABOHAR(PB) Jan 2010 790 1283 1592 1460 PB ABOHAR
3 ABOHAR(PB) Jan 2011 245 3067 3750 3433 PB ABOHAR
4 ABOHAR(PB) Jan 2012 1035 523 686 605 PB ABOHAR

In [78]:
df.index


Out[78]:
Int64Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,
                9,
            ...
            10217, 10218, 10219, 10220, 10221, 10222, 10223, 10224, 10225,
            10226],
           dtype='int64', length=10227)

In [79]:
pd.to_datetime('January 2012')


Out[79]:
Timestamp('2012-01-01 00:00:00')

In [80]:
df['date'] = df['month'] + '-' + df['year'].map(str)

In [82]:
??map

In [81]:
df.head()


Out[81]:
market month year quantity priceMin priceMax priceMod state city date
0 ABOHAR(PB) Jan 2005 2350 404 493 446 PB ABOHAR Jan-2005
1 ABOHAR(PB) Jan 2006 900 487 638 563 PB ABOHAR Jan-2006
2 ABOHAR(PB) Jan 2010 790 1283 1592 1460 PB ABOHAR Jan-2010
3 ABOHAR(PB) Jan 2011 245 3067 3750 3433 PB ABOHAR Jan-2011
4 ABOHAR(PB) Jan 2012 1035 523 686 605 PB ABOHAR Jan-2012

In [85]:
index = pd.to_datetime(df.date)

In [86]:
df.index = pd.PeriodIndex(df.date, freq='M')

In [ ]:
df.columns

In [87]:
df.index


Out[87]:
PeriodIndex(['2005-01', '2006-01', '2010-01', '2011-01', '2012-01', '2013-01',
             '2014-01', '2015-01', '2005-02', '2006-02',
             ...
             '2006-12', '2007-12', '2008-12', '2009-12', '2010-12', '2011-12',
             '2012-12', '2013-12', '2014-12', '2015-12'],
            dtype='int64', length=10227, freq='M')

In [88]:
df.head()


Out[88]:
market month year quantity priceMin priceMax priceMod state city date
2005-01 ABOHAR(PB) Jan 2005 2350 404 493 446 PB ABOHAR Jan-2005
2006-01 ABOHAR(PB) Jan 2006 900 487 638 563 PB ABOHAR Jan-2006
2010-01 ABOHAR(PB) Jan 2010 790 1283 1592 1460 PB ABOHAR Jan-2010
2011-01 ABOHAR(PB) Jan 2011 245 3067 3750 3433 PB ABOHAR Jan-2011
2012-01 ABOHAR(PB) Jan 2012 1035 523 686 605 PB ABOHAR Jan-2012

In [ ]:
df.to_csv('MonthWiseMarketArrivals_Clean.csv', index = False)