4. Explore the Data

"I don't know, what I don't know"

We want to first visually explore the data to see if we can confirm some of our initial hypotheses as well as make new hypothesis about the problem we are trying to solve.

For this we will start by loading the data and understanding the data structure of the dataframe we have.

Lets read the data


In [1]:
# Import the library we need, which is Pandas
import pandas as pd

You will find the variable df used quite often to store a dataframe


In [2]:
# Read the csv file of Monthwise Quantity and Price csv file we have.
df = pd.read_csv('MonthWiseMarketArrivals_clean.csv')

Understand Data Structure and Types


In [3]:
df.shape


Out[3]:
(10227, 10)

In [4]:
df.head()


Out[4]:
market month year quantity priceMin priceMax priceMod state city date
0 ABOHAR(PB) January 2005 2350 404 493 446 PB ABOHAR January-2005
1 ABOHAR(PB) January 2006 900 487 638 563 PB ABOHAR January-2006
2 ABOHAR(PB) January 2010 790 1283 1592 1460 PB ABOHAR January-2010
3 ABOHAR(PB) January 2011 245 3067 3750 3433 PB ABOHAR January-2011
4 ABOHAR(PB) January 2012 1035 523 686 605 PB ABOHAR January-2012

Data Structure

So we have ten columns in our dataset. Let us understand what each one is.

Three are about the location of the Wholesale Market where Onion where sold.

  • state: This is the 2/3 letter abbreviation for the state in India (PB is Punjab and so on)
  • city: This is the city in India (ABOHAR, BANGALORE and so on)
  • market: This is a string with the combination of the state and city

Three are related to the

  • month: Month in January, February and so on.
  • year: Year in YYYY representastion
  • date: The combination of the two above.

Four are about quantity and price in these wholesale market.

  • quantity: The quanity of Onion arriving in the market in that month in quintals (100 kg)
  • priceMin: The minimum price in the month in Rs./quintal
  • priceMax: The maximum price in the month in Rs./quintal
  • priceMod: The modal price in the month in Rs./quintal

We would expect the following the columns to be of the following type

  • CATEGORICAL: state, city, market
  • TIME INTERVAL: month, year, date
  • QUANTITATIVE: quantity, priceMin, priceMax, priceModal

Let us see what pandas dataframe is reading these columns as.


In [5]:
# Get the typeof each column
df.dtypes


Out[5]:
market      object
month       object
year         int64
quantity     int64
priceMin     int64
priceMax     int64
priceMod     int64
state       object
city        object
date        object
dtype: object

So we are getting the quantitive columns are correctly being shown as integers and the categorical columns are showing as objects(strings) which is fine. However, the date columns is being read as an object and not a Time-Interval. Let us at least fix the date column and make it into a datetime object


In [6]:
# Changing the date column to a Time Interval columnn
df.date = pd.DatetimeIndex(df.date)

In [13]:
df.shape


Out[13]:
(10227, 10)

In [14]:
# Now checking for type of each column
df.dtypes


Out[14]:
market              object
month               object
year                 int64
quantity             int64
priceMin             int64
priceMax             int64
priceMod             int64
state               object
city                object
date        datetime64[ns]
dtype: object

In [15]:
# Let us see the dataframe again now
df.head()


Out[15]:
market month year quantity priceMin priceMax priceMod state city date
0 ABOHAR(PB) January 2005 2350 404 493 446 PB ABOHAR 2005-01-01
1 ABOHAR(PB) January 2006 900 487 638 563 PB ABOHAR 2006-01-01
2 ABOHAR(PB) January 2010 790 1283 1592 1460 PB ABOHAR 2010-01-01
3 ABOHAR(PB) January 2011 245 3067 3750 3433 PB ABOHAR 2011-01-01
4 ABOHAR(PB) January 2012 1035 523 686 605 PB ABOHAR 2012-01-01

In [21]:
# df.city.unique()
# Functional Approach
pd.unique(df.city)


Out[21]:
array(['ABOHAR', 'AGRA', 'AHMEDABAD', 'AHMEDNAGAR', 'AJMER', 'ALIGARH',
       'ALWAR', 'AMRITSAR', 'BALLIA', 'BANGALORE', 'BAREILLY', 'BELGAUM',
       'BHATINDA', 'BHAVNAGAR', 'BHOPAL', 'BHUBNESWER', 'BIHARSHARIF',
       'BIJAPUR', 'BIKANER', 'BOMBORI', 'BURDWAN', 'CHAKAN', 'CHALLAKERE',
       'CHANDIGARH', 'CHANDVAD', 'CHENNAI', 'CHICKBALLAPUR', 'COIMBATORE',
       'DEESA', 'DEHRADOON', 'DELHI', 'DEORIA', 'DEVALA', 'DEWAS',
       'DHAVANGERE', 'DHULIA', 'DINDIGUL', 'DINDORI', 'ETAWAH', 'GONDAL',
       'GORAKHPUR', 'GUWAHATI', 'HALDWANI', 'HASSAN', 'HOSHIARPUR',
       'HUBLI', 'HYDERABAD', 'INDORE', 'JAIPUR', 'JALANDHAR', 'JALGAON',
       'JAMMU', 'JAMNAGAR', 'JODHPUR', 'JUNNAR', 'KALVAN', 'KANPUR',
       'KARNAL', 'KHANNA', 'KOLAR', 'KOLHAPUR', 'KOLKATA', 'KOPERGAON',
       'KOTA', 'KURNOOL', 'LASALGAON', 'LONAND', 'LUCKNOW', 'LUDHIANA',
       'MADURAI', 'MAHUVA', 'MALEGAON', 'MANDSOUR', 'MANMAD', 'MEERUT',
       'MIDNAPUR', 'MUMBAI', 'NAGPUR', 'NANDGAON', 'NASIK', 'NEEMUCH',
       'NEWASA', 'NIPHAD', 'PALAYAM', 'PATIALA', 'PATNA', 'PHALTAN ',
       'PIMPALGAON', 'PUNE', 'PURULIA', 'RAHATA', 'RAHURI', 'RAICHUR',
       'RAIPUR', 'RAJAHMUNDRY', 'RAJKOT', 'RANCHI', 'SAGAR', 'SAIKHEDA',
       'SANGALI', 'SANGAMNER', 'SATANA', 'SHEROAPHULY', 'SHIMLA',
       'SHRIRAMPUR', 'SINNAR', 'SOLAPUR', 'SRIGANGANAGAR', 'SRINAGAR',
       'SRIRAMPUR', 'SURAT', 'TRIVENDRUM', 'UDAIPUR', 'UJJAIN', 'VANI',
       'VARANASI', 'YEOLA'], dtype=object)

Question 1 - How big is the Bangalore onion market compared to other cities in India?

Let us try to do this examination for one of the year only. So we want to reduce our dataframe for only where the year = 2010. This process is called subset.

PRINCIPLE: Subset a Dataframe

How do you subset a dataframe on a given criteria

newDataframe = df[ <subset condition> ]


In [22]:
df2010 = df[df.year == 2010]

In [23]:
df2010.head()


Out[23]:
market month year quantity priceMin priceMax priceMod state city date
2 ABOHAR(PB) January 2010 790 1283 1592 1460 PB ABOHAR 2010-01-01
10 ABOHAR(PB) February 2010 555 1143 1460 1322 PB ABOHAR 2010-02-01
18 ABOHAR(PB) March 2010 385 510 878 688 PB ABOHAR 2010-03-01
25 ABOHAR(PB) April 2010 840 466 755 611 PB ABOHAR 2010-04-01
32 ABOHAR(PB) May 2010 2050 391 578 494 PB ABOHAR 2010-05-01

In [24]:
# We can also subset on multiple criterias
df2010Bang = df[(df.year == 2010) & (df.city == 'BANGALORE')]

In [25]:
df2010Bang.head()


Out[25]:
market month year quantity priceMin priceMax priceMod state city date
722 BANGALORE January 2010 423649 502 1837 1267 KNT BANGALORE 2010-01-01
735 BANGALORE February 2010 316685 495 1429 1011 KNT BANGALORE 2010-02-01
748 BANGALORE March 2010 368644 386 895 661 KNT BANGALORE 2010-03-01
760 BANGALORE April 2010 404805 287 730 545 KNT BANGALORE 2010-04-01
772 BANGALORE May 2010 395519 306 827 564 KNT BANGALORE 2010-05-01

Principle: Split Apply Combine

How do we get the sum of quantity for each city.

We need to SPLIT the data by each city, APPLY the sum to the quantity row and then COMBINE the data again

In pandas, we use the groupby function to do this.


In [35]:
# Group by using city
df2010City = df2010.groupby(['city']).sum()

In [36]:
df2010City.head()


Out[36]:
year quantity priceMin priceMax priceMod
city
ABOHAR 24120 11835 10896 16130 14012
AGRA 24120 756755 13213 14827 14149
AHMEDABAD 24120 1135418 8910 15819 13539
AHMEDNAGAR 24120 1678032 3955 20984 12923
ALWAR 24120 561145 8850 14717 11317

In [37]:
type(df2010City)


Out[37]:
pandas.core.frame.DataFrame

In [38]:
# If we only want to apply the sum function on quantity, then we specify the quantity column
df2010City = df2010.groupby(['city']).quantity.sum()

In [39]:
# Let us see this dataframe
df2010City.head()


Out[39]:
city
ABOHAR          11835
AGRA           756755
AHMEDABAD     1135418
AHMEDNAGAR    1678032
ALWAR          561145
Name: quantity, dtype: int64

In [40]:
type(df2010City)


Out[40]:
pandas.core.series.Series

In [41]:
# To create a dataframe again, it is best to specify index as false
df2010City = df2010.groupby(['city'], as_index=False).quantity.sum()

In [42]:
df2010City.head()


Out[42]:
city quantity
0 ABOHAR 11835
1 AGRA 756755
2 AHMEDABAD 1135418
3 AHMEDNAGAR 1678032
4 ALWAR 561145

In [45]:
sorted(df2010City.quantity)


Out[45]:
[250,
 4207,
 11835,
 15350,
 34141,
 42952,
 46683,
 54841,
 59566,
 64092,
 69568,
 70473,
 96600,
 114417,
 115600,
 122625,
 124822,
 129035,
 131830,
 141575,
 148570,
 161366,
 199684,
 201851,
 251785,
 262695,
 274420,
 275334,
 283269,
 312585,
 314257,
 353421,
 362637,
 385838,
 390480,
 396100,
 429710,
 467125,
 494081,
 495760,
 508965,
 521640,
 532340,
 556041,
 561145,
 561234,
 564360,
 568239,
 570690,
 634993,
 728121,
 756755,
 794301,
 901114,
 917394,
 1135418,
 1280122,
 1303838,
 1343987,
 1363172,
 1404700,
 1678032,
 1774613,
 1833914,
 2020484,
 2519226,
 2650760,
 2796128,
 2921005,
 3310419,
 3326024,
 3495320,
 3508582,
 6079067]

In [46]:
# Sort the Dataframe by Quantity to see which one is on top
df2010City = df2010City.sort_values(by = "quantity", ascending = False)
df2010City.head()


Out[46]:
city quantity
6 BANGALORE 6079067
16 DELHI 3508582
39 KOLKATA 3495320
56 PUNE 3326024
67 SOLAPUR 3310419

In [47]:
%timeit sorted(df2010City.quantity)


The slowest run took 92.81 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 25.5 µs per loop

In [49]:
%timeit df2010City.quantity.sort_values()


The slowest run took 20.41 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 170 µs per loop

In [48]:
%timeit df2010City.sort_values(by = "quantity", ascending = False)


1000 loops, best of 3: 381 µs per loop

In [ ]:

PRINCIPLE: Visual Exploration

Lets load the libraries required for plotting in python


In [50]:
# Load the visualisation libraries - Matplotlib
import matplotlib.pyplot as plt

In [51]:
# Let us see the output plots in the notebook itself
%matplotlib inline

In [52]:
# Set some parameters to get good visuals - style to ggplot and size to 15,10
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (15, 10)

In [54]:
# Plot the Data
df2010City.plot(kind ="barh", x = 'city', y = 'quantity')


Out[54]:
<matplotlib.axes._subplots.AxesSubplot at 0x117bdac18>

Exercise: Find the State with Highest Quantity Sales in 2015?


In [ ]:
df2015 = df[]

In [ ]:

Exercise: Plot the State and Cities with Quantity Sales in 2015


In [ ]:


In [ ]:


In [ ]:

Question 2 - Have the price variation in Onion prices in Bangalore really gone up over the years?


In [55]:
df.head()


Out[55]:
market month year quantity priceMin priceMax priceMod state city date
0 ABOHAR(PB) January 2005 2350 404 493 446 PB ABOHAR 2005-01-01
1 ABOHAR(PB) January 2006 900 487 638 563 PB ABOHAR 2006-01-01
2 ABOHAR(PB) January 2010 790 1283 1592 1460 PB ABOHAR 2010-01-01
3 ABOHAR(PB) January 2011 245 3067 3750 3433 PB ABOHAR 2011-01-01
4 ABOHAR(PB) January 2012 1035 523 686 605 PB ABOHAR 2012-01-01

In [56]:
dfBang = df[df.city == 'BANGALORE']

In [57]:
dfBang.head()


Out[57]:
market month year quantity priceMin priceMax priceMod state city date
716 BANGALORE January 2004 227832 916 1066 991 KNT BANGALORE 2004-01-01
717 BANGALORE January 2005 335679 470 597 522 KNT BANGALORE 2005-01-01
718 BANGALORE January 2006 412185 286 617 537 KNT BANGALORE 2006-01-01
719 BANGALORE January 2007 268268 586 1167 942 KNT BANGALORE 2007-01-01
720 BANGALORE January 2008 393806 174 671 472 KNT BANGALORE 2008-01-01

In [58]:
dfBang.describe()


Out[58]:
year quantity priceMin priceMax priceMod
count 146.000000 146.000000 146.000000 146.000000 146.000000
mean 2009.589041 523917.424658 557.541096 1313.883562 1043.547945
std 3.522678 302369.030403 381.415623 911.101944 667.509329
min 2004.000000 63824.000000 145.000000 338.000000 320.000000
25% 2007.000000 329690.000000 307.000000 685.000000 551.000000
50% 2010.000000 403530.500000 448.000000 1019.000000 836.000000
75% 2013.000000 661252.000000 653.000000 1636.000000 1325.500000
max 2016.000000 1639032.000000 2377.000000 4698.000000 3430.000000

In [59]:
# Reduce the precision of numbers - so that it is easy to read
pd.set_option('precision', 0)

In [60]:
dfBang.describe()


Out[60]:
year quantity priceMin priceMax priceMod
count 146 146 146 146 146
mean 2010 523917 558 1314 1044
std 4 302369 381 911 668
min 2004 63824 145 338 320
25% 2007 329690 307 685 551
50% 2010 403530 448 1019 836
75% 2013 661252 653 1636 1326
max 2016 1639032 2377 4698 3430

PRINCIPLE: Setting Index


In [61]:
dfBang.head()


Out[61]:
market month year quantity priceMin priceMax priceMod state city date
716 BANGALORE January 2004 227832 916 1066 991 KNT BANGALORE 2004-01-01
717 BANGALORE January 2005 335679 470 597 522 KNT BANGALORE 2005-01-01
718 BANGALORE January 2006 412185 286 617 537 KNT BANGALORE 2006-01-01
719 BANGALORE January 2007 268268 586 1167 942 KNT BANGALORE 2007-01-01
720 BANGALORE January 2008 393806 174 671 472 KNT BANGALORE 2008-01-01

In [62]:
dfBang.index


Out[62]:
Int64Index([716, 717, 718, 719, 720, 721, 722, 723, 724, 725,
            ...
            852, 853, 854, 855, 856, 857, 858, 859, 860, 861],
           dtype='int64', length=146)

In [63]:
# Set the index as date
dfBang = dfBang.sort_values(by = "date")
dfBang.head()


Out[63]:
market month year quantity priceMin priceMax priceMod state city date
716 BANGALORE January 2004 227832 916 1066 991 KNT BANGALORE 2004-01-01
729 BANGALORE February 2004 225133 741 870 793 KNT BANGALORE 2004-02-01
742 BANGALORE March 2004 221952 527 586 556 KNT BANGALORE 2004-03-01
754 BANGALORE April 2004 185150 419 518 465 KNT BANGALORE 2004-04-01
766 BANGALORE May 2004 137390 400 516 455 KNT BANGALORE 2004-05-01

In [64]:
# Set the Index for the Dataframe
dfBang.index = pd.PeriodIndex(dfBang.date, freq='M')

In [65]:
dfBang.head()


Out[65]:
market month year quantity priceMin priceMax priceMod state city date
2004-01 BANGALORE January 2004 227832 916 1066 991 KNT BANGALORE 2004-01-01
2004-02 BANGALORE February 2004 225133 741 870 793 KNT BANGALORE 2004-02-01
2004-03 BANGALORE March 2004 221952 527 586 556 KNT BANGALORE 2004-03-01
2004-04 BANGALORE April 2004 185150 419 518 465 KNT BANGALORE 2004-04-01
2004-05 BANGALORE May 2004 137390 400 516 455 KNT BANGALORE 2004-05-01

In [66]:
dfBang.priceMod.plot()


Out[66]:
<matplotlib.axes._subplots.AxesSubplot at 0x1187e8438>

In [67]:
dfBang.plot(kind = "line", y = ['priceMin', 'priceMod', 'priceMax'])


Out[67]:
<matplotlib.axes._subplots.AxesSubplot at 0x108c8c400>

To calculate the range of change, we will create a new price difference variable - which is the difference between the priceMin and priceMax


In [68]:
dfBang['priceDiff'] = dfBang['priceMax'] - dfBang['priceMin']

In [69]:
dfBang.head()


Out[69]:
market month year quantity priceMin priceMax priceMod state city date priceDiff
2004-01 BANGALORE January 2004 227832 916 1066 991 KNT BANGALORE 2004-01-01 150
2004-02 BANGALORE February 2004 225133 741 870 793 KNT BANGALORE 2004-02-01 129
2004-03 BANGALORE March 2004 221952 527 586 556 KNT BANGALORE 2004-03-01 59
2004-04 BANGALORE April 2004 185150 419 518 465 KNT BANGALORE 2004-04-01 99
2004-05 BANGALORE May 2004 137390 400 516 455 KNT BANGALORE 2004-05-01 116

In [70]:
dfBang.plot(kind = 'line', y = 'priceDiff')


Out[70]:
<matplotlib.axes._subplots.AxesSubplot at 0x1189360f0>

PRINCIPLE: Pivot Table

Pivot table is a way to summarize data frame data into index (rows), columns and value


In [71]:
# Create new variable for Integer Month
dfBang['monthVal'] = pd.DatetimeIndex(dfBang['date']).month

In [72]:
dfBang.head()


Out[72]:
market month year quantity priceMin priceMax priceMod state city date priceDiff monthVal
2004-01 BANGALORE January 2004 227832 916 1066 991 KNT BANGALORE 2004-01-01 150 1
2004-02 BANGALORE February 2004 225133 741 870 793 KNT BANGALORE 2004-02-01 129 2
2004-03 BANGALORE March 2004 221952 527 586 556 KNT BANGALORE 2004-03-01 59 3
2004-04 BANGALORE April 2004 185150 419 518 465 KNT BANGALORE 2004-04-01 99 4
2004-05 BANGALORE May 2004 137390 400 516 455 KNT BANGALORE 2004-05-01 116 5

In [73]:
dfBangPivot = pd.pivot_table(dfBang, values = "priceDiff", 
                             columns = "year", index = "monthVal")

In [75]:
dfBangPivot


Out[75]:
year 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
monthVal
1 150 127 331 581 497 706 1335 1138 575 973 892 1654 1743
2 129 110 209 611 373 663 934 218 511 907 721 1409 1305
3 59 79 147 398 334 387 509 137 484 745 784 1022 NaN
4 99 75 169 181 274 341 443 196 421 634 859 1012 NaN
5 116 56 142 196 331 288 521 186 447 764 1245 1491 NaN
6 135 107 157 159 350 466 634 378 498 997 1675 1416 NaN
7 167 176 189 248 348 374 629 292 514 1143 1446 1608 NaN
8 145 169 192 381 317 379 668 465 609 2849 1580 3256 NaN
9 98 219 290 382 537 495 324 939 687 2793 1377 2864 NaN
10 111 602 319 914 588 1259 405 826 724 2485 1432 2980 NaN
11 120 1131 337 824 604 1616 725 866 1089 3083 2164 2933 NaN
12 177 521 189 643 604 1914 913 679 978 1463 1900 1813 NaN

In [76]:
dfBangPivot.plot()


Out[76]:
<matplotlib.axes._subplots.AxesSubplot at 0x108d23080>

In [77]:
dfBangPivot.plot(subplots = True, figsize=(15, 15), layout=(3, 5), sharey=True)


Out[77]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x108d565c0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x111b92d30>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x111be05c0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x111c1ba20>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x111c6b358>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x111ca7128>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x112b8ac18>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1195aca90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x119ad39e8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x119fcf550>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x11a1346d8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11a17ebe0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11a1baf60>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11a3079b0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11a344cc0>]], dtype=object)

Exercise: Find the price variation for LASALGAON city?


In [78]:
dfBangPivot.plot(subplots = True, figsize=(15, 15), layout=(3, 5), 
                 sharey=False)


Out[78]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x11af325f8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11b3f02e8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11b543d68>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11b584198>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11b5cdb38>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x11b60ac50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11b659908>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11b695748>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11b6de828>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11b71b518>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x11b768828>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11b7b2cc0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x119595be0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11ba03390>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11ba44048>]], dtype=object)

In [ ]: