4. Explore the Data

"I don't know, what I don't know"

We want to first visually explore the data to see if we can confirm some of our initial hypotheses as well as make new hypothesis about the problem we are trying to solve.

For this we will start by loading the data and understanding the data structure of the dataframe we have.

Lets read the data



In [1]:

    
sessionInfo()









    Out[1]:





R version 3.2.4 Revised (2016-03-16 r70336)
Platform: x86_64-apple-darwin15.3.0 (64-bit)
Running under: OS X 10.11.4 (El Capitan)

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] magrittr_1.5    IRdisplay_0.3   tools_3.2.4     base64enc_0.1-3
 [5] uuid_0.1-2      stringi_1.0-1   rzmq_0.7.7      IRkernel_0.5   
 [9] jsonlite_0.9.19 stringr_1.0.0   digest_0.6.8    repr_0.4       
[13] evaluate_0.8



In [3]:

    
install.packages('dplyr',repos='http://ftp.iitm.ac.in/cran')









    



Installing package into '/usr/local/lib/R/3.2/site-library'
(as 'lib' is unspecified)






    



The downloaded source packages are in
	'/private/var/folders/04/r20f0_4n2m7cv23lr8t97wp00000gn/T/RtmpRVFUJm/downloaded_packages'



In [ ]:

    
install.packages('ggplot2',repos='http://ftp.iitm.ac.in/cran')



In [2]:

    
# Import the library we need, which is dplyr and ggplot2
library(dplyr)
library(ggplot2)









    



Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



In [5]:

    
# Configuring jupyter plotting
options(repr.plot.width=10, repr.plot.height=6)

You will find the variable df used quite often to store a dataframe



In [6]:

    
# Read the csv file of Monthwise Quantity and Price csv file we have.
df <- read.csv('MonthWiseMarketArrivals_Clean.csv')

Understand Data Structure and Types



In [7]:

    
dim(df)









    Out[7]:





	10320
	10



In [8]:

    
head(df)









    Out[8]:





market month year quantity priceMin priceMax priceMod city state date

	1 ABOHAR(PB) January 2005 2350 404 493 446 ABOHAR PB 2005-01-01
	2 ABOHAR(PB) January 2006 900 487 638 563 ABOHAR PB 2006-01-01
	3 ABOHAR(PB) January 2010 790 1283 1592 1460 ABOHAR PB 2010-01-01
	4 ABOHAR(PB) January 2011 245 3067 3750 3433 ABOHAR PB 2011-01-01
	5 ABOHAR(PB) January 2012 1035 523 686 605 ABOHAR PB 2012-01-01
	6 ABOHAR(PB) January 2013 675 1327 1900 1605 ABOHAR PB 2013-01-01

Data Structure

So we have ten columns in our dataset. Let us understand what each one is.

Three are about the location of the Wholesale Market where Onion where sold.

state: This is the 2/3 letter abbreviation for the state in India (PB is Punjab and so on)
city: This is the city in India (ABOHAR, BANGALORE and so on)
market: This is a string with the combination of the state and city

Three are related to the

month: Month in January, February and so on.
year: Year in YYYY representastion
date: The combination of the two above.

Four are about quantity and price in these wholesale market.

quantity: The quanity of Onion arriving in the market in that month in quintals (100 kg)
priceMin: The minimum price in the month in Rs./quintal
priceMax: The maximum price in the month in Rs./quintal
priceMod: The modal price in the month in Rs./quintal

We would expect the following the columns to be of the following type

CATEGORICAL: state, city, market
TIME INTERVAL: month, year, date
QUANTITATIVE: quantity, priceMin, priceMax, priceModal

Let us see what pandas dataframe is reading these columns as.



In [9]:

    
# Get the structure of the data frame
str(df)









    



'data.frame':	10320 obs. of  10 variables:
 $ market  : Factor w/ 122 levels "ABOHAR(PB)","AGRA(UP)",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ month   : Factor w/ 12 levels "April","August",..: 5 5 5 5 5 5 5 5 4 4 ...
 $ year    : int  2005 2006 2010 2011 2012 2013 2014 2015 2005 2006 ...
 $ quantity: int  2350 900 790 245 1035 675 440 1305 1400 1800 ...
 $ priceMin: int  404 487 1283 3067 523 1327 1025 1309 286 343 ...
 $ priceMax: int  493 638 1592 3750 686 1900 1481 1858 365 411 ...
 $ priceMod: int  446 563 1460 3433 605 1605 1256 1613 324 380 ...
 $ city    : Factor w/ 119 levels "ABOHAR","AGRA",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ state   : Factor w/ 22 levels "AP","ASM","BHR",..: 17 17 17 17 17 17 17 17 17 17 ...
 $ date    : Factor w/ 243 levels "1996-01-01","1996-02-01",..: 109 121 169 181 193 205 217 229 110 122 ...

So we are getting the quantitive columns are correctly being shown as integers and the categorical columns are showing as objects(strings) which is fine. However, the date columns is being read as an object and not a Time-Interval. Let us at least fix the date column and make it into a datetime object



In [10]:

    
as.Date("2016-04-01", "%Y-%m-%d")









    Out[10]:





[1] "2016-04-01"



In [12]:

    
# Changing the date column to a Time Interval columnn
df$date <- as.Date(as.character(df$date), "%Y-%m-%d")



In [13]:

    
# Now checking for type of each column
str(df)









    



'data.frame':	10320 obs. of  10 variables:
 $ market  : Factor w/ 122 levels "ABOHAR(PB)","AGRA(UP)",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ month   : Factor w/ 12 levels "April","August",..: 5 5 5 5 5 5 5 5 4 4 ...
 $ year    : int  2005 2006 2010 2011 2012 2013 2014 2015 2005 2006 ...
 $ quantity: int  2350 900 790 245 1035 675 440 1305 1400 1800 ...
 $ priceMin: int  404 487 1283 3067 523 1327 1025 1309 286 343 ...
 $ priceMax: int  493 638 1592 3750 686 1900 1481 1858 365 411 ...
 $ priceMod: int  446 563 1460 3433 605 1605 1256 1613 324 380 ...
 $ city    : Factor w/ 119 levels "ABOHAR","AGRA",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ state   : Factor w/ 22 levels "AP","ASM","BHR",..: 17 17 17 17 17 17 17 17 17 17 ...
 $ date    : Date, format: "2005-01-01" "2006-01-01" ...



In [14]:

    
# Let us see the dataframe again now
head(df)









    Out[14]:





market month year quantity priceMin priceMax priceMod city state date

	1 ABOHAR(PB) January 2005 2350 404 493 446 ABOHAR PB 2005-01-01
	2 ABOHAR(PB) January 2006 900 487 638 563 ABOHAR PB 2006-01-01
	3 ABOHAR(PB) January 2010 790 1283 1592 1460 ABOHAR PB 2010-01-01
	4 ABOHAR(PB) January 2011 245 3067 3750 3433 ABOHAR PB 2011-01-01
	5 ABOHAR(PB) January 2012 1035 523 686 605 ABOHAR PB 2012-01-01
	6 ABOHAR(PB) January 2013 675 1327 1900 1605 ABOHAR PB 2013-01-01

Question 1 - How big is the Bangalore onion market compared to other cities in India?

Let us try to do this examination for one of the year only. So we want to reduce our dataframe for only where the year = 2010. This process is called subset.

PRINCIPLE: `filter` for rows and/or `select` columns in a dataframe

verb: filter for rows

verb: select for columns



In [15]:

    
df2010 <- filter(df, year == 2010)

It is easier to write chained function using the pipe function - %>%



In [16]:

    
df2010 <- df %>% 
          filter(year == 2010)



In [60]:

    
head(df2010)









    Out[60]:





market month year quantity priceMin priceMax priceMod city state date

	1 ABOHAR(PB) January 2010 790 1283 1592 1460 ABOHAR PB 2010-01-01
	2 ABOHAR(PB) February 2010 555 1143 1460 1322 ABOHAR PB 2010-02-01
	3 ABOHAR(PB) March 2010 385 510 878 688 ABOHAR PB 2010-03-01
	4 ABOHAR(PB) April 2010 840 466 755 611 ABOHAR PB 2010-04-01
	5 ABOHAR(PB) May 2010 2050 391 578 494 ABOHAR PB 2010-05-01
	6 ABOHAR(PB) June 2010 2075 363 515 460 ABOHAR PB 2010-06-01



In [17]:

    
# We can also filter on multiple criterias and select only particular columns
df2010Bang <- df %>% 
              filter((year == 2010) & (city == 'BANGALORE')) %>%
              select(market, year, quantity)



In [18]:

    
head(df2010Bang)









    Out[18]:





market year quantity

	1 BANGALORE 2010 423649
	2 BANGALORE 2010 316685
	3 BANGALORE 2010 368644
	4 BANGALORE 2010 404805
	5 BANGALORE 2010 395519
	6 BANGALORE 2010 362618

Exercise: Filter for market `Lasalgaon` and select on market and price columns



In [19]:

    
unique(df$city)









    Out[19]:





	ABOHAR
	AGRA
	AHMEDABAD
	AHMEDNAGAR
	AJMER
	ALIGARH
	ALWAR
	AMRITSAR
	BALLIA
	BANGALORE
	BAREILLY
	BELGAUM
	BHATINDA
	BHAVNAGAR
	BHOPAL
	BHUBNESWER
	BIHARSHARIF
	BIJAPUR
	BIKANER
	BOMBORI
	BURDWAN
	CHAKAN
	CHALLAKERE
	CHANDIGARH
	CHANDVAD
	CHENNAI
	CHICKBALLAPUR
	COIMBATORE
	DEESA
	DEHRADOON
	DELHI
	DEORIA
	DEVALA
	DEWAS
	DHAVANGERE
	DHULIA
	DINDIGUL
	DINDORI
	ETAWAH
	FARUKHABAD
	GONDAL
	GORAKHPUR
	GUWAHATI
	HALDWANI
	HASSAN
	HOSHIARPUR
	HUBLI
	HYDERABAD
	INDORE
	JAIPUR
	JALANDHAR
	JALGAON
	JAMMU
	JAMNAGAR
	JHANSI
	JODHPUR
	JUNNAR
	KALVAN
	KANPUR
	KARNAL
	KHANNA
	KOLAR
	KOLHAPUR
	KOLKATA
	KOPERGAON
	KOTA
	KURNOOL
	LASALGAON
	LONAND
	LUCKNOW
	LUDHIANA
	MADURAI
	MAHUVA
	MALEGAON
	MANDSOUR
	MANMAD
	MEERUT
	MIDNAPUR
	MUMBAI
	NAGPUR
	NANDGAON
	NASIK
	NEEMUCH
	NEWASA
	NIPHAD
	PALAYAM
	PATIALA
	PATNA
	PHALTAN 
	PIMPALGAON
	PUNE
	PURULIA
	RAHATA
	RAHURI
	RAICHUR
	RAIPUR
	RAJAHMUNDRY
	RAJKOT
	RANCHI
	SAGAR
	SAIKHEDA
	SANGALI
	SANGAMNER
	SATANA
	SHEROAPHULY
	SHIMLA
	SHRIRAMPUR
	SINNAR
	SOLAPUR
	SRIGANGANAGAR
	SRINAGAR
	SRIRAMPUR
	SURAT
	TRIVENDRUM
	UDAIPUR
	UJJAIN
	VANI
	VARANASI
	YEOLA



In [23]:

    
df_lasalgaon <- df %>% filter(city == 'LASALGAON')



In [24]:

    
dim(df_lasalgaon)









    Out[24]:





	243
	10

Principle: Split Apply Combine (use `group_by` and `summarize`)

How do we get the sum of quantity for each city.

We need to SPLIT the data by each city, APPLY the sum to the quantity row and then COMBINE the data again

In dplyr, we use the groupby function to do the grouping and summarize to the apply part.



In [25]:

    
# Group by using city
df2010City = df2010 %>%
             group_by(city) %>%
             summarize(quantity_year = sum(quantity))



In [26]:

    
head(df2010City)









    Out[26]:





city quantity_year

	1 ABOHAR 11835
	2 AGRA 756755
	3 AHMEDABAD 1135418
	4 AHMEDNAGAR 1678032
	5 ALWAR 561145
	6 AMRITSAR 114417



In [27]:

    
# Let us see this dataframe
head(df2010City)









    Out[27]:





city quantity_year

	1 ABOHAR 11835
	2 AGRA 756755
	3 AHMEDABAD 1135418
	4 AHMEDNAGAR 1678032
	5 ALWAR 561145
	6 AMRITSAR 114417

Exercise: Find the sum of quantity for 2015 for each state



In [29]:

    
head(df)









    Out[29]:





market month year quantity priceMin priceMax priceMod city state date

	1 ABOHAR(PB) January 2005 2350 404 493 446 ABOHAR PB 2005-01-01
	2 ABOHAR(PB) January 2006 900 487 638 563 ABOHAR PB 2006-01-01
	3 ABOHAR(PB) January 2010 790 1283 1592 1460 ABOHAR PB 2010-01-01
	4 ABOHAR(PB) January 2011 245 3067 3750 3433 ABOHAR PB 2011-01-01
	5 ABOHAR(PB) January 2012 1035 523 686 605 ABOHAR PB 2012-01-01
	6 ABOHAR(PB) January 2013 675 1327 1900 1605 ABOHAR PB 2013-01-01



In [30]:

    
sum_quantity_df <- df %>% 
filter(year==2015) %>% 
group_by(state) %>% 
summarize(sum_quantity = sum(quantity),avg_price = mean(priceMax))



In [31]:

    
head(sum_quantity_df)









    Out[31]:





state sum_quantity avg_price

	1 AP 2324618 2458
	2 ASM 36013 2986.917
	3 BHR 476800 2603.643
	4 DEL 3272139 2543.083
	5 GUJ 9752460 2151.518
	6 HR 146028 2464.435

PRINCIPLE: Arrange the rows

For sorting the variables we need to use the verb - arrange



In [34]:

    
# Sort the Dataframe by Quantity to see which one is on top
df2010City <- df2010City %>%
              arrange(desc(quantity_year))



In [35]:

    
head(df2010City)









    Out[35]:





city quantity_year

	1 BANGALORE 6079067
	2 DELHI 3508582
	3 KOLKATA 3495320
	4 PUNE 3326024
	5 SOLAPUR 3310419
	6 MUMBAI 2921005



In [36]:

    
df2010CitySmall <- df2010City %>% 
                   filter(quantity_year > 500000) %>%
                   arrange(desc(quantity_year))



In [37]:

    
head(df2010CitySmall)









    Out[37]:





city quantity_year

	1 BANGALORE 6079067
	2 DELHI 3508582
	3 KOLKATA 3495320
	4 PUNE 3326024
	5 SOLAPUR 3310419
	6 MUMBAI 2921005

Exercise: Sort the sum of quantity for 2015 for each state in descending order



In [ ]:



In [ ]:



In [ ]:

    
## PRINCIPLE: Visual Exploration 
We will be using ggplot2 for doing visual exploration in R

Packages to be installed for Mac users

install.packages('tidyr',repos='http://ftp.iitm.ac.in/cran') install.packages('lubridate',repos='http://ftp.iitm.ac.in/cran') install.packages('stringr',repos='http://ftp.iitm.ac.in/cran') install.packages('rvest',repos='http://ftp.iitm.ac.in/cran')



In [39]:

    
library(tidyr)
library(lubridate)
library(stringr)
library(rvest)









    



Loading required package: xml2

PRINCIPLE: Visual Exploration

We will be using ggplot2 for doing visual exploration in R



In [41]:

    
# Plot the Data
ggplot(df2010CitySmall) + 
aes(city,weight = quantity_year) + 
geom_bar() + 
coord_flip()



In [48]:

    
head(reorder(df2010CitySmall$city,df2010CitySmall$quantity_year),10)









    Out[48]:





	BANGALORE
	DELHI
	KOLKATA
	PUNE
	SOLAPUR
	MUMBAI
	PIMPALGAON
	MAHUVA
	LASALGAON
	MALEGAON



In [43]:

    
# Plot the Data
ggplot(df2010CitySmall) + 
aes(reorder(city, quantity_year), weight = quantity_year) + 
geom_bar() +
coord_flip()

Exercise: Show the State with Quantity Sales in 2015?



In [51]:

    
df2015 <- df %>% filter(year==2015)



In [52]:

    
ggplot(df2015) +
aes(state,weight=quantity)+
geom_bar() +
coord_flip()

Exercise: Show the State with Highest Price in 2015?



In [55]:

    
head(df2015)









    Out[55]:





market month year quantity priceMin priceMax priceMod city state date

	1 ABOHAR(PB) January 2015 1305 1309 1858 1613 ABOHAR PB 2015-01-01
	2 ABOHAR(PB) February 2015 1115 1200 1946 1688 ABOHAR PB 2015-02-01
	3 ABOHAR(PB) March 2015 920 1260 1980 1745 ABOHAR PB 2015-03-01
	4 ABOHAR(PB) May 2015 940 1020 1620 1310 ABOHAR PB 2015-05-01
	5 ABOHAR(PB) June 2015 610 957 1829 1457 ABOHAR PB 2015-06-01
	6 ABOHAR(PB) July 2015 795 1008 2000 1517 ABOHAR PB 2015-07-01



In [58]:

    
df2015_priceMax <- df %>% group_by(state) %>% summarise(maxPrice=max(priceMax))



In [59]:

    
head(df2015_priceMax)









    Out[59]:





state maxPrice

	1 AP 5305
	2 ASM 5733
	3 BHR 5504
	4 DEL 5181
	5 GUJ 5286
	6 HP 2200



In [61]:

    
ggplot(df2015_priceMax)+
aes(reorder(state,maxPrice),weight=maxPrice)+
geom_bar() +
coord_flip()

Question 2 - Has the price variation in Onion prices in Bangalore really gone up over the years?



In [62]:

    
head(df)









    Out[62]:





market month year quantity priceMin priceMax priceMod city state date

	1 ABOHAR(PB) January 2005 2350 404 493 446 ABOHAR PB 2005-01-01
	2 ABOHAR(PB) January 2006 900 487 638 563 ABOHAR PB 2006-01-01
	3 ABOHAR(PB) January 2010 790 1283 1592 1460 ABOHAR PB 2010-01-01
	4 ABOHAR(PB) January 2011 245 3067 3750 3433 ABOHAR PB 2011-01-01
	5 ABOHAR(PB) January 2012 1035 523 686 605 ABOHAR PB 2012-01-01
	6 ABOHAR(PB) January 2013 675 1327 1900 1605 ABOHAR PB 2013-01-01



In [63]:

    
dfBang <- df %>% filter(city == 'BANGALORE')



In [64]:

    
head(dfBang)









    Out[64]:





market month year quantity priceMin priceMax priceMod city state date

	1 BANGALORE January 2004 227832 916 1066 991 BANGALORE KNT 2004-01-01
	2 BANGALORE January 2005 335679 470 597 522 BANGALORE KNT 2005-01-01
	3 BANGALORE January 2006 412185 286 617 537 BANGALORE KNT 2006-01-01
	4 BANGALORE January 2007 268268 586 1167 942 BANGALORE KNT 2007-01-01
	5 BANGALORE January 2008 393806 174 671 472 BANGALORE KNT 2008-01-01
	6 BANGALORE January 2009 374380 848 1554 1328 BANGALORE KNT 2009-01-01



In [65]:

    
summary(dfBang)









    Out[65]:





            market         month         year         quantity      
 BANGALORE     :147   February:13   Min.   :2004   Min.   :  63824  
 ABOHAR(PB)    :  0   January :13   1st Qu.:2007   1st Qu.: 329750  
 AGRA(UP)      :  0   March   :13   Median :2010   Median : 405716  
 AHMEDABAD(GUJ):  0   April   :12   Mean   :2010   Mean   : 523630  
 AHMEDNAGAR(MS):  0   August  :12   3rd Qu.:2013   3rd Qu.: 660674  
 AJMER(RAJ)    :  0   December:12   Max.   :2016   Max.   :1639032  
 (Other)       :  0   (Other) :72                                   
    priceMin         priceMax       priceMod            city         state    
 Min.   : 145.0   Min.   : 338   Min.   : 320   BANGALORE :147   KNT    :147  
 1st Qu.: 306.5   1st Qu.: 687   1st Qu.: 551   ABOHAR    :  0   AP     :  0  
 Median : 441.0   Median :1021   Median : 828   AGRA      :  0   ASM    :  0  
 Mean   : 555.2   Mean   :1312   Mean   :1041   AHMEDABAD :  0   BHR    :  0  
 3rd Qu.: 651.0   3rd Qu.:1612   3rd Qu.:1323   AHMEDNAGAR:  0   DEL    :  0  
 Max.   :2377.0   Max.   :4698   Max.   :3430   AJMER     :  0   GUJ    :  0  
                                                (Other)   :  0   (Other):  0  
      date           
 Min.   :2004-01-01  
 1st Qu.:2007-01-16  
 Median :2010-02-01  
 Mean   :2010-01-30  
 3rd Qu.:2013-02-15  
 Max.   :2016-03-01



In [66]:

    
# Set the index as date
dfBang <- dfBang %>% 
          arrange(date)
head(dfBang)









    Out[66]:





market month year quantity priceMin priceMax priceMod city state date

	1 BANGALORE January 2004 227832 916 1066 991 BANGALORE KNT 2004-01-01
	2 BANGALORE February 2004 225133 741 870 793 BANGALORE KNT 2004-02-01
	3 BANGALORE March 2004 221952 527 586 556 BANGALORE KNT 2004-03-01
	4 BANGALORE April 2004 185150 419 518 465 BANGALORE KNT 2004-04-01
	5 BANGALORE May 2004 137390 400 516 455 BANGALORE KNT 2004-05-01
	6 BANGALORE June 2004 311445 486 621 551 BANGALORE KNT 2004-06-01



In [67]:

    
ggplot(dfBang) + aes(date, priceMod) + geom_line()

PRINCIPLE: Convert from Wide format to Tall format using `gather`

Many times during exploration, we will need to convert the data frame from wide format to tall format (and vice versa).



In [68]:

    
head(dfBang)









    Out[68]:





market month year quantity priceMin priceMax priceMod city state date

	1 BANGALORE January 2004 227832 916 1066 991 BANGALORE KNT 2004-01-01
	2 BANGALORE February 2004 225133 741 870 793 BANGALORE KNT 2004-02-01
	3 BANGALORE March 2004 221952 527 586 556 BANGALORE KNT 2004-03-01
	4 BANGALORE April 2004 185150 419 518 465 BANGALORE KNT 2004-04-01
	5 BANGALORE May 2004 137390 400 516 455 BANGALORE KNT 2004-05-01
	6 BANGALORE June 2004 311445 486 621 551 BANGALORE KNT 2004-06-01



In [69]:

    
library(tidyr)



In [71]:

    
dim(dfBang)









    Out[71]:





	147
	10



In [85]:

    
dfBangTall <- dfBang %>%
              gather("priceType", "priceValue",5:7) %>%
              arrange(date)



In [86]:

    
dim(dfBangTall)









    Out[86]:





	441
	9



In [87]:

    
head(dfBangTall)









    Out[87]:





market month year quantity city state date priceType priceValue

	1 BANGALORE January 2004 227832 BANGALORE KNT 2004-01-01 priceMin 916
	2 BANGALORE January 2004 227832 BANGALORE KNT 2004-01-01 priceMax 1066
	3 BANGALORE January 2004 227832 BANGALORE KNT 2004-01-01 priceMod 991
	4 BANGALORE February 2004 225133 BANGALORE KNT 2004-02-01 priceMin 741
	5 BANGALORE February 2004 225133 BANGALORE KNT 2004-02-01 priceMax 870
	6 BANGALORE February 2004 225133 BANGALORE KNT 2004-02-01 priceMod 793



In [88]:

    
ggplot(dfBangTall) + aes(date, y = priceValue, color = priceType) + geom_line()

PRINCIPLE: Create new variables using `mutate`

To calculate the range of change, we will create a new price difference variable - which is the difference between the priceMin and priceMax



In [89]:

    
dfBang <- dfBang %>% 
          mutate(priceDiff = priceMax - priceMin)



In [90]:

    
head(dfBang)









    Out[90]:





market month year quantity priceMin priceMax priceMod city state date priceDiff

	1 BANGALORE January 2004 227832 916 1066 991 BANGALORE KNT 2004-01-01 150
	2 BANGALORE February 2004 225133 741 870 793 BANGALORE KNT 2004-02-01 129
	3 BANGALORE March 2004 221952 527 586 556 BANGALORE KNT 2004-03-01 59
	4 BANGALORE April 2004 185150 419 518 465 BANGALORE KNT 2004-04-01 99
	5 BANGALORE May 2004 137390 400 516 455 BANGALORE KNT 2004-05-01 116
	6 BANGALORE June 2004 311445 486 621 551 BANGALORE KNT 2004-06-01 135



In [95]:

    
ggplot(dfBang) + aes(date, priceDiff) + geom_line() + geom_point(aes(date,priceMod))

PRINCIPLE: Pivot Table

Pivot table is a way to summarize data frame data into rows, columns and value



In [96]:

    
head(dfBang)









    Out[96]:





market month year quantity priceMin priceMax priceMod city state date priceDiff

	1 BANGALORE January 2004 227832 916 1066 991 BANGALORE KNT 2004-01-01 150
	2 BANGALORE February 2004 225133 741 870 793 BANGALORE KNT 2004-02-01 129
	3 BANGALORE March 2004 221952 527 586 556 BANGALORE KNT 2004-03-01 59
	4 BANGALORE April 2004 185150 419 518 465 BANGALORE KNT 2004-04-01 99
	5 BANGALORE May 2004 137390 400 516 455 BANGALORE KNT 2004-05-01 116
	6 BANGALORE June 2004 311445 486 621 551 BANGALORE KNT 2004-06-01 135



In [97]:

    
library(lubridate)



In [98]:

    
# Create new variable for Integer Month
dfBang <- dfBang %>%
          mutate(monthVal = month(date))



In [99]:

    
head(dfBang)









    Out[99]:





market month year quantity priceMin priceMax priceMod city state date priceDiff monthVal

	1 BANGALORE January 2004 227832 916 1066 991 BANGALORE KNT 2004-01-01 150 1
	2 BANGALORE February 2004 225133 741 870 793 BANGALORE KNT 2004-02-01 129 2
	3 BANGALORE March 2004 221952 527 586 556 BANGALORE KNT 2004-03-01 59 3
	4 BANGALORE April 2004 185150 419 518 465 BANGALORE KNT 2004-04-01 99 4
	5 BANGALORE May 2004 137390 400 516 455 BANGALORE KNT 2004-05-01 116 5
	6 BANGALORE June 2004 311445 486 621 551 BANGALORE KNT 2004-06-01 135 6



In [100]:

    
dfBangGroup <- dfBang %>%
               group_by(year, monthVal) %>% 
               summarize(priceDiff)



In [101]:

    
head(dfBangGroup)









    Out[101]:





year monthVal priceDiff

	1 2004 1 150
	2 2004 2 129
	3 2004 3 59
	4 2004 4 99
	5 2004 5 116
	6 2004 6 135



In [91]:

    
str(dfBangGroup)









    



Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':	147 obs. of  3 variables:
 $ year     : int  2004 2004 2004 2004 2004 2004 2004 2004 2004 2004 ...
 $ monthVal : num  1 2 3 4 5 6 7 8 9 10 ...
 $ priceDiff: int  150 129 59 99 116 135 167 145 98 111 ...
 - attr(*, "vars")=List of 1
  ..$ : symbol year
 - attr(*, "drop")= logi TRUE

PRINCIPLE: Convert from Tall format to Wide format using `spread`

Many times during exploration, we will need to convert the data frame from tall format to wide format.



In [102]:

    
dfBangPivot <- dfBangGroup %>%
               spread(monthVal, priceDiff)



In [103]:

    
dfBangPivot <- dfBang %>%
               group_by(year, monthVal) %>% 
               summarize(priceDiff) %>%
               spread(monthVal, priceDiff)



In [104]:

    
head(dfBangPivot)









    Out[104]:





year 1 2 3 4 5 6 7 8 9 10 11 12

	1 2004 150 129 59 99 116 135 167 145 98 111 120 177
	2 2005 127 110 79 75 56 107 176 169 219 602 1131 521
	3 2006 331 209 147 169 142 157 189 192 290 319 337 189
	4 2007 581 611 398 181 196 159 248 381 382 914 824 643
	5 2008 497 373 334 274 331 350 348 317 537 588 604 604
	6 2009 706 663 387 341 288 466 374 379 495 1259 1616 1914



In [108]:

    
ggplot(dfBang) + aes(monthVal, weight = priceDiff) + geom_bar() + facet_wrap(~year)



In [ ]:

Exercise: Find the price variation for LASALGAON city?



In [ ]:



In [ ]:

	market	month	year	quantity	priceMin	priceMax	priceMod	city	state	date
1	ABOHAR(PB)	January	2005	2350	404	493	446	ABOHAR	PB	2005-01-01
2	ABOHAR(PB)	January	2006	900	487	638	563	ABOHAR	PB	2006-01-01
3	ABOHAR(PB)	January	2010	790	1283	1592	1460	ABOHAR	PB	2010-01-01
4	ABOHAR(PB)	January	2011	245	3067	3750	3433	ABOHAR	PB	2011-01-01
5	ABOHAR(PB)	January	2012	1035	523	686	605	ABOHAR	PB	2012-01-01
6	ABOHAR(PB)	January	2013	675	1327	1900	1605	ABOHAR	PB	2013-01-01

	market	year	quantity
1	BANGALORE	2010	423649
2	BANGALORE	2010	316685
3	BANGALORE	2010	368644
4	BANGALORE	2010	404805
5	BANGALORE	2010	395519
6	BANGALORE	2010	362618

	city	quantity_year
1	ABOHAR	11835
2	AGRA	756755
3	AHMEDABAD	1135418
4	AHMEDNAGAR	1678032
5	ALWAR	561145
6	AMRITSAR	114417

	state	sum_quantity	avg_price
1	AP	2324618	2458
2	ASM	36013	2986.917
3	BHR	476800	2603.643
4	DEL	3272139	2543.083
5	GUJ	9752460	2151.518
6	HR	146028	2464.435

	city	quantity_year
1	BANGALORE	6079067
2	DELHI	3508582
3	KOLKATA	3495320
4	PUNE	3326024
5	SOLAPUR	3310419
6	MUMBAI	2921005

	market	month	year	quantity	priceMin	priceMax	priceMod	city	state	date
1	ABOHAR(PB)	January	2015	1305	1309	1858	1613	ABOHAR	PB	2015-01-01
2	ABOHAR(PB)	February	2015	1115	1200	1946	1688	ABOHAR	PB	2015-02-01
3	ABOHAR(PB)	March	2015	920	1260	1980	1745	ABOHAR	PB	2015-03-01
4	ABOHAR(PB)	May	2015	940	1020	1620	1310	ABOHAR	PB	2015-05-01
5	ABOHAR(PB)	June	2015	610	957	1829	1457	ABOHAR	PB	2015-06-01
6	ABOHAR(PB)	July	2015	795	1008	2000	1517	ABOHAR	PB	2015-07-01

	market	month	year	quantity	priceMin	priceMax	priceMod	city	state	date
1	BANGALORE	January	2004	227832	916	1066	991	BANGALORE	KNT	2004-01-01
2	BANGALORE	January	2005	335679	470	597	522	BANGALORE	KNT	2005-01-01
3	BANGALORE	January	2006	412185	286	617	537	BANGALORE	KNT	2006-01-01
4	BANGALORE	January	2007	268268	586	1167	942	BANGALORE	KNT	2007-01-01
5	BANGALORE	January	2008	393806	174	671	472	BANGALORE	KNT	2008-01-01
6	BANGALORE	January	2009	374380	848	1554	1328	BANGALORE	KNT	2009-01-01

	year	1	2	3	4	5	6	7	8	9	10	11	12
1	2004	150	129	59	99	116	135	167	145	98	111	120	177
2	2005	127	110	79	75	56	107	176	169	219	602	1131	521
3	2006	331	209	147	169	142	157	189	192	290	319	337	189
4	2007	581	611	398	181	196	159	248	381	382	914	824	643
5	2008	497	373	334	274	331	350	348	317	537	588	604	604
6	2009	706	663	387	341	288	466	374	379	495	1259	1616	1914